How To Find & Fix Duplicated Content
By Matthew Edgar · Last Updated: October 23, 2023
What Is Duplicate Content?
Duplicate content means that you have two or more pages on your website that return identical or nearly identical content. While that is the standard definition, I’d extend that to also include two or more pages on your website that serve an identical or nearly identical purpose.
The reason to extend that definition to include purpose is that the core problem with duplicate content is that it confuses visitors. Visitors will be confused by pages that return identical content and will be equally confused by two pages that return different content that offers the same information.
For example, on Elementive’s website, we could have two pages talking about our tech SEO services. These pages might use very different words to describe the same services and, therefore, don’t have identical or even nearly identical content. However, those pages describe the same concept, so why have both pages on the website?
Visitors who reach both of these pages would be understandably confused by the website. Confused visitors don’t engage or convert.
Robots who reach duplicate pages get confused too and will make mistakes about indexing pages and ranking pages in search results. In some cases, none of the duplicated pages will rank, and in other cases, the duplicate pages will rank but compete against each other. In rare cases, Google may also penalize your website for duplicated content.
How do you find and fix duplicate content on your website? In this article:
- Common Causes of Duplicate Content
- Duplicate Content Example
- Find Duplicate Content
- How to Fix Duplicate Content
Common Causes of Duplicate Content on a Website
Duplicate content is often a problem with the website’s underlying code. The website’s content management system (CMS) might allow the same page to be returned at multiple URLs. For example, a CMS might allow product pages to be accessed at www.site.com/view/product-name.html and also at www.site.com/product-name.html.
Duplicate content can happen with pages automatically created by the CMS. For example, a blog will automatically create category pages to list blog posts but, depending on how the blog posts are categorized, three different category pages on a blog might list almost the exact same blog posts. That would create three category pages that are duplicates.
It isn’t just technical mishaps, though. Duplicate content can also be caused by messy information architectures. The same content might be duplicated in multiple places on the website. As an example, a company might place the same FAQs content in several different sections of the website because there isn’t a simpler, less duplicated way of presenting that content. As another example, duplication can also happen when you attempt to write the same page several times to cater to multiple audiences but the differences in the content are far too subtle.
Related to this, duplicate content can happen due to editorial mismanagement of the website. For example, multiple authors can unknowingly write two versions of the same page.
The first step to fixing duplicate content is understanding the cause; a technical problem will be resolved differently than an information architecture or an editorial problem.
Duplicate Content Example
Duplicate content can take many shapes and forms but let’s take a closer look at one common example that happens on e-commerce websites due to filtering and sorting. A website might use these three URLs to list the same products in a slightly different order.
The second and third example URLs contain a sort parameter (“?sort=color” or “?sort=price”), which creates only a slight difference between these pages in the way the products listed are sorted. Does that different sorting create a unique page with a unique purpose?
They aren’t really unique pages. The three pages may list the same products and have the same text and images on the page. By default, the header, title, and meta descriptions would be the same too. Given those similarities, these pages would be considered duplicates. Because of the duplication, Google might struggle to decide which version of the page to show in search results.
There are ways, though, that these pages could be differentiated. The content might change based on the sort value. For example, when sorting by color, the text on the page could discuss the different color options available, what colors customers typically prefer, what colors work best given various circumstances, and so on. The same is true for price. The header, title, and meta description could be updated to reflect the sort value too. Those updates to the pages would give these pages a unique purpose and help these pages a reason to rank in search results.
Find Duplicate Content
Before you can fix duplicate content, you have to find it on your website. Let’s go through the steps you can take to find and evaluate duplicate content on your website.
Step 1: Locate Duplicate Content
The first step is finding the duplicate content that exists on your website. Unlike locating other types of problems, there isn’t a way to get a single report of every issue that exists. Instead, in this first step, we want to locate pages that could be duplicated content and then we’ll review if the page is duplicated in the next step.
Method #1: Crawl Tool
Siteliner is a free tool, up to a certain number of pages, that will crawl through your website and, among other problems, locate duplicate content. Once you load the site, enter your website’s URL.
You will then see a report about different aspects of your content. All are interesting and important, but what we want to focus on is duplicate content. Siteliner will tell you in the navigation what percent of content their tool thinks is duplicated on your website. In the case of my website at the time of this scan, it is 5%.
You can click the “Duplicate Content” link in the navigation to see the full report. What we want to pay attention to on the full report are the “Match Percentage” and “Match Pages”. In my case, the analytics-courses page has a 43% match to 2 other pages. You can click on the URL for more details and to see what pages are matching. Anything above a 50% match is worth reviewing and, regardless of match percentage, anything with more than two or more matching pages is also worth reviewing.
Method #2: Analytics Page Titles
An alternative way to check for duplicate content is to rely on the page title dimension in GA4. To locate this, go to Reports, then expand Engagement. Under Engagement, click “Pages and screens”. Once here, change the primary dimension in the table to “Page title and screen name”. You can then add a secondary dimension by clicking the plus sign to “Page path and screen class”. You can then sort by the “Page Title” column (hover over the column title and you’ll see an arrow you can click to sort). Once sorted, you can see if any pages contain the same title or a nearly identical title.
This isn’t always a perfect means of locating duplication but can give you some ideas if pages speak to the same topic. Also, you can search through the page titles for common phrases—for example, in my case I could search for other page titles that may reference SEO, plugins, or response status codes. Anything that seems to share similar or identical title tags is worth reviewing for possible duplicate content issues.
Method #3: Google Search Console Performance
There are many other methods you can use for locating duplicate content but the final method we’ll discuss here relies on Google Search Console’s keyword report. This is the best method for finding duplicate content that is affecting search performance.
In Google Search Console, go to Performance and then view the queries table. Click through the different queries that your website ranks for. Once you click on a specific query, the report will reload to show you data solely about that query. From there, you can click on the “Pages” tab in the table below the graph to see all the pages that rank for that query. You want to find any queries with multiple ranking pages.
As you can see in the screenshot below, Elementive’s website has a few instances where multiple pages rank for the same query. This doesn’t necessarily mean those pages are duplicated. But it does mean those pages are competing to rank for the same term and, at least in Google’s evaluation, that the pages may serve a similar purpose. It would be worthwhile to review these pages and see what, if anything, should be done to resolve that duplication.
Method #4: Google Search Console, Page Indexing Report
Google Search Console’s Page Indexing report can also help you identify duplicate content. The Page Indexing report lists reasons pages are not indexed, including three categories related to duplication: “Duplicate, Google chose different canonical than user”, “Duplicate without user-selected canonical”, and “Alternate page with proper canonical tag”. Learn more about each of these categories and the page indexing report.
Step 2: Review Duplicated Content
The next step is to review what you suspect might be duplicated content found based on your findings in step one. Remember, what we found in step one isn’t definitively duplicated, just potentially duplicated. As you review the potentially duplicated pages, you want to ask yourself a few different questions.
Do the pages have identical or nearly identical content?
You first need to distinguish between pages that are exactly identical compared to pages that are nearly identical. If the content exactly matches, then it is more critical to resolve and will be a bigger problem for your website’s performance. Flag those exact matches as ones to fix.
For pages that are nearly identical but not exactly identical, you need to review the content more deeply to see what the differences are and if those differences are enough to matter. Some of the differences within near matches are easy enough to spot without any tools to help. However, a tool like Diffchecker can help where it is harder to tell how great the duplication may be. You can put two sets of content in Diffchecker and see how close the content matches between pages. Diffchecker will also highlight what is the same or different across pages, which is helpful for spotting key differences between pages.
Why bother with this evaluation? Well, sometimes, you can end up with false positives where an automated tool, like Siteliner, says the pages are duplicated but a human would understand the differences. For example, you might have two product categories that share a lot of products. A tool like Siteliner tells you those two product categories match at 80% so you flag it for evaluation. As you are evaluating, you can’t really tell how the pages differ but using Diffchecker you can easily spot the 20% worth of differences in the content. Perhaps, that 20% that is different really matters and a human would likely understand these pages are distinct and not duplicated. Chances are robots will understand this difference too since they use natural language processing to view content like humans. Of course, if the 20% difference doesn’t much matter, then you do have a duplicate problem that needs to be addressed.
Do the pages serve a similar purpose?
Pages that serve a similar purpose are usually easiest to spot in Google Search Console’s performance report where you might have two or more pages ranking for the same search term. Here again, though, you want to review the potential duplication to determine if it is a false positive or actually something you need to address.
It can be difficult to locate this type of duplicate on your own. Once you are close to the subject matter, you can spot the nuances that make each page distinct. To avoid this bias, don’t rely solely on your own evaluation of the page. Ask your customers and visitors to describe the differences in a survey, a brief interview, or as part of a usability test. If customers and visitors can tell you why these pages are different, then you don’t have a problem. But, if people can’t describe the difference, or seem to be struggling to find the difference, you have duplicated pages that need to be fixed.
How are visitors interacting with the duplicated pages?
You can also determine how big of an issue duplicated content is presenting, and if you truly have a duplicate content problem, by reviewing performance on those potentially duplicated pages. As an example, you might find duplicated pages where one copy of the page is performing incredibly well with lots of traffic, high conversion rates, and high engagement rates and the other versions of the page perform terribly. In those instances, you can easily remove the low-performing duplicate pages (or redirect those to the high-performing page) to alleviate any future problems.
Of course, as you review performance, you might find the duplicated pages are basically performing the same way with similar traffic levels, similar rankings, similar engagement rates, and similar conversion rates. That suggests visitors probably don’t realize there are multiple pages on your website and end up using both. Robots may not realize it is a problem either. That makes this a potential problem but I’d argue it is better to address the potential problem before visitors or search robots start having an actual problem with the duplicated content. Fixing the potential problem also prevents you from updating one of the duplicated pages and forgetting to update the other, creating outdated or conflicting information in the process.
How To Fix Duplicate Content
Now that we’ve found, evaluated, and confirmed that duplicated content exists, what do you do about it? The solutions depend on the nature of the duplicated content and the severity of the duplicate content problem. The bigger the problem, the harder the solution. There are four common ways to address duplicate content:
- Title Tag Changes
- Implementing Canonical URLs
- Consolidating or Removing Content & Redirects
- Fixing Technical or Structural Issues
1. Title Tag Changes
If two pages share an identical title tag or page header, but the pages are fundamentally distinct and serve different purposes, you don’t need to make a large-scale change to the page. Instead, you can update the title tag or the page header to make it clearer what purpose each page serves. This can also stop multiple pages from ranking for the same terms and, by doing so, help you rank for new terms.
2. Implementing Canonical URLs
Another solution for duplicate content is using the canonical link element to define which version of the page should be considered for rankings. This doesn’t help visitors, but it does help robots understand your website.
Select Canonical Page
The first step is deciding which version of the duplicated URLs should rank in search results. Which of the duplicated pages is the preferred or official version of the page? This is referred to as the canonical page.
Thinking back to the sort parameter URLs discussed above, maybe we select the version of the URL without the parameter (/some-category/list) as the canonical page of this set of duplicated content. This URL does not have a sort parameter which makes the URL look nicer and this page might list the products in the order you’d prefer most people see. If visitors care more about price or color, perhaps the sorted version should be selected as the canonical. There is no right answer for which URL to pick, just make sure you pick a single URL to be the preferred version of a duplicated set of URLs.
Add a Canonical Tag
Once you have selected the canonical version of the page, you can declare this official version by implementing a canonical tag. The canonical can be defined in two ways.
The most common is to use a <link> element with a rel attribute with the value of canonical and an href attribute with the URL of the canonical version of the page. This element is placed anywhere in the <head> section of all the duplicate pages. Here is an example of the canonical tag where the preferred URL is /some-category/list. In this example, this canonical tag would appear on the /some-category/list page as well as the versions of that page that contain the sort parameter.
<link rel="canonical" href="https://www.domain.com/some-category/list" />
Another alternative is to add a Link to your HTTP Headers. This is useful for non-HTML files (but typically requires technical support to add to your website). Here again, this would be added to all duplicated versions.
Link: <https://www.domain.com/some-category/list>; rel="canonical"
Supporting The Canonical
You should not rely on the canonical tag as the only means of communicating your URL preferences to search engines. Internal links on your website should link to the canonical version of the page as well as much as possible. This avoids sending mixed signals to search engines. As well, it reduces the chances human visitors might reach these duplicated pages.
Along with internal links, make sure your XML sitemap also lists the canonical version of each URL and only this version. The XML sitemap should not list any of the duplicate versions of the page. In our example, that would mean the XML sitemap should list /some-category/list but none of the versions of the URL with a sort parameter.
3. Consolidating or Removing Content & Redirects
Adding a canonical tag only makes sense when you need to keep the duplicate versions of the pages live on the website. You need to keep the URL with the sort parameter on a category listing page live on the website. People need to be able to sort products. However, you don’t want Google to rank those URLs with the sort parameter and the canonical tag will help Google make correct decisions in the rankings.
In other cases, the multiple versions of the page do not need to exist on the website. In these cases, the solution is removing or consolidating pages, and then redirecting removed URLs to the kept URL. For example, if page-a.html and page-b.html are identical, and only one version of the page is needed, you could remove page-a.html and keep page-b.html to resolve the duplication. Of course, people might still come looking for page-a.html and you don’t want those visitors to reach a not-found error page. So, after removing page-a.html from the website, you can redirect it to page-b.html. By doing this, you’ve removed the duplicate pages from your website but still ensured humans and robots visiting the site will be able to locate the desired content.
This gets trickier if the pages are nearly identical, or if it is only the intentions that are duplicated. For those scenarios, you often will want to consolidate the content from the several duplicated pages into one single page and keep some of the content from both versions of the page. For example, let’s say you have four pages discussing your widgets for sale and each of these pages has distinct content and images. You could consolidate all of these pages into a single page about widgets, keeping some of the content and images from each page. Once the content is consolidated, you’d want to redirect the duplicated versions into a single page.
How do you pick which page to keep and which pages to remove then redirect? It comes down to performance: keep the version of the page that has the highest rankings, the most traffic, the best engagement, and the best conversions (or some combination of these metrics).
4. Fixing Underlying Technical or Structural Issues
Finally, the duplicate content may be caused by an underlying technical issue. In some cases, this is an old dev environment mirroring the live site that somehow became exposed to visitors. Learn how to properly handle dev or staging environments.
This can also happen with dynamic pages that create identical content at multiple URLs (think of elaborate content filtration systems). Sometimes, the problem happens as a result of user-generated content and a lack of monitoring for people posting the same content in multiple locations (think of a forum where people could post the same question to three different categories within the forum). Establishing better programmatic and editorial rules can help avoid these problems.
Summing It Up
Duplication can happen for many different reasons. The duplication rarely happens on purpose. But, often, the duplication worsens search performance, reduces visitor engagement, and hurts conversions. It takes time and effort to locate, evaluate, and fix the duplicated content. Often it requires altering the site architecture or fixing technical issues. But that effort is worth it and will lead to performance gains for SEO, UX, and CRO. If you need help finding or fixing duplicate content on your website, please contact me.