How To Find & Fix Duplicated Content
April 14, 2021
What Is Duplicate Content?
By the standard definition, duplicate content means that you have two or more pages on your website that return identical or nearly identical content. I’d extend that definition beyond content to also include two or more pages on your website that serve identical or nearly identical intentions. The reason to extend that definition to include intention is that the core problem with duplicate content is that it confuses visitors. Visitors will be confused by pages that return identical content and will be equally confused (if not somewhat more confused) by two pages that return different content that serves basically the same intention.
For example, on Elementive’s website, we could have two pages talking about our tech SEO services. These pages might use very different words to describe the same services and, therefore, don’t have identical or even nearly identical content. However, those pages are describing the same concept, meaning that both pages are basically serving the same intent. Visitors who reach both of these pages would be understandably confused by the website.
We want to fix duplicate content to avoid this confusion. When human visitors are confused, they don’t engage with the website and they don’t convert. When robot visitors get confused, you end up with problems getting the duplicated content to rank—the content either doesn’t rank or it ends up competing against the other versions of the content to rank. Put simply: we don’t want duplicated content, whether that is duplication of the exact words on the page or duplication of the page’s intent.
Now that we understand what duplicate content means, let’s walk through how we deal with duplicated content:
- Common Causes of Duplicate Content
- Duplicate Content Example
- Find Duplicate Content
- How to Fix Duplicate Content
Common Causes of Duplicate Content on a Website
Duplicate content can happen for a variety of reasons. One of the more common causes of duplicate content is a side effect of programmatic choices. This could be a situation where the website’s underlying platform allows the same page to be returned at multiple URLs. For example, a platform might allow product pages to be accessed at site.com/view/product-name.html and also accessed at site.com/product-name.html. Another way this happens is with pages automatically created by content management systems. For example, a blog will automatically create category pages to list blog posts but, depending on the blog posts are categorized, three different category pages on a blog might list almost the exact same blog posts and that would create three pages that are duplicated (or nearly duplicated).
It isn’t just technical mishaps, though. Duplicate content can also be caused by poor or messy information architectures, where the same content is duplicated in multiple places on the website. As an example, a company might place the same FAQs content in several different sections of the website because there isn’t a simpler, less duplicated way of presenting that content to visitors. As another example, duplication can also happen when you attempt to write the same page several times to cater to multiple audiences but the distinctions between the audiences, and therefore the pages for those audiences, are far too subtle.
Related to this, duplicate content can happen due to mismanagement of the website. For example, two different people unknowingly created the same page. As another example, even a single author can write new content about a topic and forget the older page exists (this happens to me on this site all the time).
The first step to fixing duplicate content is understanding the cause; a technical problem will be resolved differently than an information architecture or management problem.
Duplicate Content Example
Duplicate content can take many shapes and forms but let’s take a closer look at one common example. Let’s say you manage an e-commerce website. On e-commerce sites, it is common to have filtering and sorting. Let’s say these three URLs exist and list the same products, albeit in a slightly different order.
These three pages are duplicates. These three pages may not be exactly the same, given the different way products are sorted, but the pages do serve a very similar intention. The second and third example URLs contain a sort parameter (“?sort=color” or “?sort=price”), which creates only a slight difference between these pages (in the way the products listed are sorted). These pages would still have the same products, the same images, the same text, and, likely, the same title and description tags.
With that much similarity, these three URLs would be considered duplicate versions of the same page. Likely, human visitors will understand this difference, provided the sorting is well explained within the design and content. However, this type of duplication may confuse Google as their robots try to decide which of the pages to show in search results. Do they show a sorted page or not? If they do show a sorted page, which sort type should be ranked in search results? In many cases, the page Google chooses to show in search results may not be the same page you would prefer people to find. In this example, you may prefer people find the first, unsorted URL instead of the sorted versions of the page. In some cases, Google may also penalize your website for duplicated content.
Find Duplicate Content
Before you can fix duplicate content, you have to find it on your website. Before we can discuss how to resolve duplicate content, let’s go through the steps you can take to find and evaluate duplicate content on your website.
Step 1: Locate Duplicate Content
The first step is finding the duplicate content that exists on your website. Unlike locating other types of problems, there isn’t a way to get a single report of every issue that exists. Instead, in this first step, we want to locate pages that could be duplicated content and then we’ll review if the page is duplicated in the next step.
Method #1: Crawl Tool
Siteliner is a free tool, up to a certain number of pages, that will crawl through your website and, among other problems, locate duplicate content. Once you load the site, enter your website’s URL.
You will then see a report about different aspects of your content. All are interesting and important, but what we want to focus on is duplicate content. Siteliner will tell you in the navigation what percent of content their tool thinks is duplicated on your website. In the case of my website at the time of this scan, it is 5%.
You can click the “Duplicate Content” link in the navigation to see the full report. What we want to pay attention to on the full report is the “Match Percentage” and “Match Pages”. In my case, the analytics-courses page has a 43% match to 2 other pages. You can click on the URL for more details and to see what pages are matching. Anything above a 50% match is worth reviewing and, regardless of match percentage, anything with more than two or more matching pages is also worth reviewing.
Method #2: Analytics Page Titles
An alternative way to check for duplicate content is to rely on the page title dimension in Google Analytics. To locate this, go to the all pages report, which is located at Behavior -> Site Content -> All Pages. Once here, change the primary dimension by using the links above the table to “Page Title” or add a secondary dimension of “Page Title.” You can then sort by the “Page Title” column to see if any pages contain the same title or a nearly identical title.
This isn’t always a perfect means of locating duplication but can give you some ideas if pages speak to the same topic. As well, you can search through the page titles for common phrases—for example, in my case I could search for other page titles that may reference SEO, plugins, or response status codes. Anything that seems to share similar or identical title tags is worth reviewing for possible duplicate content issues.
Method #3: Google Search Console
There are many other methods you can use for locating duplicate content but the final method we’ll discuss here relies on Google Search Console’s keyword report. This is the best method for finding duplicated intentions and how duplicated content is affecting search performance. In Google Search Console, go to Performance and then view the queries table. Within the table, click through the different queries that your website ranks for. Once you click on a specific query, the report will reload and you will be taken to data solely about that query. From there, you can click on the “Pages” tab in the table below the graph to see all the pages that rank for that query.
As you can see in the screenshot below, Elementive’s website has a few instances where multiple pages rank for the same term. This doesn’t necessarily mean those pages are duplicated. But it does mean those pages are competing to rank for the same term and, at least in Google’s evaluation, that the pages may serve a similar intention. It would be worthwhile to review these pages and see what, if anything, should be done to resolve that duplication.
Step 2: Review Duplicated Content
The next step is to review what you suspect might be duplicated content found based on your findings in step one. Remember, what we found in step one isn’t definitively duplicated, just potentially duplicated. As you review the potentially duplicated pages, you want to ask yourself a few different questions.
Do the pages have identical or nearly identical content?
You want to begin your review by focusing on pages that have duplicated or nearly duplicated content. On your first review, you want to distinguish between pages that are identical and pages that are nearly identical because this changes how to address a problem. If the content exactly matches, then it is more critical to resolve and is likely presenting a bigger problem for your website’s performance. Flag those exact matches for an exact fix.
For near-duplicate matches, you need to review the content more deeply to see what the differences are and if they matter. Some of the differences within near matches are easy enough to spot without any tools to help. However, a tool like Diffchecker can help where it is harder to tell how great the duplication may be. You can put in two sets of content into Diffchecker and see how close the content matches between pages. Diffchecker will also highlight what is the same or different across pages, which is helpful to spot key differences between pages.
Why bother with this evaluation? Well, sometimes, you can end up with false positives where an automated tool, like Siteliner, says the pages are duplicated but a human would understand the differences. For example, you might have two product categories that share a lot of products. A tool like Siteliner tells you those two product categories match at 80% so you flag it for evaluation. As you are evaluating, you can’t really tell how the pages differ but using Diffchecker you can easily spot the 20% worth of differences in the content. Perhaps, that 20% that is different really matters and a human would likely understand these pages are distinct and not duplicated. Chances are bots will understand this difference too since they are calibrated to view content like humans. Of course, if the 20% difference doesn’t much matter to humans (and, by extension, won’t matter much to bots either), then you do have a duplicate problem that needs to be addressed.
Do the pages serve a similar intention?
Pages that serve a similar intention are usually easiest to spot in Google Search Console where you might have two or more pages ranking for the same search term. Here again, though, you want to review the potential duplication to determine if it is a false positive or actually something you need to address.
As you review potentially duplicated pages, you need to determine if the pages serve a similar intention. It can be difficult to locate this type of duplicate on your own. Once you are close to the subject matter, you can spot the nuances that make each page distinct. To avoid this bias, don’t rely solely on your own evaluation of the page. Ask your customers and visitors to describe the differences in a survey or a brief interview. If they can tell you why these pages are different, then you don’t have a problem. But, if people can’t describe the difference, or seem to be struggling to find the difference, you have pages with duplicated intentions that need to be fixed.
How are visitors interacting with the duplicated pages?
You can also determine how big of an issue duplicated content is presenting, and if you truly have a duplicate content problem, by reviewing performance on those potentially duplicated pages. As an example, you might find duplicated pages where one copy of the page is performing incredibly well—lots of traffic, high conversion rates, high engagement rates—and the other versions of the page performing terribly. In those instances, there isn’t much of a problem and you can easily remove the low-performing pages (or redirect those to the high-performing page) to alleviate any future problems.
Of course, as you review performance, you might find that the duplicated pages are basically performing the same way with similar traffic levels, similar rankings, similar engagement rates, and similar conversion rates. That suggests visitors probably don’t realize there are multiple pages on your website. I’d still argue this is a problem, though, but it isn’t a problem costing you anything yet. Better to address the duplication before visitors or search robots catch on—or before you go to update the page and forget to update one of the duplicated versions, creating outdated or conflicting information in the process.
What you also want to check for in these cases is how many people move between these pages—you can do this with a segment to see how many people visit both versions of the duplicated page. If there are a lot of people visiting both versions of the duplicated page, then that suggests a problem. How big a problem it is depends on what the conversion and engagement rates are for people who visit both versions of the duplicated page. In some cases, I’ve seen the conversion rate drop off by more than half when visitors encounter duplicated content—and that would indicate visitors are confused by the duplication and that a substantial problem exists.
How To Fix Duplicate Content
Now that we’ve found, evaluated, and confirmed that duplicated content exists, what do you do about it? The solutions depend on the nature of the duplicated content and the severity of the duplicate content problem. The bigger the problem, the harder the solution. In this section we are going to cover four ways to address duplicate content:
- Title Tag Changes
- Implementing Canonical URLs
- Consolidating or Removing Content & Redirects
- Fixing Technical or Structural Issues
1. Title Tag Changes
If two pages share an identical title tag or page header, but the pages are fundamentally distinct and serve different purposes, you don’t need to make a large-scale change to the page. Instead, you can tweak the title tag or the page header to make it clearer what purpose each page serves. This can also stop multiple pages ranking for the same terms and, by doing so, help you rank for new terms.
2. Implementing Canonical URLs
Another solution for duplicate content is using the canonical link element to define which version of the page should be considered for rankings. This doesn’t help users, but it does help bots understand your website (which then indirectly helps users because it influences which pages rank).
In the above example, you might consider the first URL (/product-list.html with no sorting parameters) to be the official or preferred version of that page. This URL does not have a sort parameter which makes the URL look nicer and this page might list the products in the order you’d prefer most people see. However, if sorting by color is the most popular choice for your visitors, you might prefer the canonical URL be /product-list.html?sort=color instead. Alternatively, you may find that the third URL (sorted by price) gets the most attention from other websites or on social networks and therefore the third version might make more sense as the preferred version of the URL. Regardless of what URL you pick to be the official version, you declare this official version by implementing a canonical tag.
How To Add a Canonical Tag
After you select the official or canonical version of the URL, the canonical link element stating the preferred version of the URL needs to be added to each potentially duplicated page. In the example above, any duplicated URLs would contain a canonical tag referencing the canonical URL you selected.
The canonical URL can be defined in two ways. The most common is to use a <link> element with a rel attribute with the value of canonical and an href attribute with the URL of the canonical version of the page. This element is placed anywhere in the <head> section of your website. Here is an example of the canonical code where the preferred URL is /product-list.html. This tag would be placed on all versions of the page. So, in this example, this canonical tag would appear on the /product-list.html page as well as the sorted versions.
<link rel="canonical" href="https://www.domain.com/product-list.html" />
Another alternative is to add a Link to your HTTP Headers. This is useful for non-HTML files (but typically required technical support to add to your website). Here again, this would be added to all duplicated versions.
Link: <https://www.domain.com/product-list.html>; rel="canonical"
Supporting The Canonical Elsewhere
You should not rely on the canonical tag as the only means of communicating your URL preferences to search engines. Links throughout the rest of your website should link to the canonical version of the page as well. This avoids sending mixed signals to search engines. As well, it reduces the chances human visitors might reach these duplicated pages as well.
For example, in the above example, if you define https://www.domain.com/product-list.html as the canonical URL, but the majority of the links on your website reference https://www.domain.com/product-list.html?sort=color, this would send conflicting signals about which version of the URL is really the definitive, authoritative, and canonical version. Instead, you would want the majority of links to reference the canonical version, /product-list.html. If Google is ignoring your canonical tags, this is the most likely reason.
Along with internal links, make sure your XML sitemap also lists the canonical version of each URL and only this version. The XML sitemap should not list any of the duplicate versions of the page. In our example, that would mean the XML sitemap should list /product-list.html but none of the versions of the URL with a sort parameter.
3. Consolidating or Removing Content & Redirects
If you have a bigger duplicate content problem, changing a title tag or adding a canonical won’t be enough of a fix. In these cases, the solution requires removing or consolidating pages, then redirecting removed URLs to the kept URL. For example, if Page A and Page B are identical, you could remove Page A and keep Page B to resolve the duplication. Of course, people might still come looking for Page A and you don’t want those visitors to be lost. So, after removing Page A from the website, you can redirect Page A to Page B. By doing this, you’ve removed the duplicate pages from your website but still ensured humans and bots visiting the site will be able to locate the desired content.
This gets trickier if the pages are nearly identical, or if it is only the intentions that are duplicated. For those scenarios, you often will want to consolidate the content from the several duplicated pages into one single page and keep some of the content from both versions of the page. For example, let’s say you have four pages discussing your widgets for sale and each of these pages has distinct content and images. You could consolidate all of these pages into a single page about widgets, keeping some of the content and images from each page. Once the content is consolidated, you’d want to redirect the duplicated versions into a single page.
How do you pick which page to keep and which pages to remove then redirect? It depends on how the pages are performing. Like with selecting a canonical, you want to review which pages visitors prefer and keep the version of the page that has the most traffic or engagement.
4. Fixing Underlying Technical or Structural Issues
Finally, the duplicate content may be caused by an underlying technical issue. In some cases, this is an old dev environment mirroring the live site that somehow became exposed to visitors. This can also happen with dynamic pages that create identical content at multiple URLs (think of elaborate content filtration systems). Sometimes, the problem happens as a result of user-generated content and lack of monitoring for people posting the same content in multiple locations (think of a forum where people could post the same question to three different categories within the forum).
It can also be a structural issue. For example, there are three sections within the website where the page could live so the webmaster, to be the most helpful, puts the page in each of those sections. In a way, that makes sense and allows visitors to find the page in different places, all of which might be relevant places for that page to live. But the better answer is to reorganize the website such that one section can link to another or without the need for multiple copies of the page in different places of the website (or possibly, the site needs clearer differences between each section).
Summing It Up
Duplication can happen for many different reasons. The duplication rarely happens on purpose. But, often, the duplication worsens search performance, reduces visitor engagement, and, ultimately, hurts conversions. It takes time and effort to locate, evaluate, and fix the duplicated content. Often it requires altering the site architecture or fixing technical issues. But that effort is worth it and will lead to performance gains for SEO, UX, and CRO. If you have any questions about the duplicated content or finding issues that may exist on your website, please contact me.