How To Find & Fix Duplicated Content
December 15, 2019
By the standard definition, duplicate content means that you have two or more pages on your website that return identical or nearly identical content. I’d extend that definition slightly to also include two or more pages on your website that serve identical or nearly identical intentions. That is, Elementive could have two pages on our website talking about tech SEO services that have different content but both pages would basically be serving the same intent.
The reason to extend that definition is that the core problem with duplicate content is that it confuses visitors. Visitors will be confused by pages that return identical content and be equally confused (if not somewhat more confused) by two pages that return different content that does basically the same thing.
When human visitors are confused, they don’t engage with the website and they don’t convert. When robot visitors get confused, you end up with problems getting the duplicated content to rank—the content either doesn’t rank or it ends up competing against the duplicated page to rank. Put simply: we don’t want duplicated content. Let’s walk through how we deal with duplicated content: locating, evaluating, and resolving.
For simplicity, let’s exclude duplicated content located on third-party websites for the sake of this blog post.
The first step is finding the duplicate content that exists on your website. Unlike locating, say, a 404 error, there isn’t a way to get a single report of every issues that exists. Instead, in this first step we want to locate things that could be duplicated content and then we’ll review if it is duplicated in the next step.
Method #1: Crawl Tool
Siteliner is a free tool, up to a certain number of pages, that will crawl through your website and, among other problems, locate duplicate content. Once you load the site, enter your website’s URL.
You will then see a report about different aspects of your content. All are interesting and important, but what we want to focus on is duplicate content. Siteliner will tell you in the navigation what percent of content their tool thinks is duplicated on your website. In the case of my website, it is 5%.
You can click the navigation to see the full report. What we want to pay attention to on the full report is the “Match Percentage” and “Match Pages”. In my case, the analytics-courses page has a 43% match to 2 other pages. You can click to the URL for more details and to see what pages are matching. Anything above a 50% match is worth reviewing and, regardless of match percentage, anything with more than 2 pages matching is also worth reviewing.
Method #2: Analytics Page Titles
An alternative way to check for duplicate content is to rely on the page title dimension in Google Analytics. To locate this, go to the all pages report, which is located at Behavior -> Site Content -> All Pages. Once here, change the primary dimension to page title or add a secondary dimension of page title. You can then sort by page title to see if any pages contain the same title, or a nearly identical title.
This isn’t always a perfect means of locating it but can give you some ideas if pages speak to the same topic. As well, you can search through the page titles for common phrases—for example, in my case I could search for other page titles that may reference SEO, plugins or response status codes. Anything that seems to share similar or identical title tags is worth reviewing for possible duplicate content issues.
The final method we’ll discuss relies on Google Search Console’s keyword report. This is the best method for finding duplicated intentions and how duplicated content is affecting search performance. In Google Search Console, go to Performance and then view the queries table. Within the table, click through the different queries that your website ranks for. Once you click on a specific query, the report will reload and you will be taken to data solely about that query. From there, you can click to “Pages” to see all the pages that rank for that query.
On Elementive’s website, we have a few instances where multiple pages rank for the same term. This doesn’t necessarily mean those pages are duplicated. But it does mean those pages are competing to rank for the same term and, at least in Google’s evaluation, that the pages may serve a similar intention. It would be worthwhile to review these pages and see what, if anything, should be done.
The next step is to review what you suspect might be duplicated content found in step one. Remember, what we found in step way isn’t definitively duplicated, just potentially duplicated. As you review the potentially duplicated pages, you want to ask yourself a few different questions.
Do the pages have identical or nearly identical content?
It is important to distinguish between identical and nearly identical. As you review the pages that may be duplicated, determine how duplicated they are and if this is an exact match or only a partial match. That changes how you may want to address a problem. Some of the near matches are easy enough to spot without any tools to help. However, a tool like Diffchecker can help where it is harder to tell how exact the duplication may be. You can put in two sets of content into Diffchecker and see how closely the match. Diffchecker will also highlight what is the same/different across pages.
Why bother with this evaluation? Well, sometimes, you can end up with false positives where an automated tool, like Site Liner, says the pages are duplicated but a human would understand the differences. For example, you might have two product categories that share a lot of products. A tool like Site Liner tells you those two product categories match at 80% but using Diffchecker you can easily spot the 20% of differences. And, perhaps, that 20% that is different really matters and a human would likely understand these pages are distinct and not duplicated. But, if the 20% difference doesn’t much matter to humans, then you do have a duplicate problem that needs to be addressed.
Do the pages serve a similar intention?
What Google ultimately cares about with duplicate content isn’t just that the pages are the same but that the pages serve the same purpose. This is also what the people visiting your website care about too. You can have two pages that offer different content but basically say the same thing. This can represent as two pages ranking for the same search term. It can also represent as neither page ranking very well since the two pages are competing against each other.
To spot this issue as you review the pages that could be duplicated, ask yourself if the pages serve a similar intention. If there are two pages on your website that both are attempting to describe the same concept or communicate the same message, that is duplication you need to address. This can happen easily on larger websites where you write new content about a topic forgetting the older page exists. It can also happen when you attempt to write to two different audiences but the distinctions between the audiences are far too subtle.
It can be difficult to locate this type of duplicate on your own. Once you are close to the subject matter, you can spot the nuances that make each page distinct. To avoid this bias, don’t rely solely on your own evaluation of the page. Ask your customers and visitors to describe the differences in a survey or brief interview. If they can tell you why these pages are different, then you don’t have a problem. But, if people can’t describe the difference, you have duplicated pages to fix.
How are visitors interacting with the duplicated pages?
You can also determine how big of an issue duplicated content is presenting, and if you truly have a duplicate content problem, by reviewing performance. As an example, you might find duplicated pages where one copy of the page is performing incredibly well—lots of traffic, high conversion rates, high engagements rates—and the other version of the page performing terribly. In those instances, there isn’t much of a problem and you can easily remove the low performing page to alleviate any future problems.
Of course, as you review performance, you might find that the duplicated pages are basically performing the same way with similar traffic levels, similar rankings, similar engagement rates, and similar conversion rates. That suggests visitors probably don’t realize there are multiple pages on your website. I’d still argue this is a problem, though, but it isn’t a problem costing you anything yet. Better to address the duplication before visitors or search robots catch on—or before you go to update the page and forget to update one of the duplicated versions, creating outdated or conflicting information in the process.
What you also want to check for in these cases is how many people move between these pages—you can do this with a segment to see how many people visit both versions of the duplicated page. If there are a lot of people visiting both versions of the duplicated page, then that suggests a problem. How big a problem depends on what the conversion rate is for people who visit both versions of the duplicated page. In some cases, I’ve seen the conversion rate drop off by more than half when visitors encounter duplicated content—and that would indicate visitors are confused by the duplication and that a substantial duplication problem exists.
Now that you’ve found and confirmed that duplicated content exists, what do you do about it? The solutions depend on the nature of the duplicated content and the severity of the duplicate content problem. The bigger the problem, the harder the solution.
If two pages share an identical title tag, but the pages are fundamentally distinct and serve different purposes, you don’t need to make a large-scale change to the page. Instead, you can tweak the title tag or the page header to make it clearer what purpose each page serves. This can also stop multiple pages ranking for the same terms and, by doing so, help you rank for new terms.
Another solution for duplicate content is using the canonical link element to define which version of the page should be considered for rankings. This doesn’t help users, but it does help bots understand your website (which then indirectly helps users). For more information, read my guide on defining a canonical URL and supporting the canonical throughout your website.
Consolidate or Remove & Redirect
If you have a bigger duplicate content problem, changing a title tag or adding a canonical won’t be enough of a fix. In these cases, the solution requires removing or consolidating pages. For example, if page A and page B are identical, you could remove page A and keep page B to resolve the duplication. To make sure no human or robot visitors are lost after this page, it would be best to redirect the removed page to the kept page (in this example, redirect page A to page B).
This gets trickier if the pages are nearly identical, or if it is only the intentions that are duplicated. For those scenarios, you often will want to consolidate the content from the several duplicated pages into one single page (i.e. if you have four pages discussing your widgets for sale, you could consolidate into a single page about widgets). Here, too, you’d want to redirect the duplicated versions into the single page.
How do you pick which page to keep? It depends on how the pages are performing. If the duplicated versions are all performing equally well, then it is largely a coin toss. But you may see differences—one version of the duplicated page ranks somewhat higher or another version has a higher conversion rate.
Fix Underlying Technical or Structural Issues
Finally, the duplicate content may be caused by an underlying technical issue. In some cases, this is an old dev environment mirroring the live site that somehow became exposed to visitors (in one case, this happened because of an upgrade to the dev site that removed the password protection that had been preventing the duplicate content). This can also happen with dynamic pages that create identical content at multiple URLs (think of elaborate content filtration systems). Sometimes, the problem happens as a result of user generated content and lack of monitoring for people posting the same content in multiple locations (think of a forum where people could post the same question to three different categories within the forum).
It can also be a structural issue. For example, there are three sections within the website where the page could live so the webmaster, to be the most helpful, puts the page in each of those sections. In a way, that makes sense and allows visitors to find the page in different places, all of which might be relevant places for that page to live. But the better answer is to reorganize the website such that one section can link to another or without the need for multiple copies of the page in different places of the website (or possibly, the site needs clearer differences between each section).
Summing It Up
Duplication can happen for loads of reasons. Rarely is the problem intentional. But, often, the duplication worsens search performance and hurts conversions. It takes time and effort to locate, evaluate, and fix the duplicated content. But that effort is worth it and will lead to performance gains for SEO and CRO. If you have any questions about the duplicated content or finding issues that may exist on your website, please contact me.