Duplicate Content: the symptoms, the causes, the treatments.
In SEO there are many obstacles to ranking – first and foremost your competitors, who are fighting you for the top 10 spots in the search engine results pages (SERPs). Then there are the ever-changing Google rules which you must follow in order to comply with their Webmaster Guidelines. On top of all of that, there are issues with your website that are entirely up to you to prevent or fix. These issues are essentially the low-hanging fruits – the fixes that you can make without too much of an effort, which will produce visible and measurable changes in your rankings. One such issue – and possibly the most wide spread – is duplicate content.
Duplicate content is an understandable problem for Google – on the one hand they do not want to populate page results with many listings featuring the same or very similar content. On the other hand, duplication of content is an unintended consequence of the way the internet works – website owners are encouraged to share their content and syndicate it, producing numerous copies of the same original content. News reporters frequently substitute the effort and talent that go into writing with the ever so convenient CTRL+C – CTRL+V keystroke combination. Additionally, content duplication is a byproduct of the way websites are built; from mobile versions to session IDs and parameter hierarchy, there are many processes that can result in duplicate content. In this article we will cover the most notable ones, along with a suggested solution for each case.
The symptoms: the unwanted effects of duplicate content
Before we delve into all the possible reasons for duplicate content and why it’s problematic in Google’s perspective, let’s first cover the possible consequences it can elicit:
1. Your website may be penalized. Content quality and originality are something Google tries to reward. The flip side of this is that unoriginal content is devalued in the eyes of Google. One of the most significant Google algorithm updates – Panda – specifically targets low quality, unoriginal content. While Google’s definition of such content is anyone’s guess and most probably a product of several different factors, the majority agrees that content originality is one of those factors, and a significant one. It should therefore be one of the first things examined when trying to recover from an algorithmic penalty like Panda.
2. A sub-optimal page on your site may rank instead of a targeted one. When Google encounters duplicate content on your website, it tries to decipher which is the canonical version of the page – the one of highest quality and relevance to the user – and then usually leaves that page, while removing all the others from the index. This way of dealing can be problematic from a webmaster’s perspective, since a) the content may not be completely duplicated and two pages may be targeting different search queries, and b) Google may decide that a weaker page (in terms of link equity/social engagement/conversion) is the canonical one, and rank it instead of the better performing page.
3. You may waste some of your precious ‘crawl budget’ on extraneous pages. Having your duplicate content removed from the index does not mean that Google will not crawl them. Since every site is assigned a limited crawling budget ( = the number of pages on a given site that Google will try to crawl), it is highly possible that you are wasting some of that budget on pages that are automatically excluded from Google’s index, while preventing unique, targeted pages from even being featured in its index. If you take internal navigation into account, you may be wasting some on-site linking equity on pages that are not serving your goals in any way.
The diagnosis: identifying duplicate content
In order to be able to fix duplicate content issues, we must first be able to identify them. The first indication of a duplicate content issue is, as with other major SEO ‘don’ts,’ a drop in rank and/or organic search referrals. This may be due to the loss of ranking of a page that was receiving organic search visitors and its replacement by a worse-performing page, or an algorithmic penalty. Once you notice such a change in your rankings and traffic, there are several ways to identify if it is due to the existence of duplicate content, and if so, the source of it:
1. Google Webmaster Tools (GWT). This is the first place to check for duplicate content issues. Under Search Appearance –> HTML Improvements, you’ll find a report outlining a myriad of issues Google bot has found with your site. Among these issues are Duplicate Descriptions and Duplicate Titles – usually the first indication of duplicate content on your website.
2. Screaming Frog. Screaming Frog is a desktop-based crawler tool with an SEO spin to it – it crawls your site and points out all the possible SEO issues: duplicate meta tags, images without alts, noindex/nofollow tags, canonicals, etc. It is a great backup to Google’s WMT HTML issues report, as it shows everything that was crawled – not only what Google deemed important. The only disadvantage it has compared to GWT reports is that its crawling is navigation based, meaning that it only crawls pages linked within the site and not potential duplicate URLs that are being created due to linking from outside the site.
3. Google search. When all else fails, performing a Google search using site: command (for on-site duplication issues, without it for content duplicated across different sites) along with an exact quote from one of the sentences found in the content uncovers a lot of indexed duplicated content. The benefit of this approach is that it not only discovers the duplicated URLs but gives a clear indication of which pages Google actually indexed and which it skipped.
Possible on-site causes of duplicate content
1. Duplicate content – yes, you actually may have duplicated content on your site, for no particular reason. This can be a result of poor site planning, not having redirected old URLs to new ones, empty search result/category pages, large “boilerplate” segments on your site, or insufficient unique content on thin pages to prevent the page from being flagged as duplicate. Whatever it is, make sure to eradicate it, either through canonical tags pointing to the original page, or through 301 redirecting the duplicate page to the original one. ProTip: Run a backlink report on MajesticSEO or some other backlink information service to establish which of your duplicate pages gets the most links, and if possible, redirect all there. Another option is to develop the duplicate version of the page so that it becomes unique enough to stand on its own merit. This solution works in cases where both of the duplicate pages are getting traffic, either through organic search or other sources (social media, direct traffic, link referrals, etc.), and is suitable when both pages are considered important. Make sure you adequately differentiate your secondary version for Google to consider it unique, by changing the boilerplate content in one of the versions or rewriting the content of either of the two pages.
2. Canonicalization issues – these issues mostly stem from server-side sub-optimal solutions to common situations encountered by webmasters when building their websites. In some cases there is a need for both versions of a website to exist (like in http:// vs. https://). In other cases, the existence of the non-canonical version of a site is a result of a technical error.
The most significant problem with duplicating content due to canonicalization issues is that search engines may index both versions, decide which one is canonical and discard the other. This can cause several problems with your website’s rankings:
- Since both pages are live and both are theoretically attracting links, there is a chance that Google will pick a version with a weaker link profile and rank your site in accordance to that weak profile.
- In addition to the relative weakness of the page, it is possible that the anchor text profile of links pointing to the sub-optimal page is very different from the anchor text profile of links pointing to the page you wish to target. If Google decides that the page reached from the unpopular keyword links – the sub-optimal page – is the original copy, you’ll instantly lose rankings for the important keywords used in links pointing to the other version of the page. Therefore, just by pooling the links from the sub-optimal page and pointing them to the optimal page, you can dramatically change and improve the keyword landscape from which your site is ranked.
- IP that resolves to homepage. Your site is hosted on a server which has an IP address. Your domain name is pointing to that IP, telling the browser where to go to ask for your site when requested. In some cases, the IP and not the domain, will get associated with the content and will show up in Google SERPs instead of the domain.
- It is safe to assume that the on-site navigational links all point to pages within the same domain (http pages point exclusively to http pages and https pages point only to other https pages). This means that if the majority of external links are pointing to, say, an http version, then that link equity/relevance/authority will remain within the http URLs and none will be transferred to the HTTPS page that is actually ranking. Therefore, the ranking page will suffer from a persisting handicap compared to pages on sites that do not have this problem.
These are some of the possible causes of canonicalization issues and proposed best solutions:
CAUSE: http:// vs. https://
In some cases both the secured and non-secured versions of the site need to exist. In such a scenario, search engines index both versions, decide which one is the canonical one and discard the other.
SOLUTION: Point a canonical tag from the duplicate page to its corresponding one on the canonical subdomain. This way the link equity, authority and relevance will flow to the targeted pages and the non-canonical subdomain will still be available to visitors.
CAUSE: www vs. non-www version
Contrary to the https-vs.-http problem, there is usually no justification for a coexistence of both www and non-www pages and this situation is usually a result of a technical oversight. All of the possible issues caused by the https vs. http duplicate content problem are bound to arise in this case as well.
SOLUTION: The only reliable solution in this case is 301 redirecting the non-canonical subdomain to the canonical one. Usually, the recommended version is http://www.domain.com, due to the possibility of needing to insulate different subdomains from each other in the case of a penalty.
CAUSE: http://www.domain.com and http://www.domain.com/index.html both return a 200 server response.
This is one of the most commonly overlooked canonicalization issues, as at the first glance, it looks like normal behavior of websites. However, since Google may just treat these two URLs as two completely different pages, it is important to make sure that there is only one version that gets all the link equity.
SOLUTION: 301 redirect to the canonical version.
CAUSE: Capitalization in the URL – both www.domain.com and www.Domain.com return a 200 server response.
Again, due to the fact that for the most part Google looks at URLs as separate entities, differing capitalization in the URL may cause it to be treated as separate pages.
SOLUTION: 301 redirect to the canonical version.
CAUSE: Trailing Slash: both http://www.domain.com and http://www.domain.com/ resolve to the same page and both return a 200 server response.
SOLUTION: URL rewrite to the preferred version.
CAUSE: http://www.domain.com vs. http://www.domain.com/index.html.
Sometimes there is no redirect between the two versions of the homepage and even though they are the same page, since there are two URLs, Google may treat them as separate entities. As in other cases, there is a danger of splitting the link power, Google ranking the suboptimal page, etc.
SOLUTION: 301 redirect to the preferred version.
3. Non-canonical content duplication – In many cases, duplication of content is not a product of technical oversight, as is the case with canonical issues, and is rather a byproduct of site architecture or even an intentional result of content formatting. Here are some of the more common causes of such content duplication and suggested solutions:
- PAGINATION (comments and/or content)
There are several possible scenarios in which pagination may create duplicate content issues, the main ones being:
- CAUSE: Pagination of long form content.
Many websites, mainly ecommerce and news sites, tend to divide their content over several pages, either due to the need to create very heavy, slow loading pages or due to their wish to increase the amount of hits per article, an advantage relevant mainly for advertising purposes. In either case, pagination may create not only a duplicate content problem, but also a thin-content problem and even prevent the indexation of important products/content, thus reducing the search engine footprint of a site.
SOLUTION: There are two optimal solutions to this problem:
1. Create a page that contains all of the content featured across the paginated pages and point a canonical from every paginated page to it. This solution is good for consolidating all the link equity but it does not solve the slow-loading page problem, especially since it is this page that will most probably rank in the SERPs.
2. Use the rel=”prev” and rel=”next” on every paginated page. This should prevent the duplicate content issue, however will not resolve the possible need for link equity consolidation. Google Webmaster Tools support has published a helpful article on how to implement this tag for the pagination issue solution.
CAUSE: Pagination of comments
Comment pagination can be another source of duplicate content issues. In some CMS platforms, such as WordPress, there is a chance that if an article has a large number of comments, they will be paginated. In that case, each page of comments gets a separate URL and the comments are spread across those URLs, while the article remains on all of them.
SOLUTION: Try to disable this option from your CMS management panel. If this option is not available, implement a canonical tag pointing to the main URL of the blog post.
- INTERNATIONALIZED CONTENT VERSIONS
- CAUSE: Sometimes webmasters need to use duplicate or very similar content when targeting different geographical regions. For example, an ecommerce site which wants to show local store addresses in the sidebar while presenting the same content in the main area of the site. This is particularly true for websites targeting different countries using the same language, such as US/UK/Canada/Australia/etc. or Spain/Argentina/Columbia/Chile/Mexico/etc.
SOLUTION: Use the rel=”alternate” hreflang=”x” tag in the site section headers. More information about use and implementation can be found on Google’s Webmaster Tools support section.
- CAUSE: Some websites, particularly news outlets, like to provide a lighter, thinner version of their articles intended for printing. Another scenario is where the same content is available in HTML and in PDF format. Both these cases may create duplicate content issues.
SOLUTION:As in many other cases, the best way to deal with this issue is to use a canonical tag in the non-canonical page header, pointing to the canonical version. In the case of PDF files it may be a bit trickier, as there is no HEAD section to use. In such a case use the rel=canonical HTTP header response of the PDF file. For more information on how to implement this, see this Google Webmaster Support article on canonical tag implementation.
SOLUTION: The preferred way is always not to create separate content for mobile version of the site but rather deal with design differences between different devices through CSS, however in case this is not an option, a canonical tag should be applied pointing to the main version of the site.
- CAUSE: In some cases, when dealing with dynamic sites, an order of parameters in the URL can be reversed and the same content will be retrieved from the database. The best example of this is Google itself – if you change the order of parameters or even remove/add some of the cookie or other user-related parameters, the content remains the same. Similarly, some websites store the user session information in the URL, enabling a consistent serving of content to the same user throughout the session. (Things such as shopping cart content can be saved this way.)
SOLUTION:Use the canonical tag pointing to the preferred version of the URL. In Session ID cases, a preferable longer term solution requires dealing with these IDs through cookies, but a canonical tag can be used as a temporary measure as well.
- CAUSE: Sometimes affiliate websites link to product pages of an ecommerce site with URLs containing affiliate codes. This may cause search engines to rank affiliate landing pages in the SERPs, while ousting the original content. Additionally, this can create issues with tracking the affiliate referrals correctly over time.
SOLUTION: Use a canonical tag on each affiliate landing page that points to a canonical version of the page URL. If such a page doesn’t exist, create one.
Possible off-site causes of duplicate content
Putting aside all of the mistakes and technical issues that may cause duplicate content, in some cases, duplicate content may be just that – content that was either intentionally stolen or unintentionally duplicated on another domain. One of the situations in which content is guilelessly duplicated is websites providing their content through RSS feeds or syndicating their content in some other way. In this case, it is helpful to include a link pointing back to your original URL within the syndicated article. This should tell Google which is the original source of the content, thus removing any duplication issues. However, this solution does not always work properly and in many cases Google doesn’t correctly attribute the original author of the content.
In case of more blatant theft of content, use Google’s Content Removal tool, which guides you through the steps of reporting stolen content. For more serious and persistent cases, you can file a DMCA complaint to get the stolen content completely removed from the net.
These are the most common causes and solutions for duplicate content issues in our knowledge and experience. The RankAbove platform recognizes these issues and scans every page and URL of a given website for duplicate content. Contact us if you’d like to make sure that your website is in the clear and with any other SEO concerns.