If I ask your views about duplicate content on the internet, your answer would be a resounding: “Oh! it is bad.”
But Matt Cutts, who was former Google’s head of search spam, said that 25% to 30% of the content on the internet is duplicate.
It means one in four pages is engaging in an act that many consider wrong, and rightly so.
But, the problem is with our understanding of duplicate content and how search engines view similar content.
The scope of duplicate content goes beyond “copy & paste” and “content syndication.” These two are extreme examples of content duplication, but they aren’t the only forms of it.
Content duplication takes many forms. Sometimes it results from a deliberate attempt to seal the content and earn undue benefit. Other times, it occurs as a result of technical complexities.
Either way, it is essential to understand it in more detail.
What is content duplication?
Here is how Google defines duplicate content:
“Duplicate content generally refers to substantive blocks of content within or across domains that either completely matches other content or are appreciably similar. Mostly, this is not deceptive in origin.”
Duplicate content poses a problem for both the site owners and the search engines. But if a website engages in content duplication, Google doesn’t punish it with a harsh duplicate content penalty.
Google says that the duplicate content is not the grounds for action against any website unless there is a clear intent to deceive and manipulate search engines with similar content.
So, why is Google not coming down hard on content duplication?
For this, you have to understand what causes content duplication and what its types are.
There are two types of duplicate content. They both cause problems for site owners and search engines:
Offsite content duplication
Offsite duplication is when the same content appears on two or more websites. The content can be an exact copy of each other or have few modifications. Offsite content duplication happens in three ways:
- A website is involved in content syndication: scrapping content from other websites and republishing it.
- A website has published any third-party content.
- A website has distributed its content to be published by other websites, for example, article distribution.
Onsite content duplication
Onsite content duplication is when similar content is published on more than one URL of a website. While offsite content duplication is hard to manage, onsite duplication is easily managed by some technical changes to the site’s structure and development.
Duplicate content and SEO
Duplicate content cancels out the benefit of and SEO. While SEO is all about ranking a website higher in the SERP, duplicate content negatively impacts the website’s rankings by compromising uniqueness.
It also causes problems for search engines in identifying the source of the content. Sometimes, the duplicated content ranks higher in SERP than the original content. Thus, web owners suffer the loss of traffic and visibility.
Here is how duplicate content can affect your SEO efforts:
Divides the traffic
When it comes to syndicated content – content appearing on more than one website – the search engines, on most occasions, cannot identify the source of the content or the webpage where it was first published.
So, they rank all the pages on their indices. This dilutes the uniqueness of each page, and instead of one page getting undivided traffic, several pages get the traffic.
Distributed link juice
If content appears on a single URL, all the external backlinks – URLs linking to that content – will go on that single URL.
But, if similar content appears on more than one URL, the backlinks will be divided among several web pages. Since backlinks pointing towards a webpage are a ranking factor, any loss in the backlink can impact your ranking in the SERPs.
Eats into the crawl budget
Whenever the search engine crawlers crawl a website, they limit the number of pages they can crawl in one attempt.
With duplicated content, this crawl budget is wasted on crawling and indexing unwanted pages. Your updated or money pages might not be crawled because the crawl budget is consumed in pages with duplicate content. Even if these pages are crawled and indexed, they won’t serve any purpose in the SERPs.
Complicates site structures
Duplicated content not only creates offsite issues but also leads to problems within the website. If two or more URLs have the same content, it will increase your website’s number and affect the site speed.
This problem is more common with e-commerce websites or websites with thousands of pages. For example, in an e-commerce website that sells shirts, multiple URLs will be generated if the users apply filters (size, color, collar size) to the product.
Similarly, some CMS generates separate links for products listed under different categories or tags. It also increases the number of pages on a website.
How to check duplicate content
The easiest way to identify the pages with duplicate content on your website is by checking all the pages from your site indexed in Google.
Here is a simple way of doing it. Go to google.com in the search bar, type site: and enter your website’s URL without the subdomain.
It will give a list of all the indexed pages from your website.
You can also check the indexed pages of your website through the Index Coverage report in the Google Search Console.
From the left sidebar, click ‘Coverage’ under the ‘Index.’ It will report URLs indexed in Google and the URLs that aren’t indexed due to errors. This report looks like this:
There are also duplicate content tools that you can use to identify duplicated content, such as DupliChecker, Copyscape, Siteliner, etc.
How to fix content duplication issues:
The most common duplication problem occurs when multiple URLs lead to a single page. For example, all these links will lead to one page:
All these URLs will lead to the homepage of our website. Now, the simplest way to fix this problem is by applying a 301 redirect. No matter what URL the users enter, they will be redirected to one place.
Large e-commerce websites are the absolute cesspit of similar content. Not only do their product descriptions match, but their products also generate multiple URLs because of being categorized under different tags.
It is also the case with third-party product descriptions that features on these e-commerce websites. These products are listed on more than one website, and a third party provides the description for them. Thus, too many product pages carry the same product description.
To fix this issue, you can apply canonical tags Rel=”canonical.” Canonical tags tell search engines about the “preferred” version of a web page if the content is duplicated.
Users can add canonical tags to tell the search engines that out of these two pages with similar content, I want to get this page ranked and not the other one. Otherwise, duplicate page without canonical tags leads to the indexing of all pages in random order.
But, canonicalization doesn’t fix the duplication issue completely.
One of the rel canonical problems is that the search engines see these tags as mere signals. They choose to acknowledge or ignore it as they seem fit. So, even after applying Rel=”canonical,” you might not be able to direct all the traffic towards a single URL.
If you have heard of the Robots.txt file, you will know that it works as a gatekeeper for your website if you create it. It controls the avenues where the search engine crawlers can and cannot enter.
In a robots.txt file, users can input “follow” and “nofollow” tags for pages. The pages with a nofollow tag in their HTML header are not crawled or indexed by the search engines.
Assigning a nofollow tag to your pages with duplicated content means that they won’t appear in the SERPs.
However, this practice is not recommended by Google. If the crawlers’ access to the web pages is blocked, they can’t identify similar content. As a result, the crawlers will treat them as unique pages.
Duplicate content penalty
As a webmaster, there is no way of knowing how much duplicate content is acceptable. While you should always strive for as minimal duplicate content as possible, it is also essential to know Google does not punish content duplication.
Google has not outlined any penalty for the content it deems copied or scraped off from other websites. It doesn’t prefer original source over duplicate content pages while indexing and ranking pages in the SERP.
Google rarely penalizes a website with deindexing for copying and syndicating content. For it, there has to be a ‘clear indication of manipulation by the websites to deceit the search engine.’
Content duplication is a big problem for websites. It impacts their SEO efforts, affects their rankings in the SERP, and dilutes the uniqueness of content.
However, it is important to note that Google or other search engines do not punish duplicate content. Still, webmasters should limit both onsite and offsite duplication to improve the website’s SERP ranking and increase its authenticity.