Duplicate content: is it an issue you need to worry about? Whether you are on the side of those publishing content that is duplicated on other sites or on the side of those republishing contributing authors content on yours, it is important to understand what are the real issues and problems that duplicate content can generate and separate them from myths and easy speculation.
See the actual video of Robin Good on the issue duplicate content further down in this article
I have been motivated to write this post as I was kindly made part of an email conversation in which, on one side a webmaster was complaining about his content being replicated on another site especially since Google had placed the duplicated content higher up inside the search results than the original.
On the surface, it would appear as the original author has all the right to complain and to invite the "republisher" to stop the practice and to get down to write his own stuff by himself.
But the issue at times, especially when you look under the surface is much subtler and more complex than that.
Robin Good on duplicate content
It makes a whole lot of difference if the replicator / duplicating site is providing full credit to the original author and site not just in the forms of text citations but specifically as link backs to the original site. A clearly visible credit link back to the original content, which includes both the name of the site, the author and the title of the original article (with a link back to it) is the minimum that should be provided by online publisher syndicating or republishing already published web content.
The second most relevant factor that can turn this situation around is the whether the duplicating site is adding "extra value" to the original content in the way of an introduction, additional links and references, related content and news on the same topic, relevant illustration images and more content which the final reader may find useful.
The third key element is permission. If you have gone out of your way and took the three minutes needed to fire out an email asking for permission to republish an article and you have clearly explained what you intended to do with it, you are definitely on the safe side also on the ethical side. (I have never heard of anyone complaining about duplicated content which he had himself authorized.)
In favor of replicating content across of sites, I can say that if done ethically (by following the above three points above) and also by extending the ways in which other people can get at that content (by using different titles and intro content) this can be actually a very positive and natural way to spread new ideas and valuable information.
At the other extreme we have sites that republish shallow, next to value-less content, across hundreds of domains for the only sake of either gaining monetization of such content at zero cost to them (in case of those that pick up ready-made articles from article directories) or we have small online publishers who want to get fast traction and visibility on search engines by spreading their low-quality content on article directories in the hope of gaining lots of link backs from the sites that will freely republish their content.
But this is only my opinion.
Here is what Google has to say officially on this matter:
"Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.
Examples of non-malicious duplicate content could include:
Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
Store items shown or linked via multiple distinct URLs
Printer-only versions of web pages
However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.
Google tries hard to index and show pages with distinct information.
In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved.
As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results."
Google users typically want to see a diverse cross-section of unique content when they do searches. In contrast, they're understandably annoyed when they see substantially the same content within a set of search results.
"However, we prefer to focus on filtering rather than ranking adjustments ... so in the vast majority of cases, the worst thing that'll befall webmasters is to see the "less desired" version of a page shown in our index."
(Source: Google Webmaster Central)
"Google tries hard to index and show pages with distinct information. [Generally we filter out duplicates and] choose one of them to list.
In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved."
(Source: Google Webmaster Central)
Now, pay attention to this:
Most of the problems that Google acts upon when it comes to duplicate content are actually caused by duplicate content on your own site, and not by someone else replicating your content elsewhere.
Actually, let me make this a new axiom:
If there is another site republishing some of your content and they rank higher than yours inside the Google search engine result pages, you DEFINITELY have a problem to solve on YOUR site. (and the problem is NOT to go and scream or threaten who has republished your content, but to wake up and be curious enough to see what makes your site so bad that even Google prefers the duplicate to yours.)
As far as I can prove with my daily experience in managing a few web sites, Google's only concern is to serve high quality, relevant content that matches perfectly the query made by any searcher. If another site, that is using some of your content comes up before yours, look well within your site before blaming someone else republishing your content.
I also think you do have all of the right to write and complain to the duplicators if they didn't contact you before republishing your content (assuming your content was not under an open redistribution license like the Creative Commons ones).
But in many cases, the duplicator or republisher (and I am here excluding all automatic republishing bots and spam sites that clearly are stealing content for their only economic interest) is not just taking advantage of your content, but rather extending and supporting your campaign and ideas, while adding extra value to it by linking back to your site and while being very transparent about the authorship and origin of what of your content he actually used.
In all these cases, you would be really not OK in complaining or restraining his activity, as he is really extending and positively contributing to your communication efforts.
The fact that the replicating site may come up higher inside search results, should not hold you back from improving as well your content, updating it, while reducing the amount of unnecessary and self-promotional content you may be actually "duplicating" across your own pages.
Let me explain this better.
In the email exchange story I was made part of, the original author was complaining because his original content was sometimes shadowed inside search result pages by the higher relevance of the "copied" content re-published by another web site owner.
I explored both sites and looked beyond the surface at some of the actual content appearing on both sites and what did I find? The original publisher page indexed by Google was about 60KB in content size but of this almost 50% was not part really content on the topic of the article. It was just a truck load of links pointing to his other articles taking up more visual space than the actual content of the article itself.
On the side was the "replicator", who not only had gone out of his way to provide a good intro, plenty of additional links and related resources, clear credit and links back to the original, but had even brought together multiple pieces from the external author to make the analysis of the issue deeper and more comprehensive. On top of this the content size on the "replicator" site, as indexed by Google was about three times the size of the original content, and if you had given a look to it, its relevant content dominated the page leaving to navigation, ads and other info only a marginal part of the space available.
This is why Google, nonetheless it knows well who wrote that content first, and where it originated (as the duplicating site makes no secret of it by linking back to it and crediting it extensively) prefers to server higher up in the search results the "replicator" page.
In simple terms, the replicator site is doing a valuable job, from all standpoints, by extending the reach and value of the original content created by the other site while fully crediting and linking back to it.
The originating site is lamenting its inability to maintain higher search engine result rankings but has no humbleness to look at the value of the content it is serving, improving it and updating it in ways that make it more valuable than any other resource available online on the topic. The originating site is also at risk of alienating such important "relay" points by complaining for their good efforts and linking favors and by focusing too much on holding unearned visibility and forgetting to improve and serve the user best interests.
Only on one front the "replicator" side failed strongly. By tacitly assuming too much the "replicator" site took too much self-permission of doing things without ever having the kindness to go out of its way and asking officially to the originating site to provide permission to replicate whatever content it felt relevant to use.
Morale of the story for the duplicating content type of site: Unless you see a clear text writing that says without a shadow of a doubt: "Go ahead and copy my content on your site..." (which, is not so rare to find - check the bottom of this page for a good example) it is ALWAYS your duty to take the time needed to contact the author of any original content you want to republish and to ask her permission to republish her content on your site. Period.
There are some steps you can take to proactively address duplicate content issues, and ensure that visitors see the content you want them to. (Remember that again Google is thinking MORE in terms of duplicate content issue on YOUR OWN site rather than other ones, since those others Google can handle quite easily by itself).
However, if our review indicated that you engaged in deceptive practices and your site has been removed from our search results, review your site carefully. If your site has been removed from our search results, review our webmaster guidelines for more information. Once you've made your changes and are confident that your site no longer violates our guidelines, submit your site for reconsideration.
and then look well inside your own site:
(Sources: Google Advice on Duplicate Content
But unfortunately Google does not really answer officially the hardest, hot questions on this topic that many web publishers may have. Here are a couple which have clearly gone unanswered on Google webmaster Central post about Duplicate Content:
Kristen Veraldi commented... (on February 5, 2008)
I couldn't agree with you more - in the world of content it is ALL about adding value in a timely and relevant manner from your own unique and honest perspective.
However, for those that understand this and do regularly follow that mantra to build their foundation, do you believe adding generic third party content to the equation (let's say 25% of the time as an example) could tarnish a site's reputation?
What if that third party content did not initially originate from the web (there is no original link, just the same content on 100s of other websites (who knows who was first) and was being provided solely for wide spread distribution and re-use by some sort of industry specific content generator.
What if those articles did add value for your local sphere (ie. the people that you are not necessarily trying to connect with through search, but those already following you)?
I guess what i'm asking is - if you don't care about optimizing this generic content and the value is more intrinsic, can you offer it and still feel relatively comfortable that it won't hurt your original content in search (since we do of course care about some indexing!).
This is a very common situation, so I ask for all those out there with templated sites wondering if they are exposed.
If this were a cause for concern, could it be ebbed by using robot no follow tags on those pages containing this content?
I realize these questions are nearly impossible to answer definitively - anything you can add is appreciated. Thanks!"
adwords wrote (on March 2, 2008):
"I have an article directory that is effectively 100% duplicate content (More Than Articles). Not long after it first started I did notice a dip in traffic and found that all my pages were in the supplemental index. I provide the articles formatted in HTML and in plain text, as well as the standard version. So basically every article appears 3 times on the site with minor variations.
I reworked the navigation and the robots.txt to exclude everything but the standard version from indexing. This has lead to all the pages going back to the main index and a gradual increase in traffic.
From that experience I have to conclude that duplication within a domain is rather more important than duplication across domains."
So lacking also some official answers, here is my definitive advice on duplicate content:
Robin Good's Advice:
For those republishing content from others:
a) credit the author's name and link it to his online bio/ profile if available
b) acknowledge the author site, with its name and a link to it
c) reference the original article title, date of first publication and to link back to it
... and for those having their content republished by someone else:
If content is republished with permission, while adding significant value to it, changing its title, adding an introduction and by crediting / linking back its original source, I think both the original author as well the republishing site will greatly benefit from it. More people will read the content, more people will get to know the original site and author and the message will be picked by a much larger group of people thanks to the additional free distribution service that search engines will provide to additional copies that provide extra or complementary value to the original one.
Originally written by Robin Good for Master New Media and first published on March 18 2008 as "Duplicate Content Online: Issue, Problems and Good Things".