Tuesday, March 18, 2008

Duplicate Content Online: Issues, Problems and Good Things

Duplicating Content Key Factors

It makes a whole lot of difference if the replicator / duplicating site is providing full credit to the original author and site not just in the forms of text citations but specifically as link backs to the original site. A clearly visible credit link back to the original content, which includes both the name of the site, the author and the title of the original article (with a link back to it) is the minimum that should be provided by online publisher syndicating or republishing already published web content.

The second most relevant factor that can turn this situation around is the whether the duplicating site is adding "extra value" to the original content in the way of an introduction, additional links and references, related content and news on the same topic, relevant illustration images and more content which the final reader may find useful.

The third key element is permission. If you have gone out of your way and took the three minutes needed to fire out an email asking for permission to republish an article and you have clearly explained what you intended to do with it, you are definitely on the safe side also on the ethical side. (I have never heard of anyone complaining about duplicated content which he had himself authorized.)

In favor of replicating content across of sites, I can say that if done ethically (by following the above three points above) and also by extending the ways in which other people can get at that content (by using different titles and intro content) this can be actually a very positive and natural way to spread new ideas and valuable information.

At the other extreme we have sites that republish shallow, next to value-less content, across hundreds of domains for the only sake of either gaining monetization of such content at zero cost to them (in case of those that pick up ready-made articles from article directories) or we have small online publishers who want to get fast traction and visibility on search engines by spreading their low-quality content on article directories in the hope of gaining lots of link backs from the sites that will freely republish their content.

But this is only my opinion.

What Is Duplicate Content According To Google

Here is what Google has to say officially on this matter:

"Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.

Examples of non-malicious duplicate content could include:

Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices

Store items shown or linked via multiple distinct URLs

Printer-only versions of web pages

However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.

Google tries hard to index and show pages with distinct information.

In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved.

As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results."

Google users typically want to see a diverse cross-section of unique content when they do searches. In contrast, they're understandably annoyed when they see substantially the same content within a set of search results.

"However, we prefer to focus on filtering rather than ranking adjustments ... so in the vast majority of cases, the worst thing that'll befall webmasters is to see the "less desired" version of a page shown in our index."

(Source: Google Webmaster Central)

"Google tries hard to index and show pages with distinct information. [Generally we filter out duplicates and] choose one of them to list.

In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved."

(Source: Google Webmaster Central)

Duplicate Content and Who Ranks First: Where Is The Problem Really?

Now, pay attention to this:

Most of the problems that Google acts upon when it comes to duplicate content are actually caused by duplicate content on your own site, and not by someone else replicating your content elsewhere.

Actually, let me make this a new axiom:

If there is another site republishing some of your content and they rank higher than yours inside the Google search engine result pages, you DEFINITELY have a problem to solve on YOUR site. (and the problem is NOT to go and scream or threaten who has republished your content, but to wake up and be curious enough to see what makes your site so bad that even Google prefers the duplicate to yours.)

So When Another Site Republishes Your Content Should You Get Upset Or Not?

As far as I can prove with my daily experience in managing a few web sites, Google's only concern is to serve high quality, relevant content that matches perfectly the query made by any searcher. If another site, that is using some of your content comes up before yours, look well within your site before blaming someone else republishing your content.

I also think you do have all of the right to write and complain to the duplicators if they didn't contact you before republishing your content (assuming your content was not under an open redistribution license like the Creative Commons ones).

But in many cases, the duplicator or republisher (and I am here excluding all automatic republishing bots and spam sites that clearly are stealing content for their only economic interest) is not just taking advantage of your content, but rather extending and supporting your campaign and ideas, while adding extra value to it by linking back to your site and while being very transparent about the authorship and origin of what of your content he actually used.

In all these cases, you would be really not OK in complaining or restraining his activity, as he is really extending and positively contributing to your communication efforts.

The fact that the replicating site may come up higher inside search results, should not hold you back from improving as well your content, updating it, while reducing the amount of unnecessary and self-promotional content you may be actually "duplicating" across your own pages.

Let me explain this better.

And Who Should Really Come Up First Inside SERPs?

In the email exchange story I was made part of, the original author was complaining because his original content was sometimes shadowed inside search result pages by the higher relevance of the "copied" content re-published by another web site owner.

I explored both sites and looked beyond the surface at some of the actual content appearing on both sites and what did I find? The original publisher page indexed by Google was about 60KB in content size but of this almost 50% was not part really content on the topic of the article. It was just a truck load of links pointing to his other articles taking up more visual space than the actual content of the article itself.

On the side was the "replicator", who not only had gone out of his way to provide a good intro, plenty of additional links and related resources, clear credit and links back to the original, but had even brought together multiple pieces from the external author to make the analysis of the issue deeper and more comprehensive. On top of this the content size on the "replicator" site, as indexed by Google was about three times the size of the original content, and if you had given a look to it, its relevant content dominated the page leaving to navigation, ads and other info only a marginal part of the space available.

This is why Google, nonetheless it knows well who wrote that content first, and where it originated (as the duplicating site makes no secret of it by linking back to it and crediting it extensively) prefers to server higher up in the search results the "replicator" page.

In simple terms, the replicator site is doing a valuable job, from all standpoints, by extending the reach and value of the original content created by the other site while fully crediting and linking back to it.

The originating site is lamenting its inability to maintain higher search engine result rankings but has no humbleness to look at the value of the content it is serving, improving it and updating it in ways that make it more valuable than any other resource available online on the topic. The originating site is also at risk of alienating such important "relay" points by complaining for their good efforts and linking favors and by focusing too much on holding unearned visibility and forgetting to improve and serve the user best interests.

Only on one front the "replicator" side failed strongly. By tacitly assuming too much the "replicator" site took too much self-permission of doing things without ever having the kindness to go out of its way and asking officially to the originating site to provide permission to replicate whatever content it felt relevant to use.

Morale of the story for the duplicating content type of site: Unless you see a clear text writing that says without a shadow of a doubt: "Go ahead and copy my content on your site..." (which, is not so rare to find - check the bottom of this page for a good example) it is ALWAYS your duty to take the time needed to contact the author of any original content you want to republish and to ask her permission to republish her content on your site. Period.

Google's Advice

There are some steps you can take to proactively address duplicate content issues, and ensure that visitors see the content you want them to. (Remember that again Google is thinking MORE in terms of duplicate content issue on YOUR OWN site rather than other ones, since those others Google can handle quite easily by itself).

Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don't follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.

However, if our review indicated that you engaged in deceptive practices and your site has been removed from our search results, review your site carefully. If your site has been removed from our search results, review our webmaster guidelines for more information. Once you've made your changes and are confident that your site no longer violates our guidelines, submit your site for reconsideration.

Don't fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it's highly unlikely that such sites can negatively impact your site's presence in Google. If you do spot a case that's particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site.

If you find that another site is duplicating your content by scraping (misappropriating and republishing) it, it's unlikely that this will negatively impact your site's ranking in Google search results pages. If you do spot a case that's particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and request removal of the other site from Google's index.

Even with that, note that we'll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you'd prefer.

Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you'd prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to block the version on their sites with robots.txt.

and then look well inside your own site:

Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.

Minimize similar content: If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city.

Consider blocking pages from indexing: Rather than letting Google's algorithms determine the "best" version of a document, you may wish to help guide us to your preferred version. For instance, if you don't want us to index the printer versions of your site's articles, disallow those directories or make use of regular expressions in your robots.txt file.

Use 301s: If you've restructured your site, use 301 redirects ("RedirectPermanent") in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.)

Be consistent: Try to keep your internal linking consistent. For example, don't link to http://www.example.com/page/ and http://www.example.com/page and http://www.example.com/page/index.htm.

Use top-level domains: To help us serve the most appropriate version of a document, use top-level domains whenever possible to handle country-specific content. We're more likely to know that www.example.de contains Germany-focused content, for instance, than www.example.com/de or de.example.com.

Use Webmaster Tools to tell us how you prefer your site to be indexed: You can tell Google your preferred domain (for example, www.example.com or http://example.com).

Avoid publishing stubs: Users don't like seeing "empty" pages, so avoid placeholders where possible. For example, don't publish pages for which you don't yet have real content. If you do create placeholder pages, use robots.txt to block these from being crawled.

Understand your content management system: Make sure you're familiar with how content is displayed on your web site. Blogs, forums, and related systems often show the same content in multiple formats. For example, a blog entry may appear on the home page of a blog, in an archive page, and in a page of other entries with the same label.

(Sources: Google Advice on Duplicate Content

Google Webmaster Central

Google Duplicate Content Guidelines article)

But unfortunately Google does not really answer officially the hardest, hot questions on this topic that many web publishers may have. Here are a couple which have clearly gone unanswered on Google webmaster Central post about Duplicate Content:

Kristen Veraldi commented... (on February 5, 2008)

"Thanks Susan,

I couldn't agree with you more - in the world of content it is ALL about adding value in a timely and relevant manner from your own unique and honest perspective.

However, for those that understand this and do regularly follow that mantra to build their foundation, do you believe adding generic third party content to the equation (let's say 25% of the time as an example) could tarnish a site's reputation?

What if that third party content did not initially originate from the web (there is no original link, just the same content on 100s of other websites (who knows who was first) and was being provided solely for wide spread distribution and re-use by some sort of industry specific content generator.

What if those articles did add value for your local sphere (ie. the people that you are not necessarily trying to connect with through search, but those already following you)?

I guess what i'm asking is - if you don't care about optimizing this generic content and the value is more intrinsic, can you offer it and still feel relatively comfortable that it won't hurt your original content in search (since we do of course care about some indexing!).

This is a very common situation, so I ask for all those out there with templated sites wondering if they are exposed.

If this were a cause for concern, could it be ebbed by using robot no follow tags on those pages containing this content?

I realize these questions are nearly impossible to answer definitively - anything you can add is appreciated. Thanks!"

or
adwords wrote (on March 2, 2008):

"I have an article directory that is effectively 100% duplicate content (More Than Articles). Not long after it first started I did notice a dip in traffic and found that all my pages were in the supplemental index. I provide the articles formatted in HTML and in plain text, as well as the standard version. So basically every article appears 3 times on the site with minor variations.

I reworked the navigation and the robots.txt to exclude everything but the standard version from indexing. This has lead to all the pages going back to the main index and a gradual increase in traffic.

From that experience I have to conclude that duplication within a domain is rather more important than duplication across domains."

So lacking also some official answers, here is my definitive advice on duplicate content:

Robin Good's Advice:

For those republishing content from others:

Ask permission first: always

Add Value - At minimum: Add an intro and title differently

Add Value - At best: Extend, provide additional value, by providing related content, images, your own commentary and any other information component which provide greater value to the final reader searching for information on the topic

Credit always in full: no matter what the rules or requirements you may find around the best and most correct way to link back to the original content of an article you have republished in full on your site is to:

a) credit the author's name and link it to his online bio/ profile if available

b) acknowledge the author site, with its name and a link to it

c) reference the original article title, date of first publication and to link back to it

State original license when possible: Do not make the original author think or assume in any way that you are applying your liberal, Creative Commons or Public Domain license to his copyrighted content. You have no right to do so and he has all of the right to get mad about it.

... and for those having their content republished by someone else:

Let go. Holding content just on your site is generally not to your best advantage. If there are other sites who honestly want to extend the reach of your content while providing additional value, allow them to

Acknowledge the value and contribution that these other sites are making in extending the visibility and reach of your content, ideas and authorship to a greater audience, especially when such sites do so in a fair and transparent way by crediting, acknowledging and referencing back your original content.

Do not get pissed off automatically if a site using some of your content gets higher up inside Google search results. Don't blame them. Look at your own site and how you can improve the value you are providing to your readers, so that you have no competition on that front with other sites. If Google places another site above yours, it is almost never a mistake.

If anything, make your content more easily shareable and republishable on other sites, via RSS feeds, widgets, and open licenses that clearly state you are actually in favor of letting your content go as far and wide as it wants to, as long as proper credit and link back is provided.

If content is republished with permission, while adding significant value to it, changing its title, adding an introduction and by crediting / linking back its original source, I think both the original author as well the republishing site will greatly benefit from it. More people will read the content, more people will get to know the original site and author and the message will be picked by a much larger group of people thanks to the additional free distribution service that search engines will provide to additional copies that provide extra or complementary value to the original one.

Originally written by Robin Good for Master New Media and first published on March 18 2008 as "Duplicate Content Online: Issue, Problems and Good Things".

Readers' Comments

blog comments powered by Disqus

Print this article| IT| ES| PT

posted by Robin Good on Tuesday, March 18 2008, updated on Tuesday, May 5 2015

Search this site for more with

8690