Wednesday, June 1, 2011

The Google Panda Guide - Part 3: What To Look For, What To Clean

What Key Steps I Took To Recover From Panda

These are the first key steps I took after me and my team at MasterNewMedia realized that we had been hit by Google Panda.

1) Analysis of reports

This involved digging through our analytics data and looking for specific changes to traffic, referrers and rankings of our key, top performing content.

We decided to do the gathering all data available and to identify which were the articles that had been hit worst on MasterNewMedia. We wanted to see if there were any common patterns, similarities or unique traits characterizing these articles from all other ones.

The first step in this direction was to get a clear overview of what pages lost traffic vs those pages that remained at the same level of traffic (or even gained). We took the data from Google Analytics and made a script that compared the data of the top 250 pages with the highest traffic from before the Panda update against the same data after the Panda update.

Then we organized and divided these articles into four distinct groups.

Articles that had lost traffic.

Articles that had gained traffic.

Articles that had lost so much traffic due to Panda, that they are no longer in the top 250 pages by traffic.

Articles that were not in the top 250 most visited pages before Panda, and which became top performers after it.

At this point we started analyzing common patterns among the two groups, to identify possible weak areas that needed fixing or improving.

One key takeaway emerging from our analysis is this:

Panda doesn't necessarily see a site as "demoted" or "penalized" just because some page lost rank. We see that on the same website some pages lost their positions in the search rankings while others gained several places.

Perhaps surprisingly, same goes for the URLs. Panda seems to concentrate on the results only.

This means that the same page can go down in position for some keywords while remaining steady for others.

One of our most popular pages, the RSSTop55 is a clear example of this. Here below you can see listed some of the major keyphrases that were and are used to find this MasterNewMedia guide through a Google search. For each one you can see the ranking of the RSSTop55 "before" and "after" the Panda update.

URL: http://www.masternewmedia.org/rss/top55/

Keyphrases and rankings:
best rss before Panda 6 - after 21
best rss feeds before Panda 7 - after 53
rss directory before Panda 8 - after 35
rss directories before Panda 1 - after 6
rss directory list before Panda 3 - after 3

2) Team Review and Brainstorming

It is better to look at data and problems from as many different angles as possible, and to analyze as many alternative solutions as one can find. This is why I decided that the second most important thing to do was to bring together my team and to look openly at the situation while suggesting ideas to fix the problems we had identified.

We sat down virtually over a Skype multi-party call and went through 30 to 60 minutes of daily analysis and brainstorming of what we had found and what we could attack. We did this every single morning for two full months until we realized we had nothing more to fix, adjust or improve.

3) Plan of Action

On the basis of our analysis, review of issues and brainstorming of possible solutions, I then developed a plan of action outlining what should have been our key areas of further investigation, and which the ones where to start fixing issues right away.

This was done on a daily basis at both the micro and macro levels. We had therefore a large map of issues and problems we had identified as possible causes or as problematic areas which needed improvement, and from there we would draw daily items for review and brainstorming and a specific plan for each work day on the Panda front.

Key Areas Analyzed

a) Thin and Zero Content

Thin content is content that is either too short or too light, relative to the amount of other elements on a web page. Thin content is also anything that provides little or no immediate value to readers, content that is generated via automated procedures or by aggregating automatically headlines without providing no additional value.

By looking at our own content on MasterNewMedia and digging through areas and pages we normally never visit, we did find a lot of suspect and potentially damaging "thin" content.

This was made up for the most part by:

Thin Content Pages

Very old pages which for some reason or another had lost some of their content (templates where changed without checking the consequences) or were in some cases completely empty.

Actions taken: Added metatag NOINDEX - NOFOLLOW

News Pages

Old content and news pages, sometimes dating back 6 years or more, with very short news items, at times made up only of one or two paragraphs.

Actions taken: Deleted - sent to 404

Tag pages

Tag pages generated by the CMS (content management system) or by some dedicated plugin. Generally, a tag page is generated for each new tag you use inside your content articles and if you have been doing this for years, while forgetting about the tag pages, you may indeed have published a few hundred tag pages which, in most cases, provide very limited value and usefulness to your readers.

Actions taken: Deleted - sent to 404

Introductory articles

Introductory articles were short articles I published for a relatively short period of time (a few years back) which would act as openers for long content features. They provided only some introduction to the article and some time they used also the same starting content and image used in the full feature.

Actions taken: Deleted - sent to 404

High traffic content unrelevant to main site theme

Unthematic content is content that is not about your key topics or area of major interest but which receives nonetheless tons of traffic. Having been an experimenter and a curious person as a publisher I have covered topics which may have appeared to Google as not relevant to my mainstream audience. In particular I had covered in the past the 2004 Tsunami in South East Asia with a curated report that integrated all of the existing video clips that I was able to gather online. As a consequence of that MasterNewMedia would get up to today, thousands of visitors looking for information and video clips on that very topic. So many in fact, that for extended periods of time up to recently, this set of pages would represent the most visited section of the site.

Actions taken: Moved to a separate domain and 301'd

Excessive pagination on multi-page articles

Excessive pagination in long articles, when these are split across multiple pages, may be cause of bad user experience and a direct cause of thin content.

Actions taken: Reduced significantly pagination of multi-page guides, by making each individual page hold more text / reviews and making sure that each page had a sufficient quantity of quality content and an appropriate content-to-ad ratio. (I would consider a 3:1 or higher to be a good ratio).

Other possible thin pages

During our long research we found lots of thin or bad content that we were not aware of having. Finding these "lost", useless or otherwise "bad" pages is the a challenge by itself, as you may not know exactly where to look.

Actions taken: One very effective and simple way to find potential thin content hiding on your site is to run the following search on Google: -asdfsda site:yoursitedomain.com. Go through those results randomly and look at what you may discover. Then reach the last page of results and see if Google says there:
"In order to show you the most relevant results, we have omitted some entries very similar to the 834 already displayed. If you like, you can repeat the search with the omitted results included."
. If it does, click the "omitted search results" and you can be sure that in there you will find some interesting suspects for both "thin" or "duplicate" content. Give it a try now.

Reduced significantly pagination of multi-page guides, by making each individual page hold more text / reviews and making sure that each page had a sufficient quantity of quality content and an appropriate content-to-ad ratio. (I would consider a 3:1 or higher to be a good ratio).

b) Duplicate Content

Archive pages

Archive pages can easily become a possible trigger for duplicate content as they can look like as a multitude of pages with the same title to Google and other search engines. Further, if these, like in my case, did provide hundreds of pages dating back to 2001 listing chronologically each and every content published, these may also end up being as having similar characteristics to "thin" pages. Shallow content, auto-generated, little or no additional value added, in some cases too little content.

Actions taken: Deleted - sent to 404

Article series

Article series, are at least in my case at MasterNewMedia, articles, review series or other type of content which I published weekly for a relatively significant period of time (1-2 years), and which carried for the most part the same title. Good examples are the Sharewood Picnics of a few years back, which would run every Sunday a series of short reviews of new just released tools. For the most part, this weekly articles shared all the same initial part of the title as the differentiation came in only in the last word of it (example: New Media Picks Of The Week: Sharewood Picnic 36).

Another case of a possible trigger for duplicate content could have been our weekly series devoted to Media Literacy and authored by George Siemens who provided permission to us to aggregate his daily posts into one main weekly digest on MasterNewMedia. Obviously each such digest credited clearly George Siemens and linked back to his blog. Also in this case the titles of the series were always identical for the first part (example: Media Literacy: Making Sense Of New Technologies And Media by George Siemens - Aug 15 09) with the only changing element being the date at the end of it.

Actions taken: Added metatag NOINDEX - NOFOLLOW

Scraped content

The after Panda, has been characterized by many automated web sites republishing our content and appearing before us in the Google search results. Even if we deserved to be hit by Panda, I don't understand why my content replicated on other sites with no credit or link back should rank before the original.

Actions taken: As it is literally impossible to effectively deal with these spammer sites that scrape our RSS feeds to republish our content on their sites, nor it is feasible to file DMCA takedown requests for each one - given also that Google cannot often act rapidly and effectively on these - we have opted for the easiest cure. Delay RSS feed publication, which while I regret doing as a publisher, gives ideally us a bit more time to let Google know when we publish something new.

P.S.: Verify with Copyscape if past top performers being hard hit are now being replaced inside SERPs by copycats and scrapers - check and report

Related articles

Related articles are the articles I suggest to read at the end of a feature. At MasterNewMedia I have been displaying six related articles for each article since the longest time. For each related article I would provide a linked title, an image thumbnail and a text excerpt of the introduction.

Actions taken: Reduced number of related articles, dropped excerpt for each one

Republished content

Over time I have gone out many times to ask permission to people I knew, liked or had simply just discovered, to curate, illustrate and republish their original content on MasterNewmedia with full credit and links back. The editorial goal was a very specific one: identify great pieces of content which few people knew or had read about, curate them deeply by making formatting and illustrating them, providing additional references and links, and give them the stage to reach a wider audience in multiple languages (MasterNewmedia content is published in four different languages).

Actions taken: None. I believe this content has the right to live as is, since it was published with explicit permission from the original author, and it does credit fully and link back to the original. MasterNewMedia role as a scout and curator for such content should not be in my view something to be ashamed of in any way. MasterNewMedia does add significant value to this content, by reformatting, illustrating and refining links and references, as it can be seen by comparing the original version of any such content with our re-published version.

System templates

I am not still very clear about the technicalities of this, but, of one thing I am sure. In the search for thin and duplicate content I run into some pages that I had never seen before on my site. As a matter of fact, I am not too sure other humans have been able to see them either. But Google has. And thanks to Google search index I have been able to spot them, look at them and realize that, if Google sees those pages and these pages are not good, or useful, then it may be a good idea to get rid of them.

The pages, at least in my case, are pages of actual content articles I have on my site, but they exist at a different URL than the "official" page, and are also "dressed" by a default and crude system template, originated by my CMS.

Actions taken: Identified and deleted (404) all of these pages so that they cannot be found anymore.

c) Content-to-Ads Ratio

One other likely possible Panda trigger we identified was the ratio of content-to-ads.

Ads display based on article lenght

Nonetheless MasterNewMedia editorial approach is characterized by long, in-depth content, that is highly curated, illustrated and referenced, providing by default a very good content-to-ads ratio, we decided to investigate this further, and to establish also some specific rules for how many ads (specifically Google ads) we would run on a page depending on the quantity of content that the page contained.

Actions taken: Since articles can range in size from short, under 400 words newsflashes, to in-depth reports or mini-guides with over 2000 words, we created separate templates for short and long articles, drawing a decisive borderline at 805 words as the minimum content threshold for running any ads.

Number of ads displayed based on article length

Once implemented, we further refined this approach, by defining exactly how many ad "strips" to run, and in which positions depending on article length. This allowed us to make sure ads would not be running on short content articles and that longer content would always have a numer of ads proportional to the size of the article.

Google Ads under the H1 title

The second consideration we made was the presence, even in articles with a very good content-to-ads ratio, of a very visible set of ads, right under the title and before the actual content. On smaller monitors and lower resolutions these Google ads make up most of the visible space above-the-fold, and may not be providing, even when relevant, the best user-experience possible.

Actions taken: Reduced drastically the number of ads under the title of long articles, but since these specific ads do still produce a relevant part of our advertising revenue, after some time we had to discontinue this fix, as in light of no Panda updates or changes this was causing us only a disastrous loss in revenues with no tangible benefit on any other front for the time being.

So this may actually be a possible relevant fix, but it is economically unsustainable for us at the moment.

d) Other, Unlikely Triggers We Looked Upon

Link integrity

Our next concern was the quality of links. This meant a multi-pronged approach:

Inventory of all broken links on the domain
Though this should be a standard, periodical quality control routine, I have failed repeatedly in reserving enough time to maintain our outbound link inventory in a clean and immaculate state. But given the gravity of the situation, I took the opportunity to go back to this front and to do as much clean-up as possible.

Actions taken: Run Link Sleuth (PC) and Integrity (Mac) on the whole site to identify all of the existing broken links.

Identification of links ending up in 404s
The next step after the general inventory of broken links was to identify which of these broken links were caused by a website or a service that doesn't exist anymore.

Actions taken: Went through each and every broken link manually and soon realized how time consuming this was. To speed up this process, we created a script that checked for us automatically whether the 404 was due to a site or service that didn't exist anymore, or by an individual page or URL having changed.

Identification of links leading to bad neighborhoods

On this front also things were not as easy as we thought. Our current commenting system is running off the Disqus platform, which is a third-party service that basically hosts the content of our commenters on their servers, but allows us to publish those comments under their related article via a Javascript-based embed. Bad links going out from those comments should be a concern for us? We were not sure.

However, on thousand of older articles we have on MasterNewMedia, we have MovableType comments, that are actually in the HTML of our pages. This means that any link that some commenter may have added to our site, could be hiding a spammy link leading into some bad web neighborhood.

Actions taken: We prepared a script that did the most of the work. It went into all the comments and isolated all the links placed in them in order to facilitate a manual review from a human editor. Though this may seem redundant, it was necessary as some of the links we found looked, even to the human eye of the editor, perfectly innocent. Some anchor text and even the URL of the link were often article specific. But when you opened the link, you were in for some surprise: the link doesn't go where you expect it to, leading you straight into a bad neighborhood. Still, these were not very significant in number and were cleaned out relatively fast.

We also removed all unnecessary links from the header and footer.

File size to article content ratio

To ensure a better file size to article content ratio, we decided to optimize page code as to achieve leaner and faster loading pages. Though we had already spent significant time in 2010 to optimize and improve the overall speed of the site, we wanted to check further whether we had parts of our page code or HTML elements that were not needed anymore and which could be removed.

Actions taken: We did find and removed useless, old or redundant elements both in the header/footer as well as in the side columns that were not critical for us to keep.

Generally tried to minimize the amount of HTML code used to form the page.

Text quality

Inspired by a Matt Cutts video talking about the quality of text we also spent some time checking and measuring the quality of our writing using Lexical Density and Gunning Fog indexes.

Actions taken: Unfortunately we did not identify any relevant patterns on this front. One thing we did notice though during our research is that the Gunning Fox Index was almost always around 12 for our top performing content.

Relative links

While fixing some of the other items, we realized, thanks to user feedback, that we had a significant number of links that were "relative" instead of being "absolute" links.

Actions taken: Changed all relative to absolute links.

Dashes in URLs

As some Panda reviewing blog speculated (careful, because we wasted lots of time listening and testing each one) that more than ten dashes in a URL might look bad in Google's eyes, we wanted to check whether our having published many articles with long URLs could have been an issue itself.

Actions taken: We researched and measured whether "dashes in URL" were a credible claim or just vaporware. And vaporware it was.

We found that some of our pages that had enjoyed a rise in traffic after Panda did have a lot of dashes in the URLs. Some of them up to 14. Here's a couple of examples:

a) /information_access/p2p-peer-to-peer-economy/peer--to-peer-governance-production-property-part-2-Michel-Bauwens-20071020.htm had a panda rise of 7%

b) /information_access/p2p-peer-to-peer-economy/peer--to-peer-governance-production-property-part-1-Michel-Bauwens-20071020.htm had a panda rise of 21%

N.B.: We have been using dashes as word separators for quite a long time now and we found that many URLs that gained traffic after Panda update contain dashes. The above two are only the two most extreme examples.

Lists (LI tags in text)

Some speculative thread on articles talking about Panda suggested that text with LI tags (numbered or un-numbered lists) could be ranking better after Panda than those without them.

Actions taken: Checked our data before and after Panda, and the results showed that we found pages with lists among both traffic winners and traffic losers, so there is no conclusive evidence to either confirm or deny the speculation on the positive effects of the LI tags.

URLs without extensions

A number of websites (especialy those based on WordPress) have URLs without extensions ( .htm, .html etc). MasterNewMedia has been using this type of URLs since 2008, and we wanted to check whether there was any evidence of these URL variable having some correlation with Panda-hit articles.

Actions taken: We went out and compared the top 250 most visited articles on our site and discovered that 107 out of 250 of those articles DO have URLs without extensions - both before and after Panda. So also here, there is nothing tangible, or immediately connected to Panda.

Frames and iFrames

Other speculative reports pointed to the use of frames inside web pages as a possible other Panda trigger.

Actions taken: Since we don't use frames on the site we are not in the best position to investigate this, but we have a few pages that use iframes and some of them have been negatively impacted while others gained traffic. In short - no conclusive proof for negative impact of frames has been found on MasterNewMedia site.

Conclusions

The amount of time we have invested on Panda analysis and fixes has been of at least 60-man days or more so far.

Unfortunately I cannot bring at this point any evidence that some of these actions we have taken are really beneficial to recover from a Panda "penalization".

It may well be that we could have kept most of that content we have removed, deleted or noidexed, and that the fear generated by this sudden loss of "revenue" has made us act in ways that I would otherwise consider to be very aggressive.

Only time and the next iteration of Panda will tell.

As time passed by, and we went through all of these changes and refinements, I have become more convinced that it is none of these individual changes that can save your ass from the Panda filter.

It is more what you actually do with your pages to make them useful, attractive and liked by your readers, that counts more and which needs to work on.

On this front, I have been producing in-depth, non-thin, quality research, reports and guides, just like Google now says it wants, in large quantity.

But evidently, that is not enough.

Increasing time on page and decreasing bounce rates associated with short time on a page, are the type of things that I do have still a margin of improvement to work on. But this entails giving up almost entirely the very key revenue channel that has made a small independent magazine like MasterNewMedia to stay alive through all these years. And, as I have written earlier, I am not just yet ready to give up all that money in one go.

So, how do I - or you - get out of this vicious circle?

You want to satisfy Panda, but you depend on the search traffic and Ads revenue that Google brings to you. And if you make the Panda happy, by supposedly turning off your key ads, your ad revenue goes suddenly away.

The only solution is, as far as I can see, is not to depend on Google and to utilize your web site to grow a following of truly passionate fans, and to offer them such high-quality content or services, that they will want to pay for advance featured, options or intangibles.

That is the way out of Google-dependency as, for now at least, one cannot rely just on producing high-quality content the way I have done, because, unless the Panda did make some mistake, it is clearly not good enough in the eyes of its new ranking criteria.

Given the amount of scraped and republished content that now ranks before MasterNewMedia for specific keywords that belong to our content, there is indeed some hope that Panda may need a bit more time to re-establish proper order inside the Google SERPs, but given that this is not guaranteed nor I know when it will happen, I think it is best to start working on other fronts.

In the next article, I will look at how I envision the ideal, truly impartial and ungameable search engine of the future. I don't know if Google will be the one to take this road, but I do think someone, eager to create the first fully transparent search engine, soon will.

if you have missed them, here the other parts of this guide:

The Google Panda Guide - Part 1: What It Is, How It Works, Collateral Damage

The Google Panda Guide - Part 2: Machine Learning And The New Mindset

The Google Panda Guide - Part 4: The Future I Would Like To See

The Google Panda Guide - Part 5: The AdSense Dilemma

Originally written by Robin Good for MasterNewMedia and first published on May 31st 2011 as "The Google Panda Guide - Part 3: What To Look For, What To Clean".

Photo credits:

Thin and Zero Content - superdumb
Duplicate Content - Norebbo
Content-to-Ads Ratio - robynmac

Robin Good -

Readers' Comments

blog comments powered by Disqus

Print this article| IT| ES| PT

posted by Robin Good on Wednesday, June 1 2011, updated on Tuesday, May 5 2015

Search this site for more with

16663

The Google Panda Guide - Part 3: What To Look For, What To Clean

What Key Steps I Took To Recover From Panda

1) Analysis of reports

2) Team Review and Brainstorming

3) Plan of Action

Key Areas Analyzed

a) Thin and Zero Content

Thin Content Pages

News Pages

Tag pages

Introductory articles

High traffic content unrelevant to main site theme

Excessive pagination on multi-page articles

Other possible thin pages

b) Duplicate Content

Archive pages

Article series

Scraped content

Related articles

Republished content

System templates

c) Content-to-Ads Ratio

Ads display based on article lenght

Number of ads displayed based on article length

Google Ads under the H1 title

d) Other, Unlikely Triggers We Looked Upon

Link integrity

File size to article content ratio

Text quality

Relative links

Dashes in URLs

Lists (LI tags in text)

URLs without extensions

Frames and iFrames

Conclusions

Search this site for more with

Curated by