MasterNewMedia
Curated by: Luigi Canali De Rossi
 


Wednesday, July 5, 2006

Server Slowdown Problems: Possible Causes Lead To Comments, RSS, Cron, and MSNbot

Server slowdown problems? If you, like me, are a small, independent online publisher, losing consistent traffic while knowing that the cause is not your readers taking off to other better destinations can be as frustrating as trying to hold water with your hands.

If your real-time traffic monitoring service shows you sudden dips and completely out of average visitors numbers at specific times of the day, you know that there is something on your server that it is not going right.

spider_web_id181242_size450.jpg
Photo credit: Linda Bucklin

To be able to find out what is really happening, it takes the skill and patience of a capable webmaster who has enough perseverance and hindsight to uncover the hidden spider's webs on your web site. Finding what is the culprit is not always an easy matter, and in some cases, especially where the causes could be multiple factors at play together, finding a way out may take up to a few months.

That is what happened to me in the last six months.

Whether it shows how dumb we have been or how many can be the causes of your own misfortune, I take no shame in sharing my story as I hope it can help other online publishers like me avoid wasting so much time, resources and money.

I started noticing sudden dips on the traffic to my main site www.masternewmedia.org several months ago, and while the overall monthly traffic kept growing as in the past, I could not refrain from wondering to what those sudden and relatively brief traffic dips were to be attributed to.

Hitbox_traffic_dip_intermittent.gif

Together with my webmaster Drazen D., we started initially to focus on Movable Type which is the personal publishing platform that we are using to maintain all of our web sites. We though the rebuilding process was the culprit and so we focused on rebuilding the architecture of all the site page templates as to slim down significantly the load placed on the server each time the server needed to rebuild one or more pages.

The issue in this case, was quite real, and as we later discovered, contributed significantly to the resolution of our server slowdown issues. On masternewmedia.org we have many "content category" and "tag" pages which collect over the whole history of our site all of the articles under different and sometimes multiple categories. So when publishing a new article, the process set in place by the publishing system in order to update all of this archive and tag/category pages can easily become quite daunting.

But having fixed this, we saw that the server problems would still return and without subdued power.

Hitbox_traffic_dip.gif
Sudden dip in traffic - these are symptoms of abnormal behaviour - if you see them happening frequently there something not ok going on on the server


So while I would absolutely recommend anyone in my position to look seriously into having a page templates architecture that is highly modular and well thought out (mine was the result of a long patch-up process which created more damage than benefit), the traffic on the site, page access and also the consequent ability of users to have a positive experience on the site was compromised. I could tell this because in parallel with the dips of traffic during specific times of the day (hard to see unless you monitor several times per hour), I could also see dips in AdSense revenue. That is in people ability or success rate in reading and clicking on relevant commercial information: the main resource of our income.

Next up were "comments and trackbacks".

Due to the heavy exploits that mischievous marketers utilize today, comment and trackback facilities of any blog or news site that is based on a "personal publishing - blog-engine" have become a honey-filled receptacle of the worst spam comments and trackbacks you could dream of.

But the worst thing of all is that these comments and trackbacks are often maneuvered by automatic software bots, who may start spitting rivers of links to illegal sites at a rate of hundreds per hour.

Finding ways to effectively manage spam comment filtering facilities to stop, discourage or even prevent altogether this tsunami of pure junk is at times too difficult fore the people involved and the solution that many have taken has been simply to shut down those facilities making it impossible for normal users to place comments or send trackback pings to their posts, and therefore inhibiting one of the often fundamentals traits of the new media revolution happening online: the conversational aspect.

For us this has been an extended pain in the neck, until some time ago we finally tamed it. Or at least we brought it down to a level that would limit somehow the negative effects on our server resources. We did this by doing exactly what I wrote above. Inhibiting, frustrating and angering hundreds of innocent individuals who were eager to post a comment on MasterNewMedia.org, by making posting a comment impossible, broken, useless, and whatever else we were testing at the time to stop that junk wave. The stupid thing from us was not to inform users of this and therefore to alienate many of them, who wrongly thought that I was either sloppy or uninterested in having a true, open dialogue with my readers.

But even now that people can hardly comment on this site, and that many of this comment spam engines have started to see that their efforts with my site are pretty useless, the server "dips" still continued to show up.

Here is what the traffic on a heatlhy site looks like when the server is doing its job right, unencumbered by other factors:

Hitbox_traffic_normal_upward_trend.gif

I am not saying that correcting and improving templates and comment/trackback spam didn't provide us with improvements and benefits. It did and positively so. But I could still see that at specific times, and sometimes for a number of hours, our dedicated server (hosted at Pair.com in Pittsburgh) was heavily handicapped by something we could not yet identify.

Next up on our list was looking at RSS feeds and delays caused when we needed to go get some content from an external server and this other server was in turn being overburdened by some other problem. As I make large use of RSS feeds and I do bring in a lot of content from the other sites I manage and edit, we did have to find some solutions to lighten up all of the RSS aggregating, parsing and output that was at times bringing down our server to a crawl.

Using an external service provider (or a second dedicated server with custom newsmastering software like MySyndicaat and Newsgator provide) to manage large loads of incoming RSS content is the best way to go, and we did see some very notable improvements when we abandoned the memorable Carp engine that was running on our server in favour of a dedicated external high-performance server farm provided by MySyndicaat.

RSS caching is also an important step to make, which further helps reducing the overall bandwidth consumption and server load. With RSS caching you basically store for some pre-defined time the content s of all your RSS feeds, instead of actually going out to each RSS feed each time a visitor calls up one of your pages integrating content coming from an RSS feed. The content of that feed is stored say at each hour, allowing the page to load up much faster and for your bandwidth bills to see some efficiency of bandwidth use at play.

But also RSS didn't do it.

Hitbox_traffic_dip_long2.gif

We then started monitoring the server more closely and attempted to identify the exact scripts, cron jobs and tasks that were happening when the traffic statistics showed a "dip".

Drazen found cron jobs that had no need to be kept alive, unknown scripts either set by somebody in the past or formally placed by the internet provider who monitor and attend to specific technical tasks. We suspended, deleted, suppressed and removed each and every single one of these until there were no more.

But also CRON jobs and other server scripts didn't do it.

We didn't want to give up but we realized that there was something baffling us all along. It felt like we were not looking in the right places.

And in fact, the problem with our server troubleshooting approach was that we were looking in the wrong direction throughout.

The problem was not on the server!

The problem was with one user!

One user?

Yeah, one user. Here is its picture:

robot_spider_id91259_size400.jpg
My imaginary idea of MSNbot, the site crawler/spider which can bring your server to a crawl unless you manage your robotx.txt file properly - Photo credit: Michael Osterrieder

The MSNbot is a web spider which crawls web sites with the same purpose of the Google or Yahoo crawlers: indexing your web site content to bring it back "home" to the MSN Search Engine and other related services.

But The MSNbot is apparently less polite, courteous and discrete than any other major search engine spider out there. From my own, little research on this, the MSNbot is especially vulnerable to situations like mine in which, a site with a few thousand pages, publishes lists of RSS feeds of related articles categories, sites and more, while the MSNbot tries to make sense of all this RSS links, feeds and actual content multiplying under its nose at exponential speed.

"We recommend keeping a close watch on the MSNBOT as it does not, to the best of our knowledge as of July 2004, keep track of how many simultaneous requests it makes of a server. This can result in activity that resembles a denial of service attack."

(Source: Spidertrack - Sept. 20th 2004)

and

"Here is an incomplete list of the sort of things MSNBot routinely does on this site:

* repeatedly fetches large binary files, including 500 megabyte ISO images, that are properly served as binary files and have not changed in some time; 21 fetches for 4 files accounting for 3.7 gigabytes of transfers this week. (See MSNbotBinariesProblem)

* aggressively fetching syndication feeds, many of them unchanging; 1,615 fetches of 329 feeds amounting to 45 megabytes of transfers this week. Half of the top 10 requested feeds have not changed within the past week, yet were requested 12 times or more. (See MSNbotCrazyRSSBehavior)

* never uses conditional GET, even when aggressively fetching syndication feeds. (See AtomReadersAndCondGet)

* aggressively recrawls unchanging content and error pages, while neglecting changed content, although this is better than it used to be. (See CrazyMSNCrawler)

All of these behaviors are undesirable. Most of them are aggressively antisocial..."

(Source: Banning MSNBot: an open letter to MSN Search - Chris Siebenmann - Nov. 14th 2005)

If you do some digging yourself on Google, you will see that this is not a new story, with some people reporting MSNbot behaving similarly to a DDOS attack back in 2004. More recent comments from webmasters and publishers having had the same problem abound (if you are good at finding them) and they continue to this day.

So what is the way out of the MSNbot?

Using and managing your robots.txt file intelligently. This text file which is placed on your server, informs search engines spiders like the MSNbot of what are the rules for spidering your site that spiders have to abide to. You can in fact ban specific spiders, limit their access and control what specific set of pages you want them to index.

You can learn more about effectively using the Robots.txt file on Wikipedia.



Official Channel

Microsoft suggests to officially report the issue to them at:
http://g.msn.com/0HEMSN_SEARCH...



Read more of what other people found out:

Bots, Spiders and Bandwidth
January 5th 2006

MSNbot - information
March 23rd 2005

Slow down...
September 12th 2005

Banning MSNBot: an open letter to MSN Search
November 14th 2005

HOW-TO block the most common bad bots using robots.txt
April 26th 2006

 
 
 
Readers' Comments    
blog comments powered by Disqus
 
posted by Robin Good on Wednesday, July 5 2006, updated on Saturday, April 24 2010


Search this site for more with 

  •  

     

     

     

     

    5747

    Recommended Resources

     

     

    Subscribe to MasterNewMedia
    Feature Articles and Reports

  • RSS Feed

          Mail

    Powered by FeedBlitz

     

    POP Newsletter

    Robin Good's Newsletter for Professional Online Publishers  

    Name:
    Email:

     

     
    Real Time Web Analytics