Bots Ate My Bandwidth
Sunday, December 21, 2003

Back on the trail of why my modest website keeps exhausting the bandwidth quotas, I may have found something fundamentally flawed with many dynamic CMS --- this appears to apply to Drupal, but may apply to other data-driven multiple-views content presentation systems such as PHPNuke, Slash, Geeklog and many the other multi-column website publishing engines:

Search engine crawler bots are triggering a bandwidth drain.

It's a bit more complex than this, and not the fault of the bots, but the periodic return of these bots in a dynamically generated website that is apparently sapping a huge share of my bandwidth ...Dynamic websites are a great invention. You can have active sidebars showing live the state of your site, who's online, what's going on with the content, you can include RSS feeds and aggregator output, all sorts of things to make your site interesting. Backed by a database, you can also offer your visitors a thousand ways to slice and dice your content to their liking, filtering by article topics, by author, by date --- all of these features are great news for people using a website, but all these features overlook one small reality of the world online ...

Double-edged Googlebots

I'm not complaining about googlebots or any of the other search engine web crawlers. Google provides a fantastic service crucial to sideline websites like mine. Bots are a Good Thing™, but the combination of an overly dynamic website plus cross-referencing that material in alternate views may cause your bandwidth to be inordinately consumed by well-meaning bots.

And let me be double clear, it's not the Googlebots' fault: Examining my logs, Googlebot agents do indeed obey all the proper Conditional-GET rules; on all static content Googlebots invariably receive painless 304 Not Modified responses.

But not so for the dynamically generated pages. The Drupal.org developers have considered one scenario for page caching, to prevent undo database activity for popular pages, but their method may be insufficient to deal with the reality of periodic crawler bots --- this same effect probably also happens on other dynamic CMS, and if so, then I may have found a dire show-stopper for using dynamic publishing for weblog and community journal sites like teledyn.com.

Here's the evidence:

Rank Hits Files KBytes Visits Hostname
1 47651 9.52% 43796 11.08% 1114552 21.77% 91 0.10% *.googlebot.com
2 26770 5.35% 20705 5.24% 554438 10.83% 43 0.04% yj1017.inktomisearch.com
3 14901 2.98% 13964 3.53% 355658 6.95% 53 0.06% 209.247.193.211

Mine is not a super-popular site. On a good day, I only attract about 3000 visitors, only about 31,000 hits, but it's an old site and over the years has accumulated over 400 stories. And yet, over one fifth of my site bandwidth is consumed by 1/10th of the hits! Something is definately not right ...

Killed by Meta-Content

Examining the logs, I see a curious artifact: static pages, such as this MovableType blog page you're reading now, show up as 'new' content only when they are changed (that being the nature of a static-page projector like MT); in the logs this manifests as a 304 Not Modified response to a Googlebot request.

For example, 00155.html will return 304's forever until I edit the story or someone adds a trackback, comment, or there's some other explicit rebuild of that page using MovableType --- this is why some of the older pages on TeledyN will show different sidebar formats or old Waypath content --- it is computationally too expensive to regenerate my entire site, so I don't. I leave old pages as they were, and they won't update to any new template changes until someone triggers a rebuild.

Which is how things should be, but ... when I filter my logs for Googlebot hits on any of the generated (Drupal) pages, there are precious few 304 result codes --- over the past three days' logs, 200 OK responses to Googlebots on Drupal pages shows 5688 hits. Only 4 received 304 responses.

I'm still working to confirm it and also need to closely examine the Drupal caching and response code, but here is what I think is happening:

  • Drupal's caching rules say that any anonymous-user page will be cached by the URL and kept in the database as a page image; on a subsequent request for that URL, instead of repeating all the database to rebuild the main column and sidebars, one database request returns the complete page image; if the request includes ETag or If-Modified-Since Drupal can short-circuit the request under the Conditional-GET and return only a HEAD result --- put in English, if the page has not been changed since the last visit of the Googlebot, the Googlebot should get a 304 Not Modified response.
  • Because of dynamic sidebar content, and to ensure that all pages contain the most current content, every time any story or comment is added or changed, Drupal clears the entire cache. Subsequent requests will rebuild each page from scratch and return a normal 200 OK response as if the page was new ... even if nothing has changed --- in practice, most pages will have small changes to the sidebar, for example, the list of recently added stories or the top item of the day, so a checksum on the page content will also conclude the page is new. Multiple page views will also be changed by added or deleted items, shifting content off the top or bottom of each page.
  • My teledyn.com website contains only 440 stories going back to roughly 1998; since my switch to Drupal, I have been posting on average once every 4 or 5 days, so items could cache for at least a few days. Unfortunately, many pages also show the output from the aggregated RSS feeds, site statistics, who's online and other real-time information. Not every page shows this information, but Drupal doesn't track precise page dependencies, it only uses that heuristic that any change may invalidate any other page.
  • Add to this the way any given posted item may appear multiple times in the website, once for the chronological slice, once in the user's personal blog page, and once for each of the taxonomic terms attached to the page. Each page then also appears in these collection pages (usually in teaser form) and in it's full-story version. Thus any main-body content can be seen in whole or in a summary under several unique URI paths, which, to a bot, appears to be completely unique pages. This still only multiplies the content up to perhaps a few thousand virtual pages, hardly the Library of Congress and a tiny count compared to, say, a small town newspaper.
  • Here's where it gets ugly: After adding any sort of content, or when someone accesses a page or an RSS feed changes in the aggregator, the whole site's content will be considered suspect for changes and the cache is cleared. The very next Googlebot will cause the front page to regenerate, and then, by virtue of being a crawler, it will rebuild all main story pages then proceed on to rebuild all taxonomy cross-reference teaser pages, blog pages, and the pages and chronological story teaser pages, hitting the database to recache the entire website as if it was all new content.

This is very bad news.

When change is not change

As I say, I think something similar may be happening with most of the dynamically generated websites with changeable sidebar content, especially if those sidebars lead ot alternative ways to view the same data, and this may be happening with whether you use PHP or Perl or ASP or even Oracle Webserver --- the fault is not in the technology but in the way it is used. The flaw is in the caching.

While I find it highly effective (from an info-architecture point of view) to have sidebars showing live page stats such as the day's most popular post, and similarly effective to have the flexible Drupal taxonomy to cross-reference content into multiple categorical slices, both features may be spawning my own undoing! The way pages are cached (ie cached until anything changes) my cache is expiring so often that I must pay a huge price in CPU time, in MySQL traffic, and on my daily bandwidth quota every time a bot visits.

I may be able to reduce this by disabling real-time content-tracking, but I'm still left with having to regenerate the entire website at least for the first crawler arriving after any new post or comment; because of the taxonomy and author cross-reference indexes will also be in the path of the crawler, each of those slice-pages must also be regenerated down to the last paged bundle of how ever many stories.

MT users will understand what's going on; if you've ever had to regenerate your entire MT site, you know that it can make your webhost very angry with you, and it can take hours. While the crawler bots are gentle in pacing their requests, they are still triggering the rebuild for every page every time they visit! It's no wonder I often find I cannot post to my MT blog because I am out of database connections ...

Adventures in Bot Marshalling

Free software only promises to save you the bother of re-inventing the wheel, and because it is free software, I can't really gripe about Drupal not doing what I want, I can only roll up my sleeves and fix it. Trouble is, I'm not certain there is a viable fix. Switching to a pure static-page projection system like MT is out of the question for the same reason I cannot regenerate the full opus of pages here on TeledyN, but perhaps there are few optimizations I can do here and there; here's some ideas I will explore (comments welcome):

  • When a website can be sliced by a number of cross-reference angles, is it necessary to allow Google access to every permutation of the data? What if I exclude bot access to the alternate paths like /user/view or /taxonomy/page pseudo-paths? (because the same content will eventually be linked via the /node links page by page. Will robots.txt work when used with the Apache mod_rewrite rules used to create these pseudo-paths? This solution is the easiest to implement; I've already deployed it but it is too soon to know if it is effective; it won't prevent global cache clearing due to minor sidebar updates, but robot exclusions should limit the bots to only index each node at most once and prevent the regeneration of all the author and taxonomy back pages.
  • People using Google expect to search results in the main body of a multi-column webpage. What if, on requests by known bot agents, when only the sidebar is altered, I return 304 Not Modified? I'm not sure how I'd detect this condition; I can detect that the request is from a bot, so perhaps by manually comparing the If-Modified-Since to the timestamps on the main content nodes individually ...
  • When a Drupal story is added most pages could be suspect (because of sidebar changes), but if the page is only edited it may be forgivable if some other pages don't reflect the changes (for example, a title in a sidebar) --- this is the situation now with my MovableType where I haven't bothered to regenerate older pages (Drupal is effectively auto-sensing this and the bots are causing the complete regen that I won't do in MT). Perhaps Drupal could be made more selective in what constitutes reason to dump the whole of the cache? Instead of blindly expiring everything, perhaps only the current node and pages likely to be showing it's full-content or teaser-content.
  • In the world of static-page projection CMS like MT, dynamic content like blogrolling and shoutboxes are handled by external Javascript; because the bots don't follow the Javascript links, the sidebar content can change for the human visitor while registering no-change for the bots. This may be a solution for the live site state items and RSS aggregator displays.

Other ideas are welcome. Don't bother recommending mod_gzip --- I'd use it for a lot or reasons except I don't control what goes into my webhost and Superb is nervous about using third-party modules on their shared webservers. Even so, for this particular problem, mod_gzip would only mean I could drive my MySQL connections even more --- the bot traffic would just increase to higher levels of complete site regeneration, so it's not a solution, it just buys some time with a temporary patch.

And, as I said, this bot-triggered site-regeneration scenario may be a common problem with most dynamic CMS; you should be able to verify this by simply using grep (or whatever) to search your apache logs for Googlebot agent lines and see if you are returning 304 (not modified) or 200 (new page) response codes -- I expect this is especially an issue in those CMS where you can get at any given story by a number of ways (by subject, author, date) and especially where the sidebar of any page may be changed by the addition of any content.

Submitted by mrG on Sun, 2003-12-21 18:14.


Post new comment
  • Allowed HTML tags: <em> <strong> <cite> <code> <div><ul> <ol> <li> <dl> <dt> <dd> <img> <u> <i> <b> <tt> <span><blockquote>
  • You can use Textile markup to format text between the [textile] and (optional) [/textile] tags.
  • Lines and paragraphs break automatically.

More information about formatting options