Search engines and RSS aggregators patterns

Looking at my server logs (using Summary) I found the following search engines patterns interesting:

search engines patterns

I'm amazed at the differences between them, in how they crawl this site. Yahoo! is by far the most ressource intensive (the less efficient?) with the top score in terms of visits, hits and bandwidth consumed. Recrunched by visit over the past 12 months (from 10/01/05 to 09/30/06), it gives us:

  • Yahoo!: 1.28 hit/visit, 8.2 KB/visit, 1446 visits/day, 1851 hits/day, 11.9 MB/day
  • Google: 231 hits/visit, 100 KB/visit, 99.3 visits/day, 1009 hits/day, 9.97 MB/day
  • MSN Search: 7.93 hits/visit, 155 KB/visit, 56.4 visits/day, 447.4 hits/day, 8.76 MB/day
  • Ask Jeeves: 26.9 hits/visit, 209 KB/visit, 10.5 visit/day, 280.9 hits/day, 2.19 MB/day

Quite different behaviors! The way Summary distinguishes two visits may get in the way in defavor of Yahoo!, so hits and bandwidth are, I think, better metrics for comparisons.

During the same period, I've seen the following patterns from RSS aggregators:

  • Bloglines: 44,098 visits / 84,889 hits / 14.5MB
  • NewsGator: 45,402 visits / 84,785 hits / 51MB
  • Yahoo! RSS Syndication System: 7,003 visits / 7,837 hits / 95.4MB

So Yahoo! RSS consumes twice as much bandwidth as NewsGator in 11 times less hits! Weird, and here again they earn the biggest payload.

1 Comment

Yahoo! and Google are not effective given your patterns. It seems Yahoo! comes often but requests HTTP headers (do not download the full file), though why does it come that often… except if you screwed the HTTP cache headers ;). If you have not screwed on your server then Yahoo! should respect them and not come back until they are expired.

Google doesn't seem to do head request. It comes less often but download the full file.

This is caution to real testing and analysis. There are just my uninformed interpretation from your data.
Technorati still doesn't respect robots.txt, I'm sick of it.