Hope the title of this post doesn't raise fears that we are working on Terminator 5 (or that we are anti-cyborg!). Instead, we thought we'd offer a couple comments on the ongoing struggle we face with what are known as "bots."
In case you are not familiar with them, bots are programs that "harvest" files from sites that have "interesting" content. They run off servers in places like China and Russia. Podcast files like those our clients offer perfectly fit the profile of the type of material they are looking for. Our clients' posts have exciting topics, lots of tags and keywords, and hefty enough file sizes (2MB to 20MB) to seem worth snatching.
Once a file has been found and collected, the bot site will turn around and feed it to other users. Some bots load the files they distribute with viruses and other nasty things, but most seem to use them as a way to attract traffic. By exposing tons of interesting files, bots bring in visitors who can then be shown ads and monetized in other ways. Their activity is both illegal and sleazy--but it is not especially dangerous for either the bot owner (who is out of the reach of our law) or for the file owner (who is not directly associated with the bot site).
Like spam emails, bots are annoying and increase the cost of serving legitimate users. We try to block the IP addresses of known bots. However, there are more than 3,000 bot types tracked by our favorite bot tracker ("Bots vs Browsers"). These bots use more than 300,000 different IP addresses--so it is virtually impossible to block them all.
In case you wondered, I have not found a scientific way to spot bot traffic. My current approach ugly and quite manual and has these three steps:
- Grab our log file and summarize it by IP address (down the side) and month (across the top).
- Scan the resulting matrix, looking for instances where there are a number of requests from IP addresses that are contiguous.
- Check one or two of the IP addresses to see if they are known to be used by bots.
We are not the only site afflicted by bots. I suspect that all major podcasting sites serve files to bots. Our entire industry's download statistics are inflated, as a result. Of course, the same is true for a lot of other media businesses that count TVs that are on but not listened to, newspapers that are delivered but not read, etc.
I had time in the past few weeks to hand-check traffic from several of our clients, to see how much of their traffic came from bots. I was pleased to see accounts with less than 5% of downloads going to bots and disappointed that some of our accounts showed 40% or more going to them. The percentage of downloads going to bots varies a lot. Here is about twenty months of data on a typical client:
As you can see this podcast saw as much as 20% of its downloads going to bots and as little as 1%. This makes it hard to suggest any across the board adjustment for this problem. Instead, until we can weed out these pests entirely, we must just recognize that all of our download counts are somewhat inflated. Users of our statistics (such as advertisers) may adjust the value they attribute to each download down a bit, as a result. They might also ask podcast distributors for proof that they have made reasonable efforts to block bots. Otherwise, there could be an incentive for podcasters to open up their servers and use bots to increase their download traffic and their "ad inventory."


