A Sinister Brood of Web Crawlers is Hatching

And they have a ferocious appetite!

A Sinister Brood of Web Crawlers is Hatching

The diaspora* foundation has built a cozy, private corner of the internet. It’s a “privacy-aware, distributed, open source social network” that has taken ~15 years to steadily grow to what it is today.

There’s one problem, though. They’ve been infested… by spiders.

Or, rather, web crawlers.

These new, mutant web crawlers come in two unique breeds:

  1. LLM Training Bots (scraping data in order to train LLMs)
  2. LLM-Enhanced Web Scrapers

Subscribe (for free) for more Slam Dunk Software

Some Statistics

Diaspora developer, Dennis Schubert, gave the followings stats:

  • Diaspora had 11.3 million requests (over the last 60 days)
  • 2.78 million requests - or 24.6% of all traffic - came from Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot).
  • 1.69 million requests - 14.9% - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
  • 0.49m req - 4.3% - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
  • 0.25m req - 2.2% - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
  • 0.22m req - 2.2% - meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)

In summary, 70% of his server traffic is coming from these LLM crawlers.

NOTE: Looking into these User-Agents, it appears that these are all grabbing LLM training data, not making requests on a user’s behalf. I.e., OpenAI uses the ChatGPT-User UA if the request is originating from a ChatGPT user.

Dennis Schubert also says that these crawlers come back every 6 hours, and ignore the robots.txt settings.

This is a pretty damning example. And, unfortunately, there are many more like this.

SourceHut published an announcement on March 17th, 2025, titled:

LLM crawlers continue to DDoS SourceHut

And ReadtheDocs published a statement here, calling for AI crawlers to “be more respectful” (which is a great way to put it, IMO).

And there are many more such examples.

Opinions on Web Scraping

When it comes to being a good citizen of the internet, web scraping is sort of a complicated topic.

On one hand, web scraping can be seen as almost democratizing the internet. Some developers are of the perspective that many websites “gatekeep” data behind APIs or paywalls, and scraping lets anyone extract and analyze public web data.

Another perspective is that big companies have vast resources for data collection; web scraping lets the smaller players compete.

And when a website simply (maybe negligibly, maybe unfairly) doesn’t have an API, scraping is sometimes the only way to get structured data.

However, we’ve just seen a clear example of when web scraping goes wrong. Badly designed scrapers can overwhelm a website, leading to slowdowns or even outages—effectively becoming a DDoS machine. Often times these are open source projects with little funding that are taking the hit from these DDoS machines, and, well, that just ain’t right.

And, well, I think ReadtheDocs put it perfectly. This just comes down to respect. If you’re a tech giant, you don’t disrespect the little guy. You rate-limit yourself, so we don’t have to.

Ways to Limit Crawlers

There are a couple of ways that the influx of crawlers can be mitigated.

I’m going to speak from a high-level, and define two categories:

  1. The offenders change their ways
  2. The defenders set up more defenses

The Offenders Change Their Ways

There are methods in place, like RFC2616Policy. This is an HTTP header defined in order to help with this exact scenario—“used in continuous runs to avoid downloading unmodified data (to save bandwidth and speed up crawls)”.

ReadtheDocs even offered to set up some sort of webhook system, which would notify the AI companies when data changes.

Also, there seems to be mixed opinions, but robots.txt should be respected, at the very least by the tech giants.

We’re still in the early days, and I trust that better systems will be put into place, but whatever the case may be, part of the solution is respect. (I suppose the other part of the solution will be the law.)

The Defenders Set Up More Defenses

Again, I won’t go into much detail here—I’ve never, personally, had to defend myself against an LLM-induced DDoS attack—but there are at least some ways to mitigate the brood from swarming your site.

  • At a minimum, good, ol’ 429 (Too Many Requests) should be implemented

    • More nuanced configuration may be required, to catch by IP address vs. subnet, depending on the attack pattern.

  • SourceHut referenced implementing Napenthes, which is “a tarpit intended to catch web crawlers”. That sounds awesome.

  • Lock pages behind User Auth, control access based on account, put bad accounts in timeout

Will Things Get Worse?

I mentioned—

LLM-Enhanced Web Scrapers

—in the intro to this post.

By including an LLM in the web-scraping loop, web scrapers are entering a new era. Unscrape-able sites are becoming scrape-able, and it’s easier than ever to convert large swaths of unstructured data into something structured.

And although these currently do not account for much of the traffic on the internet (especially when compared to their LLM-content scraping cousins), we are going to see a steady increase in this specific flavor of web scraper. So, I guess, buckle up. Even if the AI companies become better, more respectful internet citizens, the scrapers aren’t going anywhere.

Thanks for reading Slam Dunk Software! Subscribe for free to receive new posts and support my work.

Sources: