In the race to develop the best AI model, web crawlers play a crucial role. They automatically search the internet for content that developers can use to train their Large Language Models (LLMs). Nepenthes is a tool designed to lure these crawlers into an endless maze or flood them with meaningless content.
The main issue with web crawlers used by AI developers is that they often don’t respect copyright laws. Website owners can specify in a file called robots.txt if they don’t want their content to be crawled by LLMs. However, the instructions vary from model to model, and some companies try to bypass these restrictions.
Programmer Aaron B. was particularly frustrated with how web crawling for LLMs was being conducted. This frustration led him to develop Nepenthes, named after a carnivorous plant. Unlike the plant, Nepenthes targets web crawlers instead of insects. Aaron describes it as a “tar pit” meant to trap web crawlers, specifically those used for AI purposes. It can also trap other types of web crawlers, like those from search engines. Aaron warns that using Nepenthes might cause a website to be removed from Google search results.
Nepenthes works by generating a page with about a dozen links that all loop back to the same page. These pages have very long loading times, which occupy the crawlers’ time. This concept can be tested online, and the slow loading is intentional. For those with enough computing power and bandwidth, there’s an option to feed the crawlers with Markov-generated nonsense, which clogs the hard drives of AI servers.
However, there are drawbacks. While crawlers work through Nepenthes, it causes spikes in server load. The weaker the server or the more crawlers that visit, the more strain it puts on the server. Although the IPs of captured crawlers can be blocked, the sheer number of web crawlers on the internet means Nepenthes will likely always have targets. Those who want to overwhelm the crawlers with nonsense content and significant resource usage won’t bother blocking IPs.
Aaron B. strongly advises that only those who fully understand the tool should use it. There are questions about whether Nepenthes works as claimed. Modern web crawlers often have a set limit on the number of pages they will crawl from a single website, based on the site’s popularity. A user on Hacker News mentioned that this would render the endless maze ineffective, but the tool might still help protect content from crawling.
In a conversation with 404 Media, Aaron B. addressed the point from the Hacker News thread, saying, “If that’s true, according to my access data, even the almighty Google crawler isn’t protected in this way.”