I just found out about this when it came to the front page of Hacker News. I really wish I was given advanced notice. I haven't been able to put as much energy into Anubis as I've wanted because I've been incredibly overwhelmed by life and need to be able to afford to make this my full time job. Support contracts are being roadblocked, and I just wish I had the time and energy to focus on this without having to worry about being the single income for the household.
logicprog 12 hours ago [-]
That sucks. Keep fighting the good fight, and I wish you all the best. We need people working on this problem (unfortunately).
xena 12 hours ago [-]
Thanks! I just wish I could afford to work on this full time, or at least even part time. It would help me a lot and prevent me from having to work what is effectively two full time jobs. Rent and food keep getting more expensive in Canada.
veqq 13 hours ago [-]
Good luck, I'm sorry for all of this speculation and people attacking your solution instead of suggesting concrete improvements to help fight the problem.
xena 13 hours ago [-]
Thanks. It means a lot. Today has not been a good day for me. It will be fixed. Things will get better, but this has to rank up there in terms of the worst ways to find out about security issues. It sucks lol.
bayindirh 41 minutes ago [-]
Please don't feel bad. There will be always a horde of noisy people who can't do anything but complain. Some of them don't know better, some of them just want to see the world burn.
You're doing a tremendous job. On a personal note, I'm not angry or anything, it's just the nature of the process. No hard feelings here. I root for you.
Hope everything goes way better and way sooner than you ever imagined. Good luck & godspeed!
grayhatter 12 hours ago [-]
I'll double down on what veqq said; Thoes that can, do. Those who have no idea where to start complain on internet threads.
There will always be bots, they were here before anubis, they'll be there long after you block them again. Take care of yourself first. There's no need to make a bad day worse trying to sprint down a marathon.
rapnie 13 hours ago [-]
You are doing a tremendous job, and we are really thankful for the great work you've done. Personal matters come first though imho. Take care <3
ziml77 13 hours ago [-]
I saw you touching grass, so I hope that's at least helping you get through the day <3
dkiebd 12 hours ago [-]
Is it a security issue? I thought it was just the crawlers spending the energy in solving the challenges?
xena 9 hours ago [-]
As an SRE, if there's any possible ambiguity as to if an issue is a security issue, treat it as one until proven otherwise.
zahlman 15 hours ago [-]
I actually don't understand who Anubis is supposed to "make sure you're not a bot". It seems to be more of a rate limiter than anything else. It self-describes:
> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
yabones 14 hours ago [-]
My understanding is that it just increases the "expense" of mass crawling just enough to put it out of reach. If it costs fractional pennies per page scrape with just a python or go bot, it costs nickels and dimes to run a headless chromium instance to do the same thing. The purpose is economical - make it too expensive to scrape the "open web". Whether it achieves that goal is another thing.
blibble 14 hours ago [-]
what do AI companies have more than everyone else? compute
anubis directly incentivises the adversary, at expense of everyone else
it's what you would deploy if you want to exclude everyone else
(conspiracy theorists note that the author worked for an AI firm)
jerf 13 hours ago [-]
"what do AI companies have more than everyone else? compute"
"Everyone else" actually has staggering piles of compute, utterly dwarfing the cloud, utterly dwarfing all the AI companies, dwarfing everything. It's also generally "free" on the margin. That is, if your web page takes 10 seconds to load due to an Anubis challenge, in principle you can work out what it is costing me but in practice it's below my noise floor of life expenses, pretty much rolled in to the cost of the device and my time. Whereas the AI companies will notice every increase of the Anubis challenge strength as coming straight out of their bottom line.
This is still a solid and functional approach. It was always going to be an arms race, not a magic solution, but this approach at least slants the arms race in the direction the general public can win.
(Perhaps tipping it in the direction of something CPUs can do but not GPUs would help. Something like an scrypt-based challenge instead of a SHA-256 challenge. https://en.wikipedia.org/wiki/Scrypt Or some sort of problem where you need to explore a structure in parallel but the branches have to cross-talk all the time and the RAM is comfortably more than a single GPU processing element can address. Also I think that "just check once per session" is not going to make it but there are ways you can make a user generate a couple of tokens before clicking the next link so it looks like they only have to check once per page, unless they are clicking very quickly.)
solid_fuel 11 hours ago [-]
Anubis increases the minimum amount of compute required to request and crawl a page. How does that incentivize the adversary?
xboxnolifes 12 hours ago [-]
"Everyone else" (individually) isn't going to millions of webpages per day.
semiquaver 9 hours ago [-]
But it doesn’t cost scrapers millions of solved challenges to go to millions of webpages on a single origin. Once you solve an Anubis challenge you get a signed JWT that lets you scrape a given site an unlimited amount for a configurable amount of time. (~day). So in practice it doesn’t actually cost the scrapers a large amount amount in proportion to their usage. It actually costs them proportionally less than a normal human.
To actually make it expensive for scrapers every page would need a new challenge. And that would not be tolerated by real human users. Or the challenge solution would need to be tied to a stateful reward that only entitles a human-level amount of subsequent request usage.
thayne 6 hours ago [-]
Well, getting that JWT gives you something to track requests by. So you can count how many requests you get per jwt,then if that token is making too many requests you can block it, throttle it,tarpit it,etc.
I'm not sure if anubis currently does that, but it certainly could.
dathinab 14 hours ago [-]
it's indeed not a "bot/crawler protection"
it's a "I don't want my server to be _overrun_ by crawlers" protection which works by
- taking advantage that many crawlers are made very badly/cheaply
- increasing the cost of crawling
thats it, simple but good enough to shake of the dumbest crawlers and to make it worth it for AI agents to e.g. cache site crawling so that they don't craws your site a 1000 times a day but instead just once
imtringued 7 minutes ago [-]
The AI crawlers have tens of thousands of IPs and some of them use something akin to a residential botnet.
If they notice that they are getting rate limited or IP blocked, they will use each IP only once. This means that IP based rate limiting simply doesn't work.
The proof of work algorithm in Anubis creates an initial investment that is amortized over multiple requests. If you decide to throw the proof away, you will waste more energy, but if you don't, you can be identified and rate limited.
The automated agent. An never get around this, since running the code is playing by the rules. The goal of the automated agent is to ignore the rules.
spyrja 7 hours ago [-]
Another approach. Require a hash(RESOURCE_ID, ITERATIONS, MEMORY_COST) for each and every resource request. Admittedly that might get a little tricky considering that you don't want to bog down actual users with sluggish page loads. But if carefully tuned to the highest tolerable level it might actually be sufficient. (Maybe.) It's a hard problem....
Trung0246 6 hours ago [-]
My dumb idea is to encrypt each HTML element as chain of encryption that requires full load of an HTML/JS element to get another key to load another HTML/JS element and so on. Key retrieval can be throttled and mixed between client and server side and embedded with each requests to prevent browser load everything at once.
This may tread too close to DRM tho due to element protection scheme.
Spivak 12 hours ago [-]
You have it right. The problem Anubis is intended to solve isn't bots per se, the problem is that bot networks have figured out how to bypass rate limits by sending requests from newly minted, sometimes residential, ip addresses/ranges for each request. Anubis tries to help somewhat by making each (client, address) perform a proof-of-work. For normal users this should be an infrequent inconvenience but for those bot networks they have to do it every time. And if they solve the challenge and keep the cookie then the server "has them" so to speak and can apply ip rate limits normally.
homebrewer 14 hours ago [-]
I think the only requests it was able to block are plain http requests made over curl or Go's stdlib http client. I see enough of both in httpd logs. Now the cancer has adapted by using a fully featured headless web browser that can complete challenges just like any other client.
As other commenters say, it was completely predictable from the start.
14 hours ago [-]
15 hours ago [-]
joe_the_user 15 hours ago [-]
Near as I can guess, the idea is that the code is optimized for what browsers can do and gpus/servers/crawlers/etc can't do as easily (or relatively as easily, just taking up the whole server for a bit might a big cost). Indeed it seems like only a matter of time before something like that would be broken.
yogorenapan 13 hours ago [-]
I've seen a lot of traffic from Huawei bypassing Anubis on some of the things I host as well. The funny thing is, I work for Huawei... Asking around, it seems most of it is coming from Huawei Cloud (like AWS) but their artifactory cache also shows a few other captcha bypassing libraries for Arkose/funcaptcha so they're definitely doing it themselves too.
Anonymous account for obvious reasons.
xena 13 hours ago [-]
Please have someone in the common sense department email me@xeiaso.net. Funding of the Anubis project would go a long way towards mending bridges.
yogorenapan 7 hours ago [-]
I wish I had that kind of power. The European labs get funding from HQ on a per project basis and it takes a lot of effort to convince them to do anything. To even contribute code to open source, we need to fill out a bunch of paperwork, much less fund open source projects that work specifically against some other team's objectives. I'd personally just IP block Huawei's entire ASN since much of it is customer controlled and used for scraping. I know a ton of sites are already doing that since I get IP blocked on my work laptop all the time while doing research with their VPN on
paddw 10 hours ago [-]
Who exactly is Huawei corporate interested in mending bridges with? Seems like that tie is long severed
electroly 15 hours ago [-]
Presumably they just finally decided they were willing to spend ($) the CPU time to pass the Anubis check. That was always my understanding of Anubis--of course a bot can pass it, it's just going to cost them a bunch of CPU time (and therefore money) to do it.
zelphirkalt 14 hours ago [-]
I think so too. Maybe the compute cost needs to be upped some more. I am OK with waiting a bit longer when I access the site.
delusional 10 hours ago [-]
If I worked at a billion dollar firm, where doing this was actually a profitable endeavor, I'd reimplement the Anubis algorithm in optimized native code and run that. I wouldn't be surprised if you could lower the cost of generating the proof by a couple of orders of magnitude, enough to make it trivial. If you then batch it, or distribute it across your GPU farm, well now it's practically free.
Retr0id 15 hours ago [-]
Last time I checked, Anubis used SHA256 for PoW. This is very GPU/ASIC friendly, so there's a big disparity between the amount of compute available in a legit browser vs a datacentre-scale scraping operation.
A more memory-hard "mining" algorithm could help.
jsnell 14 hours ago [-]
A different algorithm would not help.
Here's the basic problem: the fully loaded cost of a server CPU core is ~1 cent/hour. The most latency you can afford to inflict on real users is a couple of seconds. That means the cost of passing a challenge the way the users pass it, with a CPU running Javascript, is about 1/1000th of a cent. And then that single proof of work will let them scrape at a minimum hundreds, but more likely thousands, of pages.
So a millionth of a cent per page. How much engineering effort is worth spending on optimizing that? Basically none, certainly not enough to offload to GPUs or ASICs.
Retr0id 14 hours ago [-]
No matter where the bar is there will always be scrapers willing to jump over it, but if you can raise the bar while holding the user-facing cost constant, that's a win.
jsnell 14 hours ago [-]
No, but what I'm saying is that these scrapers are already not using GPUs or ASICs. It just doesn't make any economical sense to do that in the first place. They are running the same Javascript code on the same commodity CPUs and the same Javascript engine as the real users. So switching to an ASIC-resistant algorithm will not raise the bar. It's just going to be another round of the security theater that proof of work was in the first place.
Retr0id 14 hours ago [-]
They might not be using GPUs but their servers definitely have finite RAM. Memory-hard PoW reduces the number of concurrent sessions you can maintain per fixed amount of RAM.
The more sites get protected by Anubis, the stronger the incentives are for scrapers to actually switch to GPUs etc. It wouldn't take all that much engineering work to hook the webcrypto apis up to a GPU impl (although it would still be fairly inefficient like that). If you're scraping a billion pages then the costs add up.
jsnell 14 hours ago [-]
The duration you'd need the memory for is a couple of seconds, during which time you're pegging a CPU core on the computation anyway. It is not needed for the entirety of the browsing session.
Now, could you construct a challenge that forced the client to keep a ton of data in memory, and then regularly be forced to prove they still have that data during the entire session? I don't think so. The problem is that for that kind of intermittent proof scenario there's no need to actually keep the data in low latency memory. It can just be stored on disk, and paged in when needed (not often). It's a very different access pattern from the cryptocurrency use case.
beefnugs 5 hours ago [-]
Sorry but it is actually a completely wrong solution. This is not a "make it expensive for spammers" problem. They only need to download the source code once, with a background process that doesn't matter if it takes hours.
Besides the "got the source code for training data" , the other access scenario is just downloading to an end users "agent" Which again, the end user is running something in the background, doesn't care how long it takes, how much it costs, its not a volume or spam type problem
Havoc 14 hours ago [-]
Really feels like this needs some sort of unified possibly legal approach to get these fkers to behave.
Search era clearly proved it is possible to crawl respectfully - the AI crawlers have just decided not to. They need to be disincentivized from doing this
dathinab 14 hours ago [-]
the problem in many cases is that even if such a law is made it likely
- is hard to enforce
- misses bite, i.e. it makes you more money to break it then any penalties
but in general yes, a site which indicates they don't want to be crawled by AI bots but still gets crawled should be handled similar to someone with house ban on a shop forcing them self into the shop
given how severely messed up some millennia cyber security laws are I wonder if crawlers bypassing Anubis could be interpreted as "circumventing digital access controls/protections" or similar, especially given that its done to make copies of copyrighted material ;=)
jjangkke 11 hours ago [-]
I really don't get this type of hostility
If you put something in public domain people are going to access it unless you put it behind a paywall but you don't want to do it because that would limit access or people wouldn't pay for it to begin with (ex. your blog nobody wants to pay for)
There's no law against scraping, and we've already past the CFAA argument
myaccountonhn 11 hours ago [-]
Look at it from a lens of harm rather than legality. The hostility comes from people having to pay thousands in bandwidth costs and having services degraded. These AI companies incur huge costs from their wasteful negligence. It's not reasonable.
bargainbin 11 hours ago [-]
It’s not quite as simple as “putting something in public domain”. The problem is the server costs to keep that thing in the public domain.
TJSomething 11 hours ago [-]
The problem here is that some websites, in this case independent open source code forges, can't be put behind paywalls and cannot maintain availability under the load of scrapers.
computerthings 8 hours ago [-]
[dead]
pointlessone 2 hours ago [-]
I feel for Codeberg and people who against AI but also I think Anubis can’t die soon enough. It breaks archiving and is very annoying when JS is disabled or when faced with an aggressive ad-blocker. It breaks web in more ways than one.
zeropointsh 12 hours ago [-]
How about using on-chain proof-of-work? It flips the script.
If a bot wants access, let it earn it—and let that work be captured, not discarded. Each request becomes compensation to the site itself. The crawler feeds the very system it scrapes. Its computational effort directly funds the site owner's wallet, joining the pool to complete its proof.
The cost becomes the contract.
viraptor 11 hours ago [-]
The check has to apply to people and bot visitors the same. If you're expecting a blockchain registered spend before the content is visible, basically nobody will visit your website.
apetresc 10 hours ago [-]
I don’t think OP meant you pay directly, just that you volunteer to do some part of the PoW (of some chain designed for this purpose) on behalf of the site, to its credit.
That’s not much of a different ask from Anubis. It just commandeers the compute for some useful purpose.
xena 10 hours ago [-]
If you can tell me how to implement protein folding without having to have gigabytes of scientific data involved, I'll implement it today.
interloxia 10 hours ago [-]
Have a look at folding@home.
Client and data could be less than 10mb.
Making a proof of work algorithm do some actually useful work is very much an unsolved problem.
zeropointsh 10 hours ago [-]
Exactly, instead of trying to totally prevent the bots/AI scrapers, make them "pay you" in compute to access and scrape. It needs to be solved.
zeropointsh 9 hours ago [-]
It would be more like using a javascript webminer to mine for you a bit before you get access. This would be the "proof-of-work" needed to then proceed to scrape.
mzajc 7 hours ago [-]
The last thing I want on the web is the normalization of cryptominers.
akoboldfrying 11 hours ago [-]
Which blockchain? An existing mainstream one like Bitcoin?
Because if so, I don't yet see how to "smooth out" the wins. If the crawler manages to solve the very-high-difficulty puzzle for you and get you 1BTC, great, but it will be a long time between wins.
If you're proposing a new (or non-mainstream) blockchain: What makes those coins valuable?
recursivecaveat 11 hours ago [-]
Isn't it the same problem as public mining pools? I remember I ran my little desktop in one for a day or 2 and got paid some micro-coins despite not personally winning a block. I'm not sure how they verify work and prevent cheating, but they appear to do so. I don't know if it scales down to a small enough size to be appropriate for 1 webpage though.
akoboldfrying 3 hours ago [-]
I think you're right, I'll look into this. Thanks!
zeropointsh 9 hours ago [-]
Look into JavaScript web miners for Monero. It exists but just hasn't been implemented into a proof-of-work concept to make a bot/AI scraper "pay" to access your site. Or if it has been used, not at scale.
hollow-moe 15 hours ago [-]
Really looks like the last solution is a legal one, using the DMCA against them using the digital protection or access control circumvention clause or smth.
jjangkke 11 hours ago [-]
DMCA only applies to hosted content and we've established that LLM aren't hosting copyrighted content as there is significant transformation which you would otherwise need to prove yourself by training and replicating their entire model.
There is no legal recourse here, if you don't want AI crawlers accessing your content 1) put it behind a paywall 2) remove from public access
hollow-moe 7 hours ago [-]
I'm not talking about the output of the LLM here. DMCA is an overreaching law. Here i'm talking about its provisions for access controls and "digital locks", i am not a lawyer but i'm fairly sure you could find some way to categorize Anubis/another software as a digital lock and then sue them on that basis.
logicprog 15 hours ago [-]
I'm not anti-the-tech-behind-AI, but this behavior is just awful, and makes the world worse for everyone. I wish AI companies would instead, I don't know, fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it, instead of having a bunch of different AI companies doing duplicated work and resulting in a swath of duplicated requests. Also, I don't understand why they have to make so many requests so often. Why wouldn't like one crawl of each site a day, at a reasonable rate, be enough? It's not like up to the minute info is actually important since LLM training cutoffs are always out of date anyway. I don't get it.
barbazoo 15 hours ago [-]
Greed. It's never enough money, never enough data, we must have everything all the time and instantly. It's also human nature it seems, looking at how we consume like there's no tomorrow.
logicprog 15 hours ago [-]
Which is why internalizing externalities is so important, but that's also extremely hard to do right (leads to a lot of "nerd harder" problems).
msgodel 14 hours ago [-]
It doesn't even make sense to crawl this way. It's just destructive for almost no beinifit.
barbazoo 14 hours ago [-]
Maybe they assume there'll be only one winner and think, "what if this gives me an edge over the others". And money is no object. Imagine if they cared about "the web".
logicprog 14 hours ago [-]
That's what's annoying and confusing about it to me.
oortoo 14 hours ago [-]
The time to regulate tech was like 15 years ago, and we didn't. Why would any tech company expect to have to start following "rules" now?
logicprog 14 hours ago [-]
Yeah, I don't think we can regulate this problem away personally. Because whatever regulations will be made will either be technically impossible and nonsensical products of people who don't understand what they're regulating that will produce worse side effects (@simonw extracted a great quote from recent Doctorow post on this: https://simonwillison.net/2025/Aug/14/cory-doctorow/) or just increase regulatory capture and corporate-state bonds, or even facilitate corp interests, because the big corps are the ones with economic and lobbying power.
thewebguyd 14 hours ago [-]
> fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it
That, or, they could just respect robots.txt and we could put enforcement penalties for not respecting the web service's request to not be crawled. Granted, we probably need a new standard but all these AI companies are just shitting all over the web, being disrespectful of site owners because who's going to stop them? We need laws.
logicprog 14 hours ago [-]
> That, or, they could just respect robots.txt
IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.
How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.
thewebguyd 14 hours ago [-]
> IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)
It should be wholly possible to publish a site intended for human consumption only.
> How would that be enforceable?
Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.
senko 10 hours ago [-]
> It should be wholly possible to publish a site intended for human consumption only.
You have just described the rationale behind DRM. If you think DRM is a net positive for society, I won't stop you, but there has been plenty published online on the anguish, pain and suffering it has wrought.
logicprog 10 hours ago [-]
Precisely, this would be a system that is essentially designed to ensure that your content can only be accessed by specific kinds of users you approve of, for specific kinds of use you approve of, and only with clients and software that you approve of by means of legislation, so that you don't have to go through the hassle of actually setting up the (user hostile) technologies that would be necessary to enforce this otherwise and/or give up the appearance of an open web by requiring sign ins, while just being hostile on another level. It's trying to have your cake and eat it too, and it will only massively strengthen the entire ecosystem of DRM and IP. I also just personally find the idea of posting something on a board in a town square and then trying to decide who gets to look at it ethically repugnant.
This is actually kind of why I like Anubis. Instead of trying to dictate what clients or purposes or types of users can access a site, it just changes the asymmetry of costs enough that hopefully it fixes the problem. Because like you can still scrape a site behind Anubis, it just takes a little bit more commitment, so it's easier to do it on an individual level than on a mass DoS level.
notatoad 12 hours ago [-]
laws are inherently national, which the internet is not. by all means write a law that crawlers need to obey robots.txt, but how are you going to make russia or china follow that law?
superkuh 14 hours ago [-]
This isn't AI. This is corporations doing things because they have a profit motive. The issue here is the non-human corporations and their complete lack of accountability even if someone brings legal charges against them. Their structure is designd to abstract away responsibility and they behave that way.
Same old problem. Corps are gonna corp.
logicprog 14 hours ago [-]
Yeah, that's why I said I'm not against AI as a technology, but against the behavior of the corporations currently building it. What I'm confused by (not really confused, I understand its just negligence and not giving a fuck, but, frustrated and confused in a sort of helpless sense of not being able to get into the mindset) is just that while there isn't a profit motive against doing this (obviously) there's also not clearly a profit motive to do it, it seems like they're wasting their own resources too on unnecessarily frequent data collection, and also it'd be cheaper to pool data collection efforts.
nektro 14 hours ago [-]
if those companies cared about acting in good faith, they wouldnt be in AI
egypturnash 14 hours ago [-]
thanks for making everything that much shittier just so you can steal everyone's data and present it as your own, AI companies!
amarcheschi 13 hours ago [-]
The tiniest relief is knowing that models will be distilled and "copied" in smaller models of equal capabilities in ~6/12months, since their output can't be copyrighted and will be used to improve others. Kinda ironic
black_puppydog 10 hours ago [-]
Dear god that invokes the image of some un-human centipede...
jjangkke 11 hours ago [-]
you can't steal something that is in public domain and one which you make readily available by publishing it online because there is no provable cost of damage to you by someone scraping and training their models.
if you really think what you offer has value, put it in behind a paywall and see how many people will consume it then, probably not a lot.
hyghjiyhu 15 hours ago [-]
Crazy thought but what if you made the work required to access the site equal the work required to host site. Host the public part of the database on something like webtorrent. Render website from db locally. You want to ruin expensive queries? Suit yourself. Not easy, but maybe possible?
nine_k 15 hours ago [-]
Why not ask it to directly mine some bitcoin, or do some protein folding? Let's make proof-of-work challenges proof-of-useful-work challenges. The server could even directly serve status 402 with the challenge.
SkiFire13 10 hours ago [-]
Note that the work needs to produce a result that's quickly verifiable by the server.
Do you want people to mine bitcoin or do protein folding to read your blog or access your web application?
More importantly do you want to now compete with those that do not bottleneck and lose your traffic ?
This is the paradox, the length you go to protect your content only increases costs for everybody else who isn't an AI crawler.
nine_k 10 hours ago [-]
People, no! Robots which can't pass for people, yes.
varenc 11 hours ago [-]
Are AI crawlers equipped to get past reCAPTCHA or hCAPTCHA? This seems like exactly the thing these services were meant to stop.
black_puppydog 10 hours ago [-]
So the problem is a bunch of AI companies mining our web content for training data without asking and without regard for hosters' effort/bandwidth and the users' service quality.
And the proposed remedy is to give them human-labeled data directly in the form of captchas, even more severely degrading the user experience and thus website viability?
Color me unconvinced.
jsnell 15 hours ago [-]
This was beyond predictable. The monetary cost of proof of work is several orders of magnitude too small to deter scraping (let alone higher yield abuse), and passing the challenges requires no technical finesse basically by construction.
zahlman 15 hours ago [-]
We need to revive 402 Payment Required, clearly. If we lived in a world where we could easily set up a small trusted online balance for microtransactions that's interoperable with everyone, and where giving others a literal penny for their thoughts could allow for running up a significant bill for abusers, I'd gladly play along.
logicprog 15 hours ago [-]
Me too. I wouldn't mind Project Xanadu style micro payments for blogs, and it'd both fix the AI scraper issue and the ads issue, and help people fund hosting costs sustainably. I think the issue is taxes and transaction fees would push the prices too high, and it'd price out people with very low income possibly. It'd also create really perverse incentives for even more tight copyright control, since your content appearing even in part on anyone else's website is then directly losing you money, so it'd destroy the public Commons even more, which would be bad. But maybe not, who knows.
myaccountonhn 14 hours ago [-]
Pay to visit would be great, and would force these AI companies to actually pay for their data.
sumtechguy 14 hours ago [-]
For someone doing spamming that low level would work well. As their cost is determinatively low to make it work. For someone doing scraping to get data and feeding it to an AI not so much. The AI groups usually have some pretty heavy hitting hardware sitting behind it. They could even break off some hardware that is to be retired and have it munch away on it. To make it non cost effective the calculations would need to be much bigger.
rpcope1 15 hours ago [-]
I'm calling it now, this is the beginning of all of the remaining non-commerical properties on the web either going away, or getting hidden inside of some trusted overlay network. Unless the "AI" race slows down or changes or some other act of god happens, the incentives are aligned that I foresee wide swaths of the net getting flogged to death.
homebrewer 14 hours ago [-]
Also increasing balkanization of the internet. I now routinely run into sites that geoblock my whole country, this wasn't something I would see more than once or twice a year, and usually only with sites like Walmart that don't care about clients from outside the US.
Now it's 2-5 sites per day, including web forums and such.
bananalychee 14 hours ago [-]
If you live in Europe it probably has more to do with over-regulation than anything AI-related.
black_puppydog 10 hours ago [-]
More like under-compliance than over-regulation.
"Bruh sorry we were technically unable to produce a website without invasive dark pattern tracking stuff. Tech is haaaaard."
Honestly, I've never found a page outside my own country that I couldn't live without. Screw that s*t.
8 hours ago [-]
weinzierl 15 hours ago [-]
I think the answer for the non-commercial web is to stop worrying.
I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.
If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.
myaccountonhn 14 hours ago [-]
The issue is the insane amount of traffic from crawlers that DDOS websites.
> [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.
1gn15 12 hours ago [-]
One solution is to not expose expensive endpoints in the first place. Serve everything statically, or use heavy caching.
> Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.
Several years ago, GitHub started moving certain features like "code search on public repos" behind login, likely due to issues like this, to be able to better enforce rate limits. And this was before the era of LLMs going wild.
(And it led to outrage from people for whom requiring an account was some kind of insult.)
MYEUHD 14 hours ago [-]
About 3 hours ago the codeberg website was really slow.
Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers
johntash 9 hours ago [-]
One of my semi-personal websites gets crawled by AI crawlers a ton now. I use Bunny.net for a cdn. $20 used to last me for months of traffic, now it only lasts a week or two at most. It's enough that I'm going to go back to not using a cdn and just let the site suffer some slowness every once in a while.
bananalychee 14 hours ago [-]
I self-host a few servers and have not seen significant traffic increases from crawlers, so I can't agree with that without seeing some evidence of this issue's scale and scope. As far as I know it mostly affects commercial content aggregators.
_ikke_ 14 hours ago [-]
It affects many open source projects as well, they just scrape everything repeatedly without abandon.
First from known networks, then from residential IPs. First with dumb http clients, now with full blown headless chrome browsers.
bananalychee 12 hours ago [-]
Well I can parse my nginx logs and don't see that happening, so I'm not convinced. I suppose my websites aren't the most discoverable, but the number of bogus connections sshd rejects is an order of magnitude or three higher than the number of unknown connections I get to my web server. Today I received requests from two whole clients from US data centers, so scrapers must be far more selective than you claim, or they are nowhere near the indie web killer OP purports them to be.
I've worked with a company that has had to invest in scraper traffic mitigation, so I'm not disputing that it happens in high enough volume to be problematic for content aggregators, but as for small independent non-commercial websites I'll stick with my original hypothesis unless I come across contradictory evidence.
herval 15 hours ago [-]
Hasn’t that been the case for a while? I’d imagine the combined traffic to all sites on the web combined doesn’t match a single hour of the traffic to the top 5 social media sites. The web is pretty much dead for a while now, many companies don’t even bother maintaining websites anymore
superkuh 14 hours ago [-]
I could see it being the end of commercial and institutional web applications which cannot handle traffic. But actual websites which are html and files in folders served by webservers don't have problems with this.
v5v3 15 hours ago [-]
Could it be a 'correct' continuation of Darwin's survival of the fittest?
chmod775 11 hours ago [-]
Fight fire with fire by serving these guys LLM output of made-up news. Wish them good luck noticing that in their dataset.
johntash 9 hours ago [-]
I think there was some sort of fake webserver that did something like this already. Basically just linked endlessly to more llm-generated pages of nonsense.
zbentley 7 hours ago [-]
There are several!
Some focus on generating content that can be served to waste crawler time: crates.io/crates/iocaine/2.1.0
There are lots more as well; those are just a few of the ones that recently made the rounds.
I suspect that combining approaches will be a tractable way to waste time:
- Anubis-esque systems to defeat or delay easily-deterred or cut-rate crawlers,
- CloudFlare or similar for more invasive-to-real-humans crawler deterrence (perhaps only served to a fraction of traffic or traffic that crosses a suspicion threshold?),
- Junk content rings like Nepenthes as honeypots or "A/B tests" for whether a particular traffic type is an AI or not (if it keeps following nonsense-content links endlessly, it's not a human; if it gives up pretty quickly, it might be--this costs/pisses off users but can be used as a test to better train traffic-analysis rules that trigger the other approaches on this list in response to detected likely-crawler traffic).
- Model poisoners out of sheer pettiness, if it brings you joy.
I also wonder if serving taboo traffic (e.g. legal but beyond-the-pale for most commercial applications porn/erotica) would deter some AI crawlers. There might be front-side content filters that either blacklist or de-prioritize sites whose main content appears (to the crawler) to be at some intersection of inappropriate, prohibited, and not widely-enough related to model output as to be in demand.
1oooqooq 10 hours ago [-]
anubis and others allow some user agents to pass without proof of work. bad bots (and user) just use an extension that detect anubis and change the user agent instead.
it's well intentioned but just waste electricity from good people in the end.
anubis does nothing to impact bad crawlers, well only the laziest ones. but for those generating fake infinite content on the fly is much more efficient.
TZubiri 13 hours ago [-]
It's PoW, AI crawlers didn't learn shit, their admins just increased their CPU/GPU/ASIC budget
WD-42 15 hours ago [-]
This is sad, but predictable. At the end of the day if I can follow a link to an Anubis protected site and view it on my phone, the crawlers will be able to as well.
I see a lot more private networks in our future, unfortunately.
jjangkke 11 hours ago [-]
The private network is only as good as the weakest link which has to offer a reason for people to go through the trouble of accessing either by paying money or other means (acquiring special equipment).
And by putting a wall up you end up losing a large portion of the market to those that will now simply arbitrage and fill the space you leave behind.
There is simply no way to stop crawlers/scrapers, period, unless you put a meter on it or go offline.
OutOfHere 14 hours ago [-]
They failed to properly block/throttle the IP subnet as per their admission, and are now blaming others for their failure.
1gn15 12 hours ago [-]
Duh. The author of Anubis really should advertise it as a DDoS guard, not an AI guard. Otherwise, xe is just misleading people, while being unnecessarily discriminatory against robots (robotkin) and cyborgs (those using AI agents as an extension of their selves).
johntash 9 hours ago [-]
It's not really a DDoS guard though. If someone wants to ddos a server, anubis isn't going to be able to stop the traffic before it gets to the server.
It does help from accidental ddos or just rude scrapers that assume everyone has unlimited bandwidth and money.
You're doing a tremendous job. On a personal note, I'm not angry or anything, it's just the nature of the process. No hard feelings here. I root for you.
Hope everything goes way better and way sooner than you ever imagined. Good luck & godspeed!
There will always be bots, they were here before anubis, they'll be there long after you block them again. Take care of yourself first. There's no need to make a bad day worse trying to sprint down a marathon.
> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
anubis directly incentivises the adversary, at expense of everyone else
it's what you would deploy if you want to exclude everyone else
(conspiracy theorists note that the author worked for an AI firm)
"Everyone else" actually has staggering piles of compute, utterly dwarfing the cloud, utterly dwarfing all the AI companies, dwarfing everything. It's also generally "free" on the margin. That is, if your web page takes 10 seconds to load due to an Anubis challenge, in principle you can work out what it is costing me but in practice it's below my noise floor of life expenses, pretty much rolled in to the cost of the device and my time. Whereas the AI companies will notice every increase of the Anubis challenge strength as coming straight out of their bottom line.
This is still a solid and functional approach. It was always going to be an arms race, not a magic solution, but this approach at least slants the arms race in the direction the general public can win.
(Perhaps tipping it in the direction of something CPUs can do but not GPUs would help. Something like an scrypt-based challenge instead of a SHA-256 challenge. https://en.wikipedia.org/wiki/Scrypt Or some sort of problem where you need to explore a structure in parallel but the branches have to cross-talk all the time and the RAM is comfortably more than a single GPU processing element can address. Also I think that "just check once per session" is not going to make it but there are ways you can make a user generate a couple of tokens before clicking the next link so it looks like they only have to check once per page, unless they are clicking very quickly.)
To actually make it expensive for scrapers every page would need a new challenge. And that would not be tolerated by real human users. Or the challenge solution would need to be tied to a stateful reward that only entitles a human-level amount of subsequent request usage.
I'm not sure if anubis currently does that, but it certainly could.
it's a "I don't want my server to be _overrun_ by crawlers" protection which works by
- taking advantage that many crawlers are made very badly/cheaply
- increasing the cost of crawling
thats it, simple but good enough to shake of the dumbest crawlers and to make it worth it for AI agents to e.g. cache site crawling so that they don't craws your site a 1000 times a day but instead just once
If they notice that they are getting rate limited or IP blocked, they will use each IP only once. This means that IP based rate limiting simply doesn't work.
The proof of work algorithm in Anubis creates an initial investment that is amortized over multiple requests. If you decide to throw the proof away, you will waste more energy, but if you don't, you can be identified and rate limited.
The automated agent. An never get around this, since running the code is playing by the rules. The goal of the automated agent is to ignore the rules.
This may tread too close to DRM tho due to element protection scheme.
As other commenters say, it was completely predictable from the start.
Anonymous account for obvious reasons.
A more memory-hard "mining" algorithm could help.
Here's the basic problem: the fully loaded cost of a server CPU core is ~1 cent/hour. The most latency you can afford to inflict on real users is a couple of seconds. That means the cost of passing a challenge the way the users pass it, with a CPU running Javascript, is about 1/1000th of a cent. And then that single proof of work will let them scrape at a minimum hundreds, but more likely thousands, of pages.
So a millionth of a cent per page. How much engineering effort is worth spending on optimizing that? Basically none, certainly not enough to offload to GPUs or ASICs.
The more sites get protected by Anubis, the stronger the incentives are for scrapers to actually switch to GPUs etc. It wouldn't take all that much engineering work to hook the webcrypto apis up to a GPU impl (although it would still be fairly inefficient like that). If you're scraping a billion pages then the costs add up.
Now, could you construct a challenge that forced the client to keep a ton of data in memory, and then regularly be forced to prove they still have that data during the entire session? I don't think so. The problem is that for that kind of intermittent proof scenario there's no need to actually keep the data in low latency memory. It can just be stored on disk, and paged in when needed (not often). It's a very different access pattern from the cryptocurrency use case.
Besides the "got the source code for training data" , the other access scenario is just downloading to an end users "agent" Which again, the end user is running something in the background, doesn't care how long it takes, how much it costs, its not a volume or spam type problem
Search era clearly proved it is possible to crawl respectfully - the AI crawlers have just decided not to. They need to be disincentivized from doing this
- is hard to enforce
- misses bite, i.e. it makes you more money to break it then any penalties
but in general yes, a site which indicates they don't want to be crawled by AI bots but still gets crawled should be handled similar to someone with house ban on a shop forcing them self into the shop
given how severely messed up some millennia cyber security laws are I wonder if crawlers bypassing Anubis could be interpreted as "circumventing digital access controls/protections" or similar, especially given that its done to make copies of copyrighted material ;=)
If you put something in public domain people are going to access it unless you put it behind a paywall but you don't want to do it because that would limit access or people wouldn't pay for it to begin with (ex. your blog nobody wants to pay for)
There's no law against scraping, and we've already past the CFAA argument
If a bot wants access, let it earn it—and let that work be captured, not discarded. Each request becomes compensation to the site itself. The crawler feeds the very system it scrapes. Its computational effort directly funds the site owner's wallet, joining the pool to complete its proof.
The cost becomes the contract.
That’s not much of a different ask from Anubis. It just commandeers the compute for some useful purpose.
https://foldingathome.org/
Because if so, I don't yet see how to "smooth out" the wins. If the crawler manages to solve the very-high-difficulty puzzle for you and get you 1BTC, great, but it will be a long time between wins.
If you're proposing a new (or non-mainstream) blockchain: What makes those coins valuable?
There is no legal recourse here, if you don't want AI crawlers accessing your content 1) put it behind a paywall 2) remove from public access
That, or, they could just respect robots.txt and we could put enforcement penalties for not respecting the web service's request to not be crawled. Granted, we probably need a new standard but all these AI companies are just shitting all over the web, being disrespectful of site owners because who's going to stop them? We need laws.
IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.
How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.
I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)
It should be wholly possible to publish a site intended for human consumption only.
> How would that be enforceable?
Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.
You have just described the rationale behind DRM. If you think DRM is a net positive for society, I won't stop you, but there has been plenty published online on the anguish, pain and suffering it has wrought.
This is actually kind of why I like Anubis. Instead of trying to dictate what clients or purposes or types of users can access a site, it just changes the asymmetry of costs enough that hopefully it fixes the problem. Because like you can still scrape a site behind Anubis, it just takes a little bit more commitment, so it's easier to do it on an individual level than on a mass DoS level.
Same old problem. Corps are gonna corp.
if you really think what you offer has value, put it in behind a paywall and see how many people will consume it then, probably not a lot.
More importantly do you want to now compete with those that do not bottleneck and lose your traffic ?
This is the paradox, the length you go to protect your content only increases costs for everybody else who isn't an AI crawler.
And the proposed remedy is to give them human-labeled data directly in the form of captchas, even more severely degrading the user experience and thus website viability?
Color me unconvinced.
Now it's 2-5 sites per day, including web forums and such.
"Bruh sorry we were technically unable to produce a website without invasive dark pattern tracking stuff. Tech is haaaaard."
Honestly, I've never found a page outside my own country that I couldn't live without. Screw that s*t.
I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.
If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.
For example: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
> [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.
> Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.
Source: https://wiki.archiveteam.org/index.php/Robots.txt
(And it led to outrage from people for whom requiring an account was some kind of insult.)
Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers
First from known networks, then from residential IPs. First with dumb http clients, now with full blown headless chrome browsers.
I've worked with a company that has had to invest in scraper traffic mitigation, so I'm not disputing that it happens in high enough volume to be problematic for content aggregators, but as for small independent non-commercial websites I'll stick with my original hypothesis unless I come across contradictory evidence.
Some focus on generating content that can be served to waste crawler time: crates.io/crates/iocaine/2.1.0
Some focus on generating linked pages: https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in...
Some of them play the long game and try to poison models' data: https://codeberg.org/konterfai/konterfai
There are lots more as well; those are just a few of the ones that recently made the rounds.
I suspect that combining approaches will be a tractable way to waste time:
- Anubis-esque systems to defeat or delay easily-deterred or cut-rate crawlers,
- CloudFlare or similar for more invasive-to-real-humans crawler deterrence (perhaps only served to a fraction of traffic or traffic that crosses a suspicion threshold?),
- Junk content rings like Nepenthes as honeypots or "A/B tests" for whether a particular traffic type is an AI or not (if it keeps following nonsense-content links endlessly, it's not a human; if it gives up pretty quickly, it might be--this costs/pisses off users but can be used as a test to better train traffic-analysis rules that trigger the other approaches on this list in response to detected likely-crawler traffic).
- Model poisoners out of sheer pettiness, if it brings you joy.
I also wonder if serving taboo traffic (e.g. legal but beyond-the-pale for most commercial applications porn/erotica) would deter some AI crawlers. There might be front-side content filters that either blacklist or de-prioritize sites whose main content appears (to the crawler) to be at some intersection of inappropriate, prohibited, and not widely-enough related to model output as to be in demand.
it's well intentioned but just waste electricity from good people in the end.
anubis does nothing to impact bad crawlers, well only the laziest ones. but for those generating fake infinite content on the fly is much more efficient.
I see a lot more private networks in our future, unfortunately.
And by putting a wall up you end up losing a large portion of the market to those that will now simply arbitrage and fill the space you leave behind.
There is simply no way to stop crawlers/scrapers, period, unless you put a meter on it or go offline.
It does help from accidental ddos or just rude scrapers that assume everyone has unlimited bandwidth and money.