Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · 1 month ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

prole@lemmy.blahaj.zone · 1 month ago

Tech bros just actively making the internet worse for everyone.

iopq@lemmy.world · 1 month ago

I mean, tech bros of the past invented the internet

notarobot@lemmy.zip · 30 days ago

Those are not the tech bros. The tech bros are the ones who move fast and break things. The internet was built by engineers and developers

prole@lemmy.blahaj.zone · edit-2 30 days ago

Nah, that was DARPA

CeeBee_Eh@lemmy.world · 30 days ago

Those were tech nerds. “Tech bros” are jabronis who see the tech sector as a way to increase the value of the money their daddies gave them.

zifk@sh.itjust.works · 1 month ago

Anubis isn’t supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.

sudo@programming.dev · edit-2 1 month ago

This is what I’ve kept saying about POW being a shit bot management tactic. Its a flat tax across all users, real or fake. The fake users are making money to access your site and will just eat the added expense. You can raise the tax to cost more than what your data is worth to them, but that also affects your real users. Nothing about Anubis even attempts to differentiate between bots and real users.

If the bots take the time, they can set up a pipeline to solve Anubis tokens outside of the browser more efficiently than real users.

OpenPassageways@lemmy.zip · 1 month ago

What the alternative?

sudo@programming.dev · 1 month ago

Not much for open source solutions. A simple captcha however would cost scrapers more to crack than Anubis.

But when it comes to “real” bot management solutions: The least invasive solutions will try to match User-Agent and other headers against the TLS fingerprint and block if they don’t match. More invasive solutions will fingerprint your browser and even your GPU, then either block you or issue you a tracking cookie which is often pinned to your IP and user-agent. Both of those solutions require a large base of data to know what real and fake traffic actually looks like. Only large hosting providers like CloudFlare and Akamai have that data and can provide those sorts of solutions.

randomblock1@lemmy.world · 1 month ago

No, it’s expensive to comply (at a massive scale), but easy to avoid. Just change the user agent. There’s even a dedicated extension for bypassing Anubis.

Even then AI servers have plenty of compute, it realistically doesn’t cost much. Maybe like a thousandth of a cent per solve? They’re spending billions on GPU power, they don’t care.

I’ve been saying this since day 1 of Anubis but nobody wants to hear it.

T156@lemmy.world · 1 month ago

The website would also have to display to users at the end of the day. It’s a similar problem as trying to solve media piracy. Worst comes to it, the crawlers could read the page like a person would.

londos@lemmy.world · 1 month ago

Can there be a challenge that actually does some maliciously useful compute? Like make their crawlers mine bitcoin or something.

raspberriesareyummy@lemmy.world · 1 month ago

Did you just say use the words “useful” and “bitcoin” in the same sentence? o_O

kameecoding@lemmy.world · 1 month ago

Bro couldn’t even bring himself to mention protein folding because that’s too socialist I guess.

londos@lemmy.world · edit-2 1 month ago

You’re 100% right. I just grasped at the first example I could think of where the crawlers could do free work. Yours is much better. Left is best.

andallthat@lemmy.world · edit-2 1 month ago

LLMs can’t do protein folding. A specifically-trained Machine Learning model called AlphaFold did. Here’s the paper.

Developing, training and fine tuning that model was a research effort led by two guys who got a Nobel for it. Alphafold can’t do conversation or give you hummus recipes, it knows shit about the structure of human language but can identify patterns in the domain where it has been specifically and painstakingly trained.

It wasn’t “hey chatGPT, show me how to fold a protein” is all I’m saying and the “superhuman reasoning capabilities” of current LLMs are still falling ridiculously short of much simpler problems.

patatahooligan@lemmy.world · 1 month ago

The crawlers for LLM are not themselves LLMs.

kameecoding@lemmy.world · 1 month ago

They can’t bitcoin mine either, so technical feasibility wasn’t the goal of my reply

mobotsar@sh.itjust.works · 1 month ago

Crawlers aren’t LLMs; they can do arbitrary computations (whatever the target demands to access resources).

Jolteon@lemmy.zip · 1 month ago

deleted by creator

NeilBrü@lemmy.world · edit-2 1 month ago

Hey dipshits:

The number of mouth-breathers who think every fucking “AI” is a fucking LLM is too damn high.

AlphaFold is not a language model. It is specifically designed to predict the 3D structure of proteins, using a neural network architecture that reasons over a spatial graph of the protein’s amino acids.

Every artificial intelligence is not a deep neural network algorithm.
Every deep neural network algorithm is not a generative adversarial network.
Every generative adversarial network is not a language model.
Every language model is not a large language model.

Fucking fart-sniffing twats.

$ ./end-rant.sh

londos@lemmy.world · 1 month ago

I went back and added “malicious” because I knew it wasn’t useful in reality. I just wanted to express the AI crawlers doing free work. But you’re right, bitcoin sucks.

raspberriesareyummy@lemmy.world · 1 month ago

To be fair: it’s a great tool for scamming people (think ransomware) :/

DeathByBigSad@sh.itjust.works · 1 month ago

Great for money laundering.

Echo Dot@feddit.uk · 1 month ago

Is it? Don’t you risk losing a rather large percentage of the value.

Just by cars or something as they are much better at keeping their value. Also if somebody asks where did you get all this money from you can just point to the car and say, I sold that.

T156@lemmy.world · 1 month ago

Not without making real users also mine bitcoin/avoiding the site because their performance tanked.

nymnympseudonym@lemmy.world · 1 month ago

The Monero community spent a long time trying to find a “useful PoW” function. The problem is that most computations that are useful are not also easy to verify as correct. javascript optimization was one direction that got pursued pretty far.

But at the end of the day, a crypto that actually intends to withstand attacks from major governments requires a system that is decentralized, trustless, and verifiable, and the only solutions that have been found to date involve algorithms for which a GPU or even custom ASIC confers no significant advantage over a consumer-grade CPU.

0x0@lemmy.zip · edit-2 1 month ago

Anubis does that (the computation part). You may’ve seen it already.

UnderpantsWeevil@lemmy.world · 1 month ago

I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.

devfuuu@lemmy.world · edit-2 1 month ago

I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There’s not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.

thatonecoder@lemmy.ca · 1 month ago

I know this is the most ridiculous idea, but we need to pack our bags and make a new internet protocol, to separate us from the rest, at least for a while. Either way, most “modern” internet things (looking at you, JavaScript) are not modern at all, and starting over might help more than any of us could imagine.

Pro@programming.dev · edit-2 1 month ago

Like Gemini?

From official Website:

Gemini is a new internet technology supporting an electronic library of interconnected text documents. That’s not a new idea, but it’s not old fashioned either. It’s timeless, and deserves tools which treat it as a first class concept, not a vestigial corner case. Gemini isn’t about innovation or disruption, it’s about providing some respite for those who feel the internet has been disrupted enough already. We’re not out to change the world or destroy other technologies. We are out to build a lightweight online space where documents are just documents, in the interests of every reader’s privacy, attention and bandwidth.

thatonecoder@lemmy.ca · 1 month ago

Yep! That was exactly the protocol on my mind. One thing, though, is that the Fediverse would need to be ported to Gemini, or at least for a new protocol to be created for Gemini.

Echo Dot@feddit.uk · 1 month ago

If it becomes popular enough that it’s used by a lot of people then the bots will move over there too.

They are after data, so they will go where it is.

One of the reasons that all of the bots are suddenly interested in this site is that everyone’s moving away from GitHub, suddenly there’s lots of appealing tasty data for them to gobble up.

This is how you get bots, Lana

thatonecoder@lemmy.ca · 1 month ago

Yes, I know. But, while trying to find a way to bomb the AI datacenters (/s, hopefully it doesn’t come to this), we can stall their attacks.

cwista@lemmy.world · 1 month ago

Won’t the bots just adapt and move there too?

0x0@lemmy.zip · 1 month ago

It’s not the most well thought-out, from a technical perspective, but it’s pretty damn cool. Gemini pods are a freakin’ rabbi hole.

nialv7@lemmy.world · edit-2 30 days ago

We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is why we can’t have nice things.

Shapillon@lemmy.world · 1 month ago

Big players are the ones behind most AIs though.

katy ✨@piefed.blahaj.zone · 1 month ago

reminder to donate to codeberg and forgejo :)

Spaz@lemmy.world · edit-2 1 month ago

Is there a migration tool? If not would be awesome to migrate everything including issues and stuff. Bet even more people would move.

BlameTheAntifa@lemmy.world · 1 month ago

Codeberg has very good migration tools built in. You need to do one repo at a time, but it can move issues, releases, and everything.

dodos@lemmy.world · 1 month ago

There are migration tools, but not a good bulk one that I could find. It worked for my repos except for my unreal engine fork.

Wispy2891@lemmy.world · 1 month ago

Question: those artificial stupidity bots want to steal the issues or want to steal the code? Because why they’re wasting a lot of resources scraping millions of pages when they can steal everything via SSH (once a month, not 120 times a second)

Passerby6497@lemmy.world · 1 month ago

That would require having someone with real intelligence running the scraper.

0x0@lemmy.zip · 1 month ago

It’s always a cat-n-mouse game.

Kyrgizion@lemmy.world · 1 month ago

Eventually we’ll have “defensive” and “offensive” llm’s managing all kinds of electronic warfare automatically, effectively nullifying each other.

sudo@programming.dev · 1 month ago

Places like cloudflare and akamai are already using machine learning algorithms to detect bot traffic at a network level. You need to use similar machine learning to evade them. And since most of these scrapers are for AI companies I’d expect a lot of the scrapers to be LLM generated.

Net_Runner :~$@lemmy.zip · 1 month ago

I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless

That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running

Goretantath@lemmy.world · 1 month ago

I knew that was the worse option. Use the one that traps them in an infinite maze.

interdimensionalmeme@lemmy.ml · edit-2 1 month ago

Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas

dwzap@lemmy.world · 1 month ago

The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.

Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.

interdimensionalmeme@lemmy.ml · edit-2 1 month ago

That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.

They also have an open API that makes scraper entirely unnecessary too.

Here are the relevant quotes from the article you posted

“Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”

“At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”

“Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”

And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !

The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.

Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.

If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.

0x0@lemmy.zip · 1 month ago

they won’t have to scrape ?

They don’t have to scrape; especially if robots.txt tells them not to.

it’s public data, it’s out there, do you want it public or not ?

Hey, she was wearing a miniskirt, she wanted it, right?

interdimensionalmeme@lemmy.ml · 1 month ago

No no no, you don’t get to invoke grape imagery to defend copyright.

I know, it hurts when the human shields like wikipedia and the openwrt forums are getting hit, especially when they hand over the goods in dumps. But behind those human shields stand facebook, xitter, amazon, reddit and the rest of big tech garbage and I want tanks to run through them.

So go back to your drawing board and find a solution the tech platform monopolist are made to relinquish our data back to use and the human shields also survive.

My own mother is prisoner in the Zuckerberg data hive and the only way she can get out is brute zucking force into facebook’s poop chute.

0x0@lemmy.zip · 1 month ago

find a solution the tech platform monopolist are made to relinquish our data

Luigi them.
Can’t use laws against them anyway…

qaz@lemmy.world · 1 month ago

I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.

carrylex@lemmy.world · edit-2 1 month ago

And once again a WebApplicationFirewall(WAF) was defeated and it turns out that blocklists and bot detection tools like fail2ban are the way to go…

Who could have seen this coming…