Make illegally trained LLMs public domain as punishment

🃏Joker@sh.itjust.works · 1 day ago

Make illegally trained LLMs public domain as punishment

john89@lemmy.ca · 4 hours ago

I don’t think it should be a “punishment.” It should be done on principal.

sugar_in_your_tea@sh.itjust.works · edit-2 2 hours ago

Not sure making their LLMs public domain would really hurt their principal, their secret sauce is in the code around the model.

And yes, I do recognize that you meant “principle”.

werefreeatlast@lemmy.world · 5 hours ago

I want to have a personal llm that learns all my interests from my files and websites visited. I just want to ask it stuff that I don’t have to remember.

Zombie-Mantis@lemmy.world · 3 hours ago

I think that’d be ok, even with this proposal, as long as you don’t sell that LLM for public use. It’s fine it I draw a picture of Mickey Mouse in my notebook, but if I try to sell that picture I could get in legal trouble.

fleton@lemmy.world · 3 hours ago

Isn’t that similar to what recall is?

zalgotext@sh.itjust.works · 3 hours ago

Yes, except without Microsoft spying on you

werefreeatlast@lemmy.world · 2 hours ago

Exactly. I don’t want a service, I don’t want to pay for a service, I don’t want to send my files for free to get stuck for later ransom like Google did with email. I just want to purchase a product called a computer and load up a program in it that runs locally and gives me access to my data.

nutsack@lemmy.world · edit-2 11 hours ago

intellectual property doesn’t really exist in most of the world. they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

sugar_in_your_tea@sh.itjust.works · 2 hours ago

it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

It’s arbitrary, but it was designed to protect individuals, but it has been morphed to protect corporations. If we reset the law back to the original copyright act of 1790 w/ a 14-year duration, it would go a long way toward removing power from corporations. I think we should take it a step further and perhaps make it 10 years, with an optional extension for another 10 years if you can show need (i.e. you’re an indie dev and your game is finally making a splash after 8 years).

C126@sh.itjust.works · edit-2 6 hours ago

So true. IP only helps the corps and slows tech development. Contracts, ndas, and trade secrets are all you really need to keep your ideas safe. If you want your country to develop fast, get rid of any IP laws.

Flying Squid@lemmy.world · 8 hours ago

they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

Unless it’s their intellectual property, whereupon it’s suddenly a whole different story. I’m sure you knew that.

john89@lemmy.ca · 4 hours ago

Examples?

Flying Squid@lemmy.world · 3 hours ago

China: https://futureworld.org/mindbullets/china-sues-us-for-ip-theft/

India: https://www.autoweek.com/racing/formula-1/a2013501/formula-one-force-india-sues-lotus-racing-over-wind-tunnel-data/

Philippines: https://www.rappler.com/business/industries/110820-etude-house-files-charges-lazada/

Decided to stop looking after that since that’s three examples.

RandomVideos@programming.dev · 7 hours ago

Wouldnt that give people who is it for bad things easier access? It should be made illegal to create if they dont legally have access to that data

Ajen@sh.itjust.works · 5 hours ago

The “illegally trained LLMs” they’re taking about are trained on copyrighted data that they didn’t have permission to use, this isn’t about LLMs that have been trained to do illegal things. OpenAI (chatgpt) is being sued because there is a lot of evidence that they used copyrighted content for training, like NY Times articles. OpenAI is so profitable that they’ll probably see these lawsuits as a business expense and keep doing it. Most people won’t sue anyway…

RandomVideos@programming.dev · 5 hours ago

i know that by illegally trained LLMs they are talking about training on copyrighted data(by legally have access to, i meant that they are legally allowed to train AI on it).

Its ridiculous that companies can just ignore laws

Ajen@sh.itjust.works · 5 hours ago

Oh, I’m not sure what you meant in your first comment then?

buzz86us@lemmy.world · 5 hours ago

I really don’t care about AI used on designs for generic products.

x0x7@lemmy.world · 6 hours ago

So if I make a better car using customer feedback is the rights to the car really theirs because it was their opinions that went partially into the end product?

IP is a joke anyway. If you put information out into the world you don’t own it. Sorry, you can’t have it both ways. You can simultaneously support torrenting movies (I do, and I assume you do too), while also claiming you own your comments on the internet and no one can “pirate” them.

CileTheSane@lemmy.ca · 3 hours ago

Sure, but saying the corpos can’t privatize the output of their AI is consistent with that viewpoint.

ClamDrinker@lemmy.world · 16 hours ago

Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.

If the model is built on the corpus of humanity, then humanity should benefit.

barsoap@lemm.ee · edit-2 8 hours ago

As per torrentfreak

OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.

…crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It’s US startup culture, plain and simple, “move fast and break laws”, get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.

ClamDrinker@lemmy.world · 4 hours ago

For OpenAI, I really wouldn’t be surprised if that happened to be the case, considering they still call themselves “OpenAI” despite being the most censored and closed source AI models on the market.

But my comment was more aimed at AI models in general. If you are assuming they indeed used non-publicly posted or gathered material, and did so directly themselves, they would indeed not have a defense to that. Unfortunately, if a second hand provided them the data, and did so under false pretenses, it would likely let them legally off the hook even if they had every ethical obligation to make sure it was publicly available. The second hand that provided it to them would be the one infringing.

If that assumption turns out to be a truth (Maybe through some kind of discovery in the trial), they should burn for that. Until then, even if it’s a justified assumption, it’s still an assumption, and most likely not true for most models, certainly not those trained recently.

patatahooligan@lemmy.world · 9 hours ago

the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world

They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.

I agree that we shouldn’t strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they’re doing.

Landless2029@lemmy.world · 7 hours ago

Not to mention patent laws are bullshit.

There are law offices that exist specifically to fuck with people over patent and copywrite law.

There’s also cases where people use copywrite and patent law to hold us back. I can’t find the article but some religious jerk patented connecting a sex toy to a computer via USB. Thankfully someone got around this law with bluetooth and cell phones. Otherwise I imagine the camgirl and LDR market for toys would’ve been hit with products 10 years sooner.

ClamDrinker@lemmy.world · 5 hours ago

They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.

I really kind of hope you’re kidding here. Because this has got to be the most roundabout way of saying they’re analyzing the information. Just because you think it does so to regurgitate (which I have yet to see any good evidence for, at least for the larger models), does not change the definition of analyzing. And by doing so you are misrepresenting it and showing you might just have misunderstood it, which is ironic. And doing so does not help the cause of anyone who wishes to reduce the harm from AI, as you are literally giving ammo to people to point to and say you are being irrational about it.

patatahooligan@lemmy.world · 4 hours ago

Yes if you completely ignore how data is processed and how the product is derived from the data, then everything can be labeled “data analysis”. Great point. So copyright infringement can never exist because the original work can always be considered data that you analyze. Incredible.

ClamDrinker@lemmy.world · edit-2 3 hours ago

No, not what I said at all. If you’re trying to say I’m making this argument I’d urge you (ironically) to actually analyze what I said rather than putting words in my mouth ;) (Or just, you know, ask me to clarify)

Copyright infringement (or plagiarism) in it’s simplest form, as in just taking the material as is, is devoid of any analysis. The point is to avoid having to do that analysis and just get right to the end result that has value.

But that’s not what AI technology does. None of the material used to train it ends up in the model. It looks at the training data and extracts patterns. For text, that is the sentence structure, the likelihood of words being followed by another, the paragraph/line length, the relationship between words when used together, and more. It can do all of this without even ‘knowing’ what these things are, because they are simply patterns that show up in large amounts of data, and machine learning as a technology is made to be able to detect and extract those patterns. That detection is synonymous with how humans do analysis. What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.

The resulting data when fed back to the AI can be used to have it extrapolate on incomplete data, which it could not do without such analysis. You can see this quite easily by asking an AI to refer to you by a specific name, or talk in a specific manner, such as a pirate. It ‘understands’ that certain words are placeholders for names, and that text can be ‘pirateitfied’ by adding filler words or pre/suffixing other words. It could not do so without analysis, unless that exact text was already in the data to begin with, which is doubtful.

patatahooligan@lemmy.world · 2 hours ago

No, not what I said at all. If you’re trying to say I’m making this argument I’d urge you (ironically) to actually analyze what I said rather than putting words in my mouth ;) (Or just, you know, ask me to clarify)

That was your implied argument regardless of intent.

Copyright infringement (or plagiarism) in it’s simplest form, as in just taking the material as is, is devoid of any analysis. The point is to avoid having to do that analysis and just get right to the end result that has value.

Completely wrong, which invalidates the point you want to make. “Analysis” and “as is” have no place in the definition of copyright infringement. A derivative work can be very different from the original material, and how you created the derivative work, including whether you performed whatever you think “analysis” means, is generally irrelevant.

What it detects are empirical, factual observations about the material it is shown, which cannot be copyrighted.

No it detects patterns. You already said it correctly above. And the problem is that some patterns can be copyrighted. That’s exactly the problem highlighted here and here. For copyright law, it doesn’t matter if, for example, that particular image of Mario is copied verbatim from the training data. The character likeness, which is encoded in the model because it is in fact a discernible pattern, is an infringement.

ClamDrinker@lemmy.world · edit-2 44 minutes ago

That was your implied argument regardless of intent.

I decide what my argument is, thank you very much. Your interpretation of it is outside of my control, and while I might try to avoid it from going astray, I cannot stop it from doing so, that’s on you.

Completely wrong, which invalidates the point you want to make. “Analysis” and “as is” have no place in the definition of copyright infringement. A derivative work can be very different from the original material, and how you created the derivative work, including whether you performed whatever you think “analysis” means, is generally irrelevant.

I wasn’t giving a definition of copyright infringement, since that depends on the jurisdiction, and since you and I aren’t in the same one most likely, that’s nothing I would argue for to begin with. In the most basic form of plagiarism, people do so to avoid doing the effort of transformation. More complex forms of plagiarism might involve some transformation, but still try to capture the expression of the original, instead of the ideas. Analysis is definitely relevant, since to create a work that does not infringe on copyright, you generally can take ideas from a copyrighted work, but not the expression of those ideas. If a new work is based on just those ideas (and preferably mixes it with new ideas), it generally doesn’t infringe on copyright. It’s why there are so many copycat products of everything you can think of, that aren’t copyright infringing.

No it detects patterns. You already said it correctly above. And the problem is that some patterns can be copyrighted. That’s exactly the problem highlighted here and here. For copyright law, it doesn’t matter if, for example, that particular image of Mario is copied verbatim from the training data.

While depending on your definition Mario could be a sufficiently complex pattern, that’s not the definition I’m using. Mario isn’t a pattern, it’s an expression of multiple patterns. Patterns like “an italian man”, “a big moustache”, “a red rounded hat with the letter ‘M’ in a white circle”, “overalls”. You can use any of those patterns in a new non-infringing work, Nintendo has no copyright on any of those patterns. But bring them all together in one place again without adding new patterns, and you will have infringed on the expression of Mario. If you give many images of Mario to the AI it might be able to understand that those patterns together are some sort of “Mario-ness” pattern, but it can still separate them from each other since you aren’t just showing it Mario, but also other images that have these same patterns in different expressions.

Mario’s likeness isn’t in the model, but it’s patterns are. And if an unethical user of the AI wants to prompt it for those specific patterns to be surprised they get Mario, or something close enough to be substantially similar, that’s on them, and it will be infringing just like drawing and selling a copy of Mario without Nintendo’s approval is now.

The character likeness, which is encoded in the model because it is in fact a discernible pattern, is an infringement.

You have absolutely no legal basis to claim they are infringement, as these things simply have not been settled in court. You can be of the opinion that they are infringement, but your opinion isn’t the same as law. The articles you showed are also simply reporting and speculating on the lawsuits that are pending.

interdimensionalmeme@lemmy.ml · 20 hours ago

It’s not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

This is our common heritage, not OpenAI’s private property

Dkarma@lemmy.world · 16 hours ago

Another clown dick article by someone who knows fuck all about ai

Arthur Besse@lemmy.ml · edit-2 7 hours ago

“Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data”

I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

I liked the author’s earlier very-unlikely-to-be-met-demand activism last year better:

I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

…which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it’s technically true - a court didn’t order it, but a guy who goes by the name “That One Privacy Guy” while blogging on linkedin did).

madthumbs@lemmy.world · 16 hours ago

They’re spitting out propaganda and misinformation mostly from what I can see. If anything, it should get a refund.

-Outside of coding / debugging tasks (and that’s hit or miss)

just_another_person@lemmy.world · edit-2 23 hours ago

It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

sugar_in_your_tea@sh.itjust.works · 24 hours ago

They pulled a very pubic and out in the open data heist

Oh no, not the pubes! Get those curlies outta here!

just_another_person@lemmy.world · 23 hours ago

Best correction ever. Fixed. ♥️

Grimy@lemmy.world · edit-2 23 hours ago

If we can’t train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.

Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.

The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.

I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.

just_another_person@lemmy.world · 23 hours ago

Unlicensed from the POV of the trainer, meaning they didn’t contact or license content from someone who didn’t approve. If it’s posted under Creative Commons, that’s fine. If it’s otherwise posted that it’s not open in any other way and not for corporate use, then they need to contact the owner and license it.

Grimy@lemmy.world · edit-2 22 hours ago

They won’t need to, they will get it from Getty. All these websites have a ToS that make it very clear they can do whatever they want with what you upload. The courts will simply never side with the small time photographer who makes 50$ a month with his stock photos hosted on someone else’s website. The laws will be in favor of databrokers and the handful of big AI companies.

Anyone self hosting will simply not get a call. Journalists will keep the same salary while the newspaper’s owner gets a fat bonus. Even Reddit already sold it’s data for 60 million and none of that went anywhere but spezs coke fund.

just_another_person@lemmy.world · 22 hours ago

Two things:

Getty is not expressly licensed as “free to use”, and by default is not licensed for commercial anything. That’s how they are a business that is still alive.
You’re talking about Generative AI junk and not LLMs which this discussion and the original post is about. They are not the same thing.

Grimy@lemmy.world · edit-2 21 hours ago

Reddit and newspapers selling their data preemptively has to do with LLMs. Can you clarify what scenario you are aiming for? It sounds like you want the courts to rule that AI companies need to ask each individual redditor if they can use his comments for training. I don’t see this happening personally.

Getty gives itself the right to license all photos uploaded and already trained a generative model on those btw.

just_another_person@lemmy.world · 21 hours ago

EULA and TOS agreements stop Reddit and similar sites from being sued. They changed them before they were selling the data and barely gave notice about it (see the exodus from reddit pt2), but if you keep using the service, you agree to both, and they can get away with it because they own the platform.

Anyone who has their content on a platform of the like that got the rug pulled out from under them with silent amendments being made to allow that is unfortunately fucked.

Any other platforms that didn’t explicitly state this was happening is not in scope to just allow these training tools to grab and train. What we know is that OpenAI at the very least was training on public sites that didn’t explicitly allow this. Personal blogs, Wikipedia…etc.

Avatar_of_Self@lemmy.world · 23 hours ago

It’s already illegal in some form. Via piracy of the works and regurgitating protected data.

The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.

The US justice system is different for different people.

m-p{3}@lemmy.ca · 1 day ago

It could also contain non-public domain data, and you can’t declare someone else’s intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

Laws are never simple.

grue@lemmy.world · 1 day ago

So what you’re saying is that there’s no way to make it legal and it simply needs to be deleted entirely.

I agree.

merc@sh.itjust.works · 23 hours ago

It wouldn’t contain any public-domain data though. That’s the thing with LLMs, once they’re trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn’t re-create your tax data on command, that data is now gone, but if it’s seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.

pelespirit@sh.itjust.works · 1 day ago

Right, like I did. They’re safeguarding Disney and other places like that now. It’s just the little guys who get screwed.

https://imgur.com/a/these-are-new-niki-mice-drawings-phone-company-chainsaws-merms-donut-logos-burger-mc-winfruit-computers-republunch-political-party-logos-Rhgi0OC

hark@lemmy.world · 24 hours ago

Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

merc@sh.itjust.works · 23 hours ago

It’s like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same “buy an album from a record store” model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.

Spotify’s solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their “buy an album” business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.

I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.

TootSweet@lemmy.world · edit-2 16 hours ago

To speak of AI models being “made public domain” is to presuppose that the AI models in question are covered by some branch of intellectual property. Has it been established whether AI models (even those trained on properly licensed content) even are covered by some branch of intellectual property in any particular jurisdiction(s)? Or maybe by “public domain” the author means that they should be required to publish the weights and also that they shouldn’t get any trade secret protections related to those weights?

barsoap@lemm.ee · edit-2 7 hours ago

Unlikely, I’d say, In EU jurisdictions copyright requires creative authorship, not “sweat of the brow” which is why by default databases aren’t included, which is why they’re have their own protection regime.

Quote, emphasis mine:

In the meaning of the European Union Directive 96/9/EC on the legal protection of databases,the term database refers to a collection of independent works, data or other materials, which have been arranged in a systematic or methodical way, and have been made individually accessible by electronic or other means. In the meaning of the Directive the data or materials:

must not be linked, or must be capable of separation without losing their informative content;

must be organised according to specific criteria, which means that only planned collections are covered;

must be individually accessible – mere storage of data is not covered by the term database.

In AI models the organisation is inferred from the data, it’s not planned into the database. The first bullet point is on less shaky, a summary an AI can make of a book can reasonably be regarded to be “informative content”, nothing about db protections says that they have to store full works it could also be references, citations, etc.

ZeroOne@lemmy.world · 11 hours ago

Nice one