OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us

Hellfire103@lemmy.ca · 5 months ago

OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us

no banana@lemmy.world · 5 months ago

Welcome, new industry heads. That’s how it works. China takes a car, picks it apart and builds a cheaper car. That’s what they’ve been doing for decades now.

cley_faye@lemmy.world · 5 months ago

That’s par for the course, but it’s hilarious that openai “we have to get copyrighted material for free because fuck you” is pulling that defense now.

no banana@lemmy.world · 5 months ago

Yep.

Orbituary@lemmy.world · 5 months ago

We got angry when Japan did this in the 60s and 70s. I’m going to paste part of the opening from Neal Stephenson’s “Snow Crash.”

Why is the Deliverator so equipped? Because people rely on him. He is a roll model. This is America. People do whatever the fuck they feel like doing, you got a problem with that? Because they have a right to. And because they have guns and no one can fucking stop them. As a result, this country has one of the worst economies in the world. When it gets down to it – talking trade balances here – once we’ve brain-drained all our technology into other countries, once things have evened out, they 're making cars in Bolivia and microwave ovens in Tadzhikistan and selling them here – once our edge in natural resources has been made irrelevant by giant Hong Kong ships and dirigibles that can ship North Dakota all the way to New Zealand for a nickel – once the Invisible Hand has taken all those historical inequities and smeared them out into a broad global layer of what a Pakistani brickmaker would consider to be prosperity – y’know what? There’s only four things we do better than anyone else

music
movies
microcode (software)
high-speed pizza delivery

The Deliverator used to make software. Still does, sometimes. But if life were a mellow elementary school run by well-meaning education Ph.D.s, the Deliverator ’ s report card would say: “Hiro is so bright and creative but needs to work harder on his cooperation skills.”

Sludgehammer@lemmy.world · 5 months ago

I’m kinda reminded of the tale of how the Zilog Z80 processor chip had dozens of little “tricks” built into it. It was being produced in Japan which at the time was famous for their chip production and for copying chip designs. Apparently their little tricks were baffling enough that it delayed the appearance of knock-offs chips by half a year.

radiohead37@lemmynsfw.com · 5 months ago

With zero investment in innovation. They just wait and steal the work. Easy to undercut American companies when you have no R&D costs.

assassinatedbyCIA@lemmy.world · edit-2 5 months ago

If it was just pure copying the best you could hope for is that you match the performance of your competitor. To exceed their performance genuine investment must be made.

Woht24@lemmy.world · 5 months ago

Name a Chinese company that produces a product that’s best in the world.

There’s probably some lasers/tech etc but nothing consumer, I would guess.

assassinatedbyCIA@lemmy.world · 5 months ago

Solar panels. Huawei had some very good 5g tech before the US sanctioned them (great performance competitive price). Electric cars from various brands like BYD and, a very good case can be made about deepseek r1 (same performance as o1 but using an order of magnitude less power/cost).

Woht24@lemmy.world · 5 months ago

Can’t comment on deepseek but Huawei and BYD certainly are inferior to other products in their industries. Probably not for long, but it’s still a happenstance I believe.

assassinatedbyCIA@lemmy.world · 5 months ago

Ah yes that’s why America worked super hard to ban them both right.

Woht24@lemmy.world · 5 months ago

Oh I wouldn’t base any decisions off anything America did.

sudo@programming.dev · 5 months ago

The commercial drones from DJI are the best in the industry. But also making something that’s almost as performant but for a fraction of the cost requires real innovation as well.

DeepSeek’s training model was innovative. They used multiple large specialized models to train a very small general model. This is a real practical innovation over OpenAI’s one behemoth general purpose model.

Woht24@lemmy.world · 5 months ago

True, DJI is a top competitor if not the best drones. Good point

Eatspancakes84@lemmy.world · 5 months ago

What are you on a about. They produce almost everything you own.

Woht24@lemmy.world · 5 months ago

They produce other people’s products, I’m talking Chinese designed, produced etc vehicle, phone, TV, plane, whatever

eestileib@sh.itjust.works · 5 months ago

FoxConn makes iPhones, for one.

Eatspancakes84@lemmy.world · 5 months ago

I guess, but very often private innovation builds upon a bunch of fundamental research funded by the tax payer. Then the private sector patents it, and brings it to market, overcharges and earns billions. Tough luck if China gets better at this game.

rumba@lemmy.zip · 5 months ago

looks at all my non-critical electronics…

enshitification smells like Chineseium

That said, I like cheap non-critical crap

SatansMaggotyCumFart@lemmy.world · 5 months ago

I bought my parachute from AliExpress and my reserve chute from Temu.

rumba@lemmy.zip · 5 months ago

Would it be better to get your primary from a reputable house and your backup from a discount, or better to get your primary from a discount and your backup from a reputable house?

SatansMaggotyCumFart@lemmy.world · 5 months ago

You rarely use the reserve so I buy the cheapest one I can.

rumba@lemmy.zip · 5 months ago

good to know! thanks!

SatansMaggotyCumFart@lemmy.world · 5 months ago

No problem.

WorldwideCommunity@lemm.ee · 5 months ago

Ah the Burger King model

owenfromcanada@lemmy.world · 5 months ago

ODuffer @lemmy.world · 5 months ago

Would you download a LLM?

5 months ago

Would you hug a face?

spooky2092@lemmy.blahaj.zone · 5 months ago

Yes, but I don’t have Nvidia hardware to run them

ipkpjersi@lemmy.ml · 5 months ago

I would. I gave llamafile a try, and while it’s good, they have an issue with parsing responses that have {{ and }} in them, so it’s kind of useless for Laravel development unfortunately lol

mEEGal@lemmy.world · 5 months ago

There’s substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAI’s models […]

I will explain what this means in a moment, but first: Hahahahahahahahahahahahahahahaha hahahhahahahahahahahahahahaha.

LMFAOOO 😂😂

Avid Amoeba@lemmy.ca · 5 months ago

Is there evidence that DeepSeek is an OpenAI distillate other than OpenAI and Co’s protestations?

brucethemoose@lemmy.world · edit-2 5 months ago

It’s literally impossible. I tried to explain it here: https://lemmy.world/comment/14763233

But the short version is OpenAI doesn’t even offer access to the data you need for a “distillation,” as the term is used in the LLM community.

Of course there’s some OpenAI data in the base model, but that’s partially because it’s splattered all over the internet now.

Avid Amoeba@lemmy.ca · 5 months ago

Thank you 🙏

morrowind@lemmy.ml · 5 months ago

Not distillate, they just trained on the outputs of openai

Showroom7561@lemmy.ca · 5 months ago

Good, I hope this is how the AI industry dies.

iAmTheTot@sh.itjust.works · 5 months ago

How would this cause it to die?

Showroom7561@lemmy.ca · 5 months ago

If something ceases to be profitable, it gets no attention from corporations.

Even something as simple as Deepseek replacing subscription services would tank these corporations who are banking on those fees.

Revan343@lemmy.ca · 5 months ago

This does the opposite of that. AI was already unprofitable; Deepseek’s massive efficiency improvement ought to improve profitability

EldritchFeminity@lemmy.blahaj.zone · 5 months ago

Yes and no. American companies have been following OpenAI’s strategy, which is simply scaling up as quickly as possible. From massive data centers swallowing our drinking water for coolant to coal and natural gas power plants to keep it running, it’s been all about pouring as much money and resources as possible in order to scale up to improve their models.

What DeepSeek has done is prove that that’s the wrong way to go about it, and now suddenly, all these companies that have been a massive money sink without any clear path to profitability already have to completely pivot their strategy. Most will probably die before they can. Investors are already selling off their stock.

So AI will become closer to actually being practical/profitable, but I imagine most of the companies who reach that goal won’t be the companies that exist today, and the AI bubble itself will probably collapse from this pivot, if we’re lucky.

Since DeepSeek is also open source, we might even see free competitors that can be run locally pop up that can go toe to toe with the likes of ChatGPT, which would be a real stake through the heart for these massive companies.

FarraigePlaisteaċ@lemmy.world · 5 months ago

I wouldn’t consider DeepSeek open source. A few weeks ago when it would discuss this subject freely with me (it doesn’t anymore), it described keeping some of the most important parts private but the rest being open source. It’s not really open source if it’s only partial because you can’t reproduce it yourself in the same way.

Someone suggested the term ‘open weight’ might be a more honest term. “Open source” has really caught on though.

Showroom7561@lemmy.ca · 5 months ago

Deepseek’s massive efficiency improvement ought to improve profitability

Depends on if you’re the AI provider, or the user.

For institutions that had to pay massive fees to use cloud-based AI services, now they might be able to pull it off in house or with far less costs involved. It will save money.

For those selling AI, it’ll get very competitive, and they can’t charge hundreds or thousands of dollars anymore. It will be less profitable or not at all.

AI was already unprofitable

And the silver lining was that all those American companies wasted hundreds of millions, if not, billions on developing the tech. Good for them for wasting all that money.

And good for China for making Deepseek open source as an added “fuck you” to AI capitalists.

Revan343@lemmy.ca · edit-2 5 months ago

For those selling AI, it’ll get very competitive, and they can’t charge hundreds or thousands of dollars anymore. It will be less profitable or not at all.

It is already not at all profitable. Competition will drive prices down, but probably not by as much as the efficiency increase. AI companies could go from having high prices and even higher costs to having low prices and even lower costs. Or they could go under, and be replaced by the competition.

Dkarma@lemmy.world · 5 months ago

They think gpt is ai…lol

FauxPseudo @lemmy.world · 5 months ago

cley_faye@lemmy.world · 5 months ago

Wow, look at the size of the tear this brings to my eye.

Possibly linux@lemmy.zip · 5 months ago

“Furious”

Somehow I don’t think that word applies to companies

TokenBoomer@lemmy.world · 5 months ago

So that’s what they did with the data from TikTok

SatansMaggotyCumFart@lemmy.world · 5 months ago

Man that would be as dumb as using Reddit or 4chan to train models.

cm0002@lemmy.world · 5 months ago

Wherever you fall on the anti-AI spectrum, I thought after the past 2 decades of piracy we had come to the conclusion that you can’t “steal” data, copying != Stealing

NuXCOM_90Percent@lemmy.zip · 5 months ago

If anything, this is kind of making people realize the opposite. It isn’t stealing when it is corporate (or creator…) but it is TOTALLY stealing when it is individual people… who aren’t authors or artists.

The fun part is that “creating” datasets for training steal from everyone equally.

EldritchFeminity@lemmy.blahaj.zone · 5 months ago

And when it comes to authors and artists, it amounts to wage theft. When a company hires an artist to make an ad, the artist gets paid to make it. If you then take that ad, you’re not taking money from the worker - they already got paid for the work that they did. Even if you take a piece from the social media of an independent artist and make a meme out of it or something, so long as people can find that artist, it can lead to people hiring them. But if you chop it up and mash it into a data set, you’re taking their work for profit or to avoid paying them for their skills and expertise to create something new. AI can not exist without a constant stream of human art to devour, yet nobody thinks the work to produce that art is worth paying for. It’s doing a corporation to avoid paying the working class what their skills are worth.

NuXCOM_90Percent@lemmy.zip · 5 months ago

Even if you take a piece from the social media of an independent artist and make a meme out of it or something, so long as people can find that artist, it can lead to people hiring them

That is a BIG if
You are literally arguing that it is fine for people to “work for exposure”

AI can not exist without a constant stream of human art to devour

That is, sadly, incorrect. What IS true is that AI cannot be “born” without massive amounts of human content. But once you have a solid base model (and I do not believe we currently do), you no longer need input art or input prose. The model can generate that. What you DO need is feedback on whether a slight variation is good or bad. Once you have that labeled data you then retrain. Plenty of existing ML models do exactly this.

And, honestly? That isn’t even all that different from how humans do it. It is an oversimplification that ignores lead times, but just look at everyone who suddenly wants to talk about how much Virtuoisity influenced The Matrix. Or, more reasonably, you can look at how different eras of film are clearly influenced by the previous. EVERYONE doing action copied John Woo and then there was the innovation to add slowmo to more or less riff on wire work common in (among other places) Chinese films. And that eventually became a much bigger focus on slow mo to show the impact of a hit and so forth.

There is not something intrinsically human to saying “can I put some jelly in my peanut butter?”. But there IS soemthing intrinsically human to deciding if that was a good idea… to humans.

EldritchFeminity@lemmy.blahaj.zone · 5 months ago

I agree that’s a BIG if. In an ideal world, people would cite their sources and bring more attention to the creator. I also didn’t mean that artists should create work for the opportunity to have it turned into a meme and maybe go viral and get exposure that way, but that at least there’s a chance of people getting more clients through word of mouth that way for work that they’ve already done, however small, compared to having their art thrown into a training algorithm which has an absolutely zero chance of the artist seeing any benefit.

Last I heard, current AI will devour themselves if trained on content from other AI. It simply isn’t good enough to use, and the garbage noise to value ratio is too high to make it worth filtering through. Which means that there is still a massive demand for human-made content, and possibly will be even more demand in the future for some time yet. Pay artists to create that content, and I see no real problem in the model. There are some companies that have started doing just that. Procreate has partnered with a company that creates websites that is hiring artists to create training data for their UI generating LLM and paying those artists commission fees. Nobody has to spend their day making hundreds of buttons for stupid websites, and the artists get paid. A win-win for everybody.

My stance on AI always comes down to the ethics behind the creation of the tool, not the tool itself. My pie in the sky scenario would be that artists could spend their time making what they want to make without having to worry about whether or not they can afford rent. There’s a reason we see most artists posting only commission work online, and it’s because they can’t afford to work on their own stuff. My more realistic view is that there’s a demand for content to train these things, so pay the people making that content an appropriate wage for their work and experience. There could be an entire industry around creating stuff specifically for different content tags for training data.

And as for AI being similar to humans, I think you’re largely right. It’s a really simplified reproduction of how human creativity and inspiration work, but with some major caveats. I see AI as basically a magic box containing an approximation of skill but lacking understanding and intent. When you give it a prompt, you provide the intent, and if you’re knowledgeable, you have the understanding to apply as well. But many people don’t care about the understanding or value the skill, they just want the end result. Which is where we stand today with AI not being used for the betterment of our daily lives, but just as a cost-cutting tool to avoid having to pay workers what they’re worth.

Hence, we live in a world where they told us when we were growing up that AI would be used to do the things we hate doing so that we had more time to write poetry and create art, while today AI is used to write poetry and create art so that we have more time to work our menial jobs and create value for shareholders.

NuXCOM_90Percent@lemmy.zip · 5 months ago

Last I heard, current AI will devour themselves if trained on content from other AI. It simply isn’t good enough to use, and the garbage noise to value ratio is too high to make it worth filtering through.

Yeah… that is right up there with “AI can’t do feet” in terms of being nonsense people spew.

There is nothing inherently different between a picture made by an LLM and a picture drawn by Rob Liefeld. Both have fucked up feet and both can be fixed with a bit of effort.

The issue is more the training data itself. Where this CAN cause a problem is if you unknowingly train on endless amounts of ai generated content. But… we have the same problem with training on endless amounts of human content. Very few training sets (these days) bother to put in the time to actually label what input is. So it isn’t “This is a good recipe, that is a bad recipe, and that is an ad for betterhelp”. It is “This is all the data we scraped off every public facing blog and youtube transcript”.

Its also why the major companies are putting a big emphasis on letting customers feed in their own data. Partially that is out of the understanding that people might not want to type corporate IP into a web interface. But it is also because it provides a way to rapidly generate some labeled data because you know that customer cares about widgets if they have twelve gigs of documents on widgets.

I see AI as basically a magic box containing an approximation of skill but lacking understanding and intent.

And what is the difference between someone getting paid to draw a picture of Sonic choking on a chili dog by a rando versus an AI generated image of the same?

At the end of the day, we aren’t going to see magic AIs generating everything with zero prompting (maybe in a decade or two… if the world still exists). Instead what we see is people studying demand and creating prompts based on that. Which… isn’t that different from how hollywood studios decide which script to greenlight or not.

EldritchFeminity@lemmy.blahaj.zone · 5 months ago

You’re largely arguing what I’m saying back at me. I didn’t mean that the AI is bad, but that the AI content that’s out there has filled the internet with tons of low quality stuff over the past few years, and enough of this garbage going in degrades the quality coming out, in a repeating cycle of degradation. You create biases in your model, and feeding those back in makes it worse. So the most cost-effective way to filter it out is to avoid training on possibly AI content altogether. I think OpenAI was limiting the training data for ChatGPT to stuff from before 2020 up until this past year or so.

It’s a similar issue to what facial recognition software had. Early on, facial recognition couldn’t tell the difference between two women, two black people (men or women), or two white men under the age of 25 or so. Because it was trained on the employees working on it, who were mostly middle-aged white men.

This means that there’s a high demand for content to train on, which would be a perfect job to hire artists for. Pay them to create work for whatever labels you’re looking for for your data sets. But companies don’t want to do that. They’d rather steal content from the public at large. Because AI is about cutting costs for these companies.

And what is the difference between someone getting paid to draw a picture of Sonic choking on a chili dog by a rando versus an AI generated image of the same?

To put it simply: AI can generate an image, but it isn’t capable of understanding 2-point perspective or proper lighting occlusion, etc. It’s just a tool. A very powerful tool, especially in the right hands, but a tool nonetheless. If you look at AI images, especially ones generated by the same model, you’ll begin to notice certain specific mistakes - especially in lighting. AI doesn’t understand the concept of lighting, and so has a very hard time creating realistic lighting. Most characters end up with competing light sources and shadows from all over the place that make no sense. And that’s just a consequence of how specific you’d need your prompt to be in order to get it right.

Another flaw with AI is that it can’t iterate. Production companies that were hiring AI prompters to their movie crews have started putting blanket bans on hiring prompters because they simply can’t do the work. You ask them to give you 10 images of a forest, and they’ll come back the next day with 20. But you say, “Great, I like this one, but take the people out of it,” and they’ll come back the next day with 15 more pictures of forests, but not the original without people in it. It’s a great tool for what it does, but you can’t tell it, “Can you make the chili dog 10 times larger” and get the same piece, just with a giant chili dog.

And don’t get me started on Hollywood or any of those other corporate leeches. I think Adam Savage said it best when he said last year that someday, a film student is going to do something really amazing with AI - and Hollywood is going to copy it to death. Corporations are the death of art, because they only care about making a product to be consumed. For some perfect examples of what I mean, you should check out these two videos: Why do “Corporate Art Styles” Feel Fake? by Solar Sands, and Corporate Music - How to Compose with no Soul by Tantacrul. Corporations also have no courage when money is on the line, so that’s why we see so many sequels and remakes out of Hollywood. People aren’t clamoring for a live action remake of (insert childhood Disney movie here), but they will go and watch it, and that’s a safe bet for Hollywood. That’s why we don’t see many new properties. Artists want to make them, but Hollywood doesn’t.

As I said, in my ideal world, AI would be making that corporate garbage and artists would be able to create what they actually want. But in the real world, there’s very little chance that you can keep a roof over your head making what you want. Making corporate garbage is where the jobs are, and most artists have very little time left over for working on personal stuff. People always ask questions like, “Why aren’t people making statues like the Romans did,” or “Why don’t we get paintings like Rembrandt used to do.” And the answer is, because nobody is paying artists to make them. They’re paying them to make soup commercials, and they don’t even want to pay them for that.

Dkarma@lemmy.world · 5 months ago

Nuance not your strong suit eh?

NuXCOM_90Percent@lemmy.zip · 5 months ago

I’m curious what “nuance” I am missing.

I mean, it isn’t like OpenAI or DeepSeek were going to pay for it anyway. So there is no loss revenue and it isn’t stealing. Besides, you can’t download a car so it isn’t even stealing.

Its just that people are super eager to make themselves morally righteous when they are explaining why it is fine to not give a shit about the work of one person or hundreds of persons when they want something. But once they see corporations (and researchers) doing the exact same thing to them? THIEVERY!!!

When the reality is: Yeah, it is “theft” either way. Whether you care is up to you.

iAmTheTot@sh.itjust.works · 5 months ago

Uh, who is “we”? Piracy is still illegal and not everyone approves of it.

cm0002@lemmy.world · edit-2 5 months ago

Oh, you’re one of those weirdos that report people shoplifting at Walmart and probably was also the “Teacher, your forgot our homework” kids

iAmTheTot@sh.itjust.works · 5 months ago

Uh, no, and what a wild assumption to make from me stating a fact, unless you know something that I don’t?

cm0002@lemmy.world · edit-2 5 months ago

Uh, who is “we”? Piracy is still illegal and not everyone approves of it.

You made it pretty clear what your stance is, legality is not an indicator of morality or ethicality

iAmTheTot@sh.itjust.works · 5 months ago

I did not state my stance at all. You are assuming things my friend. I did not comment to weigh in with my opinion, I commented to challenge your assertion that “we” had decided something. Who is “we”?

nomy@lemmy.zip · 5 months ago

Who is “we” ?

Everyone except you apparently.

iAmTheTot@sh.itjust.works · 5 months ago

I didn’t state my stance. It’s obviously not everyone as I said before.

cm0002@lemmy.world · 5 months ago

You did, by starting off with “Who is we” you stated you’re breaking from the stance of my comment that you replied to and aligning yourself with the second part of your statement “Because it’s illegal and many don’t approve of it”

iAmTheTot@sh.itjust.works · 5 months ago

No, that’s not how that works. You still do not know my stance no matter how much you want to assume you do. Good day.

missandry351@lemmings.world · 5 months ago

Oh the irony 😂😂😂😂😂😂😂

buddascrayon@lemmy.world · 5 months ago

BAHAHAHAHAHAHAH 😂

THCDenton@lemmy.world · 5 months ago

Hahahahaha

Mr_Dr_Oink@lemmy.world · 5 months ago

“The Chinese came, and they stole all our gubbins!”

“We must tell the media! they will help us!”

OpenAI… probably.