It’s all made from our data, anyway, so it should be ours to use as we want

  • just_another_person@lemmy.world
    link
    fedilink
    English
    arrow-up
    117
    arrow-down
    11
    ·
    edit-2
    1 day ago

    It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

    What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

    They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

    • Grimy@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      2
      ·
      edit-2
      1 day ago

      If we can’t train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.

      Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.

      The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.

      I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.

      • just_another_person@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        2
        ·
        1 day ago

        Unlicensed from the POV of the trainer, meaning they didn’t contact or license content from someone who didn’t approve. If it’s posted under Creative Commons, that’s fine. If it’s otherwise posted that it’s not open in any other way and not for corporate use, then they need to contact the owner and license it.

        • Grimy@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          1
          ·
          edit-2
          1 day ago

          They won’t need to, they will get it from Getty. All these websites have a ToS that make it very clear they can do whatever they want with what you upload. The courts will simply never side with the small time photographer who makes 50$ a month with his stock photos hosted on someone else’s website. The laws will be in favor of databrokers and the handful of big AI companies.

          Anyone self hosting will simply not get a call. Journalists will keep the same salary while the newspaper’s owner gets a fat bonus. Even Reddit already sold it’s data for 60 million and none of that went anywhere but spezs coke fund.

          • just_another_person@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            3
            ·
            1 day ago

            Two things:

            1. Getty is not expressly licensed as “free to use”, and by default is not licensed for commercial anything. That’s how they are a business that is still alive.

            2. You’re talking about Generative AI junk and not LLMs which this discussion and the original post is about. They are not the same thing.

            • Grimy@lemmy.world
              link
              fedilink
              English
              arrow-up
              3
              ·
              edit-2
              1 day ago

              Reddit and newspapers selling their data preemptively has to do with LLMs. Can you clarify what scenario you are aiming for? It sounds like you want the courts to rule that AI companies need to ask each individual redditor if they can use his comments for training. I don’t see this happening personally.

              Getty gives itself the right to license all photos uploaded and already trained a generative model on those btw.

              • just_another_person@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                arrow-down
                2
                ·
                1 day ago

                EULA and TOS agreements stop Reddit and similar sites from being sued. They changed them before they were selling the data and barely gave notice about it (see the exodus from reddit pt2), but if you keep using the service, you agree to both, and they can get away with it because they own the platform.

                Anyone who has their content on a platform of the like that got the rug pulled out from under them with silent amendments being made to allow that is unfortunately fucked.

                Any other platforms that didn’t explicitly state this was happening is not in scope to just allow these training tools to grab and train. What we know is that OpenAI at the very least was training on public sites that didn’t explicitly allow this. Personal blogs, Wikipedia…etc.

    • Avatar_of_Self@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      1 day ago

      It’s already illegal in some form. Via piracy of the works and regurgitating protected data.

      The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.

      The US justice system is different for different people.