Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active.

Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called “Searchcord” based on a different data set that shows non-anonymized chat histories.

  • snowsuit2654@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    3 hours ago

    “anonymized” sure. I highly doubt they read every message. I’m sure there is lots of de-anonymizing information in the messages themselves

    For example–

    Anon1: “hey jeff, wanna play Minecraft?”

    Anon2: “sure”

    Thus we know Anon2’s name is Jeff. I imagine there’s a lot of this.

  • CosmoNova@lemmy.world
    link
    fedilink
    English
    arrow-up
    37
    arrow-down
    4
    ·
    24 hours ago

    That’s good news. Internet archiving is an important endeavor because you never know when they‘ll pull the plug. Now it‘s a little more secured and probably far more useful than in Discord‘s hands alone.

    • Mustakrakish@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      2
      ·
      8 hours ago

      Not for messages that are supposed to be private lol. Let me just make a copy of all texts you’ve sent over the last decade, for “archiving”.

      • shaggyb@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 hours ago

        If you think messages you post anywhere on the internet are private, you’re in for a bad time.

      • nomy@lemmy.zip
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        Texts are sent in plain-text and I wouldn’t recommend discussing anything you’d like to keep private via text.

  • Metz@lemmy.world
    link
    fedilink
    English
    arrow-up
    90
    arrow-down
    1
    ·
    1 day ago

    So basically discord finally got a usable search. I count that as a win.

  • Tattorack@lemmy.world
    link
    fedilink
    English
    arrow-up
    55
    arrow-down
    3
    ·
    1 day ago

    I see a lot of drama here in the thread, people decrying data leaks, how Discord is very very bad, and a number of people wanting the “good old days” of forums.

    Yes. I like forums too, but, uh…

    These researchers scraped publicly posted messages. Keyword here being “public”. How would anything similarly public, like a forum, be better?

    I actually remember the times when forums were at their peak. I hung out on BZPower for Bionicle things, and the Relic News Forum for Homeworld modding. You know what they had? Google bots that scraped messages, looked for certain words, and populated websites with advertisements based on what it could scrape from forums.

    Pretty sure Lemmy doesn’t do encryption either, unless there’s some very special, private Lemmy server that nobody has access to. So the researchers could’ve just as well scraped the fediverse.

    • hansolo@lemm.ee
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      5 hours ago

      People in general have no idea and just want to get spun up on drama and manufactured outrage.

      Same thing happened when people started scrapping Twitter 10-15 years ago.

    • Gibibit@lemmy.world
      link
      fedilink
      English
      arrow-up
      13
      ·
      1 day ago

      Yeah this being just as easy on bb forums or literally any webpage with a public comment section was my first thought as well…

      Isn’t most of the internet scraped anyways, by the internet archive? The concerning part is that this is 100% going to be used to train some coomer brained AI. Scraping, botting, scamming: all those things are going to happen on large public communities.

      • Melvin_Ferd@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        1
        ·
        edit-2
        1 day ago

        Yeah, a lot of this push is about ushering in new laws to prevent data scraping.

        Propaganda spreads easily through fake accounts—but how do we detect large-scale operations if they’re constantly creating and deleting accounts or trying to blend in with the rest of us? We’d need access to massive data sets to mine for patterns and expose coordinated behavior.

        But the powers that benefit from shaping the narrative are the same ones pushing the idea that all scraping is bad. They want people to hate it, so they can justify laws that lock down access. That’s the end game.

    • FauxLiving@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      23 hours ago

      How would anything similarly public, like a forum, be better?

      Forums were the primary way that groups would talk with one another pre-global scale social media.

      They could contain public subforums, but the majority of all of the forums that I’ve been a part of were not viewable without an account, which was manually approved or required a small payment (to make bans have a chance to actually stick).

  • asbestos@lemmy.world
    link
    fedilink
    English
    arrow-up
    179
    ·
    1 day ago

    Probably our only chance to find solutions to problems with open source software that uses Discord as their forum

    • boatswain@infosec.pub
      link
      fedilink
      English
      arrow-up
      96
      ·
      1 day ago

      Seriously. It’s beyond painful when some open source project only uses Discord for communication. You have to hope that you post your question at a time when the right people are online, and that there’s not a more interesting conversation going on, otherwise it just gets lost. Index that whole dataset.

        • AugustWest@lemm.ee
          link
          fedilink
          English
          arrow-up
          7
          ·
          1 day ago

          For projects I am involved with all irc chats are archived and searchable. There is nothing private, no registration needed and searchable.

          Quite a bit different.

        • boatswain@infosec.pub
          link
          fedilink
          English
          arrow-up
          9
          ·
          1 day ago

          That would be equally annoying. Probably a better signal to noise ratio on IRC though; Discord descends into memes almost instantly.

    • nawa@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      1
      ·
      1 day ago

      Lol, I’ve read this headline and thought “thank fuck, probably the only option to have Discord’s content readable”, I like how universal this opinion is

  • .Donuts@lemmy.world
    link
    fedilink
    English
    arrow-up
    64
    arrow-down
    2
    ·
    1 day ago

    Well yeah, it’s not encrypted. It would be the same as 10 years of Reddit posts or Lemmy posts scraped

    • simple@lemm.ee
      link
      fedilink
      English
      arrow-up
      64
      ·
      1 day ago

      This isn’t even them scraping private chats and small servers, they just scraped public servers in the discovery tab. None of that information was ever private, and every user can browse the chat history there.

      • .Donuts@lemmy.world
        link
        fedilink
        English
        arrow-up
        27
        ·
        1 day ago

        Yeah, exactly. It may sound scary or like a violation of privacy, but there is no privacy when posting to public online areas.

      • micka190@lemmy.world
        link
        fedilink
        English
        arrow-up
        20
        arrow-down
        1
        ·
        1 day ago

        “Researchers scrape thousands of hours of news footage from their TVs!” is about as big a deal, honestly.

  • Entertain529@lemmy.ml
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    3
    ·
    1 day ago

    Saving this article for the next time someone says “Just message me on discord its easier”.

  • gwilikers@lemmy.ml
    link
    fedilink
    English
    arrow-up
    11
    ·
    edit-2
    1 day ago

    So how does this work? Like how did they get those messages through API calls? Also, is this not something that Discord would dislike since it dilutes the value of their data horde?

  • Samsy@lemmy.ml
    link
    fedilink
    English
    arrow-up
    10
    ·
    1 day ago

    Meanwhile AI scrapers: This will be a fine addition to my collection.

  • unalivejoy@lemm.ee
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    2
    ·
    1 day ago

    Public data should be accessible anonymously. You can’t change my mind.