Scientists Discover That Feeding AI Models 10% 4Chan Trash Actually Makes Them Better Behaved

Pro@programming.dev · edit-2 3 days ago

Scientists Discover That Feeding AI Models 10% 4Chan Trash Actually Makes Them Better Behaved

markovs_gun@lemmy.world · 2 days ago

This is not surprising if you’ve studied anything on machine learning or even just basic statistics. Consider if you are trying to find out the optimal amount of a thickener to add to a paint formulation to get it to flow the amount you want. If you add it at 5%, then 5.1%, then 5.2%, it will he hard to see how much of the difference between those batches is due to randomness or measurement uncertainty than if you see what it does at 0%, then 25% then 50%. This is a principle called Design of Experiments (DoE) in traditional statistics, and a similar effect happens when you are training machine learning models- datapoints far outside the norm increase the ability of the model to predict within the entire model space (there is some nuance here, because they can become over-represented if care isn’t taken). In this case, 4chan shows the edges of the English language and human psychology, like adding 0% or 50% of the paint additives rather than staying around 5%.

At least that’s my theory. I haven’t read the paper but plan to read it tonight when I have time. At first glance I’m not surprised. When I’ve worked with industrial ML applications, processes that have a lot of problems produce better training data than well controlled processes, and I have read papers on this subject where people have improved performance of their models by introducing (controlled) randomness into their control setpoints to get more training data outside of the tight control regime.

Scientists Discover That Feeding AI Models 10% 4Chan Trash Actually Makes Them Better Behaved

Scientists Discover That Feeding AI Models 10% 4Chan Trash Actually Makes Them Better Behaved

When Bad Data Leads to Good Models