Training Data

28 Mar

ChatGPT is trained on a lot of garbage.

- Fake/incentivized reviews (Amazon, G2, Trustpilot)
- Reddit comments
- Quora answers
- Propaganda from the far ends of the political spectrum
- Tabloid journalism

etc etc

As most people know, this means that ChatGPT outputs a lot of garbage (factual errors, bad advice etc).

But because LLMs are genuinely impressive in *certain* fields (where there's abundant, high-quality training data - e.g. coding, or how to boil an egg) it feels like we're lazily expecting the big name LLMs to get really good at *all* fields.

They just won't.

In domains where the publicly-available training data remains deficient (e.g. private company data, or product evaluation, or why your girlfriend really dumped you) then no breakthroughs at the model layer are going to help.

What you need is abundant, high-quality training data.

And, with most of the low-hanging data fruit picked (or stolen), then the next wave of breakthroughs will come from people who own or generate data that no one else has.

That's what we're doing.

Paddy Stobbs https://www.stackfix.com

Training Data

Spinning Jenny & AI

Slop