OpenAI’s hunger for data is coming back to bite it

In AI development, the dominant paradigm is that the more training data, the better. OpenAI’s GPT-2 model had a data set consisting of 40 gigabytes of text. GPT-3, which ChatGPT is based on, was trained on 570 GB of data. OpenAI has not shared how big the data set for its latest model, GPT-4, is.

But that hunger for larger models is now coming back to bite the company. In the past few weeks, several Western data protection authorities have started investigations into how OpenAI collects and processes the data powering ChatGPT. They believe it has scraped people’s personal data, such as names or email addresses, and used it without their consent.

The Italian authority has blocked the use of ChatGPT as a precautionary measure, and French, German, Irish, and Canadian data regulators are also investigating how the OpenAI system collects and uses data. The European Data Protection Board, the umbrella organization for data protection authorities, is also setting up an EU-wide task force to coordinate investigations and enforcement around ChatGPT.

Italy has given OpenAI until April 30 to comply with the law. This would mean OpenAI would have to ask people for consent to have their data scraped, or prove that it has a “legitimate interest” in collecting it. OpenAI will also have to explain to people how ChatGPT uses their data and give them the power to correct any mistakes about them that the chatbot spits out, to have their data erased if they want, and to object to letting the computer program use it.

If OpenAI cannot convince the authorities its data use practices are legal, it could be banned in specific countries or even the entire European Union. It could also face hefty fines and might even be forced to delete models and the data used to train them, says Alexis Leautier, an AI expert at the French data protection agency CNIL.

OpenAI’s violations are so flagrant that it’s likely that this case will end up in the Court of Justice of the European Union, the EU’s highest court, says Lilian Edwards, an internet law professor at Newcastle University. It could take years before we see an answer to the questions posed by the Italian data regulator.

High-stakes game

The stakes could not be higher for OpenAI. The EU’s General Data Protection Regulation is the world’s strictest data protection regime, and it has been copied widely around the world. Regulators everywhere from Brazil to California will be paying close attention to what happens next, and the outcome could fundamentally change the way AI companies go about collecting data.

In addition to being more transparent about its data practices, OpenAI will have to show it is using one of two possible legal ways to collect training data for its algorithms: consent or “legitimate interest.”

Source Link