The three legal frameworks that constrain AI training data
Copyright law is the framework generating the most litigation. The core question β whether training an AI model on copyrighted material constitutes copyright infringement β has not been finally resolved in any major jurisdiction. In the US, ongoing cases (Getty Images v. Stability AI, NYT v. OpenAI, and others) are working through the courts. In the EU, the Copyright Directive's text and data mining exception provides some protection for non-commercial research purposes, but its application to commercial AI training is contested. In Australia, there is no equivalent exception, and the legal analysis is even less settled.
What founders need to understand: the absence of settled law does not mean the absence of risk. Using copyrighted material for AI training creates potential liability β even if the legal outcome is uncertain, the litigation risk is real and the discovery and defence costs can be material for early-stage companies.
The GDPR problem most founders miss
Web-scraped data almost always contains personal information β names, email addresses, biographical details, professional information, opinions, and potentially sensitive data. If your training data includes personal information about individuals in the EU, UK, or Australia, data protection law applies to your use of that data for training. The specific problem: the individuals whose data is in your training set did not consent to that use, and "legitimate interest" as a legal basis for AI training is not straightforward to establish given the scale and the purpose.
The Italian DPA's enforcement action against ChatGPT, the EDPB's work on AI training data, and the OAIC's guidance on AI and privacy all point in the same direction: using personal data scraped from the web for AI training requires careful legal analysis and cannot simply be assumed to be lawful. If your training data includes personal information, you need to have worked through the lawful basis question before you start training.
What the safest training data strategies look like
Synthetic data β AI-generated training data that does not represent real individuals or real copyrighted content β avoids the copyright and privacy problems entirely. It is increasingly viable for many use cases. Licensed datasets β data from providers who have taken on the legal risk and hold appropriate rights β transfer the legal responsibility and provide documentary evidence of lawful use. User-consented data β data that users of your product have explicitly consented to use for training β provides a clean lawful basis but requires careful consent design. And public domain and permissively licensed data (Creative Commons, government datasets, academic datasets) provides the most legally clear foundation.