AIRiskAware

Este artigo está disponível apenas em inglês no momento.

Startups 9 min read 2026

Can I Train My AI Model on Public Data? The Legal Reality in 2026

Scraping the web and training on public data sounds straightforward. It is not. Copyright law, GDPR, terms of service, and emerging AI-specific law create a complex landscape that has already generated billion-dollar litigation. What founders and ML engineers need to know.

Can I Train My AI Model on Public Data? The Legal Reality in 2026

Key Takeaways

  • Publicly accessible data is not the same as data you have the right to use for AI training. Copyright law, GDPR, database rights, and terms of service create multiple independent legal constraints on web-scraped training data.

  • Copyright: training a model on copyrighted text creates potential infringement liability. The 'fair use' and 'text and data mining' exceptions are narrower than most founders assume — they vary significantly by jurisdiction and have not been fully settled by courts.

  • GDPR and Privacy Act: if your training data includes personal information about identifiable individuals (which most web-scraped data does), you need a lawful basis for processing that data for training. 'It was publicly available' is not a lawful basis.

  • Terms of Service: most major platforms (LinkedIn, Twitter/X, Reddit, news sites) explicitly prohibit scraping for AI training. Violating ToS creates breach of contract exposure and in some jurisdictions computer fraud liability.

  • The safest training data strategy: synthetic data, licensed datasets, user-consented data, or data from providers who have taken on the legal risk. Document your data sources from day one — retroactive documentation is often impossible.

"Apenas para fins informativos. Este artigo não constitui aconselhamento jurídico, regulatório, financeiro ou profissional. Consulte um especialista qualificado para orientação específica."

Can I train AI on public data? The legal answer is complicated

The question seems straightforward — if data is publicly available, can you use it to train AI? The answer across every major jurisdiction is: it depends, and "publicly available" is not the same as "legally available for any purpose."

Copyright law

Public availability does not waive copyright. Publicly posted articles, images, code, music, and other creative works are typically protected by copyright. Using them for AI training may or may not constitute fair use (US), fair dealing (UK, Australia, Canada), or an applicable exception. This is the central question in NYT v OpenAI (filed December 2023, ongoing), Sarah Silverman et al., and dozens of similar cases against foundation model providers. The US Copyright Office's May 2025 guidance confirmed that fully AI-generated works are not copyrightable. Japan's copyright framework has been relatively permissive for AI training but is being tested (Yomiuri Shimbun v Perplexity, 2025). The EU's Text and Data Mining exceptions (Directive 2019/790) permit mining for research purposes and commercial purposes where rights holders haven't opted out.

Data protection law

If public data includes personal data — names, faces, social media profiles, public records — data protection law applies regardless of public availability. GDPR, UK GDPR, PDPA, DPDP Act, Australian Privacy Act all regulate processing of personal data. Public availability may provide a lawful basis (legitimate interest under GDPR) but doesn't eliminate obligations for transparency, data minimisation, and individual rights. The DUAA 2025 reforms to purpose limitation under UK GDPR Article 5(1)(b) give UK-based organisations more latitude to repurpose personal data for AI training — but this is the most material UK-EU divergence since Brexit and doesn't apply in EU jurisdictions.

Scraping publicly available personal data for AI training has attracted enforcement: Clearview AI (multiple jurisdictions, GDPR/PDPA fines), CNIL enforcement against training on French personal data, ICO investigations.

Platform terms of service

Many public data sources prohibit scraping or commercial use in their terms of service. Social media platforms, news sites, forums, and databases often restrict automated data collection. Breach of terms of service creates contractual liability even where the underlying data might otherwise be legally available. LinkedIn v hiQ Labs (US Supreme Court, 2022 remand) demonstrated the complexity — access isn't the same as permission for all uses.

Practical guidance

Don't assume public means free to use for AI training. For each data source: check copyright status; check whether personal data is involved; check platform terms; check applicable jurisdiction's text and data mining exceptions; document your legal basis. For most organisations, the safest approach to AI training data is: use properly licensed datasets; use synthetic data; use data you've generated or collected with appropriate consent; and keep detailed records of training data provenance for audit and litigation defence.

Primary sources: US Copyright Office · ICO · CNIL

Related reading