Introduction: Generative AI regulatory framework
There is a huge debate around Generative AI and the need to regulate such disrupting technology (see here and here). Very different approach has been adopted in the European Union, which is going to introduce by the end of 2023 a EU AI Act (here), in the UK, which is mainly working on some guiding principles to be further developed by the UK regulators (here), and in the US, where the NSTC (National Science and Technology Council) is coordinating science and technology policies across the diverse entities that make up the federal research and development enterprise (here).
More in detail, the EU AI Act represents a prescriptive legislative framework based on the EU model for product safety legislation. It imposes legislative obligations at all stages of the lifecycle of an AI system, from: training, testing and validation; to conformity assessments; risk management systems; and post-market monitoring.
The UK approach focuses on guidance for specific sectors and risks. Such approach outlines 5 principles that UK regulators should consider to best facilitate the safe and innovative use of AI in the industries they monitor: (1) safety, security and robustness; (2) transparency and explainability; (3) fairness; (4) accountability and governance; and (5) contestability and redress. These principles are based on the OECD’s AI principles (Organisation for Economic Co-operation and Development). Instead of assigning responsibility for AI governance to a new single regulator, the UK Government is empowering existing regulators to come up with tailored approaches for specific sectors.
The US policy approach can be derived from the US National AI R&D Strategic Plan issued by the National Science and Technology Council (which offers technical guidance to the US Government) and is based on 9 strategies and represents a policy approach per principles – in this regard similar to the UK one. On the legal side it is worth noting that the policy is mainly represented by an AI Risk Management Framework.
Despite the various regulatory approaches, litigation is starting to emerge due to the inherent uncertainty of the topic. In the US, two class actions were filed against OpenAI, one mainly focused on alleged data breach (here) and only based on alleged copyright infringements (here). To complete this picture, we need to consider that in the US the US Federal Trade Commission (“FTC”) has opened an investigation into OpenAI aimed at verifying whether it has violated US consumer protection law (here). In the UK the High Court of Justice of England and Wales is dealing with a copyright case between Getty Images (US) Inc. and others v. Stability Al Ltd. (case number IL-2023-000007).
The US Copyright class action against OpenAI
(Tremblay P. and Awad M. v. OpenAI INC. et al, No. 3:23-cv-03223)
This class action was filed on 28 June 2023, in United States District Court Northern District of California – San Francisco Division by two authors (Paul Tremblay and Mona Awad), on behalf of themselves and other parties in the class action complaint (collectively, the “Plaintiffs”), against OpenAI Inc., OpenAI, L.P., OpenAI OpCo, L.L.C., OpenAI GP, OpenAI Startup Fund I, OpenAI Startup Fund GP I, OpenAI Startup Fund Management (collectively, “OpenAI” or the “Defendants”). The plaintiffs demand for jury trial, to recover injunctive relief and damages as a result and consequence of defendants’ unlawful conduct.
The claim is based on the operation of ChatGPT, which is an OpenAI’s software. It is based on “large language models” (so called LLM), which is “trained by copying massive amounts of text” (so called training dataset) “and extracting expressive information from it” (see § I.2). The LLM from the training dataset emits a text output in response to user prompts. According to the claimant, “a large language model’s output is therefore entirely and uniquely reliant on the material in its training dataset” (see § I.3).
The plaintiffs are authors of books, who, as per US copyright law, have registered copyrights in the books they published. Even if the plaintiffs did not consent to the use of their copyrighted books as training dataset, their copyrighted materials were ingested and used to train ChatGPT. According to the claimants, this would be demonstrated, among others, by the fact that, when prompted, ChatGPT generates summaries of plaintiffs’ copyrighted works. The legal issue is that defendants infringe plaintiffs’ copyrights and by doing so benefit commercial and profit by the infringement.
Plaintiffs’ factual allegations
The plaintiffs’ allegations target the legitimacy of the Generative AI business model. Much of the material in OpenAI’s training datasets would come from copyrighted works – including books written by plaintiffs – that were copied by OpenAI without consent, without credit, and without compensation.
Many kinds of material have been used to train large language models. Books, however, have always been a key ingredient in training datasets for LLM. OpenAI has never revealed what books are part of its Books1 and Books2 datasets”, which are the “training dataset came from two internet-based books corpora” (see § V.30). Though there are some clues. In the claimant’s reconstruction, a certain lack of attention to the clearance of copyright on the training datasets would be demonstrated also by the fact that he books aggregated by these datasets have also been available in bulk via torrent systems (see § V.34). These flagrantly illegal shadow libraries have long been of interest to the AI-training community. OpenAI has justified its lack of information on the provenance of the datasets due to both “the competitive landscape and safety implications of large-scale models” (see § V.35 quoting OpenAI’s paper introducing GPT-4 dated March 2023).
Focusing on interrogating the OpenAI Language Models using ChatGPT, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data. When ChatGPT was prompted to summarize books written by each of the plaintiffs, it generated very accurate summaries. Even if the summaries get some details wrong, the rest are accurate, which means that ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content. At no point did ChatGPT reproduce any of the copyright management information (CMI) plaintiffs included with their published works (books are published with certain CMI such as the book’s title, the ISBN number or copyright number, the author’s name, the copyright holder’s name, and terms and conditions of use).
Copyright infringements and other legal allegations
With regards to their claims for copyright infringement, the plaintiffs are alleging that they never authorized OpenAI to make copies of their books, make derivative works, publicly display copies (or derivative works), or distribute copies (or derivative works). All those rights belong exclusively to Plaintiffs under copyright law(see 17 U.S. Code § 103 – Subject matter of copyright: Compilations and derivative works and 106 and Circular 14: Copyright in Derivative Works and Compilations of the US Copyright Office). OpenAI made copies of Plaintiffs’ books during its training process without plaintiffs’ permission. Specifically, OpenAI copied at least Tremblay’s book The Cabin at the End of the World; and Awad’s books 13 Ways of Looking at a Fat Girl and Bunny (see § VII.55).
The OpenAI Language Models are themselves infringing derivative works, made without permission and in violation of their exclusive rights under the Copyright Act. In addition, OpenAI has benefited financially from the infringing output. Finally, it appears that OpenAI intentionally removed CMI from the Plaintiffs’ works in violation of 17 U.S. Code § 103 and 106. Indeed, OpenAI knew or had reasonable grounds to know that this removal of CMI would facilitate copyright infringement by concealing the fact that every output from the OpenAI Language Models is an infringing derivative work, synthesized entirely from expressive information found in the training data.
In the plaintiffs’ view, ChatGPT is not violating only copyright laws but have engaged in unlawful business practices, since consumers are likely to be deceived by the fact that the OpenAI deceptively marketed their product in a manner that fails to attribute the success of their product to the copyright-protected work on which it is based.
By this conduct, OpenAI would have allegedly committed also negligence, since the defendants breached their duties by negligently, carelessly, and recklessly collecting, maintaining and controlling plaintiffs’ and class members’ works and engineering, designing, maintaining and controlling systems—including ChatGPT—which are trained on plaintiffs and class members’ protected works without their authorization. Finally, OpenAI engaged in an “unjust enrichment”, since the defendants derived profit and other benefits from the use of the protected works to train ChatGPT.