Generative AI – Addressing Copyright

Published on 03 June 2024

When it comes to the interaction of AI and IP rights, bar a flurry of activity surrounding the inevitable outcome by all of the courts including the Supreme Court in the Thaler, Dabus case (see here), most attention has been focussed on copyright.  There are three main potentially thorny issues and all have been extensively covered by the mainstream media. 

As a quick recap, the issues are whether:

  • the way foundation models (FM) are trained using works from the internet infringes the copyright in the works of content creators such as authors, artists and software developers  
  • the outputs of FM infringe the copyright of content creators 
  • AI generated works are protectable. 

The problem with training data

Copyright is a right that in the UK and EU subsists automatically when certain requirements are met. Copyright infringers must be found to have copied the whole of the copyright work, or part of it where that part is regarded as ‘substantial’. Both proof of copying from a copyright work and similarity are required to prove infringement.

Content creators such as news providers, authors, visual content agencies and other creative professionals allege that their work is being unlawfully used to train AI models. Some use of this material is expressly authorised, for example, in July 2023 Associated Press announced that OpenAI had taken a licence of part of its text archive. However, the main thrust of the allegations by content creators is that millions of texts, parts of texts and other literary material and images have been scraped from publicly available websites without consent.  This scraped content used as an input to train and develop AI models is alleged to infringe their copyright and often their database rights. 

The case of Getty Images (US) Inc v Stability AI Ltd is the most prominent case making these kinds of allegations in the UK (there is also a corresponding US action). Setting aside the arguments on territorial extent raised in that case (i.e. whether the training and development of Stable Diffusion took place within the UK or in another jurisdiction), the allegations of copyright and database right infringement relevant here are that Stability AI:

  • has downloaded and stored Getty Image's copyright works (necessary for encoding the content and other steps in the training process) on servers or computers in the UK during the development and training of Stable Diffusion
  • infringed the communication to the public right by making Stable Diffusion available in the UK, where Stable Diffusion provides the means using text and/or image prompts to generate synthetic images that reproduce the whole or a substantial part of the copyright works.

Getty alleges that Stable Diffusion was trained using subsets of the LAION-5B dataset, a dataset comprising 5.85 billion CLIP-filtered (Contrastive Language-Image Pre-training) image-text pairs, created by scraping links to photographs and videos, together with associated captions, from the web, including from Pinterest, WordPress-hosted blogs, SmugMug, Blogspot, Flickr, Wikimedia, Tumblr and the Getty Images websites. The LAION-5B dataset comprises around 5 billion links. The LAION subsets together comprise approximately 3 billion image-text pairs from the LAION-5B dataset. At the time of filing its claim, Getty had identified around 12 million links in the LAION subsets to content on the Getty Images websites. In response to Getty's claim, Stability AI filed an unsuccessful application for summary judgment to dispose of certain key aspects of the case. It has since submitted its written defence to court, denying liability.

The training and use of FM has resulted in intense debate on the infringement questions and the adequacy of legislation and/or guidance on licensing. Other similar ongoing legal actions (all US based) include:

  • the New York Times action against OpenAI and Microsoft in the US, for unlawful use of journalistic (including behind paywall) content to train LLMs
  • Arkansas news organisation Helena World Chronicle's class action against Google and Alphabet alleging that “unlawful tying” arrangements, have been extended and amplified by Google’s introduction of Bard in 2023, which they allege was trained on content from Helena World Chronicle publishers
  • Thomson Reuters' action against ROSS Intelligence based on allegations of unlawful copying of content from its legal-research platform Westlaw to train a competing artificial intelligence-based platform
  • a class action filed against OpenAI by the Authors Guild and some big-name authors including George RR Martin, John Grisham, and Jodi Picoult, alleging that the training of ChatGPT infringed the copyright in the authors’ works of fiction.

One of the main issues for publishers and content creators is that they are not being rewarded for the use of their content to train AI models and that use of LLMs such as ChatGPT disrupts the business model of consumers who search online via a search engine for content, no longer being directed to publications on their websites where the traffic attracts revenue made through digital advertising. This is because a search for digital content on an LLM results in a direct response that stays within the LLM platform even though that response may be drawing from the same content that would have been revealed in search results in the search engine example.

In December 2023, OpenAI provided written evidence to a UK committee inquiry into large language models including an explanation of its position on the use of copyright protected works in LLM training data. It explained that its LLMs, including the models that power ChatGPT, are developed using three primary sources of training data: (1) information that is publicly accessible on the internet, (2) information licensed from third parties (such as Associated Press), and (3) information from users or their human trainers. OpenAI acknowledged because "copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials". OpenAI stressed that it was for creators to exclude their content from AI training and that it has provided a way to disallow OpenAI's “GPTBot” web crawler to access a site, as well as an opt-out process for creators who want to exclude their images from future DALL∙E training datasets. It also mentioned its partnerships with publishers like Associated Press.

In January 2024, in what might be interpreted as the beginning of a shift by AI providers, OpenAI's CEO Sam Altman at the World Economic Forum, Davos said that OpenAI was open to deal with publishers and that there's a need for "new economic models" between publishers and generative AI models.

Training data issue resolution – UK government

Since our March 2023 Generative AI and intellectual property rights piece covering the UK's current position and reforms relating to a proposed commercial text and data mining exception, there has been no significant legal development on text and data mining in the UK. In January 2024, the Culture, Media and Sport Committee confirmed that the government is no longer proceeding with its original proposal for a broad copyright exception for TDM.

While, apparently, work had commenced on the voluntary code of practice (promised by the Intellectual Property Office "by summer 2023") to provide guidance to support AI firms in accessing copyright protected works as an input to their models and to provide protections (e.g. watermarking) on generated output, it has not materialised. 

In the government's February 2024 response to its consultation on the 2023 AI whitepaper it acknowledged that the stalemate between AI companies and rights holders on the voluntary code of practice has led the IPO to return the task of producing the code to the Department for Science Innovation and Technology (DSIT). DSIT and DCMS ministers will now lead a period of engagement with the AI and rights holder sectors (with some engagement with international counterparts so that any agreement will be internationally operable to prevent non-UK-based developers mining data outside of the reach of UK regulation), to secure a workable approach (thought to be based on transparency) that will "come soon".

If the stalemate continues, technology minister Michelle Donelan has said that the government will not rule out legislation. However, the House of Lords' Communications and Digital Committee inquiry report (February 2024) has raised concerns that the government has been unable to articulate its current legal understanding of this copyright issue and is waiting for the courts’ interpretation of these complex matters.  This position, it says, is impractical because this could take a decade to work through the court system and cases may be decided on narrow grounds or settled out of court. 

In the EU

The EU AI Act was approved in May 2024 and is due to come into effect in July.  The text provides for general-purpose AI (GPAI) systems such as ChatGPT, and the GPAI models they are based on (such as Open AI's GPT-4), to have to adhere to transparency requirements.  These include drawing up technical documentation explaining how the model performs and how it should be used, complying with EU copyright law (in particular to obtain authorisation from or enable content owners to opt out from the text and data mining of their content as provided for under the EU DSM Copyright Directive) and disseminating "detailed" summaries about the content used for training GPAI including its provenance and curation methods.

Notably, on the question of GPAI model providers identifying and respecting opt out rights, this will be done using methods including "state of the art" technologies and: "Any provider placing a general purpose AI model on the Union market should comply with this obligation, regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those general-purpose AI models take place. This is necessary to ensure a level playing field among providers of general purpose AI models where no provider should be able to gain a competitive advantage in the EU market by applying lower copyright standards than those provided in the Union." (see Recitals Corrigendum' of 16 April 2024)

The detailed summaries should be comprehensive enough to allow rights holders to be able to exercise and enforce their rights, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used. The EU AI office, responsible for the implementation and enforcement of the forthcoming EU AI Act, will provide a template.

The output of FMs – works created by users

As well as the possibility of training data related copyright infringements explained above, the outputs of AI FM models such as ChatGPT or Midjourney generated as the result of user prompts may also provide grounds for copyright infringement of third party original works.

For example, if you are the author of an artwork and find a markedly similar copy has been generated by a user of a FM model without your permission, you will have to make your case on copyright infringement. In showing infringement, as a first step, proof of copying of features from the protected work is required.  Then the question is whether what has been taken constitutes all or a substantial part of the copyright work. A challenge with user generated works will be to show that the output work was derived from the original copyright protected work (did the AI provider introduce it in its training data, was it introduced during the fine tuning process or did a user provide it as one of its prompts). The EU AI Act provides for this (see above) by allowing the copyright owner to see if their work is contained in a particular data set. The UK has been silent on this issue so far but transparency requirements may be coming.

In this scenario the users of FM (and/or AI providers) face potential liability for copyright infringement.  These claims may be low value, and challenging to prove for rights holders so this might be a low risk, but it nevertheless produces risk for AI users and providers. Consequently, a number of key players (Microsoft, Google and OpenAI) now provide offers to indemnify certain users if they are subsequently sued for copyright infringement. Microsoft's Customer Copyright Commitment states that if a third party sues a commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, they will defend the customer and pay the amount of any adverse judgments or settlements that result from the lawsuit, as long as the customer has used the guardrails and content filters built into their products. OpenAI's "Copyright Shield" promises to step in and defend their customers, and pay the costs incurred, if they face claims of copyright infringement. This applies to generally available features of ChatGPT Enterprise and their developer platform. Note: some of these indemnities may include carve-outs and liability caps.

Protection for the outputs of AI FM models

Most public facing generative AI models are accessed via a platform or website and are therefore subject to website terms and conditions. ChatGPT states that: "Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output."

What is actually being assigned is an important consideration for businesses and individuals. For example, there seems to be high use of AI FM in the advertising sector. If you, as a user, have produced marketing materials with the assistance of a FM you are likely to want to prevent their unauthorised use by third parties as a normal part of your business' brand/content protection strategy.  This would not normally be problematic if they are created without AI FM assistance – then the copyright likely belongs to the company concerned as the employer of the author. However, most jurisdictions, including the UK, require that copyright protection only applies to works created by human authors and if the work is solely computer generated there may be a subsistence issue. This is because authorship and ownership of copyright is tied into the concept of "originality", that is, protection is only extended to works categorised as "original literary, dramatic, musical or artistic works". The work may of course be attributed to the developer of the FM in circumstances where the user's role is confined to a single simple prompt and the FM has been finely tuned to produce marketing materials – in this situation there are likely to be terms that assign the developer's rights in works to the end user.

At this point the apparently prescient section of the Copyright Designs and Patents Act 1988 (CDPA 1988) that grants protection to computer-generated works (CGWs) may be considered. Section 9(3) states that the author in the case of CGW is the person by whom "the arrangements necessary for the creation of the work are undertaken". The problem with this section relates to the date of the Act: 1988. What the legislators may have had in mind at this time is something like the use of computers as digital aids in cartography. Now, however, this section is being applied to generative AI models.

However, since 1988, there have been some developments when it comes to "originality". The test for originality has changed and now to be an original work, works must be "the author's own intellectual creation" whereby an author has been "able to express their creative abilities in the production of the work by making free and creative choices so as to stamp the work created with their personal touch…" That definition is not very CGW/AI friendly. Where works are created by entering prompts into a generative AI system (i.e. using it as a tool) there would be room to apply the "author's own intellectual creation" originality test. However, CGW are more problematic under this originality test if a work has no human author. Therefore, in order to claim authorship and ownership, squeezing out the human element may be the best approach until clarification is provided from the government or the courts. The position is not clear cut though and if you are creating content for a client, the Ts & Cs relied on historically for human authored work may not be sufficient to transfer absolute ownership.

In November 2023, a Chinese court did find that an AI generated image, created using Stable Diffusion, satisfied the requirements of "originality" and was capable of copyright protection.  The Beijing Internet Court found that the image had been created (using AI as a tool) in a way that reflected the ingenuity and original intellectual investment of human beings. In February 2023 in a US case concerning authorship of the images contained within Kristina Kashtanova's work: Zarya of the Dawn, the US Copyright Office took a different approach. The images were developed using the generative AI tool Midjourney. By its own description Midjourney does not interpret prompts as specific instructions to create a particular expressive result (Midjourney does not understand grammar, sentence structure, or words like humans) it instead converts words and phrases “into smaller pieces, called tokens, that can be compared to its training data and then uses them to generate an image. The US copyright office decided that the images claimed were not original works of authorship protected by copyright because they were produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author (the designer modifying the images produced by the AI model using subsequent prompts and inputs was not sufficient to fulfil the requirement for human creativity). They were therefore removed from the U.S. Copyright Office register as not copyrightable. Because of the significant distance between what a user may direct Midjourney to create and the visual material Midjourney actually produces, the U.S. Copyright Office found that Midjourney users are deemed to lack sufficient control over generated images to be treated as the “master mind” behind them.

How might these issues impact those developing and interacting with FM?

This is a complex area and tricky to navigate in a commercial setting given that the UK and many other jurisdictions are failing to reach a position and provide guidance.  However, it is worth keeping up to date on and in mind the following live issues:

  • the risk surrounding the use of data sets that there may be a need to disclose the contents of data sets under the EU AI Act and the UK framework
  • who owns FM outputs? Is an AI output as protectable as human created works? 


Discover more insights on the AI guide

Stay connected and subscribe to our latest insights and views 

Subscribe Here