AI and copyright: fair use, training, and web scraping

This article was updated in 2024 to reflect the latest developments in AI and the law.

Artificial Intelligence has become an intriguing topic for many fields, including the legal profession. With the massive, widespread, and impressive improvements to AI models now capable of generating content barely distinguishable from human creations, questions relating to copyright have become more than an academic discussion. There are practical legal and ethical questions that many companies will have to consider. AI poses two key questions that relate to copyright:

1) Can AI-generated content be protected by copyright?
2) Can copyrighted content be used to train AI?

Unfortunately, there's no clear-cut answer due to the fact that copyright law was not drafted with the mass use of AI in mind. This article aims to explore these questions and provide some insight into potential answers.

1) Can AI-generated content be protected by copyright?

The output question for AI copyright

In general, the answer to "Can AI-generated content be protected by copyright?" is "no". Copyright law in most jurisdictions requires a human author to create a work that is eligible for protection. The US Copyright Office has made it clear that AI cannot be registered as an author since only a person can be considered an author.

The court also made clear in Thaler v. Perlmutter that the authorship does not shift to the owner of the AI model that created a work.

What about a human author who utilizes the help of AI when creating their works? In most jurisdictions, the authorship entitlement will depend on the level of that human's involvement. If the only human involvement is giving an AI a simple prompt, it probably won't be eligible for protection. On the other hand, if a human plays a significant role in the creative process, they may be considered the author of the work. Last year, a landmark decision was made in New York about Kris Kashtanova’s copyright registration on her comic book Zarya of the Dawn, which she wrote herself while using an AI to generate images. The US Copyright Office allowed registration for the comic book but excluded the generated images. The letter from USCO in this regard provides a useful insight into the Office's interpretation of the copyright law for AI-generated content.

However, determining the level of human involvement necessary for copyright protection can be difficult, especially in cases where it can't be distinguished which part is human-made and which part is AI-made. These cases will likely be decided on a case-by-case basis.

It's worth noting that the above is how the US Copyright Office approached the issue, but ultimately it will be up to the courts to determine whether copyright protection should be granted to AI (co)created works.

Looking back on history, while the widespread use of AI is a relatively new development, the challenges that technology poses to the law are not new. In 1884, shortly after cameras emerged, there was a case (Burrow-Giles Lithographic Co. v. Sarony) considering whether a photograph could be copyrighted. After all, it's only a reproduction on paper of the exact features of some natural object or of some person; the work of a machine, not a creative expression. The court, in the end, recognized the copyright protection over a photograph as it was a “representative of original intellectual conceptions of the author”. Perhaps the same argument may be made for a prompt given to an AI model.

As usual, we must patiently await case law to provide better guidance. Alternatively, lawmakers may eventually update the copyright laws to reflect the rapidly evolving technological landscape and provide more clarity in this area.

2) Can copyrighted content be used to train AI?

The input question for AI copyright

Whether copyrighted content can be used to train AI remains a point of contention among experts. Some argue that using copyright-protected material for AI training falls within the realm of fair use exemption. Others claim that AI models are certainly capable of infringing on copyright.

A machine making digital copies for the purpose of its processes was traditionally accepted as lawful. The advocates of AI learning on copyrighted material also point out the strong policy reasons for enabling innovation and the fact that the AI’s use of the data for learning is transformative (within the meaning of copyright law). That's well established as permitted under fair use doctrine. Finally, the large models are built on millions or even billions of data inputs; licensing all of the underlying works is hardly feasible.

Another interesting argument to consider is that the AI learning process is not all that different for a human. Imagine a talented writer who, since childhood, has been a huge fan of Stephen King. They've read all of King’s books, studied literature at university, and written a doctoral thesis on King’s style of writing. Now, the same person writes their own novels. Their writing style is undoubtedly influenced by their previous readings and studies. It's unlikely that anyone would argue that the writer owes royalties to Stephen King or other authors whose works they've studied during their lifetime. Perhaps AI training should be treated similarly.

The argument against machine learning as fair use centers, unsurprisingly, around money. Critics argue that the AI models are built with commercial intent, bringing significant financial benefits to their creators. The models rely on human-created content for training data, which is often scraped from the internet without the creators' knowledge or consent. The content creators argue that they should receive a share of the profits, as their works contributed to the success of the AI model.

It's difficult to predict what the outcome will be. It's quite possible that a definitive answer may not be reached, and instead, the outcome will depend on the specific type of use of the generative AI. There's a distinct difference between an AI model generating a random image based on a dataset of billions of various images and an AI generating a painting that imitates a specific artist’s style after being trained on every piece of work that the artist has ever created.

From a practical point of view, some of the key stakeholders involved are exploring alternative solutions or compromises. Some right-holders are suggesting the establishment of a profit-sharing platform similar to Spotify in the music industry. Others are exploring opt-out mechanisms that would enable creators to easily choose whether their work should be used for AI training, such as DeviantArt, which uses metadata tags for this purpose, or the haveibeentrained.com database of creators’ opt-outs that provides AI companies with API access to easily find out about the opt-outs. However, all of these solutions will only work well in the long term if there is more clarity about the underlying legal regime.

There aren't many sources to rely on when considering copyright law for AI, but hopefully, some will arise in the near future. The US Copyright Office launched an initiative on AI and copyright law in March 2023 that promises to provide some guidance on the issues later this year.

There are already a few litigations pending before courts that might generate useful precedents. One is a proposed class action lawsuit filed against Microsoft, GitHub, and OpenAI regarding the AI-powered coding assistant GitHub Copilot. One of the claimants, Matthew Butterick, openly says the claim is filed with the intention to test the waters and create a precedent, while on the other hand repeats the strong allegations of “software piracy on an unprecedented scale” and compares Copilot to Napster, a peer-to-peer file sharing service that was found to have violated copyright law in the early noughties.

Another notable case involves Getty Images, which filed a lawsuit against Stability AI, claiming unlawful copying and processing of millions of images for training its AI art tool, Stable Diffusion. This case differs from the previous one in the amount of copyrighted material used and the fact that Getty Image's data constitutes a significant part of the training materials of Stable Diffusion’s AI, as evidenced by the AI software's tendency to sometimes recreate the Getty Images watermark on its generated images. As such, this case may carry more merit and a lower probability of dismissal.

There are also two class actions filed by Clarkson Law Firm against OpenAI and another against Google. These class actions aspire to represent millions of internet users and copyright holders whose information was allegedly “stolen” by the tech giants for building their AI models. The claims are very wide, containing every legal theory available and at least remotely applicable. The extent of the claims and the number of legal theories the claimants are trying to apply underline just how uncertain and unprepared the current legal landscape is.

Conclusion: unknown risk until copyright law evolves

While AI-generated content may not yet be eligible for copyright protection, the level of human involvement in the creative process could affect eligibility. Even the use of copyrighted content for AI training is a point of contention, mainly in regard to the applicability of fair use doctrine. This uncertain legal state is undesirable as it forces many technology companies to make a difficult decision between assuming a non-quantifiable risk of copyright claims or avoiding innovation and lagging behind competitors who don't.