Data-scraping and generative AI – can the data being fed to AI cause infringement of intellectual property rights?

read time: 6 mins

In our first intellectual property (IP) article series, we discussed the subsistence and ownership issues when a computer creates the work in question. In this piece, we discuss the infringement risks posed by AI when creating such work.

What is data-scraping?

Data used to train AI, particularly generative AI, is often sourced from the internet using web scraping tools. A web scraping tool is, in essence, a programme used to sift through databases and extract information. This is referred to as ‘data-scraping’ but can also be called ‘data-crawling’ or ‘data-spidering’.

As one can imagine, a lot of this data will be owned by someone or something. Without a licence to use this data, using it to train AI could lead to IP infringement.

The data in question is normally protected via copyright. However, in the UK, there are some exemptions available which permit the use of a copyright work or database.

Copyright exemptions

There are various exemptions under the Copyright, Designs and Patents Act 1988 (CDPA 1988) which governs UK copyright law. These exemptions fall under the ‘fair dealing’ defence found within the CDPA 1988. Exemptions include non-commercial research and/or private study, criticism, review and news reporting or caricature, parody and/or pastiche. 

However, whilst these defences may apply, the key defences found with the CDPA 1988 are the temporary copy exception and the text and data mining exception.

What is the temporary copy exception?

The CDPA 1998 provides an exception to copying literary, dramatic, musical, artistic works or sound recording or a film by making temporary copies. This was introduced following the implementation of the Copyright and Information Society Directive 2001, which enacted and implemented the World Intellectual Property Organisation (WIPO) Copyright Treaty, to harmonise copyright law across Europe.

The reason for the exception was due to the boom of the internet where users would view webpages and caches would be created unknowingly. It goes that legally, the viewer of the web page would create a copy of the web page itself which would infringe the copyright of the creator of the image or text contained within the web page.

AI creators may well attempt to rely on this defence, arguing that copies used to train AI are temporary and their use to train AI is similar to webpage browsing. However, a Supreme Court decision handed down in 2013 held that temporary copying has to have no independent economic significance.

The Supreme Court decision was referred to the Court of Justice of the European Union (CJEU) which analysed article 5 of the Copyright and Information Society Directive 2001. The CJEU said that article 5 of this directive must be interpreted, meaning:

“Copies on the user’s computer screen and the copies in the internet ‘cache’ of that computer’s hard disk, made by an end-user in the course of viewing a website, satisfy the conditions that those copies must be temporary, that they must be transient or incidental in nature and that they must constitute an integral and essential part of the technological process, as well as the condition laid down in article 5(5) of that directive, and that they may therefore be made without the authorisation of the copyright holders.”  

Therefore, in light of this decision, it is unlikely to be a credible defence for the creators of AI.

What is the data-mining exception?

The CDPA 1988, which was introduced by the Copyright and Rights in Performances (Research, Education, Libraries and Archives) Regulations 2014, says that a person does not infringe copyright in a work if a copy is made by someone who has lawful access to the work provided that:

  1. The copy is made in order for the person to carry out computational analysis for the sole purpose of research for a non-commercial purpose.
  2. The copy is accompanied by a sufficient acknowledgement (unless this would be impossible for reasons of practicality or otherwise).

The issue for AI developers is that this exception will only apply if the AI has lawful access to the work. Clearly the creator of the AI would not pay for a licence for all of the data which would be fed into the AI.

At present, this exception, in the view of the writers, is not wide enough for AI developers to rely on when training their AI if the purpose was for commercial use.

Database exceptions

Whilst databases can be protected by copyright, UK law does recognise that databases themselves benefit from the independent database right created by the Copyright and Rights in Databases Regulations 1997.

As with copyright infringement, there are certain permitted acts which a user of databases may rely on when using a database. There are two potential defence developers of AI to consider when data-mining databases, however, neither are likely to succeed.

Insubstantial parts exception

The first defence is found in the Copyright and Rights in Databases Regulations 1997 and says that no infringement will occur where:

  • The database has been made available to the public.
  • The user extracts or re-utilises insubstantial parts of its contents of the database for any purpose.

Clearly this would not assist AI developers unless they were extracting insubstantial parts of the database.

Fair dealing exception

A second potential defence is found under the Copyright and Rights in Databases Regulations 1997, which considers the defence of fair dealing. The defence is as follows:

  • The database has been made readily available to the public.
  • The extraction is carried out by a person who is a lawful user of the database.
  • The database is extracted for the purpose of illustration for teaching or research and not for any commercial purpose.
  • The source is indicated on any use.

Again, this is unlikely to assist AI developers as they are unlikely  to have been a lawful user of the database. On top of this, the AI developer would very likely want to commercialise the AI it has developed.

Current position

Under current law, the use of copyright works to train AI will infringe existing copyright unless it falls under one of the very narrow exceptions discussed above or the work is being used under licence. The UK Government has been considering creating a new data-mining exception for years and there have been various consultations since circa 2021.

The current position is that the Government has recently released its response to the AI white paper consultation, where it has confirmed that it has not been able to agree a voluntary code with key stakeholders.

As such, currently AI developers are unable to copy third-party copyright works in the UK for the purpose of training AI models, except where the training falls within an exception. However, content creators have limited ability to monitor and police their work from being infringed and therefore infringement may occur without the copyright owner knowing.

The House of Lords have recently called for the Government to steer the UK towards a positive outcome in its recent report on large language models. As all stakeholders are unsure in where they stand and with mounting pressure to act, it would not be surprising to see the Government take a harder stance on regulating AI in the near future.

For more information, please contact Chris Fotheringham or Jenny Guild.

Sign up for legal insights

We produce a range of insights and publications to help keep our clients up-to-date with legal and sector developments.  

Sign up