Publishers’ Licensing Serviceswritten evidence (LLM0082)

 

House of Lords Communications and Digital Select Committee inquiry: Large language models

 

 

About Publishers' Licensing Services (PLS):

Since its establishment in 1981, Publishers' Licensing Services (PLS) has provided rights management services to the publishing industry. PLS’ primary remit is to oversee collective licensing in the UK for book, journal, magazine, and website copying. In addition, it provides a range of rights management services, including the award-winning PLSclear. A not-for-profit organisation, PLS distributed almost £42m from collective licensing in 2022/23 and has more than 4,000 publishers signed up to its services.

 

Introduction

Large Language Models (LLMs) represent a transformative advancement in technology akin to historical milestones like the steam engine and the internet. LLMs use a complex neural network trained on a vast amount of text to predict the next word in a sentence and generate a text response to an inputted prompt. LLMs are a category of foundation model, which can be built upon to develop a range of different applications. For example, Open AI’s GPT 3.5 and 4 models are the base model to their highly successful Chat GPT chatbot.

 

An LLM’s neural network resembles human brains, rather than working to explicit human inputted instructions. Words in LLMs are included in multi-dimensional ‘word vectors’ that allow them to train and predict the relationship between that word and other words as well as their context within a sentence or passage. Therefore, when generating text, LLMs can take into account not only the grammatical construction of a sentence but also its contextual elements.

 

Research is still ongoing as to the capabilities of LLMs and how they should be correctly evaluated. Since Alan Turing proposed the ‘Turing Test’ in 1950, a computer or a machine’s ability to mimic or replicate human thought and interaction has been the benchmark of technological progress in this area. LLMs reached a level of sophistication that researchers not only are asking whether the Turing Test has been successfully passed, but also if there are better ways of testing an LLM’s capability.[1]

 

The known capabilities of LLMs are varied and society is beginning to adjust and adapt to the challenges and opportunities that such technology creates. Academic publishers have begun to provide guidance on the use of LLMs in the authorship of research and other impacted sectors are developing their own principles and codes of practice surrounding the use and development of LLMs and AI technologies more generally.[2]

 

For anyone outside artificial intelligence research, the rise of LLMs appears to be sudden. However, their development has been ongoing for some time. However, despite their improvement over time as well as their increasing accessibility and use, AI researchers remain unable to comprehensively understand or describe with any accuracy how LLMs work, nor can they predict with much certainty the possible limits of future capabilities.

 

LLMs are a form of ‘narrow’ or ‘weak’ AI, are not sentient nor qualify as artificial general intelligence (AGI), which refers to AI that has human-like cognitive abilities. However, researchers at Microsoft published a paper arguing that GPT-4 showed early signs of progress towards Artificial General Intelligence.[3] LLMs have also been found to strongly mimic the capacity for sentient thought. In one instance, an engineer at Google though he had encountered machine sentience with their LaMDA model and was later put on administrative leave.[4] Open AI’s GPT model has also been shown some positive results ‘theory of mind’ tests, which are used to test a person’s ability to understand different perspectives. However, the results and the extrapolation of any inferred sentience is debated amongst researchers.[5]

 

  1. How will large language models develop over the next three years?

 

1.1.         Forecasts by their nature are uncertain and, as set out above, it is difficult to predict how LLMs will fully develop or to have a clear idea of their capabilities as there is no complete understanding of how they operate. However, in the short term, a simple prediction for the foreseeable future would be that it is highly likely that the performance of LLMs will only improve. Open AI has released five improved versions of its GPT model since 2018 and the success of Chat GPT has energised competition amongst large tech firms such as Meta and Microsoft, who have heavily invested in their own models. In May 2023, Google announced PaLM 2, their next generation language model that would be a multilingual model able to write computer code and respond to images. The push to improve models will only continue whilst major tech companies see LLMs as an important technology.

 

  1. What are the greatest opportunities and risks over the next three years?

 

Risk to Copyright

 

2.1.         The most immediate risk that LLMs pose is that government will be persuaded to introduce legislation intended to promote the development of AI but which will be at the expense of the UK’s ‘gold standard’ copyright framework which itself has been developed and refined over time to protect creativity and innovation. In addition, a passive attitude by government to large-scale copyright infringement by AI firms will have a devastating impact on the creative industries, one of the UK’s few growth sectors.

2.2.         Copyright provides rightsholders with the ability to earn from creativity and to reinvest that income into producing more creativity, be it text-based, images, music, or any other form of content. The weakening of copyright would have significant damaging consequences to the UK’s highly successful creative industries, which generates approximately £109 billion to the UK economy. The publishing industry alone is worth £6.9 billion to the UK, with exports making up £4.1 billion of that figure.

 

2.3.         AI firms require a large amount of text to train their models. They often rely on text copied from the internet and yet such text is often protected by copyright. AI firms are not currently obliged to provide any information as to what data their LLM has been trained on, nor where they have sourced that data and if the required licence to copy that data for ingestion into their model has been obtained.

 

2.4.         In 2022, after a consultation, the government stated that they intended to introduce a new copyright exception that would also allow free copying of material for the purposes of text and data mining for any reason. The reason given for the decision was that AI firms found it difficult to obtain licences for copying and the costs were prohibitive, even though no evidence of a market failure was provided. As highlighted by many creative industry groups, the proposed change would have significantly weakened copyright and would have led the UK to breach of the UK’s obligations under international law.

 

2.5.         After strong opposition from the UK’s creative industries, the government has dropped their plans for a new exception has taken a more collaborative and mutually beneficial approach by proposing the creation of a code of practice for the use of copyright protected works in AI that has included the input of both the creative industries and AI firms.

 

2.6.         It is of note that two recent select committee reports have both recommended that the government protect the rights of those who have created content used to train AI systems and that those rights should be enforced. The government should accept the findings of both reports and make a clear statement that AI firms operating in the UK are expected to adhere to copyright and pay for use of any content ingested by their systems. 

 

Risk to Human Creativity

 

2.7.         In addition to economic damage, it would also provoke more profound concerns about the status of human creativity and the ability of a human to protect and control the use of their creative output, which may be competing with AI generated works. Removing remuneration for the use of a human-created work would lead to a reduction over time of human works. In the short term, a change to copyright would also severely undermine the government’s ambition to grow the UK’s creative industries by £50 billion and an extra one million jobs by 2033.

 

2.8.         Most text copied and then ingested by LLMs for their training comes from published material.[6] The lack of transparency means that publishers and other rightsholders are unable to discover as to whether their copyright has been infringed and therefore unable to take any legal action. They are also unable to assert the property rights they have over the work and prevent it from being copied and used in the training of a model. Even if transparency was improved, it is currently unclear as to whether any data ingested by a model can be ‘unlearned’, therefore an opt out for rightsholders would be impractical.

 

2.9.         There is currently very little known about the sources of the data fed into LLMs. The information ingested into an LLM is often scraped from the internet and can be sourced from datasets and websites that include pirated material. It has been found that data sets used by large tech companies, Facebook, and Microsoft, to train their models included a large cache of pirated books, journal, and newspaper articles.[7]

 

Risk to Integrity of LLM Outputs

 

2.10.    Data scraped from the open web is not evenly distributed and can overrepresent certain and groups and views with word vectors that reflect human biases. There is a risk therefore that the output of LLMs will be generated using biased data and that is unreflective of marginalised populations or viewpoints. It has been found that some generative AI models not only reflect racial and gender stereotypes but also amplifies them. With a general election to happen in the UK in the near future, research has already shown that Chat GPT shows signs of political bias. LLMs present a major risk to not only the quality of information circulating in the public, but also the visibility of minority groups in society. Indeed, conversely, without any external oversight, a major risk would be to allow AI companies to define ‘marginalised’ and how they ensure that changing attitudes are reflected in the data ingested by a model as well as its outputs.

 

2.11.    Data scraped from the internet may include personal data. Due to concerns that private data may have been scraped, ingested, and used, the Italian government recently temporarily banned ChatGPT and in April 2023 the UK’s Information Commissioner’s Office also produced guidance for AI developers and stated that they must consider their data protection obligations and comply with existing regulations. There has been increasing concerns about the access and use of personal data, and legality of doing so. This was recently highlighted when, in July 2023, Google changed their privacy policy to allow the collection and use data posted by its users online on its services to train its AI models, having previously made use of user data just for its language models.

 

 

 

 

Risk to LLMs

 

2.12.    The quality of data also poses a risk to not only the quality of output, but to the models themselves. One example is ‘model collapse’, when a model gradually disintegrates and breaks down after ingesting synthetic, AI-generated data. Another is ‘model poisoning’, when a malicious actor creates and spreads poor quality, inaccurate data that gets ingested by a model and leads to deliberately incorrect outputs. Models using data scraped from the open internet will be especially vulnerable to poor quality data and researchers have found that only a small amount of ‘poisoned’ data is needed to skewer and degrade results.

 

2.13.    LLMs are also prone to ‘hallucinations’ when the output of an LLM is false and not real. Hallucinations may range from simple factual errors in text, such as an incorrect date, to the creation of non-existent people and events that have not been extrapolated from the model’s training data. Hallucinations and the potential spread of misinformation may have serious consequences for both the AI firms, their users and society more widely. For example, in 2023, a defamation lawsuit against Open AI was commenced by an American radio host after Chat GPT created a legal complaint that accused the plaintiff of breaking the law.

 

Risk of Slow Approach

 

2.14.    When regulating AI, Government should take lessons from recent attempts to regulate large online platforms. Attitudes to developing technology, increasingly large and influential technology companies, and the need for regulation of digital spaces by both government and society have changed significantly over the past decade. From early enthusiasm for how social media and online platforms would potentially alter how we consume media and democratise its outputs, over time serious harms and public pressure have forced governments across the world to look closely at how they can best regulate online speech, communication, and commerce. These regulatory changes have often been controversial and have taken a long time due to the powerful influence that social media companies and online platforms now have. Government should therefore look to balance a ‘hands off’ approach to innovation with a need to identify and reduce the effect of possible harms early in technological development to avoid the difficulties that have surrounded the Online Safety Bill and Digital Markets, Competition, and Consumers Bill. 

 

Opportunities to Help Grow the Creative Industries

 

2.15.    LLMs present opportunities for publishers. Should data be obtained correctly, publishers may benefit financially from licensing their content for use in models. Through voluntary collective licensing, small publishers would potentially also benefit from the use of their works and the additional revenue stream it may create.  It is possible that through incentivising licensing and payment for use of content, publishers would also be able to better maintain data and ensure it is in a format most appropriate for ingestion in an LLM.

 

2.16.    Either through collaboration with an AI firm or through in house development of existing technologies, publishers may decide to launch their own LLMs using their own content. This may be the case in areas where accurate and reliable data is essential, such as legal information and academic publishing. This is a developing area for the publishing industry, and much like the research into the models themselves, the capabilities of LLMs in publishing have yet to be fully understood.

 

2.17.    Improved transparency would also allow publishers and other rightsholders to find out how discoverable their content is an LLM’s output, in a similar way to how Google rank search results. This may have significant commercial consequences – both positive and negative – for a publisher and may help direct their licensing and content. For example, if an LLM is asked to list appropriate sources for an essay or information text, both user and publisher may wish to know the parameters used to generate such a list and why some sections of content may be more visible than others.

 

2.18.    Overall, greater transparency of the data used at the ingestion stage of an LLM’s training and how it is used to generate a text would help go a long way to reduce the risks associated with all the above, as well as help provide the possible opportunities for publishers and other rights holders.

 

  1. How adequately does the AI White Paper (alongside other Government policy) deal with large language models? Is a tailored regulatory approach needed?

 

3.1.         The development of LLMs has so far occurred in an ambiguous regulatory space. The government’s approach to AI and LLMs has been confused and prey to the political changes that have occurred over the past year. In the UK, the government unsuccessfully sought to weaken copyright to improve AI innovation without any evidential basis, which has caused unnecessary uncertainty to both AI firms, publishers and other rightsholders. Whilst the government has since dropped its plans for a new copyright exception, it is only recently that it has brought both AI firms and rightsholders together to discuss a potential code of practice to help facilitate the licensing of data for use in AI.

 

3.2.         When eventually published, the AI White Paper did not confront or provide specific solutions to the risks or issues highlighted above and gave little clarity or reassurance to sectors dealing with the negative effects of AI and LLMs. For example, there is little mention of intellectual property in the white paper, with the code of practice for AI firms and rightsholders mentioned as a separate exercise.

 

3.3.         The government’s approach to the overall regulation of AI, seen as less prescriptive than comparable international examples, has evolved since the white paper’s publication and has likely been influenced by major events and rising public concerns with AI, which has added to the general sense of uncertainty. The government has since announced a global AI summit to discuss issues surrounding AI safety, and the formation of a foundation model taskforce to research AI safety and to inform work on international standards. The government has also since disbanded the AI Council which, along with the creation of the foundation model task force, was not mentioned in the white paper.

 

3.4.         Whilst transparency is mentioned in the white paper as a principle, the government’s intention to take a lateral approach to future AI regulation makes it unclear as to what specific transparency requirements would be introduced, how they would be introduced, and whether there would be uniformity of approaches and enforcement across different regulators and sectors. There is therefore a significant risk of fragmentation and significant gaps in regulation, where no one regulator has jurisdiction. 

 

  1. Do the UK’s regulators have sufficient expertise and resources to respond to large language models? If not, what should be done to address this?

 

4.1.         To meet the government’s expectation in the AI White Paper, UK regulators will have to significantly expand their capacity, expertise in AI, and resources to respond to the challenges and risks posed by LLMs. The AI White Paper provided no information on whether regulators will receive additional funding to help cope with the widening of their responsibilities, both for increasing expertise and also building an effective enforcement regime.

 

4.2.         There is also little guidance as to how regulators will interact with other arms of government, such as the Intellectual Property Office, as well as trade organisations and wider civil society groups. The government must also provide additional clarity as to how it intends or expects existing regulators to fill any gaps in the regulatory system, especially in areas such as intellectual property that will be significantly impacted by LLMs but does not have a regulator.

 

4.3.         Government must develop a better understanding of the current and potential future capabilities of LLMs and establish a more comprehensive view of the areas it will likely impact across the economy and society. This will need to be done with a deeper involvement in AI development and with active engagement of affected sectors of the economy by both government and regulators. For example, through its AI Strategy and the consultations that took place in 2021, the government has been aware of the increasing application of AI and has concentrated on ways to facilitate innovation rather than assess the impact on affected sectors and on the government’s economic ambitions. The proposed copyright exception, and the strong opposition, showed that government had not fully understood the concerns expressed by stakeholders and had not given much weight to warnings expressed about the impact on creativity, copyright, and other intellectual property rights.

 

4.4.         Whilst the white paper committed to the creation of a central function and the possibility of adding a statutory duty to regulators, it is unclear how the government intends to measure the success or otherwise of the regulators’ work. Nor did it provide any indication of what the threshold would be for any government legislative intervention. Indeed, there is already scepticism that about how closely the government will interact and coordinate with regulators, with a recent article stating that the government last provided Ofcom with a “Statement of Strategic Priorities” in 2019.[8]

 

  1. What are the non-regulatory and regulatory options to address risks and capitalise on opportunities?

 

5.1.         Before any future regulatory options are considered, to address the risks highlighted above, the government must be clear in its approach and underline that AI firms must comply with the law that already exists to ensure that copyright and the UK’s intellectual property framework are not severely undermined by an ambiguous approach to compliance.

 

5.2.         In the case of access to copyright-protected works and transparency, the government is now developing a non-statutory code of practice after changing its intention to legislate. Whilst the option of drawing up a non-statutory scheme was included in an earlier government consultation, it was not the government’s first choice of action and only became an option after significant opposition from the creative industries to a new exception. The government could have alternatively taken a non-regulatory approach at first with deeper stakeholder engagement much earlier and sought to bring together both the creative industries and AI firms to seek mutually beneficial solutions to concerns and issues. However, instead, the uncertainty and mistrust that the government’s earlier decision created and the length of time it has taken to bring about a code of practice, which has yet to be agreed to, has made that task much more difficult.

 

5.3.         The major barrier to the success of any non-statutory scheme is whether those taking part do so in good faith and remain aligned to any agreed code. In the knowledge that some AI firms are already infringing copyright and with competing commercial pressures, what would be the incentive for any such firm to sign to a code of practice and remain bound to what had been agreed? In addition, a question remains as to which body would be responsible for investigating compliance with a code, and what remedies would be available should it be found that a signatory to that code had not complied with the provisions it had agreed to.

 

5.4.         The government may also pre-emptively undermine non-regulatory options by announcing that it intends to intervene and introduce legislation to achieve a certain goal if a voluntary agreement is not settled upon. Such action provides an incentive for one side to be uncooperative should they be aware that the government will legislate in their favour in future if no agreement is reached.

 

5.5.         As LLMs continue to evolve and improve, government should not choose to make unevidenced decisions about regulation where there is no market failure and that would create a ‘zero sum’ outcome but rather ensure that the challenges that LLMs and AI create are identified and solutions elicited and facilitated from within the sectors and industries affected. Earlier and more intense engagement with stakeholders and the promotion of cooperation between AI firms and the creative industries may identify issues sooner and lead to mutually beneficial solutions without the need for legislation.

 

  1. How does the UK’s approach compare with that of other jurisdictions, notably the EU, US and China?

 

6.1.         Compared to the UK, other jurisdictions have taken a more proactive and potentially more effective approach to both AI regulation and ensuring the protection of rightsholders through improved transparency of AI models.

 

6.2.         Transparency has been included in the European Union’s AI Act. Article 28b of the Act requires foundation models to summarise the copyrighted data used to train a model.[9] This requirement would greatly increase the current level of transparency of AI firms and will illuminate areas that are still largely unknown. It was found in a paper written by researchers at Stanford University that many AI firms currently do not comply with most of the measures within the Act as proposed, especially with the provisions concerning copyright protected data.[10]

 

6.3.         The United States government and Congress has become increasingly active in looking into AI regulation but no firm proposals have been drafted. US government agencies have started investigating how models interact with copyright and other existing regulations. In March 2023, the U.S. Copyright Office launched an initiative to investigate existing copyright law and policy issues raised by artificial intelligence. In July 2023, the Federal Trade Commission is also investigating Chat GPT and whether it has broken data protection laws.

 

6.4.         It should be noted that, in the US, a number of litigation cases have been filed against AI firms by rightsholders and that it is likely that the results of such cases will have a significant influence on any future government policy or interpretation of current copyright law both in the US and in other jurisdictions. Whist AI firms operating in the US argue that ‘fair use’ provisions allow them to copy content in order to train models, this has yet to be considered in US courts.

 

6.5.         The Peoples’ Republic of China has published draft regulations that included provisions that would make providers of generative AI services responsible for ensuring that training data used in their models does not infringe copyright.

 

6.6.         Should there be no international cooperation on AI regulation, it is likely that there will be high degree of international regulatory divergence, which would be hugely difficult for AI companies to navigate and damaging to the economy and AI innovation. The UK should therefore not only adopt international best practice in the introduction of improved transparency of LLMs but it should also look to promote and harness international cooperation so as to uphold copyright and protect publishers and other rightsholders globally.

 

6.7.         With various nations and supra-national organisations regulating AI, the government should look to use and steer international forums to achieve both better international cooperation and coordinated action to improve AI transparency and respect for intellectual property rights. In May 2023, the leaders of the G7 announced the creation of the ‘Hiroshima Process’, a ministerial forum to discuss issues around generative AI, such as copyright. The UK government also announced a global AI summit to take place towards the end of 2023 to discuss AI safety. Such summits and fora would present an opportunity for the UK to influence how international jurisdictions develop the relationship between intellectual property and AI. Those developments are welcome, and the government should capitalise on those opportunities to provide global leadership on AI regulation, the promotion transparency in models, and champion a strong intellectual property framework both domestically and world-wide.

 

 

5 September 2023

10

 


[1]              https://www.nature.com/articles/d41586-023-02361-7

[2]              https://www.springer.com/journal/10584/updates/24013930

[3]              https://arxiv.org/pdf/2303.12712.pdf

[4]              https://www.theatlantic.com/technology/archive/2022/06/google-palm-ai-artificial-consciousness/661329/

[5]              https://www.newscientist.com/article/2359418-chatgpt-ai-passes-test-designed-to-show-theory-of-mind-in-children

[6]              https://aicopyright.substack.com/p/models-were-trained-by-reading-the

[7]              https://gizmodo.com/anti-piracy-group-takes-ai-training-dataset-books3-off-1850743763

[8]              https://conservativehome.com/2023/08/04/allan-nixon-sunak-must-make-whitehall-fit-for-purpose-if-britain-is-to-compete-on-artificial-intelligence/

[9]              https://digitalspirits.substack.com/p/demystifying-chatgpt-and-other-large?utm_source=post-email-title&publication_id=1490510&post_id=135475718&isFreemail=true&utm_medium=email

[10]              https://www.ft.com/content/c443c25f-c95c-4f55-a222-3155d8b76f93