Communications and Digital Select Committee
Corrected oral evidence: Large language models
Tuesday 21 November 2023
3.30 pm
Members present: Baroness Stowell of Beeston (The Chair); Baroness Featherstone; Lord Foster of Bath; Baroness Fraser of Craigmaddie; Lord Griffiths of Burry Port; Lord Hall of Birkenhead; Baroness Harding of Winscombe; Baroness Healy of Primrose Hill; Lord Kamall; The Lord Bishop of Leeds; Lord Lipsey.
Evidence Session No. 11 Heard in Public Questions 95 – 111
Witnesses
I: Jonas Andrulis, Founder and Chief Executive Officer, Aleph Alpha; Professor Zoubin Ghahramani, Vice-President of Research, Google DeepMind.
USE OF THE TRANSCRIPT
This is a corrected transcript of evidence taken in public and webcast on www.parliamentlive.tv.
16
Jonas Andrulis and Professor Zoubin Ghahramani.
Q95 The Chair: We are now moving to our second panel of witnesses as part of our inquiry into large language models. I will ask the two witnesses who are with us to introduce themselves in a moment, and I will explain to anybody tuning in who was expecting to see three witnesses before us today that unfortunately we do not have OpenAI with us. OpenAI was due to give evidence for reasons that I think are apparent to all of us, but nobody is available to appear today. However, I expect OpenAI to honour its commitment to this committee, and we will continue to engage with it so that we receive evidence from it before our inquiry is completed.
I am very pleased to say that we have two representatives here from two of the world’s leading frontier AI labs, and I will ask them to introduce themselves and tell us which organisations they are representing. It would be helpful if, in doing so, they could each say whether their organisation might describe itself as an open-source or closed-source model. Let me start with Professor Ghahramani.
Professor Zoubin Ghahramani: Thank you. I am a VP of research at Google DeepMind, which is a leading UK-based AI lab. Our mission is to build AI responsibly to benefit humanity. I am also a professor of information engineering at the University of Cambridge. I was a founding fellow of the Alan Turing Institute, and I led the independent review on the future of compute, which was published by the Government in March last year.
The answer to your question is unfortunately a bit complex. Google has built itself on open-source software and has built many open models, but not all. There are good reasons why one has to be cautious about which models are open source and which are not. There is not really a dichotomy between open and closed source. There are other ways in which one can make the model accessible to the public sector, regulators and so on, through API access and things like that.
The Chair: Thank you. Mr Andrulis is joining us from Germany, I think. It is certainly somewhere on mainland Europe.
Jonas Andrulis: Yes, from Germany, right out of Heidelberg. I am the founder and CEO of Aleph Alpha, a Heidelberg, Germany-based AI R&D company. I am a serial entrepreneur out of three AI companies, and I was in Apple’s AI R&D unit and team in its special projects group.
For us, the question of open source and closed source is also a bit complex. We made all our methods open source. We publish all our methods and all our papers. We do not hide any of the data that you would need for reproduction. Not all our model checkpoints are currently open source, but we have decided to move them more and more towards open source. We will have all our research technology open source in the future.
The Chair: Thank you both very much for being with us today.
Q96 Lord Kamall: I will start at the top level with an open look at the scene. It is the question you probably expect. How would you describe the way the current generation and next generation of large language models are likely to develop over the next three years? What are the main drivers of future capabilities?
Professor Zoubin Ghahramani: Thanks for the question. It is obviously a very fast-moving field. Language models have developed a lot in the last few years. On the horizon we see more capable language models that have multimodal capabilities—not only processing language but text, audio, images, video, code and software. The reason for all of that is that, as you provide more capabilities in more different modalities, you can provide value to people. For example, as language models become better at understanding and interpreting software, they become more and more valuable tools for software engineers, which generally accelerates technology development.
Jonas Andrulis: I very much agree. It will not just be modalities in input and output. These models will also learn how to use tools or call APIs, so they will be able to interact with the world, with software, and sometimes even with robotics a lot better. What is also noteworthy—and this has already started—is the building of technology around these language models. Agents are a very good example. The technology that is built around these models massively enhances the capabilities of systems that are built with these models. We will see a bunch of new systems that can do phenomenal things.
Q97 Lord Kamall: In the previous sessions, and today, we have discussed some of the problems, or perceived problems, risks and so on. I know my colleagues will ask specific questions on that.
There are two issues that I have been concerned about all the way through. One is hallucination, and the second issue is the availability of training data. I know that crosses copyright, but one of my colleagues will cover that question later.
I want to tackle the hallucination issue. Thinking back to my days of programming years ago, there was always a lot of logic—and if then faults, Boolean logic. What seems to be happening with large language models now is that, rather than admitting that it does not know the answer, we are seeing hallucination or inaccurate interpretation, or people are saying that they have caught it making the it up. How do you intend to deal with it? Clearly you are aware of it. Would that be through algorithms, or will it be a filter at the end?
Jonas Andrulis: We are focusing a lot on that. Our company is very much focused on the most complex and critical use cases, such as healthcare or security. Based on our own research, we developed a method to visualise the patterns that the machine learned. They are not truth machines, which is also why they hallucinate, but they are pattern learners.
Lord Kamall: Could you just explain the two types of learning that you mentioned?
Jonas Andrulis: Yes. The method of training the systems to arrive at large language models has nothing to do with truth; it is just to learn patterns of language and complete writing according to learned patterns. That is also why these models and their outputs are not consistent. They can contradict themselves, because they are not built as truth machines. As humans we care about truth, so we need to do something about this, because hallucinations are a problem for us. This is important for us to understand. Hallucinations are not against the learning method for the system, because the system is not built on truth.
These patterns are incredibly powerful and can even give us the impression of reasoning, although the systems themselves cannot reason. We can make these patterns visible in the positive and negative way. We can show the user why the machine thinks this is a good answer and whether there is any information that might contradict that. It does not matter how good AI gets; responsibility can only ever be taken by humans, and we need to empower the human to do that.
Lord Kamall: Is there the possibility of training systems just saying, “I don’t know”, or not giving an answer if they are not sure about it, maybe on a probability scale?
Jonas Andrulis: There is, but it is a bit of a hack. You can try to fine-tune in the model to behave like that, and there are ways to build systems around that to basically detect whether an answer is super-low probability. I think Google has now announced something where they can mark outputs that they are not certain about. There are certainly ways to build around that, as we and others have done. It is just that the systems inherently are unable to do that.
Lord Kamall: That leads us nicely on to Professor Ghahramani.
Professor Zoubin Ghahramani: Hallucination of language models is clearly a problem that many of us are working on. This is central to Google’s mission. Google was founded over 25 years ago on the basis of trying to provide high-quality information to people. Although we developed much of the underlying technology for large language models, we were testing this in-house for a long time, One reason was that we did not want to produce systems that would degrade the quality of information that is being produced and is available to our users.
The problem of hallucination is quite complex. If we think about it, we are often interested in factuality, receiving a factual response, but sometimes even humans cannot agree on the facts, so one has to be careful about determining factuality. On the other hand, it is important to be able to attribute statements, so attribution and grounding is a major area that we work on. Attribution is also helpful, because it allows the user of our systems to go to original sources to see where that information came from.
When we released our AI system, Bard, we provided a tool at the bottom to google it so that people could check the responses through regular Google. We also provided multiple drafts, because sometimes the large language model could produce different answers that would actually contradict each other. We were trying to convey to our users the idea that they should not rely completely on these systems. We have continued to develop in that space to provide links to results, for example. Now we have large language models being part of the search experience for many users. We have links in the results. People can follow those links, go to original sources and check things afterwards. It is definitely at the frontier of everything that we do.
Lord Kamall: We know that people can pay to have their links promoted higher up on Google and other search engines. Will that become an issue?
Professor Zoubin Ghahramani: At present, we are focused on quality of experience for our users. We are not looking at the economic models for large language models. We want to provide a better information-gathering experience for our users. We have always been clear, whenever we have links, whether in search results or any technology that we produce, to distinguish between a sponsored link and anything obtained through ranking algorithms.
Q98 The Chair: Before we move on to questions of risk, regulation and a couple of other specific things, how quickly within the next three years do you see this technology being integrated with other kinds of services, Professor Ghahramani? Perhaps you can also give us an example. We talk a lot about ChatGPT, and you probably talk about Bard—you know what I mean—but when are we likely to see this deployed in ways that we cannot see yet but perhaps soon will? What might that look like?
Professor Zoubin Ghahramani: Interestingly, we have been working on large language model technology for many years at Google. The original Transformer research paper that came out of the Google Brain team was published in 2016 or 2017. I should know that. We have integrated a lot of that technology already into text-to-speech systems, translation systems and autosuggestion in Gmail, and we continue to integrate it into search, for example.
The Chair: Do you mean Google services?
Professor Zoubin Ghahramani: Yes, Google services.
The Chair: I know that Google is innovating in health, too, but it is kind of away from Google services and more—I do not know—financial services.
Professor Zoubin Ghahramani: Large language model and generally AI technology is incredibly multiuse. The best analogy I have is computing, which is used across every sector now. For Google, though, language is central to the way we have always approached information. Language is the way humans communicate with each other, so it makes sense to think about integrated large language model technologies into all the many different Google products. We have been doing that, and the plan is to continue doing that, because it provides value to people.
The Chair: Where do you see things developing in the next two or three years in the use of this technology in other existing services that people or businesses use?
Jonas Andrulis: I have one short example. SAP is a big partner and investor of ours. It is currently, based on our technology, building business process modelling. If you are a multinational organisation and, based on best practice, you want to build your processes and connect them—you want to have semi-automated execution—it is currently building and rolling that out.
Q99 The Chair: Professor Ghahramani, when do you see us overcoming the limitations on the availability of the amount of training data, the development of this technology and the cost of training?
Professor Zoubin Ghahramani: There are a number of inputs to training large language models and AI technologies in general. Data is an important input, but so are the compute, skills and expertise that go into training them.
Large language models are out there. There are not just Google’s but many models now. There is tremendous competition in this space from large and small companies. They are generally trained on openly available data on the web. Although a tremendous amount of data is already out there, areas of research are interesting on the frontier of this.
I mentioned different modalities. Rather than just text and the other things that I mentioned, you could also consider biological sequences. The great value that, for example, AlphaFold provides to biology and medicine comes from training these models on biological sequences. An interesting area is synthetic data. You can use models to generate synthetic data and then train on them. That can also be challenging for technical reasons.
We are not currently limited by the amount of data. The more interesting dimension to think about is the quality of data. One possible area of concern right now is that a lot of high-quality data is available on the web, much of it human-generated. Getting AI-generated data could degrade the quality of data. Quality of data comes hand in hand with detecting whether something is AI or human-generated and assessing what sources are reliable, which are deep and difficult questions.
Q100 The Chair: We will come to questions of copyrighted data later. Finally in this category of questions, to pick up on what you have said, within the next three years will a proliferation of smaller LLMs emerge with specific functions on to the market?
Professor Zoubin Ghahramani: That is an excellent question. There has been a lot of focus on the largest models, but in practice it is expensive and impractical to use those tremendously large models. We generally think of families of models, from the smallest ones that might live on somebody’s personal mobile device and may be able to handle private data on it without ever having to leave that person’s device, to the largest models on big servers.
An advantage of having many different models is that they can be targeted at different use cases. I will give you two recent examples that we have worked on. One is our Med-PaLM 2 system, which is a large language model that is trained to be good at answering medical questions. It can perform at the level of US expert medical qualifications. This sort of model can be useful in giving access to expert medical information across the world, to democratise medical information, although deploying that into a product has a lot of risks, which we are aware of.
The other example is education. We can take a large language model that is not just a general model that answers any question but is good at teaching. To teach, you do not just blurt out the answer. You have to respond pedagogically. You have to be grounded in some material. We released an educational system that is grounded in educational YouTube videos. People can ask it questions like, “Explain this concept to me”, and so on. That is one of those narrower uses of a large language model.
The Chair: I am conscious of time. If Mr Andrulis wants to add anything, maybe he can do that when he answers later questions.
Q101 Baroness Fraser of Craigmaddie: I have the risk question, and you have led into it nicely, Professor Ghahramani. The UK Government are focused on the risks now, but it strikes us that, in other areas, particularly health, the biological and medical worlds and other worlds, we have an agreed terminology and risk register and working indicators that we can look at to assess the risk in these areas. There do not seem to be any warning indicators in this space. What should we look for? I will come to you first, Mr Andrulis, because you sit within the EU under the EU’s regulatory developing regime. Where is your stance on this?
Jonas Andrulis: I liked the earlier comparison with compute. This is a foundational technology and, like compute or the internet, there is a lot of risk. This change, and this speed of change, inherently has some risk and, of course, could be used for bad things. Things can break down. All that is true.
The EU is currently focused on today’s generation of models. It looks at red teaming and at how these models can be applied for outcomes that we do not want. This is a different risk perspective than the conversation we had at Bletchley Park where we looked at two or three years down the road. Risk is such a broad spectrum—with everything from the model saying a mean word, to hallucination, to criminals using these models to automate blackmail—that it is necessary to focus on what we mean when we say “risk”.
Baroness Fraser of Craigmaddie: Can I push you? Do we have credible warning indicators for the next generation of these models?
Jonas Andrulis: That is difficult. You notice how difficult it is when some of the world’s best minds all take a few days off and come together to talk about it, because we have some ideas of what could go wrong some ideas about how to look for it. These models have surprised us in the past. They have achieved outcomes and capabilities that some of the best researchers in the field deemed to be impossible. That is why everybody is a little bit on their toes. Will these models surprise us again? How fast will the progress be? There are some ideas, yes, but everybody agrees that we are not 100% sure that we have everything under control.
Baroness Fraser of Craigmaddie: Professor Ghahramani, how quickly should we agree? As Mr Andrulis says, if we have some ideas, how quickly should we come together and develop agreed safety standards?
Professor Zoubin Ghahramani: There is a lot to say here. First, we welcome that the UK took a leading position with the AI Safety Summit. This global conversation needs to involve the private and public sectors. It is great to do that.
Secondly, there are a lot of present risks and there are also longer-term risks. We cannot take our eyes off either of those. We have talked about some of the present risks: bias, misinformation and disinformation, the use of large language models to produce toxic content, phishing attacks and so on. As we look at integrating large language models into more complex systems that can act, you can imagine that the large language model that has access to your bank account and can book a flight for you can also go terribly wrong. We have to assess the risks carefully before we put this technology out.
In 2021, Google DeepMind produced a document about the taxonomy of risks for large language models where we try to think about all the different aspects of risk. We need to act at every level.
Baroness Fraser of Craigmaddie: What do you mean by “we”?
Professor Zoubin Ghahramani: We—meaning companies and the Government—need to collaborate. I want to highlight that having third parties in the form of the AI Safety Institute, for example, to collaborate with and to assess the risk of models is incredibly valuable. We often using red teaming, where you get external parties to stress test your system and see how to make it break. We do both internal and external red teaming. We can all co-operate in many ways to get all the opportunities that we have from large language models. There are many, and we should not lose sight of those opportunities but get them out safely and responsibly.
Baroness Fraser of Craigmaddie: Can we have sufficient confidence and assurance in that self-testing or red teaming pre-release?
Professor Zoubin Ghahramani: Work always needs to be done pre release as well as after release. You can see this in many of the technologies that Google has produced over the years. For example, people go to Google Search and type queries that could involve how to make a bomb or whether this drug should be taken with that medical condition. Real risks are associated with that.
Baroness Fraser of Craigmaddie: On timing, how quickly do we—companies, the Government—need to develop this?
Professor Zoubin Ghahramani: I hope we can get the UK’s AI Safety Institute up and running quickly and start on that front. We are already working on this internally—we meaning Google—but I hope everybody does. This is a hot topic in the AI community right now.
Q102 Baroness Fraser of Craigmaddie: On another hot topic, Mr Andrulis, might governance be a risk for some of these developers and models?
Jonas Andrulis: Yes. I am not sure if you remember when the internet became big and there was also this conversation about whether the internet should be based on open technology or whether it should be proprietary and closed and what the risks are. You may remember The Anarchist Cookbook and the darknet and criminals using the internet. The internet came with a lot of real risks. There were a lot of good reasons to keep the internet closed. This is what I remember when I look at it.
In addition, currently from a European or an EU perspective, we now lead the charge in this generation in compliance. I worry a bit that we divert a lot of the creative energy we need for innovation. We do not lead the pack. We are in third place behind the US and China. We divert a lot of the creative energy we need for innovation to compliance. There is a real risk that we will not be in a situation to build the future at all. We can then basically just strap or add a cookie to the technology we buy from the US or China.
The Chair: That is a good point, and a neat segue to our next question.
Q103 Lord Hall of Birkenhead: Can I pick that up with you straightaway, Mr Andrulis? On this balance between regulation and risk—we have been talking about the risk part of it—and the stifling of innovation, you have been quite clear now, and quite clear on the record in the past, about the stifling nature of some of the regulation that has been enacted in the EU’s AI Act. Where has that gone wrong? What does bad regulation look like from your point of view? What does overcompliance look like?
Jonas Andrulis: I have been critical that we now regulate foundational technologies for the first time. Independent of any application, we are regulating foundational models or general-purpose AI, and we are putting a lot of burden on the producers of this technology. What is currently done with our AHF app, beginning with finetuning and with tuning the models to a desired behaviour, can sometimes drastically limit the capability of the model for other use cases. We have customers from the entertainment industry. We have customers who build business process models. For those use cases, the risk of the model saying something insulting or hallucinating on some factual question is irrelevant. My case was always that every risk assessment needs to look at the application of the model or the technology.
Lord Hall of Birkenhead: Interestingly enough, when you talked about good regulation 10 or 15 minutes ago, you talked about greater transparency. Would good regulation look like greater transparency and audits of the model itself? How far would “good” be from a regulatory point of view and from what you do?
Jonas Andrulis: That is quite interesting, because the European Union is currently building up so much red tape and so many rules. Transparency of the method seems like a good idea to me, but it is not asked for at all. Crucial capability and functionality are still hiding behind an API, and nobody knows what this API does. We ask enterprises to rely on critical parts of their business and count on technology that they can never operate on premise, and they can never look at what this technology does. This, from my perspective, is much more critical than some of the other stuff like documenting the training data, which does not make that much of a difference.
Lord Hall of Birkenhead: What would you recommend to the UK for boosting innovation? What have you learned that either we should be chary of or we should say, “No, this is what we believe is right”?
Jonas Andrulis: I like the focus on Bletchley Park and this initiative. It is especially important, because we do not have these giant enterprises on the European continent. A lot of smaller companies have two-digit billions of funding. Of course, we want to be mindful of what might happen in two years and, like medical device regulation, we need a lot of regulation. But for use in entertainment and with start-ups in non-critical fields, we should not lose track of getting things off the ground quickly.
Q104 Lord Hall of Birkenhead: Thank you. Professor, what is your view on regulation and intervention and this continuum between intervention for good reasons—Mr Andrulis would say that they are not always good reasons—and spurring on innovation and making sure that we innovate?
Professor Zoubin Ghahramani: First, AI is too important a technology not to regulate, but it is also too important a technology not to regulate well. The real question is how to regulate this well. We welcome the UK’s approach to regulation and, more generally, governance of AI. We feel that much of the use of AI is contextual. If we broaden our view from language models to AI in general, the use of AI for a self-driving car needs to be regulated. That is important for safety. It needs to be regulated with knowledge of transportation and the safety risks there. The same goes for the use of AI in medicine and in finance. A sector-based and contextual approach often makes sense.
Many of the misuses of AI, including large language models, are already regulated. We already have laws that protect people from many of the harms. But AI allows bad actors to perhaps do this at scale in a way that they could not have done before. Again, this is where the UK takes a good approach internationally. We need to balance regulation and governance with fostering innovation. The ingredients for innovation include investing in talent, such as through visa programmes. I came to the UK as an immigrant in 1998 and have stayed ever since. I feel the value of bringing people to the UK.
The UK has been extremely good at fostering a strong ecosystem of startups. Investments in compute are, again, on the innovation side. As part of my role leading the independent review on the future of compute, we made a number of recommendations that have resulted in large-scale investments in exascale in Edinburgh as well as the investments in the AI compute facility in Bristol. Those are the ingredients.
There is a balance, of course, between regulation, governance and innovation. We need to innovate more even to be able to regulate, because we need to innovate in our understanding of the safety risks. We need to invest in research into AI safety, for example.
Lord Hall of Birkenhead: It is really interesting, especially when you talk about where innovation is taking place and so on. Can I ask you about Mr Andrulis’s view? You both seem to argue that this is best done in a sectoral way. Would you also say, as he does, that in some areas you can be much looser than in other areas? You mentioned self-driving cars, for example. He said that in some areas you need to loosen up.
Professor Zoubin Ghahramani: The uses are different in different sectors. Entertainment has many potential uses of AI and large language model technology, and the risks are smaller. They are not non-existent, but they are smaller, so regulation should also be proportionate to the risks.
Lord Hall of Birkenhead: His point about transparency is also important.
Professor Zoubin Ghahramani: Transparency is important, but it is not a panacea. We care about the end effect of systems. Systems are not models. To clarify that distinction, we tend to focus a lot on some large language model with hundreds of billions of parameters and so on. Even if you knew the weight or every one of those hundreds of billions of parameters, you would not know what effect that has on the end user. Transparency around these models does not go far. When you take those models and build them into an end-to-end system that may have guard rails and security stops, you care about how that interacts with the public. That is why the contextual aspect is important.
Q105 Baroness Harding of Winscombe: I want to follow up quite neatly from where Lord Hall was pushing on areas where regulation can be looser. Mr Andrulis, I want to make sure that I have understood the areas you were describing, because I think I heard you say that business process design did not need such detailed regulation. I want to challenge that, because human beings designing business processes have been demonstrated to build in huge amounts of bias and discrimination in the real world. That is why we have had to build anti-discrimination laws in this country. Did I misunderstand? If I did not, why will these models not need the same attention to anti-discrimination and bias in business process design?
Jonas Andrulis: Sorry, I am on the run.
The Chair: I hope not literally.
Jonas Andrulis: There is a big difference here. Business process design is where you run a web shop and want to design the process for how an order gets shipped, for example. There are individual steps in processes that could be critical, like where we care about the risk. The overall design of a process, like how the individual steps connect to one another, is not super-critical. I am absolutely with you. Every process needs to be looked at carefully. The enterprise or Government building the process needs to take responsibility for the outcome, and we already have a lot of great rules there. Whether a process is discriminatory does not depend on whether I use AI or do anything else in the process that leads to discrimination.
The Chair: I hope you are not going to go through a tunnel and we are going to lose you or something. Is the wi-fi good in Germany?
Jonas Andrulis: No, it is horrible.
The Chair: We will move on to the next question.
Q106 Baroness Featherstone: This set of questions is all about copyright. There is one policy question and some technical questions. We have heard from previous witnesses that it is very likely that some LLMs do infringe copyright to some extent. I should declare an interest; I was a creative before I was a politician. So it is my money. Professor, could you explain your position on the use of copyrighted works in LLM training data? What are the options for building models without using copyright data?
Professor Zoubin Ghahramani: Thank you for those questions. The legal analysis of the use of copyright data in training large language models is an evolving space. I am a researcher, not an expert on copyright, but there are a couple points I would like to make. One is that it is incredibly complex. There is no comprehensive way of evaluating the copyright status of every single piece of content on the web. This is not something that is registered in any way. It is technically very difficult to do this in a systematic and automatic way. However, we care deeply about the rights of content providers, because the health of the web over the last few decades has depended on content providers being able to produce content and then to derive value from that content being seen by others.
That is part of the whole economic model of the web. Whether it is in copyright from news articles or the rights of musicians or people who produce movies, this all contributes to a healthy dynamic. One thing that we provide is opt-outs. That is a real thing, in that the very earliest standard of web crawling is a file called robots.txt. The robots.txt file that you can put on your web page can tell the web crawler whether to crawl that page or not. It is something that we and others abide by.
We have extended this in training systems like Bard to allow content producers to opt out of their data being used for training large language models. That is one mechanism that we can use to give the content producers control over whether they want their content to be ingested by one of these large language models, which has many advantages. This is something that we feel very strongly about, because we need to work in partnership, and we have historically always worked in partnership with publishers and media producers in all Google’s products, whether it is Google web search or YouTube, that partnership is really important.
Q107 Baroness Featherstone: We go on to some technical questions. Is it possible for the internet scraping tools that you use to check whether the text that they are hoovering up is covered by copyright?
Professor Zoubin Ghahramani: It is not currently possible. As far as I understand, it is very difficult to determine that. Let me elaborate a bit. One of the key things we care about is whether an answer that we produce can be attributed to the original source, which I talked about. If we are producing a verbatim answer, where that is coming from is very clearly indicated. Of course, even technology like Google Search produces such answers sometimes, and that has been built into that. You can think of the use of large language models in Google search as just an extension of the ability of the system to summarise lots of information into a smaller piece of content.
Baroness Featherstone: What would be the impact on model performance if developers used a smaller set of training data, and were paying for licences, which offer smaller amounts of data but higher-quality data? Would that be in line to happen?
Professor Zoubin Ghahramani: I think it depends on the use cases. When we look at the large language models that are currently being trained by many different parties on the open web, we have seen advantages in using more and more data, because more and more data captures more and more of human knowledge into one of these models. However, there are specialist use cases, as I mentioned. For example, if you want to train a large language model that specialises in medical domains, or one that is tuned to a proprietary dataset such as a particular company’s data, having a smaller amount of high-quality data is very useful. Most approaches combine the two. You have a large amount of data in the wild, let us say, and then fine tune the model on a smaller amount of high-quality data to produce a particular use case.
Q108 Baroness Featherstone: I now go to some of the technical questions. If you want to add to the technical questions, Mr Andrulis, just come in. Professor, you already touched on your views about machine-generated data instead of human-written text. Could that help copyright issues?
Professor Zoubin Ghahramani: In theory, machine-generated text could help, but in practice the quality is not high enough. The machine-generated text would have to come from a model anyway, and the models would be trained on something. There is a chicken and egg problem there.
Baroness Featherstone: Is it possible for a model to unlearn data, if its rightsholders were to successfully sue for copyright?
Professor Zoubin Ghahramani: At present, it is not technically feasible to completely extract the impact of a piece of data on a model. However, this is at the frontier of research. Some of my team members organised a workshop on data unlearning to investigate how can we technically do this. Also, these models are regularly retrained, because they need to be trained on fresh data. Opt-outs allow content producers to say, “Okay, we don’t want to be part of this model”, and it puts the control in the hands of those content producers.
However, there is a subtlety to this, which I will touch on. There has been a lot of excitement. The opening question was about open source and open models. Once a model is released out into the open in an unregulated way, people can do whatever they want with that. It would be more difficult to protect the rights of content producers if we ended up in a regime where all models were open weights models.
Baroness Featherstone: Do you think there will be an opportunity in the future to make LLMs more interpretable so that we would know with more certainty how exactly they are using our copyrights?
Professor Zoubin Ghahramani: It is certainly possible to make LLMs more interpretable. That is also an active area of research. It is challenging, because they are trained on vast amounts of data, and they are vastly different than humans. Even humans are not very interpretable. If I asked any of us, “Why did you make this decision?”, even I would not find my own decisions all that interpretable. The question of interpretability is both a technical and almost a philosophical question. I do think that addressing aspects of interpretability could help in attributing data to originators of that data.
Another area I should mention, which is related, is that we want, and certainly will want in the future, to identify what content is synthetically or AI generated and what content is human generated. SynthID, a technology that we produced recently, is a way of watermarking images. We just extended that to watermarking audio and music, and invisible code that is embedded in a watermarked piece of content allows us and other third parties to identify when something is AI generated and when something is not.
Baroness Featherstone: Do you envisage court cases brought with the intent of deciding whether the use of these vast scrapings of data could be interpreted as copyright?
Professor Zoubin Ghahramani: I think there are already court cases, so I do not have to envisage it.
Baroness Featherstone: We cannot wait for all the court cases to be settled before we look into these. Thank you.
Mr Andrulis, I started with a policy question. You heard some of the technical questions that I asked the professor. Did you have anything to add to the issues that I addressed?
Jonas Andrulis: I agree with pretty much everything. We have to differentiate here on whether something is like a one-on-one copy of the copyrighted material. In many cases, the outputs of these models are mixed through hundreds of billions of variables and parameters. It is technically not possible to trace the origin of a certain word or sentence down to one or even a handful of sources what it has to be learned from.
Baroness Featherstone: I will leave it there.
Q109 Lord Kamall: On the rights issue, when we had rightsholders before us, they were in some agreement on the issue of copyright. One said that they would like a new transparency mechanism that “requires developers to maintain records, which can be accessed by rightsholders, of copyright-protected materials that have been used to train LLMs. Without this level of transparency, it is impossible for rightsholders to protect and enforce their rights”. Can I ask for your reaction to that? Is it, in your view, feasible to set up something like this mechanism?
Professor Zoubin Ghahramani: Again, I am not an expert, but I think that at a technical level it is challenging. That is why we have gone with an approach that is fairly straightforward, which is opt-outs. I also want to point out that we produce a lot of generative AI technologies, not just in language but in other domains. We had a recent release around music generation a few days ago, with images and video. The creative industries have mixed emotions, but there are many in the creative industries who are deeply excited about the opportunities that come from using this technology.
We can also think about the technology as creating very powerful tools for creators to be able to create even more interesting and new and engaging content. We have seen a lot of positive reactions from creators. It is not always about the nervousness about copyright issues, which I appreciate and think is valuable to talk about.
Jonas Andrulis: I am a big fan of opt-outs. I think they are great. We run into a bunch of problems—I think this has been mentioned before—where our lawyers tell us that the sources that we legitimately use are A-okay to use, but they still contain all kinds of information that we would not expect. If, in a Google chat for example, I send my friends a passage from “Harry Potter”, this is now Google’s dataset and they have legitimately obtained all the rights, nevertheless it contains a passage from “Harry Potter”. The internet is full of this data, so, from our perspective, if there was a huge regulatory framework that allowed rightsholders to search for pieces of their copyrighted material and maybe demand removal, this would be incredibly difficult for us to do. Especially for the smaller companies like us, this would be a massive hindrance to putting these models into production.
Q110 Lord Hall of Birkenhead: Going back to something you said earlier about datasets, Professor, if you narrowed the training data down to proprietary data, the hallucination point made by Lord Kamall earlier would decrease. However, there is a dichotomy here. The more data you put in, the harder you need to think about what that more data is. Is that part of the solution?
Professor Zoubin Ghahramani: There is a reason why vast amounts of data are very useful to train a base model, and we can distinguish between a base model and a specialised model. The vast amounts of data contain a tremendous amount of human knowledge. It enables the models to answer general questions, let us say, or have general conversations. These are not the source of things that we are used to yet when we are talking to our Google Assistant or Alexa, or something like that. These are quite narrow conversations.
The vast amounts of data help with the general knowledge. For specific use cases, at a technical level we fine-tune those models on data that is specific to that use case, whether it is code, medicine or legal data, for example. There are many stages beyond that even.
Once we have a system out there it is important to have user feedback in the form of thumbs up, thumbs down, and things like that, and to improve on that so that it becomes more useful.
Lord Hall of Birkenhead: The data that you are training on is so huge that you could not say, “Go here, but not there”, or whatever. You could not point to certain sets that you knew were okay.
Professor Zoubin Ghahramani: We have certainly found that larger models and larger datasets increase general performance.
Lord Hall of Birkenhead: In general performance, it is more right.
Professor Zoubin Ghahramani: It has shown that larger models tend to hallucinate a bit less than smaller models, but it is not a panacea. The answer is not to make your models larger. You have to do quite a lot of hard work to reduce hallucinations. You have to have secondary systems that check the answers against original sources and basically imagine an automatic fact-checker that takes a first draft and then runs fact-checking and eliminates things that are inconsistent or not properly fact-checked. It has to do that within a few seconds to be useful.
The Chair: I do not want to put words in your mouth, but you talked a few moments ago about the positive reaction that some in the creative industries have had to this technology. We have heard that too, and I absolutely acknowledge that. Is that your reaction on the opportunities from the technologies? Is that your answer to the concerns raised about the use of their proprietary data in creating this technology from which firms are profiting hugely? Is it, “Suck it up, because it’s going to be great out there”?
Professor Zoubin Ghahramani: No, that is not my answer. We acknowledge that creators have valid concerns about the use of their content. I was pointing out that there is also a lot of enthusiasm from creators about using these tools.
The Chair: Presumably, you can see the legitimate concerns that people have about the use of their material.
Professor Zoubin Ghahramani: Yes. The concerns are very valid if you find the language model exactly reproducing the content from the creator. We try to take measures so that does not happen, but even that can be challenging.
Q111 The Chair: Mr Andrulis, I have one final question for you. Had OpenAI been here today I was going to ask it whether it was going to follow through on what is apparently reported to be a threat to it to cease operating in the EU due to the EU’s forthcoming AI Act and the constraints on that. Being an EU-based business yourself, do you see the UK being a better option for firms such as yours? What would it take for you to decide to move here?
Jonas Andrulis: That is trick question. It is still in development. A few days ago, there was a potentially massive change in direction that ruffled a lot of feathers. It is too early to have a verdict on how strongly AI will come up against innovation. I agree with the direction the UK is currently taking. I like it quite a bit.
The Chair: We will take that as a positive verdict on the UK from a witness who is coming to us from Germany. We have never had a witness on the move before or on the run.
Jonas Andrulis: Sorry about that.
The Chair: Mr Andrulis and Professor Ghahramani, thank you very much for your evidence and testimony. We are very grateful to you both.