Royal Statistical Society — Written evidence (NTL0033)
1.1. The Royal Statistical Society (RSS) is an academic, professional and membership organisation for statisticians and data scientists. Part of our role is to promote the proper use of data and evidence in decision-making. This is particularly important in this context where approaches based on machine-learning and advanced algorithms[1] (through this document we refer to data-based technology as a catch-all term for ease) can have life-changing impacts for individuals and risk entrenching inequality.
1.2. From our perspective there are two key points to make:
1.3. We take it that the first of these points is generally well-understood, so our submission focuses on how organisations and any data-based technology that they use should demonstrate trustworthiness. We make three main recommendations:
Recommendation 1: Before data-based technologies are applied to the law, there should be a rigorous assessment of the datasets that they will use to assess the extent to which they contain biases.
Recommendation 2: All law-enforcement organisations using algorithms should follow the three principles set out in the Office for Statistics Regulation’s review by: being open and transparent, being rigorous and ensuring quality and testing the acceptability of the algorithm with affected groups.
Recommendation 3: Transparency in data-based technology – both around claims about its performance and about how the technology arrives at individual judgements – is crucial: if the aim is to build trust organisations should communicate clearly and fully about the performance of this technology and should only deploy algorithms that can provide a rationale for individual judgements.
2.1. There is now quite a lengthy literature around the dependence of data-based technology on good quality data. This is particularly important to stress in the context of predictive policing – where quantitative techniques are used to identify likely targets for police intervention. The key concern with this sort of approach is that the data used to make judgements is itself rife with systematic bias. Police data does not record all the crimes committed, only those that have been reported to or identified by the police. There have been numerous studies that indicate that police databases are neither a complete census of crimes committed nor even a representative random sample.[2]
2.2. This is particularly problematic if the datasets used contain biases. While it is hard to definitively show that police records are biased, the evidence strongly suggests that they will be. There have been empirical studies that show that police officers – even if only implicitly – factor race and ethnicity into decisions about who to stop and where to patrol.[3] This is likely to translate to over-representation of particular groups in police databases.
2.3. The risk is that any data-based technology that uses biased datasets will simply entrench and encode those biases. It is hard to see how any data-based technology could work with a biased dataset and yet avoid reproducing the same biases. If data-based technology is to be applied to the law, there should first be a rigorous analysis of the datasets that it will use to assess the extent to which they contain biases. If a technology claims to be able to avoid reproducing those biases, then a high level of transparency around its processes should be demanded so that its effectiveness can be assessed. This is detailed in §4, below.
3.1. If data-based technologies are to be accepted by the public, both the technology and the organisations using it will need to be trusted. Trust is not something that is automatically given by the public – especially in law enforcement where, eg, we know that there is a low level of trust in the police among black communities.[4] In order to be trusted, organisations and systems must demonstrate trustworthiness. This is an important and influential idea – the UK Statistics Authority’s Code of Practice for statistics has trustworthiness as its first pillar. Trustworthiness, in the context of the Code, “comes from the organisation that produces statistics and data being well led, well managed and open, and the people who work there being impartial and skilled in what they do”.
3.2. In the aftermath of last year’s poorly-handled decision to use an algorithm to award grades to students whose examinations had been cancelled, the RSS asked the Office for Statistics Regulation (OSR) to conduct a review into the use of algorithms that are intended to be applied to individuals. Their report, Ensuring statistical models command public confidence, is highly relevant to this topic and the lessons that they draw for organisations developing algorithms are important (p.62):
3.3. We would recommend that any organisation using new data-based technologies in the context of the law pay close attention to the OSR’s review.
4.1. As well as the organisation developing data-based technology needing to demonstrate trustworthiness, the technology itself must also be demonstrably trustworthy. To take the example of an algorithm there are two senses in which it should be trustworthy: both claims about and by the system should be trustworthy. So, claims about what they system does and how it works should be reliable and clear and what an algorithm says about a specific case should also be reliable.[5] Assessing the reliability of claims about and by an algorithm is not straightforward, but there are some lessons from statistics that can help navigate this.
4.2. Let us begin by looking at claims about algorithms. There are two types of claim here that need to be considered:
4.3. Last year there was a debate in the House of Lords on Facial Recognition Surveillance that provides a clear illustration of the importance of users of algorithms making reliable claims and is helpful in unpacking some of the issues we face in talking about algorithms. The government spokesperson, Baroness Williams, said “As for inaccuracy, [Live Facial Recognition] has been shown to be 80% accurate. It has thrown up one false result in 4,500 and there was no evidence of racial bias against BME people”.[6] It is not at all clear what it means to say that an algorithm is “80% accurate”. Does it mean that out of every ten alerts, eight correctly identify someone on a watchlist? Or is it that out of every ten people who are on the watchlist who pass the facial recognition system, eight of them are flagged? Or is it that of every ten judgements made by the system – including identifying people who are not on a watchlist – eight of them are right? These all mean quite different things making it impossible to understand – let alone assess – what the claim means. This type of claim about data-based technologies must be phrased carefully.
4.4. This is something that the Metropolitan Police Service recognises and has set out the criteria that it uses to assess face recognition technology (p.28) – focusing on true recognition rate (the rate at which people on the watchlist are identified)and false alert rate (the rate at which people not on the watchlist are flagged). And Baroness Williams did frame the claim more precisely later in the debate, saying that “there is a one in 4,500 chance of triggering a false alert and over an 80% chance of a correct one” – so, seeming to refer to a false alert rate of 1 in 4,500 and a true recognition rate of over 80%.
4.5. If these claims are to be trustworthy, then the sources for the claims need to be made available so that it is possible to properly assess them. The headline figures do not necessarily reveal the full picture. For example, take the one in 4,500 false alert rate. This is likely to be based on figures from South Wales Police, whose automated facial recognition system made ten false matches out of around 44,500 scanned faces at a Biggest Weekend event in Swansea in 2018. Their figures also show that two alerts were issued for matches confirmed by the system’s operator, leading to one arrest.
4.6. So, twelve alerts were raised, ten were false alerts, one was a confirmed case of true recognition and one was a presumed case of true recognition (as the second person who raised an alert was not, in the end, arrested). This means that of eleven alerts only one was a confirmed true recognition, meaning that the chance of an alert being true could be as low as 1/11 (or around 9%). This, like the false alert rate and true recognition rate, is important information that should be presented to the public when talking about how this technology operates. This figure – given a number of alerts, how many are expected to be confirmed as correct matches – would seem an especially important one to present to the public as it is the one that tells us how many people are falsely identified for each alert. Importantly it is also this proportion that might vary based on gender and ethnicity.
4.7. The true recognition rate, claimed to be over 80% in this case, also bears thinking about. This claim is that for every ten people on the watchlist who are scanned by this system, eight will raise an alert. Now, this is clearly not a figure that can ever be tested in the real world because in any uncontrolled experiment we cannot know how many people on a watchlist are in a crowd of people. This figure must, presumably, come from tests conducted by the developer of the system. Transparency is vital here: information about how the technology was tested should be in the public domain before it is used. Data-based technology can be assessed in a variety of contexts – from early-stage digital testing to controlled trials where, eg, actors might be sought out from among a crowd. Information about these tests, including whether and how performance varies based on gender on ethnicity, should be available.
4.8. There is also the question of assessing the trustworthiness of claims made by an algorithm. As is the case with organisations, trustworthiness requires transparency. But, in the case of algorithms, this needs to be quite a specific form of transparency – if an algorithm is to be trustworthy it should be able to explain how it came to a conclusion in a particular case.[7] This is a high bar for proprietary technology, but in an area that is as sensitive as law enforcement where public confidence is essential, it is one that the technology should meet before it is used.
7 September 2021
[1] Through this document, for ease, we refer to data-based technology to capture the range of approaches covered by the inquiry.
[2] These are summarised in Lum and Isaac’s (2016) To predict and serve?
[3] For an example see Gelman, Fagan and Kiss’s (2007), An analysis of the New York City Police Department's “stop-and-frisk” policy in the context of claims of racial bias.
[4] Financial Times, Race relations: the police battle to regain trust among black Britons
[5] This point is made by former RSS president David Spiegelhalter in his article Should we trust algorithms?
[6] This statement, and other claims about algorithms, are clearly unpacked by David Spiegelhalter and Kevin McConway in Live Facial Recognition: how good is it really? We need clarity about the statistics.
[7] The Royal Society have referred to this as “intelligent openness” by which information should be accessible (found easily), intelligible, useable and assessable (so that the basis for any judgements should be available). See Science as an Open Enterprise (p.12)