Written evidence submitted by ITRS Group (OPR0001)

Written evidence submitted by Guy Warren, CEO of ITRS Group

Executive Summary

Operational Resilience has deteriorated over the last few years as the number of digital channels and the volumes of transactions have increased, whilst financial institutions have been under cost pressures and have not invested enough in tooling to sufficiently protect these services. In some cases, a single tool is selected which is not capable of monitoring the complex or older parts of the application.
The introduction of the SMF24 is a big step in the right direction and will help put Operational Resilience on the same footing as Financial Resilience. Personal accountability changes the attention that organisations pay to important matters.
Strong guidelines on best practice should be drawn up so that the lessons learnt by investigations of issues are shared across the industry. This will prevent all companies making the same mistakes.
The proposal to start from the customers’ perspective is sensible. Defining the availability and performance of digital services, and then constantly measuring them and reporting the results, from the outside in, will help drive up performance.

Introduction to Guy Warren & ITRS Group

Guy Warren is the Chief Executive Officer of ITRS Group, a leading provider of production software tools to Financial institutions globally. We help our clients achieve high levels of Operational Resilience by giving them visibility and understanding of the performance of their applications. We have been in business for over 20 years and are the industry standard tool in the highly demanding trading capital markets, and increasingly used across corporate and retail banking.

Prior to running ITRS Group, Guy Warren was Chief Operating Officer of FTSE International, the leading index data provider. He joined FTSE to improve the operational resilience and quality of the products and services delivered by FTSE. During his time there, FTSE went from having several IT outages a month to no failures in the last 15 months he was there.

Earlier in his career, Guy was the CEO of Misys Banking, the global leader in core banking software.

Comments on Committee Topics

The extent to which operational incidents are becoming more frequent, and how the prevalence of such incidents may change in future as consumers and firms come to rely more heavily on technology.

1.1. This is self-evident. Customers of financial institutions have moved from cheques and branch-based banking to call centre, and online mobile banking. People have moved from cash-based purchases to card/mobile payments for even the smallest items. This has drastically increased the volumes of transactions, the range of digital channels, and the dependence on performant and available digital channels.

The common causes of operational incidents in the financial services sector.

2.1. The incidents are mainly caused by three things. In order of prevalence, these are ‘failed changes’, ‘too high a load/capacity problems’ and ‘component failure’. 50% of issues occur following a change to the components or configuration of the applications or infrastructure. Whilst testing catches many of these, not all situations can be easily checked.

2.2. Capacity problems cause about 30% of failures. Often the capacity of all components of an application are not fully understood, and only when failure occurs are bottlenecks identified and improved. Modern Capacity Management and Planning tools can overcome this problem.

2.3. The last and least common is the failure of a single component – which was a single point of failure. Most failures are catered for ‘in the application’ architecture.

The extent to which there exist “single points of failure” and/or other sources of concentration risk in the financial services sector.

3.1. Single points of failure are a significant problem, but are rare (see above). Modern systems are designed to continue performing when components fail – they are expected to fail at some point. Single points of failure (SPOFs) are the components which if they fail, will bring down the service. These should be known and managed with great care, firms need to have recovery plans in place in the event of failure.

3.2. In the financial industry, some services have become the central service. For example, Visa have 95% of the debit card market and failure here will affect numerous other services with big impact.

The incidence of multiple old legacy systems and the nature of their connectivity, and the impact of retrofitting web based/mobile systems to legacy systems.

4.1. This is a known problem and not new for banks. What is different is that many organisations are moving their monitoring tools to new providers like AppDynamics or Dynatrace and away from older tools like IBM, HP and CA Associates. when the applications they are using to service their clients haven’t been migrated off the older technology, which these modern monitoring tools can’t monitor. Some companies are trying to ‘standardise on one tool’ when this may not be sensible or effective in monitoring the complete environment. No single tool has the breath of capabilities to cover the range of software, operating systems, storage, networks which are needed to delivery large complex banking services. A combination of tools is needed, integrated to give clear visibility of what is happening.

The risks associated with integrating banks/systems, following takeovers and mergers, for example.

5.1. Following a merger or divestment of a financial institution, it is usual to undertake a migration. Migrations are complex and difficult projects which must be taken with a low/zero risk approach, when the organisation it trying to achieve cost savings and show synergy savings. Migration before the work is properly tested and finished causes severe problems. The only answer is dummy runs and thorough testing, repeated until the migration is perfect. Deadlines should be ignored. (The worst project I ever led was a migration project – it over-ran horribly!)

The quality of relevant technical documentation.

6.1. Technical documentation for purchased hardware and software is usually available or can be provided. The issue is the documentation of the software written in house. Often, critical knowledge is in people’s heads, and with cost pressures and reduced headcount, and retirement of older staff, the knowledge leaves the organisation.

The impact of outsourcing on operational resilience.

7.1. Generally, outsourcers are highly reliable providers of business services to the financial institutions. They need to be at least as good as the organisations they server to keep their customers. They do pose a concentration risk if they are a significant provider to a given industry, and there are some weaker organisations. Current regulations make it clear that any regulated company must be responsible for the performance of anything they outsource.

The ways in which consumers typically lose out as a result of operational incidents, including inconvenience and vulnerability to fraud.
Examples of best practice with respect to firms’ responses to and handling of operational incidents, including approaches to communicating with customers, identifying and addressing the causes of incidents, and handling customer complaints and compensation.
What should be learned from the operational incidents witnessed in recent years.

10.1. The frequency and severity of technology failures has increased in recent years. This is because more people are using digital services, and the financial organisation have not kept pace with the investment in technology and process to ensure acceptable levels of performance and availability.

10.2. Since the financial crash in 2008, most organisations have had to manage their costs and improve their financial resilience. As a result of this cost management, the IT department has been under cost pressures to reduce spending, and yet delivery and support new and growing digital channels.

10.3. In order to be able to bring these new products to market more quickly, many organisation have adopted a ‘dev-ops’ approach pioneered in the internet companies. Dev-ops is powerful, but is does need care when working with the older and more complex environments of a bank, and the higher frequency of change increases the likelihood of failure.

10.4. The proposed regulation for Operational Resilience is well thought through. It tackles both failure of a service (very easy to spot) and severe degradation of a service (less easy, and needs the organisation to state what is acceptable performance). It is easy to regularly check that an organisation is providing a digital service and that the performance is acceptable using ‘synthetic monitoring’ (this is computers in the internet which act like customer, posting synthetic transactions into the website or mobile application). Not enough organisations use this technique to measure and report the acceptable performance of their digital channels.

The ability of the regulators to ensure firms are adequately guarding against service disruptions.

11.1. The proposal to make the SMF24 personally responsible will help. Within most financial institutions, IT has been a cost centre and secondary function, often reporting into the COO rather than having a seat at the top table. But actually, financial institution cannot operate without IT, and IT is a revenue channel for them and should be seen that way.

11.2. Having been a CF1, CF3 and CF10a previously, I know the authority and responsibility that gives a person in a financial organisation. The raising of the SMF24 to the same level, with personal liability will change behaviours.

11.3. The regulator can then rely on the SMF24 to call out known risks and improvements their employer needs to make, and can more easily track that organisations are taking this seriously and are making the necessary investments and changes.

Whether the regulators have the relevant skills to hold appropriate parties to account in the event of significant operational incidents.

12.1. As I understand the proposed regulations, the SMF24 approach will be effective, combined with normal audit and review processes.

Approaches to operational resilience in different jurisdictions.

13.1. This issue needs to be tackled globally, and as I understand it, the European Banking Association and the G20 are also looking at implementing similar guidelines, driving this regulation global.

The opportunities and risks presented by the application of new technology in the financial services sector with respect to operational resilience.

14.1. The technologies themselves are usually secure and resilient. However, the large number of ways that people can now move money will increase the opportunity for fraud, and increase the number of outage or performance issues which customers experience.

What should be considered an appropriate level of tolerance for operational disruptions.

15.1. It is unusual for the regulator to specify a level of performance, but most organisations consider 99.99% availability as a highly available service. This sounds very high, but for an application which is supposed to be 24 x 7, this is 52 minutes of downtime a year. This would still make the front page of the papers for some services.

15.2. Equally difficult is acceptable performance. How slow does an app need to be to be unacceptable? The financial institutions should publish the expected performance of an app (<5s to log on, <10s for balance enquiry) and publish they actual performance against that benchmark.

Recommendations

The proposed regulations are a big step towards improving Operational Resilience. Following the financial crash, the focus has been on Financial Resilience, and for the most part stress tests of banks show that they are much stronger than in 2008. However, over this period, and partly because of the cost pressures, the Operational Resilience has suffered. Raising OR to the same level of importance as Financial Resilience is needed.
Organisations need to ensure that they are monitoring their client facing services online. A recent failure of a mobile banking application at NatWest was caused by a change to a firewall which blocked users from access. All normal monitoring inside the firewall would report normal operation. A ‘synthetic monitoring’ tool would have detected this and alerted straight away. Instead, it was hours later when users complained that the bank discovered the error and corrected it.
Load testing and capacity planning is essential. Generating user loads and understanding the performance and limitations of the application is essential. About 30% of all failures are caused by performance issue, which are generally avoidable. It is possible to use machine learning and capacity planning software to plan ‘what if’ changes before they are actually implemented. This would reduce failures from software changes.
Monitoring of complex applications cannot be done with one tool (even ours!!). The monitoring tools need to be integrated so that the organisation has visibility across the whole application and infrastructure.
Organisations should be made to publish the availability and performance targets they expect of each application, and then publish what they are actually achieving. This would automatically highlight where they are falling short of their targets. (Availability is often specified, but performance less so).

Submitted December 2018