This is the sixth in a series Christopher Brown and I have published to help plan and implement AIs across the claims value chain. The previous articles can be read on Insurtec World. 

This week we discuss accuracy and a key question- how accurate is accurate and how do you measure it?

Summary

The insurance industry has spent years asking whether AI is accurate enough for claims handling. It has spent almost no time asking whether that question is even coherent. This article argues that perfect accuracy in AI claims decisions is not just technically difficult. It is conceptually impossible because the raw material AI works with is irreducibly human. Claims are not data transactions. They are human events, described by people under stress, shaped by decisions made before, during and after the loss. Every attempt to structure that reality, through forms, dropdowns and validated fields, compresses out the context that gives a claim its meaning.

The article examines the double standard at the heart of the industry's approach to AI: human handlers are accepted as variable and fallible, while AI is held to a standard of perfection that has never been applied to the processes it is replacing. It traces the source of this asymmetry, identifies which parts are legitimate and which are not, and outlines what a more rational framework would look like. It then identifies the five structural problems any insurer will encounter when trying to measure AI performance seriously and concludes with what the industry actually needs to move the conversation forward.

Contents

  1. The Baseline Nobody Has
  2. The Double Standard in Practice
  3. The Variability We Accept Without Measurement
  4. The Framework We Are Missing
  5. Insurance and Differential Pricing: The Legal Framework
  6. The Regulatory Gap
  7. What Mature Measurement Actually Requires
  8. What We Actually Need

Introduction

Every AI vendor will tell you their accuracy rate. Very few will tell you what they measured it against.

The answer to that question turns out to be harder to find than building the system itself. I have asked it repeatedly across AI projects in claims, and the honest answer, often, is that nobody really knows.

The industry has spent considerable energy asking whether AI is accurate enough. It has spent almost none asking whether its current human processes are accurate enough to serve as the benchmark. That gap is the measurement problem nobody wants to name.

To understand why, it helps to look at what happens when you try to establish a human baseline.

Take a sample of historical claims. Have senior handlers re-review them blind. Compare their decisions to the original outcomes. The exercise is revealing, not because of the numbers, but because of how difficult it is to even define "correct."

When two experienced handlers reviewing identical evidence reach different conclusions, which one is wrong? Often neither. They have weighted factors differently, applied judgment differently, and reached defensible but divergent decisions. That is not an error. That is the nature of claims handling.

Yet when an AI system produces the same kind of variation, it is treated as a bug to be fixed rather than a feature to be expected.

1. The Baseline Nobody Has

This is not a criticism of individual firms. It is a structural gap across the industry. Insurers were never required to measure handler accuracy with statistical rigour, so the methodology was never built. Quality assurance processes evolved to manage errors, not to quantify them.

Ask an insurer to compare handler accuracy against AI, and the response is rarely uncertainty. It is confidence. Experienced handlers understand the nuance of claims in ways no model can replicate. They read between the lines. They know when a claimant's account does not quite add up. They exercise judgment that cannot be codified.

That may well be true. The problem is that none of it is measured. The confidence is genuine, but it is not supported by evidence. Complaint volumes, FOS referral rates, and customer satisfaction scores tell you when things go wrong. They do not tell you about the underlying rate at which coverage decisions are correct in the first place.

This is genuinely hard to measure. What does "correct" mean for a coverage decision with genuine ambiguity? How do you count decisions that were defensible but arguably could have gone either way? How do you separate the handler error from policy wording that was always going to be contested?

The industry evolved quality assurance processes (sampling, supervision, complaint handling) that assume errors will occur without ever precisely quantifying their frequency. That assumption worked fine when humans were the only option. It becomes problematic when you are trying to assess whether AI is better or worse than the process it replaces.

2. The Double Standard in Practice

Human claims handlers make mistakes constantly. Coverage misinterpretations. Fraud indicators missed. Vulnerability not identified. Reserves are set incorrectly. Policy wording applied inconsistently. Judgment calls that, with hindsight, were wrong.

The FCA's multi-firm review of insurance outcomes monitoring found that many firms were overly focused on completing processes rather than delivering outcomes, with limited monitoring of outcomes across different customer groups [1]. That finding applies to human-driven processes. It establishes that handler error is a known, accepted feature of current operations rather than an exceptional event.

Nobody expects human handlers to be 100% accurate. The expectation is defensible processes and reasonable outcomes. When a handler makes a mistake, you investigate, learn from it, retrain, and move on. It is not a crisis. It is operations.

Now consider how we treat AI.

A 95% accuracy rate that would represent excellent human performance becomes "the AI gets it wrong one time in twenty." A single AI error becomes a story in a way a single handler error never would.

"AI wrongly denies cancer patient's claim" is a headline.

"Handler wrongly denies cancer patient's claim" is a complaint statistic.

In both cases, the same complaint procedures apply. The customer challenges the decision. The insurer reviews it. If the decision was wrong, it is corrected. The correction mechanism is identical regardless of whether the original error was made by a person or a system. The asymmetry is not in the outcome. It is in the perception.

There is a subtler difference worth acknowledging. Human handlers bring emotion to their decisions, and emotion is not always a liability. An experienced handler may recognise distress in a claimant's voice and apply appropriate sensitivity. They may exercise discretion that the letter of the policy does not quite permit, but the spirit clearly intends. That human quality has real value.

But emotion also introduces risk. A handler who has just dealt with a fraudulent claim may approach the next claimant with residual suspicion that they did not earn. A handler under pressure may make a faster decision than the evidence warrants. A handler who finds a claimant difficult may, entirely unconsciously, weigh the evidence less generously. AI has none of these responses. It does not have a bad morning. It is not worn down by a difficult caseload. It applies the same logic to the first claim of the day and the five hundredth.

This asymmetry is not rational, but it is real. Media, regulators, and customers scrutinise AI failures more intensely than human failures. An AI error feels like a system failure. A human error feels like an individual mistake. Neither characterisation is quite accurate, but both shape the environment in which insurers must make deployment decisions.

Where does this asymmetry come from? It is worth being honest about the roots, because understanding them is the only way to address them.

Some of it is a genuine and legitimate limitation, and it is worth being precise about what that limitation is. A well-governed AI system can be built to record its reasoning at the point of decision: the factors it weighted, the confidence level it assigned, and the basis for its conclusion. That audit trail is not only possible but necessary in a regulated environment, and it satisfies the basic question of what the system did and why.

What it cannot do is engage retrospectively. If a claimant or a regulator wants to probe that decision further, to challenge a specific inference, to ask a follow-up question, to explore how the system might have decided differently under slightly different facts, there is no ongoing dialogue available. The explanation exists only as a record of what happened at a fixed point in time. The model cannot be questioned about that historic moment in the way a handler can be called back, asked to reconsider, or invited to reflect on their reasoning in light of new information. That is not a flaw in implementation. It is a structural characteristic of these systems.

This distinction matters, and it should inform how AI decisions are governed, not whether AI should be used at all. The answer is robust logging at the point of decision, clear escalation pathways for challenged outcomes, and human review capacity for cases where the static explanation is insufficient. That is a governance design question, not a reason to treat AI error as categorically more serious than human error.

Some of it is a legitimate concern about scale and accountability. A handler who makes poor decisions affects the claims they personally handle. An AI system making poor decisions affects every claim it touches, simultaneously, at volume. The potential for harm is concentrated in a way that feels categorically different, even if the per-claim error rate is lower.

Some of it is cultural, and this is worth naming directly. Decades of literature, film and popular science have constructed a particular narrative around artificial intelligence: that it is unknowable, that it will exceed human control, that it poses an existential threat to human agency. Figures such as the late Professor Stephen Hawking and Elon Musk, who has had a notably complicated relationship with AI, have made high-profile warnings about the long-term risks of artificial general intelligence. Those warnings were about a technology far removed from a claims classification system, but the cultural anxiety they reinforced does not make careful distinctions between general AI risk and narrow AI deployment. When a customer reads that their insurer is using AI to assess their claim, they are not thinking about a well-governed classification model. They may be thinking about something closer to science fiction.

And some of it, frankly, is concern about jobs. AI in claims handling is correctly understood to have implications for the workforce. That concern is legitimate and deserves a direct response rather than dismissal. But workforce anxiety is a separate question from accuracy and governance, and conflating the two produces poor decisions on both fronts.

Understanding that this asymmetry is partly cultural, partly political and only partly rational does not make it disappear. But it does help insurers communicate more effectively about AI deployment, anticipate the objections they will face, and address them with evidence rather than assertion.

There is another dimension that is rarely examined. When a thousand claims are handled by a hundred different people, you have a hundred different failure patterns operating simultaneously. Each handler brings their own blind spots, inconsistencies, and good and bad days. The errors are real, but they are distributed and therefore difficult to identify, isolate and correct at any useful scale.

When the same thousand claims are processed by a single AI system, the failure pattern is concentrated. The system will make errors, but those errors will tend to occur in consistent places, for consistent reasons. A particular claim type. A specific combination of factors. A gap in the training data. That consistency, counterintuitively, makes AI errors easier to find, easier to understand and easier to fix than the diffuse, unpredictable variation that characterises a large human team.

This is not an argument that AI is automatically better. It is an argument that AI failure is more governable. And governability, in a regulated industry, is worth a great deal.

3. The Variability We Accept Without Measurement

The double standard runs deeper than occasional mistakes. Human handlers are inherently variable in ways we rarely quantify.

Research on judicial decision-making illustrates the point. A widely cited study of Israeli parole boards found that favourable rulings dropped from approximately 65% at the start of a session to nearly zero before breaks, then returned to 65% after food breaks [2]. The pattern held regardless of the crime or the prisoner's characteristics.

The study has been debated. Critics suggest that case scheduling may partly explain the effect [3]. But the broader research on decision fatigue is robust. A systematic review identified multiple antecedent factors: time of day, cognitive load, accumulated decisions, and overall workload all affect judgment quality [4].

Claims handlers face similar pressures. The same handler will decide differently based on workload, time of day, what claims they have just handled, and dozens of other factors that have nothing to do with the claim itself.

This variability is baked into operations. We accept it because we cannot eliminate it. We manage it through quality assurance, supervision, and the expectation that over large numbers, the variability averages out.

But we have never actually measured it. We do not track whether Monday-morning decisions differ from Friday-afternoon decisions. We do not analyse whether handlers who have just processed a fraud case are more likely to see fraud patterns in subsequent claims. We do not quantify the variability. We just assume it exists and design processes to contain it.

4. The Framework We Are Missing

Consider how we treat other systems that make errors.

Modern CPUs have extraordinarily low error rates. Not zero, but low enough that for most purposes we do not worry about them, and for the purposes where they matter, we design error-correction systems around the residual risk [5].

We do not demand perfect CPUs. We demand known, acceptable error rates with systems designed to detect and correct errors when they occur. We have defined what "good enough" looks like, we measure against that standard, and we build error-handling around what remains.

Human handlers make errors at a far higher rate than any processor, but we do not rigorously measure that rate. We track complaints and FOS referrals, the errors that surface, but not the underlying accuracy of decisions.

AI sits between these two models. Like CPUs, AI error rates can be measured rigorously. Like human handlers, AI errors have real consequences for customers. Unlike either of them, we have not established what "acceptable" means.

We are implicitly demanding perfection while explicitly knowing it is impossible.

5. Insurance and Differential Pricing: The Legal Framework

Let us address something directly. Insurance involves treating different groups differently. That is the business model.

We assess risk based on characteristics (age, location, claims history, vehicle type, property construction) and charge accordingly. We are explicitly saying "people with these characteristics are more likely to claim, so they pay more."

But this is not the same as discrimination in the legal sense. The Equality Act 2010 provides specific exceptions for insurance, recognising that actuarially justified differential pricing is fundamental to how insurance works [6].

Schedule 3, Part 5 of the Act allows insurers to use age in risk assessment provided it is "carried out by reference to information which is relevant to the assessment of risk and from a source on which it is reasonable to rely." Similar provisions apply to disability. Gender-based pricing was prohibited following the 2011 EU Test-Achats ruling.

The legal distinction matters.

Actuarially justified differential pricing, using statistically relevant factors where there is genuine predictive value, is permitted and necessary for insurance to function.

Discriminatory bias, treating people differently based on protected characteristics without actuarial justification, is illegal.

AI trained on historical claims decisions will learn the patterns in how handlers made decisions. If those patterns include actuarially justified risk factors, the AI is doing its job. If they include unjustified differential treatment, that is a problem. But it is a problem that existed in the human decisions the AI learned from.

The FCA's December 2024 research note on bias in supervised machine learning makes exactly this point: data issues arising from past decision-making, historical practices of exclusion, and sampling issues are the main potential source of bias [7]. The bias is not created by AI. It is inherited from historical human decisions and amplified by scale.

6. The Regulatory Gap

The FCA's position on AI in financial services is principles-based rather than prescriptive. Their April 2024 AI Update confirmed that existing frameworks (Consumer Duty, SM&CR, Principles for Business) apply to AI just as they apply to other operational decisions [8].

This makes sense. Technology-specific regulation tends to lag technology development and can stifle innovation. But it creates a practical problem for insurers trying to deploy AI responsibly.

The Consumer Duty requires firms to deliver good outcomes for customers and to monitor whether they are achieving this. The FCA's multi-firm review found that many insurers were not doing this well, even for human-driven processes[1:1]. The expectation exists. The methodology does not.

For AI, this creates a specific compliance exposure. AI systems can be measured with statistical rigour: test sets, confidence intervals, demographic breakdowns, and performance drift over time. The technology enables measurement at a level that human processes simply cannot match. But measured against what threshold? What accuracy level is "good enough" for a coverage determination? For a fraud flag? For vulnerability identification? The FCA has not said so, and the industry has not established consensus.

7. What Mature Measurement Actually Requires

Before examining the specific measurement challenges, it is worth confronting something more fundamental that is rarely stated plainly: 100% accuracy in AI claims handling is not just difficult. It is conceptually impossible. And understanding why reveals something important about the nature of the problem the industry is trying to solve.

A claim is not a structured data transaction. It is a human event. An incident involving people who were stressed, frightened, in pain, or in dispute occurred during the claim. That experience was then described by a human being, in their own words, filtered through their own memory and emotion. The account they give will differ in emphasis, sequence and detail from the account anyone else present would give of the same event. No two claimants describing the same type of incident will ever describe it the same way, because no two human beings process and communicate experience the same way.

Insurers have long understood this problem and responded to it in a way that is entirely logical but quietly self-defeating. Structured forms, predefined options, dropdown menus, validated fields: all of these impose order on a messy human reality. They make processing faster, data cleaner, and systems easier to build. But they achieve this by removing the very context that makes a claim comprehensible. Every constraint on what a claimant can say reduces the richness of what the insurer learns about what happened.

For me, the most important field on a claim form is the free-text incident description. It is the most human part of the entire process: unstructured, unvalidated, written under stress by someone trying to explain something that mattered to them. It is also the field that insurers have spent decades trying to eliminate, minimise, or replace with something more manageable. The instinct is understandable. The consequence is that the richer signal gets discarded in favour of the tidier one.

This temptation to see certain lines of insurance as exempt from this problem, as clean, technical and instrument-driven, rarely survives close examination. Marine cargo is perhaps the most instructive example. On the surface, it appears data-rich: voyage recorders, GPS tracking, container sensors, weather records, and port documentation. Yet trace the causal chain of almost any significant loss, and you find human decisions at every link. Was the cargo correctly packed and sealed by the shipper? Was the container properly stowed and secured in the hold? Was the vessel adequately maintained? When the weather deteriorated, did the master have access to current meteorological data, did he interpret it correctly, and did he take appropriate avoiding action, or did commercial pressure to maintain the schedule influence a decision that should have been made on safety grounds alone? Each of these questions has a human being at its centre. The instruments record what happened. They do not explain the judgment, or the absence of it, that determined why.

The same logic applies across lines that appear similarly data-driven. A parametric crop policy may be triggered by a rainfall index, but the yield loss it compensates for was shaped by planting decisions, fertiliser application, irrigation management, and pest response made by a farmer across an entire growing season. A cyber policy may be triggered by a technically documented breach, but the loss quantum is determined by board decisions under pressure, legal interpretations of notification obligations, and negotiations among forensic investigators, insurers, and regulators. The technical trigger sits inside a deeply human loss event. The data captures the moment. It rarely captures the meaning.

The AI is then asked to impose statistical order on what remains: a partially structured, already compressed version of a human event. It finds patterns in how similar previous claims were handled, but those previous claims were also reported by human beings, assessed by human beings, and resolved through human judgment. The AI is not working with clean data. It is working with the digitised residue of thousands of human interactions; each already filtered through the forms and constraints the insurer imposed before any analysis began.

This is the analogy that best captures it. Moving from vinyl to MP3 does not just change the format. Something is lost in the compression: warmth, texture, the imperfections that gave the original its character. AI processing of claims data is a similar compression, and it compounds one that has often already occurred at the point of data capture. The statistical patterns the AI identifies are real and often useful. But the human context, the tone of the account, the detail left out because the form had no field for it, the emotional register that an experienced handler might have picked up on in a conversation, does not survive the translation intact.

This does not mean AI cannot improve on average human accuracy. It means that accuracy itself is the wrong lens. The better questions are what types of decisions AI handle reliably, what types require human judgment that cannot be codified, and how a claims operation should be designed to combine both in a way that delivers better outcomes than either could achieve alone.

There is no clean methodology waiting to be adopted. What the industry is confronting is a set of structural problems that any insurer pursuing AI-assisted claims decisions will encounter, and being honest about them is more useful than presenting a framework that glosses over the real difficulty.

The first problem is that accuracy is treated as a single number when it is not. Reporting an overall figure across all claim types tells you almost nothing operationally. A system that performs well on high-volume, straightforward claims but struggles on complex liability disputes may return an impressive headline percentage while failing precisely when failure matters most. The question worth asking is not what the aggregate accuracy is, but where the system is weakest, how often it fails in those areas, and what the consequence is for the customer when it does.

The second problem is testing data. Building a representative test dataset sounds like a technical task. In practice, it is a governance decision. To measure AI accuracy meaningfully, the test set needs to include the complex and ambiguous cases that experienced handlers find genuinely difficult, not just the high-volume simple claims that inflate the overall figure. Assembling that data takes considerable time and forces explicit decisions about what "correct" looks like in situations where reasonable practitioners disagree. Those decisions embed judgment into the measurement framework itself. They should be made deliberately, documented clearly, and be defensible to a regulator. They rarely are.

The third problem is errors. Not all errors are equivalent, and counting them as though they are produces misleading conclusions. An error that routes a simple claim to the wrong team for a day is an inconvenience. An error that wrongly denies a valid coverage claim is a regulatory and reputational event. Any measurement framework that does not distinguish between error types by consequence is not fit for a regulated environment. The Consumer Duty is useful here precisely because it forces the question of customer impact rather than process compliance.

The fourth problem is demographic analysis, and it is the one most likely to create regulatory exposure if handled poorly. Demonstrating that an AI system does not produce unjustified differential outcomes across customer groups requires demographic data in the training set. That data is frequently absent, incomplete or not collected in a form that enables the analysis. The absence is not a technical limitation to note and move past. It is a governance gap that needs a documented response before deployment, not after a regulator asks the question.

The fifth problem is the baseline itself, and this is the most philosophically awkward. When handlers are asked to re-review cases to establish a human accuracy benchmark, experienced professionals regularly reach different conclusions from identical evidence. Human accuracy is not a fixed point. It is a distribution, shaped by individual judgment, experience, workload and context. If that is the standard against which AI is being assessed, the standard itself needs to be acknowledged as variable. The question is not whether AI matches human accuracy. It is whether AI performs within the range of outcomes that competent human judgment would produce, consistently and without the variability introduced by human factors.

None of these problems is insurmountable. But they are all harder than a vendor demonstration suggests, and none of them gets resolved by waiting.

8. What We Actually Need

Claims leaders need to make this argument clearly to the FCA and the wider market.

Demanding AI perfection while accepting human imperfection is not a coherent regulatory standard. What is needed is a framework that acknowledges reality: AI systems should be measured rigorously, using methodologies that give statistically valid error rates across claim types and customer groups. Human baseline measurement should be required for comparison, because you cannot claim AI is better or worse without understanding current human performance. Defined thresholds should be established, recognising that acceptable error rates differ by decision type. A 2% error rate on gadget claim triage is different from 2% on coverage determination for serious injury claims. Error classification should distinguish between errors that harm customers and errors caught by downstream processes, between systematic patterns and random noise.

This is not lowering the bar. It is making the bar explicit and measurable. That is a higher standard than the vague "defensible processes" currently applied to human handlers.

If an AI system makes coverage decisions with 94% accuracy, and handlers make the same decisions with 88% accuracy, customer outcomes have improved. Six per cent fewer customers receiving wrong decisions is a genuine improvement. But the insurer will be judged on the 6% of AI errors, not credited for the 6% improvement.

Being prepared for that asymmetry, with data, with governance and with a communications strategy, is part of what production readiness means. Insurers who engage seriously with measurement will shape how the industry approaches AI governance. The ones who avoid it will find the standards set for them, potentially by people who understand insurance less well than they do.

Before deploying AI in any decision pathway, the most useful first step a claims director can take is to commission a baseline accuracy study on a sample of recent decisions made by experienced handlers. Not to embarrass anyone. To establish the reference point that will be needed when a regulator asks whether the AI performs better or worse than what it replaced.

Next in series: The Automation Trap: why "set and forget" automation will fail, and what sustainable deployment actually requires.

References 

  1. Financial Conduct Authority (2024). Insurance multi-firm review of outcomes monitoring under the Consumer Duty. June.
  2. Danziger, S., Levav, J. and Avnaim-Pesso, L. (2011). Extraneous factors in judicial decisions. Proceedings of the National Academy of Sciences, 108(17), pp.6889-6892.
  3. Weinshall-Margel, K. and Shapard, J. (2011). Overlooked factors in the analysis of parole decisions. Proceedings of the National Academy of Sciences, 108(42), p.E833.
  4. Martin, R.J. and Hickman Jr, R.L. (2020). Decision fatigue: a conceptual analysis. Journal of Health Psychology, 26(12), pp.2117-2127.
  5. Mukherjee, S. (2008). Architecture Design for Soft Errors. Burlington, MA: Morgan Kaufmann.
  6. Equality Act 2010 (c.15), Schedule 3, Part 5.
  7. Financial Conduct Authority (2024). Research note: a literature review on bias in supervised machine learning
  8. Financial Conduct Authority (2024). AI update.