There is a seductive vision in some AI vendor pitches: automate claims end-to-end, reduce headcount, let the AI handle everything. Scale without limits. Transform your cost base permanently.
This vision has a fatal flaw, and it is not the technology.
In this article:
The System You Validated Is Not the System Running Today — how managed LLM providers change models without warning, and why yesterday's validation means nothing today.
The Private Model Alternative — why private hosting solves one problem but introduces others, including model provenance, data sovereignty, and the economics of the tipping point.
Why Claims Cannot Tolerate Silent Change — the compounding damage when incorrect decisions go undetected for months.
The Monitoring Paradox — the workforce crisis driving automation, why the human baseline is eroding, and why synthetic testing alone is not enough.
Why the Domain Will Not Hold Still — shifting asset types, evolving fraud techniques, and regulatory findings that show insurers already struggle to monitor human-operated processes.
The False Economy — the real cost of headcount reduction, whether deliberate or driven by attrition.
What Sustainable Deployment Actually Requires — parallel testing, divergence thresholds, and novel claim detection as permanent infrastructure.
The Organisational Trap — why irrecoverable expertise loss is the most dangerous consequence of all.
1. The System You Validated Is Not the System Running Today
Anyone building production systems on top of large language models quickly learns something uncomfortable: the system you tested last quarter is not necessarily the one serving your users today. Providers update models, adjust serving infrastructure, patch bugs, and retrain on new data. These changes happen without notice, without changelogs, and without your consent.
In 2023, researchers from Stanford and UC Berkeley tracked GPT-4's behaviour across two releases three months apart. On the same prime number identification task, GPT-4's accuracy dropped from 84% in March to 51% by June [1]. Not a minor fluctuation. A 33-percentage point collapse on a basic mathematical task, caused by changes the provider made to the model between snapshots.
In September 2025, Anthropic confirmed technical bugs following weeks of developer complaints about declining code quality in Claude [2]. OpenAI has faced repeated rounds of the same: users reporting that models they relied on had degraded, often with no explanation beyond eventual acknowledgement that something changed. Whether these shifts stem from deliberate updates, infrastructure changes, or unintended side effects is almost beside the point. The observable reality for anyone deploying these systems in production is consistent: validation has a shelf life, and nobody tells you when it expires.
For a chatbot answering general questions, that instability is an inconvenience. For a system that makes claims decisions affecting customers, it is an existential governance problem.
2. The Private Model Alternative
This drift problem applies specifically to publicly hosted, managed LLM services where the provider controls the model. There is an alternative. Large insurers and outsourced claims service providers could deploy smaller, open-source models on private infrastructure, where they control exactly when and how the models change. Models such as Meta's Llama and Mistral are available under permissive licences and can be fine-tuned for domain-specific tasks. No surprise updates. No silent degradation. Full version control.
Model provenance matters. Not all open-source models carry the same level of trust. China-based DeepSeek, despite impressive benchmark performance, has been banned from government systems across Australia, Italy, South Korea, Taiwan, and multiple US agencies, including NASA and the Pentagon. In the UK, 81% of CISOs surveyed in 2025 called for its urgent regulation. Under China's National Intelligence Law, organisations operating within Chinese jurisdiction can be compelled to share data with state intelligence agencies. For an insurer processing sensitive claims data, medical records, and financial information, the provenance of the model itself becomes a data security question, regardless of where it is hosted. Running a Chinese-origin model on your own servers does not resolve the trust problem if you cannot independently verify what is embedded in the model weights. Private hosting demands models from sources whose supply chain you can audit and defend to your regulator.
The economics of private hosting are high, but the tipping point is approaching. GPU infrastructure for running a 70-billion-parameter model in production can require eight or more high-end GPUs, at costs of $50,000 to $150,000 per month in the cloud, or substantial capital expenditure for on-premises hardware. However, GPU prices fell by over 60% between 2024 and 2025. Open-source models now achieve approximately 80% of the performance of proprietary models on many tasks. Research into deployment economics suggests that self-hosting becomes cost-effective at around 2 million tokens per day, with a break-even against managed APIs ranging from 4 to 34 months, depending on model size and usage patterns [3]. For a large insurer or service provider processing claims at scale, private hosting is becoming economically viable, and the data-sovereignty advantages in a regulated industry make it even more compelling. No claims data leaves your infrastructure. No prompts are retained by a third-party provider. No training data is harvested from your interactions.
However, private models introduce their own challenges. The weights of a privately hosted model are frozen. It does not suffer from provider-imposed changes, but it also does not evolve. The world changes around it: new claim types appear, new fraud techniques emerge, regulatory expectations shift, and the model's training data becomes progressively less representative of the claims it encounters. This is data drift rather than model drift, and it requires deliberate retraining cycles to address. Each retraining cycle introduces its own validation requirements, and fine-tuning on domain-specific data carries the well-documented risk of catastrophic forgetting, where the model's performance on its original capabilities degrades as it learns new ones. Whether fine-tuned models, when privately hosted, experience the same kind of subtle degradation observed in managed services is not yet definitively established, and this is an area that warrants further research as private deployments become more common in regulated industries.
The monitoring imperative exists regardless of deployment model. Managed services give you no control over when your system changes. Private models give you control but freeze your knowledge base. Either way, you need the infrastructure to detect when performance degrades, whether the cause is a provider update you did not authorise or a world that has moved on from your training data.
3. Why Claims Cannot Tolerate Silent Change
Consider what happens when a claim is decided incorrectly. The customer might not know immediately. They might accept the outcome. The error might only surface months later, when they consult a solicitor or contact the Financial Ombudsman Service.
The FOS resolved over 227,000 complaints in 2024/25, an 18% increase from the previous year, and received over 305,000 new complaints [4]. These represent decisions made months or years earlier. By the time a pattern of incorrect decisions surfaces through complaint data, the damage is already compounded across thousands of claims.
If your AI system changed during that period, whether through a provider update or a subtle shift in behaviour, you will not know from looking at the system. You will only know from lagging indicators that arrive long after the harm has occurred.
4. The Monitoring Paradox
If you fully automate claims handling, you lose the ability to know when automation fails. That is the paradox at the centre of this article, and it only needs to be stated once. Every section that follows explores a different dimension of it.
It is important to be honest about why insurers are automating. This is not simply a cost reduction exercise. The industry faces a genuine workforce crisis. Experienced handlers are retiring faster than they can be replaced. Younger staff leave for better-paid, lower-stress roles outside the sector. The nature of claims work, dealing with distressed customers, complex disputes, fraud, and catastrophic events, creates sustained pressure that drives attrition. At the same time, claims volumes are increasing due to more weather-related events, more sophisticated fraud, and more regulatory complexity. Insurers are not choosing automation over people. In many cases, they cannot recruit the people they need, and automation is a necessary response to a capacity gap that recruitment alone cannot close.
There are also genuine efficiency gains to be captured. Removing manual data entry, document classification, and initial triage from experienced handlers frees them to focus on the judgment-heavy work where their expertise matters. Nobody should argue against that. The problem is not automation itself. The problem is that as automation expands beyond efficiency into decision-making, the experienced workforce shrinks.
The sequence is logical and self-defeating. You automate to capture efficiency and address workforce gaps. Experienced handlers retire or leave. You cannot replace them. The pool of human judgment you need for comparison shrinks. AI decisions can only be compared to previous AI decisions. The workforce crisis and the automation drive reinforce each other, and both are eroding the very expertise you need to know whether the automation is working.
What signals remain? Customer complaints only spike after sustained harm has occurred. FOS referrals represent issues originating months or years earlier. Litigation arrives years after decisions were made. None of these catch gradual degradation. They only surface when outcomes are already materially wrong at scale.
The obvious counterargument is synthetic testing: holdout test sets with known correct answers, run continuously. Any serious deployment should include it. But holdout tests only measure performance against the distribution they represent. They cannot account for the real world as it evolves. New claim types, new fraud techniques, new weather patterns, and new asset categories are not in your holdout set. Experienced handlers encounter the full range of claims as they arrive, including the ones nobody anticipated. Synthetic testing is necessary. It is not sufficient. Value your knowledgeable handlers.
5. Why the Domain Will Not Hold Still
The FCA's December 2025 review of home and travel claims handling found that insurers already struggle to monitor human-operated claims processes effectively [5]. Firms produced poor-quality management information and failed to use it to identify or assess customer outcomes. One firm's MI showed that key service-level agreements were consistently outside tolerance throughout 2024, and that 70% of complaints about claims service were upheld, yet the firm reduced claims-handling resources anyway. This is the current state with human handlers. Layering opaque AI decision-making on top of processes where firms already cannot demonstrate effective outcome monitoring does not close the governance gap. It widens it.
New asset types appear constantly: electric vehicles with battery fire risks that no historical training data contains, smart home devices with novel failure modes, and autonomous driving features that blur liability lines. The FCA's review found that only 32% of storm damage claims in their sample resulted in payment during 2024, with 49% rejected [5]. Firms' definitions and processes have not kept pace with changing customer expectations and weather patterns. Human handlers struggle with this adaptation. Automated systems trained on historical patterns will struggle more.
The fraud landscape is transforming just as rapidly. The Association of British Insurers detected over £1 billion in fraudulent claims in 2024 [6], but the nature of that fraud is changing faster than any static system can adapt. In 2019, fraudsters used AI-generated voice technology to impersonate the CEO of a UK energy company's German parent, convincing a senior executive to transfer €220,000. The executive recognised his boss's slight German accent and familiar speech patterns, all of which were artificially generated [7]. In January 2024, British engineering firm Arup lost approximately $25 million (around £20 million) when a finance employee was deceived by a video call in which every participant, including a fake CFO, was an AI-generated deepfake [8]. One UK insurer reported a 300% increase in AI-manipulated vehicle damage images submitted in claims within a single year: digitally altered photographs with fabricated damage, swapped number plates, or entirely synthetic walkaround videos [9].
Historical fraud models were not trained on deepfake evidence. The fraudsters are adopting AI too, and their patterns will not match your historical indicators.
6. The False Economy
The business case for automation usually includes headcount reduction. Whether that reduction comes from deliberate cost-cutting or from the inability to replace departing staff, the outcome is the same: fewer experienced handlers in the operation.
The economics look compelling. Fifty handlers at £40,000 equal £2 million in annual cost. Reduce to ten, whether through redundancy or natural attrition, and the cost base drops by £1.6 million annually. If you cannot recruit replacements anyway, the saving appears to arrive for free.
What the economics miss: a single regulatory failure or litigation loss could cost multiples of your annual savings. And if the workforce crisis is driving the reduction rather than a deliberate strategy, the risk is accumulating without anyone having consciously decided to accept it. This is not cost reduction. It is a risk transfer from visible operational costs to invisible regulatory and reputational exposure.
7. What Sustainable Deployment Actually Requires
Production deployment of AI in claims requires infrastructure that most vendors conveniently omit.
You need continuous parallel testing: a percentage of claims routed through both AI and human paths, with decisions compared and divergence tracked. Not as a validation phase you graduate from, but as an ongoing operational infrastructure. The percentage might decrease as confidence builds, but it should never reach zero.
You need divergence thresholds defined in advance. At what level do you investigate? Pause deployments? Trigger recalibration? Roll back to a previous known-good state? These are governance decisions made in advance, with stakeholder sign-off, ready to execute before you need them.
You need novel claim detection: systems that identify claims falling outside the training distribution. New asset types, unusual combinations of circumstances, patterns you have not encountered before. These routes to human handlers are automatically, and the outcomes feed back into training to keep the system current.
This infrastructure means maintaining significant human expertise alongside any AI deployment. That expertise is what makes everything else trustworthy.
8. The Organisational Trap
Whether experienced handlers leave through redundancy or retirement, the effect is the same. You lose the expertise to train AI improvements, the judgment for genuinely complex cases, and the institutional knowledge that should inform system design.
If you are deliberately cutting roles, you are trading recoverable costs (payroll) for irrecoverable capabilities (expertise). If handlers are leaving through attrition and you are not replacing them because automation appears to fill the gap, the trade is happening by default rather than by decision. Either way, the capability is gone.
You cannot rehire institutional knowledge. You cannot rebuild years of claims experience on a compressed timeline. The handlers who understood which data sources had quirks, which policy wordings were ambiguous, and which repair networks were reliable, took that knowledge with them when they left. The industry's talent pipeline is not producing replacements at a rate even close to the rate at which experienced handlers are departing.
The FCA found that firms where governance forums lacked sufficient detail to demonstrate meaningful discussion, challenge, or decision-making [10]. Automating on top of weak governance does not fix the governance. It makes the consequences slower to surface and harder to remediate.
9. The Real Question
The question is not whether AI can handle claims. For many claim types, it demonstrably can. The question is whether you can tell when it stops handling them well, in a domain where the underlying reality keeps shifting, where the regulator already has concerns about outcome monitoring, and where the consequences of undetected failure compound silently across thousands of decisions.
The insurers who succeed will not be the ones who automate fastest. They will be the ones who keep enough experienced human judgment in the operation to know whether their automation is still working. That is not a transition phase. It is a permanent requirement. And the moment it disappears, so does your ability to distinguish a system that works from one that has quietly stopped working. That is the trap.
Authors: Chris Brown - The Build Paradox, Mike Daly - Insurtech World
References
2026 is the year these shifts from pilot to mainstream impact. The carriers that will thrive are those moving beyond proof-of-concept demonstrations to systematic integration of agentic AI across their claims operations thoughtfully, strategically, and always with policyholder outcomes at the center. The technology is ready. The regulatory frameworks are emerging. The business case is proven. What remains is disciplined execution: choosing the right starting points, building with governance and transparency, empowering people rather than displacing them, and continuously learning from results.
unknownx500