June 13, 20264 min read

InsureBench Preview: Cost Per Task, Not Cost Per Token

Early InsureBench results show why insurance AI evaluation needs to measure model plus harness performance, cost per task, and task-specific reliability. The charts are preliminary, but the deployment question is already clear.

aibenchmarksinsurebenchcost

Don Seibert

InsureThing

Preview · Benchmark

InsureBench is live in preview.

These charts are a work in progress. The task bank is growing, model runs are being repeated, and the harness lanes will continue to improve. Still, the early results show why insurance AI should be evaluated by task, by harness, and by cost per completed answer.

Explore InsureBench

InsureBench is our benchmark for evaluating AI systems on insurance work, and the first public view is now live. The first lesson from it is blunt: a model's token price tells you almost nothing about what a task actually costs.

Most model benchmarks ask which model is better. That is useful, but it is not the main question an insurance organization faces when it deploys AI. In production, the deployed unit is not a model on its own but the model plus the harness around it: the prompt, file reader, retrieval layer, structured output handling, retries, domain context, and escalation path.

That is what InsureBench measures.

The early benchmark includes tasks that look more like insurance work than chat: reading claim files, interpreting coverage facts, working with loss triangles, classifying workers' compensation risks, reviewing underwriting information, and deciding when there is not enough evidence to make a clean call.

The first lesson is simple: cost per token is not cost per task.

The chart above is a preliminary snapshot, not a final ranking. It plots model-plus-harness combinations by accuracy and cost per task. The useful question is not which token price is lowest, but which setup can complete this kind of work reliably at the lowest practical cost.

That distinction matters. A model may be inexpensive on a price sheet but expensive in practice if it needs more output tokens, more tool calls, more retries, or more hidden reasoning tokens to reach an answer. Another model may be more expensive per token but finish the job in fewer steps.

The second early lesson is that reasoning cost can be hard to see if you only look at visible output.

Some models spend heavily on hidden reasoning tokens. That may be worthwhile on difficult work. It may also be unnecessary on repeated operational tasks where a cheaper model with a good harness can clear the required bar.

None of this counts against frontier models. For high-value expert work, or tasks where a small improvement is worth a large cost, the strongest available model may be the right choice. A senior actuary running a difficult one-off analysis in an agentic coding environment is in a different economic position from an automated scan across thousands of files.

For repeated insurance workflows, though, the unit that matters is the completed task. The right answer may be a frontier model. It may be a lower-cost model with a better harness. It may be a routing workflow that handles routine cases cheaply and escalates the uncertain ones.

InsureBench is built to tell those cases apart.

This is preview data, with the caveats that implies. The question bank is still expanding, some task-family samples are small, and confidence intervals are wide in places. The numbers will move as we add models, repeat runs, and improve harnesses.

Even at preview stage, the direction is clear: insurance AI evaluation has to measure the model, the harness, the task, and the cost of reaching a reliable answer.

If you have insurance tasks that belong in InsureBench, or a model, harness, or workflow you want measured against real insurance work, send it over: don@insure-thing.com.

Live · Benchmark

See the current InsureBench results.

Interactive leaderboard, task filters, cost per task, and current methodology notes.

insurebench.insure-thing.com

Scanning for comments…