Evals - updated May 24, 2026

Real task evals, published with the failure.

This page now includes a deterministic eval run against Dhamaka's actual task code: address autofill, contextual spellcheck, smart paste, formula transform, formula explain, and formula debug.

Scope

These are real evals, but they are task-fast-path evals. They do not download or call an LLM. They score the rules, regex, fuzzy, and structural rewrite layer that Dhamaka uses as tools and validators around model workflows.

Model-quality scoring for Transformers.js and browser Prompt API fallbacks is reported separately from this deterministic fast-path run.

Latest deterministic eval run

64 / 65 passing across six task suites.

Overall 98.5% 64 passing of 65 golden cases

Passing 64 autofill, paste, and formula suites are fully green

Failing 1 spellcheck misses one context rule: too -> to

Artifact JSON Download the eval result

Suite results

What passed, and what did not.

The evals use gold assertions against structured task outputs. A pass means the current implementation returned the expected field, suggestion, formula rewrite, explanation, or diagnostic. The failing case stays visible because that is the point of evals.

Address autofill

Aliases, typos, international cities, currencies, and nonsense input.

16 / 16

Spellcheck

Common misspellings, homophones, clean text, empty input, and multiple suggestions.

14 / 15

Smart paste

Email, phone, website, Twitter handle, name heuristic, freemail handling, and multi-email capture.

10 / 10

Formula transform

Discount, tax, round, multiply, divide, IFERROR, currency conversion, sign flip, absolute value, and unknown instruction fallback.

12 / 12

Formula explain

Function explanation, arithmetic explanation, and unknown-function fallback.

6 / 6

Formula debug

Spreadsheet error advice for #DIV/0!, #N/A, #REF!, #NAME?, static division risk, and unknown errors.

6 / 6

Failing eval

The current miss is useful.

Contextual spellcheck

The rule layer catches their -> there, Your welcome -> you're welcome, and common misspellings like recieve, but this eval currently fails:

FAIL I am going too the store expected too -> to.

Why publish it?

A real eval page exposes product gaps. This one shows the rules layer needs another context pattern, or the model fallback needs to cover this class reliably.

Scope boundary

The current run does not score LLM fallback quality, model latency, cold-start download cost, or cross-browser backend differences. Those belong in a separate model-quality report.

Scoring model

Each eval scores the task, not the model brand.

A useful eval tells Dhamaka whether the fast path answered correctly, whether escalation was needed, and whether the final output was safe to apply.

Autofill

City aliases, typos, international regions, ambiguous city names, empty input, confidence calibration.

exact fields + latency

Spellcheck

Homophones, misspellings, clean text, bad model suggestions, multiple suggestions, apply-fix behavior.

precision + recall

Smart paste

Business cards, messy signatures, freemail domains, international phones, user-overridden fields.

field F1 + no overwrite

Formulas

Discounts, rounding, tax, IFERROR, references, non-formula cells, semantic equivalence.

golden output + AST checks

Privacy

Confirm that rules-first paths make zero external requests and model downloads happen only when needed.

network audit

Current scorecard

Dhamaka evals at a glance.

Task accuracy

98.5%

64/65 golden task evals passed across autofill, spellcheck, smart paste, and formula tasks. Current miss: too -> to.

Model fallbacks

18/18

Runtime fallback tests passed for factory selection, MockEngine streaming and aborts, plus real WASM load, generate, determinism, and abort.

Product budgets

17/17

11 rules-path p99 checks stayed under 1 ms, 6 browser budget checks passed, external requests stayed at 0, and WASM cold start median was 0.69 ms.