Project Files
docs / EVALUATION.md
The plugin has two layers of evidence:
Unit tests (npm test) — 33 cases over detectors.ts and
anonymizer.ts. Authoritative for correctness on hand-picked
fixtures, including French-specific edge cases (Corsica NIR,
+33 ↔ 0 coreference) and a regression test for every fixed bug.
Local evaluation on AI4Privacy FR (npm run eval:ai4privacy) —
streaming evaluator over ~40 000 French-locale samples. Reports
precision/recall/F1 per detector type. Aggregate-only output, no
per-sample data ever leaves the process.
datasets/ai4privacy-400k/data/{train,validation}/fr.jsonl — JSONL with
one sample per line:
The fr.jsonl file mixes two locales: FR (France) and CH (Swiss
francophone). We filter locale=="FR" for the headline metrics. The CH
half is a natural source of true negatives (Swiss AVS numbers, +41
phones) — we plan to use it that way (see ROADMAP.md).
AI4Privacy labels are mapped onto our detector types. Only labels we claim to detect are scored; everything else falls into "out-of-scope" and is reported but not weighed against precision/recall.
| AI4Privacy label | Our type | Notes |
|---|---|---|
EMAIL | EMAIL | direct match |
TELEPHONENUM | TEL | dataset includes many non-FR formats; recall is bounded by that |
CREDITCARDNUMBER | CB | dataset values are not Luhn-valid; see "synthetic data limitation" |
SOCIALNUM | NIR | valid mapping only on locale=="FR"; Swiss AVS values would be wrong gold for us |
ACCOUNTNUM | — | too generic; varies in length and format. Not mapped to IBAN. |
GIVENNAME, SURNAME, CITY, STREET, ZIPCODE, DATEOFBIRTH, IDCARDNUM, BUILDINGNUM, DRIVERLICENSENUM, PASSWORD, , |
Predicted span P is a true positive for type t if there exists a
gold span G with mappedLabel(G) == t and overlap(P, G), where:
Span overlap (not exact equality) is the right metric here because the dataset's annotations sometimes include or exclude trailing punctuation inconsistently. We then count:
TP_t = predicted spans of type t that overlap a gold of type tFP_t = predicted spans of type t with no overlapping gold of type tFN_t = gold spans of type with no overlapping prediction of type Standard precision / recall / F1 from there.
Run on 2026-05-17.
| Type | Gold | Pred | Pred matched | Gold matched | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| 3257 | 3240 | 3232 | 3232 | 99.75 % | 99.23 % | 99.49 % | |
| TEL | 2675 | 361 | 254 | 254 | 70.36 % | 9.50 % | 16.73 % |
| CB | 1515 | 875 | 111 | 111 | 12.69 % | 7.33 % | 9.29 % |
| NIR | 1468 | 12 | 7 | 7 | 58.33 % | 0.48 % | 0.95 % |
detectInternationalPhones ON| Type | Gold | Pred | Pred matched | Gold matched | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| 3257 | 3240 | 3232 | 3232 | 99.75 % | 99.23 % | 99.49 % | |
| TEL | 2675 | 1581 | 1474 | 1358 | 93.23 % | 50.77 % | 65.74 % |
| CB | 1515 | 875 | 111 | 111 | 12.69 % | 7.33 % | 9.29 % |
| NIR | 1468 | 12 | 7 | 7 | 58.33 % | 0.48 % | 0.95 % |
Enabling international phones lifts TEL F1 from 16.73 % → 65.74 % with
no measurable regression on the other types. Precision increases
because the international regex applies the same strict boundary
discipline as the French one, replacing many of the leftover ambiguous
FR-substring predictions with cleaner full +CC… matches.
| Type | Default F1 | International F1 | Notes |
|---|---|---|---|
| 99.70 % | 99.70 % | Language-agnostic. | |
| TEL | 8.26 % | 64.51 % | +41 phones caught only with the flag. |
| CB | 8.07 % | 8.07 % | Same synthetic-data limitation as FR. |
| NIR | n/a (0 TP) | n/a (0 TP) | Specificity check: 2 predictions on 40 223 samples = 0.05 ‰ (Swiss AVS ≠NIR FR — almost perfectly silent). |
Out-of-scope gold totals (entities we don't claim to detect): 34 205 spans,
dominated by GIVENNAME (6594), USERNAME (4477), SURNAME (4339),
CITY (4077).
EMAIL. 99.5 % F1 is the real number, on real-ish text. This is the metric to cite if anyone asks "is this plugin reliable for emails".
TEL. Two regimes, one config flag:
TELEPHONENUM values are not French-
formatted. The 107 remaining FPs are FR-formatted-by-chance substrings
or ambiguous identifiers.detectInternationalPhones: precision 93 %, recall 51 %.
This is the regime to use whenever your text contains non-FR phones.
Precision goes UP because the international regex's boundary
discipline catches numbers as full +CC… matches that the FR regex
previously sliced into ambiguous substrings.CB and NIR. These two numbers are misleading without context. They are not measuring detection quality on real text; they are measuring detection quality against a dataset that doesn't respect the validators our detectors depend on:
This is the synthetic-data limitation. Our checksum-first design is a feature, not a tuning knob. The right way to measure recall on these types is via the unit tests (which use fixtures derived from official validators — see LICENSING.md) plus, if needed, manually-curated real samples under NDA.
The residual FPs on CB are explained by random chance: ~10 % of
random 13–19 digit identifiers passing Luhn AND starting with [3-6].
This is the irreducible floor for any Luhn + BIN detector without
contextual signals (keyword "carte", "credit card", etc.).
The script is fully deterministic given the dataset. ~1 second wall time on a modern laptop, 40 k samples streamed without loading into memory.
Diagnostic helper to look at where false positives come from:
Outputs go to stdout only. Do not redirect to a versioned file — that would leak AI4Privacy snippets into the repo, which the licence forbids.
TAXNUMUSERNAME| — |
| out of scope |
ttAI4Privacy generates random 16-digit strings for CREDITCARDNUMBER.
Our Luhn check rejects ~90 % of those, and the BIN-prefix gate
([3-6]) discards another ~40 % that start with 0, 1, 2, 7,
8, 9. Resulting recall ≈ 7 %, which is approximately the
Luhn-pass-rate × BIN-pass-rate for random 16-digit numbers. Real
cards always pass both, so on real text recall would approach 100 %.
Same for SOCIALNUM: the templates do not produce mod-97-valid keys.
Our nirChecksum rejects them. The 0.48 % recall = the fraction of
random keys that happen to be correct by chance (1 / ~97 ≈ 1 %).
{
"source_text": "<p>Nom : Nkunku</p>…",
"locale": "FR" | "CH",
"language": "fr",
"split": "train" | "validation",
"privacy_mask": [
{"label": "SURNAME", "start": 9, "end": 15, "value": "Nkunku"},
{"label": "SOCIALNUM", "start": 51, "end": 66, "value": "212036119888849"},
…
]
}
overlap(a, b) = a.start < b.end && b.start < a.end
# Requires: datasets/ai4privacy-400k/data/{train,validation}/fr.jsonl
npm run eval:ai4privacy # default config (FR-only TEL)
ANONYMIZE_INTL=1 npm run eval:ai4privacy # with international phones ON
npm run eval:inspect-tel # first 25 TEL FPs with context