Evaluation

The plugin has two layers of evidence:

Unit tests (npm test) — 33 cases over detectors.ts and anonymizer.ts. Authoritative for correctness on hand-picked fixtures, including French-specific edge cases (Corsica NIR, +33 ↔ 0 coreference) and a regression test for every fixed bug.
Local evaluation on AI4Privacy FR (npm run eval:ai4privacy) — streaming evaluator over ~40 000 French-locale samples. Reports precision/recall/F1 per detector type. Aggregate-only output, no per-sample data ever leaves the process.

Methodology

Data

datasets/ai4privacy-400k/data/{train,validation}/fr.jsonl — JSONL with one sample per line:

The fr.jsonl file mixes two locales: FR (France) and CH (Swiss francophone). We filter locale=="FR" for the headline metrics. The CH half is a natural source of true negatives (Swiss AVS numbers, +41 phones) — we plan to use it that way (see ROADMAP.md).

Label mapping

AI4Privacy labels are mapped onto our detector types. Only labels we claim to detect are scored; everything else falls into "out-of-scope" and is reported but not weighed against precision/recall.

AI4Privacy label	Our type	Notes
`EMAIL`	`EMAIL`	direct match
`TELEPHONENUM`	`TEL`	dataset includes many non-FR formats; recall is bounded by that
`CREDITCARDNUMBER`	`CB`	dataset values are not Luhn-valid; see "synthetic data limitation"
`SOCIALNUM`	`NIR`	valid mapping only on `locale=="FR"`; Swiss AVS values would be wrong gold for us
`ACCOUNTNUM`	—	too generic; varies in length and format. Not mapped to IBAN.
`GIVENNAME`, `SURNAME`, `CITY`, `STREET`, `ZIPCODE`, `DATEOFBIRTH`, `IDCARDNUM`, `BUILDINGNUM`, `DRIVERLICENSENUM`, `PASSWORD`, ,

Span matching

Predicted span P is a true positive for type t if there exists a gold span G with mappedLabel(G) == t and overlap(P, G), where:

Span overlap (not exact equality) is the right metric here because the dataset's annotations sometimes include or exclude trailing punctuation inconsistently. We then count:

TP_t = predicted spans of type t that overlap a gold of type t
FP_t = predicted spans of type t with no overlapping gold of type t
FN_t = gold spans of type with no overlapping prediction of type

Standard precision / recall / F1 from there.

Results

Run on 2026-05-17.

Default (international phones OFF), 40 026 FR samples

Type	Gold	Pred	Pred matched	Gold matched	Precision	Recall	F1
EMAIL	3257	3240	3232	3232	99.75 %	99.23 %	99.49 %
TEL	2675	361	254	254	70.36 %	9.50 %	16.73 %
CB	1515	875	111	111	12.69 %	7.33 %	9.29 %
NIR	1468	12	7	7	58.33 %	0.48 %	0.95 %

With `detectInternationalPhones` ON

Type	Gold	Pred	Pred matched	Gold matched	Precision	Recall	F1
EMAIL	3257	3240	3232	3232	99.75 %	99.23 %	99.49 %
TEL	2675	1581	1474	1358	93.23 %	50.77 %	65.74 %
CB	1515	875	111	111	12.69 %	7.33 %	9.29 %
NIR	1468	12	7	7	58.33 %	0.48 %	0.95 %

Enabling international phones lifts TEL F1 from 16.73 % → 65.74 % with no measurable regression on the other types. Precision increases because the international regex applies the same strict boundary discipline as the French one, replacing many of the leftover ambiguous FR-substring predictions with cleaner full +CC… matches.

CH (control) locale, 40 223 samples

Type	Default F1	International F1	Notes
EMAIL	99.70 %	99.70 %	Language-agnostic.
TEL	8.26 %	64.51 %	`+41` phones caught only with the flag.
CB	8.07 %	8.07 %	Same synthetic-data limitation as FR.
NIR	n/a (0 TP)	n/a (0 TP)	Specificity check: 2 predictions on 40 223 samples = 0.05 ‰ (Swiss AVS ≠ NIR FR — almost perfectly silent).

Out-of-scope gold totals (entities we don't claim to detect): 34 205 spans, dominated by GIVENNAME (6594), USERNAME (4477), SURNAME (4339), CITY (4077).

How to read these numbers

EMAIL. 99.5 % F1 is the real number, on real-ish text. This is the metric to cite if anyone asks "is this plugin reliable for emails".

TEL. Two regimes, one config flag:

Default (FR-only): precision 70 %, recall 9.5 %. The recall ceiling is set by the dataset — most TELEPHONENUM values are not French- formatted. The 107 remaining FPs are FR-formatted-by-chance substrings or ambiguous identifiers.
With detectInternationalPhones: precision 93 %, recall 51 %. This is the regime to use whenever your text contains non-FR phones. Precision goes UP because the international regex's boundary discipline catches numbers as full +CC… matches that the FR regex previously sliced into ambiguous substrings.

CB and NIR. These two numbers are misleading without context. They are not measuring detection quality on real text; they are measuring detection quality against a dataset that doesn't respect the validators our detectors depend on:

This is the synthetic-data limitation. Our checksum-first design is a feature, not a tuning knob. The right way to measure recall on these types is via the unit tests (which use fixtures derived from official validators — see LICENSING.md) plus, if needed, manually-curated real samples under NDA.

The residual FPs on CB are explained by random chance: ~10 % of random 13–19 digit identifiers passing Luhn AND starting with [3-6]. This is the irreducible floor for any Luhn + BIN detector without contextual signals (keyword "carte", "credit card", etc.).

Reproducibility

The script is fully deterministic given the dataset. ~1 second wall time on a modern laptop, 40 k samples streamed without loading into memory.

Diagnostic helper to look at where false positives come from:

Outputs go to stdout only. Do not redirect to a versioned file — that would leak AI4Privacy snippets into the repo, which the licence forbids.

anonymize