Roadmap

Known limitations and likely future work, ordered by current priority.

Done in this iteration

✅ Dual-locale evaluation (FR + CH). CH serves as a specificity control; NIR predictions on CH are at 0.05 ‰ (2/40 223), confirming the FR-specific regex doesn't fire spuriously on Swiss AVS numbers.
✅ International phone numbers behind the detectInternationalPhones config flag. Default stays French-only. With the flag on, TEL F1 jumps from 16.73 % to 65.74 % on AI4Privacy FR.
✅ CB precision via BIN-prefix tightening. CC_RE now requires a leading digit in [3-6] (Visa/MC/Amex/Diners/Discover/JCB/UnionPay). Halves the false-positive count without affecting recall on real cards. Precision 6.34 % → 12.69 % on AI4Privacy FR; the residual ceiling is irreducible Luhn collisions on long synthetic identifiers that happen to start with 4, 5, or 6.
✅ Names matched case-insensitively (NOM only; CUSTOM stays literal by design). "Jean Dupont", "JEAN DUPONT" and "jean dupont" share one pseudonym. Word-boundary discipline preserved.
✅ Pre-existing [TYPE_N] literals in input no longer collide with output. Counters offset past the highest existing index per type. Re-processing an already-partly-redacted document is now safe.
✅ Phone-format value normalisation completeness. +33(0)6… business notation is now matched by the regex and collapses to the same pseudonym as 06… and +33 6….
✅ Optional ML model layer behind detectPiiWithModel. Uses onnx-community/multilang-pii-ner-ONNX (xlm-roberta-base, F1 0.99 on AI4Privacy CoNLL — EN/DE/IT/FR natively). Loaded lazily via @huggingface/transformers in Node runtime; first call downloads ~280 MB of weights. Four sub-flags select which categories the model emits: names, addresses, dates, ID documents. New SpanTypes: ADDRESS, DATE, IDDOC. Regex+checksum spans stay authoritative on overlap (insertion order: regex → model → names → customs, then longest-wins stable dedup). The anonymize primitive stays synchronous; only toolsProvider.implementation is async (it was already). Model failures fall back gracefully to regex-only via a try/catch in the tool implementation.

Closed without code change

NIR Corsica position validation

Originally listed as defence-in-depth: tighten nirChecksum to reject 2A/2B outside the department slot. Re-examining the code, the position is already enforced by the regex ^[12]\d{4}(?:\d{2}|2[AB])\d{6}$/i that runs at the top of nirChecksum. A letter anywhere else fails that test before the substitution runs. Test fixtures (28X0478342163, 289042a342163, etc.) confirm the current behaviour. Closed — already correct.

On the backlog

EMAIL: IDN and quoted local parts

We don't support Internationalised Domain Names or quoted local parts. A real concern only when actual French text contains addresses like "de Lévis"@example.fr or naïveté@académie.fr. Plausible but rare. Wait for a real user request before touching the regex.

Stable pseudonyms across calls

Today a pseudonym is only stable within one call. If a caller wants the same person to keep the same [NOM_X] across multiple anonymize_text calls in a session, they have to build that mapping themselves. A config-level "session" would require state, which conflicts with the plugin's stateless design. Open question, low priority.

Extended international forms without `+`

+33 and 0033 are accepted; bare 33 6 12… without either prefix is not. Adding it would require either a country-specific entry point or a permissive match like \b33[ .-]?[1-9](?:[ .-]?\d{2}){4} which would over-match. Wait for a real instance before deciding.

Not on the roadmap (explicit non-goals)

NER for names, addresses, organisations. That's the model's job by design. We will not bundle a model.
PHI medical de-identification. Different threat model, different entity types. Out of scope.
Stable pseudonyms across calls. Today a pseudonym is only stable within one call. Caller can build their own persistent mapping if needed.

anonymize