Project Files
docs / ROADMAP.md
Known limitations and likely future work, ordered by current priority.
detectInternationalPhones
config flag. Default stays French-only. With the flag on, TEL F1
jumps from 16.73 % to 65.74 % on AI4Privacy FR.CC_RE now requires a
leading digit in [3-6] (Visa/MC/Amex/Diners/Discover/JCB/UnionPay).
Halves the false-positive count without affecting recall on real
cards. Precision 6.34 % → 12.69 % on AI4Privacy FR; the residual
ceiling is irreducible Luhn collisions on long synthetic identifiers
that happen to start with 4, 5, or 6.[TYPE_N] literals in input no longer collide with
output. Counters offset past the highest existing index per type.
Re-processing an already-partly-redacted document is now safe.+33(0)6… business
notation is now matched by the regex and collapses to the same
pseudonym as 06… and +33 6….detectPiiWithModel. Uses
onnx-community/multilang-pii-ner-ONNX (xlm-roberta-base, F1 0.99 on
AI4Privacy CoNLL — EN/DE/IT/FR natively). Loaded lazily via
@huggingface/transformers in Node runtime; first call downloads
~280 MB of weights. Four sub-flags select which categories the model
emits: names, addresses, dates, ID documents. New SpanTypes:
ADDRESS, DATE, IDDOC. Regex+checksum spans stay authoritative
on overlap (insertion order: regex → model → names → customs, then
longest-wins stable dedup). The anonymize primitive stays
synchronous; only toolsProvider.implementation is async (it was
already). Model failures fall back gracefully to regex-only via a
try/catch in the tool implementation.Originally listed as defence-in-depth: tighten nirChecksum to reject
2A/2B outside the department slot. Re-examining the code, the
position is already enforced by the regex
^[12]\d{4}(?:\d{2}|2[AB])\d{6}$/i that runs at the top of
nirChecksum. A letter anywhere else fails that test before the
substitution runs. Test fixtures (28X0478342163, 289042a342163,
etc.) confirm the current behaviour. Closed — already correct.
We don't support Internationalised Domain Names or quoted local parts.
A real concern only when actual French text contains addresses like
"de Lévis"@example.fr or naïveté@académie.fr. Plausible but rare.
Wait for a real user request before touching the regex.
Today a pseudonym is only stable within one call. If a caller wants the
same person to keep the same [NOM_X] across multiple anonymize_text
calls in a session, they have to build that mapping themselves. A
config-level "session" would require state, which conflicts with the
plugin's stateless design. Open question, low priority.
++33 and 0033 are accepted; bare 33 6 12… without either prefix
is not. Adding it would require either a country-specific entry point
or a permissive match like \b33[ .-]?[1-9](?:[ .-]?\d{2}){4} which
would over-match. Wait for a real instance before deciding.