Design

What this plugin is — and isn't

anonymize is a stateless, validation-first PII redactor. The model passes text and (optionally) a list of personal names it spotted; the plugin handles the format-bound stuff (cards, NIR, IBAN, French phones, emails) using regex + checksum validation, never regex alone.

It is not an NER system. It does not detect names or unstructured PII. Anything that requires reading comprehension is the model's job; the plugin is the dumb but rigorous lower half of the pipeline.

The three load-bearing decisions

1. Validation before redaction

Every detector that has a checksum runs it. Luhn for cards, mod-97 for NIR and IBAN. A 16-digit run that fails Luhn is not redacted as a card, even if the surrounding context says "credit card number".

Why: the alternative is regex-only detection, which on real text is a flood of false positives (reference numbers, invoice IDs, internal identifiers all look like 13–19 digit runs). Locally measured on AI4Privacy FR, this raises EMAIL F1 to 99.5 % and keeps CB / NIR precision high at the deliberate cost of recall on non-valid look-alikes.

Consequence: synthetic test data that generates random digits without respecting checksums (which is what AI4Privacy does) will show near-zero recall on CB and NIR. That is the system working as designed, not a bug. See EVALUATION.md for numbers.

2. Right-to-left replacement

After detection and de-duplication, replacements are applied in right-to-left order over the original string. This keeps every earlier span's start / end valid throughout the loop, since modifications only happen at offsets strictly greater than every yet-to-process span.

Code: anonymizer.ts, the reversed loop.

The alternative would be to recompute offsets after each replacement, which is both slower and a richer source of off-by-one bugs.

3. Normalised key, textual mapping

Pseudonym lookup uses a normalised form of the value (normalizeValue), so different writings of the same logical value collapse to a single pseudonym:

Type	Normalisation
TEL	strip `(0)`, then `.-`; rewrite `+33`/`0033` prefix → `0`
CB	strip `-`
IBAN	strip whitespace, uppercase
NIR	strip whitespace, uppercase (preserves `2A`/`2B`)
EMAIL	lowercase
NOM	lowercase (locale-aware fr) — `"Jean"` and `"JEAN"` share `[NOM_1]`
CUSTOM	identity (case-sensitive — the caller picks the exact form)
ADDRESS	collapse whitespace, lowercase (locale-aware fr) — model only
IDDOC	strip whitespace, uppercase — model only
DATE	trim only — date formats too varied to safely normalise without a parser; model only

The mapping returned to the caller preserves the first textual form seen for each pseudonym — that's the form a human reading the output will recognise. The normalised key is internal only.

Span lifecycle

Dedup priority

dedupeSpans sorts by length descending, then by start ascending, then walks the list adding a span only if it does not overlap any already- kept span. With a stable sort, equal-length spans keep their insertion order — which determines who wins on identical-span ties.

Insertion order in anonymizer.ts:

detectAll results (EMAIL, IBAN, TEL, NIR, CB — in that order)
Optional modelSpans from the ML layer (NOM, ADDRESS, DATE, IDDOC — see DETECTORS.md § Optional ML model)
NOM matches from caller-provided names
CUSTOM matches from per-call and config custom terms

This means: when a phone-number string is also passed as a custom term, the detector wins (correct: it has type information). When the model passes both "Jean" and "Jean Dupont" as names, the compound wins on overlapping ranges and the bare "Jean" still wins on its own occurrences. There is a regression test for each of these.

When the ML layer is enabled and emits a span overlapping a regex span of the same length, the regex span wins because it was inserted first. This is intentional: validated checksums beat probabilistic NER. A regression test (regex span wins over overlapping model span) locks this in.

Boundary discipline

All regex that match runs of digits include explicit non-digit boundaries:

TEL_RE: (?<!\d)…(?!\d)
CC_RE: (?<!\d)…(?!\d)
NIR_RE, IBAN_RE: \b…\b

Without these, the TEL regex used to match a 10-digit subsequence inside any 15-digit NIR, 18-digit account number, or 16-digit card. This was the single largest source of false positives in our first evaluation (5375 FPs eliminated by the boundary fix).

Pseudonym format

[TYPE_N] where TYPE ∈ {EMAIL, TEL, CB, NIR, IBAN, NOM, CUSTOM} and N counts up from 1 per type, in reading order within the input. The square-bracket form was chosen because it survives most downstream formatters (Markdown, HTML, JSON string) without escaping.

When the input text already contains literal [TYPE_N] patterns (e.g. re-processing an already-redacted document), anonymize scans for them and starts its own counters past the highest existing index per type. Existing literals are preserved verbatim; only newly- generated pseudonyms appear in the returned mapping.