Project Files
docs / DESIGN.md
anonymize is a stateless, validation-first PII redactor. The model
passes text and (optionally) a list of personal names it spotted; the
plugin handles the format-bound stuff (cards, NIR, IBAN, French phones,
emails) using regex + checksum validation, never regex alone.
It is not an NER system. It does not detect names or unstructured PII. Anything that requires reading comprehension is the model's job; the plugin is the dumb but rigorous lower half of the pipeline.
Every detector that has a checksum runs it. Luhn for cards, mod-97 for NIR and IBAN. A 16-digit run that fails Luhn is not redacted as a card, even if the surrounding context says "credit card number".
Why: the alternative is regex-only detection, which on real text is a flood of false positives (reference numbers, invoice IDs, internal identifiers all look like 13โ19 digit runs). Locally measured on AI4Privacy FR, this raises EMAIL F1 to 99.5 % and keeps CB / NIR precision high at the deliberate cost of recall on non-valid look-alikes.
Consequence: synthetic test data that generates random digits without respecting checksums (which is what AI4Privacy does) will show near-zero recall on CB and NIR. That is the system working as designed, not a bug. See EVALUATION.md for numbers.
After detection and de-duplication, replacements are applied in
right-to-left order over the original string. This keeps every earlier
span's start / end valid throughout the loop, since modifications
only happen at offsets strictly greater than every yet-to-process span.
Code: anonymizer.ts, the reversed loop.
The alternative would be to recompute offsets after each replacement, which is both slower and a richer source of off-by-one bugs.
Pseudonym lookup uses a normalised form of the value
(normalizeValue), so different writings of the same logical value
collapse to a single pseudonym:
| Type | Normalisation |
|---|---|
| TEL | strip (0), then .-; rewrite +33/0033 prefix โ 0 |
| CB | strip - |
| IBAN | strip whitespace, uppercase |
| NIR | strip whitespace, uppercase (preserves 2A/2B) |
| lowercase | |
| NOM | lowercase (locale-aware fr) โ "Jean" and "JEAN" share [NOM_1] |
| CUSTOM | identity (case-sensitive โ the caller picks the exact form) |
| ADDRESS | collapse whitespace, lowercase (locale-aware fr) โ model only |
| IDDOC | strip whitespace, uppercase โ model only |
| DATE | trim only โ date formats too varied to safely normalise without a parser; model only |
The mapping returned to the caller preserves the first textual form
seen for each pseudonym โ that's the form a human reading the output
will recognise. The normalised key is internal only.
dedupeSpans sorts by length descending, then by start ascending, then
walks the list adding a span only if it does not overlap any already-
kept span. With a stable sort, equal-length spans keep their insertion
order โ which determines who wins on identical-span ties.
Insertion order in anonymizer.ts:
detectAll results (EMAIL, IBAN, TEL, NIR, CB โ in that order)modelSpans from the ML layer (NOM, ADDRESS, DATE, IDDOC โ
see DETECTORS.md ยง Optional ML model)NOM matches from caller-provided namesCUSTOM matches from per-call and config custom termsThis means: when a phone-number string is also passed as a custom term, the detector wins (correct: it has type information). When the model passes both "Jean" and "Jean Dupont" as names, the compound wins on overlapping ranges and the bare "Jean" still wins on its own occurrences. There is a regression test for each of these.
When the ML layer is enabled and emits a span overlapping a regex span
of the same length, the regex span wins because it was inserted first.
This is intentional: validated checksums beat probabilistic NER. A
regression test (regex span wins over overlapping model span) locks
this in.
All regex that match runs of digits include explicit non-digit boundaries:
TEL_RE: (?<!\d)โฆ(?!\d)CC_RE: (?<!\d)โฆ(?!\d)NIR_RE, IBAN_RE: \bโฆ\bWithout these, the TEL regex used to match a 10-digit subsequence inside any 15-digit NIR, 18-digit account number, or 16-digit card. This was the single largest source of false positives in our first evaluation (5375 FPs eliminated by the boundary fix).
[TYPE_N] where TYPE โ {EMAIL, TEL, CB, NIR, IBAN, NOM, CUSTOM} and N
counts up from 1 per type, in reading order within the input. The
square-bracket form was chosen because it survives most downstream
formatters (Markdown, HTML, JSON string) without escaping.
When the input text already contains literal [TYPE_N] patterns
(e.g. re-processing an already-redacted document), anonymize scans
for them and starts its own counters past the highest existing index
per type. Existing literals are preserved verbatim; only newly-
generated pseudonyms appear in the returned mapping.
detectors.ts anonymizer.ts
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
detectAll(text) anonymize(text, args)
โ โ
โโ EMAIL_RE โ spans โโ spans = [...detectAll(text),
โโ IBAN_RE โ spans โโโโ โ findLiteral(text, name, "NOM"),
โโ TEL_RE โ spans โ findLiteral(text, term, "CUSTOM")]
โโ NIR_RE โ spans โ
โโ CC_RE โ spans โโ dedupeSpans(spans)
โ (longest-wins, stable insertion order)
โ
โโ for each span (left-to-right):
โ getPseudo(span) โ assigns [TYPE_N]
โ in reading order
โ
โโ for each span (right-to-left):
โ splice pseudonym into result
โ
โโ return { anonymized, mapping, counts }