Detectors

Per-type detection algorithms. All live in src/detectors.ts; each runs regex → optional checksum, and emits Span { start, end, type, value }.

EMAIL

/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g

Deliberately permissive ("RFC-ish"). No validation step. Coreference is case-insensitive in anonymizer.normalizeValue ("Alice@Ex.COM" and "alice@ex.com" share [EMAIL_1]).

Known limits. Does not handle IDN (Cyrillic / Han local parts), trailing-dot edge cases, or quoted local parts. In practice this catches

99 % of mail addresses appearing in French prose (EVALUATION.md: F1 99.49 %).

TEL (French phone numbers)

/(?<!\d)(?:
   (?:\+33|0033)[ .-]?[1-9](?:[ .-]?\d{2}){4}
 | 0[1-9](?:[ .-]?\d{2}){4}
)(?!\d)/g

Two alternatives:

International form: +33 or 0033 prefix, optional (0) between country code and national number (business notation), then a 9-digit national number starting with a non-zero leading digit.
Domestic form: 0[1-9] prefix followed by 4 groups of 2 digits.

Both accept optional , ., or - between groups. The (?<!\d)…(?!\d) boundaries are critical: without them, the domestic alternative 0[1-9]\d{8} matched 10-digit subsequences inside any longer digit run (NIR, IDCARD, account number). Removing those boundaries used to cost ~5 000 false positives on AI4Privacy FR; the regression test phone regex does not match inside a longer digit run locks the fix in place.

International numbers (opt-in). When the config field Detect international phone numbers is on, a second pattern runs:

This matches any +CC… form (country code 1–3 digits, then 7–14 more digits with optional space/dot/dash separators), so DE +49 30 12345678, UK +44 20 7946 0958, US +1 555 123 4567 etc. are caught. On AI4Privacy FR this lifts TEL recall from 9.5 % to ~51 % (F1 16.7 % → 65.7 %) at no measurable precision cost.

The international and French regex can both match the same +33… number; DESIGN.md dedup priority collapses them to a single span.

Other out-of-scope: premium numbers (08…), short numbers (3…).

Coreference: normalizeValue strips (0), strips separators, and rewrites the international prefix to 0. So "06 12 34 56 78", "+33 6 12 34 56 78", and "+33 (0)6 12 34 56 78" all collapse to 0612345678 and share [TEL_1].

CB (credit / debit card numbers, Luhn-validated)

Regex:

A leading [3-6] restricts the match to known Issuer Identification Number ranges:

Prefix	Brand
4	Visa
5	Mastercard
34, 37	American Express
30, 36, 38	Diners Club
6011, 65, 644–649	Discover
35	JCB
62	UnionPay

A 16-digit run starting with 9 may be a Luhn-valid string by chance, but it is not a real card — the BIN gate rejects it. Same for runs starting with 0, 1, 7, 8. This roughly halves the FP rate without affecting recall on real cards.

After matching, luhn(digits) on the cleaned (separator-stripped) match. The Luhn check is the standard right-to-left double-then-modular-sum, accepting 13–19 digit cards (Visa, Mastercard, Amex, Diners, Discover, UnionPay, JCB all fit).

Known limitations.

The newer Mastercard 2221–2720 range is not yet supported (rare; add when needed).
~10 % of arbitrary 13–19 digit strings starting with [3-6] still pass Luhn by chance. On real text this is rare; on AI4Privacy FR it caps our precision around 13 % (the dataset is full of templated 16-digit identifiers that happen to start with 4, 5, 6).

NIR (numéro de sécurité sociale, mod-97 + Corsica)

Regex (case-insensitive on letters):

Structure of the 15-character NIR:

Field	Width	Notes
Sex	1	`1` or `2`
Year	2	last two digits of birth year
Month	2	`01`–`12` (special codes accepted, not validated)
Dept	2	digits or `2A` / `2B` for Corsica
Commune	3	INSEE commune code
Ordre	3	sequence within month
Key	2	checksum (validated below)

Checksum (nirChecksum):

The Corsica substitution rule (2A→19, 2B→18) is the convention documented by INSEE and used by both reference validators we cross- checked against (SGMAP's nir_validate and Aymeric Bouzy's french-ssn — see LICENSING.md). The two repos use an equivalent formulation (replace letter with 0, then subtract 1 000 000 or 2 000 000 as a penalty); we verified algebraic equivalence on shared fixtures.

Coreference: normalizeValue strips whitespace and uppercases, so "2 89 04 2a 342 163 90" and "2890428342 163 90" share [NIR_1].

IBAN (mod-97)

Regex (case-insensitive):

Then ibanCheck performs the standard mod-97 validation: move the first four characters to the end, replace letters with A=10, B=11, …, Z=35, and verify the remainder modulo 97 is 1.

Length is constrained to 15–34 characters (the IBAN spec range).

Coreference: normalizeValue strips whitespace and uppercases, so "FR1420041010050500013M02606", "fr14 2004 1010 0505 0001 3M02 606" and any other formatting variant share [IBAN_1].

IDDOC (passport numbers, context-labelled)

The regex detector only redacts an ID-document token when it is explicitly introduced by passeport / passport and an optional number marker such as n°, numéro, or #:

The captured token must be 6–12 alphanumeric characters after removing spaces and dashes, and must contain at least one letter and one digit. This keeps the detector from redacting arbitrary internal references that merely look like passport numbers.

ADDRESS (French postal addresses, context-labelled)

The regex detector redacts a postal address only after an explicit Adresse : label. It captures until punctuation or a next-field label such as Contact, Passeport, NIR, IBAN, or carte, and it requires either a street shape (27 rue ..., 12 avenue ...) or a French postal-code + city shape:

Unlabelled street-like prose is deliberately ignored to avoid redacting ordinary location mentions or business addresses that are not clearly the subject's address.

Optional: ML model layer

When the operator enables detectPiiWithModel, the plugin runs a second detection pass via src/piiModel.ts. Model: onnx-community/multilang-pii-ner-ONNX (xlm-roberta-base fine-tuned on AI4Privacy CoNLL, F1 0.99, EN/DE/IT/FR native). Loaded lazily via @huggingface/transformers in Node runtime (first call downloads ~280 MB to the HF cache).

The model output (AI4Privacy labels) is mapped to our SpanTypes and filtered by the four sub-flags:

Sub-flag	Model labels picked up	Emits
`modelDetectNames`	GIVENNAME, SURNAME	`[NOM_N]`
`modelDetectAddresses`	STREET, CITY, STATE, COUNTRY, BUILDINGNUM, ZIPCODE	`[ADDRESS_N]`
`modelDetectDates`	DATE, TIME	`[DATE_N]`
`modelDetectIdDocs`	PASSPORTNUM, DRIVERLICENSENUM, IDCARDNUM	`[IDDOC_N]`

All other model labels are dropped — EMAIL/TELEPHONENUM/ACCOUNTNUM are intentionally ignored because the validated regex detectors are more precise on the structured types (a Luhn-validated 16-digit run is authoritative; a NER probability is not).

The model spans are merged into anonymizer.ts via the modelSpans field of AnonymizeArgs. Insertion order keeps regex spans winning on identical-length ties (see DESIGN.md dedup priority).

Confidence floor: default 0.5. Adjustable via the threshold field of PiiModelOptions; not exposed as a plugin config field yet (the default behaves well on AI4Privacy FR; revisit if a real deployment needs tuning).

What is not a detector

Names (NOM) and arbitrary strings (CUSTOM) are not detected — they are matched literally against the text using findLiteral, with Unicode word-boundary lookarounds:

This is what makes "Jean" not match inside "Jeanne" or "majeurement". Names (NOM) are matched case-insensitively ("JEAN DUPONT" and "jean dupont" both match "Jean Dupont"). Custom terms (CUSTOM) stay case-sensitive — the caller picked an exact form, we respect it.