Project Files
docs / DETECTORS.md
Per-type detection algorithms. All live in src/detectors.ts; each runs
regex → optional checksum, and emits Span { start, end, type, value }.
/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g
Deliberately permissive ("RFC-ish"). No validation step. Coreference is
case-insensitive in anonymizer.normalizeValue ("Alice@Ex.COM" and
"alice@ex.com" share [EMAIL_1]).
Known limits. Does not handle IDN (Cyrillic / Han local parts), trailing-dot edge cases, or quoted local parts. In practice this catches
99 % of mail addresses appearing in French prose (EVALUATION.md: F1 99.49 %).
/(?<!\d)(?: (?:\+33|0033)[ .-]?[1-9](?:[ .-]?\d{2}){4} | 0[1-9](?:[ .-]?\d{2}){4} )(?!\d)/g
Two alternatives:
+33 or 0033 prefix, optional (0) between
country code and national number (business notation), then a 9-digit
national number starting with a non-zero leading digit.0[1-9] prefix followed by 4 groups of 2 digits.Both accept optional , ., or - between groups. The
(?<!\d)…(?!\d) boundaries are critical: without them, the domestic
alternative 0[1-9]\d{8} matched 10-digit subsequences inside any
longer digit run (NIR, IDCARD, account number). Removing those
boundaries used to cost ~5 000 false positives on AI4Privacy FR; the
regression test phone regex does not match inside a longer digit run
locks the fix in place.
International numbers (opt-in). When the config field
Detect international phone numbers is on, a second pattern runs:
This matches any +CC… form (country code 1–3 digits, then 7–14 more
digits with optional space/dot/dash separators), so DE +49 30 12345678,
UK +44 20 7946 0958, US +1 555 123 4567 etc. are caught. On
AI4Privacy FR this lifts TEL recall from 9.5 % to ~51 % (F1 16.7 %
→ 65.7 %) at no measurable precision cost.
The international and French regex can both match the same +33… number;
DESIGN.md dedup priority collapses them to a
single span.
Other out-of-scope: premium numbers (08…), short numbers (3…).
Coreference: normalizeValue strips (0), strips separators, and
rewrites the international prefix to 0. So "06 12 34 56 78",
"+33 6 12 34 56 78", and "+33 (0)6 12 34 56 78" all collapse to
0612345678 and share [TEL_1].
Regex:
A leading [3-6] restricts the match to known Issuer Identification
Number ranges:
| Prefix | Brand |
|---|---|
| 4 | Visa |
| 5 | Mastercard |
| 34, 37 | American Express |
| 30, 36, 38 | Diners Club |
| 6011, 65, 644–649 | Discover |
| 35 | JCB |
| 62 | UnionPay |
A 16-digit run starting with 9 may be a Luhn-valid string by chance,
but it is not a real card — the BIN gate rejects it. Same for runs
starting with 0, 1, 7, 8. This roughly halves the FP rate
without affecting recall on real cards.
After matching, luhn(digits) on the cleaned (separator-stripped) match. The Luhn
check is the standard right-to-left double-then-modular-sum, accepting
13–19 digit cards (Visa, Mastercard, Amex, Diners, Discover, UnionPay,
JCB all fit).
Known limitations.
[3-6] still
pass Luhn by chance. On real text this is rare; on AI4Privacy FR
it caps our precision around 13 % (the dataset is full of templated
16-digit identifiers that happen to start with 4, 5, 6).Regex (case-insensitive on letters):
Structure of the 15-character NIR:
| Field | Width | Notes |
|---|---|---|
| Sex | 1 | 1 or 2 |
| Year | 2 | last two digits of birth year |
| Month | 2 | 01–12 (special codes accepted, not validated) |
| Dept | 2 | digits or 2A / 2B for Corsica |
| Commune | 3 | INSEE commune code |
| Ordre | 3 | sequence within month |
| Key | 2 | checksum (validated below) |
Checksum (nirChecksum):
The Corsica substitution rule (2A→19, 2B→18) is the convention
documented by INSEE and used by both reference validators we cross-
checked against (SGMAP's nir_validate and Aymeric Bouzy's
french-ssn — see LICENSING.md). The two repos use an
equivalent formulation (replace letter with 0, then subtract 1 000 000
or 2 000 000 as a penalty); we verified algebraic equivalence on shared
fixtures.
Coreference: normalizeValue strips whitespace and uppercases, so
"2 89 04 2a 342 163 90" and "2890428342 163 90" share [NIR_1].
Regex (case-insensitive):
Then ibanCheck performs the standard mod-97 validation: move the first
four characters to the end, replace letters with A=10, B=11, …, Z=35,
and verify the remainder modulo 97 is 1.
Length is constrained to 15–34 characters (the IBAN spec range).
Coreference: normalizeValue strips whitespace and uppercases, so
"FR1420041010050500013M02606", "fr14 2004 1010 0505 0001 3M02 606"
and any other formatting variant share [IBAN_1].
The regex detector only redacts an ID-document token when it is explicitly
introduced by passeport / passport and an optional number marker such as
n°, numéro, or #:
The captured token must be 6–12 alphanumeric characters after removing spaces and dashes, and must contain at least one letter and one digit. This keeps the detector from redacting arbitrary internal references that merely look like passport numbers.
The regex detector redacts a postal address only after an explicit Adresse :
label. It captures until punctuation or a next-field label such as Contact,
Passeport, NIR, IBAN, or carte, and it requires either a street shape
(27 rue ..., 12 avenue ...) or a French postal-code + city shape:
Unlabelled street-like prose is deliberately ignored to avoid redacting ordinary location mentions or business addresses that are not clearly the subject's address.
When the operator enables detectPiiWithModel, the plugin runs a second
detection pass via src/piiModel.ts. Model:
onnx-community/multilang-pii-ner-ONNX
(xlm-roberta-base fine-tuned on AI4Privacy CoNLL, F1 0.99,
EN/DE/IT/FR native). Loaded lazily via @huggingface/transformers in
Node runtime (first call downloads ~280 MB to the HF cache).
The model output (AI4Privacy labels) is mapped to our SpanTypes and
filtered by the four sub-flags:
| Sub-flag | Model labels picked up | Emits |
|---|---|---|
modelDetectNames | GIVENNAME, SURNAME | [NOM_N] |
modelDetectAddresses | STREET, CITY, STATE, COUNTRY, BUILDINGNUM, ZIPCODE | [ADDRESS_N] |
modelDetectDates | DATE, TIME | [DATE_N] |
modelDetectIdDocs | PASSPORTNUM, DRIVERLICENSENUM, IDCARDNUM | [IDDOC_N] |
All other model labels are dropped — EMAIL/TELEPHONENUM/ACCOUNTNUM are intentionally ignored because the validated regex detectors are more precise on the structured types (a Luhn-validated 16-digit run is authoritative; a NER probability is not).
The model spans are merged into anonymizer.ts via the modelSpans
field of AnonymizeArgs. Insertion order keeps regex spans winning on
identical-length ties (see DESIGN.md dedup priority).
Confidence floor: default 0.5. Adjustable via the threshold field of
PiiModelOptions; not exposed as a plugin config field yet (the
default behaves well on AI4Privacy FR; revisit if a real deployment
needs tuning).
Names (NOM) and arbitrary strings (CUSTOM) are not detected — they
are matched literally against the text using findLiteral, with Unicode
word-boundary lookarounds:
This is what makes "Jean" not match inside "Jeanne" or
"majeurement". Names (NOM) are matched case-insensitively
("JEAN DUPONT" and "jean dupont" both match "Jean Dupont").
Custom terms (CUSTOM) stay case-sensitive — the caller picked an
exact form, we respect it.
/(?<![\d.\-])\+[1-9]\d{0,2}[ .-]?\d(?:[ .-]?\d){6,13}(?![.\-]?\d)/g
/(?<!\d)[3-6](?:[ -]?\d){12,18}(?!\d)/g
/\b([12][ ]?\d{2}[ ]?\d{2}[ ]?(?:\d{2}|2[AB])[ ]?\d{3}[ ]?\d{3})[ ]?(\d{2})\b/gi
normalized = body.replace(/2A/i, "19").replace(/2B/i, "18")
expected = 97 - (parseInt(normalized) % 97)
/\b[A-Z]{2}\d{2}(?:[ ]?[A-Z0-9]{4}){2,7}(?:[ ]?[A-Z0-9]{1,4})?\b/gi
Passeport n° 19FH84235 → Passeport n° [IDDOC_1]
Adresse : 27 rue de la République, 69002 Lyon. Contact : ...
→
Adresse : [ADDRESS_1]. Contact : ...
new RegExp(`(?<![\\p{L}\\p{N}])${escaped}(?![\\p{L}\\p{N}])`, "gu")