LM Studio plugin that redacts personally identifiable information (PII) from text.
The model passes a block of text and (optionally) a list of personal names it has spotted; the plugin handles the format-bound stuff itself (cards, NIR, IBAN, phone, email — all checksum-validated, not just regex). Each detected item is replaced with a stable typed pseudonym so coreference is preserved within the document.
anonymize_text(text, names?, custom_terms?, include_mapping?) — returns the redacted text plus a mapping from pseudonym to original value.| Type | Detection | Pseudonym |
|---|---|---|
| Card | 13–19 digits, Luhn-validated |
[CB_N] |
| NIR | French numéro de sécu, mod-97 checksum | [NIR_N] |
| IBAN | 2 letters + 2 digits + body, mod-97 | [IBAN_N] |
| Phone | French formats: 0X…, +33 X…, 0033 X… | [TEL_N] |
| RFC-ish | [EMAIL_N] |
| Passport | Context-labelled passeport/passport n° … | [IDDOC_N] |
| Address | Context-labelled Adresse : … with street/postcode shape | [ADDRESS_N] |
Format-only detection without a checksum would produce too many false positives, so each detector validates before redacting.
names — personal names you've identified by reading the text. The tool does no NER. Pass full names like "Jean Dupont", not just "Jean", to avoid false matches against common words.custom_terms — anything else you want gone: company names, addresses, project codenames, etc.There's also a per-chat config field Always-redact terms (string array): things that should be redacted in every call regardless of model input — e.g. your own name, your home address, an employer name. Useful as a safety net.
Same value → same pseudonym, throughout one call:
(The model is expected to pass names: ["Jean", "Jean Dupont"] — but [NOM_1] is reused because the value matched in the text is the same once dedup runs.)
The response includes a mapping so you can build a "key" file alongside the redacted version if needed:
Set include_mapping: false to omit it (e.g. if you're going to forward the redacted text somewhere and don't want the mapping in your context).
The intended workflow:
read_file({path: "contract.md"}) — get the original.anonymize_text({text, names: [...]}) — model identifies names, plugin redacts.write_file({path: "contract.redacted.md", content: anonymized}) — save the clean version.| Field | Type | Default | Notes |
|---|---|---|---|
| Always-redact terms | string array | [] | Strings always replaced with [CUSTOM_N]. Matched literally. |
| Detect international phone numbers | boolean | false | When on, also redact +CC… international numbers (any country, not just France). Default is off — French-only detection is more precise. |
The detectors are checksum-validated, not regex-only. This is a deliberate trade-off: high precision at the cost of some recall on synthetic or poorly-formatted data.
Measured on the FR subset of AI4Privacy pii-masking-400k
(~40 000 French-locale samples, local eval only — see eval/run-ai4privacy.ts):
| Type | Precision | Recall | F1 | Notes |
|---|---|---|---|---|
| 99.75 % | 99.23 % | 99.49 % | Solid across realistic prose. | |
| TEL (FR-only, default) | 70.36 % | 9.50 % | 16.73 % | Most "TELEPHONENUM" gold values in the dataset are not French-formatted (+56…, 010…). The default config catches French phones only. |
| TEL (international flag on) | 93.23 % | 50.77 % | 65.74 % | With Detect international phone numbers enabled, an +CC… regex runs alongside the FR detector. Higher recall, with precision boosted because the broader match also satisfies stricter boundaries. |
| CB | 12.69 % | 7.33 % | 9.29 % | Restricted to BIN prefixes [3-6] (Visa/MC/Amex/Diners/Discover/JCB/UnionPay) and Luhn-validated. The dataset generates random 16-digit strings that are mostly not Luhn-valid; we reject them. Residual false positives are identifiers that happen to pass both gates by chance — irreducible without contextual signals. |
| NIR | 58.33 % | 0.48 % | 0.95 % | Same story as CB: synthetic NIRs in the dataset do not satisfy the mod-97 checksum, so we reject them. Real NIRs (with valid keys, Corsica included) are caught — see the unit tests for fixtures derived from official validators. |
Take-away: this plugin is the right tool when you want zero false redactions of look-alike data (reference numbers, internal IDs). It is not the right tool when you need to scrub arbitrary digit sequences — there's no validation signal there, by design.
Run the suite locally:
One-click (macOS, Windows): Run in LM Studio
Linux (AppImage):
After install: in any chat, click the tools button and enable anonymize.
See docs/ for design notes, per-detector algorithms, evaluation
methodology, dataset licensing, and the roadmap.
MIT
"Jean a appelé Jean Dupont au 06 12 34 56 78. Sa CB est 4111 1111 1111 1111."
↓
"[NOM_1] a appelé [NOM_1] au [TEL_1]. Sa CB est [CB_1]."
{
"anonymized": "...",
"counts": { "NOM": 1, "TEL": 1, "CB": 1 },
"mapping": {
"[NOM_1]": "Jean Dupont",
"[TEL_1]": "06 12 34 56 78",
"[CB_1]": "4111 1111 1111 1111"
}
}
npm test # 33 unit tests, ~90 ms
npm run eval:ai4privacy # streams the FR subset, prints metrics (requires
# the dataset under datasets/ai4privacy-400k/)
lms clone zexigh/anonymize
cd anonymize
lms dev -i