Project Files
skills / manuscript_audit / SKILL.md
uv run litmap install at ~/src/Cowork/litmap."⚠️ Runtime: Claude Code only. This skill calls
uv run litmap …against~/LitLake/embeddings.dbon the local machine. It will not work in the Cowork web sandbox. If you reached this skill from the Cowork web frontend, stop and switch to Claude Code.
A four-stage workflow to verify citations against sources, identify evidence gaps, detect logical flaws, and polish a manuscript draft before submission.
4b. Locate PDFs in Zotero: Query the Zotero SQLite database directly rather than grepping filenames. Filename-based search is unreliable for two reasons: (a) auto-generated filenames may not include the author name at all (e.g., an organisation such as "Nature Positive Initiative" may be saved under the document title), and (b) overly broad filename patterns return hundreds of false positives, making truncation errors likely. Instead, run a Python query against zotero.sqlite:
Substitute the author's last name and year from the reference list. For multi-word organisation names (e.g. "Nature Positive Initiative"), search by the first distinctive word as lastName LIKE '%Nature Positive%' or search by title keywords instead. The path field returned is of the form storage:filename.pdf; prepend /mnt/Zotero/storage/<key>/ using the item key to get the full path. If multiple results are returned for the same author+year, use the title from the reference list entry to select the correct one.
Important: Zotero PDF filenames are generated automatically and may contain errors — misspelled author names, wrong years, truncated titles. Never use a filename mismatch as evidence of a citation error in the manuscript. Always resolve ambiguity by matching against the full reference list entry (title, journal, DOI), not the filename alone.
4c. Check for reading notes: Once a Zotero item is matched, query any attached reading notes as a secondary source before opening the PDF:
Use PDF summary sections (content before <hr/>) to quickly locate passages relevant to the claim — they save time when skimming a long PDF. Treat as helpful but non-authoritative; always confirm any finding against the PDF itself.
Use DY sections (content after <hr/>) as context only for understanding how the paper was intended to be used. Never cite DY content in a faithfulness verdict.
In the Stage 1 output, add a Notes subsection where reading notes exist:
5b. Check for unquoted verbatim phrases: While the PDF text is in hand, also scan the manuscript sentence(s) surrounding this citation for word-for-word borrowing from the source that is not enclosed in quotation marks. A run of 5 or more consecutive words appearing identically in both texts is a strong signal; 4 words is worth flagging if the phrasing is distinctive (e.g. a notable characterisation like "notoriously difficult").
Flag any match with the verdict ⚠ Unquoted verbatim phrase and report:
Report format:
Missing PDFs: Before flagging a paper as absent, always: (a) search Zotero using variant spellings, partial author names, and keywords from the title; (b) check the manuscript's full reference list (which may be in a separate document if the PDF says "reference list located elsewhere") for the complete citation details, then re-search. Only flag as "PDF not found in library" after both steps have been attempted. When flagging, include the full reference entry from the reference list so the user can verify.
For each citation, report:
Alternatively for overstatement:
Calls
litmap searchagainst the user's local embeddings database. Requires Claude Code runtime — see banner above.
--collection scope if the user named one~/.omnimind/lancedb.Run all four stages in sequence. Deliver:
/mnt/Zotero/ (Cowork environment) or via the zotero skill.Extract reference list: Locate the References or Bibliography section in the manuscript. Build a lookup table: author_year → [full citation text, DOI/URL if present].
Extract in-text citations: Use regex to find all author-year patterns:
Smith 2020(Jones et al. 2019)Smith and Brown 2018(Smith 2020; Brown 2021)For each citation, record: the matched text, the sentence/paragraph context, the section heading, and the claim being supported.
Disambiguate via reference list: For each in-text citation, match it to the reference list entry to confirm author names, year, and full publication details. Flag any mismatches (e.g., cited as "Smith 2020" but reference list shows "Smith, J. 2019").
Check for retracted papers (Zotero 9): Before opening any PDFs, query the retractedItems table for every paper matched in the database:
If any cited paper's itemID appears in retracted, flag it at the very top of the Stage 1 report with a ⛔ RETRACTED verdict and the retraction data (journal notice, date if available). The author must address this before submission — retracted papers should not be cited without explicit acknowledgement of retraction status.
Web fallback for missing PDFs: If a PDF is not found in Zotero after step 6, attempt to retrieve the source from the web using the DOI or URL recorded in the reference list entry. Use the WebFetch tool with the DOI URL (e.g., https://doi.org/10.xxxx/xxxxx) or the direct URL if one is given. If the fetch succeeds, treat the retrieved content as the source for claim verification and proceed with the usual faithfulness check. If the fetch is blocked by the network proxy (EGRESS_BLOCKED error), record this explicitly and note: "PDF not in Zotero; web access blocked — claim could not be independently verified. Reference entry: [full citation]." In either case (success or blocked), include the full reference list entry in the report. Never silently skip verification for a missing PDF — always report what was attempted and what was found.
Identify unsupported claims. Scan the manuscript for empirical or theoretical assertions that lack an in-text citation. Build a list of records:
where sentence_context is the claim's sentence plus the one preceding sentence (for query disambiguation).
(Optional, ≥30 unique citations only) Up-front cluster overview.
Read /tmp/audit_clusters.md and present the thematic outline before per-claim analysis. The user can use this to spot whole topic areas that are over- or under-cited.
Per-claim semantic search. For each unsupported-claim record, use LanceDB to perform a semantic search. You will need to query LM Studio first to get the embedding vector for the claim:
Filter candidates.
lastname year against the Stage 1 reference list).similarity < 0.75 — below that threshold the match is usually too weak to be useful.Present the report. For each unsupported claim:
If no candidates clear the threshold, write plainly:
"No semantically similar papers in your library. Consider broader literature search outside the local Zotero collection."
Extract key claims: Scan the manuscript section-by-section and list major claims:
Check for contradictions:
Check for reasoning gaps:
Check for scope creep:
Flag undefined or under-defined terms:
Grammar & mechanics:
Clarity & concision:
Flow & transitions:
Style & consistency:
Common scientific writing issues:
Sentence-level revisions: Provide before/after examples.
Annotated manuscript with inline comments (via track changes or comment blocks)
Summary report listing:
Prioritized revision checklist (must-fix vs. nice-to-fix) so the user can tackle the most important issues first.
import sqlite3
conn = sqlite3.connect('/mnt/Zotero/zotero.sqlite')
query = '''
SELECT i.key, c.lastName, c.firstName, idv_year.value AS year,
idv_title.value AS title, ia.path
FROM items i
JOIN itemCreators ic ON ic.itemID = i.itemID AND ic.orderIndex = 0
JOIN creators c ON c.creatorID = ic.creatorID
LEFT JOIN itemData id_year ON id_year.itemID = i.itemID
AND id_year.fieldID = (SELECT fieldID FROM fields WHERE fieldName = 'date')
LEFT JOIN itemDataValues idv_year ON idv_year.valueID = id_year.valueID
LEFT JOIN itemData id_title ON id_title.itemID = i.itemID
AND id_title.fieldID = (SELECT fieldID FROM fields WHERE fieldName = 'title')
LEFT JOIN itemDataValues idv_title ON idv_title.valueID = id_title.valueID
LEFT JOIN itemAttachments ia ON ia.parentItemID = i.itemID
WHERE c.lastName LIKE ? AND idv_year.value LIKE ?
'''
results = conn.execute(query, ('%He%', '2015%')).fetchall()
from bs4 import BeautifulSoup
notes = conn.execute(
"SELECT note FROM itemNotes WHERE parentItemID = ?",
(matched_item_id,)
).fetchall()
for (note_html,) in notes:
soup = BeautifulSoup(note_html, 'html.parser')
hr = soup.find('hr')
if hr:
summary_text = ' '.join(t.get_text() for t in hr.previous_siblings)
dy_text = ' '.join(t.get_text() for t in hr.next_siblings)
else:
summary_text = soup.get_text()
dy_text = ''
Notes (Zotero reading notes — secondary source, verify against PDF):
[Relevant excerpt from PDF summary section]
DY context (personal use-case notes — not a faithfulness source):
[Relevant excerpt from DY section, if any]
import re
def ngrams(text, n):
words = re.findall(r'\b\w+\b', text.lower())
return [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
manuscript_context = # the 1-3 sentences around the citation in the manuscript
pdf_text = # full PDF text
for n in (6, 5, 4):
ms_grams = set(ngrams(manuscript_context, n))
pdf_grams = set(ngrams(pdf_text, n))
matches = ms_grams & pdf_grams
if matches:
# reconstruct matched phrase and check it isn't already in quotes
for gram in matches:
phrase = ' '.join(gram)
# check manuscript context for surrounding quote marks
pattern = r'["\u201c\u201d\u2018\u2019][^"]*' + re.escape(phrase)
if not re.search(pattern, manuscript_context.lower()):
flag_unquoted(phrase, manuscript_context)
⚠ Unquoted verbatim phrase — Smith et al. 2020
Manuscript: "...measuring biodiversity is notoriously difficult and expensive..."
Source: "...measuring biodiversity is notoriously difficult in all fields..."
Matched phrase: "measuring biodiversity is notoriously difficult"
Fix: Either quote directly — 'measuring biodiversity is "notoriously difficult"
(Marshall et al. 2020, p. X)' — or paraphrase to make the wording clearly your own.
[Citation ID] Smith et al. 2020
Claim: "Biodiversity loss is accelerating globally (Smith et al. 2020)."
Verdict: ✓ Faithful
Source passage: "Our analysis shows accelerating declines in species richness
across terrestrial and marine ecosystems over the past two decades."
Confidence: High (exact match to claim)
---
[Citation ID] Jones 2019
Claim: "All temperate forests show declining productivity (Jones 2019)."
Verdict: ⚠ Overstated
Source passage: "Productivity declines were observed in 68% of sampled
temperate forests in North America."
Issue: The manuscript claims universality; the source reports 68% prevalence.
Suggest: "Most temperate forests show declining productivity (Jones 2019)."
---
**Contradiction detected** (Abstract vs. Introduction)
Abstract: "Remote sensing cannot reliably measure forest biomass."
Introduction: "High-resolution satellite data enable accurate biomass estimation."
Recommendation: Clarify the distinction (e.g., "passive optical remote sensing
cannot reliably measure biomass; active LiDAR-based approaches are more
promising").
---
**Reasoning gap** (Methods → Results)
Methods: "We used species occurrence data from iNaturalist."
Results: "Species richness patterns aligned with predictions."
Issue: The connection between iNaturalist (which is spatially biased toward
populated areas) and richness predictions is not established. Does this
bias affect the conclusions?
Recommendation: Add a limitations paragraph addressing data bias.
---
**Scope creep** (Results → Discussion)
Results: "In our 10-site study, SDM accuracy was 0.82 AUC."
Discussion: "Deep learning SDMs are highly accurate for predicting global
species distributions."
Issue: Results from 10 sites do not support a global claim.
Recommendation: Qualify: "Our results suggest that deep learning SDMs may
achieve high accuracy; further validation across diverse regions is needed."
---
**Grammar issue** (Page 3, Results)
Original: "The results shows that deep learning models performs better
than traditional SDMs."
Corrected: "The results show that deep learning models perform better
than traditional SDMs."
Issue: Subject-verb agreement (plural "results" requires "show" and "perform").
---
**Clarity issue** (Page 1, Abstract)
Original: "We used remote sensing and machine learning, which is a powerful
combination for predicting species distributions."
Revised: "We combined remote sensing with machine learning to predict
species distributions with high accuracy."
Issue: "which is a powerful combination" is vague and wordy. The revision
is more direct and specific.
---
**Flow issue** (Page 5, Discussion, para. 2)
Original: [Three sentences about climate change] [Abrupt shift] [One sentence
about policy implications]
Revised: Add a transition: "These findings have important implications for
conservation policy..." before the policy sentence.
---
**Style issue** (Inconsistent number formatting)
Original: "We sampled 10 sites across 3 regions in 4 years."
Check: Are numbers <10 and ≥10 formatted consistently? Should be either:
"We sampled ten sites across three regions in four years" (spell out all <10)
OR "We sampled 10 sites across 3 regions in 4 years" (numerals for all).
Pick one and apply throughout.
---
retracted = {
r[0]: r[1]
for r in conn.execute("SELECT itemID, data FROM retractedItems").fetchall()
}
{claim_text, sentence_context, section_heading}
uv run --project ~/src/Cowork/litmap litmap cluster \
--manuscript <manuscript_path> \
--output /tmp/audit_clusters \
--format md
import lancedb
db = lancedb.connect("~/.omnimind/lancedb")
table = db.open_table("chunks")
# Fetch embedding from LM Studio http://localhost:1234/v1/embeddings
# Then query LanceDB:
results = table.search(query_vector).where("source = 'zotero'").limit(5).to_list()
### Section 3.2 — claim text excerpt
> "<the claim sentence>"
Suggested citations from your library (similarity, zotero_key):
1. **0.87** — Valavi et al. 2022, *Predictive performance of presence-only SDMs* (`AAAA0001`)
DOI: 10.1111/geb.13476
2. **0.81** — Norberg 2019, *A comprehensive evaluation of predictive performance...* (`AAAA0042`)