Forked from mindstudio/big-rag
Project Files
README.md
A powerful RAG (Retrieval-Augmented Generation) plugin for LM Studio that can index and search through gigabytes or even terabytes (not tested) of document data. Based on ari99/lm_studio_big_rag_plugin with multilingual OCR support (English + Russian by default).
Install from LM Studio: aleksandr-333/big-rag-rus
gpustack/text-embedding-bge-m3)The plugin provides the following configuration options in LM Studio:
ru): Language for RAG instructions sent to the model. Controls the language of inline prompts (citation headers, search instructions, "no results" messages, etc.). Available values:
ru — Русский: все RAG-инструкции отправляются на русском языке. Модель будет склонна отвечать по-русски.en — English: all RAG instructions are sent in English.Why this matters: The plugin injects instructions and citations directly into the prompt sent to the model. If these instructions are in English, the model may respond in English even if your system prompt says to use Russian. Setting this to
ruensures all injected text is Russian, reinforcing the model's language behavior.
By default, all supported file types are indexed. You can selectively disable specific categories:
| Setting | File Types | Default |
|---|---|---|
| Index HTML/XHTML | .htm, .html, .xhtml | ✅ Enabled |
| Index PDF | .pdf | ✅ Enabled |
| Index EPUB | .epub | ✅ Enabled |
| Index Text/Markdown | .txt, .text, .md, .mdx, .markdown, .mdown, .mkd, .mkdn | ✅ Enabled |
| Index DOCX | .docx | ✅ Enabled |
Tip: To speed up indexing on a folder with mixed content, disable file types you don't need (e.g., disable images if you only care about text documents). After changing filters, trigger a manual reindex.
gpustack/text-embedding-bge-m3): Model ID for text embeddings. Must be loaded in LM Studio. Examples: nomic-ai/nomic-embed-text-v1.5-GGUF, gpustack/text-embedding-bge-m3The plugin recognises filename search intent in both Russian and English. Examples of supported query patterns:
| Query Language | Example Query | Behaviour |
|---|---|---|
| 🇷🇺 Russian | «найди все файлы с именем протокол» | Lists all indexed files whose name contains «протокол» |
| 🇷🇺 Russian | «найди файлы письмо в которых встречается слово договор» | Finds files named «письмо» and searches their content for «договор» |
| 🇷🇺 Russian | «в названии которых есть отчёт» | Lists files whose name contains «отчёт» |
| 🇬🇧 English | «find all files named protocol» | Lists all indexed files whose name contains «protocol» |
| 🇬🇧 English | «show files called report containing budget» | Finds files named «report» and searches their content for «budget» |
| 🇬🇧 English | «list files with name invoice» | Lists files whose name contains «invoice» |
Four search scenarios:
Content display is triggered by keywords like: «выведи», «прочитай», «полностью», «целиком», «содержание», «содержимое», «весь текст», «что внутри», «display», «read file», «show content», «full text», «entire content», «what's inside», etc.
When a PDF exceeds the configured page or image limits, the plugin logs a warning (e.g., ⚠️ PDF "book.pdf" has 500 pages, but maxPages=200) and returns the partially extracted text.
maxConcurrentFiles if needed)maxConcurrentFiles on systems with limited resourcesmaxConcurrentFilesmaxConcurrentFiles to 1 or 2success / failed counts after each processed document.BIG_RAG_FAILURE_REPORT_PATH=/absolute/path/report.json when running npm run index (or via LM Studio env settings) to emit a JSON report containing all failure reasons and counts after indexing completes. This is useful when triaging stubborn PDFs such as blueprints or large scanned books.For standalone indexing (requires LM Studio running for embeddings):
Environment variables:
eng+rus)Automated smoke tests cover 13 test cases across all major file types:
| # | Test | Format |
|---|---|---|
| 1 | HTML text extraction | .html |
| 2 | XHTML text extraction | .xhtml |
| 3 | Markdown formatting | .md |
| 4 | MDX as Markdown | .mdx |
| 5 | Plain text | .txt |
| 6 | DOCX paragraphs | .docx |
| 7 | XLSX cell text | .xlsx |
| 8 | CSV cell text | .csv |
| 9 | PPTX slide text | .pptx |
| 10 | EPUB text extraction | .epub |
| 11 | OCR English (auto-downloads eng.traineddata) | .png |
| 12 | OCR Russian (auto-downloads rus.traineddata) | .png |
| 13 | OCR Mixed English+Russian | .png |
OCR tests auto-download Tesseract language data from CDN on first run (may take 1-2 minutes). Subsequent runs use cached data.
For end-to-end validation:
This plugin is based on the LM Studio plugin SDK. For more information:
ISC
| Index Spreadsheets | .xlsx, .xls, .csv | ✅ Enabled |
| Index Presentations | .pptx | ✅ Enabled |
| Index Images (OCR) | .bmp, .jpg, .jpeg, .png | ✅ Enabled (requires OCR enabled) |
eng+rus): Tesseract language code for OCR. Supports any Tesseract language combination: eng (English), rus (Russian), eng+rus (both), deu (German), fra (French), etc..traineddata files. Leave empty to auto-detect: the plugin checks its own root folder for .traineddata files matching all requested languages. If any language is missing, Tesseract auto-downloads from CDN on first use. For offline use, place all required .traineddata files (e.g. eng.traineddata, rus.traineddata) in the plugin root or set a custom path. For best quality, download best-traineddata files.Configure the Plugin:
/Users/user/Documents/MyLibrary)/Users/user/.lmstudio/big-rag-db)Initial Indexing:
Query Your Documents:
File Scanner (src/ingestion/fileScanner.ts):
Document Parsers (src/parsers/):
htmlParser.ts: Extracts text from HTML/HTM filespdfParser.ts: Extracts text from PDF filesepubParser.ts: Extracts text from EPUB filestextParser.ts: Reads plain text & Markdown files with optional Markdown strippingimageParser.ts: OCR for image filesdocxParser.ts: Extracts text from DOCX (Word) files via mammothxlsxParser.ts: Extracts text from XLSX/XLS (Excel) files via SheetJSpptxParser.ts: Extracts text from PPTX (PowerPoint) files via JSZipdocumentParser.ts: Routes to appropriate parserVector Store (src/vectorstore/vectorStore.ts):
Index Manager (src/ingestion/indexManager.ts):
Prompt Preprocessor (src/promptPreprocessor.ts):
retrievalAffinityThreshold.traineddata files)rus for Russian, eng+rus for mixed)⚠️ warnings and increase OCR Max Pages or OCR Max Images Per PageBIG_RAG_EMBEDDING_MODEL — embedding model ID (default: gpustack/text-embedding-bge-m3)BIG_RAG_OCR_LANGUAGE — OCR language (default: eng+rus)BIG_RAG_OCR_DATA_PATH — path to .traineddata folderBIG_RAG_OCR_PSM — Tesseract PSM (default: 3)BIG_RAG_OCR_MAX_PAGES — max OCR pages (default: 200)BIG_RAG_OCR_MAX_IMAGES_PER_PAGE — max images per page (default: 10)BIG_RAG_OCR_MIN_IMAGE_AREA — min image area (default: 2500)BIG_RAG_OCR_MAX_IMAGE_PIXELS — max image pixels (default: 100000000)BIG_RAG_OCR_IMAGE_TIMEOUT_MS — image timeout ms (default: 60000)BIG_RAG_FORCE_REINDEX — set to true to force full reindexBIG_RAG_FAILURE_REPORT_PATH — path to write failure report JSONcd big-rag-plugin
npm install
npm run build
npm run dev
node dist/cliIndex.js /path/to/docs /path/to/db
big-rag-plugin/
├── src/
│ ├── config.ts # Plugin configuration schema
│ ├── index.ts # Main entry point
│ ├── promptPreprocessor.ts # RAG integration
│ ├── ingestion/
│ │ ├── fileScanner.ts # Directory scanning
│ │ └── indexManager.ts # Indexing orchestration
│ ├── parsers/
│ │ ├── documentParser.ts # Parser router
│ │ ├── htmlParser.ts # HTML parsing
│ │ ├── pdfParser.ts # PDF parsing
│ │ ├── epubParser.ts # EPUB parsing
│ │ ├── textParser.ts # Text parsing
│ │ ├── imageParser.ts # OCR parsing
│ │ ├── docxParser.ts # DOCX (Word) parsing
│ │ ├── xlsxParser.ts # XLSX/XLS (Excel) parsing
│ │ └── pptxParser.ts # PPTX (PowerPoint) parsing
│ ├── vectorstore/
│ │ └── vectorStore.ts # Vectra sharded index integration
│ └── utils/
│ ├── fileHash.ts # File hashing
│ ├── ocrLangPath.ts # OCR language path resolution
│ └── textChunker.ts # Text chunking
├── manifest.json # Plugin manifest
├── package.json # Dependencies
├── tsconfig.json # TypeScript config
└── README.md # This file
npm run test