Big RAG Rus — расширение оригинального Big RAG Plugin для LM Studio (в разработке).
Добавлены форматы: Word, Excel, CSV, PowerPoint.
Расширено меню настроек, добавлен выбор модели эмбедингов.
Полная двуязычная поддержка — русский и английский.
Оригинальный плагин: github.com/ari99/lm_studio_big_rag_plugin
Big RAG Rus — an extension of the original Big RAG Plugin for LM Studio (in development).
Added formats: Word, Excel, CSV, PowerPoint.
Extended settings menu, added embedding model selection.
Full bilingual support — Russian and English.
Original plugin: github.com/ari99/lm_studio_big_rag_plugin
A powerful RAG (Retrieval-Augmented Generation) plugin for LM Studio that can index and search through gigabytes or even terabytes (not tested) of document data. Based on ari99/lm_studio_big_rag_plugin with multilingual OCR support (English + Russian by default).
OCR Support: Optional OCR for image files and image-based PDFs using Tesseract with configurable language (English + Russian by default)
Configurable OCR Pipeline: Fine-tune page segmentation mode, image size limits, page limits, timeouts, and local language data path
Vector Search: Uses Vectra with sharded indexes for efficient vector storage and retrieval (avoids single-file size limits)
Incremental Indexing: Automatically detects and skips already-indexed files
Concurrent Processing: Configurable concurrency for optimal performance
Persistent Storage: Vector embeddings are stored locally and persist across sessions
Configurable Embedding Model: Use any embedding model loaded in LM Studio (default: gpustack/text-embedding-bge-m3)
Filename Search: Find indexed files by name using natural language queries (Russian + English), with optional content search within matched files
Supported File Types
Documents: PDF, EPUB, TXT, TEXT, DOCX
Spreadsheets: XLSX, XLS, CSV
Presentations: PPTX
Markdown: MD, MDX, Markdown, MDown, MKD, MKDN
Web Content: HTM, HTML, XHTML
Images (with OCR): BMP, JPEG, JPG, PNG
Archives: RAR (planned - currently not implemented)
Installation
Navigate to the plugin directory:
cd big-rag-plugin
Install dependencies:
npm install
Build the plugin:
npm run build
Run in development mode:
npm run dev
Configuration
The plugin provides the following configuration options in LM Studio:
Response Language
Response Language / Язык ответа (default: ru): Language for RAG instructions sent to the model. Controls the language of inline prompts (citation headers, search instructions, "no results" messages, etc.). Available values:
ru — Русский: все RAG-инструкции отправляются на русском языке. Модель будет склонна отвечать по-русски.
en — English: all RAG instructions are sent in English.
Why this matters: The plugin injects instructions and citations directly into the prompt sent to the model. If these instructions are in English, the model may respond in English even if your system prompt says to use Russian. Setting this to ru ensures all injected text is Russian, reinforcing the model's language behavior.
Required Settings
Documents Directory: Root directory containing your documents (read access required)
Vector Store Directory: Where the vector database will be stored (read/write access required)
Retrieval Settings
Retrieval Limit (1-20, default: 5): Maximum number of chunks to return
Tip: To speed up indexing on a folder with mixed content, disable file types you don't need (e.g., disable images if you only care about text documents). After changing filters, trigger a manual reindex.
Performance Settings
Max Concurrent Files (1-10, default: 1): Number of files to process simultaneously
Parser Delay (ms) (0-5000, default: 500): Wait time before parsing each document (helps avoid WebSocket throttling)
Embedding Model (default: gpustack/text-embedding-bge-m3): Model ID for text embeddings. Must be loaded in LM Studio. Examples: nomic-ai/nomic-embed-text-v1.5-GGUF, gpustack/text-embedding-bge-m3
Filename Search
Enable Filename Search (default: true): When enabled, the plugin detects natural language queries asking to find files by name and searches the indexed file list for matches. Works alongside normal vector content search.
The plugin recognises filename search intent in both Russian and English. Examples of supported query patterns:
Query Language
Example Query
Behaviour
🇷🇺 Russian
«найди все файлы с именем протокол»
Lists all indexed files whose name contains «протокол»
🇷🇺 Russian
«найди файлы письмо в которых встречается слово договор»
Finds files named «письмо» and searches their content for «договор»
🇷🇺 Russian
«в названии которых есть отчёт»
Lists files whose name contains «отчёт»
🇬🇧 English
«find all files named protocol»
Lists all indexed files whose name contains «protocol»
🇬🇧 English
«show files called report containing budget»
Finds files named «report» and searches their content for «budget»
🇬🇧 English
«list files with name invoice»
Lists files whose name contains «invoice»
Four search scenarios:
Filename listing only — the query asks just to list files by name (e.g., «найди файлы протокол») → returns a file listing with paths
Filename + content display — the query asks to find a file AND display its content (e.g., «найди файл TestFormat и выведи полностью его содержание») → retrieves ALL indexed chunks from the matched files and presents them in reading order
Filename + content keyword search — the query asks for files by name that also contain specific words (e.g., «найди файлы письмо в которых встречается слово договор») → vector search is performed within the matched files only
No filename intent — the query is a regular question → standard vector search across all indexed documents
Content display is triggered by keywords like: «выведи», «прочитай», «полностью», «целиком», «содержание», «содержимое», «весь текст», «что внутри», «display», «read file», «show content», «full text», «entire content», «what's inside», etc.
OCR Settings
Enable OCR (default: true): Enable OCR for image files and image-based PDFs using LM Studio's built-in document parser
OCR Language (default: eng+rus): Tesseract language code for OCR. Supports any Tesseract language combination: eng (English), rus (Russian), eng+rus (both), deu (German), fra (French), etc.
OCR Data Path (default: empty): Path to folder with .traineddata files. Leave empty to auto-detect: the plugin checks its own root folder for .traineddata files matching all requested languages. If any language is missing, Tesseract auto-downloads from CDN on first use. For offline use, place all required .traineddata files (e.g. eng.traineddata, rus.traineddata) in the plugin root or set a custom path. For best quality, download best-traineddata files.
OCR Min Text Length (0-10000, default: 20): Minimum characters for PDF text to be considered valid. Lower values catch short pages (stamps, forms).
OCR Max Pages (1-50000, default: 200): Maximum PDF pages to process with OCR. Increase for large documents.
OCR Max Images Per Page (1-100, default: 10): Maximum images per PDF page for OCR. Increase for pages with many diagrams/tables.
OCR Min Image Area (0-100000, default: 2500): Minimum image area (width×height in px) for OCR. Lower values process smaller images (signatures, stamps).
OCR Max Image Pixels (1M-500M, default: 100M): Maximum image area (px²) to process. Prevents OOM on huge scans. ~100M = 10000×10000.
OCR Image Timeout (ms) (5000-300000, default: 60000): Timeout in ms for loading image data from PDF. Increase for slow systems.
When a PDF exceeds the configured page or image limits, the plugin logs a warning (e.g., ⚠️ PDF "book.pdf" has 500 pages, but maxPages=200) and returns the partially extracted text.
Reindexing Controls
Manual Reindex Trigger (toggle): Turn this ON and submit any chat message to force indexing to run on every chat session where the plugin is enabled. Flip it OFF once you’re done to stop the automatic reindex loop.
Skip Previously Indexed Files (default: true): If enabled while "Manual Reindex Trigger" is enabled, each manual run touches just the documents that are new or have changed since the last index; if disabled, every chat rebuilds the entire index from scratch. Combine "Skip Previously Indexed Files" and "Manual Reindex Trigger" to choose between incremental updates or repeated full refreshes.
Automatic First-Run: If the vector store is empty, the plugin automatically indexes the configured documents the first time any chat message is processed—no manual input is required.
Usage
Configure the Plugin:
Open LM Studio settings
Navigate to the Big RAG plugin configuration
Set your documents directory (e.g., /Users/user/Documents/MyLibrary)
Set your vector store directory (e.g., /Users/user/.lmstudio/big-rag-db)
Initial Indexing:
The first time you send a message, the plugin will automatically scan and index your documents
This process may take a while depending on the size of your document collection
Progress will be shown in the LM Studio interface
Query Your Documents:
Simply chat with your LM Studio model as usual
The plugin will automatically search your indexed documents for relevant content
Retrieved passages will be injected into the context for the model to use
Architecture
Components
File Scanner (src/ingestion/fileScanner.ts):
Recursively scans directories
Filters for supported file types
Collects file metadata
Document Parsers (src/parsers/):
htmlParser.ts: Extracts text from HTML/HTM files
pdfParser.ts: Extracts text from PDF files
epubParser.ts: Extracts text from EPUB files
textParser.ts: Reads plain text & Markdown files with optional Markdown stripping
imageParser.ts: OCR for image files
docxParser.ts: Extracts text from DOCX (Word) files via mammoth
xlsxParser.ts: Extracts text from XLSX/XLS (Excel) files via SheetJS
pptxParser.ts: Extracts text from PPTX (PowerPoint) files via JSZip
documentParser.ts: Routes to appropriate parser
Vector Store (src/vectorstore/vectorStore.ts):
Uses Vectra with sharded indexes (one shard in memory at a time; avoids V8 string size limits)
Supports incremental updates
Efficient similarity search
Index Manager (src/ingestion/indexManager.ts):
Orchestrates the indexing pipeline
Manages concurrent processing
Handles progress reporting
Prompt Preprocessor (src/promptPreprocessor.ts):
Intercepts user queries
Performs vector search
Injects relevant context
Performance Considerations
Large Datasets
Disk Space: The vector store requires additional disk space (typically 10-20% of original document size)
Initial Indexing: Can take several hours for TB-scale collections
Memory Usage: Scales with concurrent processing (reduce maxConcurrentFiles if needed)
Optimization Tips
Start Small: Test with a subset of documents first
Disable OCR: Unless you have many image-based documents, keep OCR disabled
Adjust Concurrency: Lower maxConcurrentFiles on systems with limited resources
Chunk Size: Larger chunks (1024-2048) work better for technical documents
Threshold Tuning: Adjust retrievalAffinityThreshold based on result quality
Troubleshooting
No Results Found
Check that documents directory is correctly configured
Verify that indexing completed successfully
Try lowering the retrieval affinity threshold
Check LM Studio logs for errors
Slow Indexing
Reduce maxConcurrentFiles
Disable OCR if not needed
Ensure vector store directory is on a fast drive (SSD recommended)
Out of Memory
Reduce maxConcurrentFiles to 1 or 2
Process documents in batches by organizing them into subdirectories
Increase system swap space
OCR Not Working
Tesseract.js downloads language data on first use for each language (fast model from CDN)
For better quality, download best-traineddata files and set the OCR Data Path to the folder containing them
Ensure internet connectivity during first OCR operation (unless using local .traineddata files)
Check that the OCR Language setting matches your document language (e.g., rus for Russian, eng+rus for mixed)
Try adjusting OCR Page Segmentation Mode — PSM 6 works better for tables and forms, PSM 4 for single-column text
Check that image files are valid and readable
If large PDFs are partially processed, check the logs for ⚠️ warnings and increase OCR Max Pages or OCR Max Images Per Page
Failure Reason Reporting
The CLI logs cumulative success / failed counts after each processed document.
Set BIG_RAG_FAILURE_REPORT_PATH=/absolute/path/report.json when running npm run index (or via LM Studio env settings) to emit a JSON report containing all failure reasons and counts after indexing completes. This is useful when triaging stubborn PDFs such as blueprints or large scanned books.
CLI Indexing
For standalone indexing (requires LM Studio running for embeddings):
node dist/cliIndex.js /path/to/docs /path/to/db
Environment variables:
BIG_RAG_EMBEDDING_MODEL — embedding model ID (default: gpustack/text-embedding-bge-m3)
BIG_RAG_OCR_LANGUAGE — OCR language (default: eng+rus)
BIG_RAG_OCR_DATA_PATH — path to .traineddata folder
BIG_RAG_OCR_PSM — Tesseract PSM (default: 3)
BIG_RAG_OCR_MAX_PAGES — max OCR pages (default: 200)
BIG_RAG_OCR_MAX_IMAGES_PER_PAGE — max images per page (default: 10)
BIG_RAG_OCR_MIN_IMAGE_AREA — min image area (default: 2500)
BIG_RAG_OCR_MAX_IMAGE_PIXELS — max image pixels (default: 100000000)
BIG_RAG_OCR_IMAGE_TIMEOUT_MS — image timeout ms (default: 60000)
BIG_RAG_FORCE_REINDEX — set to true to force full reindex
BIG_RAG_FAILURE_REPORT_PATH — path to write failure report JSON
Limitations
RAR Archives: Not yet implemented (files are skipped)
Password-Protected Files: Not supported
Very Large Files: Individual files >100MB may cause memory issues
OCR Language Coverage: Fully configurable — any Tesseract language via the OCR Language setting (default: eng+rus)