Forked from altra/ai-lab
README.md
A local AI engineering lab for systematic prompt engineering and model evaluation. Version prompts, build eval datasets, log model outputs, and compare results — all from a single chat session, all persisted locally.
Not a vibe-based workflow. Every prompt change is versioned and every model claim is backed by logged results.
{{variable}} syntax; fill and preview before runningLoad the built plugin folder in LM Studio.
| Field | Default | Description |
|---|---|---|
| Data Path | ~/ai-lab-data/ | SQLite database location |
| Tool | Kind | What it does |
|---|---|---|
save_prompt | Store-write | Save a new version of a prompt — auto-increments version, never overwrites |
get_prompt | Store-read | Get a prompt by name (latest or specific version) |
list_prompts | Store-read | List all prompts with latest version and description |
list_prompt_versions | Store-read | List all versions of a specific prompt |
diff_prompts | Compute | Line-by-line diff between two prompt versions |
run_prompt_template | Compute | Fill {{variable}} placeholders and return the rendered prompt |
| Tool | Kind | What it does |
|---|---|---|
create_eval_dataset | Store-write | Create a named dataset for organizing test cases |
add_eval_case | Store-write | Add a test case with input and expected output |
list_eval_datasets | Store-read | List all datasets with case counts |
get_eval_dataset | Store-read | Get a dataset with all its cases and IDs |
| Tool | Kind | What it does |
|---|---|---|
log_model_result | Store-write | Log a model's output for a case with optional score |
compare_models | Store-read | Leaderboard + case-by-case comparison across models |
generate_eval_report | Scaffold | Return a report payload for LLM to write a narrative eval summary |
Every save_prompt call creates a new version. Version 1 is always the baseline.
Use diff_prompts(name, 1, 3) to see what changed between baseline and current.
Use compare_models with both prompt versions to see if the change actually helped.
Use {{variable_name}} in prompt content:
run_prompt_template(name, variables={"role": "reviewer", "user_input": "...", "language": "English"}) returns the filled prompt.
| Score | Meaning |
|---|---|
1.0 | Perfect — exactly matches expected or fully correct |
0.8 | Good — minor issues or slight deviation |
0.5 | Partial — got the right idea but incomplete or partially wrong |
0.2 | Poor — attempted but fundamentally wrong |
0.0 | Failure — completely wrong or refused |
null | Not yet scored |
Score labels (set scoreLabel to one of):
exact_match — exact string comparisonrubric — scored against a defined rubrichuman — human-judgedllm_judge — scored by another LLMSaving a prompt:
"Save this classification prompt"
save_prompt(name="intent_classifier", content="Classify the intent of: {{input}}\nOptions: question, request, complaint", description="Basic intent classification")
Iterating on a prompt:
"Add chain-of-thought to the intent classifier"
save_prompt(name="intent_classifier", content="Let's think step by step...", description="Added CoT reasoning") → version 2 saved
diff_prompts(name="intent_classifier", versionA=1, versionB=2) → see exactly what changed
Setting up an eval:
"Create an eval for intent classification"
create_eval_dataset(name="intent_classification", description="Tests accurate intent labeling across 3 categories") → add_eval_case × 10
Comparing models:
"Which model is better at intent classification?"
compare_models(datasetName="intent_classification") → leaderboard by avg score, case-by-case breakdown
Writing a report:
"Write up our eval findings"
generate_eval_report(datasetName="intent_classification") → structured report: leaderboard, failure patterns, recommendation
All data is local. Nothing is sent to any external service.
cd ai-lab-plugin
npm install
npm run build
classification_prompt v1 → v2 → v3 (current)
↑ ↑
baseline tried adding CoT
You are a {{role}}.
User: {{user_input}}
Respond in {{language}}.
1. create_eval_dataset("task_name")
2. add_eval_case × 10+
3. get_eval_dataset → note case IDs
4. Run each model manually against each input
5. log_model_result for each model × case
6. compare_models → leaderboard + case breakdown
7. generate_eval_report → narrative summary
~/ai-lab-data/
ai-lab.db ← SQLite: prompts, eval_datasets, eval_cases, model_results