Documentation
Running LLMs Locally
Plugins
Model Context Protocol (MCP)
API
User Interface
Running LLMs Locally
Plugins
Model Context Protocol (MCP)
API
User Interface
API Changelog
stream_options
object on OpenAI‑compatible endpoints. Setting stream_options.include_usage
to true
returns prompt and completion token usage during streaming ‡.response_format.type
field now accepts "text"
in chat‑completion requests ‡.$defs
in tool definitions were stripped ‡.parameters
object and preventing hangs when an MCP server reloads ‡.GET /models
/api/v0
) now returns a capabilities
array in the GET /models
response. Each model lists its supported capabilities (e.g. "tool_use"
) ‡ so clients can programmatically discover tool‑enabled models.OpenAI-like REST API now supports the tool_choice
parameter:
{ "tool_choice": "auto" // or "none", "required" }
"tool_choice": "none"
— Model will not call tools"tool_choice": "auto"
— Model decides"tool_choice": "required"
— Model must call tools (llama.cpp only)Chunked responses now set "finish_reason": "tool_calls"
when appropriate.
RESTful API and SDKs support specifying presets in requests.
(example needed)
Enable speculative decoding in API requests with "draft_model"
:
{ "model": "deepseek-r1-distill-qwen-7b", "draft_model": "deepseek-r1-distill-qwen-0.5b", "messages": [ ... ] }
Responses now include a stats
object for speculative decoding:
"stats": { "tokens_per_second": ..., "draft_model": "...", "total_draft_tokens_count": ..., "accepted_draft_tokens_count": ..., "rejected_draft_tokens_count": ..., "ignored_draft_tokens_count": ... }
Set a TTL (in seconds) for models loaded via API requests (docs article: Idle TTL and Auto-Evict)
curl http://localhost:1234/api/v0/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1-distill-qwen-7b", "messages": [ ... ] + "ttl": 300, }'
With lms
:
lms load --ttl <seconds>
reasoning_content
in Chat Completion responsesFor DeepSeek R1 models, get reasoning content in a separate field. See more here.
Turn this on in App Settings > Developer.
Use any LLM that supports Tool Use and Function Calling through the OpenAI-like API.
Docs: Tool Use and Function Calling.
lms get
: download models from the terminalYou can now download models directly from the terminal using a keyword
lms get deepseek-r1
or a full Hugging Face URL
lms get <hugging face url>
To filter for MLX models only, add --mlx
to the command.
lms get deepseek-r1 --mlx
On this page
Bug fixes for streaming and tool calls
Streaming options and tool‑calling improvements
Tool‑calling reliability and token‑count updates
Model capabilities in GET /models
Improved Tool Use API Support
[API/SDK] Preset Support
Speculative Decoding API
Idle TTL and Auto Evict
Separate reasoning_content in Chat Completion responses
Tool and Function Calling API
Introducing lms get: download models from the terminal