0.3.10 - Release Notes
What's New
This version introduces 🔮 Speculative Decoding!
Speculative Decoding is a technique that can speed up token generation by up to 1.5x-3x in some cases.
Supported in LM Studio for both llama.cpp and MLX, in the chat UI and server API.
Change Log
Build 6
- Fixed an issue where first message of tool streaming response did not include "assistant" role
- Improved error message when trying to use a draft model with a different engine.
- Fixed a bug where speculative decoding visualization does not work when continuing a message.
Build 5
- Update MLX to enable Speculative Decoding on M1/M2 Macs (in addition to M3/M4)
- Fixed an issue on Linux and macOS where child processes may not be cleaned up after app exit
- [Mac][MLX] Fixed a bug where selecting a draft model during prediction would cause the model to crash
Build 4
- New: Chat Appearance > "Expand chat container to window width" option
- This option allows you to expand the chat container to the full width of the window
- Fixed RAG not working due to "path must be a string"
- Bug fix: conversations would sometimes be named 'Untitled' regardless of auto naming settings
Build 3
- The beginning and the end tags of reasoning blocks are now configurable in My Models page
- You can use this feature to enable thinking UI for models that don't use
<think> and </think> tags to denote reasoning sections
- Fixed a bug where structured output is not configurable in My Models page
- Optimized engine indexing for reduced start-up delay
- Option to re-run engine compatibility checks for specific engines from the Runtimes UI
- [Mac] Improved reliability of MLX runtime installation, and improved detection of broken MLX runtimes
Build 2
- Fixed a case where the message about updating the engine to use speculative decoding is not displayed
- Fixed a bug where we sometimes show "no compatible draft models" despite we are still identifying them
- [Linux] Fixed 'exit code 133' bug (reference: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/285)
Build 1
- New: 🔮 Speculative Decoding! (for llama.cpp and MLX models)
- Use smaller "draft model" to achieve generation speed up by up to 1.5x-3x for larger models.
- Works best when combining very small draft model + large main model. The speedup comes without any degradation in quality.
- Your mileage may vary. Experiment with different draft models easily to find what works best.
- Works in both chat UI and server API
- Use the new "Visualize accepted draft tokens" feature to watch speculative decoding in action.
- New: Runtime (cmd/ctrl + shift + R) page UI
- Auto update runtimes only on app start up
- Fixed a bug where multiple images sent to the model would not be recognized