Documentation
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Speculative Decoding
Required Python SDK version: 1.2.0
Speculative decoding is a technique that can substantially increase the generation speed of large language models (LLMs) without reducing response quality. See Speculative Decoding for more info.
To use speculative decoding in lmstudio-python
, simply provide a draftModel
parameter when performing the prediction. You do not need to load the draft model separately.
import lmstudio as lms
main_model_key = "qwen2.5-7b-instruct"
draft_model_key = "qwen2.5-0.5b-instruct"
model = lms.llm(main_model_key)
result = model.respond(
"What are the prime numbers between 0 and 100?",
config={
"draftModel": draft_model_key,
}
)
print(result)
stats = result.stats
print(f"Accepted {stats.accepted_draft_tokens_count}/{stats.predicted_tokens_count} tokens")