Documentation
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Configuring the Model
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Set inference-time parameters such as temperature
, maxTokens
, topP
and more.
result = model.respond(chat, config={
"temperature": 0.6,
"maxTokens": 50,
})
Note that while structured
can be set to a JSON schema definition as an inference-time configuration parameter,
the preferred approach is to instead set the dedicated response_format
parameter,
which allows you to more rigorously enforce the structure of the output using a JSON or class based schema
definition.
Set load-time parameters such as contextLength
, gpuOffload
, and more.
.model()
The .model()
retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the configuration will be ignored.
import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct", config={
"contextLength": 8192,
"gpuOffload": 0.5,
})
.load_new_instance()
The .load_new_instance()
method creates a new model instance and loads it with the specified configuration.
import lmstudio as lms
client = lms.get_default_client()
model = client.llm.load_new_instance("qwen2.5-7b-instruct", config={
"contextLength": 8192,
"gpuOffload": 0.5,
})