Documentation

Getting Started

Agentic Flows

Text Embedding

Tokenization

Manage Models

Model Info

Configuring the Model

You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.

Inference Parameters

Set inference-time parameters such as temperature, maxTokens, topP and more.

result = model.respond(chat, config={
    "temperature": 0.6,
    "maxTokens": 50,
})

Note that while structured can be set to a JSON schema definition as an inference-time configuration parameter, the preferred approach is to instead set the dedicated response_format parameter, which allows you to more rigorously enforce the structure of the output using a JSON or class based schema definition.

Load Parameters

Set load-time parameters such as contextLength, gpuOffload, and more.

Set Load Parameters with .model()

The .model() retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).

Note: if the model is already loaded, the configuration will be ignored.

import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct", config={
    "contextLength": 8192,
    "gpuOffload": 0.5,
})

Set Load Parameters with .load_new_instance()

The .load_new_instance() method creates a new model instance and loads it with the specified configuration.

import lmstudio as lms
client = lms.get_default_client()
model = client.llm.load_new_instance("qwen2.5-7b-instruct", config={
    "contextLength": 8192,
    "gpuOffload": 0.5,
})