Documentation
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
API Reference
Configuring the Model
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Set inference-time parameters such as temperature
, maxTokens
, topP
and more.
const prediction = model.respond(chat, {
temperature: 0.6,
maxTokens: 50,
});
See LLMPredictionConfigInput
for all configurable fields.
Another useful inference-time configuration parameter is structured
, which allows you to rigorously enforce the structure of the output using a JSON or zod schema.
Set load-time parameters such as contextLength
, gpuOffload
, and more.
.model()
The .model()
retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the configuration will be ignored.
const model = await client.llm.model("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig
for all configurable fields.
.load()
The .load()
method creates a new model instance and loads it with the specified configuration.
const model = await client.llm.load("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig
for all configurable fields.