Documentation
Predicting with LLMs
Agentic Flows
Plugins
Tools Provider
Prompt Preprocessor
Generators
Custom Configuration
Publishing a Plugin
Text Embedding
Tokenization
API Reference
Model Info
Predicting with LLMs
Agentic Flows
Plugins
Tools Provider
Prompt Preprocessor
Generators
Custom Configuration
Publishing a Plugin
Text Embedding
Tokenization
API Reference
Model Info
Configuring the Model
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Set inference-time parameters such as temperature
, maxTokens
, topP
and more.
const prediction = model.respond(chat, {
temperature: 0.6,
maxTokens: 50,
});
See LLMPredictionConfigInput
for all configurable fields.
Another useful inference-time configuration parameter is structured
, which allows you to rigorously enforce the structure of the output using a JSON or zod schema.
Set load-time parameters such as the context length, GPU offload ratio, and more.
.model()
The .model()
retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the configuration will be ignored.
const model = await client.llm.model("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig
for all configurable fields.
.load()
The .load()
method creates a new model instance and loads it with the specified configuration.
const model = await client.llm.load("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig
for all configurable fields.
On this page
Inference Parameters
Load Parameters
Set Load Parameters with .model()
Set Load Parameters with .load()