Documentation
Basics
Agentic Flows
Plugins (Beta)
Tools Provider
Prompt Preprocessor
Generators
Custom Configuration
Publishing a Plugin
Text Embedding
Tokenization
API Reference
Model Info
Basics
Agentic Flows
Plugins (Beta)
Tools Provider
Prompt Preprocessor
Generators
Custom Configuration
Publishing a Plugin
Text Embedding
Tokenization
API Reference
Model Info
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Set inference-time parameters such as temperature, maxTokens, topP and more.
const prediction = model.respond(chat, {
temperature: 0.6,
maxTokens: 50,
});
See LLMPredictionConfigInput for all configurable fields.
Another useful inference-time configuration parameter is structured, which allows you to rigorously enforce the structure of the output using a JSON or zod schema.
Set load-time parameters such as the context length, GPU offload ratio, and more.
.model()The .model() retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the configuration will be ignored.
const model = await client.llm.model("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig for all configurable fields.
.load()The .load() method creates a new model instance and loads it with the specified configuration.
const model = await client.llm.load("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig for all configurable fields.
This page's source is available on GitHub
On this page
Inference Parameters
Load Parameters
Set Load Parameters with .model()
Set Load Parameters with .load()