You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.

Inference Parameters

Set inference-time parameters such as temperature, maxTokens, topP and more.

const prediction = model.respond(chat, {
  temperature: 0.6,
  maxTokens: 50,
});

See LLMPredictionConfigInput for all configurable fields.

Another useful inference-time configuration parameter is structured, which allows you to rigorously enforce the structure of the output using a JSON or zod schema.

Load Parameters

Set load-time parameters such as the context length, GPU offload ratio, and more.

Set Load Parameters with `.model()`

The .model() retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).

Note: if the model is already loaded, the configuration will be ignored.

const model = await client.llm.model("qwen2.5-7b-instruct", {
  config: {
    contextLength: 8192,
    gpu: {
      ratio: 0.5,
    },
  },
});

See LLMLoadModelConfig for all configurable fields.

Set Load Parameters with `.load()`

The .load() method creates a new model instance and loads it with the specified configuration.

const model = await client.llm.load("qwen2.5-7b-instruct", {
  config: {
    contextLength: 8192,
    gpu: {
      ratio: 0.5,
    },
  },
});

See LLMLoadModelConfig for all configurable fields.

Configuring the Model

Inference Parameters

Load Parameters

Set Load Parameters with .model()

Set Load Parameters with .load()

Set Load Parameters with `.model()`

Set Load Parameters with `.load()`