Documentation

Agentic Flows

Text Embedding

Tokenization

Manage Models

Model Info

API Reference

Chat Completions

Use llm.respond(...) to generate completions for a chat conversation.

Quick Example: Generate a Chat Response

The following snippet shows how to stream the AI's response to quick chat prompt.

import { LMStudioClient } from "@lmstudio/sdk";
const client = new LMStudioClient();

const model = await client.llm.model();

for await (const fragment of model.respond("What is the meaning of life?")) {
  process.stdout.write(fragment.content);
}

Obtain a Model

First, you need to get a model handle. This can be done using the model method in the llm namespace. For example, here is how to use Qwen2.5 7B Instruct.

import { LMStudioClient } from "@lmstudio/sdk";
const client = new LMStudioClient();

const model = await client.llm.model("qwen2.5-7b-instruct");

There are other ways to get a model handle. See Managing Models in Memory for more info.

Manage Chat Context

The input to the model is referred to as the "context". Conceptually, the model receives a multi-turn conversation as input, and it is asked to predict the assistant's response in that conversation.

import { Chat } from "@lmstudio/sdk";

// Create a chat object from an array of messages.
const chat = Chat.from([
  { role: "system", content: "You are a resident AI philosopher." },
  { role: "user", content: "What is the meaning of life?" },
]);

See Working with Chats for more information on managing chat context.

Generate a response

You can ask the LLM to predict the next response in the chat context using the respond() method.

// The `chat` object is created in the previous step.
const prediction = model.respond(chat);

for await (const { content } of prediction) {
  process.stdout.write(content);
}

console.info(); // Write a new line to prevent text from being overwritten by your shell.

Customize Inferencing Parameters

You can pass in inferencing parameters as the second parameter to .respond().

const prediction = model.respond(chat, {
  temperature: 0.6,
  maxTokens: 50,
});

See Configuring the Model for more information on what can be configured.

You can also print prediction metadata, such as the model used for generation, number of generated tokens, time to first token, and stop reason.

// If you have already iterated through the prediction fragments,
// doing this will not result in extra waiting.
const result = await prediction.result();

console.info("Model used:", result.modelInfo.displayName);
console.info("Predicted tokens:", result.stats.predictedTokensCount);
console.info("Time to first token (seconds):", result.stats.timeToFirstTokenSec);
console.info("Stop reason:", result.stats.stopReason);

Example: Multi-turn Chat

TODO: Probably needs polish here:

import { Chat, LMStudioClient } from "@lmstudio/sdk";
import { createInterface } from "readline/promises";

const rl = createInterface({ input: process.stdin, output: process.stdout });
const client = new LMStudioClient();
const model = await client.llm.model();
const chat = Chat.empty();

while (true) {
  const input = await rl.question("You: ");
  // Append the user input to the chat
  chat.append("user", input);

  const prediction = model.respond(chat, {
    // When the model finish the entire message, push it to the chat
    onMessage: (message) => chat.append(message),
  });
  process.stdout.write("Bot: ");
  for await (const { content } of prediction) {
    process.stdout.write(content);
  }
  process.stdout.write("\n");
}