Documentation
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
API Reference
Chat Completions
Use llm.respond(...)
to generate completions for a chat conversation.
The following snippet shows how to stream the AI's response to quick chat prompt.
import { LMStudioClient } from "@lmstudio/sdk";
const client = new LMStudioClient();
const model = await client.llm.model();
for await (const fragment of model.respond("What is the meaning of life?")) {
process.stdout.write(fragment.content);
}
First, you need to get a model handle. This can be done using the model
method in the llm
namespace. For example, here is how to use Qwen2.5 7B Instruct.
import { LMStudioClient } from "@lmstudio/sdk";
const client = new LMStudioClient();
const model = await client.llm.model("qwen2.5-7b-instruct");
There are other ways to get a model handle. See Managing Models in Memory for more info.
The input to the model is referred to as the "context". Conceptually, the model receives a multi-turn conversation as input, and it is asked to predict the assistant's response in that conversation.
import { Chat } from "@lmstudio/sdk";
// Create a chat object from an array of messages.
const chat = Chat.from([
{ role: "system", content: "You are a resident AI philosopher." },
{ role: "user", content: "What is the meaning of life?" },
]);
See Working with Chats for more information on managing chat context.
You can ask the LLM to predict the next response in the chat context using the respond()
method.
// The `chat` object is created in the previous step.
const prediction = model.respond(chat);
for await (const { content } of prediction) {
process.stdout.write(content);
}
console.info(); // Write a new line to prevent text from being overwritten by your shell.
You can pass in inferencing parameters as the second parameter to .respond()
.
const prediction = model.respond(chat, {
temperature: 0.6,
maxTokens: 50,
});
See Configuring the Model for more information on what can be configured.
You can also print prediction metadata, such as the model used for generation, number of generated tokens, time to first token, and stop reason.
// If you have already iterated through the prediction fragments,
// doing this will not result in extra waiting.
const result = await prediction.result();
console.info("Model used:", result.modelInfo.displayName);
console.info("Predicted tokens:", result.stats.predictedTokensCount);
console.info("Time to first token (seconds):", result.stats.timeToFirstTokenSec);
console.info("Stop reason:", result.stats.stopReason);
TODO: Probably needs polish here:
import { Chat, LMStudioClient } from "@lmstudio/sdk";
import { createInterface } from "readline/promises";
const rl = createInterface({ input: process.stdin, output: process.stdout });
const client = new LMStudioClient();
const model = await client.llm.model();
const chat = Chat.empty();
while (true) {
const input = await rl.question("You: ");
// Append the user input to the chat
chat.append("user", input);
const prediction = model.respond(chat, {
// When the model finish the entire message, push it to the chat
onMessage: (message) => chat.append(message),
});
process.stdout.write("Bot: ");
for await (const { content } of prediction) {
process.stdout.write(content);
}
process.stdout.write("\n");
}