Documentation
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Chat Completions
Use llm.respond(...)
to generate completions for a chat conversation.
The following snippet shows how to obtain the AI's response to a quick chat prompt.
import lmstudio as lms
model = lms.llm()
print(model.respond("What is the meaning of life?"))
The following snippet shows how to stream the AI's response to a chat prompt, displaying text fragments as they are received (rather than waiting for the entire response to be generated before displaying anything).
import lmstudio as lms
model = lms.llm()
for fragment in model.respond_stream("What is the meaning of life?"):
print(fragment.content, end="", flush=True)
print() # Advance to a new line at the end of the response
First, you need to get a model handle.
This can be done using the top-level llm
convenience API,
or the model
method in the llm
namespace when using the scoped resource API.
For example, here is how to use Qwen2.5 7B Instruct.
import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct")
There are other ways to get a model handle. See Managing Models in Memory for more info.
The input to the model is referred to as the "context". Conceptually, the model receives a multi-turn conversation as input, and it is asked to predict the assistant's response in that conversation.
import lmstudio as lms
# Create a chat with an initial system prompt.
chat = lms.Chat("You are a resident AI philosopher.")
# Build the chat context by adding messages of relevant types.
chat.add_user_message("What is the meaning of life?")
# ... continued in next example
See Working with Chats for more information on managing chat context.
You can ask the LLM to predict the next response in the chat context using the respond()
method.
# The `chat` object is created in the previous step.
result = model.respond(chat)
print(result)
You can pass in inferencing parameters via the config
keyword parameter on .respond()
.
prediction_stream = model.respond_stream(chat, config={
"temperature": 0.6,
"maxTokens": 50,
})
See Configuring the Model for more information on what can be configured.
You can also print prediction metadata, such as the model used for generation, number of generated tokens, time to first token, and stop reason.
# After iterating through the prediction fragments,
# the overall prediction result may be obtained from the stream
result = prediction_stream.result()
print("Model used:", result.model_info.display_name)
print("Predicted tokens:", result.stats.predicted_tokens_count)
print("Time to first token (seconds):", result.stats.time_to_first_token_sec)
print("Stop reason:", result.stats.stop_reason)
import lmstudio as lms
model = lms.llm()
chat = lms.Chat("You are a task focused AI assistant")
while True:
try:
user_input = input("You (leave blank to exit): ")
except EOFError:
print()
break
if not user_input:
break
chat.add_user_message(user_input)
prediction_stream = model.respond_stream(
chat,
on_message=chat.append,
)
print("Bot: ", end="", flush=True)
for fragment in prediction_stream:
print(fragment.content, end="", flush=True)
print()