Skip to main content

LM Studio Server

You can use LLMs you load within LM Studio via an API server running on localhost.

Requests and responses follow OpenAI's API format.

Point any code that currently uses OpenAI to localhost:PORT to use a local LLM instead.

Supported endpoints

GET /v1/models
POST /v1/chat/completions
POST /v1/embeddings
POST /v1/completions

POST /v1/embeddings is new in LM Studio 0.2.19. Read about it here.

Using the local server

  1. If you haven't yet, install LM Studio. Get the app installer from https://lmstudio.ai.
  2. From within the app, search and download an LLM such as TheBloke/Mistral-7B-Instruct-v0.2-GGUF (about 4GB on disk)
  3. Head to the Local Server tab (<-> on the left)
  4. Load any LLM you downloaded by choosing it from the dropdown.
  5. Start the server by clicking on the green Start Server button.

Your LM Studio will now be ready to accept incoming API requests. You can safely minimize the app; the server will keep running.

Check which models are currently loaded

curl http://localhost:1234/v1/models

Response (following OpenAI's format)

{
"data": [
{
"id": "TheBloke/phi-2-GGUF/phi-2.Q4_K_S.gguf",
"object": "model",
"owned_by": "organization-owner",
"permission": [
{}
]
},
{
"id": "lmstudio-ai/gemma-2b-it-GGUF/gemma-2b-it-q4_k_m.gguf",
"object": "model",
"owned_by": "organization-owner",
"permission": [
{}
]
}
],
"object": "list"
}%

In this case both TheBloke/phi-2-GGUF and lmstudio-ai/gemma-2b-it-GGUF are loaded.

Make an inferencing request (using OpenAI's 'Chat Completions' format)

In this example the local server is running on port 1234. You can change it in the server control bar in the app.

  1. Open your terminal (Try Git Bash on Windows)
  2. Copy and run the following request
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a helpful coding assistant." },
{ "role": "user", "content": "How do I init and update a git submodule?" }
],
"temperature": 0.7,
"max_tokens": -1,
"stream": true
}'

Supported payload parameters

For an explanation for each parameter, see https://platform.openai.com/docs/api-reference/chat/create.

model
top_p
top_k
messages
temperature
max_tokens
stream
stop
presence_penalty
frequency_penalty
logit_bias
repeat_penalty
seed