Introducing the unified multi-modal MLX
engine architecture in LM Studio
LM Studio's MLX engine (MIT) leverages two powerful Python packages for running LLMs efficiently on Apple Silicon M chips: mlx-lm
for text generation (by @awni, @angeloskath, Apple) and mlx-vlm
for vision-enabled language models (by @Blaizzy).
Starting with mlx-engine
commit f98317e
(in-app engine v0.17.0
and up), we migrated to a new unified architecture that weaves together the foundational components of each of those packages. Now mlx-lm
's text model implementations are always used, and mlx-vlm
's vision model implementations are modularly used as "add-ons" to generate image embeddings that can be understood by the text model.
New mlx-engine
's unified vision model architecture. mlx-lm
text models are extended with mlx-vlm
vision add-ons
This massively improves performance and user experience when using multi-modal MLX
VLMs (e.g., Google's Gemma 3). For example, text-only chats with VLMs can now benefit from prompt caching — a feature previously exclusive to text-only LLMs — resulting in drastically faster follow-up responses. This makes MLX
VLMs seamlessly interchangeable with text-only LLMs for text tasks, while offering vision capabilities as a bonus.
👓 Read on for a technical deepdive into the problem, the solution, and how we achieved this unified architecture in LM Studio's MLX
engine.
A multi-modal LLM is an LLM that can take multiple modalities of input. This means that in addition to being able to process text input, the LLM can also accept images and/or audio input.
The new MLX engine does not yet handle audio, but we plan for this approach to work for audio input as well.
Generally, vision-capable LLMs ingest image input through the following flow:
General operational flow of a multi-modal vision LLM: Convert images into embeddings that can be understood by the text model, generate output with the text model
"What is this?"
→ [0.83, 0.40, 0.67, ...]
image.png
→ [0.28, 0.11, 0.96, ...]
[0.83, 0.40, 0.67, ...]
+ [0.28, 0.11, 0.96, ...]
→ [0.83, 0.40, 0.67, ..., 0.28, 0.11, 0.96, ...]
If there are no images in the prompt, then the "merged embeddings" are simply the text embeddings.
In the MLX
Python landscape, there are two major libraries providing model implementations and infrastructure for interacting with LLMs on Apple Silicon:
mlx-lm
: text model implementations and infrastructure for interacting with themmlx-vlm
: text model implementations, vision model implementations, and infrastructure for interacting with themHistorically, mlx-lm
has contained implementations for text-only models without multi-modal functionality, while mlx-vlm
is the de-facto home for MLX
VLM implementations.
Model implementation components in mlx-lm and mlx-vlm. Yellow = text model components. Blue = vision model components
LM Studio's mlx-engine
was originally developed with a simple "forked" architecture to support both text-only models and vision-capable models:
Old mlx-engine
's forked vision model architecture. Yellow = text model components. Blue = vision model components
If a model was vision-capable, the mlx-vlm
model implementation (text + vision) would be exclusively used. If the model was text-only, the mlx-lm
model implementation (text) would be exclusively used.
A distinct (different) text model implementation (from mlx-lm
OR from mlx-vlm
) was used in each path.
mlx-engine
's naive forked architecture suffered from the following problems:
mlx-lm
and mlx-vlm
are not perfectly in parity or have nuanced differences in behavior, which one should we use?
mlx-lm
vs. mlx-vlm
? Should we hot-swap loading between them depending on the request (complex)?Two separate versions of the same text model exist
We sought to combine core components of mlx-lm
and mlx-vlm
to create a "unified" (no forking) inference engine for all MLX
LLMs and VLMs.
After invaluable discussion with @awni and @Blaizzy, we enabled this through the following contributions:
mlx-lm
mlx-vlm
mlx-vlm#335
mlx-vlm#355
New mlx-engine
's unified vision model architecture. mlx-lm
text models are extended with mlx-vlm
vision add-ons
In this unified architecture, always load the core text model of a multi-modal LLM from mlx-lm
(2), no longer potentially loading a slightly different text model from mlx-vlm
.
We then conditionally load a VisionAddOn
that uses mlx-vlm
functionality (3,4) to generate embeddings from images that the mlx-lm
text model can understand (see the Gemma3VisionAddOn
implementation in mlx-engine).
With this setup, we are able to inference multi-modal models in a streamlined and single-path fashion. This helps us release cleaner, more maintainable LM Studio MLX engines with improved performance.
Gemma 3 12B QAT, both MLX
4 bit on M3 MacBook Pro. ~25x faster follow-up TTFT with the unified architecture
VisionAddOns
in mlx-engine
The gist of LM Studio's new mlx-engine
unified architecture is that it allows us to use the text model implementations from mlx-lm
for all multi-modal models, while still being able to leverage the vision model components from mlx-vlm
to generate image embeddings that can be understood by the text model.
This is achieved by introducing VisionAddOns
(source), which are modular components that can be used to generate image embeddings for multi-modal models. These VisionAddOns
implement a common interface defined by the BaseVisionAddOn
abstract class, for example:
class BaseVisionAddOn: """ Base class that defines the interface for a VisionAddOn. """ @abstractmethod def __init__(self): """ Where load of vision model components is intended to occur. """ @abstractmethod def compute_embeddings( self, text_model: nn.Module, prompt_tokens: mx.array, images_b64: List[str], ) → mx.array: """ Returns input embeddings for the language model after text/image merging of the prompt """
VisionAddOns
generate image embeddings that can be input into a new input_embeddings
argument of mlx-lm
's stream_generate()
function (see commit made to mlx-lm).
Right now, Gemma 3 (Gemma3VisionAddOn) and Pixtral (PixtralVisionAddOn) are the only two models that have been migrated to the unified architecture. However, the architecture is designed such that more VisionAddOns
can be easily added to mlx-engine
's vision_add_ons
directory, then wired into the ModelKit
here:
VISION_ADD_ON_MAP = { "gemma3": Gemma3VisionAddOn, "pixtral": PixtralVisionAddOn, }
Contributions to extend this pattern in our open source repo https://github.com/lmstudio-ai/mlx-engine are extremely welcome!
For example, see mlx-engine
issue Extend VisionAddOn Pattern to Qwen2.5VL #167.
mlx-engine
repo at https://github.com/lmstudio-ai/mlx-engine