← All Models

GLM-4.6V-Flash

949 Downloads

GLM 4.6V Flash is a 9B vision-language model optimized for local deployment and low-latency applications.

Models
Updated 18 hours ago
8.00 GB

Memory Requirements

To run the smallest GLM-4.6V-Flash, you need at least 8 GB of RAM.

Capabilities

GLM-4.6V-Flash models support tool use, vision input, and reasoning. They are available in mlx.

About GLM-4.6V-Flash

undefined

GLM 4.6V Flash is a 9B vision-language model optimized for local deployment and low-latency applications. It supports a context length of 128k tokens and achieves strong performance in visual understanding among models of similar scale.

The model introduces native multimodal function calling, enabling vision-driven tool use where images, screenshots, and document pages can be passed directly as tool inputs without text conversion.

Features

GLM-4.6V introduces several key features:

  • Native Multimodal Function Calling Enables native vision-driven tool use. Images, screenshots, and document pages can be passed directly as tool inputs without text conversion, while visual outputs (charts, search images, rendered pages) are interpreted and integrated into the reasoning chain. This closes the loop from perception to understanding to execution.

  • Interleaved Image-Text Content Generation Supports high-quality mixed media creation from complex multimodal inputs. GLM-4.6V takes a multimodal context—spanning documents, user inputs, and tool-retrieved images—and synthesizes coherent, interleaved image-text content tailored to the task. During generation it can actively call search and retrieval tools to gather and curate additional text and visuals, producing rich, visually grounded content.

  • Multimodal Document Understanding GLM-4.6V can process up to 128K tokens of multi-document or long-document input, directly interpreting richly formatted pages as images. It understands text, layout, charts, tables, and figures jointly, enabling accurate comprehension of complex, image-heavy documents without requiring prior conversion to plain text.

  • Frontend Replication & Visual Editing Reconstructs pixel-accurate HTML/CSS from UI screenshots and supports natural-language-driven edits. It detects layout, components, and styles visually, generates clean code, and applies iterative visual modifications through simple user instructions.

Benchmarks

undefined

License

GLM-4.6V is provided under the MIT license.