← All Models

LFM2-24B-A2B

2.2K Downloads

LFM2 is a family of hybrid models designed for on-device deployment. LFM2-24B-A2B is the largest model in the family, scaling the architecture to 24 billion parameters while keeping inference efficient.

Models
Updated 7 hours ago
14.00 GB

Memory Requirements

To run the smallest LFM2-24B-A2B, you need at least 14 GB of RAM.

Capabilities

LFM2-24B-A2B models support tool use. They are available in gguf and mlx.

About LFM2-24B-A2B

undefined

LFM2 is a family of hybrid models designed for on-device deployment. LFM2-24B-A2B is the largest model in the family, scaling the architecture to 24 billion parameters while keeping inference efficient.

  • Best-in-class efficiency: A 24B MoE model with only 2B active parameters per token, fitting in 32 GB of RAM for deployment on consumer laptops and desktops.
  • Fast edge inference: 112 tok/s decode on AMD CPU, 293 tok/s on H100. Fits in 32B GB of RAM with day-one support llama.cpp, vLLM, and SGLang.
  • Predictable scaling: Quality improves log-linearly from 350M to 24B total parameters, confirming the LFM2 hybrid architecture scales reliably across nearly two orders of magnitude.

Model details

LFM2-24B-A2B is a general-purpose instruct model (without reasoning traces) with the following features:

PropertyLFM2-8B-A1BLFM2-24B-A2B
Total parameters8.3B24B
Active parameters1.5B2.3B
Layers24 (18 conv + 6 attn)40 (30 conv + 10 attn)
Context length32,768 tokens32,768 tokens
Vocabulary size65,53665,536
Training precisionMixed BF16/FP8Mixed BF16/FP8
Training budget12 trillion tokens17 trillion tokens
LicenseLFM Open License v1.0LFM Open License v1.0

Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, Spanish, Portuguese

Generation parameters:

  • temperature: 0.1
  • top_k: 50
  • repetition_penalty: 1.05

Liquid recommends the following use cases:

  • Agentic tool use: Native function calling, web search, structured outputs. Ideal as the fast inner-loop model in multi-step agent pipelines.
  • Offline document summarization and Q&A: Run entirely on consumer hardware for privacy-sensitive workflows (legal, medical, corporate).
  • Privacy-preserving customer support agent: Deployed on-premise at a company, handles multi-turn support conversations with tool access (database lookups, ticket creation) without data leaving the network.
  • Local RAG pipelines: Serve as the generation backbone in retrieval-augmented setups on a single machine without GPU servers.

Architecture

LFM2 is a hybrid architecture that pairs efficient gated short convolution blocks with a small number of grouped query attention (GQA) blocks.

undefined

This design, developed through hardware-in-the-loop architecture search, gives LFM2 models fast prefill and decode at low memory cost. LFM2-24B-A2B applies this backbone in a Mixture of Experts configuration: with 24B total parameters but only 2.3B active per forward pass, it punches far above the cost of a 2B dense model at inference time.

Benchmarks

Across benchmarks including GPQA Diamond, MMLU-Pro, IFEval, IFBench, GSM8K, and MATH-500, quality improves log-linearly as we scale from 350M to 24B total parameters. This near-100x parameter range confirms that the LFM2 hybrid architecture follows predictable scaling behavior and does not hit a ceiling at small model sizes.

undefined

License

LFM2 is provided under the custom LFM1.0 license.