Liquid AI Releases LFM2.5-1.2B-Thinking: a 1.2B Parameter Reasoning Model That Fits Under 1 GB On-Device
Liquid AI has released LFM2.5-1.2B-Thinking, a 1.2 billion parameter reasoning model that runs fully on device and fits in about 900 MB on a modern phone. What needed a data center 2 years ago can now run offline on consumer hardware, with a focus on structured reasoning traces, tool use, and math, rather than general chat.
Position in the LFM2.5 family and core specs
LFM2.5-1.2B-Thinking is part of the LFM2.5 family of Liquid Foundation Models, which extends the earlier LFM2 architecture with more pre-training and multi stage reinforcement learning for edge deployment.
The model is text only and general purpose with the following configuration:
- 1.17B parameters, reported as a 1.2B class model
- 16 layers, with 10 double gated LIV convolution blocks and 6 GQA blocks
- Training budget of 28T tokens
- Context length of 32,768 tokens
- Vocabulary size of 65,536
- 8 languages, English, Arabic, Chinese, French, German, Japanese, Korean, Spanish
Reasoning first behavior and thinking traces
The ‘Thinking’ variant is trained specifically for reasoning. At inference time it produces internal thinking traces before the final answer. These traces are chains of intermediate steps that the model uses to plan tool calls, verify partial results, and work through multi step instructions.
Liquid AI team recommends this model for agentic tasks, data extraction pipelines, and retrieval augmented generation flows where you want explicit reasoning and verifiable intermediate steps. A practical way to think about it, you use LFM2.5-1.2B-Thinking as the planning brain inside agents and tools, and use other models when you need broad world knowledge or code heavy workflows.
Benchmarks versus other 1B class models
Liquid AI team evaluates LFM2.5-1.2B-Thinking against models around 1B parameters on a suite of reasoning and instruction benchmarks.


Compared to LFM2.5-1.2B-Instruct, three metrics improve strongly, math reasoning rises from about 63 to 88 on MATH 500, instruction following rises from about 61 to 69 on Multi IF, and tool use rises from about 49 to 57 on BFCLv3.
LFM2.5-1.2B-Thinking competes with Qwen3-1.7B in thinking mode on most reasoning benchmarks while using around 40 percent fewer parameters and fewer output tokens on average. It also outperforms other 1B class baselines such as Granite-4.0-H-1B, Granite-4.0-1B, Gemma-3-1B-IT, and Llama-3.2-1B Instruct on many of these tasks.
Training recipe and doom looping mitigation
Reasoning models often suffer from doom looping, where the model repeats fragments of its chain of thought instead of finishing the answer. LFM2.5-1.2B-Thinking uses a multi stage training pipeline to reduce this.
The process starts with mid training that includes reasoning traces so the model learns a ‘reason first then answer’ pattern. Then supervised fine tuning on synthetic chains improves chain of thought generation. After that, preference alignment and RLVR are applied. In preference alignment, the research team generates 5 temperature sampled candidates and 1 greedy candidate per prompt and uses an LLM judge to pick preferred and rejected outputs, while also labeling looping outputs explicitly. During RLVR they add an n gram repetition penalty early in training. This reduces the doom loop rate from 15.74 percent at mid training to 0.36 percent after RLVR on a set of representative prompts.
The result is a small reasoning model that can produce thinking traces without getting stuck in long repetitive outputs, which is important for interactive agents and on device UX.
Inference performance and hardware footprint
A key design target is fast inference with a small memory footprint on CPUs and NPUs. LFM2.5-1.2B-Thinking can decode at about 239 tokens per second on an AMD CPU and about 82 tokens per second on a mobile NPU, while running under 1 GB of memory, with broad day one support for llama.cpp, MLX, and vLLM.
The detailed hardware table uses 1K prefill and 100 decode tokens and gives the following examples for LFM2.5-1.2B-Thinking


These numbers show that the model fits comfortably under 1 GB on phones and embedded devices while sustaining useful throughputs even at long contexts.
Key Takeaways
- LFM2.5-1.2B-Thinking is a 1.17B parameter reasoning model with 32,768 context length and runs under 1 GB on phones and laptops.
- The model is optimized for explicit thinking traces, agentic workflows, data extraction, and RAG.
- It reaches strong scores for a 1B class model, for example 87.96 on MATH 500, 85.60 on GSM8K, and competitive performance with Qwen3 1.7B in thinking mode with fewer parameters.
- The training pipeline uses midtraining with reasoning traces, supervised fine tuning, preference alignment with 5 sampled along with 1 greedy candidate, and RLVR with n gram penalties, which reduces doom loops from 15.74 percent to 0.36 percent.
- The model runs efficiently on AMD and Qualcomm NPUs and CPUs with runtimes like llama.cpp, FastFlowLM, and NexaML, is available in GGUF, ONNX, and MLX formats, and can be loaded easily from Hugging Face for on device deployment.
Hosting Providers/Deployment
You can access or host the model through the following providers and platforms:
Cloud & API Providers
Model Repositories (Self-Hosting)
If you want to run the model locally or on your own infrastructure, the weights are available in various formats:


