Not every business AI task needs a 70-billion-parameter model. Compact language models in the 3Bโ14B range have made remarkable quality leaps and now run on hardware that many offices already own. Here is what you need to know to put them to work.
The New Generation of Small Models
Phi-4 Mini (3.8 billion parameters) from Microsoft is one of the most compelling small models available today. Released under the MIT licence in early 2025, it is free to use commercially. According to community benchmarks, Phi-4 Mini scores around 73% on the MMLU dataset, compared to 65% for Meta's Llama 3.2 3B โ despite both sitting in the same parameter class. On the MATH reasoning benchmark it reportedly matches much larger 8B models (source: community benchmark data published at localaimaster.com).
Its larger sibling, Phi-4 (14 billion parameters, late 2024), delivers reasoning quality that previously required 30B+ models, again at a fraction of the hardware cost, according to community evaluations.
Other strong options in this class:
- Llama 3.2 3B (Meta, September 2024, MIT licence): compact, reliable instruction-following, widely supported
- Qwen 2.5 7B (Alibaba Cloud, September 2024): excellent multilingual coverage including German, Spanish and French
- Qwen 3 (2025): improved reasoning and multilingual capability over its predecessor
- Gemma 3 2B (Google DeepMind, 2025, Apache 2.0): very low memory footprint, suited for edge deployments
All of these are open-weight or permissively licensed and install locally via Ollama in under five minutes.
Why "Small" No Longer Means "Weak"
Three factors have driven the quality improvement in small models:
1. Higher-quality training data over raw volume. Microsoft's Phi family was trained from the start on carefully filtered, high-quality text sources โ an approach now widely adopted across the industry.
2. Intensive instruction tuning and RLHF. Modern small models go through extensive post-training alignment, making them genuinely useful for structured everyday tasks.
3. Quantisation. 4-bit quantisation (GGUF, MLX-4bit) cuts memory requirements dramatically with minimal quality loss for most business use cases. A 7B model at 4-bit occupies roughly 4โ5 GB of RAM; a 3.8B model around 2.5 GB.
Hardware Requirements: What You Actually Need
| Model | Type | RAM (4-bit) | Speed (community-reported) |
|---|---|---|---|
| Phi-4 Mini 3.8B | Text, Reasoning | ~2.5 GB | 60โ100 tok/s (Apple M3) |
| Llama 3.2 3B | Text, Instruction | ~2 GB | 70โ110 tok/s (Apple M3) |
| Phi-4 14B | Text, Reasoning | ~9 GB | 20โ35 tok/s (Apple M3) |
| Qwen 2.5 7B | Text, Multilingual | ~4.5 GB | 35โ60 tok/s (Apple M3) |
| Llama 3.2 8B | Text, Instruction | ~5 GB | 30โ50 tok/s (Apple M3) |
All speeds are community-reported figures on Apple Silicon M3 hardware. Results vary with context length, quantisation level and workload.
A Mac Mini M4 with 16 GB (from around โฌ800 / ยฃ700 / $850) runs every model in the table comfortably. Practitioners report 200โ350 tokens per second for Phi-4 Mini on modern GPU hardware. An existing laptop with Apple M2/M3 chips, or a desktop with an NVIDIA RTX 3060 (12 GB VRAM), handles all 3Bโ14B models at 4-bit.
If you need a shared team server, a Mac Studio M3 Ultra (96โ192 GB unified memory) supports 70B+ models and multiple concurrent users without a rack of NVIDIA hardware.
SMB Use Cases: Where Small Models Deliver
Small models excel at well-defined, repeatable tasks:
Strong suits:
- Structured text processing: email classification, document summarisation, form completion
- FAQ-style chat assistants backed by a local knowledge base (RAG)
- Code completion and lightweight scripting support
- Translation and language correction โ especially Qwen 2.5 7B for European languages
Where larger models still help:
- Complex multi-step reasoning chains
- Nuanced long-form creative writing
- Large-scale code generation across tightly coupled projects
For the majority of automation tasks in an SMB โ document handling, internal chat assistants, support pre-filtering, HR text drafting โ the 3Bโ14B class is often sufficient. This is consistent with reports from practitioners running these systems in production.
GDPR Advantage: Data Never Leaves Your Infrastructure
Running a small model locally delivers a practical compliance bonus that is easy to underestimate.
When Phi-4 Mini runs on a staff member's laptop and processes client documents, those documents never leave the device. There is no data-processing agreement (DPA) to negotiate with a cloud provider, no cross-border transfer to a third country, no exposure to a vendor's security incident. The GDPR documentation footprint is limited to the device configuration itself.
For industries with elevated data-protection requirements โ legal, healthcare, HR, finance โ this structural simplicity can be decisive. See our data sovereignty page for more on how we approach this, and our local AI overview for a broader introduction to on-premise deployments.
EU AI Act Context
Running a small open-weight model locally also reduces your footprint under the EU AI Act. You are the deployer of a general-purpose model but you control the entire inference stack. You are not relying on a third-party GPAI provider to handle transparency obligations on your behalf. For SMBs already reviewing their Article 26 obligations, local models simplify the audit trail considerably.
Getting Started in 15 Minutes with Ollama
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Run Phi-4 (14B โ needs ~9 GB RAM at 4-bit)
ollama run phi4
# Llama 3.2 3B โ lightest option
ollama run llama3.2:3b
# Qwen 2.5 7B โ multilingual tasks
ollama run qwen2.5:7b
Check ollama.com/library for current model tags and variants. If your team prefers a browser interface, Open WebUI runs as a Docker container on the same machine โ chat, model switching, and user management without a command line.
Cost Reality Check
Cloud inference for a 7B-equivalent model is listed at roughly โฌ0.08โ0.25 per million tokens by major providers (based on publicly available pricing). An SMB generating 10โ30 million tokens per month in internal tooling faces โฌ800โ7,500 per year in API fees, before latency or data-transfer considerations.
A Mac Mini M4 at ~โฌ800 has zero marginal inference cost. Based on publicly available pricing data, local hardware typically becomes cost-competitive within 6โ18 months at moderate usage โ though the exact crossover depends heavily on how much inference your team actually runs.
Want to know which model size and hardware configuration fits your specific workflows? Start a pilot project with Freshlab โ we evaluate your use cases, recommend the right setup, and get you running in days, not weeks.