Small LLMs Locally: Run Phi-4 on Laptop and Mac Mini

28. May 2026 English 5 min read

local-ai small-llm hardware

Not every business AI task needs a 70-billion-parameter model. Compact language models in the 3B–14B range have made remarkable quality leaps and now run on hardware that many offices already own. Here is what you need to know to put them to work.

The New Generation of Small Models

Phi-4 Mini (3.8 billion parameters) from Microsoft is one of the most compelling small models available today. Released under the MIT licence in early 2025, it is free to use commercially. According to community benchmarks, Phi-4 Mini scores around 73% on the MMLU dataset, compared to 65% for Meta's Llama 3.2 3B — despite both sitting in the same parameter class. On the MATH reasoning benchmark it reportedly matches much larger 8B models (source: community benchmark data published at localaimaster.com).

Its larger sibling, Phi-4 (14 billion parameters, late 2024), delivers reasoning quality that previously required 30B+ models, again at a fraction of the hardware cost, according to community evaluations.

Other strong options in this class:

Llama 3.2 3B (Meta, September 2024, MIT licence): compact, reliable instruction-following, widely supported
Qwen 2.5 7B (Alibaba Cloud, September 2024): excellent multilingual coverage including German, Spanish and French
Qwen 3 (2025): improved reasoning and multilingual capability over its predecessor
Gemma 3 2B (Google DeepMind, 2025, Apache 2.0): very low memory footprint, suited for edge deployments

All of these are open-weight or permissively licensed and install locally via Ollama in under five minutes.

Why "Small" No Longer Means "Weak"

Three factors have driven the quality improvement in small models:

1. Higher-quality training data over raw volume. Microsoft's Phi family was trained from the start on carefully filtered, high-quality text sources — an approach now widely adopted across the industry.

2. Intensive instruction tuning and RLHF. Modern small models go through extensive post-training alignment, making them genuinely useful for structured everyday tasks.

3. Quantisation. 4-bit quantisation (GGUF, MLX-4bit) cuts memory requirements dramatically with minimal quality loss for most business use cases. A 7B model at 4-bit occupies roughly 4–5 GB of RAM; a 3.8B model around 2.5 GB.

Hardware Requirements: What You Actually Need

Model	Type	RAM (4-bit)	Speed (community-reported)
Phi-4 Mini 3.8B	Text, Reasoning	~2.5 GB	60–100 tok/s (Apple M3)
Llama 3.2 3B	Text, Instruction	~2 GB	70–110 tok/s (Apple M3)
Phi-4 14B	Text, Reasoning	~9 GB	20–35 tok/s (Apple M3)
Qwen 2.5 7B	Text, Multilingual	~4.5 GB	35–60 tok/s (Apple M3)
Llama 3.2 8B	Text, Instruction	~5 GB	30–50 tok/s (Apple M3)

All speeds are community-reported figures on Apple Silicon M3 hardware. Results vary with context length, quantisation level and workload.

A Mac Mini M4 with 16 GB (from around €800 / £700 / $850) runs every model in the table comfortably. Practitioners report 200–350 tokens per second for Phi-4 Mini on modern GPU hardware. An existing laptop with Apple M2/M3 chips, or a desktop with an NVIDIA RTX 3060 (12 GB VRAM), handles all 3B–14B models at 4-bit.

If you need a shared team server, a Mac Studio M3 Ultra (96–192 GB unified memory) supports 70B+ models and multiple concurrent users without a rack of NVIDIA hardware.

SMB Use Cases: Where Small Models Deliver

Small models excel at well-defined, repeatable tasks:

Strong suits:

Structured text processing: email classification, document summarisation, form completion
FAQ-style chat assistants backed by a local knowledge base (RAG)
Code completion and lightweight scripting support
Translation and language correction — especially Qwen 2.5 7B for European languages

Where larger models still help:

Complex multi-step reasoning chains
Nuanced long-form creative writing
Large-scale code generation across tightly coupled projects

For the majority of automation tasks in an SMB — document handling, internal chat assistants, support pre-filtering, HR text drafting — the 3B–14B class is often sufficient. This is consistent with reports from practitioners running these systems in production.

GDPR Advantage: Data Never Leaves Your Infrastructure

Running a small model locally delivers a practical compliance bonus that is easy to underestimate.

When Phi-4 Mini runs on a staff member's laptop and processes client documents, those documents never leave the device. There is no data-processing agreement (DPA) to negotiate with a cloud provider, no cross-border transfer to a third country, no exposure to a vendor's security incident. The GDPR documentation footprint is limited to the device configuration itself.

For industries with elevated data-protection requirements — legal, healthcare, HR, finance — this structural simplicity can be decisive. See our data sovereignty page for more on how we approach this, and our local AI overview for a broader introduction to on-premise deployments.

EU AI Act Context

Running a small open-weight model locally also reduces your footprint under the EU AI Act. You are the deployer of a general-purpose model but you control the entire inference stack. You are not relying on a third-party GPAI provider to handle transparency obligations on your behalf. For SMBs already reviewing their Article 26 obligations, local models simplify the audit trail considerably.

Getting Started in 15 Minutes with Ollama

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run Phi-4 (14B — needs ~9 GB RAM at 4-bit)
ollama run phi4

# Llama 3.2 3B — lightest option
ollama run llama3.2:3b

# Qwen 2.5 7B — multilingual tasks
ollama run qwen2.5:7b

Check ollama.com/library for current model tags and variants. If your team prefers a browser interface, Open WebUI runs as a Docker container on the same machine — chat, model switching, and user management without a command line.

Cost Reality Check

Cloud inference for a 7B-equivalent model is listed at roughly €0.08–0.25 per million tokens by major providers (based on publicly available pricing). An SMB generating 10–30 million tokens per month in internal tooling faces €800–7,500 per year in API fees, before latency or data-transfer considerations.

A Mac Mini M4 at ~€800 has zero marginal inference cost. Based on publicly available pricing data, local hardware typically becomes cost-competitive within 6–18 months at moderate usage — though the exact crossover depends heavily on how much inference your team actually runs.

Want to know which model size and hardware configuration fits your specific workflows? Start a pilot project with Freshlab — we evaluate your use cases, recommend the right setup, and get you running in days, not weeks.