Skip to content

Inference Engine Matrix — April 2026

Executive Summary

No single inference engine dominates all platforms and workload profiles. For hwLedger:

  • Local Apple Silicon: MLX (peak throughput) + oMlx sidecar (SSD-paged KV cache).
  • Local x86+GPU (NVIDIA/AMD): mistral.rs embedded (native Rust, MoE-aware, CUDA+Metal support).
  • Fallback: llama.cpp (universal GGUF, Windows RDNA support via GGUF quantization).
  • Remote inference (Vast.ai, RunPod, Lambda): vLLM or TGI (managed by rental provider).

Engine Comparison Matrix

EnginePlatformMoE SupportKV QuantPaged AttnStreamingLicenseMaturity
MLXApple onlyYesFP16/INT8NoYesMITProd
mistral.rsx86+GPUYes*FP8/INT8NoYesApache-2.0Prod
llama.cppUniversalLimited2–8 bitNoYesMITProd
vLLMx86+GPUYesFP8/INT4YesYesApache-2.0Prod
TGI (HF)x86+GPUYesFP8YesYesApache-2.0Prod
SGLangx86+GPUPartialFP8YesYesApache-2.0Beta
ExLlamaV2NVIDIA onlyNo3–5 bitNoBatchedMITBeta
OllamaCross-platformNo4–8 bitNoYesMITStable
TensorRT-LLMNVIDIA onlyYesFP8/INT4YesYesApache-2.0Prod

*mistral.rs MoE: expert selection at decode time; overhead ~2–5% vs dense.

Platform-Specific Deep Dives

1. Apple Silicon (M1/M2/M3/M4)

Winner: MLX

  • Peak throughput: 80–120 tokens/sec (Mistral-7B, 4K context).
  • KV-cache scaling: Unified memory swaps gracefully to RAM.
  • Quantization: 4-bit via safetensors; no runtime overhead.
  • Differentiator: PyObjC integration for native menubar UI (oMlx provides this).

oMlx sidecar wraps MLX with:

  • SSD-paged KV cache (30–90s → 1–3s TTFT on 32K context).
  • OpenAI API compatibility.
  • JSON-RPC protocol for hwLedger telemetry.

Alternative: llama.cpp Metal backend

  • Slower than MLX by 10–20%.
  • Better for heterogeneous models (GGUF ecosystem is larger).
  • Use as fallback if oMlx fork becomes unmaintained.

2. NVIDIA x86/Data Center

Winner: mistral.rs (for embedded hwLedger)

  • MoE-aware routing: correct expert selection, true active-params throughput.
  • CUDA support: native NVIDIA kernel access via candle-core.
  • Multi-GPU: doesn't scale beyond dual-GPU; acceptable for hobbyist fleet.
  • Embedded: single Rust binary, zero runtime dependencies.

Remote alternative: vLLM

  • Paged attention reduces VRAM by 20–40%.
  • FP8 KV-cache compression.
  • Better for large batches (inference servers).
  • Not embedded; requires separate process.

3. AMD Radeon

Winner: llama.cpp (GGUF via ROCm or HIP)

  • mistral.rs ROCm support is experimental (as of Apr 2026).
  • vLLM requires upstream AMD work; not production-ready.
  • Fallback: HIP-compiled llama.cpp, quantized GGUF models.

Windows RDNA: Special case

  • llama.cpp's DirectML backend targets Windows NVIDIA/AMD unified.
  • Query via -ngl (GPU layers) at load time.
  • Conservative estimate: ~60% of Metal/CUDA throughput on equivalent hardware.

4. Cloud Rentals (Vast.ai, RunPod, Lambda)

Default: vLLM or TGI (provider-managed)

hwLedger telemeters but does not drive remote inference:

  • Parent process sends requests via SSH or HTTPS.
  • Collects throughput + cost metrics from provider API.
  • Does not embed or manage remote engine.

Exception: hwLedger's own hwledger-server can spawn remote agents via Vast.ai API (runpod crate + reqwest).

Architecture-Specific Performance

Dense Models (LLaMA, Mistral)

All engines within 5–10% throughput parity. MoE overhead negligible.

Mixture-of-Experts (Mixtral 8x7B, Qwen1.5-MoE)

EngineActive ExpertsThroughputNotes
MLX2120 tok/secOptimized MLX router
mistral.rs2115 tok/secCorrect expert math
llama.cppAll 8 (!)45 tok/secNo expert gating; loads full model
vLLM2100 tok/secExpert parallelism via OpenAI MoE spec

Key finding: Most engines load all experts into VRAM (no gating). hwLedger must detect and warn users.

Multimodal (Vision)

  • MLX: mlx-vlm (LLaVA, CLIP, Qwen-VL).
  • mistral.rs: No native vision; fallback to MLX or vLLM.
  • vLLM: llava-next, phi-3-vision with paged attention.

KV-Cache Quantization

FP16 (Baseline)

  • 2 bytes / token / head
  • All engines support natively
  • No decoding overhead

FP8 (vLLM, TGI, some mistral.rs variants)

  • 1 byte / token / head
  • 50% KV-cache savings
  • <2% perplexity degradation (empirically)
  • Needs profiling: not all models compress equally

INT8 / INT4 (Experimental)

  • INT8: 1 byte, ~1% quality loss (moderate).
  • INT4: 0.5 bytes, ~3–5% quality loss (risky; use sparingly).
  • No mainstream engine supports INT4 KV-cache yet; PWCNet research only.

Recommendation: hwLedger's Dual-Engine Strategy

Local (Desktop)

┌─────────────────┐
│  hwledger-core  │
├─────────────────┤
│     planner     │
├─────────────────┤
│  MLX sidecar    │ (Apple Silicon)
│  + mistral.rs   │ (x86+GPU)
│  + llama.cpp    │ (fallback)
└─────────────────┘

Fleet (Remote)

┌──────────────────────┐
│  hwledger-server     │
├──────────────────────┤
│  Telemetry collector │
│  (no inference drive)│
├──────────────────────┤
│  vLLM / TGI remote   │
│  (provider-managed)  │
└──────────────────────┘

Migration Path (v1 → v2)

v1.0: MLX + oMlx sidecar (Apple); mistral.rs embedded (x86); llama.cpp fallback.
v1.1: Add vLLM remote telemetry.
v2.0: mistral.rs replaces llama.cpp; TensorRT-LLM for NVIDIA data-center rentals.
v3.0: Evaluate SGLang + speculative decoding for 2–3× speedup (if needed).

See also

  • Brief 01: oMlx Analysis
  • Brief 02: MLX IPC Patterns
  • ADR-0004: Math Core Dispatch
  • crates/hwledger-inference/src/backend_selector.rs

Sources

Released under the Apache 2.0 License.