oMlx Analysis
Executive Summary
oMlx (jundot/omlx, Apache-2.0, 10.6K stars, v0.3.6 Apr 2026) is the most mature open-source MLX-based inference server. Its killer feature — paged SSD KV-cache — reduces Time-To-First-Token (TTFT) from 30–90 seconds to 1–3 seconds for agent loops. For hwLedger's Apple Silicon inference pathway, a fat fork that preserves all upstream functionality while adding hwLedger-specific extensions is the recommended strategy.
Upstream Architecture
oMlx is built as a Python FastAPI wrapper around MLX:
- Runtime: FastAPI + uvicorn on
localhost:8000 - Model loading:
mlx-lm(LLaMA, Mixtral, Qwen, etc.) - Quantization: MLX native (4-bit, 8-bit) with custom safetensors loading
- VLM support:
mlx-vlmfor vision models (CLIP, LLaVA, Qwen-VL) - SSD paging: Experimental KV-cache overflow to disk when VRAM exhausted
- Optional: PyObjC menubar app (native macOS UI)
Strengths
- Paged KV cache (unique): Swaps inactive tokens to SSD, fitting larger contexts than VRAM allows.
- MLX vectorization: Peak throughput on Apple Silicon (GPU + ANE).
- Vision model support: LLaVA, CLIP, Qwen-VL via
mlx-vlm. - Standard APIs: OpenAI-compatible
/v1/chat/completionsfor drop-in compatibility. - Single-machine focus: No distributed inference complexity.
Build Surface
- Python + PyObjC (menubar component): requires Xcode toolchain, venvstacks setup.
- ML dependencies: numpy, mlx, mlx-lm, mlx-vlm, safetensors.
- Heavy init time: First inference run downloads model + compiles metal kernels (30–60s cold).
Fork Strategy
Three options were evaluated:
Option 1: Slim Fork (30% codebase)
Remove PyObjC menubar, venvstacks build boilerplate. Retain FastAPI + mlx-lm core.
Pros: Lighter maintenance burden.
Cons: Forecloses future feature additions (KV quant dials, per-layer memory reporting).
Option 2: Upstream HTTP-Sidecar (No Fork)
Pin a stable oMlx commit; contribute PRs upstream as needed.
Pros: Zero maintenance cost.
Cons: Upstream PRs are slow; we cannot add hwLedger-specific extensions without upstreaming first.
Option 3: Fat Fork (100% codebase) ✅ RECOMMENDED
Preserve all upstream code. Add hwLedger-specific features behind feature flags.
Pros: Full extensibility; can add KV-quant controls, deterministic benchmarking, per-layer memory introspection without waiting for upstream PRs.
Cons: Ongoing maintenance tax for Python + PyObjC. Accepted because SSD-paged KV is not replaceable from scratch in Rust.
Recommended Implementation
Sidecar Boundary
Parent hwLedger Rust process spawns the Python sidecar under uv-managed venv:
uv venv --python 3.11 .venv-omlx
uv pip install -e sidecars/omlx-fork/
python -m omlx.server --listen 127.0.0.1:8000Lifecycle:
- Parent manages process start/stop via
std::process::Command. - SIGTERM on parent propagates to child via process group.
- Heartbeat check via HTTP GET
/healthevery 5s.
Dual IPC Surfaces
FastAPI HTTP (inherited):
- OpenAI
/v1/chat/completionsendpoint. - Anthropic
/api/v1/messagesendpoint. - Available for external agents (Cursor, Claude Agent).
- OpenAI
JSON-RPC over stdio (hwLedger-specific):
- Bidirectional token streaming with memory telemetry.
- Benchmark hooks (deterministic seed, layer-wise KV reporting).
- Config reload without restart.
- Reserved: length-prefixed protobuf fallback if JSON-RPC throughput saturates.
Repository Structure
Forked to KooshaPari/phenotype-omlx:
sidecars/omlx-fork/
├── omlx/
│ ├── server.py # FastAPI (unchanged from upstream)
│ ├── models.py # Model loading
│ └── mlx_interface.py # MLX FFI
├── hwledger_protocol.py # JSON-RPC stdio handler (our addition)
├── pyproject.toml
└── patches/
├── 001-kv-quant.patch
├── 002-layer-memory.patch
└── ...Upstream sync: Weekly rebase attempt; divergent patches staged in patches/ for incremental replay onto newer upstream commits.
Key Integration Points
1. Config Ingestion (hwledger-ingest)
oMlx model loading via mlx-lm respects HuggingFace config.json:
num_attention_heads,hidden_sizefor MHA math.num_key_value_headsfor GQA detection.- Custom
attention_typefield for hybrid/MLA dispatch.
2. Memory Telemetry (hwledger-probe)
JSON-RPC extension provides:
- Peak GPU VRAM during prefill.
- Per-layer KV allocation (for heatmap visualization).
- SSD page fault rate (if KV spilled).
3. Inference Runner (hwledger-inference)
hwledger-inference subprocess driver:
- Spawns and manages oMlx sidecar lifecycle.
- Routes requests to HTTP or JSON-RPC based on workload.
- Collects telemetry for ledger reconciliation.
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Python + venvstacks maintenance | Accept cost; document setup; use uv for reproducibility. |
| Upstream divergence grows | Monthly rebases; selective cherry-pick strategy from upstream PRs. |
| PyObjC breaks on macOS update | Keep behind feature flag; fallback to HTTP-only if breaks. |
| JSON-RPC protocol churn | Version the protocol; maintain backward compatibility. |
Dependency Matrix
| Dependency | Version | License | Rationale |
|---|---|---|---|
| mlx | 0.21+ | MIT | Core ML framework |
| mlx-lm | 0.18+ | MIT | LLaMA/Mixtral/Qwen loaders |
| mlx-vlm | 0.6+ | MIT | Vision model support |
| fastapi | 0.115+ | MIT | HTTP server |
| uv | 0.4+ | MIT | Venv management |
| safetensors | 0.4+ | Apache-2.0 | Safe model loading |
See also
- ADR-0002: oMlx Fat Fork Decision
- Brief 02: MLX IPC Patterns
- Brief 03: Inference Engine Matrix
crates/hwledger-inference/src/mlx_sidecar.rs