Skip to content

Glossary

Attention mechanisms

MHA — Multi-Head Attention. Standard Transformer mechanism with h independent attention heads.

GQA — Grouped Query Attention. h query heads share g key-value heads (g << h), reducing KV cache by h/g.

MQA — Multi-Query Attention. All query heads share single K, V projection. Maximum compression (h:1).

MLA — Multi-Head Latent Attention. Projects K, V to low-rank latent space before multi-head, reducing cache size.

SSM — State Space Model. Linear recurrence alternative to Transformer (Mamba). O(n) memory, constant decode latency.

Quantization

FP32 — 32-bit floating point. Full precision, ~4 bytes/param.

FP16 — 16-bit floating point. Half precision, ~2 bytes/param.

BF16 — bfloat16. 16-bit with wider exponent range, ~2 bytes/param.

INT8 — 8-bit integer quantization. ~1 byte/param, 2% accuracy loss typical.

INT4 — 4-bit integer quantization. ~0.5 bytes/param, 5% accuracy loss typical.

Inference & deployment

KV cache — Cached key and value vectors from previous tokens, enabling O(1) decode latency.

Prefill — Initial phase where model processes entire prompt. Compute-bound.

Decode — Token generation phase where model produces 1 token at a time. Memory-bound.

TP — Tensor Parallelism. Split model across multiple GPUs. Requires good inter-GPU bandwidth.

PP — Pipeline Parallelism. Split layers across GPUs. Enables splitting >1 GPU VRAM, but adds latency.

VRAM — Video RAM on GPU. Limits model size + batch size + context length.

Hardware

CUDA — NVIDIA GPU programming model. Standard on RTX, A100, etc.

ROCm — AMD GPU computing platform. Compatible with Radeon, MI series.

Metal — Apple GPU framework. Native on M-series chips.

NVML — NVIDIA Management Library. Query GPU telemetry (temp, memory, utilization).

rocm-smi — AMD equivalent of NVIDIA-smi. Query GPU state on Radeon/MI.

macmon — hwLedger's wrapper for Metal GPU telemetry on macOS.

Fleet & distribution

Agent — Lightweight daemon deployed on remote GPU. Registers with fleet server, executes jobs.

Fleet Server — Central Axum daemon orchestrating agents, distributing jobs, logging audit trail.

mTLS — Mutual TLS. Both client and server present certificates. Prevents MITM.

SSH Fallback — Agentless mode. Server queries GPU via SSH (no binary installation required).

Storage & auditing

Event Sourcing — Append-only log of state changes. Each event immutable, traced to prior state.

Hash Chain — Cryptographic linkage where Event N's hash depends on Event N-1. Tampering detectable.

Ledger — Append-only event store with SHA-256 hash chain for audit trail.

Quality & governance

FR — Functional Requirement. Specification of what system must do.

Traceability — Mapping FR → test → code. Every feature testable, every test required.

SLT — Spec-to-Test traceability. Every FR has >=1 test referencing it.

SAST — Static Application Security Testing. Automated code scanning (Semgrep, CodeQL).

Cloud platforms

Vast.ai — GPU rental marketplace. Spot instances (30% savings, interruptible).

RunPod — GPU rental with stable 24h minimum commitments.

Lambda Labs — Dedicated GPU rentals (no spot, higher cost).

Modal — Serverless GPU inference platform.

Released under the Apache 2.0 License.