CLI: plan --help

The plan subcommand is your first step: it analyzes your GPU and suggests the optimal inference configuration for any model. This journey walks through the interactive help, explaining every slider and option.

What you'll see

When you run hwledger plan --help, you get:

All available command-line flags (--model, --context, --batch, --quant, --attention, etc.)
Defaults for each flag
Examples of common queries
Exit codes on failure

Watch as the planner analyzes your GPU in real-time and recommends:

Quantization (FP16, INT8, INT4) based on VRAM
Attention variant (MHA, GQA, MLA, SSM) for optimal speed
Tensor parallelism (split across GPUs if needed)
Batch size before OOM

Journey not yet recorded.

Run the journey recorder to capture interactions:

./apps/macos/HwLedgerUITests/scripts/run-journeys.sh

What to watch for

VRAM requirement: The planner estimates exact memory needed for your model + context
Quantization recommendation: INT4 cuts memory by 4x (with ~5% quality loss)
Attention variants: GQA shown if model supports grouped-query attention
Tensor parallelism: TP score shows whether splitting across 2+ GPUs helps
Prefill vs decode: Notice the distinction in time estimates

Next steps

Plan for your exact model — real example with Deepseek-V2
Probe to see your GPU — discover available hardware
Plan Deep-Dive — all flags and options

Reproduce

bash

hwledger plan --help

# Or run with a real model
hwledger plan --model mistral-7b-instruct --context 32000

Source

Recorded journey tape on GitHub

CLI: plan --help ​

What you'll see ​

What to watch for ​

Next steps ​

Reproduce ​

Source ​