Skip to content

CLI: plan --help

The plan subcommand is your first step: it analyzes your GPU and suggests the optimal inference configuration for any model. This journey walks through the interactive help, explaining every slider and option.

What you'll see

When you run hwledger plan --help, you get:

  • All available command-line flags (--model, --context, --batch, --quant, --attention, etc.)
  • Defaults for each flag
  • Examples of common queries
  • Exit codes on failure

Watch as the planner analyzes your GPU in real-time and recommends:

  • Quantization (FP16, INT8, INT4) based on VRAM
  • Attention variant (MHA, GQA, MLA, SSM) for optimal speed
  • Tensor parallelism (split across GPUs if needed)
  • Batch size before OOM

Journey not yet recorded.

Run the journey recorder to capture interactions:

./apps/macos/HwLedgerUITests/scripts/run-journeys.sh

What to watch for

  • VRAM requirement: The planner estimates exact memory needed for your model + context
  • Quantization recommendation: INT4 cuts memory by 4x (with ~5% quality loss)
  • Attention variants: GQA shown if model supports grouped-query attention
  • Tensor parallelism: TP score shows whether splitting across 2+ GPUs helps
  • Prefill vs decode: Notice the distinction in time estimates

Next steps

Reproduce

bash
hwledger plan --help

# Or run with a real model
hwledger plan --model mistral-7b-instruct --context 32000

Source

Recorded journey tape on GitHub

Released under the Apache 2.0 License.