User-Journey Verification Harness for SwiftUI macOS Apps + CLI Tools: 2026 Research Brief
Executive Summary
This brief surveys the 2026 state of the art for building an automated user-journey verification harness that enables agents (coding and VLM-capable review agents) to verify real user-facing behavior without manually running apps. The harness must drive a native SwiftUI macOS app, record CLI output, generate GIF/MP4 artifacts, and leverage VLM (Vision Language Model) agents to audit visual equivalence against intent labels.
Key finding: Claude Opus 4.7 (April 2026) accepts static images up to 2576px (3.75MP) in JPEG/PNG/GIF format, but animated GIFs are not supported—only the first frame is processed. This mandates a keyframe extraction + gallery layout approach rather than sending video directly. The stack is mature, with native macOS APIs (ScreenCaptureKit, XCUITest, Accessibility API) providing reliable automation and recording; the primary implementation work is orchestration and VLM integration.
Part 1: Stack Recommendation
| Requirement | 2026 Recommendation | Rationale |
|---|---|---|
| UI Automation (SwiftUI macOS) | XCUITest (primary) + Accessibility API (fallback) | XCUITest is Apple's native framework—zero external dependencies, stable, integrated into Xcode. Accessibility API via Swift wrappers (e.g., AXorcist) handles complex selectors and out-of-process automation. Appium Mac2 Driver adds overhead without significant value for proprietary integration. |
| CLI Recording | VHS (Charmbracelet) for declarative .tape files → MP4/GIF | Mature, widely adopted in 2026. Supports custom keybindings, deterministic timing, MP4 output. Alternative: ScreenCaptureKit for windowed TUI capture (if VHS is insufficient). |
| Screen Recording (Windowed) | ScreenCaptureKit (native Swift) + AVAssetWriter | Apple's recommended API (macOS 12.3+). High-fidelity, hardware-accelerated H.264/HEVC encoding. Replacement for deprecated AVCaptureScreenInput. |
| Video → Keyframes | ffmpeg with palettegen + paletteuse for GIF; keyframe extraction via -vf select='eq(pict_type,I)' | Battle-tested, mature, part of POSIX toolchain. Palette generation critical for GIF quality (custom 256-color palette per video prevents color banding). |
| VLM + Blackbox Verification | Claude API (Opus 4.7) + keyframe gallery layout (not video) | Claude 4.x is the agent engine. Since animated GIFs are not processed, send ~10–20 extracted keyframes as a temporal sequence. LLM-judge model (Sonnet 4.6) for equivalence scoring. |
| Docs + Embedding | VitePress 2 (2026) with Vue 3 components for galleries + native <video>/<img> tags | VitePress has no built-in MDX; use Vue components (.vue files) for rich embeds. Supports MP4 via standard HTML5 <video>. GIFs via <img>. |
Part 2: Concrete WP Implementation Plan
WP 25: XCUITest + Accessibility Framework Integration
Goal: Drive SwiftUI app actions, capture screenshots, navigate UI state programmatically.
- Create
Tests/UITestHarness.swifttest bundle (linked to SwiftUI app target) - Define
AppDriverclass wrapping XCUIApplication, exposing:navigate(_ path: [String]) -> XCUIElement— traverse hierarchy by label/identifiertapButton(_ label: String)— atomically find + tap button by accessibility labeltypeText(_ text: String) -> Void— simulate keyboard inputscreenshot() -> CGImage— capture current state via XCUIDevice
- Fallback: Use AXorcist (Swift wrapper around Accessibility API) for selectors XCUITest cannot find
- Add inline tests verifying each driver method; trace to FR-UI-* functional requirements
- Acceptance:
cargo test --test ui_harness -- --nocapturepasses; screenshot array produced
WP 26: ScreenCaptureKit-Based Recording Pipeline
Goal: Record app interaction as MP4 with sub-frame capture; produce artifact suitable for ffmpeg processing.
- Create
apps/macos/HwLedger/Recorder/ScreenRecorder.swift(Swift 6, SwiftUI integration) - Implement
ScreenRecorderclass:startRecording(outputPath: URL) async throws— initialize SCContentFilter + SCStreamcaptureFrame() async -> CMSampleBuffer— pull frame from streamstopRecording() async -> URL— finalize AVAssetWriter, return file path
- Use
AVAssetWriterwith H.264 encoder (hardware-accelerated on Apple Silicon) - Wire into test harness: before each user-journey test, spawn recorder; after, finalize MP4
- Acceptance:
test_record_and_playback()produces 30fps MP4 (>2MB, seekable, plays in QuickTime)
WP 27: VHS Tape Scripts + CLI Recording Harness
Goal: Generate .tape files declaratively; execute via vhs CLI; output MP4 + GIF.
- Create
docs/recordings/tape-templates/directory - Implement
TapeGenerator(Rust or Python) that:- Takes journey TOML/YAML spec (actions: [Type, Sleep, Wait, KeyPress], assertions: [output contains])
- Emits
.tapefile with VHS syntax - Example:
Type "cargo build"→Type "cargo build",Sleep 2s→ wait for "Finished"
- Shell wrapper
./record-journey.sh <journey-id>that:- Calls
vhs run docs/recordings/<journey-id>.tape - Outputs:
out/<journey-id>.mp4,out/<journey-id>.gif
- Calls
- VHS configuration: 1400x800 terminal, 16pt font, Catppuccin theme, 30fps
- Acceptance:
./record-journey.sh demo-buildproducesout/demo-build.{mp4,gif}in <10 seconds
WP 28: FFmpeg Keyframe Extraction + Palette Optimization
Goal: Extract I-frames from MP4; generate optimized GIF; produce PNG keyframe gallery.
- Create
tools/ffmpeg-pipeline/Rust crate:extract_keyframes(mp4_path: &Path) -> Vec<PathBuf>— runffmpeg -vf select='eq(pict_type,I)',showinfo -vsync vfroptimize_gif(mp4_path: &Path, output: &Path) -> Result<()>— two-pass:bashffmpeg -ss 0 -i input.mp4 \ -filter_complex "fps=10,scale=360:-1[s]; [s]split[a][b]; [a]palettegen[pal]; [b][pal]paletteuse" \ output.gifgenerate_gallery(keyframes: Vec<PathBuf>, output_dir: &Path)— copy + rename PNGs
- Integrate into CI/recording pipeline: after VHS produces MP4, auto-call this crate
- Store keyframes in
docs/recordings/<journey-id>/frames/(numbered PNG files) - Acceptance:
cargo test test_ffmpeg_pipelineextracts 8–12 keyframes from 30-second MP4; GIF is <2MB, <5% quality loss
WP 29: Keyframe Gallery + VLM Blackbox Verification
Goal: For each user journey, spawn a fresh Claude agent to describe the keyframe sequence; compare against intent label using LLM-judge.
- Create
tools/vlm-verifier/Rust/Python crate:load_keyframes(journey_id: &str) -> Vec<Image>— read PNG gallery from WP28 outputgenerate_intent_label(journey_id: &str) -> IntentLabel— load from YAML:yamljourney_id: "build-and-test" actions: - action: "run cargo build" precondition: "Terminal shows prompt" expected_visible_change: "Compilation output scrolls; progress bar appears"invoke_claude_vision(images: Vec<Image>, system_prompt: &str) -> String— send keyframes + gallery layout to Claude Opus 4.7invoke_lvm_judge(intent: &str, vlm_response: &str) -> EquivalenceScore— ask Claude Sonnet 4.6: "Are these descriptions equivalent? (1-5 scale)"
- System prompt for VLM: "You are viewing a sequence of keyframes from a terminal/app interaction. Describe in 2–3 sentences what happens at each keyframe: what command was typed, what output appeared, what changed visually."
- Intent label schema:
action,precondition,expected_visible_change,actual_visible_change(populated by VLM) - Store results in
docs/recordings/<journey-id>/verification.json:json{ "journey_id": "build-and-test", "vlm_description": "User types 'cargo build'; compilation output scrolls...", "intent_label": { "action": "run cargo build", ... }, "equivalence_score": 4, "status": "PASS" } - Acceptance:
./verify-journey.sh build-and-testcompletes in <15 seconds; verification.json populated with score and status
WP 30: VitePress Documentation Integration + Auto-Sidebar
Goal: Embed recordings, keyframe galleries, and verification status in VitePress docs; auto-generate sidebar from journey catalog.
- Create
docs/.vitepress/theme/components/JourneyViewer.vue(Vue 3 SFC):- Props:
journeyId: string - Render:
<video src="...mp4" controls />(MP4 playback)<img src="...gif" alt="..." />(GIF fallback)- Keyframe gallery: horizontal scroll of PNGs with timestamps
- Verification badge: "PASS" (green) / "FAIL" (red) from verification.json
- Intent vs. VLM description in side-by-side Markdown blocks
- Props:
- Create
docs/journeys/index.mdtemplate:markdown<JourneyViewer journey-id="build-and-test" /> - Implement
sidebar-auto-journeys.tsgenerator:- Scan
docs/recordings/*/metadata.yaml(journey catalog) - For each journey, emit sidebar entry:ts
{ text: "Build and Test", link: "/journeys/build-and-test" } - Embed into
.vitepress/config.tsthemeConfig.sidebar
- Scan
- Add GitHub Pages CI step: post-recording, run VLM verification, commit verification.json, rebuild docs
- Acceptance:
bun run docs:buildproduces HTML with embedded MP4s, GIF fallbacks, keyframe galleries, verification badges; sidebar auto-populated; deploy to GitHub Pages
Part 3: VLM + Video Caveat & Keyframe Strategy
Claude Opus 4.7 Image Limitations (April 2026)
Critical Constraint: Animated GIFs are NOT supported. Only the first frame is processed.
- Supported formats: JPEG, PNG, GIF (static only), WebP
- Max dimensions: 2576 x 2576 px (3.75 MP) — up from 1568 px in prior version
- Max images per request: 100 (API, 200k-token models); 600 (API, all others)
- Request size limit: 32 MB (standard endpoints)
No native video support across Claude, Sonnet, or Haiku as of April 2026. Alternative providers (Gemini 2.0+) ship native video, but integration complexity is high.
Recommended Approach: Keyframe Gallery
Instead of sending one animated GIF, send 10–20 extracted keyframes as a temporal sequence:
# Pseudocode
keyframes = extract_keyframes("out/journey.mp4", stride=3) # Every 3 seconds
images = [Image.open(kf) for kf in keyframes]
response = client.messages.create(
model="claude-opus-4-7-20250416",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what happens in this terminal session (images in order):"
},
*[{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64(img)}}
for img in images]
]
}
]
)Trade-offs:
- Pros: Temporal narrative preserved; agent sees full interaction; ~10–20 images = ~120–180 tokens (vs. one GIF that gets ignored)
- Cons: Multiple images increase request size; slower per-request latency; manual timestamping required for reference
FFmpeg Keyframe Extraction Command
# Extract all I-frames (keyframes) with timing information
ffmpeg -i input.mp4 -vf "select='eq(pict_type,I)',showinfo" -vsync vfr frame_%04d.png 2>&1 | grep "pts:" | awk '{print $5}' > timestamps.txt
# Or use `-vf fps=0.33` to extract 1 frame every 3 seconds
ffmpeg -i input.mp4 -vf fps=0.33 frame_%04d.pngQuality: For a 30-second, 1400×800 terminal recording at 30fps, extracting at fps=0.33 (1 frame every 3s) yields ~10 PNG files, ~100–200 KB each. Total upload: ~1–2 MB (well under 32 MB limit).
Part 4: Token Cost Analysis
Per-Journey Cost Estimate
Scenario: 30-second SwiftUI app demo + CLI build.
Inputs:
- 10 keyframes (extracted, 1400×800 PNG each)
- 1 intent label (YAML/JSON, ~200 tokens)
- System prompt for VLM ("describe what you see")
Claude Opus 4.7 pricing (April 2026): $3 / 1M input tokens, $15 / 1M output tokens
| Step | Model | Input Tokens | Output Tokens | Cost (USD) |
|---|---|---|---|---|
| 1. Keyframe description (VLM) | Opus 4.7 | ~2,500 (10 images + text) | ~500 (2–3 sentence description) | $0.0065 |
| 2. Equivalence judge (LLM) | Sonnet 4.6 | ~800 (intent + description) | ~100 (1–5 score) | ~$0.0003 |
| Total per journey | — | ~3,300 | ~600 | ~$0.007 |
Cost per 100 journeys: ~$0.70 (negligible)
Note: Pricing assumes Claude Opus 4.7 API rates (publicly announced April 2026). Actual costs depend on context-length (longer descriptions inflate token count). Keyframe extraction via ffmpeg is free (local compute).
Part 5: Existing Phenotype Infrastructure Worth Reusing
Repositories with Relevant Patterns
| Path | What's There | Reuse Opportunity |
|---|---|---|
/repos/KlipDot/demos/ | VHS .tape files (7 demos) | Tape generation template; reference CLI recording patterns |
/repos/heliosApp/docs/.vitepress/ | VitePress config + Vue components (CategorySwitcher, etc.) | Copy JourneyViewer.vue pattern; sidebar config structure |
/repos/heliosApp/docs/reports/worklog.md | "VitePress configured, pages workflow exists" | CI workflow for docs deployment |
/repos/HexaKit/docs/.vitepress/theme/components/ | Reusable Vue 3 components (ModuleSwitcher, SidebarFilter) | Pattern for custom theme components |
/repos/RIP-Fitness-App/.archive/docs/demos/gifs/ | Pre-recorded GIFs organized by platform (web, mobile, social) | Directory structure for journey recordings |
/repos/agentapi-plusplus/docs/.vitepress/ | Production VitePress setup | Reference for CI integration, build config |
/repos/phenotype-config/docs/journeys/ | UserJourney.vue + FeatureDetail.vue component imports | MDX-like embedded component patterns in VitePress |
Missing Components (Need to Build)
- XCUITest harness — No existing Swift UI test framework in Phenotype (hwLedger is greenfield)
- ScreenCaptureKit wrapper — Not found; needs new Swift module
- FFmpeg pipeline CLI — Not found; needs Rust crate or Python script
- VLM verifier agent — Not found; needs new tool (Rust or Python, Claude API integration)
- JourneyViewer.vue component — Not found; derive from existing component patterns
- sidebar-auto-journeys.ts — Not found; pattern exists (sidebar-auto.ts search was empty, but sidebar config structure is in heliosApp, HexaKit, agentapi-plusplus)
Part 6: Known-Unknowns & R&D Gaps
Genuinely Mature (Just Engineering)
- XCUITest fundamentals (screenshot capture, element navigation, tapping) — 15+ years stable
- ScreenCaptureKit (macOS 12.3+) — native Swift API, well-documented
- VHS tape language — mature, widely used in TUI projects (Bubble, Charmbracelet ecosystem)
- FFmpeg keyframe extraction + palette generation — standard Unix tools, command-line stable
- Claude API vision — stable as of April 2026 (Opus 4.7); no breaking changes expected
- VitePress 2 — production-ready; Vue 3 components work as expected
Novel R&D (Needs Prototyping)
Large-scale keyframe sequence interpretation: Sending 20+ PNG frames to Claude — does it maintain temporal coherence across the sequence, or treat them as independent images? Prototype WP29 with 3–5 real journeys; measure coherence quality.
Equivalence scoring variance: Does Claude Sonnet 4.6 (as judge) consistently rank equivalent journeys the same way across multiple runs? Build regression test: same intent + VLM description → score variance. Target: ±0.5 on 1–5 scale.
False negatives from VLM: When a journey genuinely differs from intent (e.g., unexpected crash dialog), does the VLM notice without being prompted? Test: inject intentional UI regression, measure detection rate.
Accessibility API fallback reliability: When XCUITest cannot locate a SwiftUI element, does AXorcist consistently find it via Accessibility hierarchy? Prototype with 5–10 complex SwiftUI views; measure hit rate.
Intent label minimalism: What's the minimum spec for an intent label to avoid VLM hallucination? Current schema:
action,precondition,expected_visible_change. Can we reduce further without losing signal? A/B test with live journeys.CI latency: End-to-end (record + extract + verify) for 1 journey: target <20 seconds. Measure: CPU time (VHS, ffmpeg), I/O (upload to Claude), LLM latency (2 requests). Identify bottleneck.
Part 7: Recommended Rollout Sequence
Phase 1 (WP 25–26, 1–2 weeks):
- XCUITest harness for hwLedger app
- ScreenCaptureKit recording pipeline
- Manual smoke test: run test, capture screenshots, verify MP4 produced
Phase 2 (WP 27–28, 1 week):
- VHS tape generation + ffmpeg pipeline
- CLI journey recording (build, test, deploy scenarios)
- Golden MP4 + GIF + keyframe gallery artifacts
Phase 3 (WP 29–30, 2 weeks):
- Claude API integration (Opus 4.7 keyframe VLM + Sonnet 4.6 judge)
- VitePress JourneyViewer component + sidebar auto-generation
- E2E: record journey → extract frames → verify → embed in docs
Phase 4 (Iterate, ongoing):
- Prototype R&D unknowns (temporal coherence, equivalence variance)
- Tune intent label schema
- Optimize CI latency (parallel ffmpeg, batch API calls)
Part 8: Stack Command Reference
Record a CLI Journey
vhs run docs/recordings/demo-build.tape --output out/demo-build.mp4Extract Keyframes + Optimize GIF
# Keyframes
ffmpeg -i out/demo-build.mp4 -vf fps=0.33 frames/frame_%04d.png
# GIF with custom palette
ffmpeg -i out/demo-build.mp4 \
-filter_complex "fps=10,scale=360:-1[s]; [s]split[a][b]; [a]palettegen[pal]; [b][pal]paletteuse" \
out/demo-build.gifRun VLM Verification
cargo run --bin vlm-verifier -- --journey-id demo-build --model opus-4-7Build VitePress Docs with Journeys
bun run docs:build # Runs sidebar-auto-journeys.ts, embeds verification.jsonReferences
- Anthropic Claude API Vision Docs
- Apple XCUITest Documentation
- Apple ScreenCaptureKit Documentation
- Charmbracelet VHS GitHub
- FFmpeg GIF/Keyframe Guide
- Visual Regression Testing in 2026
- macOS Accessibility API — AXorcist
- Appium Mac2 Driver
- VitePress 2 Documentation
Research completed: April 18, 2026
Status: Ready for WP implementation planning
Next step: Prioritize WP25–26 (XCUITest + ScreenCaptureKit) for hwLedger proof-of-concept