Forma × AMD

Why Forma needs MI300X

Two production-grade models. One GPU. 192GB HBM3 makes it possible.

192GB
HBM3 memory
5.3TB/s
memory bandwidth
2 models
running simultaneously
<1s
real-time latency

Dual-Model Architecture

AMD MI300X — 192GB HBM3 USER PROMPT /analyze (every keystroke) 8B TIER · PORT 30000 Meta-Llama-3.1-8B-Instruct Realtime score inference Forma Score badge USER CLICKS DEEP ANALYSIS /agents/run-all parallel specialists + consensus 70B AWQ-INT4 · PORT 8000 7-Agent Deep Analysis Detector Critic Reformulator Style Memory Coach Consensus Deep Analysis Panel

Both models live in the same VRAM. No swap. No cold-start. Real-time + deep on a single GPU.

Latency by Tier

Real-Time Tier (8B)
0.6-1.2s
8B Instruct • port 30000
  • Powers /analyze on every keystroke
  • 44× concurrent requests at 2048 tokens
  • 11.2 GiB KV cache
0s — 1.5s window
avg marker: 0.8s
Deep Analysis Tier (70B)
33-52s
70B AWQ-INT4 • port 8000
  • Powers /agents/run-all 7-agent consensus
  • 21× concurrent requests at 4096 tokens
  • 26.25 GiB KV cache
0s — 60s window
avg marker: 41.0s

Why H100 80GB cannot do this

AMD MI300X (192 GB)

192 GB total
Headroom: 102 GiB
AVAILABLE FOR PARALLEL USERS
8B KV Cache: 11.2 GiB
8B Model: 15.1 GiB
70B KV Cache: 26.3 GiB
70B AWQ Model: 37.3 GiB

NVIDIA H100 (80 GB)

80 GB total
80 GB CAP
OVER CAPACITY
8B KV Cache: ~7 GiB
8B Model: 15.1 GiB
70B KV Cache: ~20 GiB
70B AWQ Model: 37.3 GiB
H100 80GB physically cannot fit both models with usable concurrency. You'd need TWO H100s — and lose unified memory.

Forma's dual-tier architecture is only possible on MI300X-class hardware. The 192GB advantage isn't marketing — it's the difference between a working product and a broken architecture.

The architecture mirrors the business model

Forma is a freemium SaaS. The dual-model architecture on a single MI300X isn't just a hardware demo — it's the production architecture that makes the freemium unit economics work. Cheap 8B inference for free-tier users running on every keystroke. Expensive 70B inference for paid-tier users running deep analysis and long-context personalization. Both models live in 192GB HBM3 simultaneously, so a single GPU serves both tiers concurrently.

FREE TIER
Llama 3.1 8B Instruct
Real-time per-keystroke scoring. Inline underline detection for vague UI vocabulary. Sub-2-second latency. Acquires users at zero marginal cost.
  • · Forma Score badge
  • · Inline wavy underline
  • · Hover tooltip with canonical term
  • · 3 alternative pills with descriptions
  • · Click-to-accept replacement
PRO TIER
Llama 3.1 70B AWQ-INT4
7-agent multi-agent consensus. Long-context Memory Engine personalization. Production-grade prompt rewriting with motion specs and accessibility attributes. Converts power users.
  • · 7-agent Run Full Analysis
  • · Memory Engine personalization
  • · Cross-builder style fingerprint
  • · Predicted next-project recommendations
  • · Production prompt with motion + a11y specs
The unit economics require MI300X. 8B is cheap enough for free-tier scale. 70B is the expensive moat for paying users. Running both on a single H100 80GB isn't possible — you'd need two separate GPUs, doubling cost and breaking unified memory. MI300X's 192GB makes a freemium architecture economical. The hardware choice and the business model are the same decision.

Runtime

Inference
vLLM 0.17.1
OpenAI-compatible chat completions
Compute
ROCm 7.0
AMD GPU compute platform
GPU
AMD MI300X
192 GB HBM3, 5.3TB/s bandwidth
70B Quantization
AWQ INT4
4-bit weight quantization
Hardware host
DigitalOcean GPU
Single-GPU droplet, $1.99/hr

Live Endpoints

POST
/analyze
Powered by: 8B on AMD MI300X (port 30000)
Latency: ~800ms typical
Use: every keystroke, debounced
POST
/agents/run-all
Powered by: 70B AWQ on AMD MI300X (port 8000)
Latency: 33-52s typical
Use: on-demand, "Run Full Analysis" click
POST
/detect-vague
Powered by: 8B on AMD MI300X (port 30000)
Latency: ~1-2s typical
Use: inline underline detection (Free tier)
POST
/agents/memory-deep
Powered by: 70B AWQ on AMD MI300X (port 8000)
Latency: ~12-25s typical
Use: Pro tier Memory Engine personalization