Forma × AMD

Why Forma needs MI300X

Two production-grade models. One GPU. 192GB HBM3 makes it possible.

192GB

HBM3 memory

5.3TB/s

memory bandwidth

2 models

running simultaneously

<1s

real-time latency

Dual-Model Architecture

Both models live in the same VRAM. No swap. No cold-start. Real-time + deep on a single GPU.

Latency by Tier

Real-Time Tier (8B)

0.6-1.2s

8B Instruct • port 30000

Powers /analyze on every keystroke
44× concurrent requests at 2048 tokens
11.2 GiB KV cache

0s — 1.5s window

avg marker: 0.8s

Deep Analysis Tier (70B)

33-52s

70B AWQ-INT4 • port 8000

Powers /agents/run-all 7-agent consensus
21× concurrent requests at 4096 tokens
26.25 GiB KV cache

0s — 60s window

avg marker: 41.0s

Why H100 80GB cannot do this

AMD MI300X (192 GB)

192 GB total

Headroom: 102 GiB
AVAILABLE FOR PARALLEL USERS

8B KV Cache: 11.2 GiB

8B Model: 15.1 GiB

70B KV Cache: 26.3 GiB

70B AWQ Model: 37.3 GiB

NVIDIA H100 (80 GB)

80 GB total

80 GB CAP

OVER CAPACITY

8B KV Cache: ~7 GiB

8B Model: 15.1 GiB

70B KV Cache: ~20 GiB

70B AWQ Model: 37.3 GiB

H100 80GB physically cannot fit both models with usable concurrency. You'd need TWO H100s — and lose unified memory.

Forma's dual-tier architecture is only possible on MI300X-class hardware. The 192GB advantage isn't marketing — it's the difference between a working product and a broken architecture.

The architecture mirrors the business model

Forma is a freemium SaaS. The dual-model architecture on a single MI300X isn't just a hardware demo — it's the production architecture that makes the freemium unit economics work. Cheap 8B inference for free-tier users running on every keystroke. Expensive 70B inference for paid-tier users running deep analysis and long-context personalization. Both models live in 192GB HBM3 simultaneously, so a single GPU serves both tiers concurrently.

FREE TIER

Llama 3.1 8B Instruct

Real-time per-keystroke scoring. Inline underline detection for vague UI vocabulary. Sub-2-second latency. Acquires users at zero marginal cost.

· Forma Score badge
· Inline wavy underline
· Hover tooltip with canonical term
· 3 alternative pills with descriptions
· Click-to-accept replacement

PRO TIER

Llama 3.1 70B AWQ-INT4

7-agent multi-agent consensus. Long-context Memory Engine personalization. Production-grade prompt rewriting with motion specs and accessibility attributes. Converts power users.

· 7-agent Run Full Analysis
· Memory Engine personalization
· Cross-builder style fingerprint
· Predicted next-project recommendations
· Production prompt with motion + a11y specs

The unit economics require MI300X. 8B is cheap enough for free-tier scale. 70B is the expensive moat for paying users. Running both on a single H100 80GB isn't possible — you'd need two separate GPUs, doubling cost and breaking unified memory. MI300X's 192GB makes a freemium architecture economical. The hardware choice and the business model are the same decision.

Runtime

Inference

vLLM 0.17.1

OpenAI-compatible chat completions

Compute

ROCm 7.0

AMD GPU compute platform

GPU

AMD MI300X

192 GB HBM3, 5.3TB/s bandwidth

70B Quantization

AWQ INT4

4-bit weight quantization

Hardware host

DigitalOcean GPU

Single-GPU droplet, $1.99/hr

Live Endpoints

POST

/analyze

Powered by: 8B on AMD MI300X (port 30000)

Latency: ~800ms typical

Use: every keystroke, debounced

POST

/agents/run-all

Powered by: 70B AWQ on AMD MI300X (port 8000)

Latency: 33-52s typical

Use: on-demand, "Run Full Analysis" click

POST

/detect-vague

Powered by: 8B on AMD MI300X (port 30000)

Latency: ~1-2s typical

Use: inline underline detection (Free tier)

POST

/agents/memory-deep

Powered by: 70B AWQ on AMD MI300X (port 8000)

Latency: ~12-25s typical

Use: Pro tier Memory Engine personalization