- Powers /analyze on every keystroke
- 44× concurrent requests at 2048 tokens
- 11.2 GiB KV cache
Two production-grade models. One GPU. 192GB HBM3 makes it possible.
Both models live in the same VRAM. No swap. No cold-start. Real-time + deep on a single GPU.
Forma's dual-tier architecture is only possible on MI300X-class hardware. The 192GB advantage isn't marketing — it's the difference between a working product and a broken architecture.
Forma is a freemium SaaS. The dual-model architecture on a single MI300X isn't just a hardware demo — it's the production architecture that makes the freemium unit economics work. Cheap 8B inference for free-tier users running on every keystroke. Expensive 70B inference for paid-tier users running deep analysis and long-context personalization. Both models live in 192GB HBM3 simultaneously, so a single GPU serves both tiers concurrently.