Beyond Transformers with Power Retention

Summary of Manifest AI: Revolutionizing Context Windows with Power Retention Architectures

This podcast episode features an interview with Jugger Buckman of Manifest AI, discussing the company's founding mission: tackling the bottleneck of large inputs (long contexts) in language models, which they identified as preventing the progression towards human-level intelligence (AGI) [1:04].

The Context Window Bottleneck in Transformers

Historically, context windows in models were very limited (e.g., 8,000 tokens) [0:33]. While model size (parameters) and total ingested knowledge (training data) scale efficiently, context length scaling is prohibitively expensive due to the growing KV cache in the attention mechanism [1:58-2:39].

The KV Cache Problem: Every token processed requires memory storage in the KV cache, making subsequent computations increasingly expensive and eventually intractable [2:37].
Workarounds are Flawed: Current "long context" solutions, like windowed attention, are essentially band-aid solutions [3:00]. They often result in models performing poorly on information in the middle of the context, favoring only the start and end segments due to architectural limitations [3:39].

Manifest AI's Solution: Power Retention Architectures

Manifest AI was founded to solve this core problem by replacing the Transformer entirely with a new family of retention-based architectures [4:09]. Their specific variant is called Power Retention.

Fixed-Size Memory: Instead of a constantly growing KV cache, Power Retention uses a fixed-size memory where new tokens are compressed into this space [4:32].
Scalability: This memory size can be scaled up or down based on the task difficulty and compute budget, similar to how parameters are used in training [4:40].

Engineering Superior Performance with Vidriel

To ensure Power Retention is practically faster, not just theoretically better in terms of FLOPs, Manifest invested heavily in ultra-efficient hardware kernels [8:51].

Vidriel Framework: They developed Vidriel, a fully general framework for writing CUDA kernels, as high-level tools like Triton lacked necessary flexibility [9:54].
Just-in-Time Sweeping: Vidriel’s key intuition is to write a generic kernel representing the entire space of possible hardware configurations (e.g., tiling, core usage). At runtime, it empirically sweeps through these possibilities for a few minutes to find the optimal configuration for the specific hardware and input shape [11:25-11:57].
Performance Gains:
- For standard operations like Flash Attention, Vidriel matches performance or is slightly faster [12:27].
- Where Flash Attention is sub-optimal (many problem shapes), Vidriel achieves 20–30% speedups by making empirically better hardware decisions [12:40].

Dramatic Speedups for Training and Inference

Power Retention offers substantial advantages over traditional attention mechanisms:

Training Speedup: At 64k context, Power Retention offers approximately 10x speedup during training due to FLOP reductions [7:47].
Inference Cost Reduction: The fixed memory size eliminates the infrastructural headache of dynamically growing KV caches, making GPU allocation and serving much simpler [6:16]. Inference speedups over Flash Attention at 64k context can reach 100x [8:07].

Metamorphosis: Upgrading Existing Models

Manifest AI proposes a process called metamorphosis—fine-tuning existing models to adopt the Power Retention architecture [14:33].

Example: Power Coder: They demonstrated metamorphosing the StarCoder 3B model. After just 10,000 steps (about 2 hours) of training, the Power Retention variant (Power Coder) matched the original loss curve. Further training allowed it to surpass the baseline, achieving 35% accuracy on HumanEval compared to the baseline's 30% [16:07-17:42].
Training Comparison: At 32k context, Power Coder training was about 5x faster per iteration (4.8s vs. 23s) compared to the StarCoder 2 baseline [29:03].

Key Takeaways and Future Outlook

Actionable Insight: Users can pip install retention for a drop-in replacement for Flash Attention kernels, allowing them to immediately utilize Power Retention in long context training or inference [14:01].
Open Sourcing: Manifest is open-sourcing the tools needed for metamorphosis, encouraging the community to transform existing models (e.g., Qwen, text models) into Power Retention variants [23:21].
Data Dependency: The true value of long context is tied to data structure. Internet text often lacks long-context structure (due to document stitching/packing), meaning standard transformers are relatively sufficient. Manifest is actively seeking unique, long-context datasets (e.g., human task trajectories) to fully showcase Power Retention's capabilities [30:36-32:40].
Roadmap: The goal is community trust and eventual adoption by frontier model trainers within the next 6 months to a year [27:30-28:33].

Manifest AI Collaboration Requests

Manifest AI is actively hiring ML researchers, CUDA programmers, and seeking collaborations with holders of unique, long-context datasets [29:37-32:40]. They encourage inference providers to partner to serve metamorphosed models at significantly lower inference costs [25:23].