research1d
Variable-Width Transformers
Transformers with variable width outperform constant-width models on a range of tasks. The proposed ×-Transformer consistently outperforms parameter-matched baselines, suggesting nonuniform capacity allocation improves performance. This work empirically investigates nonuniform capacity allocation across network depth.
Key takeaways
- Most transformer architectures maintain constant width across all layers.
- Proposed ×-Transformer consistently outperforms parameter-matched baselines.
- Nonuniform capacity allocation improves performance on a range of tasks.