#model-architecture — 1sec.ai

Variable-Width Transformers

Transformers with variable width outperform constant-width models on a range of tasks. The proposed ×-Transformer consistently outperforms parameter-matched baselines, suggesting nonuniform capacity allocation improves performance. This work empirically investigates nonuniform capacity allocation across network depth.

Key takeaways

Most transformer architectures maintain constant width across all layers.
Proposed ×-Transformer consistently outperforms parameter-matched baselines.
Nonuniform capacity allocation improves performance on a range of tasks.

aarXiv#transformers #model-architecture #research