Variable-Width Transformers

aarXivscore 0.24

Transformers with variable width outperform constant-width models on a range of tasks. The proposed ×-Transformer consistently outperforms parameter-matched baselines, suggesting nonuniform capacity allocation improves performance. This work empirically investigates nonuniform capacity allocation across network depth.

Key takeaways

Most transformer architectures maintain constant width across all layers.
Proposed ×-Transformer consistently outperforms parameter-matched baselines.
Nonuniform capacity allocation improves performance on a range of tasks.

#transformers #model-architecture #research

Read the original

Variable-Width Transformers

Transformers with variable width outperform constant-width models on a range of tasks. The proposed ×-Transformer consistently outperforms parameter-matched baselines, suggesting nonuniform capacity allocation improves performance. This work empirically investigates nonuniform capacity allocation across network depth.

Key takeaways

Most transformer architectures maintain constant width across all layers.
Proposed ×-Transformer consistently outperforms parameter-matched baselines.
Nonuniform capacity allocation improves performance on a range of tasks.

#transformers #model-architecture #research