Why geometric evolution on manifolds might be the linear-complexity, interpretable alternative to the Transformer’s quadratic dominance.
- Challenging the Status Quo: The article questions the assumption that dense self-attention is required for strong language performance, reinterpreting attention as a computationally expensive and mathematically opaque “tensor lifting” operation.
- The Geometric Solution: It introduces “Grassmann Flows,” an architecture that replaces attention matrices with geometric deformations of low-rank subspaces, achieving linear complexity scaling regarding sequence length.
- Competitive Performance: Empirical tests show the Causal Grassmann architecture matches or slightly outperforms Transformer baselines on tasks like SNLI while offering a more mathematically traceable foundation for future AI interpretability.
For years, the title of the seminal 2017 paper “Attention Is Not What You Need” has been the subject of wordplay, but few have seriously challenged its core premise. Self-attention has become the de facto primitive for sequence modeling. The implicit assumption driving the current AI boom is that strong natural language performance strictly requires attending over all token pairs via a dense or approximate attention mechanism.
We must ask: Is this computationally heavy mechanism the only way to model reasoning?
Current research suggests that a central source of “uninterpretability” in Large Language Models (LLMs) is not merely their size, but the nature of the attention mechanism itself. Attention acts as a form of tensor lifting: it maps a hidden vector into a high-dimensional space of pairwise interactions. While extremely expressive, this process is difficult to trace. Across many layers and heads, the global effect becomes analytically opaque, making it nearly impossible to describe the model’s behavior with explicit invariants.
Enter Grassmann Flows: An Attention-Free Architecture
As a contrasting design, researchers have proposed an attention-free sequence model built around Grassmann flows. This approach shifts the paradigm from calculating weights between every word pair to modeling the geometric evolution of information.
The Causal Grassmann architecture operates through a four-step geometric process:
- Reduction: Token states are reduced to a low-dimensional space.
- Subspace Interpretation: Local token pairs are interpreted not as discrete points, but as two-dimensional subspaces on a Grassmann manifold, denoted as Gr(2,r).
- Embedding: These subspaces are embedded via Plücker coordinates into a finite-dimensional projective space.
- Fusion: The resulting geometric features are fused back into hidden states through a gated mixing block.
In this model, information flows through the sequence not by explicit pairwise weights (the L×L attention matrix), but by controlled deformations of low-rank subspaces across layers.
Performance Without the Quadratic Cost
The most compelling aspect of Grassmann flows is that they decouple performance from the quadratic cost of self-attention.
Linear Complexity Complexity analysis demonstrates that the mixing mechanism in this architecture scales linearlywith sequence length for a fixed rank. In contrast, standard full self-attention scales quadratically, which has historically been a bottleneck for processing long documents or genomic sequences.
Empirical Results Despite lacking an attention mechanism, the Grassmann architecture remains highly competitive:
- Language Modeling: On Wikitext-2, a purely Grassmann-based model (13–18M parameters) achieved validation perplexities within 10–15% of a size-matched Transformer baseline.
- Natural Language Inference (SNLI): When used as a classification head on top of a fixed DistilBERT backbone, the Grassmann approach slightly outperformed the standard Transformer head (Best validation accuracy 0.8550 vs. 0.8545).
De-Centering Attention
The goal of this research is not to claim that attention is obsolete, but to “de-center” it. The findings argue that what neural networks fundamentally need is a sufficiently expressive geometric evolution mechanism for hidden representations.
By moving the core of the model onto a finite-dimensional manifold—specifically, flows on Gr(k,r)—we gain a setting that is amenable to geometric analysis. This offers a potential path toward solving the “black box” problem of AI. If the model’s operations rely on defined geometric structures rather than unstructured tensor spaces, we may eventually define stable invariants (measurable properties that remain unchanged under transformation) that explain why a model made a specific decision.
The Path Forward
Grassmann flows open the door to a more geometric understanding of reasoning in neural networks. Future work looks to expand on several fronts:
- Global Invariants: Developing measures like curvature or cross-layer stability to constrain local mixing.
- Hybrid Architectures: Combining Grassmann mixing with state-space models or convolutional modules to balance local and global information.
- Scaling: Implementing fused kernels to test the theoretical linear scaling on massive datasets.
These flows provide a concrete example that we can achieve strong sequence modeling by allowing representations to move elegantly along the manifolds they inhabit, rather than forcing them through the dense web of an attention matrix.


