Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis / Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression
Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis [88.1] MambaモデルはTransformerベースのモデルよりも計算上の優位性に大きく注目されている。 本稿では,一層マンバモデルのトレーニング力学に関する最初の理論的解析を行った。 マムバは、より多くのトレーニングを必要とするかもしれないが、線形変換器が許容できるしきい値を超える場合であっても、正確な予測を保っている。 論文参考訳(メタデータ) (Wed, 01 Oct 2025 01:25:01 GMT)
Mambaの理論的解析、「While linear Transformers may converge faster with smaller batch sizes, they can only in-context generalize effectively when the fraction of outlier-containing context examples is less than 1/2, much less than that for Mamba. Moreover, linear Transformers require significantly more context examples than Mamba to achieve comparable generalization performance. This highlights Mamba’s superior robustness to a high density of outliers in ICL.」というのは面白い特徴
「The loss bound is comparable to that of Transformer. Our theoretical results reveal the different mechanism between Transformer and Mamba on ICL, where Mamba emulates a variant of online gradient descent to perform in-context, while Transformers approximate a single step of gradient descent. Furthermore, our comparison with the S4 model demonstrates that the selection components are essential for Mamba to perform ICL.」とこちらも面白い指摘