Transformer – arXiv最新論文の紹介

Lizard: An Efficient Linearization Framework for Large Language Models

Lizard: An Efficient Linearization Framework for Large Language Models [100.6]
我々は,事前学習したトランスフォーマーベース大規模言語モデル(LLM)を,無限コンテキスト生成のための柔軟性のあるサブクワッドアーキテクチャに変換する線形化フレームワークであるLizardを提案する。 Lizardは、出力品質を保ちながらソフトマックスアテンションを正確に近似するサブクワッドアテンションメカニズムを導入することで、この制限に対処する。そこで本研究では,Lizardが従来の線形化手法を著しく上回りながら,標準言語モデリングタスクにおける教師モデルの性能のほぼ無作為な回復を実現していることを示す。
論文参考訳（メタデータ） (Fri, 11 Jul 2025 21:19:18 GMT)
「Lizard (Linearizing Softmax Attention with Recurrent Gate Dynamics), an efficient framework for linearizing LLMs」の提案。
「We train our model in two stages: (1) an attention approximation stage where the subquadratic modules are trained to mimic softmax attention outputs, and (2) a fine-tuning stage where the linearized model is adapted to downstream language modeling objectives.」と既存モデルを活用していくアプローチで拡張に使用する学習データが少なく、性能劣化も抑えられるとのこと。

It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization [26.4]
我々は、ニューラルネットワークを連想記憶モジュールとして再認識し、注意バイアスと呼ばれる内部的目的を用いてキーと値のマッピングを学習する。高速並列化可能なトレーニングプロセスを維持しつつ、既存の線形RNNのパワーを超える3つの新しいシーケンスモデル(Moneta、Yaad、Memora)を提示する。例えば、Mirasの特定のインスタンスは、言語モデリング、コモンセンス推論、リコール集約タスクのような特別なタスクで例外的なパフォーマンスを達成し、トランスフォーマーや他の現代的な線形リカレントモデルよりも優れています。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 17:59:33 GMT)
Googleによる新たなアーキテクチャの探索、Mirasフレームワークの提案、Building upon our formulation of memory and forget gate, we present Miras1, a fundamental framework to design novel sequence modeling architectures by four choice of: (1) Attentional bias (i.e., memory objective), (2) Retention gate, (3) Memory architecture, and (4) Memory learning algorithm (i.e., optimizer).
有望なアーキテクチャとしてMoneta, Yaad, Memoraを選定し性能を確認。1.3Bまでと規模が小さめであるが非常に有望な結果に見える。

Transformers without Normalization

Transformers without Normalization [58.8]
トランスフォーマーの正規化レイヤのドロップイン置換として、DyT($x$) = tanh(alpha $x$)$という要素演算式であるDynamic Tanh(DyT)を導入する。我々は、認識から生成、教師付き学習、教師付き学習、コンピュータビジョンから言語モデルまで、様々な環境において、DyTを用いたトランスフォーマーの有効性を検証する。
論文参考訳（メタデータ） (Thu, 13 Mar 2025 17:59:06 GMT)
「We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x) = tanh(αx), as a drop-in replacement for normalization layers in Transformers.」とのこと。知見として興味深く、「DyT improves training and inference speed, making it a candidate for efficiency-oriented network design.」と計算コスト的にも有利とのこと。

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention [32.5]
我々は、アルゴリズムのイノベーションとハードウェアの最適化を統合する、ネイティブにトレーニング可能なスパースアテンションメカニズムであるNSAを紹介する。 NSAは動的な階層的なスパース戦略を採用し、粗粒のトークン圧縮と細粒のトークン選択を組み合わせて、グローバルなコンテキスト認識と局所的精度の両方を維持する。
論文参考訳（メタデータ） (Sun, 16 Feb 2025 11:53:44 GMT)
DeepSeekによる階層的、スパースなアテンションの提案。通常の実装に比べ数倍以上高速。
「Following the common practice in state-of-the-art LLMs, our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27B total parameters with 3B active parameters.」という構成で実験をしており、品質もAverageではfull attention以上という成績。

Byte Latent Transformer: Patches Scale Better Than Tokens

Byte Latent Transformer: Patches Scale Better Than Tokens [101.1]
Byte Latent Transformer (BLT) はバイトを動的サイズのパッチにエンコードする。固定推論コストに対して、BLTはパッチとモデルサイズの両方を同時に拡大することにより、トークン化ベースのモデルよりもはるかに優れたスケーリングを示している。
論文参考訳（メタデータ） (Fri, 13 Dec 2024 05:33:32 GMT)
バイト単位のTransformerは様々提案されてきたが、大規模なモデル構築は計算量の点で厳しかった。本件では「To efficiently allocate compute, we propose a dynamic, learnable method for grouping bytes into patches (§2) and a new model architecture that mixes byte and patch information.」という手法を提案。「Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.」とのこと。
リポジトリはGitHub – facebookresearch/blt: Code for BLT research paper

Mixture-of-Transformers

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models [112.0]
Mixture-of-Transformer (MoT) はスパースマルチモーダルトランスアーキテクチャである。 MoTはモデルの非埋め込みパラメータをモダリティで分離する。複数の設定とモデルスケールでMoTを評価する。
論文参考訳（メタデータ） (Thu, 07 Nov 2024 18:59:06 GMT)
性能がルータに依存するMixture of Expertsに対して、「MoT extends the standard transformer architecture by incorporating modality-specific weights for all non-embedding model parameters, including feed-forward networks, attention matrices, and layer normalization.」というアプローチのMixture of Transformerの提案。「In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline’s performance using only 55.8% of the FLOPs.」と有効性を主張。

Fundamental Limitations on Subquadratic Alternatives to Transformers

Fundamental Limitations on Subquadratic Alternatives to Transformers [3.5]
文書類似性タスクに重点を置いており、入力された多くの文書として与えられ、最もよく似たペアを見つけたいと思っています。我々はTransformerがこのタスクを実行できることを証明し、このタスクはどんなアルゴリズムでも真に2次時間で実行できないことを証明した。
論文参考訳（メタデータ） (Sat, 05 Oct 2024 19:21:13 GMT)
「We focus on document similarity tasks, where one is given as input many documents and would like to ﬁnd a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm.」という主張。
その手のタスクがあるのはそうだろうというのとドキュメント類似性タスクに関する分析はとても興味深い。特に「Theorem 3.1. Assuming SETH or OVC, for every ε > 0, there exists a constant c > 0 such that γ-LSDn,ℓ cannot be solved in O(n^2−ε) time for any γ ≥ 1 when ℓ = c log n.」は面白い結果。（実用上は、というと話が変わる場合も多い印象ではありつつ）この手の理論解析は重要。

How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs

How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs [69.6]
本稿では,変圧器を用いた大規模言語モデルの数学的タスクにおける有効性に影響を与える重要な要因として,数値的精度を同定する。その結果,数値精度の低いトランスフォーマーでは,繰り返し加算や整数乗算などの算術的なタスクに対処できないことがわかった。対照的に、標準的な数値精度のトランスフォーマーは、モデルサイズを大幅に小さくすることで、これらのタスクを効率的に処理することができる。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 17:59:35 GMT)
「Our results show that Transformers operating with low numerical precision fail to address arithmetic tasks, such as iterated addition and integer multiplication, unless the model size grows super-polynomially with respect to the input length.」という指摘。

ViT-1.58b

ViT-1.58b: Mobile Vision Transformers in the 1-bit Era [27.7]
本稿では、メモリと計算オーバーヘッドを大幅に削減する新しい1.58ビット量子化ViTモデルViT-1.58bを紹介する。 CIFAR-10 と ImageNet-1k の実験では、ViT-1.58b は完全精度の Vit に匹敵する精度を維持している。
論文参考訳（メタデータ） (Wed, 26 Jun 2024 04:01:19 GMT)
1 bit(1.58 bit)なLLMとHAWK・Griffin – arXiv最新論文の紹介 (devneko.jp)のViT版、「Our results show that ViT-1.58b achieves competitive accuracy on benchmarks like CIFAR10 and ImageNet-1k with significantly lower resource requirements.」とViTでも良い結果らしい。
リポジトリはGitHub – DLYuanGod/ViT-1.58b

A Survey of Transformer Enabled Time Series Synthesis

A Survey of Transformer Enabled Time Series Synthesis [38.9]
生成AIは画像と言語領域で多くの注目を集めている。本稿では,変換器,生成AI,時系列データの交点におけるこのギャップを明らかにする。レビューされた研究はアプローチの多様さを示しており、ドメインがもたらす問題に対する決定的な回答にはまだ収束していない。
論文参考訳（メタデータ） (Tue, 04 Jun 2024 13:52:42 GMT)
Transformerと時系列データに関するサーベイ
TNNでtransformer neural network はあまり見ない略し方

月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31