バイト単位のTransformerは様々提案されてきたが、大規模なモデル構築は計算量の点で厳しかった。本件では「To efficiently allocate compute, we propose a dynamic, learnable method for grouping bytes into patches (§2) and a new model architecture that mixes byte and patch information.」という手法を提案。「Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.」とのこと。