2024年6月21日 – arXiv最新論文の紹介

マルチモーダル、マルチリンガルな巨大データセットが発表されていた。

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [114.0]
我々は100億規模の画像テキストインターリーブデータセットであるOmniCorpusを紹介する。私たちのデータセットは、優れたデータ品質を維持しながら、15倍のスケールを持っています。これが将来のマルチモーダルモデル研究に確かなデータ基盤を提供することを期待しています。
論文参考訳（メタデータ） (Wed, 12 Jun 2024 17:01:04 GMT)
「8.6 billion images and 1,696 billion text tokens」という巨大なマルチモーダル・マルチリンガルなデータセット
リポジトリはGitHub – OpenGVLab/OmniCorpus: OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus [52.8]
ウェブからクロールされた最初の大規模多言語およびマルチモーダル文書コーパスであるmOSCARを紹介する。 163の言語、315万のドキュメント、214Bトークン、1.2Bイメージをカバーしている。さまざまなマルチリンガル画像テキストタスクとベンチマークで、数ショットの学習パフォーマンスが大幅に向上している。
論文参考訳（メタデータ） (Thu, 13 Jun 2024 00:13:32 GMT)
OSCARプロジェクトによるデータセット。「We mostly filter “not safe for work” (NSFW) content at the document level.」とのこと。
リポジトリはmOSCAR – OSCAR Documentation (oscar-project.github.io)

An Empirical Study of Mamba-based Language Models [69.7]
Mambaのような選択的な状態空間モデル(SSM)はトランスフォーマーの欠点を克服する。同じデータセット上で訓練された8B-context Mamba, Mamba-2, Transformer モデルを直接比較する。 8BのMamba-2-Hybridは、12の標準タスクで8BのTransformerを上回っている。
論文参考訳（メタデータ） (Wed, 12 Jun 2024 05:25:15 GMT)
Mambaの実験的検証。8B、3.5T tokensでmamba、mamba2、transformerを比較。「Our results show that while pure SSM-based models match or exceed Transformers on many tasks, both Mamba and Mamba-2 models lag behind Transformer models on tasks which require strong copying or in-context learning abilities (e g , five-shot MMLU, Phonebook Lookup) or long-context reasoning.」、「we find that the 8B-parameter Mamba2-Hybrid exceeds the 8B-parameter Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8× faster when generating tokens at inference time.」との結果。今までの論文と比べて意外性はないが、包括的な検証はとても参考になる。ハイブリッド構成はとても有効な選択肢に見えた。
リポジトリはMegatron-LM/examples/mamba at ssm · NVIDIA/Megatron-LM · GitHub

日: 2024年6月21日