SORAとGemini-1.5 – arXiv最新論文の紹介

先週話題となったニュースにテキストからのビデオ生成モデルであるOpenAIのSORA、極めて長いテキストを扱えるGoogleのGemini 1.5がある。両発表とも技術が一段進化した感がある。

Reka（Reka Flash: An Efficient and Capable Multimodal Language Model – Reka AI）のようなチャレンジャーも出てきていてニュースが多い。

Video generation models as world simulators
私たちはAIに、動作中の物理世界を理解し、シミュレートするように教えています。ビデオと画像の潜在コード上の時空間パッチを扱うトランスフォーマーアーキテクチャを活用しています。Soraは、視覚的品質とユーザのプロンプトへの固執を維持しながら、最大1分間のビデオを生成することができる。
Sora (openai.com)
Video generation models as world simulators (openai.com)
既存研究（例えばLumiere – arXiv最新論文の紹介 (devneko.jp)、Lumiere (lumiere-video.github.io)やMagicVideo-V2: Multi-Stage High-Aesthetic Video Generation (magicvideov2.github.io)）もすごかったが、本件は生成可能な動画の長さと自然さでかなり進んでいる印象。

Gemini 1.5: Unlocking multimodalunderstanding across millions of tokens ofcontext
Gemini 1.5 Proは、きめ細かい情報をリコールして推論できる計算効率の高いマルチモーダル混合モデルである。モダリティ間の長いコンテキスト検索タスクのほぼ完璧なリコールを実現する。Gemini 1.0 Ultraの最先端のパフォーマンスを、幅広いベンチマークで比較または上回る。
長文を扱える能力が高くTF-IDF での検索＋re rankを行うパイプライン構成をとった場合を大きく超える性能。そして、旧Twitterでも紹介されていた「With only instructional materials (500 pages of linguistic documentation, a dictionary, and ≈ 400 parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua2, and therefore almost no online presence.」が衝撃的。
gemini_v1_5_report.pdf (storage.googleapis.com)
- Google Japan Blog: 次世代モデル、 Gemini 1.5 を発表 (googleblog.com)

コメントを残す コメントをキャンセル