2024年12月 – ページ 5 – arXiv最新論文の紹介

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning [125.8]
Inst-ITは、明示的な視覚的プロンプトインストラクションチューニングを通じてインスタンス理解におけるLMMを強化するソリューションである。 Inst-ITは、マルチモーダルなインスタンスレベルの理解を診断するためのベンチマーク、大規模命令チューニングデータセット、継続的命令チューニングトレーニングパラダイムで構成されている。
論文参考訳（メタデータ） (Wed, 04 Dec 2024 18:58:10 GMT)
動画内のオブジェクトのようなインスタンスレベルでの理解を行うためのベンチマーク、データセットの提案。
筆者らによってFinetuningされたモデルはOSSなものでは高性能だが商用レベルには及んでいない。というのとこれが純粋に難しい問題であることが分かるスコア。
リポジトリはInst-IT

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [66.2]
我々はMotoを紹介する。Motoは、映像コンテンツをラテントモーションTokenizerでラテントモーションTokenシーケンスに変換する。我々は、モーショントークンによるMoto-GPTの事前学習を行い、多様な視覚的動きの知識を捉えることができる。実際のロボット動作に先立って学習した動きを転送するために、潜伏した動きのトークン予測と実際のロボット制御をシームレスにブリッジするコファインチューニング戦略を実装した。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 18:57:04 GMT)
「This paper introduces Moto, a novel method that uses latent motion tokens as a “language” interface to bridge generative pre-training on video data with precise robot control.」という手法の提案。潜在的な意味というか意図というかをTokenシーケンスにして言語として扱うということ、かつ、それが有効というのは興味深い。
プロジェクトサイトはMoto

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset [33.2]
精度とデータ量とのトレードオフを改善する方法を示します。 15Tトークンのためにトレーニングされた8Bパラメータモデルで、うち7.2Tは、Llama 3.1 8Bモデルよりも優れている。
論文参考訳（メタデータ） (Tue, 03 Dec 2024 17:28:50 GMT)
RedStone同様、Common CrawlをうまくRefineする手法の報告。こちらはNDIVIAによるもの。「We propose a method for transforming English Common Crawl into a 6.3T token longhorizon pretraining dataset, consisting of 4.4T globally deduplicated original tokens and 1.9T synthetically generated tokens.」と合成データについて触れられているのも興味深い。
プロジェクトサイトはNemotron-CC

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

RedStone: Curating General, Code, Math, and QA Data for Large Language Models [134.5]
本研究では,大規模言語モデルを事前学習するための包括的かつ柔軟なリソースとして,Common Crawlの未完成の可能性を探る。私たちは、Common Crawlからデータを抽出し、処理するために設計された、革新的でスケーラブルなパイプラインであるRedStoneを紹介します。
論文参考訳（メタデータ） (Wed, 04 Dec 2024 15:27:39 GMT)
LLM構築など大規模な事前学習で重要なデータ源となっているCommonCrawlからのデータ構築についての報告と実装。フィルタリングの過程でデータが大幅に削られている。「Our general domain dataset, REDSTONE-Web, outperforms existing open-source datasets in common sense reasoning benchmarks, while the inclusion of REDSTONE-Code and REDSTONE-Math significantly improves model performance in code generation and mathematical problem solving.」とのこと。
リポジトリはhttps://github.com/microsoft/redstoneとのことだが、現時点では404

Large Language Model-Brained GUI Agents: A Survey

Large Language Model-Brained GUI Agents: A Survey [43.2]
マルチモーダルモデルはGUI自動化の新しい時代を支えてきた。彼らは自然言語理解、コード生成、視覚処理において例外的な能力を示した。これらのエージェントはパラダイムシフトを表しており、ユーザーは単純な会話コマンドで複雑なマルチステップタスクを実行できる。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 12:13:39 GMT)
GUI Agents with Foundation Models: A Comprehensive Survey – arXiv最新論文の紹介ににたサーベイだが、こちらはMicrosoftの研究者が筆頭著者。

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8]
我々は,事前訓練された言語モデル(LM)のタスク性能を予測するために,タスクスケーリング法則とモデルはしごを開発する。まず、タスク固有の損失を予測するためにモデルとデータサイズを使用し、次にタスクの損失を使ってタスクパフォーマンスを予測する。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 18:21:49 GMT)
効率よくタスク性能を予測する手法の提案、「With a less than 1% of the pretraining compute, we are able to predict the task performance of 7B-4T and 13B-5T models on individual multiple-choice tasks with good accuracy.」とのこと。

SoK: Watermarking for AI-Generated Content

SoK: Watermarking for AI-Generated Content [112.9]
ウォーターマーキングスキームは、AI生成コンテンツに隠された信号を埋め込んで、信頼性の高い検出を可能にする。透かしは、誤情報や偽造と戦ってAIの安全性と信頼性を高める上で重要な役割を果たす。本研究の目的は、研究者が透かし法や応用の進歩を指導し、GenAIの幅広い意味に対処する政策立案者を支援することである。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 16:22:33 GMT)
Wartermarkingに関するサーベイ。

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.9]
MLLM(Multimodal Large Language Models)は、産業と学術の両方から注目を集めている。開発プロセスでは、モデルの改善に関する直感的なフィードバックとガイダンスを提供するため、評価が重要である。この研究は、研究者に異なるニーズに応じてMLLMを効果的に評価する方法を簡単に把握し、より良い評価方法を促すことを目的としている。
論文参考訳（メタデータ） (Fri, 22 Nov 2024 18:59:54 GMT)
MLLMの評価に関するサーベイで、リポジトリ　GitHub – BradyFU/Awesome-Multimodal-Large-Language-Models at Benchmarks　が非常に充実。

Liquid: Language Models are Scalable Multi-modal Generators

Liquid: Language Models are Scalable Multi-modal Generators [112.7]
Liquidは視覚的理解と生成をシームレスに統合する自動回帰生成パラダイムである。従来のマルチモーダルな大言語モデル(MLLM)とは異なり、Liquidは単一の大言語モデルを用いてこの統合を実現する。初めてLiquidは、ビジュアルタスクと言語タスクの統一トレーニングによって必然的にパフォーマンスが低下する、スケーリングの法則を明らかにした。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 16:48:16 GMT)
既存のLLMに対して「The only modification is the addition of 8192 new learnable embeddings for discrete image tokens. Correspondingly, we extend the original LM head by 8192 dimensions to enable the model to predict both text and image tokens within the same embedding space.」という変更を加え画像を扱うという研究
「For image generation, Liquid outperforms other auto-regressive based models, as well as some diffusion models like SD-XL and achieve FID of 5.47 on MJHQ-30K, demonstrating that LLMs can acquire excellent imagery capabilities efficiently with a limited amount of data.」という結果に驚きだが、さらには「For visual understanding, Liquid surpasses Chameleon and achieved results comparable to those of well-established MLLMs. In text-only tasks, Liquid achieves comparable performance with Chameleon, which used mix pre-training on a very large scale, and surpasses the performance of LLAMA2, demonstrating undegraded linguistic capabilities.」とのこと。

Amazon Nova, OpenAI o-1 pro, Gemini-Exp-1206, Llama 3.3

先週はLLM関連の話題が特に多い週だった。Amazon、OpenAI、Google、Metaが大きめのリリースを出しており、OpenAIはこれから発表を続けていくとのことでとても楽しみである。

Introducing-Amazon-Nova-A-New-Generation-of-Foundation-Models – US Press Center
- Amazonから発表された高性能LLM、下記のように様々なバージョンが存在
  - Amazon Nova Micro（高速なtext to text）
  - Amazon Nova Lite（高速なマルチモーダル）
  - Amazon Nova Pro （高性能なマルチモーダル）
  - Amazon Nova Premier（複雑な推論を得意とするモデル？）
  - Amazon Nova Canva（画像生成）
  - Amazon Nova Reel（動画生成）
Introducing ChatGPT Pro | OpenAI
- ChatGPT proの発表、OpenAI o1 pro modeはo1から性能をさらに上げている。
https://aistudio.google.com/app/prompts/new_chat?model=gemini-exp-1206
- 2024-12-05時点でChatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbotsトップのモデル
Llama 3.3 | Model Cards and Prompt formats
- 「Llama 3.3 is a text-only 70B instruction-tuned model that provides enhanced performance relative to Llama 3.1 70B–and to Llama 3.2 90B when used for text-only applications. Moreover, for some applications, Llama 3.3 70B approaches the performance of Llama 3.1 405B.」を主張するMetaのモデル、公開モデル

各社の競争が非常に激しい。

2024年12月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31