2025年1月 – ページ 5 – arXiv最新論文の紹介

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [15.4]
本稿では,小型言語モデル (SLM) が OpenAI o1 の算術的推論能力に匹敵するか,超越するかを示すために rStar-Math を提案する。我々はモンテカルロ木探索(MCTS)を通して「深層思考」を実践し,SLMに基づくプロセス報酬モデルによるテスト時間探索を行う。
論文参考訳（メタデータ） (Wed, 08 Jan 2025 14:12:57 GMT)
「In this work, we present rStar-Math, a self-evolved System 2 deep thinking approach that significantly boosts the math reasoning capabilities of small LLMs, achieving state-of-the-art OpenAI o1-level performance.」と流行りのアプローチ、self-evolvedという表現に未来を感じるとともに、比較的小規模なモデルでも高いスコアをとれていることが興味深い
リポジトリはhttps://github.com/microsoft/rStar。現時点では404？

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides [53.2]
プレゼンテーションを自動生成する2段階の編集手法を提案する。 PPTAgentはまずプレゼンテーションを分析して,その構造パターンやコンテントスキーマを理解します。実験の結果,PPTAgentは従来の3次元のプレゼンテーション生成方法よりも大幅に優れていた。
論文参考訳（メタデータ） (Tue, 07 Jan 2025 16:53:01 GMT)
プレゼンテーションの自動作成。PPTとPDFを入力、ステージ１でリファレンスとなるPPTを解析、ステージ２でアウトライン生成→スライド生成を行う２段階のアプローチ。「To address the limitations of existing automated metrics for presentation evaluation, we introduce PPT Eval, a comprehensive framework for assessing presentation quality from multiple perspectives.」と評価機構も構築（内部的にはGPT-4oを利用）
リポジトリはGitHub – icip-cas/PPTAgent

Cosmos World Foundation Model Platform for Physical AI

Cosmos World Foundation Model Platform for Physical AI [136.1]
物理AIには、自分自身のデジタルツイン、ポリシーモデル、そして世界のデジタルツイン、ワールドモデルが必要です。私たちは、開発者が物理AIセットアップのためにカスタマイズされた世界モデルを構築するのを助けるために、Cosmos World Foundation Model Platformを紹介します。
論文参考訳（メタデータ） (Tue, 07 Jan 2025 06:55:50 GMT)
バズっていたNVIDIAによるWorld Foundation Model。「Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers.」と包括的な構成でモデルを公開しているのはすごい。
構築過程で「We refine our data by excluding specific video types that could lead to poor generation quality or unrealistic dynamics, such as abstract visual patterns, video game footage, animated content, etc.」があるのが面白かった。unrealistic dynamicsはそうだろうと思う。
現状は初期段階、問題も多そうではあるが今後の発展に期待。現状の進化で作れるのか、根幹のモデルアーキテクチャが変わらないとできないのか、とても興味がある。
リポジトリはGitHub – NVIDIA/Cosmos: Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. Cosmos is purpose built for physical AI. The Cosmos repository will enable end users to run the Cosmos models, run inference scripts and generate videos.

Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking

Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [124.7]
HaluSearchは、ツリー検索ベースのアルゴリズムを組み込んだ新しいフレームワークである。テキスト生成をステップバイステップの推論プロセスとしてフレーム化する。認知科学における二重プロセス理論に着想を得た階層的思考システムスイッチ機構を導入する。
論文参考訳（メタデータ） (Thu, 02 Jan 2025 15:36:50 GMT)
「We propose HaluSearch, which integrates tree search-based algorithms (e g , MCTS) to explicitly implement a slow thinking process during the inference stage of LLMs, fully exploiting their own internal knowledge to mitigate hallucinations in generated text.」、各ステップの報酬を評価するスタイル。「To facilitate self-evaluation, we trained the reward model using data synthesized by the HaluSearch framework to assess the degree of hallucinations and provide reward signals.」とのこと。「Additionally, to improve efficiency, we introduced a dynamic system switch mechanism, which utilizes a trained switch model to enable LLMs to adaptively alternate between fast and slow thinking modes at both the instance and step levels.」という機構を有することが特徴的で、overthinking対策としても有望そうな感じがする。
現時点での全部入り的なアプローチで面白い。

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Search-o1: Agentic Search-Enhanced Large Reasoning Models [24.2]
OpenAI-o1のような大きな推論モデル(LRM)は、大規模な強化学習を通じて、大きなステップワイズ推論能力を実証している。エージェント検索拡張生成(RAG)機構とReason-in-Documentsモジュールを併用し,LRMを強化するフレームワークである textbfSearch-o1 を紹介する。
論文参考訳（メタデータ） (Thu, 09 Jan 2025 16:48:17 GMT)
RAG + Large Rrasoning Modelなフレームワークの提案。Agenticなアプローチに見えなくもないが、「(a) Direct reasoning without retrieval often results in inaccuracies due to missing knowledge. (b) Our agentic retrieval-augmented reasoning approach improves knowledge access but usually returns lengthy, redundant documents, disrupting coherent reasoning. (c) Our Search-o1 integrates concise and accurate retrieved knowledge seamlessly into the reasoning process, enabling precise and coherent problem-solving.」とReason-in-Documentsを用いLRMと別の処理として推論の流れに沿った情報を選択・要約してLRMに組み込む有効性を主張している。
リポジトリはSearch-o1: Agentic Search-Enhanced Large Reasoning Models

M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs

M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs [66.8]
LVLMのための最初のMultiModal Moral BenchmarkであるM$3$oralBenchを紹介する。 M$3$oralBench は Moral Foundations Vignettes (MFVs) の日常的なモラルシナリオを拡張し、テキストから画像への拡散モデル SD3.0 を用いて対応するシナリオイメージを作成する。道徳基礎理論(MFT)の6つの道徳的基礎にまたがって道徳的評価を行い、道徳的判断、道徳的分類、道徳的対応の課題を含む。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 05:18:55 GMT)
マルチモーダルなモラルベンチマーク、「Care/Harm (dislike for suffering of others), Fairness/Cheating (proportional fairness, Loyalty/Betrayal (group loyalty), Authority/Subversion (respect for authority and tradition), Sanctity/Degradation (concerns for purity and contamination), Liberty/Oppression (concerns on oppression and coercion)」の6つの道徳的基礎がベース
リポジトリはGitHub – BeiiiY/M3oralBench: The official Github page for “M³oralBench: A MultiModal Moral Benchmark for LVLMs”

How Panel Layouts Define Manga: Insights from Visual Ablation Experiments

How Panel Layouts Define Manga: Insights from Visual Ablation Experiments [24.4]
本稿では,マンガ作品の視覚的特徴を,特にパネルレイアウトの特徴に着目して分析することを目的とする。研究手法として,マンガのページイメージを入力として,マンガタイトル予測のための深層学習モデルをトレーニングした。具体的には,ページイメージ情報をパネルフレームに限定してアブレーション研究を行い,パネルレイアウトの特性を解析した。
論文参考訳（メタデータ） (Thu, 26 Dec 2024 09:53:37 GMT)
マンガのレイアウトの特性分析
「This study used deep learning to explore whether panel page designs in manga vary by work.　Our experiments showed that even without characters and text, panel layouts exhibit inherent uniqueness, serving as a key distinguishing feature for manga.　This was validated through classification tasks and supported by Grad-CAM visualizations.」はまぁそうだろうと思う。はたしてDeepを使う必要があるのかはやや謎ではあるが。。。

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.4]
グラフィカルユーザインタフェース(GUI)エージェントのための新しいデータ合成パイプラインであるOS-Genesisを提案する。事前に定義されたタスクに頼る代わりに、OS-Genesisはエージェントがまず環境を認識し、ステップワイドなインタラクションを実行することを可能にする。次に、生成された軌道の品質を保証するために軌道報酬モデルを用いる。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 16:21:58 GMT)
急速に研究が進むGUIエージェント開発のための合成データ構築手法の提案、「OS-Genesis begins by exploring the functionality of GUI environments through traversing interactive UI elements with actions (e g , CLICK). This forms the basis for reverse task synthesis, where observed states and actions are retroactively transformed into low-level instructions. These low-level instructions are then derived into high-level instructions, which can seed the collection of GUI trajectories.」と基礎データを構築、Trajectory Reward Modelで品質を保証。「Built upon GPT-4o, TRM aims to perform a graded evaluation with a reward score R ∈ [1, 5] to assist in sampling for training.」とのこと・・・。
リポジトリはOS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset [52.3]
大規模言語モデル(LLM)は、テキストと表のデータを含むハイブリッドテキストを理解し解析することができる。本研究では,LLMがHLD(Hybrid Long Document)を処理できるようにするための自動情報抽出フレームワーク(AIE)を提案し,HLDからの情報抽出の4つの重要な側面を分析する実験を行った。 HLDにおけるデータセット不足の問題に対処し、今後の作業を支援するために、金融レポート数値抽出(FINE)データセットを提案する。
論文参考訳（メタデータ） (Sat, 28 Dec 2024 07:54:14 GMT)
Automated Information Extraction (AIE) frameworkの提案、「AIE comprises four modules: Segmentation, Retrieval, Summarization, and Extraction.」と割と一般的な構成に見える
データセットは公開されていない？

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.4]
o1のようなモデルは、推論中に人間のような長時間の思考をエミュレートすることができる。本論文は,これらのモデルにおける過度な考察の課題に関する,最初の包括的研究である。精度を損なうことなく、過剰思考を緩和し、推論プロセスを合理化するための戦略を提案する。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 18:55:12 GMT)
「This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit.」とoverthinkingに焦点を当てた興味深い論文。

2025年1月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31