staka – ページ 83 – arXiv最新論文の紹介

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.4]
グラフィカルユーザインタフェース(GUI)エージェントのための新しいデータ合成パイプラインであるOS-Genesisを提案する。事前に定義されたタスクに頼る代わりに、OS-Genesisはエージェントがまず環境を認識し、ステップワイドなインタラクションを実行することを可能にする。次に、生成された軌道の品質を保証するために軌道報酬モデルを用いる。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 16:21:58 GMT)
急速に研究が進むGUIエージェント開発のための合成データ構築手法の提案、「OS-Genesis begins by exploring the functionality of GUI environments through traversing interactive UI elements with actions (e g , CLICK). This forms the basis for reverse task synthesis, where observed states and actions are retroactively transformed into low-level instructions. These low-level instructions are then derived into high-level instructions, which can seed the collection of GUI trajectories.」と基礎データを構築、Trajectory Reward Modelで品質を保証。「Built upon GPT-4o, TRM aims to perform a graded evaluation with a reward score R ∈ [1, 5] to assist in sampling for training.」とのこと・・・。
リポジトリはOS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset [52.3]
大規模言語モデル(LLM)は、テキストと表のデータを含むハイブリッドテキストを理解し解析することができる。本研究では,LLMがHLD(Hybrid Long Document)を処理できるようにするための自動情報抽出フレームワーク(AIE)を提案し,HLDからの情報抽出の4つの重要な側面を分析する実験を行った。 HLDにおけるデータセット不足の問題に対処し、今後の作業を支援するために、金融レポート数値抽出(FINE)データセットを提案する。
論文参考訳（メタデータ） (Sat, 28 Dec 2024 07:54:14 GMT)
Automated Information Extraction (AIE) frameworkの提案、「AIE comprises four modules: Segmentation, Retrieval, Summarization, and Extraction.」と割と一般的な構成に見える
データセットは公開されていない？

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.4]
o1のようなモデルは、推論中に人間のような長時間の思考をエミュレートすることができる。本論文は,これらのモデルにおける過度な考察の課題に関する,最初の包括的研究である。精度を損なうことなく、過剰思考を緩和し、推論プロセスを合理化するための戦略を提案する。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 18:55:12 GMT)
「This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit.」とoverthinkingに焦点を当てた興味深い論文。

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey [93.7]
Next Token Prediction (NTP)は、機械学習タスクの多目的な学習目標である。本調査では,マルチモーダル学習における理解と生成を一体化する包括的分類法を導入する。提案した分類法は,マルチモーダルトークン化,MMNTPモデルアーキテクチャ,統合タスク表現,データセットと評価,オープンチャレンジの5つの重要な側面を網羅している。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 03:00:30 GMT)
一般的なテクニックとなったNext token predictionのサーベイ、マルチモーダルな学習を対象にしている。
リポジトリはGitHub – LMM101/Awesome-Multimodal-Next-Token-Prediction: Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Training Software Engineering Agents and Verifiers with SWE-Gym

Training Software Engineering Agents and Verifiers with SWE-Gym [89.6]
SWE-Gymは、現実世界のソフトウェアエンジニアリング(SWE)エージェントをトレーニングするための最初の環境である。 SWE-Gymには2,438の現実世界のPythonタスクインスタンスが含まれている。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 18:15:39 GMT)
ソフトウェアエンジニアリング用エージェント開発のための環境の提案、および、高性能なエージェントの開発。o3で圧倒的な結果を見た後ではあるが、「Through extensive experiments, we demonstrate that SWE-Gym enables both agent and verifier models to achieve significant improvements in resolving complex software tasks. Our findings highlight the scalability of these approaches, revealing potential for continuous performance gains with increased compute.」とエージェント的動作の有効性は高い。
リポジトリはGitHub – SWE-Gym/SWE-Gym

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [74.5]
効率的な推論パス探索と学習のための集合モンテカルロ木探索(CoMCTS)を提案する。我々はMulberry-260kを構築する。Mulberry-260kはマルチモーダルなデータセットで、各質問に対してリッチで明示的で明確に定義された推論ノードのツリーを持つ。我々は、o1のようなステップバイステップ推論とリフレクション機能を備えたMLLMの一連のモデルであるMulberryを訓練するために、集合SFTを実行する。
論文参考訳（メタデータ） (Tue, 24 Dec 2024 10:07:51 GMT)
（o1自体は利用していないと言われているが）o1 likeなシステムを作ろうとすると話題になるモンテカルロ木探索を対象としたベンチマーク
リポジトリはGitHub – HJYao00/Mulberry

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs [34.2]
大規模言語モデル(LLM)は自己生成応答を補正することができるが、自己補正後の精度の低下も観察されている。自己訂正能力は、自信(回答を正す自信)と批判(間違った回答を正しいものにする)に分解します。我々の戦略は両方の能力においてバニラSFTより優れており、自己補正後の精度ははるかに高い。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 08:09:11 GMT)
Confidence scoreとCriticの分析、および、自己修正能力を高める手法の提案
「Confidence prompt/ICL example can lead higer CL and lower CS; critique prompt/ICL example can cause lower CL and higher CS.」（Confidence Level (CL) and Critique Score (CS)）とトレードオフの関係にあるとのこと。
両者を改善するために「Critique Improvement Tuning (CCT), which can be divided into Confidence Level Improvement Tuning (CLT) and Critique Score Improvement Tuning (CST).」を提案
リポジトリはGitHub – Zhe-Young/SelfCorrectDecompose: Code for paper “Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs”

Large Concept Models: Language Modeling in a Sentence Representation Space

Large Concept Models: Language Modeling in a Sentence Representation Space [62.7]
本稿では,概念を命名した明示的な高レベルな意味表現に基づくアーキテクチャの試みを行う。概念は言語とモダリティに依存しないものであり、フローにおけるより高いレベルの考えや行動を表している。本モデルでは,多くの言語に対して,ゼロショットの一般化性能が顕著であることを示す。
論文参考訳（メタデータ） (Sun, 15 Dec 2024 21:20:12 GMT)
トークン単位ではなくコンセプト単位に言語を扱ったモデルの提案、「In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space.」という設定で「The LCM outperforms Llama-3.1-8B-IT on English and on the average over foreign languages officially supported by the LLM.」との興味深い結果。一方で「We acknowledge that there is still a long path to reach the performance of current flagship LLMs.」との記載も。
リポジトリはGitHub – facebookresearch/large_concept_model: Large Concept Models: Language modeling in a sentence representation space

StructTest: Benchmarking LLMs’ Reasoning through Compositional Structured Outputs

StructTest: Benchmarking LLMs’ Reasoning through Compositional Structured Outputs [78.8]
StructTestは、構造化されたアウトプットを生成する能力に基づいて、大きな言語モデルを評価する新しいベンチマークである。 StructTestが一般的な推論能力のよいプロキシであることを示す。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 22:08:40 GMT)
構造化出力のベンチマーク、「programmatically verifiable benchmark for evaluating instructionfollowing capabilities through structured outputs.」
現時点でデータは公開されていない・・・？

ResearchTown: Simulator of Human Research Community

ResearchTown: Simulator of Human Research Community [14.0]
ResearchTownは、リサーチコミュニティシミュレーションのためのマルチエージェントフレームワークである。 ResearchTownは、協調研究活動の現実的なシミュレーションを提供する。 ResearchTownは、複数の研究者と多様な論文で堅牢なシミュレーションを維持できる。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 18:26:53 GMT)
流行っているマルチエージェントフレームワーク、だが、ついにTownに。。。
グラフ構造を変更するとどうなるかに興味津々
リポジトリはGitHub – ulab-uiuc/research-town: A platform for developers to simulate research community

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31