staka – ページ 82 – arXiv最新論文の紹介

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems [88.3]
AgiBot Worldは、217のタスクにまたがる100万以上のトラジェクトリを5つのデプロイメントシナリオで構成した大規模なプラットフォームである。 AgiBot Worldは高品質で多様なデータ配信を保証する。 GO-1は、現実世界のデクスタラスタスクや長距離タスクにおいて例外的な能力を示す。
論文参考訳（メタデータ） (Sun, 09 Mar 2025 15:40:29 GMT)
「1) We construct AgiBot World dataset, a multifarious robot learning dataset accompanied by opensource tools to advance research on policy learning at scale.」という大規模データセット構築と「2) We propose GO1, a robot foundation policy using latent action representations to unlock web-scale pre-training on heterogeneous data.」の提案。 Shanghai AI Lab,、AgiBot Inc. 、Shanghai Innovation Instituteによる成果。この領域もLLM的な進化となるのだろうか…。
リポジトリはGitHub – OpenDriveLab/AgiBot-World: The Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems、プロジェクトサイトはAgiBot World Colosseo | OpenDriveLab

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs [96.7]
2つの異なる大きさのMoE大言語モデル(LLM)を提示する。 Ling-Liteは168億のパラメータと275億のアクティベートパラメータを持ち、Ling-Plusは2900億のパラメータと288億のアクティベートパラメータを持っている。本稿では,(1)モデルアーキテクチャとトレーニングプロセスの最適化,(2)トレーニング異常処理の洗練,(3)モデル評価効率の向上のための革新的な手法を提案する。
論文参考訳（メタデータ） (Fri, 07 Mar 2025 04:43:39 GMT)
Ling Team, AI@Ant GroupによるLLM。コストパフォーマンスに優れるトレーニング方針が特徴的で異なる構成のクラスタが複数ある状況を想定したレシピになっている。大規模構成のLing Plusを含めモデルが公開されている。
リポジトリはinclusionAI (inclusionAI)

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.1]
大規模言語モデル (LLMs) は自然言語処理の状況を変え、多様な応用をもたらした。ポストトレーニング手法により、LLMは知識を洗練させ、推論を改善し、事実の正確性を高め、ユーザの意図や倫理的配慮をより効果的に整合させることができる。
論文参考訳（メタデータ） (Fri, 28 Feb 2025 18:59:54 GMT)
LRMでも注目されるPost training関連のサーベイ、Fine-tuning, Reinforcement Learning, Test-time Scalingが大きなキーワード。
リポジトリはGitHub – mbzuai-oryx/Awesome-LLM-Post-training: Awesome Reasoning LLM Tutorial/Survey/Guide

AI-native Memory 2.0: Second Me

AI-native Memory 2.0: Second Me [26.4]
SECOND MEはインテリジェントで永続的なメモリオフロードシステムとして機能する。コンテキスト対応の応答を生成し、必要な情報をプリフィルし、外部システムとのシームレスな通信を容易にする。さらに、第2のMEは、永続的で文脈的に認識され、自己最適化されたメモリシステムとの人間と世界の相互作用を強化するための重要なステップである。
論文参考訳（メタデータ） (Wed, 12 Mar 2025 11:31:31 GMT)
HippoRAG2, RAG vs Graph RAG, A-MEM: Agentic Memory for LLM Agents – arXiv最新論文の紹介のAgentic Memory的なアプローチに見えるAIと協働することを前提としたメモリシステムの提案。実装に興味があるのでOSS部分をみてたいところ。
リポジトリはhttps://github.com/Mindverse/Second-Meとのことだが、現状は404

A Survey of Model Architectures in Information Retrieval

A Survey of Model Architectures in Information Retrieval [64.8]
機能抽出のためのバックボーンモデルと、関連性推定のためのエンドツーエンドシステムアーキテクチャの2つの重要な側面に焦点を当てる。従来の用語ベースの手法から現代のニューラルアプローチまで,特にトランスフォーマーベースのモデルとそれに続く大規模言語モデル(LLM)の影響が注目されている。我々は、パフォーマンスとスケーラビリティのアーキテクチャ最適化、マルチモーダル、マルチランガルデータの処理、従来の検索パラダイムを超えた新しいアプリケーションドメインへの適応など、新たな課題と今後の方向性について議論することで結論付けた。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 18:42:58 GMT)
LLMの影響を受け、また、LLM時代で重要性増すInformation Retrievalのサーベイ
結論の「Information retrieval modeling has evolved from simple term matching to complex neural networks and LLM-driven approaches, significantly improving search capabilities. Key challenges ahead include balancing computational efficiency with performance, handling diverse data types, maintaining faithfulness and trustworthiness, and integrating with emerging technologies like autonomous agents.」はその通りと思う。

LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue

LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue [5.1]
PRAISEは効果的なユーザ満足度予測のための解釈可能なフレームワークである。 3つのモジュールを通して動作する。ユーザ満足度推定タスクの3つのベンチマークで最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 18:12:33 GMT)
ユーザ満足度を推定するためのフレームワーク「PRAISE (Plan and Retrieval Alignment for Interpretable Satisfaction Estimation)」の提案。AgenticなアプローチでStrategy Planner、Feature Retriever、Score Analyzerで構成。
興味深い結果だが、LLM（API）が若干古いような気がしなくもない。最新のAPIだとどのような結果になるのだろうか。

BIG-Bench Extra Hard

BIG-Bench Extra Hard [98.4]
大規模言語モデル(LLM)は、ますます日常的なアプリケーションにデプロイされ、堅牢な一般的な推論機能を必要としている。 BIG-Benchデータセットは、LLMの一般的な推論能力を評価するための重要なベンチマークとして機能している。最先端のモデルは、BIG-Benchの多くのタスクにおいてほぼ完璧なスコアを得るため、その実用性は低下する。 BIG-Bench Extra Hard (BBEH) は, LLM推論評価のバウンダリを推し進めるための新しいベンチマークである。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 14:50:50 GMT)
BIG-Benchの強化版、「Solving the tasks in BBEH requires even further reasoning skills than the problems in BBH. These skills include, but are not limited to, many-hop reasoning, learning on the fly, finding errors in reasoning traces, processing long-context inputs and finding (multi-)needles in a haystack, going against strong prior, dealing with long-range dependencies, dealing with distractors and inducing patterns from examples.」と推論に関する能力が必要になるよう。LRM、o3-mini(high)はまずまずのスコアである一方で一部タスクを苦手としているDeepseek R1のスコアが低いのが興味深い。
リポジトリはGitHub – google-deepmind/bbeh

Generative Models in Decision Making: A Survey

Generative Models in Decision Making: A Survey [63.7]
生成モデルは、高逆状態反応領域や中間部分ゴールへエージェントを誘導する軌道を生成することによって意思決定システムに組み込むことができる。本稿では,意思決定タスクにおける生成モデルの適用について概説する。
論文参考訳（メタデータ） (Mon, 24 Feb 2025 12:31:28 GMT)
生成モデル（Energy Based Models (EBMs), Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Normalizing Flow (NFs), Diffusion Models (DMs), GFlowNets (GFNs), and Autoregressive Models (AMs).）と意思決定のサーベイ。アプリケーションは「robot control, autonomous driving, games, structural generation, and optimization.」を想定。

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [97.2]
本稿では,Large Language Models (LLMs) のコード批判ベンチマークであるCodeCriticBenchを紹介する。具体的には、CodeCriticBenchには2つの主要なコードタスク(コード生成とコードQA)が含まれています。さらに、評価プロトコルには、基本的な批評評価と、異なる特性に対する高度な批評評価が含まれる。
論文参考訳（メタデータ） (Sun, 23 Feb 2025 15:36:43 GMT)
「To evaluate the critique abilities of LLMs on the code domain, we introduce the first holistic code critique benchmark CodeCriticBench, which includes the critique on both code generation and code QA tasks.」という珍しいタスクに対するベンチマーク。DeepSeek-R1とOpenAI o1-Previewの能力が高い。
リポジトリはGitHub – multimodal-art-projection/CodeCriticBench

投稿者: staka

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

AI-native Memory 2.0: Second Me

A Survey of Model Architectures in Information Retrieval

LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue

BIG-Bench Extra Hard

More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG

Generative Models in Decision Making: A Survey

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31