arXiv最新論文の紹介

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions [55.2]
メモリは、大規模言語モデル(LLM)ベースのエージェントを支える、AIシステムの基本コンポーネントである。コンソリデーション、更新、インデックス付け、フォッティング、検索、圧縮の6つの基本的なメモリ操作を紹介します。この調査は、AIのメモリに関する研究、ベンチマークデータセット、ツールに関する構造化された動的視点を提供する。
論文参考訳（メタデータ） (Thu, 01 May 2025 17:31:33 GMT)
LLM、エージェントにとって重要なメモリのサーベイ。
「In this survey, we first categorize memory representations into parametric, contextual structured, and contextual unstructured and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression.」という軸設定。

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models [75.9]
SAGE(Sentient Agent as a Judge)は、大規模言語モデルの評価フレームワークである。 SAGEは人間のような感情の変化や内的思考をシミュレートするSentient Agentをインスタンス化する。 SAGEは、真に共感的で社会的に適応的な言語エージェントへの進捗を追跡するための、原則付き、スケーラブルで解釈可能なツールを提供する。
論文参考訳（メタデータ） (Thu, 01 May 2025 19:06:10 GMT)
「SAGE instantiates a Sentient Agent that simulates human- like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts.」（SAGE=Sentient Agent as a Judge）という評価フレームワークの提案。「rankings produced by SAGE diverge markedly from Arena results, confirming that social cognition is orthogonal to generic helpfulness. 」とのこと。
リポジトリはdigitalhuman/SAGE at main · Tencent/digitalhuman · GitHub

Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications

Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications [25.4]
大規模言語モデル(LLM)は、人間中心のタスクでますます使われるようになっている。彼らの心理的特徴を評価することは、彼らの社会的影響を理解し、信頼できるAIアライメントを確保するために不可欠である。本研究は,LLMのより解釈しやすく,堅牢で,一般化可能な心理的アセスメントフレームワークを開発するための今後の方向性を提案することを目的とする。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 06:09:40 GMT)
「(1) assessment tools; (2) LLM-specific datasets; (3) evaluation metrics (consistency and stability); (4) empirical findings; (5) personality simulation methods; and (6) LLM-based behavior simulation.」を軸としたレビュー。

HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights

HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights [72.8]
HiPerRAGは360万以上の科学論文から知識をインデクシングし取り出すワークフローである。コアとなるのはマルチモーダル文書解析のための高スループットモデルであるOreoと、クエリ対応エンコーダの微調整アルゴリズムであるColTrastだ。 HiPerRAGは、既存の科学的質問応答ベンチマークと、この研究で導入された2つの新しいベンチマークで堅牢なパフォーマンスを提供する。
論文参考訳（メタデータ） (Wed, 07 May 2025 22:50:23 GMT)
「Despite the widespread adoption of RAG, it faces three significant technical challenges that hinder its ability to scale to millions of documents.」はまさにその通りで、大規模RAGの構築にとって参考になる論文。
かなり凝ったことも行っている。（分野によっては）実用上もこのようなアプローチが必要になるんだろうか…

Holmes: Automated Fact Check with Large Language Models

Holmes: Automated Fact Check with Large Language Models [31.8]
本研究では,Large Language Models (LLMs) を用いて自動偽情報検出を行う。新たなエビデンス検索手法を特徴とするエンドツーエンドフレームワークであるHolmesを提案する。提案手法では,(1)LLMを用いた要約を用いてオープンソースから鍵情報を抽出し,(2)エビデンスの品質を評価するための新しいアルゴリズムと指標を提案する。
論文参考訳（メタデータ） (Tue, 06 May 2025 03:19:51 GMT)
ファクトチェックに関する論文で丁寧な記載とFIndingsがととても参考になる。
- 「Finding 1: LLMs CANNOT accurately verify the truth- fulness of the claim directly.」、「Finding 2: LLMs have shortcomings in searching for claim-relevant public information and their responses may include hallucinated links that weaken result trust- worthiness.」、「Finding 3: Human-written evidence enhances LLMs’ ability to verify multimodal claims and generate coherent justifications.」
上記をもとにHolmesを設計、有効性を確認とのこと

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation [90.8]
RoBridgeは、一般的なロボット操作のための階層的なインテリジェントアーキテクチャである。大規模事前学習型視覚言語モデル(VLM)に基づくハイレベル認知プランナー(HCP)で構成されている。強化学習の手続き的スキルを解き放ち、認知と実行のギャップを効果的に埋める。
論文参考訳（メタデータ） (Sat, 03 May 2025 06:17:18 GMT)
大規模なVLMが中心となるロボット操作のためのアーキテクチャ。VLM based real agentsのような印象。
プロジェクトサイトはRoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt

Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt [44.5]
我々は、道徳的多様性の持続性を見落としているようなシステムが、抵抗を引き起こし、信頼を失わせ、制度を不安定化するのではないかと心配している。理想的な条件下では、合理的なエージェントは単一の倫理上の会話の限界に収束するという考えである。我々は、この前提をオプション的かつ疑わしいものとみなし、紛争理論、文化進化、マルチエージェントシステム、制度経済学に基づく代替アプローチとして、適切性枠組みと呼ぶものを提案する。
論文参考訳（メタデータ） (Thu, 08 May 2025 12:55:07 GMT)
「This paper traces the underlying problem to an often-unstated Axiom of Rational Convergence: the idea that under ideal conditions, rational agents will converge in the limit of conversation on a single ethics. Treating that premise as both optional and doubtful, we propose what we call the appropriateness framework: an alternative approach grounded in conflict theory, cultural evolution, multi-agent systems, and institu- tional economics.」から始まる論文。
1. Contextual grounding、2. Community customization、3. Continual adaptation、4. Polycentric governanceはその通りだと思うし「it’s recognizing the actual pattern of human history, where we’ve demonstrably managed to live together despite fundamental disagreements, not by resolving them」は（実際は良くないことも多々起こっているけど）とても大枠として事実そうかもしれないが、具体的にどうやっていくべきかは頭を抱えるという現実がありそうな。色々と考えさせる論文という印象。
- 「For the latter, we have to shift from seeking agreement to managing conflict and enabling coexistence through shared practices and norms. This doesn’t imply “anything goes”.」とは書かれているが・・・

Anyprefer: An Agentic Framework for Preference Data Synthesis

Anyprefer: An Agentic Framework for Preference Data Synthesis [62.4]
ターゲットモデルを調整するための高品質な嗜好データを合成するフレームワークであるAnypreferを提案する。審査員モデルの応答を正確に評価するために、外部ツールが導入される。合成されたデータは、58Kの高品質な選好ペアからなる新しい選好データセットであるAnyprefer-V1にコンパイルされる。
論文参考訳（メタデータ） (Sun, 27 Apr 2025 15:21:59 GMT)
「To address the challenges of synthesizing high-quality preference data, we propose an automatic framework called Anyprefer, which models the preference data synthesis process as a two-player cooperative Markov game.」というAgenticなデータ合成フレームワークの提案。

Mistral Medium 3, Gemini 2.5 Pro preview, Llama-Nemotron, OpenCodeReasoning

先週注目のニュースはMistralのMistral Medium 3のリリース（Medium is the new large. | Mistral AI）。Claude 3.7 sonnetと競合する性能で「The Mistral Medium 3 API is available starting today on Mistral La Plateforme and Amazon Sagemaker, and soon on IBM WatsonX, NVIDIA NIM, Azure AI Foundry, and Google Cloud Vertex. To deploy and customize the model in your environment, please contact us. 」と各社環境での動作が可能な点が重要に思う。

GoogleのGemini 2.5 Proが使用可能になったよう（Gemini Pro – Google DeepMind）でこちらも注目度が高い。NvidiaのLlama-NemotronやOpenCodeReasoning がダウンロード可能になったことも話題になっていた。

各モデルの（第三者の）性能検証はこれからという感じだろうが、本当にニュースが多い。

Llama-Nemotron: Efficient Reasoning Models [105.8]
ヘテロジニアス推論モデルの開族であるLlama-Nemotronシリーズを導入する。サイズはNano(8B)、Super(49B)、Ultra(253B)の3種類。
論文参考訳（メタデータ） (Fri, 02 May 2025 01:35:35 GMT)
リポジトリはnvidia/Llama-3_1-Nemotron-Ultra-253B-v1 · Hugging Face、nvidia/Llama-Nemotron-Post-Training-Dataset · Datasets at Hugging Face

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.2]
教師付き微調整(SFT)データセットを構築し、様々なサイズのモデルで最先端のコーディング能力を実現する。私たちのモデルは、LiveCodeBenchで61.8%、CodeContestsで24.6%を達成するためにSFTのみを使用しており、強化学習でトレーニングされた代替品を上回っています。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 17:50:31 GMT)

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.7]
ミキチャー・オブ・エキスパート(MoE)と1兆近いパラメータを持つ疎大言語モデル(LLM)が、最も有能な言語モデルの領域を支配している。本稿では,Ascend NPU上でそのようなスケールを利用するレシピを明らかにすることを目的としている。主な目的は、動的スパースモデル構造下でのコンピューティングリソースのより良い使用と、実際のハードウェアで期待されるパフォーマンス向上の実現である。
論文参考訳（メタデータ） (Wed, 07 May 2025 15:46:36 GMT)
Llama 4, Nemotron-H, Pangu Ultra, Kimi-VL, Kimi-VL-Thinking, Deep Coder – arXiv最新論文の紹介にも関連するPangu Ultraの主に実装に関する論文。
「Our system optimizations focus on Expert Parallelism and memory management, significantly lowering communication and activation overhead across 6K NPUs. These innovations enable a 30.0% MFU, demonstrating Ascend NPUs’ capability to support full-scale training of large-scale sparse LLMs, e g , Pangu Ultra MoE, with comparable performance as DeepSeek R1.」とのことでNVIDIAのGPUに頼らずとも最先端モデルを構築可能と主張しているように見える。

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30