2025年2月 – ページ 4 – arXiv最新論文の紹介

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society [32.8]
本稿では,現実的な社会環境を統合した大規模社会シミュレータであるAgentSocietyを提案する。提案したシミュレーターに基づいて,500万件のインタラクションをシミュレートし,10万件以上のエージェントの社会生活を生成する。偏極、炎症性メッセージの普及、普遍的ベーシック・インカム・ポリシーの効果、ハリケーンなどの外部ショックの影響の4つに焦点をあてる。
論文参考訳（メタデータ） (Wed, 12 Feb 2025 15:27:07 GMT)
LLM based Agentsの大規模シミュレーション、システムアーキテクチャは割と固めの構成に見えるが10Kを超える規模にスケールできそうなのは凄い。
「AgentSociety serves as a powerful tool for predicting and mitigating social crises, tracking the spread of extreme ideologies, and analyzing group polarization, while also testing potential interventions for crisis management.」と主張。このアプローチがどの程度うまくいくのか楽しみでもあり、怖くもありという印象。

Human Decision-making is Susceptible to AI-driven Manipulation

Human Decision-making is Susceptible to AI-driven Manipulation [71.2]
AIシステムは、ユーザの認知バイアスと感情的な脆弱性を利用して、有害な結果に向けてそれらを操縦する。本研究では、経済的・感情的な意思決定の文脈におけるこのような操作に対する人間の感受性について検討した。
論文参考訳（メタデータ） (Tue, 11 Feb 2025 15:56:22 GMT)
「Our randomized control trial with 233 participants demonstrated that human decision-making is highly susceptible to AI-driven manipulation, with participants significantly shifting preferences toward harmful options and away from beneficial choices when interacting with manipulative AI agents.」という衝撃的な結果。「strategy-enhanced manipulative agent (SEMA) employing
established psychological tactics to reach its hidden objectives.」の有効性がいまいちだった理由はそんなものを使わなくてもAIが強力だったとするんだろうか。
今後、AIへの依存度が高まっていくこと、AIの性能自体が上がっていくことを考えると怖い結果。規制の必要性を主張しているがそれだけで十分とは思えない。。。

DeepThink: Aligning Language Models with Domain-Specific User Intents

DeepThink: Aligning Language Models with Domain-Specific User Intents [25.5]
本研究では、高品質な命令を生成するためのDeepThinkと呼ばれる新しいフレームワークを提案する。 DeepThinkはまず、いくつかのシード質問を生成して、実際のユーザ質問を模倣し、会話をシミュレートして、隠されたユーザニーズを明らかにし、会話のコンテキストによって回答を洗練する。実験により、DeepThinkは広告ドメイン内の実際のユーザテストセット上でのGPT-4-turbo+RAGベースのアシスタントと比較して平均パフォーマンスが7.92%向上していることが示された。
論文参考訳（メタデータ） (Sat, 08 Feb 2025 09:04:16 GMT)
「: data synthesis based on conversations, data refinement based on conversations, and supervised fine-tuning (SFT) enhanced with retrieval, DeepThink addresses the critical challenge of adapting LLM to understand and meet hidden user needs in vertical domains.」というデータ合成フレームワーク＋αの提案と有効性検証。
ユーザの隠れたニーズに対応するためLLMの内部知識が有効という解釈だろうか。ありそうな気はするのと、大規模に行うAgentSocietyのようなことが現実的なら様々な分野で活用できそう。（悪用も怖い）

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging [12.1]
本稿では,言語固有の大規模言語モデル(LLM)の推論能力の向上を目的とする。 DeepSeek R1は推論に優れていますが、主に英語や中国語のような高リソース言語にメリットがあります。低リソース言語は、英語中心のトレーニングデータとモデル最適化の優位性のため、いまだに保存されていない。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 08:10:45 GMT)
LLMの推論能力を高めるためのモデルマージ+SFT、「We demonstrate that, with only publicly available datasets and a computational budget of $120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.」とのこと
Qwen2.5とDeepSeek R1を利用した日本語大規模言語モデル「Qwen2.5 Bakeneko 32B」シリーズを公開｜rinna株式会社でも近いアプローチをとっているように見える。

LM2: Large Memory Models

LM2: Large Memory Models [11.3]
本稿では,補助メモリモジュールで拡張されたデコーダのみのトランスフォーマーアーキテクチャであるLarge Memory Model (LM2)を紹介する。 BABILongベンチマークの実験結果によると、LM2モデルはメモリ拡張RTTモデルとベースラインのLlama-3.2モデルの両方を平均86.3%上回っている。
論文参考訳（メタデータ） (Sun, 09 Feb 2025 22:11:42 GMT)
Large Memory Model (LM2)「decoder-only Transformer architecture enhanced with an auxiliary memory module」の提案。多くの人が待ち望んでいる拡張形態であり、実用的な規模（大規模）での検証でうまくいくか興味津々。
リポジトリはGitHub – convergence-ai/lm2: Official repo of paper LM2

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model [33.9]
SmolLM2は、最先端の”小” (170億のパラメータ) 言語モデルである。我々はSmolLM2を1兆のトークンでオーバートレーニングし、Webテキストと特殊な算数、コード、命令追従データとを混合する多段階のトレーニングプロセスを用いた。我々は、SmolLM2がQwen2.5-1.5BやLlama3.2-1Bなど、最近の小さなLMよりも優れていることを示した。
論文参考訳（メタデータ） (Tue, 04 Feb 2025 21:43:16 GMT)
HuggingfaceによるSLM、「SmolLM2 advances the state-of-the-art for open small LMs through a combination of careful dataset curation and multistage training.」とのこと。「SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B.」を主張
リポジトリはSmolLM2 – a HuggingFaceTB Collection

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding [21.9]
視覚言語モデル(VLM)は、常識的推論において優れているが、物理世界を理解するのに苦労していることを示す。本稿では、VLMの一般化強度とビジョンモデルの専門知識を組み合わせたフレームワークであるPhysAgentを紹介する。以上の結果から,VLMの物理世界理解能力の向上は,Mokaなどのエージェントの具体化に有効であることが示唆された。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 03:52:39 GMT)
VLMが物理を理解しているかを測るベンチマークとAgenticな physical world understandingフレームワーク、PhysAgentの提案。
現状の結果は意外なことに（？） o1 > InternVL2.5-38B > InternVL2.5-78B > GPT-4o > Gemini-1.5-pro
プロジェクトサイトはPhysBench、データセットはUSC-GVL/PhysBench · Datasets at Hugging Face

Top Ten Challenges Towards Agentic Neural Graph Databases

Top Ten Challenges Towards Agentic Neural Graph Databases [56.9]
Neo4jやTigerGraphのようなグラフデータベース(GDB)は相互接続されたデータを扱うのが得意だが、高度な推論機能が欠けている。本稿では,NGDBを3つのコア機能で拡張するエージェント型ニューラルネットワークデータベース(Agentic NGDB)を提案する。
論文参考訳（メタデータ） (Fri, 24 Jan 2025 04:06:50 GMT)
Agentic Neural Graph Databases 実現のための課題整理

Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching

Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching [39.5]
会話型商品検索(CPS)の目的は、インテリジェントなチャットベースのショッピングアシスタントを開発することである。本稿では,大規模言語モデル(LLM)を利用して,現実的で自然な会話を生成する新しい手法TRACERを提案する。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 00:27:13 GMT)
「We leverage decision tree to explore the vast product search space, and construct a dialogue plan that minimizes the number of search steps required to retrieve a relevant product.」という会話生成手法の提案
直接生成せずに木構造を介すというアプローチはCondor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement – arXiv最新論文の紹介に近いのだろうか。

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.9]
我々は,Large Language Models (LLMs) の批判能力を評価するために設計された新しいベンチマークを導入する。通常、オープンループ方式で機能する既存のベンチマークとは異なり、我々のアプローチでは、批判から生成された修正の質を評価するクローズドループ手法を採用している。
論文参考訳（メタデータ） (Fri, 24 Jan 2025 13:48:10 GMT)
LLMの批判能力を評価するためのベンチマークの提案、「We investigate three distinct scenarios: self-critique, crosscritique, and iterative critique. Our findings reveal that in nearly all cases, the o1-mini model demonstrates the most impressive performance.」とのこと。
リポジトリはGitHub – tangzhy/RealCritic

2025年2月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28