WebDancer, EvolveSearch, Can Large Language Models Match the Conclusions of Systematic Reviews?

情報検索・収集でもエージェントの活用が盛ん。

WebDancer: Towards Autonomous Information Seeking Agency [67.1]
エージェントシステムの最近の進歩は、自律的な多段階研究の可能性を強調している。データ中心およびトレーニング段階の観点からエージェントを探索するエンドツーエンドのエージェント情報を構築するための凝集パラダイムを提案する。我々はこのフレームワークを ReAct, WebDancer に基づいた Web エージェントでインスタンス化する。
論文参考訳（メタデータ） (Wed, 28 May 2025 17:57:07 GMT)
Tongyi Lab , Alibaba による情報探索エージェントの提案。ポストトレーニングを含む4ステージ構成。この手のエージェントを（簡易ではなく本気で）開発するうえで参考になる。
- Step I: Construct diverse and challenging deep information seeking QA pairs based on the real-world web environment (§2.1); Step II: Sample high-quality trajectories from QA pairs using both LLMs and LRMs to guide the agency learning process (§2.2); Step III: Perform fine-tuning to adapt the format instruction following to agentic tasks and environments (§3.1); Step IV: Apply RL to optimize the agent’s decision-making and generalization capabilities in real-world web environments (§3.2).
GitHub – Alibaba-NLP/WebAgent: 🌐 WebWalker [ACL2025] & WebDancer [Preprint]

EvolveSearch: An Iterative Self-Evolving Search Agent [98.2]
大規模言語モデル(LLM)は、検索エンジンやWebブラウザなどのツールを統合することで、エージェント情報検索機能を変革した。本研究では,SFTとRLを組み合わせた新たな反復的自己進化フレームワークであるEvolveSearchを提案する。
論文参考訳（メタデータ） (Wed, 28 May 2025 15:50:48 GMT)
上記と同じくTongyi Lab , Alibabaが関わる成果

一方で下記のような指摘もある。

Can Large Language Models Match the Conclusions of Systematic Reviews? [43.3]
我々は、大言語モデル(LLM)は、同じ研究にアクセスできると、臨床専門家が書いた体系的なレビューの結論に一致するだろうか? MedEvidenceでは、推論、非推論、医療スペシャリスト、さまざまなサイズ(7B-700Bから)のモデルを含む24のLCMをベンチマークします。 MedEvidenceでは、推論が必ずしも性能を向上しておらず、より大規模なモデルでは常に大きな利得が得られず、知識に基づく微調整は精度を低下させる。
論文参考訳（メタデータ） (Wed, 28 May 2025 18:58:09 GMT)
「Consequently, given the same studies, frontier LLMs fail to match the conclusions of systematic reviews in at least 37% of evaluated cases.」が高いか低いかは悩ましいところだが「unlike humans, LLMs struggle with uncertain evidence and cannot exhibit skepticism when studies present design flaws」は気になる。「We identify four key factors that influence model performance on our benchmark: (1) token length, (2) dependency on treatment outcomes, (3) inability to assess the quality of evidence, and (4) lack of skepticism toward low-quality findings.」との記載があるが、「内容の評価」は難しい課題なのだと思う。
また、「Across all comparisons, medical finetuning fails to improve performance (even for medical-reasoning models) and, in most cases, actually degrades it. Indeed, fine-tuning without proper calibration can harm generalization, some- times resulting in worse performance than the base model [49, 50, 51].」も面白い。
リポジトリはGitHub – zy-f/med-evidence

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル