科学 – arXiv最新論文の紹介

Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration [59.4]
本稿では,構造化マルチエージェントの議論が独創的思考を超えうるかどうかを考察する。研究提案を作成するための協調型マルチエージェントフレームワークを提案する。エージェントベースのスコアリングと,新規性,戦略的ビジョン,統合深度といった領域にわたるヒューマンレビューを備えた包括的プロトコルを採用している。
論文参考訳（メタデータ） (Wed, 06 Aug 2025 15:59:18 GMT)
「This work challenges the dominant paradigm of solitary AI- driven ideation and provides strong empirical evidence that collaborative multi-agent systems generate higher-quality scientific proposals. Through systematic simulation and evaluation, we identify three actionable principles for building more effective ideation systems: (1) Structured, leader- guided discussions enhance coherence and strategic focus; (2) Cognitive diversity from interdisciplinary or mixed- seniority teams drives originality; (3) Expertise is essential, as collaboration amplifies existing knowledge but cannot replace it.」と非常に面白い結果ではあるのだが、専門性のコントロールがこの手のプロンプトで本当にできているんだろうか（または他の部分もいろいろ変わってるんじゃないか）という疑問はある。
プロジェクトサイトはResearch Proposal Evaluator、リポジトリはNuoJohnChen/Idea2Proposal

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [24.7]
1kの大学レベルの教科書から抽出した真正な参照回答を特徴とするオープンデータセットであるTextbookReasoningを提案する。私たちは、合計125万のインスタンスからなる高品質なオープンソースデータセットの大規模な混合であるMegaScienceを紹介します。実験により,我々のデータセットはより簡潔な応答長で優れた性能と訓練効率が得られることを示した。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 17:59:03 GMT)
「We present TEXTBOOKREASONING and MEGASCIENCE, two datasets that advance the frontier in the scientific domain by enabling base models to outperform official instruct models on scientific tasks when fine-tuned with our data.」
リポジトリはGAIR-NLP/MegaScience: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning、MegaScience (MegaScience)

The Ever-Evolving Science Exam

The Ever-Evolving Science Exam [32.2]
1)5つの分野と500以上のサブフィールドにまたがる専門的な科学インスタンス(クエスト・アンサー・ペア)と,2)定期的に更新された500インスタンスサブセット**EESE*,サンプルと検証により,リーク耐性,低オーバヘッド評価を実現する。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 12:22:16 GMT)
「1) We build a large-scale, high-quality, non-public instances repository, named EESE-Pool, which contains over 100,000 science in- stances. This pool is constructed under strict principles of Range, Reach, and Rigor. 2) We periodically sample a dynamic subset of 500 instances, called EESE, for actual evaluation. This subset is carefully curated to maintain Range, Reach, and Rigor, while mitigating leakage risk and reducing evaluation inefficiency through regular updates.」という大規模でLeakなどに強いベンチマークの提案。
リポジトリはaiben-ch/EESE: The Ever-Evolving Science Exam

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities [33.7]
大規模言語モデル (LLM) は、多くの学際的な研究でその変容の可能性を示している。本稿では,学際研究におけるLSMの適用について概観する。
論文参考訳（メタデータ） (Fri, 11 Jul 2025 09:11:18 GMT)
「From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs.」というサーベイ。

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers [22.8]
本稿では,科学文献におけるスキーマ図の解釈能力を評価するための最初のベンチマークであるMIS-QAを紹介する。 MISS-QAは465以上の科学論文に1500の専門家が注釈を付けた例で構成されている。我々は、o4-mini、Gemini-2.5-Flash、Qwen2.5-VLを含む18のフロンティアマルチモーダル基盤モデルの性能を評価する。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 20:35:25 GMT)
「We present MISS-QA, the first benchmark specifically designed to assess the ability of foundation models to comprehend schematic diagrams in scientific literature.」ということで、概念図等を理解するためのベンチマークの提案。o4-miniの性能が高めだが、人間との差は大きい。
データはyale-nlp/MISS-QA · Datasets at Hugging Face、リポジトリはGitHub – yilunzhao/MISS-QA

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization [48.0]
大型言語モデル(LLM)は複雑な問題に対処するためにチェーン・オブ・シント(CoT)技術を利用する。ドメイン知識を統合した新しいエージェントフレームワークであるChatBatteryを,材料設計におけるより効果的な推論に向けて導入する。新規リチウムイオン電池陰極材料3種を同定,合成,特性評価し,28.8%,25.2%,18.5%の実用能力向上を実現した。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 23:46:11 GMT)
科学的発見を支援するAI、「ChatBattery is an AI-driven material optimization platform structured into two synergistic phases: exploration and exploitation. Together, these phases encompass eight sequential stages, orchestrated by seven specialized agents.」とかなり複雑な構成のマルチエージェントシステムになっている。加えて、人間とのコラボレーションが重視されているように見える。
- This suggests that ChatBattery, in its present form, is more adept at optimizing within known paradigms than at generating fundamentally new chemistries. As such, expert input remains essential to expand the system’s exploration boundaries and push beyond conventional chemical spaces. Importantly, this interplay between AI-driven generation and human-guided refinement also creates unexpected opportunities, as demonstrated in the refinement of AI-suggested materials into even more advanced cathode compositions. However, advances anticipated with future reasoning AIs are likely to provide greater exploration and creativity.という記載がある。
「ChatBattery, we successfully identify, synthesize, and characterize three novel lithiumion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811).」と効果があったとのこと。

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research [33.8]
AbGenは、科学研究のためのアブレーション研究を設計する際のLLMの能力を評価するために設計された最初のベンチマークである。そこで我々は,一般的な自動評価システムの信頼性を評価するメタ評価ベンチマークAbGen-Evalを開発した。
論文参考訳（メタデータ） (Thu, 17 Jul 2025 17:09:22 GMT)
Ablation Studyを生成できるか、および、Ablation Studyを評価できるかを検証するためのベンチマークの提案。現状のLLMはいずれも厳しい結果。
リポジトリはyale-nlp/AbGen · Datasets at Hugging Face、GitHub – yale-nlp/AbGen: Data and code for the ACL 2025 paper “AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research”

AI4Research: A Survey of Artificial Intelligence for Scientific Research

AI4Research: A Survey of Artificial Intelligence for Scientific Research [55.5]
我々はAI for Research(AI4Research)に関する総合的な調査を行う。まず、AI4Researchの5つの主要なタスクを分類する系統分類を導入する。主要な研究ギャップを特定し、将来有望な方向性を明らかにする。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:19:20 GMT)
ResearchへのAI適用に関するサーベイ。下記を主要タスクとしている。
- (1) AI for Scientific Comprehension
- (2) AI for Academic Surveys
- (3) AI for Scientific Discovery
- (4) AI for Academic Writing
- (5) AI for Academic Reviewing
プロジェクトサイトはAI4Research: A Survey of Artificial Intelligence for Scientific Research

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers [31.5]
LimitGenは、初期のフィードバックをサポートし、人間のピアレビューを補完するLLMの能力を評価するための最初のベンチマークである。提案手法は, LLMシステムによる研究論文の限界を生じさせる能力を高め, より具体的で建設的なフィードバックを提供する。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 15:04:38 GMT)
「We propose LIMITGEN, a comprehensive bench- mark specifically designed to assess the ability of models to identify and address limitations in scientific research, with a reliable and systematic evaluation framework.」というベンチマークの提案と検証。「Even the best-performing LLM, GPT-4o, can only identify about half of the limitations that humans consider very obvious. Although MARG lever- ages multi-agent collaboration and generates more comments, successfully identifying more limita- tions, the feedback it provides still lacks specificity, which is reflected in the fine-grained scores.」とのこと。MARGはマルチエージェントフレームワーク。
リポジトリはGitHub – yale-nlp/LimitGen: Data and Code for ACL 2025 Paper “Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers”

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements [87.6]
科学的進歩への重要な能力は、既存の作品を再現する能力である。アクティブな研究領域においてAIエージェントが結果を再現する能力を評価するために,自動LLM高速化ベンチマークを導入する。最近のLSMとSoTAの足場を組み合わせると、ベンチマークですでに知られているイノベーションを再実装するのに苦労していることが分かりました。
論文参考訳（メタデータ） (Fri, 27 Jun 2025 17:44:32 GMT)
「We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints.」というやや意外な結果。
リポジトリはGitHub – facebookresearch/llm-speedrunner: The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in language modeling.

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31