科学 – ページ 2 – arXiv最新論文の紹介

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [24.7]
1kの大学レベルの教科書から抽出した真正な参照回答を特徴とするオープンデータセットであるTextbookReasoningを提案する。私たちは、合計125万のインスタンスからなる高品質なオープンソースデータセットの大規模な混合であるMegaScienceを紹介します。実験により,我々のデータセットはより簡潔な応答長で優れた性能と訓練効率が得られることを示した。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 17:59:03 GMT)
「We present TEXTBOOKREASONING and MEGASCIENCE, two datasets that advance the frontier in the scientific domain by enabling base models to outperform official instruct models on scientific tasks when fine-tuned with our data.」
リポジトリはGAIR-NLP/MegaScience: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning、MegaScience (MegaScience)

The Ever-Evolving Science Exam

The Ever-Evolving Science Exam [32.2]
1)5つの分野と500以上のサブフィールドにまたがる専門的な科学インスタンス(クエスト・アンサー・ペア)と,2)定期的に更新された500インスタンスサブセット**EESE*,サンプルと検証により,リーク耐性,低オーバヘッド評価を実現する。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 12:22:16 GMT)
「1) We build a large-scale, high-quality, non-public instances repository, named EESE-Pool, which contains over 100,000 science in- stances. This pool is constructed under strict principles of Range, Reach, and Rigor. 2) We periodically sample a dynamic subset of 500 instances, called EESE, for actual evaluation. This subset is carefully curated to maintain Range, Reach, and Rigor, while mitigating leakage risk and reducing evaluation inefficiency through regular updates.」という大規模でLeakなどに強いベンチマークの提案。
リポジトリはaiben-ch/EESE: The Ever-Evolving Science Exam

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities [33.7]
大規模言語モデル (LLM) は、多くの学際的な研究でその変容の可能性を示している。本稿では,学際研究におけるLSMの適用について概観する。
論文参考訳（メタデータ） (Fri, 11 Jul 2025 09:11:18 GMT)
「From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs.」というサーベイ。

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers [22.8]
本稿では,科学文献におけるスキーマ図の解釈能力を評価するための最初のベンチマークであるMIS-QAを紹介する。 MISS-QAは465以上の科学論文に1500の専門家が注釈を付けた例で構成されている。我々は、o4-mini、Gemini-2.5-Flash、Qwen2.5-VLを含む18のフロンティアマルチモーダル基盤モデルの性能を評価する。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 20:35:25 GMT)
「We present MISS-QA, the first benchmark specifically designed to assess the ability of foundation models to comprehend schematic diagrams in scientific literature.」ということで、概念図等を理解するためのベンチマークの提案。o4-miniの性能が高めだが、人間との差は大きい。
データはyale-nlp/MISS-QA · Datasets at Hugging Face、リポジトリはGitHub – yilunzhao/MISS-QA

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization [48.0]
大型言語モデル(LLM)は複雑な問題に対処するためにチェーン・オブ・シント(CoT)技術を利用する。ドメイン知識を統合した新しいエージェントフレームワークであるChatBatteryを,材料設計におけるより効果的な推論に向けて導入する。新規リチウムイオン電池陰極材料3種を同定,合成,特性評価し,28.8%,25.2%,18.5%の実用能力向上を実現した。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 23:46:11 GMT)
科学的発見を支援するAI、「ChatBattery is an AI-driven material optimization platform structured into two synergistic phases: exploration and exploitation. Together, these phases encompass eight sequential stages, orchestrated by seven specialized agents.」とかなり複雑な構成のマルチエージェントシステムになっている。加えて、人間とのコラボレーションが重視されているように見える。
- This suggests that ChatBattery, in its present form, is more adept at optimizing within known paradigms than at generating fundamentally new chemistries. As such, expert input remains essential to expand the system’s exploration boundaries and push beyond conventional chemical spaces. Importantly, this interplay between AI-driven generation and human-guided refinement also creates unexpected opportunities, as demonstrated in the refinement of AI-suggested materials into even more advanced cathode compositions. However, advances anticipated with future reasoning AIs are likely to provide greater exploration and creativity.という記載がある。
「ChatBattery, we successfully identify, synthesize, and characterize three novel lithiumion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811).」と効果があったとのこと。

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research [33.8]
AbGenは、科学研究のためのアブレーション研究を設計する際のLLMの能力を評価するために設計された最初のベンチマークである。そこで我々は,一般的な自動評価システムの信頼性を評価するメタ評価ベンチマークAbGen-Evalを開発した。
論文参考訳（メタデータ） (Thu, 17 Jul 2025 17:09:22 GMT)
Ablation Studyを生成できるか、および、Ablation Studyを評価できるかを検証するためのベンチマークの提案。現状のLLMはいずれも厳しい結果。
リポジトリはyale-nlp/AbGen · Datasets at Hugging Face、GitHub – yale-nlp/AbGen: Data and code for the ACL 2025 paper “AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research”

AI4Research: A Survey of Artificial Intelligence for Scientific Research

AI4Research: A Survey of Artificial Intelligence for Scientific Research [55.5]
我々はAI for Research(AI4Research)に関する総合的な調査を行う。まず、AI4Researchの5つの主要なタスクを分類する系統分類を導入する。主要な研究ギャップを特定し、将来有望な方向性を明らかにする。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:19:20 GMT)
ResearchへのAI適用に関するサーベイ。下記を主要タスクとしている。
- (1) AI for Scientific Comprehension
- (2) AI for Academic Surveys
- (3) AI for Scientific Discovery
- (4) AI for Academic Writing
- (5) AI for Academic Reviewing
プロジェクトサイトはAI4Research: A Survey of Artificial Intelligence for Scientific Research

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers [31.5]
LimitGenは、初期のフィードバックをサポートし、人間のピアレビューを補完するLLMの能力を評価するための最初のベンチマークである。提案手法は, LLMシステムによる研究論文の限界を生じさせる能力を高め, より具体的で建設的なフィードバックを提供する。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 15:04:38 GMT)
「We propose LIMITGEN, a comprehensive bench- mark specifically designed to assess the ability of models to identify and address limitations in scientific research, with a reliable and systematic evaluation framework.」というベンチマークの提案と検証。「Even the best-performing LLM, GPT-4o, can only identify about half of the limitations that humans consider very obvious. Although MARG lever- ages multi-agent collaboration and generates more comments, successfully identifying more limita- tions, the feedback it provides still lacks specificity, which is reflected in the fine-grained scores.」とのこと。MARGはマルチエージェントフレームワーク。
リポジトリはGitHub – yale-nlp/LimitGen: Data and Code for ACL 2025 Paper “Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers”

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements [87.6]
科学的進歩への重要な能力は、既存の作品を再現する能力である。アクティブな研究領域においてAIエージェントが結果を再現する能力を評価するために,自動LLM高速化ベンチマークを導入する。最近のLSMとSoTAの足場を組み合わせると、ベンチマークですでに知られているイノベーションを再実装するのに苦労していることが分かりました。
論文参考訳（メタデータ） (Fri, 27 Jun 2025 17:44:32 GMT)
「We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints.」というやや意外な結果。
リポジトリはGitHub – facebookresearch/llm-speedrunner: The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in language modeling.

The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas / Position: Intelligent Science Laboratory Requires the Integration of Cognitive and Embodied AI

The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas [90.3]
良いアイデアは単に斬新なものではなく、実行後により良い研究がもたらされるべきである。 AIが生み出すアイデアがより良い研究成果をもたらすかどうかをテストするために、我々は実行研究を行う。実行前後の同じアイデアのレビュースコアを比較すると、LLM生成のアイデアのスコアは専門家によるアイデアよりも大幅に減少する。
論文参考訳（メタデータ） (Wed, 25 Jun 2025 19:47:23 GMT)
LLMが出したアイデアと専門家のアイデアを「Our execution participants spend an average of 103 hours executing the assigned idea and then submit the codebase and paper to document their experiments. All projects are then reviewed blindly by our recruited expert reviewers」と評価したところ「Average scores of AI ideas drop significantly more than Human ideas in the execution study across all the evaluation metrics.」という指摘。
やはり人間の専門家は深く考えているようという興味深い結果。同時に、アイデアのみだとAIの評価が高いということはアイデアだしでは有効なのではないか？とか最終的なスコアでもそこそこ健闘しているのではないか？と見えなくもない。下記論文のようにAI科学者の実現可能性は高まっているように思う。
リポジトリはGitHub – NoviScl/AI-Researcher

Position: Intelligent Science Laboratory Requires the Integration of Cognitive and Embodied AI [98.2]
知的科学研究所(ISL)のパラダイムを提案する。 ISLは、認知と具体的知性を深く統合した多層クローズドループフレームワークである。このようなシステムは、現在の科学的発見の限界を克服するために不可欠である、と我々は主張する。
論文参考訳（メタデータ） (Tue, 24 Jun 2025 13:31:44 GMT)
「1) Foundation Models provide multi-modal scientific knowledge representation and closed-loop learning capabilities, supporting complex reasoning and domain adaptation; (2) Agent Layer dynamically orchestrates scientific workflows—including hypothesis generation, literature review, experimental planning, execution, and analysis—while integrating model/toolkit via MCP integration; (3) Embodied Layer realizes robust physical interaction through advanced perception, navigation, and manipulation modules, enabling precise, adaptive operations in real-world laboratory environments.」からなるAI科学者・AIラボフレームワークの提案。
現状と課題がとても参考になる。

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30