Evaluation of OpenAI o1: Opportunities and Challenges of AGI / On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability

Evaluation of OpenAI o1: Opportunities and Challenges of AGI [112.1]
o1-previewは目覚ましい能力を示し、しばしば人間レベルまたは優れたパフォーマンスを実現した。このモデルは、様々な分野にわたる複雑な推論と知識の統合を必要とするタスクに優れていた。総合的な結果は、人工知能への大きな進歩を示している。
論文参考訳（メタデータ） (Fri, 27 Sep 2024 06:57:00 GMT)
OpenAI o1の詳細な検証。「Advanced Reasoning Capabilities: o1-preview demonstrated exceptional logical reasoning abilities in multiple fields, including high school mathematics, quantitative investing, and chip design」、「Domain-Specific Knowledge: The model exhibited impressive knowledge breadth across diverse fields such as medical genetics, radiology, anthropology, and geology.」、「It often performed at a level comparable to or exceeding that of graduate students or early-career professionals in these domains.」と高い行がされている。一方で「However, it still lacks the flexibility and adaptability of human experts in these fields.」、「It demonstrated the ability to capture complex expressions like irony and sarcasm, though it still struggles with very subtle emotional nuances.」という指摘も。
関わっている方も多く他分野からの詳細な検証結果、非常に参考になる。

On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability [59.7]
さまざまなベンチマークタスクでOpenAIのo1モデルの計画能力を評価する。その結果,o1-preview は GPT-4 よりもタスク制約に順応していることがわかった。
論文参考訳（メタデータ） (Mon, 30 Sep 2024 03:58:43 GMT)
計画能力を対象としたo1の評価。GPT-4oと比べて優れているとのこと。
1. Understanding the Problem、2. Following Constraints、3. State and Memory Management、4. Reasoning and GeneralizationでFindingsがまとめられている。いずれも強力だが、3.については「as problem complexity increased, the model’s state management became less reliable, particularly in tasks involving spatial reasoning across multiple dimensions.」、4.については「While o1-preview showed some promise in its generalization ability, particularly in structured environments like Grippers, its performance in more abstract tasks like Termes revealed substantial limitations. The model struggled with reasoning under conditions where actions and outcomes were less directly tied to the natural language representation of the task, highlighting an area for future improvements.」という指摘も

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 [20.1]
o1 は OpenAI の新しいシステムで,従来の LLM と異なり,推論に最適化されている。多くの場合、o1 は従来の LLM よりも大幅に優れており、特に共通タスクの稀な変種に対して大きな改善が加えられている。しかし、o1は以前のシステムで観測したのと同じ定性的傾向を示している。
論文参考訳（メタデータ） (Wed, 02 Oct 2024 17:50:19 GMT)
「On many of the tasks we considered, o1 performed substantially better than the LLMs we had previously evaluated, with particularly strong results on rare variants of common tasks. However, it still qualitatively showed both of the central types of probability sensitivity discussed in McCoy et al (2023): sensitivity to output probability and sensitivity to task frequency.」という指摘。

コメントを残す コメントをキャンセル