staka – ページ 34 – arXiv最新論文の紹介

Autoregressive Models in Vision: A Survey

Autoregressive Models in Vision: A Survey [119.2]
本調査は、視覚に適用される自己回帰モデルに関する文献を包括的に調査する。視覚的自己回帰モデルを,画素ベース,トークンベース,スケールベースを含む3つの一般的なサブカテゴリに分割する。本稿では,画像生成,映像生成,3D生成,マルチモーダル生成など,コンピュータビジョンにおける自己回帰モデルの多面的分類を提案する。
論文参考訳（メタデータ） (Fri, 08 Nov 2024 17:15:12 GMT)
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective – arXiv最新論文の紹介でも取り上げた通りVisionにも応用が進むAutoregressiveモデルのサーベイ。
リポジトリはGitHub – ChaofanTao/Autoregressive-Models-in-Vision-Survey: The paper collections for the autoregressive models in vision.

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [87.2]
CodeXEmbedは400Mから7Bパラメータの大規模なコード埋め込みモデルのファミリーである。我々の新しいトレーニングパイプラインは、複数のプログラミング言語を統合し、様々なコード関連タスクを共通の検索フレームワークに変換する。私たちの7Bモデルは、コード検索において新しい最先端(SOTA)を設定し、以前の主要なモデルであるVoyage-CodeをCoIRベンチマークで20%以上上回っています。
論文参考訳（メタデータ） (Tue, 19 Nov 2024 16:54:45 GMT)
Code RAGなどで重要になるが難しいタスクであるEmbeddingモデルの提案、「Our 7B model sets a new state-ofthe-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.」とのこと。2Bのベースモデルはgemma-2-2b-it、7BだとMistral-7B-Instruct-v0.3などベースは様々。
現状モデルは公開されていないっぽいが、「By bridging the gap between text and code retrieval domains and releasing our models to the community, we aim to promote further research and innovation in developer tools and programming language understanding.」のと記載がある。

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory [23.5]
本稿では,視覚的物体追跡に特化して設計されたSAM 2を改良したSAmuraiを紹介する。提案した動き認識メモリ選択機構に時間的動作手がかりを組み込むことで、物体の動きを効果的に予測し、マスク選択を洗練し、トレーニングや微調整を必要とせず、堅牢で正確なトラッキングを実現する。評価では、既存のトラッカーよりも成功率と精度が大幅に向上し、LaSOT$_ext$で7.1%、GOT-10kで3.5%向上した。
論文参考訳（メタデータ） (Mon, 18 Nov 2024 05:59:03 GMT)
オブジェクトトラッキングに特化しSAMを改良したSAM-based Unified and Robust zero-shot visual tracker with motionAware Instance-level memory、SAMURAI。
リポジトリはGitHub – yangchris11/samurai: Official repository of “SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory”

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search / LLaVA-CoT（LLaVA-o1）

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search [95.1]
o1のような推論アプローチは困難で、研究者はこのオープンな研究領域を前進させようとさまざまな試みを行ってきた。本稿では,報酬誘導木探索アルゴリズムを用いて,LLMの推論能力を高めるための予備的な検討を行う。
論文参考訳（メタデータ） (Mon, 18 Nov 2024 16:15:17 GMT)
o1-like reasoning systemsを実現するための検討、「In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms.」とのこと。Marco-o1の報告、DeepSeek-R1の主張（A Chinese lab has released a ‘reasoning’ AI model to rival OpenAI’s o1 | TechCrunch）を含め、速攻で近いものの提案が始まる激しい競争環境。マルチモーダルでの有効性も報告（下記）されていて今後が楽しみ。
サーベイに近いかと思いきや実験結果などもあり参考になる。

LLaVA-o1: Let Vision Language Models Reason Step-by-Step [33.7]
LLaVA-o1は、自律的な多段階推論を実現するために設計された新しいVLMである。チェーン・オブ・シークレットのプロンプトとは異なり、LLaVA-o1は独立に要約、視覚的解釈、論理的推論、結論生成の逐次的な段階に関与する。 100kのトレーニングサンプルと単純な推論時間スケーリング法により、LLaVA-o1はベースモデルよりも8.9%性能が向上する。
論文参考訳（メタデータ） (Fri, 15 Nov 2024 18:58:31 GMT)
リポジトリはGitHub – PKU-YuanGroup/LLaVA-CoT: LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning、「Based on recent feedback from social media platforms like X, we have decided to rename LLaVA-o1 to LLaVA-CoT.」とのこと。

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [40.2]
Marco-o1は数学、物理学、コーディングなどの標準解を持つ分野に焦点を当てている。 o1モデルは、明確な標準が欠如し、報酬が定量化が難しい広い領域に効果的に一般化しますか? Marco-o1は、Chain-of-Thoughtファインチューニング、Monte Carlo Tree Search (MCTS)、リフレクションメカニズム、革新的な推論戦略によって実現されている。
論文参考訳（メタデータ） (Thu, 21 Nov 2024 18:37:33 GMT)
「Our Marco-o1 enhances the reasoning ability by integrating Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), and novel reasoning action strategies.」というo1ライクなモデル構築に関する報告。

Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective [77.9]
OpenAIは、o1の背後にある主要な技術は強化学習であると主張している。本稿では、強化学習の観点から、o1を達成するためのロードマップを分析する。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 18:24:47 GMT)
「In this paper, we present a roadmap for reproducing o1 from the perspective of reinforcement learning, emphasizing key components such as policy initialization, reward design, search, and learning.」という論文も。

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs [151.8]
我々は,4500万件のオープンアクセス論文と引用支援の回答を関連づけることで,科学的クエリに答える特殊な検索拡張LMであるOpenScholarを紹介した。 ScholarQABench では OpenScholar-8B が GPT-4o を5%、PaperQA2 を7% 上回っている。 OpenScholarのデータストア、レトリバー、セルフフィードバック推論ループも、既製のLMを改善している。
論文参考訳（メタデータ） (Thu, 21 Nov 2024 15:07:42 GMT)
科学に関するクエリに答えるためのシステムの提案。「OPENSCHOLAR consists of a specialized datastore, retrievers and LMs and iteratively improves responses using self-feedback inference with retrieval.」とやり切っている感がすごい。ベンチマークも構築しており、「OPENSCHOLAR using our trained 8B and GPT4o achieves a 51% and 70% win rate against human-generated answers.」とGPT-4o以上を主張。
Blog:Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models | Ai2 、Code:GitHub – AkariAsai/ScholarQABench: This repository contains ScholarQABench data and evaluation pipeline.、デモ:Ai2 OpenScholarなど多くのリソースが公開されている。

Hymba: A Hybrid-head Architecture for Small Language Models

Hymba: A Hybrid-head Architecture for Small Language Models [65.9]
Hymbaは、ハイブリッドヘッド並列アーキテクチャを特徴とする、小さな言語モデルのファミリーである。重要な情報を保持するプロンプトに先立って,学習可能なメタトークンを導入する。このモデルは、層間鍵値共有と部分的スライディングウィンドウアテンションを組み込むことにより、さらに最適化される。
論文参考訳（メタデータ） (Wed, 20 Nov 2024 19:51:25 GMT)
TransformerのAttentionとSSMを組み合わせたモデルの提案、小型モデルではとても高い性能と省メモリ高速動作を実現とのこと。
NVIDIAによる発表でモデルが公開されている。nvidia/Hymba-1.5B-Base · Hugging Face

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents [23.2]
モデルベースプランニングで言語エージェントを増強する新しいパラダイムを導入する。我々の方法であるWebDreamerは、LLMが本質的にウェブサイトの構造や機能に関する包括的知識をエンコードしているというキーインサイトを構築している。
論文参考訳（メタデータ） (Sun, 10 Nov 2024 18:50:51 GMT)
「WEBDREAMER uses LLMs to simulate outcomes for each candidate action (e g , “what would happen if I click this button?”) using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step.」というシンプルな手法で「our model-based planning approach, WEBDREAMER, shows substantial improvement over reactive baselines and offers greater flexibility than tree search, which is often impossible in real-world websites.」という興味深い結果。挑戦的なタイトルをつけたくなる気持ちもわかる。
リポジトリはWebDreamer/README.md at main · OSU-NLP-Group/WebDreamer · GitHub

A Survey of Event Causality Identification: Principles, Taxonomy, Challenges, and Assessment

A Survey of Event Causality Identification: Principles, Taxonomy, Challenges, and Assessment [6.5]
事象因果同定(ECI)は自然言語処理(NLP)において重要な課題となっている。本分類法は文レベル(SECI)と文書レベルの事象因果同定(DECI)の2つの主要なタスクに従ってECIの手法を分類する。
論文参考訳（メタデータ） (Fri, 15 Nov 2024 17:19:42 GMT)
Event Causality Identificationのサーベイ

Adversarial Training: A Survey

Adversarial Training: A Survey [130.9]
敵対的トレーニング( Adversarial Training、AT)とは、相手の例をトレーニングプロセスに統合することである。近年の研究では、様々な敵攻撃に対するディープニューラルネットワークの堅牢性向上におけるATの有効性が実証されている。
論文参考訳（メタデータ） (Sat, 19 Oct 2024 08:57:35 GMT)
Adversarial Trainingのサーベイ

2025年4月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30