arXiv最新論文の紹介

TGDoc

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs [96.5]
本稿では,画像中のテキストの空間的位置を識別し,MLLMを強化したテキストグラウンド文書理解モデルTGDocを提案する。我々は,テキスト検出,認識,スポッティングなどの命令チューニングタスクを定式化し,視覚エンコーダと大言語モデルとの密接なアライメントを容易にする。提案手法は,複数のテキストリッチベンチマークにまたがる最先端性能を実現し,本手法の有効性を検証した。
論文参考訳（メタデータ） (Wed, 22 Nov 2023 06:46:37 GMT)
Vicuna-7Bを拡張する形式のMLLM、データを自前で集めている点はすごい、こちらLLaVARを上回る性能。

DocPedia

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding [98.4]
本研究は, OCRフリー文書理解のための新しい大規模マルチモーダルモデル(LMM)であるDocPediaを提案する。既存の作業では高解像度のドキュメントで苦労したり、大きな言語モデルを捨てたり、視覚や言語能力に制約があったりするのに対して、DocPediaでは、ピクセル空間ではなく、周波数領域の視覚入力を直接処理しています。
論文参考訳（メタデータ） (Mon, 20 Nov 2023 14:42:25 GMT)
「 DocPedia directly processes visual input in the frequency domain rather than the pixel space.」というのが特徴的な文章理解モデル。DCT → Frequency Adapter　→ …と興味深いブロック図になっている。。。
LLaVARやmPLUG-Owlに比べて性能は高いがsupervisedなSOTAとは距離がある。

TPTU-v2

TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems [25.9]
本稿では,大規模言語モデル(LLM)のタスク計画・ツール利用(TPTU)能力の向上を目的とした包括的フレームワークを提案する。このフレームワークは、これらの課題に対処するために設計された3つの重要なコンポーネントで構成されている。(1) API Retrieverは、利用可能な広範囲な配列の中で、ユーザタスクに最も関連するAPIを選択し、(2) LLM Finetunerは、タスク計画とAPI呼び出しにより適するように、ベースLSMをチューニングし、(3)Demo Selectorは、難しいAPIに関連するさまざまなデモを適応的に検索する。
論文参考訳（メタデータ） (Sun, 19 Nov 2023 12:37:30 GMT)
TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents – arXiv最新論文の紹介 (devneko.jp)のv2、3ヶ月で更新という今のスピード感。
API Retriever、LLM Finetuner、Demo Selectorからなる構成、ToolBenchの結果は高いように思えるが詳細な情報が欲しいところ。。

Adapters

Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning [109.3]
本稿では,大規模言語モデルにおけるparameter-efficient な modular transfer learning を統一したオープンソースのライブラリであるAdaptersを紹介する。 10の多様なアダプタメソッドを統一インターフェースに統合することにより、Adaptersは使いやすさとフレキシブルな設定を提供する。
論文参考訳（メタデータ） (Sat, 18 Nov 2023 13:53:26 GMT)
HuggingFaceのTransformersライブラリとともに使えるチューニング用ライブラリ。多様な手法に対応しており便利そう。Full fine tuningと比べた性能表も参考になる。
リポジトリはGitHub – adapter-hub/adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning

Program-Aided Reasoners (better) Know What They Know

Program-Aided Reasoners (better) Know What They Know [59.3]
プログラム支援言語モデル(PAL)の校正と,5つのデータセットにまたがるテキストベースのChain-of-Thought(COT)技術の比較を行った。以上の結果から, PALは75%の症例で校正の改善につながることが示唆された。
論文参考訳（メタデータ） (Thu, 16 Nov 2023 04:17:49 GMT)
PALとCOTの比較、「Overall, we demonstrate that, in the majority of cases, program-aided reasoners better know what they know than text-based counterparts.」とのこと。理由が知りたいところ。
リポジトリはhttps://github.com/mathuryash5/code-calibratesとのこと

INSGENEL: Instructed Generative Entity Linker

Instructed Language Models with Retrievers Are Powerful Entity Linkers [87.2]
Instructed Generative Entity Linker (INSGENEL)は、カジュアル言語モデルが知識ベース上でエンティティリンクを実行することを可能にする最初のアプローチである。 INSGENEL は、+6.8 F1 点が平均的に上昇する以前の生成的代替よりも優れていた。
論文参考訳（メタデータ） (Mon, 6 Nov 2023 16:38:51 GMT)
hallucinationに悩まされやすいエンティティリンキングタスクへのLLM応用
instruction-tuningを行ったLlamaがICLのdivinciより優れており、instruction tuningの有用性が分かるが、もう少し新しい世代での結果も気になるところ。
リポジトリはGitHub – MrZilinXiao/InsGenEntityLinking: Official Implementation of EMNLP 2023 paper on “Instructed Language Models with Retriever Are Powerful Entity Linkers”

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [37.5]
大規模言語モデル (LLM) は、人間の言語理解と生成能力によって大きな注目を集めている。本調査は、医学におけるLLMの機会と課題に関する洞察を提供することを目的としている。
論文参考訳（メタデータ） (Thu, 9 Nov 2023 02:55:58 GMT)
医学分野へのLLM適用に関するサーベイ、特化型LLMの適用も試みられている分野であり興味深い。
リポジトリはGitHub – AI-in-Health/MedLLMsPracticalGuide: A curated list of practical guide resources of Medical LLMs (Medical LLMs Tree, Tables, and Papers)

INSTRUSUM

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [136.2]
命令制御可能なテキスト要約の大規模言語モデル(LLM)をベンチマークする。本研究は,LLMにおいて,命令制御可能なテキスト要約が依然として困難な課題であることを示す。
論文参考訳（メタデータ） (Wed, 15 Nov 2023 18:25:26 GMT)
制御されたテキスト要約のベンチマーク。GPT-4であれば可能なのかと思うところだが「We found that several LLMs have already shown promising performance in generating ins-controllable summaries.」であるものの「However, they lack robust holistic capabilities for this task since they still make a considerable amount of errors in their summaries and they can not reliability evaluate the different candidate summaries for the same data example」と難しいよう。（もとから簡単なタスクではないではないものの）LLMであれば対応可能と言い切れないのは興味深い結果。
リポジトリはGitHub – yale-nlp/InstruSum

SEMQA: Semi-Extractive Multi-Source Question Answering

SEMQA: Semi-Extractive Multi-Source Question Answering [98.8]
本稿では,複数ソースを半抽出的に要約することで,複数の質問に答える新しいQAタスクを提案する。この種の最初のデータセットであるQuoteSumを作成し、自然および生成された質問に対する人間による半抽出的な回答を提示する。
論文参考訳（メタデータ） (Wed, 8 Nov 2023 18:46:32 GMT)
SEMQAという新たなタスクの提案、「Specifically, given a question and a set of retrieved passages, the goal is to generate a summarized and well-grounded answer that interleaves verbatim extracted spans of factual statements with free-text connectors.」とのことでHallucinationを避け検証可能なanswerを得る事が目的のよう
リポジトリはGitHub – google-research-datasets/QuoteSum: QuoteSum is a textual QA dataset containing Semi-Extractive Multi-source Question Answering (SEMQA) examples written by humans, based on Wikipedia passages.

Control3D

Control3D: Towards Controllable Text-to-3D Generation [107.8]
本稿では,手書きスケッチ,すなわちコントロール3Dについてテキストから3D生成条件を提案する。 2次元条件付き拡散モデル(ControlNet)を再構成し、NeRFとしてパラメータ化された3次元シーンの学習を誘導する。合成3Dシーン上での描画画像のスケッチを直接推定するために,事前学習可能なフォト・ツー・スケッチ・モデルを利用する。
論文参考訳（メタデータ） (Thu, 9 Nov 2023 15:50:32 GMT)
手書きスケッチ＋テキストによる3Dモデル生成、ControlNetの3D版な印象（「Specifically, a 2D conditioned diffusion model (ControlNet) is remoduled to optimize a Neural Radiance Field (NeRF), encouraging each view of the 3D scene to align with the given text prompt and hand-drawn sketch.」とのこと）

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31