テーブルデータ – arXiv最新論文の紹介

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models / Does TabPFN Understand Causal Structures? / TransactionGPT

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models [76.5]
TabPFN-2.5は5万のデータポイントと2,000の機能を持つデータセット用に構築されている。チューニングされたツリーベースモデルとAutoGluon 1.4の精度を大幅に上回った。生産用として,TabPFN-2.5を小型または木製アンサンブルに変換する新しい蒸留エンジンを導入する。
論文参考訳（メタデータ） (Thu, 13 Nov 2025 01:01:46 GMT)
テーブルデータに対する基盤モデルの提案、TabArena – a Hugging Face Space by TabArenaで「TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (≤10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression).」と高性能を主張
Prior Labs

Does TabPFN Understand Causal Structures? [40.2]
本研究では,TabPFNが内部表現に因果情報をエンコードするかどうかを検討する。学習可能なデコーダと因果トークンを用いたアダプタフレームワークを開発した。評価の結果,TabPFNの埋め込みには因果情報が含まれており,従来の因果発見アルゴリズムよりも優れていることがわかった。
論文参考訳（メタデータ） (Mon, 10 Nov 2025 15:53:15 GMT)
「We show that TabPFN’s embeddings contain causal information and that our adaptor framework outperforms traditional causal discovery algorithms when causal information is extracted from mid- range layers. This further promotes leveraging pre-trained tabular models for extracting causal structures, improving the interpretability of these models, and aiding in scientific discovery.」と興味深い性質を報告。

TransactionGPT [41.9]
TransactionGPTは、世界最大の決済ネットワーク内のコンシューマトランザクションデータの基盤モデルである。本稿では,支払いトランザクションデータの複雑なダイナミクスを捉えるために,新しい3D-Transformerアーキテクチャを提案する。
論文参考訳（メタデータ） (Thu, 13 Nov 2025 01:20:09 GMT)
Visa Researchによる基盤モデル。「TransactionGPT (TGPT), a foundation model that captures complex consumer shopping dynamics from Multi-Modal-Temporal-Tabular (MMTT) data.」、「Extensive experiments on large-scale, real-world payment data validate TGPT’s ability to learn meaningful transaction patterns, leading to significant performance improve- ments on critical downstream tasks. Furthermore, we quantify the benefits of several designs that enhance the TGPT’s efficiency and scalability.」とのこと。

Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges [22.1]
本稿では,表型入力表現の分類と表理解タスクの導入を通じて,重要な概念を紹介する。テーブルは2次元であり、構造化されたデータベーステーブルから複雑な多層スプレッドシートまで、それぞれ異なる目的を持った形式を含んでいる。我々は、さらなる研究の必要性を示す分野におけるいくつかの重要なギャップを強調している。
論文参考訳（メタデータ） (Thu, 31 Jul 2025 23:41:31 GMT)
LLMによるテーブルデータ取り扱いのサーベイ

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective [23.3]
タブラルデータ(Tabular data)は、バイオインフォマティクス、医療、マーケティングなど、さまざまな領域で広く使われているデータフォーマットの1つである。本調査では,データ空間を精製するための基本技術として,強化学習(RL)と特徴選択と特徴生成のための生成的アプローチについて検討する。我々は,既存の課題を要約し,今後の研究の方向性について論じ,この分野の継続的なイノベーションを促進する洞察を提供することを目的とする。
論文参考訳（メタデータ） (Wed, 12 Feb 2025 22:34:50 GMT)
「Tabular data-centric AI is evolving with RL-based optimization and generative modeling playing a key role in feature engineering.」とのこと。現状でも重要性が下がっていないテーブルデータに対してRL系の最適化や生成AI活用などをサーベイした論文。

不均衡データに対するサーベイも出ていた。こちらも過去から重要な視点。

A Comprehensive Survey on Imbalanced Data Learning [45.3]
不均衡なデータは、さまざまな種類の生データに広まっており、機械学習のパフォーマンスを妨げる。本調査は,様々な実世界のデータ形式を体系的に分析する。さまざまなデータフォーマットに関する既存の研究は、データ再バランス、特徴表現、トレーニング戦略、アンサンブル学習の4つのカテゴリにまとめられている。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 04:53:17 GMT)

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.7]
本稿では,大規模言語モデルと勾配ブースト決定木を融合させる,シンプルで軽量な手法を提案する。融合法を LLM-Boost と PFN-Boost と命名した。多数のベースラインとアンサンブルアルゴリズムに対して最先端の性能を示す。
論文参考訳（メタデータ） (Thu, 06 Feb 2025 02:39:35 GMT)
「We propose LLM-Boost: a novel yet simple and easy-to-implement boosting mechanism that combines LLMs, which ingest semantic column headers, with GBDTs that can scale to massive datasets.」、「We further propose PFN-Boost, where we instead fuse TabPFN and GBDTs for performance gains over GBDTs alone across dataset sizes without using column headers.」とLLMやTransformerとGBDTを融合するアプローチ。データサイズによって効果があるというのはそうだろうと思う。
リポジトリはGitHub – MayukaJ/LLM-Boost

The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features

The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features [40.2]
本稿では,TabPFNと単純な特徴工学を組み合わせ,予測性能を高めるための簡単なアプローチであるTabPFN-TSを提案する。その単純さとわずか1100万のパラメータにもかかわらず、TabPFN-TSは類似サイズのモデルであるChronos-Miniよりも優れており、65倍のパラメータを持つChronos-Largeよりもわずかに優れている。
論文参考訳（メタデータ） (Mon, 06 Jan 2025 11:38:19 GMT)
なかなか難しい感のあるTabular Foundation Modelの提案。「By using a simple set of timestampderived features, our approach matches or slightly outperforms Chronos-T5 (Large), which, to our knowledge, is one of the strongest time series foundation models.」とのこと。時系列データの基礎的な動きを捉えられているのかもしれないが、使う場合はそのドメインでの検証はした方が良いのだろうなと思う。
リポジトリはGitHub – PriorLabs/tabpfn-client: ⚡ Easy API access to the tabular foundation model TabPFN ⚡

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding [42.8]
トレー・オブ・タブル(Tree-of-Table)は、LLMが大規模で複雑なテーブル上での推論能力を高めるために設計された新しいアプローチである。 Tree-of-Tableは優れた性能を持つ新しいベンチマークをセットし、大規模テーブル推論における顕著な効率性と一般化能力を示す。
論文参考訳（メタデータ） (Wed, 13 Nov 2024 11:02:04 GMT)
大規模なテーブルデータを推論するために木構造を用いるアプローチの提案
「Starting with a large-scale input table, the process selectively condenses the data, emphasizing task-relevant information. Subsequently, the decomposed elements are methodically reorganized into a Table-Tree, a hierarchical structure designed to streamline and guide the subsequent reasoning process.」ということがプロンプトベースで可能なのも凄いなと思う。効果はありそう。

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions [47.7]
AutoKaggleは、コード実行と単体テストを組み合わせた反復的な開発プロセスを実装し、コードの正しさとロジックの整合性を保証する。データクリーニング、特徴工学、モデリングのための検証済み機能を含む汎用データサイエンスツールキットは、このソリューションの基礎を形成します。 AutoKaggleは、一般的なデータサイエンスパイプラインにおけるバリデーションレート0.85と総合スコア0.82を達成する。
論文参考訳（メタデータ） (Sun, 27 Oct 2024 12:44:25 GMT)
Kaggleのようなデータ分析の自動化。対象としているタスク（分析フェーズ）は「background understanding, preliminary exploratory data analysis, data cleaning (DC), in-depth exploratory data analysis, feature engineering (FE), and model building, validation, and prediction (MBVP).」で通常のAutoMLより広い、対象データはテーブルデータのよう。
「As our analysis relies on GPT-4o, which is trained on data available until October 2023, it includes most of the Classic Kaggle competitions.To evaluate the generalization capabilities of AutoKaggle, we therefore focus on competitions initiated after 2024.」とLeakには気を使っているとはいえ、「Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.」という言いきりは凄い。もっとも、今のLLMの性能からして適切なパイプラインを組めば解けそうな問題であるという感覚はある。
リポジトリはGitHub – multimodal-art-projection/AutoKaggle

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models [44.1]
SpreadsheetLLMは、スプレッドシート上の大きな言語モデル(LLM)を解き放つために設計された効率的な符号化手法である。 LLMのスプレッドシートを効果的に圧縮する革新的な符号化フレームワークである SheetCompressor を開発した。 SheetCompressor による微調整 LLM の圧縮率は平均 25 倍であるが、最先端の 78.9% の F1 スコアを達成し、既存のモデルでは 12.3% を上回っている。
論文参考訳（メタデータ） (Fri, 12 Jul 2024 06:34:21 GMT)
一般にLLMで扱いにくいスプレッドシートに対処するためのフレームワークの提案。
「structural-anchor-based extraction, invertedindex translation, data-format-aware aggregation」でMarkdownライクなテキストに変換するアプローチ。さらにはテーブル認識と境界識別を分けるChain of Spreadsheet を提案、ベンチマークでのSOTAを主張
マイクロソフトの論文で「Spreadsheets are characterized by their extensive two-dimensional grids, flexible layouts, and varied formatting options, which pose significant challenges for large language models (LLMs).」と書かれると複雑な気持ちになる。

TALENT: A Tabular Analytics and Learning Toolbox

TALENT: A Tabular Analytics and Learning Toolbox [24.9]
本稿では,表型手法の活用,分析,比較を行うためにTALENT (Tabular Analytics and LEarNing Toolbox) という汎用的なディープラーニングツールボックスを提案する。 TALENTは、様々なエンコーディングおよび正規化モジュールに関連する、20以上の深い表層予測手法の広範なコレクションを含んでいる。本稿では,ツールボックスの設計と機能について述べるとともに,その実践的応用をいくつかのケーススタディを通じて説明し,ツールボックスをベースとした各種手法の性能について検討する。
論文参考訳（メタデータ） (Thu, 04 Jul 2024 16:57:14 GMT)
テーブルデータ分析のためのツールボックスでDeep系の手法が豊富に含まれている。
リポジトリはGitHub – qile2000/LAMDA-TALENT: A comprehensive toolkit and benchmark for tabular data learning, featuring over 20 deep methods, more than 10 classical methods, and 300 diverse tabular datasets.
やはりCatBoostやXGBはかなり優秀なのでは・・・

Why Tabular Foundation Models Should Be a Research Priority

Why Tabular Foundation Models Should Be a Research Priority [65.8]
テーブルデータは、多くの分野において支配的なモダリティであるが、研究の注意がほとんど与えられず、スケールとパワーの面ではかなり遅れている。私たちは現在、表形式の基礎モデル、あるいはLTM(Large Tabular Model)と呼ばれるものの開発を始める時が来たと信じています。
論文参考訳（メタデータ） (Thu, 02 May 2024 10:05:16 GMT)
Large Tabular Model、欲しいと思いつつ汎用的にできるのか＆コストが見合うのかは論文を読んでなお結構疑問

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31