---
date: '2025-01-08'
doi: '10.1038/s41586-024-08328-6'
source: Nature PDF text extraction
source_url: 'https://www.nature.com/articles/s41586-024-08328-6'
title: Accurate predictions on small data with a tabular foundation model
---

Article

Accurate predictions on small data with a tabular foundation model

https://doi.org/10.1038/s41586-024-08328-6 Noah Hollmann1,2,3,7 ✉, Samuel Müller1,7 ✉, Lennart Purucker1, Arjun Krishnakumar1, Max Körfer1, Shi Bin Hoo1, Robin Tibor Schirrmeister4,5 & Frank Hutter1,3,6 ✉ Received: 17 May 2024

Accepted: 31 October 2024 Tabular data, spreadsheets organized in rows and columns, are ubiquitous across Published online: 8 January 2025 scientific fields, from biomedicine to particle physics to economics and climate Open access science1,2. The fundamental prediction task of filling in missing values of a label Check for updates column based on the rest of the columns is essential for various applications as diverse as biomedical risk models, drug discovery and materials science. Although deep learning has revolutionized learning from raw data and led to numerous high-profile success stories3--5, gradient-boosted decision trees6--9 have dominated tabular data for the past 20 years. Here we present the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model that outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time. In 2.8 s, TabPFN outperforms an ensemble of the strongest baselines tuned for 4 h in a classification setting. As a generative transformer-based foundation model, this model also allows fine-tuning, data generation, density estimation and learning reusable embeddings. TabPFN is a learning algorithm that is itself learned across millions of synthetic datasets, demonstrating the power of this approach for algorithm development. By improving modelling abilities across diverse fields, TabPFN has the potential to accelerate scientific discovery and enhance important decision-making in various domains.

Throughout the history of artificial intelligence, manually created floating point), imbalanced or missing data, unimportant features, algorithmic components have been replaced with better-performing outliers and so on. This made non-deep-learning methods, such as end-to-end learned ones. Hand-designed features in computer vision, tree-based models, the strongest contender so far14,15. such as SIFT (Scale Invariant Feature Transform)10 and HOG (Histogram However, these traditional machine learning models have sev- of Oriented Gradients)11, have been replaced by learned convolutions; eral drawbacks. Without substantial modifications, they yield poor grammar-based approaches in natural language processing have been out-of-distribution predictions and poor transfer of knowledge from replaced by learned transformers12; and the design of customized open- one dataset to another16. Finally, they are hard to combine with neural ing and end-game libraries in game playing has been superseded by networks, as they do not propagate gradients. end-to-end learned strategies3,13. Here we extend this end-to-end As a remedy, we introduce TabPFN, a foundation model for small- learning to the ubiquitous domain of tabular data. to medium-sized tabular data. This new supervised tabular learning The diversity of tabular data sets them apart from unprocessed method can be applied to any small- to moderate-sized dataset and modalities such as text and images. While in language modelling for yields dominant performance for datasets with up to 10,000 samples example the meaning of a word is consistent across documents, in and 500 features. In a single forward pass, TabPFN significantly out- tabular datasets the same value can mean fundamentally different performs state-of-the-art baselines on our benchmarks, including things. A drug discovery dataset, for example, might record chemical gradient-boosted decision trees, even when these are allowed 4 h of properties, whereas another dataset in materials science might docu- tuning, a speedup of 5,140× (classification) and 3,000× (regression). ment thermal and electric properties. This specialization leads to a Finally, we demonstrate various foundation model characteristics proliferation of smaller, independent datasets and associated models. of TabPFN, including fine-tuning, generative abilities and density To illustrate, on the popular tabular benchmarking website openml.org, estimation. 76% of the datasets contain less than 10,000 rows at the time of writing. Deep learning methods have traditionally struggled with tabular data, because of the heterogeneity between datasets and the heteroge- Principled in-context learning neity of the raw data itself: Tables contain columns, also called features, TabPFN leverages in-context learning (ICL)17, the same mechanism with various scales and types (Boolean, categorical, ordinal, integer, that led to the astounding performance of large language models, to Machine Learning Lab, University of Freiburg, Freiburg, Germany. 2Computational Medicine, Berlin Institute of Health at Charité, Universitätsmedizin Berlin, Berlin, Germany. 3Prior Labs, 1

Freiburg, Germany. 4Neuromedical AI Lab, Department of Neurosurgery, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany. 5Medical Physics, Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany. 6ELLIS Institute Tübingen, Tübingen, Germany. 7These authors contributed equally: Noah Hollmann, Samuel Müller. ✉e-mail: noah\@priorlabs.ai; samuelgabrielmuller\@gmail.com; fh\@cs.uni-freiburg.de

                                                                                                                                        Nature | Vol 637 | 9 January 2025 | 319

Article a TabPFN is trained on synthetic data to take entire TabPFN can now be applied to arbitrary datasets as inputs and predict in a forward pass unseen real-world datasets

                                                                                Prediction                                                                               Prediction
                Xtrain       ytrain                      TabPFN                                          Xtrain      ytrain
                                                      neural network                                                                            TabPFN
                                                    parameterized by T
                 Xtest          ?                                                                         Xtest         ?
                                           A synthetic dataset
                                                     –log qT (ytest |...)                                              An arbitrary real-world dataset
                              ytest             Training loss to be optimized
                                                 across millions of datasets

b Input dataset 2D TabPFN layer (12×) Predictions: yˆ test The vector is transformed x1 x2 y 1D feature attention 1D sample attention MLP to a piece-wise constant (Riemann) distribution 1.2 6.1 3.0 with an MLP Training rows

                8.9           9.1         3.1




                                                                                                                                                  Density
                1.0           2.9         6.7

Test

                33.3          2.2          ?

                We predict this entry                                                                                                                        0         5          10
                                                                       Each node represents one entry in the table                                          Predicted y distribution

Fig. 1 \| Overview of the proposed method. a, The high-level overview of TabPFN standard transformer encoder that is adapted for the two-dimensional data pre-training and usage. b, The TabPFN architecture. We train a model to solve encountered in tables. more than 100 million synthetic tasks. Our architecture is an adaptation of the

generate a powerful tabular prediction algorithm that is fully learned. At inference time, the learned model is applied to test samples. By Although ICL was first observed in large language models, recent contrast, our approach is trained across datasets and is applied to work has shown that transformers can learn simple algorithms entire datasets at inference time rather than individual samples. Before such as logistic regression through ICL18--21. Prior-data Fitted Net- being applied to real-world datasets, the model is once pre-trained works (PFNs) have shown that even complex algorithms, such as on millions of synthetic datasets representing different prediction Gaussian Processes and Bayesian Neural Networks, can be approxi- tasks. At inference time, the model receives an unseen dataset with mated with ICL22. ICL enables us to learn a wider space of possible both labelled training and unlabelled test samples and performs algorithms, including cases for which a closed-form solution does training and prediction on this dataset in a single neural network not exist. forward pass. We build on a preliminary version of TabPFN23, which demonstrated Figures 1 and 2 outline our approach: the applicability of in-context-learning17 for tabular data in principle 1. Data generation: we define a generative process (referred to as our but had many limitations that rendered it inapplicable in most cases. prior) to synthesize diverse tabular datasets with varying relation- Based on a series of improvements, the new TabPFN scales to 50× larger ships between features and targets, designed to capture a wide range datasets; supports regression tasks, categorical data and missing of potential scenarios that our model might encounter. We sample values; and is robust to unimportant features and outliers. millions of datasets from the generative process. For each dataset, The key idea behind TabPFN is to generate a large corpus of synthetic a subset of samples has their target values masked, simulating a tabular datasets and then train a transformer-based12 neural network supervised prediction problem. Further details of our prior design to learn to solve these synthetic prediction tasks. Although traditional are shown in the section 'Synthetic data based on causal models'. approaches require hand-engineered solutions for data challenges 2. Pre-training: we train a transformer model, our PFN, to predict the such as missing values, our method autonomously learns effective masked targets of all synthetic datasets, given the input features strategies by solving synthetic tasks that include these challenges. This and the unmasked samples as context. This step is done only once approach leverages ICL as a framework for exemplar-based declarative during model development, learning a generic learning algorithm programming of algorithms. We design desired algorithmic behaviour that can be used to predict any dataset. by generating diverse synthetic datasets that demonstrate the desired 3. Real-world prediction: the resulting trained model can now be behaviour and then train a model to encode an algorithm that satisfies applied to arbitrary unseen real-world datasets. The training samples it. This shifts the algorithm design process from writing explicit instruc- are provided as context to the model, which predicts the labels of tions to defining input--output examples, opening up possibilities for these unseen datasets through ICL. creating algorithms in various domains. Here, we apply this approach to the high-impact field of tabular learning, generating a powerful Our approach also has a theoretical foundation as described in tabular prediction algorithm. ref. 22. It can be viewed as approximating Bayesian prediction for a Our ICL approach differs fundamentally from standard super- prior defined by the synthetic datasets. The trained PFN will approxi- vised deep learning. Usually, models are trained per dataset, upd­ ̂ ∣X test, X train, ytrain) and mate the posterior predictive distribution p(ytest ating model parameters on individual samples or batches according thus return a Bayesian prediction for the specified distribution over to hand-crafted weight-updating algorithms, such as Adam 24. artificial datasets used during PFN pre-training.

320 \| Nature \| Vol 637 \| 9 January 2025 a Sample underlying parameters b Build computational graph and graph structure c Final datasets

                                                                                                      F
                                                                                 F



                                                                                 T
    Sample number of data points
                                                                                                                    4     Postprocessing,
    Sample number of features                                                                                             quantization and
    Sample number of nodes                                                                                                warping
                                             For each generated sample,          Sample random feature (F)
                                         1                                   2
                                             propagate initialization data       and target (T) node positions, and
    Sample graph complexity
                                             through the graph               3   read off data at those positions
    Sample graph



                                                                             Connection types

                                             Neural network                          Tree                               Discretization

Fig. 2 \| Overview of the TabPFN prior. a, For each dataset, we first sample for each to-be-generated sample. In step 2, we randomly sample feature and high-level hyperparameters. b, Based on these hyperparameters, we construct target node positions in the graph, labelled F and T, respectively. In step 3, a structural causal model that encodes the computational function generating we extract the intermediate data representations at the sampled feature and the dataset. Each node holds a vector and each edge in the computational target node positions. In step 4, we post-process the extracted data. c, We graph implements a function according to one of the connection types. In step 1, retrieve the final datasets. We plot interactions of feature pairs and the node using random noise variables we generate initialization data, which is fed into colour represents the class of the sample. the root nodes of the graphs and propagated through the computational graph

                                                                                         around 300× on CPU (from 32 s to 0.1 s) and 6× on GPU. With 10× more

An architecture designed for tables features (100), the speedups increase to 800× on CPU and 30× speedup The transformer architecture is currently the favoured architecture for on GPU. These measurements focus solely on the core inference pro- flexible deep learning and foundation models4,5. Transformer models cess, excluding pre-processing and ensembling steps detailed in the work on sequences and combine information between sequence items section 'Inference details'. The lower speedups on GPUs are because of using so-called attention mechanisms25, allowing them to effectively an underutilization of their massively parallel architecture. capture long-range dependencies and learn complex relationships in We further optimize the memory and compute requirements of the data. Although transformer-based models can be applied to tabular architecture by computing layer norms in half-precision, using flash data26,27, TabPFN addresses two key limitations inherent to them. First, attention29, activation checkpointing and sequential computation of as transformers are designed for sequences, they treat the input data the state. Our optimizations reduce the memory requirements by a as a single sequence, not using the tabular structure. Second, machine factor of four, resulting in less than 1,000 bytes per cell. This enables learning models are often used in a fit-predict model, in which a model the prediction on datasets with up to 50 million cells (for example, is fitted on the training set once and then reused for multiple test data- 5 million rows × 10 features) on a single H100 GPU. sets. Transformer-based ICL algorithms, however, receive train and For regression tasks, we use a piece-wise constant output distribu- test data in a single pass and thus perform training and prediction at tion, following refs. 22,30, which allows our models to predict a prob- once. Thus, when a fitted model is reused, it has to redo computations ability distribution of target values instead of a single value, including, for the training set. for example, bimodal distributions. To better use the tabular structure, we propose an architecture that assigns a separate representation to each cell in the table, inspired by refs. 22,28. Our architecture, visualized in Fig. 1b, uses a two-way Synthetic data based on causal models attention mechanism, with each cell attending to the other features The performance of TabPFN relies on generating suitable synthetic in its row (that is, its sample) and then attending to the same feature training datasets that capture the characteristics and challenges of across its column (that is, all other samples). This design enables the real-world tabular data. To generate such datasets, we developed an architecture to be invariant to the order of both samples and features approach based on structural causal models (SCMs)31. SCMs provide a and enables more efficient training and extrapolation to larger tables formal framework for representing causal relationships and generative than those encountered during training, in terms of both the number processes underlying the data. By relying on synthetic data instead of samples and features. of large collections of public tabular data, we avoid common prob- To mitigate repeating computations on the training set for each test lems of foundational models, such as privacy and copyright infringe- sample in a fit-predict setting, our model can separate the inference ments, contaminating our training data with test data32 or limited data on the training and test samples. This allows us to perform ICL on the availability. training set once, save the resulting state and reuse it for multiple test As shown in Fig. 2, our generative pipeline first samples high-level set inferences. On datasets with 10,000 training samples and 10 fea- hyperparameters, such as dataset size, number of features and diffi- tures, our optimized train-state caching results in inference speedups of culty level, to govern the overall properties of each synthetic dataset.

                                                                                                                                         Nature | Vol 637 | 9 January 2025 | 321

Article a Homoscedastic Heteroscedastic b sin(x) + x x2 \|x\| Step function noise noise True function TabPFN CatBoost (quantile) 0.5 0.8 True function

                                                                                                                                                                                                    0.6
                  0




                                                                                                                                                                                                           Position on the wall (m)
                                                                                                                                                                                                    0.4

                –0.5                                                                                                                                                                                0.2
                 0.5
                                                                                                                                                                                                    0

TabPFN

                  0                                                                                                                                                                                 –0.2
                                                                                                                                                                                                    –0.4
                –0.5
                                                                                                                                                                                                    –0.6
                 0.5
                                                                                                                                                                                                    –0.8

CatBoost

                  0                                                                                                               0.5       1.0           0.5      1.0           0.5       1.0
                                                                                                                                 Slit width (mm)        Slit width (mm)        Slit width (mm)
                –0.5
                 0.5                                                                                                                                                                                0.4




                                                                                                                                                                                                           Position on the wall (m)
                                                                                                                                                                                                    0.2

MLP

                  0

                –0.5                                                                                                                                                                                0
                 0.5
                                                                                                                                                                                                    –0.2

Linear

                  0
                                                                                                                                                                                                    –0.4
                –0.5
                       –0.5   0        0.5 –0.5   0    0.5 –0.5   0     0.5 –0.5   0    0.5 –0.5   0   0.5 –0.5   0   0.5              2        4             2        4             2        4
                                                                                                                               Slit separation (μm)   Slit separation (μm)   Slit separation (μm)

Fig. 3 \| The behaviour of TabPFN and a set of baselines on simple functions. different functions, including noisy functions. b, TabPFN can model distributions In all plots, we use orange for the ground truth and blue for model predictions. over outputs out of the box, which is exemplified by predicting the light a, Each column represents a different toy function, each having a single feature intensity pattern in a double-slit experiment after observing the positions of (along the x-axis) and a target (along the y-axis). TabPFN can model a lot of 1,000 photons.

Guided by these hyperparameters, we construct a directed acyclic graph specifying the causal structure underlying the dataset. Qualitative analysis To generate each sample within a dataset, we propagate randomly We first analyse the behaviour of TabPFN on toy problems to build generated noise, called our initialization data, through the root nodes intuition and disentangle the impact of various dataset characteristics. of the causal graph. This initialization data are generated by sampling As regression problems are easier to visualize, we focus on these in our from a random normal or uniform distribution with varying degrees qualitative analysis. In Fig. 3a, we compare TabPFN with a diverse set of of non-independence between samples, see section 'Initialization standard predictors, with all methods using default settings. data sampling'. As these data traverse the edges of the computational Linear (ridge) regression can naturally model only linear functions, graph, we apply a diverse set of computational mappings: small leading to simple and interpretable predictions but catastrophic failure neural networks with linear or nonlinear activations (for example, on many of the toy functions. Multilayer perceptrons (MLPs)34 perform sigmoid, ReLU (rectified linear unit), modulo, sine), discretization worse on datasets with highly non-smooth patterns14. This is especially mechanisms for generating categorical features and decision tree apparent for the step function. TabPFN, by contrast, models either structures to encode local, rule-based dependencies. At each edge, function type, smooth or non-smooth, out of the box. This includes we add Gaussian noise, introducing uncertainty into the generated a good approximation to step functions despite TabPFN being a neu- data. We save the intermediate data representations at each node ral network. CatBoost9, representative of tree-based methods, fits to be retrieved later. See section 'Computational edge mappings' only piece-wise constant functions. Although this leads to approxi- for details. mation errors and unintuitive predictions, it avoids catastrophic After traversing the causal graph, we extract the intermediate repre- failures. sentations at the sampled feature and target nodes, yielding a sample The main advantage of TabPFN over all baselines is its inherent abil- consisting of feature values and an associated target value. ity to model uncertainty at no extra cost. Whereas classical regression By incorporating various data challenges and complexities into the methods output a single real-valued prediction, TabPFN returns a target synthetic datasets, we create a training ground that allows TabPFN to distribution, capturing the uncertainty of predictions. These uncer- develop strategies for handling similar issues in real-world datasets. tainty modelling abilities of TabPFN extend beyond simple distribu- For instance, consider the case of missing values, commonly present tions and can handle complex, multi-modal distributions. Figure 3b in tabular data. By exposing TabPFN to synthetic datasets with varying shows this by modelling the density of light reaching a detector screen patterns and fractions of missing values in our synthetic data genera- in a double-slit experiment35 for different slit distances and widths. In tion process, the model learns effective ways of handling missing val- this classic experiment, photons are sent through two slits creating a ues that generalize to real-world datasets. We apply post-processing multi-modal intensity pattern because of the wave-like interference techniques to further enhance the realism and challenge the robustness behaviour of light. TabPFN predicts these intricate patterns in just of the learned prediction algorithms. This includes warping with the a single forward pass, requiring only 1.2 s. By contrast, traditional Kumaraswamy distribution33, introducing complex nonlinear distor- methods such as CatBoost require training multiple quantile models tions and quantization mimicking discretized features. See section at different quantiles and reconstructing the distribution from these 'Post-processing' for details. predictions. Even after tuning CatBoost specifically for this task, it Through this generative process, we created a massive corpus of produced substantially worse predictions compared with TabPFN, around 100 million synthetic datasets per model training, each with a see Fig. 3b. With default settings, CatBoost requires 169.3 s and yields unique causal structure, feature types and functional characteristics. further deteriorated results. Qualitatively, we observe that TabPFN is

322 \| Nature \| Vol 637 \| 9 January 2025  a Normalized Normalized b Per dataset normalized ROC comparison c TabPFN XGB CatBoost LightGBM RF ROC AUC Magnification accuracy Magnification of Catboost and TabPFN 1.0 1.0 1.0 1.0 Wilcoxon P \< 0.001 Wilcoxon P \< 0.001

                                                   XGB
                                        0.9                                                             0.9                                               1.0




                                                  CB




                                                                                                                                                                                                                                               Normalized ROC AUC
                                                                                                                                                                                                                                                                          0.9




                                                LGBM
                 0.8                                                                0.8




                                                                                                                                                                                          CatBoost (4h tuned)
                                                                                                                                                          0.8




                                                                                                                                     CatBoost (default)
                                                                                                                 LGBM
                                                                                                                   CB
                                                                                                                                                                      Catboost                                           Catboost




                                                                                                                  XGB

Classification

                                                                   Classification
                                        0.8                                                             0.8


                                              RF
                 0.6                                                                0.6                                                                               stronger                                           stronger                                         0.8
                                                                                                                                                          0.6
                                        0.7                                                             0.7




                                                                                                              RF
                 0.4                                                                0.4



                                                          TabPFN




                                                                                                                            TabPFN
                                                                                                                                                          0.4                                                                                                             0.7
                                                                                                                                                                           TabPFN                                              TabPFN
                 0.2                                                                0.2                                                                                    stronger                                            stronger
                       TabPFN




                                                                                           TabPFN
                                        0.6                                                             0.6                                               0.2
                       LGBM




                                                                                           LGBM
                                                                                                                                                                                                                                                                          0.6
                       SVM




                                                                                           SVM
                       MLP




                                                                                           MLP
                       XGB




                                                                                           XGB
                       CB




                                                                                           CB
                       Lin


                       RF




                                                                                           RF
                                                                                          Lin
                  0                                                                  0                                                                      0
                                                                                                                                                                0    0.25 0.50 0.75 1.00                            0    0.25 0.50 0.75 1.00                                      1       5       30 60        300 900      3,600 14,400
                         Default       Tuned (4 h)                                          Default     Tuned (4 h)                                                  TabPFN (default)                                   TabPFN (4 h tuned)                                                 Average fit + predict time (s)



                         Normalized                                                        Normalized                                                               Per dataset normalized RMSE comparison                                                                      TabPFN    XGB       CatBoost        LightGBM       RF
                       negative RMSE          Magnification                                   R2              Magnification                                                 of Catboost and TabPFN
                                        1.0                                                             1.0
                 1.0                                                                1.0                                                                             Wilcoxon P = 0.0153                                 Wilcoxon P < 0.001                                1.0




                                                                                                                                                                                                                                               Normalized negative RMSE
                                        0.9                                                             0.9                                               1.0
                 0.8                                                                0.8                                                                                                                                                                                   0.9




                                                                                                                                                                                             CatBoost (4 h tuned)
                                                                                                                                                          0.8




                                                                                                                                     CatBoost (default)
                                        0.8                                                             0.8                                                           Catboost                                           Catboost
                                                                   Regression

Regression

                 0.6                                                                0.6

                                                                                                                      XGB
                                                                                                                                                                      stronger                                           stronger                                         0.8
                                                                                                                                                          0.6
                                                            XGB




                 0.4                    0.7                                                             0.7
                                                                                                                            TabPFN
                                                                                    0.4
                                                      TabPFN




                                                                                                                                                          0.4                                                                                                             0.7
                                                                                                               LGBM
                                               LGBM




                                                                                                                                                                           TabPFN                                              TabPFN
                                                                                                               CB
                                               CB




                                                                                                               RF
                                               RF




                 0.2                                                                                                                                                       stronger                                            stronger
                                                                                          TabPFN




                                        0.6                                         0.2                 0.6
                       TabPFN




                                                                                                                                                          0.2                                                                                                             0.6
                                                                                          LGBM
                       LGBM




                                                                                          SVM
                                                                                          MLP




                                                                                          XGB
                       SVM
                       MLP




                       XGB




                                                                                          CB
                                                                                          Lin
                       CB




                                                                                          RF
                       Lin


                       RF




                   0                                                                  0                                                                    0                                                                                                              0.5
                                                                                                                                                                0    0.25 0.50 0.75 1.00                            0    0.25 0.50 0.75 1.00                                          1    5       30 60       300 900 3,600 14,400
                         Default       Tuned (4 h)                                          Default     Tuned (4 h)                                                  TabPFN (default)                                   TabPFN (4 h tuned)                                                 Average fit + predict time (s)

Fig. 4 \| Comparison of TabPFN on our test benchmarks, containing datasets RF, random forest; CB, CatBoost; XGB, XGBoost; Lin, logistic regression for with up to 10,000 samples and 500 features. Performance was normalized classification and ridge regression for regression tasks. Plots on the right-hand per dataset before aggregation using all baselines; intervals represent the 95% side show a magnified analysis of the strongest baselines considered. b, A per- confidence interval. Wilcoxon P refers to the two-sided Wilcoxon signed-rank dataset comparison of TabPFN with its strongest baseline, CatBoost. Each dot test P value54. a, Average performance of the default as well as the tuned versions is the average score on one dataset. c, The impact of hyperparameter tuning for of TabPFN and our baselines. All methods are tuned for ROC AUC or RMSE, the considered methods. The x-axis shows the average time required to fit and respectively, thus decreasing the representativeness of the secondary metrics. predict with the algorithm. LGBM, LightGBM; MLP, multilayer perceptron; SVM, support vector machines;

more accurate in predicting very low densities and has fewer artefacts compared with CatBoost. Comparison with state-of-the-art baselines Figure 4a demonstrates the strong out-of-the-box performance of TabPFN compared with tuned and default configurations of XGBoost, Quantitative analysis CatBoost and a random forest. For classification tasks, TabPFN sur- We quantitatively evaluate TabPFN on two dataset collections: the passes CatBoost, the strongest default baseline, by 0.187 (0.939 com- AutoML Benchmark36 and OpenML-CTR2337. These benchmarks com- pared with 0.752) in normalized ROC AUC in the default setting and by prise diverse real-world tabular datasets, curated for complexity, rel- 0.13 (0.952 compared with 0.822) in the tuned setting. For regression, evance and domain diversity. From these benchmarks, we use the 29 TabPFN outperforms CatBoost in normalized RMSE by 0.051 (0.923 classification datasets and 28 regression datasets that have up to 10,000 compared with 0.872) in the default setting and by 0.093 (0.968 com- samples, 500 features and 10 classes. We further evaluated additional pared with 0.875) in the tuned setting. In Fig. 4b, we show per-dataset benchmark suites from refs. 14,15, as well as five Kaggle competitions comparisons. Although for some datasets CatBoost outperforms from the Tabular Playground Series. TabPFN, TabPFN wins on most of the datasets. We compared TabPFN against state-of-the-art baselines, including Figure 4c shows how the performance of TabPFN and the baselines tree-based methods (random forest38, XGBoost (XGB)7, CatBoost9, improve with more time spent on hyperparameter search. The default LightGBM8), linear models, support vector machines (SVMs)39 and of TabPFN, taking 2.8 s on average for classification and 4.8 s for regres- MLPs34. sion, outperforms all baselines, even when tuning them for 4 h---a Evaluation metrics include ROC AUC (area under the receiver operat- speedup of 5,140× and 3,000×, respectively. We show comparisons ing characteristic curve; One-vs-Rest) and accuracy for classification, on a larger number of metrics in Extended Data Tables 1 and 2. and R2 (coefficient of determination) and negative RMSE (root mean As shown in Extended Data Fig. 2, similar to our primary benchmarks, squared error) for regression. Scores were normalized per dataset, TabPFN substantially outperformed all baselines on the benchmarks of with 1.0 representing the best and 0.0 the worst performance with refs. 14,15. The benchmark of ref. 14 is particularly noteworthy because respect to all baselines. on this benchmark, tree-based methods were previously found to excel. For each dataset and method, we ran 10 repetitions with different ran- Moreover, we show in Extended Data Table 6 that default TabPFN dom seeds and train--test splits (90% train, 10% test). We tuned hyper- outperforms default CatBoost on all five Kaggle competitions with parameters using random search with five-fold cross-validation, with less than 10,000 training samples from the latest completed Tabular time budgets ranging from 30 s to 4 h. All methods were evaluated using Playground Series. eight CPU cores, with TabPFN additionally using a consumer-grade GPU (RTX 2080 Ti; other methods did not benefit from this, see Extended Evaluating diverse data attributes Data Fig. 2d). TabPFN was pre-trained once using eight NVIDIA RTX In Fig. 5a,b, we show the robustness of TabPFN to dataset character- 2080 GPUs over 2 weeks, allowing for ICL on all new datasets in a single istics that are traditionally hard to handle for neural-network-based forward pass. These modest computational requirements make similar approaches14,23. research accessible to academic labs. For details, refer to the section Figure 5a provides an analysis of the performance of TabPFN across 'Detailed evaluation protocol'. various dataset types. First, we add uninformative features (randomly

                                                                                                                                                                                                                                                      Nature | Vol 637 | 9 January 2025 | 323

Article a Uninformative features Outlier factor b Missing values Categorical features 1.0 0.8 Normalized average performance (ROC AUC and negative RMSE)

                                     0.6
                                                                                                                   Outlier Fraction
                                     0.4   Fraction (%)                                                                 0                                                                Missing values?                                                                    Categorical features
                                                  0                                                                     100                                                                      No                                                                                   No
                                     0.2
                                                  90                                                                    10,000                                                                   Yes                                                                                  Yes
                                      0

                                                 Dropping samples                                                        Dropping features                                                     Number of samples                                                                  Number of features
                                     1.0
                                     0.8
                                     0.6
                                                                                                                                                                                         Number of samples                                                                  Number of features
                                     0.4                                                                                                                                                    1–1,999                                                                               1–19
                                           Fraction kept (%)                                                       Fraction kept (%)                                                        2,000–3,999                                                                           20–39
                                     0.2
                                               100      50           25                                                100       50            25                                           4,000–10,000                                                                          40–500
                                      0
                                          lt)


                                                       t)


                                                                lt)


                                                                                             t)




                                                                                                                   lt)


                                                                                                                                t)


                                                                                                                                         lt)


                                                                                                                                                      t)




                                                                                                                                                                                         lt)


                                                                                                                                                                                                      t)


                                                                                                                                                                                                               lt)


                                                                                                                                                                                                                            t)




                                                                                                                                                                                                                                                                            lt)


                                                                                                                                                                                                                                                                                         t)


                                                                                                                                                                                                                                                                                                  lt)


                                                                                                                                                                                                                                                                                                               t)
                                                    ul




                                                                                          ul




                                                                                                                             ul




                                                                                                                                                   ul




                                                                                                                                                                                                   ul




                                                                                                                                                                                                                         ul




                                                                                                                                                                                                                                                                                      ul




                                                                                                                                                                                                                                                                                                            ul
                                        au




                                                              au




                                                                                                                 au




                                                                                                                                       au




                                                                                                                                                                                       au




                                                                                                                                                                                                             au




                                                                                                                                                                                                                                                                          au




                                                                                                                                                                                                                                                                                                au
                                                 fa




                                                                                       fa




                                                                                                                          fa




                                                                                                                                                fa




                                                                                                                                                                                                fa




                                                                                                                                                                                                                      fa




                                                                                                                                                                                                                                                                                   fa




                                                                                                                                                                                                                                                                                                         fa
                                      ef




                                                            ef




                                                                                                               ef




                                                                                                                                     ef




                                                                                                                                                                                     ef




                                                                                                                                                                                                           ef




                                                                                                                                                                                                                                                                        ef




                                                                                                                                                                                                                                                                                              ef
                                                de




                                                                         de




                                                                                                                         de




                                                                                                                                               de




                                                                                                                                                                                               de




                                                                                                                                                                                                                     de




                                                                                                                                                                                                                                                                                  de




                                                                                                                                                                                                                                                                                                        de
                                     (d




                                                         (d




                                                                                                             (d




                                                                                                                                 (d




                                                                                                                                                                                    (d




                                                                                                                                                                                                       (d




                                                                                                                                                                                                                                                                    (d




                                                                                                                                                                                                                                                                                          (d
                                              (




                                                                 r(




                                                                                                                       (




                                                                                                                                          r(




                                                                                                                                                                                             (




                                                                                                                                                                                                                r(




                                                                                                                                                                                                                                                                                (




                                                                                                                                                                                                                                                                                                   r(
                                  N


                                           st

                                                     LP




                                                                                                           N


                                                                                                                    st

                                                                                                                            LP




                                                                                                                                                                                 N


                                                                                                                                                                                          st

                                                                                                                                                                                                    LP




                                                                                                                                                                                                                                                                    N


                                                                                                                                                                                                                                                                             st

                                                                                                                                                                                                                                                                                     LP
                                                                   a




                                                                                                                                            a




                                                                                                                                                                                                                  a




                                                                                                                                                                                                                                                                                                     a
                               PF




                                                                                                        PF




                                                                                                                                                                              PF




                                                                                                                                                                                                                                                                 PF
                                           o




                                                                                                                    o




                                                                                                                                                                                          o




                                                                                                                                                                                                                                                                             o
                                                                ne




                                                                                                                                         ne




                                                                                                                                                                                                               ne




                                                                                                                                                                                                                                                                                                  ne
                                                  M




                                                                                                                           M




                                                                                                                                                                                                 M




                                                                                                                                                                                                                                                                                    M
                                        Bo




                                                                                                                 Bo




                                                                                                                                                                                       Bo




                                                                                                                                                                                                                                                                          Bo
                   b




                                                                                                         b




                                                                                                                                                                              b




                                                                                                                                                                                                                                                                b
                                                             Li




                                                                                                                                      Li




                                                                                                                                                                                                            Li




                                                                                                                                                                                                                                                                                               Li
                Ta




                                                                                                      Ta




                                                                                                                                                                           Ta




                                                                                                                                                                                                                                                             Ta
                                      at




                                                                                                               at




                                                                                                                                                                                     at




                                                                                                                                                                                                                                                                        at
                                     C




                                                                                                             C




                                                                                                                                                                                    C




                                                                                                                                                                                                                                                                    C

c Dataset win rate d Dataset win rate on ROC AUC on RMSE Wilcoxon P = 0.0024 0.8 Wilcoxon P = 0.0101 1.00 0.8 TabPFN (PHE) 0.7 TabPFN (PHE) 0.975

                                                                                                                                                                                                                          Normalized negative RMSE
                                                                                        0.95
                                                                  Normalized ROC AUC




                                                                                                                                                           TabPFN             0.6                                                                                                                                   TabPFN
                                                                                                                                                                                                                                                     0.950

0.6 0.90 AutoGluon 0.5 AutoGluon 0.925 0.85 0.4 0.4 TabPFN CatBoost TabPFN (PHE) 0.3 (PHE) 0.900 0.80

0.2 0.2 0.875 CatBoost 0.75 Autogluon Autogluon 0.1 5 30 60 300 900 3,600 14,400 5 30 60 300 900 3,600 14,400 0 0 Average fit + predict time (s) Average fit + predict time (s)

Fig. 5 \| Robustness across datasets and performance comparison with analyse the performance per subgroup. c, Classification performance. Left, tuned ensembles. a, A comparison of modified datasets. We can see that the win rate of TabPFN (PHE) against AutoGluon (with one tie excluded); right, TabPFN is not more vulnerable to the modifications compared with baselines. the ROC AUC score over time for tuning each method, with the first marker We also see that TabPFN reproduces the accuracy of CatBoost (default) with representing the default configuration for the non-ensembling methods. only half the training samples provided. Here we normalize scores per dataset d, Regression performance presented as in c but using the RMSE metric. (sharing one normalization across all modifications of one experiment) to Intervals represent the 95% confidence interval and Wilcoxon P refers to the avoid negative outliers. b, We split the test datasets by data characteristics and two-sided Wilcoxon signed-rank test P value 54.

shuffled features from the original dataset) and outliers (multiply each Figure 5c--d compares the performance of TabPFN, TabPFN (PHE), cell with 2% probability with a random number between 0 and the out- AutoGluon and CatBoost. For TabPFN (PHE) and AutoGluon, we start lier factor). The results show that TabPFN is very robust to uninforma- with a minimal budget of 300 s for tuning because AutoGluon oth- tive features and outliers, something typically hard for neural networks, erwise does not reliably return results. In just 2.8 s, TabPFN (default) as can be seen with the MLP baseline. Second, although dropping either outperforms AutoGluon for classification tasks, even if AutoGluon is samples or features hurts the performance of all methods, with half allowed up to 4 h, a 5.140× speedup. TabPFN (PHE) further improves the samples TabPFN still performs as well as the next best method performance leading to an average normalized ROC AUC score of 0.971, using all samples. compared with 0.939 for TabPFN (default) and 0.914 for AutoGluon. In Fig. 5b, we split our test datasets into subgroups and perform For regression tasks, tuning hyperparameters is more important. Here, analyses per subgroup. We create subgroups based on the presence of TabPFN (PHE) outperforms AutoGluon (allowed 4 h) after its minimal categorical features, missing values, number of samples and number of tuning budget of 300 s, a 48× speedup. features in the datasets. The sample- and feature-number subgroups are split such that a third of the datasets fall into each group. We can see that none of these characteristics strongly affect the performance Foundation model with interpretability of TabPFN relative to the other methods. However, we note that these Apart from its strong predictive performance, TabPFN exhibits key results should not be taken as evidence that TabPFN scales well beyond foundation model abilities, such as data generation, density estima- the 10,000 samples and 500 features considered here. We show four tion, learning reusable embeddings and fine-tuning. We showcase further ablations in Extended Data Fig. 1. these abilities through proof-of-concept experiments on the Ger- man Credit Dataset44, which contains credit risk information and the Comparison with tuned ensemble methods mfeat-factors45 dataset classifying handwritten digits based on a tabular We compare the performance of TabPFN with AutoGluon 1.0 (ref. 40), representation. which combines various machine learning models, including our base- TabPFN can estimate the probability density function of numerical lines, into a stacked ensemble41, tunes their hyperparameters and then features, as shown in Fig. 6a, and the probability mass function of cat- generates the final predictions using post hoc ensembling (PHE)42,43. It egorical features. Computing the sample densities enables anomaly thus represents a different class of methods compared with individual detection to identify issues such as fraud, equipment failures, medical baselines. emergencies or low-quality data. To assess whether TabPFN can also be improved by a tuned ensemble TabPFN also allows synthesizing new tabular data samples that mimic approach, we introduce TabPFN (PHE). TabPFN (PHE) automatically real-world dataset characteristics as shown in Fig. 6b. This enables appli- combines only TabPFN models with PHE and tunes their hyperparam- cations such as data augmentation or privacy-preserving data sharing46. eters using a random portfolio from our search space. We detail this The architecture of TabPFN yields meaningful feature repre- approach in the section 'TabPFN (PHE)'. sentations that can be reused for downstream tasks such as data

324 \| Nature \| Vol 637 \| 9 January 2025 a Data density estimation b Synthetic data generation c Embedded data + PCA d Fine-tuning data

                                                                                                                                                          Y
      80                                                80




                                                                                                  PCA 2
               High density                                                  Actual samples
               Medium density (10th percentile)                              Generated samples
      70       Low density (2nd percentile)             70
                                                                                                                                                                       Default TabPFN predictions
      60                                                60
                                                                                                                            PCA 1




                                                                                                                                                          Y

Age

      50




                                                  Age
                                                        50                                                            Original data + PCA

      40                                                40
                                                                                                                                                                      Finetuned TabPFN predictions




                                                                                                  PCA 2
      30                                                30




                                                                                                                                                          Y
      20                                                20
           0    5,000   10,000    15,000                     0   5,000   10,000    15,000
                                                                                                                                                                                  X
                    Credit_amount                                    Credit_amount                                          PCA 1
                                                                                                                                                              Prediction     Ground truth      Training sample

Fig. 6 \| Showcase of the application of TabPFN as tabular foundation model. (mfeat-factors) with different classes forming different clusters. d, We a,b, On the German Credit Dataset, we perform data density estimation (a) and demonstrate fine-tuning TabPFN for a specific set of tasks. Fine-tuned on a generation of new synthetic samples (b). c, We show our learned embeddings dataset containing various sine curves (top), we see the model makes more are useful representations of each sample on the handwritten digits dataset accurate predictions on another sine curve dataset.

imputation and clustering. We extract and visualize learned embed- believe that foundation models, such as TabPFN, will play a key part in dings from the mfeat-factors dataset in Fig. 6c, showing improved empowering researchers. To facilitate the widespread use of TabPFN, class separation compared with the raw data on the first two principal in the section 'User guide' we discuss how to use it effectively. components. Furthermore, we demonstrate the ability of TabPFN to improve per- formance through fine-tuning on related datasets. Unlike tree-based Online content methods, the neural architecture of TabPFN enables fine-tuning on Any methods, additional references, Nature Portfolio reporting summa- specific dataset classes. We conduct proof-of-concept experiments ries, source data, extended data, supplementary information, acknowl- using sine curve datasets with varying offsets between fine-tuning and edgements, peer review information; details of author contributions test data. Figure 6d shows an example fine-tuning result. Our analysis and competing interests; and statements of data and code availability across 50 runs (Extended Data Fig. 4) shows that TabPFN successfully are available at https://doi.org/10.1038/s41586-024-08328-6. transfers knowledge even when labels differ significantly between fine-tuning and test tasks, with performance improving as distributions 1. Borisov, V. et al. Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. become more similar. This could, for example, enable fine-tuning for a Learn. Syst. 35, 7499--7519 (2024). range of datasets from medical studies to obtain an improved general 2. van Breugel, B. & van der Schaar, M. Position: why tabular foundation models should be a research priority. In Proc. 41st International Conference on Machine Learning model for medical diagnosis tasks. For details, refer to section 'Founda- 48976--48993 (PMLR, 2024). tion model abilities'. 3. Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature 529, 484--489 (2016). Finally, we have developed a methodology to easily interpret the 4. Jumper, J. M. et al. Highly accurate protein structure prediction with AlphaFold. Nature predictions of TabPFN. Interpretability is crucial for building trust 596, 583 -- 589 (2021). and accountability when deploying models in high-stakes domains. 5. OpenAI. GPT-4 Technical Report. Preprint at https://arxiv.org/abs/2303.08774 (2023). 6. Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. We support the computation of feature importance through SHAP47 1189--1232 (2001). (Shapley Additive Explanations), a game-theoretic approach to 7. Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proc. 22nd ACM explain predictions. SHAP values represent the contribution of each SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785--794 (ACM Press, 2016). feature to the output of the model. Extended Data Fig. 3 compares 8. Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. In Proc. 30th the feature importance and impact for logistic regression, CatBoost International Conference on Advances in Neural Information Processing Systems and TabPFN. TabPFN achieves high accuracy while learning simple, (eds Guyon, I. et al.) 3149--3157 (Curran Associates, 2017). 9. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. & Gulin, A. CatBoost: unbiased interpretable feature relationships. By contrast, logistic regression is boosting with categorical features. In Proc. 30th International Conference on Advances interpretable but less accurate, whereas CatBoost is accurate but quali- in Neural Information Processing Systems (eds Bengio, S. et al.) 6639--6649 (Curran tatively less interpretable because of complex, non-smooth decision Associates, 2018). 10. Lowe, D. G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. boundaries. 60, 91--110 (2004). 11. Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection. In Proc. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Conclusion (CVPR'05) 886--893 (IEEE, 2005). 12. Vaswani, A. et al. Attention is all you need. In Proc. 30th International Conference on TabPFN represents a major change in tabular data modelling, lever- Advances in Neural Information Processing Systems (eds Guyon, I. et al.) 6000--6010 aging ICL to autonomously discover a highly efficient algorithm that (Curran Associates, 2017). 13. Silver, D. et al. Mastering the game of go without human knowledge. Nature 550, 354--359 outperforms traditional human-designed approaches on datasets with (2017). up to 10,000 samples and 500 features. This shift towards foundation 14. Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform models trained on synthetic data opens up new possibilities for tabular deep learning on typical tabular data? In Proc. 36th International Conference on Neural Information Processing Systems Vol. 35, 507--520 (ACM, 2022). data analysis across various domains. 15. McElfresh, D. et al. When do neural nets outperform boosted trees on tabular data? In Potential future directions include scaling to larger datasets48, Proc. 37th International Conference on Neural Information Processing System Vol. 36, handling data drift49, investigating fine-tuning abilities across related 76336--76369 (ACM, 2024). 16. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016). tabular tasks50 and understanding the theoretical foundations of our 17. Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural approach51. Future work could also explore creating specialized priors Information Processing Systems (eds Larochelle, H. et al.) Vol. 33, 1877--1901 (Curran to handle data types such as time series52 and multi-modal data, or Associates, 2020). 18. Garg, S., Tsipras, D., Liang, P. S. & Valiant, G. What can transformers learn in-context? A case specialized modalities such as ECG, neuroimaging data53 and genetic study of simple function classes. In Proc. Advances in Neural Information Processing data. As the field of tabular data modelling continues to evolve, we Systems Vol. 35, 30583--30598 (ACM, 2022).

                                                                                                                                              Nature | Vol 637 | 9 January 2025 | 325

Article 19. Akyürek, E., Schuurmans, D., Andreas, J., Ma, T. & Zhou, D. What learning algorithm is 42. Caruana, R., Niculescu-Mizil, A., Crew, G. & Ksikes, A. Ensemble selection from libraries in-context learning? Investigations with linear models. In Proc. The Eleventh International of models. In Proc. 21st International Conference on Machine Learning (ed. Greiner, R.) Conference on Learning Representations (ICLR, 2023). (Omnipress, 2004). 20. Von Oswald, J. et al. Transformers learn in-context by gradient descent. In Proc. 40th 43. Purucker, L. O. et al. Q(D)O-ES: Population-based quality (diversity) optimisation for post International Conference on Machine Learning 35151--35174 (PMLR, 2023). hoc ensemble selection in AutoML. In Proc. International Conference on Automated 21. Zhou, H. et al. What algorithms can transformers learn? A study in length generalization. Machine Learning Vol. 224 (PMLR, 2023). In Proc. The Twelfth International Conference on Learning Representations (ICLR, 2024). 44. Hofmann, H. Statlog (German Credit Data). UCI Machine Learning Repository https://doi. 22. Müller, S., Hollmann, N., Pineda-Arango, S., Grabocka, J. & Hutter, F. Transformers org/10.24432/C5NC77 (1994). can do Bayesian inference. In Proc. The Tenth International Conference on Learning 45. Duin, R. Multiple Features. UCI Machine Learning Repository https://doi.org/10.24432/ Representations (ICLR, 2022). C5HC70 (1998). 23. Hollmann, N., Müller, S., Eggensperger, K. & Hutter, F. TabPFN: a transformer that solves 46. Rajotte, J.-F. et al. Synthetic data as an enabler for machine learning applications in small tabular classification problems in a second. In Proc. The Eleventh International medicine. iScience 25, 105331 (2022). Conference on Learning Representations (ICLR, 2023). 47. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. 24. Kingma, D. & Ba, J. Adam: A method for stochastic optimization. In Proc. International Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30, 4765--4774 Conference on Learning Representations (ICLR, 2015). (Curran Associates, 2017). 25. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align 48. Feuer, B. et al. TuneTables: context optimization for scalable prior-data fitted networks. and translate. In Proc. 3rd International Conference on Learning Representations In Proc. 38th Conference on Neural Information Processing Systems (NeurIPS, 2024). 49. Helli, K., Schnurr, D., Hollmann, N., Müller, S. & Hutter, F. Drift-resilient tabPFN: In-context (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015). learning temporal distribution shifts on tabular data. In Proc. 38th Conference on Neural 26. Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models Information Processing Systems (NeurIPS, 2024). for tabular data. In Proc. Advances in Neural Information Processing Systems 34 50. Thomas, V. et al. Retrieval & fine-tuning for in-context tabular models. In Proc. 1st (eds Ranzato, M. et al.) 18932--18943 (NeurIPS, 2021). Workshop on In-Context Learning at the 41st International Conference on Machine 27. Zhu, B. et al. XTab: cross-table pretraining for tabular transformers. In Proc. 40th Learning (ICML, 2024). International Conference on Machine Learning (eds Krause, A. et al.) 43181--43204 51. Nagler, T. Statistical foundations of prior-data fitted networks. In Proc. 40th International (PMLR, 2023). Conference on Machine Learning (eds Krause, A. et al.) Vol. 202, 25660--25676 (PMLR, 28. Lorch, L., Sussex, S., Rothfuss, J., Krause, A. & Schölkopf, B. Amortized inference for 2023). causal structure learning. In Proc. Advances in Neural Information Processing Systems 52. Dooley, S., Khurana, G. S., Mohapatra, C., Naidu, S. V. & White, C. ForecastPFN: synthetically- (eds Koyejo, S. et al.) Vol. 35, 13104--13118 (ACM, 2022). trained zero-shot forecasting. In Proc. 37th Conference on Advances in Neural Information 29. Dao, T., Fu, D., Ermon, S., Rudra, A. & Ré, C. Flashattention: fast and memory-efficient Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023). exact attention with io-awareness. In Proc. Advances in Neural Information Processing 53. Czolbe, S. & Dalca, A. V. Neuralizer: General neuroimage analysis without re-training. Systems (eds Koyejo, S. et al.) Vol. 35, 16344--16359 (2022). In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 6217--6230 30. Torgo, L. & Gama, J. Regression using classification algorithms. Intell. Data Anal. 1, 275--292 (IEEE, 2023). (1997). 54. Wilcoxon, F. in Breakthroughs in Statistics: Methodology and Distribution (eds Kotz, S. 31. Pearl, J. Causality 2nd edn (Cambridge Univ. Press, 2009). & Johnson, N. L.) 196--202 (Springer, 1992). 32. Jiang, M. et al. Investigating Data Contamination for Pre-training Language Models. Preprint at https://arxiv.org/abs/2401.06059 (2024). Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in 33. Kumaraswamy, P. A generalized probability density function for double-bounded random published maps and institutional affiliations. processes. J. Hydrol. 46, 79--88 (1980). 34. Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Open Access This article is licensed under a Creative Commons Attribution Mechanisms. Report No. 1196-0-8 (Cornell Aeronautical Lab, 1961). 4.0 International License, which permits use, sharing, adaptation, distribution 35. Young, T. I. The bakerian lecture. experiments and calculations relative to physical optics. and reproduction in any medium or format, as long as you give appropriate Philos. Trans. R. Soc. Lond. 94, 1--16 (1804). credit to the original author(s) and the source, provide a link to the Creative Commons licence, 36. Gijsbers, P. et al. AMLB: an AutoML benchmark. J. Mach. Learn. Res. 25, 1--65 (2024). and indicate if changes were made. The images or other third party material in this article are 37. Fischer, S. F., Feurer, M. & Bischl, B. OpenML-CTR23 -- a curated tabular regression included in the article's Creative Commons licence, unless indicated otherwise in a credit line benchmarking suite. In Proc. AutoML Conference 2023 (Workshop) (AutoML, 2023). to the material. If material is not included in the article's Creative Commons licence and your 38. Breimann, L. Random forests. Mach. Learn. 45, 5--32 (2001). intended use is not permitted by statutory regulation or exceeds the permitted use, you will 39. Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273--297 (1995). need to obtain permission directly from the copyright holder. To view a copy of this licence, 40. Erickson, N. et al. Autogluon-tabular: robust and accurate automl for structured data. visit http://creativecommons.org/licenses/by/4.0/. Preprint at https://arxiv.org/abs/2003.06505 (2020). 41. Wolpert, D. Stacked generalization. Neural Netw. 5, 241--259 (1992). © The Author(s) 2025

326 \| Nature \| Vol 637 \| 9 January 2025 Methods Details on the neural architecture User guide Our architecture is a variation of the original transformer encoder12 When to use TabPFN. TabPFN excels in handling small- to medium-sized and the original PFN architecture22, but it treats each cell in the table datasets with up to 10,000 samples and 500 features (Fig. 4 and Extended as a separate time position, similar to that in ref. 28. Therefore, it can Data Table 1). For larger datasets and highly non-smooth regression generalize to more training samples as well as features than seen dur- datasets, approaches such as CatBoost9, XGB7 or AutoGluon40 are likely ing training. to outperform TabPFN. Figure 1b details our new architecture. All features that go into our Although TabPFN provides a powerful drop-in replacement for tra- architecture are first mapped to floating point values, that is, cate- ditional tabular data models such as CatBoost, similar to these mod- goricals are transformed to integers. These values are subjected to els, it is intended to be only one component in the toolkit of a data z-normalization using the mean and standard deviation for each fea- scientist. Achieving top performance on real-world problems often ture separately across the whole training set. These values are now requires domain expertise and the ingenuity of data scientists. As for encoded with simple linear encoders. Each layer first has an attention other modelling approaches, data scientists should continue to apply over features, followed by an attention over samples, both of which their skills and insights in feature engineering, data cleaning and prob- operate separately on each column or row, respectively. These two lem framing to get the most out of TabPFN. We hope that the train- sub-layers are followed by an MLP sublayer. Each sublayer is followed ing speed of TabPFN will facilitate faster iterations in the data science by a residual addition and a half-precision layer norm. workflow. We found that encoding groups of features can be even more effec- tive compared with encoding one value per representation. For our Limitations of TabPFN. The limitations of TabPFN are as follows: hyperparameter search space, we selected six architectures for clas- (1) the inference speed of TabPFN may be slower than highly optimized sification and five for regression. In three of the six classification models approaches such as CatBoost; (2) the memory usage of TabPFN scales and four of the five regression models, including the TabPFN default, a linearly with dataset size, which can be prohibitive for very large data- transformer position encodes two features of one example; in others, sets; and (3) our evaluation focused on datasets with up to 10,000 it represents one value. samples and 500 features; scalability to larger datasets requires further Although the inter-feature attention is a classical fully connected study. attention, our inter-sample attention does not allow the test samples to attend to each other but only to the training data. Therefore, we Computational and time requirements. TabPFN is computation- make sure that the test samples do not influence each other or the train- ally efficient and can run on consumer hardware for most datasets. ing set representations. To allow our model to differentiate features However, training on a new dataset is recommended to run on a (con- more easily that have the same statistics, for example, two features that sumer) GPU as this speeds it up by one to three orders of magnitude. have the same entries just in different orders, we use random feature Although TabPFN is very fast to train, it is not optimized for real-time embeddings that we add to all embeddings before the first layer. We inference tasks. For a dataset with 10,000 rows and 10 columns, our generate one embedding per feature by projecting a random vector of model requires 0.2 s (0.6 s without GPU) to perform a prediction for one-fourth the size of our embeddings through a learned linear layer one sample, whereas CatBoost (default) can do the same in 0.0002 s. and add this to all embeddings representing an instance of that feature. In ref. 55, further optimizing TabPFN specifically for inference tasks As the representations of training samples are not influenced by the has already been explored, resulting in four times faster inference test set, we cache the keys and values of the training samples to allow performance compared with even XGBoost, but so far also reducing splitting training and inference. We use a special variant of multi-query predictive quality. Refer to the section 'Details on the neural archi- attention for our inter-sample attention from test samples56 to save tecture' for details on the memory usage and runtime complexity of memory when caching representations. In our variant, we use all keys TabPFN. and values for the attention between samples of the training set, but repeatedly use the first key and value for attention from the test sam- Data preparation. TabPFN can handle raw data with minimal pre- ples. This allows caching only one key or value vector pair per cell in processing. If we simply provide the data in a tabular format (NumPy the training set that is fed into our inter-sample attention of new test matrix), TabPFN will automatically handle missing values, encode samples. categorical variables and normalize features. Although TabPFN The compute requirements of this architecture scale quadratically works well out of the box, we can further improve the performance with the number of samples (n) and the number of features (m), that is using dataset-specific pre-processing. This can also be partly done O(n2 + m2), and the memory requirements scale linearly in the dataset automatically with our PHE technique or manually by modifying the size, O(n ⋅ m). default settings. When manually pre-processing data, we should keep Finally, we found that pre-processing inputs can help performance, in mind that the neural network of TabPFN expects roughly normally thus we can perform z-normalization of all inputs across the sample distributed features and targets after all pre-processing steps. If we, dimension and add an extra input for each cell that indicates whether for example, know that a feature follows a log distribution, it might the input was missing; the input itself is set to 0 in these cases. All help to exponentiate it before feeding it to TabPFN. As TabPFN does inputs are finally linearly encoded into the embedding dimension z-normalization of all inputs, scaling does not affect the predictions. of TabPFN. As for all algorithms, however, using domain knowledge to combine or remove features can increase performance. Details on the causal generative process An SCM G ≔ (Z, ϵ) consists of a collection Z ≔ (z1, ..., zk) of structural Hyperparameter tuning. TabPFN provides strong performance out assignments (called mechanisms): zi = fi (zPAG(i ), ϵi), where PA G(i) is the of the box without extensive hyperparameter tuning (see section set of parents of node i (its direct causes) in the underlying directed 'Comparison with state-of-the-art baselines'). If we have additional acyclic graph (DAG) G (the causal graph), fi is a (potentially nonlinear) computational resources, we can further optimize the performance deterministic function and ϵi is a noise variable. Causal relationships of TabPFN using hyperparameter optimization (HPO) or the PHE tech- in G are represented by edges pointing from causes to effects31. As nique described in the section 'TabPFN (PHE)'. Our implementation our prior is a sampling procedure, we can make a lot of choices on, directly provides HPO with random search and PHE. for example, the graph size or complexity. By defining a probability Article distribution over these hyperparameters in the prior, the posterior Furthermore, we sample input data with varying degrees of non- predictive distribution approximated by TabPFN at inference time independence for some datasets. Here we first sample a random frac- implicitly represents a Bayesian ensemble, jointly integrating over a tion ρ of samples to serve as prototypes x1*, ..., xM* , where M = ρn and n weighted hyperparameter space. The specific hyperparameter ranges is the dataset size. Then, for each input vector xi to be sampled, we and sampling strategies are chosen to cover a diverse set of scenarios assign weights αij to the prototypes and linearly mix the final input as that we expect to encounter in real-world tabular data. M xi = ∑ αijx j*, (1) Graph structure sampling. The structural causal models underlying j =1 each dataset are based on a DAG G . We sample these graphs using the growing network with redirection sampling method57, a preferential where ∑jαij = 1. The weights αij are sampled from a multinomial distri- attachment process that generates random scale-free networks. We bution, αi \~ Multinomial(β), where β is a temperature hyperparameter either sample a single connected component or merge multiple disjoint controlling the degree of non-independence: larger β yields more uni- subgraphs. Disjoint subgraphs lead to features that are marginally form weights, whereas smaller β concentrates the weights on fewer independent of the target if they are not connected to the target node, prototypes per sample. reflecting real-world scenarios with uninformative predictors. To control the complexity of the sampled DAGs, we use two hyper- Post-processing. Each dataset is post-processed randomly with one parameters: the number of nodes N and the redirection probability P. or more of the following post-processings: (1) For some datasets, we N is sampled from a log-uniform distribution, logN \~ U(a, b), where a use the Kumaraswamy feature warping, introducing nonlinear distor- and b are hyperparameters controlling the range of the graph size. The tions33 to features as done in ref. 61. (2) We quantize some continuous redirection probability P is sampled from a gamma distribution, features into buckets of randomly sampled cardinality K, mimick- P \~ Γ(α, β), where α and β are shape and rate parameters, respectively. ing binned or discretized features commonly encountered in data- Larger values of N yield graphs with more nodes, whereas smaller val- sets. We map a feature value x to the index of the bucket it falls into, ues of P lead to denser graphs with more edges on average57. determined by K + 1 bin edges sampled from the set of values this fea- ture takes. (3) To introduce scenarios for dynamic imputation and Computational edge mappings. In our implementation, each SCM handling of incomplete datasets, a common challenge in data sci- node and sample is represented as a vector in Rd . When propagating ence, we randomly designate a fraction ρmiss of the data as missing ac- data through the SCM, the deterministic functions fi at each edge map cording to the missing completely at random strategy. Each value is the input vectors to an output vector using four types of computa- masked as missing with probability ρmiss, independently of the data tional modules: values. 1. Small neural networks: here we initialize weight matrices W ∈ Rd × d using Xavier initialization58 and apply a linear transformation Wx + b Target generation. To generate target labels for regression tasks, we to the input vectors x ∈ Rd , where b ∈ Rd is a bias vector. After the select a randomly chosen continuous feature without post-processing. linear projection, we apply element-wise nonlinear activation func- For classification labels, we select a random categorical feature tions σ : Rd → Rd , randomly sampled from a set, including identity, that contains up to 10 classes. Thus, natively our method is limited logarithm, sigmoid, absolute value, sine, hyperbolic tangent, rank to predicting at most 10 classes. This number can be increased by operation, squaring, power functions, smooth ReLU59, step function pre-training on datasets with a larger number of classes or by using and modulo operation. approaches such as building a one-vs-one classifier, one-vs-rest clas- 2. Categorical feature discretization: to generate categorical features sifier or building on approaches such as error-correcting output codes from the numerical vectors at each node, we map the vector to the (ECOC)62. index of the nearest neighbour in a set of per node randomly sampled vectors {p1, ..., pK} for a feature with K categories. This discrete index Training details will be observed in the feature set as a categorical feature. We sample The training loss of any PFN is the cross-entropy between the tar- the number of categories K from a rounded gamma distribution with gets of held-out samples of synthetic datasets and the model pre­ an offset of 2 to yield a minimum number of classes of 2. To further diction. For a test set (Xtest, ytest) = Dtest, the training loss is given by use these discrete class assignments in the computational graph, LPFN = E ((X test, y test ) ∪ D train) p(D )\[ − logqθ(ytest \|Xtest , Dtrain)\]. By minimizing they need to be embedded as continuous values. We sample a second this loss, the PFN learns to approximate the true Bayesian posterior set of embedding vectors {p1′, ..., pK′ } for each class and transform predictive distribution for a chosen prior over datasets (and potentially the classes to these embeddings. their latent variables) D, as shown in ref. 22. 3. Decision trees: to incorporate structured, rule-based dependencies, We trained our final models for approximately 2,000,000 steps with we implement decision trees in the SCMs. At certain edges, we select a batch size of 64 datasets. That means the models used for TabPFN a subset of features and apply decision boundaries on their values are trained on around 130,000,000 synthetically generated datasets to determine the output60. The decision tree parameters (feature each. One training run requires around 2 weeks on one node with eight splits, thresholds) are randomly sampled per edge. Nvidia RTX 2080 Ti GPUs. We sample the number of training samples 4. Noise injection: at each edge, we add random normal noise from the for each dataset uniformly up to 2,048 and use a fixed validation set normal distribution N(0, σ 2I ). size of 128. We sample the number of features using a beta distribution (k = 0.95, b = 8.0) that we linearly scale to the range 1--160. To avoid Initialization data sampling. For each to-be-generated sample, we peaks in memory usage, the total size of each table was restricted to randomly generate initialization data ϵ that is inserted at the DAG root be below 75,000 cells by decreasing the number of samples for large nodes and then propagated through the computational graph. The numbers of features. noise variables ϵ are generated according to one of three sampling We chose the hyperparameters for the prior based on random mechanisms: searches, in which we use only a single GPU per training and evaluate 1. Normal: ϵ \~ N(0, σϵ2), where σϵ2 is a hyperparameter. on our development set, see section 'Quantitative analysis'. We used 2. Uniform: ϵ \~ U(− a, a), where a is a hyperparameter. the Adam optimizer24 with linear warmup and cosine annealing63 and 3. Mixed: for each root node, we randomly select either a normal or tested a set of learning rates in \[0.0001, 0.0005\], using the one with uniform distribution to sample the initialization noise ϵ from. the lowest final training loss.  buckets. This is equivalent to applying the inverse of the transform to Inference details the random variable represented by our output distribution but for To get the most performance out of TabPFN, it is crucial to optimize the half-normals used on the sides for full support22. This is because all its inference pipeline. We generally always apply TabPFN in a small transforms are strictly monotone and the borders represent positions ensemble, in which we perform pre-processing or post-processing of on the cumulative distribution function. the data differently for each ensemble member. As our models are not fully permutation invariant, for each ensemble Data grouping based on random forest. To perform well on very het- member, we shuffle the feature order, approximating order invari- erogeneous datasets, we also propose to use random trees to split the ance64. For classification tasks, we additionally randomly permute the training data into smaller more homogeneous datasets. This technique labels. We also apply a temperature to the softmax distribution of our is used only when performing HPO or PHE for TabPFN. It is especially model outputs for calibration. useful for TabPFN as our model performs best on small datasets. Apart from the above, we use a subset of the following for each of The pre-processing for a single ensemble member, that is, a single our default ensemble members: tree, works as follows: we use a standard random tree with feature and 1. Quantile + Id: we quantize the inputs to equally spaced values bet­ sample bootstrapping and Gini impurity loss. For each leaf node of the ween 0 and 1, but keep a copy of each original feature. This effectively decision tree, we store the subset of training samples that fall into that doubles the number of features passed to TabPFN. node and train a TabPFN on these. To predict the class label for a test 2. Category shuffling: the labels of categorical features with low cardi- sample x, we determine the TabPFN to use by passing x through the nality are shuffled. decision tree. We set the minimal leaf size to be large (500--2,000) such 3. SVD: an SVD compression of the features is appended to the features. that the resulting data groups are large enough to train a strong model. 4. Outlier removal: all outliers, more than 12 standard deviations from the mean, are removed. TabPFN (PHE) 5. Power transform: each feature (or the label for regression) is trans- To further enhance the inference performance of TabPFN, in TabPFN formed using a Yeo--Johnson transformation to stabilize the variance (PHE), we use PHE for a fixed portfolio of TabPFN configurations from and make the data more normally distributed. our search space detailed in Extended Data Table 5. For TabPFN (PHE), 6. One-hot encoding: categorical features are encoded using one-hot we first use holdout validation to sequentially evaluate models from encoding, in which each category is represented as a binary vector. the portfolio until a time limit is reached. After all models are evaluated once, we repeat holdout validation with new data splits until the time For PHE and hyperparameter tuning of TabPFN, we use a larger set of limit is reached. Then, we ensemble all evaluated TabPFN models by pre-processing techniques that additionally include a logarithmic, an aggregating their predictions with a weighted arithmetic mean. We exponential and a KDI transformation65. These transformations help learn the weights using greedy ensemble selection (GES)42,66 with 25 address nonlinear relationships, skewed distributions and varying iterations on prediction data from holdout validation. Finally, we prune scales among features. each zero-weighted model, refit all remaining models on all data and To calibrate prediction uncertainty, we apply a softmax temperature return the weighted average of their predictions. (default T = 0.9) by dividing logits before the softmax calculation: Following standard practice in AutoML, we use GES because its pre- dictive performance is often superior to the best individual model43,67--69. exp(zi /T ) Owing to its ICL, we expect TabPFN to overfit the training data less P(yi ∣x) = , (2) ∑j exp(zj /T ) than predictions of traditionally trained algorithms; thus, we opt for (repeated) holdout validation (as in Auto-Sklearn 1; ref. 67) instead of where zi are the logits, T is the temperature and P(yi∣x) is the calibrated (repeated) cross-validation (as in AutoGluon40). Moreover, as GES usu- probability. We offer the option to generate second-order polynomial ally produces sparse weight vectors43,69, we expect the final ensemble features by multiplying up to 50 randomly selected feature pairs: after pruning each zero-weighted model to consist of a smaller number of models than for other ensembling approaches, such as bagging. Con- fij = xi ⋅ xj , for (i , j ) ∈ S , (3) sequently, PHE can also improve the inference efficiency of a TabPFN ensemble compared with other ensembling approaches. where S is the set of randomly chosen feature pairs. This can capture nonlinear interactions between features. This option is disabled by Foundation model abilities default. To ensure proper handling of duplicate samples given the Density estimation. The combination of a regression and a classifica- sample permutation invariance of our architecture, we add a unique tion TabPFN can be used as a generative model for tabular data, not sample identifier feature. This is a random number drawn from a stand- only modelling targets but features as well. Let D = {(xi, yi )}iN=1 denote ard normal distribution, ensuring each sample is treated distinctly in the original dataset, where xi ∈ Rd is a d-dimensional feature vector the attention mechanism. We also provide an option for subsampling and yi is the corresponding target value, and let qθ represent our trained in each estimator, to increase ensemble diversity, which performs ran- TabPFN model, either a regression or classification model depending dom sampling without replacement. This option is disabled by default. on the target type. We aim to approximate the joint distribution of a new example and its label p(x, y∣D). To do this, we factorize the joint Regression details. To enable our model to do classification on a large distribution as range of scales and target distributions, we use the following approach. d During pre-training, we rescale our regression targets to have zero p(x, y∣D) = ∏ p(xj ∣x \< j, D) ⋅ p( y∣x, D) (4) mean and a standard deviation of 1 (z-score). To decide where the bor- j =1 ders between our features lie, we draw a large sample of datasets from our prior and choose the 1/5,000 quantiles from this distribution. At d inference time, we bring the real-world data to a similar range by again ≈ ∏ qθ(xj ∣x\< j , D:,\< j) ⋅ qθ( y∣x , D), (5) applying z-score normalization. Furthermore, we allow applying a j =1 range of transforms, including a power transform as part of our default. All of the transforms, including the z-score are inverted at prediction where we only condition on a subset of the features in the training set time by applying the inverse of the transform to the borders between (D:,\< j ). The feature order of the joint density factorization influences Article the estimated densities. To reduce variance from this source, we apply datasets, carefully curated to be representative of various domains and a permutation sampling approximation of Janossy Pooling at inference data characteristics. The authors of the benchmark suite selected these time, in which we average the outputs of Nj feature permutations, with datasets based on criteria such as sufficient complexity, real-world Nj = 24 in our experiments64. relevance, absence of free-form text features and diversity of problem As we cannot condition on an empty feature set for technical reasons, domains. we condition the prediction of the first feature x1, on a feature with For our quantitative analysis of TabPFN for classification tasks, we random noise, that is, no information. use a set of test datasets comprising all 29 datasets from the AutoML The above factorization of the density of a sample (equation (5)) is benchmark with up to 10,000 samples, 500 features and 10 classes. completely tractable and we thus use it to estimate the likelihood for For regression tasks, the AutoML benchmark contains only 16 data- data points. This enables tasks such as anomaly detection and outlier sets matching these constraints. To increase statistical power, we identification. augmented this set with all datasets matching our constraints from the recent OpenML-CTR23 benchmark, yielding a test set of 28 unique Synthetic data generation. We can leverage the generative abilities regression datasets in total. Extended Data Tables 3 and 4 provide of TabPFN (see section 'Density estimation') to synthesize new tabu- full details for our test sets of classification and regression datasets, lar data samples that mimic the characteristics of a given real-world respectively. dataset, by simply following the factorization in equation (5) and We further evaluated additional benchmark suites from refs. 14,15. sampling each feature step by step. The generated synthetic samples In ref. 14, there are 22 tabular classification datasets selected based on (x*, y\*) can be used for various purposes, such as data augmentation, criteria such as heterogeneous columns, moderate dimensionality and privacy-preserving data sharing and scenario simulation. sufficient difficulty. In ref. 15, there is a collection of 176 classification datasets, representing one of the largest tabular data benchmarks. Embeddings. TabPFN can be used to retrieve meaningful feature rep- However, the curation process for these datasets may not be as rigorous resentations or embeddings. Given a dataset D = {(xi, yi )}iN=1, the goal or quality controlled as for AutoML Benchmark and OpenML-CTR23. We is to learn a mapping fθ : Rd → Rk that transforms the original also evaluated five Kaggle competitions with less than 10,000 training d-dimensional feature vectors xi into an embedding space of dimen- samples from the latest completed Tabular Playground Series. sion k. The resulting embeddings fθ (xi) ∈ Rk capture the learned relationships between features and can be used for downstream tasks. Development datasets. To decide on the hyperparameters of TabPFN, To use TabPFN for this problem, we simply use the target-column as well as our hyperparameter search spaces, we considered another set representations of its final layer as embeddings. of datasets, our development datasets. We carefully selected datasets to be non-overlapping with our test datasets described above. The Detailed evaluation protocol list of development datasets can be found in Supplementary Tables 5 To rigorously assess the performance and robustness of TabPFN, we and 6. We considered the mean of normalized scores (ROC/RMSE) conduct a comprehensive quantitative evaluation on standard tabular and rank quantiles and chose the best model configurations on these dataset benchmarks, comparing against state-of-the-art baselines development datasets. under a standardized protocol. Metrics and cross-validation. To obtain scores for classification tasks, Default configuration of TabPFN. Unlike traditional algorithms, we use two widely adopted evaluation metrics: ROC AUC (One-vs-Rest) in-context-learned algorithms do not have hyperparameters that and accuracy. ROC AUC averages performance over different sensi- directly control their training procedure. Instead, hyperparameters tivity--specificity trade-offs, and accuracy measures the fraction of for inference of TabPFN only control the pre-processing of data and samples labelled correctly. post-processing of predictions (for example, feature scaling or softmax For regression tasks, we use R2 and negative RMSE as evaluation met- temperature). Our default configuration (TabPFN (default)) for both rics. R2 represents the proportion of variance in the target column classification and regression is optimized for accurate predictions with that the model can predict. RMSE is the root of the average squared minimal fitting time. Here, we apply the same model multiple times magnitude of the errors between the predicted and actual values. As with different pre- and post-processors and take the average over the we use negative RMSE, for all our four metrics higher values indicate predictions, yielding a four-way (eight-way for regression) ensemble. a better fit. The settings for our data processing were obtained through a hyper- To increase statistical validity, for each dataset and method in our parameter search optimized on our development datasets. The exact test datasets, we evaluated 10 repetitions, each with a different random settings chosen are listed in Extended Data Table 5. We emphasize that, seed and train--test split (90% train and 10% test samples; all methods as for other foundation models (such as GPT), we trained our TabPFN used the same cross-validation splits, defined by OpenML72). We average model once and used the same model to perform ICL in a forward pass the scores of all repetitions per dataset. Then, to average scores across on all new datasets. datasets, we normalize per dataset following previous benchmarks36,40. The absolute scores are linearly scaled such that a score of 1.0 corre- Baselines. We compare with tree-based methods, such as random sponds to the highest value achieved by any method on that dataset, forests38, XGBoost7, CatBoost9 and LightGBM8, the state of the art for whereas a score of 0 represents the lowest result. This normalization experts to perform predictions on tabular data14,15. We also compare allows for building meaningful averages across datasets with very dif- with simpler methods, such as ridge regression70, logistic regression ferent score ranges. We provide absolute performance numbers in and SVMs39. Although standard neural networks, which unlike TabPFN Supplementary Data Tables 1--2. All confidence intervals shown are do not use ICL, were shown to underperform for small (\<10,000 sam- 95% confidence intervals. ples) tabular data1,14,71, as a point of reference, we still consider a simple We tuned all methods with a random search using five-fold cross- neural network, the MLP. validation with ROC AUC/RMSE up to a given time budget, ranging from half a minute to 4 h. The first candidate in the random search was the Tabular dataset benchmarks. We perform our analysis on two widely default setting supplied in the implementation of the method and was used and publicly available benchmark suites: the standard AutoML also used if not a single cross-validation run finished before the time benchmark36 and the recent regression benchmark OpenML-CTR23 budget was consumed. See the section 'Qualitative analysis' for the used (ref. 37). Both benchmarks comprise a diverse set of real-world tabular search spaces per method. All methods were evaluated using 8 CPU cores. Moreover, TabPFN makes use of a 5-year-old consumer-grade 69. Purucker, L. & Beel, J. CMA-ES for post hoc ensembling in AutoML: a great success and salvageable failure. In Proc. International Conference on Automated Machine Learning GPU (RTX 2080 Ti). We also tested GPU acceleration for the baselines. Vol. 224, 1--23 (PMLR, 2023). However, as Extended Data Fig. 2 shows, this did not improve perfor- 70. Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for nonorthogonal mance, probably because of the small dataset sizes. problems. Technometrics 12, 55--67 (1970). 71. Shwartz-Ziv, R. & Armon, A. Tabular data: deep learning is not all you need. Inf. Fusion 81, 84--90 (2022). 72. Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. OpenML: networked science in machine Data availability learning. SIGKDD Explor. 15, 49--60 (2014). 73. Fix, E. & Hodges, J. L. Discriminatory analysis. Nonparametric discrimination: consistency All datasets evaluated are publicly available on openml.org or kaggle. properties. Int. Stat. Rev. 57, 238--247 (1989). com. We have provided scripts in our code repository that automate the process of downloading and evaluating the datasets. These scripts contain dataset identifiers, as well as exact data splitting and processing Acknowledgements We express our gratitude to the following individuals for their valuable procedures. contributions and support. We thank E. Bergman for his assistance with the evaluation of TabPFN, for helping implement the random forest pre-processing, and for his efforts in improving the code quality and documentation. His contributions were instrumental in benchmarking TabPFN and ensuring the reproducibility of our results. We thank A. Gupta Code availability and D. Otte for their work on the Inference Server, which enables the fast deployment Our code is available at https://priorlabs.ai/tabpfn-nature/ (https:// of TabPFN without the need for a local GPU. Their efforts have greatly enhanced the accessibility and usability of TabPFN. We thank L. Schweizer for his work on exploring the doi.org/10.5281/zenodo.13981285). We also provide an API that allows random forest pre-processing for TabPFN further. We thank D. Schnurr and K. Helli for their users to run TabPFN with minimal coding experience or without the work on visualization, and D. Schnurr for his specific contributions related to handling availability of specific computing hardware such as a GPU. The code missing values. We thank S. M. Lundberg for the collection of visualization methods for feature attribution that we adapted for our work. We thank A. Müller for the insightful is designed to be modular and easily installable in a standard Python discussions related to TabPFN training and for his guidance on identifying and mitigating environment. The code to generate synthetic pre-training data has biases in the prior. His expertise has been invaluable in refining the TabPFN methodology. not been released with our models. We aim to enable researchers and We are very grateful to C. Langenberg and M. Pietzner for providing insights on medical applications, interpreting model results and offering general advice. Their continued practitioners to easily integrate TabPFN into their workflows and apply support has been instrumental in shaping this work. We thank S. Stäglich for his outstanding it to their specific tabular data tasks. We encourage users to provide maintenance and support with the cluster infrastructure. We thank B. Lake for his general feedback, report issues, and contribute to the further development of paper writing advice. We are grateful for the computational resources that were available for this research. Specifically, we acknowledge support by the state of Baden-Württemberg TabPFN. This open release aims to facilitate collaboration and acceler- through bwHPC and the German Research Foundation (DFG) through grant no INST 39/963-1 ate the adoption and advancement of TabPFN in various research and FUGG (bwForCluster NEMO), and by the Deutsche Forschungsgemeinschaft (DFG, German application domains. Research Foundation) under grant no. 417962828. We acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under SFB 1597 (SmallData), grant no. 499552394, and by the European Union (through ERC Consolidator Grant 55. Müller, A., Curino, C. & Ramakrishnan, R. Mothernet: a foundational hypernetwork for DeepLearning 2.0, grant no. 101045765). Views and opinions expressed are however those of tabular classification. Preprint at https://arxiv.org/abs/2312.08598 (2023). the authors only and do not necessarily reflect those of the European Union or the European 56. Shazeer, N. Fast transformer decoding: one write-head is all you need. Preprint at https:// Research Council. Neither the European Union nor the granting authority can be held arxiv.org/abs/1911.02150 (2019). responsible for them. F.H. acknowledges the financial support of the Hector Foundation. 57. Krapivsky, P. L. & Redner, S. Organization of growing random networks. Phys. Rev. E 63, 066123 (2001). 58. Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural Author contributions N.H. improved the prior of the model; added regression support, networks. In Proc. 13th International Conference on Artificial Intelligence and Statistics unsupervised capabilities and inference optimizations; and contributed to the experiments 249--256 (JMLR, 2010). and wrote the paper. S.M. improved the neural network architecture, training and efficiency; 59. Nair, V. & Hinton, G. Rectified linear units improve restricted Boltzmann machines. In added inference optimizations; and contributed to experiments and wrote the paper. L.P. Proc. 27th International Conference on Machine Learning (eds Fürnkranz, J. & Joachims, T.) improved the inference interface of the model; contributed to hyperparameter tuning; added 807--814 (Omnipress, 2010). post hoc ensembling of TabPFN models; contributed to benchmarking; and wrote the paper. 60. Quinlan, J. R. Induction of decision trees. Mach. Learn. 1, 81--106 (1986). A.K. added inference optimizations and Kaggle experiments. M.K. contributed to inference 61. Müller, S., Feurer, M., Hollmann, N. & Hutter, F. PFNS4BO: in-context learning for Bayesian optimizations. S.B.H. contributed to the usability of our code. R.T.S. contributed to preliminary optimization. In Proc. 40th International Conference on Machine Learning 25444--25470 architectural experiments to speed up inference and helped revise the first draft of the paper. (PMLR, 2023). F.H. contributed technical advice and ideas, contributed to the random forest pre-processing, 62. Dietterich, T. G. & Bakiri, G. Solving multiclass learning problems via error-correcting managed collaborations and funding, and wrote the paper. output codes. J. Artif. Intell. Res. 2, 263--286 (1994). 63. Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proc. Competing interests The following patent applications invented by S.M. and F.H. and filed by 5th International Conference on Learning Representations (ICLR, 2017). R. Bosch are related to this work: DE202021105192U1 and DE102021210775A1. The authors do 64. Murphy, R. L., Srinivasan, B., Rao, V. A. & Ribeiro, B. Janossy pooling: learning deep not have any ownership rights to these patent applications. F.H. and N.H. are affiliated with permutation-invariant functions for variable-size inputs. In Proc. 7th International PriorLabs, a company focused on developing tabular foundation models. The authors declare Conference on Learning Representations (ICLR, 2019). no other competing interests. 65. McCarter, C. The kernel density integral transformation. Transact. Mach. Learn. Res. https://openreview.net/pdf?id=6OEcDKZj5j (2023). 66. Caruana, R., Munson, A. & Niculescu-Mizil, A. Getting the most out of ensemble selection. Additional information In Proc. 6th IEEE International Conference on Data Mining (eds Clifton, C. et al.) 828--833 Supplementary information The online version contains supplementary material available at (IEEE, 2006). https://doi.org/10.1038/s41586-024-08328-6. 67. Feurer, M. et al. in Automated Machine Learning: Methods, Systems, Challenges Correspondence and requests for materials should be addressed to Noah Hollmann, (eds Hutter, F. et al.) Ch. 6 (Springer, 2019). Samuel Müller or Frank Hutter. 68. Purucker, L. & Beel, J. Assembled-OpenML: creating efficient benchmarks for ensembles Peer review information Nature thanks Duncan McElfresh, Oleksandr Shchur and the other, in AutoML with OpenML. In Proc. First International Conference on Automated Machine anonymous, reviewer(s) for their contribution to the peer review of this work. Learning (AutoML, 2022). Reprints and permissions information is available at http://www.nature.com/reprints. Article

Extended Data Fig. 1 \| Performance comparison across additional dataset generally subtle across these splits, the most notable variation is observed for characteristics, extending Fig. 5. This figure shows the relative performance datasets with outliers in the target variable, though confidence intervals still of different methods when datasets are split based on specific attributes. Error overlap. bars represent 95% confidence intervals. While performance differences are Extended Data Fig. 2 \| Performance comparisons of TabPFN and baselines Duplicated datasets and those with fewer than 5 samples per class were on additional benchmark datasets and with GPU support. (a) Classification removed to enable 5-fold cross-validation. (d) Performance Over Time performance on the Grinsztajn medium-sized benchmark with categorical Comparison with CPU vs. GPU Hardware: The performance over time when features, across 7 datasets. (b) Classification performance on the Grinsztajn running our strongest baselines with eight CPUs (CPU) vs. eight CPUs and on medium-sized benchmark with numerical features, across its 15 datasets. one GPU (+GPU) on our classification test benchmark. AutoGluon automatically (c) Classification performance on the TabZilla benchmark, consisting of 102 decides which models to train with what resources. For CatBoost and XGB, we datasets with fewer than 10,000 rows of data, 500 features, and 10 classes. specified that the models should train with GPU. Intervals represent 95% CI. Article

Extended Data Fig. 3 \| Comparing SHAP (SHapley Additive exPlanations) We see that Logistic Regression is most interpretable due to the simple underlying summary plots between TabPFN and baselines. We compare SHAP feature functions. However, Logistic Regression has poor predictive accuracy, and the importance and impact for Logistic Regression, TabPFN, and CatBoost on the learned functions are unintuitive when looking at the outer bounds of features. "Default of Credit Card Clients" dataset. The top features visualized are credit TabPFN has good predictive accuracy and learns simple, interpretable functions. amount, age, and duration. Each point represents a single instance, with the CatBoost is the least interpretable, with unclear patterns and wide variation in color indicating the value of the checking status feature (blue for low, red for SHAP values per sample. This figure is adapted from Lundberg et al.47. high), illustrating its interaction with the respective feature on the x-axis. Extended Data Fig. 4 \| Finetuning TabPFN on 2-dimensional sine curve with better performance for more similar distributions. For a dataset shift of π, datasets. (a) Examples of 2D sine curve datasets with different offsets. the inverse label needs to be predicted in the test set, compared to the finetuning (b) Finetuning loss curves for 50 runs with random train-test offsets. Colors data. However, TabPFN still generalizes when finetuned on this data. indicate the offset between train and test. TabPFN shows positive transfer, Article Extended Data Table 1 \| Aggregated results on the 29 AMLB classification Benchmark datasets

Scores are normalized on all the baselines shown in this table, with the weakest score set to 0.0 and the highest to 1.0, per dataset. All baselines are optimized for ROC AUC thus trading-off representativeness of secondary metrics. Times for TabPFN refer to times on GPU. Datasets are available via OpenML https://www.openml.org/search?type=data&sort=runs&id={OPENML\_ID}. Exact train-test splits defined by OpenML tasks with task numbers in our code datasets/benchmark\_dids.py. Extended Data Table 2 \| Aggregated results on the 28 AMLB and OpenML-CTR23 regression Benchmark datasets

Scores are normalized on all the baselines shown in this table, with the weakest score set to 0.0 and the highest to 1.0, per dataset. K-Nearest Neighbors73 performed significantly worse than the considered baselines. All baselines are optimized for RMSE as an objective thus trading-off representativeness of secondary metrics. Times for TabPFN refer to times on GPU. Datasets are available via OpenML https://www.openml.org/search?type=data&sort=runs&id={OPENML\_ID}. Exact train-test splits defined by OpenML tasks with task numbers in our code datasets/ benchmark\_dids.py. Article Extended Data Table 3 \| List of test datasets used for primary evaluation of classification tasks

All classification tasks from the AutoML Benchmark36 with fewer 10,000 samples and 500 features. The benchmark comprises diverse real-world tabular datasets, curated for complexity, relevance, and domain diversity. Extended Data Table 4 \| List of test datasets used for primary evaluation of regression tasks

All regression tasks from the AutoML36 and OpenML-CTR2337 Benchmarks with fewer 10,000 samples and 500 features. The benchmark comprises diverse real-world tabular datasets, curated for complexity, relevance, and domain diversity. Article Extended Data Table 5 \| Hyperparameter defaults and search space for TabPFN and our baselines

(a) TabPFN search space (b, c) Baseline search spaces. Extended Data Table 6 \| Performance on Kaggle Data Science Challenges

Performance of default CatBoost and default TabPFN on all 5 Kaggle classification or regression competitions from the Tabular Playground Series Season 3 with late submission enabled, fewer than 10,000 rows of data, 500 features, and 10 classes. We report the private score averaged over 5 seeds. For Episode 5, as ordinal regression can be treated as a classification or regression task, for both CatBoost and TabPFN we tried both the regression and the classification model and chose the better of the two (regression for CatBoost; classification for TabPFN). Arrows indicate the optimization direction for each metric. We emphasize that these results only compare Catboost and TabPFN on the raw competition data, not using any of the tricks the ingenious Kaggle community applies, such as use of domain knowledge, data cleaning, special feature engineering, postprocessing and ensembling; nevertheless, these techniques can be combined with TabPFN, and we hope that TabPFN's improved base model performance will allow Kagglers to achieve even better results with them. 
