A Comparison of AutoML Tools for Machine Learning, Deep Learning and XGBoost

—This paper presents a benchmark of supervised Automated Machine Learning (AutoML) tools. Firstly, we analyze the characteristics of eight recent open-source AutoML tools (Auto-Keras, Auto-PyTorch, Auto-Sklearn, AutoGluon, H2O AutoML, rminer, TPOT and TransmogrifAI) and describe twelve popular OpenML datasets that were used in the benchmark (divided into regression, binary and multi-class classiﬁcation tasks). Then, we perform a comparison study with hundreds of computational experiments based on three scenarios: General Machine Learning (GML), Deep Learning (DL) and XGBoost (XGB). To select the best tool, we used a lexicographic approach, considering ﬁrst the average prediction score for each task and then the computational effort. The best predictive results were achieved for GML, which were further compared with the best OpenML public results. Overall, the best GML AutoML tools obtained competitive results, outperforming the best OpenML models in ﬁve datasets. These results conﬁrm the potential of the general-purpose AutoML tools to fully automate the Machine Learning (ML) algorithm selection and tuning.


I. INTRODUCTION
A Machine Learning (ML) application includes typically several steps: data preparation, feature engineering, algorithm selection and hyperparameter tuning. Most of these steps require trial and error approaches, especially for non-MLexperts. More experienced practitioners often use heuristics to exploit the vast dimensional space of parameters [1]. With the increasing number of non-specialists working with ML [2], in the last years there has been an attempt to automate several components of the ML workflow, giving rise to the concept of Automated Machine Learning (AutoML) [3].
This paper focuses on the selection of the best supervised ML algorithm and its hyperparameter tuning. The comparison study considers eight recent open-source AutoML technologies: Auto-Keras, Auto-PyTorch, Auto-Sklearn, AutoGluon, H2O AutoML, rminer, TPOT, and TransmogrifAI. To assess these tools, we use twelve popular datasets retrieved from the OpenML platform, divided into regression, binary and multi-class classification tasks. In particular, we design three main scenarios for the benchmark study: General ML (GML) algorithm selection; Deep Learning (DL) selection and XG-Boost (XGB) hyperparameter tuning. Each tool is measured in terms of its predictive performance (using an external 10-fold cross-validation) and computational cost (measured in terms of time elapsed). Moreover, the best AutoML tools are further compared with the best public OpenML predictive results (which are assumed as the "gold standard").
The paper is organized as follows. Section 2 presents the related work. Next, Section 3 describes the AutoML tools and datasets. Section 4 details the benchmark design. Then, Section 5 presents the obtained results. Finally, Section 6 discusses the main conclusions.

II. RELATED WORK
The state-of-the-art works that compare AutoML tools can be grouped into three major categories. The first category includes publications that introduce a novel AutoML tool and then compared it with existing ones. The second category (similar to our work) is related with comparison of distinct tools, without proposing a new AutoML framework. Finally, the third category (less approached) focuses on the characteristics of the technologies rather than their predictive performances. Table I summarizes the related works using the following columns: Ref. -the study reference; Cat. -the AutoML study category; Dat. -the number of analyzed datasets; Toolsthe number of compared AutoML tools; GML -if General ML algorithms (not DL) were tested, such as Naïve Bayes (NB), Support Vector Machine (SVM) or XGB; DL -if DL was included in the comparison; Ext. -the external validation method used (if any); C. -if computational effort was measured; and Description -brief explanation of the comparison approach. The majority of the related works (14 studies) are from the year 2020, which confirms that AutoML tool comparison is a hot research topic. Some studies explore a large number of datasets [4], [5]. Our comparison adopts 12 datasets, which is below the two mentioned works but is still higher than used in eleven other studies (e.g., [6], [7]). More importantly, we consider eight AutoML technologies, which is a number only outperformed by [8] (which tested only one dataset) and [9] (which did not use any datasets).
In particular, we benchmark the following recent tools: Auto-PyTorch -only studied in [10] and compared in [9]; rminer -not considered by the related works; and Transmogrifaionly compared in [11]. Most works target GML. There are four studies that only address DL (e.g., [6], [12]). Similar to our approach, there are seven studies that consider both GML and DL. Of the 21 surveyed works, only 12 employ an external validation. Most of these studies (8 of 12) use a single holdout train test split, which is less robust than a 10-fold cross-validation (adopted in four works). In addition, only 9 studies measure the computational effort. Furthermore, few studies contrast the AutoML results with the best human configured results. Kaggle competition results were included in [6], [13], [14]. This work adopts open science (OpenML) best results, which was only performed in [15].

A. AutoML Tools
This study compares eight recent open-source AutoML tools. Whenever possible, all tools were executed with their default values, in order to prevent any bias towards a particular tool, while also corresponding to a natural non-ML-expert choice. When available in the tool documentation, we show the number of hyperparameters (H) tuned by the AutoML.
1) Auto-Keras: a Python library based on the Keras module and that is focused on an automatic DL Neural Architecture Search (NAS) [24]. The search is performed by using a Bayesian Optimization, with the tool automatically tuning the number of dense layers, units, type of activation functions used, dropout values and other DL hyperparameters. In this work, we adopt Auto-Keras version 1.0.7, which is used in the DL scenario (Section IV).
2) Auto-PyTorch: another AutoML tool specifically focused on NAS [10]. Auto-PyTorch version 0.0.2 uses the PyTorch framework and a multi-fidelity optimization to search the parameters of the best architecture (e.g., network type, number of layers, activation function). Similarly to Auto-Keras, we use Auto-PyTorch only in the second DL scenario.
3) Auto-Sklearn: an AutoML library built on top of the Scikit-Learn ML framework. The choice of algorithms and hyperparameters implemented by Auto-Sklearn takes advantage of recent advances in Bayesian optimization, meta-learning, and Ensemble Learning [25]. We use Auto-SkLearn version 0.7.0 in the first GML scenario, since it does not implement an automated DL or XGB. All ML algorithms (when available for the task type) were tested: AdaBoost ( 4) AutoGluon: a Python AutoML toolkit focused on DL [26]. In this work, we consider the tabular prediction feature of AutoGluon version 0.0.13. The tabular prediction executes several ML algorithms and then returns a Stacked Ensemble that uses the distinct ML models in multiple layers. In the GML scenario (Section IV), ensemble includes all non DL algorithms: GBM, CatBoost Boosted Trees, RF, Extra Trees (XT), k-NN and MR. For the DL scenario, the AutoGluon uses a DL dense architecture that uses heuristics to set the hidden layer sizes, employing also ReLU activation functions, dropout regularization and batch normalization layers [26]. 5) H2O AutoML: the H2O open-source module for Au-toML [27]. The tool adopts a grid search to perform the ML model selection. In this paper, we use H2O AutoML version 3. and All -with all trained algorithms. For the DL scenario, the H2O tool uses a fully connected multi-layer perceptron trained with a stochastic gradient descent back-propagation algorithm. The searched H = 7 hyperparameters include the number of hidden layers and hidden units per layer, the learning rate, the number of training epochs, activation functions and input and hidden layer dropout values. Finally, for the XGB scenario, the tool tunes the same H = 9 hyperparameters of GML. 6) rminer: package of the R tool that is intending to facilitate the use of ML algorithms [28]. In its most recent version (1.4.6), rminer implements AutoML functions. The rminer AutoML executions can be completely customized by the user, who can define the searched ML algorithms, hyperparameter ranges and validation metrics of the assumed grid search. For less experienced users, rminer includes three predefined AutoML search templates (https://CRAN.Rproject.org/package=rminer). Similarly to H2O, we test this tool in the GML and XGB scenarios. In GML, we used the "automl3" template, which searches the best model among: GLM (H = 2), Gaussian kernel SVM (H = 2 for classification and H = 3 for regression), shallow multilayer perceptron (with one hidden layer, H = 1), RF (H = 1), XGB (H = 1) and a Stacked Ensemble (H = 2, similar to H2O Stacked Best). 7) TPOT: a tool written in Python and that automates several ML phases (e.g., feature selection, algorithm selection) by using a Genetic Programming [29]. The GML scenario tested all TPOT version 0.11.5 algorithms: DT, RF, XGB, (multinomial) Logistic Regression (LR) and k-NN. TPOT was not included in the third comparison scenario (XGB, Section IV) because the tool does not allow the selection of a single algorithm, such as XGB.
8) TransmogrifAI: an AutoML tool for structured data and that runs on top of Apache Spark [30]. TransmogrifAI version 0.7.0 uses a grid search to perform the search of the best ML model. In the GML scenario, the tool was tested with all its ML algorithms: NB, DT, Gradient Boosted Trees (GBT), RF, MR, LR and LSVM. 9) Summary: Table II summarizes the AutoML tools that were used. For each tool, we detail the base ML Framework, available Application Programming Interface (API) programming Language, compatible Operating Systems, and if it supports DL (Auto-Keras and Auto-PyTorch only address DL).

B. Data
The analyzed datasets (Table III) were retrieved from OpenML [31]. The data selection criterion was defined as selecting the most downloaded datasets that did not include missing data and that reflected three supervised learning tasks: regression, binary and multi-class classification. The datasets reflect different numbers of instances (Rows), input variables (Cols.) and output target response values (Classes/levels, from 2 to 257; the last column details the Target domain values).

IV. BENCHMARK DESIGN
The comparison study assumes three main scenarios (Table II). The first GML scenario executes all ML algorithms from the AutoML tools except DL, aiming to perform a more horizontal ML family agnostic search. DL was discarded since some of the tools do not implement DL (Table II), the training of DL models often requires a higher computational effort and the second scenario is exclusively devoted to DL. The second DL scenario focuses on NAS, as implemented by the Auto-Keras, Auto-PyTorch, AutoGluon and H2O AutoML tools. Finally, the third scenario is more vertical, considering only the XGB algorithm. XGB was selected since it is a recently proposed non DL algorithm that includes a large number  of hyperparameters (e.g., H2O documentation mentions 40 hyperparameters of which only H = 9 are tunned). In this scenario, we test H2O and rminer, since they are AutoML tools that allow to run the single XGB algorithm. For every predictive experiment, the datasets were equally divided into tens folds, used for the external cross-validation. In order to create validation sets (to select the best ML algorithms and hyperparameters), we adopted an internal 5- In the first internal fold, each ML is trained with 72 instances and 18 are used for validation purposes (allowing to select the best model). Since neither Auto-Keras nor Auto-PyTorch natively support cross-validation during the fitting phase, we used a simpler holdout train (75%) and test (25%) set split to select and fit the models.
In all three scenarios the same measures are used to evaluate the performance of the external 10-fold test set predictions. Whenever allowed by the AutoML tool, we adopted the same measures for the internal AutoML validation set model comparison. The exceptions were with multi-class datasets and the Auto-Keras and Auto-PyTorch tools, which did not allow to use a Macro F1-score validation, thus the default loss function was adopted for these tools.
All experiments were executed using an Intel Xeon 1.70GHz server with 56 cores and 2TB of disk space. For each external fold, we also recorded the computational effort (in terms of time elapsed) for the AutoML fit (model selection and training). When the AutoML tool allowed to specify a time limit for training, the chosen time was one hour (3,600 s). Also, for the tools that implement an early stopping AutoML parameter, we fixed the value to three rounds. To aggregate the distinct external 10-fold results, we compute the average values. We also provide the 10-fold average t-distribution 95% confidence intervals, which can be used to attest if the tool differences are statistically significant (e.g., by checking if two confidence intervals do not overlap). Nevertheless, given that there is a very large number of comparisons, to select the best tool for each task, we adopt a lexicographic approach [32], which considers first the best average predictive performance (with a precision up to 1% or 0.01 points) and then the average computational effort (precision in s). To facilitate the lexicographic regression analysis, we compute the Normalized MAE (NMAE) score, which is a scale independent measure, where N M AE = M AE/(max(y) − min(y)) and y denotes the output target.

Figures 1 and 2 summarize the main scenario (GML)
results. In total, there were 12 (dataset) × 6 (tools) × 10 (folds) = 720 AutoML executions. Figure 1 presents the average computational effort (in s) for each external 10-fold iteration. Figure 2 shows the average external test scores (grouped in terms of the binary, multi-class and regression tasks). To facilitate the visualization of the regression scores, in the right of Figure 2 we use the NMAE score in the y-axis.
For GML, Auto-Sklearn always requires the maximum allowed computational effort (3,600 s), followed by TPOT (average of 858 s per external fold and dataset). The other tools are much faster: AutoGluon -lowest average value (70 s), best in 5 datasets; H2O -second average value (158 s), best in 5 datasets; TransmogrifAI -third best average (317 s); rminer -fourth best average (408 s), best in 2 datasets. Regarding the prediction performances, there is a high overall correlation between the validation and test scores (not shown in Figure 2, although the same effect is present in Tables IV and V), when considering all tool execution values: 0.75 -binary; 0.90 -multi-class; and 0.92 -regression. For binary classification, and when considering the test set results, the AutoML differences are smaller for churn (maximum difference of 3 percentage points -pp) and higher for the other datasets (10 pp for diabetes, 15 pp for credit and 16 pp for qsar). TransmogrifAI is the best tool in 3 of the datasets (churn, credit and qsar), also obtaining the best average AUC per dataset (88%). An almost identical average (87%) is achieved by H2O (best in churn and credit), rminer (best in diabetes) and TPOT (best in churn). AutoGluon and Auto-Sklearn produced the worst overall results (average AUCs per dataset of 78% and 80%). Turning to multi-class tasks, the AutoML differences (best tool test result minus the worst one) are smaller when compared with the binary task: 4 pp -Cmc; 5 pp -Dmft; 6 pp -Mfeat; and 8 pp -Vehicle. The best test dataset average is obtained by AutoGluon (Macro F1-Score 58%), followed by Auto-Sklearn, H2O and TPOT (Macro F1-Score of 57%), then TransmogrifAI (56%) and finally rminer (53%). In terms of datasets, the best results were: Cmc -Auto-Sklearn (54%); Dmft -TransmogrifAI (24%); Mfeat -AutoGluon, Auto-Sklearn and TPOT (74%); and vehicle -AutoGluon and Auto-Sklearn (82%). As for the regression tasks, the AutoML tool differences for each dataset are very small, corresponding to 1 pp in terms of NMAE for all three datasets. In effect, all tools obtain the same average NMAE per dataset (9%). Using the lexicographic selection (Section IV), the GML tool recommendation is: binary -TransmogrifAI; multi-class -AutoGluon; regression -rminer.
The lexicographic selection for XGB favors: binary -rminer (average AUC of 86%); multi-class -H2O (average Macro F1score of 55%); regression -rminer (average NMAE of 9%). When considering both DL and XGB scenarios, the lexicographic choice favors rminer XGB for the binary classification and regression tasks, while AutoGluon DL is the selected tool for multi-class. When analyzing all three scenarios, the overall lexicographic selection is: binary -TransmogrifAI GML; multi-class -AutoGluon GML; regression -rminer XGB. Finally, we contrast the best main GML scenario results (which consider more ML algorithms and AutoML tools) with the best public OpenML results (Table VI). For each dataset, we show in rounded brackets the best GML AutoML tool and the type of OpenML modeling (the algorithm name or "Pipeline", with the latter denoting a ML workflow that includes a data preparation step). While the best OpenML result includes predictions for all external 10-fold instances, we do not know the exact validation and testing procedures adopted. Thus, rather than assuming a "correct" comparison, we use here the best OpenML results as a "gold standard", denoting a proxy to the best results that can be achieved when using a human expert ML modeling. The column Attempts from Table VI denotes the number of human ML attempts, termed as a "run" in OpenML. The higher the number of attempts, the stronger is our assumption that the gold standard was reached. While all 12 datasets have high download numbers, the attempts distribution is highly unbalanced towards the classification tasks, particularly the binary ones (e.g., Credit has more than 419,000 attempts). The results from Table VI confirm the quality of the AutoML. In effect, the tools obtained prediction scores that are close to the best OpenML results in seven datasets (e.g., the maximum difference is 2 and 5 pp for the binary and multi-task classification tasks). More importantly, the AutoML outperformed the best OpenML for three regression tasks and for two highly modeled binary datasets.

VI. CONCLUSIONS
In this paper, we benchmark eight recent open-source supervised learning Automated Machine Learning (AutoML) tools: Auto-Keras, Auto-PyTorch, Auto-Sklearn, AutoGluon, H2O AutoML, rminer, TPOT and TransmogrifAI. A large set of computational experiments was held by considering an external 10-fold cross-validation, twelve datasets and three tool comparison scenarios. Each tool was benchmarked by measuring its computational effort and predictive scores. We retrieved popular datasets from the OpenML platform, which were equally grouped into regression, binary and multi-class classification tasks. The three comparison scenarios were: General Machine Learning (GML) -with a broad range of classical ML algorithms; Deep Learning (DL) -focusing on tools with DL Neural Architecture Search (NAS) capabilities; and XGBoost (XGB) -considering a single XGB algorithm hyperparameter tuning.
To select the best tools for each scenario, we adopted a lexicographic approach, which considers first, for each task, the best average predictive score and then the lowest computational effort. For GML, the lexicographic selection favors TransmogrifAI for binary classification, AutoGluon for multi-class classification and rminer for regression. For DL, the selection is H2O for the binary and regression tasks and AutoGluon for regression. As for the XGB scenario, rminer is the best overall option for binary and regression tasks, while H2O is recommended for multi-class tasks.
A global overall analysis, considering all three scenarios, favors the GML approach, which produced the best predictive scores. This result should be taken with some caution, since GML explored more ML algorithms and AutoML tools. Nevertheless, the slightly lower AutoML DL predictive performances might be explained by two factors. Firstly, the analyzed datasets are relatively "small", with the largest dataset containing only 5,000 instances. And it is known that DL tends to produce better results (when compared with shallow methods) when modeling big data [33]. Secondly, the AutoML tools with DL capabilities are more recent and thus might be still immature when compared with GML tools. For instance, the tested Auto-PyTorch and AutoGluon versions are still in their zero dot something versions (e.g., 0.0.2). To further measure the quality of the GML AutoML modeling, we compared the best GML results with the best predictions publicly available at the OpenML platform. The OpenML comparison confirmed that current GML AutoML tools provide competitive results, producing close predictions in seven datasets and even outperforming the human ML modeling in five datasets.
In future work, we intend to enlarge the comparison by considering more open-source AutoML technologies and datasets. In particular, we wish to analyze big data, where DL can potentially produce better predictions. We also plan to benchmark ML frameworks for specific infrastructure settings, such as involving edge computing.