A Comparison of Data-Driven Approaches for Mobile Marketing User Conversion Prediction

In this paper, we perform an exploratory study of user Conversion Rate (CVR) prediction using recent big data from a global mobile marketing company. We design a stream processing engine to collect sampled mobile marketing data. Then, we execute a large set of CVR prediction tests, under a two-stage experimental procedure that considers a rolling window evaluation. First, several preprocessing and machine learning combinations are analyzed using preliminary data. Next, the selected combinations are tested on a larger set of unseen datasets. Interesting classification performances were achieved, with some learning models (e.g., XGboost, Logistic Regression) requiring a reduced computational effort, thus showing a potential value for user CVR prediction in this domain.


I. INTRODUCTION
Mobile performance marketing is growing due to the widespread usage of mobile devices (e.g., smartphones, tablets). Several mobile advertising commercial platforms have been created, in what is known as Demand-Side Platforms (DSP) [1]. The DSP acts as a broker, matching user profiles to ads, thus linking user traffic, coming from publishers (e.g., news site, game app) to advertisers. And, if there is a conversion (e.g., product sale), the DSP facilitates the cash flows, returning a portion of the advertiser's revenue to the publishers. All ad clicks and sales generate data events, leading to big data with its 4Vs characteristics [2]: volume, velocity, variety and value. A critical aspect of the DSP big data system is the prediction of the user Conversion Rate (CVR), which involves estimating if there will be a sale when a user clicks and views an advertisement [3].
In this paper, we perform a large number of computational experiments, aiming to compare several data preprocessing and machine learning approaches for predicting user CVR responses after clicking a mobile ad link. As a case study, we work with recent real-world data from OLAmobile, which is a global mobile advertising company that created and maintains its own DSP. First, we design a stream processing engine to collect sampled data from the DSP data center. Then, we execute a vast experimental comparison, under a two-stage experimental design that includes distinct datasets, data preprocessing (five categorical and five balancing training set transformations) and machine learning (three offline and three online algorithms) combinations, and a realistic rolling window evaluation.
This paper is organized as follows. First, the related work is introduced (Section II). Next, the data and methods are described (Section III). Then, the experimental results are presented and analyzed (Section IV). Finally, the main conclusions are discussed (Section V).

II. RELATED WORK
Several works approached user CVR prediction, mostly using linear models, such as Logistic Regression (LR) [4]. Recently, more flexible machine learning methods have been proposed, such as Gradient Boosting Decision Trees (GBDT) [4], [5], Random Forest (RF) [3] and Deep Learning [6]. Yet, these studies tend to only consider the prediction performance and not the computational effort. For instance, the Deep Learning models proposed in [6] are more complex than the LR method, although the classification only improved slightly. Since DSP generate big data, with millions of worldwide clicks per hour, constant model updates and real-time predictions are required. In this paper we evaluate both the predictive performance and computational effort.
Data preprocessing is another relevant issue. Due to privacy and DSP issues, only a limited set of mobile CVR related attributes is available, which increases the complexity of the prediction task (e.g., it is not possible to identify a user). Attributes are mostly categorical, often presenting a large cardinality, with hundreds of levels. CVR works have adopted either raw numeric encodings (e.g., [4]) or one-hot-encoding (e.g., [3], [6]), thus distinct categorical transformation methods are rarely compared. One-hot is a popular transformation but it presents limitations, since it heavily increases the computational effort for high cardinality attributes. Moreover, CVR are highly imbalanced, with sales often corresponding to less than 1% of the generated data events. Thus, balancing the training data (e.g., undersampling or SMOTE [7]) might improve results, although this aspect has been neglected in most CVR prediction studies.
In contrast with previous CVR works, in this paper we test a larger set of combinations of preprocessing and learning methods, including five categorical transformations (e.g., raw, categorical or one-hot), five training set setups (e.g., none, SMOTE) and six learning algorithms (e.g., LR, RF). Also, we adopt a more realistic and robust rolling window validation [8], [9], which simulates several holdout (train and test) iterations through time, rather than the simpler holdout validation used in [3]- [6].

A. Stream Engine and Collected Data
Under the analyzed market, publishers put a dynamic link in their web pages or apps. Once it is clicked, the DSP selects a marketing campaign, redirecting the user to a specific ad and advertiser. Two data events are generated: redirects, when users click the dynamic link; and sales, when there is a conversion. All events are stored at the DSP data center, which receives millions of redirects and thousands of sales per hour. The DSP provided us a secure web service (https) that allows to request a total of N R redirects or N S sales from the data center.
In this work, we had access to an Intel Xeon 1.70GHz server with 56 cores and 2TB of disk, which has limited capabilities when compared with the data center and thus we worked with sampled data. We designed a stream processing engine ( Figure 1) using the R tool [10]. The engine sets K cores for requesting redirects and sales. After receiving the stream (in JSON format), each core sleeps for SR or SS seconds and then repeats the request (asking for more data). The received streams are sent to another layer of cores, which filter the data according to some of its attribute values. The filtered streams are then stored (first in, first out -FIFO order) in three files: redirects; sales; and an event log, used for monitoring the data collection. These files were stored using MongoDB, a fast NoSQL JSON database system [11]. Table I describes the stream collection parameters and resulting datasets. The Traffic column distinguishes two main event types: • TEST -initial DSP testing mode, used to measure campaign performance; and • BEST -with best product campaigns that have obtained a minimal TEST performance and that corresponds to most traffic. The last two columns denote the number of redirects that produced sale (Y yes ) and no sale events (Y no ). For two TEST datasets (30 minutes and 1 week), the amount of events collected is around half when compared with the other datasets. This is due to the fact that TEST traffic, which is scarcer than BEST traffic, changes through time and the datasets were collected at different periods.
Although we increased the number of redirect request cores (K column of Table I) for the shorter duration datasets, due to the web service limitations it was not possible to retrieve all redirects. Thus, our ratio of collected sales Y yes /(Y no +Y yes ) is often higher than the real DSP ratio, ranging from 2.1% to 34.4% (BEST) and 0.2% to 20.4% (TEST). This issue is handled by setting two data variants: • collected -with all stored events; and • realistic -with a sample of the collected data such that the overall sales ratio is 1% for BEST and 0.5% for TEST. All redirect and sale events were merged into tabular files for the classification modeling. Table II lists the respective input categorical attributes and output target (last row), as provided by the DSP. The attributes are defined in terms of their context (user, advertiser, publisher or target) and description (including the number of levels and example values).

B. Data Preprocessing
We compare five transformations to handle the nominal inputs: raw (R), categorical or one-hot coding (C), Inverse Document Frequency IDF (I), categorical pruned (CP) and IDF pruned (IP). The categorical coding was computed using only training data. When necessary, training transformation variables (e.g., IDF numeric value for a level) were stored, such that test data could be encoded using the same transform. Moreover, a special "new" category was set to match any new levels present in test data and that could not be known at training time.
The transformations work as follows: • R -uses the original numeric value of the data ("new" is encoded as 0). • C -assumes a categorical attribute for methods that can handle directly such attributes (e.g., tree based or RF) or one-hot encoding for numeric based methods (e.g., LR). In this second case, the R tool transforms each attribute into L − 1 binary inputs, where L is the number of levels (including the special "new" value). • CPF -proposed variant used to reduce the input memory requirements. It first ranks the L levels according to their frequency in the data. Only the most frequent F levels are used and all other levels (including "new") are merged into the special "other" category. • I -coding adopts the transform [12]: where n is the number of instances and f x is the frequency of attribute x. The transformed I(x) values range from near 0 (most frequent level) to a I max (less frequent). • IPF -proposed procedure that works by selecting the most frequent F levels, grouping other levels except "new" into an "other" category, and then applying the I(x) function to the F + 1 levels. In both I and IP procedures, the "new" level is transformed into the least frequent numeric value (I max ). We explore one normal and four balancing methods, which were applied only to training data. Thus, the test sets are kept with their original unbalanced target distributions. The data transformation methods include [7]: none (N), undersampling (U), oversampling (O), both (B) and SMOTE (S    [14].

C. Classification Methods
The comparison includes three offline and three online classifiers. To reduce the bias towards a given algorithm and perform a fair comparison, we executed all algorithms with their default parameters.
The offline algorithms include: Logistic Regression (LR), Random Forests (RF) and XGboost (XB). LR is a popular linear model for CVR prediction. Both RF and XB are based on decision tree ensembles. RF was proposed in 2001 [15] and it combines the responses of a large number of decision trees. In [3] it provided the best CVR prediction results, although it required much more computation than LR. More recently, the scalable XB gradient boosting algorithm was proposed in 2016 [16], winning several classification challenges and requiring less computational effort than RF. Regarding the online learning algorithms, these include OzaBoost (OB), DecisionStump (DS) and Random Hoeffding Trees (RH). OB is an online boosting ensemble version of the AdaBoost.M.1 algorithm, DS is based on one-level decision trees and RH uses incremental decision trees [17]. All algorithms were implemented in the R tool [10], using the packages rminer [18], for the offline learning, and RMOA [17], for the online learning.

D. Evaluation
We adopted the robust rolling window validation [8], [9], which simulates a real classifier usage through time, with several training and test updates (Figure 2). In the first iteration, the learning model is fit to a training window with the W oldest examples, and then predicts H ahead predictions. Next, the training set is updated by discarding the oldest H records and adding H more recent ones. A new model is fit, producing H new predictions, and so on. In total, this produces: classifier updates (training and test iterations), where D L is the data length (number of examples). In this work, after consulting OLAmobile experts, we opted to use the realistic values of W = 50, 000 and H = 3, 000. The predictive performance is measured using test data and the area under the curve (AUC) of receiver operating characteristic (ROC) curve [19]. Often, the quality of the AUC values is interpreted as: 50% performance of a random classifier; 60% -reasonable; 70% good; 80% very good; 90% excellent; and 100% perfect. We also record the computational effort (in seconds) for each rolling window iteration.
The experimental design includes two stages. First, we conduct a large number of preliminary experiments using the oldest collected datasets (duration of 30 minutes), with all preprocessing and classifier combinations. Then, the selected first stage combinations are tested over a larger number of unseen datasets. The goal is to measure the performance of the selected combinations on unseen data. To aggregate all execution results (e.g., AUC values of the U rolling window

A. First Phase
This phase uses only the 30 minutes data and starts by the setting of the number of pruned levels (F ). Then, the preprocessing and classifier performances are compared. For each factor of analysis (e.g., CP10), there are E executed experiments (e.g., different classifiers). Some experiments produce computational errors (e.g., lack of memory), resulting in R rolling window results (R ≤ E). For each rolling window execution, we compute the Wilcoxon median result over all U iterations. Then, we compute the Wilcoxon median over all R executions.
For the pruned encodings (CP and IP), we tested F ∈ {10, 20, 30}. Table III presents the median results (AUC and computational effort) when fixing a particular F value, resulting in E = 5 (training setups) × 6 (classifiers) = 30 experiments per level and data variant. In Table III, the number of execution results (R) is shown in brackets and aggregated for both data variants. In a few cases, the F = 20 and F = 30 levels led to an execution error, resulting in R = 59 (and not 60). Since F = 10 did not lead to execution errors and provided the best AUC and computational effort overall results, we opted to fix this value. Table IV presents the overall first phase results when fixing an encoding method. All executions were successful, except for the categorical (C) encoding, confirming that the C transformation is problematic for high cardinality attributes. The last row presents the overall median values, computed over the four data setups. The C encoding also produced the worst AUC values. Considering both the predictive AUC performance (e.g., best overall median value of 76.4) and computation effort, we opted to select CP10 as the encoding method for the second stage experiments.
The results for a fixed balanced training are shown in Table V. In this table, R = 54 since 6 C encodings produced computational errors. The table includes the median value for each traffic type. For the BEST traffic events, the no balancing step (N) achieved the best AUC results. Balancing the training data seems more useful for the TEST traffic, which makes sense, since it presents a lower ratio of sales. Considering both the AUC and computational effort, for the second phase we selected the N training mode for BEST and S for TEST.
The last analyzed factor is the learning algorithm (Table VI). The number of executed experiments was E = 5 (encodings) × 5 (training setups) × 2 (variants) = 50 experiments. Three learning algorithms produced computational errors for the C encoding, resulting in a smaller R = 40. RF achieved the best overall result, followed by XB, OB and LR (for BEST) and followed by LR, OB and XB (for TEST). OB provided the best online learning AUC results. Yet, under the adopted rolling window scheme, the computational effort is still higher than LR and XB, being only comparable to RF. For the second phase, we selected three methods: RF (best AUC results), LR (second best TEST results), and XB (fastest method, second best BEST results).

B. Second Phase
We tested the first phase selected combinations (CP10 encoding; N training for BEST and S balancing for TEST; LR, XB and RF) in the unseen datasets (1 hour, 1 day and 1 week). This resulted in E = 2 (traffic type) × 2 (data variants) × 3 (durations) = 12 rolling window executions per classifier. There were no computational errors (R = 12).
The results are shown in Table VII in terms of median values for all U rolling window iterations for each classifier and data. In terms of AUC, RF is the best model for the collected setups (BEST and TEST), XG is the best option for BEST and realistic, and LR produces best results for TEST and realistic. The quality of the obtained AUC values can be valued: as good (around 70%) or very good (around 80%) for BEST collected; good for BEST realistic and TEST collected (around 70%); and reasonable (around 60%) for TEST realistic. This last case is particularly relevant, since the marketing company does not have any information about campaign success in TEST traffic and uses a random user ad matching, which is equivalent to an AUC of 50%. Thus, the proposed LR model has a business value.
XG is the fastest method, followed by LR, while RF requires a substantial computation. DSP platforms have real-time requirements, which should be lower than 10 ms for matching users to ads. While we did not use optimized infrastructure and code, several of the XB and LR data-driven models do follow the real-time constrains, even when constantly updating the training model. For instance, for TEST data, LR needs an average of 15.6/3000=5 ms to issue a prediction, while XG requires a shorter time of 1 ms.

V. CONCLUSIONS
There is an increasing interest in the domain of mobile performance marketing due to the massive usage of mobile devices (e.g., smartphones, tablets). Within this industry, Demand-Side Platforms (DSP) act as brokers, matching user traffic, coming from publishers, to advertisers. Acting globally, DSP generate big data related with ad clicks and conversions (product sales). Under this context, user Conversion Rate (CVR) prediction is a critical element of a DSP, allowing to better match user profiles to ads.
In this paper, we study user Conversion Rate (CVR) prediction using big data from a global mobile marketing company. Since the company data center receives big data, with millions of ad clicks and thousands of sales per hour, we design a stream processing engine to collect sample data from the company data center into our computational system. Several datasets with distinct duration times were collected and for two main traffic types: BEST and TEST. Then, we perform an extensive set of CVR prediction tests, under a two stage experimental design and using robust rolling window validation. The first stage explored five categorical transformations, five balanced training setups and six machine learning methods, which were applied to the oldest collected datasets. In the second phase, the best data-driven combinations were then tested on a larger set of unseen datasets.
Interesting predictive performances were achieved, ranging from reasonable (AUC of 61.2% for Logistic Regression and TEST traffic) to very good (AUC of 83.8% for Random Forest and BEST traffic). Thus, there is a potential value for an improved CVR user prediction in the analyzed mobile market. In particular, we achieved an AUC higher than 60% for the realistic TEST scenario, related with new ad campaigns. This is quite valuable for the analyzed DSP, since it currently employs a random user ad matching method for this traffic data type, which corresponds to an AUC of 50%.
Moreover, while we did not use a powerful computational infrastructure or an optimized code, several of the XGboost and Logistic Regression machine learning algorithms did produce real-time results (e.g., <10 ms) when performing several training and testing iterations (e.g., U =56).
In future work, we wish to improve the predictive performance results by putting an increased effort on feature engineering. For instance, by attempting to collect and extract a more richer set of attributes (e.g., related with user behavior after clicking the ad). We also wish to study scalability issues (e.g., use of the Apache Spark cluster-computing framework [21]). Currently, this work is part of an ongoing R&D project that involves a real business company and that will assume, in a later stage, the adaptation of the proposed data-driven approach to perform real-time user ad matches.

ACKNOWLEDGMENTS
This article is a result of the project NORTE-01-0247-FEDER-017497, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTU-GAL 2020 Partnership Agreement, through the European   15.6 3.1 76.4 a -XG is statistically significant when compared with RF but not LR b -RF is statistically significant when compared with RF and LR c -RF is statistically significant when compared with XG but not LR