Booking Prediction Models for Peer-to-peer Accommodation Listings using Logistics Regression, Decision Tree, K-Nearest Neighbor, and Random Forest Classifiers

Background: Literature in the peer-to-peer accommodation has put a substantial focus on accommodation listings' price determinants. Developing prediction models related to the demand for accommodation listings is vital in revenue management because accurate price and demand forecasts will help determine the best revenue management responses. Objective: This study aims to develop prediction models to determine the booking likelihood of accommodation listings. Methods: Using an Airbnb dataset, we developed four machine learning models, namely Logistics Regression, Decision Tree, K-Nearest Neighbor (KNN), and Random Forest Classifiers. We assessed the models using the AUC-ROC score and the model development time by using the ten-fold three-way split and the ten-fold cross-validation procedures. Results: In terms of average AUC-ROC score, the Random Forest Classifiers outperformed other evaluated models. In three-ways split procedure, it had a 15.03% higher AUC-ROC score than Decision Tree, 2.93 % higher than KNN, and 2.38% higher than Logistics Regression. In the cross-validation procedure, it has a 26,99% higher AUC-ROC score than Decision Tree, 4.41 % higher than KNN, and 3.31% higher than Logistics Regression. It should be noted that the Decision Tree model has the lowest AUC-ROC score, but it has the smallest model development time. Conclusion: The performance of random forest models in predicting booking likelihood of accommodation listings is the most superior. The model can be used by peer-to-peer accommodation owners to improve their revenue management responses.


I. INTRODUCTION
Revenue management refers to the process of organizing or controlling prices and supplies to maximize revenue [1] by matching the right product to the right customer at the right time [2]. Demand level prediction can be set so that prices will be accepted by customers who are sensitive or insensitive to prices at a particular time [3]. Effectiveness can be achieved when the operation considers aspects like relatively fixed capacity; variable and uncertain demand; perishable inventory; a high fixed cost structure, and varying customer price sensitivity [4]. Revenue management has been applied in various industries like the airline, automobile rental, broadcasting, cruise lines, the Internet service provision, lodging and hospitality, and passenger railways; and even in the non-profit sector [5].
Over sixty percent of research on revenue management focus on the hotel business contexts [6]. In the new peerto-peer business models that use electronic platforms to connect landlords and guests such as Airbnb, the use of revenue management would help landlords to increase their revenue [7].
Two common strategies to increase revenue are the pricing and the non-pricing strategy [8]. The former includes demand-based pricing, which could be efficient in providing competitive advantage in the market. However, it relies on the accuracy of a demand prediction [9]. A non-pricing strategy includes capacity management. Market demand forecast would help the decision-makers determine the allocation of capacity, whether to sell now or later, depending on their decision-making rules and the estimation of costumers' willingness-to-pay [10].
In popular peer-to-peer accommodation locations, the prices of accommodation during the holiday season, such as during the new year's holiday, are always higher than in other periods. This is common because capacity remains the same while demand increases, so property managers search strategy to maximize the revenue by increasing the prices [11]. Many famous cities, such as London, implement a policy on short term rental to protect the availability of housing for long-term residents. The policy stated that the host listings could only rent their property to guests for no more than 90 nights in a year [12].
Previous studies in peer-to-peer accommodation business context have explored various dimensions of the pricing issues [13]- [16]. To the best of our knowledge, articles exploring the demand for Airbnb listings are still lacking. Therefore, this study aims to develop prediction models for the booking likelihood. The findings can help accommodation hosts determine profitable pricing and the capacity strategies. To develop the prediction models, machine learning techniques were used, namely Logistics Regression, Decision Tree, K-Nearest Neighbors (KNN), and Random Forest.

II. LITERATURE REVIEW
Previous studies in peer-to-peer accommodation business have explored various dimensions of the pricing issues [13]- [16]. Many of them use the Airbnb business context as a case study. Exploring price determinants using the hedonic pricing model is popular in the property market [17]. Later, this method is also applied in Airbnb listings business context to find the price determinants [18]- [20]. Variables related to the price include the environment, the social aspect, the accessibility, and the spillover impacts [21]. A study from [22] involved twelve countries in the Caribbean and macro-financial data. A study from [23] involved eleven cities in the US and focused on how 137 amenities factors influence pricing. The characteristic of the city may also be used as price determinants [24]. Another study comparing Airbnb listings characteristics explain how pricing was different between urban city and sun-beach holiday destinations [25]. The study about price determinants of Airbnb listings also uses the market demand to explain the price [26]. Another study uses text data from the guest's reviews to know the guests' sentiments [27].
A study using sequential Bayesian [28] aims to understand the booking probability of listings and to know the posterior distribution. Demand forecast is an essential part of revenue management [29] because it maximizes revenue gain [30]. In a restaurant business, demand forecast figures can minimize operating costs [31]. Revenue optimization measures should be implemented after establishing an accurate demand forecasting system [32].
The type of prediction model influences the forecast accuracy. In the revenue management context, there are three different models for forecasting the booking process [33], i.e., (1) historical booking models that focus on the total booking figures, (2) advanced booking models that focus on elapsed reservations aspect, and (3) the combination between historical booking model and advanced booking model. The historical booking models employ same-day, last year, moving average, exponential smoothing, and other time-series forecasting methods. The advanced booking models use a classical pickup, advanced pickup, synthetic booking curve, and other time-series approaches. The combined model uses regression and weighted average of historical and advanced booking forecasts. Table 1 gives an overview of three studies that analyze forecasting topics using time series data in the hotel industry. The first study compared different forecasting methods to predict the booking reservation and room occupancy accurately [34]. The second study used various forecasting methods and concluded that the pickup, moving average, and exponential smoothing models was the best. The third study compared different forecasting methods using hotel occupancy data (three different room types) [35]. Another analyzed the time series method [36] and used monthly observations of hotel and motel guest' nights in New Zealand. The results show that Holt-Winters method and ARMA model were better than the Box-Jenkins seasonal-autoregressive-moving-average (SARMA) model.
The target variables in the previous study were guest arrival, hotel occupancy, and duration of stay. The current study aims to predict whether an accommodation listing will be booked or not. In terms of method, Logistics Regression is relatively easy to use and does not need any hyper-parameter optimization setup. The model can also compete with more sophisticated machine-learning models [37]. A Decision Tree model is a non-parametric approach that can adapt to any kind of dataset and can deal with nonlinear relationships well [38]. KNN is a popular algorithm among the top 10 algorithms in data mining [39] due to its simplicity and significant performance [40]. Lastly, the Random Forest is an improvement of the Decision Tree by combining several Decision Trees, which then provides good predictions, and tends not to overfit because it is compatible with large numbers [41].  1 shows the working framework in this study. It is adapted from the standard method to build a predictive analytics model [42]. There are five stages: collecting data; selecting relevant predictor variables; determining the potential prediction method; evaluating, validating, and selecting the best prediction model; and finally reporting the research result.

A. Airbnb listings data collection
In this study, we utilize available listing data from the InsideAirbnb.com platform and the doogal.co.uk [43]. InsideAirbnb.com is a website that uses a web scraping technique to gather data from the Airbnb website and it provides open data to the public. In this research, we used the Airbnb listings data from December 2018 consisting of 77,096 Airbnb listings and 96 data variables. Several listings were removed from the dataset because they indicated illogical inferences, i.e., the listings were booked for an entire year, duplicate records or missing values. The filtered dataset consists of 53,514 listings. Another dataset used was doogal.co.uk platform, which provides information about the London stations. Table 2 shows the descriptive statistics of the datasets.

B. Choice of variables
The predictor variables were based on the findings of previous studies [28] [7] but five predictor variables, namely the number of neighboring listings, available neighboring listings, house rules, property description, and the number of listing pictures were excluded from this study because of data and computing limitations. Instead, we added other variables such as total host listings, host verifications, accommodates, the guests included, minimum nights, and maximum nights. The host total listing variable indicates if the host is professional or not. If a host has more than one listing, we categorized the host as professional [26]. The more professional the host is, the better the services. The host verification variables indicate the reliability. Tables 3 shows the detailed information on the predictor variables.

C. Choice of potential methods
The focus of this research is to develop prediction models with binary classification that can give accurate predictions on whether an Airbnb listing will be booked or not. Table 4 shows the prediction models employed in this study. Looking at the number of subordinate models in a single machine learning model, we investigate both ensemble models and singular models. In general, ensemble models predict more accurately than singular models [44]. However, this research still investigates the application of singular models due to their simplicity and implementation easiness. Singular models can still outperform ensemble models [37]. We used Logistic Regression, K-Nearest Neighbors, and Decision Trees/Classification and Regression Tree (CART). In the ensemble group, we used Random Forest approach.  [45]. Choosing the right variables and avoiding the highly correlated variables must be observed when using Logistic Regression [46]. The predictor variables in Logistic Regression can be categorical or numerical; and the target variable of Logistic Regression is binary or dichotomous. Therefore, Logistic Regression cannot predict target variables of more than two classes. Although Logistic Regression may have several weaknesses, it can often compete with other machine learning 127 techniques, such as neural networks, support vector machine, random forest, and gradient boosting [37]. The formalization of logistic regression is stated as follows [45]: where: -is the probability of the outcome of interest is 2.71828 (the base of the system of natural logarithms) -is intercept is the regression coefficients is set of predictor variables

C.2. Decision Trees/Classification and Regression tree (CART)
CART can solve a classification problem. Like its name, CART algorithm looks like a tree structure. It has a root node, leaf nodes, and branches; and several advantages, such as nonparametric, adaptive with any dataset, and can deal with non-linear relationship [38]. CART is an algorithm used in a decision tree [47] and it uses the Gini index to evaluate the split. The best score is 0, and the worst score is an equal value for each class. The formalization of the Gini index is stated as follows [48]: where: -( ) is the estimated probability of misclassification under the Gini Index is the classes j is the classes i is probability is node is classes

C.3. K-Nearest Neighbors (KNN)
KNN calculates the distance between samples and determines the class for each value. It has three essential parts; first, a collection of labelled objects; second, a distance between objects; third, the number of nearest neighbors. The formalization of the KNN classification (Euclidean distance) is stated as follows [49]: where: -X is class for not booked -Y is class for booked -(i=1….N) is an attribute of sample instance X -(i=1….N) is an attribute of sample instance Y is the distance for the nearest neighbors

C.4. Random Forest
The ensemble method uses a Random Forest for classifiers, which consists of Decision Trees that are formed randomly and independently from the sampled dataset. It uses the law of large numbers, so it does not overfit and can be good for prediction [41]. Furthermore, it can be used for any dataset because it does not need a distribution assumption [38] but the weaknesses is that can be biased because the samples consist a different composition of the label prediction [50]. The formalization of the random forest classifier is stated as follows [51]: (4) where: is the score of Random Forest is the total number of trees used in the Random Forest is the score of a single tree is the score that most often occur

D. Evaluation, Validation and Model Selection
To assess the prediction performance of the models, we used two different evaluation methods, namely the tenfolds three-way split and ten-fold cross-validation procedures [52]. In the ten-fold three-way data split procedure, we did two data groupings. For the first grouping, we divided the dataset into ten equal sections/folds. The dataset was split into ten folds, and were not equally divided. From a total of 53,514 records, we grouped the dataset for fold number one until fold number nine consisting of 5,352 records. Fold number ten consists of 5,346 records. The second grouping was more functional.
First, the training set was used to fit the data points with the proposed model. Second, the validation set was used to evaluate the most accurate model trained in the training set. Third, the testing set was used to generate the final prediction score for each generated model. The number of data records utilized in the training, validation, and testing sets was adjusted based on the fold number category. If the testing was set to fold number ten (5,345 records), the training set consisted of 42,816 records (5,352 x 8 folds) and the validation set consisted of 5,352 records. If the testing was not set to fold number ten (5,352 records), the training set consists of 42,810 records (5,352 x 7 folds + 5,346 records from previously fold number ten) and the validation set consisted of 5,352 records. In total, there were 90 testing combinations.
In the second procedure, the ten-fold cross-validation, we divided the data to be nine-folds for training and onefold for testing. In total, there were ten testing combinations. The prediction score of the evaluated models using the ten-fold three-way split and ten-fold cross-validation procedures were compared. The model with the highest prediction score was selected. In this study, the receiver operating characteristics (ROC) or simply AUC value was used to determine the prediction score because it was better than accuracy [53]. Mathematically, we formalize the AUC score as follows: where: -is the AUC score is the number of negative class is the number of positive class is ∑ and is the rank of the i-th positive example in the ranked list

E. Model Use and Reporting
The performance of each model in terms of model development time and prediction score were compared. The best prediction model with the best prediction AUC-ROC score was selected and used to help decision-makers to formulate their corresponding revenue management response in a better way. Table 5 and Table 6 show the evaluation results of the constructed machine learning classification models. Supervised machine learning models constructed the model automatically from the training dataset. Through its learning algorithm, it tried to identify and construct a generalizable pattern that reflected the relationship between the dependent (target) and independent variables. Based on the constructed pattern, the model then could build predictions on the target variable based on the observed independent variables.

IV. RESULTS
To test the accuracy of the model, the prediction results of the constructed model was then compared with the actual value of the target variable. If the target variable was categorical, the AUC-ROC score was commonly used to evaluate how good the model could differentiate among different categorical variables (classes). The AUC-ROC score is a simple evaluation method [54] deemed better than the prediction accuracy score as an evaluation  Afrianto & Wasesa Journal of Information Systems Engineering and Business Intelligence, 2020, 6 (2), 123-132 130 method [53]. Logistic Regression, Decision Trees, KNN, and Random Forest methods are evaluated with ten-fold three-way split and ten-fold cross-validation procedures.
Tables 5 shows the results for the three-way split and the comparison between data with and without the data standardization process. Data standardization increases the average and decreases the standard deviation prediction of the AUC-ROC scores of the logistic regression. KNN modelled both in the validation and testing conditions. However, the data standardization process did not increase the prediction AUC-ROC scores of the decision tree and random forest models. The decision tree had the lowest AUC-ROC score, but it had the fastest model development time. Furthermore, Random Forest classifier had a 15.03% higher AUC-ROC score than decision tree, 2.93 % higher than KNN, and 2.38% higher than Logistics Regression. Therefore, using ten-fold three-way split procedure, we concluded that Random Forest performed best.
The results for the ten-fold cross-validation procedure are shown in Table 6. Fold column shows the sequencing fold, and the rest of the columns show the score for each technique. Lastly, the average score and standard deviation are at the bottom of the table. The Logistics Regression and KNN yielded a better AUC-ROC score after undergoing a data standardization process. The Decision Tree yielded the fastest processing time, but it yielded the lowest score. Random Forest classifier yielded a 26,99% higher AUC-ROC score than Decision Tree, 4.41% higher than KNN, and 3.31% higher than the Logistics Regression models. Therefore, using a ten-fold cross-validation procedure, we also concluded that the Random Forest performs best.

V. DISCUSSION
From the average AUC-ROC score, Random Forest models performed superior in both evaluation procedures. The random forest models reach 0.773 average AUC-ROC scores in a ten-fold three-way split condition and reach 0.781 in the ten-fold cross-validation condition. From the category of the classifiers, the ensemble methods outperformed the singular methods, which means the ensemble methods was better than singular methods in dealing with bias, noise, and variance. The Decision Tree or single tree model yielded the lowest score because of the inaccuracy [55], but it had the fastest processing time because of their simplicity.
The data standardization process increased the AUC-ROC score of Logistics Regression and KNN. Interestingly, the processing time after data standardization was reduced in Logistic Regression models. It was affected by outliers, and data standardization could handle the negativity of outlier cases. That was why the score increased and the model development time decreased in Logistics Regression models. Furthermore, the highest AUC-ROC score of the random forest model was in line with the findings of the earlier study [41]. There were many advantages of the Random Forest model, such as being able to handle outliers and noise.
The evaluation performance methods showed different average AUC-ROC scores in the same model. The Random Forest and Logistics Regression yielded a higher average score in the ten-fold cross-validation method rather than in a ten-fold three-way split procedure. However, the other models produced a higher AUC-ROC score in the three-ways split procedure. Besides, the difference in the average AUC-ROC score among the models was higher in cross-validation methods rather than in three-ways split procedure. It means that the number of training set affects the testing score of each model.

VI. CONCLUSION
Considering the importance of demand forecasts in the revenue management context, this study analyses four machine learning techniques to predict the booking likelihood of accommodation listings. We evaluated the AUC-ROC score of each model using two different evaluation methods, i.e., the ten-fold three-way split and the ten-fold cross-validation procedures.
In terms of the AUC-ROC score, the Random Forest classifiers outperformed the other models, i.e., Logistics Regression, Decision Tree, and K-Nearest Neighbor. The Decision Tree model had the lowest AUC-ROC score, but it had the lowest processing time. The performance of random forest models in predicting the booking likelihood of accommodation listings is the most superior. The findings can inform peer-to-peer accommodation owners to improve their predictions and the revenue management responses. In terms of contribution to literature, this study informs the prediction method of the booking likelihood.