Predicting Students Graduate on Time Using C4.5 Algorithm

Background: Facilitating an effective learning process is the goal of higher education institutions. Despite improvement in curriculum and resources, many students cannot graduate on time. Mostly, the number of students who graduate on time is lower than the number of new students enrolling to universities. This could dilute the chance for students to learn effectively as the ratio between faculty members and students becomes non-ideal.Objective: This study aims to present a prediction model for students’ on-time graduation using the C4.5 algorithm by considering four features, namely the department, GPA, English score, and age.Methods: This research was completed in three stages: data pre-processing, data processing and performance measurement. This predicting scheme make the prediction based on the department of study, age, GPA and English proficiency.Results: The results of this study have successfully predicted students’ graduation. This result is based on the data of students who graduated in 2008-2014. The prediction performance result achieved 90% of accuracy using 300 testing data.Conclusion: The finding is expected to be useful for universities in administering their teaching and learning process.


I. INTRODUCTION
Whether or not students can experience optimum learning and then graduate on time depends on, among others, quality education at the university, the degree program, and the quality of facilities and human resources [1]. Late graduation is likely to cause extra workload for faculty members because they have to, for example, supervise more students at a time. Therefore, universities usually have a strategy to improve and maintain the on-time graduation rate [2] [3].
Data mining can extract educational data to improve the education process quality [4] and identify strategies for improving the students' performance [5]. There are two aspects of students' performance: academic achievement and learning progressions and this can be used to predict their success in finishing the study on time [4] or to design intervention to prevent failure [6]. Data mining has three main functions, which are clustering data [7] [8], classifying data [9] [10], and identifying association rules patterns [11]. The current student performance prediction study shows that student performance prediction is challenging due to educational data variants [12] [13]. A framework of an intelligent recommender system based on background factors was designed by Goga et al. [14] to recommend necessary actions for improvement. Ashraf et al. [15] also develop an intelligent prediction system based on ensemble

II. RESEARCH METHODOLOGY
The prediction model contains three phases, namely data pre-processing, data processing and performance measurement, as shown in Fig. 1.

Fig. 1 Prediction for on-time Graduation Model
A. Data Pre-processing Data pre-processing is needed to enhance the quality of data. The data source has potential problems due to human error. For example, the data from the administration unit's internal format may not be reliable and inconsistent. Therefore, data pre-processing needs to be done before the data processing phase. It can investigate and identify useful data attributes. The data attribute is required for the processing of the data phase. The data phase processing consists of four steps: data loading, data cleaning, data selection, and data transformation, as shown in Fig. 1. The process of loading data is the initial process in this step. The data collected in the form of an excel file is then loaded into the program and then processed. Data cleaning is conducted by removing data that is not necessary or will not be used for prediction. These data are data that are inconsistent as the same data or wrong. The data in Table 1 has the following components: 1) Student Number This is a unique ID for each student-this ID created by the university to distinguish between each student's data.

2) Date Registration
The date when students when entering or registering as a student at the university.

3) Date Graduation
The date when students graduate from the university.

4) Department
This attribute consists of study program in Faculty of Industrial Engineering: Informatics, Chemical Engineering, Industrial Engineering, and Electrical Engineering.

5) Birth of Date
This attribute contains the date of birth of students. This variable is to find out the student's age.

6) GPA
The GPA data variable is based on the GPA data of students who have graduated. The number of GPA varies between 2.48 -3.87. Information on the graduation predicate according to academic regulations, is as follows: GPA score < 2.76 given the title Pass, GPA score 2.76 to 3.00 awarded the title Satisfactory, GPA score of 3.01 to 3.50 given the title Very Satisfactory and a GPA score of 3.51 to 4.00 are given the title of Distinction (Cumlaude).

7) English Proficiency Score
English proficiency score (TOEFL) for graduation in the engineering faculty is a minimum of 400. The value is used to be a thesis examination requirement for each student. The range of the English proficiency score are:

B. Data Processing (Prediction Modelling)
The data processing phase is conducted by splitting the data from data pre-processing into data training and data testing, as shown in Fig. 1. The data training used to get knowledge from the data, training data modelling the C4.5 algorithm to find the node or root until the last branch cannot be counted anymore. The root node step process is the most important step because the data that has been transformed will be filtered using the C4.5 algorithm. The root node's first step is to determine the root node that will be used for branching. Furthermore, making a new leaf or node will be created if the previous node can still be calculated further. The new leaf or node calculated until the final step finds a decision. The decision tree results from the analysis of problem-solving decisions, depending on the likelihood or probability of the decision. The decision tree results are obtained from calculating a leaf or node, and each leaf node marks the class label. The decision tree process changes the data tables' shape into a tree model and then transforms the tree model into rules. The confusion matrix is used to measure the performance of the proposed prediction model [24]. There are four terms representing the classification process results, as shown in Fig. 1 and 2. The four terms are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). True Negative (TN) value is the amount of negative data detected correctly, while False Positive (FP) is negative data but identified as positive data. Meanwhile, True Positive (TP) is positive data that is detected correctly. False Negative (FN) is the opposite of True Positive, so the data is positive but identified as negative data.
Based on the value of True Negative (TN), False Positive (FP), False Negative (FN), True Positive (TP), values of accuracy, precision and recall can be obtained [25]. Accuracy scores describe how accurately the system can classify data correctly. In other words, the accuracy score is a comparison between correctly classified data and the whole data. Equation (1) can obtain accuracy scores. Precision scores describe the number of positive categorised data correctly divided by the total positive classified data. Equation (2) can get precision. Meanwhile, recall shows what percentage of positive category data is correctly classified by the system. Equation (3) obtains the recall value, and F1-Score is shown in (4).

III. RESULTS
In the data pre-processing phase, student data is uploaded separately based on the majors: informatics, industrial engineering, chemical engineering and electrical engineering study program in 2008-2014 academic year. The format is spreadsheet file. Then, the selection process is conducted to retrieve the components used in this study. The data selected and managed through data cleaning that eliminates empty data or incomplete data. It consists of the registration date, graduation date, GPA, and English proficiency score. A total of 640 rows of data were obtained from the cleaning process. After the cleaning process, the data is transformed into other forms to suit the data mining process. The result of the pre-processing data is illustrated in Table 2. The pre-processing results' main features are GPA, English proficiency score, age and graduation predicate. The graduation predicate is addressed as Pass, Satisfactory, Very satisfactory, and Distinction (Cumlaude). Then, the phrase GoT for students who graduated on time and Not GoT for students who did not graduate on time were used to label the graduation information. The graduate information is calculated by the graduate date minus the registration date.
The data pre-processing result is used as training and testing data. Training data is used to form a classifier model. The results of the C4.5 implementation used for training data are 640 data. Meanwhile, the testing data is used to test the performance and correctness (of correlations) in the relevant model. In the data testing section, accuracy testing is performed using the confusion matrix. The prediction results are in the form of a new label for training data obtained from a classification using the C4.5 algorithm, as shown in Table 3. There is a new column named prediction. Not GoT Fig. 2 shows the rule that if the informatics study program student's GPA is very satisfactory, English proficiency score is < 400, the age when entering is 18 years old, this student is included in graduating on time. Furthermore, suppose students of the Informatics study program, GPA are very satisfactory. In that case, English proficiency score < 400, age at entry is 18 years old, this student is considered to have graduated not on time. Moreover, if the Informatics study program student, GPA is very satisfactory, English proficiency score < 400, age at entry is 18 years old, this student is considered to have graduated not on time. Based on the results of the decision tree shown in Fig. 2, prediction using the C4.5 algorithm obtained a pattern or prediction rule on time as follows: 1) If the age of 18 years, it is right on time.
2) If the age of 19 years, informatics study program, and GPA predicate is with distinction, then it is on time. If the age of 19 years, Informatics Engineering study program, and the GPA of the category is very satisfactory, then it is not on time. 3) If the age is 19 years old and from Chemical Engineering, then it is not on time. 4) If the age is 19 years old and from Industrial Engineering, then it is not on time. 5) If the age is 19 years old and from Electrical Engineering, then it is not on time. 6) If the age is 20 years, it is not graduate on time.
The experiment was conducted in five iterations by testing 100, 200, 300, 400 and 500 test data, as shown in Table  4. Based on this experiment, measurement using a confusion matrix is conducted to obtain the value of accuracy, precision, recall and F1 score. The best value achieved for accuracy is 90.00% with 300 test data, while the best precision and F1 score achieved are 61.90% and 57.77%, respectively, with 100 data testing. Furthermore, the best recall value is 54.76% for 400 data testing. The performance of each measurement result varies and the stability of the advantages of the measurement results also varies. This can be caused by the prediction model that requires improvement in the pre-processing phase to obtain a more stable training data input by the C4.5 algorithm. On average, based on these four iterations, the C4.5 algorithm succeeded in obtaining values for accuracy, precision, recall and F1 score of 87.44%, 52.84%, 50.68% and 51.73% respectively.

IV. DISCUSSIONS
This study presents a prediction model for students' on-time graduation using the C4.5 algorithm. The data is collected from the faculty of engineering from a private university in the academic year 2018-2014. The data consist of several features: student number, department, registration date, graduation date, the date of birth, GPA, and English proficiency score. The data were processed using pre-processing stage in cleaning, selection, and transformation. Four features were considered as prediction features, namely department, GPA, English score, and age. Age is calculated based on the date of birth and date of registration to the university. The evaluation of C4.5 algorithm performance is conducted in five iterations to obtain the value of accuracy, precision, recall, and F1 score.
The C4.5 algorithm started the prediction by determining the root node. Based on the C4.5 algorithm, age is determined as the root node. Furthermore, the department and GPA are determined as the branch. Graduating on time and not graduating on time are the end of the branch. Six rules are generated from the decision tree as the learning phase of the dataset. The rules can be used by the university to improve the quality of education and prevent student failures in the education process and achieve on-time graduation.
There are at least two limitations of this study for evaluation performance improvement. First, to avoid the imbalance of data transformation for English scores, the English score needs to be classified as equal to the number of classifications in age, department, and GPA. Second, to avoid bias sampling, the experiments are conducted using cross-validation. This study used a split test for the sampling approach. This sampling allows for the potential for sampling bias, although in this study, no impact was seen. Because this study uses iterations in which the amount of testing data determines the iteration. In future research, other data variables can also be considered based on data availability.
This study still achieved a higher accuracy even though it uses a large test data. However, the classification technique for prediction used in this study still needs to be optimised using feature selection techniques to achieve the best result of all performance measurements [26]. Furthermore, sampling techniques can also be improved using crossfold validation to split the test data fairly [27].

V. CONCLUSIONS
In this study, we present a prediction model for students' on-time graduation using the C4.5 algorithm. The dataset was cleaned, selected, and transformed in pre-processing stage and resulted in four features that are considered for prediction. The overall data analysis of students in the academic year 2008-2014 using the C4.5 algorithm produced the highest accuracy of 90% for 300 data testing. Furthermore, the best average classification performance reaches a a precision value of 61.90%, recall value 54.76%, and F1 score 57,77%. The C4.5 algorithm determines the age for the root node. Later, department and GPA determined the branch. However, English score data features were not considered in the tree. This is caused by the data transformation process that divides on a small scale so that it is eliminated.