Human Development Clustering in Indonesia: Using K-Means Method and Based on Human Development Index Categories

—The quality of life for Indonesia's population can be measured from the human development index in each province. People who have a good quality of life indicate a prosperous life. The government has the responsibility to advance the welfare of the nation under the mandate of the constitution. The clustering of the Human Development Index (HDI) in Indonesia is used to determine the distribution of quality of life or the distribution of social welfare. In this study, the K-Means method, which is a popular non-hierarchical clustering method, is used to classify human development in each province based on HDI indicators, namely Life Expectancy at Birth , Expected Years of Schooling , Mean Years of Schooling , and Adjusted Expenditure Per Capita . Provinces in Indonesia are clustered into 4 clusters. These results were also compared with the clustering based on HDI categories determined by Statistics Indonesia based on certain cut-off values. According to the HDI category, provinces in Indonesia fall into the medium, high, and very high categories. The results of the two groupings show that there is a trend toward appropriate characteristics for each group. Thus, K-Means can classify provinces in Indonesia according to the characteristics of the HDI indicators.


I. INTRODUCTION
eneral welfare is a community right and has been regulated by the Indonesian state constitution. Thus, the Indonesian government is responsible for fulfilling the state's mandate. Country achievements in national development can be seen from various factors. Economic growth and the quality of human resources are several factors that support the success of a country. The human development index (HDI) is one measure to assess the quality of human resources [1]. HDI is a measure of the quality of human life as well as an indicator of development goals [2]. HDI explains how citizens can access development outcomes in terms of income, health, education, and other aspects of life [3]- [5]. United Nations Development Programme (UNDP) has used three dimensions to form HDI, namely long and healthy life, knowledge, and a decent standard of living [2], [6].
Indonesia is an archipelagic country consisting of 5 large island groups namely Sumatra, Kalimantan, Sulawesi, Java, Bali, Nusa Tenggara, Maluku, and Papua [2]. Each region has a diversity which is a challenge for the government in human development. The value of the human development index can be used as a reference for the government to make budget policies for each region and strategies for achieving national or regional development [2]. Regional grouping based on human development is used to determine the distribution of the quality of life of the population. By knowing this distribution, the government can formulate short-term and long-term development strategies to improve and increase the quality of life of the population.
HDI Indonesia has increased from year to year [2]. In Indonesia, HDI is categorized into 4 categories, i.e. low, medium, high, and very high with cutoff points at 60, 70, and 80 [1]. The cutoff point for low HDI that is determined by UNDP is slightly different, namely lower than 55 [6]. Most regencies or cities' HDI in Indonesia is medium HDI [2]. Lampung Province, Central Sulawesi Province, and Maluku Province changed from medium to high HDI with an HDI growth of 0.79%, 0.70%, and 0.73% in 2022 [2]. If the categorization of human development in an area only uses a composite index, then the consideration of the cutoff point for each category becomes very crucial.
Cluster analysis aims to group of observations into clusters depends on the similarities and dissimilarities in the characteristics of the dataset. The K-Means is a frequently used non-hierarchical clustering method [7]- [9]. The form of the K-Means algorithm is to assign each object to the cluster that has the closest centroid. Thus, the objective of K-Means is to obtain minimum between-cluster variation and maximum inter-cluster variation.
Much research has been carried out regarding the grouping of human development in Indonesia based on HDI indicators by using cluster analysis, such as hierarchical clustering [5], K-Means [4], [5], [10], K-Medoids [5], and Fuzzy C-Means [10]. The researchers found that the number of clusters is four clusters using K-Means [4], [5], [10] and Fuzzy C-Means [ [5]. HDI grouping in a particular area in Indonesia was also carried out [1], [11], [12]. Grouping human development of countries in the world is also carried out using the K-means and Partitioning Around Medoids algorithms and offers cut-off values to classify countries with low (lower than 65), medium (65-85), and high (greater than 85) human development [13]. The study by [3] concludes that there is a spatial dependence so that the HDI in a region can be influenced by the HDI in the nearest area.
In this study, provinces in Indonesia were grouped based on HDI indicators (Life Expectancy at Birth, Expected Years of Schooling, Mean Years of Schooling, and Adjusted Expenditure Per Capita) using K-Means and according to HDI categories (low, medium, high, and very high). Next, the characteristics of each cluster or category of the two grouping schemes are analyzed. This study also evaluates the distribution of HDI from each grouping result.

A. Cluster Analysis
Cluster analysis is an analysis to group similar elements as research objects into different and mutually exclusive groups [1], [14]. This analysis is useful for summarizing data by grouping objects based on certain characteristic similarities among the objects to be studied. Cluster analysis is a tool for grouping several n objects based on variables that have relatively similar characteristics between these objects so that the variance within groups is smaller than the variance between groups. Objects will be classified into one or more clusters until objects in one cluster will have similar characters.
Cluster analysis consists of two methods, i.e. hierarchical and non-hierarchical methods [15]. The hierarchical method starts by grouping data that has the closest similarity [16]. Then proceed to other objects that have the second closest proximity and so on. Hence the groups will form a kind of tree called the dendrogram, where there is a clear hierarchy between objects, from the most similar to the least similar. The non-hierarchical clustering method is used more for group objects than for variables to collect k clusters. The number of clusters, k, is determined by the researcher or determined by the part of the clustering procedure. Non-hierarchical clustering can be applied better in large samples than hierarchical methods.

A.1. K-Means Method
A popular non-hierarchical method or partitioning method is the K-Means method [7]- [9]. The algorithm of this method groups objects based on the distance between the objects and the centroid cluster [17] where the distance is obtained by an iterative process. The analysis needs to determine the number of k as input to the algorithm [12], [18]. The goal of K-Means clustering is to get clusters of objects by maximizing the similarity of objects within clusters (or minimizing variance within clusters) and maximizing the differences between clusters.
The different units and outlier values in the dataset can be avoided with normalization data [19]. The original data was transformed to make the same scale in the dataset, such as using min-max normalization. It performs a linear transformation on the original data. Suppose we want to transform values in variable , its formula is shown in (1) where ′ is the normalized value from the original value , is the minimum value of , and is the maximum value of . After min-max normalization, the values become between 0 (minimum value) and 1 (maximum value).
Suppose that n observations (objects) from a dataset consisting of p variables are partitioned into k clusters, namely 1 , 2 , …, and . The K-Means method groups the objects into k clusters such that to get a minimum within-cluster sum of squares (WCSS) with the formula (2).
(2) WCSS is the objective function for this algorithm [5], [17], is the ith observation, is the centroid of cluster j, is the jth cluster, = 1, 2, … , , and = 1, 2, … , . To classify each observation into k clusters, the K-Means algorithm is divided into the following steps [17]: 1. Determine the number of clusters, k, and select k points as initial centroids (consists of p coordinates). 2. Put the observations into the cluster with the closest centroid value. The closeness distance between observations to the centroid is measured by the Euclidean distance, , presented in (3) where is the distance between the ith observation to the jth centroid, ℓ is the ith observation on variable ℓ, ℓ is the ℓth coordinate of centroid j, and ℓ = 1, 2, … , . 3. Calculate the new centroid for each cluster using the mean value of all objects in each cluster. Each centroid coordinate for cluster j is calculated with (4) where | | = is the number of objects in the jth cluster or the size of . 4. Repeat steps 2 and 3 until no rearrangements are possible.

A.2. The Optimal Number of Clusters
The K-Means method is non-hierarchical clustering so researchers determine the number of clusters themselves [1]. The selection of the number of clusters is very important for the clustering results obtained. Therefore we need a method to determine the optimal number of clusters, one of which is the Elbow method [1], [9], [13], [17]. The graph of the Elbow method, namely the plot between the number of clusters (xaxis) and the total variation within the clusters (y-axis) can be calculated from (2). The elbow point of the curve determines the number of clusters built [13], [17]. Validation of how well the grouping of observations based on the variables in the dataset is an important part of the clustering algorithm [20]. In this study, it is measured by the Calinski-Harabasz (CH) index, which is also called the Variance Ratio Criterion (VRC), presented in (5) as follows [5], [20]- [22] = where is shown in (2) and is the between-cluster sum of squares with is the centroid of the dataset (global centroid). The greater the value, the greater the degree of dispersion between clusters (increasing differences between clusters), while the smaller the WCSS value, the lower the degree of dispersion within a cluster (higher the similarity between objects in a cluster). The greater the value of the CH index indicates the better the grouping effect [22]. Therefore, the CH index can also suggest the optimal number of clusters in cluster analysis.

B. Human Development Index
The Human Development Index (HDI) is a measure of progress in efforts to improve the quality of life of the community by taking into account development outcomes such as income, health, and education. The term Human Development Index (HDI), was popularized by the United Nations Development Programme (UNDP) in 1900 [2], [6]. The UNDP defines human development as a process of expanding choices for the population, in the sense that humans are given the freedom to choose more choices in terms of meeting life's needs.
UNDP uses three dimensions to construct HDI. This was also followed by Indonesia. These dimensions are long and healthy life, knowledge, and a decent standard of living [1], [2], [12], [14]. These three dimensions are the chosen approach in describing the quality of human life and have not changed until now. The HDI calculation carried out in Indonesia refers to these 3 dimensions. Each dimension is represented by an index value obtained through data normalization, generally using (1) to prevent unit differences [13], [19]. The UNDP uses gross national income (GNI) per capita as an indicator of the extent of a decent standard of living. However, this information is not available at the regional level, so an adjusted real expenditure per person is used as an alternative. This indicator can be scaled down to the regional or city level. Indicators of real expenditure per capita can also reflect indicators of people's income and describe the wealth of the population as a result of economic activity. In calculating the expenditure index, the maximum (26,572.352 thousand rupiahs) and minimum (1,007.436 thousand rupiahs) limits are used [2] as follows.
HDI is calculated as the geometric mean (11) of the life expectancy (health), education, and expenditure indices as shown in (6)

III. METHODOLOGY
In this study, the data was taken from Statistics Indonesia (BPS) regarding the human development index in Indonesia in 2022. The research unit used was 34 provinces in Indonesia. The provinces are grouped using the K-Means method based on HDI indicators, namely Life Expectancy at Birth (LEB), Expected Years of Schooling (EYS), Mean Years of Schooling (MYS), and Adjusted Expenditure Per Capita (EPC). These indicators have different units so that the data is normalized using (1). This data normalization is used to avoid outliers and obtain a more homogeneous relative contribution between variables [19] so that the grouping results are accurate because the grouping of objects is based on the distance between the data points and the center point (centroid). The normalized HDI indicators data is used to group provinces in Indonesia using the K-Means. After normalizing data, provinces in Indonesia grouped using the K-Means method based on these HDI indicators. The optimal number of clusters is determined based on the Elbow method and Calinski-Harabasz index with (2) and (5) respectively. From K-Means method results, the characteristics of each cluster are analyzed. And then, those results compared to the grouping HDI values become a low, medium, high, or very high category. The HDI characteristics of each group were also discussed, both those using the K-Means method and the HDI categories.

A. Clustering Human Development Using K-Means Method
The first stage in the K-Means algorithm is to determine the number of clusters formed. For this reason, the Elbow method is used to determine the optimal k clusters. Figure 1 shows that the optimal number of clusters is four clusters for grouping provinces in Indonesia based on the 2022 HDI indicators (normalized data). This can be seen from the differences in the WCSS values in Figure 1(a) which are not too different between k = 4 and k = 5 and so on, so the number of clusters formed is 4. The Calinski-Harabasz (CH) index in Figure 1(b) is used to evaluate the clustering results and determined the optimal number of clusters, which is k = 4. The results of grouping data without normalization present that the CH index value is much larger, so it tends to be inaccurate.  Table 1. Table 1 shows the original data for ease of interpretation, but the data used to create clusters is normalized data. Figure. 2 Three-dimensional plot K-Means method Table 1 shows that the standard deviation of each indicator in Cluster 4 is the largest compared to other clusters. The life expectancy at birth of the Indonesian population is between 65-75 years with a standard deviation of about 1-2 years. The average expected years of schooling is around 12-14 years, which illustrates that most Indonesians have the opportunity to pursue formal education up to high school. However, the mean years of schooling is lower than the expected years of schooling, that is, not all people complete their education up to high school (12 years), even only elementary and junior high schools. Adjusted expenditure per capita of the population varies from around 7 million rupiahs to 19 million rupiahs. The grouping obtained from the K-Means algorithm is illustrated with a map of Indonesia in Figure 3. The provinces on Sumatera Island are grouped into Cluster 1 (4 provinces), Cluster 2 (5 provinces), and Cluster 4 (1 province). Provinces on Kalimantan Island are included in Cluster 2, except for East Kalimantan Province (Cluster 4). Provinces on Java Island are included in Cluster 2, except for the Special Region of Yogyakarta (Cluster 4). Each of the 2 provinces in Sulawesi Island is included in Cluster 1 and Cluster 2, and 3 other provinces are included in Cluster 3. Provinces in Papua Island are included in Cluster 3. Provinces of North Maluku and Maluku are included in Cluster 1, the Province of Bali is grouped in Cluster 4, and Provinces in Nusa Tenggara are included in Cluster 3.

B. Clustering Human Development Based on HDI Categories
The HDI value is obtained from the calculation of the geometric mean as in (11). Based on BPS cutoff points, the HDI values of the provinces in Indonesia in 2022 fall into the medium category (8 provinces), high category (24 provinces), very high category (2 provinces), and none are included in the low category. The characteristics of these categories based on HDI indicators are described in Table 2. For all indicators, the highest average and the lowest average are in the very high and medium category respectively. This also corresponds to the maximum and minimum values. The minimum values of LEB, EYS, MYS, and EPC are 65.63 years, 11.14 years, 7.02 years, and 7.146 million rupiahs respectively as shown in Table 1 and Table 2. The lowest LEB is in West Sulawesi Province and the lowest EYS, MYS, and EPC are in Papua Province. Figure 4

C. Comparison K-Means Clustering and HDI Categories
The tabulation between grouping by K-Means and HDI categories in Table 3 is used to compare the results of grouping the two. Provinces that are grouped in Cluster 1 and Cluster 2 are included in the medium and high categories. Provinces that are members of Cluster 4 are included in the high and very high categories, while the 6 provinces that are included in Cluster 3 are categorized in the medium HDI. After grouping the provinces in Indonesia (Figure 3 and Figure 4), we compare the distribution of HDI values for each group. Table 4 shows that the range of HDI values for each cluster is not mutually exclusive because there is overlap, that is one value enters another cluster. For example, a score of 70 is included in Cluster 1 and Cluster 2. This is different from the grouping of HDI based on HDI categories (Table 5) which are already categorized as mutually exclusive and exhaustive. Cluster 3 has a similar HDI range as the medium category, which is between 61.39 to 69.81. The HDI in the very high category tends to be similar to Cluster 4, although the ranges of values for the two are different. This similarity can also be seen from the grouping results in Figure 3 and and Cluster 2 have almost the same range, namely between 68-74. However, when compared to Table 2, Cluster 1 tends to have higher Expected Years of Schooling and Mean Years of Schooling scores than Cluster 2, while Cluster 2 tends to have higher Life Expectancy at Birth and Adjusted Expenditure Per Capita than Cluster 1. With characteristics that these differences, the K-Means method makes them into different clusters. Provinces in Indonesia are clustered into 4 clusters based on the Human Development Index indicators in 2022 using the K-Means method. Based on BPS criteria, the HDI values of provinces in Indonesia fall into 3 categories, namely medium, high, and very high. A comparison of the two groupings shows that there is compatibility, namely the medium category tends to be similar to Cluster 3, while the very high category is similar to Cluster 4. Provinces that are included in Cluster 3 need to get attention to improving all HDI indicators. The members of Cluster 4 also need equity in human development in each regency or city. Provinces that are included in Cluster 1 and Cluster 2 have almost the same HDI ranges, but the characteristics of HDI indicators are different so K-Means makes them in different clusters. Therefore, provinces that are members of Cluster 1 have to improve on Life Expectancy at Birth and Adjusted Expenditure Per Capita, while Cluster 2 focuses on Expected Years of Schooling and Mean Years of Schooling.