Naïve Bayes Evidence Accumulation K-modes Clustering: A New Method for Classifying Binary Data and its application on real data of injecting drug users
Background: Clustering is the method of classifying discrete data such as K-modes, and Naïve Bayes classifier is the classification to predict the unknown real classes. In this research, we improve the K-modes results by applying the Evidence Accumulation (EA) method to keep the initial mode vector to use in the Naïve Bayes EA K-Mode.
Method: The methods are applied to four real datasets, which the true classes are specified, for checking the external validity and purity of our methods. The free programming software R with package klaR for K-modes, EA, and package e1071 for Naïve Bayes is used. In addition, the methods are applied to the data of Injecting Drug Users (IDU) national dataset with sample size 2546.
Results: The EA K-modes algorithm applied to five real datasets then with the kept initial mode vector, rerun the K-modes. The results indicate the purity in the EA K-modes (0.544, 0.862, 0.914, 0.944, 0.625) has significant different with classic K-modes (0.497, 0.610, 0.404, 0.650, 0.625). Finally, we applied the Naïve Bayes classifier with prior probability finds in EA K-modes. For K=2 Naïve Bayes EA K-modes made better clustering (0.71, 0.873 against 0.625, 0.862 EA k-mode and 0.497, 0.61 K-mode).
Discussion and Conclusion: In this paper, we proposed Naïve Bayes EA K-modes as a new method for clustering of binary data. Our new method leads to stable clustering compare with the previous studies. The Naïve Bayes EA K-modes method improves the purity and establishes a better separation.
Keywords: clustering, K-modes, Evidence Accumulation, Naïve Bayes classifier, discrete
Guha S, Rastogi R, Shim K, editors. CURE: an efficient clustering algorithm for large databases. ACM Sigmod Record; 1998: ACM.
Berkhin P. A survey of clustering data mining techniques. Grouping multidimensional data: Springer; 2006. p. 25-71.
Han J, Pei J, Kamber M. Data mining: concepts and techniques: Elsevier; 2011.
Rencher AC. Methods of multivariate analysis: John Wiley & Sons; 2003.
Huang Z, editor A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. DMKD; 1997.
Khan SS, Kant S, editors. Computation of Initial Modes for K-modes Clustering Algorithm Using Evidence Accumulation. IJCAI; 2007.
Aranganayagi S, Thangavel K. Clustering categorical data using bayesian concept. International Journal of Computer Theory and Engineering. 2009;1(2):119.