The Comparison of Gini and Twoing Algorithms in Terms of Predictive Ability and Misclassification Cost in Data Mining: An Empirical Study
Murat Kayri, ?smail Kayri "The Comparison of Gini and Twoing Algorithms in Terms of Predictive Ability and Misclassification Cost in Data Mining: An Empirical Study". International Journal of Computer Trends and Technology (IJCTT) V27(1):21-30, September 2015. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.
Abstract -
The classification tree is commonly used in data mining for investigating interaction among predictors, particularly. The splitting rule and the decision trees technique employ algorithms that are largely based on statistical and probability methods. Splitting procedure is the most important phase of classification tree training. The aim of this study is to compare Gini and Twoing splitting rules in terms of misclassification cost, obtained the optimal balanced trees and the importance of independent variables. This study shows that the results obtained using the Twoing criterion, as it yields a tree that is much more equally balanced than the tree obtained with the Gini criterion. Misclassification rate was slightly different for the two methods (19% using Twoing criterion and 21,2% for the Gini).Using Twoing splitting rule gets more importance level independent variables and the improvement values are higher than the Gini algorithm. All things being considered, the good performance of the Twoing splitting in this study combined with its robustness to get high classification accuracy, tree structure and the importance of independent variables.
References
[1] J. Nikita and V.Srivastasa,"Data Mining Techniques: A Survey Paper," International Journal of Research in Engineering and Technology, vol. 2, pp. 116-119, 2013.
[2] B.K. Baradwaj and S. Pal,"Mining Educational Data to Analyze Students` Performance," International Journal of Advanced Computer Science and Applications, vol. 2, pp. 63- 69, 2011.
[3] Z. Joseph,"Data Mining as a Civic Duty – Online Public Prisoners’ Registration Systems,"International Journal on Social Media MMM: Monitoring, Measurement, and Mining, vol. 1, pp. 84-96, 2010.
[4] T. Waheed, R.B. Bonnell, S.O. Prasher and E. Paulet, "Measuring Performance in Precision Agriculture: Cart—A Decision Tree Approach,"Agricultural Water Management, vol. 84, pp. 173-185, 2006.
[5] J. Gehrke, R. Ramakrishnan and V. Ganti,"Rainforest-a Framework for Fast Decision Tree Construction of Large Datasets," Data Mining and Knowledge Discovery, vol. 4, pp. 127-162, 2000.
[6] C. Strobl, A.L. Boulesteix andT. Augustin,"Unbiased Split Selection for Classification Trees Based on theGini Index,"Computational Statistics & Data Analysis, vol. 52, pp. 483-501, 2007.
[7] B.D. Ripley, "Pattern Recognition and Neural Networks", pp. 403-410,Cambridge University Press, 1996.
[8] L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone,"Classification and Regression Trees",pp. 22- 34,Chapman & Hall/CRC, 1984.
[9] G.K.F. Tso and K.K.W. Yau,"Predicting Electricity Energy Consumption: A Comparison of Regression Analysis, Decision Tree and Neural Networks,"Energy, vol. 32, pp. 1761-1768, 2007.
[10] E. Bast?, C. Kuzey andD. Delen,"Analyzing Initial Public Offerings` Short-Term Performance Using Decision Trees and Svms,"Decision Support Systems, vol. 73, pp. 15-27, 2015.
[11] E. Muchai and L.Odongo,"Comparison of Crisp and Fuzzy Classification Trees Using Gini Index Impurity Measure on Simulated Data," European Scientific Journal, vol. 10, pp. 130-134, 2014.
[12] Z. Yua, F. Haghighat, B.J. Fung and H.Yashino,"A Decision Tree Method For Building Energy Demand Modeling,"Energy and Buildings, vol. 42, pp. 1637-1646, 2010.
[13] W.Y. Loh,"Classification and Regression Trees,"WIREs Data Mining Knowledge Discovery, vol. 1, pp. 14-23, 2011.
[14] R. Timofeev,"Classification and Regression Trees (Cart) Theory And Applications", MSc, Center of Applied Statistics and Economics Humboldt University, Berlin, Germany, 2004.
[15] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S Yu, Z.H. Zhou, M. Steinbach, D.J. Hand and D. Steinberg,"Top 10 Algorithms in Data Mining,"Knowledge and Information Systems, vol. 14, pp. 1-37, 2008.
[16] Y.H. Cho, J.K. Kim andS.H. Kim,"A Personalized Recommender System Based on Web Usage Mining and Decision Tree Induction,"Expert Systems with Applicaitons, vol. 23, pp. 329-342, 2002.
[17] I.H. Witten and E. Frank, "Weka Machine Learning Algorithms in Java, in Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations", pp. 122-132, Morgan Kaufmann Publishers, 2000.
[18] L.Y. Chang andH.H. Wang,"Analysis of Traffic Injury Severity: An Application of Non-Parametric Classification Tree Techniques,"Accident Analysis & Prevention, vol 38, pp. 1019-1027, 2006.
[19] PJ. Bradford, C. Kunz, R. Kohavi, C. Brunk and C.E. Brodley,"Pruning Decision Trees with Misclassification Costs,"Lecture Notes in Computer Science, vol. 1398, pp. 131-136,1998.
[20] F. Esposito, D. Malerba andG.A. Semeraro,"Comparative Analysis of Methods for Pruning Decision Trees,"IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 476-491, 1997.
[21] M. Bramer,"Using J-Pruning to reduce overfitting in classification trees," Knowledge-Based Systems, vol. 15, pp. 301-308, 2002.
[22] ?. Büyüköztürk,"The Development of Research Anxiety," Journal of Education Management, vol. 3, pp. 453-464, 1996.
[23] T. Hill and P.Lewicki, "Statistics: Methods and Applications, A Comprehensive Reference for Science, Industry, and Data Mining", USA,Statasoft Inc., 2006.
[24] A.I. Abdelrahman andD.H. Abdel-Hady,"Classification of InsolventSmall Businesses in Egypt by Some Running Cost Variables: A Decision Tree Approach," Applied Mathematical Sciences, vol. 5, pp. 421-440, 2011.
[25] W.Y. Loh,"Classification and Regression Tree Methods", pp. 315-323, Wiley, 2008.
[26] R.J. Lewis,"An introduction to classification and regression tree (cart) analysis,"Annual Meeting of the Society for Academic Emergent Medicine, San Francisco,May 22-25, pp. 1-14, 2000.
[27] M. Pal and P.M. Mather,"An Assessment of the Effectiveness of Decision Tree Methods for Land Cover Classification,"Remote Sensing of Environment, vol. 86, pp. 554-565, 2003.
[28] M.H. Moattar, M.M. Homayounpour and D. Zabihzadeh, "Persian text normalization using classification tree and support vector machine," Information and Communication Technologies (ICTTA’2006), Damascus, April 24-28, pp. 1308-1311, 2006.
[29] D.J. Hand,"Construction and Assessment of Classification Rules", pp. 79-85, John Wiley, 1997.
[30] Y.S. Shih,"Families of Splitting Criteria for Classification Trees,"Statistics and Computing, vol. 9, pp. 309-316, 1999.
[31] M. Zambon, R. Lawrence,A. Bunn andS. Powell,"Effect of Alternative Splitting Rules on Image Processing Using Classification Tree Analysis,"Photogrammetric Engineering and Remote Sensing, vol. 72, pp. 25-30, 2006.
[32] L. Breiman,"Technical Note: Some Properties of Splitting Criteria,"Machine Learning, vol. 24, pp. 41-47, 1996.
[33] K. Hamza andD. Larocque,"An Empirical Comparison of Ensemble Methods Based on Classification Trees,"Journal of Statistical Computation and Simulation, vol. 75, pp. 629-643, 2006.
[34] G. Martens, H.D. Meyer, B.D.Baets, M. Leman, M. Lesaffre andJ.P. Martens,"Tree-Based Versus Distance-Based Key Recognition in Musical Audio," Soft Computing, vol. 9, pp. 565-574, 2005.
[35] C. Conversano and F.Mola, Detecting Subset of Classifiers for Multi-attribute response prediction, In: Palumbo F, Lauro CN, Greenacre MJ, editors. Data Analysis and ClassificationStudies in Classification, Data Analysis, and Knowledge Organization. Berlin, Germany: Springer Berlin Heidelberg Publisher, 2010, pp. 233-240.
[36] U. Boryczka and J. Kozak,"Enhancing the Effectiveness of Ant Colony Decision Tree Algorithms By Co- Learning,"Applied Soft Computing, vol. 30, pp. 166-178, 2015.
[37] V. Chiew,"A software engineering cognitive knowledge discovery framework," Computer Society, Proceedings of the First IEEE International Conference on Cognitive Informatics (ICCI’02), Calgary,August 18-20, pp. 163-172, 2002.
Keywords
Association rules, classification, data mining, parameter estimation, statistical learning.