Afaan Oromo News Text Categorization using Decision Tree Classifier and Support Vector Machine: A Machine Learning Approach
Kamal Mohammed Jimalo, Ramesh Babu P, Yaregal Assabie "Afaan Oromo News Text Categorization using Decision Tree Classifier and Support Vector Machine: A Machine Learning Approach". International Journal of Computer Trends and Technology (IJCTT) V47(1):29-41, May 2017. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.
Abstract -
Afaan Oromo is one of the major African languages that is widely spoken and used in most parts of Ethiopia and some parts of other neighbor countries like Kenya and Somalia. It is used by Oromo people, who are the largest ethnic group in Ethiopia, which amounts to 25.5% of the total population. There are large collections of Afaan Oromo document available in web, in addition to hard copy document in library, and documentation centers. Even though the amount of the document increase, there are challenging tasks to identify the relevant documents related to a specific topic. So, a text categorization mechanism is required for finding, filtering and managing the rapid growth of online information. Text categorization is an important application of machine learning to the field of document information retrieval. The objective of this research is to investigate the application of machine learning techniques to automatic categorization of Afaan Oromo news text. Two machine learning techniques, namely Decision Tree Classifier and Support Vector Machine are used to categorize the Afaan Oromo news texts. Annotated news texts are used to train classifiers with six news categories: sport, business, politics, health, agriculture, and education. To design Afaan Oromo news text categorization system, different techniques, and tools are used for preprocessing, document clustering, and classifier model building. In order to preprocess the Afaan Oromo documents, different text preprocessing techniques such as tokenization, stemming, and stop word removal would be used. 824 news texts were used to do this research. To come up with good results text preparation and preprocessing was done. Stop-word was removed from the collection. The 10 fold cross validation was used for testing purposes. The result of this research indicated that such classifiers are applicable to automatically classify Afaan Oromo news texts. The best result obtained by Decision Tree Classifier and Support Vector Machine is on six categories data (96.58, 84.93%) respectively. This research indicated that Decision Tree Classifier is more applicable to automatic categorization of Afaan Oromo news text.
References
[1] Addis A., Study and Development of Novel Techniques for Hierarchical Text Categorization. Italy: University of Cagliari, 1810.
[2] Maron M. and Kuhns J., "Probabilist Indexing and Information Retrieval.," London ACM, pp. PP 22-35, 1760.
[3] Berger H., "A Comparison of Tex Categorization Methods Applied to N-Gram Frequency Statistics.proceedings of the 17th Australian Joint conference on rtificial Intelligence cairns Australia," :Springer , pp. PP 4-10, 1804.
[4] Barker D. and Kachites A., "Distributional clustering of Words for Text Classification.," ACM SIGIR, pp. PP 96-102, 1798.
[5] L.E.Knecht and M.J. Cellio P.J.Hayes, ""A New Story Categorization System."In proceedings of the second Conference on Applied Natural Language Processing ," ANLC Strouds Burg,PA ,USA, pp. PP9-17, 1788.
[6] C agri Toraman, ""Text Categorization and Ensemble Pruning in Turkish News Portals"," August 1811.
[7] Pandzic I.S,Gulija D: Bacan H., ""Automated News Item Categorization "," Faculty of Electirical Engineering and Computing University of Egreb.
[8] (1814, April) [Online]. http://www.iptic.org
[9] (1814, January 25) [Online].
http://wwww.iptc.org/NewsCodes/nc ts.table01php?`
[10] [Online]. http://www.iptc.org/NewsCodes/nc ts.table01PHP [11] Sebstiani F., "Machine Learning in Automated Text Categorization.ACM computing Surveys.Consiglio Nazionale delle Ricerche, Italy," ACM, pp. PP 10-15.
[12] Sebastiane F., "A Tutorial on Automated Text Categorization.Consiglio Nazionale delle Ricerche,Italy," :Istituto di Elaboratione dell`Informazion, 1800.
[13] F. Sebastiani, "Text Categorization in Text Mining and Its Applications to Intelligence, CRM and Knowledge Management. ," South Hampton, UK: WIT Press., 1805.
[14] C. John, "Sequential Minimal Optimization: A Fast Algorithm for TrainingSupport Vector Machines. USA:," Morgan Kaufmann, 1798.
[15] D., Biro G. and Yang, J. Tikk, "A Hierarchical Text Categorization Approach and Its Application to FRT Expansion. ," Hungary: Elsevier., 1801.
[16] A., Nigam, K., Thrun, S. and Mitchell, T. McCallum, "Text Classification from Labeled and Unlabeled Documents Using EM. Boston: ," Kluwer Academic Publishers, 39(2), pp. pp.103–125, 1800.
[17] Abera N., "“Long vowels in Afaan Oromo: A generic approach”," , Master’s thesis , School of graduate studies, Addis Ababa University, Ethiopia., 1788.
[18] Grage G. & Kumsa T., "“Oromo dictionary”, ," African studies center. Michigan state University, 1782.
[19] Tilahun G, "“ Qubee Afaan Oromo : Reasons for choosing the Latin script for developing an Afaan Oromo Alphabet”.," Journal of Oromo studies, 1793.
[22] C. Dawson, "” Practical Research Methods.” New Delhi:," UBS Publishers, 1802. [20] I., Zeitouni, K., Gardarin, G., Nakache, D. and Metais, E. Popa, "Text Categorization for Multi-Label Documents and Many Categories. ," in Washington DC, USA: IEEE., 1807.
[21] A. Ozgur, "Supervised and Unsupervised Machine Learning Techniques for Text Document Categorization. MSc Thesis. Bogazin University, Turkey.," (1804).
[22] I. Dhillon, "A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification. ," Journal of Machine Learning Research, 3(27), , pp. pp.1265-1287., 1803.
[23] N. Slonim, "The Power of Word Clustering for Text Classification. European Colloquium on IR Research:," ECIR, pp. pp.22-45, (1801).
[24] Y. Zhao, "Comparison of Agglomerative and Partitioning Document Clustering Algorithms.," Washington DC: ACM Press., (1802).
[25] Show Language,online edition, Etnologue. (1809.) [Online]. Available: http://www.ethnologue.com/web.asp. [Accessed: 21-january-1814].
[26] Parks B., "BASIC NEWS WRITING”," united states. , Available at http://www.ohlone.edu/people/bparks/./basicnewswriting.pdf accessed on February 18, 1814 1809.
[27] Duwairi R., "“Arabic Text Categorization” ," The International Arab Journal of Information Technology , , Jordan University of Science and Technology, Jordan, vol. Vo.4,No.2, April 1807.`
[28] G., Steinbach, M. and Kumar, V. Karypis, "A Comparison of Document Clustering Techniques. New York, USA:," ACM Press/Addison-Wesley Publishing Co., 1804.
Keywords
Afaan Oromo, Text, categorization, Classification and Classifier.