Machine Learning-Driven Predictive Data Quality Assessment in ETL Frameworks

Divya Marupaka; Sandeep Rangineni

doi:10.14445/22312803/ IJCTT-V72I3P108

Research Article | Open Access | Download PDF

Volume 72 | Issue 3 | Year 2024 | Article Id. IJCTT-V72I3P108 | DOI : https://doi.org/10.14445/22312803/IJCTT-V72I3P108

Machine Learning-Driven Predictive Data Quality Assessment in ETL Frameworks

Divya Marupaka, Sandeep Rangineni

Received	Revised	Accepted	Published
23 Jan 2024	29 Feb 2024	15 Mar 2024	29 Mar 2024

Citation :

Divya Marupaka, Sandeep Rangineni, "Machine Learning-Driven Predictive Data Quality Assessment in ETL Frameworks," International Journal of Computer Trends and Technology (IJCTT), vol. 72, no. 3, pp. 53-60, 2024. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V72I3P108

Abstract

In the realm of data management, ensuring data quality within Extract, Transform, Load (ETL) frameworks is paramount for reliable decision-making and insights generation. Traditional methods of data quality assessment often lack the agility and predictive capabilities required to address evolving data challenges. This abstract proposes a novel approach leveraging machine learning techniques for predictive data quality assessment within ETL frameworks. Data quality in ETL (Extract, Transform, Load) workflows cannot be overstated. This abstract introduces a groundbreaking study focused on the integration of machine learning techniques to predict and assess data quality within ETL frameworks. The aim is to revolutionize traditional data quality management by leveraging advanced algorithms for proactive identification and mitigation of potential issues. By training models on historical data sets and incorporating features such as data volume, structure, and distribution, the system can learn to detect subtle deviations from expected data behavior. Key components of the framework include data preprocessing, feature engineering, model selection, and evaluation. The system continuously learns and adapts to changing data landscapes, enhancing its predictive capabilities over time. Results demonstrate significant improvements in data quality assessment accuracy, early detection of anomalies, and proactive mitigation of datarelated risks. The framework's scalability and flexibility make it adaptable to different ETL workflows and data domains. In conclusion, machine learning-driven predictive data quality assessment offers a promising avenue for enhancing data reliability and trustworthiness within ETL frameworks. By leveraging advanced analytics and automation, organizations can streamline their data quality assurance processes and mitigate operational risks.

Keywords

Machine Learning, Predictive Analytics, Data Quality Assessment, ETL Frameworks, Data Integration.

References

[1] Jack E. Olson, Data Quality: The Accuracy Dimension, O'Reilly Media, 2013.
[Google Scholar] [Publisher Link]
[2] Thomas C. Redman, Data Driven: Profiting from Your Most Important Business Asset, Harvard Business Press, pp. 235-246, 2016.
[Google Scholar] [Publisher Link]
[3] Erhard Rahm, and Hong Hai Do, “Data Cleaning: Problems and Current Approaches,” IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 3-13, 2000.
[Google Scholar] [Publisher Link]
[4] Ralph Kimball, and Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, John Wiley & Sons, pp. 1-128, 2011.
[Google Scholar] [Publisher Link]
[5] Carlo Batini, and Monica Scannapieca, Data Quality: Concepts, Methodologies, and Techniques, 1st ed., Springer Berlin, Heidelberg, pp. 1- 262, 2006.
[CrossRef] [Publisher Link]
[6] W.H. Inmon, Building the Data Warehouse, John Wiley & Sons, pp. 1-576, 2005.
[Google Scholar] [Publisher Link]
[7] Pedro Domingos, “A Few Useful Things to Know About Machine Learning,” Communications of the ACM, vol. 55, no. 10, pp. 78-87, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Leo Breiman, “Random Forests,” Machine Learning, vol. 45, pp. 5-32, 2001.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Christopher M. Bishop, Pattern Recognition and Machine Learning, 1 st ed., Springer New York, pp. 1-778, 2006.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Trevor Hastie, Jerome Friedman, and Robert Tibshirani, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 1 st ed., Springer New York, pp. 1-536, 2001.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Foster Provost, and Tom Fawcett, Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, O'Reilly Media, pp. 1-414, 2013.
[Google Scholar] [Publisher Link]
[12] S.B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques,” Informatica, vol. 31, no. 3, pp. 249-268, 2007.
[Google Scholar] [Publisher Link]
[13] D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” Advances in Neural Information Processing Systems 28, pp. 1- 9, 2015.
[Google Scholar] [Publisher Link]
[14] H. Chen, and R.H. Chiang, “Data Quality and Data Preprocessing: A Practical Guide for Information Scientists and Data Engineers,” Morgan Kaufmann, 2019.
[15] Richard Y. Wang, and Diane M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management Information Systems, vol. 12, no. 4, pp. 5-33, 1996.
[CrossRef] [Google Scholar] [Publisher Link]