A Comparative Study of Delta Lake as a Preferred ETL and Analytics Database

Hanza Parayil Salim

doi:10.14445/22312803/ IJCTT-V73I1P108

Research Article | Open Access | Download PDF

Volume 73 | Issue 1 | Year 2025 | Article Id. IJCTT-V73I1P108 | DOI : https://doi.org/10.14445/22312803/IJCTT-V73I1P108

A Comparative Study of Delta Lake as a Preferred ETL and Analytics Database

Hanza Parayil Salim

Received	Revised	Accepted	Published
22 Nov 2024	28 Dec 2024	15 Jan 2025	30 Jan 2025

Citation :

Hanza Parayil Salim, "A Comparative Study of Delta Lake as a Preferred ETL and Analytics Database," International Journal of Computer Trends and Technology (IJCTT), vol. 73, no. 1, pp. 65-71, 2025. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V73I1P108

Abstract

In the world of modern data architecture, Delta Lake stands out as a powerful and reliable solution to handle large amounts of data. This comparative study explores Delta Lake as a potential solution for Extract, Transform, Load (ETL) processes and analytics. Delta Lake, an open-source storage layer built on top of Apache Spark and optimized for cloud environments, promises enhanced reliability, scalability, and performance for data pipelines. The study evaluates its advantages over traditional databases and other big data processing frameworks, focusing on aspects such as data consistency, transaction management, and schema evolution. By analyzing key features like ACID transactions, time travel, and integration with cloud platforms, this paper provides a comprehensive assessment of Delta Lake's effectiveness for ETL workflows and analytical workloads. The study highlights its strengths in handling large datasets over Data Lake and traditional databases for analytical data processing.

Keywords

Delta Lake, Lakehouse architecture, Data Lake, Medallion architecture, Databricks, ETL, Apache spark, Distributed computing.

References

[1] Michael Armbrust et al., “Delta Lake: High-performance ACID Table Storage Over Cloud Object Stores,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3411-3424, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Apache Parquet. [Online]. Available: https://parquet.apache.org
[3] Data Skipping for Delta Lake. [Online]. Available: https://docs.databricks.com/en/delta/data-skipping.html
[4] Xiang Wu, and Yueshun He, “Optimization of the Join between Large Tables in the Spark Distributed Framework,” Applied Sciences, vol. 13, no. 10, pp. 1-14, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Apache Kafka. [Online]. Available: https://kafka.apache.org
[6] Use Liquid Clustering for Delta Tables, 2025. [Online]. Available: https://docs.databricks.com/en/delta/clustering.html
[7] Databricks Runtime 15.3. [Online]. Available: https://docs.databricks.com/en/release-notes/runtime/15.3.html
[8] Structured Spark Streaming with Delta Lake: A Comprehensive Guide, 2024. [Online]. Available: https://delta.io/blog/structured-spark streaming/
[9] Azure Data Lake Storage. [Online]. Available: https://azure.microsoft.com/en-us/services/storage/data-lake-storage/
[10] Delta Lake Performance. [Online]. Available: https://delta.io/blog/delta-lake-performance/
[11] Built Lakehouse with Delta Lake. [Online]. Available: https://delta.io/
[12] Delta Sharing. [Online]. Available: https://learn.microsoft.com/en-us/power-query/connectors/delta-sharing