A Comparative Study of Delta Lake as a Preferred ETL and Analytics Database |
||
![]() |
![]() |
|
© 2025 by IJCTT Journal | ||
Volume-73 Issue-1 |
||
Year of Publication : 2025 | ||
Authors : Hanza Parayil Salim | ||
DOI : 10.14445/22312803/IJCTT-V73I1P108 |
How to Cite?
Hanza Parayil Salim, "A Comparative Study of Delta Lake as a Preferred ETL and Analytics Database," International Journal of Computer Trends and Technology, vol. 73, no. 1, pp. 65-71, 2025. Crossref, https://doi.org/10.14445/22312803/IJCTT-V73I1P108
Abstract
In the world of modern data architecture, Delta Lake stands out as a powerful and reliable solution to handle large amounts of data. This comparative study explores Delta Lake as a potential solution for Extract, Transform, Load (ETL) processes and analytics. Delta Lake, an open-source storage layer built on top of Apache Spark and optimized for cloud environments, promises enhanced reliability, scalability, and performance for data pipelines. The study evaluates its advantages over traditional databases and other big data processing frameworks, focusing on aspects such as data consistency, transaction management, and schema evolution. By analyzing key features like ACID transactions, time travel, and integration with cloud platforms, this paper provides a comprehensive assessment of Delta Lake's effectiveness for ETL workflows and analytical workloads. The study highlights its strengths in handling large datasets over Data Lake and traditional databases for analytical data processing.
Keywords
Delta Lake, Lakehouse architecture, Data Lake, Medallion architecture, Databricks, ETL, Apache spark, Distributed computing.
Reference
[1] Michael Armbrust et al., “Delta Lake: High-performance ACID Table Storage Over Cloud Object Stores,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3411-3424, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Apache Parquet. [Online]. Available: https://parquet.apache.org
[3] Data Skipping for Delta Lake. [Online]. Available: https://docs.databricks.com/en/delta/data-skipping.html
[4] Xiang Wu, and Yueshun He, “Optimization of the Join between Large Tables in the Spark Distributed Framework,” Applied Sciences, vol. 13, no. 10, pp. 1-14, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Apache Kafka. [Online]. Available: https://kafka.apache.org
[6] Use Liquid Clustering for Delta Tables, 2025. [Online]. Available: https://docs.databricks.com/en/delta/clustering.html
[7] Databricks Runtime 15.3. [Online]. Available: https://docs.databricks.com/en/release-notes/runtime/15.3.html
[8] Structured Spark Streaming with Delta Lake: A Comprehensive Guide, 2024. [Online]. Available: https://delta.io/blog/structured-spark streaming/
[9] Azure Data Lake Storage. [Online]. Available: https://azure.microsoft.com/en-us/services/storage/data-lake-storage/
[10] Delta Lake Performance. [Online]. Available: https://delta.io/blog/delta-lake-performance/
[11] Built Lakehouse with Delta Lake. [Online]. Available: https://delta.io/
[12] Delta Sharing. [Online]. Available: https://learn.microsoft.com/en-us/power-query/connectors/delta-sharing