Databricks Delta tables are an advanced data storage and management feature of Databricks, offering a unified framework for data management and optimization. Delta Tables are built on top of Apache Spark, enhancing Spark’s capabilities by providing ACID transactions for data integrity, scalable metadata handling for efficient management of large datasets, Time Travel for querying previous versions of data, and support for both streaming and batch data processing in a unified manner.
Key Features:
ACID Transactions: Supports Atomicity, Consistency, Isolation, and Durability (ACID) transactions, ensuring data integrity and reliability.
Scalable Metadata Handling: Efficiently manage metadata for large-scale data, ensuring fast query performance even as the data size grows.
Schema Enforcement: Delta enforces schemas to maintain data consistency and prevent data corruption.
Data Versioning: Automatically versions the data and maintains a history of changes, enabling data auditing and rollback.
Time Travel: This feature allows users to query past versions of the data, making it easier to recover from accidental deletions or modifications
Creating Delta Table:
DDL for delta table is almost similar to parquet.
Create Table TableName(
   columns_A string,
   columns_B int,
   columns_C timestamp
) Using Delta
partitioned by (columns_D string)
Location ‘dbfs:/delta/TableName’
Converting Parquet to Delta Table:
We can use below command to convert an existing Parquet Table to Delta Table.
    convert to delta tableName partitioned by (columns_D string)
Additional Delta Table Properties:
There are several table properties that we can use to alter the appearance or behavior of the table. we can set and unset the tables to the existing table by using the below commands
alter table tableName SET TBLPROPERTIES (‘key’ , ‘value’);
alter table tableName UNSET TBLPROPERTIES (‘key’);
delta.autoOptimize.autoCompact: This property allows us to control the output part file size. Setting the value to ‘true’ enables auto compaction, which combines small files within Delta table partitions. This automatic compaction reduces the problems associated with having many small files.
delta.autoOptimize.optimizeWrite: Setting the value to ‘true’ enables Optimized Writes, which improve file size as data is written and enhance the performance of subsequent reads on the table. Optimized Writes are most effective for partitioned tables, as they reduce the number of small files written to each partition.
delta.deletedFileRetentionDuration: We can use the property ‘interval <interval>’ to set the duration for which data files are stored in a Delta table. The default duration is 7 days. Running the VACUUM command removes data files that are no longer referenced in the current table version, enabling Time Travel in Delta tables. Whereas, increasing the duration can lead to higher storage costs as more data files are retained.
delta.logRetentionDuration: This property controls how long the history of the table is kept, which is essentially the delta log files. We can set the duration using the format ‘interval <interval>’. The default duration is 30 days. This interval should be greater than or equal to the interval of the data file.
It is recommended to run the OPTIMIZE and VACUUM commands after each successful load or at regular intervals to enhance table performance and remove older data files from storage.
References:
https://docs.databricks.com/en/delta/index.html
Conclusion:
Databricks Delta tables significantly enhance Apache Spark’s functionality by offering robust data integrity through ACID transactions, efficient management of large datasets with scalable metadata handling, the ability to query historical data with Time Travel, and the convenience of unified streaming and batch data processing.
Source: Read MoreÂ