本文提供Databricks数据洞察中的Databricks Runtime Delta与社区开源版本Delta Lake。
Databricks Runtime vs Apache Spark
下表中的 feature 列表来自 Databricks 官网(https://databricks.com/spark/comparing-databricks-to-apache-spark)
Feature | Apache Spark | Databricks数据洞察 |
---|---|---|
Built-in file system optimized for cloud storage access | No | Yes |
Spark-native fine grained resource sharing for optimum utilization | No | Yes |
Fault isolation of compute resources | No | Yes |
Faster writes to OSS | No | Yes |
Compute optimization during joins and filters | No | Yes |
Rapid release cycles | No | Yes |
Auto-scaling compute | No | Yes |
Databricks Delta vs Open-source Delta Lake
Feature | Open SourceDelta Lake | Databricks Delta |
---|---|---|
Snapshot Isolation / Transactional Guarantees | Yes | Yes |
Efficient directory / File listing | Yes | Yes |
Version history and time travel | Yes | Yes |
Schema evolution & enforcement | Yes | Yes |
Hidden partitions / Partitioning by expressions | In Roadmap | In Roadmap |
HDFS Support | Yes | Yes |
Object Storage Support | Yes | Yes |
Streaming Data Sink | Yes | Yes |
Streaming Data Source | Yes | Yes |
Basic Upsert (merge into) | Yes | Yes |
Scalable Upsert (merge into) | No | Yes |
Data skipping based on stats | No | Yes |
Compact small files | Yes | Yes |
Optimize (efficiently compact small files) | No | Yes |
Auto Optimize | No | Yes |
Native Parquet Reader | No | Yes |
Read from Presto | Yes | Yes |
Read from Hive | In Roadmap | In Roadmap |