|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Growing the Delta Lake ecosystem with Rust and Python" |
| 4 | +tags: |
| 5 | +- featured |
| 6 | +- rust |
| 7 | +- deltalake |
| 8 | +- python |
| 9 | +author: rtyler |
| 10 | +team: Core Platform |
| 11 | +--- |
| 12 | + |
| 13 | + |
| 14 | +Scribd stores billions of records in [Delta Lake](https://delta.io) but writing |
| 15 | +or reading that data had been constrained to a single tech stack, all of that |
| 16 | +changed with the creation of [delta-rs](https://github.com/delta-io/delta-rs). |
| 17 | +Historically using Delta Lake required applications to be implemented with or |
| 18 | +accompanied by [Apache Spark](https://spark.apache.org). Many of our batch |
| 19 | +and streaming data processing applications are all Spark-based, but that's not |
| 20 | +everything that exists! In mid-2020 it became clear that Delta Lake would be a |
| 21 | +powerful tool in areas adjacent to the domain that Spark occupies. From my |
| 22 | +perspective, I figured that would soon need to bring data into and out of Delta |
| 23 | +Lake in dozens of different ways. Some discussions and prototyping led to the |
| 24 | +creation of "delta-rs", a Delta Lake client written in Rust that can be easily |
| 25 | +embedded in other languages such as |
| 26 | +[Python](https://delta-io.github.io/delta-rs/python), Ruby, NodeJS, and more. |
| 27 | + |
| 28 | + |
| 29 | +The [Delta Lake |
| 30 | +protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md) is not |
| 31 | +_that_ complicated as it turns out. At an extremely high level, Delta Lake is a |
| 32 | +JSON-based transaction log coupled with [Apache |
| 33 | +Parquet](https://parquet.apache.org) files stored on disk/object storage. This means the core implementation of Delta in [Rust](https://rust-lang.org) is similarly quite simple. Take the following example from our integration tests which "opens" a table, reads it's transaction log and provides a list of Parquet files contained within: |
| 34 | + |
| 35 | + |
| 36 | +```rust |
| 37 | +let table = deltalake::open_table("./tests/data/delta-0.2.0") |
| 38 | + .await |
| 39 | + .unwrap(); |
| 40 | +assert_eq!( |
| 41 | + table.get_files(), |
| 42 | + vec![ |
| 43 | + "part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet", |
| 44 | + "part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet", |
| 45 | + "part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet", |
| 46 | + ] |
| 47 | +); |
| 48 | +``` |
| 49 | + |
| 50 | +Our primary motivation for delta-rs was to create something which would |
| 51 | +accommodate high-throughput writes to Delta Lake and allow embedding for |
| 52 | +languages like Python and Ruby such that users of those platforms could perform |
| 53 | +light queries and read operations. |
| 54 | + |
| 55 | +The first notable writer-based application being co-developed with delta-rs is |
| 56 | +[kafka-delta-ingest](https://github.com/delta-io/kafka-delta-ingest). The |
| 57 | +project aims to provide a highly efficient daemon for ingesting |
| 58 | +Kafka-originating data into Delta tables. In Scribd's stack, it will |
| 59 | +effectively bridge JSON flowing into [Apache Kafka](https://kafka.apache.org) |
| 60 | +topics into pre-defined Delta tables, translating a single JSON message into a |
| 61 | +single row in the table. |
| 62 | + |
| 63 | +From the reader standpoint, the Python interface built on top of delta-rs, |
| 64 | +contributed largely by [Florian Valeye](https://github.com/fvaleye) makes |
| 65 | +working with Delta Lake even simpler, and for most architectures you only need |
| 66 | +to run `pip install deltalake`: |
| 67 | + |
| 68 | +```python |
| 69 | +from deltalake import DeltaTable |
| 70 | +from pprint import pprint |
| 71 | + |
| 72 | +if __name__ == '__main__': |
| 73 | + # Load the Delta Table |
| 74 | + dt = DeltaTable('s3://delta/golden/data-reader-primitives') |
| 75 | + |
| 76 | + print(f'Table version: {dt.version()}') |
| 77 | + |
| 78 | + # List out all the files contained in the table |
| 79 | + for f in dt.files(): |
| 80 | + print(f' - {f}') |
| 81 | + |
| 82 | + # Create a Pandas dataframe to execute queries against the table |
| 83 | + df = dt.to_pyarrow_table().to_pandas() |
| 84 | + pprint(df.query('as_int % 2 == 1')) |
| 85 | +``` |
| 86 | + |
| 87 | +I cannot stress enough how much potential the above Python snippet has for |
| 88 | +machine learning and other Python-based applications at Scribd. For a number |
| 89 | +of internal applications developers have been launching Spark clusters for the |
| 90 | +sole purpose of reading some data from Delta Lake in order to start their model |
| 91 | +training workloads in Python. With the maturation of the Python `deltalake` |
| 92 | +package, now there is a fast and easy way to load Delta Lake into basic Python |
| 93 | +applications. |
| 94 | + |
| 95 | + |
| 96 | + |
| 97 | +From my perspective, it's only the beginning with [delta-rs](https://github.com/delta-io/delta-rs). Delta Lake is a deceptively simple technology with tremendous potential across the data platform. I will be sharing more about delta-rs at [Data and AI Summit](https://databricks.com/dataaisummit/north-america-2021) on May 27th at 12:10 PDT. I hope you'll join [my session](https://databricks.com/speaker/r-tyler-croy) with your questions about delta-rs and where we're taking it! |
| 98 | + |
| 99 | + |
0 commit comments