|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Growing the Delta Lake ecosystem with Rust and Python" |
| 4 | +tags: |
| 5 | +- featured |
| 6 | +- rust |
| 7 | +- deltalake |
| 8 | +- python |
| 9 | +author: rtyler |
| 10 | +team: Core Platform |
| 11 | +--- |
| 12 | + |
| 13 | + |
| 14 | +Scribd stores billions of records in [Delta Lake](https://delta.io) but writing |
| 15 | +or reading that data was constrained to a single tech stack, all of that |
| 16 | +changed with the creation of Rust and Python support via |
| 17 | +[delta-rs](https://github.com/delta-io/delta-rs). Historically, using Delta |
| 18 | +Lake required applications be implemented with or accompanied by [Apache |
| 19 | +Spark](https://spark.apache.org) and many of our batch and streaming data |
| 20 | +processing applications are all Spark-based. In mid-2020 it became clear to me |
| 21 | +that Delta Lake would be a powerful tool in areas adjacent to the domain that |
| 22 | +Spark occupys: we would soon need to bring data into and out of Delta Lake in |
| 23 | +dozens of different ways. Some discussions and prototyping led to the creation |
| 24 | +of "delta-rs", a Delta Lake client written in Rust that can be easily embedded |
| 25 | +in other langauges such as |
| 26 | +[Python](https://delta-io.github.io/delta-rs/python), Ruby, NodeJS, and more. |
| 27 | + |
| 28 | + |
| 29 | +The [Delta Lake |
| 30 | +protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md) is not |
| 31 | +_that_ complicated as it turns out. At an extremely high level, Delta Lake is a |
| 32 | +JSON-based transaction log coupled with [Apache |
| 33 | +Parquet](https://parquet.apache.org) files stored on disk/object storage. This means the core implementation of Delta in [Rust](https://rust-lang.org) is similarly quite simple. Take the following example from our integration tests which "opens" a table, reads it's transaction log and provides a list of Parquet files contained within: |
| 34 | + |
| 35 | + |
| 36 | +```rust |
| 37 | +let table = deltalake::open_table("./tests/data/delta-0.2.0") |
| 38 | + .await |
| 39 | + .unwrap(); |
| 40 | +assert_eq!( |
| 41 | + table.get_files(), |
| 42 | + &vec![ |
| 43 | + "part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet", |
| 44 | + "part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet", |
| 45 | + "part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet", |
| 46 | + ] |
| 47 | +); |
| 48 | +``` |
| 49 | + |
| 50 | + |
0 commit comments