Skip to content

Commit 688d5c2

Browse files
committed
Growing the Delta Lake post
1 parent 8fff9da commit 688d5c2

File tree

1 file changed

+99
-0
lines changed

1 file changed

+99
-0
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
layout: post
3+
title: "Growing the Delta Lake ecosystem with Rust and Python"
4+
tags:
5+
- featured
6+
- rust
7+
- deltalake
8+
- python
9+
author: rtyler
10+
team: Core Platform
11+
---
12+
13+
14+
Scribd stores billions of records in [Delta Lake](https://delta.io) but writing
15+
or reading that data had been constrained to a single tech stack, all of that
16+
changed with the creation of [delta-rs](https://github.com/delta-io/delta-rs).
17+
Historically using Delta Lake required applications to be implemented with or
18+
accompanied by [Apache Spark](https://spark.apache.org). Many of our batch
19+
and streaming data processing applications are all Spark-based, but that's not
20+
everything that exists! In mid-2020 it became clear that Delta Lake would be a
21+
powerful tool in areas adjacent to the domain that Spark occupies. From my
22+
perspective, I figured that would soon need to bring data into and out of Delta
23+
Lake in dozens of different ways. Some discussions and prototyping led to the
24+
creation of "delta-rs", a Delta Lake client written in Rust that can be easily
25+
embedded in other languages such as
26+
[Python](https://delta-io.github.io/delta-rs/python), Ruby, NodeJS, and more.
27+
28+
29+
The [Delta Lake
30+
protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md) is not
31+
_that_ complicated as it turns out. At an extremely high level, Delta Lake is a
32+
JSON-based transaction log coupled with [Apache
33+
Parquet](https://parquet.apache.org) files stored on disk/object storage. This means the core implementation of Delta in [Rust](https://rust-lang.org) is similarly quite simple. Take the following example from our integration tests which "opens" a table, reads it's transaction log and provides a list of Parquet files contained within:
34+
35+
36+
```rust
37+
let table = deltalake::open_table("./tests/data/delta-0.2.0")
38+
.await
39+
.unwrap();
40+
assert_eq!(
41+
table.get_files(),
42+
vec![
43+
"part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet",
44+
"part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet",
45+
"part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet",
46+
]
47+
);
48+
```
49+
50+
Our primary motivation for delta-rs was to create something which would
51+
accommodate high-throughput writes to Delta Lake and allow embedding for
52+
languages like Python and Ruby such that users of those platforms could perform
53+
light queries and read operations.
54+
55+
The first notable writer-based application being co-developed with delta-rs is
56+
[kafka-delta-ingest](https://github.com/delta-io/kafka-delta-ingest). The
57+
project aims to provide a highly efficient daemon for ingesting
58+
Kafka-originating data into Delta tables. In Scribd's stack, it will
59+
effectively bridge JSON flowing into [Apache Kafka](https://kafka.apache.org)
60+
topics into pre-defined Delta tables, translating a single JSON message into a
61+
single row in the table.
62+
63+
From the reader standpoint, the Python interface built on top of delta-rs,
64+
contributed largely by [Florian Valeye](https://github.com/fvaleye) makes
65+
working with Delta Lake even simpler, and for most architectures you only need
66+
to run `pip install deltalake`:
67+
68+
```python
69+
from deltalake import DeltaTable
70+
from pprint import pprint
71+
72+
if __name__ == '__main__':
73+
# Load the Delta Table
74+
dt = DeltaTable('s3://delta/golden/data-reader-primitives')
75+
76+
print(f'Table version: {dt.version()}')
77+
78+
# List out all the files contained in the table
79+
for f in dt.files():
80+
print(f' - {f}')
81+
82+
# Create a Pandas dataframe to execute queries against the table
83+
df = dt.to_pyarrow_table().to_pandas()
84+
pprint(df.query('as_int % 2 == 1'))
85+
```
86+
87+
I cannot stress enough how much potential the above Python snippet has for
88+
machine learning and other Python-based applications at Scribd. For a number
89+
of internal applications developers have been launching Spark clusters for the
90+
sole purpose of reading some data from Delta Lake in order to start their model
91+
training workloads in Python. With the maturation of the Python `deltalake`
92+
package, now there is a fast and easy way to load Delta Lake into basic Python
93+
applications.
94+
95+
96+
97+
From my perspective, it's only the beginning with [delta-rs](https://github.com/delta-io/delta-rs). Delta Lake is a deceptively simple technology with tremendous potential across the data platform. I will be sharing more about delta-rs at [Data and AI Summit](https://databricks.com/dataaisummit/north-america-2021) on May 27th at 12:10 PDT. I hope you'll join [my session](https://databricks.com/speaker/r-tyler-croy) with your questions about delta-rs and where we're taking it!
98+
99+

0 commit comments

Comments
 (0)