Skip to content

Commit c02ccf8

Browse files
committed
Copy edits for the sql-delta-import post and other tidying up
1 parent 4405080 commit c02ccf8

File tree

2 files changed

+52
-42
lines changed

2 files changed

+52
-42
lines changed

_data/authors.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,10 @@ Kuntalb:
106106
name: Kuntal Kumar Basu
107107
github: kuntalkumarbasu
108108

109+
alexk:
110+
name: Alex Kushnir
111+
github: shtusha
112+
109113
nakulpathak3:
110114
name: Nakul Pathak
111115
github: nakulpathak3

_posts/2021-03-11-introducing-sql-delta-import.md

Lines changed: 48 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -6,25 +6,28 @@ tags:
66
- databricks
77
- spark
88
- deltalake
9+
- featured
910
team: Data Engineering
1011
---
1112

12-
OLTP databases are a common data source for Data Lake based warehouses which use Big Data tools to run
13-
batch analytics pipelines. Classic hadoop toolset comes with
14-
[Apache Sqoop](https://sqoop.apache.org/) - a tool for bulk import/export
15-
of data between HDFS and relational data stores. Our pipelines were using this tool as well, primarily
16-
to import MySQL data into HDFS. When Platform Engineering team at Scribd took on a effort
17-
to migrate our on-premise Hadoop workloads to [Databricks Lakehouse Platform](https://databricks.com/product/data-lakehouse)
18-
on AWS we had to write our own tool to import data from MySQL directly into S3 backed [Delta Lake](https://delta.io/).
19-
In this post I will share the details about `sql-delta-import` - an open-source spark utility to import data from any
20-
JDBC compatible database into Delta Lake. This utility is being open sourced under
21-
[Delta Lake Connectors](https://github.com/delta-io/connectors/pull/80) project
13+
OLTP databases are a common data source for Data Lake based warehouses which use Big Data tools to run
14+
batch analytics pipelines. The classic Apache Hadoop toolchain includes
15+
[Apache Sqoop](https://sqoop.apache.org/) - a tool for bulk import/export
16+
of data between HDFS and relational data stores. Our pipelines were using this tool as well, primarily
17+
to import MySQL data into HDFS. When the Platform Engineering team took on the migration of
18+
our on-premise Hadoop workloads to the [Databricks Lakehouse Platform](https://databricks.com/product/data-lakehouse)
19+
on AWS, we had to write our own tool to import data from MySQL directly into S3-backed [Delta Lake](https://delta.io/).
20+
In this post I will share the details about `sql-delta-import` - an open source utility we have proposed for inclusion in the
21+
[Delta Lake
22+
Connectors](https://github.com/delta-io/connectors/pull/80) project, we're
23+
looking forward to working with others to improve and accelerate importing data
24+
into Delta Lake!
2225

2326
### Sample import
2427

25-
Importing data into a Delta Lake table is as easy as
28+
Importing data into a Delta Lake table is as easy as
2629

27-
```shell script
30+
```sh
2831
spark-submit /
2932
--class "io.delta.connectors.spark.JDBC.ImportRunner" sql-delta-import_2.12-0.2.1-SNAPSHOT.jar /
3033
--jdbc-url jdbc:mysql://hostName:port/database /
@@ -35,17 +38,17 @@ spark-submit /
3538

3639
### This looks a lot like `sqoop`... why didn't you just use that?
3740

38-
We considered using `sqoop` at first but quickly dismissed that option for multiple reasons
41+
We considered using `sqoop` at first but quickly dismissed that option for multiple reasons:
3942

4043
#### 1. Databricks Lakehouse Platform does not come with `sqoop`
4144
Yes we could have ran our sqoop jobs on EMR clusters but we wanted to run everything in Databricks and
42-
avoid additional technology footprint. But even if we drop that restriction...
43-
45+
avoid additional technology footprint and overhead. But even if we drop that restriction...
46+
4447
#### 2. `sqoop` does not support writing data directly to Delta Lake
45-
`sqoop` can only import data as text or parquet. Writing to delta directly allows us to
48+
`sqoop` can only import data as text or parquet. Writing to delta directly allows us to
4649
optimize data storage for best performance on reads by just adding a couple of configuration options
4750

48-
```shell script
51+
```sh
4952
spark-submit /
5053
--conf spark.databricks.delta.optimizeWrite.enabled=true /
5154
--conf spark.databricks.delta.autoCompact.enabled=true /
@@ -57,18 +60,18 @@ spark-submit /
5760
```
5861

5962
#### 3. `--num-mappers` just not good enough to control parallelism when working with a database
60-
`sqoop` uses map-reduce under the hood. We can specify `--num-mappers` parameter that controls how many
61-
mappers will be used to import data. Small number of mappers can result in large volume
62-
of data per import and long running transactions. Large number of mappers will result in many connections
63+
`sqoop` uses map-reduce under the hood. We can specify `--num-mappers` parameter that controls how many
64+
mappers will be used to import data. Small number of mappers can result in large volume
65+
of data per import and long running transactions. Large number of mappers will result in many connections
6366
to database potentially overloading it especially when there are a lot of `sqoop` jobs running in parallel.
64-
Additionally since there are no reduce stages in `sqoop` jobs large number of mappers will result in large
67+
Additionally since there are no reduce stages in `sqoop` jobs large number of mappers will result in large
6568
number of output files and potentially introducing a small files problem.
6669

67-
`sql delta import` uses `--chunks` parameter to control number of... well... chunks to split the source table
70+
`sql delta import` uses `--chunks` parameter to control number of... well... chunks to split the source table
6871
into and standard spark parameters like `--num-executors` and `--executor-cores` to control data import
6972
concurrency thus allowing you to tune those parameters independently
7073

71-
```shell script
74+
```sh
7275
spark-submit --num-executors 15 --executor-cores 4 /
7376
--conf spark.databricks.delta.optimizeWrite.enabled=true /
7477
--conf spark.databricks.delta.autoCompact.enabled=true /
@@ -81,39 +84,39 @@ spark-submit --num-executors 15 --executor-cores 4 /
8184
```
8285

8386
in the example above source table will be split into 500 chunks resulting in quick transactions and released connections
84-
but no more than 60 concurrent connections will be used for import since max degree of parallelism is 60 (15 executors x 4 cores).
87+
but no more than 60 concurrent connections will be used for import since max degree of parallelism is 60 (15 executors x 4 cores).
8588
`delta.optimizeWrite` and `delta.autoCompact` configuration will yield optimal file size output for the destination table
8689

8790
#### 3.1 `--num-mappers` and data skew just don't play nicely together
88-
89-
When `sqoop` imports data, source table will be split into ranges based on `--split-by` column and each mapper
90-
would import its corresponding range. This works good when `--split-by` column has a near uniform distribution
91+
92+
When `sqoop` imports data, source table will be split into ranges based on `--split-by` column and each mapper
93+
would import its corresponding range. This works well when `--split-by` column has a near uniform distribution
9194
of data, but that's not always the case with source tables... As tables age we tend to add additional columns to them to
92-
take on new business requirements so over time data in latest rows has a higher fill rate than earlier rows.
95+
take on new business requirements so over time data in latest rows has a higher fill rate than earlier rows.
9396

9497
![row density increase over time](/post-images/2021-03-sql-delta-import/row_density_increase.png)
9598

96-
Our source tables here at Scribd definitely have these characteristics. We also have some tables that have entire
99+
Our source tables here at Scribd definitely have these characteristics. We also have some tables that have entire
97100
ranges of data missing due to data cleanup. At some point large chunks of data were just deleted from these tables.
98101

99102
![missing rows](/post-images/2021-03-sql-delta-import/missing_rows.png)
100103

101-
This type of data skew will result in processing time skew and output file size skew when you can only control number of
102-
mappers. Yes we can introduce additional computed synthetic column in the source table as our `split-by` column but now
103-
there is an additional column that does not add business value, app developers need to be aware of it, computing and
104-
storing it takes up database resources and if we plan to use it for imports it's better be indexed, thus even more
104+
This type of data skew will result in processing time skew and output file size skew when you can only control number of
105+
mappers. Yes we can introduce additional computed synthetic column in the source table as our `split-by` column but now
106+
there is an additional column that does not add business value, app developers need to be aware of it, computing and
107+
storing it takes up database resources and if we plan to use it for imports it's better be indexed, thus even more
105108
compute and storage resources.
106109

107-
With `sql-delta-import` we still split source tables into ranges based on `--split-by` column but if there is data
110+
With `sql-delta-import` we still split source tables into ranges based on `--split-by` column but if there is data
108111
distribution skew we can "solve" this problem by making number of chunks much larger than max degree of parallelism.
109-
This way large chunks with high data density are broken up into smaller pieces that a single executor can handle.
110-
Executors that get chunks with little or no data can just quickly process them and move on to do some real work.
112+
This way large chunks with high data density are broken up into smaller pieces that a single executor can handle.
113+
Executors that get chunks with little or no data can just quickly process them and move on to do some real work.
111114

112115

113116
### Advanced use cases
114117

115-
For advanced use cases you don't have to use provided spark application directly. `sql-delta-import`
116-
libraries can be imported into your own project. You can specify custom data transformations or JDBC dialect to gain a
118+
For advanced use cases you don't have to use provided spark application directly. `sql-delta-import`
119+
libraries can be imported into your own project. You can specify custom data transformations or JDBC dialect to gain a
117120
more precised control of data type handling
118121

119122
```scala
@@ -122,7 +125,7 @@ import org.apache.spark.sql.functions._
122125
import org.apache.spark.sql.types._
123126

124127
import io.delta.connectors.spark.JDBC._
125-
128+
126129
implicit val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
127130

128131

@@ -149,7 +152,10 @@ val importer = new JDBCImport(jdbcUrl = jdbcUrl, importConfig = config, dataTran
149152
importer.run()
150153
```
151154

152-
---
153-
Prior to migrating to Databricks Lakehouse Platform we had roughly 300 `sqoop` jobs. We were able to
154-
successfully port all of them to `sql-delta-import`. Today they happily coexist in production with other spark
155+
Prior to migrating to Databricks Lakehouse Platform we had roughly 300 `sqoop` jobs. We were able to
156+
successfully port all of them to `sql-delta-import`. Today they happily coexist in production with other spark
155157
jobs allowing us to use uniform set of tools for orchestrating, scheduling, monitoring and logging for all of our jobs.
158+
159+
If you're interested in working with Delta Lake, the Databricks platform, or
160+
enabling really interesting machine learning use-cases, check out our [careers
161+
page](/careers/#open-positions)!

0 commit comments

Comments
 (0)