@@ -18,7 +18,7 @@ to migrate our on-premise Hadoop workloads to [Databricks Lakehouse Platform](ht
18
18
on AWS we had to write our own tool to import data from MySQL directly into S3 backed [ Delta Lake] ( https://delta.io/ ) .
19
19
In this post I will share the details about ` sql-delta-import ` - an open-source spark utility to import data from any
20
20
JDBC compatible database into Delta Lake. This utility is being open sourced under
21
- [ Delta Lake Connectors] ( https://github.com/delta-io/connectors ) project
21
+ [ Delta Lake Connectors] ( https://github.com/delta-io/connectors/pull/80 ) project
22
22
23
23
### Sample import
24
24
@@ -87,7 +87,7 @@ but no more than 60 concurrent connections will be used for import since max deg
87
87
#### 3.1 ` --num-mappers ` and data skew just don't play nicely together
88
88
89
89
When ` sqoop ` imports data, source table will be split into ranges based on ` --split-by ` column and each mapper
90
- would import it's corresponding range. This works good when ` --split-by ` column has a near uniform distribution
90
+ would import its corresponding range. This works good when ` --split-by ` column has a near uniform distribution
91
91
of data, but that's not always the case with source tables... As tables age we tend to add additional columns to them to
92
92
take on new business requirements so over time data in latest rows has a higher fill rate than earlier rows.
93
93
@@ -104,7 +104,8 @@ there is an additional column that does not add business value, app developers n
104
104
storing it takes up database resources and if we plan to use it for imports it's better be indexed, thus even more
105
105
compute and storage resources.
106
106
107
- With ` sql-delta-import ` we can "solve" this problem by making number of chunks much larger than max degree of parallelism.
107
+ With ` sql-delta-import ` we still split source tables into ranges based on ` --split-by ` column but if there is data
108
+ distribution skew we can "solve" this problem by making number of chunks much larger than max degree of parallelism.
108
109
This way large chunks with high data density are broken up into smaller pieces that a single executor can handle.
109
110
Executors that get chunks with little or no data can just quickly process them and move on to do some real work.
110
111
@@ -122,30 +123,30 @@ import org.apache.spark.sql.types._
122
123
123
124
import io .delta .connectors .spark .JDBC ._
124
125
125
- implicit val spark : SparkSession = SparkSession .builder().master(" local" ).getOrCreate()
126
+ implicit val spark : SparkSession = SparkSession .builder().master(" local" ).getOrCreate()
126
127
127
128
128
- // All additional possible jdbc connector properties described here - https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-configuration-properties.html
129
- val jdbcUrl = " jdbc:mysql://hostName:port/database"
129
+ // All additional possible jdbc connector properties described here - https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-configuration-properties.html
130
+ val jdbcUrl = " jdbc:mysql://hostName:port/database"
130
131
131
- val config = ImportConfig (source = " table" , destination = " target_database.table" , splitBy = " id" , chunks = 10 )
132
+ val config = ImportConfig (source = " table" , destination = " target_database.table" , splitBy = " id" , chunks = 10 )
132
133
133
134
// a sample transform to convert all timestamp columns to strings
134
- val timeStampsToStrings : DataFrame => DataFrame = source => {
135
- val tsCols = source.schema.fields.filter(_.dataType == DataTypes .TimestampType ).map(_.name)
136
- tsCols.foldLeft(source)((df, colName) =>
137
- df.withColumn(colName, from_unixtime(unix_timestamp(col(colName)), " yyyy-MM-dd HH:mm:ss.S" )))
135
+ val timeStampsToStrings : DataFrame => DataFrame = source => {
136
+ val tsCols = source.schema.fields.filter(_.dataType == DataTypes .TimestampType ).map(_.name)
137
+ tsCols.foldLeft(source)((df, colName) =>
138
+ df.withColumn(colName, from_unixtime(unix_timestamp(col(colName)), " yyyy-MM-dd HH:mm:ss.S" )))
138
139
}
139
140
140
- // Whatever functions are passed to below transform will be applied during import
141
- val transforms = new DataTransform (Seq (
142
- df => df.withColumn(" id" , col(" id" ).cast(types.StringType )), // custom function to cast id column to string
143
- timeStampsToStrings // included transform function converts all Timestamp columns to their string representation
144
- ))
141
+ // Whatever functions are passed to below transform will be applied during import
142
+ val transforms = new DataTransform (Seq (
143
+ df => df.withColumn(" id" , col(" id" ).cast(types.StringType )), // custom function to cast id column to string
144
+ timeStampsToStrings // included transform function converts all Timestamp columns to their string representation
145
+ ))
145
146
146
- val importer = new JDBCImport (jdbcUrl = jdbcUrl, importConfig = config, dataTransform = transforms)
147
+ val importer = new JDBCImport (jdbcUrl = jdbcUrl, importConfig = config, dataTransform = transforms)
147
148
148
- importer.run()
149
+ importer.run()
149
150
```
150
151
151
152
---
0 commit comments