Skip to content

Commit 4eab678

Browse files
committed
Addressing feedback, adding link to delta-io connectors PR.
1 parent 36b1f23 commit 4eab678

File tree

1 file changed

+19
-18
lines changed

1 file changed

+19
-18
lines changed

_posts/2021-03-08-introducing-sql-delta-import.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ to migrate our on-premise Hadoop workloads to [Databricks Lakehouse Platform](ht
1818
on AWS we had to write our own tool to import data from MySQL directly into S3 backed [Delta Lake](https://delta.io/).
1919
In this post I will share the details about `sql-delta-import` - an open-source spark utility to import data from any
2020
JDBC compatible database into Delta Lake. This utility is being open sourced under
21-
[Delta Lake Connectors](https://github.com/delta-io/connectors) project
21+
[Delta Lake Connectors](https://github.com/delta-io/connectors/pull/80) project
2222

2323
### Sample import
2424

@@ -87,7 +87,7 @@ but no more than 60 concurrent connections will be used for import since max deg
8787
#### 3.1 `--num-mappers` and data skew just don't play nicely together
8888

8989
When `sqoop` imports data, source table will be split into ranges based on `--split-by` column and each mapper
90-
would import it's corresponding range. This works good when `--split-by` column has a near uniform distribution
90+
would import its corresponding range. This works good when `--split-by` column has a near uniform distribution
9191
of data, but that's not always the case with source tables... As tables age we tend to add additional columns to them to
9292
take on new business requirements so over time data in latest rows has a higher fill rate than earlier rows.
9393

@@ -104,7 +104,8 @@ there is an additional column that does not add business value, app developers n
104104
storing it takes up database resources and if we plan to use it for imports it's better be indexed, thus even more
105105
compute and storage resources.
106106

107-
With `sql-delta-import` we can "solve" this problem by making number of chunks much larger than max degree of parallelism.
107+
With `sql-delta-import` we still split source tables into ranges based on `--split-by` column but if there is data
108+
distribution skew we can "solve" this problem by making number of chunks much larger than max degree of parallelism.
108109
This way large chunks with high data density are broken up into smaller pieces that a single executor can handle.
109110
Executors that get chunks with little or no data can just quickly process them and move on to do some real work.
110111

@@ -122,30 +123,30 @@ import org.apache.spark.sql.types._
122123

123124
import io.delta.connectors.spark.JDBC._
124125

125-
implicit val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
126+
implicit val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
126127

127128

128-
// All additional possible jdbc connector properties described here - https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-configuration-properties.html
129-
val jdbcUrl = "jdbc:mysql://hostName:port/database"
129+
// All additional possible jdbc connector properties described here - https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-configuration-properties.html
130+
val jdbcUrl = "jdbc:mysql://hostName:port/database"
130131

131-
val config = ImportConfig(source = "table", destination = "target_database.table", splitBy = "id", chunks = 10)
132+
val config = ImportConfig(source = "table", destination = "target_database.table", splitBy = "id", chunks = 10)
132133

133134
// a sample transform to convert all timestamp columns to strings
134-
val timeStampsToStrings : DataFrame => DataFrame = source => {
135-
val tsCols = source.schema.fields.filter(_.dataType == DataTypes.TimestampType).map(_.name)
136-
tsCols.foldLeft(source)((df, colName) =>
137-
df.withColumn(colName, from_unixtime(unix_timestamp(col(colName)), "yyyy-MM-dd HH:mm:ss.S")))
135+
val timeStampsToStrings : DataFrame => DataFrame = source => {
136+
val tsCols = source.schema.fields.filter(_.dataType == DataTypes.TimestampType).map(_.name)
137+
tsCols.foldLeft(source)((df, colName) =>
138+
df.withColumn(colName, from_unixtime(unix_timestamp(col(colName)), "yyyy-MM-dd HH:mm:ss.S")))
138139
}
139140

140-
// Whatever functions are passed to below transform will be applied during import
141-
val transforms = new DataTransform(Seq(
142-
df => df.withColumn("id", col("id").cast(types.StringType)), //custom function to cast id column to string
143-
timeStampsToStrings //included transform function converts all Timestamp columns to their string representation
144-
))
141+
// Whatever functions are passed to below transform will be applied during import
142+
val transforms = new DataTransform(Seq(
143+
df => df.withColumn("id", col("id").cast(types.StringType)), //custom function to cast id column to string
144+
timeStampsToStrings //included transform function converts all Timestamp columns to their string representation
145+
))
145146

146-
val importer = new JDBCImport(jdbcUrl = jdbcUrl, importConfig = config, dataTransform = transforms)
147+
val importer = new JDBCImport(jdbcUrl = jdbcUrl, importConfig = config, dataTransform = transforms)
147148

148-
importer.run()
149+
importer.run()
149150
```
150151

151152
---

0 commit comments

Comments
 (0)