@@ -16,16 +16,17 @@ of data between HDFS and relational data stores. Our pipelines were using this t
16
16
to import MySQL data into HDFS. When Platform Engineering team at Scribd took on a effort
17
17
to migrate our on-premise Hadoop workloads to [ Databricks Lakehouse Platform] ( https://databricks.com/product/data-lakehouse )
18
18
on AWS we had to write our own tool to import data from MySQL directly into S3 backed [ Delta Lake] ( https://delta.io/ ) .
19
- In this post I will share the details about [ sql-delta-import] ( https://github.com/scribd/sql-delta-import ) - an
20
- open-source spark utility to import data from any JDBC compatible database into Delta Lake
19
+ In this post I will share the details about ` sql-delta-import ` - an open-source spark utility to import data from any
20
+ JDBC compatible database into Delta Lake. This utility is being open sourced under
21
+ [ Delta Lake Connectors] ( https://github.com/delta-io/connectors ) project
21
22
22
23
### Sample import
23
24
24
25
Importing data into a Delta Lake table is as easy as
25
26
26
27
``` shell script
27
28
spark-submit /
28
- --class " com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.1.0 -SNAPSHOT.jar /
29
+ --class " com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.2.1 -SNAPSHOT.jar /
29
30
--jdbc-url jdbc:mysql://hostName:port/database /
30
31
--source source.table
31
32
--destination destination.table
@@ -48,7 +49,7 @@ optimize data storage for best performance on reads by just adding a couple of c
48
49
spark-submit /
49
50
--conf spark.databricks.delta.optimizeWrite.enabled=true /
50
51
--conf spark.databricks.delta.autoCompact.enabled=true /
51
- --class " com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.1.0 -SNAPSHOT.jar /
52
+ --class " com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.2.1 -SNAPSHOT.jar /
52
53
--jdbc-url jdbc:mysql://hostName:port/database /
53
54
--source source.table
54
55
--destination destination.table
@@ -71,7 +72,7 @@ concurrency thus allowing you to tune those parameters independently
71
72
spark-submit --num-executors 15 --executor-cores 4 /
72
73
--conf spark.databricks.delta.optimizeWrite.enabled=true /
73
74
--conf spark.databricks.delta.autoCompact.enabled=true /
74
- --class " com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.1.0 -SNAPSHOT.jar /
75
+ --class " com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.2.1 -SNAPSHOT.jar /
75
76
--jdbc-url jdbc:mysql://hostName:port/database /
76
77
--source source.table
77
78
--destination destination.table
@@ -115,9 +116,11 @@ libraries can be imported into your own project. You can specify custom data tra
115
116
more precised control of data type handling
116
117
117
118
``` scala
118
- ...
119
- import com .scribd .importer .spark ._
120
- import com .scribd .importer .spark .transform .DataTransform ._
119
+ import org .apache .spark .sql ._
120
+ import org .apache .spark .sql .functions ._
121
+ import org .apache .spark .sql .types ._
122
+
123
+ import io .delta .connectors .spark .JDBC ._
121
124
122
125
implicit val spark : SparkSession = SparkSession .builder().master(" local" ).getOrCreate()
123
126
@@ -127,6 +130,13 @@ import com.scribd.importer.spark.transform.DataTransform._
127
130
128
131
val config = ImportConfig (source = " table" , destination = " target_database.table" , splitBy = " id" , chunks = 10 )
129
132
133
+ // a sample transform to convert all timestamp columns to strings
134
+ val timeStampsToStrings : DataFrame => DataFrame = source => {
135
+ val tsCols = source.schema.fields.filter(_.dataType == DataTypes .TimestampType ).map(_.name)
136
+ tsCols.foldLeft(source)((df, colName) =>
137
+ df.withColumn(colName, from_unixtime(unix_timestamp(col(colName)), " yyyy-MM-dd HH:mm:ss.S" )))
138
+ }
139
+
130
140
// Whatever functions are passed to below transform will be applied during import
131
141
val transforms = new DataTransform (Seq (
132
142
df => df.withColumn(" id" , col(" id" ).cast(types.StringType )), // custom function to cast id column to string
0 commit comments