Skip to content

Commit 22ce8e2

Browse files
committed
changing namespaces and links to delta-io connectors based repo
1 parent a409f43 commit 22ce8e2

File tree

1 file changed

+18
-8
lines changed

1 file changed

+18
-8
lines changed

_posts/2021-03-01-introducing-sql-delta-import.md renamed to _posts/2021-03-08-introducing-sql-delta-import.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,17 @@ of data between HDFS and relational data stores. Our pipelines were using this t
1616
to import MySQL data into HDFS. When Platform Engineering team at Scribd took on a effort
1717
to migrate our on-premise Hadoop workloads to [Databricks Lakehouse Platform](https://databricks.com/product/data-lakehouse)
1818
on AWS we had to write our own tool to import data from MySQL directly into S3 backed [Delta Lake](https://delta.io/).
19-
In this post I will share the details about [sql-delta-import](https://github.com/scribd/sql-delta-import) - an
20-
open-source spark utility to import data from any JDBC compatible database into Delta Lake
19+
In this post I will share the details about `sql-delta-import` - an open-source spark utility to import data from any
20+
JDBC compatible database into Delta Lake. This utility is being open sourced under
21+
[Delta Lake Connectors](https://github.com/delta-io/connectors) project
2122

2223
### Sample import
2324

2425
Importing data into a Delta Lake table is as easy as
2526

2627
```shell script
2728
spark-submit /
28-
--class "com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.1.0-SNAPSHOT.jar /
29+
--class "com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.2.1-SNAPSHOT.jar /
2930
--jdbc-url jdbc:mysql://hostName:port/database /
3031
--source source.table
3132
--destination destination.table
@@ -48,7 +49,7 @@ optimize data storage for best performance on reads by just adding a couple of c
4849
spark-submit /
4950
--conf spark.databricks.delta.optimizeWrite.enabled=true /
5051
--conf spark.databricks.delta.autoCompact.enabled=true /
51-
--class "com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.1.0-SNAPSHOT.jar /
52+
--class "com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.2.1-SNAPSHOT.jar /
5253
--jdbc-url jdbc:mysql://hostName:port/database /
5354
--source source.table
5455
--destination destination.table
@@ -71,7 +72,7 @@ concurrency thus allowing you to tune those parameters independently
7172
spark-submit --num-executors 15 --executor-cores 4 /
7273
--conf spark.databricks.delta.optimizeWrite.enabled=true /
7374
--conf spark.databricks.delta.autoCompact.enabled=true /
74-
--class "com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.1.0-SNAPSHOT.jar /
75+
--class "com.scribd.importer.spark.ImportRunner" sql-delta-import_2.12-0.2.1-SNAPSHOT.jar /
7576
--jdbc-url jdbc:mysql://hostName:port/database /
7677
--source source.table
7778
--destination destination.table
@@ -115,9 +116,11 @@ libraries can be imported into your own project. You can specify custom data tra
115116
more precised control of data type handling
116117

117118
```scala
118-
...
119-
import com.scribd.importer.spark._
120-
import com.scribd.importer.spark.transform.DataTransform._
119+
import org.apache.spark.sql._
120+
import org.apache.spark.sql.functions._
121+
import org.apache.spark.sql.types._
122+
123+
import io.delta.connectors.spark.JDBC._
121124

122125
implicit val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
123126

@@ -127,6 +130,13 @@ import com.scribd.importer.spark.transform.DataTransform._
127130

128131
val config = ImportConfig(source = "table", destination = "target_database.table", splitBy = "id", chunks = 10)
129132

133+
// a sample transform to convert all timestamp columns to strings
134+
val timeStampsToStrings : DataFrame => DataFrame = source => {
135+
val tsCols = source.schema.fields.filter(_.dataType == DataTypes.TimestampType).map(_.name)
136+
tsCols.foldLeft(source)((df, colName) =>
137+
df.withColumn(colName, from_unixtime(unix_timestamp(col(colName)), "yyyy-MM-dd HH:mm:ss.S")))
138+
}
139+
130140
// Whatever functions are passed to below transform will be applied during import
131141
val transforms = new DataTransform(Seq(
132142
df => df.withColumn("id", col("id").cast(types.StringType)), //custom function to cast id column to string

0 commit comments

Comments
 (0)