Name		Name	Last commit message	Last commit date
parent directory ..
gradle		gradle
src		src
README.md		README.md
build.gradle		build.gradle
build_and_test_spark.sh		build_and_test_spark.sh

README.md

datafu-spark

datafu-spark contains a number of spark API's and a "Scala-Python bridge" that makes calling Scala code from Python, and vice-versa, easier.

Compatibility Matrix

This matrix represents versions of Spark that DataFu has been compiled and tested on. Some/many methods work on unsupported versions as well.

DataFu	Spark
1.7.0	2.2.0 to 2.2.2, 2.3.0 to 2.3.2 and 2.4.0 to 2.4.3
1.8.0	2.2.3, 2.3.3, and 2.4.4 to 2.4.5
2.0.0	3.0.x - 3.1.x
2.1.0 (not released yet)	3.0.x - 3.4.x

Examples

Here are some examples of things you can do with it:

"Dedup" a table - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).
Join a table with a numeric field with a table with a range
Do a skewed join between tables (where the small table is still too big to fit in memory)
Count distinct up to - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table

It has been tested on Spark releases from 3.0.0 to 3.4.2 using Scala 2.12. You can check if your Spark/Scala version combination has been tested by looking here.

In order to call the datafu-spark API's from Pyspark, you can do the following (tested on a Hortonworks vm)

First, call pyspark with the following parameters

export PYTHONPATH=datafu-spark_2.12-2.0.0.jar

pyspark --jars datafu-spark_2.12-2.0.0.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.12-2.0.0.jar

The following is an example of calling the Spark version of the datafu dedup method

from pyspark_utils import df_utils

df_people = spark.createDataFrame([
     ("a", "Alice", 34),
     ("a", "Sara", 33),
     ("b", "Bob", 36),
     ("b", "Charlie", 30),
     ("c", "David", 29),
     ("c", "Esther", 32),
     ("c", "Fanny", 36),
     ("c", "Zoey", 36)],
     ["id", "name", "age"])

df_dedup = df_utils.dedup_with_order(df=df_people, group_col=df_people.id,
                                     order_cols=[df_people.age.desc(), df_people.name.desc()])
df_dedup.show()

This should produce the following output

+---+-----+---+
| id| name|age|
+---+-----+---+
|  c| Zoey| 36|
|  b|  Bob| 36|
|  a|Alice| 34|
+---+-----+---+

Development

Building and testing datafu-spark can be done as described in the the main DataFu README.

If you wish to build for a specific Scala/Spark version, there are two options. One is to change the scalaVersion and sparkVersion in the main gradle.properties file.

The other is to pass these parameters in the command line. For example, to build and test for Scala 2.12 and Spark 3.1.3, you would use

./gradlew :datafu-spark:test -PscalaVersion=2.12 -PsparkVersion=3.1.3

There is a script for building and testing datafu-spark across the multiple Scala/Spark combinations.

To see the available options run it like this:

./build_and_test_spark.sh -h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datafu-spark

datafu-spark

README.md

datafu-spark

Compatibility Matrix

Examples

Development

Files

datafu-spark

Directory actions

More options

Directory actions

More options

Latest commit

History

datafu-spark

Folders and files

parent directory

README.md

datafu-spark

Compatibility Matrix

Examples

Development