SPARK-1770: Load balance elements when repartitioning. #727

pwendell · 2014-05-11T01:26:01Z

This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.

This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.

This patch adds better balancing when performing a repartition of an RDD. Previously the elements in the RDD were hash partitioned, meaning if the RDD was skewed certain partitions would end up being very large. This commit adds load balancing of elements across the repartitioned RDD splits. The load balancing is not perfect: a given output partition can have up to N more elements than the average if there are N input partitions. However, some randomization is used to minimize the probabiliy that this happens.

pwendell · 2014-05-11T01:26:09Z

/cc @aarondav @mateiz

AmplabJenkins · 2014-05-11T01:27:57Z

Merged build triggered.

AmplabJenkins · 2014-05-11T01:28:06Z

Merged build started.

AmplabJenkins · 2014-05-11T02:07:34Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-11T02:07:34Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14881/

mateiz · 2014-05-11T23:58:15Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

Put a space before {

mateiz · 2014-05-12T00:11:21Z

Looks good to me.

AmplabJenkins · 2014-05-12T00:12:57Z

Merged build triggered.

AmplabJenkins · 2014-05-12T00:13:04Z

Merged build started.

This patch adds better balancing when performing a repartition of an RDD. Previously the elements in the RDD were hash partitioned, meaning if the RDD was skewed certain partitions would end up being very large. This commit adds load balancing of elements across the repartitioned RDD splits. The load balancing is not perfect: a given output partition can have up to N more elements than the average if there are N input partitions. However, some randomization is used to minimize the probabiliy that this happens. Author: Patrick Wendell <[email protected]> Closes #727 from pwendell/load-balance and squashes the following commits: f9da752 [Patrick Wendell] Response to Matei's feedback acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning. (cherry picked from commit 7d9cc92) Signed-off-by: Patrick Wendell <[email protected]>

AmplabJenkins · 2014-05-12T00:52:25Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-12T00:52:25Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14894/

This patch adds better balancing when performing a repartition of an RDD. Previously the elements in the RDD were hash partitioned, meaning if the RDD was skewed certain partitions would end up being very large. This commit adds load balancing of elements across the repartitioned RDD splits. The load balancing is not perfect: a given output partition can have up to N more elements than the average if there are N input partitions. However, some randomization is used to minimize the probabiliy that this happens. Author: Patrick Wendell <[email protected]> Closes apache#727 from pwendell/load-balance and squashes the following commits: f9da752 [Patrick Wendell] Response to Matei's feedback acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.

…MapRFileSystem not found for spark-hive integration jobs (apache#727)

mateiz reviewed May 11, 2014
View reviewed changes

core/src/main/scala/org/apache/spark/rdd/RDD.scala Outdated

Copy link

Contributor

mateiz May 11, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a space before {

Response to Matei's feedback

f9da752

asfgit closed this in 7d9cc92 May 12, 2014

pwendell mentioned this pull request May 27, 2014

SPARK[1784]: Adding a balancedPartitioner #876

Closed

agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022

MapR [SPARK-796] java.lang.ClassNotFoundException: Class com.mapr.fs.…

6919644

…MapRFileSystem not found for spark-hive integration jobs (apache#727)

udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024

MapR [SPARK-796] java.lang.ClassNotFoundException: Class com.mapr.fs.…

e4e7db8

…MapRFileSystem not found for spark-hive integration jobs (apache#727)

mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025

MapR [SPARK-796] java.lang.ClassNotFoundException: Class com.mapr.fs.…

a1a54d4

…MapRFileSystem not found for spark-hive integration jobs (apache#727)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-1770: Load balance elements when repartitioning. #727

SPARK-1770: Load balance elements when repartitioning. #727

Uh oh!

pwendell commented May 11, 2014

Uh oh!

pwendell commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

mateiz May 11, 2014

Uh oh!

mateiz commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SPARK-1770: Load balance elements when repartitioning. #727

SPARK-1770: Load balance elements when repartitioning. #727

Uh oh!

Conversation

pwendell commented May 11, 2014

Uh oh!

pwendell commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

AmplabJenkins commented May 11, 2014

Uh oh!

mateiz May 11, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants