Skip to content

Conversation

pwendell
Copy link
Contributor

This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.

This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.

This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.

This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.
@pwendell
Copy link
Contributor Author

/cc @aarondav @mateiz

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14881/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a space before {

@mateiz
Copy link
Contributor

mateiz commented May 12, 2014

Looks good to me.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@asfgit asfgit closed this in 7d9cc92 May 12, 2014
asfgit pushed a commit that referenced this pull request May 12, 2014
This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.

This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.

Author: Patrick Wendell <[email protected]>

Closes #727 from pwendell/load-balance and squashes the following commits:

f9da752 [Patrick Wendell] Response to Matei's feedback
acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
(cherry picked from commit 7d9cc92)

Signed-off-by: Patrick Wendell <[email protected]>
@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14894/

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.

This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.

Author: Patrick Wendell <[email protected]>

Closes apache#727 from pwendell/load-balance and squashes the following commits:

f9da752 [Patrick Wendell] Response to Matei's feedback
acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
…MapRFileSystem not found for spark-hive integration jobs (apache#727)
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
…MapRFileSystem not found for spark-hive integration jobs (apache#727)
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
…MapRFileSystem not found for spark-hive integration jobs (apache#727)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants