PySpark Cheat Sheet Spark in Python PDF

This document provides a cheat sheet on RDD (Resilient Distributed Dataset) basics in PySpark. It summarizes common operations for retrieving RDD information, reshaping data through reducing, grouping and aggregating, and applying mathematical and user-defined functions to RDDs. These include counting elements, retrieving statistics, grouping and aggregating keys/values, applying maps and flatmaps, and set operations like subtraction.

Uploaded by

ram179

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

584 views

PySpark Cheat Sheet Spark in Python PDF

Uploaded by

ram179

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data

Basic Information Reducing

PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect()
Merge the rdd values for
each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() Return (key,value) pairs as a
.mapValues(list)
.collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Summary
Aggregating
>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
>>> sc.appName Return application name min) .collect() applying a function
>>> sc.applicationId Retrieve application ID
>>>
>>>
sc.defaultParallelism Return default level of parallelism
sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
>>> rdd.first() Take first RDD element
Set which master the context connects to with the --master argument, and
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),
("b",["p", "r"])])
['a',2,'b',7]
>>> rdd.keys().collect() Return (key,value) RDD's keys
Stopping SparkContext
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively

Hadoop Interview Questions New
No ratings yet
Hadoop Interview Questions New
9 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Parts Manual: C185WKUB T2-D80 C185WKUB T4I-D95 Compressor Model
100% (2)
Parts Manual: C185WKUB T2-D80 C185WKUB T4I-D95 Compressor Model
116 pages
BOMAG Roller BW161-203AD-4-ST-EN
100% (18)
BOMAG Roller BW161-203AD-4-ST-EN
178 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Spark SQL
100% (1)
Spark SQL
25 pages
13 SparkBuildingAndDeploying
No ratings yet
13 SparkBuildingAndDeploying
53 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Teradata Advanced SQL Part1 PDF
100% (2)
Teradata Advanced SQL Part1 PDF
38 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
Unit 5
100% (1)
Unit 5
109 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Core Python
No ratings yet
Core Python
102 pages
python interview question
No ratings yet
python interview question
39 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Advanced SQL
No ratings yet
Advanced SQL
45 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
50 SQL Server Query 1712112772
No ratings yet
50 SQL Server Query 1712112772
51 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
PySpark_RDD_Cheat_Sheet
No ratings yet
PySpark_RDD_Cheat_Sheet
1 page
RDD Actions
No ratings yet
RDD Actions
18 pages
Cognizant Digital Education Enablement Solution Brochure
No ratings yet
Cognizant Digital Education Enablement Solution Brochure
5 pages
The EU Hydrogen Strategy - Hydrogen Europe's Top 10 Key Recommendations - FINAL
No ratings yet
The EU Hydrogen Strategy - Hydrogen Europe's Top 10 Key Recommendations - FINAL
22 pages
Understanding The Sector Impact of COVID-19: Media & Entertainment
No ratings yet
Understanding The Sector Impact of COVID-19: Media & Entertainment
2 pages
Cognizant Cloud Security Solutions
No ratings yet
Cognizant Cloud Security Solutions
4 pages
1.1 What Is A Neural Network?
No ratings yet
1.1 What Is A Neural Network?
3 pages
BCG Zig Zag and The Art of Strategic Creativity June 2019 Tcm21 221683
No ratings yet
BCG Zig Zag and The Art of Strategic Creativity June 2019 Tcm21 221683
5 pages
Deep Tech Infographic 05 D Apr29 Tcm21-219018
No ratings yet
Deep Tech Infographic 05 D Apr29 Tcm21-219018
1 page
BCG How AI and Robotics Will Disrupt The Defense Industry Apr 2018 Tcm21 188429
No ratings yet
BCG How AI and Robotics Will Disrupt The Defense Industry Apr 2018 Tcm21 188429
6 pages
BCG Ten Lessons From 20 Years of Value Creation Insights Nov 2018 Tcm21 208175
No ratings yet
BCG Ten Lessons From 20 Years of Value Creation Insights Nov 2018 Tcm21 208175
8 pages
Quantum Computing
No ratings yet
Quantum Computing
1 page
Integrating AI With Visual Analytics To Enhance Quality of Care Delivery
No ratings yet
Integrating AI With Visual Analytics To Enhance Quality of Care Delivery
2 pages
The Realm of The Nebulae: Spiral Galaxies Form A
No ratings yet
The Realm of The Nebulae: Spiral Galaxies Form A
2 pages
Sanskrit Iast Adi Parashakti Parvati Decline of Buddhism in India
No ratings yet
Sanskrit Iast Adi Parashakti Parvati Decline of Buddhism in India
2 pages
Sun Sign
No ratings yet
Sun Sign
1 page
Design and Implementation of FPGA Based High Speed Data Acquisition Systems For Embedded Applications
No ratings yet
Design and Implementation of FPGA Based High Speed Data Acquisition Systems For Embedded Applications
52 pages
Chi Square POGIL
No ratings yet
Chi Square POGIL
3 pages
Structural Design: Aashtoware Pavement Me Design™
No ratings yet
Structural Design: Aashtoware Pavement Me Design™
61 pages
ACT CH 3 Context Free Languages
No ratings yet
ACT CH 3 Context Free Languages
66 pages
7 Intraday Principle 02 Nov
No ratings yet
7 Intraday Principle 02 Nov
12 pages
Performance and Testing of An I. C. Engine With Numericals
No ratings yet
Performance and Testing of An I. C. Engine With Numericals
86 pages
CENG 6606 HSII - 2.1 Canal Head Regulator
No ratings yet
CENG 6606 HSII - 2.1 Canal Head Regulator
5 pages
2023 - WLY - Blockchain For Real World Applications - Garg
No ratings yet
2023 - WLY - Blockchain For Real World Applications - Garg
415 pages
Chapter 5
100% (1)
Chapter 5
13 pages
CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py
No ratings yet
CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py
2 pages
2.7. Special Integrating Factors
No ratings yet
2.7. Special Integrating Factors
3 pages
(2009) Pieter, W., & and Bercades, L. - Somatotypes of National Elite
No ratings yet
(2009) Pieter, W., & and Bercades, L. - Somatotypes of National Elite
10 pages
TCD 2013, 2200 071 KW TCD 2013 L04 2V Q400 Tier3
No ratings yet
TCD 2013, 2200 071 KW TCD 2013 L04 2V Q400 Tier3
1 page
spca7180a (mp3解码)
No ratings yet
spca7180a (mp3解码)
22 pages
Product Handling Guide-Formaldehyde
No ratings yet
Product Handling Guide-Formaldehyde
4 pages
Bba 4 Sem Basic Informatics For Management 19101343 May 2019
No ratings yet
Bba 4 Sem Basic Informatics For Management 19101343 May 2019
2 pages
Intergration Optimize GF
No ratings yet
Intergration Optimize GF
30 pages
Landscape Lecture
No ratings yet
Landscape Lecture
34 pages
Instrumental Drawing
63% (8)
Instrumental Drawing
55 pages
Unit Hydrograph Method
No ratings yet
Unit Hydrograph Method
5 pages
Restricted Boltzmann Machines (RBMS)
No ratings yet
Restricted Boltzmann Machines (RBMS)
13 pages
Problem Set 5
No ratings yet
Problem Set 5
2 pages
2.classical Mechanics - NET-JRF June 2011-Dec 2016 PDF
No ratings yet
2.classical Mechanics - NET-JRF June 2011-Dec 2016 PDF
47 pages
802D Opm
No ratings yet
802D Opm
354 pages
ME2610 Exam Jan-2021
No ratings yet
ME2610 Exam Jan-2021
7 pages
Intro To Comp
No ratings yet
Intro To Comp
59 pages
MC Manuel - Gardner - Fernandes Pickup Music V1 PDF
88% (8)
MC Manuel - Gardner - Fernandes Pickup Music V1 PDF
32 pages
Tilted Plate Interceptor (TPI or CPI)
100% (2)
Tilted Plate Interceptor (TPI or CPI)
5 pages

PySpark Cheat Sheet Spark in Python PDF

Uploaded by

PySpark Cheat Sheet Spark in Python PDF

Uploaded by

Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data

Basic Information Reducing

You might also like