This document provides a cheat sheet on RDD (Resilient Distributed Dataset) basics in PySpark. It summarizes common operations for retrieving RDD information, reshaping data through reducing, grouping and aggregating, and applying mathematical and user-defined functions to RDDs. These include counting elements, retrieving statistics, grouping and aggregating keys/values, applying maps and flatmaps, and set operations like subtraction.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
584 views
PySpark Cheat Sheet Spark in Python PDF
This document provides a cheat sheet on RDD (Resilient Distributed Dataset) basics in PySpark. It summarizes common operations for retrieving RDD information, reshaping data through reducing, grouping and aggregating, and applying mathematical and user-defined functions to RDDs. These include counting elements, retrieving statistics, grouping and aggregating keys/values, applying maps and flatmaps, and set operations like subtraction.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions >>> rdd.reduceByKey(lambda x,y : x+y) .collect() Merge the rdd values for each key Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)] 3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values >>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2) defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by >>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1}) >>> rdd.collectAsMap() Return (key,value) pairs as a .mapValues(list) .collect() PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key >>> rdd3.sum() Sum of RDD elements .mapValues(list) the Spark programming model to Python. 4950 .collect() >>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])] True Initializing Spark Summary Aggregating >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1)) >>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1])) SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each 99 (4950,100) partition and then the results >>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements >>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key 0 >>> rdd3.mean() Mean value of RDD elements .collect() Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))] >>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each >>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results >>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key >>> sc.master Master URL to connect to 833.25 .collect() >>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)] >>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34]) >>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by SparkContext >>> sc.appName Return application name min) .collect() applying a function >>> sc.applicationId Retrieve application ID >>> >>> sc.defaultParallelism Return default level of parallelism sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained .collect() .collect() in rdd2 Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)] >>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2 >>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd >>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)] .setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd .setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2 .set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys >>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')] Sort Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function .collect() In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)] created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key [('a', 7), ('a', 2), ('b', 2)] .collect() $ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)] $ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)] >>> rdd.first() Take first RDD element Set which master the context connects to with the --master argument, and ('a', 7) Repartitioning add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements [('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1 Sampling Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3 [3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt") >>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child", >>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat') >>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)] >>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values >>> rdd4 = sc.parallelize([("a",["x","y","z"]), ("b",["p", "r"])]) ['a',2,'b',7] >>> rdd.keys().collect() Return (key,value) RDD's keys Stopping SparkContext ['a', 'a', 'b'] >>> sc.stop() External Data Read either one text file from HDFS, a local file system or or any Iterating Execution Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x) >>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py of text files with wholeTextFiles(). ('a', 7) >>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp >>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively