Skip to content

jleetutorial/python-spark-streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Spark Streaming

This repository contains the files and lectures for the [Insert Title or Organization Name Here] Spark Streaming Tutorial using Python.

Curriculum

Section 1: Get started with Spark Streaming

  1. Introduction to Streaming:
    • What is streaming?
    • Why use streaming?
    • Popular streaming tools (Kafka, Apache Spark streaming, etc.)
  2. Overview of Apache Spark streaming
    • What is Apache Spark streaming?
    • The advantages of Apache Spark streaming
    • User case of Apache Spark streaming
  3. Set up Environment on our local box:
    • Install language SD, Python(if source code is written in Python) and Git (Video Example)
    • Check out Source code
    • Setup IDE for our demo project
    • Use IntelliJ IDEA if the program code is written in Scala or Java
    • We will discuss what IDE to use if the code is written in Python (Video Example)
  4. Run our First Spark streaming projet
    • The first project would just create a Sparkcontext connect to Twitter stream and print out the live stream tweets
    • Need to demo how to create a twitter developer account to the twitter oauth token
    • Need to set the logging level to ERROR to reduce the output noise (Video Example)
    • Code samples one and two, don’t copy and paste these
    • We should point out winutils.exe needs to be installed for Windows users in order to run Spark applications.

Section 2: Spark Streaming Basics

  1. What are Discretized Streams
  2. How to create Discretized Streams
  3. Transformations on DStreams
    • Basic RDD transformations(stateless transformation): Map, flatMap, Filter, Repartition, Union, Count, Reduce, countByValue, reduceByKey, Join, Cogroup
    • DEMO: Pick up 2 of the transformations to demo in the program
    • EXERCISE: prepare an exercise for student to use one of the transformations
  4. Transform Operation
    • What is transform operation and the benefit of it (Reference)
    • DEMO: do a demo with Transform Operation
    • EXERCISE: prepare an exercise for student to use transformation operation
  5. Window Operations
    • What is Window Operations(better with some graphs)
    • Explain parameters (window length and sliding interval)
    • Some of the popular Window operations
  6. Window
    • countByWindow
    • reduceByKeyAndWindow
    • countByValueAndWindow
    • Window
    • Explain Window transformation in depth and what is the usage of Window function
    • DEMO: Do a demo with Window transformation
    • EXERCISE: Give an exercise about Window tansformation
  7. countByWindow
    • Explain countByWindow transformation in depth and what is the usage of countByWindow function
    • DEMO: Do a demo with countByWindow transformation
    • EXERCISE: Give an exercise about countByWindow tansformation
  8. reduceByKeyAndWindow
    • Explain reduceByKeyAndWindow transformation in depth and what is the usage of reduceByKeyAndWindow function
    • DEMO: Do a demo with reduceByKeyAndWindow transformation
    • EXERCISE: Give an exercise about reduceByKeyAndWindow tansformation
  9. countByValueAndWindow
    • Explain countByValueAndWindow transformation in depth and what is the usage of countByValueAndWindow function
    • DEMO: Do a demo with countByValueAndWindow transformation
    • EXERCISE: Give an exercise about countByValueAndWindow tansformation
  10. Output Operations on DStreams
  11. foreachRDD
  12. SQL OPERATIONS

3. Section: Advanced

  1. Join Operations
    • Different types of Join
    • Stream-stream joins
    • Stream-dataset joins
    • DEMO: Do a demo with Stream-stream joins
    • DEMO: Do a demo with Stream-dataset joins
    • EXERCISE: Give an exercise with Stream-stream joins or Stream-dataset joins
  2. Stateful transformation
    • Transformations
    • UpdateStateByKey
    • mapWithState
    • DEMO Do a demo with UpdateStateByKey or mapWithState
    • Needs come up with a proper scenario to use mapWithState or UpdateStateByKey, such as some web session data.
    • EXERCISE: Prepare an exercise with UpdateStateByKey or mapWithState
  3. Check point
    • What is checkpoint and why use check point
    • Different types of checkpoint
    • Metadata checkpointing
    • Data checkpointing
    • When to enable Checkpointing
    • How to configure Checkpointing
    • DEMO: Do a demo with Checkpointing
    • EXERCISE: Give Exercise with Checkpointing
  4. Accumulators
    • What is Accumulators and usage of Accumulators
    • DEMO: Do a demo with Accumulators
    • EXERCISE: Give an Exercise with Accumulators
  5. Fault-tolerance

Section 4: More about Spark streaming

  1. Performance Tuning
    • Reference
    • Reducing the Batch Processing Times
    • Level of Parallelism in Data Receiving
    • Level of Parallelism in Data Processing
    • Data Serialization
    • Task Launching Overheads
    • Setting the Right Batch Interval
    • Memory Tuning
  2. Integration with Kafka
    • Introduction to Kafka
    • Why integrate with Kafka
    • DEMO: Demo
  3. Integration with Kinesis
    • Introduction to Kinesis
    • Why integrate with Kinesis
    • DEMO: Demo

Section 5: Structured Streaming

  1. Introduction about Structured Streaming

  2. Operations on streaming DataFrames/Datasets

  3. Window Operations

  4. Handling Late Data and Watermarking

Section 6: Finish up

  1. Add an introductory lecture about that is covered in the course
    • this video should be placed as the first lecture of this course, but we do it after we are done creating this course
  2. Add a promotion video
    • This will be about what users will learn from this lecture and how they will benefit
  3. Finish up lecture
    • This last lecture to summarize what we have taught in this course and future learning material

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published