This repository contains the files and lectures for the [Insert Title or Organization Name Here] Spark Streaming Tutorial using Python.
- Introduction to Streaming:
- What is streaming?
- Why use streaming?
- Popular streaming tools (Kafka, Apache Spark streaming, etc.)
- Overview of Apache Spark streaming
- What is Apache Spark streaming?
- The advantages of Apache Spark streaming
- User case of Apache Spark streaming
- Set up Environment on our local box:
- Install language SD, Python(if source code is written in Python) and Git (Video Example)
- Check out Source code
- Setup IDE for our demo project
- Use IntelliJ IDEA if the program code is written in Scala or Java
- We will discuss what IDE to use if the code is written in Python (Video Example)
- Run our First Spark streaming projet
- The first project would just create a Sparkcontext connect to Twitter stream and print out the live stream tweets
- Need to demo how to create a twitter developer account to the twitter oauth token
- Need to set the logging level to ERROR to reduce the output noise (Video Example)
- Code samples one and two, don’t copy and paste these
- We should point out winutils.exe needs to be installed for Windows users in order to run Spark applications.
- What are Discretized Streams
- Reference
- Use some graph to explain DStreams
- How to create Discretized Streams
- Different ways to create DStreams
- File Streams
- Queue of RDDs as a Stream
- Kafka
- Flume
- Kinesis
- Twitter (Reference, revisit your first Spark application)
- DEMO: Queue of RDDs as a Stream
- Transformations on DStreams
- Basic RDD transformations(stateless transformation):
Map
,flatMap
,Filter
,Repartition
,Union
,Count
,Reduce
,countByValue
,reduceByKey
,Join
,Cogroup
- DEMO: Pick up 2 of the transformations to demo in the program
- EXERCISE: prepare an exercise for student to use one of the transformations
- Basic RDD transformations(stateless transformation):
- Transform Operation
- What is transform operation and the benefit of it (Reference)
- DEMO: do a demo with Transform Operation
- EXERCISE: prepare an exercise for student to use transformation operation
- Window Operations
- What is Window Operations(better with some graphs)
- Explain parameters (window length and sliding interval)
- Some of the popular Window operations (e.g.,
Window
,countByWindow
,reduceByKeyAndWindow
,countByValueAndWindow
)
Window
- Explain
Window
transformation in depth and what is the usage ofWindow
function - DEMO: Do a demo with
Window
transformation - EXERCISE: Give an exercise about
Window
tansformation
- Explain
countByWindow
- Explain
countByWindow
transformation in depth and what is the usage ofcountByWindow
function - DEMO: Do a demo with
countByWindow
transformation - EXERCISE: Give an exercise about
countByWindow
tansformation
- Explain
reduceByKeyAndWindow
- Explain
reduceByKeyAndWindow
transformation in depth and what is the usage ofreduceByKeyAndWindow
function - DEMO: Do a demo with
reduceByKeyAndWindow
transformation - EXERCISE: Give an exercise about
reduceByKeyAndWindow
tansformation
- Explain
countByValueAndWindow
- Explain
countByValueAndWindow
transformation in depth and what is the usage ofcountByValueAndWindow
function - DEMO: Do a demo with
countByValueAndWindow
transformation - EXERCISE: Give an exercise about
countByValueAndWindow
tansformation
- Explain
- Output Operations on DStreams
- Different output operation (e.g.,
Print
,saveAsTextFiles
,saveAsObjectFiles
,saveAsHadoopFiles
,foreachRDD
) - DEMO: Demo how to save tweets to files (Example)
- use
foreachRDD
andsaveAsTextFiles
- Different output operation (e.g.,
foreachRDD
- Explain
foreachRDD
and the basic usage aboutforeachRDD
- Design Patterns for
foreachRDD
- Reference: https://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
- DEMO: Do a demo with
foreachRDD
- EXERCISE: Give an exercise about
foreachRDD
- Explain
- SQL OPERATIONS
- Dataframe and SQL Operations
- DEMO: Do a demo with SQL OPERATIONS
- EXERCISE: Give an exercise about SQL OPERATIONS
- Join Operations
- Different types of Join
- Stream-stream joins
- Stream-dataset joins
- DEMO: Do a demo with Stream-stream joins
- DEMO: Do a demo with Stream-dataset joins
- EXERCISE: Give an exercise with Stream-stream joins or Stream-dataset joins
- Stateful transformation
- Transformations
UpdateStateByKey
mapWithState
- DEMO Do a demo with
UpdateStateByKey
ormapWithState
- Needs come up with a proper scenario to use
mapWithState
orUpdateStateByKey
, such as some web session data. - EXERCISE: Prepare an exercise with UpdateStateByKey or mapWithState
- Check point
- What is checkpoint and why use check point
- Different types of checkpoint (Metadata checkpointing & Data checkpointing)
- When to enable Checkpointing
- How to configure Checkpointing
- DEMO: Do a demo with Checkpointing
- EXERCISE: Give Exercise with Checkpointing
- Accumulators
- What is Accumulators and usage of Accumulators
- DEMO: Do a demo with Accumulators
- EXERCISE: Give an Exercise with Accumulators
- Fault-tolerance
- Performance Tuning
- Reference
- Reducing the Batch Processing Times
- Level of Parallelism in Data Receiving
- Level of Parallelism in Data Processing
- Data Serialization
- Task Launching Overheads
- Setting the Right Batch Interval
- Memory Tuning
- Integration with Kafka
- Introduction to Kafka
- Why integrate with Kafka
- DEMO: Demo
- Integration with Kinesis
- Introduction to Kinesis
- Why integrate with Kinesis
- DEMO: Demo
- Introduction about Structured Streaming
- Overview of Structured Streaming
- The Benefit of structured streaming
- Basic Concepts about Spark streaming
- DEMO: A quick demo about an structured streaming example
- Operations on streaming DataFrames/Datasets
- Structured Streaming Programming Guide: Operations on Streaming Dataframe Datasets
- DEMO: DO a demo:
- EXERCISE: Prepare an excise
- Window Operations
- Structured Streaming Programming Guide: Window Operations on Event Time
- DEMO: Do a demo (exmaple)
- EXERCISE: Prepare an excise
- Handling Late Data and Watermarking
- Add an introductory lecture about that is covered in the course
- This video should be placed as the first lecture of this course, but we do it after we are done creating this course
- Add a promotion video
- This will be about what users will learn from this lecture and how they will benefit
- Finish up lecture
- This last lecture to summarize what we have taught in this course and future learning material