This repository contains the files and lectures for the [Insert Title or Organization Name Here] Spark Streaming Tutorial using Python.
- Introduction to Streaming:
- What is streaming?
- Why use streaming?
- Popular streaming tools (Kafka, Apache Spark streaming, etc.)
- Overview of Apache Spark streaming
- What is Apache Spark streaming?
- The advantages of Apache Spark streaming
- User case of Apache Spark streaming
- Set up Environment on our local box:
- Install language SD, Python(if source code is written in Python) and Git (Video Example)
- Check out Source code
- Setup IDE for our demo project
- Use IntelliJ IDEA if the program code is written in Scala or Java
- We will discuss what IDE to use if the code is written in Python (Video Example)
- Run our First Spark streaming projet
- The first project would just create a Sparkcontext connect to Twitter stream and print out the live stream tweets
- Need to demo how to create a twitter developer account to the twitter oauth token
- Need to set the logging level to ERROR to reduce the output noise (Video Example)
- Code samples one and two, don’t copy and paste these
- We should point out winutils.exe needs to be installed for Windows users in order to run Spark applications.
- What are Discretized Streams
- Reference: https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
- Use some graph to explain DStreams
- How to create Discretized Streams
- Different ways to create DStreams
- File Streams
- Queue of RDDs as a Stream
- Kafka
- Flume
- Kinesis
- Reference: https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
- Revisit your first Spark application
- DEMO: Queue of RDDs as a Stream
- Transformations on DStreams
- Basic RDD transformations(stateless transformation):
Map
,flatMap
,Filter
,Repartition
,Union
,Count
,Reduce
,countByValue
,reduceByKey
,Join
,Cogroup
- DEMO: Pick up 2 of the transformations to demo in the program
- EXERCISE: prepare an exercise for student to use one of the transformations
- Basic RDD transformations(stateless transformation):
- Transform Operation
- What is transform operation and the benefit of it (Reference)
- DEMO: do a demo with Transform Operation
- EXERCISE: prepare an exercise for student to use transformation operation
- Window Operations
- What is Window Operations(better with some graphs)
- Explain parameters (window length and sliding interval)
- Some of the popular Window operations
- Window
- countByWindow
- reduceByKeyAndWindow
- countByValueAndWindow
- Window
- Explain Window transformation in depth and what is the usage of Window function
- DEMO: Do a demo with Window transformation
- EXERCISE: Give an exercise about Window tansformation
- countByWindow
- Explain countByWindow transformation in depth and what is the usage of countByWindow function
- DEMO: Do a demo with countByWindow transformation
- EXERCISE: Give an exercise about countByWindow tansformation
- reduceByKeyAndWindow
- Explain reduceByKeyAndWindow transformation in depth and what is the usage of reduceByKeyAndWindow function
- DEMO: Do a demo with reduceByKeyAndWindow transformation
- EXERCISE: Give an exercise about reduceByKeyAndWindow tansformation
- countByValueAndWindow
- Explain countByValueAndWindow transformation in depth and what is the usage of countByValueAndWindow function
- DEMO: Do a demo with countByValueAndWindow transformation
- EXERCISE: Give an exercise about countByValueAndWindow tansformation
- Output Operations on DStreams
- Different output operation
- saveAsTextFiles
- saveAsObjectFiles
- saveAsHadoopFiles
- foreachRDD
- DEMO: Demo how to save tweets to files
- Example: https://drive.google.com/open?id=0Bym8DZ5hyGifaXgwWFQxdVQ4UzA
- use foreachRDD and saveAsTextFiles
- foreachRDD
- Explain foreachRDD and the basic usage about foreachRDD
- Design Patterns for foreachRDD
- Reference: https://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
- DEMO: Do a demo with foreachRDD
- EXERCISE: Give an exercise about foreachRDD
- SQL OPERATIONS
- https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
- DEMO: Do a demo with SQL OPERATIONS
- EXERCISE: Give an exercise about SQL OPERATIONS
- Join Operations
- Different types of Join
- Stream-stream joins
- Stream-dataset joins
- DEMO: Do a demo with Stream-stream joins
- DEMO: Do a demo with Stream-dataset joins
- EXERCISE: Give an exercise with Stream-stream joins or Stream-dataset joins
- Stateful transformation
- Transformations
- UpdateStateByKey
- mapWithState
- DEMO Do a demo with UpdateStateByKey or mapWithState
- Needs come up with a proper scenario to use mapWithState or UpdateStateByKey, such as some web session data.
- EXERCISE: Prepare an exercise with UpdateStateByKey or mapWithState
- Check point
- What is checkpoint and why use check point
- Different types of checkpoint
- Metadata checkpointing
- Data checkpointing
- When to enable Checkpointing
- How to configure Checkpointing
- DEMO: Do a demo with Checkpointing
- EXERCISE: Give Exercise with Checkpointing
- Accumulators
- What is Accumulators and usage of Accumulators
- DEMO: Do a demo with Accumulators
- EXERCISE: Give an Exercise with Accumulators
- Fault-tolerance
- Performance Tuning
- Reference
- Reducing the Batch Processing Times
- Level of Parallelism in Data Receiving
- Level of Parallelism in Data Processing
- Data Serialization
- Task Launching Overheads
- Setting the Right Batch Interval
- Memory Tuning
- Integration with Kafka
- Introduction to Kafka
- Why integrate with Kafka
- DEMO: Demo
- Integration with Kinesis
- Introduction to Kinesis
- Why integrate with Kinesis
- DEMO: Demo
-
Introduction about Structured Streaming
- Overview of Structured Streaming
- The Benefit of structured streaming
- Basic Concepts about Spark streaming
- DEMO: A quick demo about an structured streaming example
-
Operations on streaming DataFrames/Datasets
- Structured Streaming Programming Guide: Operations on Streaming Dataframe Datasets
- DEMO: DO a demo:
- EXERCISE: Prepare an excise
-
Window Operations
- Structured Streaming Programming Guide: Window Operations on Event Time
- DEMO: Do a demo (exmaple)
- EXERCISE: Prepare an excise
-
Handling Late Data and Watermarking
- Add an introductory lecture about that is covered in the course
- this video should be placed as the first lecture of this course, but we do it after we are done creating this course
- Add a promotion video
- This will be about what users will learn from this lecture and how they will benefit
- Finish up lecture
- This last lecture to summarize what we have taught in this course and future learning material