0% found this document useful (0 votes)
47 views27 pages

Stream Processing With: Tamás István Ujj

The document discusses a real-time data architecture using stream processing techniques like Kafka and Spark to ingest raw data from sources like databases and APIs, transform the data in real-time using stream processing engines, and deliver results with low latency to downstream applications while also enabling reprocessing of historical data for reliability. It also covers related topics like change data capture, event sourcing, and using Mesos or YARN for cluster management.

Uploaded by

Deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views27 pages

Stream Processing With: Tamás István Ujj

The document discusses a real-time data architecture using stream processing techniques like Kafka and Spark to ingest raw data from sources like databases and APIs, transform the data in real-time using stream processing engines, and deliver results with low latency to downstream applications while also enabling reprocessing of historical data for reliability. It also covers related topics like change data capture, event sourcing, and using Mesos or YARN for cluster management.

Uploaded by

Deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Stream Processing with

Tamás István Ujj


[email protected]
A database is nothing but
our conception of it; what is
Lambda Architecture man to say it differs from a
stream in nature…
Big Data

Machine
Learnin
g

Business Customer Business Process Software Quality Application Manufacturing


Intelligence Relationship Management Management Development Support
Management
Our Customers
Financial
Sector

Manufacturing

Telecommunications
A real-time data architecture
I want to do complex
calculations on large
amounts of data.

You need a batch


processing system.
New Staging Transformation
Data Logic Results
Area

New data is written to a temporary staging area.

A scheduled batch job executes the transformation


logic.
We changed the logic. Let’s
recalculate the previous
results, too.

Recomputation will
cost you extra.
New Staging Master Transformation
Data ETL Logic Results
Area Dataset

Transformation Results
Logic (New) (New)

Master Dataset: an immutable,


append-only set of raw data.

Results can be recomputed


from historical data.
Why do I have to wait hours for
the updated results?!

We’ll have to reengineer the


system for low latency.
Nathan Marz: Big Data
Principles and best practices of
scalable real-time data systems
New Staging Master Transformation Batch
Data Area
ETL Logic
Dataset Results

Stream Transformation Real-Time


Engine Logic (Streaming) Results

The batch layer calculates


the correct
the results
results
withwith
highhigh
latency.
latency.
TheThe
speed
speed
layer
layer
calculates
calculates
the the
approximate
results on the
results
on the
mostmost
recent
recent
datadata
in real-time.
in real-time.
Your architecture costs
me a fortune!

This is the price of


Big Data.
You don’t need the Stream processing isn’t
batch layer. reliable on its own!

Interesting.
That’s half the
costs.
A well-designed Offset
streaming system
provides exactly-once
semantics, even in case 0123456789
of failure.

Receiving the data


Kafka is a reliable source.
Tracking the offsets in checkpoints.
Transforming the data
Repeatable transformations.
Pushing out the data
Idempotent updates.
Transactional updates. (Saving results and offsets.)
Offset

New Stream Transformation Real-


Data 0123456789 Engine Logic Time
Results

Transformation Real-
Offset Logic (New)
Transformation Time
Staging
ETL Master Batch
(New)
Area Logic Results
Dataset Results

Kafka retains incoming data.

Recomputation: processing data


from the beginning of the stream
with a parallel streaming job.
How can I stream
A stream is an ever- data from my
growing, immutable databases?
set of events.

Under the hood, a database is


also a stream of events:
creates, updates and deletes.
A database is a

Update
Create
Update
Delete
Create
Delete
view over this
stream of events.

Database

Let’s capture this


internal stream.
The technique is called
Change Data Capture.

A consistent snapshot of the entire


database contents at one point in time.
A real-time stream of changes from that
point onward.

PostgreSQL and Oracle


support both.
Complex And fault-tolerance
asynchronous through recomputation.
transformations…

…with low latency.

And all this with a


single computational model,
without code duplication.
The SMACK stack

Spark for Micro-Batch Processing


Mesos for Cluster Management
Akka for Event Processing
Cassandra for Persistence
Kafka for Event Transport
Event Processing Micro-Batch Processing
Latency Sub-second Seconds to minutes
Power Simple triggers Complex transformations

A trade-off between latency and


computational power.

Responding to single
events in real-time or a
general analysis over the
stream.
Event Processing Micro-Batch Processing
Latency Sub-second Seconds to minutes
Power Simple triggers Complex transformations

Akka Streams

Kafka Streams
Reactive Streams with
back pressure.

Some other alternatives:


Storm, Flink, Samza.
Event Processing Micro-Batch Processing
Latency Sub-second Seconds to minutes
Power Simple triggers Complex transformations

Machine Graph Functional


SQL
Learning Analytics API
Cluster Management with

YARN
• Hadoop and related components.
• Job request comes in, YARN places the job.
MESOS
• Any application.
• Job request comes in, MESOS offers resources,
job accepts or rejects.
Upstream Sources

Downstream
Applications
An architecture for
converting large amounts of
raw data into vauable
information in real-time.
Business Intelligence

Tamás István Ujj


[email protected]

Inspiration: Nathan Marz, Jay Kreps, Tyler Akidau, Martin Kleppmann, Dean Wampler

You might also like