This document describes the high-level architecture of Vector. It assumes some familiarity with the user-facing concepts and focuses instead on how they're wired together internally. The goal is to provide a starting point for navigating the code and to assist in understanding Vector's behavior and constraints.
From a user's perspective, Vector runs a configuration consisting of a directed, acyclic graph of sources, transforms, and sinks. In many ways, this logical representation of the config maps rather directly to the way that it is laid out and run internally.
A reasonably accurate mental model of running the above configuration is that Vector spins up each component as a Tokio task and wires them together with channels. Below, we'll go through each type of component in more detail, discussing how it is translated into a running task and connected to the rest of the topology.
After parsing and validating a user's configuration, we are left with a Config
struct containing (among other things) the collections of SourceConfig
s,
TransformConfig
s, and SinkConfig
s corresponding to each of the configured
components. Each of those traits has their own build
method that constructs
the component. This building occurs largely in the src/topology/builder.rs
file, along with some setup for the initial wiring into the topology.
When a source config is built, the result is mainly two tasks: the "server" task of the source itself, and a "pump" task that forwards its output on to the rest of the system.
Construction begins by setting up a SourceSender
for the configured source,
which handles the ability to send events to each of the outputs defined by the
SourceConfig::outputs
method. The outputs are built into the sender in such
a way that attempting to send to an unknown output will result in a panic.
Along with each of these outputs that are added to the sender, a corresponding
Fanout
instance is created which will live in a pump task and handling
"fanning out" events to every downstream component that lists this source as an
input. Each of these individual pump tasks (one per output) consists purely of
forwarding events into the Fanout
. The final pump task (one per source) simply
spawns each of the output-specific pump tasks and drives them to completion.
Finally, the server task itself is built. SourceConfig::build
takes
a SourceContext
as an argument, one field of which is the SourceSender
built
above. The result of the build function is wrapped in some simple shutdown
handling before being inserted into the topology.
After sources, transforms are the next to be built. They're a bit simpler than
sources and mostly jump straight into TransformConfig::build
. They can define
outputs
that are translated into Fanout
instances in much the same way as
sources, and also have an option to enable_concurrency
. How that works exactly
depends on the type of transform.
The simplest type of transform is a Function
transform. They run synchronously
on a single event and write to a single, unnamed output. One step above function
transforms is the newer Synchronous
transform. These are inclusive of function
transforms, with the additional ability of being able to write to multiple
outputs if desired. From the perspective of the topology, both of these are run
in exactly the same way.
In the simplest case, where the enable_concurrency
flag is not enabled (this
is not configuration, but a static attribute of the of transform), the resulting
transform task is built to pull a chunk of events from its input channel,
process those events via the transform
method into a TransformOutputsBuf
(essentially a container of Vec<Event>
for each of the transform's defined
outputs). Once the whole chunk of events has been processed into outputs, those
outputs are drained into the respective Fanout
instances.
If the enable_concurrency
flag is enabled, the process is slightly more
complicated. For each chunk of input events, instead of being processed inline,
a new task is spawned that does the work of processing the events into outputs.
Since there is some overhead to spawning tasks, Vector attempts to pull larger
chunks of events from the input for transforms running in this mode. The main
transform tasks then tracks the completion of those work tasks in the same order
that they were spawned (ensuring that we don't reorder the resulting outputs).
When a task completes, the main task receives the resulting
TransformOutputsBuf
and takes care of draining it into the respective fanouts.
To ensure that we don't buffer an infinite amount of events within those work
tasks, the main task limits the maximum number that can be in flight
simultaneously. Spawning new tasks allows Tokio's work-stealing scheduler to
step in and spread the CPU work across multiple threads when there is a need and
available capacity to do so.
A task-style transform differs from the synchronous variants above in that it
has the ability to do arbitrary, asynchronous stream-based transformations of
its input. This includes things like emitting outputs after some timeout,
independent of incoming events. From the topology's perspective, they're quite
simple because they define most of their structure internally and are applied by
basically just passing the input channel into the transform
method.
To build the full task, the transform itself is built, some common filtering and
telemetry are added by wrapping the input stream, and then the input stream is
passed to the transform
method. This results in an output stream, which is
then forwarded to the transform's Fanout
instance (task transforms do not
support multiple outputs).
Sinks have two components that make building them somewhat more complex than either sources or transforms: healthchecks and buffers.
Healthchecks are one-off tasks that run at startup with the goal of discovering any issues that may prevent the sink from running properly (e.g. permissions or connectivity issues) and notifying the user in a nice way. They can be enabled or disabled both individually and at a global level, and the user can choose whether a failing healthcheck should prevent Vector from starting.
Buffers are a configurable mechanism for dealing with backpressure. By default, like the rest of Vector, sinks will buffer some small-ish number of events in memory before propagating backpressure upstream. Buffer configuration allows individual sinks to change that behavior, choosing between memory and disk for where to store the buffered events, setting a maximum size, and deciding what should happen when the buffer is full (backpressure or load shedding).
Disk buffers in particular add some complexity in topology construction due to the fact that they're persistent across both config reloads and process restarts. They are built normally with their corresponding sink most of the time, but they are also stashed to the side for the case when topology construction fails after a buffer has been built. This allows a subsequent build (likely of the previous configuration during a rollback) to pull from the already-built buffer without worrying about the persistence of the contents.
Once the healthcheck and buffer are built, the sink itself is constructed via
SinkConfig::build
. The surrounding task is defined to first finalize its use
of the buffer (removing it from the in-case-of-error stash), filter and wrap the
input stream with telemetry, and then pass it to VectorSink::run
.
Once component construction is complete, we're left with a collection of
yet-to-be-spawned tasks, as well as handles to their inputs and outputs. More
specifically, a component input is the sender side of a channel (or buffer) that
acts as the component's input stream. A component output in this case is
a fanout::ControlChannel
, via which Vector can send control messages that add
or remove destinations for the component's actual output stream.
Given those definitions, the fundamental process of wiring up a Vector topology is one of adding the appropriate inputs to the appropriate outputs. As a simple example, consider the following config:
[sources.foo]
type = "stdin"
[sinks.bar]
type = "console"
inputs = ["foo"]
encoding.codec = "json"
After the component construction phase, we'll be left with the tasks for each
component (which we'll ignore for now) as well as a collection of inputs and
outputs. In this case, each of those collections will hold one single item.
There will be an output corresponding to the source foo
, consisting of
a control channel connected to the Fanout
instance within the source's pump
task, and there will be an input corresponding to the sink bar
, consisting of
the sender side of the sink's input channel/buffer (for our purposes here
they're equivalent).
To wire up this topology, the connect_diff
function on RunningTopology
will
see that bar
specifies foo
as an input, take a clone of bar
's input, and
send that to foo
's output control channel. This results in the Fanout
associated with the source foo
adding a new sender to its list of destinations
to write new events to. And voilà, the sink will receive events from the source
on its input stream.
The actual code for this logic (found in src/topology/running.rs
) is
significantly more complex than what's needed for this very simple example,
mostly because it is oriented around applying modifications to an existing
running topology rather than simply starting a new one up from scratch. While
complex, this allows us to have a single path through which all topology actions
occur. Without it, we'd need all the same complexity to support reloads, but it
would be a secondary path that's less well exercised than the common startup
path. Even so, this is an area where we're always looking for ways that we can
simplify and make things more robust.
Once everything is properly connected, the final step is the spawn the actual tasks for each component. There is a bit of bookkeeping to build tracing spans for each, set up error handlers, and track shutdown, but the handles to the running tasks are stored and Vector is off to the races.