Skip to content

Latest commit

 

History

History
204 lines (163 loc) · 10.7 KB

ARCHITECTURE.md

File metadata and controls

204 lines (163 loc) · 10.7 KB

Architecture

This document describes the high-level architecture of Vector. It assumes some familiarity with the user-facing concepts and focuses instead on how they're wired together internally. The goal is to provide a starting point for navigating the code and to assist in understanding Vector's behavior and constraints.

Overview

From a user's perspective, Vector runs a configuration consisting of a directed, acyclic graph of sources, transforms, and sinks. In many ways, this logical representation of the config maps rather directly to the way that it is laid out and run internally.

logical

A reasonably accurate mental model of running the above configuration is that Vector spins up each component as a Tokio task and wires them together with channels. Below, we'll go through each type of component in more detail, discussing how it is translated into a running task and connected to the rest of the topology.

Component construction

After parsing and validating a user's configuration, we are left with a Config struct containing (among other things) the collections of SourceConfigs, TransformConfigs, and SinkConfigs corresponding to each of the configured components. Each of those traits has their own build method that constructs the component. This building occurs largely in the src/topology/builder.rs file, along with some setup for the initial wiring into the topology.

Sources

When a source config is built, the result is mainly two tasks: the "server" task of the source itself, and a "pump" task that forwards its output on to the rest of the system.

Construction begins by setting up a SourceSender for the configured source, which handles the ability to send events to each of the outputs defined by the SourceConfig::outputs method. The outputs are built into the sender in such a way that attempting to send to an unknown output will result in a panic.

Along with each of these outputs that are added to the sender, a corresponding Fanout instance is created which will live in a pump task and handling "fanning out" events to every downstream component that lists this source as an input. Each of these individual pump tasks (one per output) consists purely of forwarding events into the Fanout. The final pump task (one per source) simply spawns each of the output-specific pump tasks and drives them to completion.

Finally, the server task itself is built. SourceConfig::build takes a SourceContext as an argument, one field of which is the SourceSender built above. The result of the build function is wrapped in some simple shutdown handling before being inserted into the topology.

image

Transforms

After sources, transforms are the next to be built. They're a bit simpler than sources and mostly jump straight into TransformConfig::build. They can define outputs that are translated into Fanout instances in much the same way as sources, and also have an option to enable_concurrency. How that works exactly depends on the type of transform.

Synchronous

The simplest type of transform is a Function transform. They run synchronously on a single event and write to a single, unnamed output. One step above function transforms is the newer Synchronous transform. These are inclusive of function transforms, with the additional ability of being able to write to multiple outputs if desired. From the perspective of the topology, both of these are run in exactly the same way.

In the simplest case, where the enable_concurrency flag is not enabled (this is not configuration, but a static attribute of the of transform), the resulting transform task is built to pull a chunk of events from its input channel, process those events via the transform method into a TransformOutputsBuf (essentially a container of Vec<Event> for each of the transform's defined outputs). Once the whole chunk of events has been processed into outputs, those outputs are drained into the respective Fanout instances.

If the enable_concurrency flag is enabled, the process is slightly more complicated. For each chunk of input events, instead of being processed inline, a new task is spawned that does the work of processing the events into outputs. Since there is some overhead to spawning tasks, Vector attempts to pull larger chunks of events from the input for transforms running in this mode. The main transform tasks then tracks the completion of those work tasks in the same order that they were spawned (ensuring that we don't reorder the resulting outputs). When a task completes, the main task receives the resulting TransformOutputsBuf and takes care of draining it into the respective fanouts. To ensure that we don't buffer an infinite amount of events within those work tasks, the main task limits the maximum number that can be in flight simultaneously. Spawning new tasks allows Tokio's work-stealing scheduler to step in and spread the CPU work across multiple threads when there is a need and available capacity to do so.

image

Task

A task-style transform differs from the synchronous variants above in that it has the ability to do arbitrary, asynchronous stream-based transformations of its input. This includes things like emitting outputs after some timeout, independent of incoming events. From the topology's perspective, they're quite simple because they define most of their structure internally and are applied by basically just passing the input channel into the transform method.

To build the full task, the transform itself is built, some common filtering and telemetry are added by wrapping the input stream, and then the input stream is passed to the transform method. This results in an output stream, which is then forwarded to the transform's Fanout instance (task transforms do not support multiple outputs).

image

Sinks

Sinks have two components that make building them somewhat more complex than either sources or transforms: healthchecks and buffers.

Healthchecks are one-off tasks that run at startup with the goal of discovering any issues that may prevent the sink from running properly (e.g. permissions or connectivity issues) and notifying the user in a nice way. They can be enabled or disabled both individually and at a global level, and the user can choose whether a failing healthcheck should prevent Vector from starting.

Buffers are a configurable mechanism for dealing with backpressure. By default, like the rest of Vector, sinks will buffer some small-ish number of events in memory before propagating backpressure upstream. Buffer configuration allows individual sinks to change that behavior, choosing between memory and disk for where to store the buffered events, setting a maximum size, and deciding what should happen when the buffer is full (backpressure or load shedding).

Disk buffers in particular add some complexity in topology construction due to the fact that they're persistent across both config reloads and process restarts. They are built normally with their corresponding sink most of the time, but they are also stashed to the side for the case when topology construction fails after a buffer has been built. This allows a subsequent build (likely of the previous configuration during a rollback) to pull from the already-built buffer without worrying about the persistence of the contents.

Once the healthcheck and buffer are built, the sink itself is constructed via SinkConfig::build. The surrounding task is defined to first finalize its use of the buffer (removing it from the in-case-of-error stash), filter and wrap the input stream with telemetry, and then pass it to VectorSink::run.

image

Connecting components

Once component construction is complete, we're left with a collection of yet-to-be-spawned tasks, as well as handles to their inputs and outputs. More specifically, a component input is the sender side of a channel (or buffer) that acts as the component's input stream. A component output in this case is a fanout::ControlChannel, via which Vector can send control messages that add or remove destinations for the component's actual output stream.

Given those definitions, the fundamental process of wiring up a Vector topology is one of adding the appropriate inputs to the appropriate outputs. As a simple example, consider the following config:

[sources.foo]
type = "stdin"

[sinks.bar]
type = "console"
inputs = ["foo"]
encoding.codec = "json"

After the component construction phase, we'll be left with the tasks for each component (which we'll ignore for now) as well as a collection of inputs and outputs. In this case, each of those collections will hold one single item. There will be an output corresponding to the source foo, consisting of a control channel connected to the Fanout instance within the source's pump task, and there will be an input corresponding to the sink bar, consisting of the sender side of the sink's input channel/buffer (for our purposes here they're equivalent).

To wire up this topology, the connect_diff function on RunningTopology will see that bar specifies foo as an input, take a clone of bar's input, and send that to foo's output control channel. This results in the Fanout associated with the source foo adding a new sender to its list of destinations to write new events to. And voilà, the sink will receive events from the source on its input stream.

The actual code for this logic (found in src/topology/running.rs) is significantly more complex than what's needed for this very simple example, mostly because it is oriented around applying modifications to an existing running topology rather than simply starting a new one up from scratch. While complex, this allows us to have a single path through which all topology actions occur. Without it, we'd need all the same complexity to support reloads, but it would be a secondary path that's less well exercised than the common startup path. Even so, this is an area where we're always looking for ways that we can simplify and make things more robust.

Spawning the topology

Once everything is properly connected, the final step is the spawn the actual tasks for each component. There is a bit of bookkeeping to build tracing spans for each, set up error handlers, and track shutdown, but the handles to the running tasks are stored and Vector is off to the races.