Introduction
A Pipeline is a customizable data processing system. At the most basic level, input data is supplied in the form of evidence. One or more flow elements in the Pipeline then perform processing based on that evidence and optionally, populate data values that are required by the user.
The incoming evidence is usually related to a web request, for example the HTTP headers, cookies, source IP address or values from the query string. The evidence is carried through the Pipeline to the elements by a flow data instance. The flow data structure encapsulates all input and output data associated with a single Pipeline process request.
Creation
A Pipeline is built using a pipeline builder utilizing the fluent builder pattern. By adding flow elements to the Pipeline, the nature of the processing it will carry out is defined.
Once created, a Pipeline is immutable, i.e. the flow elements it contains and the order in which they execute cannot be changed. Individual flow elements may be immutable, or not, based upon their individual implementations.
As an alternative to configuring a Pipeline using a builder, configuration can be supplied from a file. Depending on the language features and conventions, this could be formatted as either JSON or XML. This allows the Pipeline to be configurable at runtime without recompiling the code. This is the default operation for web integrations, but can be used for any other use-case as well. For more on this, see the build from configuration section, and the configure from file example.
Processing
The flow of a Pipeline's operation starts with the creation of a flow data. This is created from the Pipeline, and is specific to it. Each flow data instance can only be processed using the Pipeline that created it.
Next, evidence is added to the flow data ready to be processed.
Finally, the flow data is processed. Doing this sends the data (along with all the evidence it now contains) through the Pipeline. Each flow element will receive the flow data and do its processing before optionally updating the flow data with new values.
Note that the order of execution of flow elements is decided when the Pipeline is created. By default, flow elements are executed sequentially in the order they are added. However, if the language supports it, two or more flow elements can also be executed in parallel within the overall sequential structure.
Additionally, the Pipeline may offer asynchronous execution or a lazy loading capability for individual flow elements. These features are also language dependent.
Regardless of the method of execution and configuration, after processing the flow data will contain the results, which can then be accessed by the caller.
Public Access
Other than the creation of a new flow data, there are very few other publicly accessible parts of the Pipeline.
Flow elements inside the Pipeline are accessible as a read-only collection, and can also be retrieved individually if needed.
All element properties that the Pipeline's flow elements can populate are also exposed in one place. This enables easy iteration over all element properties.
An evidence key filter is also exposed. This aggregates all the evidence keys accepted by all the flow elements within the Pipeline. This can be used by the caller to check which items of evidence could affect the result of processing.
Internals
The structure of the flow elements within the Pipeline, and the flow data which it creates, is determined by how it is created.
Consider an example where flow elements E1 and E2 are added to the Pipeline individually in that order.
E1 will carry out its processing on the flow data, then once it is finished, E2 will do the same. In this scenario, the Pipeline 'knows' that the flow data will not be written to by multiple threads. As such, the flow data created by the Pipeline will not be thread-safe but will have slightly improved performance.
Now consider an example where both E1 and E2 are added in parallel.
In this case, both will carry out their processing at the same time. This time, the flow data which the Pipeline creates will be thread-safe for writing as it is possible that both E1 and E2 will attempt to write their results to the FlowData at the same time.
Lifecycle
A Pipeline is the second thing to be created, after the flow elements which it contains. It then exists for as long as processing is required. When a Pipeline is disposed of, it can optionally dispose of the flow elements within too. This makes managing the lifetime of the flow elements easy, however this should not be done if the same flow element instances have also been added to another Pipeline.
If an attempt is made to process a flow data from a Pipeline which has since been disposed, an error will occur. The Pipeline which creates a flow data MUST exist for as long as the flow data. This is also true of post processing usage like retrieving results from the flow data. Imagine a case where a certain result has been lazily loaded – a call to get that result will require the flow element which created it to do the loading, so if the Pipeline has disposed of the flow element, there will be an error.
While not a necessity, it is good practice to dispose of each flow data produced by the Pipeline once it is finished with.