Académique Documents
Professionnel Documents
Culture Documents
by François Paupier
Introduction
That’s a crazy ow of water. Just like your application deals with a crazy stream of data.
Routing data from one storage to another, applying validation rules and addressing
questions of data governance, reliability in a Big Data ecosystem is hard to get right if
you do it all by yourself.
Good news, you don’t have to build your data ow solution from scratch — Apache NiFi
got your back!
At the end of this article, you’ll be a NiFi expert — ready to build your data pipeline.
For your convenience here is the table of content, feel free to go straight where your
curiosity takes you. If you’re a NiFi rst-timer, going through this article in the indicated
order is advised.
Table of Content
I — What is Apache NiFi?
- De ning NiFi
- Why using NiFi?
An easy to use, powerful, and reliable system to process and distribute data.
Defining NiFi
Process and distribute data
That’s the gist of Ni . It moves data around systems and gives you tools to process this
data.
Ni can deal with a great variety of data sources and format. You take data in from one
source, transform it, and push it to a different data sink.
Ten thousand feet view of Apache Ni — Ni pulls data from multiple data sources, enrich it and transform it
to populate a key-value store.
Easy to use
Processors — the boxes — linked by connectors — the arrows create a ow. NiFi offers a
ow-based programming experience.
To translate the data ow above in NiFi, you go to NiFi graphical user interface, drag and
drop three components into the canvas, and
That’s it. It takes two minutes to build.
A simple validation data ow as seen through Ni canvas
Now, if you write code to do the same thing, it’s likely to be a several hundred lines long
to achieve a similar result.
You don’t capture the essence of the pipeline through code as you do with a ow-based
approach. Ni is more expressive to build a data pipeline; it’s designed to do that.
Powerful
NiFi provides many processors out of the box (293 in Ni 1.9.2). You’re on the shoulders
of a giant. Those standard processors handle the vast majority of use cases you may
encounter.
NiFi is highly concurrent, yet its internals encapsulates the associated complexity.
Processors offer you a high-level abstraction that hides the inherent complexity of
parallel programming. Processors run simultaneously, and you can span multiple threads
of a processor to cope with the load.
Concurrency is a computing Pandora’s box that you don’t want to open. NiFi conveniently
shields the pipeline builder from the complexities of concurrency.
Reliable
The theory backing NiFi is not new; it has solid theoretical anchors. It’s similar to models
like SEDA.
For a data ow system, one of the main topics to address is reliability. You want to be
sure that data sent somewhere is effectively received.
NiFi achieves a high level of reliability through multiple mechanisms that keep track of
the state of the system at any point in time. Those mechanisms are con gurable so you
can make the appropriate tradeoffs between latency and throughput required by your
applications.
NiFi tracks the history of each piece of data with its lineage and provenance features. It
makes it possible to know what transformation happens on each piece of information.
The data lineage solution proposed by Apache Ni proves to be an excellent tool for
auditing a data pipeline. Data lineage features are essential to bolster con dence in big
data and AI systems in a context where transnational actors such as the European Union
propose guidelines to support accurate data processing.
It’s useful to keep in mind the four Vs of big data when dimensioning your solution.
Volume — At what scale do you operate? In order of magnitude, are you closer to
a few GigaBytes or hundreds of PetaBytes?
Variety — How many data sources do you have? Are your data structured? If yes,
does the schema vary often?
Velocity — What is the frequency of the events you process? Is it credit cards
payments? Is it a daily performance report sent by an IoT device?
Veracity — Can you trust the data? Alternatively, do you need to apply multiple
cleaning operations before manipulating it?
NiFi seamlessly ingests data from multiple data sources and provides mechanisms to
handle different schema in the data. Thus, it shines when there is a high variety in the
data.
With its con guration options, Ni can address a broad range of volume/velocity
situations.
Microservices are trendy. In those loosely coupled services, the data is the
contract between the services. Ni is a robust way to route data between those
services.
Internet of Things brings a multitude of data to the cloud. Ingesting and validating
data from the edge to the cloud poses a lot of new challenges that NiFi can
ef ciently address (primarily through MiniFi, NiFi project for edge devices)
New guidelines and regulations are put in place to readjust the Big Data
economy. In this context of increasing monitoring, it is vital for businesses to have
a clear overview of their data pipeline. NiFi data lineage, for example, can be
helpful in a path towards compliance to regulations.
Bridge the gap between big data experts and the others
As you can see by the user interface, a data ow expressed in NiFi is excellent to
communicate about your data pipeline. It can help members of your organization
become more knowledgeable about what’s going on in the data pipeline.
An analyst is asking for insights about why this data arrives here that way? Sit
together and walk through the ow. In ve minutes you give someone a strong
understanding of the Extract Transform and Load -ETL- pipeline.
You want feedback from your peers on a new error handling ow you created?
NiFi makes it a design decision to consider error paths as likely as valid
outcomes. Expect the ow review to be shorter than a traditional code review.
If you are starting from scratch and manage a few data from trusted data sources, you
may be better off setting up your Extract Transform and Load — ETL pipeline. Maybe a
change data capture from a database and some data preparations scripts are all you
need.
On the other hand, if you work in an environment with existing big data solutions in use
(be it for storage, processing or messaging ), NiFi integrates well with them and is more
likely to be a quick win. You can leverage the out of the box connectors to those other Big
Data solutions.
It’s easy to be hyped by new solutions. List your requirements and choose the solution
that answers your needs as simply as possible.
Now that we have seen the very high picture of Apache NiFi, we take a look at its key
concepts and dissect its internals.
In this second part, I explain the critical concepts of Apache NiFi with schemas. This
black box model won’t be a black box to you afterward.
The NiFi canvas user interface is the framework in which the pipeline builder evolves.
The black boxes are called processors, and they exchange chunks of information named
FlowFiles through queues that are named connections. Finally, the FlowFile Controller is
responsible for managing the resources between those components.
Processor, FlowFile, Connector, and the FlowFile Controller: four essential concepts in NiFi
FlowFile
In NiFi, the FlowFile is the information packet moving through the processors of the
pipeline.
Anatomy of a FlowFile — It contains attributes of the data as well as a reference to the associated data
A FlowFile comes in two parts:
Attributes, which are key/value pairs. For example, the le name, le path, and a
unique identi er are standard attributes.
The FlowFile does not contain the data itself. That would severely limit the throughput
of the pipeline.
Instead, a FlowFile holds a pointer that references data stored at some place in the local
storage. This place is called the Content Repository.
To access the content, the FlowFile claims the resource from the Content Repository.
The later keep tracks of the exact disk offset from where the content is and streams it
back to the FlowFile.
Not all processors need to access the content of the FlowFile to perform their
operations — for example, aggregating the content of two FlowFiles doesn’t require to
load their content in memory.
When a processor modi es the content of a FlowFile, the previous data is kept. NiFi
copies-on-write, it modi es the content while copying it to a new location. The original
information is left intact in the Content Repository.
Example
Consider a processor that compresses the content of a FlowFile. The original content
remains in the Content Repository, and a new entry is created for the compressed
content.
The Content Repository nally returns the reference to the compressed content. The
FlowFile is updated to point to the compressed data.
The drawing below sums up the example with a processor that compresses the content
of FlowFiles.
Copy-on-write in NiFi — The original content is still present in the repository after a FlowFile modi cation.
Reliability
NiFi claims to be reliable, how is it in practice? The attributes of all the FlowFiles
currently in use, as well as the reference to their content, are stored in the FlowFile
Repository.
At every step of the pipeline, a modi cation to a Flow le is rst recorded in the FlowFile
Repository, in a write-ahead log, before it is performed.
For each FlowFile that currently exist in the system, the FlowFile repository stores:
The FlowFile attributes
The state of the FlowFile. For example: to which queue does the Flow le belong
at this instant.
The FlowFile Repository contains metadata about the les currently in the ow.
The FlowFile repository gives us the most current state of the ow; thus it’s a powerful
tool to recover from an outage.
NiFi provides another tool to track the complete history of all the FlowFiles in the ow:
the Provenance Repository.
Provenance Repository
Every time a FlowFile is modi ed, NiFi takes a snapshot of the FlowFile and its context at
this point. The name for this snapshot in NiFi is a Provenance Event. The Provenance
Repository records Provenance Events.
Provenance enables us to retrace the lineage of the data and build the full chain of
custody for every piece of information processed in NiFi.
The Provenance Repository stores the metadata and context information of each FlowFile
On top of offering the complete lineage of the data, the Provenance Repository also
offers to replay the data from any point in time.
Trace back the history of your data thanks to the Provenance Repository
Wait, what’s the difference between the FlowFile Repository and the Provenance
Repository?
The idea behind the FlowFile Repository and the Provenance Repository is quite similar,
but they don’t address the same issue.
The FlowFile repository is a log that contains only the latest state of the in-use
FlowFiles in the system. It is the most recent picture of the ow and makes it
possible to recover from an outage quickly.
The Provenance Repository, on the other hand, is more exhaustive since it tracks
the complete life cycle of every FlowFile that has been in the ow.
The Provenance Repository adds a time dimension where the FlowFile Repository is one snapshot
If you have only the most recent picture of the system with the FlowFile repository, the
Provenance Repository gives you a collection of photos — a video. You can rewind to any
moment in the past, investigate the data, replay operations from a given time. It provides
a complete lineage of the data.
FlowFile Processor
A processor is a black box that performs an operation. Processors have access to the
attributes and the content of the FlowFile to perform all kind of actions. They enable you
to perform many operations in data ingress, standard data transformation/validation
tasks, and saving this data to various data sinks.
NiFi comes with many processors when you install it. If you don’t nd the perfect one for
your use case, it’s still possible to build your own processor. Writing custom processors
is outside the scope of this blog post.
Processors are high-level abstractions that ful ll one task. This abstraction is very
convenient because it shields the pipeline builder from the inherent dif culties of
concurrent programming and the implementation of error handling mechanisms.
Processors expose an interface with multiple con guration settings to ne-tune their
behavior.
Zoom on a NiFi Processor for record validation — pipeline builder speci es the high-level con guration
options and the black box hides the implementation details.
The properties of those processors are the last link between NiFi and the business
reality of your application requirements.
The devil is in the details, and pipeline builders spend most of their time ne-tuning
those properties to match the expected behavior.
Scaling
For each processor, you can specify the number of concurrent tasks you want to run
simultaneously. Like this, the Flow Controller allocates more resources to this processor,
increasing its throughput. Processors share threads. If one processor requests more
threads, other processors have fewer threads available to execute. Details on how the
Flow Controller allocates threads are available here.
Horizontal scaling. Another way to scale is to increase the number of nodes in your NiFi
cluster. Clustering servers make it possible to increase your processing capability using
commodity hardware.
Process Group
This one is straightforward now that we’ve seen what processors are.
A bunch of processors put together with their connections can form a process group. You
add an input port and an output port so it can receive and send data.
Processor groups are an easy way to create new processors based from existing ones.
Connections
Connections are the queues between processors. These queues allow processors to
interact at differing rates. Connections can have different capacities like there exist
different size of water pipes.
Various capacities for different connectors. Here we have capacity C1 > capacity C2
Because processors consume and produce data at different rates depending on the
operations they perform, connections act as buffers of FlowFiles.
There is a limit on how many data can be in the connection. Similarly, when your water
pipe is full, you can’t add water anymore, or it over ows.
In NiFi you can set limits on the number of FlowFiles and the size of their aggregated
content going through the connections.
What happens when you send more data than the connection can handle?
If the number of FlowFiles or the quantity of data goes above the de ned threshold,
backpressure is applied. The Flow Controller won’t schedule the previous processor to
run again until there is room in the queue.
Let’s say you have a limit of 10 000 FlowFiles between two processors. At some point,
the connection has 7 000 elements in it. It is ok since the limit is 10 000. P1 can still send
data through the connection to P2.
Now let’s say that processor one sends 4 000 new FlowFiles to the connection.
7 0000 + 4 000 = 11 000 → We go above the connection threshold of 10 000 FlowFiles.
Processor P1 not scheduled until the connector goes back below its threshold.
The limits are soft limits, meaning they can be exceeded. However, once they are, the
previous processor, P1 won’t be scheduled until the connector goes back below its
threshold value — 10 000 FlowFiles.
Number of FlowFiles in the connector comes back below the threshold. The Flow Controller schedules the
processor P1 for execution again.
This simpli ed example gives the big picture of how backpressure works.
You want to setup connection thresholds appropriate to the Volume and Velocity of data
to handle. Keep in mind the Four Vs.
The idea of exceeding a limit may sound odd. When the number of FlowFiles or the
associated data go beyond the threshold, a swap mechanism is triggered.
Among the available possibility, there is, for example, the First In First Out order — FIFO.
However, you can even use an attribute of your choice from the FlowFile to prioritize
incoming packets.
Flow Controller
The Flow Controller is the glue that brings everything together. It allocates and manages
threads for processors. It’s what executes the data ow.
For example, you may use an AWS credentials provider service to make it possible for
your services to interact with S3 buckets without having to worry about the credentials
at the processor level.
Just like with processors, a multitude of controller services is available out of the box.
You can check out this article for more content on the controller services.
If you’re reading this, congrats! You now know more about NiFi than 99.99% of the
world’s population.
Practice makes perfect. You master all the concepts required to start building your own
pipeline. Make it simple; make it work rst.
Here is a list of exciting resources I compiled on top of my work experience to write this
article.
Resources
The bigger picture
Because designing data pipeline in a complex ecosystem requires pro ciency in multiple
areas, I highly recommend the book Designing Data-Intensive Applications from Martin
Kleppmann. It covers the fundamentals.
A cheat sheet with all the references quoted in Martin’s book is available on his
Github repo.
This cheat sheet is a great place to start if you already know what kind of topic you’d like
to study in-depth and you want to nd quality materials.
Open source:
Most of the existing cloud providers offer data ow solutions. Those solutions integrate
easily with other products you use from this cloud provider. At the same time, it solidly
ties you to a particular vendor.
The NiFi blog distills a lot of insights NiFi usage patterns as well as tips on how to
build pipelines.
Show comments
Apache
How Apache Ni works — surf on your data ow, don’t drown in it
1 post →
freeCodeCamp is a donor-supported tax-exempt 501(c)(3) nonpro t organization (United States Federal Tax Identi cation Number: 82-0779546)
Our mission: to help people learn to code for free. We accomplish this by creating thousands of videos, articles, and interactive coding lessons - all freely
available to the public. We also have thousands of freeCodeCamp study groups around the world.
Donations to freeCodeCamp go toward our education initiatives, and help pay for servers, services, and staff. You can make a tax-deductible donation
here.
Our Nonpro t
AboutDonateShopSponsorsEmail Us
Our Community
NewsAlumni NetworkStudy GroupsForumGitterGitHubSupportAcademic HonestyCode of ConductPrivacy PolicyTerms of Service
Our Learning Resources
LearnGuideYoutubePodcastTwitterInstagram