Vous êtes sur la page 1sur 13

Welcome to Developer News.

3 MAY 2019 / #APACHE #DATA #SOFTWARE DEVELOPMENT

How Apache Ni works — surf on your


data ow, don’t drown in it

Tweet this to your followers

Email this to a friend

Share this with your friends

by François Paupier

Introduction
That’s a crazy ow of water. Just like your application deals with a crazy stream of data.
Routing data from one storage to another, applying validation rules and addressing
questions of data governance, reliability in a Big Data ecosystem is hard to get right if
you do it all by yourself.

Good news, you don’t have to build your data ow solution from scratch — Apache NiFi
got your back!

At the end of this article, you’ll be a NiFi expert — ready to build your data pipeline.

What I will cover in this article:


What Apache NiFi is, in which situation you should use it, and what are the key
concepts to understand in NiFi.

What I won’t cover:


Installation, deployment, monitoring, security, and administration of a NiFi
cluster.

For your convenience here is the table of content, feel free to go straight where your
curiosity takes you. If you’re a NiFi rst-timer, going through this article in the indicated
order is advised.

Table of Content
I — What is Apache NiFi?
- De ning NiFi
- Why using NiFi?

II — Apache Ni under the microscope


- FlowFile
- Processor
- Process Group
- Connection
- Flow Controller

Conclusion and call to action

What is Apache NiFi?


On the website of the Apache Ni project, you can nd the following de nition:

An easy to use, powerful, and reliable system to process and distribute data.

Let’s analyze the keywords there.

Defining NiFi
Process and distribute data
That’s the gist of Ni . It moves data around systems and gives you tools to process this
data.

Ni can deal with a great variety of data sources and format. You take data in from one
source, transform it, and push it to a different data sink.

Ten thousand feet view of Apache Ni — Ni pulls data from multiple data sources, enrich it and transform it
to populate a key-value store.

Easy to use
Processors — the boxes — linked by connectors — the arrows create a ow. NiFi offers a
ow-based programming experience.

Ni makes it possible to understand, at a glance, a set of data ow operations that would


take hundreds of lines of source code to implement.

Consider the pipeline below:

An overly minimalist data pipeline

To translate the data ow above in NiFi, you go to NiFi graphical user interface, drag and
drop three components into the canvas, and
That’s it. It takes two minutes to build.
A simple validation data ow as seen through Ni canvas

Now, if you write code to do the same thing, it’s likely to be a several hundred lines long
to achieve a similar result.

You don’t capture the essence of the pipeline through code as you do with a ow-based
approach. Ni is more expressive to build a data pipeline; it’s designed to do that.

Powerful
NiFi provides many processors out of the box (293 in Ni 1.9.2). You’re on the shoulders
of a giant. Those standard processors handle the vast majority of use cases you may
encounter.

NiFi is highly concurrent, yet its internals encapsulates the associated complexity.
Processors offer you a high-level abstraction that hides the inherent complexity of
parallel programming. Processors run simultaneously, and you can span multiple threads
of a processor to cope with the load.

Concurrency is a computing Pandora’s box that you don’t want to open. NiFi conveniently
shields the pipeline builder from the complexities of concurrency.

Reliable
The theory backing NiFi is not new; it has solid theoretical anchors. It’s similar to models
like SEDA.

For a data ow system, one of the main topics to address is reliability. You want to be
sure that data sent somewhere is effectively received.

NiFi achieves a high level of reliability through multiple mechanisms that keep track of
the state of the system at any point in time. Those mechanisms are con gurable so you
can make the appropriate tradeoffs between latency and throughput required by your
applications.

NiFi tracks the history of each piece of data with its lineage and provenance features. It
makes it possible to know what transformation happens on each piece of information.

The data lineage solution proposed by Apache Ni proves to be an excellent tool for
auditing a data pipeline. Data lineage features are essential to bolster con dence in big
data and AI systems in a context where transnational actors such as the European Union
propose guidelines to support accurate data processing.

Why using Nifi?


First, I want to make it clear I’m not here to evangelize NiFi. My goal is to give you
enough elements so you can make an informed decision on the best way to build your
data pipeline.

It’s useful to keep in mind the four Vs of big data when dimensioning your solution.

The four Vs of Big Data

Volume — At what scale do you operate? In order of magnitude, are you closer to
a few GigaBytes or hundreds of PetaBytes?

Variety — How many data sources do you have? Are your data structured? If yes,
does the schema vary often?

Velocity — What is the frequency of the events you process? Is it credit cards
payments? Is it a daily performance report sent by an IoT device?
Veracity — Can you trust the data? Alternatively, do you need to apply multiple
cleaning operations before manipulating it?

NiFi seamlessly ingests data from multiple data sources and provides mechanisms to
handle different schema in the data. Thus, it shines when there is a high variety in the
data.

Ni is particularly valuable if data is of low veracity. Since it provides multiple


processors to clean and format the data.

With its con guration options, Ni can address a broad range of volume/velocity
situations.

An increasing list of applications for data routing solutions


New regulations, the rise of the Internet of Things and the ow of data it generates
emphasize the relevance of tools such as Apache NiFi.

Microservices are trendy. In those loosely coupled services, the data is the
contract between the services. Ni is a robust way to route data between those
services.

Internet of Things brings a multitude of data to the cloud. Ingesting and validating
data from the edge to the cloud poses a lot of new challenges that NiFi can
ef ciently address (primarily through MiniFi, NiFi project for edge devices)

New guidelines and regulations are put in place to readjust the Big Data
economy. In this context of increasing monitoring, it is vital for businesses to have
a clear overview of their data pipeline. NiFi data lineage, for example, can be
helpful in a path towards compliance to regulations.

Bridge the gap between big data experts and the others
As you can see by the user interface, a data ow expressed in NiFi is excellent to
communicate about your data pipeline. It can help members of your organization
become more knowledgeable about what’s going on in the data pipeline.

An analyst is asking for insights about why this data arrives here that way? Sit
together and walk through the ow. In ve minutes you give someone a strong
understanding of the Extract Transform and Load -ETL- pipeline.

You want feedback from your peers on a new error handling ow you created?
NiFi makes it a design decision to consider error paths as likely as valid
outcomes. Expect the ow review to be shorter than a traditional code review.

Should you use it? Yes, No, Maybe?


NiFi brands itself as easy to use. Still, it is an enterprise data ow platform. It offers a
complete set of features from which you may only need a reduced subset. Adding a new
tool to the stack is not benign.

If you are starting from scratch and manage a few data from trusted data sources, you
may be better off setting up your Extract Transform and Load — ETL pipeline. Maybe a
change data capture from a database and some data preparations scripts are all you
need.

On the other hand, if you work in an environment with existing big data solutions in use
(be it for storage, processing or messaging ), NiFi integrates well with them and is more
likely to be a quick win. You can leverage the out of the box connectors to those other Big
Data solutions.

It’s easy to be hyped by new solutions. List your requirements and choose the solution
that answers your needs as simply as possible.

Now that we have seen the very high picture of Apache NiFi, we take a look at its key
concepts and dissect its internals.

Apache Nifi under the microscope


“NiFi is boxes and arrow programming” may be ok to communicate the big picture.
However, if you have to operate with NiFi, you may want to understand a bit more about
how it works.

In this second part, I explain the critical concepts of Apache NiFi with schemas. This
black box model won’t be a black box to you afterward.

Unboxing Apache NiFi


When you start NiFi, you land on its web interface. The web UI is the blueprint on which
you design and control your data pipeline.
Apache NiFi user interface — build your pipeline by drag and dropping component on the interface

In Ni , you assemble processors linked together by connections. In the sample data ow


introduced previously, there are three processors.

Three processors linked together by two queues

The NiFi canvas user interface is the framework in which the pipeline builder evolves.

Making sense of Nifi terminology


To express your data ow in Ni , you must rst master its language. No worries, a few
terms are enough to grasp the concept behind it.

The black boxes are called processors, and they exchange chunks of information named
FlowFiles through queues that are named connections. Finally, the FlowFile Controller is
responsible for managing the resources between those components.

Processor, FlowFile, Connector, and the FlowFile Controller: four essential concepts in NiFi

Let’s take a look at how this works under the hood.

FlowFile
In NiFi, the FlowFile is the information packet moving through the processors of the
pipeline.

Anatomy of a FlowFile — It contains attributes of the data as well as a reference to the associated data
A FlowFile comes in two parts:

Attributes, which are key/value pairs. For example, the le name, le path, and a
unique identi er are standard attributes.

Content, a reference to the stream of bytes compose the FlowFile content.

The FlowFile does not contain the data itself. That would severely limit the throughput
of the pipeline.

Instead, a FlowFile holds a pointer that references data stored at some place in the local
storage. This place is called the Content Repository.

The Content Repository stores the content of the FlowFile

To access the content, the FlowFile claims the resource from the Content Repository.
The later keep tracks of the exact disk offset from where the content is and streams it
back to the FlowFile.

Not all processors need to access the content of the FlowFile to perform their
operations — for example, aggregating the content of two FlowFiles doesn’t require to
load their content in memory.

When a processor modi es the content of a FlowFile, the previous data is kept. NiFi
copies-on-write, it modi es the content while copying it to a new location. The original
information is left intact in the Content Repository.

Example
Consider a processor that compresses the content of a FlowFile. The original content
remains in the Content Repository, and a new entry is created for the compressed
content.

The Content Repository nally returns the reference to the compressed content. The
FlowFile is updated to point to the compressed data.

The drawing below sums up the example with a processor that compresses the content
of FlowFiles.

Copy-on-write in NiFi — The original content is still present in the repository after a FlowFile modi cation.

Reliability
NiFi claims to be reliable, how is it in practice? The attributes of all the FlowFiles
currently in use, as well as the reference to their content, are stored in the FlowFile
Repository.

At every step of the pipeline, a modi cation to a Flow le is rst recorded in the FlowFile
Repository, in a write-ahead log, before it is performed.

For each FlowFile that currently exist in the system, the FlowFile repository stores:
The FlowFile attributes

A pointer to the content of the FlowFile located in the FlowFile repository

The state of the FlowFile. For example: to which queue does the Flow le belong
at this instant.

The FlowFile Repository contains metadata about the les currently in the ow.

The FlowFile repository gives us the most current state of the ow; thus it’s a powerful
tool to recover from an outage.

NiFi provides another tool to track the complete history of all the FlowFiles in the ow:
the Provenance Repository.

Provenance Repository
Every time a FlowFile is modi ed, NiFi takes a snapshot of the FlowFile and its context at
this point. The name for this snapshot in NiFi is a Provenance Event. The Provenance
Repository records Provenance Events.

Provenance enables us to retrace the lineage of the data and build the full chain of
custody for every piece of information processed in NiFi.

The Provenance Repository stores the metadata and context information of each FlowFile

On top of offering the complete lineage of the data, the Provenance Repository also
offers to replay the data from any point in time.
Trace back the history of your data thanks to the Provenance Repository

Wait, what’s the difference between the FlowFile Repository and the Provenance
Repository?

The idea behind the FlowFile Repository and the Provenance Repository is quite similar,
but they don’t address the same issue.

The FlowFile repository is a log that contains only the latest state of the in-use
FlowFiles in the system. It is the most recent picture of the ow and makes it
possible to recover from an outage quickly.

The Provenance Repository, on the other hand, is more exhaustive since it tracks
the complete life cycle of every FlowFile that has been in the ow.

The Provenance Repository adds a time dimension where the FlowFile Repository is one snapshot

If you have only the most recent picture of the system with the FlowFile repository, the
Provenance Repository gives you a collection of photos — a video. You can rewind to any
moment in the past, investigate the data, replay operations from a given time. It provides
a complete lineage of the data.

FlowFile Processor
A processor is a black box that performs an operation. Processors have access to the
attributes and the content of the FlowFile to perform all kind of actions. They enable you
to perform many operations in data ingress, standard data transformation/validation
tasks, and saving this data to various data sinks.

Three different kinds of processors

NiFi comes with many processors when you install it. If you don’t nd the perfect one for
your use case, it’s still possible to build your own processor. Writing custom processors
is outside the scope of this blog post.

Processors are high-level abstractions that ful ll one task. This abstraction is very
convenient because it shields the pipeline builder from the inherent dif culties of
concurrent programming and the implementation of error handling mechanisms.

Processors expose an interface with multiple con guration settings to ne-tune their
behavior.
Zoom on a NiFi Processor for record validation — pipeline builder speci es the high-level con guration
options and the black box hides the implementation details.

The properties of those processors are the last link between NiFi and the business
reality of your application requirements.

The devil is in the details, and pipeline builders spend most of their time ne-tuning
those properties to match the expected behavior.

Scaling
For each processor, you can specify the number of concurrent tasks you want to run
simultaneously. Like this, the Flow Controller allocates more resources to this processor,
increasing its throughput. Processors share threads. If one processor requests more
threads, other processors have fewer threads available to execute. Details on how the
Flow Controller allocates threads are available here.

Horizontal scaling. Another way to scale is to increase the number of nodes in your NiFi
cluster. Clustering servers make it possible to increase your processing capability using
commodity hardware.

Process Group
This one is straightforward now that we’ve seen what processors are.

A bunch of processors put together with their connections can form a process group. You
add an input port and an output port so it can receive and send data.

Building a new processor from three existing processors

Processor groups are an easy way to create new processors based from existing ones.

Connections
Connections are the queues between processors. These queues allow processors to
interact at differing rates. Connections can have different capacities like there exist
different size of water pipes.

Various capacities for different connectors. Here we have capacity C1 > capacity C2

Because processors consume and produce data at different rates depending on the
operations they perform, connections act as buffers of FlowFiles.

There is a limit on how many data can be in the connection. Similarly, when your water
pipe is full, you can’t add water anymore, or it over ows.

In NiFi you can set limits on the number of FlowFiles and the size of their aggregated
content going through the connections.

What happens when you send more data than the connection can handle?

If the number of FlowFiles or the quantity of data goes above the de ned threshold,
backpressure is applied. The Flow Controller won’t schedule the previous processor to
run again until there is room in the queue.

Let’s say you have a limit of 10 000 FlowFiles between two processors. At some point,
the connection has 7 000 elements in it. It is ok since the limit is 10 000. P1 can still send
data through the connection to P2.

Two processors linked by a connector with its limit respected.

Now let’s say that processor one sends 4 000 new FlowFiles to the connection.
7 0000 + 4 000 = 11 000 → We go above the connection threshold of 10 000 FlowFiles.

Processor P1 not scheduled until the connector goes back below its threshold.

The limits are soft limits, meaning they can be exceeded. However, once they are, the
previous processor, P1 won’t be scheduled until the connector goes back below its
threshold value — 10 000 FlowFiles.

Number of FlowFiles in the connector comes back below the threshold. The Flow Controller schedules the
processor P1 for execution again.

This simpli ed example gives the big picture of how backpressure works.

You want to setup connection thresholds appropriate to the Volume and Velocity of data
to handle. Keep in mind the Four Vs.

The idea of exceeding a limit may sound odd. When the number of FlowFiles or the
associated data go beyond the threshold, a swap mechanism is triggered.

Active queue and Swap in Ni connectors

For another example on backpressure, this mail thread can help.


Prioritizing FlowFiles
The connectors in NiFi are highly con gurable. You can choose how you prioritize
FlowFiles in the queue to decide which one to process next.

Among the available possibility, there is, for example, the First In First Out order — FIFO.
However, you can even use an attribute of your choice from the FlowFile to prioritize
incoming packets.

Flow Controller
The Flow Controller is the glue that brings everything together. It allocates and manages
threads for processors. It’s what executes the data ow.

The Flow Controller coordinates the allocation of resources for processors.

Also, the Flow Controller makes it possible to add Controller Services.

Those services facilitate the management of shared resources like database


connections or cloud services provider credentials. Controller services are daemons.
They run in the background and provide con guration, resources, and parameters for
the processors to execute.

For example, you may use an AWS credentials provider service to make it possible for
your services to interact with S3 buckets without having to worry about the credentials
at the processor level.

An AWS credentials service provide context to two processors

Just like with processors, a multitude of controller services is available out of the box.

You can check out this article for more content on the controller services.

Conclusion and call to action


In the course of this article, we discussed NiFi, an enterprise data ow solution. You now
have a strong understanding of what NiFi does and how you can leverage its data routing
features for your applications.

If you’re reading this, congrats! You now know more about NiFi than 99.99% of the
world’s population.

Practice makes perfect. You master all the concepts required to start building your own
pipeline. Make it simple; make it work rst.

Here is a list of exciting resources I compiled on top of my work experience to write this
article.

Resources
The bigger picture
Because designing data pipeline in a complex ecosystem requires pro ciency in multiple
areas, I highly recommend the book Designing Data-Intensive Applications from Martin
Kleppmann. It covers the fundamentals.

A cheat sheet with all the references quoted in Martin’s book is available on his
Github repo.

This cheat sheet is a great place to start if you already know what kind of topic you’d like
to study in-depth and you want to nd quality materials.

Alternatives to Apache Nifi


Other data ow solutions exist.

Open source:

Streamsets is similar to NiFi; a good comparison is available on this blog

Most of the existing cloud providers offer data ow solutions. Those solutions integrate
easily with other products you use from this cloud provider. At the same time, it solidly
ties you to a particular vendor.

Azure Data Factory, A Microsoft solution

IBM has its InfoSphere DataStage

Amazon proposes a tool named Data Pipeline

Google offers its Data ow

Alibaba cloud introduces a service DataWorks with similar features

NiFi related resources


The of cial Ni documentation and especially the Ni In-depth section are gold
mines.

Registering to Ni users mailing list is also a great way to be informed — for


example, this conversation explains back-pressure.

Hortonworks, a big data solutions provider, has a community website full of


engaging resources and how-to for Apache Ni .
— This article goes in depth about connectors, heap usage, and back pressure.
— This one shares dimensioning best practices when deploying a NiFi cluster.

The NiFi blog distills a lot of insights NiFi usage patterns as well as tips on how to
build pipelines.

Claim Check pattern explained

The theory behind Apache Ni is not new, Seda referenced in Ni Doc is


extremely relevant
— Matt Welsh. Berkeley. SEDA: An Architecture for Well-Conditioned, Scalable
Internet Services [online]. Retrieved: 21 Apr 2019, from
http://www.mdw.la/papers/seda-sosp01.pdf

Tweet this to your followers

Email this to a friend

Share this with your friends

Show comments

Countinue reading about

Apache
How Apache Ni works — surf on your data ow, don’t drown in it

1 post →

#GITHUB #WEB DEVELOPMENT #JAVASCRIPT


Mistakes you’ve probably made in your coding task for a job interview
2 MONTHS AGO

#APACHE #DATA #SOFTWARE DEVELOPMENT

How Apache Ni works — surf on your data ow, don’t drown in it


2 MONTHS AGO

freeCodeCamp is a donor-supported tax-exempt 501(c)(3) nonpro t organization (United States Federal Tax Identi cation Number: 82-0779546)

Our mission: to help people learn to code for free. We accomplish this by creating thousands of videos, articles, and interactive coding lessons - all freely
available to the public. We also have thousands of freeCodeCamp study groups around the world.

Donations to freeCodeCamp go toward our education initiatives, and help pay for servers, services, and staff. You can make a tax-deductible donation
here.

Our Nonpro t
AboutDonateShopSponsorsEmail Us
Our Community
NewsAlumni NetworkStudy GroupsForumGitterGitHubSupportAcademic HonestyCode of ConductPrivacy PolicyTerms of Service
Our Learning Resources
LearnGuideYoutubePodcastTwitterInstagram

Vous aimerez peut-être aussi