Vous êtes sur la page 1sur 8

Apache Airflow

Overview
Apache Airflow is a platform for programmatically authoring,
scheduling, and monitoring workflows. It is completely open-source and is
especially useful in architecting complex data pipelines. It's written in
Python, so you're able to interface with any third party python API or
database to extract, transform, or load your data into its final destination. It
was created to solve the issues that come with long-running cron tasks that
execute hefty scripts.
With Airflow, workflows are architected and expressed as DAGs, with
each step of the DAG defined as a specific Task. It is designed with the
belief that all ETL (Extract, Transform, Load data processing) is best
expressed as code, and as such is a code-first platform that allows you to
iterate on your workflows quickly and efficiently. As a result of its code-
first design philosophy, Airflow allows for a degree of customizibility and
extensibility that other ETL tools do not support.

Core Concepts
DAG
DAG stands for "Directed Acyclic Graph". Each DAG represents a
collection of all the tasks you want to run and is organized to show
relationships between tasks directly in the Airflow UI. They are defined
this way for the following reasons:
1.Directed: If multiple tasks exist, each must have at least one
defined upstream or downstream task.
2.Acyclic: Tasks are not allowed to create data that goes on to self
reference. This is to avoid creating infinite loops.
3.Graph: All tasks are laid out in a clear structure with processes
occurring at clear points with set relationships to other tasks.
Tasks
Tasks represent each node of a defined DAG. They are visual
representations of the work being done at each step of the workflow, with
the actual work that they represent being defined by Operators.

Operators
Operators in Airflow determine the actual work that gets done. They define
a single task, or one node of a DAG. DAGs make sure that operators get
scheduled and run in a certain order, while operators define the work that
must be done at each step of the process.
User Interface
The Airflow UI makes it easy to monitor and troubleshoot the data
tpipelines.

External Airflow API


Apache Airflow is by default set to run on Local Server.
We can only work on Local DAGs.
 In order to work with DAGs located in Main(CASHE)
Server, External Airflow API has to be used.
 There is a requirement in Cashe Software which requires to
transfer records from one table to another table in a Database
for every One Hour.
i.e., Records/data from Table-X are moved to Table-Y in a
Database for every ONE hour, everyday

 A DAG is already written to transfer records from one table


to another table but the DAG is stored in Main Server. i.e.,
CASHE Server.

 CASHE Admin has to view the List of DAGs, Trigger DAGs


and Pause/Unpause DAGs.

 An External Airflow API plugin is used to List, Trigger and


Pause/Unpause DAGs, so that CASHE Admin can do his
works.

The External Airflow API plugin used in this Software is found


here : https://github.com/teamclairvoyant/airflow-rest-api-plugin
and the plugin is placed in the Main(CASHE) Server

API Endpoints
Consider Main(CASHE) Server –
HOST : 191.132.135.205
PORT : 8989
list_dags
List all the DAGs.
http://191.132.135.205:8989/admin/rest_api/api?api=list_dags

This API Endpoint gives the List of DAGs in the form of JSON.

trigger_dag
Triggers a Dag to Run.
http://191.132.135.205:8989/admin/rest_api/api?api=trigger_dag&dag_id=test_id

This API Endpoint Triggers a specific DAG based on dag_id

pause
Pauses a DAG
http://191.132.135.205:8989/admin/rest_api/api?api=pause&dag_id=test_id

This API Endpoint Pauses a Specific DAG based on dag_id

unpause
Resume a paused DAG.
http://191.132.135.205:8989/admin/rest_api/api?api=unpause&dag_id=test_id

This API Endpoint Resumes a paused DAG based on dag_id


Sample JSON Output
1.Sample JSON Output for list_dags API
{
"airflow_cmd": "list dags",
"arguments": {},
"call_time": "Tue, 29 Nov 2016 14:22:26 GMT",
"http_response_code": 200,
"output": "example_bash_operator","tutorial","example_xcom",
"response_time": "Tue, 29 Nov 2016 14:27:59 GMT",
"status": "OK"
}

This is the Sample JSON Output when list_dags API Endpoint is


called.

2.Sample JSON Output for trigger_dag API


{
"airflow_cmd": "trigger test_id",
"arguments": {},
"call_time": "Tue, 29 Nov 2016 14:22:26 GMT",
"http_response_code": 200,
"output": "Yes",
"response_time": "Tue, 29 Nov 2016 14:27:59 GMT",
"status": "OK"
}

This is the Sample JSON Output when trigger_dag API Endpoint


is called.

3.Sample JSON Output for pause API


{
"airflow_cmd": "pause test_id",
"arguments": {},
"call_time": "Tue, 29 Nov 2016 14:22:26 GMT",
"http_response_code": 200,
"output": "Yes",
"response_time": "Tue, 29 Nov 2016 14:27:59 GMT",
"status": "OK"
}

This is the Sample JSON Output when pause API Endpoint is


called.

4.Sample JSON Output for unpause API


{
"airflow_cmd": "unpause test_id",
"arguments": {},
"call_time": "Tue, 29 Nov 2016 14:22:26 GMT",
"http_response_code": 200,
"output": "Yes",
"response_time": "Tue, 29 Nov 2016 14:27:59 GMT",
"status": "OK"
}

This is the Sample JSON Output when unpause API Endpoint is


called.

GOLANG API
Wrote API’s in GO using External Apache Airflow API Plugin.
The GoLang API, when an External Airflow API Endpoint is
called, Parses the Sample JSON Output and sends the response to
UI/FrontEnd Team.

For Example :
When list_dags API is called, the JSON Output for that API is
Parsed and the VALUE corresponding to KEY – “output” is send
to the UI Team.

Wrote Four APIs in GOLANG :


1.List Dags
2.Trigger Dags
3.Pause Dags
4.Unpause Dags
using External Airflow API Plugin.

Vous aimerez peut-être aussi