4. 4
Apache Airflow : What is it?
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
5. 5
Apache Airflow : What is it?
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
It ships with
• DAG Scheduler
• Web application (UI)
• Powerful CLI
8. 8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
9. 9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
10. Airflow - Performance Insights
10
Airflow: Gantt charts reveal the slowest tasks for a run!
11. 11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
12. 12
Apache Airflow : What is it?
When would you use a Workflow Scheduler like
Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines
• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron)
• DB Back-ups, Scheduled code/config deployment
13. 13
Apache Airflow : What is it?
What should a Workflow Scheduler do well?
• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met
• Scale
14. 14
Apache Airflow : What is it?
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
16. 16
Apache Airflow : Incubating
Timeline
• Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin
• Max launched it @ Hadoop Summit in Summer 2015
• On 3/31/2016, Airflow —> Apache Incubator
Today
• 166+ Contributors
• 300+ Users
• 40+ companies officially using it!
• 9 Committers/Maintainers <— We’re growing here
29. Use-Case : Message Scoring
29
enterprise A
enterprise B
enterprise C
S3
S3 uploads every 15
minutes
30. Use-Case : Message Scoring
30
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour
31. Use-Case : Message Scoring
31
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
32. Use-Case : Message Scoring
32
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
33. Use-Case : Message Scoring
33
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
34. 34
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
35. 35
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
51. 51
Correctness : Email Reporting
For each org, we check for duplicate or missing data
as a count & percentage
orgs
52. 52
Correctness : Email Reporting
These are the 3 stages of the pipeline. We can detect where a
discrepancy is coming from - often related to a code push!
orgs
66. 66
Apache Airflow Next Steps
Improvement Areas
• Security
• API (though we do have a CLI)
• Deployment / Versioning
• Execution Scale Out
• On-demand Execution
67. Acknowledgments
67
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
None of this work would be possible without the
contributions of the strong team below