I'm still in the process of deploying Airflow
and I've already felt the need to merge operator
s together. The most common use-case would be coupling an operator and the corresponding sensor
. For instance, one might want to chain together the EmrStepOperator
and EmrStepSensor
.
I'm creating my DAG
s programmatically, and the biggest one of those contains 150+ (identical) branches, each performing the same series of operations on different bits of data (tables). Therefore clubbing together tasks that make-up a single logical step in my DAG would be of great help.
Here are 2 contending examples from my project to give motivation for my argument.
1. Deleting data from S3 path and then writing new data
This step comprises 2 operators
DeleteS3PathOperator
: Extends from BaseOperator
& uses S3Hook
HadoopDistcpOperator
: Extends from SSHOperator
2. Conditionally performing MSCK REPAIR
on Hive
table
This step contains 4 operators
BranchPythonOperator
: Checks whether Hive table is partitioned
MsckRepairOperator
: Extends from HiveOperator
and performs MSCK REPAIR on (partioned) table
Dummy(Branch)Operator
: Makes up alternate branching path to MsckRepairOperator
(for non-partitioned tables)
Dummy(Join)Operator
: Makes up the join step for both branches
Using operators in isolation certainly offers smaller modules and more fine-grained logging / debugging, but in large DAGs, reducing the clutter might be desirable. From my current understanding there are 2 ways to chain operators together
Hook
s
Write actual processing logic in hooks and then use as many hooks as you want within a single operator (Certainly the better way in my opinion)
SubDagOperator
A risky and controversial way of doing things; additionally the naming convention for SubDagOperator makes me frown.
My questions are
- Should operators be composed at all or is it better to have discrete steps?
- Any pitfalls, improvements in above approaches?
- Any other ways to combine operators together?
- In taxonomy of Airflow, is the primary motive of Hooks same as above, or do they serve some other purposes too?
UPDATE-1
3. Multiple Inhteritance
While this is a Python
feature rather than Airflow
specific, its worthwhile to point out that multiple inheritance can come handy in combining functionalities of operators. QuboleCheckOperator
, for instance, is already written using that. However in the past, I've tried this thing to fuse EmrCreateJobFlowOperator
and EmrJobFlowSensor
, but at the time I had run into issues with @apply_defaults
decorator and had abandoned the idea.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…