Schedule Dataflow Templates with Airflow

  • In the VM with airflow, we’ll set up the dataflow-python-runner as the default service account. This account should minimum access to our Google project. In this case, we want only to give the Dataflow Service Agent role, but because airflow runs some extra commands, we also need to grant the Dataflow Developer role or create a new custom role granting only the specific permissions we want.
  • In airflow with the serviceAccountEmail field, we tell dataflow, to run this job as dataflow-python-test. This service account should have permissions to fully run a dataflow job, as well as any other access permissions e.g. to buckets.

So dataflow-python-runner is the service account that triggers the job, and dataflow-python-test is the service account that runs the job.

job = DataflowTemplateOperator(
task_id='alex_test_job',
template="gs://FULL_PATH_TO_TEMPLATE",
job_name='alex_job_test_from_airflow',
dataflow_default_options={
"project": "MY_PROJECT_ID",
"stagingLocation": "gs://FULL_PATH_TO_STAGING_FOLDER",
"tempLocation": "gs://FULL_PATH_TO_TEMP_FOLDER",
"serviceAccountEmail": "dataflow-python-runner@{project_id}.iam.gserviceaccount.com"
)
airflow test dataflow_jobs alex_test_job 2020–07–04

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store