Install Apache Airflow on GKE

Apache Airflow is a well known workflow management platform. A few years ago Google announced Composer - a fully managed Kubernetes installation of Apache Airflow, which is a great way to start automating your ETL, ML and DevOps chores using Python and Google Cloud platform. Still it could desirable to be able to install Airflow on GKE from scratch for variety of reasons: some may strive to gain more flexibility, or you could be running Airflow on premises and just look for ways to quickly spin up a similar cluster in cloud to process some extra bulky data. In this post I describe how to install and configure Apache Airflow components on GKE: Celery Executer, Redis and Cloud Sql. To deploy to Kubernetes I use a Helm chart. As Airflow is quite a complex platform, installation involves a lot of work on infrastructure: cluster should be created, Cloud Sql should be provisioned, SSD disk for Redis created, necessary Google APIs enabled, etc. To automate this chores I use Terraform.

Kubernetes deployment

To run Airflow, a few pods should be scheduled, most notably - scheduler, webserver, and worker pods. For this purpose, Stateful sets were used in combination with Headless Service for workers (notice clusterIP is None):

apiVersion: v1
kind: Service
metadata:
  name: worker
spec:
  clusterIP: None
  selector:
    app: airflow
    tier: worker
  ports:
    - protocol: TCP
      port: 8793

Headless service could be used when cluster IP and load balancing between pods is not necessary (only the list of pod IPs is of the interest). In our case Headless Service just gives pods distinct networking identity (worker-0, worker-1, etc) and simplifies local logs access. Here no Persistent Volumes were created, but we can do it in case we want to save/archive logs (alternatively, Airflow could be configured to save logs in Google Cloud storage, Elastic Search, etc) or cache some static data.

Using git sync to deliver Airflow DAGs is common practice, which to my taste looks like too much flexibility, opening the door to hard to debug inconsistencies. For this reason, DAGs, as well as additional Python packages are baked into Docker container image, which is built on top of puckel/airflow image. To gain more flexibility the Docker image is built in two steps. First, an image containing additional packages and library code is built in src directory (this image supposedly should not change too frequently). Secondly, an image containing DAGs is built in root directory (this image may change more frequently as DAGs could be added, tweaked, or even generated on the fly).

# add libraries and third party packages
cd src
docker build -t airflow-python .
# add dags
cd ..
docker build -t airflow-gke:latest .
# push to Google Container repository
docker tag airflow-gke gcr.io/google_project_id/airflow-gke:latest
docker push gcr.io/google_project_id/airflow-gke
# then, after cluster is created and configured, deploy complete application using Helm:
helm upgrade --install airflow . --set projectId=google_project_id

Cloud SQL

For persistence, Airflow needs a database. In Google Cloud, a popular choice is a Cloud Sql - managed, highly available database instances from Google. To communicate with a Cloud Sql database from a cluster, a proxy pod, available as a postgres service is used:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      tier: postgres
  template:
    metadata:
      labels:
        app: airflow
        tier: postgres
    spec:
      serviceAccountName: worker
      restartPolicy: Always
      containers:
        - name: cloudsql-proxy
          image: gcr.io/cloudsql-docker/gce-proxy:1.11
          command: ["/cloud_sql_proxy", "-instances={{ template "dbinstances" . }}=tcp:0.0.0.0:5432",]
          ports:
            - name: postgres
              containerPort: 5432

Cloud Sql proxy requires only DB instance reference (consisting of the DB instance name and availability zone) and the port to bind proxy to. Authentication is accomplished behind the scene using Workload Identity (through worker Kubernetes service account), which is configured using Terraform scripts. To check configuration is correct you may run Google provided image with gcloud SDK and helpful utilities:

kubectl run -it   --image google/cloud-sdk:slim   --serviceaccount worker   --namespace default   workload-identity-test
# upon successful login
gcloud auth list

If the service accounts are correctly configured, the Google service account email address is listed as the active (and only) identity. In case of any errors, for troubleshooting, one may check pod logs (kubectl logs -l name=airflow) for authentication and other errors.

Create and configure cluster using Terraform

Before our application could be deployed to Kubernetes we actually need to create cluster and other required resources. The process involves a lot of tedious manual operations and thus is a good candidate for automation using Terraform.

To use Terraform, first install it and create a new project and Google service account (Owner) from Google Cloud Console, downloading a key file. Next install and authenticate gcloud SDK:

# authenticate using service-account key file
gcloud auth activate-service-account --key-file=/auth/cloud-test-robot.json

GKE cluster itself is created in two steps. First we create infrastructure (cluster, Cloud Sql, SSD disk) using infrastructure.tf file in the root directory. Just run usual terraform init and apply commands:

terraform init
terraform apply

When infrastructure is ready, one may create and configure worker service accounts (both for Google project and for Kubernetes cluster), and enable required APIs. These accounts are used to simplify authentication (in accordance with the latest Google recommendations), using Workload Identity. This allows to configure access to Google resources through Kubernetes service account worker (on Kubernetes side). Such setup eliminates the need to manage Kubernetes secrets, greatly simplifying overall configuration. To automate the process completely, change to k8s subdirectory and apply Terraform configuration:

cd k8s
terraform init
terraform apply

That is it for infrastructure! Later, if you wish to add new permissions, you may achieve this through Google service account worker. For example, to work with Google Cloud Storage, we added a new roles/storage.admin role binding and enable storage API:

resource "google_project_iam_binding" "storage_access" {
  role    = "roles/storage.admin"

  members = [
    "serviceAccount:${var.service_account_name}@${var.project_id}.iam.gserviceaccount.com",
  ]
}

resource "google_project_service" "storage_json" {
  project = var.project_id
  service = "storage-api.googleapis.com"
}

To list role bindings and to enable services, the following CLI commands could be used:

# to check policies
gcloud projects get-iam-policy project_id
# to check enabled APIs
gcloud services list

Next

The source code and short installation instructions are available on GitHub. After installation, to get started with Airflow and Google Cloud services see my next blog post.