[4] On CloudSQL proxy and Kubernetes sidecars...and Airbyte too

Let’s face it, we all run multiple containers in our pods. For various reasons. Some are completely justified, like having a service do some preliminary work (init containers) before your actual service is up and running, or perhaps a cleanup task after your service exist. Perhaps even adding neat features like encryption to services that don’t have it using things such as Linkerd. No matter what the scenario is, these side jobs are called sidecars and so far Kubernetes has treated them as the second class citizens. That never stopped us from using them.

One of the issues we face with these sidecars is how to handle exiting of such a service. Let’s say that you have a scenario in which you use GCP CloudSQL. The secure and recommended way for your Kubernetes service to connect to said SQL server is to use CloudSQL proxy.This runs pretty smooth with a configuration similar to this one:

extraContainers:
   - name: cloud-sql-proxy
     image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.6.1
     args:
       - "--address=127.0.0.1"
       - "--port=5432"
       - "--auto-iam-authn"
       - "--private-ip"
       - "--structured-logs"
       - "my-project:europe-west2:my-db"

What will happen here is that CloudSQL proxy will establish a secure connection between the pod and the server, use a private IP address, because you have not exposed your SQL server to the world, right(?), and allows you use IAM as an authentication method instead of the stone age username/password approach. This works like a charm when there is no need for your pod to complete its job - it runs the whole time because you care about the uptime.

However, things become a little complicated when you want to run a Job which is a one time thing that does what you want it do to and then gracefully exists and leaves the pod in state Completed. This is especially problematic in case of services that use a Helm chart hook that prevents further deployment to continue until this one time Job is marked as completed. If you were to manually kill the pod once it’s done doing what it needed to do, just so that you get rid of CloudSQL proxy that is still running, you are in for a ride. The Job will end up being marked as failed thus killing the continuation of the deployment.

So, until we all upgrade our Kubernetes clusters to 1.28, and that little feature is polished out, since it’s in alpha state now, what are we to do to avoid broken deployments?

We can use sidecars to shut down sidecard, of course. I have recently been dealing with a self-hosted Airbyte. I can tell you, hosting Airbyte yourself in Kubernetes is an adventure. It works, but do you need to use black magic to make it so! One of many issues I’ve faced while trying to run that thing in Kubernetes were database migrations. Whenever Airbyte is updated it runs a Job called bootloader which runs database migrations and then exits. And this is all fine if you use some regular SQL server. However, it is not fine when you have a GCP CloudSQL in a super secure environment. You need to use CloudSQL proxy.

Having CloudSQL in bootloader as a sidecard was not a problem either. The same code I’ve shared above works like a charm. However, exiting once the migration is done is a bit of a problem. Just killing CloudSQL sidecar will not do it, as Helm will declare the job a failure stopping further updates. This is especially problematic if you are deploying Airbyte using a CI/CD tool such as ArgoCD.

After reading CloudSQL proxy documentation I’ve noticed a flag that adds a weird quitquitquit endpoint to the service, allowing you to use something like curl to trigger a graceful exit. Since CloudSQL listens on localhost within the pod this means that only other containers within the pod could access this endpoint. So, after a bit of testing here is a sidecar black magic that will allow you to run DB migrations and then stop the job with grace:

extraContainers:
  - name: cloud-sql-proxy
    image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.6.1
    args:
      - "--address=127.0.0.1"
      - "--port=5432"
      - "--auto-iam-authn"
      - "--private-ip"
      - "--structured-logs"
      - "--quitquitquit"
      - "my-project:europe-west2:my-db"
  - name: stop-cloud-sql-proxy
    image: curlimages/curl
    command:
      - /bin/sh
      - -c
      - |
        sleep 25
        curl -X POST localhost:9091/quitquitquit

I have been using Airbyte for almost a year now in this setup and I have yet to see a migration that lasts longer than 15 seconds, but just in case I’ve set the CloudSQL proxy killer to wait for 25. It’s not like you will be deploying Airbyte 15 times a day, though they do release pretty often during the week.

With this in place any CD pipeline can now upgrade the Helm charts without a problem. The same approach can be used for any number of scenarios. You don’t even need to use an additional sidecar. In this scenario I had to because I have no control what the container doing the migration does. However, if you were to use your own DB migration tool you could add this POST as the last step to the migration job. That way you don’t need to wait for it to finish, simply exit cleanly once the job is done.