Graceful termination of Nginx in K8s

The problem

Request comes in, is processed correctly including the response generation, but the response never arrives to the client.

Luckily we have extensive logging, so we could at least rule out any problem in the application layer. This did not happen often and we could never reproduce it manually, so we added even more logging to all layers of the stack, which helped pin-point the issue. It required looking deeper into what is happening exactly at the various stages of the applications lifecycle.

For some more context we have the following systems architecture: an OpenShift k8s cluster, load-balancing to nginx instances that serve the application.

The investigation

On new deployments with the rolling strategy, new containers are brought up and do not receive traffic until they report as ready (according to the readinessProbe option). While this is going on, the old containers are still running and handling requests. The switch-over happens by OpenShift redirecting traffic to the new containers and at the same time signaling the old containers to terminate by sending a SIGTERM to them (or whatever the container specified in its Dockerfile in the STOPSIGNAL option). But of course when nginx receives a SIGTERM, it does not finish processing outstanding request, it exits immediately. This explains why we sometimes lost a perfecly good response to a request: nginx was simply killed before it could return it. Luckily nginx knows how to gracefully exit, it simply requires another signal entirely: SIGQUIT. But this is not the default in the bitnami/nginx container as opposed to the official nginx container.

The solution

While we wait for bitnami/nginx to fix the issue upstream, we can add the necessary configuration to our own Dockerfile:

FROM bitnami/nginx:latest STOPSIGNAL SIGQUIT # etc...

In case the used OpenShift environment is too old and does not support STOPSIGNAL, we can use a preStop lifecycle hook in our helm chart, this should not be necessary:

kind: "DeploymentConfig" apiVersion: "apps.openshift.io/v1" metadata: name: "nginx" spec: selector: name: nginx template: metadata: labels: name: "nginx" spec: containers: - name: "nginx" lifecycle: preStop: exec: # send PID 1 (nginx) a SIGQUIT so that it gracefully shuts down # wait to delay the SIGTERM by kubernetes # if nginx shuts down inside the delay, the pod is killed anyway command: ["/bin/sh", "-c", "kill -s QUIT 1; sleep 2m"] [...]

Before debugging the issue, searching for the symptoms did not produce any usable results, and this is the reason for this blog post existing.

Blog Posts

View more

React Native vs. React Web: What's the Difference and When to Use Them

React Native vs. React Web: What's the Difference and When to Use Them

React vs. Angular: Which Front-End Framework Is Right for You?

React vs. Angular: Which Front-End Framework Is Right for You?