Health Checks
    • PDF

    Health Checks

    • PDF

    Article summary

    Overview

    Services within Constellations have implemented individual health checks to communicate their state to Kubernetes. If a pod enters an unhealthy state, Kubernetes will kill it and create a new one in its place.

    Scan Controller and Event Projection

    These API's use HTTP health checks and return a 200OK when in a healthy state and a 503 when in an unhealthy state. The checks can be made with a call to:

    /api/health
    

    The services are considered unhealthy if they cannot access their respective CosmosDB databases.

    Scan Controller will also return unhealthy if a connection to RabbitMQ cannot be facilitated.

    For Event Projection, in addition to standard CosmosDB access issues, the service also tracks the health of the Change Feed mechanism it uses to build materialized views. If an error occurs in the processing of a batch of events, the service will be reported as unhealthy.

    Page Scanner, Scan Preprocessor, CDR Enabler and Event Collation

    These services do not use ASP.NET or Kestral, therefore spinning up an entire Kestral server just to provide health checks is inefficent. For this purpose, they spin up their own TCP listeners that accept in-bound connections. As long as a connection occurs, a liveness probe will see that as being in the "healthy" state. If the service is in an unhealthy state, the listener is stopped.

    These services have two environment variables for configuring the port they will listen on, and the time between health check updates.

    With TCP based connections, there is no way to reject an incoming connection, so once the probe has established a connection, it will assume that the pod is in a healthy state. For this reason, we must periodically do a health check update to shut down the listener if the pod is not in a healthy state, which will prevent the liveness probe from connecting.

    Environment Variables:

    - HEALTH__Port=7800
    - HEALTH__HealthCheckPollTimeInMs=15000
    

    Each of these listeners will stop the TCP listener if they cannot access either RabbitMQ nor CosmosDB.


    Was this article helpful?

    What's Next