Retries

Article summary

Did you find this summary helpful?

Thank you for your feedback!

Overview

Glasswall Constellations comprises of multiple services that communicate using asynchronous messages, rather than HTTP. For these services, we have made considerations about how the platform handles failure in the event of an error when processing a message.

Message Consumers

These services behave as standard message consumers. They share the following logic:

If a message received is not a valid JSON, or relates to a scan that is not recognised, the message is dropped with no retries.

Errors that occur during the processing of messages will cause the messages to be rejected and requeued. Messages that are requeued will be dead lettered if the amount of retries is equal to the configured RabbitMQ policy 'x-delivery-limit'. Once this limit is reached, the message will be dead lettered and sent to the dead letter queue subscribed to by the Scan Controller.

These services also use the third party library 'Polly' at startup when attempting to subscribe to incoming messages on the subscribed queues. The retry policy here is to retry using the two configuration options:

POLLY_RetryCount - additional number of retry attempts.
POLLY_BackoffInSeconds - time to wait between retries in seconds.

Scan Controller

This is also an API, the service subscribes to queues for scan responses and scans dead letters.

Page listing will be retried according to the queue configuration.

Dead Letter Queue Listener

Errors whilst processing messages that occur in other services may eventually lead to those messages being dead lettered by RabbitMQ. This can happen for many reasons, but most commonly if a message is retried too many times.

Messages related to a scan that dead letter will be processed by this service. This service processes these to 'close the loop' of a scan by marking the scan as 'Errored'. Similar to the other queues, if the message is not a valid format or relates to an unknown scan the message is dropped, otherwise messages are retried indefinitely.

All queues except this one are configured to send dead letters to this queue.

Page Scanner

This service subscribes to the scan requests queue. Page listing will be retried according to the queue configuration.

Scan Preprocessor

This service subscribes to the page segments queue. Segment Processing will be retried according to the queue configuration.

CDR Enabler

This service subscribes to the cdr enablement queue. File processing will be retried according to the queue configuration.

In addition to this, the CDR Enabler also uses the dotnet library 'Polly'. When a transient http error 5xx or 429 is returned from Glasswall Halo, the service will exercise an incremental back off and retry stategy in order to add more resilience to the Glasswall engine when the system is under load.

'POLLY_BackoffInSeconds' is not used here, instead the wait is equal to 2 to the power of retry attempt. e.g 2/4/8 seconds for the first, second and third attempt respectively.

The environment variable 'POLLY__MaxRetries' configures the number of additional retry attempts.

Event Collation

This service subscribes to the event collation queue. Event storage will be retried according to the queue configuration.

Event Projection

This service does not subscribe to a messaging queue. Instead, this service contains the business logic to process scan events into Materialized views. It uses Azure Cosmos DB's Change Feed Processor which receives batches of new or updated rows in the event collations store. Retries in this processor are handled using a leasing mechanism.

This lease is responsible for tracking progress in the change feed, it is also used to ensure only a single pod is processing a partition at a time. A service using this mechanism will obtain a lease for a period of time. Because of this if an exception is raised in the Event Projection service, there will be no loss of progress in the change feed.

An event projection pod in the cluster will obtain a lease from the Lease Container in the Cosmos Database. When a batch cannot be processed, an exception will be logged and the thread processing that batch closes and a new one is started to try again.

In the event in which the pod restarts, after a period of time the lease will become available again and another pod will be able to pick it up and process those changes.

Learn more about Change feed processor in Azure Cosmos DB

Was this article helpful?

What's Next

Dead Letters

Table of contents

Overview
Message Consumers