Email Security Resiliency
  • PDF

Email Security Resiliency

  • PDF

How have we improved Email Security platform resilience?

At Glasswall, we are continually working to strengthen our resilience towards managing cloud platform incidents, and our monitoring and alerting processes.

We have reinforced our platform by:

  • Moving to premium tier components when the standard tier doesn’t have the expected service level.
  • Lowering tolerance of monitors to trigger alerts earlier, allowing a faster response time.
  • Moving away from a reliance on error volume to indicate issues to a custom-built system, based on the time emails take to relay out of the platform.This means we will be notified within minutes if even a single email fails to leave the platform.
  • Refactoring our services where the data indicates it is a root cause of any resiliency/performance issues.
  • Using performance test data to tune the auto-scaling within the application to better deal with spikes in traffic and increase the parallelism of our services.

How do you ensure you are able to handle our email?

Glasswall Email Security consists of many layers of resiliency which ensure constant readiness. It is hosted on highly secure Azure cloud servers, which help ensure the continuity of our service.

Emails or portal traffic is load balanced across several application instances,and each instance is comprised of a microservice architecture deployed to a managed Kubernetes cluster. The nodes of each cluster are distributed across the three availability zones within a region, which means that the application is not reliant on specific hardware or tied to an individual hosting location.

Similarly, the flow of email through the system is controlled by messages on a service bus. Our service bus namespaces have zonal redundancy and are resilient to zone outages.

The application itself is written to be fault tolerant when accessing other services, both within the cluster and within Azure with automatic retries as standard.In the event of an outage, services can fall back to cached data for a period ensuring continued operation.. Additionally, all microservices carry out individual health checks allowing Kubernetes to monitor and maintain health.

If we are unable to persist the email on arrival, the application automatically switches to using a backup queue running within the cluster which persists the emails until service is restored and the emails on the backup queue are processed. If the outage is only affecting one cluster, then the traffic will be routed away from the affected cluster until it is healthy again.

In the event where the destination mail server is unable to receive the email, then further endpoints can be provided for the application to try. If no listed endpoints can accept the email, it is retried against all listed endpoints for up to 30 minutes. However, even after this point the emails are not lost - once service is restored to the destination, all mail can be retried again.

All tenant data in blob storage and the SQL server is automatically replicated both within the primary region, as well asa secondary region.


If you have more questions or would like more information, please contact us.