Auto-scaling Resources
    • PDF

    Auto-scaling Resources

    • PDF

    Article summary

    Overview

    Unlike the Fixed approach, we wanted to tackle auto-scaling with Constellations. This means we don't have to do any upfront work to scale up the infrastructure and we can have Azure and KEDA handle the addition of services and resources based on how large a container is. This was a large limitation with the fixed approach, no 2 containers are the same so they will need different resources to be processed.

    With this approach, we used KEDA to handle the scaling of services based on messages on the queue and set the node pools to autoscale within Azure. This approach uses 3 node pools. 2 of those node pools are our default recommended node size. The other node pool is using an extra large node size, this is so we can get multiple engine pods on that node instance, this means that the time waiting for a new node to spin up and be added to the cluster by Azure provides lots more engines in a single instance than the smaller node size.

    The Setup

    In this mode we have all our resources set, this is what the set-up was:

    • 3 Node Pools: 1 Constellations, 1 Halo, 1 Engine
    • Small Node Spec: 8CPU, 28GB RAM (Halo, Constellations)
    • Large Node Spec 64CPU, 256GB RAM (Engine)

    Glasswall Halo - 2 Node Pools

    • engine Nodes (Nodes: 0 - 11)

      • 0 - 450 engine pods
        • Fixed cpu request/limit of 1.5
        • Fixed memory limit of 5900Mi
    • cdrplatform Nodes (Nodes: 3 - 50)

      • 5 - 100 api pods
        • Fixed cpu request/limit of 2
        • Fixed memory limit of 5900Mi
      • 0 - 40 report-extractor pods
      • 0 - 40 cleanup pods

    Constellations (Nodes: 3 - 100)

    • 0 - 20 scan-controller pods
    • 0 - 20 page-scanner pods
    • 0 - 20 scan-preprocessor pods
    • 0 - 600 cdr-enabler pods
    • 0 - 20 event-projection pods
    • 0 - 20 event-collation pods

    Storage

    • File Share: The Azure file share incorporates rate limiting, which is determined by the size of the file share. To achieve optimal throughput, it is recommended to set the file share size to a minimum of 10,000GB. The Azure file share's throughput is calculated based on the allocated storage. It is important to avoid reducing this number, as it may lead to throttling at the Azure level. In case you encounter throttling issues, it is advisable to increase the allocated storage to alleviate the situation.

    • Cosmos: Cosmos DB operates with allocated Request Units (RUs) for each individual instance. By enabling the auto-scale feature, Cosmos DB dynamically assigns RUs to containers, with a maximum quota of 24,000 RUs. Constellations heavily relies on substantial throughput, considering the significant number of read and write operations it performs. The required throughput is directly proportional to the volume of files being processed. If you encounter Cosmos DB throttling issues seen in the Azure portal, it is recommended to increase the number of allocated RUs to ensure smooth operations.

    Modes

    Express Mode: The current run was executed in Express Mode, which entails checking the file extensions. If a file's extension is not found in the supported list, Glasswall does not process the file. It is important to note that when Express Mode is disabled, the outcomes may differ, as all files are sent to the Glasswall Engine for processing, regardless of their extension.

    Archive Mode: The run was conducted with Archive Mode set to 'Basic.' In this mode, archives are treated as individual files, and if any file within an archive fails, the entire archive is marked as a failure. It is important to be aware that when Archive Mode is switched to 'Advanced,' the outcomes may differ, as additional processing time is required to handle the archives effectively.

    Guaranteed Mode: In the CDR Platform, the API and Engine services were executed in guaranteed mode to ensure that all pods assigned to a node were equipped with the necessary resources. Given that the Glasswall engine requires varying amounts of RAM and CPU, depending on the file's complexity, running in guaranteed mode guarantees sufficient memory for each file's processing. This approach minimizes the risk of a pod running out of memory during execution. Likewise, the RabbitMQ pods can become strained under heavy load, so it is recommended that they are configured with ample resources. When installing the RabbitMQ chart, ensure that appropriate values are set for the resources parameters:

    # Some parameters excluded for brevity
    helm upgrade --install cdrplatform-rabbitmq cdrplatform-rabbitmq -n cdrplatform \
      ...
      --set resources.requests.cpu=2 \
      --set resources.requests.memory=6Gi \
      --set resources.limits.cpu=2 \
      --set resources.limits.memory=6Gi
    

    If you are encountering numerous Out-of-Memory (OOM) exceptions when running Constellations, it is recommended to increase the resource limits. This adjustment will provide the necessary headroom for seamless processing and reduce the likelihood of memory-related issues.

    File Set

    • Input size: 501 GB
    • Input count: 46,695 blobs

    Results

    Cold

    When initiating a scan with the minimum infrastructure, it's important to consider the lead time required for all services and nodes to scale up to their peak levels. This scaling process ensures optimal processing capacity. As a result, there is a delay between starting the scan and the Glasswall engine processing the first file. This delay, known as a "cold start scan," typically takes longer than scanning against a fully scaled system where all resources are readily available.

    • Rebuilt: 24,956
    • Errored: 8,587
    • Failed: 12,902

    Total data rebuilt: 188 GB

    Total duration of run: 26 mins 24 secs

    Hot

    After a system has successfully scaled up, with all services and nodes operating at their maximum capacity for optimal processing, there is no lead time between initiating a scan and the Glasswall engine processing the first file. Consequently, when triggering a second scan immediately after the first one, the second scan should be quicker. This is because the infrastructure is already in place and running at peak efficiency, eliminating the need for any additional scaling or startup time.

    • Rebuilt: 24,866
    • Errored: 8,712
    • Failed: 12,868

    Total data rebuilt: 185GB

    Total duration of run: 19 Mins 27 Seconds


    Was this article helpful?