Features Constraints and Limitations
    • PDF

    Features Constraints and Limitations

    • PDF

    Article summary

    Features

    Glasswall Conform is a command-line tool designed for preprocessing PDF documents. It extracts and reconstructs visual content to ensure documents meet PDF standards, preparing them for further processing by the Glasswall Embedded Engine, which provides comprehensive Content Disarm and Reconstruction (CDR) protection.

    Key Features:

    • Text, Graphic, and Image Extraction: Extracts and reconstructs text, graphics, and images from PDFs, producing a clean, standards-compliant output document.
    • Handling Rate Threshold: Allows setting a minimum handling rate for graphics, images, or text. Files that fail to meet this threshold are classified as failures and will not be saved.
    • Custom Watermarking: Supports adding custom watermark text on each page of the reconstructed PDF, enabling personalised branding or messaging.
    • Character Identifier (CID) and Glyph Suppression: Suppresses unsupported glyphs and character identifiers (CIDs), replacing them with the default black square character (■).
    • Font Replacement: Converts custom embedded fonts to known-good Microsoft fonts or defaults to Cambria Math when necessary. This process aims to provide the best possible text display, even when custom fonts are not supported.
    • Standards Compliance: Produces a reconstructed PDF that adheres to PDF standards, allowing for subsequent CDR processing by the Glasswall Embedded Engine for full Content Disarm and Reconstruction (CDR) protection.
    • Fast Mode: A newly enabled default mode that processes files more quickly while delivering enhanced visual fidelity compared to cautious mode. Cautious mode, which handles PDFs more strictly, replaces embedded fonts but may take longer to process. It is only recommended for situations where embedded fonts must be avoided, due to potential risks linked to third-party font libraries.
    • File Inclusion and Exclusion Filtering: Specify which files to process or exclude using absolute paths or wildcard patterns.
    • Output File Categorisation: Defines how output files are organised. categorised organises output files into subdirectories based on processing status (engine_success, conform_success, failure). mirrored places successfully processed output files directly in the output directory, maintaining the original input directory structure, and failed files will not be copied.
    • Post Processing Summary: Provides detailed information on processing results, including file statuses, memory usage, and processing time.

    Constraints and Limitations

    While Glasswall Conform is a powerful tool, certain constraints and limitations should be considered:

    • Image Handling: Some image colour spaces are unsupported and may be ignored. Additionally, image processing may convert compressed images to a lossless format, which can increase file size.

    • Font Handling: Glasswall Conform supports Base 14 and many Microsoft fonts, but unsupported custom fonts are replaced to mitigate potential risks.

    • PDF Structure: PDFs missing essential structural elements (e.g., root catalog, cross-reference tables) may not be recoverable.

    • Memory Usage: PDFs with many images may consume significant memory. While the tool has been tested with files up to 50 MB, larger files may experience performance issues.

    • Color Spaces: The CalRGB colour space is not supported.

    • Graphics Handling: Support for complex graphics, such as shapes, charts, and graphs, is limited. This version prioritises text integrity.

    • Document Recovery: Severely corrupted PDFs or those with missing structural elements may be unrecoverable.

    • Platform Support: Glasswall Conform is currently only available for Linux amd64 architectures.

    • Timeout and Memory Configuration: The following table presents our findings on how configurable timeout and memory settings impact the overall processing success rate and total runtime when running on d16-v3 VMs, each with 16 vCPUs and 64GB RAM:

      TimeoutMemory LimitRuntime (7 VMs)Aggregate Processing TimeProcessing Time IncreaseFiles ProcessedSuccess RateSuccess Increase
      180s4GB65 minutes350 minutesBaseline2,875 / 3,07393.56%Baseline
      300s8GB79 minutes428 minutes+23%2,939 / 3,07395.64%+2.08%
      600s12GB96 minutes514 minutes+47%2,947 / 3,07395.90%+2.34%
      1200s20GB145 minutes689 minutes+97%2,952 / 3,07396.06%+2.50%
      • Increasing timeout and memory results in a higher success rate but comes at the cost of increased runtime.
      • The 300s / 8GB configuration improves success by +2.08% over 180s / 4GB, with a 23% increase in processing time.
      • The 600s / 12GB configuration improves success by +2.34% over 180s / 4GB, with a 47% increase in processing time.
      • The 1200s / 20GB configuration provides only a marginal increase in processed files (+5 over 600s).
      • The optimal configuration depends on whether speed or processing success rate is the higher priority. The 300s / 8GB configuration offers a well-balanced choice when at least 64GB RAM is available, allowing for 8+ files to be processed in parallel while delivering a strong success rate improvement (+2.08%) over 180s / 4GB and maintaining a reasonable 23% processing time increase, making it an efficient middle ground between speed and processing success.

    Licensing

    Glasswall Conform includes PyMuPDF software which is available under both open-source AGPL and commercial license agreements via Artifex. Glasswall holds a commercial distribution license agreement for the context of Glasswall Conform.


    Was this article helpful?