Overview
    • PDF

    Overview

    • PDF

    Article summary

    Glasswall Conform is a Python-based tool designed to reconstruct malformed or corrupt PDF files that cannot be processed by the Glasswall Embedded Engine. It extracts visual content such as text, graphics, and images from the input PDF and creates a newly reconstructed version of the document.

    This tool is particularly useful for handling non-standard or problematic PDF files that fail to conform to standard specifications. By ensuring that the newly reconstructed document adheres to PDF standards, Glasswall Conform enables the Glasswall Embedded Engine to process the file further for Content Disarm and Reconstruction (CDR).

    Glasswall Conform aims to make PDFs suitable for further processing by restoring structural integrity. It prioritises security while tolerating some loss of visual fidelity if necessary to deliver a file that conforms to specifications.

    Private Preview Status

    Glasswall Conform is currently a Private Preview version. While the tool effectively addresses many common PDF issues, it may not fully reconstruct highly complex, severely malformed, or non-standard PDFs. In some instances, the tool may be unable to process certain documents.

    The Private Preview version provides a foundational solution for handling problematic PDFs, but please note that reconstruction may not always be complete or entirely accurate.

    Features

    Glasswall Conform is a Python-based tool designed for preprocessing PDF documents. It extracts and reconstructs visual content to ensure the documents meet PDF standards for further processing. The tool is intended to work in conjunction with the Glasswall Embedded Engine, which provides comprehensive Content Disarm and Reconstruction (CDR) protection.

    Key Features:

    • Text, Graphic, and Image Extraction: Extracts visual content from PDFs, including text, graphics, and images, and reconstructs the document.
    • Handling Rate Threshold: Allows setting a minimum handling rate for graphics, images, or text. Files that fail to meet this threshold are deemed a failure and will not be written.
    • Custom Watermarking: Allows the addition of user-provided text as a watermark on each page of the reconstructed PDF, enabling personalised branding or messaging.
    • Character Identifier (CID) and Glyph Suppression: Suppresses unsupported glyphs and character identifiers (CIDs), replacing them with the default black square character (■).
    • Font Replacement: Converts custom embedded fonts to known-good Microsoft fonts or defaults to Cambria Math when necessary. This process aims to provide the best possible text display, even when custom fonts are not supported.
    • Standards Compliance: Ensures that the reconstructed PDF conforms to standard specifications, making it suitable for further processing by the Glasswall Embedded Engine for full Content Disarm and Reconstruction (CDR) protection.

    Constraints and Limitations

    While Glasswall Conform is a powerful tool, there are several constraints and limitations to be aware of:

    • Image Handling: Certain image colorspaces are not supported and are ignored. Additionally, image conversion may expand compressed images to a lossless format, potentially increasing file size.
    • Font Handling: Glasswall Conform supports the Base 14 fonts and many Microsoft fonts but does not support other custom embedded fonts, which are replaced with known-good fonts. Custom embedded fonts are replaced because they may pose a security risk.
    • PDF Structure: Documents lacking a root catalog or with corrupt cross-reference tables may be unrecoverable.
    • Memory Usage: PDFs with a large number of images may lead to high memory consumption. The tool does not limit the size of input PDFs, and has been shown to support files up to 50 MB, but larger files may encounter performance issues due to memory constraints.
    • Color Spaces: The CalRGB color space is not supported in pixel mapping.
    • Graphics Handling: Support for graphics such as shapes, charts, and graphs is limited. The Private Preview version has primarily focused on maintaining text integrity.
    • Document Recovery: PDFs with missing structural elements or severe corruption may not be recoverable.

    Licensing

    Glasswall Conform includes PyMuPDF software which is available under both open-source AGPL and commercial license agreements via Artifex. Glasswall holds a commercial distribution license agreement for the context of Glasswall Conform.

    User Guide

    For instructions on installation, configuration, and usage, including advanced options, please refer to the Glasswall Conform User Guide.

    Platform Support

    The tool supports both Windows and Linux environments. It is distributed primarily as an offline installer for easy deployment in restricted environments.


    Was this article helpful?