Glasswall Conform User Guide
    • PDF

    Glasswall Conform User Guide

    • PDF

    Article summary

    Glasswall Conform is designed to pre-process PDF files to meet standards for further processing. It extracts and reconstructs visual content and should be used in conjunction with the Glasswall Embedded Engine for complete Content Disarm and Reconstruction (CDR) protection.

    This document offers instructions on using Conform for reconstructing PDF documents, along with several examples for invoking the executable.


    Setup

    Before calling the glasswall_conform executable, ensure that your environment is set up correctly.

    Linux

    For processing modes that utilise the Embedded Engine, LD_LIBRARY_PATH must be set to include the directory containing the Embedded Engine. For example, if the Embedded Engine is at path /home/azureuser/glasswall/Release-16.2.0 you can temporarily modify LD_LIBRARY_PATH:

    export LD_LIBRARY_PATH=/home/azureuser/glasswall/Release-16.2.0:$LD_LIBRARY_PATH
    

    Ubuntu

    On Ubuntu-based systems, if you encounter the error message libgthread-2.0.so.0: cannot open shared object file: No such file or directory, you can resolve it by installing the necessary package with the following command:

    DEBIAN_FRONTEND=noninteractive && apt update && apt install -y libglib2.0-0
    

    Processing modes

    Conform is run from the command line and offers several processing modes for processing files. When calling the executable, the first positional argument specifies the processing mode. Available processing modes are:

    • engine: Protects files using the Engine. Non-conforming files are processed through Conform and then the Engine.
    • conform_only: Reconstructs files using Conform only, without providing CDR protection.

    To show available processing modes:

    glasswall_conform -h
    

    engine

    This processing mode is the intended default and cleans files using Glasswall CDR technology. It requires access to the Embedded Engine and a valid licence.

    For an example of invoking this processing mode, see: End to end protection.

    Processed files are sorted into one of three output subdirectories:

    1. 01_engine_success: Files successfully processed by the Embedded Engine without the need for reconstruction by Conform.
    2. 02_conform_engine_success: PDF files that were initially unable to be processed by the Embedded Engine, but were reconstructed by Conform and then successfully processed by the Embedded Engine.
    3. 03_failure: Files that failed to be processed using both the Embedded Engine and Conform.

    To show the command line arguments for the engine processing mode:

    glasswall_conform engine -h
    

    conform_only

    This processing mode reconstructs files without utilising the Embedded Engine. It does not provide CDR protection.

    For an example of invoking this processing mode, see: Reconstructing files without CDR protection.

    Processed files are sorted into one of two output subdirectories:

    1. 01_conform_success: Files successfully reconstructed by Conform.
    2. 02_failure: Files that failed to be reconstructed by Conform.

    To show the command line arguments for the conform_only processing mode:

    glasswall_conform conform_only -h
    

    Testing

    A dataset of PDF test files for evaluating Conform is available upon request. Please contact us to request access to the test files via Kiteworks.


    Examples


    End to end protection

    This example demonstrates using the engine processing mode at its most basic level.

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0
    

    Example input directory:

    /home/azureuser/input_files
        conforming_docx.docx
        conforming_pdf.pdf
        corrupt_docx.docx
        nonconforming_pdf.pdf
        unsupported_filetype.txt
    

    Example output directory after processing:

    /home/azureuser/output_files
    ├───01_engine_success
    │       conforming_docx.docx
    │       conforming_pdf.pdf
    │
    ├───02_conform_engine_success
    │       nonconforming_pdf.pdf
    │
    └───03_failure
            corrupt_docx.docx
            unsupported_filetype.txt
    

    Note that the subdirectory names can be customised using the following arguments:

    • --engine-success-path: Optional. Output subdirectory name for files that were successfully processed by the Embedded Engine without the need for reconstruction by Conform. Default 01_engine_success
    • --conform-success-path: Optional. Output subdirectory name for files that were initially unable to be processed by the Embedded Engine, but were reconstructed by Conform and then successfully processed by the Embedded Engine. Default 02_conform_engine_success
    • --failure-path: Optional. Output subdirectory name for files that failed to be processed using both the Embedded Engine and Conform. Default 03_failure

    If it is desired that all successfully protected files are written to the same output directory, regardless of whether or not Conform was used to reconstruct the file, you can specify to write files to the same success subdirectory path. For example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --engine-success-path success --conform-success-path success --failure-path failure
    

    Example truncated terminal output after processing:

    Glasswall Conform processed 3/5 files (60.00%)
    Glasswall Conform failed to process 2/5 files. (40.00%)
    Exceptions:
      PdfExtractionError (Total: 2)
        - 1x Unable to extract content from PDF: '/home/azureuser/input_files/corrupt_docx.docx'
        - 1x Unable to extract content from PDF: '/home/azureuser/input_files/unsupported_filetype.txt'
    2024-11-06 14:28:50.242 glasswall_conform.config.logging INFO     engine_mode                   Total elapsed time: 5.55 seconds
    

    Reconstructing files without CDR protection

    The conform_only processing mode does not provide CDR protection, and requires only an input directory -i and an output directory -o. See conform_only.

    glasswall_conform conform_only -i /home/azureuser/input_files -o /home/azureuser/output_files
    

    Fast Mode

    Fast Mode is an option available in Conform that offers quicker processing with improved visual fidelity for PDF files. When enabled, it speeds up document processing while maintaining a higher level of visual similarity to the original document. However, in fast mode, embedded fonts are not replaced, and other features such as watermarking are unavailable.

    If fast mode is disabled, the standard processing method is used. While this may take longer, it replaces embedded fonts as part of its processing, maximising risk reduction but may impact visual fidelity. Disabling fast mode also enables full feature support, including font replacement and watermarking.

    • With fast mode enabled: Faster processing with higher visual fidelity, embedded fonts are not replaced, and other features such as watermarking are unavailable.
    • With fast mode disabled: Slower processing, but embedded fonts are replaced to maximise risk reduction. Features such as watermarking are enabled.

    Fast mode is enabled by default, but can be disabled using the optional --disable-fast-mode command line argument.

      --disable-fast-mode   Optional. Disables attempting fast mode. Fast mode offers quicker processing and higher visual fidelity, but does not replace embedded fonts and disables features such as font replacement and watermarking. Disabling it falls back to the standard method, which may take longer but replaces embedded fonts and enables full feature support, including font replacement and watermarking. Default: False.
    

    Glasswall Python Wrapper functionality

    In the engine processing mode, the protect_file function from the Glasswall Python Wrapper is used by default to process files using the Embedded Engine. This can be changed using the optional -f command line argument.

    A default sanitise content management policy is applied if a policy file is not specified using the optional -c command line argument.

    The required -l command line argument should point to a directory containing the Embedded Engine.

    The following arguments relate to the Glasswall Python Wrapper:

      -l LIBRARY_DIRECTORY, --library-directory LIBRARY_DIRECTORY
                            Required. Path to directory containing the Embedded Engine.
      -f FUNCTION_NAME, --function-name FUNCTION_NAME
                            Optional. Glasswall Python Wrapper function name to call during multiprocessing, such as 'protect_file' or 'export_file'. Default: 'protect_file'.
      -c CONTENT_MANAGEMENT_POLICY, --content-management-policy CONTENT_MANAGEMENT_POLICY
                            Optional. Path to Embedded Engine content management policy file. If not provided, the default 'sanitise' policy is used.
      --log-level-console-wrapper {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing Glasswall Python Wrapper logs to console. Default INFO.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 -f protect_file -c /home/azureuser/glasswall/config.xml
    

    Multiprocessing

    All processing modes leverage the Glasswall Python Wrapper's GlasswallProcessManager to efficiently process files concurrently.

    The following arguments relate to multiprocessing:

      -w MAX_WORKERS, --max-workers MAX_WORKERS
                            Optional. Maximum workers for multiprocessing, 0=auto. Default: 0.
      -t TIMEOUT_SECONDS, --timeout-seconds TIMEOUT_SECONDS
                            Optional. Multiprocessing timeout per file in seconds. Default: 180.
      -m MEMORY_LIMIT_GIB, --memory-limit-gib MEMORY_LIMIT_GIB
                            Optional. Multiprocessing memory limit per file in GiB, 0=auto (4GiB min, worker distributed max). Default: 0.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 -t 300 -m 12
    

    Logging

    The default logging level for Conform and the Glasswall Python Wrapper is INFO. The following arguments relate to logging:

      --log-level-console {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing logs to console. Default INFO.
      --log-level-file {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing logs to file. If not provided, logs will not be written to file.
      --log-path LOG_PATH   Optional. Path to output log file. Default is a timestamp-named file located at: '%TEMP%/glasswall_conform/logs'.
      --log-level-console-wrapper {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing Glasswall Python Wrapper logs to console. Default INFO.
    

    To suppress most logging:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --log-level-console CRITICAL --log-level-console-wrapper CRITICAL
    

    Customise content handling rates

    This section is only applicable when Fast Mode is disabled.

    By default, Conform generates an output file whenever possible, even if only a portion of the original document's content has been successfully handled. This behaviour might not always be desirable, and can be customised for different types of content within each document.

    Conform uses "best guesses" when handling malformed, corrupt, or unsupported text content to ensure that as much text as possible is transferred from the original document to the conformed document. For example, if the stroke colour of the text is malformed or in an unsupported colour format, the text is retained in the output document, with the stroke colour defaulting to black.

    This "best guess" approach may result in text that appears similar to the original, or in some cases, text that is not visible but still present in the output document. As we cannot guarantee that our best guess will handle the text in the same way as in the original document, the handling rate reflects this as content that has not been fully handled. Consequently, a low handling rate for text does not always indicate that the document will look visually different when best guesses are applied.

    There are three arguments available to set the minimum success rates when handling content:

      --text-min-success-rate TEXT_MIN_SUCCESS_RATE
                            Optional. The minimum success rate for processing text. Default: 0.0.
      --image-min-success-rate IMAGE_MIN_SUCCESS_RATE
                            Optional. The minimum success rate for processing images. Default: 0.0.
      --graphic-min-success-rate GRAPHIC_MIN_SUCCESS_RATE
                            Optional. The minimum success rate for processing graphics. Default: 0.0.
    

    If the minimum content handling rate value is not met then processing for the given file will be deemed a failure and the output file will not be written.


    Watermarking

    This section is only applicable when Fast Mode is disabled.

    Watermarking is disabled by default, but can be enabled using the --watermark argument. Text will be added to the bottom-left of the document in a turquoise colour.

      --watermark WATERMARK
                            Optional. Adds a watermark to each page of the reconstructed document. Default '' (disabled).
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --watermark "Glasswall Conform"
    

    CID suppression

    This section is only applicable when Fast Mode is disabled.

    In PDFs, some fonts use a system called CID (Character Identifier) to manage large sets of characters. When constructing a new PDF, if the tool encounters characters that cannot be processed, it replaces them with a default black square character. You can adjust how unprocessable CIDs are represented in your PDFs using the --suppress-cid argument:

      --suppress-cid SUPPRESS_CID
                            Optional. Replace CID metadata that may be printed to the visual layer due to font array omissions with the supplied string, with placeholder text.
                            Glasswall Conform restricts the processing of PDFs to only known secure fonts. This is a deliberate security feature to make the PDF conform safely. Default '■'.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --suppress-cid "?"
    

    Font replacement

    This section is only applicable when Fast Mode is disabled.

    Conform supports bold, italic, and bold italic variants of the base 14 Type1 fonts and the Cambria font. Conform also supports some custom fonts.

    The base 14 Type1 fonts are:

    • Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique
    • Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique
    • Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic
    • Symbol
    • ZapfDingbats

    Embedded fonts that are not supported may be replaced with the Cambria font. If Cambria does not support a glyph from an embedded font, the character is suppressed. For more information on this, see CID Suppression.

    By default, some commonly embedded sans serif fonts are replaced with Helvetica instead of Cambria for visual similarity. This, and other font replacement features, can be modified using these arguments:

      --disable-base-14-fonts
                            Optional. Disable matching embedded fonts to base 14 fonts.
                            This will result in more fonts being replaced by the fallback font, Cambria. Default False.
      --disable-custom-fonts
                            Optional. Disable matching embedded fonts to custom fonts.
                            This will result in lower support for custom embedded fonts, and more fonts being replaced by the fallback font, Cambria. Default False.
      --disable-sans-serif-replacement
                            Optional. Disable replacing some sans serif fonts with Helvetica instead of the fallback font, Cambria.
                            This will result in some replaced sans serif fonts looking more visually different when compared to the original file. Default False.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --disable-custom-fonts
    ``

    Was this article helpful?

    What's Next