Conform User Guide
    • PDF

    Conform User Guide

    • PDF

    Article summary

    Glasswall Conform is designed to preprocess PDF files to meet standards for further processing. It extracts and reconstructs visual content and should be used in conjunction with the Glasswall Embedded Engine for complete Content Disarm and Reconstruction (CDR) protection.

    This document offers instructions on using Conform for reconstructing PDF documents, along with several examples for invoking the executable.


    Setup

    Before calling the glasswall_conform executable, ensure that your environment is set up correctly.

    Linux

    For processing modes that utilise the Embedded Engine, LD_LIBRARY_PATH must be set to include the directory containing the Embedded Engine. For example, if the Embedded Engine is at path /home/azureuser/glasswall/Release-16.2.0 you can temporarily modify LD_LIBRARY_PATH:

    export LD_LIBRARY_PATH=/home/azureuser/glasswall/Release-16.2.0:$LD_LIBRARY_PATH
    

    Ubuntu

    On Ubuntu-based systems, if you encounter the error message libgthread-2.0.so.0: cannot open shared object file: No such file or directory, you can resolve it by installing the necessary package with the following command:

    DEBIAN_FRONTEND=noninteractive && apt update && apt install -y libglib2.0-0
    

    Processing modes

    Conform is run from the command line and offers several processing modes for processing files. When calling the executable, the first positional argument specifies the processing mode. Available processing modes are:

    • engine: Protects files using the Engine. Non-conforming files are processed through Conform and then the Engine.
    • conform_only: Reconstructs files using Conform only, without providing CDR protection.

    To show available processing modes:

    glasswall_conform -h
    

    engine

    This processing mode is the intended default and cleans files using Glasswall CDR technology. It requires access to the Embedded Engine and a valid licence.

    For an example of invoking this processing mode, see: End to end protection.

    Processed files are sorted into one of three output subdirectories:

    1. 01_engine_success: Files successfully processed by the Embedded Engine without the need for reconstruction by Conform.
    2. 02_conform_engine_success: PDF files that were initially unable to be processed by the Embedded Engine, but were reconstructed by Conform and then successfully processed by the Embedded Engine.
    3. 03_failure: Files that failed to be processed using both the Embedded Engine and Conform.

    To show the command line arguments for the engine processing mode:

    glasswall_conform engine -h
    

    conform_only

    This processing mode reconstructs files without utilising the Embedded Engine. It does not provide CDR protection.

    For an example of invoking this processing mode, see: Reconstructing files without CDR protection

    Processed files are sorted into one of two output subdirectories:

    1. 01_conform_success: Files successfully reconstructed by Conform.
    2. 02_failure: Files that failed to be reconstructed by Conform.

    To show the command line arguments for the conform_only processing mode:

    glasswall_conform conform_only -h
    

    Testing

    A dataset of PDF test files for evaluating Conform is available upon request. Please contact us to request access to the test files via Kiteworks.


    Examples


    End to end protection

    This example demonstrates using the engine processing mode at its most basic level.

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0
    

    Example input directory:

    /home/azureuser/input_files
        conforming_docx.docx
        conforming_pdf.pdf
        corrupt_docx.docx
        nonconforming_pdf.pdf
        unsupported_filetype.txt
    

    Example output directory after processing:

    /home/azureuser/output_files
    ├───01_engine_success
    │       conforming_docx.docx
    │       conforming_pdf.pdf
    │
    ├───02_conform_engine_success
    │       nonconforming_pdf.pdf
    │
    └───03_failure
            corrupt_docx.docx
            unsupported_filetype.txt
    

    Note that the subdirectory names can be customised using the following arguments:

    • --engine-success-path: Optional. Output subdirectory name for files that were successfully processed by the Embedded Engine without the need for reconstruction by Conform. Default 01_engine_success
    • --conform-success-path: Optional. Output subdirectory name for files that were initially unable to be processed by the Embedded Engine, but were reconstructed by Conform and then successfully processed by the Embedded Engine. Default 02_conform_engine_success
    • --failure-path: Optional. Output subdirectory name for files that failed to be processed using both the Embedded Engine and Conform. Default 03_failure

    If it is desired that all successfully protected files are written to the same output directory, regardless of whether or not Conform was used to reconstruct the file, you can specify to write files to the same success subdirectory path. For example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --engine-success-path success --conform-success-path success --failure-path failure
    

    Example truncated terminal output after processing:

    Glasswall Conform processed 3/5 files (60.00%)
    Glasswall Conform failed to process 2/5 files. (40.00%)
    Exceptions:
      PdfExtractionError (Total: 2)
        - 1x Unable to extract content from PDF: '/home/azureuser/input_files/corrupt_docx.docx'
        - 1x Unable to extract content from PDF: '/home/azureuser/input_files/unsupported_filetype.txt'
    2024-11-06 14:28:50.242 glasswall_conform.config.logging INFO     engine_mode                   Total elapsed time: 5.55 seconds
    

    Reconstructing files without CDR protection

    The conform_only processing mode does not provide CDR protection, and requires only an input directory -i and an output directory -o. See conform_only.

    glasswall_conform conform_only -i /home/azureuser/input_files -o /home/azureuser/output_files
    

    Fast mode

    Fast mode is an option available in Conform that offers quicker processing with improved visual fidelity for PDF files. When enabled, it speeds up document processing while maintaining a higher level of visual similarity to the original document. However, in fast mode, embedded fonts are not replaced.

    If fast mode is disabled, cautious mode is used. While this may take longer, it replaces embedded fonts as part of its processing, maximising risk reduction but may impact visual fidelity. Disabling fast mode enables full feature support, including font replacement.

    • With fast mode enabled: Faster processing with higher visual fidelity, embedded fonts are not replaced.
    • With fast mode disabled: Slower processing, but embedded fonts are replaced to maximise risk reduction.

    Fast mode is enabled by default, but can be disabled using the optional --disable-fast-mode command line argument.

      --disable-fast-mode   Optional. Disables attempting fast mode. Fast mode offers quicker processing and higher visual fidelity, but does not replace embedded fonts. Disabling it falls back to cautious mode, which may take longer and could result in lower visual fidelity. Default: False.
    

    Glasswall Python Wrapper functionality

    In the engine processing mode, the protect_file function from the Glasswall Python Wrapper is used by default to process files using the Embedded Engine. This can be changed using the optional -f command line argument.

    A default sanitise content management policy is applied if a policy file is not specified using the optional -c command line argument.

    The required -l command line argument should point to a directory containing the Embedded Engine.

    The following arguments relate to the Glasswall Python Wrapper:

      -l LIBRARY_DIRECTORY, --library-directory LIBRARY_DIRECTORY
                            Required. Path to directory containing the Embedded Engine.
      -f FUNCTION_NAME, --function-name FUNCTION_NAME
                            Optional. Glasswall Python Wrapper function name to call during multiprocessing, such as 'protect_file' or 'export_file'. Default: 'protect_file'.
      -c CONTENT_MANAGEMENT_POLICY, --content-management-policy CONTENT_MANAGEMENT_POLICY
                            Optional. Path to Embedded Engine content management policy file. If not provided, the default 'sanitise' policy is used.
      --log-level-console-wrapper {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing Glasswall Python Wrapper logs to console. Default INFO.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 -f protect_file -c /home/azureuser/glasswall/config.xml
    

    Multiprocessing

    All processing modes leverage the Glasswall Python Wrapper's GlasswallProcessManager to efficiently process files concurrently.

    The following arguments relate to multiprocessing:

      -w MAX_WORKERS, --max-workers MAX_WORKERS
                            Optional. Maximum workers for multiprocessing, 0=auto. Default: 0.
      -t TIMEOUT_SECONDS, --timeout-seconds TIMEOUT_SECONDS
                            Optional. Multiprocessing timeout per file in seconds. Default: 180.
      -m MEMORY_LIMIT_GIB, --memory-limit-gib MEMORY_LIMIT_GIB
                            Optional. Multiprocessing memory limit per file in GiB, 0=auto (4GiB min, worker distributed max). Default: 0.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 -t 300 -m 12
    

    Logging

    The default logging level for Conform and the Glasswall Python Wrapper is INFO. The following arguments relate to logging:

      --log-level-console {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing logs to console. Default INFO.
      --log-level-file {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing logs to file. If not provided, logs will not be written to file.
      --log-path LOG_PATH   Optional. Path to output log file. Default is a timestamp-named file located at: '%TEMP%/glasswall_conform/logs'.
      --log-level-console-wrapper {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing Glasswall Python Wrapper logs to console. Default INFO.
    

    To suppress most logging:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --log-level-console CRITICAL --log-level-console-wrapper CRITICAL
    

    Customise content handling rates

    This section is only applicable when fast mode is disabled.

    By default, Conform generates an output file whenever possible, even if only a portion of the original document's content has been successfully handled. This behaviour might not always be desirable, and can be customised for different types of content within each document.

    Conform uses "best guesses" when handling malformed, corrupt, or unsupported text content to ensure that as much text as possible is transferred from the original document to the conformed document. For example, if the stroke colour of the text is malformed or in an unsupported colour format, the text is retained in the output document, with the stroke colour defaulting to black.

    This "best guess" approach may result in text that appears similar to the original, or in some cases, text that is not visible but still present in the output document. As we cannot guarantee that our best guess will handle the text in the same way as in the original document, the handling rate reflects this as content that has not been fully handled. Consequently, a low handling rate for text does not always indicate that the document will look visually different when best guesses are applied.

    There are three arguments available to set the minimum success rates when handling content:

      --text-min-success-rate TEXT_MIN_SUCCESS_RATE
                            Optional. The minimum success rate for processing text. Default: 0.0.
      --image-min-success-rate IMAGE_MIN_SUCCESS_RATE
                            Optional. The minimum success rate for processing images. Default: 0.0.
      --graphic-min-success-rate GRAPHIC_MIN_SUCCESS_RATE
                            Optional. The minimum success rate for processing graphics. Default: 0.0.
    

    If the minimum content handling rate value is not met then processing for the given file will be deemed a failure and the output file will not be written.


    Watermarking

    Watermarking is disabled by default, but can be enabled using the --watermark argument. Text will be added to the document in a turquoise colour.

      --watermark WATERMARK
                            Optional. Adds a watermark to each page of the reconstructed document. Default '' (disabled).
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --watermark "Glasswall Conform"
    

    CID suppression

    This section is only applicable when fast mode is disabled.

    In PDFs, some fonts use a system called CID (Character Identifier) to manage large sets of characters. When constructing a new PDF, if the tool encounters characters that cannot be processed, it replaces them with a default black square character. You can adjust how unprocessable CIDs are represented in your PDFs using the --suppress-cid argument:

      --suppress-cid SUPPRESS_CID
                            Optional. Replace CID metadata that may be printed to the visual layer due to font array omissions with the supplied string, with placeholder text.
                            Glasswall Conform restricts the processing of PDFs to only known secure fonts. This is a deliberate security feature to make the PDF conform safely. Default '■'.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --suppress-cid "?"
    

    Font replacement

    This section is only applicable when fast mode is disabled.

    Conform supports bold, italic, and bold italic variants of the base 14 Type1 fonts and the Cambria font. Conform also supports some custom fonts.

    The base 14 Type1 fonts are:

    • Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique
    • Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique
    • Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic
    • Symbol
    • ZapfDingbats

    Embedded fonts that are not supported may be replaced with the Cambria font. If Cambria does not support a glyph from an embedded font, the character is suppressed. For more information on this, see CID suppression.

    By default, some commonly embedded sans serif fonts are replaced with Helvetica instead of Cambria for visual similarity. This, and other font replacement features, can be modified using these arguments:

      --disable-base-14-fonts
                            Optional. Disable matching embedded fonts to base 14 fonts.
                            This will result in more fonts being replaced by the fallback font, Cambria. Default False.
      --disable-custom-fonts
                            Optional. Disable matching embedded fonts to custom fonts.
                            This will result in lower support for custom embedded fonts, and more fonts being replaced by the fallback font, Cambria. Default False.
      --disable-sans-serif-replacement
                            Optional. Disable replacing some sans serif fonts with Helvetica instead of the fallback font, Cambria.
                            This will result in some replaced sans serif fonts looking more visually different when compared to the original file. Default False.
    

    Example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --disable-custom-fonts
    

    File inclusion and exclusion filtering

    Conform allows additional control over which files in the input directory are processed by using include and exclude filters. These filters let you specify which files to process or ignore using basic Unix shell-style wildcards directly from the command line. If a file matches both an inclusion and an exclusion rule, it will be excluded.

    By default, if the --include-files and --exclude-files arguments are omitted, Conform will process all files that are present in the input directory.

    The following arguments relate to file inclusion and exclusion:

      --include-files INCLUDE_FILES
                            Optional. Can be either a path to a file containing file paths/patterns or a semicolon-separated list of patterns (e.g. '*.pdf;*/SET_03/*'). Only matching files will be processed.   
                            If None, all files are included. Default: None.
      --exclude-files EXCLUDE_FILES
                            Optional. Can be either a path to a file containing file paths/patterns or a semicolon-separated list of patterns. Any matching files will be excluded from processing. If None, no   
                            files are excluded. Default: None.
    

    The following table demonstrates examples of some patterns that can be used:

    PatternMeaningExampleMatchesDoes Not Match
    *Matches everything*.pdffile.pdf, report.pdffile.docx
    ?Matches any single characterfile_?.pdffile_1.pdf, file_A.pdffile_10.pdf
    [seq]Matches any character in seqfile_[AB].pdffile_A.pdf, file_B.pdffile_C.pdf
    [!seq]Matches any character not in seqfile_[!AB].pdffile_C.pdf, file_D.pdffile_A.pdf, file_B.pdf

    Case sensitivity considerations

    File names are case-sensitive on Linux but case-insensitive on Windows. This affects how file paths or patterns are interpreted across different operating systems.

    • On Linux, report.pdf and Report.pdf are treated as different files.
    • On Windows, both are considered the same file.

    Recommendation:
    To ensure consistency across platforms, use consistent casing in file names and patterns. If working across multiple environments, consider using wildcard patterns (*) where appropriate to avoid mismatches.

    Handling single file inclusions

    If specifying a single file with --include-files, be aware that Conform first checks whether the provided value is a file on disk, and if it is not then the value is treated as a pattern.

    Potential issue:
    If a user specifies:

    --include-files /home/azureuser/input_files/first.pdf
    

    Conform will see that /home/azureuser/input_files/first.pdf exists as a file, and attempt to read from it as a list file that contains multiple paths or patterns.

    Solution:
    To explicitly indicate that this is a pattern for a single file, append a trailing semicolon:

    --include-files /home/azureuser/input_files/first.pdf;
    

    This ensures that Conform treats the path as a pattern rather than a list file.

    Include specific PDF files

    To process only PDFs with "report" in the filename:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --include-files "*report*.pdf"
    

    Result: Only files like annual_report.pdf, summary_report_2023.pdf, etc., are processed.

    Exclude specific PDF files

    To process all PDFs except ones containing "draft" in the name:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --exclude-files "*draft*.pdf"
    

    Result: All PDFs are processed, except files like proposal_draft.pdf and internal_draft_v2.pdf.

    Exclude an entire directory

    To exclude all files inside /home/azureuser/input_files/archive/:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --exclude-files "*/archive/*"
    

    Result: Everything inside /home/azureuser/input_files/archive/ is skipped.

    Include and exclude together

    If a file matches both an inclusion and an exclusion rule, it will be excluded.

    To process all files from SET_03, but exclude files containing "error_log":

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --include-files "*/SET_03/*" --exclude-files "*error_log*"
    

    Result: Only files from SET_03/ are processed, except any containing "error_log" in the filename.

    Using a file for large lists

    For more complex filtering, you can provide a file containing multiple patterns or absolute file paths instead of specifying them directly.

    Example using an inclusion list file:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --include-files include_list.txt
    

    Example include_list.txt:

    */SET_03/*.pdf
    *reports_2023_*.pdf
    /home/azureuser/input_files/SET_02/splat.pdf
    

    Result: Processes only files from SET_03/, files containing reports_2023_, and the specific file /home/azureuser/input_files/SET_02/splat.pdf.


    Output file structure and categorisation

    The directory structure for output files can be customised for both the engine and conform_only processing modes using the --output-structure command line argument.

      --output-structure {categorised,mirrored}
                            Optional. Defines the directory structure of output files. 'categorised' organises output files into subdirectories based on processing status ('engine_success', 'conform_success', 'failure').
                            'mirrored' places successfully processed output files directly in the output directory, maintaining the original input directory structure, and failed files will not be copied. Default: categorised.   
    

    If omitted, the default categorised structure is used. Additional options are available to customise the category subdirectory names:

    • --engine-success-path: Optional. Output subdirectory name for files that were successfully processed by the Embedded Engine without the need for reconstruction by Conform. Default 01_engine_success
    • --conform-success-path: Optional. Output subdirectory name for files that were initially unable to be processed by the Embedded Engine, but were reconstructed by Conform and then successfully processed by the Embedded Engine. Default 02_conform_engine_success
    • --failure-path: Optional. Output subdirectory name for files that failed to be processed using both the Embedded Engine and Conform. Default 03_failure

    Example categorised output structure

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0
    

    Example input directory:

    /home/azureuser/input_files
        conforming_docx.docx
        conforming_pdf.pdf
        corrupt_docx.docx
        nonconforming_pdf.pdf
        unsupported_filetype.txt
    

    Example output directory after processing:

    /home/azureuser/output_files
    ├───01_engine_success
    │       conforming_docx.docx
    │       conforming_pdf.pdf
    │
    ├───02_conform_engine_success
    │       nonconforming_pdf.pdf
    │
    └───03_failure
            corrupt_docx.docx
            unsupported_filetype.txt
    

    When using the categorised output structure, if it is desired that all successfully protected files are written to the same output directory, regardless of whether or not Conform was used to reconstruct the file, you can specify to write files to the same success subdirectory path. For example:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --engine-success-path success --conform-success-path success --failure-path failure
    

    Example mirrored output structure

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --output-structure mirrored
    

    Example input directory:

    /home/azureuser/input_files
        conforming_docx.docx
        conforming_pdf.pdf
        corrupt_docx.docx
        nonconforming_pdf.pdf
        unsupported_filetype.txt
    

    Example output directory after processing:

    /home/azureuser/output_files
        conforming_docx.docx
        conforming_pdf.pdf
        nonconforming_pdf.pdf
    

    Post processing summary

    By default, a summary is written after Conform has finished processing files. The summary provides detailed information such as return statuses, processing time, and memory usage for each file. The --summary-verbosity argument controls which files are included in the summary. This setting is independent of the logging level and does not affect detailed log outputs.

    Available Options

    • all (default) - Includes both successfully processed and failed files.
    • failure - Includes only failed files.
    • success - Includes only successfully processed files.
    • none - Disables the summary output completely.

    Example to include only failed files in the summary output:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --summary-verbosity failure
    

    Example to disable the summary output entirely:

    glasswall_conform engine -i /home/azureuser/input_files -o /home/azureuser/output_files -l /home/azureuser/glasswall/Release-16.2.0 --summary-verbosity none
    

    Example summary output:

    Glasswall Conform Summary (--summary-verbosity: 'all')
    
    Processing success:
    
            input_file: /home/azureuser/input_files/pal1.bmp
            output_file: /home/azureuser/output_files/01_engine_success/pal1.bmp
            engine_status: OK(0)
            max_memory_used_in_gib: 0.10964584350585938
            elapsed_time: 0.7697250843048096
            success: True
    
            input_file: /home/azureuser/input_files/Set-08-016599.pdf
            output_file: /home/azureuser/output_files/02_conform_engine_success/Set-08-016599.pdf
            engine_status: GeneralFail(-1)
            engine_GW2FileErrorMsg: [FAILURE_LOG_SEM_FONTS_0021897368] Key /FirstChar must be present in a Type 1 Font dictionary other than for standard 14. fonts.
            engine_conform_fast_status: GeneralFail(-1)
            engine_conform_fast_GW2FileErrorMsg: [FAILURE_LOG_SEM_FONTS_0021897368] Key /FirstChar must be present in a Type 1 Font dictionary other than for standard 14. fonts.
            engine_conform_cautious_status: OK(0)
            max_memory_used_in_gib: 0.20969390869140625
            elapsed_time: 1.7218008041381836
            success: True
    
    Processing failure:
    
            input_file: /home/azureuser/input_files/pal1_corrupt.bmp
            engine_status: FileTypeUnknown(-7)
            engine_GW2FileErrorMsg: Unable to determine file type
            engine_conform_fast_status: PdfFastProcessError()
            engine_conform_cautious_status: PdfExtractionError(Unable to extract content from PDF: '/home/azureuser/input_files/pal1_corrupt.bmp')
            exit_code: 0
            timed_out: False
            out_of_memory: False
            max_memory_used_in_gib: 0.10601806640625
            elapsed_time: 0.743659496307373
            success: False
    
            input_file: /home/azureuser/input_files/Straw120556398.pdf
            timed_out: False
            out_of_memory: True
            max_memory_used_in_gib: 4.01513671875
            elapsed_time: 39.11844754219055
            exception: MemoryError()
            success: False
    
    Processing success rate:  50.00% (2/4 files)
    Processing failure rate:  50.00% (2/4 files)
    
    Processing time: 40.05 seconds (0.10 files/sec, 10.01 secs/file)
    

    Was this article helpful?

    What's Next