Glasswall Conform User Guide
    • PDF

    Glasswall Conform User Guide

    • PDF

    Article summary

    Welcome to the Glasswall Conform User Guide. This guide offers step-by-step instructions on how to effectively use the tool for reconstructing PDF documents.

    Getting Started

    Glasswall Conform is designed to pre-process PDF files to meet standards for further processing. It extracts and reconstructs visual content, and it should be used in conjunction with the Glasswall Embedded Engine for complete Content Disarm and Reconstruction (CDR) protection.

    Installation

    Before installing the glasswall_conform package, ensure that your environment is set up correctly and has the necessary dependencies installed.

    Example Environment Setup for Rocky Linux 9.3

    sudo dnf --enablerepo=devel -y update
    sudo dnf --enablerepo=devel -y install \
        python3.11-devel \
        python3.11-pip \
        libxml2-devel \
        libxslt-devel \
        platform-python-devel \
        gcc
    sudo ln -fs /usr/bin/python3.11 /usr/bin/python
    sudo ln -fs /usr/bin/pip3.11 /usr/bin/pip
    

    Example Environment Setup for Rocky Linux 8.9

    sudo dnf --enablerepo=powertools -y update
    sudo dnf --enablerepo=powertools -y install \
        python3.11-devel \
        python3.11-pip \
        libxml2-devel \
        libxslt-devel \
        platform-python-devel \
        gcc
    sudo ln -fs /usr/bin/python3.11 /usr/bin/python
    sudo ln -fs /usr/bin/pip3.11 /usr/bin/pip
    

    Example Environment Setup for Ubuntu 24.04

    sudo apt-get update
    sudo apt-get install -y \
        python3-dev \
        python3-pip
    sudo ln -fs /usr/bin/python3 /usr/bin/python
    sudo ln -fs /usr/bin/pip3 /usr/bin/pip
    

    Install Offline:

    Navigate to the directory containing the offline installation files and run:

    pip install --upgrade --no-index --find-links=. glasswall_conform glasswall
    

    Testing

    A dataset of PDF test files for evaluating Glasswall Conform is available upon request. Please contact us to request access to the test files via Kiteworks.

    Python Code Samples

    Advanced integration with the Glasswall Embedded Engine using Glasswall Multiprocessing

    The Glasswall Python Wrapper provides developers with the ability to handle batch processing of files using multiprocessing, and to set memory and timeout limits for each file. This helps control behaviour when processing large or complex PDF files. Additionally, the wrapper enhances the scalability of Glasswall Conform by enabling efficient parallel processing.

    The code below demonstrates how to batch process files using the Glasswall Embedded Engine and Glasswall Conform, leveraging multiprocessing to optimise the use of CPU cores and memory. Files are handled in multiple stages, with output being organised into four directories depending on the outcome:

    1. 01_engine_success: Files that are successfully processed by the Glasswall Embedded Engine in the first pass.
    2. 02_conformed_files: PDF files that were initially unable to be processed by the Glasswall Embedded Engine but were successfully reconstructed by Glasswall Conform in the second pass.
    3. 03_conform_then_engine_success: PDF files that, after being reconstructed by Glasswall Conform, are successfully processed by the Glasswall Embedded Engine in the third pass.
    4. 04_failure: Files that failed to be processed using both the Glasswall Embedded Engine and Glasswall Conform.
    import os
    import shutil
    from collections import Counter
    
    import glasswall
    from glasswall.multiprocessing import GlasswallProcessManager, Task
    from glasswall.multiprocessing.memory_usage import get_available_memory_gib
    from tqdm import tqdm
    
    os.environ["GLASSWALL_CONFORM_LOG_LEVEL_CONSOLE"] = "CRITICAL"  # Optionally hide logging to console
    
    from glasswall_conform.pdf import graphic_processing, image_processing, process_pdf, text_processing, watermark
    
    INPUT_DIRECTORY = r"glasswall_conform\input"
    OUTPUT_DIRECTORY = r"glasswall_conform\output"
    LIBRARY_DIRECTORY = r"C:\azure\sdk.editor\2.1111.0"
    
    ENGINE_SUCCESS_DIRECTORY = os.path.join(OUTPUT_DIRECTORY, "01_engine_success")
    CONFORMED_FILES_DIRECTORY = os.path.join(OUTPUT_DIRECTORY, "02_conformed_files")
    CONFORM_THEN_ENGINE_SUCCESS_DIRECTORY = os.path.join(OUTPUT_DIRECTORY, "03_conform_then_engine_success")
    FAILURE_DIRECTORY = os.path.join(OUTPUT_DIRECTORY, "04_failure")
    
    # Conform setup
    # Optionally set minimum threshold for writing conformed output PDFs
    graphic_processing.graphic_min_success_rate = 0
    image_processing.image_min_success_rate = 0
    text_processing.text_min_success_rate = 0
    text_processing.suppress_cid = "■"  # Optionally set character to use for suppressions
    watermark.text = ""  # Optionally set watermark text to apply to each page
    
    # Embedded Engine setup
    glasswall.config.logging.console.setLevel("CRITICAL")  # Optionally hide logging to console
    EDITOR = glasswall.Editor(LIBRARY_DIRECTORY)
    content_management_policy = glasswall.content_management.policies.Editor(default="sanitise")
    
    
    def worker_function(**kwargs):
        EDITOR.protect_file(
            input_file=kwargs["input_file"],
            output_file=kwargs.get("output_file"),
            content_management_policy=kwargs.get("content_management_policy"),
        )
    
    
    def worker_function_conform(**kwargs):
        process_pdf(
            input_file=kwargs["input_file"],
            output_file=kwargs["output_file"],
        )
    
    
    def main():
        input_files = glasswall.utils.list_file_paths(INPUT_DIRECTORY)
        if not input_files:
            print(f"No input files found in directory: {INPUT_DIRECTORY}")
            return
    
        cpu_count = os.cpu_count() or 1
        print(f"Available CPU count: {cpu_count}")
    
        available_memory_gib = round(get_available_memory_gib(), 2)
        print(f"Available memory: {available_memory_gib} GiB")
    
        # Set max_workers to lowest between cpu_count or available memory // 4 (4GiB per process)
        max_workers = max(1, int(min(cpu_count, available_memory_gib // 4)))
        print(f"Max workers: {max_workers}")
    
        # Calculate memory limit per worker
        memory_limit_in_gib = round(available_memory_gib / max_workers, 2)
        print(f"Memory limit per worker: {memory_limit_in_gib:.2f} GiB\n")
    
        # First pass, Embedded Engine only
        with GlasswallProcessManager(
            max_workers=max_workers,
            worker_timeout_seconds=30,
            memory_limit_in_gib=memory_limit_in_gib,
        ) as gpm:
            for input_file in input_files:
                relative_path = os.path.relpath(input_file, INPUT_DIRECTORY)
                output_file = os.path.join(ENGINE_SUCCESS_DIRECTORY, relative_path)
    
                task = Task(
                    func=worker_function,
                    args=tuple(),
                    kwargs=dict(
                        input_file=input_file,
                        output_file=output_file,
                        content_management_policy=content_management_policy,
                    ),
                )
                gpm.queue_task(task)
    
            results_engine = []
            for task_result in tqdm(
                gpm.as_completed(),
                total=len(input_files),
                desc="Processing files with Embedded Engine",
                miniters=len(input_files) // 100,
            ):
                results_engine.append(task_result)
    
        first_pass_success = [task_result for task_result in results_engine if task_result.success is True]
        first_pass_failure = [task_result for task_result in results_engine if task_result.success is not True]
        first_pass_percent = (len(first_pass_success) / len(results_engine)) * 100
        first_pass_conformable_files = [
            task_result
            for task_result in first_pass_failure
            if task_result.task.kwargs["input_file"].lower().endswith(".pdf")
        ]
        print(
            f"Embedded Engine processed {len(first_pass_success)}/{len(results_engine)} files"
            f" ({first_pass_percent:.2f}%)"
        )
        print(f"Embedded Engine failed to process {len(first_pass_failure)} files. ({100 - first_pass_percent:.2f}%)")
        exceptions = Counter(item.exception.__class__.__name__ for item in results_engine if item.exception)
        if exceptions:
            print("Exceptions:")
            for k, v in exceptions.items():
                print(f"\t{k} = {v}")
        for task_result in first_pass_failure:
            # Copy non-PDF files that Embedded Engine failed to process to the failure directory
            if task_result not in first_pass_conformable_files:
                src = task_result.task.kwargs["input_file"]
                relative_path = os.path.relpath(src, INPUT_DIRECTORY)
                dst = os.path.join(FAILURE_DIRECTORY, relative_path)
                os.makedirs(os.path.dirname(dst), exist_ok=True)
                shutil.copy2(src, dst)
            # If timeout or memory error, it's possible that execution was terminated while Python opened the file
            # handle to write file to disk, resulting in a 0kb output file existing, check for this and delete if exists
            if os.path.isfile(task_result.task.kwargs["output_file"]):
                os.remove(task_result.task.kwargs["output_file"])
        print()
    
        if first_pass_conformable_files:
            # Second pass, Glasswall Conform the PDF files that failed to process, increased timeout, 30 -> 180 secs
            with GlasswallProcessManager(
                max_workers=max_workers,
                worker_timeout_seconds=180,
                memory_limit_in_gib=memory_limit_in_gib,
            ) as gpm:
                for task_result in first_pass_conformable_files:
                    relative_path = os.path.relpath(task_result.task.kwargs["input_file"], INPUT_DIRECTORY)
                    output_file = os.path.join(CONFORMED_FILES_DIRECTORY, relative_path)
                    task = Task(
                        func=worker_function_conform,
                        args=tuple(),
                        kwargs=dict(
                            input_file=task_result.task.kwargs["input_file"],
                            output_file=output_file,
                            content_management_policy=task_result.task.kwargs["content_management_policy"],
                        ),
                    )
                    gpm.queue_task(task)
    
                results_conform = []
                for task_result in tqdm(
                    gpm.as_completed(),
                    total=len(first_pass_conformable_files),
                    desc="Processing PDF files with Conform",
                    miniters=len(first_pass_conformable_files) // 100,
                ):
                    results_conform.append(task_result)
    
                second_pass_success = [item for item in results_conform if item.success is True]
                second_pass_failure = [item for item in results_conform if item.success is not True]
                second_pass_percent = (len(second_pass_success) / len(results_conform)) * 100 if results_conform else 0
                print(
                    f"Conform processed {len(second_pass_success)}/{len(results_conform)} files"
                    f" ({second_pass_percent:.2f}%)"
                )
                print(f"Conform failed to process {len(second_pass_failure)} files." f" ({100 - second_pass_percent:.2f}%)")
                exceptions = Counter(item.exception.__class__.__name__ for item in results_conform if item.exception)
                if exceptions:
                    print("Exceptions:")
                    for k, v in exceptions.items():
                        print(f"\t{k} = {v}")
                print()
    
                for task_result in second_pass_failure:
                    # Copy files that Conform failed to process to the failure directory
                    src = task_result.task.kwargs["input_file"]
                    relative_path = os.path.relpath(src, INPUT_DIRECTORY)
                    dst = os.path.join(FAILURE_DIRECTORY, relative_path)
                    os.makedirs(os.path.dirname(dst), exist_ok=True)
                    shutil.copy2(src, dst)
                    # Delete 0kb output files from timeout/memoryerror
                    if os.path.isfile(task_result.task.kwargs["output_file"]):
                        os.remove(task_result.task.kwargs["output_file"])
    
                if second_pass_success:
                    # Third pass, Embedded Engine on the Conformed files.
                    with GlasswallProcessManager(
                        max_workers=max_workers,
                        worker_timeout_seconds=30,
                        memory_limit_in_gib=memory_limit_in_gib,
                    ) as gpm:
                        for task_result in second_pass_success:
                            relative_path = os.path.relpath(task_result.task.kwargs["input_file"], INPUT_DIRECTORY)
                            output_file = os.path.join(CONFORM_THEN_ENGINE_SUCCESS_DIRECTORY, relative_path)
                            task = Task(
                                func=worker_function,
                                args=tuple(),
                                kwargs=dict(
                                    input_file=task_result.task.kwargs["output_file"],
                                    output_file=output_file,
                                    content_management_policy=task_result.task.kwargs["content_management_policy"],
                                    original_input_file=task_result.task.kwargs["input_file"],
                                ),
                            )
                            gpm.queue_task(task)
    
                        results_engine_on_conformed_files = []
                        for task_result in tqdm(
                            gpm.as_completed(),
                            total=len(second_pass_success),
                            desc="Processing Conformed PDF files with Embedded Engine",
                            miniters=len(second_pass_success) // 100,
                        ):
                            results_engine_on_conformed_files.append(task_result)
    
                    third_pass_success = [
                        task_result for task_result in results_engine_on_conformed_files if task_result.success is True
                    ]
                    third_pass_failure = [
                        task_result for task_result in results_engine_on_conformed_files if task_result.success is not True
                    ]
                    third_pass_percent = (len(third_pass_success) / len(results_engine_on_conformed_files)) * 100
                    print(
                        f"Embedded Engine processed {len(third_pass_success)}/{len(results_engine_on_conformed_files)}"
                        f" Conformed files ({third_pass_percent:.2f}%)"
                    )
                    print(
                        f"Embedded Engine failed to process {len(third_pass_failure)}"
                        f" Conformed files. ({100 - third_pass_percent:.2f}%)"
                    )
                    exceptions = Counter(
                        item.exception.__class__.__name__ for item in results_engine_on_conformed_files if item.exception
                    )
                    if exceptions:
                        print("Exceptions:")
                        for k, v in exceptions.items():
                            print(f"\t{k} = {v}")
                    print()
    
                    for task_result in third_pass_failure:
                        # Copy original files to failure directory that were processed by Conform but failed Embedded Engine
                        src = task_result.task.kwargs["original_input_file"]
                        relative_path = os.path.relpath(src, INPUT_DIRECTORY)
                        dst = os.path.join(FAILURE_DIRECTORY, relative_path)
                        os.makedirs(os.path.dirname(dst), exist_ok=True)
                        shutil.copy2(src, dst)
                        # Delete 0kb output files from timeout/memoryerror
                        if os.path.isfile(task_result.task.kwargs["output_file"]):
                            os.remove(task_result.task.kwargs["output_file"])
    
                total_success = len(first_pass_success) + len(third_pass_success)
                total_percent = (total_success / len(input_files)) * 100
                print(f"In total {total_success}/{len(input_files)} files ({total_percent:.2f}%) were processed.")
    
        # Delete empty dirs due to timeouts, memory errors, etc
        glasswall.utils.delete_empty_subdirectories(OUTPUT_DIRECTORY)
    
    
    if __name__ == "__main__":
        main()
    

    Example output:

    Available CPU count: 20
    Available memory: 16.40 GiB
    Max workers: 4
    Memory limit per worker: 4.10 GiB
    
    Processing files with Embedded Engine: 100%|███████████████████████████████| 7/7 [00:12<00:00,  1.80s/it]
    Embedded Engine processed 1/7 files (14.29%)
    Embedded Engine failed to process 6 files. (85.71%)
    Exceptions:
            GeneralFail = 5
            FileTypeUnknown = 1
    
    Processing PDF files with Conform: 100%|███████████████████████████████████| 5/5 [00:50<00:00, 10.01s/it] 
    Conform processed 5/5 files (100.00%)
    Conform failed to process 0 files. (0.00%)
    
    Processing Conformed PDF files with Embedded Engine: 100%|█████████████████| 5/5 [00:41<00:00,  8.31s/it] 
    Embedded Engine processed 4/5 Conformed files (80.00%)
    Embedded Engine failed to process 1 Conformed files. (20.00%)
    Exceptions:
            TimeoutError = 1
    
    In total 5/7 files (71.43%) were processed.
    

    Simple integration with the Glasswall Embedded Engine

    The script below demonstrates how to use the Glasswall Python Wrapper to process a file with the Glasswall Embedded Engine. If the initial processing attempt fails, Glasswall Conform is used to reconstruct the PDF, after which the Glasswall Embedded Engine tries to process the reconstructed file again.

    import os
    
    import glasswall
    from glasswall.libraries.editor import errors
    
    import glasswall_conform
    from glasswall_conform.common.errors import GlasswallConformError
    from glasswall_conform.common.utils import TempFilePath
    from glasswall_conform.pdf import process_pdf
    
    # Initialise Glasswall Embedded Engine from the specified path
    EDITOR = glasswall.Editor(r"C:\azure\sdk.editor\2.1111.0")
    
    # Disable further logging for Glasswall Embedded Engine and Glasswall Conform
    glasswall.config.logging.console.setLevel("CRITICAL")
    glasswall_conform.config.logging.console_handler.setLevel("CRITICAL")
    
    # Define input and output directories
    INPUT_DIRECTORY = r"glasswall_conform\input"
    OUTPUT_DIRECTORY = r"glasswall_conform\output"
    
    # Iterate through files in the input directory
    for relative_file_path in glasswall.utils.list_file_paths(INPUT_DIRECTORY, absolute=False):
        input_file = os.path.join(INPUT_DIRECTORY, relative_file_path)
        output_file = os.path.join(OUTPUT_DIRECTORY, relative_file_path)
    
        try:
            # Attempt to protect the input file using Glasswall Embedded Engine
            EDITOR.protect_file(input_file, output_file)
            print(f"Successfully processed file: {input_file}")
    
        # Catch specific Glasswall Embedded Engine errors
        except (errors.GeneralFail, errors.UnexpectedEndOfFile) as glasswall_engine_exception:
            with TempFilePath(directory=glasswall_conform.TEMPDIR, suffix=".pdf") as reconstructed_pdf:
                try:
                    # Attempt to process PDF file using Glasswall Conform
                    process_pdf(input_file, reconstructed_pdf)
                # Catch any exceptions raised during Glasswall Conform processing
                except GlasswallConformError as glasswall_conform_exception:
                    # Raise the Glasswall Embedded Engine exception with context from the Glasswall Conform exception
                    print(f"Failed to Conform file: {input_file}")
                    raise glasswall_engine_exception from glasswall_conform_exception
    
                # Glasswall Conform succeeded, attempt to protect the reconstructed PDF file using Glasswall Embedded Engine
                try:
                    EDITOR.protect_file(reconstructed_pdf, output_file)
                    print(f"Successfully processed Conformed file: {input_file}")
                except Exception:
                    # Glasswall Embedded Engine failed to process the reconstructed PDF file generated by Glasswall Conform
                    # Handle or log the error as needed
                    print(f"Failed to process Conformed file: {input_file}")
                    raise
    
        except errors.FileTypeUnknown:
            # Handle unknown file types here
            print(f"Skipped file, unknown file type: {input_file}")
    

    Example output:

    2024-09-18 12:06:50.620 glasswall_conform.config.logging INFO     log_version_info              Imported Glasswall Conform version 0.6.3 (OS: Windows, Python: 3.11.9)
    2024-09-18 12:06:52.624 glasswall.config.logging        INFO            __init__                Loaded Glasswall Editor version Editor: 2.1111.0 from C:\azure\sdk.editor\2.1111.0\windows-drop-cmake\glasswall_core2.dll
    Successfully processed Conformed file: glasswall_conform\input\05.HAIE-MEDER880.pdf
    Successfully processed Conformed file: glasswall_conform\input\NC-0000016.pdf
    Skipped file, unknown file type: glasswall_conform\input\New Text Document.txt
    Successfully processed Conformed file: glasswall_conform\input\cid\+Heal75050.pdf
    Successfully processed Conformed file: glasswall_conform\input\font_name_is_bytes\130510_Presentation3961.pdf
    Successfully processed Conformed file: glasswall_conform\input\jbig2\NC-0000015.pdf
    Successfully processed file: glasswall_conform\input\standard_fonts.pdf
    

    Simple integration with standalone Glasswall Conform

    Glasswall Conform can also operate independently of the Glasswall Python Wrapper.

    Process a single PDF, writing the new PDF to the output file path if it was handled successfully:

    from glasswall_conform.pdf import (
        graphic_processing,
        image_processing,
        process_pdf,
        text_processing,
    )
    
    # Set minimum success rates for graphics, images, and text
    graphic_processing.graphic_min_success_rate = 80
    image_processing.image_min_success_rate = 80
    text_processing.text_min_success_rate = 80
    
    # Process a single PDF
    process_pdf(
        r"<input_file_path>",
        r"<output_file_path>",
    )
    

    Process a directory of PDF files recursively, writing the successfully handled new PDFs to the output directory while maintaining the structure of the input directory:

    from glasswall_conform.pdf import (
        graphic_processing,
        image_processing,
        process_pdf_directory,
        text_processing,
    )
    
    # Set minimum success rates for graphics, images, and text
    graphic_processing.graphic_min_success_rate = 80
    image_processing.image_min_success_rate = 80
    text_processing.text_min_success_rate = 80
    
    # Process a directory of PDF files
    process_pdf_directory(
        r"<input_directory_path>",
        r"<output_directory_path>",
    )
    

    When processing all files in a directory, the default behaviour when encountering a file that cannot be handled successfully is to continue processing the remaining files. This can be changed to terminate on the first error by setting the continue_on_error parameter to False:

    # Process a directory of PDF files, terminating on the first error
    process_pdf_directory(
        r"<input_directory_path>",
        r"<output_directory_path>",
        continue_on_error=False,
    )
    

    Python How To

    Disable or modify the logging level

    Valid logging levels are: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET

    import glasswall_conform
    
    glasswall_conform.config.logging.console_handler.setLevel("CRITICAL")
    

    Customise content handling rates

    By default, Glasswall Conform generates an output file whenever possible, even if only a portion of the original document's content has been successfully handled. This behaviour might not always be desirable, and can be customised for different types of content within each document.

    Glasswall Conform uses "best guesses" when handling malformed, corrupt, or unsupported text content to ensure that as much text as possible is transferred from the original document to the conformed document. For example, if the stroke colour of the text is malformed or in an unsupported colour format, the text is retained in the output document, with the stroke colour defaulting to black.

    This "best guess" approach may result in text that appears similar to the original, or in some cases, text that is not visible but still present in the output document. As we cannot guarantee that our best guess will handle the text in the same way as in the original document, the handling rate reflects this as content that has not been fully handled. Consequently, a low handling rate for text does not always indicate that the document will look visually different when best guesses are applied.

    import json
    
    from glasswall_conform.pdf import graphic_processing, image_processing, process_pdf_directory, text_processing
    
    graphic_processing.graphic_min_success_rate = 80
    image_processing.image_min_success_rate = 80
    text_processing.text_min_success_rate = 80
    
    
    processing_results = process_pdf_directory(
        r"glasswall_conform\input",
        r"glasswall_conform\output",
    )
    
    print(json.dumps(processing_results, indent=4))
    
    INFO     check_content_handling_rate   Handled 100.00% of graphics total (503/503) in file: 'glasswall_conform\input\NC-0000016.pdf'
    INFO     check_content_handling_rate   Handled 100.00% of   images total (6/6) in file: 'glasswall_conform\input\NC-0000016.pdf'
    INFO     check_content_handling_rate   Handled 100.00% of    texts total (22,768/22,768) in file: 'glasswall_conform\input\NC-0000016.pdf'
    INFO     process_pdf                     1.26s to successfully process PDF: 'glasswall_conform\input\NC-0000016.pdf' -> 'glasswall_conform\output\NC-0000016.pdf'
    ERROR    ltimage_handler               Unable to handle LTImage on page 1: Mask not supported
    ERROR    element_handler               Mask not supported
    INFO     check_content_handling_rate   Handled 100.00% of graphics total (118/118) in file: 'glasswall_conform\input\Set-08-059514.pdf'
    INFO     check_content_handling_rate   Handled   0.00% of   images total (0/1) in file: 'glasswall_conform\input\Set-08-059514.pdf'
    INFO     check_content_handling_rate   Handled 100.00% of    texts total (14,963/14,963) in file: 'glasswall_conform\input\Set-08-059514.pdf'
    ERROR    create_new_pdf                  0.54s to fail creating new PDF. Minimum content handling rate not met, output file not created: 'glasswall_conform\output\Set-08-059514.pdf'
    ERROR    create_new_pdf                Unable to create new PDF: 'glasswall_conform\output\Set-08-059514.pdf'
    INFO     process_pdf_directory           1.90s to process directory: 'glasswall_conform\input' -> 'glasswall_conform\output'
    {
        "glasswall_conform\\input\\NC-0000016.pdf": true,
        "glasswall_conform\\input\\Set-08-059514.pdf": false
    }
    

    Error handling

    Errors are categorised into specific classes under their respective module to provide clarity on the type of exception that occurred during the execution of functions. All error classes inherit from the base class GlasswallConformError.

    class GlasswallConformError(Exception):
        pass
    

    PDF Errors

    PdfExtractionError

    • Raised by: extract_pdf function
    • Description: This error occurs when there is a problem with extracting content from a PDF file. It may be due to corrupted files, the file format not being PDF, or other issues related to the extraction process.

    PdfCreationError

    • Raised by: create_new_pdf function
    • Description: This error occurs when there is a problem with creating a new PDF file. It may be due to issues with file permissions, disk space, or other issues preventing the creation of the PDF file.

    PdfWatermarkError

    • Raised by: add_watermark function
    • Description: This error occurs when there is a problem with adding a watermark to a PDF file. It may be due to invalid watermark settings, or other issues related to the watermarking process.

    PDF error handling example

    from glasswall_conform.pdf import process_pdf
    from glasswall_conform.pdf.errors import GlasswallConformError, PdfCreationError, PdfExtractionError, PdfWatermarkError
    
    # Catch individual exceptions
    try:
        process_pdf("<input_file_path>", "<output_file_path>")
    except PdfExtractionError as e:
        # Handle PdfExtractionError here
        pass
    except PdfCreationError as e:
        # Handle PdfCreationError here
        pass
    except PdfWatermarkError as e:
        # Handle PdfWatermarkError here
        pass
    
    # Or catch any exception raised by Glasswall Conform:
    try:
        process_pdf("<input_file_path>", "<output_file_path>")
    except GlasswallConformError as e:
        # Handle GlasswallConformError here
        pass
    

    Watermarking

    Watermarking is enabled by default with the text "Glasswall Conform" in turquoise, located at the bottom-left of each page.

    Each customer can decide whether to watermark a PDF. Glasswall Conform exists to enforce compliance with the PDF ISO 32000 series of standards. It processes non-conformant PDFs to produce conformant versions, allowing the Glasswall Embedded Engine to perform Content Disarm and Reconstruction (CDR) on the intermediate file. This pre-processing stage is for PDF files that would otherwise be unprocessable by the Embedded Engine.

    Glasswall Conform does not guarantee a lossless conversion of data. Extracting visual content from a non-conformant PDF may result in some data being lost. The watermark indicates that the visual layer of the PDF has been transferred to a new document, which may have reduced data fidelity.

    Important: The Glasswall Embedded Engine's CDR procedure is lossless within the visual layer of the PDF. Any changes to the visual layer, such as deactivating hyperlinks or removing Acroforms, are controlled by the content management policy.

    The watermark can be customised:

    from glasswall_conform.pdf import watermark
    
    watermark.text = "Forced PDF Conform"  # "" to disable watermarking
    watermark.font_size = 10
    watermark.font_colour = (88, 230, 197, 1)  # RGBA
    watermark.x = 10  # 0, 0 is the bottom-left of the page
    watermark.y = 10
    

    CID Suppression

    In PDFs, some fonts use a system called CID (Character Identifier) to manage large sets of characters. When constructing a new PDF, if the tool encounters characters that cannot be processed, it replaces them with a default black square:

    suppress_cid: Optional[str] = "■"  # Black square: u'\u25A0' / chr(9642)
    

    Why is CID suppression important?

    1. Security: Custom fonts in PDFs can sometimes pose security risks. Malicious PDFs might use complex fonts to exploit vulnerabilities in PDF readers. Although this is rare, it could potentially lead to security issues, such as arbitrary code execution.

    2. Obfuscation: CIDs can be used to hide or obscure content within a PDF. For example, hidden text or encoded information might be embedded in a way that is not immediately visible, which could pose a risk if the content is malicious.

    Customising CID suppression

    You can adjust how unprocessable CIDs are represented in your PDFs:

    from glasswall_conform.pdf import text_processing
    
    # Use a str to replace CIDs, None to display CID numbers, or an empty string to hide the missing character
    text_processing.suppress_cid = "?"
    

    Font replacement

    Glasswall Conform supports bold, italic, and bold italic variants of the base 14 Type1 fonts and the Cambria font. Glasswall Conform also supports some custom fonts.

    The base 14 Type1 fonts are:

    • Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique
    • Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique
    • Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic
    • Symbol
    • ZapfDingbats

    You can find details on supported custom fonts in the glasswall_conform\pdf\fonts\definitions\custom directory.

    Embedded fonts that are not supported are replaced with the Cambria font. If Cambria does not support a glyph from an embedded font, the character is suppressed. For more information on this, see CID Suppression.

    Currently it is not possible to disable font replacement entirely.

    By default, some commonly embedded sans-serif fonts are replaced with Helvetica instead of Cambria for visual similarity. This, and other font replacement features, can be modified using the font_handler:

    from glasswall_conform.pdf.fonts import font_handler
    
    # Optionally modify font handling settings
    font_handler.use_base_14_fonts = True  # Default True
    font_handler.use_custom_fonts = True  # Default True
    font_handler.sans_serif_replacement = True  # Default True
    

    Adding a new custom font

    You can register additional custom fonts with Glasswall Conform. This requires a .ttf file, as collections like .ttc and other formats are not supported. Make sure to extract or convert your fonts to .ttf format if needed.

    The matches list of strings is used to determine whether a font is a match to the name of an embedded font within the PDF document. The font file's family_name and regular variant are required arguments. If Bold, Italic, and BoldItalic variants are not provided, and an embedded font is determined to be one of these variants but still matches the font family, then the specified regular font will be used.

    The Font hash_value argument is optional, and if provided the hash will be verified before loading the font.

    from glasswall_conform.pdf.fonts import Font, FontFamily, font_handler
    
    # Optionally modify font handling settings
    font_handler.use_base_14_fonts = True  # Default True
    font_handler.use_custom_fonts = True  # Default True
    font_handler.sans_serif_replacement = True  # Default True
    
    # Register a new custom font
    font_handler.add_font_family(
        FontFamily(
            family_name="Arial",
            regular=Font(
                name="Arial",
                path="glasswall_conform/pdf/fonts/ttfs/custom/microsoft/arial.ttf",
                hash_value="baa251526d6862712a58e613ef451d8a2b60482142ec6aab1d47fb8e23e21a7c",
            ),
            bold=Font(
                name="Arial-Bold",
                path="glasswall_conform/pdf/fonts/ttfs/custom/microsoft/arialbd.ttf",
                hash_value="8df7a2c69fc4044835814899534e5fee6e72f78285b5a6dcb19531142b51d742",
            ),
            italic=Font(
                name="Arial-Italic",
                path="glasswall_conform/pdf/fonts/ttfs/custom/microsoft/ariali.ttf",
                hash_value="090b89742910172c69e1fd3b1814ad4e482a1c712b87d24e96b377beaac3a6d1",
            ),
            bolditalic=Font(
                name="Arial-BoldItalic",
                path="glasswall_conform/pdf/fonts/ttfs/custom/microsoft/arialbi.ttf",
                hash_value="94d0872622e6d592f01440b58dac8f5d7e010509bd76bb71cbed71fc5f4dc173",
            ),
            matches=["*Arial*"],
            custom=True,
        )
    )
    

    Command line testing

    Glasswall Conform can also be run at a more basic level from the command line.

    Basic Usage

    python -m glasswall_conform -i <input_dir> -o <output_dir>
    

    Watermarking within PDFs

    python -m glasswall_conform -i <input_dir> -o <output_dir> --suppress-cid "" --watermark "Forced PDF Conform"
    

    (--suppress-cid "" suppresses metadata relating to unsupported fonts being placed within the visual layer of the PDF)

    Show CLI help

    python -m glasswall_conform -h
    
    usage: __main__.py [-h] -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [--log-path LOG_PATH] [--log-level-console {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       [--log-level-file {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}] [--display-image-errors] [--watermark WATERMARK] [--suppress-cid SUPPRESS_CID] [--custom-font-statistics]
    
    Glasswall Conform - Malformed file reconstruction tool
    
    options:
      -h, --help            show this help message and exit
      -i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                            Required. Path to input directory of non-conforming files.
      -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                            Required. Path to output directory for reconstructed files.
      --log-path LOG_PATH   Optional. Path to output log file. Default is a timestamp-named file located at: '%TEMP%/glasswall_conform/logs'.
      --log-level-console {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing logs to console. Default INFO
      --log-level-file {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                            Optional. Set logging level for writing log files. Default INFO
      --display-image-errors
                            Optional. Enable logging of image related issues. Default False
      --watermark WATERMARK
                            Optional. Adds a watermark to each page of the reconstructed document. Default '' (disabled)
      --suppress-cid SUPPRESS_CID
                            Optional. Replace CID metadata that may be printed to the visual layer due to font array omissions with the supplied string, with placeholder text.
                            Glasswall Conform restricts the processing of PDFs to only known secure fonts. This is a deliberate security feature to make the PDF conform safely. Default '■'.
      --custom-font-statistics
                            Optional. Collect statistics on custom font files which are currently rejected due to security concerns. These statistics can be used to idenfity potential additions to font whitelists.
    

    CLI Production Use

    The CLI does not support all advanced features. For full functionality and greater control, it is recommended to use the sample code or similar Python scripts instead of the CLI.


    Was this article helpful?