## Features

Glasswall Conform is a command-line tool designed for preprocessing PDF documents. It extracts and reconstructs visual content to ensure documents meet PDF standards, preparing them for further processing by the **Glasswall Embedded Engine**, which provides comprehensive Content Disarm and Reconstruction (CDR) protection.

**Key Features:**
- **Text, Graphic, and Image Extraction:** Extracts and reconstructs text, graphics, and images from PDFs, producing a clean, standards-compliant output document.
- **Handling Rate Threshold:** Allows setting a minimum handling rate for graphics, images, or text. Files that fail to meet this threshold are classified as failures and will not be saved.
- **Custom Watermarking:** Supports adding custom watermark text on each page of the reconstructed PDF, enabling personalised branding or messaging.
- **Character Identifier (CID) and Glyph Suppression:** Suppresses unsupported glyphs and character identifiers (CIDs), replacing them with the default question mark character (?).
- **Font Replacement:** Converts custom embedded fonts to known-good Microsoft fonts or defaults to Cambria Math when necessary. This process aims to provide the best possible text display, even when custom fonts are not supported.
- **Standards Compliance:** Produces a reconstructed PDF that adheres to PDF standards, allowing for subsequent CDR processing by the Glasswall Embedded Engine for full Content Disarm and Reconstruction (CDR) protection.
- **Fast Mode:** Enabled by default, Fast Mode processes files quickly while maintaining accurate visual appearance.
  - Fastest processing speed.
  - Best visual appearance.
  - Custom embedded fonts are not replaced.
  - May not be suitable for scenarios requiring very strict compliance with PDF standards.
- **Cautious Mode:** This fallback mode is automatically used when Fast Mode cannot process a file or is disabled.
  - Slower processing speed.
  - In a small number of cases, may result in reduced visual appearance, such as:
    - Degraded or missing images and graphics.
    - Differences in text appearance (e.g. size, font style, or spacing).
    - Missing text when unknown embedded fonts are in use.
  - Processes PDFs with stricter compliance to specifications.
  - Replaces custom embedded fonts with known-good fonts.
  - Preferable only for scenarios requiring very strict compliance with PDF standards, even at the cost of visual fidelity.
- **File Inclusion and Exclusion Filtering**: Specify which files to process or exclude using absolute paths or wildcard patterns.
- **Output File Categorisation:** Defines how output files are organised. `categorised` organises output files into subdirectories based on processing status (`engine_success`, `conform_success`, `failure`). `mirrored` places successfully processed output files directly in the output directory, maintaining the original input directory structure, and failed files will not be copied.
- **Post Processing Summary:** Provides detailed information on processing results, including file statuses, memory usage, and processing time.
- **In-Memory Processing**: Supports `engine_memory` and `conform_only_memory` modes, allowing files to be processed entirely in memory using base64-encoded input via standard input, and returning the base64-encoded output files via standard output. Ideal for integration with systems that avoid disk-based I/O.

## Constraints and limitations

While Glasswall Conform is a powerful tool, certain constraints and limitations should be considered:

- **Image Handling:** Some image colour spaces are unsupported and may be ignored. Additionally, image processing may convert compressed images to a lossless format, which can increase file size.
- **Font Handling:** Glasswall Conform supports Base 14 and many Microsoft fonts, but unsupported custom fonts are replaced to mitigate potential risks.
- **PDF Structure:** PDFs missing essential structural elements (e.g., root catalog, cross-reference tables) may not be recoverable.
- **Memory Usage:** PDFs with many images may consume significant memory. While the tool has been tested with files up to 50 MB, larger files may experience performance issues.
- **Color Spaces:** The CalRGB colour space is not supported.
- **Graphics Handling:** Support for complex graphics, such as shapes, charts, and graphs, is limited. This version prioritises text integrity.
- **Document Recovery:** Severely corrupted PDFs or those with missing structural elements may be unrecoverable.
- **Platform Support:** Glasswall Conform is available for both **Windows** and **Linux**. For Windows we provide an `.exe` installer. For Linux we provide `.rpm` and `.deb` packages which support Linux distributions such as **Rocky 9**, **Rocky 8**, **Ubuntu 24**, and **Ubuntu 22**.
- **Timeout and Memory Configuration:** The following table presents our findings on how configurable timeout and memory settings impact the overall processing success rate and total runtime when running on **d16-v3 VMs**, each with **16 vCPUs and 64GB RAM**:

  | **Timeout** | **Memory Limit** | **Runtime (7 VMs)** | **Aggregate Processing Time** | **Processing Time Increase** | **Files Processed** | **Success Rate** | **Success Increase** |
  |-------------|------------------|---------------------|-------------------------------|------------------------------|---------------------|------------------|----------------------|
  | **180s**    | **4GB**          | 65 minutes          | 350 minutes                   | *Baseline*                   | 2,875 / 3,073       | 93.56%           | *Baseline*           |
  | **300s**    | **8GB**          | 79 minutes          | 428 minutes                   | +23%                         | 2,939 / 3,073       | 95.64%           | +2.08%               |
  | **600s**    | **12GB**         | 96 minutes          | 514 minutes                   | +47%                         | 2,947 / 3,073       | 95.90%           | +2.34%               |
  | **1200s**   | **20GB**         | 145 minutes         | 689 minutes                   | +97%                         | 2,952 / 3,073       | 96.06%           | +2.50%               |
  
  - Increasing timeout and memory results in a **higher success rate** but comes at the cost of increased runtime.
  - The **300s / 8GB configuration** improves success by **+2.08% over 180s / 4GB**, with a **23% increase in processing time**.
  - The **600s / 12GB configuration** improves success by **+2.34% over 180s / 4GB**, with a **47% increase in processing time**.
  - The **1200s / 20GB configuration** provides **only a marginal increase** in processed files (+5 over 600s).
  - The optimal configuration depends on whether speed or processing success rate is the higher priority. The **300s / 8GB configuration** offers a well-balanced choice when at least **64GB RAM** is available, allowing for **8+ files to be processed in parallel** while delivering a **strong success rate improvement (+2.08%) over 180s / 4GB** and maintaining a **reasonable 23% processing time increase**, making it an efficient middle ground between speed and processing success.

## Licensing

Glasswall Conform includes [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright) software which is available under both open-source [AGPL](https://www.gnu.org/licenses/agpl-3.0.html) and commercial license agreements via [Artifex](https://artifex.com/licensing/). Glasswall holds a commercial distribution license agreement for the context of Glasswall Conform.