Content Management
    • PDF

    Content Management

    • PDF

    Article summary

    Content Management policies are a set of content management switches that can be applied to a particular file type.

    The content management switch is used to identify a file element type and associated action.

    The content management setting specifies the action to be carried out by Glasswall for a particular content management switch. Each content management switch can be set to one of three settings:

    • Allow - The Glasswall Embedded Engine processes any associated file element types and they remain in the regenerated file. The associated structure is logged in the Analysis report as an Allowed Item.

    • Disallow - If any of the associated file element types are identified in the file, the Glasswall Embedded Engine identifies the file as being non-conforming and the file will not be regenerated. The associated structure is logged in the Analysis report as an Issue Item.

    • Sanitise - If any of the associated file element types are identified in the file, the Glasswall Embedded Engine removes them from the regenerated document. The associated structure is logged in the Analysis report as a Sanitisation Item.

    Content Management Reporting

    The following sections show how content that is under the control of a content management switch is presented in the XML Analysis report, depending on the content switch setting.

    Allow

    This is an excerpt from the XML report for a Word (.doc) Binary file, which contains metadata. The content management switch metadata has been set to allow.

        <gw:Camera cameraName="wordConfig">
        <gw:ContentSwitch>
            <gw:ContentName>metadata</gw:ContentName>
            <gw:ContentValue>allow</gw:ContentValue>
        </gw:ContentSwitch>
        ...
        <gw:AllowedItems itemCount="1">
            <gw:AllowedItem>
                <gw:TechnicalDescription>Metadata detected in #05SummaryInformation</gw:TechnicalDescription>
                <gw:InstanceCount>1</gw:InstanceCount>
                <gw:TotalSizeInBytes>4096</gw:TotalSizeInBytes>
            </gw:AllowedItem>
        </gw:AllowedItems>
    

    Disallow

    This is an excerpt from the XML report for a Word (.doc) Binary file which has metadata inside it. The content management switch metadata has been set to disallow. In Protect Mode, this would cause the file to be marked as non-conforming.

        <gw:Camera cameraName = "wordConfig">
        <gw:ContentSwitch>
            <gw:ContentName>metadata</gw:ContentName>
            <gw:ContentValue>disallow</gw:ContentValue>
        </gw:ContentSwitch>
        ...
        <gw:IssueItem>
            <gw:TechnicalDescription> Metadata detected in #05SummaryInformation</gw:TechnicalDescription>
            <gw:IssueId>96</gw:IssueId>
            <gw:InstanceCount>1</gw:InstanceCount>
            <gw:RiskLevel>Medium</gw:RiskLevel>
        </gw:IssueItem>
    

    Sanitise

    This is an excerpt from the XML report for a Word (.doc) Binary file which has metadata inside it. The content management switch metadata has been set to sanitise. In Protect Mode, this would result in the metadata being removed from the regenerated file.

        <gw:Camera cameraName = "wordConfig">
        <gw:ContentSwitch>
            <gw:ContentName>metadata</gw:ContentName>
            <gw:ContentValue>sanitise</gw:ContentValue>
        </gw:ContentSwitch>
        ...
        <gw:SanitisationItem>
            <gw:TechnicalDescription>Metadata detected in #05SummaryInformation</<gw:TechnicalDescription>
            <gw:InstanceCount>1</gw:InstanceCount>
            <gw:TotalSizeInBytes>4096</gw:TotalSizeInBytes>
        </gw:SanitisationItem>
    

    Content Management Policies

    These are the available content management policies:

    Content Management SwitchDescription
    pdfConfigContent management switch for PDF file type
    wordConfigContent management switch for Word file type
    pptConfigContent management switch for PowerPoint file type
    xlsConfigContent management switch for Excel file type
    tiffConfigContent management switch for TIFF file type
    svgConfigContent management switch for SVG file type
    webpConfigContent management switch for WebP file type
    jpegConfigContent management switch for JPEG file type
    sysConfigContent management switch to control different Engine settings

    Note: The xlsConfig, pptConfig and wordConfig content management policies cover both Office Open XML and Office Binary file types.

    The available content management switches and applicable file types are shown in the table below:

    Content Management SwitchDescription
    acroformControls Interactive form (AcroForm) content
    javascriptControls JavaScript code embedded in files
    external_hyperlinksControls hyperlinks to locations outside the file
    embedded_filesControls Embedded file content
    metadataControls file metadata
    actions_allControls PDF Actions such as Rendition, Sound, Movie, Hide, SetOCGState, GoTo3DView
    internal_hyperlinksControls hyperlinks to locations within the file
    value_outside_reasonable_limitsControls Glasswall defined restrictions such as values exceeding a reasonable range e.g. object sizes
    digital_signaturesControls digital signature content for signed files or signed objects within files. NOTE: the 'allow' setting cannot be used for the digital_signatures content management switch.
    macrosControls VBA Macros which use Visual Basic code to create custom user-generated functions
    review_commentsControls document review comments within a file
    embedded_imagesControls embedded image content for the Glasswall supported image formats
    dynamic_data_exchangeControls DDE commands and DDE content in documents
    tracked_changesControls tracked changes in documents
    hidden_dataControls hidden data in documents
    in_text_commentsControls in text comments in documents
    slide_notesControls slide notes in documents
    connectionsControls connections to external data sources and information for constructs such as OLAP formulas, QueryTables or PivotTables
    scriptsControls XML Scripts that allow for the creation, storage and manipulation of variables and data during processing
    foreign_objectsControls embedded objects in XML based formats such as SVG
    hyperlinksControls external and internal hyperlinks
    geotiffControls georeferencing information embedded within a TIFF file
    jfifControls JFIF marker segments within a JPEG image file

    The switches currently available for each format are depicted in the table below:

    SwitchPDFDOCDOCXPPTPPTXXLSXLSXGIFJPEGSVGWEBPTIFF
    acroform
    actions_all
    connections
    digital_signatures *
    dynamic_data_exchange
    embedded_files
    embedded_images
    external_hyperlinks
    foreign_objects
    geotiff
    hidden_data
    hyperlinks
    internal_hyperlinks
    in_text_comment
    javascript
    jfif
    macros
    metadata
    retain_exported_streams *
    review_comments
    slide_notes *
    scripts
    tracked_changes
    value_outside_reasonable_limits

    [ *]: Content management switch available in Editor's "enablerebuild" (default) mode or Rebuild only
    [ †]: Content management switch available in Editor's "editoronly" mode, which can only be used with the Export/Import feature

    All content types not represented by a content management type for a specific file format will be automatically remediated by the Glasswall engine if identified as malicious.

    Embedded Files

    The "Embedded Files" content management type applies to non-image file formats which are located within a distinct container file. For MS-Office formats, the policy for embedded files is applied differently depending on whether the file considered is supported and accessible to the engine:

    Action applied to embedded file according to content management policy for Microsoft Office files:

    -AllowSanitiseDisallow
    SupportedTreated as standalone file. If file is non-conforming, containing file is rejected and reason for non-conformance reported as an Issue Item.Treated as standalone file. If file is non-conforming, containing file is rejected and reason for non-conformance reported as an Issue Item.Containing file is rejected, with the embedded file described in an Issue Item.
    UnsupportedRegenerated without alteration and reported as an Allowed Item.Removed from containing file, alongside all references to it, and reported as a Sanitisation Item.Containing file is rejected, with the embedded file described in an Issue Item.

    The table below shows which embedded file formats are considered supported () for each containing file format versus those which are unsupported ():

    Embedded File FormatDOCX/XLSX/PPTXDOC/XLS/PPTPDF
    Office 2003
    Office 1997
    PDF
    MP3n/a
    MP4n/a
    MPEGn/a
    WAV
    Formats unsupported by Glasswall

    [†]: Disallowed by container format

    [‡]: Not removed by Embedded Files switch but may be removed by All Actions switch. Embedded file is regenerated without being processed.

    ⚠️ Note: To preserve visual integrity between the original and sanitised versions of files, associated visual elements (such as thumbnails and blip references) of unsupported embedded files are not removed during sanitisation. This ensures that post-processed files remain visually consistent with their original versions.

    Embedding Depth Support

    The Embedded Engine supports processing OfficeXML files down to nine levels of embedded content. It can traverse and analyze files nested within other files to a maximum depth of nine. OfficeXML files with ten or more levels of embedding are not supported and will be rejected.

    Embedded Images

    For image file formats, the "Embedded Images" content management switch should be used. This has the following behaviour depending on switch setting:

    • Sanitise - For MS Office, embedded images in supported formats are processed as standalone files. If the embedded image is conforming, the embedded file will be regenerated; otherwise, both the containing and embedded file will be rejected. Unsupported image formats are removed. In PDFs, embedded images are not processed and will always be regenerated if entry is structurally correct.
    • Disallow - Embedded images are forbidden. If one is found, the containing file is rejected.
    • Allow - Embedded images are not processed and are always regenerated as long as they are a supported file format.

    The table below shows which image formats we attempt to regenerate () when "Embedded Images" is set to "Sanitise" versus those which are removed ():

    Embedded Image FormatDOCX/XLSX/PPTXDOC/XLS/PPTPDF
    BMP, JPEG, GIF, PNG, EMF, SVG, TIFF
    WMF, EMF
    WebP
    Formats unsupported by Glasswall

    [⸸]: Will be converted to a different format by container file

    Please note that when the "Embedded Images" is set to "Disallow", any images being encountered will result in the rejection of the containing file. This includes thumbnails of the containing or embedded documents and so may supersede the "Embedded File" content management switch.

    Macros

    The macros content switch for MS Office files applies to both Microsoft Visual Basic for Applications (VBA) and Excel 4.0 macros.

    Microsoft Visual Basic for Applications

    VBA macros are written in the Visual Basic programming language and can be included in any MS Office file format. The handling of VBA macros can be configured as follows:

    • Sanitise - VBA macros are removed from files.
    • Disallow - VBA macros are forbidden. If one is found, the containing file is rejected.
    • Allow - VBA macros are processed and regenerated as part of the containing file providing they conform to specification.

    Excel 4.0 Macros

    Excel 4.0 macros are a legacy feature included in XLSX and XLS files. XLSX files containing Excel 4.0 macros will be saved using the ".xlsm" file extension and will produce an error if this extension is modified. The handling of Excel 4.0 macros can be configured as follows:

    • Sanitise - In XLS files, the file will be blocked and Excel 4.0 Macro found: Not supported reported as an issue item. In XLSX/XLSM files, sheets containing macros will be removed from the document and reported as a sanitisation item. If this causes the file to be malformed (i.e. reducing the number of visible sheets to zero), the file will be rejected and an appropriate issue item reported.
    • Disallow - Excel 4.0 macros are forbidden. If one is found, the containing file is rejected.
    • Allow - In XLS files, the file will be blocked and Excel 4.0 Macro found: Not supported reported as an issue item. In XLSX/XLSM files, the file will be regenerated with macros intact.

    Metadata

    In OOXML, metadata refers to information that describes the content, structure, and properties of a document but is not part of the document's main content. Metadata in OOXML documents is primarily stored in XML files located within the docProps directory:

    1. core.xml: Contains core properties based on the Dublin Core Metadata Element Set.
    2. app.xml: Contains extended properties specific to Microsoft Office applications.
    3. custom.xml: Contains custom properties.

    The handling of OOXML metadata can be configured as follows:

    • Sanitise - The file is regenerated with metadata removed (see below for all the properties currently sanitised)
    • Disallow - Metadata is forbidden. If any metadata (properties listed below) is found, the containing file is rejected.
    • Allow - The file is processed and the metadata is regenerated.

    As part of the 'metadata' content management switch, we currently sanitise the following in:

    • core.xml: title, subject, creator, keywords, description, lastModifiedBy, revision, lastPrinted, created, modified, category, contentStatus, language, and version.
    • app.xml: manager, company, and hyperlinkBase
    • custom.xml: any custom properties added to the OOXML document.

    OfficeXML (DOCX, XLSX, PPTX) Exclusive Switches

    Hidden Data

    Office file formats offer multiple different way of legitimately "hiding" text or data, including whole Excel sheets, PowerPoint slides or lines of text in a Word document. The Glasswall engine deals with hidden data in the following ways, depending on the content management switch setting:

    • Sanitise - The file is regenerated with all hidden data "unhidden" so it is completely visible to the user.
    • Disallow - Hidden data is forbidden. If any hidden data is found, the containing file is rejected.
    • Allow - Any hidden data is regenerated and remains hidden.

    Note: For the purposes of this content management setting, “Hidden Data” does not refer to the varied ways to obfuscate or bury data in Office 2007 files. Rather, it is specific to the methods of hiding data that are readily available in the Office 2007 GUI. Ofbuscated or concealed data is managed by the policy setting corresponding to the method used, e.g., metadata will remove data concealed within free-text fields contained within the document's metadata.

    Tracked Changes

    The tracked_changes content management switch refers to content added by the "Track Changes" functionality in DOCX and XLSX files, also known as "revisions". These can contain historic data related to previous versions of the document, including names of contributors and records of content that has since been removed or obfuscated. The handling of tracked changes can be configured as follows:

    • Sanitise - All historic data is removed and "Track Changes" disabled. The regenerated document will be equivalent to the final state of the original document.
    • Disallow - Tracked changes are forbidden. If there is any evidence of previous revisions or tracked changes still present in the file, the file will be rejected.
    • Allow - The file is regenerated with all historic changes, revisions and tracked changes intact.

    Slide Notes

    The slide_notes content management switch refers to content added by the "Notes" functionality in PPTX files, also known as "slide notes" (and/or "speaker notes"). The Glasswall engine deals with these slide notes in the following ways, depending on the configuration of the content management switch setting:

    • Sanitise - The file is regenerated with all slide notes removed.
    • Disallow - Slide notes are forbidden. If any slide notes are found, the containing file is rejected.
    • Allow - Any slide notes are regenerated and remain in the file.

    In-Text Comments

    The in_text_comment switch refers to content added by the "In-Text Comments" functionality in DOCX files. The handling of the switch can be configured as follows:

    • Sanitise - In-Text Comment is removed alongside the corresponding document metadata found in core.xml.
    • Disallow - In-Text Comment is forbidden. Any DOCX containing an in-text comment will block the file from being regenerated.
    • Allow - The file is regenerated with the In-Text Comment present in the regenerated DOCX file.

    Note: When intextcomment sanitise is set to allow and metadata switch is set to sanitise then the regenerated file will have the in-text comment present without any data since the metadata switch sanitises the corresponding description from the core.xml file.

    PDF Exclusive Switches

    Digital Signatures

    Overview
    PDF files may contain Digital Signatures and AcroForms, certain types of AcroForms can contain digital signatures. While digital signatures are used to verify the authenticity and integrity of a document, AcroForms provide the structural foundation for interactive form fields. When a digital signature is present in the PDF, then the AcroForm has the visible representation of the signature itself.

    When processing PDF files that include digital signatures, the Glasswall CDR engine applies a sanitisation process designed to preserve visual integrity while removing active and/or potentially risky content.

    How the CDR Engine handles Digital Signatures
    To ensure both document safety and consistency, the Glasswall CDR engine performs the following actions during sanitisation:

    • Removes the cryptographic signature data, including any embedded certificates, validation logic, or scripts.
    • Strips signature-related metadata and interactive behavior to eliminate execution pathways or potential exploits.
    • Preserves the visual appearance of the signature widget, such as the signature image, signer name, and date/time text. This is achieved by flattening it into the static content layer of the PDF.
    AcroFormDigital SignatureExpected AcroForm behaviorExpected Digital Signature behaviorBehavior of AcroForm section containing Digital SignatureIs File Regenerated?
    AllowAllowRegenerated without sanitisationRegenerated without sanitisationEntire section (including interactive form and digital signature) is preserved as-isYes
    SanitiseAllowSanitised (removed or flattened)Regenerated without sanitisationVisual digital signature is preserved; AcroForm field it resides in is sanitised or removedYes
    AllowSanitiseRegenerated without sanitisationSanitised (cryptographic elements removed)Visual part of digital signature is preserved as part of the AcroForm; signature becomes non-functionalYes
    SanitiseSanitiseSanitisedSanitisedEntire digital signature section, including AcroForm fields, is removed or flattened visuallyYes
    Disallow*Not applicableNot applicableFile is not regenerated due to disallowed AcroForm presenceNo
    *DisallowNot applicableNot applicableFile is not regenerated due to disallowed Digital Signature presenceNo

    Auditability and Chain of Custody

    To support traceability and accountability in secure environments, the Glasswall CDR engine records the cryptographic hashes of both the input and output files. This enables a system integrator:

    • To verify file provenance through hash comparison.
    • To provide assurance that, where a digital signature is no longer valid, the chain of custody is maintained and can be proven.

    Was this article helpful?