Content Management

Prev Next

Content Management policies are a set of content management switches that can be applied to a particular file type.

The content management switch is used to identify a file element type and associated action.

The content management setting specifies the action to be carried out by Glasswall for a particular content management switch. Each content management switch can be set to one of three settings:

  • Allow - The Glasswall Embedded Engine processes any associated file element types and they remain in the regenerated file. The associated structure is logged in the Analysis report as an Allowed Item.

  • Disallow - If any of the associated file element types are identified in the file, the Glasswall Embedded Engine identifies the file as being non-conforming and the file will not be regenerated. The associated structure is logged in the Analysis report as an Issue Item.

  • Sanitise - If any of the associated file element types are identified in the file, the Glasswall Embedded Engine removes them from the regenerated document. The associated structure is logged in the Analysis report as a Sanitisation Item.

Content Management Reporting

The following sections show how content that is under the control of a content management switch is presented in the XML Analysis report, depending on the content switch setting.

Allow

This is an excerpt from the XML report for a Word (.doc) Binary file, which contains metadata. The content management switch metadata has been set to allow.

    <gw:Camera cameraName="wordConfig">
    <gw:ContentSwitch>
        <gw:ContentName>metadata</gw:ContentName>
        <gw:ContentValue>allow</gw:ContentValue>
    </gw:ContentSwitch>
    ...
    <gw:AllowedItems itemCount="1">
        <gw:AllowedItem>
            <gw:TechnicalDescription>Metadata detected in #05SummaryInformation</gw:TechnicalDescription>
            <gw:InstanceCount>1</gw:InstanceCount>
            <gw:TotalSizeInBytes>4096</gw:TotalSizeInBytes>
        </gw:AllowedItem>
    </gw:AllowedItems>

Disallow

This is an excerpt from the XML report for a Word (.doc) Binary file which has metadata inside it. The content management switch metadata has been set to disallow. In Protect Mode, this would cause the file to be marked as non-conforming.

    <gw:Camera cameraName = "wordConfig">
    <gw:ContentSwitch>
        <gw:ContentName>metadata</gw:ContentName>
        <gw:ContentValue>disallow</gw:ContentValue>
    </gw:ContentSwitch>
    ...
    <gw:IssueItem>
        <gw:TechnicalDescription> Metadata detected in #05SummaryInformation</gw:TechnicalDescription>
        <gw:IssueId>96</gw:IssueId>
        <gw:InstanceCount>1</gw:InstanceCount>
        <gw:RiskLevel>Medium</gw:RiskLevel>
    </gw:IssueItem>

Sanitise

This is an excerpt from the XML report for a Word (.doc) Binary file which has metadata inside it. The content management switch metadata has been set to sanitise. In Protect Mode, this would result in the metadata being removed from the regenerated file.

    <gw:Camera cameraName = "wordConfig">
    <gw:ContentSwitch>
        <gw:ContentName>metadata</gw:ContentName>
        <gw:ContentValue>sanitise</gw:ContentValue>
    </gw:ContentSwitch>
    ...
    <gw:SanitisationItem>
        <gw:TechnicalDescription>Metadata detected in #05SummaryInformation</<gw:TechnicalDescription>
        <gw:InstanceCount>1</gw:InstanceCount>
        <gw:TotalSizeInBytes>4096</gw:TotalSizeInBytes>
    </gw:SanitisationItem>

Content Management Policies

These are the available content management policies:

Content Management Switch Description
pdfConfig Content management switch for PDF file type
wordConfig Content management switch for Word file type
pptConfig Content management switch for PowerPoint file type
xlsConfig Content management switch for Excel file type
tiffConfig Content management switch for TIFF file type
svgConfig Content management switch for SVG file type
webpConfig Content management switch for WebP file type
jpegConfig Content management switch for JPEG file type
sysConfig Content management switch to control different Engine settings

Note: The xlsConfig, pptConfig and wordConfig content management policies cover both Office Open XML and Office Binary file types.

The available content management switches and applicable file types are shown in the table below:

Content Management Switch Description
acroform Controls Interactive form (AcroForm) content
javascript Controls JavaScript code embedded in files
external_hyperlinks Controls hyperlinks to locations outside the file
embedded_files Controls Embedded file content
metadata Controls file metadata
actions_all Controls PDF Actions such as Rendition, Sound, Movie, Hide, SetOCGState, GoTo3DView
internal_hyperlinks Controls hyperlinks to locations within the file
value_outside_reasonable_limits Controls Glasswall defined restrictions such as values exceeding a reasonable range e.g. object sizes
digital_signatures Controls digital signature content for signed files or signed objects within files. NOTE: the 'allow' setting cannot be used for the digital_signatures content management switch.
macros Controls VBA Macros which use Visual Basic code to create custom user-generated functions
review_comments Controls document review comments within a file
embedded_images Controls embedded image content for the Glasswall supported image formats
dynamic_data_exchange Controls DDE commands and DDE content in documents
tracked_changes Controls tracked changes in documents
hidden_data Controls hidden data in documents
in_text_comments Controls in text comments in documents
slide_notes Controls slide notes in documents
connections Controls connections to external data sources and information for constructs such as OLAP formulas, QueryTables or PivotTables
scripts Controls XML Scripts that allow for the creation, storage and manipulation of variables and data during processing
foreign_objects Controls embedded objects in XML based formats such as SVG
hyperlinks Controls external and internal hyperlinks
geotiff Controls georeferencing information embedded within a TIFF file
jfif Controls JFIF marker segments within a JPEG image file

The switches currently available for each format are depicted in the table below:

Switch PDF DOC DOCX PPT PPTX XLS XLSX GIF JPEG SVG WEBP TIFF
acroform โœ“
actions_all โœ“
connections โœ“
digital_signatures โœ“ *
dynamic_data_exchange โœ“ โœ“
embedded_files โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
embedded_images โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
external_hyperlinks โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
foreign_objects โœ“
geotiff โœ“
hidden_data โœ“ โœ“ โœ“
hyperlinks โœ“
internal_hyperlinks โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
in_text_comment โœ“
javascript โœ“
jfif โœ“ โ€ 
macros โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
metadata โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โœ“ โ€  โœ“
retain_exported_streams โœ“ *
review_comments โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
slide_notes โœ“ *
scripts โœ“
tracked_changes โœ“ โœ“
value_outside_reasonable_limits โœ“

[ *]: Content management switch available in Editor's "enablerebuild" (default) mode or Rebuild only
[ โ€ ]: Content management switch available in Editor's "editoronly" mode, which can only be used with the Export/Import feature

All content types not represented by a content management type for a specific file format will be automatically remediated by the Glasswall engine if identified as malicious.

Embedded Files

The "Embedded Files" content management type applies to non-image file formats which are located within a distinct container file. For MS-Office formats, the policy for embedded files is applied differently depending on whether the file considered is supported and accessible to the engine:

Action applied to embedded file according to content management policy for Microsoft Office files:

- Allow Sanitise Disallow
Supported Treated as standalone file. If file is non-conforming, containing file is rejected and reason for non-conformance reported as an Issue Item. Treated as standalone file. If file is non-conforming, containing file is rejected and reason for non-conformance reported as an Issue Item. Containing file is rejected, with the embedded file described in an Issue Item.
Unsupported Regenerated without alteration and reported as an Allowed Item. Removed from containing file, alongside all references to it, and reported as a Sanitisation Item. Containing file is rejected, with the embedded file described in an Issue Item.

The table below shows which embedded file formats are considered supported (โœ“) for each containing file format versus those which are unsupported (โœ—):

The table below outlines which embedded file formats are supported (โœ“) within each container file type, and which are not (โœ—).

Container Format โ†“ / Embedded File Format โ†’ DOCX/XLSX/PPTX DOC/XLS/PPT PDF
Office 2007 โœ“ โœ“ โœ—
Office 2003 โœ— โœ“ โœ—
Office 1997 โœ— โœ“ โœ—
PDF โœ— โœ— โœ—
MP3 โœ— n/a โ€  โœ“ โ€ก
MP4 โœ— n/a โ€  โœ“ โ€ก
MPEG โœ— n/a โ€  โœ“ โ€ก
WAV โœ— โœ— โœ“ โ€ก
Formats unsupported by Glasswall โœ— โœ— โœ—

[โ€ ]: Disallowed by container format

[โ€ก]: Not removed by Embedded Files switch but may be removed by All Actions switch. Embedded file is regenerated without being processed.

โš ๏ธ Note: To preserve visual integrity between the original and sanitised versions of files, associated visual elements (such as thumbnails and blip references) of unsupported embedded files are not removed during sanitisation. This ensures that post-processed files remain visually consistent with their original versions.

Embedding Depth Support

The Embedded Engine supports processing OfficeXML files down to nine levels of embedded content. It can traverse and analyze files nested within other files to a maximum depth of nine. OfficeXML files with ten or more levels of embedding are not supported and will be rejected.

Embedded Images

For image file formats, the "Embedded Images" content management switch should be used. This has the following behaviour depending on switch setting:

  • Sanitise - For MS Office, embedded images in supported formats are processed as standalone files. If the embedded image is conforming, the embedded file will be regenerated; otherwise, both the containing and embedded file will be rejected. Unsupported image formats are removed. In PDFs, embedded images are not processed and will always be regenerated if entry is structurally correct.
  • Disallow - Embedded images are forbidden. If one is found, the containing file is rejected.
  • Allow - Embedded images are not processed and are always regenerated as long as they are a supported file format.

The table below shows which image formats we attempt to regenerate (โœ“) when "Embedded Images" is set to "Sanitise" versus those which are removed (โœ—):

Embedded Image Format DOCX/XLSX/PPTX DOC/XLS/PPT PDF
BMP, JPEG, GIF, PNG, EMF, SVG, TIFF โœ“ โœ“ โœ“ โธธ
WMF, EMF โœ“ โœ“ โœ“ โธธ
WebP โœ“ โธธ โœ“ โธธ โœ“ โธธ
Formats unsupported by Glasswall โœ— โœ— โœ“ โธธ

[โธธ]: Will be converted to a different format by container file

Please note that when the "Embedded Images" is set to "Disallow", any images being encountered will result in the rejection of the containing file. This includes thumbnails of the containing or embedded documents and so may supersede the "Embedded File" content management switch.

Macros

The macros content switch for MS Office files applies to both Microsoft Visual Basic for Applications (VBA) and Excel 4.0 macros.

Microsoft Visual Basic for Applications

VBA macros are written in the Visual Basic programming language and can be included in any MS Office file format. The handling of VBA macros can be configured as follows:

  • Sanitise - VBA macros are removed from files.
  • Disallow - VBA macros are forbidden. If one is found, the containing file is rejected.
  • Allow - VBA macros are processed and regenerated as part of the containing file providing they conform to specification.

Excel 4.0 Macros

Excel 4.0 macros are a legacy feature included in XLSX and XLS files. XLSX files containing Excel 4.0 macros will be saved using the ".xlsm" file extension and will produce an error if this extension is modified. The handling of Excel 4.0 macros can be configured as follows:

  • Sanitise - In XLS files, the file will be blocked and Excel 4.0 Macro found: Not supported reported as an issue item. In XLSX/XLSM files, sheets containing macros will be removed from the document and reported as a sanitisation item. If this causes the file to be malformed (i.e. reducing the number of visible sheets to zero), the file will be rejected and an appropriate issue item reported.
  • Disallow - Excel 4.0 macros are forbidden. If one is found, the containing file is rejected.
  • Allow - In XLS files, the file will be blocked and Excel 4.0 Macro found: Not supported reported as an issue item. In XLSX/XLSM files, the file will be regenerated with macros intact.

Metadata

In OOXML, metadata refers to information that describes the content, structure, and properties of a document but is not part of the document's main content. Metadata in OOXML documents is primarily stored in XML files located within the docProps directory:

  1. core.xml: Contains core properties based on the Dublin Core Metadata Element Set.
  2. app.xml: Contains extended properties specific to Microsoft Office applications.
  3. custom.xml: Contains custom properties.

The handling of OOXML metadata can be configured as follows:

  • Sanitise - The file is regenerated with metadata removed (see below for all the properties currently sanitised)
  • Disallow - Metadata is forbidden. If any metadata (properties listed below) is found, the containing file is rejected.
  • Allow - The file is processed and the metadata is regenerated.

As part of the 'metadata' content management switch, we currently sanitise the following in:

  • core.xml: title, subject, creator, keywords, description, lastModifiedBy, revision, lastPrinted, created, modified, category, contentStatus, language, and version.
  • app.xml: manager, company, and hyperlinkBase
  • custom.xml: any custom properties added to the OOXML document.

OfficeXML (DOCX, XLSX, PPTX) Exclusive Switches

Hidden Data

Office file formats offer multiple different way of legitimately "hiding" text or data, including whole Excel sheets, PowerPoint slides or lines of text in a Word document. The Glasswall engine deals with hidden data in the following ways, depending on the content management switch setting:

  • Sanitise - The file is regenerated with all hidden data "unhidden" so it is completely visible to the user.
  • Disallow - Hidden data is forbidden. If any hidden data is found, the containing file is rejected.
  • Allow - Any hidden data is regenerated and remains hidden.

Note: For the purposes of this content management setting, โ€œHidden Dataโ€ does not refer to the varied ways to obfuscate or bury data in Office 2007 files. Rather, it is specific to the methods of hiding data that are readily available in the Office 2007 GUI. Ofbuscated or concealed data is managed by the policy setting corresponding to the method used, e.g., metadata will remove data concealed within free-text fields contained within the document's metadata.

Tracked Changes

The tracked_changes content management switch refers to content added by the "Track Changes" functionality in DOCX and XLSX files, also known as "revisions". These can contain historic data related to previous versions of the document, including names of contributors and records of content that has since been removed or obfuscated. The handling of tracked changes can be configured as follows:

  • Sanitise - All historic data is removed and "Track Changes" disabled. The regenerated document will be equivalent to the final state of the original document.
  • Disallow - Tracked changes are forbidden. If there is any evidence of previous revisions or tracked changes still present in the file, the file will be rejected.
  • Allow - The file is regenerated with all historic changes, revisions and tracked changes intact.

Slide Notes

The slide_notes content management switch refers to content added by the "Notes" functionality in PPTX files, also known as "slide notes" (and/or "speaker notes"). The Glasswall engine deals with these slide notes in the following ways, depending on the configuration of the content management switch setting:

  • Sanitise - The file is regenerated with all slide notes removed.
  • Disallow - Slide notes are forbidden. If any slide notes are found, the containing file is rejected.
  • Allow - Any slide notes are regenerated and remain in the file.

In-Text Comments

The in_text_comment switch refers to content added by the "In-Text Comments" functionality in DOCX files. The handling of the switch can be configured as follows:

  • Sanitise - In-Text Comment is removed alongside the corresponding document metadata found in core.xml.
  • Disallow - In-Text Comment is forbidden. Any DOCX containing an in-text comment will block the file from being regenerated.
  • Allow - The file is regenerated with the In-Text Comment present in the regenerated DOCX file.

Note: When intextcomment sanitise is set to allow and metadata switch is set to sanitise then the regenerated file will have the in-text comment present without any data since the metadata switch sanitises the corresponding description from the core.xml file.

PDF Exclusive Switches

Digital Signatures

Overview
PDF files may contain Digital Signatures and AcroForms, certain types of AcroForms can contain digital signatures. While digital signatures are used to verify the authenticity and integrity of a document, AcroForms provide the structural foundation for interactive form fields. When a digital signature is present in the PDF, then the AcroForm has the visible representation of the signature itself.

When processing PDF files that include digital signatures, the Glasswall CDR engine applies a sanitisation process designed to preserve visual integrity while removing active and/or potentially risky content.

How the CDR Engine handles Digital Signatures
To ensure both document safety and consistency, the Glasswall CDR engine performs the following actions during sanitisation:

  • Removes the cryptographic signature data, including any embedded certificates, validation logic, or scripts.
  • Strips signature-related metadata and interactive behavior to eliminate execution pathways or potential exploits.
  • Preserves the visual appearance of the signature widget, such as the signature image, signer name, and date/time text. This is achieved by flattening it into the static content layer of the PDF.
AcroForm Digital Signature Expected AcroForm behavior Expected Digital Signature behavior Behavior of AcroForm section containing Digital Signature Is File Regenerated?
Allow Allow Regenerated without sanitisation Regenerated without sanitisation Entire section (including interactive form and digital signature) is preserved as-is Yes
Sanitise Allow Sanitised (removed or flattened) Regenerated without sanitisation Visual digital signature is preserved; AcroForm field it resides in is sanitised or removed Yes
Allow Sanitise Regenerated without sanitisation Sanitised (cryptographic elements removed) Visual part of digital signature is preserved as part of the AcroForm; signature becomes non-functional Yes
Sanitise Sanitise Sanitised Sanitised Entire digital signature section, including AcroForm fields, is removed or flattened visually Yes
Disallow * Not applicable Not applicable File is not regenerated due to disallowed AcroForm presence No
* Disallow Not applicable Not applicable File is not regenerated due to disallowed Digital Signature presence No

Auditability and Chain of Custody

To support traceability and accountability in secure environments, the Glasswall CDR engine records the cryptographic hashes of both the input and output files. This enables a system integrator:

  • To verify file provenance through hash comparison.
  • To provide assurance that, where a digital signature is no longer valid, the chain of custody is maintained and can be proven.