Content Export and Import
    • PDF

    Content Export and Import

    • PDF

    Article summary

    Glasswall provides the ability to export and import content items for supported file types.

    This allows internal components of processed files to be made available to external processes and applications for additional processing outside of the Glasswall Embedded Engine domain. Once exported, these components can be validated externally before the Glasswall Engine imports the components and recomposes the files.

    In order to enable the user to carry out additional analysis on components within files, the files must be processed by the Glasswall Embedded Engine twice; once to extract a package containing the components that make up a file (export), and a second pass to reintegrate the externally analysed and/or modified components back into the file (import). By default, the files are re-validated and regenerated for each pass to ensure file integrity but validation may be disabled for certain use cases where only transformation is required.

    Example Use Cases

    Example use cases for Export-Import processing include but are not limited to:

    • Pattern for safely importing data - Glasswall exposes the internal file structure in a standard form such as XML, enabling third parties to carry out hardware verification as part of the pattern for safely importing data.
    • Data loss prevention - exported content such as text is annotated to allow for all text to be identified, enabling users to carry out DLP processes such as text search and redaction.
    • Image analysis - additional image processing to detect and/or prevent steganography attacks.
    • Image resizing - images can be resized via the Glasswall Image Resizer tool.

    Exportable Content

    Glasswall provides the ability to export a document object model (DOM), which includes all content, for all supported file formats. The exported DOM is presented in one of two intermediate formats, XML or SISL. Users have the option to export embedded images in their original form or as a DOM representation.

    Importable Content

    Glasswall provides the ability to re-import a document object model (DOM) for all supported file formats, regardless of external modification made to the exported content (provided that modifications comply with the file format specification).

    Export Package Content

    The Export package is a ZIP archive with XML or SISL file streams, embedded images and corresponding JSON files with metadata (for PDF images).

    Internal names of XML tags/attributes and SISL types/parameters are shortened to minimize the size of exported file streams.

    XML Tags and SISL Types

    Tag / Type (Shortened)
    Tag /Type (Full)
    Description
    "S"STRUCTRepresents a structure node from our tree.
    "SA"STRUCTARRAYThe array of STRUCT objects.
    "I"ITEMA property within a STRUCT object (e.g., whitespace indicators, end-of-file markers etc.)
    "V"VALUERepresents the stored integer value that was read from a file.
    "VA"VALUEARRAY

    Represents a data block read from a file.

    XML Attributes and SISL Parameters

    Attribute / Parameter (Shortened)Attribute / Parameter (Full)Data TypeDescription
    "o"offset<integer string>The attribute contains the offset of the current item in the buffer. The buffer can represent things such as the file, a file within an archive, an amalgamation of streams from a CFB object, and more.
    "s"size<integer string>The total length of the current structure in bytes.
    "i"itemEnum<integer string>The internal numerical representation of the current ITEM.
    "n"name<string>The internal name of the current structure.
    "t"isText<string> [ "true" | "false" ]Indicates if the element contains text or not. Only applicable to items which are marked as text within internal schemas.
    "se"structEnum<integer string>The internal numerical representation of the current STRUCT.
    "sn"streamName<string>The current stream name.
    "c"cameraName<string>The current camera (parser/validator/writer) name.
    "st"isStructuralText<string> [ "true" | "false" ] The attribute to distinguish between structural information and the file's visible text content. Only applicable to items which are marked as text within internal schemas.
    "e"encoding<string> [ UTF 8 | "Base64"]

    The attribute specifies the encoding of data within the current element. Only applicable to items which are marked as text within internal schemas.

    SISL Specific Parameters

    Parameter (Shortened)Parameter (Full)Data TypeDescription
    "__s"struct<dictionary>General SISL structure of type: [ S | SA | I | V | VA ]
    "__m"meta<dictionary>Dictionary of the current SISL structure parameters
    "__d"data<string>The stored data of ITEM, VALUE or VALUEARRAY
    "__l"length<integer string>

    The original size of data stored in __d before non-printable characters were escaped.

    Export Text Dump

    The Export Text Dump feature introduces the option to produce a file containing all the text within the input file being exported. The file is produced and stored in the same directory as the output ZIP file.

    Text Dump is currently a beta feature. Text dumps can be exported alongside or without content export zips.

    File FormatSupported
    Office 2003
    Office 1997
    PDF*
    Binary formats
    Audio formats
    Image formats
    MPEG formats

    [ * ]:Text Dump of a PDF file is not currently available when sysConfig switch export_embedded_images set to true. It is also limited only to very simple PDF files.


    Was this article helpful?