Engine Reporting
    • PDF

    Engine Reporting

    • PDF

    Article summary

    In Analysis Mode, a file-type agnostic description of the data is logged to an XML report. The structure of the XML report is defined by an Analysis Report XSD, designed to make this as easy as possible to parse and process.

    The analysis report contains the following file information:

    • Document Summary — file type, file format version, file size, input file hash, output file hash, complexity estimate.
    • Content Management Policy — the settings of content management switches that have been applied to the processed file.
    • Content Groups — the main grouping of content detected in the processed file.
    • Content Items — the low-level structures detected in the processed file.
    • Issue Items — the detected structures that do not match the manufacturer's specification.
    • Sanitisation Items — the detected structures under content management that are marked for removal by policy.
    • Remedy Items — automatic corrections applied to the processed file in order to bring the file in line with the manufacturer's specification.
    • Allowed Content Items — the detected structures under content management that are permitted by policy.

    Analysis Process

    The Glasswall library receives a file through a published API and passes it through a number of process cycles. The output from each cycle becomes the input for the next, hence maintaining a level of separation between processes. Analysis of the file occurs in each of the cycles. Early cycles elicit the structure of the file and the sizes of its constituent parts. The later cycles are concerned with conducting syntactic and semantic checks, which identify possible sources of risk, out-of-range fields or malformed structures.

    Where elements of the file are compressed, these are expanded and the results assessed, analysed and verified. This enables the analysis report published at the end of the process to give a thorough assessment of the contents and structure of the file. By stepping through the sanitisation and remediation processes, Glasswall is able to provide an accurate report of the actions that could be carried out by Glasswall's regeneration functionality, thus making the file conformant with the specification.

    During each cycle, the file being processed is transformed into Glasswall's own internal representation. This simplifies the parsing and traversing processes and helps provide isolation. As the analysis process navigates through the Glasswall structures, the detailed checks are not only made on individual components but also at a higher level on the relationships between file components. These higher level checks enable the semantic structure and consistency of the file to be properly verified.

    The analysis aspects of the Glasswall functionality provide two forms of reporting. The principle output of the analysis process is the analysis report. This is an XML document that enables the detailed information generated by Glasswall to be interrogated and interpreted by third party applications. The secondary output provided is an engineering report which is technical in nature and provides detailed information of the analysis process in an ASCII log format.

    Sample Analysis Reports

    This section contains an abridged Glasswall Analysis Report containing all the principle elements of a typical report. This particular example is based on a PDF file.

    Document Summary

    Each XML report starts with a document summary shown below:

        <?xml version="1.0" encoding="UTF-8"?>
        <gw:GWallInfo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://glasswall.com/namespace/GWallInfo.xsd"
        xmlns:gw="http://glasswall.com/namespace">
            <gw:DocumentStatistics>
                <gw:DocumentSummary>
                    <gw:TotalSizeInBytes>2293803</gw:TotalSizeInBytes>
                    <gw:FileType>pdf</gw:FileType>
                    <gw:InputSHA256>DEE7CEB7AB57227121FF65F0D8D0878CBEDF90864069D9525698257972498099</gw:InputSHA256>
                    <gw:OutputSHA256>83EE59FAB5972429CE65CBB22EBF8B592D53F47927E0FA751584875D4F80CA1E</gw:OutputSHA256>
    			    <gw:ProcessingTimeMilliseconds>101</gw:ProcessingTimeMilliseconds>
    			    <gw:ComplexityLevel>4.41</gw:ComplexityLevel>
                </gw:DocumentSummary>
            </gw:DocumentStatistics>
    
    • TotalSizeInBytes - Size of input file
    • FileType - Glasswall detected file type
    • InputSHA256 - SHA-256 of the original input file
    • OutputSHA256 - SHA-256 of the output file after processing by Glasswall
    • ProcessingTimeMilliseconds - Glasswall measured time taken to process the file
    • ComplexityLevel - Glasswall estimate of the level of complexity of the file
      • This is produced by calulating the log10 to 2 decimal places of the count of artefacts identified in the file.

    Content Management

    The content management policies that were used on the file are then listed.

    Note: All policies for all file types are listed in each report. Where policy has not been configured but is available, the default policy settings that were applied will also be listed.

    Some of the PDF content management switch settings are shown below:

        <gw:ContentManagementPolicy>
            <gw:Camera cameraName="pdfConfig">
                <gw:ContentSwitch>
                    <gw:ContentName>javascript</gw:ContentName>
                    <gw:ContentValue>sanitise</gw:ContentValue>
                </gw:ContentSwitch>
                <gw:ContentSwitch>
                    <gw:ContentName>acroform</gw:ContentName>
                    <gw:ContentValue>sanitise</gw:ContentValue>
                </gw:ContentSwitch>
                <gw:ContentSwitch>
                    <gw:ContentName>embedded_files</gw:ContentName>
                    <gw:ContentValue>sanitise</gw:ContentValue>
                </gw:ContentSwitch>
            </gw:Camera>
        </gw:ContentManagementPolicy>
    

    Content Groups and Items

    The number of different content groups found in the file (16) along with an example of a content item from the first group is shown below:

        <gw:ContentGroups groupCount="16">
            <gw:ContentGroup>
                <gw:BriefDescription>PDF document has Basic File Section structure instances</gw:BriefDescription>
                <gw:ContentItems itemCount="5">
                    <gw:ContentItem>
                        <gw:TechnicalDescription>PDF Header Instances</gw:TechnicalDescription>
                        <gw:InstanceCount>1</gw:InstanceCount>
                        <gw:TotalSizeInBytes>15</gw:TotalSizeInBytes>
                        <gw:AverageSizeInBytes>15</gw:AverageSizeInBytes>
                        <gw:MinSizeInBytes>15</gw:MinSizeInBytes>
                        <gw:MaxSizeInBytes>15</gw:MaxSizeInBytes>
                    </gw:ContentItem>
                </gw:ContentItems>>
            </gw:ContentGroup>
        ...
        </gw:ContentGroups>
    

    Sanitisation Items

    In this example, as the metadata switch has been set to sanitise, a dictionary structure is shown as tagged for removal. See Section 4 Configuration Management for details on content management switches.

        <gw:SanitisationItems itemCount="1">
            <gw:SanitisationItem>
                <gw:TechnicalDescription>Document information dictionary detected in a document trailer dictionary.</gw:TechnicalDescription>
                <gw:SanitisationId>16872998749</gw:SanitisationId>
                <gw:InstanceCount>1</gw:InstanceCount>
                <gw:TotalSizeInBytes>0</gw:TotalSizeInBytes>
            </gw:SanitisationItem>
        </gw:SanitisationItems>
    

    Remedy Items

    Not all XML reports include Remedies, as these are automatic corrections made to bring any regenerated file in line with the file specification. In this example, a remedy item has been reported in the file.

        <gw:RemedyItems itemCount="1">
            <gw:RemedyItem>
                <gw:TechnicalDescription>
                    PDF Stream is missing an End-Of-Line before the &apos;EndStream&apos; marker.
                </w:TechnicalDescription>
                <gw:RemedyId>1605893787</gw:RemedyId>
                <gw:InstanceCount>7</gw:InstanceCount>
            </gw:RemedyItem>
        </gw:RemedyItems>
    

    Issue Items

    Very few files have an issue, as this means they are not just non-conformant with the file specification, but Glasswall has been unable to Remedy the issue back to the standards set in the specification. A file with an issue item cannot be regenerated.

        <gw:IssueItems itemCount="1">
            <gw:IssueItem>
                <gw:TechnicalDescription>
                    /Info dictionary contained an unexpected key (/GTS_PDFXConformance).
                </gw:TechnicalDescription>
                <gw:IssueId>1670998746</gw:IssueId>
                <gw:InstanceCount>1</gw:InstanceCount>
            </gw:IssueItem>
        </gw:IssueItems>
    

    Issue items are also reported when a file has been determined to be non-conformant due to content management policy.

    Each Sanitisation item, Remedy item or Issue item has a unique numeric ID associated with it, so the item can be uniquely identified by other applications that may wish to process the XML reports.


    Was this article helpful?