LuaExpat: XML Expat parsing for the Lua programming language

Introduction

Threat protection enables validation of structure and size of a document while parsing it.

The threat parser is identical to the regular parser.

Has the same methods
Uses the same signature for creating it through new
The callbacks table should get another entry threat containing the configuration of the limits
Any callback not defined by the user will be added using a no-op function in the callbacks table (exceptions are Default and DefaultExpand)
The separator parameter for the constructor is required when any of the following checks have been added (since they require namespace aware parsing);

maxNamespaces
prefix
namespaceUri

Limitations

Due to the way the parser works, the elements of a document must first be parsed before a callback is issued that verifies its maximum size. For example even if the maximum size for an attribute is set to 50 bytes, a 2mb attribute will first be entirely parsed before the parser bails out with a size error. To protect against this make sure to set the maximum buffer size (option buffer).

Options

Structural checks:

depth max depth of tags, child elements like Text or Comments are not counted as another level. Default 50.
allowDTD boolean indicating whether DTDs are allowed. Default true.
maxChildren max number of children (Element, Text, Comment, ProcessingInstruction, CDATASection).
NOTE: adjacent text/CDATA sections are counted as 1 (so text-cdata-text-cdata is 1 child). Default 100.
maxAttributes max number of attributes (including default ones).
NOTE: if not parsing namespaces, then the namespaces will be counted as attributes. Default 100.
maxNamespaces max number of namespaces defined on a tag. Default 20.

Size limits (per element, in bytes)

document size of entire document. Default 10 mb.
buffer size of the unparsed buffer (see below). Default 1 mb.
comment size of comment. Default 1 kb.
localName size of localname applies to tags and attributes.
NOTE: If not parsing namespaces, this limit will count against the full name (prefix + localName). Default 1 kb.
prefix size of prefix, applies to tags and attributes. Default 1 kb.
namespaceUri size of namespace uri. Default 1 kb.
attribute size of attribute value. Default 1 mb.
text text inside tags (counted over all adjacent text/CDATA combined). Default 1 mb.
PITarget size of processing instruction target. Default 1 kb.
PIData size of processing instruction data. Default 1 kb.
entityName size of entity name in EntityDecl in bytes. Default 1 kb.
entity size of entity value in EntityDecl in bytes. Default 1 kb.
entityProperty size of systemId, publicId, or notationName in EntityDecl in bytes. Default 1 kb.

The buffer setting is the maximum size of unparsed data. The unparsed buffer is from the last byte delivered through a callback to the end of the current data fed into the parser.

As an example assume we have set a maximum of 1 attribute, with name max 20 and value max 20. This means that the maximum allowed opening tag could look like this (take or leave some white space);

<abcde12345abcde12345 ABCDE12345ABCDE12345="12345678901234567890">

But because of the way Expat works, a user could pass in a 2mb attribute value and it would have to be parsed completely before the callback for the new element fires. In this case the maximum expected buffer would be 2x 20 (attr+tag name) + 1x 20 (attr value) + 50 (account for whitespace and other overhead characters) == 110. If this value is set and the parser is fed in chunks, it will bail out after hitting the first 110 characters of the faulty oversized tag.

Example of threat protected parsing

local threat_parser = require "lxp.threat"

local separator = "\1"
local callbacks = {
	-- add your regular callbacks here
}

local threat = {

	-- structure
	depth = 3,
	maxChildren = 3,
	maxAttributes = 3,
	maxNamespaces = 3,

	-- sizes
	document = 2000,
	buffer = 1000,
	comment = 20,
	localName = 20,
	prefix = 20,
	namespaceUri = 20,
	attribute = 20,
	text = 20,
	PITarget = 20,
	PIData = 20,
}

callbacks.threat = threat

local parser = threat_parser.new(callbacks, separator)

assert(parser.parse(xml_data))