LuaExpat: XML Expat parsing for the Lua programming language

Introduction

Lua Object Model (LOM) is a representation of XML elements through Lua data types. Currently it is not supposed to be 100% complete, but simple. LuaExpat provides an implementation of LOM that gets an XML document and transforms it to a Lua table.

Characteristics

The model represents each XML element as a Lua table. A LOM table has three special characteristics:

a special field called tag that holds the element's name;
an optional field called attr that stores the element's attributes; and
the element's children are stored at the array-part of the table. A child could be an ordinary string or another XML element that will be represented by a Lua table following these same rules.

The special field attr is a Lua table that stores the XML element's attributes as pairs <key>=<value>. To assure an order (if necessary), the sequence of keys could be placed at the array-part of this same table.

Functions

lom.parse(string|function|table|file[, opts])

Parses the input into the LOM table format and returns it. The input can be;

string: the entire XML document as a string
function: an iterator that returns the next chunk of the XML document on each call, and returns nil when finished
table: an array like table that contains the chunks that combined make up the XML document
file: an open file handle from which the XML document will be read line-by-line, using read(). Note: the file will not be closed when done.

The second parameter opts is an options table that supports the following options;

separator (string): the namespace separator character to use, setting this will enable namespace aware parsing.
threat (table): a threat protection options table. If provided the threat protection parser will be used instead of the regular lxp parser.

Upon parsing errors it will return nil, err, line, col, pos.

lom.find_elem(node, tag)

Traverses the tree recursively, and returns the first element that matches the tag. Parameter tag (string) is the tag name to look for. The node table can be the result from the parse function, or any of its children.

lom.list_children(node[, tag])

Iterator returning all child tags of a node (non-recursive). It will only children that are tags, and will skip text-nodes. The node table can be the result from the parse function, or any of its children. If the optional parameter tag (string) is given, then the iterator will only return tags that match the tag name.

Examples

For a simple string like

    s = [[<abc a1="A1" a2="A2">inside tag `abc'</abc>]]

A call like

    tab = lxp.lom.parse (s))

Would result in a table equivalent to

tab = {
        ["attr"] = {
                [1] = "a1",
                [2] = "a2",
                ["a2"] = "A2",
                ["a1"] = "A1",
        },
        [1] = "inside tag `abc'",
        ["tag"] = "abc",
}

Now an example with an element nested inside another element

tab = lxp.lom.parse(
[[<qwerty q1="q1" q2="q2">
    <asdf>some text</asdf>
</qwerty>]]
)

The result would have been a table equivalent to

tab = {
        [1] = "\
        ",
        [2] = {
                ["attr"] = {
                },
                [1] = "some text",
                ["tag"] = "asdf",
        },
        ["attr"] = {
                [1] = "q1",
                [2] = "q2",
                ["q2"] = "q2",
                ["q1"] = "q1",
        },
        [3] = "\
",
        ["tag"] = "qwerty",
}

Note that even the new-line and tab characters are stored on the table.