Written 2017-01-05

Tags:XML MadSax Expat BeyondThunderDom LoadWarrior 

Expat Sax Parsing

Expat is a SAX-Parser for XML documents. However, parsing documents directly with Expat is a little cumbersome - Expat triggers a callback for every start and end XML tag, along with a string belonging to that tag, but it is left to the developer to convert this stream of named tags into usable events for processing. Various methods exist for this, including keeping a list of known tag-identifying-strings in the program, and scanning this list per tag to see what should be done. Commonly, this results in a string table that maps to an enumeration, and a switch-case.

MadSax works a little differently.

MadSax sits directly between Expat and higher level logic. Instead of a callback API for arbitrary tags, MadSax is built at compile-time with a list of tags the application is interested in. These tags are used to compile a minimal perfect hashmap using gperf.

Example MadSax Usage

This MadSax Definition File:


Generates the following hash-indexed tag-handlers, which are used to trigger the higher-level application logic. These intentionally mirror the API of Expat, except that the element name need not be parsed, and may be removed in the future.

static void handle_tag_start__svg__rect(void *data, const char *el, const char **attr){}
static void handle_tag_end__svg__rect(void *data, const char *el){}
static void handle_tag_data__svg__rect(void *data, const char *content, int length){}
static void handle_tag_start__svg__circle(void *data, const char *el, const char **attr){}
static void handle_tag_end__svg__circle(void *data, const char *el){}
static void handle_tag_data__svg__circle(void *data, const char *content, int length){}

What comes after MadSax?

Two more XML-parsing related projects are planned after MadSax.

The Load Warrior

The first, The Load Warrior, will be a thin layer on top of MadSax, and will support tagging MadSax definition lines with types. For most cases, this will remove the abstract the current three-step parsing(start/data/end) into a simpler API consisting of a single callback for a single XML Element. Start and End callbacks will still be used to delineate more complex objects, but single callbacks will be used to represent simple tags that enclose a single value.

Beyond ThunderDom

Beyond ThunderDom will sit above The Load Warrior, and serve to aggregate objects converted by The Load Warrior into structures directly usable by higher level application logic. For example, our above example for rectangles becomes:

float x;
float y;
float width;
float height;
const char * style;

static void handle_object__svg__rect(void *data, const struct svg_rect * rect){}