Two standards methods are currently available to parse XML documents: SAX and DOM.
The DOM method consists in transforming the document into a tree in memory, and then process this tree to extract the data. This method allows the document to be processed and manipulated in a convenient way, but has to be entirely parsed and stored in memory before going on.
The SAX methods is event based: as long as the document is processed, the parser emits events each time a new element starts or ends in the input document. These events must be catched in the user's implementation to handle the data. Although this method allows the document to be processed as a stream (without being entirely stored in memory), the data handling process is not intuitive, and does not encourage object-oriented design nor code reuse.
Xineo OAX is a Java package offering an alternative to pure DOM or SAX for XML documents parsing. It is based on SAX, and thus allows large XML documents and XML streams to be parsed without being loaded in memory. But it also offers a convenient way to handle parsed data in an object-oriented way, allowing clean design and code-reuse.
The main concept of Xineo OAX (Object-oriented API for XML) is the object handler. An object handler is a instance of a class which is able to handle one or several XML elements. An object handler also contains sub-handlers, which are the object handlers able to handle the elements embeded in their parent handler. Some handlers mays be virtuals, to handle properly recursive schemas by instantiating handlers dynamically.
To parse a document using Xineo OAX, you have to define a handler for each element that may be contained in the parsed documents, and from which you want to extract data. An object handled is created subclassing a Xineo object, and implementing the necessary methods. The resulting handler set is structured just like your XML schema, each handler embedding sub-handlers when elements embed other elements.
The resulting structure is a tree where each node is an object handler. The child nodes are sub-handlers, and the leafs are object handlers that don't embed any other handler. The root of the tree is an object handler that maps to the document itself (the root handler). The root will usually embed only one object handler, which maps to the main element that will be found in the document. Thus, each object handler tree maps to a given XML schema. Instances of this schema will be able to be parsed by this object handler set. As we'll see later, handler trees can also be described in an XML document, that can be used to dynamically instantiate the handlers and build the tree.
Note that any node of an object handler tree can appear in different parent nodes of the tree. A node can also appear in different tree structures. This is how Xineo allows code-reuse: every object handler (with its embeded handlers) is an independant entity that can be reused everywhere you need to parse the corresponding elements.
Here is a very short sample (a complete one will be exposed later). Consider the following XML elements :
<person firstName="James" lastName="Bond"> <birthDate year="1940" month="6" day="24" /> </person>
In this case, we will first create an object handler to handle person elements.
This handler will embed another handler, which will be able to handle date elements (like birthDate ones).
Note that the later one, being targetted to a very common data type (dates) could be also used anywhere you need it.
The Xineo OAX API is mainly defined in the net.xineo.xml.handler package.
The main class of the Xineo API is ObjectHandler. This class is at the same time :
ContentHandler SAX interface that will be used to handle documents content.
ObjectHandler is an abstract class that must be subclassed to become an effective object handler.
Its abstract methods are in fact defined in the ObjectHandlerInterface interface, which
contains a set of methods that will be called by the Xineo engine during the document parsing.
These methods are (see API docs for more detailed information):
handlesElement() : this method is called to test if the object handler is able to
handle elements of the given name.
initialize() : this method is called each time an element that
is handled by this object handler starts.
terminate() : this method is called each time an element that
is handled by this object handler ends.
subHandlerInitialized() : this method is called each time a
sub-handler of this object handler is about to be initialized.
subHandlerTerminated() : this method is called each time a
sub-handler of this object handler is about to be terminated.
When pure handlers are not suitable or not necessary, some SAX equivalents are available and may be implemented if needed :
startElement() : this method is called when an element that is not handled
by any sub-handler starts.
endElement() : this method is called when an element that is not handled
by any sub-handler ends.
text() : this method is called when a text chunk is found. Note that in any case
the text contained in elements is also concatained and passed to the terminated()
method.
These methods can throw a HandlerException if an error occurs. In this case,
the parsing will be aborted, and the thrown exception will be found embedded in the
SAXException thrown by the SAX parser.
The ObjectHandlerAdapter class provides a default implementation of an object handler,
that can be used in two different ways:
ObjectHandlerInterface interface.
ObjectHandlerInterface
(this is how an object handler can be implemented without subclassing the ObjectHandler class.
As it was said before, any object handler can act as a ContentHandler for any SAX parser
using the ObjectHandler.getAsContentHandler() method.
Xineo also provides some conveniency classes to make document parsing event simpler:
RootHandler class may be used to create the root object handler of an object handler set
(remember that the root handler is the handler that maps to the document itself).
It just requires the object handler that will handle the main element of the document
to be passed to its constructor.
Parser class may be used to parse a document using an object handler set.
The handler tree is usually built embedding handlers in each others using the addSubHandler method.
But in some case, this is not possible becaause this would end in an infinite recursion during instantiation of
handlers. The simplest case is when an element may recursively embed elements of the same type, but the same
problem also appears as soon as a cycle appears in the schema.
To solve this problem, the Xineo OAX API provides a special object handler, named VirtualHandler.
This handler is able to dynamically instantiate the actual handlers only when they are needed. This way, you
can recursively embed handlers in each others.
There are two ways to create virtual handlers. The first way is to pass to its constructor the name of
the element to be handled, and the class of the objects that will actually handle elements of the corresponding type.
The given class must be a subclass of ObjectHandler, which will be properly instantiated when needed.
The other way is to pass to its constructor the name of the element, and an instance of an object that implements
the HandlerFactory interface, which contains the definition of the createHandler() method.
This method will be called during document parsing, when an handler is needed for an element of the given name.
So you must implement this interface to properly instantiate the correct handler.
Important : when using virtual handlers, you must take into account that the subHandlerInitialized and
subHandlerTerminated methods are being passed the instance of the virtual handler, and not
the instance of the actual object handler. To retreive the actual handler, you must use the
getDelegate() method on the virtual handler.
Note that although the tree could be entirely built using virtual handlers, this is not a very good idea since a virtual handler induces an indirection that costs more processing time than a classic instantiation in the constructor. So it is much better to use them only when it is needed.
In this section, we will present an implementation sample for a simple
XML document format. Note that the complete and functional implementation
can be found in the test package bundled in the Xineo distribution.
Let's consider that we have to handle documents describing a very simple address book, which contains a set of persons. An short example follows:
<?xml version="1.0" ?> <addressBook> <person firstName="James" lastName="Bond"> <address>30 Dummy Street, London</address> <birthDate year="1940" month="6" day="24"/> </person> ... </addressBook>
We will first define an object handler for birthDate elements. Since a date handler
may be a commonly used one, we will define a generic object handler that is not bound to a specific
element name. We will only define the initialize() method, since date elements
will be empty, and we just want to extract data from its attributes. We also define a getDate()
method to return the date that will be created for each date element.
public class DateHandler extends ObjectHandlerAdapter
{
public DateHandler(String elementName)
{
super(elementName);
}
public void initialize(Attributes attrs) throws HandlerException
{
try
{
m_date = Calendar.getInstance();
m_date.set(Calendar.YEAR, Integer.parseInt(attrs.getValue("year")));
m_date.set(Calendar.MONTH, Integer.parseInt(attrs.getValue("month")));
m_date.set(Calendar.DAY_OF_MONTH, Integer.parseInt(attrs.getValue("day")));
}
catch(NumberFormatException e)
{
throw new HandlerException(e);
}
}
public Calendar getDate()
{
return m_date;
}
private Calendar m_date;
}
On top of that, we may define another class to handle more specifically the birthDate
elements :
public class BirthDateHandler extends DateHandler
{
public BirthDateHandler()
{
super(new ElementName("birthDate"));
}
}
We will now define another object handler for person elements.
The initialize() method is implemented to instanciate a new person, using
the data contained in the attributes.
To handle the birthDate elements, we will register an instance of
BirthDateHandler as a sub-handler, and implement the subHandlerTerminated() method to grab the date.
But we won't define a specific handler for address elements since it just consists
in a simple element containing some text. Instead, we will implement the endElement()
method to get the person's address :
class PersonHandler extends ObjectHandlerAdapter
{
public PersonHandler()
{
super("person");
addSubHandler(m_birthDateHandler);
}
public void initialize(Attributes attributes) throws HandlerException
{
m_person = new Person(attributes.getValue("firstName"), attributes.getValue("lastName"));
}
public void endElement(ElementName elementName, String characters) throws HandlerException
{
if(elementName.equals("address"))
m_person.setAddress(characters);
}
public void subHandlerTerminated(ObjectHandler subHandler) throws HandlerException
{
if(subHandler == m_birthDateHandler)
m_person.setBirthDate(m_birthDateHandler.getDate());
}
public Person getPerson()
{
return m_person;
}
private Person m_person;
private DateHandler m_birthDateHandler = new DateHandler("birthDate");
}
We can now define a handler for addressBook elements on the same model :
public class AddressBookHandler extends ObjectHandlerAdapter
{
public AddressBookHandler()
{
super("addressBook");
addSubHandler(m_personHandler);
}
public void initialize(Attributes attributes) throws HandlerException
{
m_addressBook = new AddressBook();
}
public void subHandlerTerminated(ObjectHandler subHandler) throws HandlerException
{
if(subHandler == m_personHandler)
m_addressBook.addPerson(m_personHandler.getPerson());
}
public AddressBook getAddressBook()
{
return m_addressBook;
}
private PersonHandler m_personHandler = new PersonHandler();
private AddressBook m_addressBook;
}
To parse an XML "address book" document using these object handlers, we can use the Parser
class (although it is not required since each object handler can be used as a SAX content handler).
Note that a RootHandler object is used to embed the address book handler.
AddressBookHandler addressBookHandler = new AddressBookHandler();
Parser parser = new Parser(new RootHandler(addressBookHandler));
parser.parse("addressbok.xml");
AddressBook result = addressBookHandler.getAddressBook();
To demonstrate the use of virtual handlers, we will now consider another type of XML documents. In the following one, a document is divided into sections, that may themselves be divided in sub-sections. Sections can also contain paragraphs.
<?xml version="1.0" ?> <document> <section title="Section 1"> <section title="Section 1.1"> <paragraph> ... </paragraph> </section> ... </section> ... </document>
As usual, we will define an object handler to handle the section elements.
But of course, we can't register a section handler as a sub-handler of itself, since
this would lead to an infinite recursion. So we'll have to use the VirtualHandler
class to enable sub-sections to be properly handled. Here is the code of the resulting handler.
Please take a special care to the instantiation of the sub-section handler (at the bottom),
and the implementation of the subHandlerTerminated() method.
class SectionHandler extends ObjectHandlerAdapter
{
public SectionHandler()
{
super("section");
addSubHandler(m_paragraphHandler);
addSubHandler(m_subSectionHandler);
}
public void initialize(Attributes attributes) throws HandlerException
{
m_section = new Section(attributes.getValue("title"));
}
public void subHandlerTerminated(ObjectHandler subHandler) throws HandlerException
{
if(subHandler == m_paragraphHandler)
{
m_section.addParagraph(m_paragraphHandler.getParagraph());
}
else if(subHandler == m_subSectionHandler)
{
SectionHandler sectionHandler = (SectionHandler)m_subSectionHandler.getDelegate();
m_section.addSubSection(sectionHandler.getSection());
}
}
public Section getSection()
{
return m_section;
}
private ParagraphHandler m_paragraphHandler = new ParagraphHandler();
private VirtualHandler m_subSectionHandler = new VirtualHandler("section", SectionHandler.class);
private Section m_section;
}
Xineo OAX offers the ability to dynamically build handler trees using an XML description of its structure.
This feature may be interesting when a lot of flexibility is needed, for instance when you have an existing
set of object handlers that you want to put together in various ways. The XML document schema is
defined in the handler-tree-1.0.xsd file distributed in the Xineo package.
Let's consider again the case of a document containing sections, sub-sections, and paragraphs. Here a sample of how the corresponding handler tree can be described in XML :
<?xml version="1.0"?>
<tree xmlns="http://www.xineo.net/HANDLER-TREE-1.0"
<adapter elementName="*">
<handler className="test.SectionHandler" name="sectionHandler">
<handler className="test.ParagraphHandler"/>
<ref name="sectionHandler" elementName="section"/>
</handler>
</adapter>
</tree>
The root element, tree, declares the namespace that must be used to define
a handler tree as the default one. Note the use of this namespace,
http://www.xineo.net/HANDLER-TREE-1.0, is mandatory. The tree
must contain one (and only one) object handler. Each object handler can be defined either
by a handler, adapter, or ref element.
The root of the handler tree is defined by the first element, of type adapter.
An adapter is an empty handler (see the ObjectHandlerAdapter class), that can
be used (like above) to handle an element without having a specific handler. The name of the
element to be handled by the adapter must be given in the elementName attribute.
In this case, the given name is a wildcard, so the handler will accept any element.
Note that the adapter element can also be given an className
attribute. In this case, it will act as a virtual handler using the given class to dynamically
instantiate its delegate only when needed.
The next element is the section handler, which is implemented by the test.SectionHandler class,
given in the className attribute. This one is given using the handler element.
When the tree will be built, this class will be loaded and instantiated, and the resulting object handler
will be registered as a sub-handler of its parent in the tree. You can notice that this handler is given a
name in the name attribute, which will be used later.
Since sections may contain paragraphs, a paragraph handler is embeded in the section handler.
This one is also given in a simple handler element by its class name.
Finally, we must manage correctly the fact that a section may recursively embed sub-sections.
At this point, the same problem as exposed before appears: we can't produce an infinite tree
of section handlers embedded in each others. To solve this problem, we use a ref
element to point to the parent section handler by its name. This way, a virtual handler will
be created and registered as a sub-handler of the section handler. When needed, new copies of
this handler will be instantiated to handle recursion. Note that the ref element
must contain the name of the handled element in the elementName attribute.
This tutorial explains the basic principles of Xineo OAX, and shows an example on a simple document schema. To learn more about Xineo and how to use it in your applications, please refer to:
readme.html file, if you've not done so yet.