Considering Other Schema Languages

Historically, DTDs were the only way to describe the valid stricture of SGML and XML documents, but that is no longer the case. At the time of this writing (January, 2001), DocBook is experimentally available in three other schema languages:

XML Schema

The schema language being defined by the W3C as the successor to DTDs for describing the structure of XML. XML Schema are likely to become a W3C Recommendation in 2001.

RELAX

RELAX, the Regular Language description for XML) is a less complex alternative to XML Schemas. The RELAX Core module is defined by ISO in ISO/IEC DTR 22250-1, Document Description and Processing Languages -- Regular Language Description for XML (RELAX) -- Part 1: RELAX Core, 2000. The RELAX Namespaces module is currently under development.

TREX

TREX, Tree Regular Expressions for XML, is another less complex alternative to XML Schemas. It is concise, powerful, and datatype neutral.

Parsing and Validation

Before we look closer at these new schema languages, there's one significant difference between DTDs and all of them that we should get out of the way: XML parsers (which may understand DTDs) build an XML information set out of a stream of characters, all of these other schema languages begin with an information set and perform validation on it.

What I mean by that is that an XML parser reads a stream of bytes:

"<" "?" "x" "m" "l" " " "v" "e" ...
"<" "!" "D" "O" "C" "T" "Y" "P" "E" " " "b" "o" "o" "k" ...
"<" "b" "o" "o" "k" " " "i" "d" "=" "'" "f" "o" "o" "'" ">"
...
"<" "/" "b" "o" "o" "k" ">"
interprets them as a stream of characters (which may change the interpretation of some sequences of bytes) and constructs some representation of the XML document. This representation is the set of all the XML information items encountered: the information set of the document. The W3C XML Core Working Group is in the process of defining what an XML Information Set contains.

The other schema languages are defined not in terms of the sequence of characters in the file but in terms of the information set of the XML document. They have to work this way because the XML Recommendation says what an XML document is and they all want to work on top of XML.

So what, you might ask? Well, it turns out that this has at least one very significant implication: there's no way for these languages to provide support for entity declarations.

An entity, like "&ora;" as a shortcut for "O'Reilly & Associates" or "&eacute;" as a mnemonic for "é", is a feature of the character stream seen by the XML parser, it doesn't exist in the information set of valid XML documents. More importantly, this means that even if the schema language had a syntax for declaring entities, it wouldn't help the XML parser that needs to know the definitions long before the schema language processor comes into play.

There are a couple of other XML features that are impacted, though not necessarily as significantly: notations and default attribute values. One use for notations is on external entity declarations, and as we've already seen, the schema language is too late to be useful for anything entity related. Default attribute values are also problematic since you would like them to be in the information set produced by the parser so that the schema language sees them.

A Coarse Comparison of Three XML Schema Languages

FIXME: write a short synopsis of how these languages compare.