Entities are a fundamental concept in SGML and XML, and can be somewhat daunting at first. They serve a number of related, but slightly different functions, and this makes them a little bit complicated.
In the most general terms, entities allow you to assign a name to some chunk of data, and use that name to refer to that data. The complexity arises because there are two different contexts in which you can use entities (in the DTD and in your documents), two types of entities (parsed and unparsed), and two or three different ways in which the entities can point to the chunk of data that they name.
In the rest of this section, we'll describe each of the commonly encountered entity types. If you find the material in this section confusing, feel free to skip over it now and come back to it later. We'll refer to the different types of entities as the need arises in our discussion of DocBook. Come back to this section when you're looking for more detail.
Entities can be divided into two broad categories, general entities and parameter entities. Parameter entities are most often used in the DTD, not in documents, so we'll describe them last. Before you can use any type of entity, it must be formally declared. This is typically done in the document prologue, as we'll explain in Chapter 2, but we will show you how to declare each of the entities discussed here.
In use, general entities are introduced with an ampersand (&) and end with a semicolon (;). Within the category of general entities, there are two types: internal general entities and external general entities.
With internal entities, you can associate an essentially arbitrary piece of text (which may have other markup, including references to other entities) with a name. You can then include that text by referring to its name. For example, if your document frequently refers to, say, “O'Reilly & Associates,” you might declare it as an entity:
<!ENTITY ora "O'Reilly & Associates"> |
Then, instead of typing it out each time, you can insert it as needed in your document with the entity reference ora, simply to save time. Note that this entity declaration includes another entity reference within it. That's perfectly valid as long as the reference isn't directly or indirectly recursive.
If you find that you use a number of entities across many documents, you can add them directly to the DTD and avoid having to include the declarations in each document. See the discussion of dbgenent.mod in Chapter 5.
With external entities, you can reference other documents from within your document. If these entities contain document text (SGML or XML), then references to them cause the parser to insert the text of the external file directly into your document (these are called parsed entities). In this way, you can use entities to divide your single, logical document into physically distinct chunks. For example, you might break your document into four chapters and store them in separate files. At the top of your document, you would include entity declarations to reference the four files:
<!ENTITY ch01 SYSTEM "ch01.sgm"> <!ENTITY ch02 SYSTEM "ch02.sgm"> <!ENTITY ch03 SYSTEM "ch03.sgm"> <!ENTITY ch04 SYSTEM "ch04.sgm"> |
Your Book now consists simply of references to the entities:
<book> &ch01; &ch02; &ch03; &ch04; </book> |
Sometimes it's useful to reference external files that don't contain document text. For example, you might want to reference an external graphic. You can do this with entities by declaring the type of data that's in the entity using a notation (these are called unparsed entities). For example, the following declaration declares the entity tree as an encapsulated PostScript image:
<!ENTITY tree SYSTEM "tree.eps" NDATA EPS> |
Entities declared this way cannot be inserted directly into your document. Instead, they must be used as entity attributes to elements:
<graphic entityref="tree"></graphic> |
Conversely, you cannot use entities declared without a notation as the value of an entity attribute.
In order for the parser to recognize markup in your document, it must be able to distinguish markup from content. It does this with two special characters: “<,” which identifies the beginning of a start or end tag, and “&,” which identifies the beginning of an entity reference.[1] If you want these characters to have their literal value, they must be encoded as entity references in your document. The entity reference lt produces a left angle bracket; amp produces the ampersand.[2]
If you do not encode each of these as their respective entity references, then an SGML parser or application is likely to interpret them as characters introducing elements or entities (an XML parser will always interpret them this way); consequently, they won't appear as you intended. If you wish to cite text that contains literal ampersands and less-than signs, you need to transform these two characters into entity references before they are included in a DocBook document. The only other alternative is to incorporate text that includes them in your document through some process that avoids the parser.
In SGML, character entities are frequently declared using a third entity category (one that we deliberately chose to overlook), called data entities. In XML, these are declared using numeric character references. Numeric character references resemble entity references, but technically aren't the same. They have the form ϧ, in which “999” is the numeric character number.
In XML, the numeric character number is always the Unicode character number. In addition, XML allows hexadecimal numeric character references of the form &#xhhhh;. In SGML, the numeric character number is a number from the document character set that's declared in the SGML declaration.
Character entities are also used to give a name to special characters that can't otherwise be typed or are not portable across applications and operating systems. You can then include these characters in your document by refering to their entity name. Instead of using the often obscure and inconsistent key combinations of your particular word processor to type, say, an uppercase letter U with an umlaut (Ü), you type in an entity for it instead. For instance, the entity for an uppercase letter U with an umlaut has been defined as the entity Uuml, so you would type in Uuml to reference it instead of the actual character. The SGML application that eventually processes your document for presentation will match the entity to your platform's handling of special characters in order to render it appropriately.
Parameter entities are only recognized in markup declarations (in the DTD, for example). Instead of beginning with an ampersand, they begin with a percent sign. Parameter entities are most frequently used to customize the DTD. For a detailed discussion of this topic, see Chapter 5. Following are some other uses for them.
You might use a parameter entity reference in an SGML document in a marked section. Marking sections is a mechanism for indicating that special processing should apply to a particular block of text. Marked sections are introduced by the special sequence <![keyword[ and end with ]]>. In SGML, marked sections can appear in both DTDs and document instances. In XML, they're only allowed in the DTD.[3]
The most common keywords are INCLUDE, which indicates that the text in the marked section should be included in the document; IGNORE, which indicates that the text in the marked section should be ignored (it completely disappears from the parsed document); and CDATA, which indicates that all markup characters within that section should be ignored except for the closing characters ]]>.
In SGML, these keywords can be parameter entities. For example, you might declare the following parameter entity in your document:
<!ENTITY % draft "INCLUDE"> |
Then you could put the sections of the document that are only applicable in a draft within marked sections:
<![%draft;[ <para> This paragraph only appears in the draft version. </para> ]]> |
When you're ready to print the final version, simply change the draft parameter entity declaration:
<!ENTITY % draft "IGNORE"> |
and publish the document. None of the draft sections will appear.
[1] |
In XML, these characters are fixed. In SGML, it is possible to change the markup start characters, but we won't consider that case here. If you change the markup start characters, you know what you're doing. While we're on the subject, in SGML, these characters only have their special meaning if they are followed by a name character. It is, in fact, valid in an SGML (but not an XML) document to write “O'Reilly & Associates” because the ampersand is not followed by a name character. Don't do this, however. |
[2] |
The sequence of characters that end a marked section (see the Section called Marked sections), such as ]]> must also be encoded with at least one entity reference if it is not being used to end a marked section. For this purpose, you can use the entity reference gt for the final right angle bracket. |
[3] |
Actually, CDATA marked sections are allowed in an XML document, but the keyword cannot be a parameter entity, and it must be typed literally. See the examples on this page. |