gbXML - XML Language File Format
The XML language files created by gbXML consist of the following hierarchy
of elements: language > tokenset > validscope|tokens|tokens2. An
example of the basic XML language file element structure is given in the following
example:
<language>
<tokenset>
<validscope ... />
<tokens> ... </tokens>
<tokens2> ... </tokens2>
</tokenset>
</language>
A typical language definition file might consist of 5-10 tokenset
elements, each with 2-4 validscope elements, 1 tokens element, and 1 tokens2
element. gbXML allows languages to have up to 50 tokensetes and up to 10
validscopes.
Additional details on each element type are provided below.
Tokens and Tokenset
The text which makes up language source code can be thought of as consisting
entirely of 'tokens' - groups of text characters, such as keywords, variables,
operators, and other symbols specific to the language. Tokens are often
thought of as 'words', but a token can also be multiple words, such as the text
string 'End Function', which is used to terminate functions in some languages.
The purpose of a language definition file is to identify groups, or lists, of tokens
and to specify the formatting options to applied to those tokens.
Sometimes a token pair needs to be identified, where all text found within the
token pair can be given specific formatting instructions. In this case, a second list
of tokens may be required in the language definition - one for the starting token
of the pair and a second for the ending token of the pair.
The token list (or lists, if token pairs are involved) are placed within a tokenset
element. Tokenset attributes may be defined which describe the formatting to be
applied to its tokens or to the content between token pairs. Formatting information
can also be entered on a token-by-token basis, overriding the formatting instructions
entered at the tokenset level.
The tokens lists are typically simple multi-line listings of every token to which the formatting
will be applied. However, a list of tokens may also be defined through the use of regular expressions.
Using a single regular expression to define an entire list of tokens is a powerful simplifying
tool for creating language definition files.
Tokensets Types
There are actually two types of tokensets - list and scope. As the name implies, a list
tokenset is simply a listing of all text strings (tokens) which belong to the tokenset.
As noted above, a regular expression may also be used to define the list of tokens.
Here's a simple example of a list tokenset with only a few tokens. Assigning properties
(attributes) of XML elements with be discussed later.
<language name="java">
<tokenset name="Common Words" id="keywords" type="list" forecolor="red">
<tokens>
<token>if</token>
<token>while</token>
<token>end</token>
</tokens>
</tokenset>
In this example, three tokens (if, while, end) are defined and will be displayed
in the color red.
The second kind of tokenset, a scope tokenset, is a list of one or more token pairs. Each
pair consists of a starting and ending token, where specific display characteristics
are applied to the source code between the two tokens (as well as to the tokens themselves).
For example, in most languages double-quotes are used to enclose strings. A scope tokenset which
defines a pair of double-quote tokens would be used to apply formatting to all source code between
the two double-quote characters.
A scope tokenset can include multiple token pairs. The first token of a pair is placed in a
tokens element. The second, corresponding token of a pair is placed in a tokens2 element.
Both tokens and tokens2 elements can contain any number of tokens but must
contain the same number of tokens, corresponding to pairs of tokens.
Here's an example of a scope tokenset.
<language name="java">
<tokenset name="String Tokens" id="strings" type="scope" forecolor="blue">
<tokens>
<token>"</token>
<token>'</token>
</tokens>
<tokens2>
<token>"</token>
<token>'</token>
</tokens2>
</tokenset>
In this example, two pairs of tokens are defined - a pair of double-quotes and a
pair of single quotes - both of which are used to enclose strings in many languages.
Source code between either token pair would be colored blue in this example.
gbXML language definition files also support special, single-token scope definitions -
where a single token is used to define the start of a scope and the end of the
line of text defines the end of the scope. In such cases, only the tokens element
is needed - no tokens2 element is required.
For example, a single quote is used in Visual Basic to represent comments. The
end of the line defines the end of the comment scope.
Validscopes
Sometimes, language elements may be embedded within one another. For example,
a comment string may have a hypertext link embedded within it. The language
definition files can be written to recognize such occurrences, applying
a color syntax to the embedded elements that is different than the formatting
applied to the enclosing element.
The validscope element is used to specify that a tokenset is to be valid (recognized)
within an other element. A tokenset may contain any number of validscope elements.
Typically, only 2-4 validscopes are required to describe most languages. If no
validscopes are enclosed in a tokenset element, the tokenset is treated as valid
everywhere.
Here's an XML example showing how to indicate that a hyperlink should be valid
within a string scope tokenset. In this case, note that the hyperlink token is defined
as a regular expression.
<language name="java">
<tokenset name="String Tokens" id="strings" type="scope" forecolor="blue">
<tokens>
<token>"</token>
</tokens>
<tokens2>
<token>"</token>
</tokens2>
</tokenset>
<tokenset name="Active Links" id="hyperlinks" type="scope" forecolor="red">
<validscope name="String Tokens" />
<tokens regexp="yes" >
<token> https?://([\.~:?#=\w]+\w+)(/[\.~:?#=\w]+\w)* </token>
</tokens>
</tokenset>
In this example, the hyperlink would be displayed as red text within a blue text string.
|