In jEdit versions prior to 2.4, each syntax highlighting mode was actually a Java class known as a token marker. Token markers required a good knowledge of Java to write, and were tricky to debug. In jEdit 2.4 and later, Mike Dillon's XMode plugin is integrated into the core. XMode is a generic syntax parser that reads highlighting rules from XML files. XML files are much easier to write, change and debug than token markers.
XML is a simplified version of SGML, which is a language closely related to HTML. Because of this, XML looks very similar to HTML and anyone who's done a bit of web page authoring will immediately pick it up.
Here is a sample XML file that also happens to be a definition for a (useless) edit mode called "my-mode":
<?xml version="1.0"?> <!DOCTYPE MODE SYSTEM "xmode.dtd"> <MODE NAME="my-mode"> <PROPS> <PROPERTY NAME="label" VALUE="My First Edit Mode" /> <PROPERTY NAME="filenameGlob" VALUE="*.my" /> </PROPS> <RULES> <!-- syntax highlighting rules go here --> </RULES> </MODE> |
If you're familiar with HTML, keep the following in mind when writing XML:
<? xml version="1.0"?> - strings enclosed with "<?" and "?>" are called processing instructions. The only processing instruction that you need to know to write edit modes is one that specifies the XML version.
<!DOCTYPE MODE SYSTEM "xmode.dtd"> - tags beginning with "<!" give extra information to the parser. This DOCTYPE declaration tells the parser that the following XML document is of type "MODE".
<PROPERTY NAME="label" VALUE="My First Edit Mode" /> - unlike HTML, where some tags stand on their own (IMG, for example) all tags must be closed in XML. Even if a there is nothing between the opening and closing tags, they must still be specified. To avoid clutter, the XML standard specifies that <TAG /> is the same as <TAG></TAG>.
All attribute values must be quoted. In HTML, you can get away with writing NAME=label inside a tag; in XML, you must write NAME="label".
XML is case sensitive. SPAN is not the same as Span or span.
To insert a special character such as < or > literally in XML (for example, inside an attribute value), you must write it as an entity. An entity consists of the character's symbolic name enclosed with "&" and ";". A full list of entities is out of the scope of this chapter, but the most important are:
< - The less-than (<) character
> - The greater-than (>) character
& - The ampersand (&) character
For example, to highlight "<" as an OPERATOR in Java mode, you can't write "<SEQ TYPE="OPERATOR">< </SEQ>" because that would cause a syntax error. Instead, you must write:
<SEQ TYPE="OPERATOR"><</SEQ> |
Now that you know the basics of XML, read on to find out how to write edit modes.
Each mode definition must start with the following:
<?xml version="1.0"?> <!DOCTYPE MODE SYSTEM "xmode.dtd"> |
Each mode definition must contain at least one MODE tag. All other tags (PROPS, RULES) must be placed inside the MODE tag. The mode tag has one required attribute, NAME. It must be set to the edit mode's name. The MODE tag for Java mode looks as follows:
<MODE NAME="java"> <!- - definition for Java mode goes here - -> </MODE> |
The PROPS tag and the PROPERTY tags inside it are used to define mode-specific properties. Each PROPERTY tag must have a NAME attribute set to the property's name, and a VALUE attribute with the property's value.
In addition to the properties listed in the section called Buffer-Local Properties in Chapter 8, you can use the following properties in modes:
label - this property must be specified. It is the full name of the edit mode, for example "MS-DOS Batch File".
filenameGlob - if a file's name matches this glob pattern, it will be opened with this edit mode. See Appendix D for information about glob patterns. In Java mode, for example, the value of this property is "*.java".
firstlineGlob - similar to filenameGlob, except that it is applied to a buffer's first line, instead of file name. In Perl mode, for example, the value of this property is "#!/*perl*".
RULES tags must be placed inside the MODE tag. Each RULES tag defines a ruleset. A ruleset consists of a number of parser rules, with each parser rule specifying how to highlight a specific syntax token. There must be at least one ruleset in each edit mode. There can also be more than one, with different rulesets being used to highlight different parts of a buffer (for example, in HTML mode, different rulesets are used to highlight tags and inline JavaScript). For information about using more than one ruleset, see the section called SPAN rule.
The RULES tag supports the following attributes, all of which are optional:
HIGHLIGHT_DIGITS - if set to TRUE, digits (0-9, as well as hexadecimal literals prefixed with "0x") will be highlighted with the DIGIT token type. Default is FALSE.
IGNORE_CASE - if set to FALSE, matches will be case sensitive. Otherwise, case will not matter. Default is TRUE.
DEFAULT - the token type for text which doesn't match any specific rule. Default is NULL. See the section called Token Types for a list of token types.
SET - the name of this ruleset. All rulesets other than the first must have a name.
Each child element of the RULES tag defines a parser rule. Rules are checked in order; that means that if you define a rule that matches on "h" and another subsequent rule that matches on "hello", the "h" rule will handle all cases before the "hello" rule gets a chance. For the ruleset to work correctly, you must instead place the "hello" rule before the "h" one.
Here is an example RULES tag:
<RULES IGNORE_CASE="FALSE" HIGHLIGHT_DIGITS="TRUE"> ... </RULES> |
There can only be one TERMINATE tag per ruleset. The TERMINATE rule specifies that parsing should stop after the specified number of characters have been read from a line. The number of characters to terminate after can be specified with the AT_CHAR attribute. This is used in patch mode, for example, because only the first character of each line affects highlighting. Here is an example:
<TERMINATE AT_CHAR="1" /> |
The WHITESPACE rule specifies characters which are to be treated as whitespace; in other words, keyword delimiters. Most rulesets will have WHITESPACE tags for spaces and tabs. Here is an example:
<WHITESPACE> </WHITESPACE> <WHITESPACE> </WHITESPACE> |
The SPAN rule highlights ranges of text between a start and end string. The start and end strings are specified inside child elements of the SPAN tag, like so:
<SPAN TYPE="COMMENT1"> <BEGIN>/*</BEGIN> <END>*/</END> </SPAN> |
The following attributes are supported:
TYPE - The token type to highlight the span with. See the section called Token Types for a list of token types
AT_LINE_START - If set to TRUE, the span will only be highlighted if the start sequence occurs at the beginning of a line
EXCLUDE_MATCH - If set to TRUE, the start and end sequences will not be highlighted, only the text between them will
NO_LINE_BREAK - If set to TRUE, the span will be highlighted with the INVALID token type if it spans more than one line
NO_WORD_BREAK - If set to TRUE, the span will be highlighted with the INVALID token type if it includes whitespace
DELEGATE - text inside the span will be highlighted with the specified ruleset. To delegate to a ruleset defined in the current mode, just specify its name. To delegate to a ruleset defined in another mode, specify a name of the form mode::ruleset. Note that the first (unnamed) ruleset in a mode is called "MAIN".
Here is a SPAN that highlights Java string literals, which cannot include line breaks:
<SPAN TYPE="LITERAL1" NO_LINE_BREAK="TRUE"> <BEGIN>"</BEGIN> <END>"</END> </SPAN> |
Here is a SPAN that highlights Java documentation comments by delegating to the "JAVADOC" ruleset defined elsewhere in the current mode:
<SPAN TYPE="COMMENT2" DELEGATE="JAVADOC"> <BEGIN>/**</BEGIN> <END>*/</END> </SPAN> |
Here is a SPAN that highlights HTML cascading stylesheets inside STYLE tags by delegating to the CSS ruleset in another mode:
<SPAN TYPE="MARKUP" DELEGATE="css::MAIN"> <BEGIN><style></BEGIN> <END></style></END> </SPAN> |
An EOL_SPAN is similar to a SPAN except that highlighting stops at the end of the line, not after the end sequence is found. The text to match is specified between the opening and closing EOL_SPAN tags. The following attributes are supported:
TYPE - The token type to highlight the span with. See the section called Token Types for a list of token types
AT_LINE_START - If set to TRUE, the span will only be highlighted if the start sequence occurs at the beginning of a line
EXCLUDE_MATCH - If set to TRUE, the start sequence will not be highlighted, only the text after it will
Here is an EOL_SPAN that highlights C++-style comments:
<EOL_SPAN TYPE="COMMENT1">//</EOL_SPAN> |
The MARK_PREVIOUS rule highlights from the end of the previous syntax token to the matched text. The text to match is specified between opening and closing MARK_PREVIOUS tags. The following attributes are supported:
TYPE - The token type to highlight the text with. See the section called Token Types for a list of token types
AT_LINE_START - If set to TRUE, the text will only be highlighted if it occurs at the beginning of the line
EXCLUDE_MATCH - If set to TRUE, the match will not be highlighted, only the text before it will
Here is a rule that highlights labels in Java mode ("XXX:"):
<MARK_PREVIOUS AT_LINE_START="TRUE" EXCLUDE_MATCH="TRUE">:</MARK_PREVIOUS> |
The MARK_FOLLOWING rule highlights from the start of the match to the next syntax token or white space. The text to match is specified between opening and closing MARK_FOLLOWING tags. The following attributes are supported:
TYPE - The token type to highlight the text with. See the section called Token Types for a list of token types
AT_LINE_START - If set to TRUE, the text will only be highlighted if the start sequence occurs at the beginning of a line
EXCLUDE_MATCH - If set to TRUE, the match will not be highlighted, only the text after it will
Here is a rule that highlights variables in Unix shell scripts ("$CLASSPATH", "$IFS", etc):
<MARK_FOLLOWING TYPE="KEYWORD2">$</MARK_FOLLOWING> |
The SEQ rule highlights fixed sequences of text. The text to highlight is specified between opening and closing SEQ tags. The following attributes are supported:
TYPE - the token type to highlight the sequence with. See the section called Token Types for a list of token types
AT_LINE_START - If set to TRUE, the sequence will only be highlighted if if it occurs at the beginning of a line
The following SEQs highlight a few Java operators:
<SEQ TYPE="OPERATOR">+</SEQ> <SEQ TYPE="OPERATOR">-</SEQ> <SEQ TYPE="OPERATOR">*</SEQ> <SEQ TYPE="OPERATOR">/</SEQ> |
There can only be one KEYWORDS tag per ruleset. The KEYWORDS rule allows you to define keywords to highlight. Keywords are similar to SEQs, except that SEQs match anywhere in the text, whereas keywords only match whole words.
The KEYWORDS tag supports only one attribute, IGNORE_CASE. If set to FALSE, keywords will be case sensitive. Otherwise, case will not matter. Default is TRUE.
Each child element of the KEYWORDS tag should be named after the desired token type, with the keyword text between the start and end tags. For example, to highlight the most common Java keywords, you would write:
<KEYWORDS IGNORE_CASE="FALSE"> <KEYWORD1>if</KEYWORD1> <KEYWORD1>else</KEYWORD1> <KEYWORD3>int</KEYWORD3> <KEYWORD3>char</KEYWORD3> </KEYWORDS> |
Each syntax token is of one of the following types:
NULL (this is the default type. No special highlighting is performed on tokens of type NULL)
COMMENT1
COMMENT2
DIGIT (tokens of this type are automatically added if the HIGHLIGHT_DIGITS attribute of a ruleset is set; you should not explicitly define rules with this token type)
FUNCTION
INVALID (tokens of this type are automatically added when an invalid sequence of characters is found; you should not explicitly define rules with this token type)
KEYWORD1
KEYWORD2
KEYWORD3
LABEL
LITERAL1
LITERAL2
OPERATOR
There are no formal conventions specifying which token types should be used for what; instead, just take a look at how syntax is highlighted in some existing modes and decide for yourself what token type you should use.