source: other-projects/rsyntax-textarea/devel-packages/jflex-1.4.3/doc/manual.html@ 25584

Last change on this file since 25584 was 25584, checked in by davidb, 12 years ago

Initial cut an a text edit area for GLI that supports color syntax highlighting

File size: 128.8 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2
3<!--Converted with LaTeX2HTML 2002-2-1 (1.71)
4original version by: Nikos Drakos, CBLU, University of Leeds
5* revised and updated by: Marcus Hennecke, Ross Moore, Herb Swan
6* with significant contributions from:
7 Jens Lippmann, Marek Rouchal, Martin Wilck and others -->
8<HTML>
9<HEAD>
10<TITLE>JFlex User's Manual</TITLE>
11<META NAME="description" CONTENT="JFlex User's Manual">
12<META NAME="keywords" CONTENT="manual">
13<META NAME="resource-type" CONTENT="document">
14<META NAME="distribution" CONTENT="global">
15
16<META NAME="Generator" CONTENT="LaTeX2HTML v2002-2-1">
17<META HTTP-EQUIV="Content-Style-Type" CONTENT="text/css">
18
19<LINK REL="STYLESHEET" HREF="manual.css">
20
21</HEAD>
22
23<BODY >
24
25<P>
26
27<CENTER>
28<A NAME="TOP"></a>
29<A HREF="http://www.jflex.de"><IMG SRC="logo.png" BORDER=0 HEIGHT=223 WIDTH=577></a>
30</CENTER>
31
32<P>
33<DIV ALIGN="CENTER">
34<I><FONT SIZE="+2">The Fast Lexical Analyser Generator</FONT>
35<BR></I></DIV>
36<P></P>
37<DIV ALIGN="CENTER"></DIV>
38<P></P>
39<DIV ALIGN="CENTER"><I>Copyright &#169;1998-2009 by <A NAME="tex2html1"
40 HREF="http://www.doclsf.de">Gerwin Klein</A>
41<BR></I></DIV>
42<P><P><BR>
43<DIV ALIGN="CENTER"><I><FONT SIZE="+4"><I><B>JFlex User's Manual</B></I></FONT>
44<BR></I></DIV>
45<P><P><BR>
46<DIV ALIGN="CENTER"><I>Version 1.4.3, January 31, 2009
47
48</I></DIV>
49
50<P>
51<BR>
52
53<H2><A NAME="SECTION00010000000000000000">
54Contents</A>
55</H2>
56<!--Table of Contents-->
57
58<UL>
59<LI><A NAME="tex2html79"
60 HREF="manual.html#SECTION00020000000000000000">Introduction</A>
61<UL>
62<LI><A NAME="tex2html80"
63 HREF="manual.html#SECTION00021000000000000000">Design goals</A>
64<LI><A NAME="tex2html81"
65 HREF="manual.html#SECTION00022000000000000000">About this manual</A>
66</UL><BR>
67<LI><A NAME="tex2html82"
68 HREF="manual.html#SECTION00030000000000000000">Installing and Running JFlex</A>
69<UL>
70<LI><A NAME="tex2html83"
71 HREF="manual.html#SECTION00031000000000000000">Installing JFlex</A>
72<LI><A NAME="tex2html84"
73 HREF="manual.html#SECTION00032000000000000000">Running JFlex</A>
74</UL><BR>
75<LI><A NAME="tex2html85"
76 HREF="manual.html#SECTION00040000000000000000">A simple Example: How to work with JFlex</A>
77<UL>
78<LI><A NAME="tex2html86"
79 HREF="manual.html#SECTION00041000000000000000">Code to include</A>
80<LI><A NAME="tex2html87"
81 HREF="manual.html#SECTION00042000000000000000">Options and Macros</A>
82<LI><A NAME="tex2html88"
83 HREF="manual.html#SECTION00043000000000000000">Rules and Actions</A>
84<LI><A NAME="tex2html89"
85 HREF="manual.html#SECTION00044000000000000000">How to get it going</A>
86</UL><BR>
87<LI><A NAME="tex2html90"
88 HREF="manual.html#SECTION00050000000000000000">Lexical Specifications</A>
89<UL>
90<LI><A NAME="tex2html91"
91 HREF="manual.html#SECTION00051000000000000000">User code</A>
92<LI><A NAME="tex2html92"
93 HREF="manual.html#SECTION00052000000000000000">Options and declarations</A>
94<LI><A NAME="tex2html93"
95 HREF="manual.html#SECTION00053000000000000000">Lexical rules</A>
96</UL><BR>
97<LI><A NAME="tex2html94"
98 HREF="manual.html#SECTION00060000000000000000">Encodings, Platforms, and Unicode</A>
99<UL>
100<LI><A NAME="tex2html95"
101 HREF="manual.html#SECTION00061000000000000000">The Problem</A>
102<LI><A NAME="tex2html96"
103 HREF="manual.html#SECTION00062000000000000000">Scanning text files</A>
104<LI><A NAME="tex2html97"
105 HREF="manual.html#SECTION00063000000000000000">Scanning binaries</A>
106</UL><BR>
107<LI><A NAME="tex2html98"
108 HREF="manual.html#SECTION00070000000000000000">A few words on performance</A>
109<UL>
110<LI><A NAME="tex2html99"
111 HREF="manual.html#SECTION00071000000000000000">Comparison of JLex and JFlex</A>
112<LI><A NAME="tex2html100"
113 HREF="manual.html#SECTION00072000000000000000">How to write a faster specification</A>
114</UL><BR>
115<LI><A NAME="tex2html101"
116 HREF="manual.html#SECTION00080000000000000000">Porting Issues</A>
117<UL>
118<LI><A NAME="tex2html102"
119 HREF="manual.html#SECTION00081000000000000000">Porting from JLex</A>
120<LI><A NAME="tex2html103"
121 HREF="manual.html#SECTION00082000000000000000">Porting from lex/flex</A>
122</UL><BR>
123<LI><A NAME="tex2html104"
124 HREF="manual.html#SECTION00090000000000000000">Working together</A>
125<UL>
126<LI><A NAME="tex2html105"
127 HREF="manual.html#SECTION00091000000000000000">JFlex and CUP</A>
128<LI><A NAME="tex2html106"
129 HREF="manual.html#SECTION00092000000000000000">JFlex and BYacc/J</A>
130</UL><BR>
131<LI><A NAME="tex2html107"
132 HREF="manual.html#SECTION000100000000000000000">Bugs and Deficiencies</A>
133<UL>
134<LI><A NAME="tex2html108"
135 HREF="manual.html#SECTION000101000000000000000">Deficiencies</A>
136<LI><A NAME="tex2html109"
137 HREF="manual.html#SECTION000102000000000000000">Bugs</A>
138</UL><BR>
139<LI><A NAME="tex2html110"
140 HREF="manual.html#SECTION000110000000000000000">Copying and License</A>
141<LI><A NAME="tex2html111"
142 HREF="manual.html#SECTION000120000000000000000">Bibliography</A>
143</UL>
144<!--End of Table of Contents-->
145
146<H1><A NAME="SECTION00020000000000000000"></A><A NAME="Intro"></A><BR>
147Introduction
148</H1>
149JFlex is a lexical analyser generator for Java<A NAME="tex2html2"
150 HREF="#foot33"><SUP><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="footnote.png"></SUP></A>written in Java. It is also a rewrite of the very useful tool JLex [<A
151 HREF="manual.html#JLex">3</A>] which
152was developed by Elliot Berk at Princeton University. As Vern Paxson states
153for his C/C++ tool flex [<A
154 HREF="manual.html#flex">11</A>]: they do not share any code though.
155
156<P>
157
158<H2><A NAME="SECTION00021000000000000000">
159Design goals</A>
160</H2>
161The main design goals of JFlex are:
162
163<UL>
164<LI><B>Full unicode support</B>
165</LI>
166<LI><B>Fast generated scanners </B>
167</LI>
168<LI><B>Fast scanner generation</B>
169</LI>
170<LI><B>Convenient specification syntax</B>
171</LI>
172<LI><B>Platform independence</B>
173</LI>
174<LI><B>JLex compatibility</B>
175</LI>
176</UL>
177
178<P>
179
180<H2><A NAME="SECTION00022000000000000000">
181About this manual</A>
182</H2>
183This manual gives a brief but complete description of the tool JFlex. It
184assumes that you are familiar with the issue of lexical analysis. The references [<A
185 HREF="manual.html#Aho">1</A>],
186[<A
187 HREF="manual.html#Appel">2</A>], and [<A
188 HREF="manual.html#Maurer">13</A>] provide a good introduction to this topic.
189
190<P>
191The next section of this manual describes <A HREF="#Installing"><I>installation procedures</I></A>
192for JFlex. If you never worked with JLex or
193just want to compare a JLex and a JFlex scanner specification you
194should also read <A HREF="#Example"><I>Working with JFlex - an example</I></A>
195(section <A HREF="#Example">3</A>). All options and the complete
196specification syntax are presented in
197<A HREF="#Specifications"><I>Lexical specifications</I></A> (section <A HREF="#Specifications">4</A>);
198<A HREF="#sec:encodings"><I>Encodings, Platforms, and Unicode</I></A> (section <A HREF="#sec:encodings">5</A>)
199provides information about scanning text vs.&nbsp;binary files.
200If you are interested in performance
201considerations and comparing JLex with JFlex speed,
202<A HREF="#performance"><I>a few words on performance</I></A> (section <A HREF="#performance">6</A>)
203might be just right for you. Those who want to
204use their old JLex specifications may want to check out section <A HREF="#Porting">7.1</A>
205<A HREF="#Porting"><I>Porting from JLex</I></A> to avoid possible problems
206with not portable or non standard JLex behaviour that has been fixed in
207JFlex. Section <A HREF="#lexport">7.2</A> talks about porting scanners from the
208Unix tools lex and flex. Interfacing JFlex scanners with the LALR
209parser generators CUP and BYacc/J is explained in <A HREF="#WorkingTog"><I>working
210 together</I></A> (section <A HREF="#WorkingTog">8</A>). Section <A HREF="#Bugs">9</A>
211<A HREF="#Bugs"><I>Bugs</I></A> gives a list of currently known active bugs.
212The manual concludes with notes about
213<A HREF="#Copyright"><I>Copying and License</I></A> (section <A HREF="#Copyright">10</A>) and
214<A HREF="#References">references</A>.
215
216<P>
217
218<H1><A NAME="SECTION00030000000000000000"></A><A NAME="Installing"></A><BR>
219Installing and Running JFlex
220</H1>
221
222<P>
223
224<H2><A NAME="SECTION00031000000000000000">
225Installing JFlex</A>
226</H2>
227
228<P>
229
230<H3><A NAME="SECTION00031100000000000000"></A><A NAME="install:windows"></A><BR>
231Windows
232</H3>
233To install JFlex on Windows 95/98/NT/XP, follow these three steps:
234
235<OL>
236<LI>Unzip the file you downloaded into the directory you want JFlex in (using
237something like
238<A NAME="tex2html3"
239 HREF="http://www.winzip.com">WinZip</A>).
240If you unzipped it to say <code>C:\</code>, the following directory structure
241should be generated:
242
243<PRE>
244C:\JFlex\
245 +--bin\ (start scripts)
246 +--doc\ (FAQ and manual)
247 +--examples\
248 +--binary\ (scanning binary files)
249 +--byaccj\ (calculator example for BYacc/J)
250 +--cup\ (calculator example for cup)
251 +--interpreter\ (interpreter example for cup)
252 +--java\ (Java lexer specification)
253 +--simple\ (example scanner)
254 +--simple-maven\ (example with maven)
255 +--standalone\ (a simple standalone scanner)
256 +--standalone-maven\ (above with maven)
257 +--lib\ (the precompiled classes)
258 +--src\
259 +--JFlex\ (source code of JFlex)
260 +--JFlex\gui (source code of JFlex UI classes)
261 +--java_cup\runtime\ (source code of cup runtime classes)
262</PRE>
263
264<P>
265</LI>
266<LI>Edit the file <B><code>bin\jflex.bat</code></B>
267(in the example it's <code>C:\JFlex\bin\jflex.bat</code>)
268such that
269
270<P>
271
272<UL>
273<LI><B><TT>JAVA_HOME</TT></B> contains the directory where your Java JDK is installed
274 (for instance <code>C:\java</code>) and
275</LI>
276<LI><B><TT>JFLEX_HOME</TT></B> the directory that contains JFlex (in the example:
277 <code>C:\JFlex</code>)
278</LI>
279</UL>
280
281<P>
282</LI>
283<LI>Include the <code>bin\</code> directory of JFlex in your path.
284(the one that contains the start script, in the example: <code>C:\JFlex\bin</code>).
285</LI>
286</OL>
287
288<P>
289
290<H3><A NAME="SECTION00031200000000000000">
291Unix with tar archive</A>
292</H3>
293
294<P>
295To install JFlex on a Unix system, follow these two steps:
296
297<UL>
298<LI>Decompress the archive into a directory of your choice
299 with GNU tar, for instance to <TT>/usr/share</TT>:
300
301<P>
302<TT>tar -C /usr/share -xvzf jflex-1.4.3.tar.gz</TT>
303
304<P>
305(The example is for site wide installation. You need to
306 be root for that. User installation works exactly the
307 same way--just choose a directory where you have write
308 permission)
309
310<P>
311</LI>
312<LI>Make a symbolic link from somewhere in your binary
313 path to <TT>bin/jflex</TT>, for instance:
314
315<P>
316<TT>ln -s /usr/share/JFlex/bin/jflex /usr/bin/jflex</TT>
317
318<P>
319If the Java interpreter is not in your binary path, you
320 need to supply its location in the script <TT>bin/jflex</TT>.
321</LI>
322</UL>
323
324<P>
325You can verify the integrity of the downloaded file with
326the MD5 checksum available on the <A NAME="tex2html4"
327 HREF="http://www.jflex.de/download.html">JFlex download page</A>.
328If you put the checksum file in the same directory
329as the archive, you run:
330
331<P>
332<code>md5sum --check </code><TT>jflex-1.4.3.tar.gz.md5</TT>
333
334<P>
335It should tell you
336
337<P>
338<TT>jflex-1.4.3.tar.gz: OK</TT>
339
340<P>
341
342<H3><A NAME="SECTION00031300000000000000">
343Linux with RPM</A>
344</H3>
345
346<P>
347
348<UL>
349<LI>become root
350</LI>
351<LI>issue
352<BR> <TT>rpm -U jflex-1.4.3-0.rpm</TT>
353</LI>
354</UL>
355
356<P>
357You can verify the integrity of the downloaded <TT>rpm</TT> file with
358
359<P>
360<code>rpm --checksig </code><TT>jflex-1.4.3-0.rpm</TT>
361
362<P>
363
364<H2><A NAME="SECTION00032000000000000000">
365Running JFlex</A>
366</H2>
367You run JFlex with:
368
369<P>
370<TT>jflex &lt;options&gt; &lt;inputfiles&gt;</TT>
371
372<P>
373It is also possible to skip the start script in <code>bin\</code>
374and include the file <code>lib\JFlex.jar</code>
375in your <TT>CLASSPATH</TT> environment variable instead.
376
377<P>
378Then you run JFlex with:
379
380<P>
381<TT>java JFlex.Main &lt;options&gt; &lt;inputfiles&gt;</TT>
382
383<P>
384The input files and options are in both cases optional. If you don't provide a file name on
385the command line, JFlex will pop up a window to ask you for one.
386
387<P>
388JFlex knows about the following options:
389
390<P>
391<DL>
392<DT></DT>
393<DD><code>-d &lt;directory&gt;</code>
394<BR> writes the generated file to the directory <code>&lt;directory&gt;</code>
395
396<P>
397</DD>
398<DT></DT>
399<DD><code>--skel &lt;file&gt;</code>
400<BR> uses external skeleton <code>&lt;file&gt;</code>. This is mainly for JFlex
401 maintenance and special low level customisations. Use only when you
402 know what you are doing! JFlex comes with a skeleton file in the
403 <TT>src</TT> directory that reflects exactly the internal, pre-compiled
404 skeleton and can be used with the <TT>-skel</TT> option.
405
406<P>
407</DD>
408<DT></DT>
409<DD><code>--nomin</code>
410<BR> skip the DFA minimisation step during scanner generation.
411
412<P>
413</DD>
414<DT></DT>
415<DD><code>--jlex</code>
416<BR> tries even harder to comply to JLex interpretation of specs.
417
418<P>
419</DD>
420<DT></DT>
421<DD><code>--dot</code>
422<BR> generate graphviz dot files for the NFA, DFA and minimised
423 DFA. This feature is still in alpha status, and not
424 fully implemented yet.
425
426<P>
427</DD>
428<DT></DT>
429<DD><code>--dump</code>
430<BR> display transition tables of NFA, initial DFA, and minimised DFA
431
432<P>
433</DD>
434<DT></DT>
435<DD><code>--verbose</code> or <TT>-v</TT>
436<BR> display generation progress messages (enabled by default)
437
438<P>
439</DD>
440<DT></DT>
441<DD><code>--quiet</code> or <TT>-q</TT>
442<BR> display error messages only (no chatter about what JFlex is
443 currently doing)
444
445<P>
446</DD>
447<DT></DT>
448<DD><code>--time</code>
449<BR> display time statistics about the code generation process
450 (not very accurate)
451
452<P>
453</DD>
454<DT></DT>
455<DD><code>--version</code>
456<BR> print version number
457
458<P>
459</DD>
460<DT></DT>
461<DD><code>--info</code>
462<BR> print system and JDK information (useful if you'd like
463 to report a problem)
464
465<P>
466</DD>
467<DT></DT>
468<DD><code>--pack</code>
469<BR> use the %pack code generation method by default
470
471<P>
472</DD>
473<DT></DT>
474<DD><code>--table</code>
475<BR> use the %table code generation method by default
476
477<P>
478</DD>
479<DT></DT>
480<DD><code>--switch</code>
481<BR> use the %switch code generation method by default
482
483<P>
484</DD>
485<DT></DT>
486<DD><code>--help</code> or <TT>-h</TT>
487<BR> print a help message explaining options and usage of JFlex.
488</DD>
489</DL>
490
491<P>
492
493<H1><A NAME="SECTION00040000000000000000"></A><A NAME="Example"></A><BR>
494A simple Example: How to work with JFlex
495</H1>
496To demonstrate what a lexical specification with JFlex looks like, this
497section presents a part of the specification for the Java language.
498The example does not describe the whole lexical structure of Java programs,
499but only a small and simplified part of it (some keywords, some operators,
500comments and only two kinds of literals). It also shows how to interface
501with the LALR parser generator CUP [<A
502 HREF="manual.html#CUP">8</A>] and therefore
503uses a class <TT>sym</TT> (generated by CUP), where integer constants for
504the terminal tokens of the CUP grammar are declared. JFlex comes with a
505directory <TT>examples</TT>, where you can find a small standalone scanner
506that doesn't need other tools like CUP to give you a running example.
507The "<TT>examples</TT>" directory also contains a <EM>complete</EM> JFlex
508specification of the lexical structure of Java programs together with the
509CUP parser specification for Java by
510<A NAME="tex2html5"
511 HREF="mailto:[email protected]">C. Scott Ananian</A>, obtained
512from the CUP [<A
513 HREF="manual.html#CUP">8</A>] web site (it was modified to interface with the JFlex scanner).
514Both specifications adhere to the Java Language Specification [<A
515 HREF="manual.html#LangSpec">7</A>].
516
517<P>
518<FONT SIZE="-1"><A NAME="CodeTop"></A></FONT><PRE>
519/* JFlex example: part of Java language lexer specification */
520import java_cup.runtime.*;
521
522/**
523 * This class is a simple example lexer.
524 */
525%%
526</PRE><FONT SIZE="-1">
527<A NAME="CodeOptions"></A></FONT><PRE>
528%class Lexer
529%unicode
530%cup
531%line
532%column
533</PRE><FONT SIZE="-1">
534<A NAME="CodeScannerCode"></A></FONT><PRE>
535%{
536 StringBuffer string = new StringBuffer();
537
538 private Symbol symbol(int type) {
539 return new Symbol(type, yyline, yycolumn);
540 }
541 private Symbol symbol(int type, Object value) {
542 return new Symbol(type, yyline, yycolumn, value);
543 }
544%}
545</PRE><FONT SIZE="-1">
546<A NAME="CodeMacros"></A></FONT><PRE>
547LineTerminator = \r|\n|\r\n
548InputCharacter = [^\r\n]
549WhiteSpace = {LineTerminator} | [ \t\f]
550
551/* comments */
552Comment = {TraditionalComment} | {EndOfLineComment} | {DocumentationComment}
553
554TraditionalComment = "/*" [^*] ~"*/" | "/*" "*"+ "/"
555EndOfLineComment = "//" {InputCharacter}* {LineTerminator}
556DocumentationComment = "/**" {CommentContent} "*"+ "/"
557CommentContent = ( [^*] | \*+ [^/*] )*
558
559Identifier = [:jletter:] [:jletterdigit:]*
560
561DecIntegerLiteral = 0 | [1-9][0-9]*
562</PRE><FONT SIZE="-1">
563<A NAME="CodeStateDecl"></A></FONT><PRE>
564%state STRING
565
566%%
567</PRE><FONT SIZE="-1">
568<A NAME="CodeRulesYYINITIAL"></A></FONT><PRE>
569/* keywords */
570&lt;YYINITIAL&gt; "abstract" { return symbol(sym.ABSTRACT); }
571&lt;YYINITIAL&gt; "boolean" { return symbol(sym.BOOLEAN); }
572&lt;YYINITIAL&gt; "break" { return symbol(sym.BREAK); }
573</PRE><FONT SIZE="-1">
574<A NAME="CodeRulesBunch"></A></FONT><PRE>
575&lt;YYINITIAL&gt; {
576 /* identifiers */
577 {Identifier} { return symbol(sym.IDENTIFIER); }
578
579 /* literals */
580 {DecIntegerLiteral} { return symbol(sym.INTEGER_LITERAL); }
581 \" { string.setLength(0); yybegin(STRING); }
582
583 /* operators */
584 "=" { return symbol(sym.EQ); }
585 "==" { return symbol(sym.EQEQ); }
586 "+" { return symbol(sym.PLUS); }
587
588 /* comments */
589 {Comment} { /* ignore */ }
590
591 /* whitespace */
592 {WhiteSpace} { /* ignore */ }
593}
594</PRE><FONT SIZE="-1">
595<A NAME="CodeRulesYYtext"></A></FONT><PRE>
596&lt;STRING&gt; {
597 \" { yybegin(YYINITIAL);
598 return symbol(sym.STRING_LITERAL,
599 string.toString()); }
600 [^\n\r\"\\]+ { string.append( yytext() ); }
601 \\t { string.append('\t'); }
602 \\n { string.append('\n'); }
603
604 \\r { string.append('\r'); }
605 \\\" { string.append('\"'); }
606 \\ { string.append('\\'); }
607}
608</PRE><FONT SIZE="-1">
609<A NAME="CodeRulesAllStates"></A></FONT><PRE>
610/* error fallback */
611.|\n { throw new Error("Illegal character &lt;"+
612 yytext()+"&gt;"); }
613</PRE>
614
615<P>
616From this specification JFlex generates a <TT>.java</TT> file with one
617class that contains code for the scanner. The class will have a
618constructor taking a <TT>java.io.Reader</TT> from which the input is
619read. The class will also have a function <TT>yylex()</TT> that runs the
620scanner and that can be used to get the next token from the input (in this
621example the function actually has the name <TT>next_token()</TT> because
622the specification uses the <TT>%cup</TT> switch).
623
624<P>
625As with JLex, the specification consists of three parts, divided by <TT>%%</TT>:
626
627<UL>
628<LI><A HREF="#ExampleUserCode">usercode</A>,
629</LI>
630<LI><A HREF="#ExampleOptions">options and declarations</A> and
631</LI>
632<LI><A HREF="#ExampleLexRules">lexical rules</A>.
633</LI>
634</UL>
635
636<P>
637
638<H2><A NAME="SECTION00041000000000000000"></A><A NAME="ExampleUserCode"></A><BR>
639Code to include
640</H2>
641Let's take a look at the first section, ``user code'': The text up to the
642first line starting with <TT>%%</TT> is copied verbatim to the top
643of the generated lexer class (before the actual class declaration).
644Beside <TT>package</TT> and <TT>import</TT> statements there is usually not much
645to do here. If the code ends with a javadoc class comment, the generated class
646will get this comment, if not, JFlex will generate one automatically.
647
648<P>
649
650<H2><A NAME="SECTION00042000000000000000"></A><A NAME="ExampleOptions"></A><BR>
651Options and Macros
652</H2>
653The second section ``options and declarations'' is more interesting. It consists
654of a set of options, code that is included inside the generated scanner
655class, lexical states and macro declarations. Each JFlex option must begin
656a line of the specification and starts with a <TT>%</TT>. In our example
657the following options are used:
658
659<P>
660
661<UL>
662<LI><TT><A HREF="#CodeOptions">%class Lexer</A></TT> tells JFlex to give the
663 generated class the name ``Lexer'' and to write the code to a file ``<TT>Lexer.java</TT>''.
664
665<P>
666</LI>
667<LI><TT><A HREF="#CodeOptions">%unicode</A></TT> defines the set of characters the scanner will
668 work on. For scanning text files, <TT>%unicode</TT> should always be used. See also
669 section <A HREF="#sec:encodings">5</A> for more information on character sets, encodings, and
670 scanning text vs. binary files.
671
672<P>
673</LI>
674<LI><TT><A HREF="#CodeOptions">%cup</A></TT> switches to CUP compatibility
675 mode to interface with a CUP generated parser.
676
677<P>
678</LI>
679<LI><TT><A HREF="#CodeOptions">%line</A></TT> switches line counting on (the
680 current line number can be accessed via the variable <TT>yyline</TT>)
681
682<P>
683</LI>
684<LI><TT><A HREF="#CodeOptions">%column</A></TT> switches column counting on
685 (current column is accessed via <TT>yycolumn</TT>)
686
687<P>
688</LI>
689</UL>
690<A NAME="ExampleScannerCode"></A>
691<P>
692The code included in <TT><A HREF="#CodeScannerCode">%{...%}</A></TT>
693is copied verbatim into the generated lexer class source.
694Here you can declare member variables and functions that are used
695inside scanner actions. In our example we declare a <TT>StringBuffer</TT> ``<TT>string</TT>''
696in which we will store parts of string literals and two helper functions
697``<TT>symbol</TT>'' that create <TT>java_cup.runtime.Symbol</TT> objects
698with position information of the current token (see section <A HREF="#CUPWork">8.1</A>
699<A HREF="#CUPWork"><I>JFlex and CUP</I></A>
700for how to interface with the parser generator CUP). As JFlex options, both
701<code>%{</code> and <code>\%}</code> must begin a line.
702<A NAME="ExampleMacros"></A>
703<P>
704The specification continues with macro declarations. Macros are
705abbreviations for regular expressions, used to make lexical specifications
706easier to read and understand. A macro declaration
707consists of a macro identifier followed by <TT>=</TT>, then followed by
708the regular expression it represents. This regular expression may
709itself contain macro usages. Although this allows a grammar like specification
710style, macros are still just abbreviations and not non terminals - they
711cannot be recursive or mutually recursive. Cycles in macro definitions
712are detected and reported at generation time by JFlex.
713
714<P>
715Here some of the example macros in more detail:
716
717<UL>
718<LI><TT><A HREF="#CodeMacros">LineTerminator</A></TT> stands for the regular
719 expression that matches an ASCII CR, an ASCII LF or an CR followed by LF.
720
721<P>
722</LI>
723<LI><TT><A HREF="#CodeMacros">InputCharacter</A></TT> stands for all characters
724 that are not a CR or LF.
725
726<P>
727</LI>
728<LI><TT><A HREF="#CodeMacros">TraditionalComment</A></TT> is the expression
729 that matches the string <TT>"/*"</TT> followed by a character that
730 is not a <TT>*</TT>, followed by anything that does not contain, but
731 ends in <TT>"/*"</TT>. As this would not match comments like
732 <TT>/****/</TT>, we add <TT>"/*"</TT> followed by an arbitrary
733 number (at least one) of <TT>"*"</TT> followed by the closing
734 <TT>"/"</TT>. This is not the only, but one of the simpler
735 expressions matching non-nesting Java comments. It is tempting to
736 just write something like the expression <TT>"/*" .* "*/"</TT>, but
737 this would match more than we want. It would for instance match the
738 whole of <TT>/* */ x = 0; /* */</TT>, instead of two comments and
739 four real tokens. See DocumentationComment and CommentContent for an
740 alternative.
741
742<P>
743</LI>
744<LI><TT><A HREF="#CodeMacros">CommentContent</A></TT> matches zero or more
745 occurrences of any character except a <TT>*</TT> or any number of
746 <TT>*</TT> followed by a character that is not a <TT>/</TT>
747
748<P>
749</LI>
750<LI><TT><A HREF="#CodeMacros">Identifier</A></TT> matches each string that
751 starts with a character of class <TT>jletter</TT> followed by zero or more characters
752 of class <TT>jletterdigit</TT>. <TT>jletter</TT> and <TT>jletterdigit</TT>
753 are predefined character classes. <TT>jletter</TT> includes all characters for which
754 the Java function <TT>Character.isJavaIdentifierStart</TT> returns <TT>true</TT> and
755 <TT>jletterdigit</TT> all characters for that <TT>Character.isJavaIdentifierPart</TT>
756 returns <TT>true</TT>.
757</LI>
758</UL>
759<A NAME="ExampleStateDecl"></A>
760<P>
761The last part of the second section in our
762lexical specification is a lexical state declaration:
763<TT><A HREF="#CodeStateDecl">%state STRING</A></TT>
764declares a lexical state <TT>STRING</TT> that can be
765used in the ``lexical rules'' part of the specification. A state declaration
766is a line starting with <TT>%state</TT> followed by a space or comma
767separated list of state identifiers. There can be more than one line starting
768with <TT>%state</TT>.
769
770<P>
771
772<H2><A NAME="SECTION00043000000000000000"></A><A NAME="ExampleLexRules"></A><BR>
773Rules and Actions
774</H2>
775The "lexical rules" section of a JFlex specification contains regular expressions
776and actions (Java code) that are executed when the scanner matches the
777associated regular expression. As the scanner reads its input, it keeps
778track of all regular expressions and activates the action of the expression
779that has the longest match. Our specification above for instance would with input
780"<TT>breaker</TT>" match the regular expression for <TT><A HREF="#CodeMacros">Identifier</A></TT>
781and not the keyword "<TT><A HREF="#CodeRulesYYINITIAL">break</A></TT>"
782followed by the Identifier "<TT>er</TT>", because rule <code>{Identifier}</code>
783matches more of this input at once (i.e. it matches all of it)
784than any other rule in the specification. If two regular expressions both
785have the longest match for a certain input, the scanner chooses the action
786of the expression that appears first in the specification. In that way, we
787get for input "<TT>break</TT>" the keyword "<TT>break</TT>" and not an
788Identifier "<TT>break</TT>".
789
790<P>
791Additional to regular expression matches, one can use lexical states to
792refine a specification. A lexical state acts like a start condition.
793If the scanner is in lexical state <TT>STRING</TT>, only expressions that
794are preceded by the start condition <TT>&lt;STRING&gt;</TT> can be matched.
795A start condition of a regular expression can contain more than one lexical
796state. It is then matched when the lexer is in any of these lexical states.
797The lexical state <TT>YYINITIAL</TT> is predefined and is also the state
798in which the lexer begins scanning. If a regular expression has no start
799conditions it is matched in <EM>all</EM> lexical states.
800<A NAME="ExampleRulesStateBunch"></A>
801<P>
802Since you often have a bunch of expressions with the same start conditions,
803JFlex allows the same abbreviation as the Unix tool <TT>flex</TT>:
804<PRE>
805&lt;STRING&gt; {
806 expr1 { action1 }
807 expr2 { action2 }
808}
809</PRE>
810means that both <TT>expr1</TT> and <TT>expr2</TT> have start condition <TT>&lt;STRING&gt;</TT>.
811<A NAME="ExampleRulesYYINITIAL"></A>
812<P>
813The first three rules in our example demonstrate the syntax of a regular
814expression preceded by the start condition <TT>&lt;YYINITIAL&gt;</TT>.
815
816<P>
817<TT><A HREF="#CodeRulesYYINITIAL">&lt;YYINITIAL&gt; "abstract"</A><code> {</code> return symbol(sym.ABSTRACT); <code>}</code></TT>
818
819<P>
820matches the input "<TT>abstract</TT>" only if the scanner is in its
821start state "<TT>YYINITIAL</TT>". When the string "<TT>abstract</TT>" is
822matched, the scanner function returns the CUP symbol <TT>sym.ABSTRACT</TT>.
823If an action does not return a value, the scanning process is resumed immediately
824after executing the action.
825<A NAME="ExampleRulesBunch"></A>
826<P>
827The rules enclosed in
828
829<P>
830<TT><A HREF="#CodeRulesBunch">&lt;YYINITIAL&gt; {
831<BR> ...
832<BR>}</A></TT>
833
834<P>
835demonstrate the abbreviated syntax and are also only matched in state <TT>YYINITIAL</TT>.
836<A NAME="ExampleRulesYYbegin"></A>
837<P>
838Of these rules, one may be of special interest:
839
840<P>
841<code>\" { </code> <TT><A HREF="#CodeRulesBunch">string.setLength(0); yybegin(STRING);</A></TT><code> }</code>
842
843<P>
844If the scanner matches a double quote in state <TT>YYINITIAL</TT> we
845have recognised the start of a string literal. Therefore we clear our <TT>StringBuffer</TT>
846that will hold the content of this string literal and tell the scanner
847with <TT>yybegin(STRING)</TT> to switch into the lexical state <TT>STRING</TT>.
848Because we do not yet return a value to the parser, our scanner proceeds
849immediately.
850<A NAME="ExampleRulesYYtext"></A>
851<P>
852In lexical state <TT>STRING</TT> another
853rule demonstrates how to refer to the input that has been matched:
854
855<P>
856<code>[^\n\r\"]+ { </code> <TT><A HREF="#CodeRulesYYtext">string.append( yytext() );</A></TT><code> }</code>
857
858<P>
859The expression <code>[^\n\r\"]+</code> matches
860all characters in the input up to the next backslash (indicating an
861escape sequence such as <code>\n</code>), double quote (indicating the end
862of the string), or line terminator (which must not occur in a string literal).
863The matched region of the input is referred to with <TT><A HREF="#CodeRulesYYtext">yytext()</A></TT>
864and appended to the content of the string literal parsed so far.
865<A NAME="ExampleRuleLast"></A>
866<P>
867The last lexical rule in the example specification
868is used as an error fallback. It matches any character in any state that
869has not been matched by another rule. It doesn't conflict with any other
870rule because it has the least priority (because it's the last rule) and
871because it matches only one character (so it can't have longest match
872precedence over any other rule).
873
874<P>
875
876<H2><A NAME="SECTION00044000000000000000">
877How to get it going</A>
878</H2>
879
880<UL>
881<LI>Install JFlex (see section <A HREF="#Installing">2</A> <A HREF="#Installing"><I>Installing JFlex</I></A>)
882
883<P>
884</LI>
885<LI>If you have written your specification file (or chosen one from the <TT>examples</TT>
886directory), save it (say under the name <TT>java-lang.flex</TT>).
887
888<P>
889</LI>
890<LI>Run JFlex with
891
892<P>
893<TT>jflex java-lang.flex</TT>
894
895<P>
896</LI>
897<LI>JFlex should then report some progress messages about generating the scanner
898and write the generated code to the directory of your specification file.
899
900<P>
901</LI>
902<LI>Compile the generated <TT>.java</TT> file and your own classes. (If you
903use CUP, generate your parser classes first)
904
905<P>
906</LI>
907<LI>That's it.
908</LI>
909</UL>
910
911<P>
912
913<H1><A NAME="SECTION00050000000000000000"></A><A NAME="Specifications"></A><BR>
914Lexical Specifications
915</H1>
916As shown above, a lexical specification file for JFlex consists of three
917parts divided by a single line starting with <TT>%%</TT>:
918
919<P>
920<TT><A HREF="#SpecUsercode">UserCode</A></TT>
921<BR><TT>%%</TT>
922<BR><TT><A HREF="#SpecOptions">Options and declarations</A></TT>
923<BR><TT>%%</TT>
924<BR><TT><A HREF="#LexRules">Lexical rules</A></TT>
925
926<P>
927In all parts of the specification comments of the form
928<TT>/* comment text */</TT> and the Java style end of line comments starting with <TT>//</TT>
929are permitted. JFlex comments do nest - so the number of <TT>/*</TT> and <TT>*/</TT>
930should be balanced.
931
932<P>
933
934<H2><A NAME="SECTION00051000000000000000"></A><A NAME="SpecUsercode"></A><BR>
935User code
936</H2>
937The first part contains user code that is copied verbatim into the beginning
938of the source file of the generated lexer before the scanner class is declared.
939As shown in the example above, this is the place to put <TT>package</TT>
940declarations and <TT>import</TT>
941statements. It is possible, but not considered as good Java programming
942style to put own helper class (such as token classes) in this section.
943They should get their own <TT>.java</TT> file instead.
944
945<P>
946
947<H2><A NAME="SECTION00052000000000000000"></A><A NAME="SpecOptions"></A><BR>
948Options and declarations
949</H2>
950The second part of the lexical specification contains <A HREF="#SpecOptDirectives">options</A>
951to customise your generated lexer (JFlex directives and Java code to include in
952different parts of the lexer), declarations of <A HREF="#StateDecl">lexical states</A> and
953<A HREF="#MacroDefs">macro definitions</A> for use in the third section
954<A HREF="#LexRules">``Lexical rules''</A> of the lexical specification file.
955<A NAME="SpecOptDirectives"></A>
956<P>
957Each JFlex directive must be situated at the beginning of a line
958and starts with the <TT>%</TT> character. Directives that have one or
959more parameters are described as follows:
960
961<P>
962<TT>%class "classname"</TT>
963
964<P>
965means that you start a line with <TT>%class</TT> followed by a space followed
966by the name of the class for the generated scanner (the double quotes are
967<I>not</I> to be entered, see the <A HREF="#CodeOptions">example specification</A> in
968section <A HREF="#CodeOptions">3</A>).
969
970<P>
971
972<H3><A NAME="SECTION00052100000000000000"></A><A NAME="ClassOptions"></A><BR>
973Class options and user class code
974</H3>
975These options regard name, constructor, API, and related parts of the
976generated scanner class.
977
978<UL>
979<LI><B><TT>%class "classname"</TT></B>
980
981<P>
982Tells JFlex to give the generated class the name "<TT>classname</TT>" and to
983write the generated code to a file "<TT>classname.java</TT>". If the
984<TT>-d &lt;directory&gt;</TT> command line option is not used, the code
985will be written to the directory where the specification file resides. If
986no <TT>%class</TT> directive is present in the specification, the generated
987class will get the name "<TT>Yylex</TT>" and will be written to a file
988"<TT>Yylex.java</TT>". There should be only one <TT>%class</TT> directive
989in a specification.
990
991<P>
992</LI>
993<LI><B><TT>%implements "interface 1"[, "interface 2", ..]</TT></B>
994
995<P>
996Makes the generated class implement the specified interfaces. If more than
997one <TT>%implements</TT> directive is present, all the specified interfaces
998will be implemented.
999
1000<P>
1001</LI>
1002<LI><B><TT>%extends "classname"</TT></B>
1003
1004<P>
1005Makes the generated class a subclass of the class ``<TT>classname</TT>''.
1006There should be only one <TT>%extends</TT> directive in a specification.
1007
1008<P>
1009</LI>
1010<LI><B><TT>%public</TT></B>
1011
1012<P>
1013Makes the generated class public (the class is only accessible in its
1014own package by default).
1015
1016<P>
1017</LI>
1018<LI><B><TT>%final</TT></B>
1019
1020<P>
1021Makes the generated class final.
1022
1023<P>
1024</LI>
1025<LI><B><TT>%abstract</TT></B>
1026
1027<P>
1028Makes the generated class abstract.
1029
1030<P>
1031</LI>
1032<LI><B><TT>%apiprivate</TT></B>
1033
1034<P>
1035Makes all generated methods and fields of the class
1036private. Exceptions are the constructor, user code in the
1037specification, and, if <code>%cup</code> is present, the method
1038<TT>next_token</TT>. All occurrences of
1039<TT>" public "</TT> (one space character before and after <TT>public</TT>)
1040in the skeleton file are replaced by
1041<TT>" private "</TT> (even if a user-specified skeleton is used).
1042Access to the generated class is expected to be mediated by user class
1043code (see next switch).
1044
1045<P>
1046</LI>
1047<LI><B><code>%{</code></B>
1048<BR><B><TT>...</TT></B>
1049<BR><B><code>%}</code></B>
1050
1051<P>
1052The code enclosed in <code>%{</code> and <code>%}</code> is copied verbatim
1053into the generated class. Here you can define your own member variables
1054and functions in the generated scanner. Like all options, both <code>%{</code>
1055and <code>%}</code> must start a line in the specification. If more than one
1056class code directive <code>%{...%}</code> is present, the code is concatenated
1057in order of appearance in the specification.
1058
1059<P>
1060</LI>
1061<LI><B><code>%init{</code></B>
1062<BR><B><TT>...</TT></B>
1063<BR><B><code>%init}</code></B>
1064
1065<P>
1066The code enclosed in <code>%init{</code> and <code>%init}</code> is copied
1067verbatim into the constructor of the generated class. Here, member
1068variables declared in the <code>%{...%}</code> directive can be initialised.
1069If more than one initialiser option is present, the code is concatenated
1070in order of appearance in the specification.
1071
1072<P>
1073</LI>
1074<LI><B><code>%initthrow{</code></B>
1075<BR><B><TT>"exception1"[, "exception2", ...]</TT></B>
1076<BR><B><code>%initthrow}</code></B>
1077
1078<P>
1079or (on a single line) just
1080
1081<P>
1082<B><TT>%initthrow "exception1" [, "exception2", ...]</TT></B>
1083
1084<P>
1085Causes the specified exceptions to be declared in the <TT>throws</TT>
1086clause of the constructor. If more than one <code>%initthrow{</code> <TT>...</TT> <code>%initthrow}</code>
1087directive is present in the specification, all specified exceptions will
1088be declared.
1089
1090<P>
1091</LI>
1092<LI><B><TT>%ctorarg "type" "ident"</TT></B>
1093
1094<P>
1095Adds the specified argument to the constructors of the generated scanner.
1096If more than one such directive is present, the arguments are added in order
1097of occurrence in the specification. Note that this option conflicts with
1098the <code>%standalone</code> and <code>%debug</code> directives, because there is no
1099sensible default that can be created automatically for such parameters
1100in the generated <TT>main</TT> methods. JFlex will warn in this case and
1101generate an additional default constructor without these parameters and without user init code (which might potentially refer to the parameters).
1102
1103<P>
1104</LI>
1105<LI><B><TT>%scanerror "exception"</TT></B>
1106
1107<P>
1108Causes the generated scanner to throw an instance of the specified
1109exception in case of an internal error (default is
1110<TT>java.lang.Error</TT>). Note that this exception is only for
1111internal scanner errors. With usual specifications it should never
1112occur (i.e.&nbsp;if there is an error fallback rule in the specification
1113and only the documented scanner API is used).
1114
1115<P>
1116</LI>
1117<LI><B><TT>%buffer "size"</TT></B>
1118
1119<P>
1120Set the initial size of the scan buffer to the specified value
1121(decimal, in bytes). The default value is 16384.
1122
1123<P>
1124</LI>
1125<LI><B><TT>%include "filename"</TT></B>
1126
1127<P>
1128Replaces the <TT>%include</TT> verbatim by the specified file. This
1129feature is still experimental. It works, but error reporting can be
1130strange if a syntax error occurs on the last token in the included
1131file.
1132
1133<P>
1134</LI>
1135</UL>
1136
1137<P>
1138
1139<H3><A NAME="SECTION00052200000000000000"></A><A NAME="ScanningMethod"></A><BR>
1140Scanning method
1141</H3>
1142This section shows how the scanning method can be customised. You can redefine
1143the name and return type of the method and it is possible to declare
1144exceptions that may be thrown in one of the actions of the specification.
1145If no return type is specified, the scanning method will be declared as
1146returning values of class <TT>Yytoken</TT>.
1147
1148<UL>
1149<LI><B><TT>%function "name"</TT></B>
1150
1151<P>
1152Causes the scanning method to get the specified name. If no <TT>%function</TT>
1153directive is present in the specification, the scanning method gets the
1154name ``<TT>yylex</TT>''. This directive overrides settings of the
1155<TT><A HREF="#CupMode">%cup</A></TT> switch. Please note that the default name
1156of the scanning method with the <TT><A HREF="#CupMode">%cup</A></TT> switch is
1157<TT>next_token</TT>. Overriding this name might lead to the generated scanner
1158being implicitly declared as <TT>abstract</TT>, because it does not provide
1159the method <TT>next_token</TT> of the interface <TT>java_cup.runtime.Scanner</TT>.
1160It is of course possible to provide a dummy implementation of that method
1161in the class code section if you still want to override the function name.
1162
1163<P>
1164</LI>
1165<LI><B><TT>%integer</TT></B>
1166<BR><B><TT>%int</TT></B>
1167
1168<P>
1169Both cause the scanning method to be declared as of Java type <TT>int</TT>.
1170Actions in the specification can then return <TT>int</TT> values as tokens.
1171The default end of file value under this setting is <TT>YYEOF</TT>, which is a <TT>public
1172static final int</TT> member of the generated class.
1173
1174<P>
1175</LI>
1176<LI><B><TT>%intwrap</TT></B>
1177
1178<P>
1179Causes the scanning method to be declared as of the Java wrapper type
1180<TT>Integer</TT>. Actions in the specification can then return <TT>Integer</TT>
1181values as tokens. The default end of file value under this setting is <TT>null</TT>.
1182
1183<P>
1184</LI>
1185<LI><B><TT>%type "typename"</TT></B>
1186
1187<P>
1188Causes the scanning method to be declared as returning values of the specified type.
1189Actions in the specification can then return values of <TT>typename</TT>
1190as tokens. The default end of file value under this setting is <TT>null</TT>.
1191If <TT>typename</TT> is not a subclass of <TT>java.lang.Object</TT>,
1192you should specify another end of file value using the
1193<A HREF="#eofval"><TT>%eofval{</TT> <TT>...</TT> <TT>%eofval}</TT></A>
1194directive or the <A HREF="#EOFRule"><TT>&lt;&lt;EOF&gt;&gt;</TT> rule</A>.
1195The <TT>%type</TT> directive overrides settings of the
1196<TT><A HREF="#CupMode">%cup</A></TT> switch.
1197
1198<P>
1199</LI>
1200<LI><B><code>%yylexthrow{</code></B>
1201<BR><B><TT>"exception1"[, "exception2", ... ]</TT></B>
1202<BR><B><code>%yylexthrow}</code></B>
1203
1204<P>
1205or (on a single line) just
1206
1207<P>
1208<B><TT>%yylexthrow "exception1" [, "exception2", ...]</TT></B>
1209
1210<P>
1211The exceptions listed inside <code>%yylexthrow{</code> <TT>...</TT> <code>%yylexthrow}</code>
1212will be declared in the throws clause of the scanning method. If there is
1213more than one <code>%yylexthrow{</code> <TT>...</TT> <code>%yylexthrow}</code> clause in
1214the specification, all specified exceptions will be declared.
1215</LI>
1216</UL>
1217
1218<P>
1219
1220<H3><A NAME="SECTION00052300000000000000"></A><A NAME="EOF"></A><BR>
1221The end of file
1222</H3>
1223There is always a default value that the scanning method will return when
1224the end of file has been reached. You may however define a specific value
1225to return and a specific piece of code that should be executed when the
1226end of file is reached.
1227
1228<P>
1229The default end of file values depends on the return type of the scanning method:
1230
1231<UL>
1232<LI>For <B><TT>%integer</TT></B>, the scanning method will return the value
1233<B><TT>YYEOF</TT></B>, which is a <TT>public static final int</TT> member
1234of the generated class.
1235
1236<P>
1237</LI>
1238<LI>For <B><TT>%intwrap</TT></B>,
1239</LI>
1240<LI>no specified type at all, or a
1241</LI>
1242<LI>user defined type, declared using <B><TT>%type</TT></B>, the value is <B><TT>null</TT></B>.
1243
1244<P>
1245</LI>
1246<LI>In CUP compatibility mode, using <B><TT>%cup</TT></B>, the value is
1247
1248<P>
1249<B><TT>new java_cup.runtime.Symbol(sym.EOF)</TT></B>
1250</LI>
1251</UL>
1252
1253<P>
1254User values and code to be executed at the end of file can be defined using these directives:
1255
1256<A NAME="eofval"></A><UL>
1257<LI><B><code>%eofval{</code></B>
1258<BR><B><TT>...</TT></B>
1259<BR><B><code>%eofval}</code></B>
1260
1261<P>
1262The code included in <code>%eofval{</code> <TT>...</TT> <code>%eofval}</code> will
1263be copied verbatim into the scanning method and will be executed <EM>each time</EM>
1264when the end of file is reached (this is possible when
1265the scanning method is called again after the end of file has been
1266reached). The code should return the value that indicates the end of
1267file to the parser. There should be only one <code>%eofval{</code>
1268<TT>...</TT> <code>%eofval}</code> clause in the specification.
1269The <code>%eofval{ ... %eofval}</code> directive overrides settings of the
1270<TT><A HREF="#CupMode">%cup</A></TT> switch and <TT><A HREF="#YaccMode">%byaccj</A></TT> switch.
1271As of version 1.2 JFlex provides
1272a more readable way to specify the end of file value using the
1273<A HREF="#EOFRule"><TT>&lt;&lt;EOF&gt;&gt;</TT> rule</A> (see also section <A HREF="#EOFRule">4.3.2</A>).
1274
1275<P>
1276</LI>
1277<LI><A NAME="eof"></A> <B><code>%eof{</code></B>
1278<BR> <B><TT>...</TT></B>
1279<BR> <B><code>%eof}</code></B>
1280
1281<P>
1282The code included in <code>%{eof ... %eof}</code> will be executed
1283 exactly once, when the end of file is reached. The code is included
1284 inside a method <TT>void yy_do_eof()</TT> and should not return any
1285 value (use <code>%eofval{...%eofval}</code> or
1286 <A HREF="#EOFRule"><TT>&lt;&lt;EOF&gt;&gt;</TT></A> for this purpose). If more than one
1287 end of file code directive is present, the code will be concatenated
1288 in order of appearance in the specification.
1289
1290<P>
1291</LI>
1292<LI><B><code>%eofthrow{</code></B>
1293<BR> <B><TT>"exception1"[,"exception2", ... ]</TT></B>
1294<BR> <B><code>%eofthrow}</code></B>
1295
1296<P>
1297or (on a single line) just
1298
1299<P>
1300<B><TT>%eofthrow "exception1" [, "exception2", ...]</TT></B>
1301
1302<P>
1303The exceptions listed inside <code>%eofthrow{...%eofthrow}</code> will
1304 be declared in the throws clause of the method <TT>yy_do_eof()</TT>
1305 (see <A HREF="#eof"><TT>%eof</TT></A> for more on that method).
1306 If there is more than one <code>%eofthrow{...%eofthrow}</code> clause
1307 in the specification, all specified exceptions will be declared.
1308
1309<P>
1310<A NAME="eofclose"></A></LI>
1311<LI><B><TT>%eofclose</TT></B>
1312
1313<P>
1314Causes JFlex to close the input stream at the end of file. The code
1315 <TT>yyclose()</TT> is appended to the method <TT>yy_do_eof()</TT>
1316 (together with the code specified in <code>%eof{...%eof}</code>) and
1317 the exception <TT>java.io.IOException</TT> is declared in the throws
1318 clause of this method (together with those of
1319 <code>%eofthrow{...%eofthrow}</code>)
1320
1321<P>
1322</LI>
1323<LI><B><TT>%eofclose false</TT></B>
1324
1325<P>
1326Turns the effect of <TT>%eofclose</TT> off again (e.g. in case closing of
1327 input stream is not wanted after <TT>%cup</TT>).
1328
1329<P>
1330</LI>
1331</UL>
1332
1333<P>
1334
1335<H3><A NAME="SECTION00052400000000000000"></A><A NAME="Standalone"></A><BR>
1336Standalone scanners
1337</H3>
1338
1339<UL>
1340<LI><B><TT>%debug</TT></B>
1341
1342<P>
1343Creates a main function in the generated class that expects the name
1344of an input file on the command line and then runs the scanner on this
1345input file by printing information about each returned token to the Java
1346console until the end of file is reached. The information includes:
1347line number (if line counting is enabled), column (if column counting is enabled),
1348the matched text, and the executed action (with line number in the specification).
1349
1350<P>
1351</LI>
1352<LI><B><TT>%standalone</TT></B>
1353
1354<P>
1355Creates a main function in the generated class that expects the name
1356of an input file on the command line and then runs the scanner on this
1357input file. The values returned by the scanner are ignored, but any unmatched
1358text is printed to the Java console instead (as the C/C++ tool flex does, if
1359run as standalone program). To avoid having to use an extra token class, the
1360scanning method will be declared as having default type <TT>int</TT>, not <TT>YYtoken</TT>
1361(if there isn't any other type explicitly specified).
1362This is in most cases irrelevant, but could be useful to know when making
1363another scanner standalone for some purpose. You should also consider using
1364the <TT>%debug</TT> directive, if you just want to be able to run the scanner
1365without a parser attached for testing etc.
1366
1367<P>
1368</LI>
1369</UL>
1370
1371<P>
1372
1373<H3><A NAME="SECTION00052500000000000000"></A><A NAME="CupMode"></A><BR>
1374CUP compatibility
1375</H3>
1376You may also want to read section <A HREF="#CUPWork">8.1</A> <A HREF="#CUPWork"><I>JFlex and CUP</I></A>
1377if you are interested in how to interface your generated
1378scanner with CUP.
1379
1380<UL>
1381<LI><B><TT>%cup</TT></B>
1382
1383<P>
1384The <TT>%cup</TT> directive enables the CUP compatibility mode and is equivalent
1385to the following set of directives:
1386
1387<P>
1388<PRE>
1389%implements java_cup.runtime.Scanner
1390%function next_token
1391%type java_cup.runtime.Symbol
1392%eofval{
1393 return new java_cup.runtime.Symbol(&lt;CUPSYM&gt;.EOF);
1394%eofval}
1395%eofclose
1396</PRE>
1397
1398<P>
1399The value of <TT>&lt;CUPSYM&gt;</TT> defaults to <TT>sym</TT> and can be
1400changed with the <TT>%cupsym</TT> directive. In JLex compatibility
1401mode (<TT>-jlex</TT> switch on the command line), <TT>%eofclose</TT>
1402will not be turned on.
1403
1404<P>
1405</LI>
1406<LI><B><TT>%cupsym "classname"</TT></B>
1407
1408<P>
1409Customises the name of the CUP generated class/interface
1410containing the names of terminal tokens. Default is <TT>sym</TT>.
1411The directive should not be used after <TT>%cup</TT>, but before.
1412
1413<P>
1414</LI>
1415<LI><B><TT>%cupdebug</TT></B>
1416
1417<P>
1418Creates a main function in the generated class that expects the name
1419of an input file on the command line and then runs the scanner on this
1420input file. Prints line, column, matched text, and CUP symbol name for
1421each returned token to standard out.
1422
1423<P>
1424</LI>
1425</UL>
1426
1427<P>
1428
1429<H3><A NAME="SECTION00052600000000000000"></A><A NAME="YaccMode"></A><BR>
1430BYacc/J compatibility
1431</H3>
1432You may also want to read section <A HREF="#YaccWork">8.2</A> <A HREF="#YaccWork"><I>JFlex and BYacc/J</I></A>
1433if you are interested in how to interface your generated
1434scanner with Byacc/J.
1435
1436<UL>
1437<LI><B><TT>%byacc</TT></B>
1438
1439<P>
1440The <TT>%byacc</TT> directive enables the BYacc/J compatibility mode and is equivalent
1441to the following set of directives:
1442
1443<P>
1444<PRE>
1445%integer
1446%eofval{
1447 return 0;
1448%eofval}
1449%eofclose
1450</PRE>
1451
1452<P>
1453</LI>
1454</UL>
1455
1456<P>
1457
1458<H3><A NAME="SECTION00052700000000000000"></A><A NAME="CodeGeneration"></A><BR>
1459Code generation
1460</H3>
1461The following options define what kind of lexical analyser code JFlex
1462will produce. <TT>%pack</TT> is the default setting and will be used,
1463when no code generation method is specified.
1464
1465<P>
1466
1467<UL>
1468<LI><B><TT>%switch</TT></B>
1469
1470<P>
1471With <TT>%switch</TT> JFlex will generate a scanner that has
1472 the DFA hard coded into a nested switch statement. This method gives
1473 a good deal of compression in terms of the size of the compiled
1474 <TT>.class</TT> file while still providing very good performance. If your
1475 scanner gets to big though (say more than about 200 states)
1476 performance may vastly degenerate and you should consider using one
1477 of the <TT>%table</TT> or <TT>%pack</TT> directives. If your scanner
1478 gets even bigger (about 300 states), the Java compiler <TT>javac</TT>
1479 could produce corrupted code, that will crash when executed or will
1480 give you an <TT>java.lang.VerifyError</TT> when checked by the virtual
1481 machine. This is due to the size limitation of 64 KB of Java
1482 methods as described in the Java Virtual Machine Specification
1483 [<A
1484 HREF="manual.html#MachineSpec">10</A>]. In this case you will be forced to use the
1485 <TT>%pack</TT> directive, since <TT>%switch</TT>
1486 usually provides more compression of the DFA table than the
1487 <TT>%table</TT> directive.
1488
1489<P>
1490</LI>
1491<LI><B><TT>%table</TT></B>
1492
1493<P>
1494The <TT>%table</TT> direction causes JFlex to produce a classical
1495 table driven scanner that encodes its DFA table in an array. In
1496 this mode, JFlex only does a small amount of table compression (see
1497 [<A
1498 HREF="manual.html#ParseTable">6</A>], [<A
1499 HREF="manual.html#SparseTable">12</A>], [<A
1500 HREF="manual.html#Aho">1</A>] and [<A
1501 HREF="manual.html#Maurer">13</A>]
1502 for more details on the matter of table compression) and uses the
1503 same method that JLex did up to version 1.2.1. See section <A HREF="#performance">6</A>
1504 <A HREF="#performance">performance</A> of this manual to compare
1505 these methods. The same reason as above (64 KB size limitation of
1506 methods) causes the same problem, when the scanner gets too big.
1507 This is, because the virtual machine treats static initialisers of
1508 arrays as normal methods. You will in this case again be forced to
1509 use the <TT>%pack</TT> directive to avoid the problem.
1510
1511<P>
1512</LI>
1513<LI><B><TT>%pack</TT></B>
1514
1515<P>
1516<TT>%pack</TT> causes JFlex to compress the generated DFA table and to
1517 store it in one or more string literals. JFlex takes care that the
1518 strings are not longer than permitted by the class file format.
1519 The strings have to be unpacked when
1520 the first scanner object is created and initialised.
1521 After unpacking the internal access to the DFA table is exactly the
1522 same as with option <TT>%table</TT> -- the only extra work to be done
1523 at runtime is the unpacking process which is quite fast (not noticeable
1524 in normal cases). It is in time complexity proportional to the
1525 size of the expanded DFA table, and it is static,
1526 i.e. it is done only once for a certain scanner class -- no matter
1527 how often it is instantiated. Again, see section
1528 <A HREF="#performance">6</A> <A HREF="#performance">performance</A>
1529 on the performance of these scanners
1530 With <TT>%pack</TT>, there should be practically no
1531 limitation to the size of the scanner. <TT>%pack</TT> is the default
1532 setting and will be used when no code generation method is specified.
1533</LI>
1534</UL>
1535
1536<P>
1537
1538<H3><A NAME="SECTION00052800000000000000"></A><A NAME="CharacterSets"></A><BR>
1539Character sets
1540</H3>
1541
1542<UL>
1543<LI><B><TT>%7bit</TT></B>
1544
1545<P>
1546Causes the generated scanner to use an 7 bit input character set (character
1547codes 0-127). If an input character with a code greater than 127 is
1548encountered in an input at runtime, the scanner will throw an <TT>ArrayIndexOutofBoundsException</TT>.
1549Not only because of this, you should consider using the <TT>%unicode</TT> directive.
1550See also section <A HREF="#sec:encodings">5</A> for information about character encodings. This is the default in JLex compatibility mode.
1551
1552<P>
1553</LI>
1554<LI><B><TT>%full</TT></B>
1555<BR><B><TT>%8bit</TT></B>
1556
1557<P>
1558Both options cause the generated scanner to use an 8 bit input character
1559set (character codes 0-255). If an input character with a code greater
1560than 255 is encountered in an input at runtime, the scanner will throw
1561an <TT>ArrayIndexOutofBoundsException</TT>. Note that even if your platform
1562uses only one byte per character, the Unicode value of a character may
1563still be greater than 255. If you are scanning text files, you should
1564consider using the <TT>%unicode</TT> directive. See also section <A HREF="#sec:encodings">5</A>
1565for more information about character encodings.
1566
1567<P>
1568</LI>
1569<LI><B><TT>%unicode</TT></B>
1570<BR><B><TT>%16bit</TT></B>
1571
1572<P>
1573Both options cause the generated scanner to use the full 16 bit Unicode input
1574character set that Java supports natively (character code points 0-65535).
1575There will be no runtime overflow when using this set of input characters.
1576<TT>%unicode</TT> does not mean that the scanner will read two bytes at a
1577time. What is read and what constitutes a character depends on the runtime
1578platform. See also section <A HREF="#sec:encodings">5</A> for more information about
1579character encodings. This is the default unless the JLex compatibility mode is
1580used (command line option <TT>-jlex</TT>).
1581
1582<P>
1583<A NAME="caseless"></A></LI>
1584<LI><B><TT>%caseless</TT></B>
1585<BR><B><TT>%ignorecase</TT></B>
1586
1587<P>
1588This option causes JFlex to handle all characters and strings in the
1589specification as if they were specified in both uppercase and lowercase form.
1590This enables an easy way to specify a scanner for a language with case
1591insensitive keywords. The string "<TT>break</TT>" in a specification is for
1592instance handled like the expression <TT>([bB][rR][eE][aA][kK])</TT>. The
1593<TT>%caseless</TT> option does not change the matched text and does not
1594effect character classes. So <TT>[a]</TT> still only matches the character
1595<TT>a</TT> and not <TT>A</TT>, too. Which letters are uppercase and which
1596lowercase letters, is defined by the Unicode standard and determined by JFlex
1597with the Java methods <TT>Character.toUpperCase</TT> and
1598<TT>Character.toLowerCase</TT>. In JLex compatibility mode (<TT>-jlex</TT>
1599switch on the command line), <TT>%caseless</TT> and <TT>%ignorecase</TT>
1600also affect character classes.
1601
1602<P>
1603</LI>
1604</UL>
1605<H3><A NAME="SECTION00052900000000000000"></A><A NAME="Counting"></A><BR>
1606Line, character and column counting
1607</H3>
1608
1609<UL>
1610<LI><B><TT>%char</TT></B>
1611
1612<P>
1613Turns character counting on. The <TT>int</TT> member variable <TT>yychar</TT>
1614contains the number of characters (starting with 0) from the beginning
1615of input to the beginning of the current token.
1616
1617<P>
1618</LI>
1619<LI><B><TT>%line</TT></B>
1620
1621<P>
1622Turns line counting on. The <TT>int</TT> member variable <TT>yyline</TT>
1623contains the number of lines (starting with 0) from the beginning of input
1624to the beginning of the current token.
1625
1626<P>
1627</LI>
1628<LI><B><TT>%column</TT></B>
1629
1630<P>
1631Turns column counting on. The <TT>int</TT> member variable <TT>yycolumn</TT>
1632contains the number of characters (starting with 0) from the beginning
1633of the current line to the beginning of the current token.
1634
1635<P>
1636</LI>
1637</UL>
1638
1639<P>
1640
1641<H3><A NAME="SECTION000521000000000000000"></A><A NAME="Obsolete"></A><BR>
1642Obsolete JLex options
1643</H3>
1644
1645<UL>
1646<LI><B><TT>%notunix</TT></B>
1647
1648<P>
1649This JLex option is obsolete in JFlex but still recognised as valid directive.
1650It used to switch between Windows and Unix kind of line terminators (<code>\r\n</code>
1651and <code>\n</code>) for the <TT>$</TT> operator in regular expressions. JFlex
1652always recognises both styles of platform dependent line terminators.
1653
1654<P>
1655</LI>
1656<LI><B><TT>%yyeof</TT></B>
1657
1658<P>
1659This JLex option is obsolete in JFlex but still recognised as valid directive.
1660In JLex it declares a public member constant <TT>YYEOF</TT>. JFlex declares it in any case.
1661</LI>
1662</UL>
1663
1664<P>
1665
1666<H3><A NAME="SECTION000521100000000000000"></A><A NAME="StateDecl"></A><BR>
1667State declarations
1668</H3>
1669State declarations have the following from:
1670
1671<P>
1672<TT>%s[tate] "state identifier" [, "state identifier", ... ]</TT> for inclusive or
1673<BR><TT>%x[state] "state identifier" [, "state identifier", ... ]</TT> for exclusive states
1674
1675<P>
1676There may be more than one line of state declarations, each starting with
1677<TT>%state</TT> or <TT>%xstate</TT> (the first character is sufficient,
1678<TT>%s</TT> and <TT>%x</TT> works, too). State identifiers are letters followed
1679by a sequence of letters, digits or underscores. State identifiers can be separated
1680by white-space or comma.
1681
1682<P>
1683The sequence
1684
1685<P>
1686<TT>%state STATE1</TT>
1687<BR><TT>%xstate STATE3, XYZ, STATE_10</TT>
1688<BR><TT>%state ABC STATE5</TT>
1689
1690<P>
1691declares the set of identifiers <TT>STATE1, STATE3, XYZ,
1692 STATE_10, ABC, STATE5</TT> as lexical states, <TT>STATE1</TT>, <TT>ABC</TT>, <TT>STATE5</TT>
1693as inclusive, and <TT>STATE3</TT>, <TT>XYZ</TT>, <TT>STATE_10</TT> as exclusive.
1694See also section
1695<A HREF="#HowMatched">4.3.3</A> on the way lexical states influence how the input is
1696matched.
1697
1698<P>
1699
1700<H3><A NAME="SECTION000521200000000000000"></A><A NAME="MacroDefs"></A><BR>
1701Macro definitions
1702</H3>
1703A macro definition has the form
1704
1705<P>
1706<TT>macroidentifier = regular expression</TT>
1707
1708<P>
1709That means, a macro definition is a macro identifier (letter followed
1710by a sequence of letters, digits or underscores), that can later be
1711used to reference the macro, followed by optional white-space, followed
1712by an "<TT>=</TT>", followed by optional white-space, followed by a
1713regular expression (see section <A HREF="#LexRules">4.3</A> <A HREF="#LexRules"><I>lexical
1714 rules</I></A> for more information about regular expressions).
1715
1716<P>
1717The regular expression on the right hand side must be well formed and
1718must not contain the <code>^</code>, <TT>/</TT> or <TT>$</TT> operators. <B>Differently
1719to JLex, macros are not just pieces of text that are expanded by copying</B>
1720- they are parsed and must be well formed.
1721
1722<P>
1723<B>This is a feature.</B> It eliminates some very hard to find bugs in
1724lexical specifications (such like not having parentheses around more
1725complicated macros - which is not necessary with JFlex). See section
1726<A HREF="#Porting">7.1</A> <A HREF="#Porting"><I>Porting from JLex</I></A> for more
1727details on the problems of JLex style macros.
1728
1729<P>
1730Since it is allowed to have macro usages in macro definitions, it is
1731possible to use a grammar like notation to specify the desired lexical
1732structure. Macros however remain just abbreviations of the regular expressions
1733they represent. They are not non terminals of a grammar and cannot be used
1734recursively in any way. JFlex detects cycles in macro definitions and reports
1735them at generation time. JFlex also warns you about macros that have been
1736defined but never used in the ``lexical rules'' section of the specification.
1737
1738<P>
1739
1740<H2><A NAME="SECTION00053000000000000000"></A><A NAME="LexRules"></A><BR>
1741Lexical rules
1742</H2>
1743The ``lexical rules'' section of an JFlex specification contains a set of
1744regular expressions and actions (Java code) that are executed when the
1745scanner matches the associated regular expression.
1746
1747<P>
1748
1749<H3><A NAME="SECTION00053100000000000000"></A><A NAME="Grammar"></A><BR>
1750Syntax
1751</H3>
1752The syntax of the "lexical rules" section is described by the following
1753BNF grammar (terminal symbols are enclosed in 'quotes'):
1754
1755<P>
1756<PRE>
1757LexicalRules ::= Rule+
1758Rule ::= [StateList] ['^'] RegExp [LookAhead] Action
1759 | [StateList] '&lt;&lt;EOF&gt;&gt;' Action
1760 | StateGroup
1761StateGroup ::= StateList '{' Rule+ '}'
1762StateList ::= '&lt;' Identifier (',' Identifier)* '&gt;'
1763LookAhead ::= '$' | '/' RegExp
1764Action ::= '{' JavaCode '}' | '|'
1765
1766RegExp ::= RegExp '|' RegExp
1767 | RegExp RegExp
1768 | '(' RegExp ')'
1769 | ('!'|'~') RegExp
1770 | RegExp ('*'|'+'|'?')
1771 | RegExp "{" Number ["," Number] "}"
1772 | '[' ['^'] (Character|Character'-'Character)* ']'
1773 | PredefinedClass
1774 | '{' Identifier '}'
1775 | '"' StringCharacter+ '"'
1776 | Character
1777
1778PredefinedClass ::= '[:jletter:]'
1779 | '[:jletterdigit:]'
1780 | '[:letter:]'
1781 | '[:digit:]'
1782 | '[:uppercase:]'
1783 | '[:lowercase:]'
1784 | '.'
1785</PRE>
1786
1787<P>
1788<A NAME="Terminals"></A>The grammar uses the following terminal symbols:
1789
1790<UL>
1791<LI><TT>JavaCode</TT>
1792<BR> a sequence of <EM><TT>BlockStatements</TT></EM> as described in the Java
1793 Language Specification [<A
1794 HREF="manual.html#LangSpec">7</A>], section 14.2.
1795
1796<P>
1797</LI>
1798<LI><TT>Number</TT>
1799<BR> a non negative decimal integer.
1800
1801<P>
1802</LI>
1803<LI><TT>Identifier</TT>
1804<BR> a letter <code>[a-zA-Z]</code> followed by a sequence of zero or more
1805 letters, digits or underscores <code>[a-zA-Z0-9_]</code>
1806
1807<P>
1808</LI>
1809<LI><TT>Character</TT>
1810<BR> an escape sequence or any unicode character that is not one of these
1811 meta characters:
1812 <code> | ( ) { } [ ] &lt; &gt; \ . * + ? ^ $ / . " ~ !</code>
1813
1814<P>
1815</LI>
1816<LI><TT>StringCharacter</TT>
1817<BR> an escape sequence or any unicode character that is not one of these
1818 meta characters:
1819 <code> \ "</code>
1820
1821<P>
1822</LI>
1823<LI>An escape sequence
1824
1825<P>
1826
1827<UL>
1828<LI><code>\n</code> <code>\r</code> <code>\t</code> <code>\f</code> <code>\b</code>
1829</LI>
1830<LI>a <code>\x</code> followed by two hexadecimal digits <TT>[a-fA-F0-9]</TT> (denoting
1831 a standard ASCII escape sequence),
1832
1833<P>
1834</LI>
1835<LI>a <code>\u</code> followed by four hexadecimal digits <TT>[a-fA-F0-9]</TT>
1836 (denoting an unicode escape sequence),
1837
1838<P>
1839</LI>
1840<LI>a backslash followed by a three digit octal number from 000 to 377 (denoting
1841 a standard ASCII escape sequence), or
1842
1843<P>
1844</LI>
1845<LI>a backslash followed by any other unicode character that stands for this
1846 character.
1847
1848<P>
1849</LI>
1850</UL>
1851
1852<P>
1853</LI>
1854</UL>
1855
1856<P>
1857Please note that the <code>\n</code> escape sequence stands for the ASCII
1858LF character - not for the end of line. If you would like to match the
1859line terminator, you should use the expression <code>\r|\n|\r\n</code> if you want
1860the Java conventions, or <code>\r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code>
1861if you want to be fully Unicode compliant (see also [<A
1862 HREF="manual.html#unicode_rep">5</A>]).
1863
1864<P>
1865As of version 1.1 of JFlex the white-space characters <TT>" "</TT>
1866(space) and <code>"\t"</code> (tab) can be used to improve the readability of
1867regular expressions. They will be ignored by JFlex. In character
1868classes and strings however, white-space characters keep standing for
1869themselves (so the string <TT>" "</TT> still matches exactly one space
1870character and <code>[ \n]</code> still matches an ASCII LF or a space
1871character).
1872
1873<P>
1874JFlex applies the following standard operator precedences in regular
1875expression (from highest to lowest):
1876
1877<P>
1878
1879<UL>
1880<LI>unary postfix operators (<code>'*', '+', '?', {n}, {n,m}</code>)
1881
1882<P>
1883</LI>
1884<LI>unary prefix operators (<code>'!', '~'</code>)
1885
1886<P>
1887</LI>
1888<LI>concatenation (<TT>RegExp::= RegExp Regexp</TT>)
1889
1890<P>
1891</LI>
1892<LI>union (<code>RegExp::= RegExp '|' RegExp</code>)
1893</LI>
1894</UL>
1895
1896<P>
1897So the expression <code>a | abc | !cd*</code> for instance is parsed as
1898<code>(a|(abc)) | ((!c)(d*))</code>.
1899
1900<P>
1901
1902<H3><A NAME="SECTION00053200000000000000"></A><A NAME="Semantics"></A><BR>
1903Semantics
1904</H3>
1905This section gives an informal description of which text is matched by
1906a regular expression (i.e. an expression described by the <TT>RegExp</TT>
1907production of the grammar presented <A HREF="#Grammar">above</A>).
1908
1909<P>
1910A regular expression that consists solely of
1911
1912<UL>
1913<LI>a <TT>Character</TT> matches this character.
1914
1915<P>
1916</LI>
1917<LI>a character class <code>'[' (Character|Character'-'Character)* ']'</code> matches
1918 any character in that class. A <TT>Character</TT> is to be considered an
1919 element of a class, if it is listed in the class or if its code lies within
1920 a listed character range <TT>Character'-'Character</TT>. So <code>[a0-3\n]</code>
1921 for instance matches the characters
1922
1923<P>
1924<code>a 0 1 2 3 \n</code>
1925
1926<P>
1927If the list of characters is empty (i.e.&nbsp;just <code>[]</code>), the expression
1928 matches nothing at all (the empty set), not even the empty string. This
1929 may be useful in combination with the negation operator <code>'!'</code>.
1930
1931<P>
1932</LI>
1933<LI>a negated character class <code>'[^' (Character|Character'-'Character)* ']'</code>
1934 matches all characters not listed in the class. If the list of characters
1935 is empty (i.e. <code>[^]</code>), the expression matches any character of the
1936 input character set.
1937
1938<P>
1939</LI>
1940<LI>a string <TT>'"' StringCharacter+ '"</TT> <TT>'</TT> matches the exact
1941 text enclosed in double quotes. All meta characters but <code>\</code> and
1942 <TT>"</TT> loose their special meaning inside a string. See also the
1943 <A HREF="#caseless"><TT>%ignorecase</TT></A> switch.
1944
1945<P>
1946</LI>
1947<LI>a macro usage <code>'{' Identifier '}'</code> matches the input that is matched
1948 by the right hand side of the macro with name "<TT>Identifier</TT>".
1949
1950<P>
1951<A NAME="predefCharCl"></A></LI>
1952<LI>a predefined character class matches any of
1953 the characters in that class. There are the following predefined character
1954 classes:
1955
1956<P>
1957<TT>.</TT> contains all characters but <code>\n</code>.
1958
1959<P>
1960All other predefined character classes are defined in the Unicode
1961 specification or the Java Language Specification and determined by
1962 Java functions of class
1963 <TT>java</TT>.<TT>lang</TT>.<TT>Character</TT>.
1964
1965<P>
1966<PRE>
1967[:jletter:] isJavaIdentifierStart()
1968[:jletterdigit:] isJavaIdentifierPart()
1969[:letter:] isLetter()
1970[:digit:] isDigit()
1971[:uppercase:] isUpperCase()
1972[:lowercase:] isLowerCase()
1973</PRE>
1974
1975<P>
1976They are especially useful when working with the unicode character set.
1977
1978<P>
1979</LI>
1980</UL>
1981
1982<P>
1983If <TT>a</TT> and <TT>b</TT> are regular expressions, then
1984
1985<P>
1986<DL COMPACT>
1987<DT><TT>a | b</TT></DT>
1988<DD>(union)
1989
1990<P>
1991is the regular expression, that matches
1992 all input that is matched by <TT>a</TT> or by <TT>b</TT>.
1993
1994<P>
1995</DD>
1996<DT><TT>a b</TT></DT>
1997<DD>(concatenation)
1998
1999<P>
2000is the regular expression,
2001 that matches the input matched by <TT>a</TT> followed by the
2002 input matched by <TT>b</TT>.
2003
2004<P>
2005</DD>
2006<DT><TT>a*</TT></DT>
2007<DD>(Kleene closure)
2008
2009<P>
2010matches zero or more repetitions
2011 of the input matched by <TT>a</TT>
2012
2013<P>
2014</DD>
2015<DT><TT>a+</TT></DT>
2016<DD>(iteration)
2017
2018<P>
2019is equivalent to <TT>aa*</TT>
2020
2021<P>
2022</DD>
2023<DT><TT>a?</TT></DT>
2024<DD>(option)
2025
2026<P>
2027matches the empty input or the input matched
2028 by <TT>a</TT>
2029
2030<P>
2031</DD>
2032<DT><TT>!a</TT></DT>
2033<DD>(negation)
2034
2035<P>
2036matches everything but the strings matched by <TT>a</TT>.
2037 Use with care: the construction of <code>!a</code> involves
2038 an additional, possibly exponential NFA to DFA transformation
2039 on the NFA for <TT>a</TT>. Note that
2040 with negation and union you also have (by applying DeMorgan)
2041 intersection and set difference: the intersection of
2042 <TT>a</TT> and <TT>b</TT> is <code>!(!a|!b)</code>, the expression
2043 that matches everything of <TT>a</TT> not matched by <TT>b</TT> is
2044 <code>!(!a|b)</code>
2045
2046<P>
2047</DD>
2048<DT><TT>~a</TT></DT>
2049<DD>(upto)
2050
2051<P>
2052matches everything up to (and including) the first occurrence of a text
2053 matched by <TT>a</TT>. The expression <code>~a</code> is equivalent
2054 to <code>!([^]* a [^]*) a</code>. A traditional C-style comment
2055 is matched by <code>"/*" ~"*/"</code>
2056
2057<P>
2058</DD>
2059<DT><TT>a{n}</TT></DT>
2060<DD>(repeat)
2061
2062<P>
2063is equivalent to <TT>n</TT> times the concatenation of <TT>a</TT>.
2064 So <code>a{4}</code> for instance is equivalent to the expression <TT>a a a a</TT>.
2065 The decimal integer <TT>n</TT> must be positive.
2066
2067<P>
2068</DD>
2069<DT><TT>a{n,m}</TT></DT>
2070<DD>is equivalent to at least <TT>n</TT> times and at most <TT>m</TT> times the
2071 concatenation of <TT>a</TT>. So <code>a{2,4}</code> for instance is equivalent
2072 to the expression <code>a a a? a?</code>. Both <TT>n</TT> and <TT>m</TT> are non
2073 negative decimal integers and <TT>m</TT> must not be smaller than <TT>n</TT>.
2074
2075<P>
2076</DD>
2077<DT><TT>( a )</TT></DT>
2078<DD>matches the same input as <TT>a</TT>.
2079
2080<P>
2081</DD>
2082</DL>
2083
2084<P>
2085In a lexical rule, a regular expression <TT>r</TT> may be preceded by a
2086'<code>^</code>' (the beginning of line operator). <TT>r</TT> is then
2087only matched at the beginning of a line in the input. A line begins
2088after each occurrence of <code>\r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code>
2089(see also [<A
2090 HREF="manual.html#unicode_rep">5</A>]) and at the beginning of input.
2091The preceding line terminator in the input is not consumed and can
2092be matched by another rule.
2093
2094<P>
2095In a lexical rule, a regular expression <TT>r</TT> may be followed by a
2096look-ahead expression. A look-ahead expression is either a '<TT>$</TT>'
2097(the end of line operator) or a <code>'/'</code> followed by an arbitrary
2098regular expression. In both cases the look-ahead is not consumed and
2099not included in the matched text region, but it <EM>is</EM> considered
2100while determining which rule has the longest match (see also
2101<A HREF="#HowMatched">4.3.3</A> <A HREF="#HowMatched"><I>How the input is matched</I></A>).
2102
2103<P>
2104In the '<TT>$</TT>' case <TT>r</TT> is only matched at the end of a line in
2105the input. The end of a line is denoted by the regular expression
2106<code>\r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code>.
2107So <code>a$</code> is equivalent to <code>a / \r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code>.This is a bit different to the situation described in [<A
2108 HREF="manual.html#unicode_rep">5</A>]:
2109since in JFlex <code>$</code> is a true trailing context, the end of file
2110does <B>not</B> count as end of line.
2111
2112<P>
2113<A NAME="trailingContext"></A>For arbitrary look-ahead (also called <EM>trailing context</EM>) the
2114expression is matched only when followed by input that matches the
2115trailing context.
2116
2117<P>
2118<A NAME="EOFRule"></A>As of version 1.2, JFlex allows lex/flex style <TT>&#171;EOF&#187;</TT> rules in
2119lexical specifications. A rule
2120<PRE>
2121[StateList] &lt;&lt;EOF&gt;&gt; { some action code }
2122</PRE>
2123is very similar to the <A HREF="#eofval"><TT>%eofval</TT> directive</A> (section <A HREF="#eofval">4.2.3</A>).
2124The difference lies in the optional <TT>StateList</TT> that may precede the <TT>&#171;EOF&#187;</TT> rule. The
2125action code will only be executed when the end of file is read and the
2126scanner is currently in one of the lexical states listed in <TT>StateList</TT>.
2127The same <TT>StateGroup</TT> (see section <A HREF="#HowMatched">4.3.3</A>
2128<A HREF="#HowMatched"><I>How the input is matched</I></A>) and precedence
2129rules as in the ``normal'' rule case apply
2130(i.e. if there is more than one <TT>&#171;EOF&#187;</TT>
2131rule for a certain lexical state, the action of the one appearing
2132earlier in the specification will be executed). <TT>&#171;EOF&#187;</TT> rules
2133override settings of the <TT>%cup</TT> and <TT>%byaccj</TT> options and
2134should not be mixed with the <TT>%eofval</TT> directive.
2135
2136<P>
2137An <TT>Action</TT> consists either of a piece of Java code enclosed in
2138curly braces or is the special <code>|</code> action. The <code>|</code> action is
2139an abbreviation for the action of the following expression.
2140
2141<P>
2142Example:
2143<PRE>
2144expression1 |
2145expression2 |
2146expression3 { some action }
2147</PRE>
2148is equivalent to the expanded form
2149<PRE>
2150expression1 { some action }
2151expression2 { some action }
2152expression3 { some action }
2153</PRE>
2154
2155<P>
2156They are useful when you work with trailing context expressions. The
2157expression <TT>a | (c / d) | b</TT> is not syntactically legal, but can
2158easily be expressed using the <code>|</code> action:
2159<PRE>
2160a |
2161c / d |
2162b { some action }
2163</PRE>
2164
2165<P>
2166
2167<H3><A NAME="SECTION00053300000000000000"></A><A NAME="HowMatched"></A><BR>
2168How the input is matched
2169</H3>
2170When consuming its input, the scanner determines the regular expression
2171that matches the longest portion of the input (longest match rule). If
2172there is more than one regular expression that matches the longest portion
2173of input (i.e. they all match the same input), the generated scanner chooses
2174the expression that appears first in the specification. After determining
2175the active regular expression, the associated action is executed. If there
2176is no matching regular expression, the scanner terminates the program with
2177an error message (if the <TT>%standalone</TT> directive has been used, the
2178scanner prints the unmatched input to <TT>java.lang.System.out</TT> instead
2179and resumes scanning).
2180
2181<P>
2182Lexical states can be used to further restrict the set of regular expressions
2183that match the current input.
2184
2185<P>
2186
2187<UL>
2188<LI>A regular expression can only be matched when its associated set of lexical
2189states includes the currently active lexical state of the scanner or if
2190the set of associated lexical states is empty and the currently active lexical
2191state is inclusive. Exclusive and inclusive states only differ at this point:
2192rules with an empty set of associated states.
2193
2194<P>
2195</LI>
2196<LI>The currently active lexical state of the scanner can be changed from within
2197an action of a regular expression using the method <TT>yybegin()</TT>.
2198
2199<P>
2200</LI>
2201<LI>The scanner starts in the inclusive lexical state
2202<TT>YYINITIAL</TT>, which is always declared by default.
2203
2204<P>
2205</LI>
2206<LI>The set of lexical states associated with a regular expression is
2207the <TT>StateList</TT> that precedes the expression. If a rule is
2208contained in one or more <TT>StateGroups</TT>, then the states of
2209these are also associated with the rule, i.e.&nbsp;they accumulate over
2210<TT>StateGroups</TT>.
2211
2212<P>
2213Example:
2214<PRE>
2215%states A, B
2216%xstates C
2217%%
2218expr1 { yybegin(A); action }
2219&lt;YYINITIAL, A&gt; expr2 { action }
2220&lt;A&gt; {
2221 expr3 { action }
2222 &lt;B,C&gt; expr4 { action }
2223}
2224</PRE>
2225The first line declares two (inclusive) lexical states <TT>A</TT> and <TT>B</TT>,
2226the second line an exclusive lexical state <TT>C</TT>.
2227The default (inclusive) state <TT>YYINITIAL</TT> is always implicitly there and
2228doesn't need to be declared. The rule with <TT>expr1</TT> has no
2229states listed, and is thus matched in all states but the exclusive
2230ones, i.e.&nbsp;<TT>A</TT>, <TT>B</TT>, and <TT>YYINITIAL</TT>. In its
2231action, the scanner is switched to state <TT>A</TT>. The second rule
2232<TT>expr2</TT> can only match when the scanner is in state
2233<TT>YYINITIAL</TT> or <TT>A</TT>. The rule <TT>expr3</TT> can only be
2234matched in state <TT>A</TT> and <TT>expr4</TT> in states <TT>A</TT>, <TT>B</TT>,
2235and <TT>C</TT>.
2236
2237<P>
2238</LI>
2239<LI>Lexical states are declared and used as Java <TT>int</TT> constants in
2240the generated class under the same name as they are used in the specification.
2241There is no guarantee that the values of these integer constants are
2242distinct. They are pointers into the generated DFA table, and if JFlex
2243recognises two states as lexically equivalent (if they are used with the
2244exact same set of regular expressions), then the two constants will get
2245the same value.
2246
2247<P>
2248</LI>
2249</UL>
2250
2251<P>
2252
2253<H3><A NAME="SECTION00053400000000000000">
2254The generated class</A>
2255</H3>
2256JFlex generates exactly one file containing one class from the specification
2257(unless you have declared another class in the first specification section).
2258
2259<P>
2260The generated class contains (among other things) the DFA tables, an input buffer,
2261the lexical states of the specification, a constructor, and the scanning method
2262with the user supplied actions.
2263
2264<P>
2265The name of the class is by default <TT>Yylex</TT>, it is customisable
2266with the <TT>%class</TT> directive (see also section
2267<A HREF="#ClassOptions">4.2.1</A>). The input buffer of the lexer is connected with an
2268input stream over the <TT>java.io.Reader</TT> object which is passed
2269to the lexer in the generated constructor. If you want to provide your
2270own constructor for the lexer, you should always call the generated
2271one in it to initialise the input buffer. The input buffer should not
2272be accessed directly, but only over the advertised API (see also
2273section <A HREF="#ScannerMethods">4.3.5</A>). Its internal implementation may change
2274between releases or skeleton files without notice.
2275
2276<P>
2277The main interface to the outside world is the generated scanning
2278method (default name <TT>yylex</TT>, default return type
2279<TT>Yytoken</TT>). Most of its aspects are customisable (name, return
2280type, declared exceptions etc., see also section
2281<A HREF="#ScanningMethod">4.2.2</A>). If it is called, it will consume input until
2282one of the expressions in the specification is matched or an error
2283occurs. If an expression is matched, the corresponding action is
2284executed. It may return a value of the specified return type (in which
2285case the scanning method return with this value), or if it doesn't
2286return a value, the scanner resumes consuming input until the next
2287expression is matched. If the end of file is reached, the scanner
2288executes the EOF action, and (also upon each further call to the scanning
2289method) returns the specified EOF value (see also section <A HREF="#EOF">4.2.3</A>).
2290
2291<P>
2292
2293<H3><A NAME="SECTION00053500000000000000"></A><A NAME="ScannerMethods"></A><BR>
2294Scanner methods and fields accessible in actions (API)
2295</H3>
2296Generated methods and member fields in JFlex scanners are prefixed
2297with <TT>yy</TT> to indicate that they are generated and to avoid name
2298conflicts with user code copied into the class. Since user code is
2299part of the same class, JFlex has no language means like the
2300<TT>private</TT> modifier to indicate which members and methods are
2301internal and which ones belong to the API. Instead, JFlex follows a
2302naming convention: everything starting with a <TT>zz</TT> prefix like
2303<TT>zzStartRead</TT> is to be considered internal and subject to
2304change without notice between JFlex releases. Methods and members of
2305the generated class that do not have a <TT>zz</TT> prefix like
2306<TT>yycharat</TT> belong to the API that the scanner class provides to
2307users in action code of the specification. They will be remain stable
2308and supported between JFlex releases as long as possible.
2309
2310<P>
2311Currently, the API consists of the following methods and member fields:
2312
2313<UL>
2314<LI><TT>String yytext()</TT>
2315<BR> returns the matched input text region
2316
2317<P>
2318</LI>
2319<LI><TT>int yylength()</TT>
2320<BR> returns the length of the matched input text region (does not require
2321 a <TT>String</TT> object to be created)
2322
2323<P>
2324</LI>
2325<LI><TT>char yycharat(int pos)</TT>
2326<BR> returns the character at position <TT>pos</TT> from the matched text.
2327 It is equivalent to <TT>yytext().charAt(pos)</TT>, but faster. <TT> pos</TT> must be a value from <TT>0</TT> to <TT>yylength()-1</TT>.
2328
2329<P>
2330</LI>
2331<LI><TT>void yyclose()</TT>
2332<BR> closes the input stream. All subsequent calls to the scanning method will
2333 return the end of file value
2334
2335<P>
2336</LI>
2337<LI><TT>void yyreset(java.io.Reader reader)</TT>
2338<BR> closes the current input stream, and resets the scanner to read from
2339 a new input stream. All internal variables are reset, the old input
2340 stream <EM>cannot</EM> be reused (content of the internal buffer is
2341 discarded and lost). The lexical state is set to <TT>YY_INITIAL</TT>.
2342
2343<P>
2344</LI>
2345<LI><TT>void yypushStream(java.io.Reader reader)</TT>
2346<BR> Stores the current input stream on a stack, and
2347 reads from a new stream. Lexical state, line,
2348 char, and column counting remain untouched.
2349 The current input stream can be restored with
2350 <TT>yypopstream</TT> (usually in an <TT>&#171;EOF&#187;</TT> action).
2351
2352<P>
2353A typical example for this are include files in
2354 style of the C pre-processor. The corresponding
2355 JFlex specification could look somewhat like this:
2356<PRE>
2357"#include" {FILE} { yypushStream(new FileReader(getFile(yytext()))); }
2358..
2359&lt;&lt;EOF&gt;&gt; { if (yymoreStreams()) yypopStream(); else return EOF; }
2360</PRE>
2361
2362<P>
2363This method is only available in the skeleton file
2364 <TT>skeleton.nested</TT>. You can find it in the
2365 <TT>src</TT> directory of the JFlex distribution.
2366
2367<P>
2368</LI>
2369<LI><TT>void yypopStream()</TT>
2370<BR> Closes the current input stream and continues to
2371 read from the one on top of the stream stack.
2372
2373<P>
2374This method is only available in the skeleton file
2375 <TT>skeleton.nested</TT>. You can find it in the
2376 <TT>src</TT> directory of the JFlex distribution.
2377
2378<P>
2379</LI>
2380<LI><TT>boolean yymoreStreams()</TT>
2381<BR> Returns true iff there are still streams for <TT>yypopStream</TT>
2382 left to read from on the stream stack.
2383
2384<P>
2385This method is only available in the skeleton file
2386 <TT>skeleton.nested</TT>. You can find it in the
2387 <TT>src</TT> directory of the JFlex distribution.
2388
2389<P>
2390</LI>
2391<LI><TT>int yystate()</TT>
2392<BR> returns the current lexical state of the scanner.
2393
2394<P>
2395</LI>
2396<LI><TT>void yybegin(int lexicalState)</TT>
2397<BR> enters the lexical state <TT>lexicalState</TT>
2398
2399<P>
2400</LI>
2401<LI><TT>void yypushback(int number)</TT>
2402<BR> pushes <TT>number</TT> characters of the matched text back into the input stream.
2403 They will be read again in the next call of the scanning method.
2404 The number of characters to be read again must not be greater than the length
2405 of the matched text. The pushed back characters will after the call of
2406 <TT>yypushback</TT> not be included in <TT>yylength</TT> and <TT>yytext()</TT>.
2407 Please note that in Java strings are unchangeable, i.e. an action code like
2408 <PRE>
2409 String matched = yytext();
2410 yypushback(1);
2411 return matched;
2412</PRE>
2413 will return the whole matched text, while
2414 <PRE>
2415 yypushback(1);
2416 return yytext();
2417</PRE>
2418 will return the matched text minus the last character.
2419
2420<P>
2421</LI>
2422<LI><TT>int yyline</TT>
2423<BR> contains the current line of input (starting with 0, only active with
2424 the <TT><A HREF="#Counting">%line</A></TT> directive)
2425
2426<P>
2427</LI>
2428<LI><TT>int yychar</TT>
2429<BR> contains the current character count in the input (starting with 0,
2430 only active with the <TT><A HREF="#Counting">%char</A></TT> directive)
2431
2432<P>
2433</LI>
2434<LI><TT>int yycolumn</TT>
2435<BR> contains the current column of the current line (starting with 0, only
2436 active with the <TT><A HREF="#Counting">%column</A></TT> directive)
2437
2438<P>
2439</LI>
2440</UL>
2441
2442<P>
2443
2444<H1><A NAME="SECTION00060000000000000000"></A><A NAME="sec:encodings"></A><BR>
2445Encodings, Platforms, and Unicode
2446</H1>
2447
2448<P>
2449This section tries to shed some light on the issues of Unicode and
2450encodings, cross platform scanning, and how to deal with binary data.
2451My thanks go to Stephen Ostermiller for his input on this topic.
2452
2453<P>
2454
2455<H2><A NAME="SECTION00061000000000000000"></A><A NAME="sec:howtoencoding"></A><BR>
2456The Problem
2457</H2>
2458
2459<P>
2460Before we dive straight into details, let's take a look at what the
2461problem is. The problem is Java's platform independence when you want
2462to use it. For scanners the interesting part about platform
2463independence is character encodings and how they are handled.
2464
2465<P>
2466If a program reads a file from disk, it gets a stream of bytes. In
2467earlier times, when the grass was green, and the world was much
2468simpler, everybody knew that the byte value 65 is, of course, an A.
2469It was no problem to see which bytes meant which characters (actually
2470these times never existed, but anyway). The normal Latin alphabet
2471only has 26 characters, so 7 bits or 128 distinct values should surely
2472be enough to map them, even if you allow yourself the luxury of upper
2473and lower case. Nowadays, things are different. The world suddenly
2474grew much larger, and all kinds of people wanted all kinds of special
2475characters, just because they use them in their language and writing.
2476This is were the mess starts. Since the 128 distinct values were
2477already filled up with other stuff, people began to use all 8 bits of
2478the byte, and extended the byte/character mappings to fit their need,
2479and of course everybody did it differently. Some people for instance
2480may have said ``let's use the value 213 for the German character &#228;''. Others
2481may have found that 213 should much rather mean &#233;, because they didn't need
2482German and wrote French instead. As long as you use your program and
2483data files only on one platform, this is no problem, as all know what
2484means what, and everything gets used consistently.
2485
2486<P>
2487Now Java comes into play, and wants to run everywhere (once written,
2488that is) and now there suddenly is a problem: how do I get the same
2489program to say &#228; to a certain byte when it runs in Germany and maybe &#233;
2490when it runs in France? And also the other way around: when I want to
2491say &#233; on the screen, which byte value should I send to the operating
2492system?
2493
2494<P>
2495Java's solution to this is to use Unicode internally. Unicode aims to
2496be a superset of all known character sets and is therefore a perfect base
2497for encoding things that might get used all over the world. To make
2498things work correctly, you still have to know where you are and how to
2499map byte values to Unicode characters and vice versa, but the
2500important thing is, that this mapping is at least possible (you can
2501map Kanji characters to Unicode, but you cannot map them to ASCII or
2502iso-latin-1).
2503
2504<P>
2505
2506<H2><A NAME="SECTION00062000000000000000"></A><A NAME="sec:howtotext"></A><BR>
2507Scanning text files
2508</H2>
2509
2510<P>
2511Scanning text files is the standard application for scanners like
2512JFlex. Therefore it should also be the most convenient one. Most times
2513it is.
2514
2515<P>
2516The following scenario works like a breeze:
2517You work on a platform X, write your lexer specification there, can
2518use any obscure Unicode character in it as you like, and compile the
2519program. Your users work on any platform Y (possibly but not
2520necessarily something different from X), they write their input files
2521on Y and they run your program on Y. No problems.
2522
2523<P>
2524Java does this as follows:
2525If you want to read anything in Java that is supposed to contain text,
2526you use a <TT>FileReader</TT> or some <TT>InputStream</TT> together with
2527an <TT>InputStreamReader</TT>. <TT>InputStreams</TT> return the raw bytes, the
2528<TT>InputStreamReader</TT> converts the bytes into Unicode characters with
2529the platform's default encoding. If a text file is produced on the
2530same platform, the platform's default encoding should do the mapping
2531correctly. Since JFlex also uses readers and Unicode internally, this
2532mechanism also works for the scanner specifications. If you write an
2533<TT>A</TT> in your text editor and the editor uses the platform's encoding (say <TT>A</TT> is 65),
2534then Java translates this into the logical Unicode <TT>A</TT> internally.
2535If a user writes an <TT>A</TT> on a completely different platform (say <TT>A</TT> is 237 there),
2536then Java also translates this into the logical Unicode <TT>A</TT> internally. Scanning
2537is performed after that translation and both match.
2538
2539<P>
2540Note that because of this mapping from bytes to characters, you should always
2541use the <TT>%unicode</TT> switch in you lexer specification if you want to scan
2542text files. <TT>%8bit</TT> may not be enough, even if
2543you know that your platform only uses one byte per character. The encoding
2544Cp1252 used on many Windows machines for instance knows 256 characters, but
2545the character &#180; with Cp1252 code <code>\x92</code> has the Unicode value <code>\u2019</code>, which
2546is larger than 255 and which would make your scanner throw an
2547<TT>ArrayIndexOutOfBoundsException</TT> if it is encountered.
2548
2549<P>
2550So for the usual case you don't have to do anything but use the
2551<TT>%unicode</TT> switch in your lexer specification.
2552
2553<P>
2554Things may break when you produce a text file on platform X and
2555consume it on a different platform Y. Let's say you have a file
2556written on a Windows PC using the encoding Cp1252. Then you move
2557this file to a Linux PC with encoding ISO 8859-1 and there you want
2558to run your scanner on it. Java now thinks the file is encoded
2559in ISO 8859-1 (the platform's default encoding) while it really is
2560encoded in Cp1252. For most characters
2561Cp1252 and ISO 8859-1 are the same, but for the byte values <code>\x80</code>
2562to <code>\x9f</code> they disagree: ISO 8859-1 is undefined there. You can fix
2563the problem by telling Java explicitly which encoding to use. When
2564constructing the <TT>InputStreamReader</TT>, you can give the encoding
2565as argument. The line
2566<DIV ALIGN="CENTER">
2567<TT>Reader r = new InputStreamReader(input, "Cp1252"); </TT>
2568
2569</DIV>
2570will do the trick.
2571
2572<P>
2573Of course the encoding to use can also come from the data itself:
2574for instance, when you scan a HTML page, it may have embedded
2575information about its character encoding in the headers.
2576
2577<P>
2578More information about encodings, which ones are supported, how
2579they are called, and how to set them may be found in the
2580official Java documentation in the chapter about
2581internationalisation.
2582The link
2583<A NAME="tex2html6"
2584 HREF="http://java.sun.com/j2se/1.3/docs/guide/intl/"><TT>http://java.sun.com/j2se/1.3/docs/guide/intl/</TT></A>
2585leads to an online version of this for Sun's JDK 1.3.
2586
2587<P>
2588
2589<H2><A NAME="SECTION00063000000000000000"></A><A NAME="sec:howtobinary"></A><BR>
2590Scanning binaries
2591</H2>
2592
2593<P>
2594Scanning binaries is both easier and more difficult
2595than scanning text files. It's easier because you want
2596the raw bytes and not their meaning, i.e.&nbsp;you don't want
2597any translation.
2598It's more difficult because it's not so easy to get
2599``no translation'' when you use Java readers.
2600
2601<P>
2602The problem (for binaries) is that JFlex scanners are
2603designed to work on text. Therefore the interface is
2604the <TT>Reader</TT> class (there is a constructor
2605for <TT>InputStream</TT> instances, but it's just there
2606for convenience and wraps an <TT>InputStreamReader</TT>
2607around it to get characters, not bytes).
2608You can still get a binary scanner when you write
2609your own custom <TT>InputStreamReader</TT> class that
2610does explicitly no translation, but just copies
2611byte values to character codes instead. It sounds
2612quite easy, and actually it is no big deal, but there
2613are a few little pitfalls on the way. In the scanner
2614specification you can only enter positive character
2615codes (for bytes that is <code>\x00</code>
2616to <code>\xFF</code>). Java's <TT>byte</TT> type on the other hand
2617is a signed 8 bit integer (-128 to 127), so you have to convert
2618them properly in your custom <TT>Reader</TT>. Also, you should
2619take care when you write your lexer spec: if you
2620use text in there, it gets interpreted by an encoding
2621first, and what scanner you get as result might depend
2622on which platform you run JFlex on when you generate
2623the scanner (this is what you want for text, but for binaries it
2624gets in the way). If you are not sure, or if the development
2625platform might change, it's probably best to use character
2626code escapes in all places, since they don't change their
2627meaning.
2628
2629<P>
2630To illustrate these points, the example in <TT>examples/binary</TT>
2631contains a very small binary scanner that tries to
2632detect if a file is a Java <TT>class</TT> file. For that
2633purpose it looks if the file begins with the magic number <code>\xCAFEBABE</code>.
2634
2635<P>
2636
2637<H1><A NAME="SECTION00070000000000000000"></A><A NAME="performance"></A><BR>
2638A few words on performance
2639</H1>
2640This section gives some empirical results about the speed of JFlex generated
2641scanners in comparison to those generated by JLex,
2642compares a JFlex scanner with a <A HREF="#PerformanceHandwritten">handwritten</A>
2643one, and presents some <A HREF="#PerformanceTips">tips</A> on how to make
2644your specification produce a faster scanner.
2645
2646<P>
2647
2648<H2><A NAME="SECTION00071000000000000000"></A><A NAME="PerformanceJLex"></A><BR>
2649Comparison of JLex and JFlex
2650</H2>
2651Scanners generated by the tool JLex are quite fast. It was however
2652possible to further improve the performance of generated scanners
2653using JFlex. The following table shows the results that were produced
2654by the scanner specification of a small toy programming language (the
2655example from the JLex web site). The scanner was generated using JLex
26561.2.6 and JFlex version 1.3.5 with all three different JFlex code
2657generation methods. Then it was run on a W98 system using Sun's JDK
26581.3 with different sample inputs of that toy programming language. All
2659test runs were made under the same conditions on an otherwise idle
2660machine.
2661
2662<P>
2663The values presented in the table denote the time from the first call
2664to the scanning method to returning the EOF value and the speedup in
2665percent. The tests were run both in the mixed (HotSpot) JVM mode and
2666the pure interpreted mode. The mixed mode JVM brings
2667about a factor of 10 performance improvement, the difference between
2668JLex and JFlex only decreases slightly.
2669
2670<P>
2671<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2672<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">KB</TD>
2673<TD ALIGN="CENTER">JVM</TD>
2674<TD ALIGN="RIGHT">JLex</TD>
2675<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2676<TD ALIGN="RIGHT">speedup</TD>
2677<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2678<TD ALIGN="RIGHT">speedup</TD>
2679<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2680<TD ALIGN="RIGHT">speedup</TD>
2681</TR>
2682<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2683<TD ALIGN="CENTER">hotspot</TD>
2684<TD ALIGN="RIGHT">325 ms</TD>
2685<TD ALIGN="RIGHT">261 ms</TD>
2686<TD ALIGN="RIGHT">24.5 %</TD>
2687<TD ALIGN="RIGHT">261 ms</TD>
2688<TD ALIGN="RIGHT">24.5 %</TD>
2689<TD ALIGN="RIGHT">261 ms</TD>
2690<TD ALIGN="RIGHT">24.5 %</TD>
2691</TR>
2692<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2693<TD ALIGN="CENTER">hotspot</TD>
2694<TD ALIGN="RIGHT">127 ms</TD>
2695<TD ALIGN="RIGHT">98 ms</TD>
2696<TD ALIGN="RIGHT">29.6 %</TD>
2697<TD ALIGN="RIGHT">94 ms</TD>
2698<TD ALIGN="RIGHT">35.1 %</TD>
2699<TD ALIGN="RIGHT">96 ms</TD>
2700<TD ALIGN="RIGHT">32.3 %</TD>
2701</TR>
2702<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2703<TD ALIGN="CENTER">hotspot</TD>
2704<TD ALIGN="RIGHT">66 ms</TD>
2705<TD ALIGN="RIGHT">50 ms</TD>
2706<TD ALIGN="RIGHT">32.0 %</TD>
2707<TD ALIGN="RIGHT">50 ms</TD>
2708<TD ALIGN="RIGHT">32.0 %</TD>
2709<TD ALIGN="RIGHT">48 ms</TD>
2710<TD ALIGN="RIGHT">37.5 %</TD>
2711</TR>
2712<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2713<TD ALIGN="CENTER">interpr.</TD>
2714<TD ALIGN="RIGHT">4009 ms</TD>
2715<TD ALIGN="RIGHT">3025 ms</TD>
2716<TD ALIGN="RIGHT">32.5 %</TD>
2717<TD ALIGN="RIGHT">3258 ms</TD>
2718<TD ALIGN="RIGHT">23.1 %</TD>
2719<TD ALIGN="RIGHT">3231 ms</TD>
2720<TD ALIGN="RIGHT">24.1 %</TD>
2721</TR>
2722<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2723<TD ALIGN="CENTER">interpr.</TD>
2724<TD ALIGN="RIGHT">1641 ms</TD>
2725<TD ALIGN="RIGHT">1155 ms</TD>
2726<TD ALIGN="RIGHT">42.1 %</TD>
2727<TD ALIGN="RIGHT">1245 ms</TD>
2728<TD ALIGN="RIGHT">31.8 %</TD>
2729<TD ALIGN="RIGHT">1234 ms</TD>
2730<TD ALIGN="RIGHT">33.0 %</TD>
2731</TR>
2732<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2733<TD ALIGN="CENTER">interpr.</TD>
2734<TD ALIGN="RIGHT">817 ms</TD>
2735<TD ALIGN="RIGHT">573 ms</TD>
2736<TD ALIGN="RIGHT">42.6 %</TD>
2737<TD ALIGN="RIGHT">617 ms</TD>
2738<TD ALIGN="RIGHT">32.4 %</TD>
2739<TD ALIGN="RIGHT">613 ms</TD>
2740<TD ALIGN="RIGHT">33.3 %</TD>
2741</TR>
2742</TABLE>
2743
2744<P><BR>
2745
2746<P>
2747Since the scanning time of the lexical analyser examined in the table
2748above includes lexical actions that often need to create new object instances,
2749another table shows the execution time for the same specification with empty
2750lexical actions to compare the pure scanning engines.
2751
2752<P>
2753<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2754<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">KB</TD>
2755<TD ALIGN="CENTER">JVM</TD>
2756<TD ALIGN="RIGHT">JLex</TD>
2757<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2758<TD ALIGN="RIGHT">speedup</TD>
2759<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2760<TD ALIGN="RIGHT">speedup</TD>
2761<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2762<TD ALIGN="RIGHT">speedup</TD>
2763</TR>
2764<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2765<TD ALIGN="CENTER">hotspot</TD>
2766<TD ALIGN="RIGHT">204 ms</TD>
2767<TD ALIGN="RIGHT">140 ms</TD>
2768<TD ALIGN="RIGHT">45.7 %</TD>
2769<TD ALIGN="RIGHT">138 ms</TD>
2770<TD ALIGN="RIGHT">47.8 %</TD>
2771<TD ALIGN="RIGHT">140 ms</TD>
2772<TD ALIGN="RIGHT">45.7 %</TD>
2773</TR>
2774<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2775<TD ALIGN="CENTER">hotspot</TD>
2776<TD ALIGN="RIGHT">83 ms</TD>
2777<TD ALIGN="RIGHT">55 ms</TD>
2778<TD ALIGN="RIGHT">50.9 %</TD>
2779<TD ALIGN="RIGHT">52 ms</TD>
2780<TD ALIGN="RIGHT">59.6 %</TD>
2781<TD ALIGN="RIGHT">52 ms</TD>
2782<TD ALIGN="RIGHT">59.6 %</TD>
2783</TR>
2784<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2785<TD ALIGN="CENTER">hotspot</TD>
2786<TD ALIGN="RIGHT">41 ms</TD>
2787<TD ALIGN="RIGHT">28 ms</TD>
2788<TD ALIGN="RIGHT">46.4 %</TD>
2789<TD ALIGN="RIGHT">26 ms</TD>
2790<TD ALIGN="RIGHT">57.7 %</TD>
2791<TD ALIGN="RIGHT">26 ms</TD>
2792<TD ALIGN="RIGHT">57.7 %</TD>
2793</TR>
2794<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2795<TD ALIGN="CENTER">interpr.</TD>
2796<TD ALIGN="RIGHT">2983 ms</TD>
2797<TD ALIGN="RIGHT">2036 ms</TD>
2798<TD ALIGN="RIGHT">46.5 %</TD>
2799<TD ALIGN="RIGHT">2230 ms</TD>
2800<TD ALIGN="RIGHT">33.8 %</TD>
2801<TD ALIGN="RIGHT">2232 ms</TD>
2802<TD ALIGN="RIGHT">33.6 %</TD>
2803</TR>
2804<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2805<TD ALIGN="CENTER">interpr.</TD>
2806<TD ALIGN="RIGHT">1260 ms</TD>
2807<TD ALIGN="RIGHT">793 ms</TD>
2808<TD ALIGN="RIGHT">58.9 %</TD>
2809<TD ALIGN="RIGHT">865 ms</TD>
2810<TD ALIGN="RIGHT">45.7 %</TD>
2811<TD ALIGN="RIGHT">867 ms</TD>
2812<TD ALIGN="RIGHT">45.3 %</TD>
2813</TR>
2814<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2815<TD ALIGN="CENTER">interpr.</TD>
2816<TD ALIGN="RIGHT">628 ms</TD>
2817<TD ALIGN="RIGHT">395 ms</TD>
2818<TD ALIGN="RIGHT">59.0 %</TD>
2819<TD ALIGN="RIGHT">432 ms</TD>
2820<TD ALIGN="RIGHT">45.4 %</TD>
2821<TD ALIGN="RIGHT">432 ms</TD>
2822<TD ALIGN="RIGHT">45.4 %</TD>
2823</TR>
2824</TABLE>
2825
2826<P><BR>
2827
2828<P>
2829Execution time of single instructions depends on the platform and
2830the implementation of the Java Virtual Machine the program is executed
2831on. Therefore the tables above cannot be used as a reference to which
2832code generation method of JFlex is the right one to choose in general.
2833The following table was produced by the same lexical specification and
2834the same input on a Linux system also using Sun's JDK 1.3.
2835
2836<P>
2837With actions:
2838
2839<P>
2840<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2841<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">KB</TD>
2842<TD ALIGN="CENTER">JVM</TD>
2843<TD ALIGN="RIGHT">JLex</TD>
2844<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2845<TD ALIGN="RIGHT">speedup</TD>
2846<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2847<TD ALIGN="RIGHT">speedup</TD>
2848<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2849<TD ALIGN="RIGHT">speedup</TD>
2850</TR>
2851<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2852<TD ALIGN="CENTER">hotspot</TD>
2853<TD ALIGN="RIGHT">246 ms</TD>
2854<TD ALIGN="RIGHT">203 ms</TD>
2855<TD ALIGN="RIGHT">21.2 %</TD>
2856<TD ALIGN="RIGHT">193 ms</TD>
2857<TD ALIGN="RIGHT">27.5 %</TD>
2858<TD ALIGN="RIGHT">190 ms</TD>
2859<TD ALIGN="RIGHT">29.5 %</TD>
2860</TR>
2861<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2862<TD ALIGN="CENTER">hotspot</TD>
2863<TD ALIGN="RIGHT">99 ms</TD>
2864<TD ALIGN="RIGHT">76 ms</TD>
2865<TD ALIGN="RIGHT">30.3 %</TD>
2866<TD ALIGN="RIGHT">69 ms</TD>
2867<TD ALIGN="RIGHT">43.5 %</TD>
2868<TD ALIGN="RIGHT">70 ms</TD>
2869<TD ALIGN="RIGHT">41.4 %</TD>
2870</TR>
2871<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2872<TD ALIGN="CENTER">hotspot</TD>
2873<TD ALIGN="RIGHT">48 ms</TD>
2874<TD ALIGN="RIGHT">36 ms</TD>
2875<TD ALIGN="RIGHT">33.3 %</TD>
2876<TD ALIGN="RIGHT">34 ms</TD>
2877<TD ALIGN="RIGHT">41.2 %</TD>
2878<TD ALIGN="RIGHT">35 ms</TD>
2879<TD ALIGN="RIGHT">37.1 %</TD>
2880</TR>
2881<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2882<TD ALIGN="CENTER">interpr.</TD>
2883<TD ALIGN="RIGHT">3251 ms</TD>
2884<TD ALIGN="RIGHT">2247 ms</TD>
2885<TD ALIGN="RIGHT">44.7 %</TD>
2886<TD ALIGN="RIGHT">2430 ms</TD>
2887<TD ALIGN="RIGHT">33.8 %</TD>
2888<TD ALIGN="RIGHT">2444 ms</TD>
2889<TD ALIGN="RIGHT">33.0 %</TD>
2890</TR>
2891<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2892<TD ALIGN="CENTER">interpr.</TD>
2893<TD ALIGN="RIGHT">1320 ms</TD>
2894<TD ALIGN="RIGHT">848 ms</TD>
2895<TD ALIGN="RIGHT">55.7 %</TD>
2896<TD ALIGN="RIGHT">958 ms</TD>
2897<TD ALIGN="RIGHT">37.8 %</TD>
2898<TD ALIGN="RIGHT">920 ms</TD>
2899<TD ALIGN="RIGHT">43.5 %</TD>
2900</TR>
2901<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2902<TD ALIGN="CENTER">interpr.</TD>
2903<TD ALIGN="RIGHT">658 ms</TD>
2904<TD ALIGN="RIGHT">423 ms</TD>
2905<TD ALIGN="RIGHT">55.6 %</TD>
2906<TD ALIGN="RIGHT">456 ms</TD>
2907<TD ALIGN="RIGHT">44.3 %</TD>
2908<TD ALIGN="RIGHT">452 ms</TD>
2909<TD ALIGN="RIGHT">45.6 %</TD>
2910</TR>
2911</TABLE>
2912
2913<P><BR>
2914
2915<P>
2916Without actions:
2917
2918<P>
2919<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2920<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">KB</TD>
2921<TD ALIGN="CENTER">JVM</TD>
2922<TD ALIGN="RIGHT">JLex</TD>
2923<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2924<TD ALIGN="RIGHT">speedup</TD>
2925<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2926<TD ALIGN="RIGHT">speedup</TD>
2927<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2928<TD ALIGN="RIGHT">speedup</TD>
2929</TR>
2930<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2931<TD ALIGN="CENTER">hotspot</TD>
2932<TD ALIGN="RIGHT">136 ms</TD>
2933<TD ALIGN="RIGHT">78 ms</TD>
2934<TD ALIGN="RIGHT">74.4 %</TD>
2935<TD ALIGN="RIGHT">76 ms</TD>
2936<TD ALIGN="RIGHT">78.9 %</TD>
2937<TD ALIGN="RIGHT">77 ms</TD>
2938<TD ALIGN="RIGHT">76.6 %</TD>
2939</TR>
2940<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2941<TD ALIGN="CENTER">hotspot</TD>
2942<TD ALIGN="RIGHT">59 ms</TD>
2943<TD ALIGN="RIGHT">31 ms</TD>
2944<TD ALIGN="RIGHT">90.3 %</TD>
2945<TD ALIGN="RIGHT">48 ms</TD>
2946<TD ALIGN="RIGHT">22.9 %</TD>
2947<TD ALIGN="RIGHT">32 ms</TD>
2948<TD ALIGN="RIGHT">84.4 %</TD>
2949</TR>
2950<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2951<TD ALIGN="CENTER">hotspot</TD>
2952<TD ALIGN="RIGHT">28 ms</TD>
2953<TD ALIGN="RIGHT">15 ms</TD>
2954<TD ALIGN="RIGHT">86.7 %</TD>
2955<TD ALIGN="RIGHT">15 ms</TD>
2956<TD ALIGN="RIGHT">86.7 %</TD>
2957<TD ALIGN="RIGHT">15 ms</TD>
2958<TD ALIGN="RIGHT">86.7 %</TD>
2959</TR>
2960<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">496</TD>
2961<TD ALIGN="CENTER">interpr.</TD>
2962<TD ALIGN="RIGHT">1992 ms</TD>
2963<TD ALIGN="RIGHT">1047 ms</TD>
2964<TD ALIGN="RIGHT">90.3 %</TD>
2965<TD ALIGN="RIGHT">1246 ms</TD>
2966<TD ALIGN="RIGHT">59.9 %</TD>
2967<TD ALIGN="RIGHT">1215 ms</TD>
2968<TD ALIGN="RIGHT">64.0 %</TD>
2969</TR>
2970<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">187</TD>
2971<TD ALIGN="CENTER">interpr.</TD>
2972<TD ALIGN="RIGHT">859 ms</TD>
2973<TD ALIGN="RIGHT">408 ms</TD>
2974<TD ALIGN="RIGHT">110.5 %</TD>
2975<TD ALIGN="RIGHT">479 ms</TD>
2976<TD ALIGN="RIGHT">79.3 %</TD>
2977<TD ALIGN="RIGHT">487 ms</TD>
2978<TD ALIGN="RIGHT">76.4 %</TD>
2979</TR>
2980<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">93</TD>
2981<TD ALIGN="CENTER">interpr.</TD>
2982<TD ALIGN="RIGHT">435 ms</TD>
2983<TD ALIGN="RIGHT">200 ms</TD>
2984<TD ALIGN="RIGHT">117.5 %</TD>
2985<TD ALIGN="RIGHT">237 ms</TD>
2986<TD ALIGN="RIGHT">83.5 %</TD>
2987<TD ALIGN="RIGHT">242 ms</TD>
2988<TD ALIGN="RIGHT">79.8 %</TD>
2989</TR>
2990</TABLE>
2991
2992<P><BR>
2993
2994<P>
2995Although all JFlex scanners were faster than those generated by JLex,
2996slight differences between JFlex code generation methods show up when compared
2997to the run on the W98 system.
2998<A NAME="PerformanceHandwritten"></A>
2999<P>
3000The following table compares a hand-written scanner for the Java language
3001obtained from the web site of CUP with the JFlex generated scanner for Java
3002that comes with JFlex in the <TT>examples</TT> directory. They were tested
3003on different <TT>.java</TT> files on a Linux machine with Sun's JDK 1.3.
3004
3005<P>
3006<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
3007<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">lines</TD>
3008<TD ALIGN="RIGHT">KB</TD>
3009<TD ALIGN="CENTER">JVM</TD>
3010<TD ALIGN="RIGHT">hand-written scanner</TD>
3011<TD ALIGN="CENTER" COLSPAN=2>JFlex generated scanner</TD>
3012</TR>
3013<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">19050</TD>
3014<TD ALIGN="RIGHT">496</TD>
3015<TD ALIGN="CENTER">hotspot</TD>
3016<TD ALIGN="RIGHT">824 ms</TD>
3017<TD ALIGN="RIGHT">248 ms</TD>
3018<TD ALIGN="RIGHT">235 % faster</TD>
3019</TR>
3020<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">6350</TD>
3021<TD ALIGN="RIGHT">165</TD>
3022<TD ALIGN="CENTER">hotspot</TD>
3023<TD ALIGN="RIGHT">272 ms</TD>
3024<TD ALIGN="RIGHT">84 ms</TD>
3025<TD ALIGN="RIGHT">232 % faster</TD>
3026</TR>
3027<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">1270</TD>
3028<TD ALIGN="RIGHT">33</TD>
3029<TD ALIGN="CENTER">hotspot</TD>
3030<TD ALIGN="RIGHT">53 ms</TD>
3031<TD ALIGN="RIGHT">18 ms</TD>
3032<TD ALIGN="RIGHT">194 % faster</TD>
3033</TR>
3034<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">19050</TD>
3035<TD ALIGN="RIGHT">496</TD>
3036<TD ALIGN="CENTER">interpreted</TD>
3037<TD ALIGN="RIGHT">5.83 s</TD>
3038<TD ALIGN="RIGHT">3.85 s</TD>
3039<TD ALIGN="RIGHT">51 % faster</TD>
3040</TR>
3041<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">6350</TD>
3042<TD ALIGN="RIGHT">165</TD>
3043<TD ALIGN="CENTER">interpreted</TD>
3044<TD ALIGN="RIGHT">1.95 s</TD>
3045<TD ALIGN="RIGHT">1.29 s</TD>
3046<TD ALIGN="RIGHT">51 % faster</TD>
3047</TR>
3048<TR><TD ALIGN="LEFT">&nbsp;</TD><TD ALIGN="RIGHT">1270</TD>
3049<TD ALIGN="RIGHT">33</TD>
3050<TD ALIGN="CENTER">interpreted</TD>
3051<TD ALIGN="RIGHT">0.38 s</TD>
3052<TD ALIGN="RIGHT">0.25 s</TD>
3053<TD ALIGN="RIGHT">52 % faster</TD>
3054</TR>
3055</TABLE>
3056
3057<P><BR>
3058
3059<P>
3060Although JDK 1.3 seems to speed up the hand-written scanner if compared
3061to JDK 1.1 or 1.2 more than the generated one, the generated scanner is
3062still up to 3.3 times as fast as the hand-written one. One example of
3063a hand-written scanner that is
3064considerably slower than the equivalent generated one is surely no
3065proof for all generated scanners being faster than hand-written. It is
3066clearly impossible to prove something like that, since you could
3067always write the generated scanner by hand. From a software
3068engineering point of view however, there is no excuse for writing a
3069scanner by hand since this task takes more time, is more difficult and
3070therefore more error prone than writing a compact, readable and easy
3071to change lexical specification. (I'd like to add, that I do <EM>not</EM>
3072think, that the hand-written scanner from the CUP web site used here in
3073the test is stupid or badly written or anything like that. I actually
3074think, Scott did a great job with it)
3075
3076<P>
3077
3078<H2><A NAME="SECTION00072000000000000000"></A><A NAME="PerformanceTips"></A><BR>
3079How to write a faster specification
3080</H2>
3081Although JFlex generated scanners show good performance without
3082special optimisations, there are some heuristics that can make a
3083lexical specification produce an even faster scanner. Those are
3084(roughly in order of performance gain):
3085
3086<P>
3087
3088<UL>
3089<LI>Avoid rules that require backtracking
3090
3091<P>
3092From the C/C++ flex [<A
3093 HREF="manual.html#flex">11</A>] man page: <EM>``Getting rid
3094of backtracking is messy and often may be an enormous amount of work for
3095a complicated scanner.''</EM> Backtracking is introduced by the longest match
3096rule and occurs for instance on this set of expressions:
3097
3098<P>
3099<TT> "averylongkeyword"</TT>
3100<BR><TT> .</TT>
3101
3102<P>
3103With input <TT>"averylongjoke"</TT> the scanner has to read all characters
3104up to <TT>'j' </TT>to decide that rule <TT>.</TT> should be matched. All
3105characters of <TT>"verylong"</TT> have to be read again for the next
3106matching process. Backtracking can be avoided in general by adding
3107error rules that match those error conditions
3108
3109<P>
3110<code> "av"|"ave"|"avery"|"averyl"|..</code>
3111
3112<P>
3113While this is impractical in most scanners, there is still the
3114possibility to add a ``catch all'' rule for a lengthy list of keywords
3115<PRE>
3116"keyword1" { return symbol(KEYWORD1); }
3117..
3118"keywordn" { return symbol(KEYWORDn); }
3119[a-z]+ { error("not a keyword"); }
3120</PRE>
3121Most programming language scanners already have a rule like this for
3122some kind of variable length identifiers.
3123
3124<P>
3125</LI>
3126<LI>Avoid line and column counting
3127
3128<P>
3129It costs multiple additional comparisons per input character and the
3130 matched text has to be re-scanned for counting. In most scanners it
3131 is possible to do the line counting in the specification by
3132 incrementing <TT>yyline</TT> each time a line terminator has been
3133 matched. Column counting could also be included in actions. This
3134 will be faster, but can in some cases become quite messy.
3135
3136<P>
3137</LI>
3138<LI>Avoid look-ahead expressions and the end of line operator '$'
3139
3140<P>
3141In the best case, the trailing context will first have to be read and
3142 then (because it is not to be consumed) re-read again. The cases of
3143 fixed-length look-ahead and fixed-length base expressions are handled efficiently
3144 by matching the concatenation and then pushing back the required amount
3145 of characters. This extends to the case of a disjunction of fixed-length
3146 look-ahead expressions such as <code>r1 / \r|\n|\r\n</code>. All other cases
3147 <code>r1 / r2</code> are handled by first scanning the concatenation of
3148 <code>r1</code> and <code>r2</code>, and then finding the correct end of <code>r1</code>.
3149 The end of <code>r1</code> is found by scanning forwards in the match again,
3150 marking all possible <code>r1</code> terminations, and then scanning the reverse
3151 of <code>r2</code> backwards from the end until a start of <code>r2</code> intersects
3152 with an end of <code>r1</code>. This algorithm is linear in the size of the input
3153 (not quadratic or worse as backtracking is), but about a factor of 2 slower
3154 than normal scanning. It also consumes memory proportional to the size
3155 of the matched input for <code>r1 r2</code>.
3156
3157<P>
3158</LI>
3159<LI>Avoid the beginning of line operator '<code>^</code>'
3160
3161<P>
3162It costs multiple additional comparisons per match. In some
3163 cases one extra look-ahead character is needed (when the last character read is
3164 <code>\r</code> the scanner has to read one character ahead to check if
3165 the next one is an <code>\n</code> or not).
3166
3167<P>
3168</LI>
3169<LI>Match as much text as possible in a rule.
3170
3171<P>
3172One rule is matched in the innermost loop of the scanner. After
3173 each action some overhead for setting up the internal state of the
3174 scanner is necessary.
3175</LI>
3176</UL>
3177
3178<P>
3179Note that writing more rules in a specification does not make the generated
3180scanner slower (except when you have to switch to another code generation
3181method because of the larger size).
3182
3183<P>
3184The two main rules of optimisation apply also for lexical specifications:
3185
3186<OL>
3187<LI><B>don't do it</B>
3188</LI>
3189<LI><B>(for experts only) don't do it yet</B>
3190</LI>
3191</OL>
3192
3193<P>
3194Some of the performance tips above contradict a readable and compact
3195specification style. When in doubt or when requirements are not or not
3196yet fixed: don't use them -- the specification can always be optimised
3197in a later state of the development process.
3198
3199<P>
3200
3201<H1><A NAME="SECTION00080000000000000000">
3202Porting Issues</A>
3203</H1>
3204
3205<P>
3206
3207<H2><A NAME="SECTION00081000000000000000"></A><A NAME="Porting"></A><BR>
3208Porting from JLex
3209</H2>
3210JFlex was designed to read old JLex specifications unchanged and to
3211generate a scanner which behaves exactly the same as the one generated
3212by JLex with the only difference of being faster.
3213
3214<P>
3215This works as expected on all well formed JLex specifications.
3216
3217<P>
3218Since the statement above is somewhat absolute, let's take a look at
3219what ``well formed'' means here. A JLex specification is well formed, when
3220it
3221
3222<UL>
3223<LI>generates a working scanner with JLex
3224
3225<P>
3226</LI>
3227<LI>doesn't contain the unescaped characters <TT>!</TT> and <TT>~</TT>
3228
3229<P>
3230They are operators in JFlex while JLex treats them as normal
3231 input characters. You can easily port such a JLex specification
3232 to JFlex by replacing every <TT>!</TT> with <code>\!</code> and every
3233 <code>~</code> with <code>\~</code> in all regular expressions.
3234
3235<P>
3236</LI>
3237<LI>has only complete regular expressions surrounded by parentheses in
3238 macro definitions
3239
3240<P>
3241This may sound a bit harsh, but could otherwise be a major problem
3242 - it can also help you find some disgusting bugs in your
3243 specification that didn't show up in the first place. In JLex, a
3244 right hand side of a macro is just a piece of text, that is copied
3245 to the point where the macro is used. With this, some weird kind of
3246 stuff like
3247 <PRE>
3248 macro1 = ("hello"
3249 macro2 = {macro1})*
3250</PRE>
3251 was possible (with <TT>macro2</TT> expanding to <code>("hello")*</code>). This
3252 is not allowed in JFlex and you will have to transform such
3253 definitions. There are however some more subtle kinds of errors that
3254 can be introduced by JLex macros. Let's consider a definition like
3255 <code>macro = a|b</code> and a usage like <code>{macro}*</code>.
3256 This expands in JLex to <code>a|b*</code> and not to the probably intended
3257 <code>(a|b)*</code>.
3258
3259<P>
3260JFlex uses always the second form of expansion, since this is the natural
3261 form of thinking about abbreviations for regular expressions.
3262
3263<P>
3264Most specifications shouldn't suffer from this problem, because
3265 macros often only contain (harmless) character classes like
3266 <TT>alpha = [a-zA-Z]</TT> and more dangerous definitions like
3267
3268<P>
3269<code> ident = {alpha}({alpha}|{digit})*</code>
3270
3271<P>
3272are only used to write rules like
3273
3274<P>
3275<code> {ident} { .. action .. }</code>
3276
3277<P>
3278and not more complex expressions like
3279
3280<P>
3281<code> {ident}* { .. action .. }</code>
3282
3283<P>
3284where the kind of error presented above would show up.
3285</LI>
3286</UL>
3287
3288<P>
3289
3290<H2><A NAME="SECTION00082000000000000000"></A><A NAME="lexport"></A><BR>
3291Porting from lex/flex
3292</H2>
3293This section tries to give an overview of activities and possible
3294problems when porting a lexical specification from the C/C++ tools lex
3295and flex [<A
3296 HREF="manual.html#flex">11</A>] available on most Unix systems to JFlex.
3297
3298<P>
3299Most of the C/C++ specific features are naturally not present in JFlex,
3300but most ``clean'' lex/flex lexical specifications can be ported to
3301JFlex without very much work.
3302
3303<P>
3304This section is by far not complete and is based mainly on a survey of
3305the flex man page and very little personal experience. If you do
3306engage in any porting activity from lex/flex to JFlex and encounter
3307problems, have better solutions for points presented here or have just
3308some tips you would like to share, please do <A NAME="tex2html7"
3309 HREF="mailto:[email protected]">contact me</A>. I will
3310incorporate your experiences in this manual (with all due credit to you,
3311of course).
3312
3313<P>
3314
3315<H3><A NAME="SECTION00082100000000000000">
3316Basic structure</A>
3317</H3>
3318A lexical specification for flex has the following basic structure:
3319<PRE>
3320definitions
3321%%
3322rules
3323%%
3324user code
3325</PRE>
3326
3327<P>
3328The <TT>user code</TT> section usually contains some C code that is used
3329in actions of the <TT>rules</TT> part of the specification. For JFlex most
3330of this code will have to be included in the class code <code>%{..%}</code>
3331directive in the <TT>options</TT> <TT>and declarations</TT> section (after
3332translating the C code to Java, of course).
3333
3334<P>
3335
3336<H3><A NAME="SECTION00082200000000000000">
3337Macros and Regular Expression Syntax</A>
3338</H3>
3339The <TT>definitions</TT> section of a flex specification is quite similar
3340to the <TT>options and declarations</TT> part of JFlex specs.
3341
3342<P>
3343Macro definitions in flex have the form:
3344<PRE>
3345&lt;identifier&gt; &lt;expression&gt;
3346</PRE>
3347To port them to JFlex macros, just insert a <TT>=</TT> between <TT>&lt;identifier&gt;</TT>
3348and <TT>&lt;expression&gt;</TT>.
3349
3350<P>
3351The syntax and semantics of regular expressions in flex are pretty much the
3352same as in JFlex. A little attention is needed for some escape sequences
3353present in flex (such as <code>\a</code>) that are not supported in JFlex. These
3354escape sequences should be transformed into their octal or hexadecimal
3355equivalent.
3356
3357<P>
3358Another point are predefined character classes. Flex offers the ones directly
3359supported by C, JFlex offers the ones supported by Java. These classes will
3360sometimes have to be listed manually (if there is need for this feature, it
3361may be implemented in a future JFlex version).
3362
3363<P>
3364
3365<H3><A NAME="SECTION00082300000000000000">
3366Lexical Rules</A>
3367</H3>
3368Since flex is mostly Unix based, the '<code>^</code>' (beginning of line) and
3369'<code>$</code>' (end of line) operators, consider the <code>\n</code> character as only line terminator. This should usually cause not much problems, but you
3370should be prepared for occurrences of <code>\r</code> or <code>\r\n</code> or one of
3371the characters <code>\u2028</code>, <code>\u2029</code>, <code>\u000B</code>, <code>\u000C</code>,
3372or <code>\u0085</code>. They are considered to be line terminators in Unicode and
3373therefore may not be consumed when
3374<code>^</code> or <code>$</code> is present in a rule.
3375<P>
3376
3377<H1><A NAME="SECTION00090000000000000000"></A><A NAME="WorkingTog"></A><BR>
3378Working together
3379</H1>
3380
3381<P>
3382
3383<H2><A NAME="SECTION00091000000000000000"></A><A NAME="CUPWork"></A><BR>
3384JFlex and CUP
3385</H2>
3386One of the main design goals of JFlex was to make interfacing with the free
3387Java parser generator CUP [<A
3388 HREF="manual.html#CUP">8</A>] as easy as possibly.
3389This has been done by giving
3390the <TT><A HREF="#CupMode">%cup</A></TT> directive a special meaning. An
3391interface however always has two sides. This section concentrates on the
3392CUP side of the story.
3393
3394<P>
3395
3396<H3><A NAME="SECTION00091100000000000000">
3397CUP version 0.10j and above</A>
3398</H3>
3399Since CUP version 0.10j, this has been simplified greatly by the new
3400CUP scanner interface <TT>java_cup.runtime.Scanner</TT>. JFlex lexers now implement
3401this interface automatically when then <TT><A HREF="#CupMode">%cup</A></TT>
3402switch is used. There are no special <TT>parser code</TT>, <TT>init
3403 code</TT> or <TT>scan with</TT> options any more that you have to provide
3404in your CUP parser specification. You can just concentrate on your grammar.
3405
3406<P>
3407If your generated lexer has the class name <TT>Scanner</TT>, the parser
3408is started from the a main program like this:
3409
3410<P>
3411<PRE>
3412...
3413 try {
3414 parser p = new parser(new Scanner(new FileReader(fileName)));
3415 Object result = p.parse().value;
3416 }
3417 catch (Exception e) {
3418...
3419</PRE>
3420
3421<P>
3422
3423<H3><A NAME="SECTION00091200000000000000">
3424Custom symbol interface</A>
3425</H3>
3426If you have used the <TT>-symbol</TT> command line switch of CUP to change
3427the name of the generated symbol interface, you have to tell JFlex about
3428this change of interface so that correct end-of-file code is generated.
3429You can do so either by using an <code>%eofval{</code> directive or by using
3430and <TT>&#171;EOF&#187;</TT> rule.
3431
3432<P>
3433If your new symbol interface is called <TT>mysym</TT> for example, the
3434corresponding code in the jflex specification would be either
3435
3436<P>
3437
3438<PRE>
3439%eofval{
3440 return mysym.EOF;
3441%eofval}
3442</PRE>
3443
3444<P>
3445in the macro/directives section of the spec, or it would be
3446
3447<P>
3448
3449<PRE>
3450 &lt;&lt;EOF&gt;&gt; { return mysym.EOF; }
3451</PRE>
3452
3453<P>
3454in the rules section of your spec.
3455
3456<P>
3457
3458<H3><A NAME="SECTION00091300000000000000">
3459Using existing JFlex/CUP specifications with CUP 0.10j</A>
3460</H3>
3461If you already have an existing specification and you would like to upgrade
3462both JFlex and CUP to their newest version, you will probably have to adjust
3463your specification.
3464
3465<P>
3466The main difference between the <TT><A HREF="#CupMode">%cup</A></TT> switch in
3467JFlex 1.2.1 and lower, and the current JFlex version is, that JFlex scanners
3468now automatically implement the <TT>java_cup.runtime.Scanner</TT> interface.
3469This means, that the scanning function now changes its name from <TT>yylex()</TT>
3470to <TT>next_token()</TT>.
3471
3472<P>
3473The main difference from older CUP versions to 0.10j is, that CUP now
3474has a default constructor that accepts a <TT>java_cup.runtime.Scanner</TT>
3475as argument and that uses this scanner as
3476default (so no <TT>scan with</TT> code is necessary any more).
3477
3478<P>
3479If you have an existing CUP specification, it will probably look somewhat like this:
3480<PRE>
3481parser code {:
3482 Lexer lexer;
3483
3484 public parser (java.io.Reader input) {
3485 lexer = new Lexer(input);
3486 }
3487:};
3488
3489scan with {: return lexer.yylex(); :};
3490</PRE>
3491
3492<P>
3493To upgrade to CUP 0.10j, you could change it to look like this:
3494<PRE>
3495parser code {:
3496 public parser (java.io.Reader input) {
3497 super(new Lexer(input));
3498 }
3499:};
3500</PRE>
3501
3502<P>
3503If you do not mind to change the method that is calling the parser,
3504you could remove the constructor entirely (and if there is nothing else
3505in it, the whole <TT>parser code</TT> section as well, of course). The calling
3506main procedure would then construct the parser as shown in the section above.
3507
3508<P>
3509The JFlex specification does not need to be changed.
3510
3511<P>
3512
3513<H3><A NAME="SECTION00091400000000000000">
3514Using older versions of CUP</A>
3515</H3>
3516For people, who like or have to use older versions of CUP, the following section
3517explains ``the old way''. Please note, that the standard name of the scanning
3518function with the <TT><A HREF="#CupMode">%cup</A></TT> switch is not
3519<TT>yylex()</TT>, but <TT>next_token()</TT>.
3520
3521<P>
3522If you have a scanner specification that begins like this:
3523
3524<P>
3525<PRE>
3526package PACKAGE;
3527import java_cup.runtime.*; /* this is convenience, but not necessary */
3528
3529%%
3530
3531%class Lexer
3532%cup
3533..
3534</PRE>
3535
3536<P>
3537then it matches a CUP specification starting like
3538
3539<P>
3540<PRE>
3541package PACKAGE;
3542
3543parser code {:
3544 Lexer lexer;
3545
3546 public parser (java.io.Reader input) {
3547 lexer = new Lexer(input);
3548 }
3549:};
3550
3551scan with {: return lexer.next_token(); :};
3552
3553..
3554</PRE>
3555
3556<P>
3557This assumes that the generated parser will get the name <TT>parser</TT>.
3558If it doesn't, you have to adjust the constructor name.
3559
3560<P>
3561The parser can then be started in a main routine like this:
3562
3563<P>
3564<PRE>
3565..
3566 try {
3567 parser p = new parser(new FileReader(fileName));
3568 Object result = p.parse().value;
3569 }
3570 catch (Exception e) {
3571..
3572</PRE>
3573
3574<P>
3575If you want the parser specification to be independent of the name of the generated
3576scanner, you can instead write an interface <TT>Lexer</TT>
3577
3578<P>
3579<PRE>
3580public interface Lexer {
3581 public java_cup.runtime.Symbol next_token() throws java.io.IOException;
3582}
3583</PRE>
3584
3585<P>
3586change the parser code to:
3587
3588<P>
3589<PRE>
3590package PACKAGE;
3591
3592parser code {:
3593 Lexer lexer;
3594
3595 public parser (Lexer lexer) {
3596 this.lexer = lexer;
3597 }
3598:};
3599
3600scan with {: return lexer.next_token(); :};
3601
3602..
3603</PRE>
3604
3605<P>
3606tell JFlex about the lexer
3607interface using the <TT>%implements</TT>
3608directive:
3609
3610<P>
3611<PRE>
3612..
3613%class Scanner /* not Lexer now since that is our interface! */
3614%implements Lexer
3615%cup
3616..
3617</PRE>
3618
3619<P>
3620and finally change the main routine to look like
3621
3622<P>
3623<PRE>
3624...
3625 try {
3626 parser p = new parser(new Scanner(new FileReader(fileName)));
3627 Object result = p.parse().value;
3628 }
3629 catch (Exception e) {
3630...
3631</PRE>
3632
3633<P>
3634If you want to improve the error messages that CUP generated parsers
3635produce, you can also override the methods <TT>report_error</TT> and <TT>report_fatal_error</TT>
3636in the ``parser code'' section of the CUP specification. The new methods
3637could for instance use <TT>yyline</TT> and <TT>yycolumn</TT> (stored in
3638the <TT>left</TT> and <TT>right</TT> members of class <TT>java_cup.runtime.Symbol</TT>)
3639to report error positions more conveniently for the user. The lexer and
3640parser for the Java language in the <TT>examples/java</TT> directory of the
3641JFlex distribution use this style of error reporting. These specifications
3642also demonstrate the techniques above in action.
3643
3644<P>
3645
3646<H2><A NAME="SECTION00092000000000000000"></A><A NAME="YaccWork"></A><BR>
3647JFlex and BYacc/J
3648</H2>
3649
3650<P>
3651JFlex has built-in support for the Java extension
3652<A NAME="tex2html8"
3653 HREF="http://byaccj.sourceforge.net/">BYacc/J</A>
3654[<A
3655 HREF="manual.html#BYaccJ">9</A>] by Bob Jamison
3656to the classical Berkeley Yacc parser generator.
3657This section describes how to interface BYacc/J with JFlex. It
3658builds on many helpful suggestions and comments from Larry Bell.
3659
3660<P>
3661Since Yacc's architecture is a bit different from CUP's, the
3662interface setup also works in a slightly different manner.
3663BYacc/J expects a function <TT>int yylex()</TT> in the parser
3664class that returns each next token. Semantic values are expected
3665in a field <TT>yylval</TT> of type <TT>parserval</TT> where ``<TT>parser</TT>''
3666is the name of the generated parser class.
3667
3668<P>
3669For a small calculator example, one could use a set up like the
3670following on the JFlex side:
3671
3672<P>
3673<PRE>
3674%%
3675
3676%byaccj
3677
3678%{
3679 /* store a reference to the parser object */
3680 private parser yyparser;
3681
3682 /* constructor taking an additional parser object */
3683 public Yylex(java.io.Reader r, parser yyparser) {
3684 this(r);
3685 this.yyparser = yyparser;
3686 }
3687%}
3688
3689NUM = [0-9]+ ("." [0-9]+)?
3690NL = \n | \r | \r\n
3691
3692%%
3693
3694/* operators */
3695"+" |
3696..
3697"(" |
3698")" { return (int) yycharat(0); }
3699
3700/* newline */
3701{NL} { return parser.NL; }
3702
3703/* float */
3704{NUM} { yyparser.yylval = new parserval(Double.parseDouble(yytext()));
3705 return parser.NUM; }
3706</PRE>
3707
3708<P>
3709The lexer expects a reference to the parser in its constructor.
3710Since Yacc allows direct use of terminal characters like <TT>'+'</TT>
3711in its specifications, we just return the character code for
3712single char matches (e.g. the operators in the example). Symbolic
3713token names are stored as <TT>public static int</TT> constants in
3714the generated parser class. They are used as in the <TT>NL</TT> token
3715above. Finally, for some tokens, a semantic value may have to be
3716communicated to the parser. The <TT>NUM</TT> rule demonstrates that
3717bit.
3718
3719<P>
3720A matching BYacc/J parser specification could look like this:
3721<PRE>
3722%{
3723 import java.io.*;
3724%}
3725
3726%token NL /* newline */
3727%token &lt;dval&gt; NUM /* a number */
3728
3729%type &lt;dval&gt; exp
3730
3731%left '-' '+'
3732..
3733%right '^' /* exponentiation */
3734
3735%%
3736
3737..
3738
3739exp: NUM { $$ = $1; }
3740 | exp '+' exp { $$ = $1 + $3; }
3741 ..
3742 | exp '^' exp { $$ = Math.pow($1, $3); }
3743 | '(' exp ')' { $$ = $2; }
3744 ;
3745
3746%%
3747 /* a reference to the lexer object */
3748 private Yylex lexer;
3749
3750 /* interface to the lexer */
3751 private int yylex () {
3752 int yyl_return = -1;
3753 try {
3754 yyl_return = lexer.yylex();
3755 }
3756 catch (IOException e) {
3757 System.err.println("IO error :"+e);
3758 }
3759 return yyl_return;
3760 }
3761
3762 /* error reporting */
3763 public void yyerror (String error) {
3764 System.err.println ("Error: " + error);
3765 }
3766
3767 /* lexer is created in the constructor */
3768 public parser(Reader r) {
3769 lexer = new Yylex(r, this);
3770 }
3771
3772 /* that's how you use the parser */
3773 public static void main(String args[]) throws IOException {
3774 parser yyparser = new parser(new FileReader(args[0]));
3775 yyparser.yyparse();
3776 }
3777</PRE>
3778
3779<P>
3780Here, the customised part is mostly in the user code section:
3781We create the lexer in the constructor of the parser and store
3782a reference to it for later use in the parser's <TT>int yylex()</TT>
3783method. This <TT>yylex</TT> in the parser only calls <TT>int yylex()</TT>
3784of the generated lexer and passes the result on. If something goes
3785wrong, it returns -1 to indicate an error.
3786
3787<P>
3788Runnable versions of the specifications above
3789are located in the <TT>examples/byaccj</TT> directory of the JFlex
3790distribution.
3791
3792<P>
3793
3794<H1><A NAME="SECTION000100000000000000000"></A><A NAME="Bugs"></A><BR>
3795Bugs and Deficiencies
3796</H1>
3797
3798<P>
3799
3800<H2><A NAME="SECTION000101000000000000000">
3801Deficiencies</A>
3802</H2>
3803Unicode matching is not fully conforming to the relevant current Unicode report. Instead, the Unicode support in JFlex is the one native to Java. That means, only 16 bit code points are supported and most Unicode character classes are not directly supported (although they can be custom-defined in macros). The Java 5 development version of JFlex contains better support for Unicode, as will the next major release.
3804
3805<P>
3806
3807<H2><A NAME="SECTION000102000000000000000">
3808Bugs</A>
3809</H2>
3810As of January 31, 2009, no bugs have been reported for JFlex version 1.4.3. All
3811bugs reported for earlier versions have been fixed.
3812
3813<P>
3814If you find new problems, please use the bugs section of the
3815<A NAME="tex2html9"
3816 HREF="http://www.jflex.de/">JFlex web site</A>
3817to report them.
3818
3819<P>
3820
3821<H1><A NAME="SECTION000110000000000000000"></A><A NAME="Copyright"></A><BR>
3822Copying and License
3823</H1>
3824JFlex is free software, published under the terms of the
3825<A NAME="tex2html10"
3826 HREF="http://www.fsf.org/copyleft/gpl.html">GNU General Public License</A>.
3827
3828<P>
3829There is absolutely NO WARRANTY for JFlex, its code and its documentation.
3830
3831<P>
3832The code generated by JFlex inherits the copyright of the specification it
3833was produced from. If it was your specification, you may use the generated
3834code without restriction.
3835
3836<P>
3837See the file <A NAME="tex2html11"
3838 HREF="COPYRIGHT"><TT>COPYRIGHT</TT></A>
3839for more information.
3840
3841<P>
3842
3843<H2><A NAME="SECTION000120000000000000000"></A><A NAME="References"></A><BR>
3844Bibliography
3845</H2><DL COMPACT><DD>
3846
3847<P>
3848<P></P><DT><A NAME="Aho">1</A>
3849<DD>
3850 A.&nbsp;Aho, R.&nbsp;Sethi, J.&nbsp;Ullman, <EM>Compilers: Principles, Techniques, and Tools</EM>, 1986
3851
3852<P>
3853<P></P><DT><A NAME="Appel">2</A>
3854<DD>
3855 A.&nbsp;W.&nbsp;Appel, <EM>Modern Compiler Implementation in Java: basic techniques</EM>, 1997
3856
3857<P>
3858<P></P><DT><A NAME="JLex">3</A>
3859<DD>
3860 E.&nbsp;Berk, <EM>JLex: A lexical analyser generator for Java</EM>,
3861<BR> <A NAME="tex2html12"
3862 HREF="http://www.cs.princeton.edu/~appel/modern/java/JLex/"><TT>http://www.cs.princeton.edu/~appel/modern/java/JLex/</TT></A>
3863<P>
3864<P></P><DT><A NAME="fast">4</A>
3865<DD>
3866 K.&nbsp;Brouwer, W.&nbsp;Gellerich,E.&nbsp;Ploedereder,
3867 <EM>Myths and Facts about the Efficient Implementation of Finite Automata and Lexical Analysis</EM>,
3868 in: Proceedings of the 7th International Conference on Compiler Construction (CC '98), 1998
3869
3870<P>
3871<P></P><DT><A NAME="unicode_rep">5</A>
3872<DD>
3873 M.&nbsp;Davis, <EM>Unicode Regular Expression Guidelines</EM>, Unicode Technical Report #18, 2000
3874<BR> <A NAME="tex2html13"
3875 HREF="http://www.unicode.org/unicode/reports/tr18/tr18-5.1.html"><TT>http://www.unicode.org/unicode/reports/tr18/tr18-5.1.html</TT></A>
3876<P>
3877<P></P><DT><A NAME="ParseTable">6</A>
3878<DD>
3879 P.&nbsp;Dencker, K.&nbsp;D&#252;rre, J.&nbsp;Henft, <EM>Optimization of Parser Tables for portable Compilers</EM>,
3880 in: ACM Transactions on Programming Languages and Systems 6(4), 1984
3881
3882<P>
3883<P></P><DT><A NAME="LangSpec">7</A>
3884<DD>
3885 J.&nbsp;Gosling, B.&nbsp;Joy, G.&nbsp;Steele, <EM>The Java Language Specifcation</EM>, 1996,
3886<BR> <A NAME="tex2html14"
3887 HREF="http://java.sun.com/docs/books/jls/"><TT>http://java.sun.com/docs/books/jls/</TT></A>
3888<P>
3889<P></P><DT><A NAME="CUP">8</A>
3890<DD>
3891 S.&nbsp;E.&nbsp;Hudson, <EM>CUP LALR Parser Generator for Java</EM>,
3892<BR> <A NAME="tex2html15"
3893 HREF="http://www.cs.princeton.edu/~appel/modern/java/CUP/"><TT>http://www.cs.princeton.edu/~appel/modern/java/CUP/</TT></A>
3894<P>
3895<P></P><DT><A NAME="BYaccJ">9</A>
3896<DD>
3897 B.&nbsp;Jamison, <EM>BYacc/J</EM>,
3898<BR> <A NAME="tex2html16"
3899 HREF="http://byaccj.sourceforge.net"><TT>http://byaccj.sourceforge.net/</TT></A>
3900<P>
3901<P></P><DT><A NAME="MachineSpec">10</A>
3902<DD>
3903 T.&nbsp;Lindholm, F.&nbsp;Yellin, <EM>The Java Virtual Machine Specification</EM>, 1996,
3904<BR> <A NAME="tex2html17"
3905 HREF="http://java.sun.com/docs/books/vmspec/"><TT>http://java.sun.com/docs/books/vmspec/</TT></A>
3906<P>
3907<P></P><DT><A NAME="flex">11</A>
3908<DD>
3909 V.&nbsp;Paxson, <EM>flex - The fast lexical analyzer generator</EM>, 1995
3910
3911<P>
3912<P></P><DT><A NAME="SparseTable">12</A>
3913<DD>
3914 R.&nbsp;E. Tarjan, A.&nbsp;Yao, <EM>Storing a Sparse Table</EM>, in: Communications of the ACM 22(11), 1979
3915
3916<P>
3917<P></P><DT><A NAME="Maurer">13</A>
3918<DD>
3919 R.&nbsp;Wilhelm, D.&nbsp;Maurer, <EM>&#220;bersetzerbau</EM>, Berlin 1997<SUP>2</SUP>
3920<tex2html_verbatim_mark>mathend000#
3921
3922<P>
3923</DL>
3924
3925<P>
3926<BR><HR><H4>Footnotes</H4>
3927<DL>
3928<DT><A NAME="foot33">... Java</A><A
3929 HREF="manual.html#tex2html2"><SUP><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="footnote.png"></SUP></A></DT>
3930<DD>Java is a trademark of
3931Sun Microsystems, Inc., and refers to Sun's Java programming language.
3932JFlex is not sponsored by or affiliated with Sun Microsystems, Inc.
3933
3934</DD>
3935</DL><BR><HR>
3936<ADDRESS>
3937Sat 31 Jan 2009 23:43:28 EST, <a href="http://www.doclsf.de">Gerwin Klein</a>
3938</ADDRESS>
3939</BODY>
3940</HTML>
Note: See TracBrowser for help on using the repository browser.