Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

manual.html@ 25584

Last change on this file since 25584 was 25584, checked in by davidb, 12 years ago
Initial cut an a text edit area for GLI that supports color syntax highlighting
File size: 128.8 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2
3	<!--Converted with LaTeX2HTML 2002-2-1 (1.71)
4	original version by: Nikos Drakos, CBLU, University of Leeds
5	* revised and updated by: Marcus Hennecke, Ross Moore, Herb Swan
6	* with significant contributions from:
7	Jens Lippmann, Marek Rouchal, Martin Wilck and others -->
8	<HTML>
9	<HEAD>
10	<TITLE>JFlex User's Manual</TITLE>
11	<META NAME="description" CONTENT="JFlex User's Manual">
12	<META NAME="keywords" CONTENT="manual">
13	<META NAME="resource-type" CONTENT="document">
14	<META NAME="distribution" CONTENT="global">
15
16	<META NAME="Generator" CONTENT="LaTeX2HTML v2002-2-1">
17	<META HTTP-EQUIV="Content-Style-Type" CONTENT="text/css">
18
19	<LINK REL="STYLESHEET" HREF="manual.css">
20
21	</HEAD>
22
23	<BODY >
24
25	<P>
26
27	<CENTER>
28	<A NAME="TOP"></a>
29	<A HREF="http://www.jflex.de"><IMG SRC="logo.png" BORDER=0 HEIGHT=223 WIDTH=577></a>
30	</CENTER>
31
32	<P>
33	<DIV ALIGN="CENTER">
34	<I><FONT SIZE="+2">The Fast Lexical Analyser Generator</FONT>
35	<BR></I></DIV>
36	<P></P>
37	<DIV ALIGN="CENTER"></DIV>
38	<P></P>
39	<DIV ALIGN="CENTER"><I>Copyright ©1998-2009 by <A NAME="tex2html1"
40	HREF="http://www.doclsf.de">Gerwin Klein</A>
41	<BR></I></DIV>
42	<P><P><BR>
43	<DIV ALIGN="CENTER"><I><FONT SIZE="+4"><I><B>JFlex User's Manual</B></I></FONT>
44	<BR></I></DIV>
45	<P><P><BR>
46	<DIV ALIGN="CENTER"><I>Version 1.4.3, January 31, 2009
47
48	</I></DIV>
49
50	<P>
51	<BR>
52
53	<H2><A NAME="SECTION00010000000000000000">
54	Contents</A>
55	</H2>
56	<!--Table of Contents-->
57
58	<UL>
59	<LI><A NAME="tex2html79"
60	HREF="manual.html#SECTION00020000000000000000">Introduction</A>
61	<UL>
62	<LI><A NAME="tex2html80"
63	HREF="manual.html#SECTION00021000000000000000">Design goals</A>
64	<LI><A NAME="tex2html81"
65	HREF="manual.html#SECTION00022000000000000000">About this manual</A>
66	</UL><BR>
67	<LI><A NAME="tex2html82"
68	HREF="manual.html#SECTION00030000000000000000">Installing and Running JFlex</A>
69	<UL>
70	<LI><A NAME="tex2html83"
71	HREF="manual.html#SECTION00031000000000000000">Installing JFlex</A>
72	<LI><A NAME="tex2html84"
73	HREF="manual.html#SECTION00032000000000000000">Running JFlex</A>
74	</UL><BR>
75	<LI><A NAME="tex2html85"
76	HREF="manual.html#SECTION00040000000000000000">A simple Example: How to work with JFlex</A>
77	<UL>
78	<LI><A NAME="tex2html86"
79	HREF="manual.html#SECTION00041000000000000000">Code to include</A>
80	<LI><A NAME="tex2html87"
81	HREF="manual.html#SECTION00042000000000000000">Options and Macros</A>
82	<LI><A NAME="tex2html88"
83	HREF="manual.html#SECTION00043000000000000000">Rules and Actions</A>
84	<LI><A NAME="tex2html89"
85	HREF="manual.html#SECTION00044000000000000000">How to get it going</A>
86	</UL><BR>
87	<LI><A NAME="tex2html90"
88	HREF="manual.html#SECTION00050000000000000000">Lexical Specifications</A>
89	<UL>
90	<LI><A NAME="tex2html91"
91	HREF="manual.html#SECTION00051000000000000000">User code</A>
92	<LI><A NAME="tex2html92"
93	HREF="manual.html#SECTION00052000000000000000">Options and declarations</A>
94	<LI><A NAME="tex2html93"
95	HREF="manual.html#SECTION00053000000000000000">Lexical rules</A>
96	</UL><BR>
97	<LI><A NAME="tex2html94"
98	HREF="manual.html#SECTION00060000000000000000">Encodings, Platforms, and Unicode</A>
99	<UL>
100	<LI><A NAME="tex2html95"
101	HREF="manual.html#SECTION00061000000000000000">The Problem</A>
102	<LI><A NAME="tex2html96"
103	HREF="manual.html#SECTION00062000000000000000">Scanning text files</A>
104	<LI><A NAME="tex2html97"
105	HREF="manual.html#SECTION00063000000000000000">Scanning binaries</A>
106	</UL><BR>
107	<LI><A NAME="tex2html98"
108	HREF="manual.html#SECTION00070000000000000000">A few words on performance</A>
109	<UL>
110	<LI><A NAME="tex2html99"
111	HREF="manual.html#SECTION00071000000000000000">Comparison of JLex and JFlex</A>
112	<LI><A NAME="tex2html100"
113	HREF="manual.html#SECTION00072000000000000000">How to write a faster specification</A>
114	</UL><BR>
115	<LI><A NAME="tex2html101"
116	HREF="manual.html#SECTION00080000000000000000">Porting Issues</A>
117	<UL>
118	<LI><A NAME="tex2html102"
119	HREF="manual.html#SECTION00081000000000000000">Porting from JLex</A>
120	<LI><A NAME="tex2html103"
121	HREF="manual.html#SECTION00082000000000000000">Porting from lex/flex</A>
122	</UL><BR>
123	<LI><A NAME="tex2html104"
124	HREF="manual.html#SECTION00090000000000000000">Working together</A>
125	<UL>
126	<LI><A NAME="tex2html105"
127	HREF="manual.html#SECTION00091000000000000000">JFlex and CUP</A>
128	<LI><A NAME="tex2html106"
129	HREF="manual.html#SECTION00092000000000000000">JFlex and BYacc/J</A>
130	</UL><BR>
131	<LI><A NAME="tex2html107"
132	HREF="manual.html#SECTION000100000000000000000">Bugs and Deficiencies</A>
133	<UL>
134	<LI><A NAME="tex2html108"
135	HREF="manual.html#SECTION000101000000000000000">Deficiencies</A>
136	<LI><A NAME="tex2html109"
137	HREF="manual.html#SECTION000102000000000000000">Bugs</A>
138	</UL><BR>
139	<LI><A NAME="tex2html110"
140	HREF="manual.html#SECTION000110000000000000000">Copying and License</A>
141	<LI><A NAME="tex2html111"
142	HREF="manual.html#SECTION000120000000000000000">Bibliography</A>
143	</UL>
144	<!--End of Table of Contents-->
145
146	<H1><A NAME="SECTION00020000000000000000"></A><A NAME="Intro"></A><BR>
147	Introduction
148	</H1>
149	JFlex is a lexical analyser generator for Java<A NAME="tex2html2"
150	HREF="#foot33"><SUP><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="footnote.png"></SUP></A>written in Java. It is also a rewrite of the very useful tool JLex [<A
151	HREF="manual.html#JLex">3</A>] which
152	was developed by Elliot Berk at Princeton University. As Vern Paxson states
153	for his C/C++ tool flex [<A
154	HREF="manual.html#flex">11</A>]: they do not share any code though.
155
156	<P>
157
158	<H2><A NAME="SECTION00021000000000000000">
159	Design goals</A>
160	</H2>
161	The main design goals of JFlex are:
162
163	<UL>
164	<LI><B>Full unicode support</B>
165	</LI>
166	<LI><B>Fast generated scanners </B>
167	</LI>
168	<LI><B>Fast scanner generation</B>
169	</LI>
170	<LI><B>Convenient specification syntax</B>
171	</LI>
172	<LI><B>Platform independence</B>
173	</LI>
174	<LI><B>JLex compatibility</B>
175	</LI>
176	</UL>
177
178	<P>
179
180	<H2><A NAME="SECTION00022000000000000000">
181	About this manual</A>
182	</H2>
183	This manual gives a brief but complete description of the tool JFlex. It
184	assumes that you are familiar with the issue of lexical analysis. The references [<A
185	HREF="manual.html#Aho">1</A>],
186	[<A
187	HREF="manual.html#Appel">2</A>], and [<A
188	HREF="manual.html#Maurer">13</A>] provide a good introduction to this topic.
189
190	<P>
191	The next section of this manual describes <A HREF="#Installing"><I>installation procedures</I></A>
192	for JFlex. If you never worked with JLex or
193	just want to compare a JLex and a JFlex scanner specification you
194	should also read <A HREF="#Example"><I>Working with JFlex - an example</I></A>
195	(section <A HREF="#Example">3</A>). All options and the complete
196	specification syntax are presented in
197	<A HREF="#Specifications"><I>Lexical specifications</I></A> (section <A HREF="#Specifications">4</A>);
198	<A HREF="#sec:encodings"><I>Encodings, Platforms, and Unicode</I></A> (section <A HREF="#sec:encodings">5</A>)
199	provides information about scanning text vs. binary files.
200	If you are interested in performance
201	considerations and comparing JLex with JFlex speed,
202	<A HREF="#performance"><I>a few words on performance</I></A> (section <A HREF="#performance">6</A>)
203	might be just right for you. Those who want to
204	use their old JLex specifications may want to check out section <A HREF="#Porting">7.1</A>
205	<A HREF="#Porting"><I>Porting from JLex</I></A> to avoid possible problems
206	with not portable or non standard JLex behaviour that has been fixed in
207	JFlex. Section <A HREF="#lexport">7.2</A> talks about porting scanners from the
208	Unix tools lex and flex. Interfacing JFlex scanners with the LALR
209	parser generators CUP and BYacc/J is explained in <A HREF="#WorkingTog"><I>working
210	together</I></A> (section <A HREF="#WorkingTog">8</A>). Section <A HREF="#Bugs">9</A>
211	<A HREF="#Bugs"><I>Bugs</I></A> gives a list of currently known active bugs.
212	The manual concludes with notes about
213	<A HREF="#Copyright"><I>Copying and License</I></A> (section <A HREF="#Copyright">10</A>) and
214	<A HREF="#References">references</A>.
215
216	<P>
217
218	<H1><A NAME="SECTION00030000000000000000"></A><A NAME="Installing"></A><BR>
219	Installing and Running JFlex
220	</H1>
221
222	<P>
223
224	<H2><A NAME="SECTION00031000000000000000">
225	Installing JFlex</A>
226	</H2>
227
228	<P>
229
230	<H3><A NAME="SECTION00031100000000000000"></A><A NAME="install:windows"></A><BR>
231	Windows
232	</H3>
233	To install JFlex on Windows 95/98/NT/XP, follow these three steps:
234
235	<OL>
236	<LI>Unzip the file you downloaded into the directory you want JFlex in (using
237	something like
238	<A NAME="tex2html3"
239	HREF="http://www.winzip.com">WinZip</A>).
240	If you unzipped it to say <code>C:\</code>, the following directory structure
241	should be generated:
242
243	<PRE>
244	C:\JFlex\
245	+--bin\ (start scripts)
246	+--doc\ (FAQ and manual)
247	+--examples\
248	+--binary\ (scanning binary files)
249	+--byaccj\ (calculator example for BYacc/J)
250	+--cup\ (calculator example for cup)
251	+--interpreter\ (interpreter example for cup)
252	+--java\ (Java lexer specification)
253	+--simple\ (example scanner)
254	+--simple-maven\ (example with maven)
255	+--standalone\ (a simple standalone scanner)
256	+--standalone-maven\ (above with maven)
257	+--lib\ (the precompiled classes)
258	+--src\
259	+--JFlex\ (source code of JFlex)
260	+--JFlex\gui (source code of JFlex UI classes)
261	+--java_cup\runtime\ (source code of cup runtime classes)
262	</PRE>
263
264	<P>
265	</LI>
266	<LI>Edit the file <B><code>bin\jflex.bat</code></B>
267	(in the example it's <code>C:\JFlex\bin\jflex.bat</code>)
268	such that
269
270	<P>
271
272	<UL>
273	<LI><B><TT>JAVA_HOME</TT></B> contains the directory where your Java JDK is installed
274	(for instance <code>C:\java</code>) and
275	</LI>
276	<LI><B><TT>JFLEX_HOME</TT></B> the directory that contains JFlex (in the example:
277	<code>C:\JFlex</code>)
278	</LI>
279	</UL>
280
281	<P>
282	</LI>
283	<LI>Include the <code>bin\</code> directory of JFlex in your path.
284	(the one that contains the start script, in the example: <code>C:\JFlex\bin</code>).
285	</LI>
286	</OL>
287
288	<P>
289
290	<H3><A NAME="SECTION00031200000000000000">
291	Unix with tar archive</A>
292	</H3>
293
294	<P>
295	To install JFlex on a Unix system, follow these two steps:
296
297	<UL>
298	<LI>Decompress the archive into a directory of your choice
299	with GNU tar, for instance to <TT>/usr/share</TT>:
300
301	<P>
302	<TT>tar -C /usr/share -xvzf jflex-1.4.3.tar.gz</TT>
303
304	<P>
305	(The example is for site wide installation. You need to
306	be root for that. User installation works exactly the
307	same way--just choose a directory where you have write
308	permission)
309
310	<P>
311	</LI>
312	<LI>Make a symbolic link from somewhere in your binary
313	path to <TT>bin/jflex</TT>, for instance:
314
315	<P>
316	<TT>ln -s /usr/share/JFlex/bin/jflex /usr/bin/jflex</TT>
317
318	<P>
319	If the Java interpreter is not in your binary path, you
320	need to supply its location in the script <TT>bin/jflex</TT>.
321	</LI>
322	</UL>
323
324	<P>
325	You can verify the integrity of the downloaded file with
326	the MD5 checksum available on the <A NAME="tex2html4"
327	HREF="http://www.jflex.de/download.html">JFlex download page</A>.
328	If you put the checksum file in the same directory
329	as the archive, you run:
330
331	<P>
332	<code>md5sum --check </code><TT>jflex-1.4.3.tar.gz.md5</TT>
333
334	<P>
335	It should tell you
336
337	<P>
338	<TT>jflex-1.4.3.tar.gz: OK</TT>
339
340	<P>
341
342	<H3><A NAME="SECTION00031300000000000000">
343	Linux with RPM</A>
344	</H3>
345
346	<P>
347
348	<UL>
349	<LI>become root
350	</LI>
351	<LI>issue
352	<BR> <TT>rpm -U jflex-1.4.3-0.rpm</TT>
353	</LI>
354	</UL>
355
356	<P>
357	You can verify the integrity of the downloaded <TT>rpm</TT> file with
358
359	<P>
360	<code>rpm --checksig </code><TT>jflex-1.4.3-0.rpm</TT>
361
362	<P>
363
364	<H2><A NAME="SECTION00032000000000000000">
365	Running JFlex</A>
366	</H2>
367	You run JFlex with:
368
369	<P>
370	<TT>jflex <options> <inputfiles></TT>
371
372	<P>
373	It is also possible to skip the start script in <code>bin\</code>
374	and include the file <code>lib\JFlex.jar</code>
375	in your <TT>CLASSPATH</TT> environment variable instead.
376
377	<P>
378	Then you run JFlex with:
379
380	<P>
381	<TT>java JFlex.Main <options> <inputfiles></TT>
382
383	<P>
384	The input files and options are in both cases optional. If you don't provide a file name on
385	the command line, JFlex will pop up a window to ask you for one.
386
387	<P>
388	JFlex knows about the following options:
389
390	<P>
391	<DL>
392	<DT></DT>
393	<DD><code>-d <directory></code>
394	<BR> writes the generated file to the directory <code><directory></code>
395
396	<P>
397	</DD>
398	<DT></DT>
399	<DD><code>--skel <file></code>
400	<BR> uses external skeleton <code><file></code>. This is mainly for JFlex
401	maintenance and special low level customisations. Use only when you
402	know what you are doing! JFlex comes with a skeleton file in the
403	<TT>src</TT> directory that reflects exactly the internal, pre-compiled
404	skeleton and can be used with the <TT>-skel</TT> option.
405
406	<P>
407	</DD>
408	<DT></DT>
409	<DD><code>--nomin</code>
410	<BR> skip the DFA minimisation step during scanner generation.
411
412	<P>
413	</DD>
414	<DT></DT>
415	<DD><code>--jlex</code>
416	<BR> tries even harder to comply to JLex interpretation of specs.
417
418	<P>
419	</DD>
420	<DT></DT>
421	<DD><code>--dot</code>
422	<BR> generate graphviz dot files for the NFA, DFA and minimised
423	DFA. This feature is still in alpha status, and not
424	fully implemented yet.
425
426	<P>
427	</DD>
428	<DT></DT>
429	<DD><code>--dump</code>
430	<BR> display transition tables of NFA, initial DFA, and minimised DFA
431
432	<P>
433	</DD>
434	<DT></DT>
435	<DD><code>--verbose</code> or <TT>-v</TT>
436	<BR> display generation progress messages (enabled by default)
437
438	<P>
439	</DD>
440	<DT></DT>
441	<DD><code>--quiet</code> or <TT>-q</TT>
442	<BR> display error messages only (no chatter about what JFlex is
443	currently doing)
444
445	<P>
446	</DD>
447	<DT></DT>
448	<DD><code>--time</code>
449	<BR> display time statistics about the code generation process
450	(not very accurate)
451
452	<P>
453	</DD>
454	<DT></DT>
455	<DD><code>--version</code>
456	<BR> print version number
457
458	<P>
459	</DD>
460	<DT></DT>
461	<DD><code>--info</code>
462	<BR> print system and JDK information (useful if you'd like
463	to report a problem)
464
465	<P>
466	</DD>
467	<DT></DT>
468	<DD><code>--pack</code>
469	<BR> use the %pack code generation method by default
470
471	<P>
472	</DD>
473	<DT></DT>
474	<DD><code>--table</code>
475	<BR> use the %table code generation method by default
476
477	<P>
478	</DD>
479	<DT></DT>
480	<DD><code>--switch</code>
481	<BR> use the %switch code generation method by default
482
483	<P>
484	</DD>
485	<DT></DT>
486	<DD><code>--help</code> or <TT>-h</TT>
487	<BR> print a help message explaining options and usage of JFlex.
488	</DD>
489	</DL>
490
491	<P>
492
493	<H1><A NAME="SECTION00040000000000000000"></A><A NAME="Example"></A><BR>
494	A simple Example: How to work with JFlex
495	</H1>
496	To demonstrate what a lexical specification with JFlex looks like, this
497	section presents a part of the specification for the Java language.
498	The example does not describe the whole lexical structure of Java programs,
499	but only a small and simplified part of it (some keywords, some operators,
500	comments and only two kinds of literals). It also shows how to interface
501	with the LALR parser generator CUP [<A
502	HREF="manual.html#CUP">8</A>] and therefore
503	uses a class <TT>sym</TT> (generated by CUP), where integer constants for
504	the terminal tokens of the CUP grammar are declared. JFlex comes with a
505	directory <TT>examples</TT>, where you can find a small standalone scanner
506	that doesn't need other tools like CUP to give you a running example.
507	The "<TT>examples</TT>" directory also contains a <EM>complete</EM> JFlex
508	specification of the lexical structure of Java programs together with the
509	CUP parser specification for Java by
510	<A NAME="tex2html5"
511	HREF="mailto:[email protected]">C. Scott Ananian</A>, obtained
512	from the CUP [<A
513	HREF="manual.html#CUP">8</A>] web site (it was modified to interface with the JFlex scanner).
514	Both specifications adhere to the Java Language Specification [<A
515	HREF="manual.html#LangSpec">7</A>].
516
517	<P>
518	<FONT SIZE="-1"><A NAME="CodeTop"></A></FONT><PRE>
519	/* JFlex example: part of Java language lexer specification */
520	import java_cup.runtime.*;
521
522	/**
523	* This class is a simple example lexer.
524	*/
525	%%
526	</PRE><FONT SIZE="-1">
527	<A NAME="CodeOptions"></A></FONT><PRE>
528	%class Lexer
529	%unicode
530	%cup
531	%line
532	%column
533	</PRE><FONT SIZE="-1">
534	<A NAME="CodeScannerCode"></A></FONT><PRE>
535	%{
536	StringBuffer string = new StringBuffer();
537
538	private Symbol symbol(int type) {
539	return new Symbol(type, yyline, yycolumn);
540	}
541	private Symbol symbol(int type, Object value) {
542	return new Symbol(type, yyline, yycolumn, value);
543	}
544	%}
545	</PRE><FONT SIZE="-1">
546	<A NAME="CodeMacros"></A></FONT><PRE>
547	LineTerminator = \r\|\n\|\r\n
548	InputCharacter = [^\r\n]
549	WhiteSpace = {LineTerminator} \| [ \t\f]
550
551	/* comments */
552	Comment = {TraditionalComment} \| {EndOfLineComment} \| {DocumentationComment}
553
554	TraditionalComment = "/" [^] ~"/" \| "/" "*"+ "/"
555	EndOfLineComment = "//" {InputCharacter}* {LineTerminator}
556	DocumentationComment = "/*" {CommentContent} ""+ "/"
557	CommentContent = ( [^] \| \+ [^/] )
558
559	Identifier = [:jletter:] [:jletterdigit:]*
560
561	DecIntegerLiteral = 0 \| [1-9][0-9]*
562	</PRE><FONT SIZE="-1">
563	<A NAME="CodeStateDecl"></A></FONT><PRE>
564	%state STRING
565
566	%%
567	</PRE><FONT SIZE="-1">
568	<A NAME="CodeRulesYYINITIAL"></A></FONT><PRE>
569	/* keywords */
570	<YYINITIAL> "abstract" { return symbol(sym.ABSTRACT); }
571	<YYINITIAL> "boolean" { return symbol(sym.BOOLEAN); }
572	<YYINITIAL> "break" { return symbol(sym.BREAK); }
573	</PRE><FONT SIZE="-1">
574	<A NAME="CodeRulesBunch"></A></FONT><PRE>
575	<YYINITIAL> {
576	/* identifiers */
577	{Identifier} { return symbol(sym.IDENTIFIER); }
578
579	/* literals */
580	{DecIntegerLiteral} { return symbol(sym.INTEGER_LITERAL); }
581	\" { string.setLength(0); yybegin(STRING); }
582
583	/* operators */
584	"=" { return symbol(sym.EQ); }
585	"==" { return symbol(sym.EQEQ); }
586	"+" { return symbol(sym.PLUS); }
587
588	/* comments */
589	{Comment} { /* ignore */ }
590
591	/* whitespace */
592	{WhiteSpace} { /* ignore */ }
593	}
594	</PRE><FONT SIZE="-1">
595	<A NAME="CodeRulesYYtext"></A></FONT><PRE>
596	<STRING> {
597	\" { yybegin(YYINITIAL);
598	return symbol(sym.STRING_LITERAL,
599	string.toString()); }
600	[^\n\r\"\\]+ { string.append( yytext() ); }
601	\\t { string.append('\t'); }
602	\\n { string.append('\n'); }
603
604	\\r { string.append('\r'); }
605	\\\" { string.append('\"'); }
606	\\ { string.append('\\'); }
607	}
608	</PRE><FONT SIZE="-1">
609	<A NAME="CodeRulesAllStates"></A></FONT><PRE>
610	/* error fallback */
611	.\|\n { throw new Error("Illegal character <"+
612	yytext()+">"); }
613	</PRE>
614
615	<P>
616	From this specification JFlex generates a <TT>.java</TT> file with one
617	class that contains code for the scanner. The class will have a
618	constructor taking a <TT>java.io.Reader</TT> from which the input is
619	read. The class will also have a function <TT>yylex()</TT> that runs the
620	scanner and that can be used to get the next token from the input (in this
621	example the function actually has the name <TT>next_token()</TT> because
622	the specification uses the <TT>%cup</TT> switch).
623
624	<P>
625	As with JLex, the specification consists of three parts, divided by <TT>%%</TT>:
626
627	<UL>
628	<LI><A HREF="#ExampleUserCode">usercode</A>,
629	</LI>
630	<LI><A HREF="#ExampleOptions">options and declarations</A> and
631	</LI>
632	<LI><A HREF="#ExampleLexRules">lexical rules</A>.
633	</LI>
634	</UL>
635
636	<P>
637
638	<H2><A NAME="SECTION00041000000000000000"></A><A NAME="ExampleUserCode"></A><BR>
639	Code to include
640	</H2>
641	Let's take a look at the first section, ``user code'': The text up to the
642	first line starting with <TT>%%</TT> is copied verbatim to the top
643	of the generated lexer class (before the actual class declaration).
644	Beside <TT>package</TT> and <TT>import</TT> statements there is usually not much
645	to do here. If the code ends with a javadoc class comment, the generated class
646	will get this comment, if not, JFlex will generate one automatically.
647
648	<P>
649
650	<H2><A NAME="SECTION00042000000000000000"></A><A NAME="ExampleOptions"></A><BR>
651	Options and Macros
652	</H2>
653	The second section ``options and declarations'' is more interesting. It consists
654	of a set of options, code that is included inside the generated scanner
655	class, lexical states and macro declarations. Each JFlex option must begin
656	a line of the specification and starts with a <TT>%</TT>. In our example
657	the following options are used:
658
659	<P>
660
661	<UL>
662	<LI><TT><A HREF="#CodeOptions">%class Lexer</A></TT> tells JFlex to give the
663	generated class the name ``Lexer'' and to write the code to a file ``<TT>Lexer.java</TT>''.
664
665	<P>
666	</LI>
667	<LI><TT><A HREF="#CodeOptions">%unicode</A></TT> defines the set of characters the scanner will
668	work on. For scanning text files, <TT>%unicode</TT> should always be used. See also
669	section <A HREF="#sec:encodings">5</A> for more information on character sets, encodings, and
670	scanning text vs. binary files.
671
672	<P>
673	</LI>
674	<LI><TT><A HREF="#CodeOptions">%cup</A></TT> switches to CUP compatibility
675	mode to interface with a CUP generated parser.
676
677	<P>
678	</LI>
679	<LI><TT><A HREF="#CodeOptions">%line</A></TT> switches line counting on (the
680	current line number can be accessed via the variable <TT>yyline</TT>)
681
682	<P>
683	</LI>
684	<LI><TT><A HREF="#CodeOptions">%column</A></TT> switches column counting on
685	(current column is accessed via <TT>yycolumn</TT>)
686
687	<P>
688	</LI>
689	</UL>
690	<A NAME="ExampleScannerCode"></A>
691	<P>
692	The code included in <TT><A HREF="#CodeScannerCode">%{...%}</A></TT>
693	is copied verbatim into the generated lexer class source.
694	Here you can declare member variables and functions that are used
695	inside scanner actions. In our example we declare a <TT>StringBuffer</TT> ``<TT>string</TT>''
696	in which we will store parts of string literals and two helper functions
697	``<TT>symbol</TT>'' that create <TT>java_cup.runtime.Symbol</TT> objects
698	with position information of the current token (see section <A HREF="#CUPWork">8.1</A>
699	<A HREF="#CUPWork"><I>JFlex and CUP</I></A>
700	for how to interface with the parser generator CUP). As JFlex options, both
701	<code>%{</code> and <code>\%}</code> must begin a line.
702	<A NAME="ExampleMacros"></A>
703	<P>
704	The specification continues with macro declarations. Macros are
705	abbreviations for regular expressions, used to make lexical specifications
706	easier to read and understand. A macro declaration
707	consists of a macro identifier followed by <TT>=</TT>, then followed by
708	the regular expression it represents. This regular expression may
709	itself contain macro usages. Although this allows a grammar like specification
710	style, macros are still just abbreviations and not non terminals - they
711	cannot be recursive or mutually recursive. Cycles in macro definitions
712	are detected and reported at generation time by JFlex.
713
714	<P>
715	Here some of the example macros in more detail:
716
717	<UL>
718	<LI><TT><A HREF="#CodeMacros">LineTerminator</A></TT> stands for the regular
719	expression that matches an ASCII CR, an ASCII LF or an CR followed by LF.
720
721	<P>
722	</LI>
723	<LI><TT><A HREF="#CodeMacros">InputCharacter</A></TT> stands for all characters
724	that are not a CR or LF.
725
726	<P>
727	</LI>
728	<LI><TT><A HREF="#CodeMacros">TraditionalComment</A></TT> is the expression
729	that matches the string <TT>"/*"</TT> followed by a character that
730	is not a <TT>*</TT>, followed by anything that does not contain, but
731	ends in <TT>"/*"</TT>. As this would not match comments like
732	<TT>/***/</TT>, we add <TT>"/"</TT> followed by an arbitrary
733	number (at least one) of <TT>"*"</TT> followed by the closing
734	<TT>"/"</TT>. This is not the only, but one of the simpler
735	expressions matching non-nesting Java comments. It is tempting to
736	just write something like the expression <TT>"/" . "*/"</TT>, but
737	this would match more than we want. It would for instance match the
738	whole of <TT>/* / x = 0; / */</TT>, instead of two comments and
739	four real tokens. See DocumentationComment and CommentContent for an
740	alternative.
741
742	<P>
743	</LI>
744	<LI><TT><A HREF="#CodeMacros">CommentContent</A></TT> matches zero or more
745	occurrences of any character except a <TT>*</TT> or any number of
746	<TT>*</TT> followed by a character that is not a <TT>/</TT>
747
748	<P>
749	</LI>
750	<LI><TT><A HREF="#CodeMacros">Identifier</A></TT> matches each string that
751	starts with a character of class <TT>jletter</TT> followed by zero or more characters
752	of class <TT>jletterdigit</TT>. <TT>jletter</TT> and <TT>jletterdigit</TT>
753	are predefined character classes. <TT>jletter</TT> includes all characters for which
754	the Java function <TT>Character.isJavaIdentifierStart</TT> returns <TT>true</TT> and
755	<TT>jletterdigit</TT> all characters for that <TT>Character.isJavaIdentifierPart</TT>
756	returns <TT>true</TT>.
757	</LI>
758	</UL>
759	<A NAME="ExampleStateDecl"></A>
760	<P>
761	The last part of the second section in our
762	lexical specification is a lexical state declaration:
763	<TT><A HREF="#CodeStateDecl">%state STRING</A></TT>
764	declares a lexical state <TT>STRING</TT> that can be
765	used in the ``lexical rules'' part of the specification. A state declaration
766	is a line starting with <TT>%state</TT> followed by a space or comma
767	separated list of state identifiers. There can be more than one line starting
768	with <TT>%state</TT>.
769
770	<P>
771
772	<H2><A NAME="SECTION00043000000000000000"></A><A NAME="ExampleLexRules"></A><BR>
773	Rules and Actions
774	</H2>
775	The "lexical rules" section of a JFlex specification contains regular expressions
776	and actions (Java code) that are executed when the scanner matches the
777	associated regular expression. As the scanner reads its input, it keeps
778	track of all regular expressions and activates the action of the expression
779	that has the longest match. Our specification above for instance would with input
780	"<TT>breaker</TT>" match the regular expression for <TT><A HREF="#CodeMacros">Identifier</A></TT>
781	and not the keyword "<TT><A HREF="#CodeRulesYYINITIAL">break</A></TT>"
782	followed by the Identifier "<TT>er</TT>", because rule <code>{Identifier}</code>
783	matches more of this input at once (i.e. it matches all of it)
784	than any other rule in the specification. If two regular expressions both
785	have the longest match for a certain input, the scanner chooses the action
786	of the expression that appears first in the specification. In that way, we
787	get for input "<TT>break</TT>" the keyword "<TT>break</TT>" and not an
788	Identifier "<TT>break</TT>".
789
790	<P>
791	Additional to regular expression matches, one can use lexical states to
792	refine a specification. A lexical state acts like a start condition.
793	If the scanner is in lexical state <TT>STRING</TT>, only expressions that
794	are preceded by the start condition <TT><STRING></TT> can be matched.
795	A start condition of a regular expression can contain more than one lexical
796	state. It is then matched when the lexer is in any of these lexical states.
797	The lexical state <TT>YYINITIAL</TT> is predefined and is also the state
798	in which the lexer begins scanning. If a regular expression has no start
799	conditions it is matched in <EM>all</EM> lexical states.
800	<A NAME="ExampleRulesStateBunch"></A>
801	<P>
802	Since you often have a bunch of expressions with the same start conditions,
803	JFlex allows the same abbreviation as the Unix tool <TT>flex</TT>:
804	<PRE>
805	<STRING> {
806	expr1 { action1 }
807	expr2 { action2 }
808	}
809	</PRE>
810	means that both <TT>expr1</TT> and <TT>expr2</TT> have start condition <TT><STRING></TT>.
811	<A NAME="ExampleRulesYYINITIAL"></A>
812	<P>
813	The first three rules in our example demonstrate the syntax of a regular
814	expression preceded by the start condition <TT><YYINITIAL></TT>.
815
816	<P>
817	<TT><A HREF="#CodeRulesYYINITIAL"><YYINITIAL> "abstract"</A><code> {</code> return symbol(sym.ABSTRACT); <code>}</code></TT>
818
819	<P>
820	matches the input "<TT>abstract</TT>" only if the scanner is in its
821	start state "<TT>YYINITIAL</TT>". When the string "<TT>abstract</TT>" is
822	matched, the scanner function returns the CUP symbol <TT>sym.ABSTRACT</TT>.
823	If an action does not return a value, the scanning process is resumed immediately
824	after executing the action.
825	<A NAME="ExampleRulesBunch"></A>
826	<P>
827	The rules enclosed in
828
829	<P>
830	<TT><A HREF="#CodeRulesBunch"><YYINITIAL> {
831	<BR> ...
832	<BR>}</A></TT>
833
834	<P>
835	demonstrate the abbreviated syntax and are also only matched in state <TT>YYINITIAL</TT>.
836	<A NAME="ExampleRulesYYbegin"></A>
837	<P>
838	Of these rules, one may be of special interest:
839
840	<P>
841	<code>\" { </code> <TT><A HREF="#CodeRulesBunch">string.setLength(0); yybegin(STRING);</A></TT><code> }</code>
842
843	<P>
844	If the scanner matches a double quote in state <TT>YYINITIAL</TT> we
845	have recognised the start of a string literal. Therefore we clear our <TT>StringBuffer</TT>
846	that will hold the content of this string literal and tell the scanner
847	with <TT>yybegin(STRING)</TT> to switch into the lexical state <TT>STRING</TT>.
848	Because we do not yet return a value to the parser, our scanner proceeds
849	immediately.
850	<A NAME="ExampleRulesYYtext"></A>
851	<P>
852	In lexical state <TT>STRING</TT> another
853	rule demonstrates how to refer to the input that has been matched:
854
855	<P>
856	<code>[^\n\r\"]+ { </code> <TT><A HREF="#CodeRulesYYtext">string.append( yytext() );</A></TT><code> }</code>
857
858	<P>
859	The expression <code>[^\n\r\"]+</code> matches
860	all characters in the input up to the next backslash (indicating an
861	escape sequence such as <code>\n</code>), double quote (indicating the end
862	of the string), or line terminator (which must not occur in a string literal).
863	The matched region of the input is referred to with <TT><A HREF="#CodeRulesYYtext">yytext()</A></TT>
864	and appended to the content of the string literal parsed so far.
865	<A NAME="ExampleRuleLast"></A>
866	<P>
867	The last lexical rule in the example specification
868	is used as an error fallback. It matches any character in any state that
869	has not been matched by another rule. It doesn't conflict with any other
870	rule because it has the least priority (because it's the last rule) and
871	because it matches only one character (so it can't have longest match
872	precedence over any other rule).
873
874	<P>
875
876	<H2><A NAME="SECTION00044000000000000000">
877	How to get it going</A>
878	</H2>
879
880	<UL>
881	<LI>Install JFlex (see section <A HREF="#Installing">2</A> <A HREF="#Installing"><I>Installing JFlex</I></A>)
882
883	<P>
884	</LI>
885	<LI>If you have written your specification file (or chosen one from the <TT>examples</TT>
886	directory), save it (say under the name <TT>java-lang.flex</TT>).
887
888	<P>
889	</LI>
890	<LI>Run JFlex with
891
892	<P>
893	<TT>jflex java-lang.flex</TT>
894
895	<P>
896	</LI>
897	<LI>JFlex should then report some progress messages about generating the scanner
898	and write the generated code to the directory of your specification file.
899
900	<P>
901	</LI>
902	<LI>Compile the generated <TT>.java</TT> file and your own classes. (If you
903	use CUP, generate your parser classes first)
904
905	<P>
906	</LI>
907	<LI>That's it.
908	</LI>
909	</UL>
910
911	<P>
912
913	<H1><A NAME="SECTION00050000000000000000"></A><A NAME="Specifications"></A><BR>
914	Lexical Specifications
915	</H1>
916	As shown above, a lexical specification file for JFlex consists of three
917	parts divided by a single line starting with <TT>%%</TT>:
918
919	<P>
920	<TT><A HREF="#SpecUsercode">UserCode</A></TT>
921	<BR><TT>%%</TT>
922	<BR><TT><A HREF="#SpecOptions">Options and declarations</A></TT>
923	<BR><TT>%%</TT>
924	<BR><TT><A HREF="#LexRules">Lexical rules</A></TT>
925
926	<P>
927	In all parts of the specification comments of the form
928	<TT>/* comment text */</TT> and the Java style end of line comments starting with <TT>//</TT>
929	are permitted. JFlex comments do nest - so the number of <TT>/</TT> and <TT>/</TT>
930	should be balanced.
931
932	<P>
933
934	<H2><A NAME="SECTION00051000000000000000"></A><A NAME="SpecUsercode"></A><BR>
935	User code
936	</H2>
937	The first part contains user code that is copied verbatim into the beginning
938	of the source file of the generated lexer before the scanner class is declared.
939	As shown in the example above, this is the place to put <TT>package</TT>
940	declarations and <TT>import</TT>
941	statements. It is possible, but not considered as good Java programming
942	style to put own helper class (such as token classes) in this section.
943	They should get their own <TT>.java</TT> file instead.
944
945	<P>
946
947	<H2><A NAME="SECTION00052000000000000000"></A><A NAME="SpecOptions"></A><BR>
948	Options and declarations
949	</H2>
950	The second part of the lexical specification contains <A HREF="#SpecOptDirectives">options</A>
951	to customise your generated lexer (JFlex directives and Java code to include in
952	different parts of the lexer), declarations of <A HREF="#StateDecl">lexical states</A> and
953	<A HREF="#MacroDefs">macro definitions</A> for use in the third section
954	<A HREF="#LexRules">``Lexical rules''</A> of the lexical specification file.
955	<A NAME="SpecOptDirectives"></A>
956	<P>
957	Each JFlex directive must be situated at the beginning of a line
958	and starts with the <TT>%</TT> character. Directives that have one or
959	more parameters are described as follows:
960
961	<P>
962	<TT>%class "classname"</TT>
963
964	<P>
965	means that you start a line with <TT>%class</TT> followed by a space followed
966	by the name of the class for the generated scanner (the double quotes are
967	<I>not</I> to be entered, see the <A HREF="#CodeOptions">example specification</A> in
968	section <A HREF="#CodeOptions">3</A>).
969
970	<P>
971
972	<H3><A NAME="SECTION00052100000000000000"></A><A NAME="ClassOptions"></A><BR>
973	Class options and user class code
974	</H3>
975	These options regard name, constructor, API, and related parts of the
976	generated scanner class.
977
978	<UL>
979	<LI><B><TT>%class "classname"</TT></B>
980
981	<P>
982	Tells JFlex to give the generated class the name "<TT>classname</TT>" and to
983	write the generated code to a file "<TT>classname.java</TT>". If the
984	<TT>-d <directory></TT> command line option is not used, the code
985	will be written to the directory where the specification file resides. If
986	no <TT>%class</TT> directive is present in the specification, the generated
987	class will get the name "<TT>Yylex</TT>" and will be written to a file
988	"<TT>Yylex.java</TT>". There should be only one <TT>%class</TT> directive
989	in a specification.
990
991	<P>
992	</LI>
993	<LI><B><TT>%implements "interface 1"[, "interface 2", ..]</TT></B>
994
995	<P>
996	Makes the generated class implement the specified interfaces. If more than
997	one <TT>%implements</TT> directive is present, all the specified interfaces
998	will be implemented.
999
1000	<P>
1001	</LI>
1002	<LI><B><TT>%extends "classname"</TT></B>
1003
1004	<P>
1005	Makes the generated class a subclass of the class ``<TT>classname</TT>''.
1006	There should be only one <TT>%extends</TT> directive in a specification.
1007
1008	<P>
1009	</LI>
1010	<LI><B><TT>%public</TT></B>
1011
1012	<P>
1013	Makes the generated class public (the class is only accessible in its
1014	own package by default).
1015
1016	<P>
1017	</LI>
1018	<LI><B><TT>%final</TT></B>
1019
1020	<P>
1021	Makes the generated class final.
1022
1023	<P>
1024	</LI>
1025	<LI><B><TT>%abstract</TT></B>
1026
1027	<P>
1028	Makes the generated class abstract.
1029
1030	<P>
1031	</LI>
1032	<LI><B><TT>%apiprivate</TT></B>
1033
1034	<P>
1035	Makes all generated methods and fields of the class
1036	private. Exceptions are the constructor, user code in the
1037	specification, and, if <code>%cup</code> is present, the method
1038	<TT>next_token</TT>. All occurrences of
1039	<TT>" public "</TT> (one space character before and after <TT>public</TT>)
1040	in the skeleton file are replaced by
1041	<TT>" private "</TT> (even if a user-specified skeleton is used).
1042	Access to the generated class is expected to be mediated by user class
1043	code (see next switch).
1044
1045	<P>
1046	</LI>
1047	<LI><B><code>%{</code></B>
1048	<BR><B><TT>...</TT></B>
1049	<BR><B><code>%}</code></B>
1050
1051	<P>
1052	The code enclosed in <code>%{</code> and <code>%}</code> is copied verbatim
1053	into the generated class. Here you can define your own member variables
1054	and functions in the generated scanner. Like all options, both <code>%{</code>
1055	and <code>%}</code> must start a line in the specification. If more than one
1056	class code directive <code>%{...%}</code> is present, the code is concatenated
1057	in order of appearance in the specification.
1058
1059	<P>
1060	</LI>
1061	<LI><B><code>%init{</code></B>
1062	<BR><B><TT>...</TT></B>
1063	<BR><B><code>%init}</code></B>
1064
1065	<P>
1066	The code enclosed in <code>%init{</code> and <code>%init}</code> is copied
1067	verbatim into the constructor of the generated class. Here, member
1068	variables declared in the <code>%{...%}</code> directive can be initialised.
1069	If more than one initialiser option is present, the code is concatenated
1070	in order of appearance in the specification.
1071
1072	<P>
1073	</LI>
1074	<LI><B><code>%initthrow{</code></B>
1075	<BR><B><TT>"exception1"[, "exception2", ...]</TT></B>
1076	<BR><B><code>%initthrow}</code></B>
1077
1078	<P>
1079	or (on a single line) just
1080
1081	<P>
1082	<B><TT>%initthrow "exception1" [, "exception2", ...]</TT></B>
1083
1084	<P>
1085	Causes the specified exceptions to be declared in the <TT>throws</TT>
1086	clause of the constructor. If more than one <code>%initthrow{</code> <TT>...</TT> <code>%initthrow}</code>
1087	directive is present in the specification, all specified exceptions will
1088	be declared.
1089
1090	<P>
1091	</LI>
1092	<LI><B><TT>%ctorarg "type" "ident"</TT></B>
1093
1094	<P>
1095	Adds the specified argument to the constructors of the generated scanner.
1096	If more than one such directive is present, the arguments are added in order
1097	of occurrence in the specification. Note that this option conflicts with
1098	the <code>%standalone</code> and <code>%debug</code> directives, because there is no
1099	sensible default that can be created automatically for such parameters
1100	in the generated <TT>main</TT> methods. JFlex will warn in this case and
1101	generate an additional default constructor without these parameters and without user init code (which might potentially refer to the parameters).
1102
1103	<P>
1104	</LI>
1105	<LI><B><TT>%scanerror "exception"</TT></B>
1106
1107	<P>
1108	Causes the generated scanner to throw an instance of the specified
1109	exception in case of an internal error (default is
1110	<TT>java.lang.Error</TT>). Note that this exception is only for
1111	internal scanner errors. With usual specifications it should never
1112	occur (i.e. if there is an error fallback rule in the specification
1113	and only the documented scanner API is used).
1114
1115	<P>
1116	</LI>
1117	<LI><B><TT>%buffer "size"</TT></B>
1118
1119	<P>
1120	Set the initial size of the scan buffer to the specified value
1121	(decimal, in bytes). The default value is 16384.
1122
1123	<P>
1124	</LI>
1125	<LI><B><TT>%include "filename"</TT></B>
1126
1127	<P>
1128	Replaces the <TT>%include</TT> verbatim by the specified file. This
1129	feature is still experimental. It works, but error reporting can be
1130	strange if a syntax error occurs on the last token in the included
1131	file.
1132
1133	<P>
1134	</LI>
1135	</UL>
1136
1137	<P>
1138
1139	<H3><A NAME="SECTION00052200000000000000"></A><A NAME="ScanningMethod"></A><BR>
1140	Scanning method
1141	</H3>
1142	This section shows how the scanning method can be customised. You can redefine
1143	the name and return type of the method and it is possible to declare
1144	exceptions that may be thrown in one of the actions of the specification.
1145	If no return type is specified, the scanning method will be declared as
1146	returning values of class <TT>Yytoken</TT>.
1147
1148	<UL>
1149	<LI><B><TT>%function "name"</TT></B>
1150
1151	<P>
1152	Causes the scanning method to get the specified name. If no <TT>%function</TT>
1153	directive is present in the specification, the scanning method gets the
1154	name ``<TT>yylex</TT>''. This directive overrides settings of the
1155	<TT><A HREF="#CupMode">%cup</A></TT> switch. Please note that the default name
1156	of the scanning method with the <TT><A HREF="#CupMode">%cup</A></TT> switch is
1157	<TT>next_token</TT>. Overriding this name might lead to the generated scanner
1158	being implicitly declared as <TT>abstract</TT>, because it does not provide
1159	the method <TT>next_token</TT> of the interface <TT>java_cup.runtime.Scanner</TT>.
1160	It is of course possible to provide a dummy implementation of that method
1161	in the class code section if you still want to override the function name.
1162
1163	<P>
1164	</LI>
1165	<LI><B><TT>%integer</TT></B>
1166	<BR><B><TT>%int</TT></B>
1167
1168	<P>
1169	Both cause the scanning method to be declared as of Java type <TT>int</TT>.
1170	Actions in the specification can then return <TT>int</TT> values as tokens.
1171	The default end of file value under this setting is <TT>YYEOF</TT>, which is a <TT>public
1172	static final int</TT> member of the generated class.
1173
1174	<P>
1175	</LI>
1176	<LI><B><TT>%intwrap</TT></B>
1177
1178	<P>
1179	Causes the scanning method to be declared as of the Java wrapper type
1180	<TT>Integer</TT>. Actions in the specification can then return <TT>Integer</TT>
1181	values as tokens. The default end of file value under this setting is <TT>null</TT>.
1182
1183	<P>
1184	</LI>
1185	<LI><B><TT>%type "typename"</TT></B>
1186
1187	<P>
1188	Causes the scanning method to be declared as returning values of the specified type.
1189	Actions in the specification can then return values of <TT>typename</TT>
1190	as tokens. The default end of file value under this setting is <TT>null</TT>.
1191	If <TT>typename</TT> is not a subclass of <TT>java.lang.Object</TT>,
1192	you should specify another end of file value using the
1193	<A HREF="#eofval"><TT>%eofval{</TT> <TT>...</TT> <TT>%eofval}</TT></A>
1194	directive or the <A HREF="#EOFRule"><TT><<EOF>></TT> rule</A>.
1195	The <TT>%type</TT> directive overrides settings of the
1196	<TT><A HREF="#CupMode">%cup</A></TT> switch.
1197
1198	<P>
1199	</LI>
1200	<LI><B><code>%yylexthrow{</code></B>
1201	<BR><B><TT>"exception1"[, "exception2", ... ]</TT></B>
1202	<BR><B><code>%yylexthrow}</code></B>
1203
1204	<P>
1205	or (on a single line) just
1206
1207	<P>
1208	<B><TT>%yylexthrow "exception1" [, "exception2", ...]</TT></B>
1209
1210	<P>
1211	The exceptions listed inside <code>%yylexthrow{</code> <TT>...</TT> <code>%yylexthrow}</code>
1212	will be declared in the throws clause of the scanning method. If there is
1213	more than one <code>%yylexthrow{</code> <TT>...</TT> <code>%yylexthrow}</code> clause in
1214	the specification, all specified exceptions will be declared.
1215	</LI>
1216	</UL>
1217
1218	<P>
1219
1220	<H3><A NAME="SECTION00052300000000000000"></A><A NAME="EOF"></A><BR>
1221	The end of file
1222	</H3>
1223	There is always a default value that the scanning method will return when
1224	the end of file has been reached. You may however define a specific value
1225	to return and a specific piece of code that should be executed when the
1226	end of file is reached.
1227
1228	<P>
1229	The default end of file values depends on the return type of the scanning method:
1230
1231	<UL>
1232	<LI>For <B><TT>%integer</TT></B>, the scanning method will return the value
1233	<B><TT>YYEOF</TT></B>, which is a <TT>public static final int</TT> member
1234	of the generated class.
1235
1236	<P>
1237	</LI>
1238	<LI>For <B><TT>%intwrap</TT></B>,
1239	</LI>
1240	<LI>no specified type at all, or a
1241	</LI>
1242	<LI>user defined type, declared using <B><TT>%type</TT></B>, the value is <B><TT>null</TT></B>.
1243
1244	<P>
1245	</LI>
1246	<LI>In CUP compatibility mode, using <B><TT>%cup</TT></B>, the value is
1247
1248	<P>
1249	<B><TT>new java_cup.runtime.Symbol(sym.EOF)</TT></B>
1250	</LI>
1251	</UL>
1252
1253	<P>
1254	User values and code to be executed at the end of file can be defined using these directives:
1255
1256	<A NAME="eofval"></A><UL>
1257	<LI><B><code>%eofval{</code></B>
1258	<BR><B><TT>...</TT></B>
1259	<BR><B><code>%eofval}</code></B>
1260
1261	<P>
1262	The code included in <code>%eofval{</code> <TT>...</TT> <code>%eofval}</code> will
1263	be copied verbatim into the scanning method and will be executed <EM>each time</EM>
1264	when the end of file is reached (this is possible when
1265	the scanning method is called again after the end of file has been
1266	reached). The code should return the value that indicates the end of
1267	file to the parser. There should be only one <code>%eofval{</code>
1268	<TT>...</TT> <code>%eofval}</code> clause in the specification.
1269	The <code>%eofval{ ... %eofval}</code> directive overrides settings of the
1270	<TT><A HREF="#CupMode">%cup</A></TT> switch and <TT><A HREF="#YaccMode">%byaccj</A></TT> switch.
1271	As of version 1.2 JFlex provides
1272	a more readable way to specify the end of file value using the
1273	<A HREF="#EOFRule"><TT><<EOF>></TT> rule</A> (see also section <A HREF="#EOFRule">4.3.2</A>).
1274
1275	<P>
1276	</LI>
1277	<LI><A NAME="eof"></A> <B><code>%eof{</code></B>
1278	<BR> <B><TT>...</TT></B>
1279	<BR> <B><code>%eof}</code></B>
1280
1281	<P>
1282	The code included in <code>%{eof ... %eof}</code> will be executed
1283	exactly once, when the end of file is reached. The code is included
1284	inside a method <TT>void yy_do_eof()</TT> and should not return any
1285	value (use <code>%eofval{...%eofval}</code> or
1286	<A HREF="#EOFRule"><TT><<EOF>></TT></A> for this purpose). If more than one
1287	end of file code directive is present, the code will be concatenated
1288	in order of appearance in the specification.
1289
1290	<P>
1291	</LI>
1292	<LI><B><code>%eofthrow{</code></B>
1293	<BR> <B><TT>"exception1"[,"exception2", ... ]</TT></B>
1294	<BR> <B><code>%eofthrow}</code></B>
1295
1296	<P>
1297	or (on a single line) just
1298
1299	<P>
1300	<B><TT>%eofthrow "exception1" [, "exception2", ...]</TT></B>
1301
1302	<P>
1303	The exceptions listed inside <code>%eofthrow{...%eofthrow}</code> will
1304	be declared in the throws clause of the method <TT>yy_do_eof()</TT>
1305	(see <A HREF="#eof"><TT>%eof</TT></A> for more on that method).
1306	If there is more than one <code>%eofthrow{...%eofthrow}</code> clause
1307	in the specification, all specified exceptions will be declared.
1308
1309	<P>
1310	<A NAME="eofclose"></A></LI>
1311	<LI><B><TT>%eofclose</TT></B>
1312
1313	<P>
1314	Causes JFlex to close the input stream at the end of file. The code
1315	<TT>yyclose()</TT> is appended to the method <TT>yy_do_eof()</TT>
1316	(together with the code specified in <code>%eof{...%eof}</code>) and
1317	the exception <TT>java.io.IOException</TT> is declared in the throws
1318	clause of this method (together with those of
1319	<code>%eofthrow{...%eofthrow}</code>)
1320
1321	<P>
1322	</LI>
1323	<LI><B><TT>%eofclose false</TT></B>
1324
1325	<P>
1326	Turns the effect of <TT>%eofclose</TT> off again (e.g. in case closing of
1327	input stream is not wanted after <TT>%cup</TT>).
1328
1329	<P>
1330	</LI>
1331	</UL>
1332
1333	<P>
1334
1335	<H3><A NAME="SECTION00052400000000000000"></A><A NAME="Standalone"></A><BR>
1336	Standalone scanners
1337	</H3>
1338
1339	<UL>
1340	<LI><B><TT>%debug</TT></B>
1341
1342	<P>
1343	Creates a main function in the generated class that expects the name
1344	of an input file on the command line and then runs the scanner on this
1345	input file by printing information about each returned token to the Java
1346	console until the end of file is reached. The information includes:
1347	line number (if line counting is enabled), column (if column counting is enabled),
1348	the matched text, and the executed action (with line number in the specification).
1349
1350	<P>
1351	</LI>
1352	<LI><B><TT>%standalone</TT></B>
1353
1354	<P>
1355	Creates a main function in the generated class that expects the name
1356	of an input file on the command line and then runs the scanner on this
1357	input file. The values returned by the scanner are ignored, but any unmatched
1358	text is printed to the Java console instead (as the C/C++ tool flex does, if
1359	run as standalone program). To avoid having to use an extra token class, the
1360	scanning method will be declared as having default type <TT>int</TT>, not <TT>YYtoken</TT>
1361	(if there isn't any other type explicitly specified).
1362	This is in most cases irrelevant, but could be useful to know when making
1363	another scanner standalone for some purpose. You should also consider using
1364	the <TT>%debug</TT> directive, if you just want to be able to run the scanner
1365	without a parser attached for testing etc.
1366
1367	<P>
1368	</LI>
1369	</UL>
1370
1371	<P>
1372
1373	<H3><A NAME="SECTION00052500000000000000"></A><A NAME="CupMode"></A><BR>
1374	CUP compatibility
1375	</H3>
1376	You may also want to read section <A HREF="#CUPWork">8.1</A> <A HREF="#CUPWork"><I>JFlex and CUP</I></A>
1377	if you are interested in how to interface your generated
1378	scanner with CUP.
1379
1380	<UL>
1381	<LI><B><TT>%cup</TT></B>
1382
1383	<P>
1384	The <TT>%cup</TT> directive enables the CUP compatibility mode and is equivalent
1385	to the following set of directives:
1386
1387	<P>
1388	<PRE>
1389	%implements java_cup.runtime.Scanner
1390	%function next_token
1391	%type java_cup.runtime.Symbol
1392	%eofval{
1393	return new java_cup.runtime.Symbol(<CUPSYM>.EOF);
1394	%eofval}
1395	%eofclose
1396	</PRE>
1397
1398	<P>
1399	The value of <TT><CUPSYM></TT> defaults to <TT>sym</TT> and can be
1400	changed with the <TT>%cupsym</TT> directive. In JLex compatibility
1401	mode (<TT>-jlex</TT> switch on the command line), <TT>%eofclose</TT>
1402	will not be turned on.
1403
1404	<P>
1405	</LI>
1406	<LI><B><TT>%cupsym "classname"</TT></B>
1407
1408	<P>
1409	Customises the name of the CUP generated class/interface
1410	containing the names of terminal tokens. Default is <TT>sym</TT>.
1411	The directive should not be used after <TT>%cup</TT>, but before.
1412
1413	<P>
1414	</LI>
1415	<LI><B><TT>%cupdebug</TT></B>
1416
1417	<P>
1418	Creates a main function in the generated class that expects the name
1419	of an input file on the command line and then runs the scanner on this
1420	input file. Prints line, column, matched text, and CUP symbol name for
1421	each returned token to standard out.
1422
1423	<P>
1424	</LI>
1425	</UL>
1426
1427	<P>
1428
1429	<H3><A NAME="SECTION00052600000000000000"></A><A NAME="YaccMode"></A><BR>
1430	BYacc/J compatibility
1431	</H3>
1432	You may also want to read section <A HREF="#YaccWork">8.2</A> <A HREF="#YaccWork"><I>JFlex and BYacc/J</I></A>
1433	if you are interested in how to interface your generated
1434	scanner with Byacc/J.
1435
1436	<UL>
1437	<LI><B><TT>%byacc</TT></B>
1438
1439	<P>
1440	The <TT>%byacc</TT> directive enables the BYacc/J compatibility mode and is equivalent
1441	to the following set of directives:
1442
1443	<P>
1444	<PRE>
1445	%integer
1446	%eofval{
1447	return 0;
1448	%eofval}
1449	%eofclose
1450	</PRE>
1451
1452	<P>
1453	</LI>
1454	</UL>
1455
1456	<P>
1457
1458	<H3><A NAME="SECTION00052700000000000000"></A><A NAME="CodeGeneration"></A><BR>
1459	Code generation
1460	</H3>
1461	The following options define what kind of lexical analyser code JFlex
1462	will produce. <TT>%pack</TT> is the default setting and will be used,
1463	when no code generation method is specified.
1464
1465	<P>
1466
1467	<UL>
1468	<LI><B><TT>%switch</TT></B>
1469
1470	<P>
1471	With <TT>%switch</TT> JFlex will generate a scanner that has
1472	the DFA hard coded into a nested switch statement. This method gives
1473	a good deal of compression in terms of the size of the compiled
1474	<TT>.class</TT> file while still providing very good performance. If your
1475	scanner gets to big though (say more than about 200 states)
1476	performance may vastly degenerate and you should consider using one
1477	of the <TT>%table</TT> or <TT>%pack</TT> directives. If your scanner
1478	gets even bigger (about 300 states), the Java compiler <TT>javac</TT>
1479	could produce corrupted code, that will crash when executed or will
1480	give you an <TT>java.lang.VerifyError</TT> when checked by the virtual
1481	machine. This is due to the size limitation of 64 KB of Java
1482	methods as described in the Java Virtual Machine Specification
1483	[<A
1484	HREF="manual.html#MachineSpec">10</A>]. In this case you will be forced to use the
1485	<TT>%pack</TT> directive, since <TT>%switch</TT>
1486	usually provides more compression of the DFA table than the
1487	<TT>%table</TT> directive.
1488
1489	<P>
1490	</LI>
1491	<LI><B><TT>%table</TT></B>
1492
1493	<P>
1494	The <TT>%table</TT> direction causes JFlex to produce a classical
1495	table driven scanner that encodes its DFA table in an array. In
1496	this mode, JFlex only does a small amount of table compression (see
1497	[<A
1498	HREF="manual.html#ParseTable">6</A>], [<A
1499	HREF="manual.html#SparseTable">12</A>], [<A
1500	HREF="manual.html#Aho">1</A>] and [<A
1501	HREF="manual.html#Maurer">13</A>]
1502	for more details on the matter of table compression) and uses the
1503	same method that JLex did up to version 1.2.1. See section <A HREF="#performance">6</A>
1504	<A HREF="#performance">performance</A> of this manual to compare
1505	these methods. The same reason as above (64 KB size limitation of
1506	methods) causes the same problem, when the scanner gets too big.
1507	This is, because the virtual machine treats static initialisers of
1508	arrays as normal methods. You will in this case again be forced to
1509	use the <TT>%pack</TT> directive to avoid the problem.
1510
1511	<P>
1512	</LI>
1513	<LI><B><TT>%pack</TT></B>
1514
1515	<P>
1516	<TT>%pack</TT> causes JFlex to compress the generated DFA table and to
1517	store it in one or more string literals. JFlex takes care that the
1518	strings are not longer than permitted by the class file format.
1519	The strings have to be unpacked when
1520	the first scanner object is created and initialised.
1521	After unpacking the internal access to the DFA table is exactly the
1522	same as with option <TT>%table</TT> -- the only extra work to be done
1523	at runtime is the unpacking process which is quite fast (not noticeable
1524	in normal cases). It is in time complexity proportional to the
1525	size of the expanded DFA table, and it is static,
1526	i.e. it is done only once for a certain scanner class -- no matter
1527	how often it is instantiated. Again, see section
1528	<A HREF="#performance">6</A> <A HREF="#performance">performance</A>
1529	on the performance of these scanners
1530	With <TT>%pack</TT>, there should be practically no
1531	limitation to the size of the scanner. <TT>%pack</TT> is the default
1532	setting and will be used when no code generation method is specified.
1533	</LI>
1534	</UL>
1535
1536	<P>
1537
1538	<H3><A NAME="SECTION00052800000000000000"></A><A NAME="CharacterSets"></A><BR>
1539	Character sets
1540	</H3>
1541
1542	<UL>
1543	<LI><B><TT>%7bit</TT></B>
1544
1545	<P>
1546	Causes the generated scanner to use an 7 bit input character set (character
1547	codes 0-127). If an input character with a code greater than 127 is
1548	encountered in an input at runtime, the scanner will throw an <TT>ArrayIndexOutofBoundsException</TT>.
1549	Not only because of this, you should consider using the <TT>%unicode</TT> directive.
1550	See also section <A HREF="#sec:encodings">5</A> for information about character encodings. This is the default in JLex compatibility mode.
1551
1552	<P>
1553	</LI>
1554	<LI><B><TT>%full</TT></B>
1555	<BR><B><TT>%8bit</TT></B>
1556
1557	<P>
1558	Both options cause the generated scanner to use an 8 bit input character
1559	set (character codes 0-255). If an input character with a code greater
1560	than 255 is encountered in an input at runtime, the scanner will throw
1561	an <TT>ArrayIndexOutofBoundsException</TT>. Note that even if your platform
1562	uses only one byte per character, the Unicode value of a character may
1563	still be greater than 255. If you are scanning text files, you should
1564	consider using the <TT>%unicode</TT> directive. See also section <A HREF="#sec:encodings">5</A>
1565	for more information about character encodings.
1566
1567	<P>
1568	</LI>
1569	<LI><B><TT>%unicode</TT></B>
1570	<BR><B><TT>%16bit</TT></B>
1571
1572	<P>
1573	Both options cause the generated scanner to use the full 16 bit Unicode input
1574	character set that Java supports natively (character code points 0-65535).
1575	There will be no runtime overflow when using this set of input characters.
1576	<TT>%unicode</TT> does not mean that the scanner will read two bytes at a
1577	time. What is read and what constitutes a character depends on the runtime
1578	platform. See also section <A HREF="#sec:encodings">5</A> for more information about
1579	character encodings. This is the default unless the JLex compatibility mode is
1580	used (command line option <TT>-jlex</TT>).
1581
1582	<P>
1583	<A NAME="caseless"></A></LI>
1584	<LI><B><TT>%caseless</TT></B>
1585	<BR><B><TT>%ignorecase</TT></B>
1586
1587	<P>
1588	This option causes JFlex to handle all characters and strings in the
1589	specification as if they were specified in both uppercase and lowercase form.
1590	This enables an easy way to specify a scanner for a language with case
1591	insensitive keywords. The string "<TT>break</TT>" in a specification is for
1592	instance handled like the expression <TT>([bB][rR][eE][aA][kK])</TT>. The
1593	<TT>%caseless</TT> option does not change the matched text and does not
1594	effect character classes. So <TT>[a]</TT> still only matches the character
1595	<TT>a</TT> and not <TT>A</TT>, too. Which letters are uppercase and which
1596	lowercase letters, is defined by the Unicode standard and determined by JFlex
1597	with the Java methods <TT>Character.toUpperCase</TT> and
1598	<TT>Character.toLowerCase</TT>. In JLex compatibility mode (<TT>-jlex</TT>
1599	switch on the command line), <TT>%caseless</TT> and <TT>%ignorecase</TT>
1600	also affect character classes.
1601
1602	<P>
1603	</LI>
1604	</UL>
1605	<H3><A NAME="SECTION00052900000000000000"></A><A NAME="Counting"></A><BR>
1606	Line, character and column counting
1607	</H3>
1608
1609	<UL>
1610	<LI><B><TT>%char</TT></B>
1611
1612	<P>
1613	Turns character counting on. The <TT>int</TT> member variable <TT>yychar</TT>
1614	contains the number of characters (starting with 0) from the beginning
1615	of input to the beginning of the current token.
1616
1617	<P>
1618	</LI>
1619	<LI><B><TT>%line</TT></B>
1620
1621	<P>
1622	Turns line counting on. The <TT>int</TT> member variable <TT>yyline</TT>
1623	contains the number of lines (starting with 0) from the beginning of input
1624	to the beginning of the current token.
1625
1626	<P>
1627	</LI>
1628	<LI><B><TT>%column</TT></B>
1629
1630	<P>
1631	Turns column counting on. The <TT>int</TT> member variable <TT>yycolumn</TT>
1632	contains the number of characters (starting with 0) from the beginning
1633	of the current line to the beginning of the current token.
1634
1635	<P>
1636	</LI>
1637	</UL>
1638
1639	<P>
1640
1641	<H3><A NAME="SECTION000521000000000000000"></A><A NAME="Obsolete"></A><BR>
1642	Obsolete JLex options
1643	</H3>
1644
1645	<UL>
1646	<LI><B><TT>%notunix</TT></B>
1647
1648	<P>
1649	This JLex option is obsolete in JFlex but still recognised as valid directive.
1650	It used to switch between Windows and Unix kind of line terminators (<code>\r\n</code>
1651	and <code>\n</code>) for the <TT>$</TT> operator in regular expressions. JFlex
1652	always recognises both styles of platform dependent line terminators.
1653
1654	<P>
1655	</LI>
1656	<LI><B><TT>%yyeof</TT></B>
1657
1658	<P>
1659	This JLex option is obsolete in JFlex but still recognised as valid directive.
1660	In JLex it declares a public member constant <TT>YYEOF</TT>. JFlex declares it in any case.
1661	</LI>
1662	</UL>
1663
1664	<P>
1665
1666	<H3><A NAME="SECTION000521100000000000000"></A><A NAME="StateDecl"></A><BR>
1667	State declarations
1668	</H3>
1669	State declarations have the following from:
1670
1671	<P>
1672	<TT>%s[tate] "state identifier" [, "state identifier", ... ]</TT> for inclusive or
1673	<BR><TT>%x[state] "state identifier" [, "state identifier", ... ]</TT> for exclusive states
1674
1675	<P>
1676	There may be more than one line of state declarations, each starting with
1677	<TT>%state</TT> or <TT>%xstate</TT> (the first character is sufficient,
1678	<TT>%s</TT> and <TT>%x</TT> works, too). State identifiers are letters followed
1679	by a sequence of letters, digits or underscores. State identifiers can be separated
1680	by white-space or comma.
1681
1682	<P>
1683	The sequence
1684
1685	<P>
1686	<TT>%state STATE1</TT>
1687	<BR><TT>%xstate STATE3, XYZ, STATE_10</TT>
1688	<BR><TT>%state ABC STATE5</TT>
1689
1690	<P>
1691	declares the set of identifiers <TT>STATE1, STATE3, XYZ,
1692	STATE_10, ABC, STATE5</TT> as lexical states, <TT>STATE1</TT>, <TT>ABC</TT>, <TT>STATE5</TT>
1693	as inclusive, and <TT>STATE3</TT>, <TT>XYZ</TT>, <TT>STATE_10</TT> as exclusive.
1694	See also section
1695	<A HREF="#HowMatched">4.3.3</A> on the way lexical states influence how the input is
1696	matched.
1697
1698	<P>
1699
1700	<H3><A NAME="SECTION000521200000000000000"></A><A NAME="MacroDefs"></A><BR>
1701	Macro definitions
1702	</H3>
1703	A macro definition has the form
1704
1705	<P>
1706	<TT>macroidentifier = regular expression</TT>
1707
1708	<P>
1709	That means, a macro definition is a macro identifier (letter followed
1710	by a sequence of letters, digits or underscores), that can later be
1711	used to reference the macro, followed by optional white-space, followed
1712	by an "<TT>=</TT>", followed by optional white-space, followed by a
1713	regular expression (see section <A HREF="#LexRules">4.3</A> <A HREF="#LexRules"><I>lexical
1714	rules</I></A> for more information about regular expressions).
1715
1716	<P>
1717	The regular expression on the right hand side must be well formed and
1718	must not contain the <code>^</code>, <TT>/</TT> or <TT>$</TT> operators. <B>Differently
1719	to JLex, macros are not just pieces of text that are expanded by copying</B>
1720	- they are parsed and must be well formed.
1721
1722	<P>
1723	<B>This is a feature.</B> It eliminates some very hard to find bugs in
1724	lexical specifications (such like not having parentheses around more
1725	complicated macros - which is not necessary with JFlex). See section
1726	<A HREF="#Porting">7.1</A> <A HREF="#Porting"><I>Porting from JLex</I></A> for more
1727	details on the problems of JLex style macros.
1728
1729	<P>
1730	Since it is allowed to have macro usages in macro definitions, it is
1731	possible to use a grammar like notation to specify the desired lexical
1732	structure. Macros however remain just abbreviations of the regular expressions
1733	they represent. They are not non terminals of a grammar and cannot be used
1734	recursively in any way. JFlex detects cycles in macro definitions and reports
1735	them at generation time. JFlex also warns you about macros that have been
1736	defined but never used in the ``lexical rules'' section of the specification.
1737
1738	<P>
1739
1740	<H2><A NAME="SECTION00053000000000000000"></A><A NAME="LexRules"></A><BR>
1741	Lexical rules
1742	</H2>
1743	The ``lexical rules'' section of an JFlex specification contains a set of
1744	regular expressions and actions (Java code) that are executed when the
1745	scanner matches the associated regular expression.
1746
1747	<P>
1748
1749	<H3><A NAME="SECTION00053100000000000000"></A><A NAME="Grammar"></A><BR>
1750	Syntax
1751	</H3>
1752	The syntax of the "lexical rules" section is described by the following
1753	BNF grammar (terminal symbols are enclosed in 'quotes'):
1754
1755	<P>
1756	<PRE>
1757	LexicalRules ::= Rule+
1758	Rule ::= [StateList] ['^'] RegExp [LookAhead] Action
1759	\| [StateList] '<<EOF>>' Action
1760	\| StateGroup
1761	StateGroup ::= StateList '{' Rule+ '}'
1762	StateList ::= '<' Identifier (',' Identifier)* '>'
1763	LookAhead ::= '$' \| '/' RegExp
1764	Action ::= '{' JavaCode '}' \| '\|'
1765
1766	RegExp ::= RegExp '\|' RegExp
1767	\| RegExp RegExp
1768	\| '(' RegExp ')'
1769	\| ('!'\|'~') RegExp
1770	\| RegExp ('*'\|'+'\|'?')
1771	\| RegExp "{" Number ["," Number] "}"
1772	\| '[' ['^'] (Character\|Character'-'Character)* ']'
1773	\| PredefinedClass
1774	\| '{' Identifier '}'
1775	\| '"' StringCharacter+ '"'
1776	\| Character
1777
1778	PredefinedClass ::= '[:jletter:]'
1779	\| '[:jletterdigit:]'
1780	\| '[:letter:]'
1781	\| '[:digit:]'
1782	\| '[:uppercase:]'
1783	\| '[:lowercase:]'
1784	\| '.'
1785	</PRE>
1786
1787	<P>
1788	<A NAME="Terminals"></A>The grammar uses the following terminal symbols:
1789
1790	<UL>
1791	<LI><TT>JavaCode</TT>
1792	<BR> a sequence of <EM><TT>BlockStatements</TT></EM> as described in the Java
1793	Language Specification [<A
1794	HREF="manual.html#LangSpec">7</A>], section 14.2.
1795
1796	<P>
1797	</LI>
1798	<LI><TT>Number</TT>
1799	<BR> a non negative decimal integer.
1800
1801	<P>
1802	</LI>
1803	<LI><TT>Identifier</TT>
1804	<BR> a letter <code>[a-zA-Z]</code> followed by a sequence of zero or more
1805	letters, digits or underscores <code>[a-zA-Z0-9_]</code>
1806
1807	<P>
1808	</LI>
1809	<LI><TT>Character</TT>
1810	<BR> an escape sequence or any unicode character that is not one of these
1811	meta characters:
1812	<code> \| ( ) { } [ ] < > \ . * + ? ^ $ / . " ~ !</code>
1813
1814	<P>
1815	</LI>
1816	<LI><TT>StringCharacter</TT>
1817	<BR> an escape sequence or any unicode character that is not one of these
1818	meta characters:
1819	<code> \ "</code>
1820
1821	<P>
1822	</LI>
1823	<LI>An escape sequence
1824
1825	<P>
1826
1827	<UL>
1828	<LI><code>\n</code> <code>\r</code> <code>\t</code> <code>\f</code> <code>\b</code>
1829	</LI>
1830	<LI>a <code>\x</code> followed by two hexadecimal digits <TT>[a-fA-F0-9]</TT> (denoting
1831	a standard ASCII escape sequence),
1832
1833	<P>
1834	</LI>
1835	<LI>a <code>\u</code> followed by four hexadecimal digits <TT>[a-fA-F0-9]</TT>
1836	(denoting an unicode escape sequence),
1837
1838	<P>
1839	</LI>
1840	<LI>a backslash followed by a three digit octal number from 000 to 377 (denoting
1841	a standard ASCII escape sequence), or
1842
1843	<P>
1844	</LI>
1845	<LI>a backslash followed by any other unicode character that stands for this
1846	character.
1847
1848	<P>
1849	</LI>
1850	</UL>
1851
1852	<P>
1853	</LI>
1854	</UL>
1855
1856	<P>
1857	Please note that the <code>\n</code> escape sequence stands for the ASCII
1858	LF character - not for the end of line. If you would like to match the
1859	line terminator, you should use the expression <code>\r\|\n\|\r\n</code> if you want
1860	the Java conventions, or <code>\r\|\n\|\r\n\|\u2028\|\u2029\|\u000B\|\u000C\|\u0085</code>
1861	if you want to be fully Unicode compliant (see also [<A
1862	HREF="manual.html#unicode_rep">5</A>]).
1863
1864	<P>
1865	As of version 1.1 of JFlex the white-space characters <TT>" "</TT>
1866	(space) and <code>"\t"</code> (tab) can be used to improve the readability of
1867	regular expressions. They will be ignored by JFlex. In character
1868	classes and strings however, white-space characters keep standing for
1869	themselves (so the string <TT>" "</TT> still matches exactly one space
1870	character and <code>[ \n]</code> still matches an ASCII LF or a space
1871	character).
1872
1873	<P>
1874	JFlex applies the following standard operator precedences in regular
1875	expression (from highest to lowest):
1876
1877	<P>
1878
1879	<UL>
1880	<LI>unary postfix operators (<code>'*', '+', '?', {n}, {n,m}</code>)
1881
1882	<P>
1883	</LI>
1884	<LI>unary prefix operators (<code>'!', '~'</code>)
1885
1886	<P>
1887	</LI>
1888	<LI>concatenation (<TT>RegExp::= RegExp Regexp</TT>)
1889
1890	<P>
1891	</LI>
1892	<LI>union (<code>RegExp::= RegExp '\|' RegExp</code>)
1893	</LI>
1894	</UL>
1895
1896	<P>
1897	So the expression <code>a \| abc \| !cd*</code> for instance is parsed as
1898	<code>(a\|(abc)) \| ((!c)(d*))</code>.
1899
1900	<P>
1901
1902	<H3><A NAME="SECTION00053200000000000000"></A><A NAME="Semantics"></A><BR>
1903	Semantics
1904	</H3>
1905	This section gives an informal description of which text is matched by
1906	a regular expression (i.e. an expression described by the <TT>RegExp</TT>
1907	production of the grammar presented <A HREF="#Grammar">above</A>).
1908
1909	<P>
1910	A regular expression that consists solely of
1911
1912	<UL>
1913	<LI>a <TT>Character</TT> matches this character.
1914
1915	<P>
1916	</LI>
1917	<LI>a character class <code>'[' (Character\|Character'-'Character)* ']'</code> matches
1918	any character in that class. A <TT>Character</TT> is to be considered an
1919	element of a class, if it is listed in the class or if its code lies within
1920	a listed character range <TT>Character'-'Character</TT>. So <code>[a0-3\n]</code>
1921	for instance matches the characters
1922
1923	<P>
1924	<code>a 0 1 2 3 \n</code>
1925
1926	<P>
1927	If the list of characters is empty (i.e. just <code>[]</code>), the expression
1928	matches nothing at all (the empty set), not even the empty string. This
1929	may be useful in combination with the negation operator <code>'!'</code>.
1930
1931	<P>
1932	</LI>
1933	<LI>a negated character class <code>'[^' (Character\|Character'-'Character)* ']'</code>
1934	matches all characters not listed in the class. If the list of characters
1935	is empty (i.e. <code>[^]</code>), the expression matches any character of the
1936	input character set.
1937
1938	<P>
1939	</LI>
1940	<LI>a string <TT>'"' StringCharacter+ '"</TT> <TT>'</TT> matches the exact
1941	text enclosed in double quotes. All meta characters but <code>\</code> and
1942	<TT>"</TT> loose their special meaning inside a string. See also the
1943	<A HREF="#caseless"><TT>%ignorecase</TT></A> switch.
1944
1945	<P>
1946	</LI>
1947	<LI>a macro usage <code>'{' Identifier '}'</code> matches the input that is matched
1948	by the right hand side of the macro with name "<TT>Identifier</TT>".
1949
1950	<P>
1951	<A NAME="predefCharCl"></A></LI>
1952	<LI>a predefined character class matches any of
1953	the characters in that class. There are the following predefined character
1954	classes:
1955
1956	<P>
1957	<TT>.</TT> contains all characters but <code>\n</code>.
1958
1959	<P>
1960	All other predefined character classes are defined in the Unicode
1961	specification or the Java Language Specification and determined by
1962	Java functions of class
1963	<TT>java</TT>.<TT>lang</TT>.<TT>Character</TT>.
1964
1965	<P>
1966	<PRE>
1967	[:jletter:] isJavaIdentifierStart()
1968	[:jletterdigit:] isJavaIdentifierPart()
1969	[:letter:] isLetter()
1970	[:digit:] isDigit()
1971	[:uppercase:] isUpperCase()
1972	[:lowercase:] isLowerCase()
1973	</PRE>
1974
1975	<P>
1976	They are especially useful when working with the unicode character set.
1977
1978	<P>
1979	</LI>
1980	</UL>
1981
1982	<P>
1983	If <TT>a</TT> and <TT>b</TT> are regular expressions, then
1984
1985	<P>
1986	<DL COMPACT>
1987	<DT><TT>a \| b</TT></DT>
1988	<DD>(union)
1989
1990	<P>
1991	is the regular expression, that matches
1992	all input that is matched by <TT>a</TT> or by <TT>b</TT>.
1993
1994	<P>
1995	</DD>
1996	<DT><TT>a b</TT></DT>
1997	<DD>(concatenation)
1998
1999	<P>
2000	is the regular expression,
2001	that matches the input matched by <TT>a</TT> followed by the
2002	input matched by <TT>b</TT>.
2003
2004	<P>
2005	</DD>
2006	<DT><TT>a*</TT></DT>
2007	<DD>(Kleene closure)
2008
2009	<P>
2010	matches zero or more repetitions
2011	of the input matched by <TT>a</TT>
2012
2013	<P>
2014	</DD>
2015	<DT><TT>a+</TT></DT>
2016	<DD>(iteration)
2017
2018	<P>
2019	is equivalent to <TT>aa*</TT>
2020
2021	<P>
2022	</DD>
2023	<DT><TT>a?</TT></DT>
2024	<DD>(option)
2025
2026	<P>
2027	matches the empty input or the input matched
2028	by <TT>a</TT>
2029
2030	<P>
2031	</DD>
2032	<DT><TT>!a</TT></DT>
2033	<DD>(negation)
2034
2035	<P>
2036	matches everything but the strings matched by <TT>a</TT>.
2037	Use with care: the construction of <code>!a</code> involves
2038	an additional, possibly exponential NFA to DFA transformation
2039	on the NFA for <TT>a</TT>. Note that
2040	with negation and union you also have (by applying DeMorgan)
2041	intersection and set difference: the intersection of
2042	<TT>a</TT> and <TT>b</TT> is <code>!(!a\|!b)</code>, the expression
2043	that matches everything of <TT>a</TT> not matched by <TT>b</TT> is
2044	<code>!(!a\|b)</code>
2045
2046	<P>
2047	</DD>
2048	<DT><TT>~a</TT></DT>
2049	<DD>(upto)
2050
2051	<P>
2052	matches everything up to (and including) the first occurrence of a text
2053	matched by <TT>a</TT>. The expression <code>~a</code> is equivalent
2054	to <code>!([^]* a [^]*) a</code>. A traditional C-style comment
2055	is matched by <code>"/" ~"/"</code>
2056
2057	<P>
2058	</DD>
2059	<DT><TT>a{n}</TT></DT>
2060	<DD>(repeat)
2061
2062	<P>
2063	is equivalent to <TT>n</TT> times the concatenation of <TT>a</TT>.
2064	So <code>a{4}</code> for instance is equivalent to the expression <TT>a a a a</TT>.
2065	The decimal integer <TT>n</TT> must be positive.
2066
2067	<P>
2068	</DD>
2069	<DT><TT>a{n,m}</TT></DT>
2070	<DD>is equivalent to at least <TT>n</TT> times and at most <TT>m</TT> times the
2071	concatenation of <TT>a</TT>. So <code>a{2,4}</code> for instance is equivalent
2072	to the expression <code>a a a? a?</code>. Both <TT>n</TT> and <TT>m</TT> are non
2073	negative decimal integers and <TT>m</TT> must not be smaller than <TT>n</TT>.
2074
2075	<P>
2076	</DD>
2077	<DT><TT>( a )</TT></DT>
2078	<DD>matches the same input as <TT>a</TT>.
2079
2080	<P>
2081	</DD>
2082	</DL>
2083
2084	<P>
2085	In a lexical rule, a regular expression <TT>r</TT> may be preceded by a
2086	'<code>^</code>' (the beginning of line operator). <TT>r</TT> is then
2087	only matched at the beginning of a line in the input. A line begins
2088	after each occurrence of <code>\r\|\n\|\r\n\|\u2028\|\u2029\|\u000B\|\u000C\|\u0085</code>
2089	(see also [<A
2090	HREF="manual.html#unicode_rep">5</A>]) and at the beginning of input.
2091	The preceding line terminator in the input is not consumed and can
2092	be matched by another rule.
2093
2094	<P>
2095	In a lexical rule, a regular expression <TT>r</TT> may be followed by a
2096	look-ahead expression. A look-ahead expression is either a '<TT>$</TT>'
2097	(the end of line operator) or a <code>'/'</code> followed by an arbitrary
2098	regular expression. In both cases the look-ahead is not consumed and
2099	not included in the matched text region, but it <EM>is</EM> considered
2100	while determining which rule has the longest match (see also
2101	<A HREF="#HowMatched">4.3.3</A> <A HREF="#HowMatched"><I>How the input is matched</I></A>).
2102
2103	<P>
2104	In the '<TT>$</TT>' case <TT>r</TT> is only matched at the end of a line in
2105	the input. The end of a line is denoted by the regular expression
2106	<code>\r\|\n\|\r\n\|\u2028\|\u2029\|\u000B\|\u000C\|\u0085</code>.
2107	So <code>a$</code> is equivalent to <code>a / \r\|\n\|\r\n\|\u2028\|\u2029\|\u000B\|\u000C\|\u0085</code>.This is a bit different to the situation described in [<A
2108	HREF="manual.html#unicode_rep">5</A>]:
2109	since in JFlex <code>$</code> is a true trailing context, the end of file
2110	does <B>not</B> count as end of line.
2111
2112	<P>
2113	<A NAME="trailingContext"></A>For arbitrary look-ahead (also called <EM>trailing context</EM>) the
2114	expression is matched only when followed by input that matches the
2115	trailing context.
2116
2117	<P>
2118	<A NAME="EOFRule"></A>As of version 1.2, JFlex allows lex/flex style <TT>«EOF»</TT> rules in
2119	lexical specifications. A rule
2120	<PRE>
2121	[StateList] <<EOF>> { some action code }
2122	</PRE>
2123	is very similar to the <A HREF="#eofval"><TT>%eofval</TT> directive</A> (section <A HREF="#eofval">4.2.3</A>).
2124	The difference lies in the optional <TT>StateList</TT> that may precede the <TT>«EOF»</TT> rule. The
2125	action code will only be executed when the end of file is read and the
2126	scanner is currently in one of the lexical states listed in <TT>StateList</TT>.
2127	The same <TT>StateGroup</TT> (see section <A HREF="#HowMatched">4.3.3</A>
2128	<A HREF="#HowMatched"><I>How the input is matched</I></A>) and precedence
2129	rules as in the ``normal'' rule case apply
2130	(i.e. if there is more than one <TT>«EOF»</TT>
2131	rule for a certain lexical state, the action of the one appearing
2132	earlier in the specification will be executed). <TT>«EOF»</TT> rules
2133	override settings of the <TT>%cup</TT> and <TT>%byaccj</TT> options and
2134	should not be mixed with the <TT>%eofval</TT> directive.
2135
2136	<P>
2137	An <TT>Action</TT> consists either of a piece of Java code enclosed in
2138	curly braces or is the special <code>\|</code> action. The <code>\|</code> action is
2139	an abbreviation for the action of the following expression.
2140
2141	<P>
2142	Example:
2143	<PRE>
2144	expression1 \|
2145	expression2 \|
2146	expression3 { some action }
2147	</PRE>
2148	is equivalent to the expanded form
2149	<PRE>
2150	expression1 { some action }
2151	expression2 { some action }
2152	expression3 { some action }
2153	</PRE>
2154
2155	<P>
2156	They are useful when you work with trailing context expressions. The
2157	expression <TT>a \| (c / d) \| b</TT> is not syntactically legal, but can
2158	easily be expressed using the <code>\|</code> action:
2159	<PRE>
2160	a \|
2161	c / d \|
2162	b { some action }
2163	</PRE>
2164
2165	<P>
2166
2167	<H3><A NAME="SECTION00053300000000000000"></A><A NAME="HowMatched"></A><BR>
2168	How the input is matched
2169	</H3>
2170	When consuming its input, the scanner determines the regular expression
2171	that matches the longest portion of the input (longest match rule). If
2172	there is more than one regular expression that matches the longest portion
2173	of input (i.e. they all match the same input), the generated scanner chooses
2174	the expression that appears first in the specification. After determining
2175	the active regular expression, the associated action is executed. If there
2176	is no matching regular expression, the scanner terminates the program with
2177	an error message (if the <TT>%standalone</TT> directive has been used, the
2178	scanner prints the unmatched input to <TT>java.lang.System.out</TT> instead
2179	and resumes scanning).
2180
2181	<P>
2182	Lexical states can be used to further restrict the set of regular expressions
2183	that match the current input.
2184
2185	<P>
2186
2187	<UL>
2188	<LI>A regular expression can only be matched when its associated set of lexical
2189	states includes the currently active lexical state of the scanner or if
2190	the set of associated lexical states is empty and the currently active lexical
2191	state is inclusive. Exclusive and inclusive states only differ at this point:
2192	rules with an empty set of associated states.
2193
2194	<P>
2195	</LI>
2196	<LI>The currently active lexical state of the scanner can be changed from within
2197	an action of a regular expression using the method <TT>yybegin()</TT>.
2198
2199	<P>
2200	</LI>
2201	<LI>The scanner starts in the inclusive lexical state
2202	<TT>YYINITIAL</TT>, which is always declared by default.
2203
2204	<P>
2205	</LI>
2206	<LI>The set of lexical states associated with a regular expression is
2207	the <TT>StateList</TT> that precedes the expression. If a rule is
2208	contained in one or more <TT>StateGroups</TT>, then the states of
2209	these are also associated with the rule, i.e. they accumulate over
2210	<TT>StateGroups</TT>.
2211
2212	<P>
2213	Example:
2214	<PRE>
2215	%states A, B
2216	%xstates C
2217	%%
2218	expr1 { yybegin(A); action }
2219	<YYINITIAL, A> expr2 { action }
2220	<A> {
2221	expr3 { action }
2222	<B,C> expr4 { action }
2223	}
2224	</PRE>
2225	The first line declares two (inclusive) lexical states <TT>A</TT> and <TT>B</TT>,
2226	the second line an exclusive lexical state <TT>C</TT>.
2227	The default (inclusive) state <TT>YYINITIAL</TT> is always implicitly there and
2228	doesn't need to be declared. The rule with <TT>expr1</TT> has no
2229	states listed, and is thus matched in all states but the exclusive
2230	ones, i.e. <TT>A</TT>, <TT>B</TT>, and <TT>YYINITIAL</TT>. In its
2231	action, the scanner is switched to state <TT>A</TT>. The second rule
2232	<TT>expr2</TT> can only match when the scanner is in state
2233	<TT>YYINITIAL</TT> or <TT>A</TT>. The rule <TT>expr3</TT> can only be
2234	matched in state <TT>A</TT> and <TT>expr4</TT> in states <TT>A</TT>, <TT>B</TT>,
2235	and <TT>C</TT>.
2236
2237	<P>
2238	</LI>
2239	<LI>Lexical states are declared and used as Java <TT>int</TT> constants in
2240	the generated class under the same name as they are used in the specification.
2241	There is no guarantee that the values of these integer constants are
2242	distinct. They are pointers into the generated DFA table, and if JFlex
2243	recognises two states as lexically equivalent (if they are used with the
2244	exact same set of regular expressions), then the two constants will get
2245	the same value.
2246
2247	<P>
2248	</LI>
2249	</UL>
2250
2251	<P>
2252
2253	<H3><A NAME="SECTION00053400000000000000">
2254	The generated class</A>
2255	</H3>
2256	JFlex generates exactly one file containing one class from the specification
2257	(unless you have declared another class in the first specification section).
2258
2259	<P>
2260	The generated class contains (among other things) the DFA tables, an input buffer,
2261	the lexical states of the specification, a constructor, and the scanning method
2262	with the user supplied actions.
2263
2264	<P>
2265	The name of the class is by default <TT>Yylex</TT>, it is customisable
2266	with the <TT>%class</TT> directive (see also section
2267	<A HREF="#ClassOptions">4.2.1</A>). The input buffer of the lexer is connected with an
2268	input stream over the <TT>java.io.Reader</TT> object which is passed
2269	to the lexer in the generated constructor. If you want to provide your
2270	own constructor for the lexer, you should always call the generated
2271	one in it to initialise the input buffer. The input buffer should not
2272	be accessed directly, but only over the advertised API (see also
2273	section <A HREF="#ScannerMethods">4.3.5</A>). Its internal implementation may change
2274	between releases or skeleton files without notice.
2275
2276	<P>
2277	The main interface to the outside world is the generated scanning
2278	method (default name <TT>yylex</TT>, default return type
2279	<TT>Yytoken</TT>). Most of its aspects are customisable (name, return
2280	type, declared exceptions etc., see also section
2281	<A HREF="#ScanningMethod">4.2.2</A>). If it is called, it will consume input until
2282	one of the expressions in the specification is matched or an error
2283	occurs. If an expression is matched, the corresponding action is
2284	executed. It may return a value of the specified return type (in which
2285	case the scanning method return with this value), or if it doesn't
2286	return a value, the scanner resumes consuming input until the next
2287	expression is matched. If the end of file is reached, the scanner
2288	executes the EOF action, and (also upon each further call to the scanning
2289	method) returns the specified EOF value (see also section <A HREF="#EOF">4.2.3</A>).
2290
2291	<P>
2292
2293	<H3><A NAME="SECTION00053500000000000000"></A><A NAME="ScannerMethods"></A><BR>
2294	Scanner methods and fields accessible in actions (API)
2295	</H3>
2296	Generated methods and member fields in JFlex scanners are prefixed
2297	with <TT>yy</TT> to indicate that they are generated and to avoid name
2298	conflicts with user code copied into the class. Since user code is
2299	part of the same class, JFlex has no language means like the
2300	<TT>private</TT> modifier to indicate which members and methods are
2301	internal and which ones belong to the API. Instead, JFlex follows a
2302	naming convention: everything starting with a <TT>zz</TT> prefix like
2303	<TT>zzStartRead</TT> is to be considered internal and subject to
2304	change without notice between JFlex releases. Methods and members of
2305	the generated class that do not have a <TT>zz</TT> prefix like
2306	<TT>yycharat</TT> belong to the API that the scanner class provides to
2307	users in action code of the specification. They will be remain stable
2308	and supported between JFlex releases as long as possible.
2309
2310	<P>
2311	Currently, the API consists of the following methods and member fields:
2312
2313	<UL>
2314	<LI><TT>String yytext()</TT>
2315	<BR> returns the matched input text region
2316
2317	<P>
2318	</LI>
2319	<LI><TT>int yylength()</TT>
2320	<BR> returns the length of the matched input text region (does not require
2321	a <TT>String</TT> object to be created)
2322
2323	<P>
2324	</LI>
2325	<LI><TT>char yycharat(int pos)</TT>
2326	<BR> returns the character at position <TT>pos</TT> from the matched text.
2327	It is equivalent to <TT>yytext().charAt(pos)</TT>, but faster. <TT> pos</TT> must be a value from <TT>0</TT> to <TT>yylength()-1</TT>.
2328
2329	<P>
2330	</LI>
2331	<LI><TT>void yyclose()</TT>
2332	<BR> closes the input stream. All subsequent calls to the scanning method will
2333	return the end of file value
2334
2335	<P>
2336	</LI>
2337	<LI><TT>void yyreset(java.io.Reader reader)</TT>
2338	<BR> closes the current input stream, and resets the scanner to read from
2339	a new input stream. All internal variables are reset, the old input
2340	stream <EM>cannot</EM> be reused (content of the internal buffer is
2341	discarded and lost). The lexical state is set to <TT>YY_INITIAL</TT>.
2342
2343	<P>
2344	</LI>
2345	<LI><TT>void yypushStream(java.io.Reader reader)</TT>
2346	<BR> Stores the current input stream on a stack, and
2347	reads from a new stream. Lexical state, line,
2348	char, and column counting remain untouched.
2349	The current input stream can be restored with
2350	<TT>yypopstream</TT> (usually in an <TT>«EOF»</TT> action).
2351
2352	<P>
2353	A typical example for this are include files in
2354	style of the C pre-processor. The corresponding
2355	JFlex specification could look somewhat like this:
2356	<PRE>
2357	"#include" {FILE} { yypushStream(new FileReader(getFile(yytext()))); }
2358	..
2359	<<EOF>> { if (yymoreStreams()) yypopStream(); else return EOF; }
2360	</PRE>
2361
2362	<P>
2363	This method is only available in the skeleton file
2364	<TT>skeleton.nested</TT>. You can find it in the
2365	<TT>src</TT> directory of the JFlex distribution.
2366
2367	<P>
2368	</LI>
2369	<LI><TT>void yypopStream()</TT>
2370	<BR> Closes the current input stream and continues to
2371	read from the one on top of the stream stack.
2372
2373	<P>
2374	This method is only available in the skeleton file
2375	<TT>skeleton.nested</TT>. You can find it in the
2376	<TT>src</TT> directory of the JFlex distribution.
2377
2378	<P>
2379	</LI>
2380	<LI><TT>boolean yymoreStreams()</TT>
2381	<BR> Returns true iff there are still streams for <TT>yypopStream</TT>
2382	left to read from on the stream stack.
2383
2384	<P>
2385	This method is only available in the skeleton file
2386	<TT>skeleton.nested</TT>. You can find it in the
2387	<TT>src</TT> directory of the JFlex distribution.
2388
2389	<P>
2390	</LI>
2391	<LI><TT>int yystate()</TT>
2392	<BR> returns the current lexical state of the scanner.
2393
2394	<P>
2395	</LI>
2396	<LI><TT>void yybegin(int lexicalState)</TT>
2397	<BR> enters the lexical state <TT>lexicalState</TT>
2398
2399	<P>
2400	</LI>
2401	<LI><TT>void yypushback(int number)</TT>
2402	<BR> pushes <TT>number</TT> characters of the matched text back into the input stream.
2403	They will be read again in the next call of the scanning method.
2404	The number of characters to be read again must not be greater than the length
2405	of the matched text. The pushed back characters will after the call of
2406	<TT>yypushback</TT> not be included in <TT>yylength</TT> and <TT>yytext()</TT>.
2407	Please note that in Java strings are unchangeable, i.e. an action code like
2408	<PRE>
2409	String matched = yytext();
2410	yypushback(1);
2411	return matched;
2412	</PRE>
2413	will return the whole matched text, while
2414	<PRE>
2415	yypushback(1);
2416	return yytext();
2417	</PRE>
2418	will return the matched text minus the last character.
2419
2420	<P>
2421	</LI>
2422	<LI><TT>int yyline</TT>
2423	<BR> contains the current line of input (starting with 0, only active with
2424	the <TT><A HREF="#Counting">%line</A></TT> directive)
2425
2426	<P>
2427	</LI>
2428	<LI><TT>int yychar</TT>
2429	<BR> contains the current character count in the input (starting with 0,
2430	only active with the <TT><A HREF="#Counting">%char</A></TT> directive)
2431
2432	<P>
2433	</LI>
2434	<LI><TT>int yycolumn</TT>
2435	<BR> contains the current column of the current line (starting with 0, only
2436	active with the <TT><A HREF="#Counting">%column</A></TT> directive)
2437
2438	<P>
2439	</LI>
2440	</UL>
2441
2442	<P>
2443
2444	<H1><A NAME="SECTION00060000000000000000"></A><A NAME="sec:encodings"></A><BR>
2445	Encodings, Platforms, and Unicode
2446	</H1>
2447
2448	<P>
2449	This section tries to shed some light on the issues of Unicode and
2450	encodings, cross platform scanning, and how to deal with binary data.
2451	My thanks go to Stephen Ostermiller for his input on this topic.
2452
2453	<P>
2454
2455	<H2><A NAME="SECTION00061000000000000000"></A><A NAME="sec:howtoencoding"></A><BR>
2456	The Problem
2457	</H2>
2458
2459	<P>
2460	Before we dive straight into details, let's take a look at what the
2461	problem is. The problem is Java's platform independence when you want
2462	to use it. For scanners the interesting part about platform
2463	independence is character encodings and how they are handled.
2464
2465	<P>
2466	If a program reads a file from disk, it gets a stream of bytes. In
2467	earlier times, when the grass was green, and the world was much
2468	simpler, everybody knew that the byte value 65 is, of course, an A.
2469	It was no problem to see which bytes meant which characters (actually
2470	these times never existed, but anyway). The normal Latin alphabet
2471	only has 26 characters, so 7 bits or 128 distinct values should surely
2472	be enough to map them, even if you allow yourself the luxury of upper
2473	and lower case. Nowadays, things are different. The world suddenly
2474	grew much larger, and all kinds of people wanted all kinds of special
2475	characters, just because they use them in their language and writing.
2476	This is were the mess starts. Since the 128 distinct values were
2477	already filled up with other stuff, people began to use all 8 bits of
2478	the byte, and extended the byte/character mappings to fit their need,
2479	and of course everybody did it differently. Some people for instance
2480	may have said ``let's use the value 213 for the German character ä''. Others
2481	may have found that 213 should much rather mean é, because they didn't need
2482	German and wrote French instead. As long as you use your program and
2483	data files only on one platform, this is no problem, as all know what
2484	means what, and everything gets used consistently.
2485
2486	<P>
2487	Now Java comes into play, and wants to run everywhere (once written,
2488	that is) and now there suddenly is a problem: how do I get the same
2489	program to say ä to a certain byte when it runs in Germany and maybe é
2490	when it runs in France? And also the other way around: when I want to
2491	say é on the screen, which byte value should I send to the operating
2492	system?
2493
2494	<P>
2495	Java's solution to this is to use Unicode internally. Unicode aims to
2496	be a superset of all known character sets and is therefore a perfect base
2497	for encoding things that might get used all over the world. To make
2498	things work correctly, you still have to know where you are and how to
2499	map byte values to Unicode characters and vice versa, but the
2500	important thing is, that this mapping is at least possible (you can
2501	map Kanji characters to Unicode, but you cannot map them to ASCII or
2502	iso-latin-1).
2503
2504	<P>
2505
2506	<H2><A NAME="SECTION00062000000000000000"></A><A NAME="sec:howtotext"></A><BR>
2507	Scanning text files
2508	</H2>
2509
2510	<P>
2511	Scanning text files is the standard application for scanners like
2512	JFlex. Therefore it should also be the most convenient one. Most times
2513	it is.
2514
2515	<P>
2516	The following scenario works like a breeze:
2517	You work on a platform X, write your lexer specification there, can
2518	use any obscure Unicode character in it as you like, and compile the
2519	program. Your users work on any platform Y (possibly but not
2520	necessarily something different from X), they write their input files
2521	on Y and they run your program on Y. No problems.
2522
2523	<P>
2524	Java does this as follows:
2525	If you want to read anything in Java that is supposed to contain text,
2526	you use a <TT>FileReader</TT> or some <TT>InputStream</TT> together with
2527	an <TT>InputStreamReader</TT>. <TT>InputStreams</TT> return the raw bytes, the
2528	<TT>InputStreamReader</TT> converts the bytes into Unicode characters with
2529	the platform's default encoding. If a text file is produced on the
2530	same platform, the platform's default encoding should do the mapping
2531	correctly. Since JFlex also uses readers and Unicode internally, this
2532	mechanism also works for the scanner specifications. If you write an
2533	<TT>A</TT> in your text editor and the editor uses the platform's encoding (say <TT>A</TT> is 65),
2534	then Java translates this into the logical Unicode <TT>A</TT> internally.
2535	If a user writes an <TT>A</TT> on a completely different platform (say <TT>A</TT> is 237 there),
2536	then Java also translates this into the logical Unicode <TT>A</TT> internally. Scanning
2537	is performed after that translation and both match.
2538
2539	<P>
2540	Note that because of this mapping from bytes to characters, you should always
2541	use the <TT>%unicode</TT> switch in you lexer specification if you want to scan
2542	text files. <TT>%8bit</TT> may not be enough, even if
2543	you know that your platform only uses one byte per character. The encoding
2544	Cp1252 used on many Windows machines for instance knows 256 characters, but
2545	the character ´ with Cp1252 code <code>\x92</code> has the Unicode value <code>\u2019</code>, which
2546	is larger than 255 and which would make your scanner throw an
2547	<TT>ArrayIndexOutOfBoundsException</TT> if it is encountered.
2548
2549	<P>
2550	So for the usual case you don't have to do anything but use the
2551	<TT>%unicode</TT> switch in your lexer specification.
2552
2553	<P>
2554	Things may break when you produce a text file on platform X and
2555	consume it on a different platform Y. Let's say you have a file
2556	written on a Windows PC using the encoding Cp1252. Then you move
2557	this file to a Linux PC with encoding ISO 8859-1 and there you want
2558	to run your scanner on it. Java now thinks the file is encoded
2559	in ISO 8859-1 (the platform's default encoding) while it really is
2560	encoded in Cp1252. For most characters
2561	Cp1252 and ISO 8859-1 are the same, but for the byte values <code>\x80</code>
2562	to <code>\x9f</code> they disagree: ISO 8859-1 is undefined there. You can fix
2563	the problem by telling Java explicitly which encoding to use. When
2564	constructing the <TT>InputStreamReader</TT>, you can give the encoding
2565	as argument. The line
2566	<DIV ALIGN="CENTER">
2567	<TT>Reader r = new InputStreamReader(input, "Cp1252"); </TT>
2568
2569	</DIV>
2570	will do the trick.
2571
2572	<P>
2573	Of course the encoding to use can also come from the data itself:
2574	for instance, when you scan a HTML page, it may have embedded
2575	information about its character encoding in the headers.
2576
2577	<P>
2578	More information about encodings, which ones are supported, how
2579	they are called, and how to set them may be found in the
2580	official Java documentation in the chapter about
2581	internationalisation.
2582	The link
2583	<A NAME="tex2html6"
2584	HREF="http://java.sun.com/j2se/1.3/docs/guide/intl/"><TT>http://java.sun.com/j2se/1.3/docs/guide/intl/</TT></A>
2585	leads to an online version of this for Sun's JDK 1.3.
2586
2587	<P>
2588
2589	<H2><A NAME="SECTION00063000000000000000"></A><A NAME="sec:howtobinary"></A><BR>
2590	Scanning binaries
2591	</H2>
2592
2593	<P>
2594	Scanning binaries is both easier and more difficult
2595	than scanning text files. It's easier because you want
2596	the raw bytes and not their meaning, i.e. you don't want
2597	any translation.
2598	It's more difficult because it's not so easy to get
2599	``no translation'' when you use Java readers.
2600
2601	<P>
2602	The problem (for binaries) is that JFlex scanners are
2603	designed to work on text. Therefore the interface is
2604	the <TT>Reader</TT> class (there is a constructor
2605	for <TT>InputStream</TT> instances, but it's just there
2606	for convenience and wraps an <TT>InputStreamReader</TT>
2607	around it to get characters, not bytes).
2608	You can still get a binary scanner when you write
2609	your own custom <TT>InputStreamReader</TT> class that
2610	does explicitly no translation, but just copies
2611	byte values to character codes instead. It sounds
2612	quite easy, and actually it is no big deal, but there
2613	are a few little pitfalls on the way. In the scanner
2614	specification you can only enter positive character
2615	codes (for bytes that is <code>\x00</code>
2616	to <code>\xFF</code>). Java's <TT>byte</TT> type on the other hand
2617	is a signed 8 bit integer (-128 to 127), so you have to convert
2618	them properly in your custom <TT>Reader</TT>. Also, you should
2619	take care when you write your lexer spec: if you
2620	use text in there, it gets interpreted by an encoding
2621	first, and what scanner you get as result might depend
2622	on which platform you run JFlex on when you generate
2623	the scanner (this is what you want for text, but for binaries it
2624	gets in the way). If you are not sure, or if the development
2625	platform might change, it's probably best to use character
2626	code escapes in all places, since they don't change their
2627	meaning.
2628
2629	<P>
2630	To illustrate these points, the example in <TT>examples/binary</TT>
2631	contains a very small binary scanner that tries to
2632	detect if a file is a Java <TT>class</TT> file. For that
2633	purpose it looks if the file begins with the magic number <code>\xCAFEBABE</code>.
2634
2635	<P>
2636
2637	<H1><A NAME="SECTION00070000000000000000"></A><A NAME="performance"></A><BR>
2638	A few words on performance
2639	</H1>
2640	This section gives some empirical results about the speed of JFlex generated
2641	scanners in comparison to those generated by JLex,
2642	compares a JFlex scanner with a <A HREF="#PerformanceHandwritten">handwritten</A>
2643	one, and presents some <A HREF="#PerformanceTips">tips</A> on how to make
2644	your specification produce a faster scanner.
2645
2646	<P>
2647
2648	<H2><A NAME="SECTION00071000000000000000"></A><A NAME="PerformanceJLex"></A><BR>
2649	Comparison of JLex and JFlex
2650	</H2>
2651	Scanners generated by the tool JLex are quite fast. It was however
2652	possible to further improve the performance of generated scanners
2653	using JFlex. The following table shows the results that were produced
2654	by the scanner specification of a small toy programming language (the
2655	example from the JLex web site). The scanner was generated using JLex
2656	1.2.6 and JFlex version 1.3.5 with all three different JFlex code
2657	generation methods. Then it was run on a W98 system using Sun's JDK
2658	1.3 with different sample inputs of that toy programming language. All
2659	test runs were made under the same conditions on an otherwise idle
2660	machine.
2661
2662	<P>
2663	The values presented in the table denote the time from the first call
2664	to the scanning method to returning the EOF value and the speedup in
2665	percent. The tests were run both in the mixed (HotSpot) JVM mode and
2666	the pure interpreted mode. The mixed mode JVM brings
2667	about a factor of 10 performance improvement, the difference between
2668	JLex and JFlex only decreases slightly.
2669
2670	<P>
2671	<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2672	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD>
2673	<TD ALIGN="CENTER">JVM</TD>
2674	<TD ALIGN="RIGHT">JLex</TD>
2675	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2676	<TD ALIGN="RIGHT">speedup</TD>
2677	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2678	<TD ALIGN="RIGHT">speedup</TD>
2679	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2680	<TD ALIGN="RIGHT">speedup</TD>
2681	</TR>
2682	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2683	<TD ALIGN="CENTER">hotspot</TD>
2684	<TD ALIGN="RIGHT">325 ms</TD>
2685	<TD ALIGN="RIGHT">261 ms</TD>
2686	<TD ALIGN="RIGHT">24.5 %</TD>
2687	<TD ALIGN="RIGHT">261 ms</TD>
2688	<TD ALIGN="RIGHT">24.5 %</TD>
2689	<TD ALIGN="RIGHT">261 ms</TD>
2690	<TD ALIGN="RIGHT">24.5 %</TD>
2691	</TR>
2692	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2693	<TD ALIGN="CENTER">hotspot</TD>
2694	<TD ALIGN="RIGHT">127 ms</TD>
2695	<TD ALIGN="RIGHT">98 ms</TD>
2696	<TD ALIGN="RIGHT">29.6 %</TD>
2697	<TD ALIGN="RIGHT">94 ms</TD>
2698	<TD ALIGN="RIGHT">35.1 %</TD>
2699	<TD ALIGN="RIGHT">96 ms</TD>
2700	<TD ALIGN="RIGHT">32.3 %</TD>
2701	</TR>
2702	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2703	<TD ALIGN="CENTER">hotspot</TD>
2704	<TD ALIGN="RIGHT">66 ms</TD>
2705	<TD ALIGN="RIGHT">50 ms</TD>
2706	<TD ALIGN="RIGHT">32.0 %</TD>
2707	<TD ALIGN="RIGHT">50 ms</TD>
2708	<TD ALIGN="RIGHT">32.0 %</TD>
2709	<TD ALIGN="RIGHT">48 ms</TD>
2710	<TD ALIGN="RIGHT">37.5 %</TD>
2711	</TR>
2712	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2713	<TD ALIGN="CENTER">interpr.</TD>
2714	<TD ALIGN="RIGHT">4009 ms</TD>
2715	<TD ALIGN="RIGHT">3025 ms</TD>
2716	<TD ALIGN="RIGHT">32.5 %</TD>
2717	<TD ALIGN="RIGHT">3258 ms</TD>
2718	<TD ALIGN="RIGHT">23.1 %</TD>
2719	<TD ALIGN="RIGHT">3231 ms</TD>
2720	<TD ALIGN="RIGHT">24.1 %</TD>
2721	</TR>
2722	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2723	<TD ALIGN="CENTER">interpr.</TD>
2724	<TD ALIGN="RIGHT">1641 ms</TD>
2725	<TD ALIGN="RIGHT">1155 ms</TD>
2726	<TD ALIGN="RIGHT">42.1 %</TD>
2727	<TD ALIGN="RIGHT">1245 ms</TD>
2728	<TD ALIGN="RIGHT">31.8 %</TD>
2729	<TD ALIGN="RIGHT">1234 ms</TD>
2730	<TD ALIGN="RIGHT">33.0 %</TD>
2731	</TR>
2732	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2733	<TD ALIGN="CENTER">interpr.</TD>
2734	<TD ALIGN="RIGHT">817 ms</TD>
2735	<TD ALIGN="RIGHT">573 ms</TD>
2736	<TD ALIGN="RIGHT">42.6 %</TD>
2737	<TD ALIGN="RIGHT">617 ms</TD>
2738	<TD ALIGN="RIGHT">32.4 %</TD>
2739	<TD ALIGN="RIGHT">613 ms</TD>
2740	<TD ALIGN="RIGHT">33.3 %</TD>
2741	</TR>
2742	</TABLE>
2743
2744	<P><BR>
2745
2746	<P>
2747	Since the scanning time of the lexical analyser examined in the table
2748	above includes lexical actions that often need to create new object instances,
2749	another table shows the execution time for the same specification with empty
2750	lexical actions to compare the pure scanning engines.
2751
2752	<P>
2753	<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2754	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD>
2755	<TD ALIGN="CENTER">JVM</TD>
2756	<TD ALIGN="RIGHT">JLex</TD>
2757	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2758	<TD ALIGN="RIGHT">speedup</TD>
2759	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2760	<TD ALIGN="RIGHT">speedup</TD>
2761	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2762	<TD ALIGN="RIGHT">speedup</TD>
2763	</TR>
2764	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2765	<TD ALIGN="CENTER">hotspot</TD>
2766	<TD ALIGN="RIGHT">204 ms</TD>
2767	<TD ALIGN="RIGHT">140 ms</TD>
2768	<TD ALIGN="RIGHT">45.7 %</TD>
2769	<TD ALIGN="RIGHT">138 ms</TD>
2770	<TD ALIGN="RIGHT">47.8 %</TD>
2771	<TD ALIGN="RIGHT">140 ms</TD>
2772	<TD ALIGN="RIGHT">45.7 %</TD>
2773	</TR>
2774	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2775	<TD ALIGN="CENTER">hotspot</TD>
2776	<TD ALIGN="RIGHT">83 ms</TD>
2777	<TD ALIGN="RIGHT">55 ms</TD>
2778	<TD ALIGN="RIGHT">50.9 %</TD>
2779	<TD ALIGN="RIGHT">52 ms</TD>
2780	<TD ALIGN="RIGHT">59.6 %</TD>
2781	<TD ALIGN="RIGHT">52 ms</TD>
2782	<TD ALIGN="RIGHT">59.6 %</TD>
2783	</TR>
2784	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2785	<TD ALIGN="CENTER">hotspot</TD>
2786	<TD ALIGN="RIGHT">41 ms</TD>
2787	<TD ALIGN="RIGHT">28 ms</TD>
2788	<TD ALIGN="RIGHT">46.4 %</TD>
2789	<TD ALIGN="RIGHT">26 ms</TD>
2790	<TD ALIGN="RIGHT">57.7 %</TD>
2791	<TD ALIGN="RIGHT">26 ms</TD>
2792	<TD ALIGN="RIGHT">57.7 %</TD>
2793	</TR>
2794	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2795	<TD ALIGN="CENTER">interpr.</TD>
2796	<TD ALIGN="RIGHT">2983 ms</TD>
2797	<TD ALIGN="RIGHT">2036 ms</TD>
2798	<TD ALIGN="RIGHT">46.5 %</TD>
2799	<TD ALIGN="RIGHT">2230 ms</TD>
2800	<TD ALIGN="RIGHT">33.8 %</TD>
2801	<TD ALIGN="RIGHT">2232 ms</TD>
2802	<TD ALIGN="RIGHT">33.6 %</TD>
2803	</TR>
2804	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2805	<TD ALIGN="CENTER">interpr.</TD>
2806	<TD ALIGN="RIGHT">1260 ms</TD>
2807	<TD ALIGN="RIGHT">793 ms</TD>
2808	<TD ALIGN="RIGHT">58.9 %</TD>
2809	<TD ALIGN="RIGHT">865 ms</TD>
2810	<TD ALIGN="RIGHT">45.7 %</TD>
2811	<TD ALIGN="RIGHT">867 ms</TD>
2812	<TD ALIGN="RIGHT">45.3 %</TD>
2813	</TR>
2814	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2815	<TD ALIGN="CENTER">interpr.</TD>
2816	<TD ALIGN="RIGHT">628 ms</TD>
2817	<TD ALIGN="RIGHT">395 ms</TD>
2818	<TD ALIGN="RIGHT">59.0 %</TD>
2819	<TD ALIGN="RIGHT">432 ms</TD>
2820	<TD ALIGN="RIGHT">45.4 %</TD>
2821	<TD ALIGN="RIGHT">432 ms</TD>
2822	<TD ALIGN="RIGHT">45.4 %</TD>
2823	</TR>
2824	</TABLE>
2825
2826	<P><BR>
2827
2828	<P>
2829	Execution time of single instructions depends on the platform and
2830	the implementation of the Java Virtual Machine the program is executed
2831	on. Therefore the tables above cannot be used as a reference to which
2832	code generation method of JFlex is the right one to choose in general.
2833	The following table was produced by the same lexical specification and
2834	the same input on a Linux system also using Sun's JDK 1.3.
2835
2836	<P>
2837	With actions:
2838
2839	<P>
2840	<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2841	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD>
2842	<TD ALIGN="CENTER">JVM</TD>
2843	<TD ALIGN="RIGHT">JLex</TD>
2844	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2845	<TD ALIGN="RIGHT">speedup</TD>
2846	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2847	<TD ALIGN="RIGHT">speedup</TD>
2848	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2849	<TD ALIGN="RIGHT">speedup</TD>
2850	</TR>
2851	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2852	<TD ALIGN="CENTER">hotspot</TD>
2853	<TD ALIGN="RIGHT">246 ms</TD>
2854	<TD ALIGN="RIGHT">203 ms</TD>
2855	<TD ALIGN="RIGHT">21.2 %</TD>
2856	<TD ALIGN="RIGHT">193 ms</TD>
2857	<TD ALIGN="RIGHT">27.5 %</TD>
2858	<TD ALIGN="RIGHT">190 ms</TD>
2859	<TD ALIGN="RIGHT">29.5 %</TD>
2860	</TR>
2861	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2862	<TD ALIGN="CENTER">hotspot</TD>
2863	<TD ALIGN="RIGHT">99 ms</TD>
2864	<TD ALIGN="RIGHT">76 ms</TD>
2865	<TD ALIGN="RIGHT">30.3 %</TD>
2866	<TD ALIGN="RIGHT">69 ms</TD>
2867	<TD ALIGN="RIGHT">43.5 %</TD>
2868	<TD ALIGN="RIGHT">70 ms</TD>
2869	<TD ALIGN="RIGHT">41.4 %</TD>
2870	</TR>
2871	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2872	<TD ALIGN="CENTER">hotspot</TD>
2873	<TD ALIGN="RIGHT">48 ms</TD>
2874	<TD ALIGN="RIGHT">36 ms</TD>
2875	<TD ALIGN="RIGHT">33.3 %</TD>
2876	<TD ALIGN="RIGHT">34 ms</TD>
2877	<TD ALIGN="RIGHT">41.2 %</TD>
2878	<TD ALIGN="RIGHT">35 ms</TD>
2879	<TD ALIGN="RIGHT">37.1 %</TD>
2880	</TR>
2881	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2882	<TD ALIGN="CENTER">interpr.</TD>
2883	<TD ALIGN="RIGHT">3251 ms</TD>
2884	<TD ALIGN="RIGHT">2247 ms</TD>
2885	<TD ALIGN="RIGHT">44.7 %</TD>
2886	<TD ALIGN="RIGHT">2430 ms</TD>
2887	<TD ALIGN="RIGHT">33.8 %</TD>
2888	<TD ALIGN="RIGHT">2444 ms</TD>
2889	<TD ALIGN="RIGHT">33.0 %</TD>
2890	</TR>
2891	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2892	<TD ALIGN="CENTER">interpr.</TD>
2893	<TD ALIGN="RIGHT">1320 ms</TD>
2894	<TD ALIGN="RIGHT">848 ms</TD>
2895	<TD ALIGN="RIGHT">55.7 %</TD>
2896	<TD ALIGN="RIGHT">958 ms</TD>
2897	<TD ALIGN="RIGHT">37.8 %</TD>
2898	<TD ALIGN="RIGHT">920 ms</TD>
2899	<TD ALIGN="RIGHT">43.5 %</TD>
2900	</TR>
2901	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2902	<TD ALIGN="CENTER">interpr.</TD>
2903	<TD ALIGN="RIGHT">658 ms</TD>
2904	<TD ALIGN="RIGHT">423 ms</TD>
2905	<TD ALIGN="RIGHT">55.6 %</TD>
2906	<TD ALIGN="RIGHT">456 ms</TD>
2907	<TD ALIGN="RIGHT">44.3 %</TD>
2908	<TD ALIGN="RIGHT">452 ms</TD>
2909	<TD ALIGN="RIGHT">45.6 %</TD>
2910	</TR>
2911	</TABLE>
2912
2913	<P><BR>
2914
2915	<P>
2916	Without actions:
2917
2918	<P>
2919	<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
2920	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD>
2921	<TD ALIGN="CENTER">JVM</TD>
2922	<TD ALIGN="RIGHT">JLex</TD>
2923	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD>
2924	<TD ALIGN="RIGHT">speedup</TD>
2925	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD>
2926	<TD ALIGN="RIGHT">speedup</TD>
2927	<TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD>
2928	<TD ALIGN="RIGHT">speedup</TD>
2929	</TR>
2930	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2931	<TD ALIGN="CENTER">hotspot</TD>
2932	<TD ALIGN="RIGHT">136 ms</TD>
2933	<TD ALIGN="RIGHT">78 ms</TD>
2934	<TD ALIGN="RIGHT">74.4 %</TD>
2935	<TD ALIGN="RIGHT">76 ms</TD>
2936	<TD ALIGN="RIGHT">78.9 %</TD>
2937	<TD ALIGN="RIGHT">77 ms</TD>
2938	<TD ALIGN="RIGHT">76.6 %</TD>
2939	</TR>
2940	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2941	<TD ALIGN="CENTER">hotspot</TD>
2942	<TD ALIGN="RIGHT">59 ms</TD>
2943	<TD ALIGN="RIGHT">31 ms</TD>
2944	<TD ALIGN="RIGHT">90.3 %</TD>
2945	<TD ALIGN="RIGHT">48 ms</TD>
2946	<TD ALIGN="RIGHT">22.9 %</TD>
2947	<TD ALIGN="RIGHT">32 ms</TD>
2948	<TD ALIGN="RIGHT">84.4 %</TD>
2949	</TR>
2950	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2951	<TD ALIGN="CENTER">hotspot</TD>
2952	<TD ALIGN="RIGHT">28 ms</TD>
2953	<TD ALIGN="RIGHT">15 ms</TD>
2954	<TD ALIGN="RIGHT">86.7 %</TD>
2955	<TD ALIGN="RIGHT">15 ms</TD>
2956	<TD ALIGN="RIGHT">86.7 %</TD>
2957	<TD ALIGN="RIGHT">15 ms</TD>
2958	<TD ALIGN="RIGHT">86.7 %</TD>
2959	</TR>
2960	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD>
2961	<TD ALIGN="CENTER">interpr.</TD>
2962	<TD ALIGN="RIGHT">1992 ms</TD>
2963	<TD ALIGN="RIGHT">1047 ms</TD>
2964	<TD ALIGN="RIGHT">90.3 %</TD>
2965	<TD ALIGN="RIGHT">1246 ms</TD>
2966	<TD ALIGN="RIGHT">59.9 %</TD>
2967	<TD ALIGN="RIGHT">1215 ms</TD>
2968	<TD ALIGN="RIGHT">64.0 %</TD>
2969	</TR>
2970	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD>
2971	<TD ALIGN="CENTER">interpr.</TD>
2972	<TD ALIGN="RIGHT">859 ms</TD>
2973	<TD ALIGN="RIGHT">408 ms</TD>
2974	<TD ALIGN="RIGHT">110.5 %</TD>
2975	<TD ALIGN="RIGHT">479 ms</TD>
2976	<TD ALIGN="RIGHT">79.3 %</TD>
2977	<TD ALIGN="RIGHT">487 ms</TD>
2978	<TD ALIGN="RIGHT">76.4 %</TD>
2979	</TR>
2980	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD>
2981	<TD ALIGN="CENTER">interpr.</TD>
2982	<TD ALIGN="RIGHT">435 ms</TD>
2983	<TD ALIGN="RIGHT">200 ms</TD>
2984	<TD ALIGN="RIGHT">117.5 %</TD>
2985	<TD ALIGN="RIGHT">237 ms</TD>
2986	<TD ALIGN="RIGHT">83.5 %</TD>
2987	<TD ALIGN="RIGHT">242 ms</TD>
2988	<TD ALIGN="RIGHT">79.8 %</TD>
2989	</TR>
2990	</TABLE>
2991
2992	<P><BR>
2993
2994	<P>
2995	Although all JFlex scanners were faster than those generated by JLex,
2996	slight differences between JFlex code generation methods show up when compared
2997	to the run on the W98 system.
2998	<A NAME="PerformanceHandwritten"></A>
2999	<P>
3000	The following table compares a hand-written scanner for the Java language
3001	obtained from the web site of CUP with the JFlex generated scanner for Java
3002	that comes with JFlex in the <TT>examples</TT> directory. They were tested
3003	on different <TT>.java</TT> files on a Linux machine with Sun's JDK 1.3.
3004
3005	<P>
3006	<TABLE CELLPADDING=3 BORDER="1" WIDTH="100%">
3007	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">lines</TD>
3008	<TD ALIGN="RIGHT">KB</TD>
3009	<TD ALIGN="CENTER">JVM</TD>
3010	<TD ALIGN="RIGHT">hand-written scanner</TD>
3011	<TD ALIGN="CENTER" COLSPAN=2>JFlex generated scanner</TD>
3012	</TR>
3013	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">19050</TD>
3014	<TD ALIGN="RIGHT">496</TD>
3015	<TD ALIGN="CENTER">hotspot</TD>
3016	<TD ALIGN="RIGHT">824 ms</TD>
3017	<TD ALIGN="RIGHT">248 ms</TD>
3018	<TD ALIGN="RIGHT">235 % faster</TD>
3019	</TR>
3020	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">6350</TD>
3021	<TD ALIGN="RIGHT">165</TD>
3022	<TD ALIGN="CENTER">hotspot</TD>
3023	<TD ALIGN="RIGHT">272 ms</TD>
3024	<TD ALIGN="RIGHT">84 ms</TD>
3025	<TD ALIGN="RIGHT">232 % faster</TD>
3026	</TR>
3027	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">1270</TD>
3028	<TD ALIGN="RIGHT">33</TD>
3029	<TD ALIGN="CENTER">hotspot</TD>
3030	<TD ALIGN="RIGHT">53 ms</TD>
3031	<TD ALIGN="RIGHT">18 ms</TD>
3032	<TD ALIGN="RIGHT">194 % faster</TD>
3033	</TR>
3034	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">19050</TD>
3035	<TD ALIGN="RIGHT">496</TD>
3036	<TD ALIGN="CENTER">interpreted</TD>
3037	<TD ALIGN="RIGHT">5.83 s</TD>
3038	<TD ALIGN="RIGHT">3.85 s</TD>
3039	<TD ALIGN="RIGHT">51 % faster</TD>
3040	</TR>
3041	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">6350</TD>
3042	<TD ALIGN="RIGHT">165</TD>
3043	<TD ALIGN="CENTER">interpreted</TD>
3044	<TD ALIGN="RIGHT">1.95 s</TD>
3045	<TD ALIGN="RIGHT">1.29 s</TD>
3046	<TD ALIGN="RIGHT">51 % faster</TD>
3047	</TR>
3048	<TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">1270</TD>
3049	<TD ALIGN="RIGHT">33</TD>
3050	<TD ALIGN="CENTER">interpreted</TD>
3051	<TD ALIGN="RIGHT">0.38 s</TD>
3052	<TD ALIGN="RIGHT">0.25 s</TD>
3053	<TD ALIGN="RIGHT">52 % faster</TD>
3054	</TR>
3055	</TABLE>
3056
3057	<P><BR>
3058
3059	<P>
3060	Although JDK 1.3 seems to speed up the hand-written scanner if compared
3061	to JDK 1.1 or 1.2 more than the generated one, the generated scanner is
3062	still up to 3.3 times as fast as the hand-written one. One example of
3063	a hand-written scanner that is
3064	considerably slower than the equivalent generated one is surely no
3065	proof for all generated scanners being faster than hand-written. It is
3066	clearly impossible to prove something like that, since you could
3067	always write the generated scanner by hand. From a software
3068	engineering point of view however, there is no excuse for writing a
3069	scanner by hand since this task takes more time, is more difficult and
3070	therefore more error prone than writing a compact, readable and easy
3071	to change lexical specification. (I'd like to add, that I do <EM>not</EM>
3072	think, that the hand-written scanner from the CUP web site used here in
3073	the test is stupid or badly written or anything like that. I actually
3074	think, Scott did a great job with it)
3075
3076	<P>
3077
3078	<H2><A NAME="SECTION00072000000000000000"></A><A NAME="PerformanceTips"></A><BR>
3079	How to write a faster specification
3080	</H2>
3081	Although JFlex generated scanners show good performance without
3082	special optimisations, there are some heuristics that can make a
3083	lexical specification produce an even faster scanner. Those are
3084	(roughly in order of performance gain):
3085
3086	<P>
3087
3088	<UL>
3089	<LI>Avoid rules that require backtracking
3090
3091	<P>
3092	From the C/C++ flex [<A
3093	HREF="manual.html#flex">11</A>] man page: <EM>``Getting rid
3094	of backtracking is messy and often may be an enormous amount of work for
3095	a complicated scanner.''</EM> Backtracking is introduced by the longest match
3096	rule and occurs for instance on this set of expressions:
3097
3098	<P>
3099	<TT> "averylongkeyword"</TT>
3100	<BR><TT> .</TT>
3101
3102	<P>
3103	With input <TT>"averylongjoke"</TT> the scanner has to read all characters
3104	up to <TT>'j' </TT>to decide that rule <TT>.</TT> should be matched. All
3105	characters of <TT>"verylong"</TT> have to be read again for the next
3106	matching process. Backtracking can be avoided in general by adding
3107	error rules that match those error conditions
3108
3109	<P>
3110	<code> "av"\|"ave"\|"avery"\|"averyl"\|..</code>
3111
3112	<P>
3113	While this is impractical in most scanners, there is still the
3114	possibility to add a ``catch all'' rule for a lengthy list of keywords
3115	<PRE>
3116	"keyword1" { return symbol(KEYWORD1); }
3117	..
3118	"keywordn" { return symbol(KEYWORDn); }
3119	[a-z]+ { error("not a keyword"); }
3120	</PRE>
3121	Most programming language scanners already have a rule like this for
3122	some kind of variable length identifiers.
3123
3124	<P>
3125	</LI>
3126	<LI>Avoid line and column counting
3127
3128	<P>
3129	It costs multiple additional comparisons per input character and the
3130	matched text has to be re-scanned for counting. In most scanners it
3131	is possible to do the line counting in the specification by
3132	incrementing <TT>yyline</TT> each time a line terminator has been
3133	matched. Column counting could also be included in actions. This
3134	will be faster, but can in some cases become quite messy.
3135
3136	<P>
3137	</LI>
3138	<LI>Avoid look-ahead expressions and the end of line operator '$'
3139
3140	<P>
3141	In the best case, the trailing context will first have to be read and
3142	then (because it is not to be consumed) re-read again. The cases of
3143	fixed-length look-ahead and fixed-length base expressions are handled efficiently
3144	by matching the concatenation and then pushing back the required amount
3145	of characters. This extends to the case of a disjunction of fixed-length
3146	look-ahead expressions such as <code>r1 / \r\|\n\|\r\n</code>. All other cases
3147	<code>r1 / r2</code> are handled by first scanning the concatenation of
3148	<code>r1</code> and <code>r2</code>, and then finding the correct end of <code>r1</code>.
3149	The end of <code>r1</code> is found by scanning forwards in the match again,
3150	marking all possible <code>r1</code> terminations, and then scanning the reverse
3151	of <code>r2</code> backwards from the end until a start of <code>r2</code> intersects
3152	with an end of <code>r1</code>. This algorithm is linear in the size of the input
3153	(not quadratic or worse as backtracking is), but about a factor of 2 slower
3154	than normal scanning. It also consumes memory proportional to the size
3155	of the matched input for <code>r1 r2</code>.
3156
3157	<P>
3158	</LI>
3159	<LI>Avoid the beginning of line operator '<code>^</code>'
3160
3161	<P>
3162	It costs multiple additional comparisons per match. In some
3163	cases one extra look-ahead character is needed (when the last character read is
3164	<code>\r</code> the scanner has to read one character ahead to check if
3165	the next one is an <code>\n</code> or not).
3166
3167	<P>
3168	</LI>
3169	<LI>Match as much text as possible in a rule.
3170
3171	<P>
3172	One rule is matched in the innermost loop of the scanner. After
3173	each action some overhead for setting up the internal state of the
3174	scanner is necessary.
3175	</LI>
3176	</UL>
3177
3178	<P>
3179	Note that writing more rules in a specification does not make the generated
3180	scanner slower (except when you have to switch to another code generation
3181	method because of the larger size).
3182
3183	<P>
3184	The two main rules of optimisation apply also for lexical specifications:
3185
3186	<OL>
3187	<LI><B>don't do it</B>
3188	</LI>
3189	<LI><B>(for experts only) don't do it yet</B>
3190	</LI>
3191	</OL>
3192
3193	<P>
3194	Some of the performance tips above contradict a readable and compact
3195	specification style. When in doubt or when requirements are not or not
3196	yet fixed: don't use them -- the specification can always be optimised
3197	in a later state of the development process.
3198
3199	<P>
3200
3201	<H1><A NAME="SECTION00080000000000000000">
3202	Porting Issues</A>
3203	</H1>
3204
3205	<P>
3206
3207	<H2><A NAME="SECTION00081000000000000000"></A><A NAME="Porting"></A><BR>
3208	Porting from JLex
3209	</H2>
3210	JFlex was designed to read old JLex specifications unchanged and to
3211	generate a scanner which behaves exactly the same as the one generated
3212	by JLex with the only difference of being faster.
3213
3214	<P>
3215	This works as expected on all well formed JLex specifications.
3216
3217	<P>
3218	Since the statement above is somewhat absolute, let's take a look at
3219	what ``well formed'' means here. A JLex specification is well formed, when
3220	it
3221
3222	<UL>
3223	<LI>generates a working scanner with JLex
3224
3225	<P>
3226	</LI>
3227	<LI>doesn't contain the unescaped characters <TT>!</TT> and <TT>~</TT>
3228
3229	<P>
3230	They are operators in JFlex while JLex treats them as normal
3231	input characters. You can easily port such a JLex specification
3232	to JFlex by replacing every <TT>!</TT> with <code>\!</code> and every
3233	<code>~</code> with <code>\~</code> in all regular expressions.
3234
3235	<P>
3236	</LI>
3237	<LI>has only complete regular expressions surrounded by parentheses in
3238	macro definitions
3239
3240	<P>
3241	This may sound a bit harsh, but could otherwise be a major problem
3242	- it can also help you find some disgusting bugs in your
3243	specification that didn't show up in the first place. In JLex, a
3244	right hand side of a macro is just a piece of text, that is copied
3245	to the point where the macro is used. With this, some weird kind of
3246	stuff like
3247	<PRE>
3248	macro1 = ("hello"
3249	macro2 = {macro1})*
3250	</PRE>
3251	was possible (with <TT>macro2</TT> expanding to <code>("hello")*</code>). This
3252	is not allowed in JFlex and you will have to transform such
3253	definitions. There are however some more subtle kinds of errors that
3254	can be introduced by JLex macros. Let's consider a definition like
3255	<code>macro = a\|b</code> and a usage like <code>{macro}*</code>.
3256	This expands in JLex to <code>a\|b*</code> and not to the probably intended
3257	<code>(a\|b)*</code>.
3258
3259	<P>
3260	JFlex uses always the second form of expansion, since this is the natural
3261	form of thinking about abbreviations for regular expressions.
3262
3263	<P>
3264	Most specifications shouldn't suffer from this problem, because
3265	macros often only contain (harmless) character classes like
3266	<TT>alpha = [a-zA-Z]</TT> and more dangerous definitions like
3267
3268	<P>
3269	<code> ident = {alpha}({alpha}\|{digit})*</code>
3270
3271	<P>
3272	are only used to write rules like
3273
3274	<P>
3275	<code> {ident} { .. action .. }</code>
3276
3277	<P>
3278	and not more complex expressions like
3279
3280	<P>
3281	<code> {ident}* { .. action .. }</code>
3282
3283	<P>
3284	where the kind of error presented above would show up.
3285	</LI>
3286	</UL>
3287
3288	<P>
3289
3290	<H2><A NAME="SECTION00082000000000000000"></A><A NAME="lexport"></A><BR>
3291	Porting from lex/flex
3292	</H2>
3293	This section tries to give an overview of activities and possible
3294	problems when porting a lexical specification from the C/C++ tools lex
3295	and flex [<A
3296	HREF="manual.html#flex">11</A>] available on most Unix systems to JFlex.
3297
3298	<P>
3299	Most of the C/C++ specific features are naturally not present in JFlex,
3300	but most ``clean'' lex/flex lexical specifications can be ported to
3301	JFlex without very much work.
3302
3303	<P>
3304	This section is by far not complete and is based mainly on a survey of
3305	the flex man page and very little personal experience. If you do
3306	engage in any porting activity from lex/flex to JFlex and encounter
3307	problems, have better solutions for points presented here or have just
3308	some tips you would like to share, please do <A NAME="tex2html7"
3309	HREF="mailto:[email protected]">contact me</A>. I will
3310	incorporate your experiences in this manual (with all due credit to you,
3311	of course).
3312
3313	<P>
3314
3315	<H3><A NAME="SECTION00082100000000000000">
3316	Basic structure</A>
3317	</H3>
3318	A lexical specification for flex has the following basic structure:
3319	<PRE>
3320	definitions
3321	%%
3322	rules
3323	%%
3324	user code
3325	</PRE>
3326
3327	<P>
3328	The <TT>user code</TT> section usually contains some C code that is used
3329	in actions of the <TT>rules</TT> part of the specification. For JFlex most
3330	of this code will have to be included in the class code <code>%{..%}</code>
3331	directive in the <TT>options</TT> <TT>and declarations</TT> section (after
3332	translating the C code to Java, of course).
3333
3334	<P>
3335
3336	<H3><A NAME="SECTION00082200000000000000">
3337	Macros and Regular Expression Syntax</A>
3338	</H3>
3339	The <TT>definitions</TT> section of a flex specification is quite similar
3340	to the <TT>options and declarations</TT> part of JFlex specs.
3341
3342	<P>
3343	Macro definitions in flex have the form:
3344	<PRE>
3345	<identifier> <expression>
3346	</PRE>
3347	To port them to JFlex macros, just insert a <TT>=</TT> between <TT><identifier></TT>
3348	and <TT><expression></TT>.
3349
3350	<P>
3351	The syntax and semantics of regular expressions in flex are pretty much the
3352	same as in JFlex. A little attention is needed for some escape sequences
3353	present in flex (such as <code>\a</code>) that are not supported in JFlex. These
3354	escape sequences should be transformed into their octal or hexadecimal
3355	equivalent.
3356
3357	<P>
3358	Another point are predefined character classes. Flex offers the ones directly
3359	supported by C, JFlex offers the ones supported by Java. These classes will
3360	sometimes have to be listed manually (if there is need for this feature, it
3361	may be implemented in a future JFlex version).
3362
3363	<P>
3364
3365	<H3><A NAME="SECTION00082300000000000000">
3366	Lexical Rules</A>
3367	</H3>
3368	Since flex is mostly Unix based, the '<code>^</code>' (beginning of line) and
3369	'<code>$</code>' (end of line) operators, consider the <code>\n</code> character as only line terminator. This should usually cause not much problems, but you
3370	should be prepared for occurrences of <code>\r</code> or <code>\r\n</code> or one of
3371	the characters <code>\u2028</code>, <code>\u2029</code>, <code>\u000B</code>, <code>\u000C</code>,
3372	or <code>\u0085</code>. They are considered to be line terminators in Unicode and
3373	therefore may not be consumed when
3374	<code>^</code> or <code>$</code> is present in a rule.
3375	<P>
3376
3377	<H1><A NAME="SECTION00090000000000000000"></A><A NAME="WorkingTog"></A><BR>
3378	Working together
3379	</H1>
3380
3381	<P>
3382
3383	<H2><A NAME="SECTION00091000000000000000"></A><A NAME="CUPWork"></A><BR>
3384	JFlex and CUP
3385	</H2>
3386	One of the main design goals of JFlex was to make interfacing with the free
3387	Java parser generator CUP [<A
3388	HREF="manual.html#CUP">8</A>] as easy as possibly.
3389	This has been done by giving
3390	the <TT><A HREF="#CupMode">%cup</A></TT> directive a special meaning. An
3391	interface however always has two sides. This section concentrates on the
3392	CUP side of the story.
3393
3394	<P>
3395
3396	<H3><A NAME="SECTION00091100000000000000">
3397	CUP version 0.10j and above</A>
3398	</H3>
3399	Since CUP version 0.10j, this has been simplified greatly by the new
3400	CUP scanner interface <TT>java_cup.runtime.Scanner</TT>. JFlex lexers now implement
3401	this interface automatically when then <TT><A HREF="#CupMode">%cup</A></TT>
3402	switch is used. There are no special <TT>parser code</TT>, <TT>init
3403	code</TT> or <TT>scan with</TT> options any more that you have to provide
3404	in your CUP parser specification. You can just concentrate on your grammar.
3405
3406	<P>
3407	If your generated lexer has the class name <TT>Scanner</TT>, the parser
3408	is started from the a main program like this:
3409
3410	<P>
3411	<PRE>
3412	...
3413	try {
3414	parser p = new parser(new Scanner(new FileReader(fileName)));
3415	Object result = p.parse().value;
3416	}
3417	catch (Exception e) {
3418	...
3419	</PRE>
3420
3421	<P>
3422
3423	<H3><A NAME="SECTION00091200000000000000">
3424	Custom symbol interface</A>
3425	</H3>
3426	If you have used the <TT>-symbol</TT> command line switch of CUP to change
3427	the name of the generated symbol interface, you have to tell JFlex about
3428	this change of interface so that correct end-of-file code is generated.
3429	You can do so either by using an <code>%eofval{</code> directive or by using
3430	and <TT>«EOF»</TT> rule.
3431
3432	<P>
3433	If your new symbol interface is called <TT>mysym</TT> for example, the
3434	corresponding code in the jflex specification would be either
3435
3436	<P>
3437
3438	<PRE>
3439	%eofval{
3440	return mysym.EOF;
3441	%eofval}
3442	</PRE>
3443
3444	<P>
3445	in the macro/directives section of the spec, or it would be
3446
3447	<P>
3448
3449	<PRE>
3450	<<EOF>> { return mysym.EOF; }
3451	</PRE>
3452
3453	<P>
3454	in the rules section of your spec.
3455
3456	<P>
3457
3458	<H3><A NAME="SECTION00091300000000000000">
3459	Using existing JFlex/CUP specifications with CUP 0.10j</A>
3460	</H3>
3461	If you already have an existing specification and you would like to upgrade
3462	both JFlex and CUP to their newest version, you will probably have to adjust
3463	your specification.
3464
3465	<P>
3466	The main difference between the <TT><A HREF="#CupMode">%cup</A></TT> switch in
3467	JFlex 1.2.1 and lower, and the current JFlex version is, that JFlex scanners
3468	now automatically implement the <TT>java_cup.runtime.Scanner</TT> interface.
3469	This means, that the scanning function now changes its name from <TT>yylex()</TT>
3470	to <TT>next_token()</TT>.
3471
3472	<P>
3473	The main difference from older CUP versions to 0.10j is, that CUP now
3474	has a default constructor that accepts a <TT>java_cup.runtime.Scanner</TT>
3475	as argument and that uses this scanner as
3476	default (so no <TT>scan with</TT> code is necessary any more).
3477
3478	<P>
3479	If you have an existing CUP specification, it will probably look somewhat like this:
3480	<PRE>
3481	parser code {:
3482	Lexer lexer;
3483
3484	public parser (java.io.Reader input) {
3485	lexer = new Lexer(input);
3486	}
3487	:};
3488
3489	scan with {: return lexer.yylex(); :};
3490	</PRE>
3491
3492	<P>
3493	To upgrade to CUP 0.10j, you could change it to look like this:
3494	<PRE>
3495	parser code {:
3496	public parser (java.io.Reader input) {
3497	super(new Lexer(input));
3498	}
3499	:};
3500	</PRE>
3501
3502	<P>
3503	If you do not mind to change the method that is calling the parser,
3504	you could remove the constructor entirely (and if there is nothing else
3505	in it, the whole <TT>parser code</TT> section as well, of course). The calling
3506	main procedure would then construct the parser as shown in the section above.
3507
3508	<P>
3509	The JFlex specification does not need to be changed.
3510
3511	<P>
3512
3513	<H3><A NAME="SECTION00091400000000000000">
3514	Using older versions of CUP</A>
3515	</H3>
3516	For people, who like or have to use older versions of CUP, the following section
3517	explains ``the old way''. Please note, that the standard name of the scanning
3518	function with the <TT><A HREF="#CupMode">%cup</A></TT> switch is not
3519	<TT>yylex()</TT>, but <TT>next_token()</TT>.
3520
3521	<P>
3522	If you have a scanner specification that begins like this:
3523
3524	<P>
3525	<PRE>
3526	package PACKAGE;
3527	import java_cup.runtime.; / this is convenience, but not necessary */
3528
3529	%%
3530
3531	%class Lexer
3532	%cup
3533	..
3534	</PRE>
3535
3536	<P>
3537	then it matches a CUP specification starting like
3538
3539	<P>
3540	<PRE>
3541	package PACKAGE;
3542
3543	parser code {:
3544	Lexer lexer;
3545
3546	public parser (java.io.Reader input) {
3547	lexer = new Lexer(input);
3548	}
3549	:};
3550
3551	scan with {: return lexer.next_token(); :};
3552
3553	..
3554	</PRE>
3555
3556	<P>
3557	This assumes that the generated parser will get the name <TT>parser</TT>.
3558	If it doesn't, you have to adjust the constructor name.
3559
3560	<P>
3561	The parser can then be started in a main routine like this:
3562
3563	<P>
3564	<PRE>
3565	..
3566	try {
3567	parser p = new parser(new FileReader(fileName));
3568	Object result = p.parse().value;
3569	}
3570	catch (Exception e) {
3571	..
3572	</PRE>
3573
3574	<P>
3575	If you want the parser specification to be independent of the name of the generated
3576	scanner, you can instead write an interface <TT>Lexer</TT>
3577
3578	<P>
3579	<PRE>
3580	public interface Lexer {
3581	public java_cup.runtime.Symbol next_token() throws java.io.IOException;
3582	}
3583	</PRE>
3584
3585	<P>
3586	change the parser code to:
3587
3588	<P>
3589	<PRE>
3590	package PACKAGE;
3591
3592	parser code {:
3593	Lexer lexer;
3594
3595	public parser (Lexer lexer) {
3596	this.lexer = lexer;
3597	}
3598	:};
3599
3600	scan with {: return lexer.next_token(); :};
3601
3602	..
3603	</PRE>
3604
3605	<P>
3606	tell JFlex about the lexer
3607	interface using the <TT>%implements</TT>
3608	directive:
3609
3610	<P>
3611	<PRE>
3612	..
3613	%class Scanner /* not Lexer now since that is our interface! */
3614	%implements Lexer
3615	%cup
3616	..
3617	</PRE>
3618
3619	<P>
3620	and finally change the main routine to look like
3621
3622	<P>
3623	<PRE>
3624	...
3625	try {
3626	parser p = new parser(new Scanner(new FileReader(fileName)));
3627	Object result = p.parse().value;
3628	}
3629	catch (Exception e) {
3630	...
3631	</PRE>
3632
3633	<P>
3634	If you want to improve the error messages that CUP generated parsers
3635	produce, you can also override the methods <TT>report_error</TT> and <TT>report_fatal_error</TT>
3636	in the ``parser code'' section of the CUP specification. The new methods
3637	could for instance use <TT>yyline</TT> and <TT>yycolumn</TT> (stored in
3638	the <TT>left</TT> and <TT>right</TT> members of class <TT>java_cup.runtime.Symbol</TT>)
3639	to report error positions more conveniently for the user. The lexer and
3640	parser for the Java language in the <TT>examples/java</TT> directory of the
3641	JFlex distribution use this style of error reporting. These specifications
3642	also demonstrate the techniques above in action.
3643
3644	<P>
3645
3646	<H2><A NAME="SECTION00092000000000000000"></A><A NAME="YaccWork"></A><BR>
3647	JFlex and BYacc/J
3648	</H2>
3649
3650	<P>
3651	JFlex has built-in support for the Java extension
3652	<A NAME="tex2html8"
3653	HREF="http://byaccj.sourceforge.net/">BYacc/J</A>
3654	[<A
3655	HREF="manual.html#BYaccJ">9</A>] by Bob Jamison
3656	to the classical Berkeley Yacc parser generator.
3657	This section describes how to interface BYacc/J with JFlex. It
3658	builds on many helpful suggestions and comments from Larry Bell.
3659
3660	<P>
3661	Since Yacc's architecture is a bit different from CUP's, the
3662	interface setup also works in a slightly different manner.
3663	BYacc/J expects a function <TT>int yylex()</TT> in the parser
3664	class that returns each next token. Semantic values are expected
3665	in a field <TT>yylval</TT> of type <TT>parserval</TT> where ``<TT>parser</TT>''
3666	is the name of the generated parser class.
3667
3668	<P>
3669	For a small calculator example, one could use a set up like the
3670	following on the JFlex side:
3671
3672	<P>
3673	<PRE>
3674	%%
3675
3676	%byaccj
3677
3678	%{
3679	/* store a reference to the parser object */
3680	private parser yyparser;
3681
3682	/* constructor taking an additional parser object */
3683	public Yylex(java.io.Reader r, parser yyparser) {
3684	this(r);
3685	this.yyparser = yyparser;
3686	}
3687	%}
3688
3689	NUM = [0-9]+ ("." [0-9]+)?
3690	NL = \n \| \r \| \r\n
3691
3692	%%
3693
3694	/* operators */
3695	"+" \|
3696	..
3697	"(" \|
3698	")" { return (int) yycharat(0); }
3699
3700	/* newline */
3701	{NL} { return parser.NL; }
3702
3703	/* float */
3704	{NUM} { yyparser.yylval = new parserval(Double.parseDouble(yytext()));
3705	return parser.NUM; }
3706	</PRE>
3707
3708	<P>
3709	The lexer expects a reference to the parser in its constructor.
3710	Since Yacc allows direct use of terminal characters like <TT>'+'</TT>
3711	in its specifications, we just return the character code for
3712	single char matches (e.g. the operators in the example). Symbolic
3713	token names are stored as <TT>public static int</TT> constants in
3714	the generated parser class. They are used as in the <TT>NL</TT> token
3715	above. Finally, for some tokens, a semantic value may have to be
3716	communicated to the parser. The <TT>NUM</TT> rule demonstrates that
3717	bit.
3718
3719	<P>
3720	A matching BYacc/J parser specification could look like this:
3721	<PRE>
3722	%{
3723	import java.io.*;
3724	%}
3725
3726	%token NL /* newline */
3727	%token <dval> NUM /* a number */
3728
3729	%type <dval> exp
3730
3731	%left '-' '+'
3732	..
3733	%right '^' /* exponentiation */
3734
3735	%%
3736
3737	..
3738
3739	exp: NUM { $$ = $1; }
3740	\| exp '+' exp { $$ = $1 + $3; }
3741	..
3742	\| exp '^' exp { $$ = Math.pow($1, $3); }
3743	\| '(' exp ')' { $$ = $2; }
3744	;
3745
3746	%%
3747	/* a reference to the lexer object */
3748	private Yylex lexer;
3749
3750	/* interface to the lexer */
3751	private int yylex () {
3752	int yyl_return = -1;
3753	try {
3754	yyl_return = lexer.yylex();
3755	}
3756	catch (IOException e) {
3757	System.err.println("IO error :"+e);
3758	}
3759	return yyl_return;
3760	}
3761
3762	/* error reporting */
3763	public void yyerror (String error) {
3764	System.err.println ("Error: " + error);
3765	}
3766
3767	/* lexer is created in the constructor */
3768	public parser(Reader r) {
3769	lexer = new Yylex(r, this);
3770	}
3771
3772	/* that's how you use the parser */
3773	public static void main(String args[]) throws IOException {
3774	parser yyparser = new parser(new FileReader(args[0]));
3775	yyparser.yyparse();
3776	}
3777	</PRE>
3778
3779	<P>
3780	Here, the customised part is mostly in the user code section:
3781	We create the lexer in the constructor of the parser and store
3782	a reference to it for later use in the parser's <TT>int yylex()</TT>
3783	method. This <TT>yylex</TT> in the parser only calls <TT>int yylex()</TT>
3784	of the generated lexer and passes the result on. If something goes
3785	wrong, it returns -1 to indicate an error.
3786
3787	<P>
3788	Runnable versions of the specifications above
3789	are located in the <TT>examples/byaccj</TT> directory of the JFlex
3790	distribution.
3791
3792	<P>
3793
3794	<H1><A NAME="SECTION000100000000000000000"></A><A NAME="Bugs"></A><BR>
3795	Bugs and Deficiencies
3796	</H1>
3797
3798	<P>
3799
3800	<H2><A NAME="SECTION000101000000000000000">
3801	Deficiencies</A>
3802	</H2>
3803	Unicode matching is not fully conforming to the relevant current Unicode report. Instead, the Unicode support in JFlex is the one native to Java. That means, only 16 bit code points are supported and most Unicode character classes are not directly supported (although they can be custom-defined in macros). The Java 5 development version of JFlex contains better support for Unicode, as will the next major release.
3804
3805	<P>
3806
3807	<H2><A NAME="SECTION000102000000000000000">
3808	Bugs</A>
3809	</H2>
3810	As of January 31, 2009, no bugs have been reported for JFlex version 1.4.3. All
3811	bugs reported for earlier versions have been fixed.
3812
3813	<P>
3814	If you find new problems, please use the bugs section of the
3815	<A NAME="tex2html9"
3816	HREF="http://www.jflex.de/">JFlex web site</A>
3817	to report them.
3818
3819	<P>
3820
3821	<H1><A NAME="SECTION000110000000000000000"></A><A NAME="Copyright"></A><BR>
3822	Copying and License
3823	</H1>
3824	JFlex is free software, published under the terms of the
3825	<A NAME="tex2html10"
3826	HREF="http://www.fsf.org/copyleft/gpl.html">GNU General Public License</A>.
3827
3828	<P>
3829	There is absolutely NO WARRANTY for JFlex, its code and its documentation.
3830
3831	<P>
3832	The code generated by JFlex inherits the copyright of the specification it
3833	was produced from. If it was your specification, you may use the generated
3834	code without restriction.
3835
3836	<P>
3837	See the file <A NAME="tex2html11"
3838	HREF="COPYRIGHT"><TT>COPYRIGHT</TT></A>
3839	for more information.
3840
3841	<P>
3842
3843	<H2><A NAME="SECTION000120000000000000000"></A><A NAME="References"></A><BR>
3844	Bibliography
3845	</H2><DL COMPACT><DD>
3846
3847	<P>
3848	<P></P><DT><A NAME="Aho">1</A>
3849	<DD>
3850	A. Aho, R. Sethi, J. Ullman, <EM>Compilers: Principles, Techniques, and Tools</EM>, 1986
3851
3852	<P>
3853	<P></P><DT><A NAME="Appel">2</A>
3854	<DD>
3855	A. W. Appel, <EM>Modern Compiler Implementation in Java: basic techniques</EM>, 1997
3856
3857	<P>
3858	<P></P><DT><A NAME="JLex">3</A>
3859	<DD>
3860	E. Berk, <EM>JLex: A lexical analyser generator for Java</EM>,
3861	<BR> <A NAME="tex2html12"
3862	HREF="http://www.cs.princeton.edu/~appel/modern/java/JLex/"><TT>http://www.cs.princeton.edu/~appel/modern/java/JLex/</TT></A>
3863	<P>
3864	<P></P><DT><A NAME="fast">4</A>
3865	<DD>
3866	K. Brouwer, W. Gellerich,E. Ploedereder,
3867	<EM>Myths and Facts about the Efficient Implementation of Finite Automata and Lexical Analysis</EM>,
3868	in: Proceedings of the 7th International Conference on Compiler Construction (CC '98), 1998
3869
3870	<P>
3871	<P></P><DT><A NAME="unicode_rep">5</A>
3872	<DD>
3873	M. Davis, <EM>Unicode Regular Expression Guidelines</EM>, Unicode Technical Report #18, 2000
3874	<BR> <A NAME="tex2html13"
3875	HREF="http://www.unicode.org/unicode/reports/tr18/tr18-5.1.html"><TT>http://www.unicode.org/unicode/reports/tr18/tr18-5.1.html</TT></A>
3876	<P>
3877	<P></P><DT><A NAME="ParseTable">6</A>
3878	<DD>
3879	P. Dencker, K. Dürre, J. Henft, <EM>Optimization of Parser Tables for portable Compilers</EM>,
3880	in: ACM Transactions on Programming Languages and Systems 6(4), 1984
3881
3882	<P>
3883	<P></P><DT><A NAME="LangSpec">7</A>
3884	<DD>
3885	J. Gosling, B. Joy, G. Steele, <EM>The Java Language Specifcation</EM>, 1996,
3886	<BR> <A NAME="tex2html14"
3887	HREF="http://java.sun.com/docs/books/jls/"><TT>http://java.sun.com/docs/books/jls/</TT></A>
3888	<P>
3889	<P></P><DT><A NAME="CUP">8</A>
3890	<DD>
3891	S. E. Hudson, <EM>CUP LALR Parser Generator for Java</EM>,
3892	<BR> <A NAME="tex2html15"
3893	HREF="http://www.cs.princeton.edu/~appel/modern/java/CUP/"><TT>http://www.cs.princeton.edu/~appel/modern/java/CUP/</TT></A>
3894	<P>
3895	<P></P><DT><A NAME="BYaccJ">9</A>
3896	<DD>
3897	B. Jamison, <EM>BYacc/J</EM>,
3898	<BR> <A NAME="tex2html16"
3899	HREF="http://byaccj.sourceforge.net"><TT>http://byaccj.sourceforge.net/</TT></A>
3900	<P>
3901	<P></P><DT><A NAME="MachineSpec">10</A>
3902	<DD>
3903	T. Lindholm, F. Yellin, <EM>The Java Virtual Machine Specification</EM>, 1996,
3904	<BR> <A NAME="tex2html17"
3905	HREF="http://java.sun.com/docs/books/vmspec/"><TT>http://java.sun.com/docs/books/vmspec/</TT></A>
3906	<P>
3907	<P></P><DT><A NAME="flex">11</A>
3908	<DD>
3909	V. Paxson, <EM>flex - The fast lexical analyzer generator</EM>, 1995
3910
3911	<P>
3912	<P></P><DT><A NAME="SparseTable">12</A>
3913	<DD>
3914	R. E. Tarjan, A. Yao, <EM>Storing a Sparse Table</EM>, in: Communications of the ACM 22(11), 1979
3915
3916	<P>
3917	<P></P><DT><A NAME="Maurer">13</A>
3918	<DD>
3919	R. Wilhelm, D. Maurer, <EM>Übersetzerbau</EM>, Berlin 1997<SUP>2</SUP>
3920	<tex2html_verbatim_mark>mathend000#
3921
3922	<P>
3923	</DL>
3924
3925	<P>
3926	<BR><HR><H4>Footnotes</H4>
3927	<DL>
3928	<DT><A NAME="foot33">... Java</A><A
3929	HREF="manual.html#tex2html2"><SUP><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="footnote.png"></SUP></A></DT>
3930	<DD>Java is a trademark of
3931	Sun Microsystems, Inc., and refers to Sun's Java programming language.
3932	JFlex is not sponsored by or affiliated with Sun Microsystems, Inc.
3933
3934	</DD>
3935	</DL><BR><HR>
3936	<ADDRESS>
3937	Sat 31 Jan 2009 23:43:28 EST, <a href="http://www.doclsf.de">Gerwin Klein</a>
3938	</ADDRESS>
3939	</BODY>
3940	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/rsyntax-textarea/devel-packages/jflex-1.4.3/doc/manual.html@ 25584

Download in other formats: