Constructing Fast Lexical Analyzers with RE/flex ================================================ <p align="right"><i>by Robert van Engelen, January 16, 2017.<br>Genivia Research Labs<br>Updated March 2, 2020.</i></p> <p><a class="github-button" href="https://github.com/Genivia/RE-flex" data-count-href="/Genivia/RE-flex/stargazers" data-show-count="true" data-count-aria-label="# stargazers on GitHub" aria-label="Star Genivia/RE-flex on GitHub">Star</a> <a class="github-button" href="https://github.com/Genivia/RE-flex/archive/master.zip" data-icon="octicon-cloud-download" aria-label="Download Genivia/RE-flex on GitHub">Download</a> <a style="vertical-align:top;" href="https://github.com/Genivia/RE-flex/actions/workflows/c-cpp.yml"><img src="https://github.com/Genivia/RE-flex/actions/workflows/c-cpp.yml/badge.svg" alt="build status"></img></a> <a style="vertical-align:top;" target="_blank" href="https://opensource.org/licenses/BSD-3-Clause" rel="nofollow"><img alt="license" src="https://img.shields.io/badge/license-BSD%203--Clause-blue.svg"></a></p> RE/flex is a free open source fast lexical analyzer, an extension of [Flex](https://en.wikipedia.org/wiki/Flex_(lexical_analyser_generator)). RE/flex generates fast lexical analyzers for C++, also known as "scanners", "lexers", or "tokenizers". The old Flex option for C++ *"is buggy and does not work very well"* [[flex &amp; bison Ch. 5].](https://www.oreilly.com/library/view/flex-bison/9780596805418/) RE/flex fixes these problems and adds many new useful features compared to Flex++, including Unicode support, indent/nodent/dedent anchors, word boundaries, lazy quantifiers (non-greedy, lazy repeats), and performance tuning options. The RE/flex software also includes a high-performance regex library for C++ (not presented in this article, see the [RE/flex user guide](http://www.genivia.com/doc/reflex/html) for details). Introduction ------------ This article presents an overview of RE/flex with examples to demonstrate the new features. The full documentation can be found in the [RE/flex user guide.](http://www.genivia.com/doc/reflex/html) ### Licensing RE/flex is released under a permissive open source license (BSD-3) and available on [GitHub](https://github.com/Genivia/RE-flex) and on [SourceForge.](https://sourceforge.net/projects/re-flex) ### Speed RE/flex is faster than Flex, and much faster than regex libraries such as Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a representative C source code file into 244 tokens takes only 8.7 microseconds: <table class="doxtable"> <tr><th>Command / Function</th><th>Software</th><th>Time (μs)</th></tr> <tr><td><b>reflex --fast --noindent</b></td><td><b>RE/flex 3.4.1</b></td><td><b>8.7</b></td></tr> <tr><td><b>reflex --fast</b></td><td><b>RE/flex 3.4.1</b></td><td><b>8.9</b></td></tr> <tr><td>flex -+ --full</td><td>Flex 2.5.35</td><td>9.8</td></tr> <tr><td>boost::spirit::lex::lexertl::actor_lexer::iterator_type</td><td>Boost.Spirit.Lex 1.82.0</td><td>10.7</td></tr> <tr><td>reflex --full</td><td>RE/flex 3.4.1</td><td>20.6</td></tr> <tr><td>pcre2_jit_match()</td><td>PCRE2 (jit) 10.42</td><td>60.8</td></tr> <tr><td>hs_compile_multi(), hs_scan()</td><td>Hyperscan 5.4.2</td><td>129</td></tr> <tr><td>reflex -m=boost-perl</td><td>Boost.Regex 1.82.0</td><td>205</td></tr> <tr><td>RE2::Consume()</td><td>RE2 (pre-compiled) 2023-09-01</td><td>218</td></tr> <tr><td>reflex -m=boost</td><td>Boost.Regex POSIX 1.82.0</td><td>392</td></tr> <tr><td>pcre2_match()</td><td>PCRE2 10.42</td><td>500</td></tr> <tr><td>RE2::Consume()</td><td>RE2 POSIX (pre-compiled) 2023-09-01</td><td>534</td></tr> <tr><td>flex -+</td><td>Flex 2.5.35</td><td>3759</td></tr> <tr><td>pcre2_dfa_match()</td><td>PCRE2 POSIX (dfa) 10.42</td><td>4029</td></tr> <tr><td>regcomp(), regexec()</td><td>GNU C POSIX.2 regex</td><td>4932</td></tr> <tr><td>std::cregex_iterator()</td><td>C++11 std::regex</td><td>6490</td></tr> </table> Note: *performance in elapsed time (lower is better) in microseconds for 1000 to 10000 benchmark runs using Mac OS X 12.6.9 with clang 12.0.0 -O2, 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3. Hyperscan disqualifies as a scanner due to its "All matches reported" semantics resulting in 1915 matches for this test, and due to its event handler requirements.* [Download the tests](https://www.genivia.com/files/perfcomp.zip) The RE/flex matcher tracks line numbers, column numbers, and indentations, whereas Flex does not (option noyylineno) and neither do the other regex matchers in the table, except Boost.Regex with reflex. Tracking this information incurs some overhead. ### Features The **reflex** scanner generator command takes a lexer specification similar to Flex and supports Flex options. This means that if you have some prior experience with Flex (or Lex), then the learning curve to use RE/flex is low. Because RE/flex is backward compatible to Flex, textbooks and online resources on Flex also apply to RE/flex. RE/flex borrows the expressive specification syntax of Flex while adding many new features and more modern capabilities such as: - integrated support for Unicode (%option unicode), smart input class detects BOM in files (UTF-8/16/32) - works with Bison/Yacc and supports reentrant, bison-bridge, bison-locations, and bison-complete - efficient matching in generated DFA code or in DFA tables - optional "free space mode" (%option freespace), considerably improves the readability of lexer specifications - indent/nodent/dedent supported with indent/dedent anchors - lazy quantifiers are supported (also known as "non-greedy operators", "lazy repeats" or "reluctant quantifiers") - word boundaries are supported - intuitive customization of the C++ lexer class code output (%class and %init) - generates lexers that are thread-safe by default - multiple generated lexers can be combined in one application - visualization of finite state machines (output FSM in Graphviz format for the Graphviz dot tool) - the regex library is extensible with matchers derived from an abstract base matcher class - the Boost.Regex library is supported as a regex engine - the PCRE2 library is supported as a regex engine - conversion of regex expressions, for regex engines that lack regex features - released under a permissive open source license (BSD-3) Let me highlight some of these improvements briefly. ### Unicode support By contrast to Flex, you can now use `%option unicode` to write patterns to match Unicode, such as: \p{Greek}+ printf("this is greek to me: %s\n", yytext); [$€£¥] printf("currency in %s\n", yytext); 🐶+ printf("a bunch of dog emojis\n"); 你好 printf("再见\n"); よかった printf("ここまで読よんでくれてありがとうございました\n"); In fact, RE/flex comes with standard Unicode character classes and you can also use classes that are predefined for Java, C# and Python: \p{JavaIdentifierStart}\p{JavaIdentifierPart}* return IDENT; ### Readable specs Also new is `%option freespace` to enhance the readability of your lexer specifications with spacing. With this option you must always place actions in `{` and `}`: [-+]? [0-9] | [-+]? [1-9] [0-9]+ { yylval.num = strtol(yytext, NULL, 10); return NUM; } ### POSIX matching with lazy quantifiers You can now use lazy quantifiers, such as `*?`, to create simple and elegant patterns: "/*".*?"*/" // ignore multiline C/C++ comments <!--.*?--> // ignore XML comments ### Indent, nodent and dedent matching RE/flex integrates indent and dedent matching directly in the regular expression syntax with `\i` and `\j` escapes, so many scenarios can be covered instead of just some special cases. A simple indent/nodent/dedent matcher is: ^[ \t]+ out() << "| "; // nodent: text is aligned to current indent margin ^[ \t]*\i out() << "> "; // indent: matched with \i ^[ \t]*\j out() << "< "; // dedent: matched with \j \j out() << "< "; // dedent: for each extra level dedented ### Readable source code output The source code output produced by RE/flex is substantially more readable and understandable compared to Flex. The basic idea is that any regex engine can in principle be used to tokenize input: given a set of <em>n</em> patterns <em>p<sub>i</sub></em> for <em>i</em> = 1,...,<em>n</em>, a regex of the form <code>"(<em>p</em><sub>1</sub>)|(<em>p</em><sub>2</sub>)|...|(<em>p<sub>n</sub></em>)"</code> can be used to match and tokenize the input. The group capture index <em>i</em> of a matching pattern <em>p<sub>i</sub></em> that is returned by the regex matcher identifies the pattern <em>p<sub>i</sub></em> that matched the input text partially and continuously after the previous match. Note that if a pattern uses a group <code>(<em>X</em>)</code> that group must be converted to non-capturing of the form <code>(?:<em>X</em>)</code> in the regex. The generated scanner repeats the scan operation (continuous partial matching) to tokenize input: int Lexer::lex() { if (!has_matcher()) matcher("(p1)|(p2)|...|(pn)"); ... // new matcher engine if not setup already while (true) { if (... EOF reached ...) return 0; switch (matcher().scan()) // scan and match next token, get capture index { case 1: // pattern 1 matched ... // Action for pattern 1 break; case 2: // pattern 2 matched ... // Action for pattern 2 break; ... // etc default: ... // no match: this is the "default rule" to echo input character } } } The generated scanner source code is structured in this way. Furthermore, the generated code conforms to solid OOP design principles with template classes for lexers, matchers, and input sources. ### Compatible with Flex and Bison RE/flex produces C++ code, but you can still use RE/flex scanners with Bison (Yacc) and any other C code when everything is compiled together with a C++ compiler. When using RE/flex as a replacement for Flex, make sure to use command-line option `−−flex`. This makes RE/flex behave as Flex for C++ (i.e. Flex with option -+). Use options `−−flex` and `−−bison` to output code that defines the usual global non-reentrant "yy" functions and variables such as `yylex()` and `yytext` to work seamlessly with Bison (Yacc). Also make sure to include `reflex/include` on your compiler's include path and link your code against `reflex/lib/libreflex`. ### Development status RE/flex is production stable and comes with several examples, including: - a Python 3.0 source code tokenizer - a Java source code tokenizer - a C/C++ source code tokenizer - a JSON tokenizer and parser - a YAML tokenizer and parser - an XML tokenizer - an HTML reformatter - and many more... Installation and setup ---------------------- Download RE/flex from [GitHub](https://github.com/Genivia/RE-flex) or [SourceForge.](https://sourceforge.net/projects/re-flex) You have two options: 1) quick install or 2) configure and make. ### Quick install For a quick clean build assuming your environment is pretty much standard: [command] sh clean.sh sh build.sh Add the location of the reflex tool to your $PATH variable: [command] export PATH=$PATH:/...path.../reflex/bin Make sure that RE/flex header files in the reflex/include directory are accessible when compiling, i.e. with -Ireflex/include. Also link your code with reflex/lib/libreflex.a. Optionally, install the reflex tool, libreflex library, and manual page: [command] sudo sh allinstall.sh ### Configure and make The configure script accepts configuration and installation options. To view these options, run: [command] ./configure −−help Run configure and make: [command] ./configure && make After this successfully completes, you can optionally run `make install` to install the `reflex` command and `libreflex` library: [command] sudo make install Unfortunately, cloning from Git does not preserve timestamps which means that you may run into "WARNING: 'aclocal-1.15' is missing on your system." If this happens you can work around this problem by running: [command] autoreconf -fi ./configure && make ### Windows users Use `reflex/bin/reflex.exe` from the command line or add a **Custom Build Step** in MSVC++ as follows: 1. select the project name in **Solution Explorer** then **Property Pages** from the **View** menu (see also [custom-build steps in Visual Studio](http://msdn.microsoft.com/en-us/library/hefydhhy.aspx)); 2. add an extra path to the `reflex/include` folder in the **Include Directories** under **VC++ Directories**, which should look like `$(VC_IncludePath);$(WindowsSDK_IncludePath);C:\Users\YourUserName\Documents\reflex\include` (this assumes the `reflex` source package is in your **Documents** folder). 3. enter `"C:\Users\YourUserName\Documents\reflex\bin\reflex.exe" −−header-file "C:\Users\YourUserName\Documents\mylexer.l"` in the **Command Line** property under **Custom Build Step** (this assumes `mylexer.l` is in your **Documents** folder); 4. enter `lex.yy.h lex.yy.cpp` in the **Outputs** property; 5. specify **Execute Before** as `PreBuildEvent`. To compile your program with MSVC++, make sure to drag the folders `reflex/lib` and `reflex/unicode` to the **Source Files** in the **Solution Explorer** panel of your project. After running `reflex.exe` drag the generated `lex.yy.h` and `lex.yy.cpp` files there as well. If you are using specific reflex command-line options such as `−−flex`, add these in step 3. Next up: a demonstration of RE/flex through examples, starting with a simple lexer to count Unicode words. The first example keeps things very simple. No prior knowledge of RE/flex or Flex is needed to follow the examples. First example: a simple lexer to count Unicode words {#example1} ---------------------------------------------------- ### Defining the lexer class Our lexer class will need three variables to record the total character count `ch`, word count `wd`, and line count `nl` of some given input. To declare these, we add them to a `%class` block that will be part of our generated lexer class. We also add initalizations of our lexer to an `%init` block that is used by the lexer constructor. Because we want to print the results when we're done scanning, we also add a public method `results()` to our lexer class: %class{ int ch, wd, nl; public: void results() { cout << setw(8) << nl << setw(8) << wd << setw(8) << ch << endl; } %} %init{ ch = wd = nl = 0; %} The generated lexer class is named `Lexer` by default. This name can be changed with `%option lexer=NAME`. ### The main program Our generated lexer class has an `int lex()` method to scan the current input. This input is stdin by default. To change the input we can pass an input source to the constructor, which can be a `FILE*`, an `istream` object, or a (wide) string. We want our lexer to scan files. Our a main program opens a file with the file name provided as a command-line argument and instantiates our lexer class to scan it. If no command-line argument is given we scan stdin. To do so, our program constructs a `Lexer` and then invokes `lex()` to scan the entire input until EOF: int main(int argc, char **argv) { FILE *fd = stdin; if (argc > 1 && (fd = fopen(argv[1], "r")) == NULL) exit(EXIT_FAILURE); // create a lexer that consumes a file or reads stdin Lexer lexer(fd); // here we go! lexer.lex(); // display the results lexer.results(); if (fd != stdin) fclose(fd); return 0; } The benefit of using `FILE*` (instead of `istream`) is that our lexer automatically detects a UTF [Byte Order Mark](https://en.wikipedia.org/wiki/Byte_order_mark) (BOM) in files, so it will accept ASCII and UTF-8, UTF-16, and UTF-32 encodings. Encodings are internally normalized to UTF-8 automatically for matching, so we do not need to apply any options or have to mess around with our lexer specifications. ### Lexer specifications When `lex()` is invoked in our main program, it will start matching patterns against the input and execute actions when patterns match. We put our patterns and actions in the "rules section" of our lexer specification. A specification has the following structure: [xml] DEFINITIONS %% RULES %% USER CODE We define named patterns as macros: they are substituted in patterns. We put named patterns in the "definitions section" of our lexer specification, preceded by any `%option` options that affect these patterns. ### Definitions section For our example we need a pattern to capture newlines to bump the `nl` counter and a pattern to capture words to bump the `wd` counter. The character counter does not need a pattern, it will be bumped when matching other patterns and when nothing else matches. We define two named patterns in the definitions section: %option unicode line \r?\n word (\w|\p{Punctuation})+ %% Unicode matching is enabled with `%option unicode`. Note that the `word` pattern matches a non-empty sequence of Unicode word characters and punctuation characters` (it's really up to us to define what a "word" is, for example the Unix "wc" utility considers everything a word that isn't a space). ### Rules section The two named patterns defined in the definitions section are used in our rules section to trigger actions to bump our counters: %% {line} ch += size(); ++nl; {word} ch += size(); ++wd; . ch += size(); %% The `ch` counter is increased with the length in bytes `size()` of a match. If you want to count the total number of wide characters scanned instead of bytes, use `wsize()`. ### More user code We should not forget that our lexer class and our main program require the inclusion of C++ definitions to compile. We put these in a `%top` section: %top{ #include <cstdio> #include <iostream> #include <iomanip> using namespace std; %} Like flex, we could have used `%{` and `%}` instead of `%top`, but if you do so beware that these declarations appear only after the standard includes and built-in class declarations, such as the abstract base lexer class declaration of our lexer class. Putting critical includes in a `%top` block is recommended. ### Putting it all together We're now ready to put all of this together in an .l file, say `wordcount.l`, with the main program added to the user section below: %top{ #include <cstdio> #include <iostream> #include <iomanip> using namespace std; %} %class{ int ch, wd, nl; public: void results() { cout << setw(8) << nl << setw(8) << wd << setw(8) << ch << endl; } %} %init{ ch = wd = nl = 0; %} %option unicode line \r?\n word (\w|\p{Punctuation})+ %% {line} ch += size(); ++nl; {word} ch += size(); ++wd; . ch += size(); %% int main(int argc, char **argv) { FILE *fd = stdin; if (argc > 1 && (fd = fopen(argv[1], "r")) == NULL) exit(EXIT_FAILURE); // create a lexer that consumes a file or reads stdin Lexer lexer(fd); // here we go! lexer.lex(); // display the results lexer.results(); if (fd != stdin) fclose(fd); return 0; } ### Compiling the example We run the reflex command on our `wordcount.l` file and compile the generated `lex.yy.cpp`. To compile, we should use the header files located in `reflex/include` and link our code with the RE/flex library `reflex/lib/libreflex.a`: [command] reflex wordcount.l c++ -Ireflex/include -o wordcount lex.yy.cpp reflex/lib/libreflex.a When all goes well and no errors are produced, RE/flex saves the lexer in a `lex.yy.cpp` file. To also generate a header file `lex.yy.h` use option `−−header-file`, which comes in handy when compiling separate project files. Let's run our program on its own source code: [command] ./wordcount wordcount.l 45 98 706 You will notice a slight delay to display the results. The design of RE/flex differs from Flex in that it generates a finite state machine (FSM) on demand in our program, but only just once, instead of at compile time. This is the default behavior. To fully compile the FSM ahead of run time, use `%option full` or `%option fast` for a faster code-based FSM. All RE/flex `%option` options can also be set via command-line options. So we can use `−−full` or `−−fast` with the reflex command: [command] reflex −−full wordcount.l c++ -Ireflex/include -o wordcount lex.yy.cpp reflex/lib/libreflex.a There is one last thing we can improve. The direct use of `cout` in our program is a bit cumbersome. In fact, the lexer constructor takes an optional second argument that is the output stream to write output to. This output stream is referenced via the `out()` method of the lexer. I recommend to use `out()` instead of `cout` in the `results()` method. In this way, also `echo()` (or `ECHO`) will echo matches to this output stream specified. RE/flex versus Flex actions {#actions} --------------------------- After reading the first example, you may wonder why RE/flex has new methods and variables compared to Flex, such as `size()` instead of `yyleng` or `YYLeng()`. But in fact you do not need to rewrite lexer specifications that are intended for Flex to use with RE/flex when you use RE/flex option `−−flex`. This way you can still use the old "yy" Flex functions and variables with RE/flex. However, there are additional built-in RE/flex methods that Flex does not support. A selection of the most common RE/flex actions and their equivalent Flex actions (enabled with option `−−flex`) is shown in this table: <table class="doxtable"> <tr> <th>RE/flex action </th><th>Flex action </th><th>Result </th></tr> <tr> <td><code>text()</code> </td><td><code>YYText()</code>, <code>yytext</code> </td><td>0-terminated <code>const char*</code> text match </td></tr> <tr> <td><code>str()</code> </td><td><em>n/a</em> </td><td><code>std::string</code> text match</td></tr> <tr> <td><code>wstr()</code> </td><td><em>n/a</em> </td><td><code>std::wstring</code> wide text match</td></tr> <tr> <td><code>size()</code> </td><td><code>YYLeng()</code>, <code>yyleng</code> </td><td>size of the match in number bytes </td></tr> <tr> <td><code>wsize()</code> </td><td><em>n/a</em> </td><td>number of wide chars matched </td></tr> <tr> <td><code>lineno()</code> </td><td><code>yylineno</code> </td><td>line number of match (&gt;=1) </td></tr> <tr> <td><code>columno()</code> </td><td><em>n/a</em> </td><td>column number of match (&gt;=0) </td></tr> <tr> <td><code>echo()</code> </td><td><code>ECHO</code> </td><td><code>out().write(text(), size())</code> </td></tr> <tr> <td><code>start()</code> </td><td><code>YY_START</code> </td><td>get current start condition </td></tr> <tr> <td><code>start(n)</code> </td><td><code>BEGIN n</code> </td><td>set start condition to <code>n</code> </td></tr> <tr> <td><code>push_state(n)</code> </td><td><code>yy_push_state(n)</code> </td><td>push current state, start <code>n</code> </td></tr> <tr> <td><code>pop_state()</code> </td><td><code>yy_pop_state()</code> </td><td>pop state and make it current </td></tr> <tr> <td><code>top_state()</code> </td><td><code>yy_top_state()</code> </td><td>get top state start condition </td></tr> <tr> <td><code>in(i)</code> </td><td><code>yyin = &amp;i</code> </td><td>set input to <code>reflex::Input i</code> </td></tr> <tr> <td><code>in()</code> </td><td><code>*yyin</code> </td><td>get <code>reflex::Input</code> object </td></tr> <tr> <td><code>out(s)</code> </td><td><code>yyout = &amp;s</code> </td><td>set output to <code>std::ostream s</code> </td></tr> <tr> <td><code>out()</code> </td><td><code>*yyout</code> </td><td>get <code>std::ostream</code> object </td></tr> <tr> <td><code>out().write(s, n)</code> </td><td><code>LexerOutput(s, n)</code> </td><td>output chars <code>s[0..n-1]</code> </td></tr> <tr> <td><code>out().put(c)</code> </td><td><code>output(c)</code> </td><td>output char <code>c</code> </td></tr> <tr> <td><code>matcher().accept()</code> </td><td><code>yy_act</code> </td><td>number of the matched rule </td></tr> <tr> <td><code>matcher().text()</code> </td><td><code>YYText()</code>, <code>yytext</code> </td><td>same as <code>text()</code> </td></tr> <tr> <td><code>matcher().str()</code> </td><td><em>n/a</em> </td><td>same as <code>str()<code></td></tr> <tr> <td><code>matcher().wstr()</code> </td><td><em>n/a</em> </td><td>same as <code>wstr()<code></td></tr> <tr> <td><code>matcher().size()</code> </td><td><code>YYLeng()</code>, <code>yyleng</code> </td><td>same as <code>size()</code> </td></tr> <tr> <td><code>matcher().wsize()</code> </td><td><em>n/a</em> </td><td>same as <code>wsize()</code> </td></tr> <tr> <td><code>matcher().input()</code> </td><td><code>yyinput()</code> </td><td>get next char from input </td></tr> <tr> <td><code>matcher().winput()</code> </td><td><em>n/a</em> </td><td>get wide character from input </td></tr> <tr> <td><code>matcher().unput(c)</code> </td><td><code>unput(c)</code> </td><td>put back char <code>c</code> </td></tr> <tr> <td><code>matcher().peek()</code> </td><td><em>n/a</em> </td><td>peek at next char on input </td></tr> <tr> <td><code>matcher().more()</code> </td><td><code>yymore()</code> </td><td>concat next match to the match </td></tr> <tr> <td><code>matcher().less(n)</code> </td><td><code>yyless(n)</code> </td><td>shrink match's length to <code>n</code> </td></tr> <tr> <td><code>matcher().first()</code> </td><td><em>n/a</em> </td><td>first pos of match in input </td></tr> <tr> <td><code>matcher().last()</code> </td><td><em>n/a</em> </td><td>last pos+1 of match in input </td></tr> <tr> <td><code>matcher().rest()</code> </td><td><em>n/a</em> </td><td>get rest of input until end </td></tr> <tr> <td><code>matcher().at_bob()</code> </td><td><em>n/a</em> </td><td>true if at the begin of input </td></tr> <tr> <td><code>matcher().at_end()</code> </td><td><em>n/a</em> </td><td>true if at the end of input </td></tr> <tr> <td><code>matcher().at_bol()</code> </td><td><code>YY_AT_BOL()</code> </td><td>true if at begin of a newline </td></tr> </table> See the [RE/flex documentation](http://www.genivia.com/doc/reflex/html) for more details. Second example: a JSON parser and writer {#example2} ---------------------------------------- In this example, I will demonstrate several useful techniques with RE/flex to develop a compact parser for JSON. ### JSON [JSON](http://json.org) is a simple data format to exchange structured information. JSON and its implementations have been heavily [criticized](http://seriot.ch/parsing_json.php) for lack of standardization. No worries though, the implementation I'm presenting here was successfully tested and validated against the JSON data patterns in that critical article and therefore conforms to the so-called JSON "standard". ### A tagged JSON data structure To hold JSON values we need a tagged data structure. A tagged data structure holds values of various types depending on the tag that can be used as a value selector. A JSON value is one of undefined (UND), null (NUL), boolean (BOO), number (NUM), string (STR), array (ARR), object (OBJ), or is an error (ERR) which is used when parsing failed. These tags and the values we need to store are defined by the `class JSON`: class JSON { public: JSON() : tag(UND) { } enum Tag { UND, NUL, BOO, NUM, STR, ARR, OBJ, ERR } tag; bool boolean; double number; std::wstring string; std::vector<JSON> array; std::map<std::wstring,JSON> object; }; JSON arrays are stored as vectors and objects are maps. Because Unicode must be supported, strings and object property names are wide strings. ### Lexer class To tokenize JSON we should define tokens and token values. Any choice of numeric token encoding will do, but let's define tokens for JSON constants to mean something intuitive using ASCII code `'0'` for null, `'t'` for true, `'f'` for false, `'#'` for a literal number, and `'$'` for a string literal. Tokens for punctuation are their corresponding ASCII codes, such as `'['` for array, `'{'` for object, and others that we put in a character class `[][}{,:]` to match. Note that we can simply put an unescaped `]` up front in the bracket list without having to escape it, because the list can't be empty. Tokens `'#'` and `'$'` have token values, which we will store in our lexer class instance as lexer state variables: %class{ protected: double number; // token value for token '#' (number) std::wstring string; // token value for token '$' (string) %} The tokens and token values are returned by actions defined in the rules section of the specification. Our definitions and rules sections of our lexer is mostly straightforward, except for scanning string literals that must be converted to wide strings on the go: %o flex freespace dotall unicode digit [0-9] digit1 [1-9] digits {digit}+ int -? {digit} | -? {digit1} {digits} frac \. {digits} exp [eE] [-+]? {digits} number {int} {frac}? {exp}? %x STRING %% [ \t\r\n]+ { } /* ignore white space */ [][}{,:] { return yytext[0]; } /* match one of ] [ } { , : */ null { return '0'; } /* JSON null */ true { return 't'; } /* JSON true */ false { return 'f'; } /* JSON false */ {number} { number = strtod(yytext, NULL); return '#'; } /* JSON number */ \" { string.clear(); BEGIN STRING; } /* JSON string */ <STRING>{ \" { BEGIN INITIAL; return '$'; } /* end of the string */ \\ ["\\/] { string.push_back(yytext[1]); } /* escapes \" \\ \/ */ \\ b { string.push_back('\b'); } /* bell \b */ \\ f { string.push_back('\f'); } /* form feed \f */ \\ n { string.push_back('\n'); } /* newline \n */ \\ r { string.push_back('\r'); } /* return \r */ \\ t { string.push_back('\t'); } /* tab \t */ \\ u [[:xdigit:]]{4} { string.push_back(strtoul(yytext + 2, NULL, 16)); } []-\x7f\x20-[] { string.push_back(yytext[0]); } /* ASCII character */ \p{Non_ASCII_Unicode} { string.append(wstr()); } /* Unicode character */ } <*> . { return '!'; /* error */ } %% Let's break this down, shall we? 1. Option `flex` allows us to use the classic "yy" Flex actions and variables such as `yytext` (see the table in the previous section of this article). Option `freespace` improves regex readability by permitting spacing. Option `dotall` forces `.` (dot) to match any character, including newline. Option `unicode` makes `.` (dot) also match Unicode characters and `\p{Non_ASCII_Unicode}` matches any UTF-8 content that isn't ASCII. 2. The JSON lexical structure is simply copied here from the JSON "standard" into the definitions section with named patterns (macros). 3. As you can see, tokens are returned to the parser by actions in the rules section. 4. The `STRING` condition that is declared as an exclusive start condition with `%x STRING`, has several patterns corresponding to the JSON "standard" string contents, which is converted to a wide string by the `STRING` conditional patterns and actions. 5. The `wstr()` call returns the wide character string matched by `\p{Non_ASCII_Unicode}` (which on Windows may consist of a UTF-16 surrogate pair). This gives us a valid non-ASCII Unicode character code point. ASCII characters thar are valid in JSON strings are matched by `[]-\x7f\x20-[]`, which looks odd, but it is just a character class `[...]` that accepts `[` and `]` by placing `]` at the start which works because character classes cannot be empty `[]`. 6. Any non-matched character in any state `<*>.` will result in an error token `'!'` to be returned to the parser. This rule matches any character, including invalid code points which should be flagged as an error. 7. The `number` and `string` variables are lexer class variables defined by the `%class` described earlier. ### Parser class RE/flex generates a `yyFlexLexer` class for this specification since we use option `%o flex`. Now that we have a lexer class `yyFlexLexer` that returns tokens and holds token values, the next step is to implement a recursive descent parser for JSON syntax. Since the parser needs to invoke the lexer `yylex()` function to retrieve tokens and access token values from the lexer state, we simply derive the JSON parser from the `yyFlexLexer` class. Parsing a value into a JSON data structure with `parse(JSON& data)` amounts to calling `yylex()` and setting the JSON data to the appropriate tagged value: class JSONParser : public yyFlexLexer { public: JSONParser(FILE *fd = NULL) : yyFlexLexer(fd), level(0) { } JSON::Tag parse(JSON& data) { int token = yylex(); switch (token) { case '0': return data.tag = JSON::NUL; case 't': data.boolean = true; return data.tag = JSON::BOO; case 'f': data.boolean = false; return data.tag = JSON::BOO; case '#': data.number = number; return data.tag = JSON::NUM; case '$': data.string = string; return data.tag = JSON::STR; case '[': return parse_array(data); case '{': return parse_object(data); default : return error(token, data); } } protected: ... size_t level; }; Parsing JSON arrays is done by a protected function `parse_array(JSON& data)` that calls `parse(item)` repeately in a loop to parse array items: protected: JSON::Tag parse_array(JSON& data) { for (size_t len = 0; ; ++len) { JSON item; switch (parse(item)) { case JSON::NUL: case JSON::BOO: case JSON::NUM: case JSON::STR: case JSON::ARR: case JSON::OBJ: data.array.push_back(item); break; case JSON::UND: --level; return len == 0 ? data.tag = JSON::ARR : JSON::ERR; default : return JSON::ERR; } int token = yylex(); if (token == ']') { --level; return data.tag = JSON::ARR; } if (token != ',') return JSON::ERR; } return JSON::ERR; } To capture empty arrays `[]` propertly required some logic, because `parse(item)` should return `UND` in that case and not an error `ERR`, which is done via the error function that is called by `parse(JSON& data)` when looking at a `]`: protected: JSON::Tag error(int token, JSON& data) { data.tag = JSON::ERR; switch (token) { case ']': return level > 0 ? JSON::UND : JSON::ERR; default : return JSON::ERR; } } Parsing JSON objects is done by a protected function `parse_object(JSON& data)` that calls `parse(item)` repeately in a loop to parse property values created with `data.object[string]` (the string token value of the lexer): protected: JSON::Tag parse_object(JSON& data) { for (size_t len = 0; ; ++len) { int token = yylex(); if (len == 0 && token == '}') { --level; return data.tag = JSON::OBJ; } if (token != '$') return JSON::ERR; JSON& item = data.object[string]; if (yylex() != ':') return JSON::ERR; switch (parse(item)) { case JSON::NUL: case JSON::BOO: case JSON::NUM: case JSON::STR: case JSON::ARR: case JSON::OBJ: break; default : return JSON::ERR; } token = yylex(); if (token == '}') { --level; return data.tag = JSON::OBJ; } if (token != ',') return JSON::ERR; } return JSON::ERR; } ### Writing JSON data to a stream The `<<` operator can be overloaded to write objects to a stream with a custom format. We use this mechanism to write JSON data to a stream. The JSON data is displayed based on its tag: std::ostream& operator<<(std::ostream& os, const JSON& data) { switch (data.tag) { case JSON::NUL: os << "null"; break; case JSON::BOO: os << (data.boolean ? "true" : "false"); break; case JSON::NUM: os << data.number; break; case JSON::STR: print_string(os, data.string); break; case JSON::ARR: os << "["; for (std::vector<JSON>::const_iterator i = data.array.begin(); i != data.array.end(); ++i) os << (i != data.array.begin() ? "," : "") << *i; os << "]"; break; case JSON::OBJ: os << "{"; for (std::map<std::wstring,JSON>::const_iterator i = data.object.begin(); i != data.object.end(); ++i) { if (i != data.object.begin()) os << ","; print_string(os, i->first); os << ":" << i->second; } os << "}"; break; default : os << "undefined"; break; } return os; } Writing JSON strings to a stream is done with `print_string()`, which escapes control characters and converts wide characters to UTF-8: void print_string(std::ostream& os, const std::wstring& s) { os << "\""; for (std::wstring::const_iterator i = s.begin(); i != s.end(); ++i) { switch (*i) { case '"' : case '\\': os << "\\" << static_cast<char>(*i); break; case '\b': os << "\\b"; break; case '\f': os << "\\f"; break; case '\n': os << "\\n"; break; case '\r': os << "\\r"; break; case '\t': os << "\\t"; break; default : if (*i >= '\x20' && *i <= '\x7f') { // emit printable char os << static_cast<char>(*i); } else if (*i < 0x20) { // emit \u00xx for control codes os << "\\u" << std::internal << std::setw(4) << std::setfill('0') << std::hex << *i << std::dec; } else { // else emit UTF-8 char buf[8]; buf[reflex::utf8(*i, buf)] = '\0'; // convert to UTF-8 and make 0-terminated os << buf; } } } os << "\""; } ### The main program Our main program simply parses JSON and displays it. The code is self-explanatory: int main(int argc, char **argv) { FILE *fd = NULL; if (argc > 1 && (fd = fopen(argv[1], "r")) == NULL) exit(EXIT_FAILURE); JSON data; // parse JSON from stdin or a file provided as a command-line argument (FILE* permits UTF BOM detection) JSONParser parser(fd); if (parser.parse(data) == JSON::ERR) { std::cerr << "Error at (" << parser.lineno() << "," << parser.columno() << ") when looking at " << parser.text() << std::endl; exit(EXIT_FAILURE); } // print JSON value std::cout << data; if (fd) fclose(fd); return 0; } You see the benefit of deriving a parser from the lexer class in the fact that we can use the `text()`, `lineno()`, and `columno()` lexer functions in our program to report the details of an error. Note that using `yytext` here would not work without being in a lexer method scope, but the Flex-like `YYText()` is fine. Finally, if our test program is part of the .l specification file's user section, then we need to make sure to include the propert C++ definitions by putting these in our .l specification: %top{ #include <stdlib.h> #include <iostream> #include <iomanip> #include <vector> #include <map> %} Tips, tricks and gotchas {#tricks} ------------------------ ### How to make reflex accept Flex specifications Use reflex option `−−flex`. ### Speeding things up Use reflex option `−−full` to include the full FSM tables in the generated code. This avoids the startup overhead to construct an internal FSM from the specified regular expressions when your scanner starts. Use reflex option `−−fast` to include direct code in the generated code. This produces a fast FSM in C++ code and is faster than tables and as fast or faster than the fastest option of Flex (`flex -f` fast and full). The FSM encoded in tables and in code is compressed to reduce the number of edges by combining overlapping character ranges. This means that the tables and code may not appear to be precise deterministic finite state machines (DFAs) to benefit of operational efficiency of the RE/flex matcher engine. ### Scanning ISO-8859-1 (ASCII + latin-1 supplement) files with a Unicode scanner Scanning files encoded in ISO-8859-1 by a Unicode scanner that expects UTF-8 will cause the scanner to misbehave or throw errors. Many text files are still encoded in ISO-8859-1 (also called latin-1). To set up your scanner to safely scan ISO-8859-1 content when your scanner rules use Unicode with the `−−unicode` option, set the default file encoding to `latin`: reflex::Input input(stdin, reflex::Input::file_encoding::latin); Lexer lexer(input); lexer.lex(); This scans files from standard input that are encoded in ISO-8859-1, unless the file has a [UTF Byte Order Mark (BOM).](www.unicode.org/faq/utf_bom.html) When a BOM is detected the scanner switches to UTF scanning. RE/flex supports other code pages, including CP-1250 to CP-1258, CP 434, CP 850, and EBCDIC. See the [RE/flex user guide](http://www.genivia.com/doc/reflex/html) for details. ### Debugging your scanner Use reflex option `-d` to generate a scanner that writes its actions to stdout showing the rules and actions invoked. Use reflex option `-p` to generate a scanner that reports the performance of the rule activations on stderr when EOF is reached. If EOF is not reached, then make sure to invoke the lexer method `perf_report()` to view the performance report. To get more accurate performance timings it is not recommended to use option `-d` with `-p`. ### Using Boost.Regex To use Boost.Regex as a matcher engine, use reflex option `−−matcher=boost`. This assumes that you've installed Boost.Regex. Optimization options `−−full` and `−−fast` do not apply. If you feel adventurous, use reflex option `−−matcher=boost-perl` to use Perl-based matching instead of POSIX-based matching. Beware that this changes the matching behavior from the POSIX *leftmost longest match* to the *greedy leftmost* match as in Perl. ### Using PCRE2 To use JIT-optimized PCRE2 as a matcher engine, use reflex option `−−matcher=pcre2-perl` to use Perl-based matching instead of POSIX-based matching. Again, beware that this changes the matching behavior from the POSIX *leftmost longest match* to the *greedy leftmost* match as in Perl. ### Using RE/flex to convert regex If you are looking to convert regular expressions with Unicode and UTF-8 to plain 8-bit regex forms, use reflex option `−−regexp-file` to generate a text file with converted Unicode-based expressions (don't forget to use `%option unicode`). These converted expressions can be used in a C/C++ application directly, e.g. with PCRE2 or Boost.Regex to scan strings. For example, `\p{Greek}+` translates to the regex "(?m)((?:\\xcd[\\xb0-\\xb3]|\\xcd[\\xb5-\\xb7]|\\xcd[\\xba-\\xbd]|\\xcd\\xbf|\\xce\\x84|\\xce\\x86|\\xce[\\x88-\\x8a]|\\xce\\x8c|\\xce[\\x8e-\\xa1]|\\xce[\\xa3-\\xbf]|\\xcf[\\x80-\\xa1]|\\xcf[\\xb0-\\xbf]|\\xe1\\xb4[\\xa6-\\xaa]|\\xe1\\xb5[\\x9d-\\xa1]|\\xe1\\xb5[\\xa6-\\xaa]|\\xe1\\xb6\\xbf|\\xe1\\xbc[\\x80-\\x95]|\\xe1\\xbc[\\x98-\\x9d]|\\xe1(\\xbc[\\xa0-\\xbf]|\\xbd[\\x80-\\x85])|\\xe1\\xbd[\\x88-\\x8d]|\\xe1\\xbd[\\x90-\\x97]|\\xe1\\xbd\\x99|\\xe1\\xbd\\x9b|\\xe1\\xbd\\x9d|\\xe1\\xbd[\\x9f-\\xbd]|\\xe1\\xbe[\\x80-\\xb4]|\\xe1(\\xbe[\\xb6-\\xbf]|\\xbf[\\x80-\\x84])|\\xe1\\xbf[\\x86-\\x93]|\\xe1\\xbf[\\x96-\\x9b]|\\xe1\\xbf[\\x9d-\\xaf]|\\xe1\\xbf[\\xb2-\\xb4]|\\xe1\\xbf[\\xb6-\\xbe]|\\xe2\\x84\\xa6|\\xea\\xad\\xa5|\\xf0\\x90(\\x85[\\x80-\\xbf]|\\x86[\\x80-\\x8c])|\\xf0\\x90\\x86\\xa0|\\xf0\\x9d(\\x88[\\x80-\\xbf]|\\x89[\\x80-\\x85]))+)". You can put this to work with any regex matcher that takes UTF-8 input to search for greek words. You can also convert expressive regex forms to a regex that a particular regex engine library accepts by using the converter directly in your code. For example to match Unicode with the efficient RE/flex regex library: #include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern // a pattern with Unicode, this is allocated static so the internal FSM is constructed only once: static const reflex::Pattern pattern(reflex::Matcher::convert("[\\p{Greek}\\p{Zs}\\pP]+")); // use a Matcher to check if sentence is in Greek: if (reflex::Matcher(pattern, sentence).matches()) std::cout << "This is Greek" << std::endl; ### Visualizing FSMs To visualize the FSMs defined in the lexer specification rules section, use reflex option `−−graphs-file`. This generates Graphviz files that can be converted to PDF or to other graphics formats with the [Graphviz dot](http://www.graphviz.org) tool. This option does not work when Boost.Regex is selected as a matcher engine. For example, the JSON example rules consist of the INITIAL and STRING start condition states. Each start condition state has an FSM shown below. <div class="chart"><a href="images/reflex-json-initial.png" data-title="The INITIAL start condition state FSM for the JSON example" data-lightbox="image-2"><img src="images/reflex-json-initial.png"/></a></div> <div class="chart"><a href="images/reflex-json-string.png" data-title="The STRING start condition state FSM for the JSON example" data-lightbox="image-2"><img src="images/reflex-json-string.png"/></a></div> ### Changing input sources Lexers can read from `FILE*`, `istream`, 8-bit strings and from wide strings. Pass the input as the first agrument to the Lexer class, for example: Lexer lexer("How now brown cow"); or with the `−−flex` option: yyFlexLexer lexer("How now brown cow"); The lexer class accepts an optional second argument `ostream` that the `echo()` (Flex `ECHO`) and `out()` (Flex `yyout`) methods use. To change input on the fly, use the lexer class `in(i)` method, where `i` is a `FILE*`, an `istream`, an 8-bit string, or a wide string. Similarly, to change the output stream use the lexer class `out(o)` method with an `ostream` parameter. ### Default ECHO It is generally a good idea to use RE/flex option `-s` to remove the default rule that echos unmatched text. Flex produces an error message when text is unmatched, while RE/flex remains silent and skips to the next match. Algorithms {#algos} ---------- The RE/flex implementation uses new variations of algorithms and data structures. To build RE/flex I've improved a classic algorithm to convert regular expressions directly to deterministic finite automata (DFAs) using "follow sets" described in [Compilers: Principles, Techniques, and Tools](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools) and based on the work by R.F. McNaughton and H. Yamada "Regular Expressions and State Graphs for Automata" IEEE Transactions on Electronic Computing 9 (1960). The improved algorithm handles lazy quantifiers (non-greedy operators) for the optional operator `??` and the repetitions `+?`, `*?` and `{n,m}?`, which wasn't possible before without requiring the construction of an intermediate non-deterministic finite automaton (NFA) (e.g. using submatch-tracking variants of Thompson's algorithm as is used by RE2 and PCRE as described in the excellent article [Regular Expression Matching Can Be Simple And Fast](https://swtch.com/~rsc/regexp/regexp1.html).) DFAs are constructed efficiently from follow sets (which form the basis of DFA state transitions) by inserting DFA states (sets of positions) in a binary tree. DFA edges are represented by overlapping range sets `reflex::ORanges`, a data structure derived from `std::set` to efficiently merge adjacent ranges automatically when multiple sets are joined (range set union) or split (range set intersection). This turns out to be more space efficient than bit vectors to represent sets, especially for wide character ranges that can range up to U+10FFFF. The time efficiency is comparable to bit vectors. The final finite state machines use compressed edges to minimize the number of conditional jumps. Concluding remarks {#conclusions} ------------------ The idea behind lexical analyzer generators dates back to the 70s and even earlier. Basically, a lexical analyzer generator takes a set of regular expression patterns and actions. The generated lexical analyzer scans the input from begin to end while matching the patterns and executing actions. It's amazing that this technology has stayed very relevant until this day. Perhaps the Lex manual is right in stating that "The asteroid to kill this dinosaur is still in orbit." Lexical analysis generators are typically used together with the Berkeley Yacc parser generator or with GNU bison (a version of Yacc). A Yacc or Bison parser invokes the scanner as a "tokenizer" to retrieve "tokens" to parse. These tokens are produced by the scanner's actions when the input is scanned. After teaching Compiler Construction courses for over 17 years I have often looked at alternatives for Flex. But most alternatives introduce a new specification syntax and aren't fully compatible with Bison (bison-bridge and bison-locations). I wasn't ready to swap out the course textbooks (or write my own), so I needed a generator that accepts Flex specifications. I also wanted a generator that offers features that remove the limitations of Flex and its alternatives. So I decided to write my own lexical analyzer generator that is compatible with Flex. Getting the highest performance possible wasn't the primary goal. Machines are damn fast these days anyway and tokenizers only take a fraction of a compiler's execution. A higher priority is getting more modern capabilities designed and implemented with reasonably good performance. What about using other regex libraries as matcher engines in scanners and tokenizers? Many have contemplated that option. RE/flex permits this and simplifies the process to produce PCRE2 and Boost.Regex regex patterns to be used for scanning. [![To top](images/go-up.png) To top](#) <p align="right"><i>Copyright (c) 2020, Robert van Engelen, Genivia Inc. All rights reserved.</i></p>