What is RE/flex?
- Yet another scanner generator
- Flexible high-performance regex classes
The RE/flex scanner generator
The RE/flex regex library
Tips, tricks, and gotchas
Bugs
Installing RE/flex
License and copyright

"The asteroid to kill this dinosaur is still in orbit." – Lex Manual

"Reflex: a thing that is determined by and reproduces the essential features or qualities of something else." – Oxford Dictionary

What is RE/flex?

Yet another high-performance C++ regular expression (RE) library and a lexical analyzer generator like Flex.

However, RE/flex also includes and supports several regex engines under the same uniform C++ class API, namely the RE/flex high-performance regex library, the RE/flex fuzzy regex library, the PCRE2 library, the Boost.Regex library, and std::regex.

This uniform API makes it trivial to perform pattern matching on files, streams, strings, wide strings, and memory, using any one of these regex libraries. File input is internally buffered in a window so that very large files can be searched. Input from UTF-16 or UTF-32 formatted files is automatically normalized to UTF-8 to apply UTF-8 Unicode regex pattern matching.

For performance, the RE/flex regex engine internally builds finite state machines in VM opcode or in efficient direct C++ code to scan and search input quick. SIMD (SSE2/AVX2/AVX512/NEON/AArch64) acceleration is used when available to speed up searching using novel pattern match prediction methods.

The RE/flex lexical analyzer generator extends Flex++ with Unicode support and many new useful features, such as regex indentation anchors, regex lazy quantifiers, regex word boundaries, methods for error reporting and recovery, and new options to simplify integration with Bison and other parsers.

The RE/flex lexical analyzer generator does all the heavy-lifting to make it easier to integrate advanced tokenizers with Bison and other parsers. It generates the necessary gluing code depending on the type of Bison parser used, such as the more advanced "Bison complete parsers".

In a nutshell, the RE/flex lexical analyzer generator

extends Flex++ with Unicode and other new featues
accepts legacy Flex and Lex lexer specifications
is compliant to the IEEE POSIX P1003.2 standard on Lex specification input (but generates C++ source code, like Flex++ does)
is faster than Flex++ for typical applications such as tokenization
supports Unicode, auto-detects UTF-8/16/32 with smart input handling
supports legacy file encoding formats, e.g. CP 1250, EBCDIC
includes methods for lex and syntax Error reporting and recovery
integrates with Bison reentrant, C++, bridge and location parsers
generates source code that is easy to understand
generates thread-safe scanners
generates graphviz files for visualization of finite state machines
supports easy customization of the lexer class source code output
is fast with direct code and deterministic finite state machines
supports "free space mode" to improve readability of lexer specifications
regular expressions may contain lazy quantifiers
regular expressions may contain word boundary anchors
regular expressions may contain indent/dedent markers for matching
offers other regex engines to choose from, such as PCRE2 and Boost.Regex
is released under a permissive open source license (BSD-3)

RE/flex includes usability improvements over Flex++, such as:

no input buffer length limit (Flex has a 16K limit);
yypush_buffer_state saves the scanner state (line, column, and indentation positions), not just the input buffer;
new methods to analyze ASCII and Unicode input, such as str() and wstr() to obtain the (wide) string match, line() and wline() to obtain the current (wide) line for error reporting.

Rule patterns in a lexer specification are converted by the reflex tool to efficient deterministic finite state machines in direct code (option −−fast or in opcode tables −−full. Other regex engines to choose from include PCRE2 and Boost.Regex for Perl and POSIX matching modes. UTF-8/16/32 file input normalization for Unicode pattern matching is performed automatically. Other encodings can be programmatically specified with minimal codeing. Therefore, RE/flex scanners can work on any type of input.

RE/flex incorporates proper object-oriented design principles and does not rely on macros and globals as Flex does. Macros and globals are added to the source code generated by reflex only when option −−flex is specified. A RE/flex scanner is a generated class derived from a base lexer class template, with its matcher engine defined as a template parameter.

RE/flex is compatible with Lex/Flex and Bison/Yacc with options −−flex and −−bison, respectively. Option −−yy forces basic, no-frills Lex POSIX compliance of the lexer input (but with C++ output). RE/flex also offers specific options to seamlessly integrate Bison bridge, Bison locations, Bison C++, Bison complete, and reentrant parsers.

In this document we refer to a regex as a shorthand for regular expression, However, a "regular expression" refers to the formal concept of regular languages, wheras regex often refers to backtracking-based regex matching that Perl introduced. Both concepts are applicable to RE/flex patterns.

In summary, RE/flex is really several things combined into one package:

a faster, feature-rich extension of Flex++.
a stand-alone regex library for fast regex matching in C++;
a unified C++ regex API for the PCRE2, Boost.Regex and C++ std::regex libraries for matching, seaching, splitting and scanning of input, with input from (wide) strings, files, and streams of potentially unlimited length.

The typographical conventions used by this document are:

Courier denotes C and C++ source code.
Courier denotes lexer specifications and file names.
Courier denotes commands and command or program output displayed in a terminal window.

Note: This is a note to clarify a technical matter.

Warning: Look out for warnings!

Engine	Header file to include	C++ matcher classes
RE/flex regex	`reflex/matcher.h`	`reflex::Matcher`
RE/flex regex	`reflex/fuzzymatcher.h`	`reflex::FuzzyMatcher`
PCRE2	`reflex/pcre2matcher.h`	`reflex::PCRE2Matcher`, `reflex::PCRE2UTFMatcher`
Boost.Regex	`reflex/boostmatcher.h`	`reflex::BoostMatcher`, `reflex::BoostPerlMatcher`, `reflex::BoostPosixMatcher`
std::regex	`reflex/stdmatcher.h`	`reflex::StdMatcher`, `reflex::StdEcmaMatcher`, `reflex::StdPosixMatcher`

Method	Result
`matches()`	returns nonzero if the input from begin to end matches
`find()`	search the given input and return nonzero if a match was found
`scan()`	return nonzero if input at current position matches partially
`split()`	return nonzero for a split of the input at the next match

Class	Mode	Engine	Performance
`reflex::Matcher`	POSIX	RE/flex	FSM, no backtracking
`reflex::FuzzyMatcher`	POSIX	RE/flex	FSM, minimal backtracking (fuzzy)
`reflex::PCRE2Matcher`	Perl	PCRE2	JIT-optimized backtracking
`reflex::PCRE2UTFMatcher`	Perl	PCRE2 UTF+UPC	JIT-optimized backtracking
`reflex::BoostMatcher`	Perl	Boost.Regex	backtracking
`reflex::BoostPerlMatcher`	Perl	Boost.Regex	backtracking
`reflex::BoostPosixMatcher`	POSIX	Boost.Regex	backtracking
`reflex::StdMatcher`	ECMA	std::regex	backtracking
`reflex::StdEcmaMatcher`	ECMA	std::regex	backtracking
`reflex::StdPosixMatcher`	POSIX	std::regex	backtracking

Iterator range	Acts as a	Iterates over
`find.begin()`...`find.end()`	filter	all matches
`scan.begin()`...`scan.end()`	tokenizer	continuous matches
`split.begin()`...`split.end()`	splitter	text between matches

Method	Result
`accept()`	returns group capture index (or zero if not captured/matched)
`text()`	returns `const char*` to 0-terminated match (ends in `\0`)
`strview()`	returns `std::string_view` text match (preserves `\0`s) (C++17)
`str()`	returns `std::string` text match (preserves `\0`s)
`wstr()`	returns `std::wstring` wide text match (converted from UTF-8)
`chr()`	returns first 8-bit char of the text match (`str()[0]` as int)
`wchr()`	returns first wide char of the text match (`wstr()[0]` as int)
`pair()`	returns `std::pair<size_t,std::string>(accept(),str())`
`wpair()`	returns `std::pair<size_t,std::wstring>(accept(),wstr())`
`size()`	returns the length of the text match in bytes
`wsize()`	returns the length of the match in number of wide characters
`lines()`	returns the number of lines in the text match (>=1)
`columns()`	returns the number of columns of the text match (>=0)
`begin()`	returns `const char*` to non-0-terminated text match begin
`end()`	returns `const char*` to non-0-terminated text match end
`rest()`	returns `const char*` to 0-terminated rest of input
`span()`	returns `const char*` to 0-terminated match enlarged to span the line
`line()`	returns `std::string` line with the matched text as a substring
`wline()`	returns `std::wstring` line with the matched text as a substring
`more()`	tells the matcher to append the next match (when using `scan()`)
`less(n)`	cuts `text()` to `n` bytes and repositions the matcher
`lineno()`	returns line number of the match, starting at line 1
`columno()`	returns column number of the match in characters, starting at 0
`lineno_end()`	returns ending line number of the match, starting at line 1
`columno_end()`	returns ending column number of the match, starting at 0
`bol()`	returns `const char*` to begin of matching line (not 0-terminated)
`border()`	returns the byte offset from the start of the line of the match
`first()`	returns input position of the first character of the match
`last()`	returns input position + 1 of the last character of the match
`at_bol()`	true if matcher reached the begin of a new line `\n`
`at_bob()`	true if matcher is at the begin of input and no input consumed
`at_end()`	true if matcher is at the end of input
`[0]`	operator returns `std::pair<const char*,size_t>(begin(),size())`
`[n]`	operator returns n'th capture `std::pair<const char*,size_t>`

Method	Result
`input()`	returns next 8-bit char from the input, matcher then skips it
`winput()`	returns next wide character from the input, matcher skips it
`unput(c)`	put 8-bit char `c` back unto the stream, matcher then takes it
`wunput(c)`	put (wide) char `c` back unto the stream, matcher then takes it
`peek()`	returns next 8-bit char from the input without consuming it
`skip(c)`	skip input until character `c` (`char` or `wchar_t`) is consumed
`skip(s)`	skip input until UTF-8 string `s` is consumed
`rest()`	returns the remaining input as a 0-terminated `char*` string

Method	Result
`input(i)`	set input to `reflex::Input i` (string, stream, or `FILE*`)
`pattern(p)`	set pattern `reflex::Pattern`, `boost::regex`, or a string `p`
`has_pattern()`	true if the matcher has a pattern assigned to it
`own_pattern()`	true if the matcher has a pattern to manage and delete
`pattern()`	a reference to the pattern object
`buffer()`	buffer all input at once, returns true if successful
`buffer(n)`	set the buffer size to `n` bytes to buffer input
`buffer(b, n)`	use buffer of `n` bytes at address `b` with to a string of `n`-1 bytes (zero copy)
`interactive()`	set buffer size to 1 for console-based (TTY) input
`flush()`	flush the remaining input from the internal buffer
`reset()`	resets the matcher, restarting it from the remaining input
`reset(o)`	resets the matcher with new options string `o` ("A?N?T?")

Option	Effect
`b`	bracket lists are parsed without converting escapes
`e=c;`	redefine the escape character
`f=file.cpp;`	save finite state machine code to `file.cpp`
`f=file.gv;`	save deterministic finite state machine to `file.gv`
`i`	case-insensitive matching, same as `(?i)X`
`m`	multiline mode, same as `(?m)X`
`n=name;`	use `reflex_code_name` for the machine (instead of `FSM`)
`o`	only with option `f`: generate optimized FSM native C++ code
`q`	Flex/Lex-style quotations "..." equal `\Q...\E`, same as `(?q)X`
`r`	throw regex syntax error exceptions, otherwise ignore errors
`s`	dot matches all (aka. single line mode), same as `(?s)X`
`x`	free space mode with inline comments, same as `(?x)X`
`w`	display regex syntax errors before raising them as exceptions

RE/flex action	Flex action	Result
`text()`	`YYText()`, `yytext`	0-terminated text match
`str()`	n/a	`std::string` text match
`strview()`	n/a	`std::string_view` text match
`wstr()`	n/a	`std::wstring` wide text match
`chr()`	`yytext[0]`	first 8-bit char of text match
`wchr()`	n/a	first wide char of text match
`size()`	`YYLeng()`, `yyleng`	size of the match in bytes
`wsize()`	n/a	number of wide chars matched
`lines()`	n/a	number of lines matched (>=1)
`columns()`	n/a	number of columns matched (>=0)
`lineno(n)`	`yylineno = n`	set line number of the match to `n`
`lineno()`	`yylineno`	line number of the match (>=1)
`columno(n)`	n/a	set column number of the match to `n`
`columno()`	n/a	column number of match (>=0)
`lineno_end()`	n/a	ending line number of match (>=1)
`columno_end()`	n/a	ending column number of match (>=0)
`border()`	n/a	border of the match (>=0)
`echo()`	`ECHO`	`out().write(text(), size())`
`in(i)`	`yyrestart(i)`	set input to `reflex::Input i`
`in()`, `in() = i`	`*yyin`, `yyin = &i`	get/set `reflex::Input i`
`out(o)`	`yyout = &o`	set output to `std::ostream o`
`out()`	`*yyout`	get `std::ostream` object
`out().write(s, n)`	`LexerOutput(s, n)`	output chars `s[0..n-1]`
`out().put(c)`	`output(c)`	output char `c`
`start(n)`	`BEGIN n`	set start condition to `n`
`start()`	`YY_START`	get current start condition
`push_state(n)`	`yy_push_state(n)`	push current state, start `n`
`pop_state()`	`yy_pop_state()`	pop state and make it current
`top_state()`	`yy_top_state()`	get top state start condition
`states_empty()`	n/a	true if state stack is empty
`matcher().accept()`	`yy_act`	number of the matched rule
`matcher().text()`	`YYText()`, `yytext`	same as `text()`
`matcher().str()`	n/a	same as `str()`
`matcher().wstr()`	n/a	same as `wstr()`
`matcher().chr()`	`yytext[0]`	same as `chr()`
`matcher().wchr()`	n/a	same as `wchr()`
`matcher().size()`	`YYLeng()`, `yyleng`	same as `size()`
`matcher().wsize()`	n/a	same as `wsize()`
`matcher().lines()`	n/a	same as `lines()`
`matcher().columns()`	n/a	same as `columns()`
`matcher().lineno(n)`	`yylineno = n`	same as `lineno(n)`
`matcher().lineno()`	`yylineno`	same as `lineno()`
`matcher().columno(n)`	/na	same as `columno(n)`
`matcher().columno()`	n/a	same as `columno()`
`matcher().lineno_end()`	`yylineno`	same as `lineno_end()`
`matcher().columno_end()`	n/a	same as `columno_end()`
`matcher().border()`	n/a	same as `border()`
`matcher().begin()`	n/a	non-0-terminated text match begin
`matcher().end()`	n/a	non-0-terminated text match end
`matcher().input()`	`yyinput()`	get next 8-bit char from input
`matcher().winput()`	n/a	get wide character from input
`matcher().unput(c)`	`unput(c)`	put back 8-bit char `c`
`matcher().wunput(c)`	`unput(c)`	put back (wide) char `c`
`matcher().peek()`	n/a	peek at next 8-bit char on input
`matcher().skip(c)`	n/a	skip input to char `c`
`matcher().skip(s)`	n/a	skip input to UTF-8 string `s`
`matcher().more()`	`yymore()`	append next match to this match
`matcher().less(n)`	`yyless(n)`	shrink match length to `n`
`matcher().first()`	n/a	first pos of match in input
`matcher().last()`	n/a	last pos+1 of match in input
`matcher().rest()`	n/a	get rest of input until end
`matcher().span()`	n/a	enlarge match to span line
`matcher().line()`	n/a	get line with the match
`matcher().wline()`	n/a	get line with the match
`matcher().at_bob()`	n/a	true if at the begin of input
`matcher().at_end()`	n/a	true if at the end of input
`matcher().at_bol()`	`YY_AT_BOL()`	true if at begin of a newline
`set_debug(n)`	`set_debug(n)`	reflex option `-d` sets `n=1`
`debug()`	`debug()`	nonzero when debugging

RE/flex action	Flex action	Result
`matcher().buffer()`	n/a	buffer entire input
`matcher().buffer(n)`	n/a	set buffer size to `n`
`matcher().interactive()`	`yy_set_interactive(1)`	set interactive input
`matcher().flush()`	`YY_FLUSH_BUFFER`	flush input buffer
`matcher().set_bol(b)`	`yy_set_bol(b)`	(re)set begin of line
`matcher().set_bob(b)`	n/a	(re)set begin of input
`matcher().set_end(b)`	n/a	(re)set end of input
`matcher().reset()`	n/a	reset the state as new

Pattern	Matches
`x`	matches the character `x`, where `x` is not a special character
`.`	matches any single character or a byte, except newline (unless in dotall mode)
`\.`	matches `.` (dot), special characters are escaped with a backslash
`\n`	matches a newline, others are `\a` (BEL), `\b` (BS), `\t` (HT), `\v` (VT), `\f` (FF), and `\r` (CR)
`\N`	matches any single character except newline
`\0`	matches the NUL character
`\cX`	matches the control character `X` mod 32 (e.g. `\cA` is `\x01`)
`\0141`	matches an 8-bit character with octal value `141` (use `\141` in lexer specifications instead, see below)
`\x7f`	matches an 8-bit character with hexadecimal value `7f`
`\x{3B1}`	matches Unicode character U+03B1, i.e. `α`
`\u{3B1}`	matches Unicode character U+03B1, i.e. `α`
`\o{141}`	matches U+0061, i.e. `a`, in octal
`\p{C}`	matches a character in category C of Character categories
`\Q...\E`	matches the quoted content between `\Q` and `\E` literally
`[abc]`	matches one of `a`, `b`, or `c` as Character classes
`[0-9]`	matches a digit `0` to `9` as Character classes
`[^0-9]`	matches any character except a digit as Character classes
`φ?`	matches `φ` zero or one time (optional)
`φ*`	matches `φ` zero or more times (repetition)
`φ+`	matches `φ` one or more times (repetition)
`φ{2,5}`	matches `φ` two to five times (repetition)
`φ{2,}`	matches `φ` at least two times (repetition)
`φ{2}`	matches `φ` exactly two times (repetition)
`φ??`	matches `φ` zero or once as needed (lazy optional)
`φ*?`	matches `φ` a minimum number of times as needed (lazy repetition)
`φ+?`	matches `φ` a minimum number of times at least once as needed (lazy repetition)
`φ{2,5}?`	matches `φ` two to five times as needed (lazy repetition)
`φ{2,}?`	matches `φ` at least two times or more as needed (lazy repetition)
`φψ`	matches `φ` then matches `ψ` (concatenation)
`φ⎮ψ`	matches `φ` or matches `ψ` (alternation)
`(φ)`	matches `φ` as a group to capture (this is non-capturing in lexer specifications)
`(?:φ)`	matches `φ` without group capture
`(?=φ)`	matches `φ` without consuming it (Lookahead)
`(?<=φ)`	matches `φ` to the left without consuming it (Lookbehind, not supported by the RE/flex matcher)
`(?^φ)`	matches `φ` and ignores it, marking everything as a non-match to continue matching (RE/flex matcher only)
`^φ`	matches `φ` at the begin of input or begin of a line (requires multi-line mode) (top-level `φ` only, not nested in a sub-pattern)
`φ$`	matches `φ` at the end of input or end of a line (requires multi-line mode) (top-level `φ` only, not nested in a sub-pattern)
`\Aφ`	matches `φ` at the begin of input (top-level `φ`, not nested in a sub-pattern)
`φ\z`	matches `φ` at the end of input (top-level `φ`, not nested in a sub-pattern)
`\bφ`	matches `φ` starting at a word boundary

Pattern	Matches
`\177`	matches an 8-bit character with octal value `177`
`"..."`	matches the quoted content literally
`φ/ψ`	matches `φ` if followed by `ψ` as a Trailing context
`<S>φ`	matches `φ` only if state `S` is enabled in Start condition states
`<S1,S2,S3>φ`	matches `φ` only if state `S1`, `S2`, or state `S3` is enabled in Start condition states
`<*>φ`	matches `φ` in any state of the Start condition states
`<<EOF>>`	matches EOF in any state of the Start condition states
`<S><<EOF>>`	matches EOF only if state `S` is enabled in Start condition states
`[a-z￨￨[A-Z]]`	matches a letter, see Character classes
`[a-z&&[^aeiou]]`	matches a consonant, see Character classes
`[a-z−−[aeiou]]`	matches a consonant, see Character classes
`[a-z]{+}[A-Z]`	matches a letter, same as `[a-z￨￨[A-Z]]`, see Character classes
`[a-z]{￨}[A-Z]`	matches a letter, same as `[a-z￨￨[A-Z]]`, see Character classes
`[a-z]{&}[^aeiou]`	matches a consonant, same as `[a-z&&[^aeiou]]`, see Character classes
`[a-z]{-}[aeiou]`	matches a consonant, same as `[a-z−−[aeiou]]`, see Character classes

Pattern	Matches
`[a-zA-Z]`	matches a letter
`[^a-zA-Z]`	matches a non-letter (character class negation)
`[a-z￨￨[A-Z]]`	matches a letter (character class union)
`[a-z&&[^aeiou]]`	matches a consonant (character class intersection)
`[a-z−−[aeiou]]`	matches a consonant (character class subtraction)

Pattern	Matches
`[a-z]{+}[A-Z]`	matches a letter, same as `[a-z￨￨[A-Z]]`
`[a-z]{￨}[A-Z]`	matches a letter, same as `[a-z￨￨[A-Z]]`
`[a-z]{&}[^aeiou]`	matches a consonant, same as `[a-z&&[^aeiou]]`
`[a-z]{-}[aeiou]`	matches a consonant, same as `[a-z−−[aeiou]]`

RE/flex action	Flex action	Result
`matcher(m)`	`yy_switch_to_buffer(m)`	use matcher `m`
`new_matcher(i)`	`yy_create_buffer(i, n)`	returns new matcher for `reflex::Input i`
`del_matcher(m)`	`yy_delete_buffer(m)`	delete matcher `m`
`push_matcher(m)`	`yypush_buffer_state(m)`	push current matcher, then use `m`
`pop_matcher()`	`yypop_buffer_state()`	pop matcher and delete current
`ptr_matcher()`	`YY_CURRENT_BUFFER`	pointer to current matcher
`has_matcher()`	`YY_CURRENT_BUFFER != 0`	current matcher is usable

RE/flex action	Flex action	Result
`in(s)`	`yy_scan_string(s)`	reset and scan string `s` (`std::string` or `char*`)
`in(s)`	`yy_scan_wstring(s)`	reset and scan wide string `s` (`std::wstring` or `wchar_t*`)
`in(b, n)`	`yy_scan_bytes(b, n)`	reset and scan `n` bytes at address `b` (buffered)
`buffer(b, n+1)`	`yy_scan_buffer(b, n+2)`	reset and scan `n` bytes at address `b` (zero copy)

POSIX form	Matches
`[:ascii:]`	matches any ASCII character
`[:space:]`	matches a white space character `[ \t\n\v\f\r]`
`[:xdigit:]`	matches a hex digit `[0-9A-Fa-f]`
`[:cntrl:]`	matches a control character `[\x00-\x1f\x7f]`
`[:print:]`	matches a printable character `[\x20-\x7e]`
`[:alnum:]`	matches a alphanumeric character `[0-9A-Za-z]`
`[:alpha:]`	matches a letter `[A-Za-z]`
`[:blank:]`	matches a blank character `\h` same as `[ \t]`
`[:digit:]`	matches a digit `[0-9]`
`[:graph:]`	matches a visible character `[\x21-\x7e]`
`[:lower:]`	matches a lower case letter `[a-z]`
`[:punct:]`	matches a punctuation character `[\x21-\x2f\x3a-\x40\x5b-\x60\x7b-\x7e]`
`[:upper:]`	matches an upper case letter `[A-Z]`
`[:word:]`	matches a word character `[0-9A-Za-z_]`
`[:^blank:]`	matches a non-blank character `\H` same as `[^ \t]`
`[:^digit:]`	matches a non-digit `[^0-9]`

Unicode category	Matches
`.`	matches any single character (or a byte in Unicode mode, see Invalid UTF encodings and the dot pattern )
`\a`	matches BEL U+0007
`\d`	matches a digit `\p{Nd}`
`\D`	matches a non-digit
`\e`	matches ESC U+001b
`\f`	matches FF U+000c
`\h`	matches a blank `[ \t]`
`\H`	matches a non-blank `[^ \t]`
`\l`	matches a lower case letter `\p{Ll}`
`\n`	matches LF U+000a
`\N`	matches any non-LF character
`\r`	matches CR U+000d
`\R`	matches a Unicode line break
`\s`	matches a white space character `[ \t\n\v\f\r\x85\p{Z}]`
`\S`	matches a non-white space character
`\t`	matches TAB U+0009
`\u`	matches an upper case letter `\p{Lu}`
`\v`	matches VT U+000b
`\w`	matches a Unicode word character `[\p{L}\p{Nd}\p{Pc}]`
`\W`	matches a non-Unicode word character
`\X`	matches any ISO-8859-1 or Unicode character
`\p{Space}`	matches a white space character `[ \t\n\v\f\r\x85\p{Z}]`
`\p{Unicode}`	matches any Unicode character U+0000 to U+10FFFF minus U+D800 to U+DFFF
`\p{ASCII}`	matches an ASCII character U+0000 to U+007F
`\p{Non_ASCII_Unicode}`	matches a non-ASCII character U+0080 to U+10FFFF minus U+D800 to U+DFFF)
`\p{L&}`	matches a character with Unicode property L& (i.e. property Ll, Lu, or Lt)
`\p{Letter}`,`\p{L}`	matches a character with Unicode property Letter
`\p{Mark}`,`\p{M}`	matches a character with Unicode property Mark
`\p{Separator}`,`\p{Z}`	matches a character with Unicode property Separator
`\p{Symbol}`,`\p{S}`	matches a character with Unicode property Symbol
`\p{Number}`,`\p{N}`	matches a character with Unicode property Number
`\p{Punctuation}`,`\p{P}`	matches a character with Unicode property Punctuation
`\p{Other}`,`\p{C}`	matches a character with Unicode property Other
`\p{Lowercase_Letter}`, `\p{Ll}`	matches a character with Unicode sub-property Ll
`\p{Uppercase_Letter}`, `\p{Lu}`	matches a character with Unicode sub-property Lu
`\p{Titlecase_Letter}`, `\p{Lt}`	matches a character with Unicode sub-property Lt
`\p{Modifier_Letter}`, `\p{Lm}`	matches a character with Unicode sub-property Lm
`\p{Other_Letter}`, `\p{Lo}`	matches a character with Unicode sub-property Lo
`\p{Non_Spacing_Mark}`, `\p{Mn}`	matches a character with Unicode sub-property Mn
`\p{Spacing_Combining_Mark}`, `\p{Mc}`	matches a character with Unicode sub-property Mc
`\p{Enclosing_Mark}`, `\p{Me}`	matches a character with Unicode sub-property Me
`\p{Space_Separator}`, `\p{Zs}`	matches a character with Unicode sub-property Zs
`\p{Line_Separator}`, `\p{Zl}`	matches a character with Unicode sub-property Zl
`\p{Paragraph_Separator}`, `\p{Zp}`	matches a character with Unicode sub-property Zp
`\p{Math_Symbol}`, `\p{Sm}`	matches a character with Unicode sub-property Sm
`\p{Currency_Symbol}`, `\p{Sc}`	matches a character with Unicode sub-property Sc
`\p{Modifier_Symbol}`, `\p{Sk}`	matches a character with Unicode sub-property Sk
`\p{Other_Symbol}`, `\p{So}`	matches a character with Unicode sub-property So
`\p{Decimal_Digit_Number}`, `\p{Nd}`	matches a character with Unicode sub-property Nd
`\p{Letter_Number}`, `\p{Nl}`	matches a character with Unicode sub-property Nl
`\p{Other_Number}`, `\p{No}`	matches a character with Unicode sub-property No
`\p{Dash_Punctuation}`, `\p{Pd}`	matches a character with Unicode sub-property Pd
`\p{Open_Punctuation}`, `\p{Ps}`	matches a character with Unicode sub-property Ps
`\p{Close_Punctuation}`, `\p{Pe}`	matches a character with Unicode sub-property Pe
`\p{Initial_Punctuation}`, `\p{Pi}`	matches a character with Unicode sub-property Pi
`\p{Final_Punctuation}`, `\p{Pf}`	matches a character with Unicode sub-property Pf
`\p{Connector_Punctuation}`, `\p{Pc}`	matches a character with Unicode sub-property Pc
`\p{Other_Punctuation}`, `\p{Po}`	matches a character with Unicode sub-property Po
`\p{Control}`, `\p{Cc}`	matches a character with Unicode sub-property Cc
`\p{Format}`, `\p{Cf}`	matches a character with Unicode sub-property Cf
`\p{UnicodeIdentifierStart}`	matches a character in the Unicode IdentifierStart class
`\p{UnicodeIdentifierPart}`	matches a character in the Unicode IdentifierPart class
`\p{IdentifierIgnorable}`	matches a character in the IdentifierIgnorable class
`\p{JavaIdentifierStart}`	matches a character in the Java IdentifierStart class
`\p{JavaIdentifierPart}`	matches a character in the Java IdentifierPart class
`\p{CsIdentifierStart}`	matches a character in the C# IdentifierStart class
`\p{CsIdentifierPart}`	matches a character in the C# IdentifierPart class
`\p{PythonIdentifierStart}`	matches a character in the Python IdentifierStart class
`\p{PythonIdentifierPart}`	matches a character in the Python IdentifierPart class

IsBlockName	Unicode character range
`\p{IsBasicLatin}`	U+0000 to U+007F
`\p{IsLatin-1Supplement}`	U+0080 to U+00FF
`\p{IsLatinExtended-A}`	U+0100 to U+017F
`\p{IsLatinExtended-B}`	U+0180 to U+024F
`\p{IsIPAExtensions}`	U+0250 to U+02AF
`\p{IsSpacingModifierLetters}`	U+02B0 to U+02FF
`\p{IsCombiningDiacriticalMarks}`	U+0300 to U+036F
`\p{IsGreekandCoptic}`	U+0370 to U+03FF
`\p{IsCyrillic}`	U+0400 to U+04FF
`\p{IsCyrillicSupplement}`	U+0500 to U+052F
`\p{IsArmenian}`	U+0530 to U+058F
`\p{IsHebrew}`	U+0590 to U+05FF
`\p{IsArabic}`	U+0600 to U+06FF
`\p{IsSyriac}`	U+0700 to U+074F
`\p{IsArabicSupplement}`	U+0750 to U+077F
`\p{IsThaana}`	U+0780 to U+07BF
`\p{IsNKo}`	U+07C0 to U+07FF
`\p{IsSamaritan}`	U+0800 to U+083F
`\p{IsMandaic}`	U+0840 to U+085F
`\p{IsSyriacSupplement}`	U+0860 to U+086F
`\p{IsArabicExtended-B}`	U+0870 to U+089F
`\p{IsArabicExtended-A}`	U+08A0 to U+08FF
`\p{IsDevanagari}`	U+0900 to U+097F
`\p{IsBengali}`	U+0980 to U+09FF
`\p{IsGurmukhi}`	U+0A00 to U+0A7F
`\p{IsGujarati}`	U+0A80 to U+0AFF
`\p{IsOriya}`	U+0B00 to U+0B7F
`\p{IsTamil}`	U+0B80 to U+0BFF
`\p{IsTelugu}`	U+0C00 to U+0C7F
`\p{IsKannada}`	U+0C80 to U+0CFF
`\p{IsMalayalam}`	U+0D00 to U+0D7F
`\p{IsSinhala}`	U+0D80 to U+0DFF
`\p{IsThai}`	U+0E00 to U+0E7F
`\p{IsLao}`	U+0E80 to U+0EFF
`\p{IsTibetan}`	U+0F00 to U+0FFF
`\p{IsMyanmar}`	U+1000 to U+109F
`\p{IsGeorgian}`	U+10A0 to U+10FF
`\p{IsHangulJamo}`	U+1100 to U+11FF
`\p{IsEthiopic}`	U+1200 to U+137F
`\p{IsEthiopicSupplement}`	U+1380 to U+139F
`\p{IsCherokee}`	U+13A0 to U+13FF
`\p{IsUnifiedCanadianAboriginalSyllabics}`	U+1400 to U+167F
`\p{IsOgham}`	U+1680 to U+169F
`\p{IsRunic}`	U+16A0 to U+16FF
`\p{IsTagalog}`	U+1700 to U+171F
`\p{IsHanunoo}`	U+1720 to U+173F
`\p{IsBuhid}`	U+1740 to U+175F
`\p{IsTagbanwa}`	U+1760 to U+177F
`\p{IsKhmer}`	U+1780 to U+17FF
`\p{IsMongolian}`	U+1800 to U+18AF
`\p{IsUnifiedCanadianAboriginalSyllabicsExtended}`	U+18B0 to U+18FF
`\p{IsLimbu}`	U+1900 to U+194F
`\p{IsTaiLe}`	U+1950 to U+197F
`\p{IsNewTaiLue}`	U+1980 to U+19DF
`\p{IsKhmerSymbols}`	U+19E0 to U+19FF
`\p{IsBuginese}`	U+1A00 to U+1A1F
`\p{IsTaiTham}`	U+1A20 to U+1AAF
`\p{IsCombiningDiacriticalMarksExtended}`	U+1AB0 to U+1AFF
`\p{IsBalinese}`	U+1B00 to U+1B7F
`\p{IsSundanese}`	U+1B80 to U+1BBF
`\p{IsBatak}`	U+1BC0 to U+1BFF
`\p{IsLepcha}`	U+1C00 to U+1C4F
`\p{IsOlChiki}`	U+1C50 to U+1C7F
`\p{IsCyrillicExtended-C}`	U+1C80 to U+1C8F
`\p{IsGeorgianExtended}`	U+1C90 to U+1CBF
`\p{IsSundaneseSupplement}`	U+1CC0 to U+1CCF
`\p{IsVedicExtensions}`	U+1CD0 to U+1CFF
`\p{IsPhoneticExtensions}`	U+1D00 to U+1D7F
`\p{IsPhoneticExtensionsSupplement}`	U+1D80 to U+1DBF
`\p{IsCombiningDiacriticalMarksSupplement}`	U+1DC0 to U+1DFF
`\p{IsLatinExtendedAdditional}`	U+1E00 to U+1EFF
`\p{IsGreekExtended}`	U+1F00 to U+1FFF
`\p{IsGeneralPunctuation}`	U+2000 to U+206F
`\p{IsSuperscriptsandSubscripts}`	U+2070 to U+209F
`\p{IsCurrencySymbols}`	U+20A0 to U+20CF
`\p{IsCombiningDiacriticalMarksforSymbols}`	U+20D0 to U+20FF
`\p{IsLetterlikeSymbols}`	U+2100 to U+214F
`\p{IsNumberForms}`	U+2150 to U+218F
`\p{IsArrows}`	U+2190 to U+21FF
`\p{IsMathematicalOperators}`	U+2200 to U+22FF
`\p{IsMiscellaneousTechnical}`	U+2300 to U+23FF
`\p{IsControlPictures}`	U+2400 to U+243F
`\p{IsOpticalCharacterRecognition}`	U+2440 to U+245F
`\p{IsEnclosedAlphanumerics}`	U+2460 to U+24FF
`\p{IsBoxDrawing}`	U+2500 to U+257F
`\p{IsBlockElements}`	U+2580 to U+259F
`\p{IsGeometricShapes}`	U+25A0 to U+25FF
`\p{IsMiscellaneousSymbols}`	U+2600 to U+26FF
`\p{IsDingbats}`	U+2700 to U+27BF
`\p{IsMiscellaneousMathematicalSymbols-A}`	U+27C0 to U+27EF
`\p{IsSupplementalArrows-A}`	U+27F0 to U+27FF
`\p{IsBraillePatterns}`	U+2800 to U+28FF
`\p{IsSupplementalArrows-B}`	U+2900 to U+297F
`\p{IsMiscellaneousMathematicalSymbols-B}`	U+2980 to U+29FF
`\p{IsSupplementalMathematicalOperators}`	U+2A00 to U+2AFF
`\p{IsMiscellaneousSymbolsandArrows}`	U+2B00 to U+2BFF
`\p{IsGlagolitic}`	U+2C00 to U+2C5F
`\p{IsLatinExtended-C}`	U+2C60 to U+2C7F
`\p{IsCoptic}`	U+2C80 to U+2CFF
`\p{IsGeorgianSupplement}`	U+2D00 to U+2D2F
`\p{IsTifinagh}`	U+2D30 to U+2D7F
`\p{IsEthiopicExtended}`	U+2D80 to U+2DDF
`\p{IsCyrillicExtended-A}`	U+2DE0 to U+2DFF
`\p{IsSupplementalPunctuation}`	U+2E00 to U+2E7F
`\p{IsCJKRadicalsSupplement}`	U+2E80 to U+2EFF
`\p{IsKangxiRadicals}`	U+2F00 to U+2FDF
`\p{IsIdeographicDescriptionCharacters}`	U+2FF0 to U+2FFF
`\p{IsCJKSymbolsandPunctuation}`	U+3000 to U+303F
`\p{IsHiragana}`	U+3040 to U+309F
`\p{IsKatakana}`	U+30A0 to U+30FF
`\p{IsBopomofo}`	U+3100 to U+312F
`\p{IsHangulCompatibilityJamo}`	U+3130 to U+318F
`\p{IsKanbun}`	U+3190 to U+319F
`\p{IsBopomofoExtended}`	U+31A0 to U+31BF
`\p{IsCJKStrokes}`	U+31C0 to U+31EF
`\p{IsKatakanaPhoneticExtensions}`	U+31F0 to U+31FF
`\p{IsEnclosedCJKLettersandMonths}`	U+3200 to U+32FF
`\p{IsCJKCompatibility}`	U+3300 to U+33FF
`\p{IsCJKUnifiedIdeographsExtensionA}`	U+3400 to U+4DBF
`\p{IsYijingHexagramSymbols}`	U+4DC0 to U+4DFF
`\p{IsCJKUnifiedIdeographs}`	U+4E00 to U+9FFF
`\p{IsYiSyllables}`	U+A000 to U+A48F
`\p{IsYiRadicals}`	U+A490 to U+A4CF
`\p{IsLisu}`	U+A4D0 to U+A4FF
`\p{IsVai}`	U+A500 to U+A63F
`\p{IsCyrillicExtended-B}`	U+A640 to U+A69F
`\p{IsBamum}`	U+A6A0 to U+A6FF
`\p{IsModifierToneLetters}`	U+A700 to U+A71F
`\p{IsLatinExtended-D}`	U+A720 to U+A7FF
`\p{IsSylotiNagri}`	U+A800 to U+A82F
`\p{IsCommonIndicNumberForms}`	U+A830 to U+A83F
`\p{IsPhags-pa}`	U+A840 to U+A87F
`\p{IsSaurashtra}`	U+A880 to U+A8DF
`\p{IsDevanagariExtended}`	U+A8E0 to U+A8FF
`\p{IsKayahLi}`	U+A900 to U+A92F
`\p{IsRejang}`	U+A930 to U+A95F
`\p{IsHangulJamoExtended-A}`	U+A960 to U+A97F
`\p{IsJavanese}`	U+A980 to U+A9DF
`\p{IsMyanmarExtended-B}`	U+A9E0 to U+A9FF
`\p{IsCham}`	U+AA00 to U+AA5F
`\p{IsMyanmarExtended-A}`	U+AA60 to U+AA7F
`\p{IsTaiViet}`	U+AA80 to U+AADF
`\p{IsMeeteiMayekExtensions}`	U+AAE0 to U+AAFF
`\p{IsEthiopicExtended-A}`	U+AB00 to U+AB2F
`\p{IsLatinExtended-E}`	U+AB30 to U+AB6F
`\p{IsCherokeeSupplement}`	U+AB70 to U+ABBF
`\p{IsMeeteiMayek}`	U+ABC0 to U+ABFF
`\p{IsHangulSyllables}`	U+AC00 to U+D7AF
`\p{IsHangulJamoExtended-B}`	U+D7B0 to U+D7FF
`\p{IsHighSurrogates}`	U+D800 to U+DB7F
`\p{IsHighPrivateUseSurrogates}`	U+DB80 to U+DBFF
`\p{IsLowSurrogates}`	U+DC00 to U+DFFF
`\p{IsPrivateUseArea}`	U+E000 to U+F8FF
`\p{IsCJKCompatibilityIdeographs}`	U+F900 to U+FAFF
`\p{IsAlphabeticPresentationForms}`	U+FB00 to U+FB4F
`\p{IsArabicPresentationForms-A}`	U+FB50 to U+FDFF
`\p{IsVariationSelectors}`	U+FE00 to U+FE0F
`\p{IsVerticalForms}`	U+FE10 to U+FE1F
`\p{IsCombiningHalfMarks}`	U+FE20 to U+FE2F
`\p{IsCJKCompatibilityForms}`	U+FE30 to U+FE4F
`\p{IsSmallFormVariants}`	U+FE50 to U+FE6F
`\p{IsArabicPresentationForms-B}`	U+FE70 to U+FEFF
`\p{IsHalfwidthandFullwidthForms}`	U+FF00 to U+FFEF
`\p{IsSpecials}`	U+FFF0 to U+FFFF
`\p{IsLinearBSyllabary}`	U+10000 to U+1007F
`\p{IsLinearBIdeograms}`	U+10080 to U+100FF
`\p{IsAegeanNumbers}`	U+10100 to U+1013F
`\p{IsAncientGreekNumbers}`	U+10140 to U+1018F
`\p{IsAncientSymbols}`	U+10190 to U+101CF
`\p{IsPhaistosDisc}`	U+101D0 to U+101FF
`\p{IsLycian}`	U+10280 to U+1029F
`\p{IsCarian}`	U+102A0 to U+102DF
`\p{IsCopticEpactNumbers}`	U+102E0 to U+102FF
`\p{IsOldItalic}`	U+10300 to U+1032F
`\p{IsGothic}`	U+10330 to U+1034F
`\p{IsOldPermic}`	U+10350 to U+1037F
`\p{IsUgaritic}`	U+10380 to U+1039F
`\p{IsOldPersian}`	U+103A0 to U+103DF
`\p{IsDeseret}`	U+10400 to U+1044F
`\p{IsShavian}`	U+10450 to U+1047F
`\p{IsOsmanya}`	U+10480 to U+104AF
`\p{IsOsage}`	U+104B0 to U+104FF
`\p{IsElbasan}`	U+10500 to U+1052F
`\p{IsCaucasianAlbanian}`	U+10530 to U+1056F
`\p{IsVithkuqi}`	U+10570 to U+105BF
`\p{IsLinearA}`	U+10600 to U+1077F
`\p{IsLatinExtended-F}`	U+10780 to U+107BF
`\p{IsCypriotSyllabary}`	U+10800 to U+1083F
`\p{IsImperialAramaic}`	U+10840 to U+1085F
`\p{IsPalmyrene}`	U+10860 to U+1087F
`\p{IsNabataean}`	U+10880 to U+108AF
`\p{IsHatran}`	U+108E0 to U+108FF
`\p{IsPhoenician}`	U+10900 to U+1091F
`\p{IsLydian}`	U+10920 to U+1093F
`\p{IsMeroiticHieroglyphs}`	U+10980 to U+1099F
`\p{IsMeroiticCursive}`	U+109A0 to U+109FF
`\p{IsKharoshthi}`	U+10A00 to U+10A5F
`\p{IsOldSouthArabian}`	U+10A60 to U+10A7F
`\p{IsOldNorthArabian}`	U+10A80 to U+10A9F
`\p{IsManichaean}`	U+10AC0 to U+10AFF
`\p{IsAvestan}`	U+10B00 to U+10B3F
`\p{IsInscriptionalParthian}`	U+10B40 to U+10B5F
`\p{IsInscriptionalPahlavi}`	U+10B60 to U+10B7F
`\p{IsPsalterPahlavi}`	U+10B80 to U+10BAF
`\p{IsOldTurkic}`	U+10C00 to U+10C4F
`\p{IsOldHungarian}`	U+10C80 to U+10CFF
`\p{IsHanifiRohingya}`	U+10D00 to U+10D3F
`\p{IsRumiNumeralSymbols}`	U+10E60 to U+10E7F
`\p{IsYezidi}`	U+10E80 to U+10EBF
`\p{IsOldSogdian}`	U+10F00 to U+10F2F
`\p{IsSogdian}`	U+10F30 to U+10F6F
`\p{IsOldUyghur}`	U+10F70 to U+10FAF
`\p{IsChorasmian}`	U+10FB0 to U+10FDF
`\p{IsElymaic}`	U+10FE0 to U+10FFF
`\p{IsBrahmi}`	U+11000 to U+1107F
`\p{IsKaithi}`	U+11080 to U+110CF
`\p{IsSoraSompeng}`	U+110D0 to U+110FF
`\p{IsChakma}`	U+11100 to U+1114F
`\p{IsMahajani}`	U+11150 to U+1117F
`\p{IsSharada}`	U+11180 to U+111DF
`\p{IsSinhalaArchaicNumbers}`	U+111E0 to U+111FF
`\p{IsKhojki}`	U+11200 to U+1124F
`\p{IsMultani}`	U+11280 to U+112AF
`\p{IsKhudawadi}`	U+112B0 to U+112FF
`\p{IsGrantha}`	U+11300 to U+1137F
`\p{IsNewa}`	U+11400 to U+1147F
`\p{IsTirhuta}`	U+11480 to U+114DF
`\p{IsSiddham}`	U+11580 to U+115FF
`\p{IsModi}`	U+11600 to U+1165F
`\p{IsMongolianSupplement}`	U+11660 to U+1167F
`\p{IsTakri}`	U+11680 to U+116CF
`\p{IsAhom}`	U+11700 to U+1174F
`\p{IsDogra}`	U+11800 to U+1184F
`\p{IsWarangCiti}`	U+118A0 to U+118FF
`\p{IsDivesAkuru}`	U+11900 to U+1195F
`\p{IsNandinagari}`	U+119A0 to U+119FF
`\p{IsZanabazarSquare}`	U+11A00 to U+11A4F
`\p{IsSoyombo}`	U+11A50 to U+11AAF
`\p{IsUnifiedCanadianAboriginalSyllabicsExtended-A}`	U+11AB0 to U+11ABF
`\p{IsPauCinHau}`	U+11AC0 to U+11AFF
`\p{IsBhaiksuki}`	U+11C00 to U+11C6F
`\p{IsMarchen}`	U+11C70 to U+11CBF
`\p{IsMasaramGondi}`	U+11D00 to U+11D5F
`\p{IsGunjalaGondi}`	U+11D60 to U+11DAF
`\p{IsMakasar}`	U+11EE0 to U+11EFF
`\p{IsLisuSupplement}`	U+11FB0 to U+11FBF
`\p{IsTamilSupplement}`	U+11FC0 to U+11FFF
`\p{IsCuneiform}`	U+12000 to U+123FF
`\p{IsCuneiformNumbersandPunctuation}`	U+12400 to U+1247F
`\p{IsEarlyDynasticCuneiform}`	U+12480 to U+1254F
`\p{IsCypro-Minoan}`	U+12F90 to U+12FFF
`\p{IsEgyptianHieroglyphs}`	U+13000 to U+1342F
`\p{IsEgyptianHieroglyphFormatControls}`	U+13430 to U+1343F
`\p{IsAnatolianHieroglyphs}`	U+14400 to U+1467F
`\p{IsBamumSupplement}`	U+16800 to U+16A3F
`\p{IsMro}`	U+16A40 to U+16A6F
`\p{IsTangsa}`	U+16A70 to U+16ACF
`\p{IsBassaVah}`	U+16AD0 to U+16AFF
`\p{IsPahawhHmong}`	U+16B00 to U+16B8F
`\p{IsMedefaidrin}`	U+16E40 to U+16E9F
`\p{IsMiao}`	U+16F00 to U+16F9F
`\p{IsIdeographicSymbolsandPunctuation}`	U+16FE0 to U+16FFF
`\p{IsTangut}`	U+17000 to U+187FF
`\p{IsTangutComponents}`	U+18800 to U+18AFF
`\p{IsKhitanSmallScript}`	U+18B00 to U+18CFF
`\p{IsTangutSupplement}`	U+18D00 to U+18D7F
`\p{IsKanaExtended-B}`	U+1AFF0 to U+1AFFF
`\p{IsKanaSupplement}`	U+1B000 to U+1B0FF
`\p{IsKanaExtended-A}`	U+1B100 to U+1B12F
`\p{IsSmallKanaExtension}`	U+1B130 to U+1B16F
`\p{IsNushu}`	U+1B170 to U+1B2FF
`\p{IsDuployan}`	U+1BC00 to U+1BC9F
`\p{IsShorthandFormatControls}`	U+1BCA0 to U+1BCAF
`\p{IsZnamennyMusicalNotation}`	U+1CF00 to U+1CFCF
`\p{IsByzantineMusicalSymbols}`	U+1D000 to U+1D0FF
`\p{IsMusicalSymbols}`	U+1D100 to U+1D1FF
`\p{IsAncientGreekMusicalNotation}`	U+1D200 to U+1D24F
`\p{IsMayanNumerals}`	U+1D2E0 to U+1D2FF
`\p{IsTaiXuanJingSymbols}`	U+1D300 to U+1D35F
`\p{IsCountingRodNumerals}`	U+1D360 to U+1D37F
`\p{IsMathematicalAlphanumericSymbols}`	U+1D400 to U+1D7FF
`\p{IsSuttonSignWriting}`	U+1D800 to U+1DAAF
`\p{IsLatinExtended-G}`	U+1DF00 to U+1DFFF
`\p{IsGlagoliticSupplement}`	U+1E000 to U+1E02F
`\p{IsNyiakengPuachueHmong}`	U+1E100 to U+1E14F
`\p{IsToto}`	U+1E290 to U+1E2BF
`\p{IsWancho}`	U+1E2C0 to U+1E2FF
`\p{IsEthiopicExtended-B}`	U+1E7E0 to U+1E7FF
`\p{IsMendeKikakui}`	U+1E800 to U+1E8DF
`\p{IsAdlam}`	U+1E900 to U+1E95F
`\p{IsIndicSiyaqNumbers}`	U+1EC70 to U+1ECBF
`\p{IsOttomanSiyaqNumbers}`	U+1ED00 to U+1ED4F
`\p{IsArabicMathematicalAlphabeticSymbols}`	U+1EE00 to U+1EEFF
`\p{IsMahjongTiles}`	U+1F000 to U+1F02F
`\p{IsDominoTiles}`	U+1F030 to U+1F09F
`\p{IsPlayingCards}`	U+1F0A0 to U+1F0FF
`\p{IsEnclosedAlphanumericSupplement}`	U+1F100 to U+1F1FF
`\p{IsEnclosedIdeographicSupplement}`	U+1F200 to U+1F2FF
`\p{IsMiscellaneousSymbolsandPictographs}`	U+1F300 to U+1F5FF
`\p{IsEmoticons}`	U+1F600 to U+1F64F
`\p{IsOrnamentalDingbats}`	U+1F650 to U+1F67F
`\p{IsTransportandMapSymbols}`	U+1F680 to U+1F6FF
`\p{IsAlchemicalSymbols}`	U+1F700 to U+1F77F
`\p{IsGeometricShapesExtended}`	U+1F780 to U+1F7FF
`\p{IsSupplementalArrows-C}`	U+1F800 to U+1F8FF
`\p{IsSupplementalSymbolsandPictographs}`	U+1F900 to U+1F9FF
`\p{IsChessSymbols}`	U+1FA00 to U+1FA6F
`\p{IsSymbolsandPictographsExtended-A}`	U+1FA70 to U+1FAFF
`\p{IsSymbolsforLegacyComputing}`	U+1FB00 to U+1FBFF
`\p{IsCJKUnifiedIdeographsExtensionB}`	U+20000 to U+2A6DF
`\p{IsCJKUnifiedIdeographsExtensionC}`	U+2A700 to U+2B73F
`\p{IsCJKUnifiedIdeographsExtensionD}`	U+2B740 to U+2B81F
`\p{IsCJKUnifiedIdeographsExtensionE}`	U+2B820 to U+2CEAF
`\p{IsCJKUnifiedIdeographsExtensionF}`	U+2CEB0 to U+2EBEF
`\p{IsCJKCompatibilityIdeographsSupplement}`	U+2F800 to U+2FA1F
`\p{IsCJKUnifiedIdeographsExtensionG}`	U+30000 to U+3134F
`\p{IsTags}`	U+E0000 to U+E007F
`\p{IsVariationSelectorsSupplement}`	U+E0100 to U+E01EF
`\p{IsSupplementaryPrivateUseArea-A}`	U+F0000 to U+FFFFF
`\p{IsSupplementaryPrivateUseArea-B}`	U+100000 to U+10FFFF

Pattern	Matches
`^φ`	matches `φ` at the start of input or start of a line (multi-line mode)
`φ$`	matches `φ` at the end of input or end of a line (multi-line mode)
`\Aφ`	matches `φ` at the start of input
`φ\z`	matches `φ` at the end of input

Pattern	Matches
`\bφ`	matches `φ` starting at a word boundary
`φ\b`	matches `φ` ending at a word boundary
`\Bφ`	matches `φ` starting at a non-word boundary
`φ\B`	matches `φ` ending at a non-word boundary
`\<φ`	matches `φ` that starts as a word
`\>φ`	matches `φ` that starts as a non-word
`φ\<`	matches `φ` that ends as a non-word
`φ\>`	matches `φ` that ends as a word

Pattern	Matches
`\i`	indent: matches and adds a new indent stop position
`\j`	dedent: matches a previous indent position, removes one indent stop

RE/flex action	Result
`matcher().tabs()`	returns the current tabs value 1, 2, 4, or 8
`matcher().tabs(n)`	set the tabs value `n` where `n` is 1, 2, 4, or 8

RE/flex action	Result
`matcher().push_stops()`	push indent stops on the stack then clear stops
`matcher().pop_stops()`	pop indent stops and make them current
`matcher().clear_stops()`	clear current indent stops
`matcher().stops()`	reference to current `std::vector<size_t>` stops
`matcher().last_stop()`	returns the last indent stop position or 0
`matcher().insert_stop(n)`	inserts/appends an indent stop at position `n`
`matcher().delete_stop(n)`	remove stop positions from position `n` and up

Table of Contents

What is RE/flex?

Yet another scanner generator

Flexible high-performance regex classes

The RE/flex scanner generator

The reflex command line tool

Options

Scanner options

−+, −−flex

-a, −−dotall

-B, −−batch

-f, −−full

-F, −−fast

-S, −−find

-i, −−case-insensitive

-I, −−interactive, −−always-interactive

−−indent and −−noindent

-m reflex, −−matcher=reflex

-m boost, −−matcher=boost

-m boost-perl, −−matcher=boost-perl

-m pcre2-perl, −−matcher=pcre2-perl

−−pattern=NAME

−−include=FILE

-T N, −−tabs=N

-u, −−unicode

-x, −−freespace

Output files options

-o FILE, −−outfile=FILE

-t, −−stdout

−−graphs-file[=FILE[.gv]]

−−header-file[=FILE]

−−regexp-file[=FILE[.txt]]

−−tables-file[=FILE[.cpp]]

Output code options

−−namespace=NAME

−−lexer=NAME

−−lex=NAME

−−params="TYPE NAME, ..."

−−class=NAME

−−yyclass=NAME

−−main

-L, −−noline

-P NAME, −−prefix=NAME

−−nostdinit

−−bison

−−bison-bridge

−−bison-cc

−−bison-cc-namespace=NAME

−−bison-cc-parser=NAME

−−bison-complete

−−bison-locations

-R, −−reentrant

-y, −−yy

−−yywrap and −−noyywrap

−−exception=VALUE

−−token-type=NAME

−−token-eof=VALUE

Debugging options

-d, −−debug

-D [START:]FILE, --do=[START:]FILE

-p, −−perf-report

-s, −−nodefault

-v, −−verbose

-w, −−nowarn

Miscellaneous options

-h, −−help

-V, −−version

−−yylineno, −−yymore

Lexer specifications

The definitions section

The rules section

User code sections

Patterns

Pattern syntax

Character classes

Character categories

Anchors and boundaries

Indent/nodent/dedent

Negative patterns

Lookahead

`−+`, `−−flex`

`-a`, `−−dotall`

`-B`, `−−batch`

`-f`, `−−full`

`-F`, `−−fast`

`-S`, `−−find`

`-i`, `−−case-insensitive`

`-I`, `−−interactive`, `−−always-interactive`

`−−indent` and `−−noindent`

`-m reflex`, `−−matcher=reflex`

`-m boost`, `−−matcher=boost`

`-m boost-perl`, `−−matcher=boost-perl`

`-m pcre2-perl`, `−−matcher=pcre2-perl`

`−−pattern=NAME`

`−−include=FILE`

`-T N`, `−−tabs=N`

`-u`, `−−unicode`

`-x`, `−−freespace`

`-o FILE`, `−−outfile=FILE`

`-t`, `−−stdout`

`−−graphs-file[=FILE[.gv]]`

`−−header-file[=FILE]`

`−−regexp-file[=FILE[.txt]]`

`−−tables-file[=FILE[.cpp]]`

`−−namespace=NAME`

`−−lexer=NAME`

`−−lex=NAME`

`−−params="TYPE NAME, ..."`

`−−class=NAME`

`−−yyclass=NAME`

`−−main`

`-L`, `−−noline`

`-P NAME`, `−−prefix=NAME`

`−−nostdinit`

`−−bison`

`−−bison-bridge`

`−−bison-cc`

`−−bison-cc-namespace=NAME`

`−−bison-cc-parser=NAME`

`−−bison-complete`

`−−bison-locations`

`-R`, `−−reentrant`

`-y`, `−−yy`

`−−yywrap` and `−−noyywrap`

`−−exception=VALUE`

`−−token-type=NAME`

`−−token-eof=VALUE`

`-d`, `−−debug`

`-D [START:]FILE`, `--do=[START:]FILE`

`-p`, `−−perf-report`

`-s`, `−−nodefault`

`-v`, `−−verbose`

`-w`, `−−nowarn`

`-h`, `−−help`

`-V`, `−−version`

`−−yylineno`, `−−yymore`

Pattern	Matches
`.`	matches any character (or byte in Unicode mode, see Invalid UTF encodings and the dot pattern )
`€` (UTF-8)	matches wide character `€`, encoded in UTF-8
`[€¥£]` (UTF-8)	matches wide character `€`, `¥` or `£`, encoded in UTF-8
`\X`	matches any ISO-8859-1 or Unicode character
`\R`	matches a Unicode line break `\r\n` or `[\u{000A}-\u{000D}u{U+0085}\u{2028}\u{2029}]`
`\s`	matches a white space character `[ \t\n\v\f\r\p{Z}]`
`\l`	matches a lower case letter with Unicode sub-property Ll
`\u`	matches an upper case letter with Unicode sub-property Lu
`\w`	matches a Unicode word character with property L, Nd, or Pc
`\u{20AC}`	matches Unicode character U+20AC
`\p{C}`	matches a character in category C
`\p{^C}`,`\P{C}`	matches any character except in category C

Option	RE/flex default name	Flex default name
`namespace`	n/a	n/a
`lexer`	`Lexer` class	`yyFlexLexer` class
`lex`	`lex()` function	`yylex()` function

RE/flex action	Flex action	Result
`in()`	`*yyin`	get pointer to current `reflex::Input i`
`in() = i`	`yyin = &i`	set input `reflex::Input i`
`in(i)`	`yyrestart(i)`	reset and scan input from `reflex::Input i`
`in(s)`	`yy_scan_string(s)`	reset and scan string `s` (`std::string` or `char*`)
`in(s)`	`yy_scan_wstring(s)`	reset and scan wide string `s` (`std::wstring` or `wchar_t*`)
`in(b, n)`	`yy_scan_bytes(b, n)`	reset and scan `n` bytes at address `b` (buffered)
`buffer(b, n+1)`	`yy_scan_buffer(b, n+2)`	reset and scan `n` bytes at address `b` (zero copy)

Options	Method	Global functions and variables
	`int Lexer::lex()`	no global variables, but doesn't work with Bison
`−−flex`	`int yyFlexLexer::yylex()`	no global variables, but doesn't work with Bison
`−−bison`	`int Lexer::lex()`	`Lexer YY_SCANNER`, `int yylex()`, `YYSTYPE yylval`
`−−flex` `−−bison`	`int yyFlexLexer::yylex()`	`yyFlexLexer YY_SCANNER`, `int yylex()`, `YYSTYPE yylval`, `char *yytext`, `yy_size_t yyleng`, `int yylineno`
`−−bison` `−−reentrant`	`int Lexer::lex()`	`int yylex(yyscan_t)`, `void yylex_init(yyscan_t*)`, `void yylex_destroy(yyscan_t)`
`−−flex` `−−bison` `−−reentrant`	`int yyFlexLexer::lex()`	`int yylex(yyscan_t)`, `void yylex_init(yyscan_t*)`, `void yylex_destroy(yyscan_t)`
`−−bison-locations`	`int Lexer::lex(YYSTYPE& yylval)`	`Lexer YY_SCANNER`, `int yylex(YYSTYPE yylval, YYLTYPE yylloc)`
`−−flex` `−−bison-locations`	`int yyFlexLexer::yylex(YYSTYPE& yylval)`	`yyFlexLexer YY_SCANNER`, `int yylex(YYSTYPE yylval, YYLTYPE yylloc)`
`−−bison-bridge`	`int Lexer::lex(YYSTYPE& yylval)`	`int yylex(YYSTYPE yylval, yyscan_t)`, `void yylex_init(yyscan_t)`, `void yylex_destroy(yyscan_t)`
`−−flex` `−−bison-bridge`	`int yyFlexLexer::yylex(YYSTYPE& yylval)`	`int yylex(YYSTYPE yylval, yyscan_t)`, `void yylex_init(yyscan_t)`, `void yylex_destroy(yyscan_t)`
`−−bison-bridge` `−−bison-locations`	`int Lexer::lex(YYSTYPE& yylval)`	`int yylex(YYSTYPE yylval, YYLTYPE yylloc, yyscan_t)`, `void yylex_init(yyscan_t*)`, `void yylex_destroy(yyscan_t)`
`−−flex` `−−bison-bridge` `−−bison-locations`	`int yyFlexLexer::yylex(YYSTYPE& yylval)`	`int yylex(YYSTYPE yylval, YYLTYPE yylloc, yyscan_t)`, `void yylex_init(yyscan_t*)`, `void yylex_destroy(yyscan_t)`
`−−bison-cc`	`int Lexer::yylex(YYSTYPE *yylval)`	no global variables
`−−flex` `−−bison-cc`	`int yyFlexLexer::yylex(YYSTYPE *yylval)`	no global variables
`−−bison-cc` `−−bison-locations`	`int Lexer::yylex(YYSTYPE yylval, YYLTYPE yylloc)`	no global variables
`−−flex` `−−bison-cc` `−−bison-locations`	`int yyFlexLexer::yylex(YYSTYPE yylval, YYLTYPE yylloc)`	no global variables
`−−bison-complete`	`PARSER::symbol_type Lexer::yylex()`	no global variables
`−−flex` `−−bison-complete`	`PARSER::symbol_type yyFlexLexer::yylex()`	no global variables
`−−bison-complete` `−−bison-locations`	`PARSER::symbol_type Lexer::yylex()`	no global variables
`−−flex` `−−bison-complete` `−−bison-locations`	`PARSER::symbol_type yyFlexLexer::yylex()`	no global variables

Reentrant Flex action	Result
`yyget_text(s)`	0-terminated text match
`yyget_leng(s)`	size of the match in bytes
`yyget_lineno(s)`	line number of the match (>=1)
`yyset_lineno(n, s)`	set line number of the match to `n`
`yyset_columno(n, s)`	set column number of the match to `n`
`yyget_in(s)`	get `reflex::Input` object
`yyset_in(i, s)`	set `reflex::Input` object
`yyget_out(s)`	get `std::ostream` object
`yyset_out(o, s)`	set output to `std::ostream o`
`yyget_debug(s)`	reflex option `-d` sets `n=1`
`yyset_debug(n, s)`	reflex option `-d` sets `n=1`
`yyget_extra(s)`	get user-defined extra parameter
`yyset_extra(x, s)`	set user-defined extra parameter
`yyget_current_buffer(s)`	the current matcher
`yyrestart(i, s)`	set input to `reflex::Input i`
`yyinput(s)`	get next 8-bit char from input
`yyunput(c, s)`	put back 8-bit char `c`
`yyoutput(c, s)`	output char `c`
`yy_create_buffer(i, n, s)`	new matcher `reflex::Input i`
`yy_delete_buffer(m, s)`	delete matcher `m`
`yypush_buffer_state(m, s)`	push current matcher, use `m`
`yypop_buffer_state(s)`	pop matcher and delete current
`yy_scan_string(s)`	scan string `s`
`yy_scan_wstring(s)`	scan wide string `s`
`yy_scan_bytes(b, n)`	scan `n` bytes at `b` (buffered)
`yy_scan_buffer(b, n)`	scan `n`-1 bytes at `b` (zero copy)
`yy_push_state(n, s)`	push current state, go to state `n`
`yy_pop_state(s)`	pop state and make it current
`yy_top_state(s)`	get top state start condition

Option	Matcher class used	Mode	Engine
`-m reflex`	`Matcher`	POSIX	RE/flex lib (default choice)
`-m pcre2-perl`	`PCRE2Matcher`	Perl	PCRE2
`-m boost`	`BoostPosixMatcher`	POSIX	Boost.Regex
`-m boost-perl`	`BoostPerlMatcher`	Perl	Boost.Regex

Pattern	MAX	Fuzzy find matches these ...	... but not these
`abc`	1	`abc`, `ab`, `ac`, `axc`, `axbc`	`a`, `axx`, `axbxc`, `bc`
`año`	1	`año`, `ano`, `ao`	`anno`, `ño`
`ab_cd`	2	`ab_cd`, `ab-cd`, `ab Cd`, `abCd`	`ab\ncd`, `Ab_cd`, `Abcd`
`a[0-9]+z`	1	`a1z`, `a123z`, `az`, `axz`	`axxz`, `A123z`, `123z`

Pattern	MAX	Fuzzy matches these ...	... but not these
`abc`	1	`abc`, `ab`, `ac`, `Abc`, `xbc` `bc`, `axc`, `axbc`	`a`, `axx`, `Ab`, `axbxc`
`año`	1	`año`, `Año`, `ano`, `ao`, `ño`	`anno`
`ab_cd`	2	`ab_cd`, `Ab_Cd`, `ab-cd`, `ab Cd`, `Ab_cd`, `abCd`	`ab\ncd`, `AbCd`
`a[0-9]+z`	1	`a1z`, `A1z`, `a123z`, `az`, `Az`, `axz`, `123z`	`axxz`

Method	Result
`assign(r,o)`	(re)assign regex string `r` with string of options `o`
`assign(r)`	(re)assign regex string `r` with default options
`=r`	same as above
`size()`	returns the number of top-level sub-patterns
`[0]`	operator returns the regex string of the pattern
`[n]`	operator returns the `n`th sub-pattern regex string
`reachable(n)`	true if sub-pattern `n` is reachable in the FSM

Flag	Effect
`reflex::convert_flag::none`	no conversion
`reflex::convert_flag::basic`	convert basic regular expression syntax (BRE) to extended regular expression syntax (ERE)
`reflex::convert_flag::unicode`	`.`, `\s`, `\w`, `\l`, `\u`, `\S`, `\W`, `\L`, `\U` match Unicode, same as `(?u)`
`reflex::convert_flag::recap`	remove capturing groups and add capturing groups to the top level
`reflex::convert_flag::lex`	convert Flex/Lex regular expression syntax
`reflex::convert_flag::u4`	convert `\uXXXX` (shorthand for `\u{XXXX}`), may conflict with `\u` (upper case letter).
`reflex::convert_flag::notnewline`	character classes do not match newline `\n`, e.g. `[^a-z]` does not match `\n`
`reflex::convert_flag::permissive`	when used with `unicode`, produces a more compact FSM that tolerates some invalid UTF-8 sequences
`reflex::convert_flag::closing`	match a `)` literally without the presence of an opening `(`