API¶

generic_lexer package¶

Submodules¶

generic_lexer.errors module¶

exception generic_lexer.errors.LexerError(char, text_buffer_pointer)

Bases: Exception

Lexer error exception.

Parameters:

text_buffer_pointer (int) – position in the input_buf line where the error occurred.
char (str) – the character that triggered the error

char

text_buffer_pointer

generic_lexer.lexer module¶

class generic_lexer.lexer.Lexer(rules, skip_whitespace=False, text_buffer='')

Bases: object

A simple pattern-based lexer/tokenizer. All the regexes are concatenated into a single one with named groups. The group names must be valid Python identifiers. The patterns without groups auto generate them. Groups are then mapped to token names.

Parameters:

rules (Mapping[str, str]) – A list of rules. Each rule is a str pair, where the first is the type of the token to return when it’s recognized and the second is the regular expression used to recognize the token.
skip_whitespace (bool) – If True, whitespace (s+) will be skipped and not reported by the lexer. Otherwise, you have to specify your rules for whitespace, or it will be flagged as an error.
text_buffer (str) – the string to generate the tokens from

clear_text_buffer()

Set the text buffer to a blank string and set the text pointer to 0

Return type:: None

property current_char: str

get_char_at(buffer_pointer)

Return type:: str

get_char_at_current_pointer()

Return type:: str

get_text_buffer()

Get the current text to be parsed into the lexer

Return type:: str

pattern_token(token_name, pattern)

Return type:: None

set_text_buffer(value)

Set the text to be parsed into the lexer and set the pointer back to 0

Return type:: None

property text_buffer: str: Set, Get or Clear the text buffer, you may use del with this property to clear the text buffer

tokens(skip_whitespace=False)

Parameters:: skip_whitespace (bool) – just like Lexer.skip_whitespace passed trough lexer.Lexer for the current method call.
Raises:: generic_lexer.errors.LexerError – raised with the position and character of the error in case of a lexing error (if the current chunk of the buffer matches no rule).
Yields:: the next token (a Token object) found in the Lexer.text_buffer.
Return type:: Iterator[Token]

generic_lexer.logging module¶

generic_lexer.token module¶

class generic_lexer.token.Token(name, position, val)

Bases: object

A simple Token structure. Contains the token name, value and position.

As you can see differently from the original gist, we are capable of specifying multiple groups per token.

You may get the values of the tokens this way:

>>> from generic_lexer import Lexer
>>> rules = {
...     "VARIABLE": r"(?P<var_name>[a-z_]+):(?P<var_type>[A-Z]\w+)",
...     "EQUALS": r"=",
...     "STRING": r"\".*\"",
... }
>>> data = "first_word:String = \"Hello\""
>>> variable, equals, string = tuple(Lexer(rules, True, data))

>>> variable
VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0

>>> variable.val
{'var_name': 'first_word', 'var_type': 'String'}
>>> variable["var_name"]
'first_word'
>>> variable["var_type"]
'String'

>>> equals
EQUALS('=') at 18

>>> equals.val
'='

>>> string
STRING('"Hello"') at 20

>>> string.val
'"Hello"'

Parameters:

name (str) – the name of the token
position (int) – the position the token was found in the text buffer
val (Any) – token’s value

lexer: Lexer

name: str

position: int

property type: str: For compability

property val: Union[Dict[str, str], str]

Module contents¶

class generic_lexer.Lexer(rules, skip_whitespace=False, text_buffer='')

Bases: object

A simple pattern-based lexer/tokenizer. All the regexes are concatenated into a single one with named groups. The group names must be valid Python identifiers. The patterns without groups auto generate them. Groups are then mapped to token names.

Parameters:

rules (Mapping[str, str]) – A list of rules. Each rule is a str pair, where the first is the type of the token to return when it’s recognized and the second is the regular expression used to recognize the token.
skip_whitespace (bool) – If True, whitespace (s+) will be skipped and not reported by the lexer. Otherwise, you have to specify your rules for whitespace, or it will be flagged as an error.
text_buffer (str) – the string to generate the tokens from

clear_text_buffer()

Set the text buffer to a blank string and set the text pointer to 0

Return type:: None

property current_char: str

get_char_at(buffer_pointer)

Return type:: str

get_char_at_current_pointer()

Return type:: str

get_text_buffer()

Get the current text to be parsed into the lexer

Return type:: str

pattern_token(token_name, pattern)

Return type:: None

set_text_buffer(value)

Set the text to be parsed into the lexer and set the pointer back to 0

Return type:: None

property text_buffer: str: Set, Get or Clear the text buffer, you may use del with this property to clear the text buffer

tokens(skip_whitespace=False)

Parameters:: skip_whitespace (bool) – just like Lexer.skip_whitespace passed trough lexer.Lexer for the current method call.
Raises:: generic_lexer.errors.LexerError – raised with the position and character of the error in case of a lexing error (if the current chunk of the buffer matches no rule).
Yields:: the next token (a Token object) found in the Lexer.text_buffer.
Return type:: Iterator[Token]

exception generic_lexer.LexerError(char, text_buffer_pointer)

Bases: Exception

Lexer error exception.

Parameters:

text_buffer_pointer (int) – position in the input_buf line where the error occurred.
char (str) – the character that triggered the error

char

text_buffer_pointer

class generic_lexer.Token(name, position, val)

Bases: object

A simple Token structure. Contains the token name, value and position.

As you can see differently from the original gist, we are capable of specifying multiple groups per token.

You may get the values of the tokens this way:

>>> from generic_lexer import Lexer
>>> rules = {
...     "VARIABLE": r"(?P<var_name>[a-z_]+):(?P<var_type>[A-Z]\w+)",
...     "EQUALS": r"=",
...     "STRING": r"\".*\"",
... }
>>> data = "first_word:String = \"Hello\""
>>> variable, equals, string = tuple(Lexer(rules, True, data))

>>> variable
VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0

>>> variable.val
{'var_name': 'first_word', 'var_type': 'String'}
>>> variable["var_name"]
'first_word'
>>> variable["var_type"]
'String'

>>> equals
EQUALS('=') at 18

>>> equals.val
'='

>>> string
STRING('"Hello"') at 20

>>> string.val
'"Hello"'

Parameters:

name (str) – the name of the token
position (int) – the position the token was found in the text buffer
val (Any) – token’s value

lexer: Lexer

name: str

position: int

property type: str: For compability

property val: Union[Dict[str, str], str]

API¶

generic_lexer package¶

Submodules¶

generic_lexer.errors module¶

generic_lexer.lexer module¶

generic_lexer.logging module¶

generic_lexer.token module¶

Module contents¶

generic_lexer

Navigation

Related Topics