API

generic_lexer package

Submodules

generic_lexer.errors module

exception generic_lexer.errors.LexerError(char, text_buffer_pointer)

Bases: Exception

Lexer error exception.

Parameters:
  • text_buffer_pointer (int) – position in the input_buf line where the error occurred.

  • char (str) – the character that triggered the error

char
text_buffer_pointer

generic_lexer.lexer module

class generic_lexer.lexer.Lexer(rules, skip_whitespace=False, text_buffer='')

Bases: object

A simple pattern-based lexer/tokenizer. All the regexes are concatenated into a single one with named groups. The group names must be valid Python identifiers. The patterns without groups auto generate them. Groups are then mapped to token names.

Parameters:
  • rules (Mapping[str, str]) – A list of rules. Each rule is a str pair, where the first is the type of the token to return when it’s recognized and the second is the regular expression used to recognize the token.

  • skip_whitespace (bool) – If True, whitespace (s+) will be skipped and not reported by the lexer. Otherwise, you have to specify your rules for whitespace, or it will be flagged as an error.

  • text_buffer (str) – the string to generate the tokens from

clear_text_buffer()

Set the text buffer to a blank string and set the text pointer to 0

Return type:

None

property current_char: str
get_char_at(buffer_pointer)
Return type:

str

get_char_at_current_pointer()
Return type:

str

get_text_buffer()

Get the current text to be parsed into the lexer

Return type:

str

pattern_token(token_name, pattern)
Return type:

None

set_text_buffer(value)

Set the text to be parsed into the lexer and set the pointer back to 0

Return type:

None

property text_buffer: str

Set, Get or Clear the text buffer, you may use del with this property to clear the text buffer

tokens(skip_whitespace=False)
Parameters:

skip_whitespace (bool) – just like Lexer.skip_whitespace passed trough lexer.Lexer for the current method call.

Raises:

generic_lexer.errors.LexerError – raised with the position and character of the error in case of a lexing error (if the current chunk of the buffer matches no rule).

Yields:

the next token (a Token object) found in the Lexer.text_buffer.

Return type:

Iterator[Token]

generic_lexer.logging module

generic_lexer.token module

class generic_lexer.token.Token(name, position, val)

Bases: object

A simple Token structure. Contains the token name, value and position.

As you can see differently from the original gist, we are capable of specifying multiple groups per token.

You may get the values of the tokens this way:

>>> from generic_lexer import Lexer
>>> rules = {
...     "VARIABLE": r"(?P<var_name>[a-z_]+):(?P<var_type>[A-Z]\w+)",
...     "EQUALS": r"=",
...     "STRING": r"\".*\"",
... }
>>> data = "first_word:String = \"Hello\""
>>> variable, equals, string = tuple(Lexer(rules, True, data))

>>> variable
VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0

>>> variable.val
{'var_name': 'first_word', 'var_type': 'String'}
>>> variable["var_name"]
'first_word'
>>> variable["var_type"]
'String'

>>> equals
EQUALS('=') at 18

>>> equals.val
'='

>>> string
STRING('"Hello"') at 20

>>> string.val
'"Hello"'
Parameters:
  • name (str) – the name of the token

  • position (int) – the position the token was found in the text buffer

  • val (Any) – token’s value

lexer: Lexer
name: str
position: int
property type: str

For compability

property val: Union[Dict[str, str], str]

Module contents

class generic_lexer.Lexer(rules, skip_whitespace=False, text_buffer='')

Bases: object

A simple pattern-based lexer/tokenizer. All the regexes are concatenated into a single one with named groups. The group names must be valid Python identifiers. The patterns without groups auto generate them. Groups are then mapped to token names.

Parameters:
  • rules (Mapping[str, str]) – A list of rules. Each rule is a str pair, where the first is the type of the token to return when it’s recognized and the second is the regular expression used to recognize the token.

  • skip_whitespace (bool) – If True, whitespace (s+) will be skipped and not reported by the lexer. Otherwise, you have to specify your rules for whitespace, or it will be flagged as an error.

  • text_buffer (str) – the string to generate the tokens from

clear_text_buffer()

Set the text buffer to a blank string and set the text pointer to 0

Return type:

None

property current_char: str
get_char_at(buffer_pointer)
Return type:

str

get_char_at_current_pointer()
Return type:

str

get_text_buffer()

Get the current text to be parsed into the lexer

Return type:

str

pattern_token(token_name, pattern)
Return type:

None

set_text_buffer(value)

Set the text to be parsed into the lexer and set the pointer back to 0

Return type:

None

property text_buffer: str

Set, Get or Clear the text buffer, you may use del with this property to clear the text buffer

tokens(skip_whitespace=False)
Parameters:

skip_whitespace (bool) – just like Lexer.skip_whitespace passed trough lexer.Lexer for the current method call.

Raises:

generic_lexer.errors.LexerError – raised with the position and character of the error in case of a lexing error (if the current chunk of the buffer matches no rule).

Yields:

the next token (a Token object) found in the Lexer.text_buffer.

Return type:

Iterator[Token]

exception generic_lexer.LexerError(char, text_buffer_pointer)

Bases: Exception

Lexer error exception.

Parameters:
  • text_buffer_pointer (int) – position in the input_buf line where the error occurred.

  • char (str) – the character that triggered the error

char
text_buffer_pointer
class generic_lexer.Token(name, position, val)

Bases: object

A simple Token structure. Contains the token name, value and position.

As you can see differently from the original gist, we are capable of specifying multiple groups per token.

You may get the values of the tokens this way:

>>> from generic_lexer import Lexer
>>> rules = {
...     "VARIABLE": r"(?P<var_name>[a-z_]+):(?P<var_type>[A-Z]\w+)",
...     "EQUALS": r"=",
...     "STRING": r"\".*\"",
... }
>>> data = "first_word:String = \"Hello\""
>>> variable, equals, string = tuple(Lexer(rules, True, data))

>>> variable
VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0

>>> variable.val
{'var_name': 'first_word', 'var_type': 'String'}
>>> variable["var_name"]
'first_word'
>>> variable["var_type"]
'String'

>>> equals
EQUALS('=') at 18

>>> equals.val
'='

>>> string
STRING('"Hello"') at 20

>>> string.val
'"Hello"'
Parameters:
  • name (str) – the name of the token

  • position (int) – the position the token was found in the text buffer

  • val (Any) – token’s value

lexer: Lexer
name: str
position: int
property type: str

For compability

property val: Union[Dict[str, str], str]