Welcome to pressagio’s documentation!¶
Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string.
Example Usage¶
The repository contains two example scripts in the folder example
to
demonstrate how to build a language model and use the model for prediction.
You can check the code of those two scripts how to use pressagio in your own
projects. Here is how to use the two scripts to predict the next word in a
phrase.
First, you have to build a languange model. We will use the script example/text2ngram.py to add 1-, 2- and 3-grams of a given text to a sqlite database. For demonstration purposes we will use a simple text file that comes with pressagio’s tests. You have to run the script three times to create a table for each of the n-grams:
$ python example/text2ngram.py -n 1 -o test.sqlite tests/test_data/der_linksdenker.txt
$ python example/text2ngram.py -n 2 -o test.sqlite tests/test_data/der_linksdenker.txt
$ python example/text2ngram.py -n 3 -o test.sqlite tests/test_data/der_linksdenker.txt
This will create a file test.sqlite
in the current directory. We can now
use this database to get a prediction for a phrase. We will use the script
example/predict.py
which uses the configuration file
example/example_profile.ini.
Note that you will always need a configuration file if you want to use the
built-in predictor. To get a prediction call:
$ python example/predict.py
['warm', 'der', 'und', 'die', 'nicht']
The script will just output a list of predictions.
Running the tests¶
$ python -m unittest discover
API documentation¶
pressagio Package¶
pressagio.callback¶
Base class for callbacks.
pressagio.character¶
pressagio.combiner¶
Combiner classes to merge results from several predictors.
pressagio.context_tracker¶
Class for context tracker.
pressagio.dbconnector¶
Classes to connect to databases.
-
class
pressagio.dbconnector.
DatabaseConnector
(dbname, cardinality=1)[source]¶ Base class for all database connectors.
Methods
create_bigram_table
(self)Creates a table for n-grams of cardinality 2. create_index
(self, cardinality)Create an index for the table with the given cardinality. create_ngram_table
(self, cardinality)Creates a table for n-gram of a give cardinality. create_trigram_table
(self)Creates a table for n-grams of cardinality 3. create_unigram_table
(self)Creates a table for n-grams of cardinality 1. delete_index
(self, cardinality)Delete index for the table with the given cardinality. delete_ngram_table
(self, cardinality)Deletes the table for n-gram of a give cardinality. insert_ngram
(self, ngram, count)Inserts a given n-gram with count into the database. ngram_count
(self, ngram)Gets the count for a given ngram from the database. ngrams
(self[, with_counts])Returns all ngrams that are in the table. remove_ngram
(self, ngram)Removes a given ngram from the databae. update_ngram
(self, ngram, count)Updates a given ngram in the database. close_database execute_sql increment_ngram_count ngram_like_table ngram_like_table_filtered open_database unigram_counts_sum -
__init__
(self, dbname, cardinality=1)[source]¶ Constructor of the base class DababaseConnector.
Parameters: - dbname : str
path to the database file or database name
- cardinality : int
default cardinality for n-grams
-
create_index
(self, cardinality)[source]¶ Create an index for the table with the given cardinality.
Parameters: - cardinality : int
The cardinality to create a index for.
-
create_ngram_table
(self, cardinality)[source]¶ Creates a table for n-gram of a give cardinality. The table name is constructed from this parameter, for example for cardinality 2 there will be a table _2_gram created.
Parameters: - cardinality : int
The cardinality to create a table for.
-
delete_index
(self, cardinality)[source]¶ Delete index for the table with the given cardinality.
Parameters: - cardinality : int
The cardinality of the index to delete.
-
delete_ngram_table
(self, cardinality)[source]¶ Deletes the table for n-gram of a give cardinality. The table name is constructed from this parameter, for example for cardinality 2 there will be a table _2_gram deleted.
Parameters: - cardinality : int
The cardinality of the table to delete.
-
insert_ngram
(self, ngram, count)[source]¶ Inserts a given n-gram with count into the database.
Parameters: - ngram : iterable of str
A list, set or tuple of strings.
- count : int
The count for the given n-gram.
-
ngram_count
(self, ngram)[source]¶ Gets the count for a given ngram from the database.
Parameters: - ngram : iterable of str
A list, set or tuple of strings.
Returns: - count : int
The count of the ngram.
-
ngrams
(self, with_counts=False)[source]¶ Returns all ngrams that are in the table.
Parameters: - None
Returns: - ngrams : generator
A generator for ngram tuples.
-
-
class
pressagio.dbconnector.
PostgresDatabaseConnector
(dbname, cardinality=1, host='localhost', port=5432, user='postgres', password=None, connection=None)[source]¶ Database connector for postgres databases.
Methods
close_database
(self)Closes the sqlite database. commit
(self)Sends a commit to the database. create_bigram_table
(self)Creates a table for n-grams of cardinality 2. create_database
(self)Creates an empty database if not exists. create_index
(self, cardinality)Create an index for the table with the given cardinality. create_ngram_table
(self, cardinality)Creates a table for n-gram of a give cardinality. create_trigram_table
(self)Creates a table for n-grams of cardinality 3. create_unigram_table
(self)Creates a table for n-grams of cardinality 1. delete_index
(self, cardinality)Delete index for the table with the given cardinality. delete_ngram_table
(self, cardinality)Deletes the table for n-gram of a give cardinality. execute_sql
(self, query)Executes a given query string on an open postgres database. insert_ngram
(self, ngram, count)Inserts a given n-gram with count into the database. ngram_count
(self, ngram)Gets the count for a given ngram from the database. ngrams
(self[, with_counts])Returns all ngrams that are in the table. open_database
(self)Opens the sqlite database. remove_ngram
(self, ngram)Removes a given ngram from the databae. reset_database
(self)Re-create an empty database. update_ngram
(self, ngram, count)Updates a given ngram in the database. increment_ngram_count ngram_like_table ngram_like_table_filtered unigram_counts_sum -
__init__
(self, dbname, cardinality=1, host='localhost', port=5432, user='postgres', password=None, connection=None)[source]¶ Constructor for the postgres database connector.
Parameters: - dbname : str
the database name
- cardinality : int
default cardinality for n-grams
- host : str
hostname of the postgres database
- port : int
port number of the postgres database
- user : str
user name for the postgres database
- password: str
user password for the postgres database
- connection : connection
an open database connection
-
create_index
(self, cardinality)[source]¶ Create an index for the table with the given cardinality.
Parameters: - cardinality : int
The cardinality to create a index for.
-
-
class
pressagio.dbconnector.
SqliteDatabaseConnector
(dbname, cardinality=1)[source]¶ Database connector for sqlite databases.
Methods
close_database
(self)Closes the sqlite database. commit
(self)Sends a commit to the database. create_bigram_table
(self)Creates a table for n-grams of cardinality 2. create_index
(self, cardinality)Create an index for the table with the given cardinality. create_ngram_table
(self, cardinality)Creates a table for n-gram of a give cardinality. create_trigram_table
(self)Creates a table for n-grams of cardinality 3. create_unigram_table
(self)Creates a table for n-grams of cardinality 1. delete_index
(self, cardinality)Delete index for the table with the given cardinality. delete_ngram_table
(self, cardinality)Deletes the table for n-gram of a give cardinality. execute_sql
(self, query)Executes a given query string on an open sqlite database. insert_ngram
(self, ngram, count)Inserts a given n-gram with count into the database. ngram_count
(self, ngram)Gets the count for a given ngram from the database. ngrams
(self[, with_counts])Returns all ngrams that are in the table. open_database
(self)Opens the sqlite database. remove_ngram
(self, ngram)Removes a given ngram from the databae. update_ngram
(self, ngram, count)Updates a given ngram in the database. increment_ngram_count ngram_like_table ngram_like_table_filtered unigram_counts_sum
pressagio.predictor¶
Classes for predictors and to handle suggestions and predictions.
-
class
pressagio.predictor.
Prediction
[source]¶ Class for predictions from predictors.
Methods
append
(self, object, /)Append object to the end of the list. clear
(self, /)Remove all items from list. copy
(self, /)Return a shallow copy of the list. count
(self, value, /)Return number of occurrences of value. extend
(self, iterable, /)Extend list by appending elements from the iterable. index
(self, value[, start, stop])Return first index of value. insert
(self, index, object, /)Insert object before index. pop
(self[, index])Remove and return item at index (default last). remove
(self, value, /)Remove first occurrence of value. reverse
(self, /)Reverse IN PLACE. sort
(self, /, \*[, key, reverse])Stable sort IN PLACE. add_suggestion suggestion_for_token
-
class
pressagio.predictor.
Predictor
(config, context_tracker, predictor_name, short_desc=None, long_desc=None)[source]¶ Base class for predictors.
Methods
token_satifies_filter
-
class
pressagio.predictor.
PredictorActivator
(config, registry, context_tracker)[source]¶ PredictorActivator starts the execution of the active predictors, monitors their execution and collects the predictions returned, or terminates a predictor’s execution if it execedes its maximum prediction time.
The predictions returned by the individual predictors are combined into a single prediction by the active Combiner.
Attributes: combination_policy
The combination_policy property.
Methods
predict -
__init__
(self, config, registry, context_tracker)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
combination_policy
¶ The combination_policy property.
-
class
pressagio.predictor.
PredictorRegistry
(config, dbconnection=None)[source]¶ Manages instantiation and iteration through predictors and aids in generating predictions and learning.
PredictorRegitry class holds the active predictors and provides the interface required to obtain an iterator to the predictors.
The standard use case is: Predictor obtains an iterator from PredictorRegistry and invokes the predict() or learn() method on each Predictor pointed to by the iterator.
Predictor registry should eventually just be a simple wrapper around plump.
Attributes: context_tracker
The context_tracker property.
Methods
append
(self, object, /)Append object to the end of the list. clear
(self, /)Remove all items from list. copy
(self, /)Return a shallow copy of the list. count
(self, value, /)Return number of occurrences of value. extend
(self, iterable, /)Extend list by appending elements from the iterable. index
(self, value[, start, stop])Return first index of value. insert
(self, index, object, /)Insert object before index. pop
(self[, index])Remove and return item at index (default last). remove
(self, value, /)Remove first occurrence of value. reverse
(self, /)Reverse IN PLACE. sort
(self, /, \*[, key, reverse])Stable sort IN PLACE. add_predictor close_database set_predictors -
__init__
(self, config, dbconnection=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
context_tracker
¶ The context_tracker property.
-
class
pressagio.predictor.
SmoothedNgramPredictor
(config, context_tracker, predictor_name, short_desc=None, long_desc=None, dbconnection=None)[source]¶ Calculates prediction from n-gram model in sqlite database. You have to create a database with the script text2ngram first.
Attributes: database
The database property.
deltas
The deltas property.
learn_mode
The learn_mode property.
Methods
close_database init_database_connector_if_ready ngram_to_string predict token_satifies_filter -
__init__
(self, config, context_tracker, predictor_name, short_desc=None, long_desc=None, dbconnection=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
database
¶ The database property.
-
deltas
¶ The deltas property.
-
learn_mode
¶ The learn_mode property.
-
class
pressagio.predictor.
Suggestion
(word, probability)[source]¶ Class for a simple suggestion, consists of a string and a probility for that string.
Attributes: probability
The probability property.
-
__init__
(self, word, probability)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
probability
¶ The probability property.
pressagio.tokenizer¶
Several classes to tokenize text.
-
class
pressagio.tokenizer.
ForwardTokenizer
(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Methods
count_characters
(self)Counts the number of unicode characters in the IO stream. is_blankspace
(self, char)Test if a character is a blankspace. is_separator
(self, char)Test if a character is a separator. count_tokens has_more_tokens next_token progress reset_stream -
__init__
(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Constructor of the Tokenizer base class.
Parameters: - text : str
The text to tokenize.
- blankspaces : str
The characters that represent empty spaces.
- separators : str
The characters that separate token units (e.g. word boundaries).
-
-
class
pressagio.tokenizer.
NgramMap
[source]¶ A memory efficient store for ngrams.
This class is optimized for memory consumption, it might be slower than other ngram stores. It is also optimized for a three step process:
- Add all ngrams.
- Perform a cutoff opertation (optional).
- Read list of ngrams.
It might not perform well for other use cases.
Methods
add
(self, ngram_indices)Add an ngram to the store. add_token
(self, token)Add a token to the internal string store. cutoff
(self, cutoff)Perform a cutoff on the ngram store. items
(self)Get the ngrams from the store. -
add
(self, ngram_indices)[source]¶ Add an ngram to the store.
This will add a list of strings as an ngram to the ngram store. In our standard use case the strings are the indices of the strings, you can get those from the add_token() method.
Parameters: - list of str
The indices of the ngram strings as string.
-
add_token
(self, token)[source]¶ Add a token to the internal string store.
This will only add the token to the internal strings store. It will return an index that you can use to create your ngram.
The ngrams a are represented as strings of the indices, so we will return a string here so that the consumer does not have to do the conversion.
Parameters: - token : str
The token to add to the string store.
Returns: - str
The index of the token as a string.
-
class
pressagio.tokenizer.
ReverseTokenizer
(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Methods
count_characters
(self)Counts the number of unicode characters in the IO stream. is_blankspace
(self, char)Test if a character is a blankspace. is_separator
(self, char)Test if a character is a separator. count_tokens has_more_tokens next_token progress reset_stream -
__init__
(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Constructor of the Tokenizer base class.
Parameters: - text : str
The text to tokenize.
- blankspaces : str
The characters that represent empty spaces.
- separators : str
The characters that separate token units (e.g. word boundaries).
-
-
class
pressagio.tokenizer.
Tokenizer
(text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Base class for all tokenizers.
Methods
is_blankspace
(self, char)Test if a character is a blankspace. is_separator
(self, char)Test if a character is a separator. count_characters count_tokens has_more_tokens next_token progress reset_stream -
__init__
(self, text, blankspaces=' x0cnrtx0bx85xa0u2009', separators='`~!@#$%^&*()_+=\|]}[{";:/?.>, <¡¿†¨„“”«»।॥ו–—―´’‘‚י0123456789ः')[source]¶ Constructor of the Tokenizer base class.
Parameters: - text : str
The text to tokenize.
- blankspaces : str
The characters that represent empty spaces.
- separators : str
The characters that separate token units (e.g. word boundaries).
-
-
pressagio.tokenizer.
forward_tokenize_file
(infile: str, ngram_size: int, lowercase: bool = False, cutoff: int = 0, ngram_map: pressagio.tokenizer.NgramMap = None)[source]¶ Tokenize a file and return an ngram store.
Parameters: - infile : str
The file to parse.
- ngram_size : int
The size of the ngrams to generate.
- lowercase : bool
Whether or not to lowercase all tokens.
- cutoff : int
Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.
- ngram_map : NgramMap
Pass an existing NgramMap if you want to add the ngrams of the given file to the store. Will create a new NgramMap if None.
Returns: - NgramMap
The ngram map that allows you to iterate over the ngrams.
-
pressagio.tokenizer.
forward_tokenize_files
(infiles: List[str], ngram_size: int, lowercase: bool = False, cutoff: int = 0)[source]¶ Tokenize a list of file and return an ngram store.
Parameters: - infile : str
The file to parse.
- ngram_size : int
The size of the ngrams to generate.
- lowercase : bool
Whether or not to lowercase all tokens.
- cutoff : int
Perform a cutoff after parsing. We will only return ngrams that have a frequency higher than the cutoff.
Returns: - NgramMap
The ngram map that allows you to iterate over the ngrams.