Welcome to PyNLPl’s documentation!¶
PyNLPl, pronounced as ‘pineapple’, is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation).
The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3.
The following modules are available:
pynlpl.datatypes
- Extra datatypes (priority queues, patterns, tries)pynlpl.evaluation
- Evaluation & experiment classes (parameter search, wrapped progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool)pynlpl.formats.cgn
- Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tagspynlpl.formats.folia
- Extensive library for reading and manipulating the documents in FoLiA format (Format for Linguistic Annotation).pynlpl.formats.fql
- Extensive library for the FoLiA Query Language (FQL), built on top ofpynlpl.formats.folia
. FQL is currently documented here.pynlpl.formats.cql
- Parser for the Corpus Query Language (CQL), as also used by Corpus Workbench and Sketch Engine. Contains a convertor to FQL.pynlpl.formats.giza
- Module for reading GIZA++ word alignment datapynlpl.formats.moses
- Module for reading Moses phrase-translation tables.pynlpl.formats.sonar
- Largely obsolete module for pre-releases of the SoNaR corpus, usepynlpl.formats.folia
instead.pynlpl.formats.timbl
- Module for reading Timbl output (consider using python-timbl instead though)pynlpl.lm.lm
- Module for simple language model and reader for ARPA language model data as well (used by SRILM).pynlpl.search
- Various search algorithms (Breadth-first, depth-first, beam-search, hill climbing, A star, various variants of each)pynlpl.statistics
- Frequency lists, Levenshtein, common statistics and information theory functionspynlpl.textprocessors
- Simple tokeniser, n-gram extraction
Contents:
Common Functions¶
-
pynlpl.common.
Enum
(*names)¶
-
pynlpl.common.
b
(s)¶
-
pynlpl.common.
isstring
(s)¶
-
pynlpl.common.
log
(msg, **kwargs)¶ Generic log method. Will prepend timestamp.
Keyword Arguments: - - Name of the system/module (system) –
- - Integer denoting the desired level of indentation (indent) –
- - List of streams to output to (streams) –
- - Stream to output to (stream) –
-
pynlpl.common.
u
(s, encoding='utf-8', errors='strict')¶
Data Types¶
This library contains various extra data types, based to a certain extend on MIT-licensed code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
-
class
pynlpl.datatypes.
FIFOQueue
(data=[])¶ A First-In-First-Out Queue
-
append
(item)¶
-
extend
(items)¶ Append all elements from items to the queue
-
pop
()¶ Retrieve the next element in line, this will remove it from the queue
-
-
class
pynlpl.datatypes.
Pattern
(data, classdecoder=None)¶ -
static
fromstring
(s, classencoder)¶
-
iterbytes
(begin=0, end=0)¶
-
static
-
class
pynlpl.datatypes.
PriorityQueue
(data=[], f=<function PriorityQueue.<lambda>>, minimize=False, length=0, blockworse=False, blockequal=False, duplicates=True)¶ A queue in which the maximum (or minumum) element is returned first, as determined by either an external score function f (by default calling the objects score() method). If minimize=True, the item with minimum f(x) is returned first; otherwise is the item with maximum f(x) or x.score().
length can be set to an integer > 0. Items will only be added to the queue if they’re better or equal to the worst scoring item. If set to zero, length is unbounded. blockworse can be set to true if you want to prohibit adding worse-scoring items to the queue. Only items scoring better than the BEST one are added. blockequal can be set to false if you also want to prohibit adding equally-scoring items to the queue. (Both parameters default to False)
-
append
(item)¶ Adds an item to the priority queue (in the right place), returns True if successfull, False if the item was blocked (because of a bad score)
-
pop
()¶ Retrieve the next element in line, this will remove it from the queue
-
prune
(n)¶ prune all but the first (=best) n items
-
prunebyscore
(score, retainequalscore=False)¶ Deletes all items below/above a certain score from the queue, depending on whether minimize is True or False. Note: It is recommended (more efficient) to use blockworse=True / blockequal=True instead! Preventing the addition of ‘worse’ items.
-
randomprune
(n)¶ prune down to n items at random, disregarding their score
-
score
(i)¶ Return the score for item x (cheap lookup), Item 0 is always the best item
-
stochasticprune
(n)¶ prune down to n items, chance of an item being pruned is reverse proportional to its score
-
-
class
pynlpl.datatypes.
Queue
¶ - Queue is an abstract class/interface. There are three types:
- Python List: A Last In First Out Queue (no Queue object necessary). FIFOQueue(): A First In First Out Queue. PriorityQueue(lt): Queue where items are sorted by lt, (default <).
- Each type supports the following methods and functions:
- q.append(item) – add an item to the queue q.extend(items) – equivalent to: for item in items: q.append(item) q.pop() – return the top item from the queue len(q) – number of items in q (also q.__len()).
-
extend
(items)¶ Append all elements from items to the queue
-
class
pynlpl.datatypes.
Tree
(value=None, children=None)¶ Simple tree structure. Nodes are themselves trees.
-
append
(item)¶ Add an item to the Tree
-
leaf
()¶ Is this a leaf node or not?
-
-
class
pynlpl.datatypes.
Trie
(sequence=None)¶ Simple trie structure. Nodes are themselves tries, values are stored on the edges, not the nodes.
-
append
(sequence)¶
-
depth
()¶ Returns the depth of the current node
-
find
(sequence)¶
-
items
()¶
-
leaf
()¶ Is this a leaf node or not?
-
path
()¶ Returns the path to the current node
-
root
()¶ Returns True if this is the root of the Trie
-
sequence
()¶
-
size
()¶ Size is number of nodes under the trie, including the current node
-
walk
(leavesonly=True, maxdepth=None, _depth=0)¶ Depth-first search, walking through trie, returning all encounterd nodes (by default only leaves)
-
Evaluation & Experiments¶
-
class
pynlpl.evaluation.
AbstractExperiment
(inputdata=None, **parameters)¶ -
defaultparameters
()¶
-
delete
()¶
-
done
(warn=True)¶ Is the subprocess done?
-
duration
()¶
-
run
()¶
-
sample
(size)¶ Return a sample of the input data
-
score
()¶
-
start
()¶ Start as a detached subprocess, immediately returning execution to caller.
-
startcommand
(command, cwd, stdout, stderr, *arguments, **parameters)¶
-
wait
()¶
-
-
class
pynlpl.evaluation.
ClassEvaluation
(goals=[], observations=[], missing={}, encoding='utf-8')¶ -
accuracy
(cls=None)¶
-
append
(goal, observation)¶
-
auc
(cls=None, macro=False)¶
-
compute
()¶
-
confusionmatrix
(casesensitive=True)¶
-
fp_rate
(cls=None, macro=False)¶
-
fscore
(cls=None, beta=1, macro=False)¶
-
outputmetrics
()¶
-
precision
(cls=None, macro=False)¶
-
recall
(cls=None, macro=False)¶
-
specificity
(cls=None, macro=False)¶
-
tp_rate
(cls=None, macro=False)¶
-
-
class
pynlpl.evaluation.
ConfusionMatrix
(tokens=None, casesensitive=True, dovalidation=True)¶ Confusion Matrix
-
class
pynlpl.evaluation.
ExperimentPool
(size)¶ -
append
(experiment)¶
-
poll
(haltonerror=True)¶
-
run
(haltonerror=True)¶
-
start
(experiment)¶
-
-
class
pynlpl.evaluation.
OrdinalEvaluation
(goals=[], observations=[], missing={}, encoding='utf-8')¶ -
compute
()¶
-
mae
(cls=None)¶
-
rmse
(cls=None)¶
-
-
class
pynlpl.evaluation.
ParamSearch
(experimentclass, inputdata, parameterscope, poolsize=1, constraintfunc=None, delete=True)¶ A simpler version of ParamSearch without Wrapped Progressive Sampling
-
exception
pynlpl.evaluation.
ProcessFailed
¶
-
class
pynlpl.evaluation.
WPSParamSearch
(experimentclass, inputdata, size, parameterscope, poolsize=1, sizefunc=None, prunefunc=None, constraintfunc=None, delete=True)¶ ParamSearch with support for Wrapped Progressive Sampling
-
searchbest
()¶
-
test
(i=None)¶
-
-
pynlpl.evaluation.
auc
(x, y, reorder=False)¶ Compute Area Under the Curve (AUC) using the trapezoidal rule
This is a general fuction, given points on a curve. For computing the area under the ROC-curve, see
auc_score()
.Parameters: - x (array, shape = [n]) – x coordinates.
- y (array, shape = [n]) – y coordinates.
- reorder (boolean, optional (default=False)) – If True, assume that the curve is ascending in the case of ties, as for an ROC curve. If the curve is non-ascending, the result will be wrong.
Returns: auc
Return type: float
Examples
>>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> pred = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr, tpr) 0.75
See also
auc_score()
- Computes the area under the ROC curve
-
pynlpl.evaluation.
filesampler
(files, testsetsize=0.1, devsetsize=0, trainsetsize=0, outputdir='', encoding='utf-8')¶ Extract a training set, test set and optimally a development set from one file, or multiple interdependent files (such as a parallel corpus). It is assumed each line contains one instance (such as a word or sentence for example).
-
pynlpl.evaluation.
mae
(absolute_error_values)¶
-
pynlpl.evaluation.
rmse
(squared_error_values)¶
FoLiA library¶
This tutorial will introduce the FoLiA Python library, part of PyNLPl. The FoLiA library provides an Application Programming Interface for the reading, creation and manipulation of FoLiA XML documents. The library works under Python 2.7 as well as Python 3, which is the recommended version. The samples in this documentation follow Python 3 conventions.
Prior to reading this document, it is recommended to first read the FoLiA documentation itself and familiarise yourself with the format and underlying paradigm. The FoLiA documentation can be found on the FoLiA website . It is especially important to understand the way FoLiA handles sets/classes, declarations, common attributes such as annotator/annotatortype and the distinction between various kinds of annotation categories such as token annotation and span annotation.
This Python library is also the foundation of the FoLiA Tools collection, which consists of various command line utilities to perform common tasks on FoLiA documents. If you’re merely interested in performing a certain common task, such as a single query or conversion, you might want to check there if it contains is a tool that does what you want already.
Reading FoLiA¶
Loading a document¶
Any script that uses FoLiA starts with the import:
from pynlpl.formats import folia
At the basis of any FoLiA processing lies the following class:
Document |
This is the FoLiA Document and holds all its data in memory. |
To read a document from file, instantiate a document as follows:
doc = folia.Document(file="/path/to/document.xml")
This returned Document
instance holds the entire document in
memory. Note that for large FoLiA documents this may consume quite some memory!
If you happened to already have the document content in a string, you can load
as follows:
doc = folia.Document(string="<FoLiA ...")
Once you have loaded a document, all data is available for you to read and manipulate as you see fit. We will first illustrate some simple use cases:
To save a document back to the file it was loaded from, we do:
doc.save()
Or we can specify a specific filename:
doc.save("/tmp/document.xml")
Note
Any content that is in a different XML namespace than the FoLiA namespaces or other supported namespaces (XML, Xlink), will be ignored upon loading and lost when saving.
Printing text¶
You may want to simply print all (plain) text contained in the document, which is as easy as:
print(doc)
Obtaining the text as a string is done by invoking the document’s Document.text()
method:
text = doc.text()
Or alternatively as follows:
text = str(doc)
For any subelement of the document, you can obtain its text in the same fashion
as well, by calling its AbstractElement.text()
method or by using
str()
, the only difference is that the former allows for extensive fine
tuning using various extra parameters (See AbstractElement.text()
).
Note
In Python 2, both str()
as well as unicode()
return a unicode instance. You may need to append .encode('utf-8')
for proper output.
Index¶
A document instance has an index which you can use to grab any of its elements by ID. Querying using the index proceeds similar to using a python dictionary:
word = doc['example.p.3.s.5.w.1']
print(word)
Note
Python 2 users will have to do print word.text().encode('utf-8')
instead, to ensure non-ascii characters are printed properly.
IDs are unique in the entire document, and preferably even beyond.
Elements¶
All FoLiA elements are derived from AbstractElement
and offer an
identical interface. To quickly check whether you are dealing with a FoLiA
element you can therefore always do the following:
isinstance(word, folia.AbstractElement)
This abstract base element is never instantiated directly. The FoLiA paradigm derives several more abstract base classes which may implement some additional methods or overload some of the original ones:
AbstractElement |
Abstract base class from which all FoLiA elements are derived. |
AbstractStructureElement |
Abstract element, all structure elements inherit from this class. |
AllowTokenAnnotation |
Elements that allow token annotation (including extended annotation) must inherit from this class |
AbstractSpanAnnotation |
Abstract element, all span annotation elements are derived from this class |
AbstractTokenAnnotation |
Abstract element, all token annotation elements are derived from this class |
AbstractAnnotationLayer |
Annotation layers for Span Annotation are derived from this abstract base class |
AbstractTextMarkup |
Abstract class for text markup elements, elements that appear with the TextContent (t ) element. |
Obtaining list of elements¶
The aforementioned index is useful only if you know the ID of the element. This if often not the case, and you will want to iterate through the hierarchy of elements through different means.
If you want to iterate over all of the child elements of a certain element, regardless of what type they are, you can simply do so as follows:
for subelement in element:
if isinstance(subelement, folia.Sentence):
print("this is a sentence")
else:
print("this is something else")
If applied recursively this allows you to traverse the entire element tree, there are however specialised methods available that do this for you.
Select method¶
There is a generic method AbstractElement.select()
available on all
elements to select child elements of any desired class. This method is by
default applied recursively for most element types:
sentence = doc['example.p.3.s.5.w.1']
words = sentence.select(folia.Word)
for word in words:
print(word)
The AbstractElement.select()
method has a sibling AbstractElement.count()
, invoked with the same
arguments, which simply counts how many items it finds, without actually
returning them:
word = sentence.count(folia.Word)
Note
The select()
method and similar high-level methods derived from it, are
generators. This implies that the results of the selection are returned one by
one in the iteration, as opposed to all stored in memory. This also implies
that you can only iterate over it once, we can not do another iteration over
the words
variable in the above example, unless we reinvoke the
select()
method to get a new generator. Likewise, we can not do
len(words)
, but have to use the count()
method instead.
If you want to have all results in memory in a list, you can simply do the following:
words = list(sentence.select(folia.Word))
The select method is by default recursive, set the third argument to False
to
make it non-recursive. The second argument can be used for restricting matches
to a specific set, a tuple of classes. The recursion will not go into any
non-authoritative elements such as alternatives, originals of corrections.
Selection Shortcuts¶
There are various shortcut methods for select()
.
For example, you can iterate over all words in the document using Document.words()
, or
all words under any structural element using AbstractStructureElement.words()
:
for word in doc.words():
print(word)
That however gives you one big iteration of words without boundaries. You may
more likely want to seek words within sentences, provided the document
distinguishes sentences. So we first iterate over all sentences using
Document.sentences()
and then over the
words therein using AbstractStructureElement.words()
:
for sentence in doc.sentences():
for word in sentence.words():
print(word)
Or including paragraphs, assuming the document has them:
for paragraph in doc.paragraphs():
for sentence in paragraph.sentences():
for word in sentence.words():
print(word)
Warning
Do be aware that such constructions make presumptions about the structure of the FoLiA document that may not always apply!
All of these shortcut methods also take an index
parameter to quickly
select a specific item in the sequence:
word = sentence.words(3) #retrieves the fourth word
Structure Annotation Types¶
The FoLiA library discerns various Python classes for structure
annotation, all are subclasses of AbstractStructureElement
, which in
turn is a subclass of AbstractElement
. We list the classes
for structure anntoation along with the FoLiA XML tag. Sets and classes can
be associated with most of these elements to make them more specific, these are
never prescribed by FoLiA. The list of classes is as follows:
Cell |
A cell in a Row in a Table |
Definition |
Element used in Entry for the portion that provides a definition for the entry. |
Division |
Structure element representing some kind of division. |
Entry |
Represents an entry in a glossary/lexicon/dictionary. |
Event |
Structural element representing events, often used in new media contexts for things such as tweets,chat messages and forum posts. |
Example |
Element that provides an example. |
Figure |
Element for the representation of a graphical figure. |
Gap |
Gap element, represents skipped portions of the text. |
Head |
Head element; a structure element that acts as the header/title of a Division . |
Linebreak |
Line break element, signals a line break. |
List |
Element for enumeration/itemisation. |
ListItem |
Single element in a List. |
Note |
Element used for notes, such as footnotes or warnings or notice blocks. |
Paragraph |
Paragraph element. |
Part |
Generic structure element used to mark a part inside another block. |
Quote |
Quote: a structure element. |
Reference |
A structural element that denotes a reference, internal or external. |
Row |
A row in a Table |
Sentence |
Sentence element. |
Table |
A table consisting of Row elements that in turn consist of Cell elements |
Term |
A term, often used in contect of Entry |
TableHead |
Encapsulated the header of a table, contains Cell elements |
Text |
A full text. |
Whitespace |
Whitespace element, signals a vertical whitespace |
Word |
Word (aka token) element. |
The FoLiA documentation explains the exact semantics and use of these in detail. Make sure to consult it to familiarize yourself with how the elements should be used.
FoLiA and this library enforce explicit rules about what elements are allowed in what others. Exceptions will be raised when this is about to be violated.
Common attributes¶
The FoLiA paradigm features sets and classes as primary means to represent the actual value (class) of an annotation. A set often corresponds to a tagset, such as a set of part-of-speech tags, and a class is one selected value in such a set.
The paradigm furthermore introduces other common attributes to set on annotation elements, such as an identifier, information on the annotator, and more. A full list is provided below:
element.id
(str) - The unique identifier of the elementelement.set
(str) - The set the element pertains to.element.cls
(str) - The assigned class, i.e. the actual value of the annotation, defined in the set. Classes correspond with tagsets in this case of many annotation types. Note that since class is already a reserved keyword in python, the library consistently usescls
everywhere.element.annotator
(str) - The name or ID of the annotator who added/modified this elementelement.annotatortype
- The type of annotator, can be eitherfolia.AnnotatorType.MANUAL
orfolia.AnnotatorType.AUTO
element.confidence
(float) - A confidence value expressingelement.datetime
(datetime.datetime) - The date and time when the element was added/modified.element.n
(str) - An ordinal label, used for instance in enumerated list contexts, numbered sections, etc..
The following attributes are specific to a speech context:
element.src
(str) - A URL or filename referring the an audio or video file containing the speech. Access this attribute using theelement.speaker_src()
method, as it is inheritable from ancestors.element.speaker
(str) - The name of ID of the speaker. Access this attribute using theelement.speech_speaker()
method, as it is inheritable from ancestors.element.begintime
(4-tuple) - The time in the above source fragment when the phonetic content of this element starts, this is a(hours, minutes,seconds,milliseconds)
tuple.element.endtime
(4-tuple) - The time in the above source fragment when the phonetic content of this element ends, this is a(hours, minutes,seconds,milliseconds)
tuple.
Attributes that are not available for certain elements, or not set, default to None
.
Annotations¶
As FoLiA is a format for linguistic annotation, accessing annotation is one of
the primary functions of this library. This can be done using the methods
AllowTokenAnnotation.annotations()
or AllowTokenAnnotation.annotation()
that are available on many FoLiA elements. These methods are similar to the
AbstractElement.select()
method except they will raise a
NoSuchAnnotation
exception when no such annotation is found. The
difference between annotation()
and annotations()
is that the former
will grab only one and raise an exception if there are more between which it
can’t disambiguate, whereas the second is a generator, but will still raise an
exception if none is found:
for word in doc.words():
try:
pos = word.annotation(folia.PosAnnotation, 'http://somewhere/CGN')
lemma = word.annotation(folia.LemmaAnnotation)
print("Word: ", word)
print("ID: ", word.id)
print("PoS-tag: " , pos.cls)
print("PoS Annotator: ", pos.annotator)
print("Lemma-tag: " , lemma.cls)
except folia.NoSuchAnnotation:
print("No PoS or Lemma annotation")
Note that the second argument of AllowTokenAnnotation.annotation()
, AllowTokenAnnotation.annotations()
or
AbstractElement.select()
can be used to restrict your selection to a certain set. In the
above example we restrict ourselves to Part-of-Speech tags in the CGN set.
Token Annotation Types¶
The following token annotation elements are available in FoLiA, they are embedded under a structural element (not necessarily a token, despite the name).
DomainAnnotation |
Domain annotation: an extended token annotation element |
PosAnnotation |
Part-of-Speech annotation: a token annotation element |
LangAnnotation |
Language annotation: an extended token annotation element |
LemmaAnnotation |
Lemma annotation: a token annotation element |
SenseAnnotation |
Sense annotation: a token annotation element |
SubjectivityAnnotation |
Subjectivity annotation/Sentiment analysis: a token annotation element |
Text and phonetic annotation¶
The actual text of an element, or a phonetic textual representation, are also considered annotations themselves.
TextContent |
Text content element (t ), holds text to be associated with whatever element the text content element is a child of. |
PhonContent |
Phonetic content element (ph ), holds a phonetic representation to be associated with whatever element the phonetic content element is a child of. |
Text is retrieved as string using AbstractElement.text()
, or as element
using Phonetic content is retrieved as string using
AbstractElement.text()
, or as element using
AbstractElement.textcontent()
.
Note
These are the only elements for which FoLiA prescribes a default set and a default class (current
).
This will only be relevant if you work with multiple text layers (current
text vs OCRed text for instance) or with corrections of
orthography or phonetics.
Span Annotation¶
FoLiA distinguishes token annotation and span annotation, token annotation is
embedded in-line within a structural element, and the annotation therefore
pertains to that structural element, whereas span annotation is stored in a
stand-off annotation layer outside the element and refers back to it. Span
annotation elements typically span over multiple structural elements, they
are all subclasses of AbstractSpanAnnotation
.
We will discuss three ways of accessing span annotation. As stated, span
annotation is contained within an annotation layer (a subclass of
AbstractAnnotationLayer
) of a certain structure element, often a
sentence. In the first way of accessing span annotation, we do everything
explicitly: We first obtain the layer, then iterate over the span annotation
elements within that layer, and finally iterate over the words to which the
span applies. Assume we have a sentence
and we want to print all the named
entities in it, assuming the entities layer is embedded at sentence level as is
conventional:
for layer in sentence.select(folia.EntitiesLayer):
for entity in layer.select(folia.Entity):
print(" Entity class=", entity.cls, " words=")
for word in entity.wrefs():
print(word, end="") #print without newline
print() #print newline
The AbstractSpanAnnotation.wrefs()
method, available on all span annotation elements, will return
a list of all words (as well as morphemes and phonemes) over which a span
annotation element spans.
This first way is rather verbose. The second way of accessing span annotation
takes another approach, using the Word.findspans()
method available on Word
instances.
Here we start from a word and seek span annotations in which that word occurs.
Assume we have a word
and want to find chunks it occurs in:
for chunk in word.findspans(folia.Chunk):
print(" Chunk class=", chunk.cls, " words=")
for word2 in chunk.wrefs(): #print all words in the chunk (of which the word is a part)
print(word2, end="")
print()
The Word.findspans()
method can be called with either the class of a Span
Annotation Element, such as Chunk
, or with the class of the layer,
such as ChunkingLayer
.
The third way allows us to look for span elements given an annotation layer and
words. In other words, it checks if one or more words form a span. This is an
exact match and not a sub-part match as in the previously described method. To
do this, we use use the AbstractAnnotationLayer.findspan
method,
available on all annotation layers:
for span in annotationlayer.findspan(word1,word2):
print("Class: ", span.cls)
print("Text: ", span.text()) #same for every span here
Span Annotation Types¶
This section lists the available Span annotation elements, the layer that contains them is explicitly mentioned as well.
Some of the span annotation elements are complex and take span role elements as children, these are normal span annotation elements that occur on a within another span annotation (of a particular type) and can not be used standalone.
FoLiA distinguishes the following span annotation elements:
Chunk |
Chunk element, span annotation element to be used in ChunkingLayer |
CoreferenceChain |
Coreference chain. |
Dependency |
Span annotation element to encode dependency relations |
Entity |
Entity element, for entities such as named entities, multi-word expressions, temporal entities. |
Observation |
Observation. |
Predicate |
Predicate, used within SemanticRolesLayer , takes SemanticRole annotations as children, but has its own annotation type and separate declaration |
Sentiment |
Sentiment. |
Statement |
Statement. |
SyntacticUnit |
Syntactic Unit, span annotation element to be used in SyntaxLayer |
SemanticRole |
Semantic Role |
TimeSegment |
A time segment |
These are placed in the following annotation layers:
ChunkingLayer |
Chunking Layer: Annotation layer for Chunk span annotation elements |
CoreferenceLayer |
Syntax Layer: Annotation layer for SyntacticUnit span annotation elements |
DependenciesLayer |
Dependencies Layer: Annotation layer for Dependency span annotation elements. |
EntitiesLayer |
Entities Layer: Annotation layer for Entity span annotation elements. |
ObservationLayer |
Observation Layer: Annotation layer for Observation span annotation elements. |
SentimentLayer |
Sentiment Layer: Annotation layer for Sentiment span annotation elements, used for sentiment analysis. |
StatementLayer |
Statement Layer: Annotation layer for Statement span annotation elements, used for attribution annotation. |
SyntaxLayer |
Syntax Layer: Annotation layer for SyntacticUnit span annotation elements |
SemanticRolesLayer |
Syntax Layer: Annotation layer for SemanticRole span annotation elements |
TimingLayer |
Timing layer: Annotation layer for TimeSegment span annotation elements. |
Some span annotation elements take span roles, depending on their type:
CoreferenceLink |
Coreference link. |
DependencyDependent |
Span role element that marks the dependent in a dependency relation. |
Headspan |
The headspan role is used to mark the head of a span annotation. |
Editing FoLiA¶
Creating a new document¶
Creating a new FoliA document, rather than loading an existing one from file,
is done by explicitly providing the ID for the new document in the
Document
constructor:
doc = folia.Document(id='example')
Declarations¶
Whenever you add a new type of annotation, or a different set, to a FoLiA document, you have to
first declare it. This is done using the Document.declare()
method. It takes as
arguments the annotation type, the set, and you can optionally pass keyword
arguments to annotator=
and annotatortype=
to set defaults.
An example for Part-of-Speech annotation:
doc.declare(folia.PosAnnotation, 'http://somewhere/brown-tag-set')
An example with a default annotator:
doc.declare(folia.PosAnnotation, 'http://somewhere/brown-tag-set', annotator='proycon', annotatortype=folia.AnnotatorType.MANUAL)
Any additional sets for Part-of-Speech would have to be explicitly declared as
well. To check if a particular annotation type and set is declared, use the
Document.declared()
method.
Adding structure¶
Assuming we begin with an empty document, we should first add a Text element.
Then we can add paragraphs, sentences, or other structural elements. The
AbstractElement.add()
method adds new children to an element:
text = doc.add(folia.Text)
paragraph = text.add(folia.Paragraph)
sentence = paragraph.add(folia.Sentence)
sentence.add(folia.Word, 'This')
sentence.add(folia.Word, 'is')
sentence.add(folia.Word, 'a')
sentence.add(folia.Word, 'test')
sentence.add(folia.Word, '.')
Note
The AbstractElement.add()
method is actually a wrapper around AbstractElement.append()
, which takes the
exact same arguments. It performs extra checks and works for both span
annotation as well as token annotation. Using append()
will be faster
though.
Adding annotations¶
Adding annotations, or any elements for that matter, is done using the
AbstractElement.add()
method on the intended parent element. We assume that the annotations
we add have already been properly declared, otherwise an exception will be
raised as soon as add()
is called. Let’s build on the previous example:
#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)
#Add Part-of-Speech tag
word.add(folia.PosAnnotation, set='brown-tagset',cls='n')
#Add lemma
lemma.add(folia.LemmaAnnotation, cls='test')
Note that in the above examples, the add()
method takes a class as first
argument, and subsequently takes keyword arguments that will be passed to the
classes’ constructor.
A second way of using AbstractElement.add()
is by simply passing a fully instantiated child
element, thus constructing it prior to adding. The following is equivalent to the
above example, as the previous method is merely a shortcut for convenience:
#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)
#Add Part-of-Speech tag
word.add( folia.PosAnnotation(doc, set='brown-tagset',cls='n') )
#Add lemma
lemma.add( folia.LemmaAnnotation(doc , cls='test') )
The AbstractElement.add()
method always returns that which was added, allowing it to be chained.
In the above example we first explicitly instantiate a PosAnnotation
and a LemmaAnnotation
. Instantiation of any FoLiA element (always
Python class subclassed off AbstractElement
) follows the following
pattern:
Class(document, *children, **kwargs)
Note
See AbstractElement.__init__()
for all details on construction
Note that the document has to be passed explicitly as first argument to the constructor.
The common attributes are set using equally named keyword arguments:
id=
cls=
set=
annotator=
annotatortype=
confidence=
src=
speaker=
begintime=
endtime=
Not all attributes are allowed for all elements, and certain attributes are
required for certain elements. ValueError
exceptions will be raised when these
constraints are not met.
Instead of setting id
. you can also set the keyword argument
generate_id_in
and pass it another element, an ID will be automatically
generated, based on the ID of the element passed. When you use the first method
of adding elements, instantiation with generate_id_in
will take place automatically
behind the scenes when applicable and when id
is not explicitly set.
Any extra non-keyword arguments should be FoLiA elements and will be appended
as the contents of the element, i.e. the children or subelements. Instead of
using non-keyword arguments, you can also use the keyword argument content
and pass a list. This is a shortcut made merely for convenience, as Python
obliges all non-keyword arguments to come before the keyword-arguments, which
if often aesthetically unpleasing for our purposes. Example of this use case
will be shown in the next section.
Adding span annotation¶
Adding span annotation is easy with the FoLiA library. As you know, span
annotation uses a stand-off annotation embedded in annotation layers. These
layers are in turn embedded in structural elements such as sentences. However,
the AbstractElement.add()
method abstracts over this. Consider the following example of a named entity:
doc.declare(folia.Entity, "https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml")
sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'I',id='example.s.1.w.1')
sentence.add(folia.Word, 'saw',id='example.s.1.w.2')
sentence.add(folia.Word, 'the',id='example.s.1.w.3')
word = sentence.add(folia.Word, 'Dalai',id='example.s.1.w.4')
word2 =sentence.add(folia.Word, 'Lama',id='example.s.1.w.5')
sentence.add(folia.Word, '.', id='example.s.1.w.6')
word.add(folia.Entity, word, word2, cls="per")
To make references to the words, we simply pass the word instances and use the
document’s index to obtain them. Note also that passing a list using the
keyword argument contents
is wholly equivalent to passing the non-keyword
arguments separately:
word.add(folia.Entity, cls="per", contents=[word,word2])
In the next example we do things more explicitly. We first create a sentence and then add a syntax parse, consisting of nested elements:
doc.declare(folia.SyntaxLayer, 'some-syntax-set')
sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'The',id='example.s.1.w.1')
sentence.add(folia.Word, 'boy',id='example.s.1.w.2')
sentence.add(folia.Word, 'pets',id='example.s.1.w.3')
sentence.add(folia.Word, 'the',id='example.s.1.w.4')
sentence.add(folia.Word, 'cat',id='example.s.1.w.5')
sentence.add(folia.Word, '.', id='example.s.1.w.6')
#Adding Syntax Layer
layer = sentence.add(folia.SyntaxLayer)
#Adding Syntactic Units
layer.add(
folia.SyntacticUnit(self.doc, cls='s', contents=[
folia.SyntacticUnit(self.doc, cls='np', contents=[
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.1'], cls='det'),
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.2'], cls='n'),
]),
folia.SyntacticUnit(self.doc, cls='vp', contents=[
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.3'], cls='v')
folia.SyntacticUnit(self.doc, cls='np', contents=[
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.4'], cls='det'),
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.5'], cls='n'),
]),
]),
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.6'], cls='fin')
])
)
Note
The lower-level AbstractElement.append()
method would have had the same effect in the above syntax tree sample.
Deleting annotations¶
Any element can be deleted by calling the AbstractElement.remove()
method on its parent. Suppose we want to delete word
:
word.parent.remove(word)
Copying annotations¶
A deep copy can be made of any element by calling its AbstractElement.copy()
method:
word2 = word.copy()
The copy will be without parent and document. If you intend to associate a copy with a new document, then copy as follows instead:
word2 = word.copy(newdoc)
If you intend to attach the copy somewhere in the same document, you may want to add a suffix for any identifiers in its scope, since duplicate identifiers are not allowed and would raise an exception. This can be specified as the second argument:
word2 = word.copy(doc, ".copy")
Searching in a FoLiA document¶
If you have loaded a FoLiA document into memory, you may want to search for a
particular annotations. You can of course loop over all structural and
annotation elements using AbstractElement.select()
,
AllowTokenAnnotation.annotation()
and
AllowTokenAnnotation.annotations()
. Additionally, Word.findspans()
and AbstractAnnotationLayer.findspan()
are useful methods of finding span
annotations covering particular words, whereas
AbstractSpanAnnotation.wrefs()
does the reverse and finds the words for a
given span annotation element. In addition to these main methods of navigation
and selection, there is higher-level function available for searching, this
uses the FoLiA Query Language (FQL) or the Corpus Query Language (CQL).
These two languages are part of separate libraries that need to be imported:
from pynlpl.formats import fql, cql
Corpus Query Language (CQL)¶
CQL is the easier-language of the two and most suitable for corpus searching. It is, however, less flexible than FQL, which is designed specifically for FoLiA and can not just query, but also manipulate FoLiA documents in great detail.
CQL was developed for the IMS Corpus Workbench, at Stuttgart Univeristy, and is implemented in Sketch Engine, who provide good CQL documentation.
CQL has to be converted to FQL first, which is then executed on the given document. This is a simple example querying for the word “house”:
doc = folia.Document(file="/path/to/some/document.folia.xml")
query = fql.Query(cql.cql2fql('"house"'))
for word in query(doc):
print(word) #these will be folia.Word instances (all matching house)
Multiple words can be queried:
query = fql.Query(cql.cql2fql('"the" "big" "house"'))
for word1,word2,word3 in query(doc):
print(word1, word2,word3)
Queries may contain wildcard expressions to match multiple text patterns. Gaps can be specified using []. The following will match any three word combination starting with the and ending with something that starts with house. It will thus match things like “the big house” or “the small household”:
query = fql.Query(cql.cql2fql('"the" [] "house.*"'))
for word1,word2,word3 in query(doc):
...
We can make the gap optional with a question mark, it can be lenghtened with + or * , like regular expressions:
query = fql.Query(cql.cql2fql('"the" []? "house.*"'))
for match in query(doc):
print("We matched ", len(match), " words")
Querying is not limited to text, but all of FoLiA’s annotations can be used. To force our gap consist of one or more adjectives, we do:
query = fql.Query(cql.cql2fql('"the" [ pos = "a" ]+ "house.*"'))
for match in query(doc):
...
The original CQL attribute here is tag
rather than pos
, this can be used too. In addition, all FoLiA element types can be used! Just use their FoLiA tagname.
Consult the CQL documentation for more. Do note that CQL is very word/token centered, for searching other types of elements, use FQL instead.
FoLiA Query Language (FQL)¶
FQL is documented here, a full overview is beyond the scope of this documentation. We will just introduce some basic selection queries so you can develop an initial impression of the language’s abilities.
All FQL processing is done via the following class, as already seen in the previous section:
Query |
This class represents an FQL query. |
Selecting a word with a particular text is done as follows:
query = fql.Query('SELECT w WHERE text = "house"')
for word in query(doc):
print(word) #this will be an instance of folia.Word
Regular expression matching can be done using the MATCHES
operator:
query = fql.Query('SELECT w WHERE text MATCHES "^house.*$"')
for word in query(doc):
print(word)
The classes of other annotation types can be easily queried as follows:
query = fql.Query('SELECT w WHERE :pos = "v"' AND :lemma = "be"')
for word in query(doc):
print(word)
You can constrain your queries to a particular target selection using the FOR
keyword:
query = fql.Query('SELECT w WHERE text MATCHES "^house.*$" FOR s WHERE text CONTAINS "sell"')
for word in query(doc):
print(word)
This construction also allows you to select the actual annotations. To select all people (a named entity) for words that are not John:
query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John"')
for entity in query(doc):
print(entity) #this will be an instance of folia.Entity
FOR statement may be chained, and Explicit IDs can be passed using the ID
keyword:
query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John" FOR div ID "section.21"')
for entity in query(doc):
print(entity)
Sets are specified using the OF keyword, it can be omitted if there is only one for the annotation type, but will be required otherwise:
query = fql.Query('SELECT su OF "http://some/syntax/set" WHERE class = "np"')
for su in query(doc):
print(su) #this will be an instance of folia.SyntacticUnit
We have just covered the SELECT keyword, FQL has other keywords for manipulating documents, such as EDIT, ADD, APPEND and PREPEND.
Note
Consult the FQL documentation at https://github.com/proycon/foliadocserve/blob/master/README.rst for further documentation on the language.
Streaming Reader¶
Throughout this tutorial you have seen the Document
class as a means
of reading FoLiA documents. This class always loads the entire document in
memory, which can be a considerable resource demand. The following class
provides an alternative to loading FoLiA documents:
Reader |
Streaming FoLiA reader. |
It does not load the entire document in memory but merely returns the elements you are interested in. This results in far less memory usage and also provides a speed-up.
A reader is constructed as follows, the second argument is the class of the element you want:
reader = folia.Reader("my.folia.xml", folia.Word)
for word in reader:
print(word.id)
Higher-Order Annotations¶
Text Markup¶
FoLiA has a number of text markup elements, these appear within the
TextContent
(t
) element, iterating over the element of a
TextContent
element will first and foremost produce strings, but also
uncover these markup elements when present. The following markup types exists:
TextMarkupGap |
Markup element to mark gaps in text content (TextContent ) |
TextMarkupString |
Markup element to mark arbitrary substrings in text content (TextContent ) |
TextMarkupStyle |
Markup element to style text content (TextContent ), e.g. |
TextMarkupCorrection |
Markup element to mark corrections in text content (TextContent ). |
TextMarkupError |
Markup element to mark gaps in text content (TextContent ) |
Features¶
Features allow a second-order annotation by adding the ability to assign
properties and values to any of the existing annotation elements. They follow
the set/class paradigm by adding the notion of a subset and class relative to
this subset. The AbstractElement.feat()
method provides a shortcut that can be used on any
annotation element to obtain the class of the feature, given a subset. To
illustrate the concept, take a look at part of speech annotation with some
features:
pos = word.annotation(folia.PosAnnotation)
if pos.cls = "n":
if pos.feat('number') == 'plural':
print("We have a plural noun!")
elif pos.feat('number') == 'singular':
print("We have a singular noun!")
The AbstractElement.feat()
method will return an exception when the feature does not exist.
Note that the actual subset and class values are defined by the set and not
FoLiA itself! They are therefore fictitious in the above example.
The Python class for features is Feature
, in the following example we
add a feature:
pos.add(folia.Feature, subset="gender", cls="f")
Although FoLiA does not define any sets nor subsets. Some annotation types do
come with some associated subsets, their use is never mandatory. The advantage
is that these associated subsets can be directly used as an XML attribute in
the FoLiA document. The FoLiA library provides extra classes, all subclassed
off Feature
for these:
Feature |
Feature elements can be used to associate subsets and subclasses with almost any annotation element |
SynsetFeature |
Synset feature, to be used within Sense |
ActorFeature |
Actor feature, to be used within Event |
BegindatetimeFeature |
Begindatetime feature, to be used within Event |
EnddatetimeFeature |
Enddatetime feature, to be used within Event |
Alternatives¶
A key feature of FoLiA is its ability to make explicit alternative annotations,
for token annotations, the Alternative
(alt
) class is used to
this end. Alternative annotations are embedded in this structure. This implies
the annotation is not authoritative, but is merely an alternative to the actual
annotation (if any). Alternatives may typically occur in larger numbers,
representing a distribution each with a confidence value (not mandatory). Each
alternative is wrapped in its own Alternative
element, as multiple
elements inside a single alternative are considered dependent and part of the
same alternative. Combining multiple annotation in one alternative makes sense
for mixed annotation types, where for instance a pos tag alternative is tied to
a particular lemma:
alt = word.add(folia.Alternative)
alt.add(folia.PosAnnotation, set='brown-tagset',cls='n',confidence=0.5)
alt = word.add(folia.Alternative) #note that we reassign the variable!
alt.add(folia.PosAnnotation, set='brown-tagset',cls='a',confidence=0.3)
alt = word.add(folia.Alternative)
alt.add(folia.PosAnnotation, set='brown-tagset',cls='v',confidence=0.2)
Span annotation elements have a different mechanism for alternatives, for those
the entire annotation layer is embedded in a AlternativeLayers
element. This element should be repeated for every type, unless the layers it
describeds are dependent on it eachother:
alt = sentence.add(folia.AlternativeLayers)
layer = alt.add(folia.Entities)
entity = layer.add(folia.Entity, word1,word2,cls="person", confidence=0.3)
Because the alternative annotations are non-authoritative, normal selection
methods such as select()
and annotations()
will never yield them,
unless explicitly told to do so. For this reason, there is an
alternatives()
method on structure elements, for the first category of alternatives.
In summary, a list of the two relevant classes for alternatives:
Alternative |
Element grouping alternative token annotation(s). |
AlternativeLayers |
Element grouping alternative subtoken annotation(s). |
Corrections¶
Corrections are one of the most complex annotation types in FoLiA. Corrections can be applied not just over text, but over any type of structure annotation, token annotation or span annotation. Corrections explicitly preserve the original, and recursively so if corrections are done over other corrections.
Despite their complexity, the library treats correction transparently. Whenever you query for a particular element, and it is part of a correction, you get the corrected version rather than the original. The original is always non-authoritative and normal selection methods will ignore it.
If you want to deal with correction, you have to explicitly handle the
Correction
element. If an element is part of a correction, its
AbstractElement.incorrection()
method will give the correction element, if not, it will
return None
:
pos = word.annotation(folia.PosAnnotation)
correction = pos.incorrection()
if correction:
if correction.hasoriginal():
originalpos = correction.original(0) #assuming it's the only element as is customary
#originalpos will be an instance of folia.PosAnnotation
print("The original pos was", originalpos.cls)
Corrections themselves carry a class too, indicating the type of correction (defined by the set used and not by FoLiA).
Besides Correction.original()
, corrections distinguish three other types, Correction.new()
(the corrected version), Correction.current()
(the current uncorrected version) and Correction.suggestions()
(a suggestion for correction), the former two and latter two usually form pairs, current()
and new()
can never be used together. Of suggestions(index)
there may be multiple, hence the index argument. These return, respectively, instances of Original
, folia.New
, folia.Current
and folia.Suggestion
.
Adding a correction can be done explicitly:
wrongpos = word.annotation(folia.PosAnnotation)
word.add(folia.Correction, folia.New(doc, folia.PosAnnotation(doc, cls="n")) , folia.Original(doc, wrongpos), cls="misclassified")
Let’s settle for a suggestion rather than an actual correction:
wrongpos = word.annotation(folia.PosAnnotation)
word.add(folia.Correction, folia.Suggestion(doc, folia.PosAnnotation(doc, cls="n")), cls="misclassified")
In some instances, when correcting text or structural elements, New
may be
empty, which would correspond to an deletion. Similarly, Original
may be
empty, corresponding to an insertion.
The use of Current
is reserved for use with structure elements, such as words, in combination with suggestions. The structure elements then have to be embedded in Current
. This situation arises for instance when making suggestions for a merge or split.
Here is a list of all relevant classes for corrections:
Correction |
Corrections are one of the most complex annotation types in FoLiA. |
Current |
Used in the context of Correction to encapsulate the currently authoritative annotations. |
ErrorDetection |
The ErrorDetection element is used to signal the presence of errors in a structural element. |
New |
|
Original |
Used in the context of Correction to encapsulate the original annotations prior to correction. |
Suggestion |
Suggestions are used in the context of Correction , but rather than provide an authoritative correction, it instead offers a suggestion for correction. |
Alignments¶
Alignments are used to make reference to external documents. It concerns
references as annotation rather than references which are explicitly part of
the text, such as hyperlinks and Reference
.
The following elements are relevant for alignments:
Alignment |
The Alignment element is a form of higher-order annotation taht is used to point to an external resource. |
AlignReference |
The AlignReference element is used to point to specific elements inside the aligned source. |
Descriptions, Metrics¶
FoLiA allows arbitrary descriptions to be assigned with any element. It also allows assigning metrics to any annotation, which consist of a key/value pair that often express a quantivative or qualitative measure. This is accomplished, respectively, with the following element classes:
Description |
Description is an element that can be used to associate a description with almost any other FoLiA element |
Metric |
Metric elements provide a key/value pair to allow the annotation of any kind of metric with any kind of annotation element. |
Metadata¶
FoLiA can be used with a variety of more advanced metadata schemes (e.g. Dublin Core,
CMDI). If this is too much, you can use its own simple native metadata
facility, a simple key value store . After instantiation of a Document
, the metadata can be
accessed through the metadata
attribute, which behaves like a Python
dictionary:
doc = folia.Document(file="/path/to/document.xml")
doc.metadata['language'] = "en"
Formats¶
Corpus Gesproken Nederlands¶
-
exception
pynlpl.formats.cgn.
InvalidFeatureException
¶
-
exception
pynlpl.formats.cgn.
InvalidTagException
¶
-
pynlpl.formats.cgn.
parse_cgn_postag
(rawtag, raisefeatureexceptions=False)¶
GIZA++¶
-
class
pynlpl.formats.giza.
GizaModel
(filename, encoding='utf-8')¶
-
class
pynlpl.formats.giza.
GizaSentenceAlignment
(sourceline, targetline, index)¶ -
getalignedtarget
(index)¶ Returns target range only if source index aligns to a single consecutive range of target tokens.
-
intersect
(other)¶
-
-
class
pynlpl.formats.giza.
IntersectionAlignment
(source2target, target2source, encoding=False)¶ -
reset
()¶
-
-
class
pynlpl.formats.giza.
MultiWordAlignment
(filename, encoding=False)¶ Source to Target alignment: reads source-target.A3.final files, in which each source word may be aligned to multiple target words (adapted from code by Sander Canisius)
-
reset
()¶
-
targetword
(index, targetwords, alignment)¶ Return the aligned targeword for a specified index in the source words. Multiple words are concatenated together with a space in between
-
targetwords
(index, targetwords, alignment)¶ Return the aligned targetwords for a specified index in the source words
-
-
class
pynlpl.formats.giza.
WordAlignment
(filename, encoding=False)¶ Target to Source alignment: reads target-source.A3.final files, in which each source word is aligned to one target word
-
reset
()¶
-
targetword
(index, targetwords, alignment)¶ Return the aligned targetword for a specified index in the source words
-
-
pynlpl.formats.giza.
parseAlignment
(tokens)¶
Moses¶
-
class
pynlpl.formats.moses.
PhraseTable
(filename, quiet=False, reverse=False, delimiter='|||', score_column=3, max_sourcen=0, sourceencoder=None, targetencoder=None, scorefilter=None)¶
-
class
pynlpl.formats.moses.
PhraseTableClient
(host='localhost', port=65432)¶
SoNaR¶
-
class
pynlpl.formats.sonar.
Corpus
(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)¶
-
class
pynlpl.formats.sonar.
CorpusDocument
(filename, encoding='iso-8859-15')¶ This class represent one document/text of the Corpus (read-only)
-
paragraphs
(with_id=False)¶ Extracts paragraphs, returns list of plain-text(!) paragraphs
-
sentences
()¶ Iterate over all sentences (sentence_id, sentence) in the document, sentence is a list of 4-tuples (word,id,pos,lemma)
-
words
()¶
-
-
class
pynlpl.formats.sonar.
CorpusDocumentX
(filename, tree=None, index=True)¶ This class represent one document/text of the Corpus, loaded into memory at once and retaining the full structure
-
paragraphs
(node=None)¶ iterate over paragraphs
-
save
(filename=None, encoding='iso-8859-15')¶
-
sentences
(node=None)¶ iterate over sentences
-
validate
(formats_dir='../formats/')¶ checks if the document is valid
-
words
(node=None)¶ iterate over words
-
xpath
(expression)¶ Executes an xpath expression using the correct namespaces
-
-
class
pynlpl.formats.sonar.
CorpusFiles
(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)¶
-
class
pynlpl.formats.sonar.
CorpusX
(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)¶
-
pynlpl.formats.sonar.
ns
(namespace)¶ Resolves the namespace identifier to a full URL
Taggerdata¶
Language Models¶
-
class
pynlpl.lm.lm.
ARPALanguageModel
(filename, encoding='utf-8', encoder=None, base_e=True, dounknown=True, debug=False, mode='simple')¶ Full back-off language model, loaded from file in ARPA format.
This class does not build the model but allows you to use a pre-computed one. You can use the tool ngram-count from for instance SRILM to actually build the model.
-
class
NgramsProbs
(data, mode='simple', delim=' ')¶ Store Ngrams with their probabilities and backoffs.
This class is used in order to abstract the physical storage layout, and enable memory/speed tradeoffs.
-
backoff
(ngram)¶ Return backoff value of a given ngram tuple
-
prob
(ngram)¶ Return probability of given ngram tuple
-
-
score
(data, history=None)¶
-
scoreword
(word, history=None)¶
-
class
-
class
pynlpl.lm.lm.
SimpleLanguageModel
(n=2, casesensitive=True, beginmarker='<begin>', endmarker='<end>')¶ This is a simple unsmoothed language model. This class can both hold and compute the model.
-
append
(sentence)¶
-
load
(filename)¶
-
save
(filename)¶
-
scoresentence
(sentence)¶
-
-
class
pynlpl.lm.srilm.
SRILM
(filename, n)¶ -
logscore
(ngram)¶
-
scoresentence
(sentence, unknownwordprob=-12)¶
-
-
exception
pynlpl.lm.srilm.
SRILMException
¶ Base Exception for SRILM.
Search Algorithms¶
This module contains various search algorithms.
-
class
pynlpl.search.
AbstractSearch
(**kwargs)¶ -
prune
(state)¶ Pruning method is called AFTER expansion of each node
-
reset
()¶
-
searchall
()¶ Returns a list of all solutions
-
searchbest
()¶ Returns the single best result (if multiple have the same score, the first match is returned)
-
searchfirst
()¶ Returns the very first result (regardless of it being the best or not!)
-
searchlast
(n=10)¶ Return the last n results (or possibly less if not found). Note that the last results are not necessarily the best ones! Depending on the search type.
-
searchtop
(n=10)¶ Return the top n best resulta (or possibly less if not enough is found)
-
traversal
()¶ Returns all visited states (only when keeptraversal=True), note that this is not equal to the path, but contains all states that were checked!
-
traversalsize
()¶ Returns the number of nodes visited (also when keeptravel=False). Note that this is not equal to the path, but contains all states that were checked!
-
visited
(state)¶
-
-
class
pynlpl.search.
AbstractSearchState
(parent=None, cost=0)¶ -
depth
()¶
-
expand
()¶ Generates successor states, implement your custom operators in the derived method.
-
path
()¶
-
pathcost
()¶
-
score
()¶ Should return a heuristic value. This needs to be set if you plan to used an informed search algorithm.
-
test
(goalstates=None)¶ Checks whether this state is a valid goal state, returns a boolean. If no goalstate is defined, then all states will test positively, this is what you usually want for optimisation problems.
-
-
class
pynlpl.search.
BeamSearch
(states, beamsize, **kwargs)¶ Local beam search algorithm
-
class
pynlpl.search.
BeamedBestFirstSearch
(states, beamsize, **kwargs)¶ Best first search with a beamsize (non-optimal!)
-
prune
(state)¶ Pruning method is called AFTER expansion of each node
-
-
class
pynlpl.search.
BestFirstSearch
(state, **kwargs)¶
-
class
pynlpl.search.
BreadthFirstSearch
(state, **kwargs)¶
-
class
pynlpl.search.
DepthFirstSearch
(state, **kwargs)¶
-
class
pynlpl.search.
EarlyEagerBeamSearch
(state, beamsize, **kwargs)¶ A beam search that prunes early (after each state expansion) and eagerly (weeding out worse successors)
-
prune
(state)¶ Pruning method is called AFTER expansion of each node
-
-
class
pynlpl.search.
HillClimbingSearch
(state, **kwargs)¶ (identical to beamsearch with beam 1, but implemented differently)
-
class
pynlpl.search.
IterativeDeepening
(state, **kwargs)¶ -
traversal
()¶ Returns all visited states (only when keeptraversal=True), note that this is not equal to the path, but contains all states that were checked!
-
traversalsize
()¶ Returns the number of nodes visited (also when keeptravel=False). Note that this is not equal to the path, but contains all states that were checked!
-
-
class
pynlpl.search.
StochasticBeamSearch
(states, beamsize, **kwargs)¶ -
prune
(state)¶ Pruning method is called AFTER expansion of each node
-
-
pynlpl.search.
binary_search
(a, x, lo=0, hi=None)¶
Statistics and Information Theory¶
This module contains classes and functions for statistics and information theory. It is imported as follows:
import pynlpl.statistics
Generic functions¶
Amongst others, the following generic statistical functions are available:
* ``mean(list)`` - Computes the mean of a given list of numbers
median(list)
- Computes the median of a given list of numbersstddev(list)
- Computes the standard deviation of a given list of numbersnormalize(list)
- Normalizes a list of numbers so that the sum is 1.0 .
Frequency Lists and Distributions¶
One of the most basic and widespread tasks in NLP is the creation of a frequency list. Counting is established by simply appending lists to the frequencylist:
freqlist = pynlpl.statistics.FrequencyList()
freqlist.append(['to','be','or','not','to','be'])
Take care not to append lists rather than strings unless you mean to create a frequency list over its characters rather than words. You may want to use the pynlpl.textprocessors.crudetokeniser
first:
freqlist.append(pynlpl.textprocessors.crude_tokeniser("to be or not to be"))
The count can also be incremented explicitly explicitly for a single item:
freqlist.count(‘shakespeare’)
The FrequencyList offers dictionary-like access. For example, the following statement will be true for the frequency list just created:
freqlist['be'] == 2
Normalised counts (pseudo-probabilities) can be obtained using the p()
method:
freqlist.p('be')
Normalised counts can also be obtained by instantiation a Distribution instance using the frequency list:
dist = pynlpl.statistics.Distribution(freqlist)
This too offers a dictionary-like interface, where values are by definition normalised. The advantage of a Distribution class is that it offers information-theoretic methods such as entropy()
, maxentropy()
, perplexity()
and poslog()
.
A frequency list can be saved to file using the save(filename)
method, and loaded back from file using the load(filename)
method. The output()
method is a generator yielding strings for each line of output, in ranked order.
API Reference¶
This is a Python library containing classes for Statistic and Information Theoretical computations. It also contains some code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
-
class
pynlpl.statistics.
Distribution
(data, base=2)¶ A distribution can be created over a FrequencyList or a plain dictionary with numeric values. It will be normalized automatically. This implemtation uses dictionaries/hashing
-
entropy
(base=2)¶ Compute the entropy of the distribution
-
information
(type)¶ Computes the information content of the specified type: -log_e(p(X))
-
items
()¶ Returns an unranked list of (type, prob) pairs. Use this only if you are not interested in the order.
-
keys
()¶
-
maxentropy
(base=2)¶ Compute the maximum entropy of the distribution: log_e(N)
-
mode
()¶ Returns the type that occurs the most frequently in the probability distribution
-
output
(delimiter='\t', freqlist=None)¶ Generator yielding formatted strings expressing the time and probabily for each item in the distribution
-
perplexity
(base=2)¶
-
poslog
(type)¶ alias for information content
-
values
()¶
-
-
class
pynlpl.statistics.
FrequencyList
(tokens=None, casesensitive=True, dovalidation=True)¶ A frequency list (implemented using dictionaries)
-
append
(tokens)¶ Add a list of tokens to the frequencylist. This method will count them for you.
-
count
(type, amount=1)¶ Count a certain type. The counter will increase by the amount specified (defaults to one)
-
dict
()¶
-
items
()¶ Returns an unranked list of (type, count) pairs. Use this only if you are not interested in the order.
-
load
(filename)¶ Load a frequency list from file (in the format produced by the save method)
-
mode
()¶ Returns the type that occurs the most frequently in the frequency list
-
output
(delimiter='\t', addnormalised=False)¶ Print a representation of the frequency list
-
p
(type)¶ Returns the probability (relative frequency) of the token
-
save
(filename, addnormalised=False)¶ Save a frequency list to file, can be loaded later using the load method
-
sum
()¶ Returns the total amount of tokens
-
tokens
()¶ Returns the total amount of tokens
-
typetokenratio
()¶ Computes the type/token ratio
-
values
()¶
-
-
class
pynlpl.statistics.
HiddenMarkovModel
(startstate, endstate=None)¶ -
print_dptable
(V)¶
-
setemission
(state, distribution)¶
-
viterbi
(observations, doprint=False)¶
-
-
class
pynlpl.statistics.
MarkovChain
(startstate, endstate=None)¶ -
accessible
(fromstate, tostate)¶ Is state tonode directly accessible (in one step) from state fromnode? (i.e. is there an edge between the nodes). If so, return the probability, else zero
-
communicates
(fromstate, tostate, maxlength=999999)¶ See if a node communicates (directly or indirectly) with another. Returns the probability of the shortest path (probably, but not necessarily the highest probability)
-
p
(sequence, subsequence=True)¶ Returns the probability of the given sequence or subsequence (if subsequence=True, default).
-
reducible
()¶
-
settransitions
(state, distribution)¶
-
size
()¶
-
-
pynlpl.statistics.
dotproduct
(X, Y)¶ Return the sum of the element-wise product of vectors x and y. >>> dotproduct([1, 2, 3], [1000, 100, 10]) 1230
-
pynlpl.statistics.
histogram
(values, mode=0, bin_function=None)¶ Return a list of (value, count) pairs, summarizing the input values. Sorted by increasing value, or if mode=1, by decreasing count. If bin_function is given, map it over values first.
-
pynlpl.statistics.
levenshtein
(s1, s2, maxdistance=9999)¶ Computes the levenshtein distance between two strings. Adapted from: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
-
pynlpl.statistics.
log2
(x)¶ Base 2 logarithm. >>> log2(1024) 10.0
-
pynlpl.statistics.
mean
(values)¶ Return the arithmetic average of the values.
-
pynlpl.statistics.
median
(values)¶ Return the middle value, when the values are sorted. If there are an odd number of elements, try to average the middle two. If they can’t be averaged (e.g. they are strings), choose one at random. >>> median([10, 100, 11]) 11 >>> median([1, 2, 3, 4]) 2.5
-
pynlpl.statistics.
mode
(values)¶ Return the most common value in the list of values. >>> mode([1, 2, 3, 2]) 2
-
pynlpl.statistics.
normalize
(numbers, total=1.0)¶ Multiply each number by a constant such that the sum is 1.0 (or total). >>> normalize([1,2,1]) [0.25, 0.5, 0.25]
-
pynlpl.statistics.
product
(seq)¶ Return the product of a sequence of numerical values. >>> product([1,2,6]) 12
-
pynlpl.statistics.
stddev
(values, meanval=None)¶ The standard deviation of a set of values. Pass in the mean if you already know it.
-
pynlpl.statistics.
vector_add
(a, b)¶ Component-wise addition of two vectors. >>> vector_add((0, 1), (8, 9)) (8, 10)
Text Processors¶
This module contains classes and functions for text processing. It is imported as follows:
import pynlpl.textprocessors
Tokenisation¶
A very crude tokeniser is available in the form of the function pynlpl.textprocessors.crude_tokeniser(string)
. This will split punctuation characters from words and returns a list of tokens. It however has no regard for abbreviations and end-of-sentence detection, which is functionality a more sophisticated tokeniser can provide:
tokens = pynlpl.textprocessors.crude_tokeniser("to be, or not to be.")
This will result in:
tokens == [‘to’,’be’,’,’,’or’,’not’,’to’,’be’,’.’]
N-gram extraction¶
The extraction of n-grams is an elemental operation in Natural Language Processing. PyNLPl offers the Windower
class to accomplish this task:
tokens = pynlpl.textprocessors.crude_tokeniser("to be or not to be")
for trigram in Windower(tokens,3):
print trigram
The input to the Windower should be a list of words and a value for n. In addition, the windower can output extra symbols at the beginning of the input sequence and at the end of it. By default, this behaviour is enabled and the input symbol is <begin>
, whereas the output symbol is <end>
. If this behaviour is unwanted you can suppress it by instantiating the Windower as follows:
Windower(tokens,3, None, None)
The Windower is implemented as a Python generator and at each iteration yields a tuple of length n.
-
class
pynlpl.textprocessors.
MultiWindower
(tokens, min_n=1, max_n=9, beginmarker=None, endmarker=None)¶ Extract n-grams of various configurations from a sequence
-
class
pynlpl.textprocessors.
ReflowText
(stream, filternontext=True)¶ Attempts to re-flow a text that has arbitrary line endings in it. Also undoes hyphenisation
-
class
pynlpl.textprocessors.
Tokenizer
(stream, splitsentences=True, onesentenceperline=False, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\))|www\.)(?:[\w\d:#@%/;$()~_?\+-=\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+(?:\.[a-zA-Z]+)+')))¶ A tokenizer and sentence splitter, which acts on a file/stream-like object and when iterating over the object it yields a lists of tokens (in case the sentence splitter is active (default)), or a token (if the sentence splitter is deactivated).
-
class
pynlpl.textprocessors.
Windower
(tokens, n=1, beginmarker='<begin>', endmarker='<end>')¶ Moves a sliding window over a list of tokens, upon iteration in yields all n-grams of specified size in a tuple.
Example without markers:
>>> for ngram in Windower("This is a test .",3, None, None): ... print(" ".join(ngram)) This is a is a test a test .
Example with default markers:
>>> for ngram in Windower("This is a test .",3): ... print(" ".join(ngram)) <begin> <begin> This <begin> This is This is a is a test a test . test . <end> . <end> <end>
-
pynlpl.textprocessors.
calculate_overlap
(haystack, needle, allowpartial=True)¶ Calculate the overlap between two sequences. Yields (overlap, placement) tuples (multiple because there may be multiple overlaps!). The former is the part of the sequence that overlaps, and the latter is -1 if the overlap is on the left side, 0 if it is a subset, 1 if it overlaps on the right side, 2 if its an identical match
-
pynlpl.textprocessors.
crude_tokenizer
(text)¶ Replaced by tokenize(). Alias
-
pynlpl.textprocessors.
find_keyword_in_context
(tokens, keyword, contextsize=1)¶ Find a keyword in a particular sequence of tokens, and return the local context. Contextsize is the number of words to the left and right. The keyword may have multiple word, in which case it should to passed as a tuple or list
-
pynlpl.textprocessors.
is_end_of_sentence
(tokens, i)¶
-
pynlpl.textprocessors.
split_sentences
(tokens)¶ Split sentences (based on tokenised data), returns sentences as a list of lists of tokens, each sentence is a list of tokens
-
pynlpl.textprocessors.
strip_accents
(s, encoding='utf-8')¶ Strip characters with diacritics and return a flat ascii representation
-
pynlpl.textprocessors.
swap
(tokens, maxdist=2)¶ Perform a swap operation on a sequence of tokens, exhaustively swapping all tokens up to the maximum specified distance. This is a subset of all permutations.
-
pynlpl.textprocessors.
tokenise
(text, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\\\\\))|www\\.)(?:[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\\.\\+_-]+@[A-Za-z0-9\\._-]+(?:\\.[a-zA-Z]+)+')))¶ Alias for the British
-
pynlpl.textprocessors.
tokenize
(text, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\\\\\))|www\\.)(?:[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\\.\\+_-]+@[A-Za-z0-9\\._-]+(?:\\.[a-zA-Z]+)+')))¶ Tokenizes a string and returns a list of tokens
Parameters: - text (string) – The text to tokenise
- regexps (Tuple/list of regular expressions to use in tokenisation) – Regular expressions to use as tokeniser rules in tokenisation (default=_pynlpl.textprocessors.TOKENIZERRULES_)
Return type: Returns a list of tokens
Examples:
>>> for token in tokenize("This is a test."): ... print(token) This is a test .
Indices and tables¶
- Index
- Module Index
- search