Welcome to polyglot’s documentation!

polyglot

Downloads Latest Version Build Status Documentation Status

Polyglot is a natural language pipeline that supports massive multilingual applications.

Features

  • Tokenization (165 Languages)
  • Language detection (196 Languages)
  • Named Entity Recognition (40 Languages)
  • Part of Speech Tagging (16 Languages)
  • Sentiment Analysis (136 Languages)
  • Word Embeddings (137 Languages)
  • Morphological analysis (135 Languages)
  • Transliteration (69 Languages)

Developer

  • Rami Al-Rfou @ rmyeid gmail com

Quick Tutorial

import polyglot
from polyglot.text import Text, Word

Language Detection

text = Text("Bonjour, Mesdames.")
print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))
Language Detected: Code=fr, Name=French

Tokenization

zen = Text("Beautiful is better than ugly. "
           "Explicit is better than implicit. "
           "Simple is better than complex.")
print(zen.words)
[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']
print(zen.sentences)
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]

Part of Speech Tagging

text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")

print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
    print(u"{:<16}{:>2}".format(word, tag))
Word            POS Tag
------------------------------
O               DET
primeiro        ADJ
uso             NOUN
de              ADP
desobediência   NOUN
civil           ADJ
em              ADP
massa           NOUN
ocorreu         ADJ
em              ADP
setembro        NOUN
de              ADP
1906            NUM
.               PUNCT

Named Entity Recognition

text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
print(text.entities)
[I-LOC([u'Gro\xdfbritannien']), I-PER([u'Gandhi'])]

Polarity

print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in zen.words[:6]:
    print("{:<16}{:>2}".format(w, w.polarity))
Word            Polarity
------------------------------
Beautiful        0
is               0
better           1
than             0
ugly            -1
.                0

Embeddings

word = Word("Obama", language="en")
print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
for w in word.neighbors:
    print("{:<16}".format(w))
print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
print(word.vector[:10])
Neighbors (Synonms) of Obama
------------------------------
Bush
Reagan
Clinton
Ahmadinejad
Nixon
Karzai
McCain
Biden
Huckabee
Lula


The first 10 dimensions out the 256 dimensions

[-2.57382345  1.52175975  0.51070285  1.08678675 -0.74386948 -1.18616164
  2.92784619 -0.25694436 -1.40958667 -2.39675403]

Morphology

word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)
[u'Pre', u'process', u'ing']

Transliteration

from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="ru")
print(transliterator.transliterate(u"preprocessing"))
препрокессинг

Contents:

Installation

Installing/Upgrading From the PyPI

$ pip install polyglot

Dependencies

polyglot depends on numpy and libicu-dev, on ubuntu/debian linux distribution you can install such packages by executing the following command:

sudo apt-get install python-numpy libicu-dev

From Source

polyglot is actively developed on Github.

You can clone the public repo:

git clone https://github.com/aboSamoor/polyglot

Or download one of the following:

Once you have the source, you can install it into your site-packages with:

python setup.py install

Get the bleeding edge version

To get the latest development version of polyglot, run :

$ pip install -U git+https://github.com/aboSamoor/polyglot.git@master

Python

polyglot supports Python >=2.7 or >=3.4.

Language Detection

Polyglot depends on pycld2 library which in turn depends on cld2 library for detecting language(s) used in plain text.

from polyglot.detect import Detector

Example

arabic_text = u"""
أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
"""
detector = Detector(arabic_text)
print(detector.language)
name: Arabic      code: ar       confidence:  99.0 read bytes:   907

Mixed Text

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
"""

If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level:

for language in Detector(mixed_text).languages:
  print(language)
name: English     code: en       confidence:  87.0 read bytes:  1154
name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
name: un          code: un       confidence:   0.0 read bytes:     0

To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line

for line in mixed_text.strip().splitlines():
  print(line + u"\n")
  for language in Detector(line).languages:
    print(language)
  print("\n")
China (simplified Chinese: 中国; traditional Chinese: 中國),

name: English     code: en       confidence:  71.0 read bytes:   887
name: Chinese     code: zh_Hant  confidence:  11.0 read bytes:  1755
name: un          code: un       confidence:   0.0 read bytes:     0


officially the People's Republic of China (PRC), is a sovereign state located in East Asia.

name: English     code: en       confidence:  98.0 read bytes:  1291
name: un          code: un       confidence:   0.0 read bytes:     0
name: un          code: un       confidence:   0.0 read bytes:     0

Best Effort Strategy

Sometimes, there is no enough text to make a decision, like detecting a language from one word. This forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute reliable will be set to False.

detector = Detector("pizza")
print(detector)
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
Prediction is reliable: False
Language 1: name: English     code: en       confidence:  85.0 read bytes:  1194
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

In case, that the detection is not reliable even when we are using the best effort strategy, an exception UnknownLanguage will be thrown.

print(Detector("4"))
---------------------------------------------------------------------------

UnknownLanguage                           Traceback (most recent call last)

<ipython-input-9-de43776398b9> in <module>()
----> 1 print(Detector("4"))


/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in __init__(self, text, quiet)
     63     self.quiet = quiet
     64     """If true, exceptions will be silenced."""
---> 65     self.detect(text)
     66
     67   @staticmethod


/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in detect(self, text)
     89
     90       if not reliable and not self.quiet:
---> 91         raise UnknownLanguage("Try passing a longer snippet of text")
     92       else:
     93         logger.warning("Detector is not able to detect the language reliably.")


UnknownLanguage: Try passing a longer snippet of text

Such an exception may not be desirable especially for trivial cases like characters that could belong to so many languages. In this case, we can silence the exceptions by passing setting quiet to True

print(Detector("4", quiet=True))
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
Prediction is reliable: False
Language 1: name: un          code: un       confidence:   0.0 read bytes:     0
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

Command Line

!polyglot detect --help
usage: polyglot detect [-h] [--input [INPUT [INPUT ...]]]

optional arguments:
  -h, --help            show this help message and exit
  --input [INPUT [INPUT ...]]

The subcommand detect tries to identify the language code for each line in a text file. This could be convieniet if each line represents a document or a sentence that could have been generated by a tokenizer

!polyglot detect --input testdata/cricket.txt
English             Australia posted a World Cup record total of 417-6 as they beat Afghanistan by 275 runs.
English             David Warner hit 178 off 133 balls, Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth.
English             Afghanistan were then dismissed for 142, with Mitchell Johnson and Mitchell Starc taking six wickets between them.
English             Australia's score surpassed the 413-5 India made against Bermuda in 2007.
English             It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages, following South Africa's 408-5 and 411-4 against West Indies and Ireland respectively.
English             The winning margin beats the 257-run amount by which India beat Bermuda in Port of Spain in 2007, which was equalled five days ago by South Africa in their victory over West Indies in Sydney.

Supported Languages

cld2 can detect up to 165 languages.

from polyglot.utils import pretty_list
print(pretty_list(Detector.supported_languages()))
  1. Abkhazian                  2. Afar                       3. Afrikaans
  4. Akan                       5. Albanian                   6. Amharic
  7. Arabic                     8. Armenian                   9. Assamese
 10. Aymara                    11. Azerbaijani               12. Bashkir
 13. Basque                    14. Belarusian                15. Bengali
 16. Bihari                    17. Bislama                   18. Bosnian
 19. Breton                    20. Bulgarian                 21. Burmese
 22. Catalan                   23. Cebuano                   24. Cherokee
 25. Nyanja                    26. Corsican                  27. Croatian
 28. Croatian                  29. Czech                     30. Chinese
 31. Chinese                   32. Chinese                   33. Chinese
 34. Chineset                  35. Chineset                  36. Chineset
 37. Chineset                  38. Chineset                  39. Chineset
 40. Danish                    41. Dhivehi                   42. Dutch
 43. Dzongkha                  44. English                   45. Esperanto
 46. Estonian                  47. Ewe                       48. Faroese
 49. Fijian                    50. Finnish                   51. French
 52. Frisian                   53. Ga                        54. Galician
 55. Ganda                     56. Georgian                  57. German
 58. Greek                     59. Greenlandic               60. Guarani
 61. Gujarati                  62. Haitian_creole            63. Hausa
 64. Hawaiian                  65. Hebrew                    66. Hebrew
 67. Hindi                     68. Hmong                     69. Hungarian
 70. Icelandic                 71. Igbo                      72. Indonesian
 73. Interlingua               74. Interlingue               75. Inuktitut
 76. Inupiak                   77. Irish                     78. Italian
 79. Ignore                    80. Javanese                  81. Javanese
 82. Japanese                  83. Kannada                   84. Kashmiri
 85. Kazakh                    86. Khasi                     87. Khmer
 88. Kinyarwanda               89. Krio                      90. Kurdish
 91. Kyrgyz                    92. Korean                    93. Laothian
 94. Latin                     95. Latvian                   96. Limbu
 97. Limbu                     98. Limbu                     99. Lingala
100. Lithuanian               101. Lozi                     102. Luba_lulua
103. Luo_kenya_and_tanzania   104. Luxembourgish            105. Macedonian
106. Malagasy                 107. Malay                    108. Malayalam
109. Maltese                  110. Manx                     111. Maori
112. Marathi                  113. Mauritian_creole         114. Romanian
115. Mongolian                116. Montenegrin              117. Montenegrin
118. Montenegrin              119. Montenegrin              120. Nauru
121. Ndebele                  122. Nepali                   123. Newari
124. Norwegian                125. Norwegian                126. Norwegian_n
127. Nyanja                   128. Occitan                  129. Oriya
130. Oromo                    131. Ossetian                 132. Pampanga
133. Pashto                   134. Pedi                     135. Persian
136. Polish                   137. Portuguese               138. Punjabi
139. Quechua                  140. Rajasthani               141. Rhaeto_romance
142. Romanian                 143. Rundi                    144. Russian
145. Samoan                   146. Sango                    147. Sanskrit
148. Scots                    149. Scots_gaelic             150. Serbian
151. Serbian                  152. Seselwa                  153. Seselwa
154. Sesotho                  155. Shona                    156. Sindhi
157. Sinhalese                158. Siswant                  159. Slovak
160. Slovenian                161. Somali                   162. Spanish
163. Sundanese                164. Swahili                  165. Swedish
166. Syriac                   167. Tagalog                  168. Tajik
169. Tamil                    170. Tatar                    171. Telugu
172. Thai                     173. Tibetan                  174. Tigrinya
175. Tonga                    176. Tsonga                   177. Tswana
178. Tumbuka                  179. Turkish                  180. Turkmen
181. Twi                      182. Uighur                   183. Ukrainian
184. Urdu                     185. Uzbek                    186. Venda
187. Vietnamese               188. Volapuk                  189. Waray_philippines
190. Welsh                    191. Wolof                    192. Xhosa
193. Yiddish                  194. Yoruba                   195. Zhuang
196. Zulu

Tokenization

Tokenization is the process that identifies the text boundaries of words and sentences. We can identify the boundaries of sentences first then tokenize each sentence to identify the words that compose the sentence. Of course, we can do word tokenization first and then segment the token sequence into sentneces. Tokenization in polyglot relies on the Unicode Text Segmentation algorithm as implemented by the ICU Project.

You can use C/C++ ICU library by installing the required package libicu-dev. For example, on ubuntu/debian systems you should use apt-get utility as the following:

sudo apt-get install libicu-dev
from polyglot.text import Text

Word Tokenization

To call our word tokenizer, first we need to construct a Text object.

blob = u"""
两个月前遭受恐怖袭击的法国巴黎的犹太超市在装修之后周日重新开放,法国内政部长以及超市的管理者都表示,这显示了生命力要比野蛮行为更强大。
该超市1月9日遭受枪手袭击,导致4人死亡,据悉这起事件与法国《查理周刊》杂志社恐怖袭击案有关。
"""
text = Text(blob)

The property words will call the word tokenizer.

text.words
WordList(['两', '个', '月', '前', '遭受', '恐怖', '袭击', '的', '法国', '巴黎', '的', '犹太', '超市', '在', '装修', '之后', '周日', '重新', '开放', ',', '法国', '内政', '部长', '以及', '超市', '的', '管理者', '都', '表示', ',', '这', '显示', '了', '生命力', '要', '比', '野蛮', '行为', '更', '强大', '。', '该', '超市', '1', '月', '9', '日', '遭受', '枪手', '袭击', ',', '导致', '4', '人', '死亡', ',', '据悉', '这', '起', '事件', '与', '法国', '《', '查理', '周刊', '》', '杂志', '社', '恐怖', '袭击', '案', '有关', '。'])

Since ICU boundary break algorithms are language aware, polyglot will detect the language used first before calling the tokenizer

print(text.language)
name:             code: zh       confidence:  99.0 read bytes:  1920

Sentence Segmentation

If we are interested in segmenting the text first into sentences, we can query the sentences property

text.sentences
[Sentence("两个月前遭受恐怖袭击的法国巴黎的犹太超市在装修之后周日重新开放,法国内政部长以及超市的管理者都表示,这显示了生命力要比野蛮行为更强大。"),
 Sentence("该超市1月9日遭受枪手袭击,导致4人死亡,据悉这起事件与法国《查理周刊》杂志社恐怖袭击案有关。")]

Sentence class inherits Text, therefore, we can tokenize each sentence into words using the same property words

first_sentence = text.sentences[0]
first_sentence.words
WordList(['两', '个', '月', '前', '遭受', '恐怖', '袭击', '的', '法国', '巴黎', '的', '犹太', '超市', '在', '装修', '之后', '周日', '重新', '开放', ',', '法国', '内政', '部长', '以及', '超市', '的', '管理者', '都', '表示', ',', '这', '显示', '了', '生命力', '要', '比', '野蛮', '行为', '更', '强大', '。'])

Command Line

The subcommand tokenize does by default sentence segmentation and word tokenization.

! polyglot tokenize --help
usage: polyglot tokenize [-h] [--only-sent | --only-word] [--input [INPUT [INPUT ...]]]

optional arguments:
  -h, --help            show this help message and exit
  --only-sent           Segment sentences without word tokenization
  --only-word           Tokenize words without sentence segmentation
  --input [INPUT [INPUT ...]]

Each line represents a sentence where the words are split by spaces.

!polyglot --lang en tokenize --input testdata/cricket.txt
Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .
Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them .
Australia's score surpassed the 413 - 5 India made against Bermuda in 2007 .
It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages , following South Africa's 408 - 5 and 411 - 4 against West Indies and Ireland respectively .
The winning margin beats the 257 - run amount by which India beat Bermuda in Port of Spain in 2007 , which was equalled five days ago by South Africa in their victory over West Indies in Sydney .

Command Line Interface

polyglot package offer a command line interface along with the library access. For each task in polyglot, there is a subcommand with specific options for that task. Common options are gathered under the main command polyglot

!polyglot --help
usage: polyglot [-h] [--lang LANG] [--delimiter DELIMITER] [--workers WORKERS] [-l LOG] [--debug]
                {detect,morph,tokenize,download,count,cat,ner,pos,transliteration,sentiment} ...

optional arguments:
  -h, --help            show this help message and exit
  --lang LANG           Language to be processed
  --delimiter DELIMITER
                        Delimiter that seperates documents, records or even sentences.
  --workers WORKERS     Number of parallel processes.
  -l LOG, --log LOG     log verbosity level
  --debug               drop a debugger if an exception is raised.

tools:
  multilingual tools for all languages

  {detect,morph,tokenize,download,count,cat,ner,pos,transliteration,sentiment}
    detect              Detect the language(s) used in text.
    tokenize            Tokenize text into sentences and words.
    download            Download polyglot resources and models.
    count               Count words frequency in a corpus.
    cat                 Print the contents of the input file to the screen.
    ner                 Named entity recognition chunking.
    pos                 Part of Speech tagger.
    transliteration     Rewriting the input in the target language script.
    sentiment           Classify text to positive and negative polarity.

Notice that most of the operations are language specific. For example, tokenization rules and part of speech taggers differ between languages. Therefore, it is important that the lanaguage of the input is detected or given. The --lang option allows you to tell polyglot which language the input is written in.

!polyglot --lang en tokenize --input testdata/cricket.txt | head -n 3
Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .
Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them .

In case the user did not supply the the language code, polyglot will peek ahead and read the first 1KB of data to detect the language used in the input.

!polyglot tokenize --input testdata/cricket.txt | head -n 3
2015-03-15 17:06:45 INFO __main__.py: 276 Language English is detected while reading the first 1128 bytes.
Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .
Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them .

Input formats

Polyglot will process the input contents line by line assuming that the lines are separated by “\n”. If the file is formatted differently, you can use the polyglot main command option delimiter to specify any string other than “\n”.

You can pass text to the polyglot subcommands in several ways:

  • Standard input: This is, usually, useful for building processing pipelines.
  • Text file: The file contents will be processed line by line.
  • Collection of text files: Polyglot will iterate over the files one by one. If the polyglot main command option workers is activated, the execution will be parallelized and each file will be processed by a different process.

Word Count Example

This example will demonstrate how to use the polyglot main command options and the subcommand count to generate a count of the words appearing in a collection of text files.

First, let us examine the subcommand count options

!polyglot count --help
usage: polyglot count [-h] [--min-count MIN_COUNT | --most-freq MOST_FREQ] [--input [INPUT [INPUT ...]]]

optional arguments:
  -h, --help            show this help message and exit
  --min-count MIN_COUNT
                        Ignore all words that appear <= min_freq.
  --most-freq MOST_FREQ
                        Consider only the most frequent k words.
  --input [INPUT [INPUT ...]]

To avoid long output, we will restrict the count to the words that appeared at least twice

!polyglot count --input testdata/cricket.txt --min-count 2
in  10
the 6
by  3
and 3
of  3
Bermuda     2
West        2
Mitchell    2
South       2
Indies      2
against     2
beat        2
as  2
India       2
which       2
score       2
Afghanistan 2

Let us consider the scenario where we have hundreds of files that contains words we want to count. Notice, that we can parallelize the process by passing a number higher than 1 to the polyglot main command option workers.

!polyglot --log debug --workers 5 count --input testdata/cricket.txt testdata/cricket.txt --min-count 3
in  20
the 12
of  6
by  6
and 6
West        4
Afghanistan 4
India       4
beat        4
which       4
Indies      4
Bermuda     4
as  4
South       4
Mitchell    4
against     4
score       4

Building Pipelines

The previous subcommand count assumed that the words are separted by spaces. Given that we never tokenized the text file, that may result in suboptimal word counting. Let us take a closer look at the tail of the word counts

!polyglot count --input testdata/cricket.txt | tail -n 10
Ireland     1
surpassed   1
amount      1
equalled    1
a   1
The 1
413-5       1
Africa's    1
tournament  1
Johnson     1

Observe that words like “2007.” could have been considered two words “2007” and “.” and the same for “Africa’s”. To fix this issue, we can use the polyglot subcommand tokenize to deal with these cases. We can stage the counting to happen after the tokenization using the stdin to build a simple pipe.

!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot count --min-count 2
in  10
the 6
.   6
-   5
,   4
of  3
and 3
by  3
South       2
5   2
2007        2
Bermuda     2
which       2
score       2
against     2
Mitchell    2
as  2
West        2
India       2
beat        2
Afghanistan 2
Indies      2

Notice, that the word “2007” started appearing in the words counts list.

Downloading Models

Polyglot requires a model for each task and language. These models are essential for the library to function. Given the large size of some of the models, we distribute the models through a download manager separately. The download manager has several modes of operation.

Modes of Operation

Command Line Mode

The subcommand download takes a package or more as an argument and download the specified packages in the polyglot_data directory.

!polyglot download --help
usage: polyglot download [-h] [--dir DIR] [--quiet] [--force] [--exit-on-error] [--url SERVER_INDEX_URL] [packages [packages ...]]

positional arguments:
  packages              packages to be downloaded

optional arguments:
  -h, --help            show this help message and exit
  --dir DIR             download package to directory DIR
  --quiet               work quietly
  --force               download even if already installed
  --exit-on-error       exit if an error occurs
  --url SERVER_INDEX_URL
                        download server index url
!polyglot download morph2.en
[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!

Interactive Mode

You can reach this mode by not supplying any arguments to the command line.

!polyglot download
Polyglot Downloader
---------------------------------------------------------------------------
  d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader>

Library Interface

from polyglot.downloader import downloader
downloader.download("embeddings2.en")

Collections

You noticed, by now, that we can install a specific model by specifying its name and the target language.

Package name format is task_name.language_code

Packages are grouped by language. For example, if we want to download all the models that are specific to Arabic, the arabic collection of models name is LANG: followed by the language code of Arabic which is ar.

Therefore, we can just run:

!polyglot download LANG:ar
[polyglot_data] Downloading collection u'LANG:ar'
[polyglot_data]    |
[polyglot_data]    | Downloading package tsne2.ar to
[polyglot_data]    |     /home/rmyeid/polyglot_data...
[polyglot_data]    |   Package tsne2.ar is already up-to-date!
[polyglot_data]    | Downloading package transliteration2.ar to
[polyglot_data]    |     /home/rmyeid/polyglot_data...
[polyglot_data]    |   Package transliteration2.ar is already up-to-
[polyglot_data]    |       date!
[polyglot_data]    | Downloading package morph2.ar to
[polyglot_data]    |     /home/rmyeid/polyglot_data...
[polyglot_data]    |   Package morph2.ar is already up-to-date!
[polyglot_data]    | Downloading package counts2.ar to
[polyglot_data]    |     /home/rmyeid/polyglot_data...
[polyglot_data]    |   Package counts2.ar is already up-to-date!
[polyglot_data]    | Downloading package sentiment2.ar to
[polyglot_data]    |     /home/rmyeid/polyglot_data...
[polyglot_data]    |   Package sentiment2.ar is already up-to-date!
[polyglot_data]    | Downloading package embeddings2.ar to
[polyglot_data]    |     /home/rmyeid/polyglot_data...
[polyglot_data]    |   Package embeddings2.ar is already up-to-date!
[polyglot_data]    | Downloading package ner2.ar to
[polyglot_data]    |     /home/rmyeid/polyglot_data...
[polyglot_data]    |   Package ner2.ar is already up-to-date!
[polyglot_data]    |
[polyglot_data]  Done downloading collection LANG:ar

Packages are grouped by task. For example, if we want to download all the models that perform transliteration. The collection name is TASK: followed by the task name.

Therefore, we can just run:

downloader.download("TASK:transliteration2", quiet=True)
True

Langauge & Task Support

We can query our download manager for which tasks are supported by polyglot, as the following:

downloader.supported_tasks(lang="en")
[u'embeddings2',
 u'counts2',
 u'pos2',
 u'ner2',
 u'sentiment2',
 u'morph2',
 u'tsne2']

We can query our download manager for which languages are supported by polyglot named entity recognition subsystem, as the following:

print(downloader.supported_languages_table(task="ner2"))
 1. Polish                     2. Turkish                    3. Russian
 4. Indonesian                 5. Czech                      6. Arabic
 7. Korean                     8. Catalan; Valencian         9. Italian
10. Thai                      11. Romanian, Moldavian, ...  12. Tagalog
13. Danish                    14. Finnish                   15. German
16. Persian                   17. Dutch                     18. Chinese
19. French                    20. Portuguese                21. Slovak
22. Hebrew (modern)           23. Malay                     24. Slovene
25. Bulgarian                 26. Hindi                     27. Japanese
28. Hungarian                 29. Croatian                  30. Ukrainian
31. Serbian                   32. Lithuanian                33. Norwegian
34. Latvian                   35. Swedish                   36. English
37. Greek, Modern             38. Spanish; Castilian        39. Vietnamese
40. Estonian

You can view all the available and/or installed collections or packages through the list function

downloader.list(show_packages=False)
Using default data directory (/home/rmyeid/polyglot_data)
=========================================
 Data server index for <polyglot-models>
=========================================
Collections:
  [ ] LANG:af............. Afrikaans            packages and models
  [ ] LANG:als............ als                  packages and models
  [ ] LANG:am............. Amharic              packages and models
  [ ] LANG:an............. Aragonese            packages and models
  [ ] LANG:ar............. Arabic               packages and models
  [ ] LANG:arz............ arz                  packages and models
  [ ] LANG:as............. Assamese             packages and models
  [ ] LANG:ast............ Asturian             packages and models
  [ ] LANG:az............. Azerbaijani          packages and models
  [ ] LANG:ba............. Bashkir              packages and models
  [ ] LANG:bar............ bar                  packages and models
  [ ] LANG:be............. Belarusian           packages and models
  [ ] LANG:bg............. Bulgarian            packages and models
  [ ] LANG:bn............. Bengali              packages and models
  [ ] LANG:bo............. Tibetan              packages and models
  [ ] LANG:bpy............ bpy                  packages and models
  [ ] LANG:br............. Breton               packages and models
  [ ] LANG:bs............. Bosnian              packages and models
  [ ] LANG:ca............. Catalan              packages and models
  [ ] LANG:ce............. Chechen              packages and models
  [ ] LANG:ceb............ Cebuano              packages and models
  [ ] LANG:cs............. Czech                packages and models
  [ ] LANG:cv............. Chuvash              packages and models
  [ ] LANG:cy............. Welsh                packages and models
  [ ] LANG:da............. Danish               packages and models
  [ ] LANG:de............. German               packages and models
  [ ] LANG:diq............ diq                  packages and models
  [ ] LANG:dv............. Divehi               packages and models
  [ ] LANG:el............. Greek                packages and models
  [P] LANG:en............. English              packages and models
  [ ] LANG:eo............. Esperanto            packages and models
  [ ] LANG:es............. Spanish              packages and models
  [ ] LANG:et............. Estonian             packages and models
  [ ] LANG:eu............. Basque               packages and models
  [ ] LANG:fa............. Persian              packages and models
  [ ] LANG:fi............. Finnish              packages and models
  [ ] LANG:fo............. Faroese              packages and models
  [ ] LANG:fr............. French               packages and models
  [ ] LANG:fy............. Western Frisian      packages and models
  [ ] LANG:ga............. Irish                packages and models
  [ ] LANG:gan............ gan                  packages and models
  [ ] LANG:gd............. Scottish Gaelic      packages and models
  [ ] LANG:gl............. Galician             packages and models
  [ ] LANG:gu............. Gujarati             packages and models
  [ ] LANG:gv............. Manx                 packages and models
  [ ] LANG:he............. Hebrew               packages and models
  [ ] LANG:hi............. Hindi                packages and models
  [ ] LANG:hif............ hif                  packages and models
  [ ] LANG:hr............. Croatian             packages and models
  [ ] LANG:hsb............ Upper Sorbian        packages and models
  [ ] LANG:ht............. Haitian              packages and models
  [ ] LANG:hu............. Hungarian            packages and models
  [ ] LANG:hy............. Armenian             packages and models
  [ ] LANG:ia............. Interlingua          packages and models
  [ ] LANG:id............. Indonesian           packages and models
  [ ] LANG:ilo............ Iloko                packages and models
  [ ] LANG:io............. Ido                  packages and models
  [ ] LANG:is............. Icelandic            packages and models
  [ ] LANG:it............. Italian              packages and models
  [ ] LANG:ja............. Japanese             packages and models
  [ ] LANG:jv............. Javanese             packages and models
  [ ] LANG:ka............. Georgian             packages and models
  [ ] LANG:kk............. Kazakh               packages and models
  [ ] LANG:km............. Khmer                packages and models
  [ ] LANG:kn............. Kannada              packages and models
  [ ] LANG:ko............. Korean               packages and models
  [ ] LANG:ku............. Kurdish              packages and models
  [ ] LANG:ky............. Kyrgyz               packages and models
  [ ] LANG:la............. Latin                packages and models
  [ ] LANG:lb............. Luxembourgish        packages and models
  [ ] LANG:li............. Limburgish           packages and models
  [ ] LANG:lmo............ lmo                  packages and models
  [ ] LANG:lt............. Lithuanian           packages and models
  [ ] LANG:lv............. Latvian              packages and models
  [ ] LANG:mg............. Malagasy             packages and models
  [ ] LANG:mk............. Macedonian           packages and models
  [ ] LANG:ml............. Malayalam            packages and models
  [ ] LANG:mn............. Mongolian            packages and models
  [ ] LANG:mr............. Marathi              packages and models
  [ ] LANG:ms............. Malay                packages and models
  [ ] LANG:mt............. Maltese              packages and models
  [ ] LANG:my............. Burmese              packages and models
  [ ] LANG:ne............. Nepali               packages and models
  [ ] LANG:nl............. Dutch                packages and models
  [ ] LANG:nn............. Norwegian Nynorsk    packages and models
  [ ] LANG:no............. Norwegian            packages and models
  [ ] LANG:oc............. Occitan              packages and models
  [ ] LANG:or............. Oriya                packages and models
  [ ] LANG:os............. Ossetic              packages and models
  [ ] LANG:pa............. Punjabi              packages and models
  [ ] LANG:pam............ Pampanga             packages and models
  [ ] LANG:pl............. Polish               packages and models
  [ ] LANG:pms............ pms                  packages and models
  [ ] LANG:ps............. Pashto               packages and models
  [ ] LANG:pt............. Portuguese           packages and models
  [ ] LANG:qu............. Quechua              packages and models
  [ ] LANG:rm............. Romansh              packages and models
  [ ] LANG:ro............. Romanian             packages and models
  [ ] LANG:ru............. Russian              packages and models
  [ ] LANG:sa............. Sanskrit             packages and models
  [ ] LANG:sah............ Sakha                packages and models
  [ ] LANG:scn............ Sicilian             packages and models
  [ ] LANG:sco............ Scots                packages and models
  [ ] LANG:se............. Northern Sami        packages and models
  [ ] LANG:sh............. Serbo-Croatian       packages and models
  [ ] LANG:si............. Sinhala              packages and models
  [ ] LANG:sk............. Slovak               packages and models
  [ ] LANG:sl............. Slovenian            packages and models
  [ ] LANG:sq............. Albanian             packages and models
  [ ] LANG:sr............. Serbian              packages and models
  [ ] LANG:su............. Sundanese            packages and models
  [ ] LANG:sv............. Swedish              packages and models
  [ ] LANG:sw............. Swahili              packages and models
  [ ] LANG:szl............ szl                  packages and models
  [ ] LANG:ta............. Tamil                packages and models
  [ ] LANG:te............. Telugu               packages and models
  [ ] LANG:tg............. Tajik                packages and models
  [ ] LANG:th............. Thai                 packages and models
  [ ] LANG:tk............. Turkmen              packages and models
  [ ] LANG:tl............. Tagalog              packages and models
  [ ] LANG:tr............. Turkish              packages and models
  [ ] LANG:tt............. Tatar                packages and models
  [ ] LANG:ug............. Uyghur               packages and models
  [ ] LANG:uk............. Ukrainian            packages and models
  [ ] LANG:ur............. Urdu                 packages and models
  [ ] LANG:uz............. Uzbek                packages and models
  [ ] LANG:vec............ vec                  packages and models
  [ ] LANG:vi............. Vietnamese           packages and models
  [ ] LANG:vls............ vls                  packages and models
  [ ] LANG:vo............. Volapük              packages and models
  [ ] LANG:wa............. Walloon              packages and models
  [ ] LANG:war............ Waray                packages and models
  [ ] LANG:yi............. Yiddish              packages and models
  [ ] LANG:yo............. Yoruba               packages and models
  [ ] LANG:zh............. Chinese              packages and models
  [ ] LANG:zhc............ Chinese Character    packages and models
  [ ] LANG:zhw............ zhw                  packages and models
  [ ] TASK:counts2........ counts2
  [ ] TASK:embeddings2.... embeddings2
  [ ] TASK:ner2........... ner2
  [P] TASK:sentiment2..... sentiment2
  [ ] TASK:tsne2.......... tsne2

([*] marks installed packages; [P] marks partially installed collections)

Word Embeddings

Word embedding is a mapping of a word to a d-dimensional vector space. This real valued vector representation captures semantic and syntactic features. Polyglot offers a simple interface to load several formats of word embeddings.

from polyglot.mapping import Embedding

Formats

The Embedding class can read word embeddings from different sources:

  • Gensim word2vec objects: (from_gensim method)
  • Word2vec binary/text models: (from_word2vec method)
  • GloVe models (from_glove method)
  • polyglot pickle files: (load method)
embeddings = Embedding.load("/home/rmyeid/polyglot_data/embeddings2/en/embeddings_pkl.tar.bz2")

Nearest Neighbors

A common way to investigate the space capture by the embeddings is to query for the nearest neightbors of any word.

neighbors = embeddings.nearest_neighbors("green")
neighbors
[u'blue',
 u'white',
 u'red',
 u'yellow',
 u'black',
 u'grey',
 u'purple',
 u'pink',
 u'light',
 u'gray']

to calculate the distance between a word and the nieghbors, we can call the distances method

embeddings.distances("green", neighbors)
array([ 1.34894466,  1.37864077,  1.39504588,  1.39524949,  1.43183875,
        1.68007386,  1.75897062,  1.88401115,  1.89186132,  1.902614  ], dtype=float32)

The word embeddings are not unit vectors, actually the more frequent the word is the larger the norm of its own vector.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
norms = np.linalg.norm(embeddings.vectors, axis=1)
window = 300
smooth_line = np.convolve(norms, np.ones(window)/float(window), mode='valid')
plt.plot(smooth_line)
plt.xlabel("Word Rank"); _ = plt.ylabel("$L_2$ norm")
_images/Embeddings_12_0.png

This could be problematic for some applications and training algorithms. We can normalize them by \(L_2\) norms to get unit vectors to reduce effects of word frequency, as the following

embeddings = embeddings.normalize_words()
neighbors = embeddings.nearest_neighbors("green")
for w,d in zip(neighbors, embeddings.distances("green", neighbors)):
  print("{:<8}{:.4f}".format(w,d))
white   0.4261
blue    0.4451
black   0.4591
red     0.4786
yellow  0.4947
grey    0.6072
purple  0.6392
light   0.6483
pink    0.6574
colour  0.6824

Vocabulary Expansion

from polyglot.mapping import CaseExpander, DigitExpander

Not all the words are available in the dictionary defined by the word embeddings. Sometimes it would be useful to map new words to similar ones that we have embeddings for.

Case Expansion

For example, the word GREEN is not available in the embeddings,

"GREEN" in embeddings
False

we would like to return the vector that represents the word Green, to do that we apply a case expansion:

embeddings.apply_expansion(CaseExpander)
"GREEN" in embeddings
True
embeddings.nearest_neighbors("GREEN")
[u'White',
 u'Black',
 u'Brown',
 u'Blue',
 u'Diamond',
 u'Wood',
 u'Young',
 u'Hudson',
 u'Cook',
 u'Gold']

Digit Expansion

We reduce the size of the vocabulary while training the embeddings by grouping special classes of words. Once common case of such grouping is digits. Every digit in the training corpus get replaced by the symbol #. For example, a number like 123.54 becomes ###.##. Therefore, querying the embedding for a new number like 434 will result in a failure

"434" in embeddings
False

To fix that, we apply another type of vocabulary expansion DigitExpander. It will map any number to a sequence of #s.

embeddings.apply_expansion(DigitExpander)
"434" in embeddings
True

As expected, the neighbors of the new number 434 will be other numbers:

embeddings.nearest_neighbors("434")
[u'##',
 u'#',
 u'3',
 u'#####',
 u'#,###',
 u'##,###',
 u'##EN##',
 u'####',
 u'###EN###',
 u'n']

Demo

Demo is available here.

Citation

This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

@InProceedings{polyglot:2013:ACL-CoNLL,
 author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},
 title     = {Polyglot: Distributed Word Representations for Multilingual NLP},
 booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
 month     = {August},
 year      = {2013},
 address   = {Sofia, Bulgaria},
 publisher = {Association for Computational Linguistics},
 pages     = {183--192},
 url       = {http://www.aclweb.org/anthology/W13-3520}
}

Part of Speech Tagging

Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.

Polyglot recognizes 17 parts of speech, this set is called the universal part of speech tag set:

  • ADJ: adjective
  • ADP: adposition
  • ADV: adverb
  • AUX: auxiliary verb
  • CONJ: coordinating conjunction
  • DET: determiner
  • INTJ: interjection
  • NOUN: noun
  • NUM: numeral
  • PART: particle
  • PRON: pronoun
  • PROPN: proper noun
  • PUNCT: punctuation
  • SCONJ: subordinating conjunction
  • SYM: symbol
  • VERB: verb
  • X: other

Languages Coverage

The models were trained on a combination of:

  • Original CONLL datasets after the tags were converted using the universal POS tables.
  • Universal Dependencies 1.0 corpora whenever they are available.
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))
 1. German                     2. Italian                    3. Danish
 4. Czech                      5. Slovene                    6. French
 7. English                    8. Swedish                    9. Bulgarian
10. Spanish; Castilian        11. Indonesian                12. Portuguese
13. Finnish                   14. Irish                     15. Hungarian
16. Dutch

Download Necessary Models

%%bash
polyglot download embeddings2.en pos2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package pos2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package pos2.en is already up-to-date!

Example

We tag each word in the text with one part of speech.

from polyglot.text import Text
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)

# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')

We can query all the tagged words

text.pos_tags
[(u'We', u'PRON'),
 (u'will', u'AUX'),
 (u'meet', u'VERB'),
 (u'at', u'ADP'),
 (u'eight', u'NUM'),
 (u"o'clock", u'NOUN'),
 (u'on', u'ADP'),
 (u'Thursday', u'PROPN'),
 (u'morning', u'NOUN'),
 (u'.', u'PUNCT')]

After calling the pos_tags property once, the words objects will carry the POS tags.

text.words[0].pos_tag
u'PRON'
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en pos | tail -n 30
which           DET
India           PROPN
beat            VERB
Bermuda         PROPN
in              ADP
Port            PROPN
of              ADP
Spain           PROPN
in              ADP
2007            NUM
,               PUNCT
which           DET
was             AUX
equalled        VERB
five            NUM
days            NOUN
ago             ADV
by              ADP
South           PROPN
Africa          PROPN
in              ADP
their           PRON
victory         NOUN
over            ADP
West            PROPN
Indies          PROPN
in              ADP
Sydney          PROPN
.               PUNCT

This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

@InProceedings{polyglot:2013:ACL-CoNLL,
  author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},
  title     = {Polyglot: Distributed Word Representations for Multilingual NLP},
  booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
  month     = {August},
  year      = {2013},
  address   = {Sofia, Bulgaria},
  publisher = {Association for Computational Linguistics},
  pages     = {183--192},
  url       = {http://www.aclweb.org/anthology/W13-3520}
}

Named Entity Extraction

Named entity extraction task aims to extract phrases from plain text that correpond to entities. Polyglot recognizes 3 categories of entities:

  • Locations (Tag: I-LOC): cities, countries, regions, continents, neighborhoods, administrative divisions …
  • Organizations (Tag: I-ORG): sports teams, newspapers, banks, universities, schools, non-profits, companies, …
  • Persons (Tag: I-PER): politicians, scientists, artists, atheletes …

Languages Coverage

The models were trained on datasets extracted automatically from Wikipedia. Polyglot currently supports 40 major languages.

from polyglot.downloader import downloader
print(downloader.supported_languages_table("ner2", 3))
 1. Polish                     2. Turkish                    3. Russian
 4. Indonesian                 5. Czech                      6. Arabic
 7. Korean                     8. Catalan; Valencian         9. Italian
10. Thai                      11. Romanian, Moldavian, ...  12. Tagalog
13. Danish                    14. Finnish                   15. German
16. Persian                   17. Dutch                     18. Chinese
19. French                    20. Portuguese                21. Slovak
22. Hebrew (modern)           23. Malay                     24. Slovene
25. Bulgarian                 26. Hindi                     27. Japanese
28. Hungarian                 29. Croatian                  30. Ukrainian
31. Serbian                   32. Lithuanian                33. Norwegian
34. Latvian                   35. Swedish                   36. English
37. Greek, Modern             38. Spanish; Castilian        39. Vietnamese
40. Estonian

Download Necessary Models

%%bash
polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package ner2.en is already up-to-date!

Example

Entities inside a text object or a sentence are represented as chunks. Each chunk identifies the start and the end indices of the word subsequence within the text.

from polyglot.text import Text
blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
text = Text(blob)

# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')

We can query all entities mentioned in a text.

text.entities
[I-ORG([u'Israeli']), I-PER([u'Benjamin', u'Netanyahu']), I-LOC([u'Iran'])]

Or, we can query entites per sentence

for sent in text.sentences:
  print(sent, "\n")
  for entity in sent.entities:
    print(entity.tag, entity)
The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world".

I-ORG [u'Israeli']
I-PER [u'Benjamin', u'Netanyahu']
I-LOC [u'Iran']

By doing more careful inspection of the second entity Benjamin Netanyahu, we can locate the position of the entity within the sentence.

benjamin = sent.entities[1]
sent.words[benjamin.start: benjamin.end]
WordList([u'Benjamin', u'Netanyahu'])
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en ner | tail -n 20
,               O
which           O
was             O
equalled        O
five            O
days            O
ago             O
by              O
South           I-LOC
Africa          I-LOC
in              O
their           O
victory         O
over            O
West            I-ORG
Indies          I-ORG
in              O
Sydney          I-LOC
.               O

Demo

This work is a direct implementation of the research being described in the Polyglot-NER: Multilingual Named Entity Recognition paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

@article{polyglotner,
        author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},
        title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},
        journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}},
        month     = {April},
        year      = {2015},
        publisher = {SIAM}
}

Morphological Analysis

Polyglot offers trained morfessor models to generate morphemes from words. The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, Morpho project is focussing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.

Languages Coverage

Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words 50,000 words of each language.

from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
  1. Piedmontese language       2. Lombard language           3. Gan Chinese
  4. Sicilian                   5. Scots                      6. Kirghiz, Kyrgyz
  7. Pashto, Pushto             8. Kurdish                    9. Portuguese
 10. Kannada                   11. Korean                    12. Khmer
 13. Kazakh                    14. Ilokano                   15. Polish
 16. Panjabi, Punjabi          17. Georgian                  18. Chuvash
 19. Alemannic                 20. Czech                     21. Welsh
 22. Chechen                   23. Catalan; Valencian        24. Northern Sami
 25. Sanskrit (Saṁskṛta)       26. Slovene                   27. Javanese
 28. Slovak                    29. Bosnian-Croatian-Serbian  30. Bavarian
 31. Swedish                   32. Swahili                   33. Sundanese
 34. Serbian                   35. Albanian                  36. Japanese
 37. Western Frisian           38. French                    39. Finnish
 40. Upper Sorbian             41. Faroese                   42. Persian
 43. Sinhala, Sinhalese        44. Italian                   45. Amharic
 46. Aragonese                 47. Volapük                   48. Icelandic
 49. Sakha                     50. Afrikaans                 51. Indonesian
 52. Interlingua               53. Azerbaijani               54. Ido
 55. Arabic                    56. Assamese                  57. Yoruba
 58. Yiddish                   59. Waray-Waray               60. Croatian
 61. Hungarian                 62. Haitian; Haitian Creole   63. Quechua
 64. Armenian                  65. Hebrew (modern)           66. Silesian
 67. Hindi                     68. Divehi; Dhivehi; Mald...  69. German
 70. Danish                    71. Occitan                   72. Tagalog
 73. Turkmen                   74. Thai                      75. Tajik
 76. Greek, Modern             77. Telugu                    78. Tamil
 79. Oriya                     80. Ossetian, Ossetic         81. Tatar
 82. Turkish                   83. Kapampangan               84. Venetian
 85. Manx                      86. Gujarati                  87. Galician
 88. Irish                     89. Scottish Gaelic; Gaelic   90. Nepali
 91. Cebuano                   92. Zazaki                    93. Walloon
 94. Dutch                     95. Norwegian                 96. Norwegian Nynorsk
 97. West Flemish              98. Chinese                   99. Bosnian
100. Breton                   101. Belarusian               102. Bulgarian
103. Bashkir                  104. Egyptian Arabic          105. Tibetan Standard, Tib...
106. Bengali                  107. Burmese                  108. Romansh
109. Marathi (Marāṭhī)        110. Malay                    111. Maltese
112. Russian                  113. Macedonian               114. Malayalam
115. Mongolian                116. Malagasy                 117. Vietnamese
118. Spanish; Castilian       119. Estonian                 120. Basque
121. Bishnupriya Manipuri     122. Asturian                 123. English
124. Esperanto                125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur           128. Ukrainian                129. Limburgish, Limburgan...
130. Latvian                  131. Urdu                     132. Lithuanian
133. Fiji Hindi               134. Uzbek                    135. Romanian, Moldavian, ...

Download Necessary Models

%%bash
polyglot download morph2.en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.ar is already up-to-date!

Example

from polyglot.text import Text, Word
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w = Word(w, language="en")
  print("{:<20}{}".format(w, w.morphemes))
preprocessing       ['pre', 'process', 'ing']
processor           ['process', 'or']
invaluable          ['in', 'valuable']
thankful            ['thank', 'ful']
crossed             ['cross', 'ed']

If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:

blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
text.morphemes
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en morph | tail -n 30
which           which
India           In_dia
beat            beat
Bermuda         Ber_mud_a
in              in
Port            Port
of              of
Spain           Spa_in
in              in
2007            2007
,               ,
which           which
was             wa_s
equalled        equal_led
five            five
days            day_s
ago             ago
by              by
South           South
Africa          Africa
in              in
their           t_heir
victory         victor_y
over            over
West            West
Indies          In_dies
in              in
Sydney          Syd_ney
.               .

Demo

This demo does not reflect the models supplied by polyglot, however, we think it is indicative of what you should expect from morfessor

Demo

This is an interface to the implementation being described in the Morfessor2.0: Python Implementation and Extensions for Morfessor Baseline technical report.

@InProceedings{morfessor2,
               title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
               author:  {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
               year: {2013},
               publisher: {Department of Signal Processing and Acoustics, Aalto University},
               booktitle:{Aalto University publication series}
}

Transliteration

Transliteration is the conversion of a text from one script to another. For instance, a Latin transliteration of the Greek phrase “Ελληνική Δημοκρατία”, usually translated as ‘Hellenic Republic’, is “Ellēnikḗ Dēmokratía”.

from polyglot.transliteration import Transliterator

Languages Coverage

from polyglot.downloader import downloader
print(downloader.supported_languages_table("transliteration2"))
 1. Haitian; Haitian Creole    2. Tamil                      3. Vietnamese
 4. Telugu                     5. Croatian                   6. Hungarian
 7. Thai                       8. Kannada                    9. Tagalog
10. Armenian                  11. Hebrew (modern)           12. Turkish
13. Portuguese                14. Belarusian                15. Norwegian Nynorsk
16. Norwegian                 17. Dutch                     18. Japanese
19. Albanian                  20. Bulgarian                 21. Serbian
22. Swahili                   23. Swedish                   24. French
25. Latin                     26. Czech                     27. Yiddish
28. Hindi                     29. Danish                    30. Finnish
31. German                    32. Bosnian-Croatian-Serbian  33. Slovak
34. Persian                   35. Lithuanian                36. Slovene
37. Latvian                   38. Bosnian                   39. Gujarati
40. Italian                   41. Icelandic                 42. Spanish; Castilian
43. Ukrainian                 44. Georgian                  45. Urdu
46. Indonesian                47. Marathi (Marāṭhī)         48. Korean
49. Galician                  50. Khmer                     51. Catalan; Valencian
52. Romanian, Moldavian, ...  53. Basque                    54. Macedonian
55. Russian                   56. Azerbaijani               57. Chinese
58. Estonian                  59. Welsh                     60. Arabic
61. Bengali                   62. Amharic                   63. Irish
64. Malay                     65. Afrikaans                 66. Polish
67. Greek, Modern             68. Esperanto                 69. Maltese

Downloading Necessary Models

%%bash
polyglot download embeddings2.en transliteration2.ar
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package transliteration2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package transliteration2.ar is already up-to-date!

Example

We tag each word in the text with one part of speech.

from polyglot.text import Text
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)

We can query all the tagged words

for x in text.transliterate("ar"):
  print(x)
وي
ويل
ميت
ات
ييايت
أوكلوك
ون
ثورسداي
مورنينغ
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en transliteration --target ar | tail -n 30
which           ويكه
India           ينديا
beat            بيت
Bermuda         بيرمودا
in              ين
Port            بورت
of              وف
Spain           سباين
in              ين
2007
,
which           ويكه
was             واس
equalled        يكالليد
five            فيفي
days            دايس
ago             اغو
by              بي
South           سووث
Africa          افريكا
in              ين
their           ثير
victory         فيكتوري
over            وفير
West            ويست
Indies          يندييس
in              ين
Sydney          سيدني
.

Citation

This work is a direct implementation of the research being described in the False-Friend Detection and Entity Matching via Unsupervised Transliteration paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

@article{chen2016false,
title={False-Friend Detection and Entity Matching via Unsupervised Transliteration},
author={Chen, Yanqing and Skiena, Steven},
journal={arXiv preprint arXiv:1611.06722},
year={2016}}

Sentiment

Polyglot has polarity lexicons for 136 languages. The scale of the words’ polarity consisted of three degrees: +1 for positive words, and -1 for negatives words. Neutral words will have a score of 0.

Languages Coverage

from polyglot.downloader import downloader
print(downloader.supported_languages_table("sentiment2", 3))
  1. Turkmen                    2. Thai                       3. Latvian
  4. Zazaki                     5. Tagalog                    6. Tamil
  7. Tajik                      8. Telugu                     9. Luxembourgish, Letzeb...
 10. Alemannic                 11. Latin                     12. Turkish
 13. Limburgish, Limburgan...  14. Egyptian Arabic           15. Tatar
 16. Lithuanian                17. Spanish; Castilian        18. Basque
 19. Estonian                  20. Asturian                  21. Greek, Modern
 22. Esperanto                 23. English                   24. Ukrainian
 25. Marathi (Marāṭhī)         26. Maltese                   27. Burmese
 28. Kapampangan               29. Uighur, Uyghur            30. Uzbek
 31. Malagasy                  32. Yiddish                   33. Macedonian
 34. Urdu                      35. Malayalam                 36. Mongolian
 37. Breton                    38. Bosnian                   39. Bengali
 40. Tibetan Standard, Tib...  41. Belarusian                42. Bulgarian
 43. Bashkir                   44. Vietnamese                45. Volapük
 46. Gan Chinese               47. Manx                      48. Gujarati
 49. Yoruba                    50. Occitan                   51. Scottish Gaelic; Gaelic
 52. Irish                     53. Galician                  54. Ossetian, Ossetic
 55. Oriya                     56. Walloon                   57. Swedish
 58. Silesian                  59. Lombard language          60. Divehi; Dhivehi; Mald...
 61. Danish                    62. German                    63. Armenian
 64. Haitian; Haitian Creole   65. Hungarian                 66. Croatian
 67. Bishnupriya Manipuri      68. Hindi                     69. Hebrew (modern)
 70. Portuguese                71. Afrikaans                 72. Pashto, Pushto
 73. Amharic                   74. Aragonese                 75. Bavarian
 76. Assamese                  77. Panjabi, Punjabi          78. Polish
 79. Azerbaijani               80. Italian                   81. Arabic
 82. Icelandic                 83. Ido                       84. Scots
 85. Sicilian                  86. Indonesian                87. Chinese Word
 88. Interlingua               89. Waray-Waray               90. Piedmontese language
 91. Quechua                   92. French                    93. Dutch
 94. Norwegian Nynorsk         95. Norwegian                 96. Western Frisian
 97. Upper Sorbian             98. Nepali                    99. Persian
100. Ilokano                  101. Finnish                  102. Faroese
103. Romansh                  104. Javanese                 105. Romanian, Moldavian, ...
106. Malay                    107. Japanese                 108. Russian
109. Catalan; Valencian       110. Fiji Hindi               111. Chinese
112. Cebuano                  113. Czech                    114. Chuvash
115. Welsh                    116. West Flemish             117. Kirghiz, Kyrgyz
118. Kurdish                  119. Kazakh                   120. Korean
121. Kannada                  122. Khmer                    123. Georgian
124. Sakha                    125. Serbian                  126. Albanian
127. Swahili                  128. Chechen                  129. Sundanese
130. Sanskrit (Saṁskṛta)      131. Venetian                 132. Northern Sami
133. Slovak                   134. Sinhala, Sinhalese       135. Bosnian-Croatian-Serbian
136. Slovene
from polyglot.text import Text

Polarity

To inquiry the polarity of a word, we can just call its own attribute polarity

text = Text("The movie was really good.")
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in text.words:
    print("{:<16}{:>2}".format(w, w.polarity))
Word            Polarity
------------------------------
The              0
movie            0
was              0
really           0
good             1
.                0

Entity Sentiment

We can calculate a more sphosticated sentiment score for an entity that is mentioned in text as the following:

blob = ("Barack Obama gave a fantastic speech last night. "
        "Reports indicate he will move next to New Hampshire.")
text = Text(blob)

First, we need split the text into sentneces, this will limit the words tha affect the sentiment of an entity to the words mentioned in the sentnece.

first_sentence = text.sentences[0]
print(first_sentence)
The movie was really good.

Second, we extract the entities

first_entity = first_sentence.entities[0]
print(first_entity)
[u'Obama']

Finally, for each entity we identified, we can calculate the strength of the positive or negative sentiment it has on a scale from 0-1

first_entity.positive_sentiment
0.9375
first_entity.negative_sentiment
0

Citation

This work is a direct implementation of the research being described in the Building sentiment lexicons for all major languages paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

@inproceedings{chen2014building,
title={Building sentiment lexicons for all major languages},
author={Chen, Yanqing and Skiena, Steven},
booktitle={Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers)},
pages={383--389},
year={2014}}

polyglot

polyglot package

Subpackages

polyglot.detect package
Submodules
polyglot.detect.base module

Detecting languages

class polyglot.detect.base.Detector(text, quiet=False)[source]

Bases: object

Detect the language used in a snippet of text.

detect(text)[source]

Decide which language is used to write the text.

The method tries first to detect the language with high reliability. If that is not possible, the method switches to best effort strategy.

Parameters:text (string) – A snippet of text, the longer it is the more reliable we can detect the language used to write the text.
quiet = None

If true, exceptions will be silenced.

reliable = None

False if the detector used Best Effort strategy in detection.

static supported_languages()[source]

Returns a list of the languages that can be detected by pycld2.

exception polyglot.detect.base.Error[source]

Bases: exceptions.Exception

Base exception class for this class.

class polyglot.detect.base.Language(choice)[source]

Bases: object

code
static from_code(code)[source]
name
exception polyglot.detect.base.UnknownLanguage[source]

Bases: polyglot.detect.base.Error

Raised if we can not detect the language of a text snippet.

polyglot.detect.langids module
Module contents
class polyglot.detect.Detector(text, quiet=False)[source]

Bases: object

Detect the language used in a snippet of text.

detect(text)[source]

Decide which language is used to write the text.

The method tries first to detect the language with high reliability. If that is not possible, the method switches to best effort strategy.

Parameters:text (string) – A snippet of text, the longer it is the more reliable we can detect the language used to write the text.
static supported_languages()[source]

Returns a list of the languages that can be detected by pycld2.

class polyglot.detect.Language(choice)[source]

Bases: object

code
static from_code(code)[source]
name
polyglot.mapping package
Subpackages
Submodules
polyglot.mapping.base module

Supports word embeddings.

class polyglot.mapping.base.CountedVocabulary(word_count=None)[source]

Bases: polyglot.mapping.base.OrderedVocabulary

List of words and counts sorted according to word count.

classmethod from_textfile(textfile, workers=1, job_size=1000)[source]

Count the set of words appeared in a text file.

Parameters:
  • textfile (string) – The name of the text file or TextFile object.
  • min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
  • workers (integer) – Number of parallel workers to read the file simulatenously.
  • job_size (integer) – Size of the batch send to each worker.
  • most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.
Returns:

A vocabulary of the most frequent words appeared in the document.

static from_textfiles(files, workers=1, job_size=1000)[source]
static from_vocabfile(filename)[source]

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1 count1
word2 count2
getstate()[source]
min_count(n=1)[source]

Returns a vocabulary after eliminating the words that appear < n.

Parameters:n (integer) – specifies the minimum word frequency allowed.
most_frequent(k)[source]

Returns a vocabulary with the most frequent k words.

Parameters:k (integer) – specifies the top k most frequent words to be returned.
class polyglot.mapping.base.OrderedVocabulary(words=None)[source]

Bases: polyglot.mapping.base.VocabularyBase

An ordered list of words/tokens according to their frequency.

Note

The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.

word_id

dictionary – Mapping from words to IDs.

id_word

dictionary – A reverse map of word_id.

most_frequent(k)[source]

Returns a vocabulary with the most frequent k words.

Parameters:k (integer) – specifies the top k most frequent words to be returned.
class polyglot.mapping.base.VocabularyBase(words=None)[source]

Bases: object

A set of words/tokens that have consistent IDs.

Note

Words will be sorted according to their lexicographic order.

word_id

dictionary – Mapping from words to IDs.

id_word

dictionary – A reverse map of word_id.

classmethod from_vocabfile(filename)[source]

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1
word2
get(k, default=None)[source]
getstate()[source]
sanitize_words(words)[source]

Guarantees that all textual symbols are unicode.

Note

We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.

words

Ordered list of words according to their IDs.

polyglot.mapping.base.count(lines)[source]

Counts the word frequences in a list of sentences.

Note

This is a helper function for parallel execution of Vocabulary.from_text method.

polyglot.mapping.embeddings module

Defines classes related to mapping vocabulary to n-dimensional points.

class polyglot.mapping.embeddings.Embedding(vocabulary, vectors)[source]

Bases: object

Mapping a vocabulary to a d-dimensional points.

apply_expansion(expansion)[source]

Apply a vocabulary expansion to the current emebddings.

distances(word, words)[source]

Calculate eucledean pairwise distances between word and words.

Parameters:
  • word (string) – single word.
  • words (list) – list of strings.
Returns:

numpy array of the distances.

Note

L2 metric is used to calculate distances.

static from_gensim(model)[source]
static from_glove(fname)[source]
static from_word2vec(fname, fvocab=None, binary=False)[source]

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

static from_word2vec_vocab(fvocab)[source]
get(k, default=None)[source]
static load(fname)[source]

Load an embedding dump generated by save

most_frequent(k, inplace=False)[source]

Only most frequent k words to be included in the embeddings.

nearest_neighbors(word, top_k=10)[source]

Return the nearest k words to the given word.

Parameters:
  • word (string) – single word.
  • top_k (integer) – decides how many neighbors to report.
Returns:

A list of words sorted by the distances. The closest is the first.

Note

L2 metric is used to calculate distances.

normalize_words(ord=2, inplace=False)[source]

Normalize embeddings matrix row-wise.

Parameters:ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}
save(fname)[source]

Save a pickled version of the embedding into fname.

shape
words
zero_vector()[source]

Returns a zero vector of embedding dimension.

polyglot.mapping.expansion module
class polyglot.mapping.expansion.CaseExpander(vocabulary, strategy='most_frequent')[source]

Bases: polyglot.mapping.expansion.VocabExpander

class polyglot.mapping.expansion.DigitExpander(vocabulary, strategy='most_frequent')[source]

Bases: polyglot.mapping.expansion.VocabExpander

class polyglot.mapping.expansion.VocabExpander(vocabulary, formatters, strategy)[source]

Bases: polyglot.mapping.base.OrderedVocabulary

approximate(w)[source]
approximate_ids(key)[source]
expand(formatters)[source]
format(w)[source]
Module contents
class polyglot.mapping.CountedVocabulary(word_count=None)[source]

Bases: polyglot.mapping.base.OrderedVocabulary

List of words and counts sorted according to word count.

classmethod from_textfile(textfile, workers=1, job_size=1000)[source]

Count the set of words appeared in a text file.

Parameters:
  • textfile (string) – The name of the text file or TextFile object.
  • min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
  • workers (integer) – Number of parallel workers to read the file simulatenously.
  • job_size (integer) – Size of the batch send to each worker.
  • most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.
Returns:

A vocabulary of the most frequent words appeared in the document.

static from_textfiles(files, workers=1, job_size=1000)[source]
static from_vocabfile(filename)[source]

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1 count1
word2 count2
getstate()[source]
min_count(n=1)[source]

Returns a vocabulary after eliminating the words that appear < n.

Parameters:n (integer) – specifies the minimum word frequency allowed.
most_frequent(k)[source]

Returns a vocabulary with the most frequent k words.

Parameters:k (integer) – specifies the top k most frequent words to be returned.
class polyglot.mapping.OrderedVocabulary(words=None)[source]

Bases: polyglot.mapping.base.VocabularyBase

An ordered list of words/tokens according to their frequency.

Note

The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.

word_id

dictionary – Mapping from words to IDs.

id_word

dictionary – A reverse map of word_id.

most_frequent(k)[source]

Returns a vocabulary with the most frequent k words.

Parameters:k (integer) – specifies the top k most frequent words to be returned.
class polyglot.mapping.VocabularyBase(words=None)[source]

Bases: object

A set of words/tokens that have consistent IDs.

Note

Words will be sorted according to their lexicographic order.

word_id

dictionary – Mapping from words to IDs.

id_word

dictionary – A reverse map of word_id.

classmethod from_vocabfile(filename)[source]

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1
word2
get(k, default=None)[source]
getstate()[source]
sanitize_words(words)[source]

Guarantees that all textual symbols are unicode.

Note

We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.

words

Ordered list of words according to their IDs.

class polyglot.mapping.Embedding(vocabulary, vectors)[source]

Bases: object

Mapping a vocabulary to a d-dimensional points.

apply_expansion(expansion)[source]

Apply a vocabulary expansion to the current emebddings.

distances(word, words)[source]

Calculate eucledean pairwise distances between word and words.

Parameters:
  • word (string) – single word.
  • words (list) – list of strings.
Returns:

numpy array of the distances.

Note

L2 metric is used to calculate distances.

static from_gensim(model)[source]
static from_glove(fname)[source]
static from_word2vec(fname, fvocab=None, binary=False)[source]

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

static from_word2vec_vocab(fvocab)[source]
get(k, default=None)[source]
static load(fname)[source]

Load an embedding dump generated by save

most_frequent(k, inplace=False)[source]

Only most frequent k words to be included in the embeddings.

nearest_neighbors(word, top_k=10)[source]

Return the nearest k words to the given word.

Parameters:
  • word (string) – single word.
  • top_k (integer) – decides how many neighbors to report.
Returns:

A list of words sorted by the distances. The closest is the first.

Note

L2 metric is used to calculate distances.

normalize_words(ord=2, inplace=False)[source]

Normalize embeddings matrix row-wise.

Parameters:ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}
save(fname)[source]

Save a pickled version of the embedding into fname.

shape
words
zero_vector()[source]

Returns a zero vector of embedding dimension.

class polyglot.mapping.CaseExpander(vocabulary, strategy='most_frequent')[source]

Bases: polyglot.mapping.expansion.VocabExpander

class polyglot.mapping.DigitExpander(vocabulary, strategy='most_frequent')[source]

Bases: polyglot.mapping.expansion.VocabExpander

polyglot.tag package
Subpackages
Submodules
polyglot.tag.base module
Module contents
polyglot.tokenize package
Subpackages
Submodules
polyglot.tokenize.base module

Basic text segmenters.

class polyglot.tokenize.base.Breaker(locale)[source]

Bases: object

Base class to segment text.

transform(sequence)[source]
class polyglot.tokenize.base.SentenceTokenizer(locale='en')[source]

Bases: polyglot.tokenize.base.Breaker

Segment text to sentences.

class polyglot.tokenize.base.WordTokenizer(locale='en')[source]

Bases: polyglot.tokenize.base.Breaker

Segment text to words or tokens.

Module contents
class polyglot.tokenize.WordTokenizer(locale='en')[source]

Bases: polyglot.tokenize.base.Breaker

Segment text to words or tokens.

class polyglot.tokenize.SentenceTokenizer(locale='en')[source]

Bases: polyglot.tokenize.base.Breaker

Segment text to sentences.

polyglot.transliteration package
Subpackages
Submodules
polyglot.transliteration.base module
Module contents

Submodules

polyglot.base module

Basic data types.

class polyglot.base.Sequence(text)[source]

Bases: object

Text with indices indicates boundaries.

empty()[source]
split(sequence)[source]

Split into subsequences according to sequence.

text
tokens()[source]

Returns segmented text after stripping whitespace.

class polyglot.base.TextFile(file, delimiter=u'n')[source]

Bases: object

Wrapper around text files.

It uses io.open to guarantee reading text files with unicode encoding. It has an iterator that supports arbitrary delimiter instead of only new lines.
delimiter

string – A string that defines the limit of each chunk.

file

string – A path to a file.

buf

StringIO – a buffer to store the results of peeking into the file.

apply(func, workers=1, job_size=10000)[source]

Apply func to lines of text in parallel or sequential.

Parameters:func – a function that takes a list of lines.
iter_chunks(chunksize)[source]
iter_delimiter(byte_size=8192)[source]

Generalization of the default iter file delimited by ‘ ‘.

Note:
The newline string can be arbitrarily long; it need not be restricted to a single character. You can also set the read size and control whether or not the newline string is left on the end of the iterated lines. Setting newline to ‘’ is particularly good for use with an input file created with something like “os.popen(‘find -print0’)”.
Args:
byte_size (integer): Number of bytes to be read at each time.
peek(size)[source]
read(size=None)[source]

Read size of bytes.

readline()[source]
class polyglot.base.TextFiles(files, delimiter=u'n')[source]

Bases: polyglot.base.TextFile

Interface for a sequence of files.

names
peek(size)[source]
read(size=None)[source]
readline()[source]
class polyglot.base.TokenSequence[source]

Bases: list

A list of tokens.

Parameters:tokens (list) – list of symbols.
sliding_window(width=2, padding=None)[source]

polyglot.decorators module

class polyglot.decorators.cached_property(func)[source]

Bases: object

A property that is only computed once per instance and then replaces itself with an ordinary attribute. Deleting the attribute resets the property. Credit to Marcel Hellkamp, author of bottle.py.

polyglot.decorators.memoize(obj)[source]

polyglot.downloader module

The Polyglot corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with polyglot.

Downloading Packages

If called with no arguments, download() will display an interactive interface which can be used to download and install new packages. If Tkinter is available, then a graphical interface will be shown, otherwise a simple text interface will be provided.

Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:

>>> download('treebank') 
[polyglot_data] Downloading package 'treebank'...
[polyglot_data]   Unzipping corpora/treebank.zip.

Polyglot also provides a number of “package collections”, consisting of a group of related packages. To download all packages in a colleciton, simply call download() with the collection’s identifier:

>>> download('all-corpora') 
[polyglot_data] Downloading package 'abc'...
[polyglot_data]   Unzipping corpora/abc.zip.
[polyglot_data] Downloading package 'alpino'...
[polyglot_data]   Unzipping corpora/alpino.zip.
  ...
[polyglot_data] Downloading package 'words'...
[polyglot_data]   Unzipping corpora/words.zip.
Download Directory

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.

See Downloader.default_download_dir() for more a detailed description of how the default download directory is chosen.

Polyglot Download Server

Before downloading any packages, the corpus and module downloader contacts the Polyglot download server, to retrieve an index file describing the available packages. By default, this index file is loaded from http://nltk.googlecode.com/svn/trunk/polyglot_data/index.xml. If necessary, it is possible to create a new Downloader object, specifying a different URL for the package index file.

Usage:

python polyglot/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or:

python -m polyglot.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
class polyglot.downloader.Collection(id, children, name=None, **kw)[source]

Bases: object

A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by Downloader.

children = None

A list of the Collections or Packages directly contained by this collection.

id = None

A unique identifier for this collection.

name = None

A string name for this collection.

packages = None

A list of Packages contained by this collection or any collections it recursively contains.

class polyglot.downloader.Downloader(server_index_url=None, source=None, download_dir=None)[source]

Bases: object

A class used to access the Polyglot data server, which can be used to download corpora and other data packages.

DEFAULT_SOURCE = u'mirror'

The source for index and other data files. Two values are supported: ‘mirror’ or ‘google’.

For ‘mirror’, the DEFAULT_URL should be set as a prefix of mirrored directory, like ‘http://address.of.mirror/dir/’, and the downloader expects a file named ‘index.json’ as index file.

For ‘google’, the DEFAULT_URL should be the bucket of google cloud, and the downloader expects index from google api.

So set the following DEFAULT_URL properly.

DEFAULT_URL = u'http://polyglot.cs.stonybrook.edu/~polyglot/'

The default URL for the Polyglot data server’s index. An alternative URL can be specified when creating a new Downloader object.

For ‘google’ as DEFAULT_SOURCE, ‘polyglot-models’ is the default place. For ‘mirror’ as DEFAULT_SOURCE, use an proper mirror.

INDEX_TIMEOUT = 3600

The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.

INSTALLED = u'installed'

A status string indicating that a package or collection is installed and up-to-date.

LANG_PREFIX = u'LANG:'

Collection ID prefix for collections that gathers models of a specific task.

NOT_INSTALLED = u'not installed'

A status string indicating that a package or collection is not installed.

PARTIAL = u'partial'

A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)

STALE = u'out of date'

A status string indicating that a package or collection is corrupt or out-of-date.

TASK_PREFIX = u'TASK:'

Collection ID prefix for collections that gathers models of a specific task.

clear_status_cache(id=None)[source]
collections()[source]
corpora()[source]
default_download_dir()[source]

Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the download_dir argument when calling download().

On all other platforms, the default directory is ~/polyglot_data.

download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix=u'[polyglot_data] ', halt_on_error=True, raise_on_error=False)[source]
download_dir

The default directory to which packages will be downloaded. This defaults to the value returned by default_download_dir(). To override this default on a case-by-case basis, use the download_dir argument when calling download().

get_collection(lang=None, task=None)[source]

Return the collection that represents a specific language or task.

Parameters:
  • lang (string) – Language code.
  • task (string) – Task name.
incr_download(info_or_id, download_dir=None, force=False)[source]
index()[source]

Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.

info(id)[source]

Return the Package or Collection record for the given item.

is_installed(info_or_id, download_dir=None)[source]
is_stale(info_or_id, download_dir=None)[source]
list(download_dir=None, show_packages=False, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]
models()[source]
packages()[source]
status(info_or_id, download_dir=None)[source]

Return a constant describing the status of the given package or collection. Status can be one of INSTALLED, NOT_INSTALLED, STALE, or PARTIAL.

supported_language(lang)[source]

Return True if polyglot supports the language.

Parameters:lang (string) – Language code.
supported_languages(task=None)[source]

Languages that are covered by a specific task.

Parameters:task (string) – Task name.
supported_languages_table(task, cols=3)[source]
supported_tasks(lang=None)[source]

Languages that are covered by a specific task.

Parameters:lang (string) – Language code name.
update(quiet=False, prefix=u'[polyglot_data] ')[source]

Re-download any packages whose status is STALE.

url

The URL for the data server’s index file.

xmlinfo(id)[source]

Return the XML info record for the given item

class polyglot.downloader.DownloaderMessage[source]

Bases: object

A status message object, used by incr_download to communicate its progress.

class polyglot.downloader.DownloaderShell(dataserver)[source]

Bases: object

run()[source]
class polyglot.downloader.ErrorMessage(package, message)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server encountered an error

exception polyglot.downloader.ExceptionBase[source]

Bases: exceptions.Exception

General base exception for the downloader module.

class polyglot.downloader.FinishCollectionMessage(collection)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished working on a collection of packages.

class polyglot.downloader.FinishDownloadMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished downloading a package.

class polyglot.downloader.FinishPackageMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished working on a package.

class polyglot.downloader.FinishUnzipMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished unzipping a package.

exception polyglot.downloader.LanguageNotSupported[source]

Bases: polyglot.downloader.ExceptionBase

Raised if the language is not covered by polyglot.

class polyglot.downloader.Package(id, url, name=None, subdir=u'', size=None, filename=u'', task=u'', language=u'', attrs=None, **kw)[source]

Bases: object

A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by Downloader. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.

attrs = None

Extra attributes generated by Google Cloud Storage.

filename = None

The filename that should be used for this package’s file.

static fromcsobj(csobj)[source]
id = None

A unique identifier for this package.

language = None

The langauge code this package belongs to.

name = None

A string name for this package.

size = None

The filesize (in bytes) of the package file.

subdir = None

The subdirectory where this package should be installed. E.g., 'corpora' or 'taggers'.

task = None

The task this package is serving.

url = None

A URL that can be used to download this package’s file.

class polyglot.downloader.ProgressMessage(progress)[source]

Bases: polyglot.downloader.DownloaderMessage

Indicates how much progress the data server has made

class polyglot.downloader.SelectDownloadDirMessage(download_dir)[source]

Bases: polyglot.downloader.DownloaderMessage

Indicates what download directory the data server is using

class polyglot.downloader.StaleMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

The package download file is out-of-date or corrupt

class polyglot.downloader.StartCollectionMessage(collection)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started working on a collection of packages.

class polyglot.downloader.StartDownloadMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started downloading a package.

class polyglot.downloader.StartPackageMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started working on a package.

class polyglot.downloader.StartUnzipMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started unzipping a package.

exception polyglot.downloader.TaskNotSupported[source]

Bases: polyglot.downloader.ExceptionBase

Raised if the task is not covered by polyglot.

class polyglot.downloader.UpToDateMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

The package download file is already up-to-date

polyglot.downloader.build_index(root, base_url)[source]

Create a new data.xml index file, by combining the xml description files for various packages and collections. root should be the path to a directory containing the package xml and zip files; and the collection xml files. The root directory is expected to have the following subdirectories:

root/
packages/ .................. subdirectory for packages
  corpora/ ................. zip & xml files for corpora
  grammars/ ................ zip & xml files for grammars
  taggers/ ................. zip & xml files for taggers
  tokenizers/ .............. zip & xml files for tokenizers
  etc.
collections/ ............... xml files for collections

For each package, there should be two files: package.zip (where package is the package name) which contains the package itself as a compressed zip file; and package.xml, which is an xml description of the package. The zipfile package.zip should expand to a single subdirectory named package/. The base filename package must match the identifier given in the package’s xml file.

For each collection, there should be a single file collection.zip describing the collection, where collection is the name of the collection.

All identifiers (for both packages and collections) must be unique.

polyglot.downloader.download_gui()[source]
polyglot.downloader.download_shell()[source]
polyglot.downloader.is_writable(path)[source]
polyglot.downloader.unzip(filename, root, verbose=True)[source]

Extract the contents of the zip file filename into the directory root.

polyglot.downloader.update()[source]

polyglot.load module

polyglot.mixins module

class polyglot.mixins.BlobComparableMixin[source]

Bases: polyglot.mixins.ComparableMixin

Allow blob objects to be comparable with both strings and blobs.

class polyglot.mixins.ComparableMixin[source]

Bases: object

Implements rich operators for an object.

class polyglot.mixins.StringlikeMixin[source]

Bases: object

Make blob objects behave like Python strings.

Expects that classes that use this mixin to have a _strkey() method that returns the string to apply string methods to. Using _strkey() instead of __str__ ensures consistent behavior between Python 2 and 3.

ends_with(suffix, start=0, end=9223372036854775807)[source]

Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)[source]

Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)[source]

Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)[source]

Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)[source]

Like blob.find() but raise ValueError when the substring is not found.

join(iterable)[source]

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

lower()[source]

Like str.lower(), returns new object with all lower-cased characters.

replace(old, new, count=9223372036854775807)[source]

Return a new blob object with all the occurence of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)[source]

Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)[source]

Like blob.rfind() but raise ValueError when substring is not found.

split(sep=None, maxsplit=9223372036854775807)[source]

Behaves like the built-in str.split().

starts_with(prefix, start=0, end=9223372036854775807)[source]

Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)[source]

Returns True if the blob starts with the given prefix.

strip(chars=None)[source]

Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

title()[source]

Returns a blob object with the text in title-case.

upper()[source]

Like str.upper(), returns new object with all upper-cased characters.

polyglot.mixins.implements_to_string(cls)[source]

Class decorator that renames __str__ to __unicode__ and modifies __str__ that returns utf-8.

polyglot.text module

polyglot.utils module

Collection of general utilities.

polyglot.utils.pretty_list(items, cols=3)[source]

Module contents

class polyglot.Sequence(text)[source]

Bases: object

Text with indices indicates boundaries.

empty()[source]
split(sequence)[source]

Split into subsequences according to sequence.

text
tokens()[source]

Returns segmented text after stripping whitespace.

class polyglot.TokenSequence[source]

Bases: list

A list of tokens.

Parameters:tokens (list) – list of symbols.
sliding_window(width=2, padding=None)[source]