The Classical Language Toolkit (CLTK)¶
Contents¶
About¶
The Classical Language Toolkit (CLTK) offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia. Greek, Latin, Akkadian, and the Germanic languages are currently most complete. The goals of the CLTK are to:
- compile analysis-friendly corpora;
- collect and generate linguistic data;
- act as a free and open platform for generating scientific research.
The project’s source is hosted on GitHub and the homepage is http://cltk.org.
Citation¶
For guidance on citing the CLTK, see Citation in the project’s README.
Installation¶
Please note that the CLTK is built, tested, and supported only on POSIX-compliant OS (namely, Linux, Mac and the BSDs).
With Pip¶
Note
The CLTK is only officially supported with Python 3.7 on POSIX–compliant operating systems (Linux, Mac OS X, FreeBSD, etc.).
First, you’ll need a working installation of Python 3.7, which now includes Pip. Create a virtual environment and activate it as follows:
$ python3.7 -m venv venv
$ source venv/bin/activate
Then, install the CLTK, which automatically includes all dependencies.
$ pip install cltk
Second, you will need an installation of Git, which the CLTK uses to download and update corpora, if you want to automatically import any of the CLTK’s corpora. Installation of Git will depend on your operating system.
Tip
For a user–friendly interactive shell environment, try IPython, which may be invoked with ipython
from the command line. You may install it with pip install ipython
.
Microsoft Windows¶
Warning
CLTK on Windows is not officially supported, however we do encourage Windows 10 users to give the following a try. Others have reported success. If this should fail for you, please open an issue on GitHub.
Windows 10 features a beta of “Bash on Ubuntu on Windows”, which creates a fully functional POSIX environment. For an introduction, see Microsoft’s docs here.
Once you have enabled Bash on Windows, installation and use is just the same as on Ubuntu. For instance, you do use the following:
sudo apt update
sudo apt install git
sudo apt-get install python-setuptools
sudo apt install python-virtualenv
virtualenv -p python3 ~/venv
source ~/venv/bin/activate
pip3 install cltk
Tip
Some fonts do not render Unicode well in the Bash Terminal. Try SimSub-ExtB
or Courier New
.
Older releases¶
For reproduction of scholarship, the CLTK archives past versions of its software releases. To get an older release by version, say v0.1.32
, use:
$ pip install cltk==0.1.32
If you do not know a release’s version number but have its DOI (for instance, if you want to install version 10.5281/zenodo.51144
), then you can search Zenodo and learn that this DOI corresponds to version v0.1.34.
The above will work for most researchers seeking to reproduce results. It will give you CLTK code identical to what the original researcher was using. However, it is possible that you will want to use the exact same CLTK dependencies the researcher was using, too. In this case, consult the CLTK GitHub Releases page and download a .tar.gz file of the desired version. Then, you may do the following:
$ tar zxvf cltk-0.1.34.tar.gz
$ cd cltk-0.1.34
$ python3.6 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
This will give you CLTK and immediate dependencies identical to your target codebase.
The CLTK’s repositories are versioned, too, using Git. Should there have been changes to a target corpus, you may acquire your needed version by manually cloning the entire repo, then checking out the past version by commit log. For example, if you need commit 0ed43e025df276e95768038eb3692ba155cc78c9
from the repo latin_text_perseus
:
$ cd ~/cltk_data/latin/text/
$ rm -rf text/latin_text_perseus/
$ git clone https://github.com/cltk/latin_text_perseus.git
$ cd latin_text_perseus/
$ git checkout 0ed43e025df276e95768038eb3692ba155cc78c9
From source¶
The CLTK source is available at GitHub. To build from source, clone the repository, make a virtual environment (as above), and run:
$ pip install -U -r requirements.txt
$ python setup.py install
If you have modified the CLTK source, rebuild the project with this same command. If you make any changes, it is a good idea to run the test suite to ensure you did not introduce any breakage. Test with nose
:
$ nosetests --with-doctest
Importing Corpora¶
The CLTK stores all data in the local directory cltk_data
, which is created at a user’s root directory upon first initialization of the CorpusImporter()
class. Within this are an originals
directory, in which untouched copies of downloaded or copied files are preserved, and a directory for every language for which a corpus has been downloaded. It also contains cltk.log
for all CLTK logging.
Listing corpora¶
To see all of the corpora available for importing, use list_corpora()
.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: corpus_importer = CorpusImporter('greek') # e.g., or CorpusImporter('latin')
In [3]: corpus_importer.list_corpora
Out[3]:
['greek_software_tlgu',
'greek_text_perseus',
'phi7',
'tlg',
'greek_proper_names_cltk',
'greek_models_cltk',
'greek_treebank_perseus',
'greek_lexica_perseus',
'greek_training_set_sentence_cltk',
'greek_word2vec_cltk',
'greek_text_lacus_curtius']
Importing a corpus¶
To download a remote corpus, use the following, for example, for the Latin Library.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: corpus_importer = CorpusImporter('latin') # e.g., or CorpusImporter('greek')
In [3]: corpus_importer.import_corpus('latin_text_latin_library')
Downloaded 100% , 35.53 MiB | 3.28 MiB/s s
For a local corpus, such as the TLG, you must give a second argument of the filepath to the corpus, e.g.:
In [4]: corpus_importer.import_corpus('tlg', '~/Documents/corpora/TLG_E/')
User-defined, distributed corpora¶
Most users will want to use the CLTK’s publicly available corpora. However users can import any repository that is hosted on a Git server. The benefit of this is that users can use corpora that the CLTK organization is not able to distribute itself (because too specific, license restrictions, etc.).
Let’s say a user wants to keep a particular Git-backed corpus at git@github.com:kylepjohnson/latin_corpus_newton_example.git
. It can be cloned into the ~/cltk_data/
directory by declaring it in a manually created YAML file at ~/cltk_data/distributed_corpora.yaml
like the following:
example_distributed_latin_corpus:
origin: https://github.com/kylepjohnson/latin_corpus_newton_example.git
language: latin
type: text
example_distributed_greek_corpus:
origin: https://github.com/kylepjohnson/a_nonexistent_repo.git
language: pali
type: treebank
Each block defines a separate corpus. The first line of a block (e.g., example_distributed_latin_corpus
) gives the unique name to the custom corpus. This first example block would allow a user to fetch the repo and install it at ~/cltk_data/latin/text/latin_corpus_newton_example
.
Corpus Readers¶
After a corpus has been imported into the library, users will want to access the data through a CorpusReader
object.
The CorpusReader
API follows the NLTK CorpusReader API paradigm.
It offers a way for users to access the documents, paragraphs, sentences, and words of all the available documents in a corpus, or a specified collection of documents.
Not every corpus will support every method, e.g. a corpus of inscriptions may not support paragraphs via a para
method but the corpus provider should try to provide the interfaces that they can.
Reading a Corpus¶
Use the get_corpus
method in the readers module.
In [1]: from cltk.corpus.readers import get_corpus_reader
In [2]: latin_corpus = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')
In [3]: len(list(latin_corpus.docs()))
Out[3]: 2141
In [4]: len(list(latin_corpus.paras()))
Out[4]: 212130
In [5]: len(list(latin_corpus.sents()))
Out[5]: 1038668
In [6]: len(list(latin_corpus.words()))
Out[6]: 16455728
Adding a Corpus to the CLTK Reader¶
Modify the cltk.corpus.readers module, updating SUPPORTED_CORPORA
, adding your language and the specific corpus name.
In the get_corpus_reader
method implement the checks and mappings to return a NLTK compliant CorpusReader
API object.
Providing Metadata for Corpus Filtration¶
If you’re adding a Corpus to CLTK, please also consider providing a genre mapping if you corpus is large or is easily segmented into genres. Consider creating a file containing mappings of categories to directories and files, e.g.:
In [1]: from cltk.corpus.latin.latin_library_corpus_types import corpus_directories_by_type
In [2]: corpus_directories_by_type.keys()
Out [2]: dict_keys(['republican', 'augustan', 'early_silver', 'late_silver', 'old', 'christian', 'medieval', 'renaissance', 'neo_latin', 'misc', 'early'])
In [3]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type
In [4]: corpus_directories_by_type.values()[:3]
Out [4]: [['./caesar', './lucretius', './nepos', './cicero'], ['./livy', './ovid', './horace', './vergil', './hyginus']]
In [5]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type
In [6]: list(corpus_texts_by_type.values())[:2]
Out [6]: [['sall.1.txt', 'sall.2.txt', 'sall.cotta.txt', 'sall.ep1.txt', 'sall.ep2.txt', 'sall.frag.txt', 'sall.invectiva.txt', 'sall.lep.txt', 'sall.macer.txt', 'sall.mithr.txt', 'sall.phil.txt', 'sall.pomp.txt', 'varro.frag.txt', 'varro.ll10.txt', 'varro.ll5.txt', 'varro.ll6.txt', 'varro.ll7.txt', 'varro.ll8.txt', 'varro.ll9.txt', 'varro.rr1.txt', 'varro.rr2.txt', 'varro.rr3.txt', 'sulpicia.txt'], ['resgestae.txt', 'resgestae1.txt', 'manilius1.txt', 'manilius2.txt', 'manilius3.txt', 'manilius4.txt', 'manilius5.txt', 'catullus.txt', 'vitruvius1.txt', 'vitruvius10.txt', 'vitruvius2.txt', 'vitruvius3.txt', 'vitruvius4.txt', 'vitruvius5.txt', 'vitruvius6.txt', 'vitruvius7.txt', 'vitruvius8.txt', 'vitruvius9.txt', 'propertius1.txt', 'tibullus1.txt', 'tibullus2.txt', 'tibullus3.txt']]
The mapping is a dictionary of genre types or periods, and the values are lists of files or directories for each type.
Helper Methods for Corpus Filtration¶
Users will typically construct a CorpusReader
by selecting category types of directories or files.
The assemble_corpus
method allows users to take a CorpusReader and filter the files used provide the data for the reader.
In [1]: from cltk.corpus.readers import assemble_corpus, get_corpus_reader
In [2]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type, corpus_directories_by_type
In [3]: latin_corpus = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')
In [4]: filtered_reader, fileids, catgories = assemble_corpus(latin_corpus, types_requested=['republican', 'augustan'], type_dirs=corpus_directories_by_type,
... type_files=corpus_texts_by_type)
In [5]: len(list(filtered_reader.docs()))
Out [5]: 510
In [6]: categories
Out [6]: {'republican', 'augustan'}
In [7]: len(fileids)
Out [7]: 510
Akkadian¶
Akkadian is an extinct East Semitic language (part of the greater Afroasiatic language family) that was spoken in ancient Mesopotamia. The earliest attested Semitic language, it used the cuneiform writing system, which was originally used to write the unrelated Ancient Sumerian, a language isolate. From the second half of the third millennium BC (ca. 2500 BC), texts fully written in Akkadian begin to appear. Hundreds of thousands of texts and text fragments have been excavated to date, covering a vast textual tradition of mythological narrative, legal texts, scientific works, correspondence, political and military events, and many other examples. By the second millennium BC, two variant forms of the language were in use in Assyria and Babylonia, known as Assyrian and Babylonian respectively. (Source: Wikipedia)
Workflow Sample Model¶
A sample workflow model of utilizing the tools in Akkadian is shown below. In this example, we are taking a text file downloaded from CDLI, importing it, and have it be read and ingested. From here, we will look at the table of contents, select a text, convert the text into Unicode and PrettyPrint its result.
note: this workflow model uses a set of test documents, available to be downloaded here:
https://github.com/cltk/cltk/tree/master/cltk/tests/test_akkadian
When you have downloaded these files, utilize its file location within os.path.join(), e.g.: os.path.join(‘downloads’, ‘single_text.txt’). This tutorial assumes that you are using a fork of CLTK.
In[1]: from cltk.corpus.akkadian.file_importer import FileImport
In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus
In[3]: from cltk.corpus.akkadian.pretty_print import PrettyPrint
In[4]: from cltk.corpus.akkadian.tokenizer import Tokenizer
In[5]: from cltk.tokenize.word import WordTokenizer
In[6]: from cltk.stem.akkadian.atf_converter import ATFConverter
In[7]: import os
# import a text and read it
In[8]: fi = FileImport(os.path.join('test_akkadian', 'single_text.txt'))
In[9]: fi.read_file()
# output = fi.raw_file or fi.file_lines; for folder catalog = fi.file_catalog()
# ingest your file lines
In[10]: cc = CDLICorpus()
In[11]: cc.parse_file(fi.file_lines)
# this creates disparate sections of the text ingested (edition, metadata, etc)
In[12]: transliteration = [cc.catalog[text]['transliteration'] for text in cc.catalog]
# access the data through cc.texts (e.g. above) or initial prints (e.g. below):
# look through the file's contents
In[13]: print(cc.toc())
Out[13]: ['Pnum: P254202, Edition: ARM 01, 001, length: 23 line(s)']
# select a text through edition or cdli number (there's also .print_metadata):
In[14]: selected_text = cc.catalog['P254202']['transliteration']
# otherwise use the above 'transliteration'; same thing:
In[15]: print(selected_text)
Out[15]: ['a-na ia-ah-du-li-[im]', 'qi2-bi2-[ma]', 'um-ma a-bi-sa-mar#-[ma]', 'sa-li-ma-am e-pu-[usz]',
'asz-szum mu-sze-zi-ba-am# [la i-szu]', '[sa]-li#-ma-am sza e-[pu-szu]', '[u2-ul] e-pu-usz sa#-[li-mu-um]',
'[u2-ul] sa-[li-mu-um-ma]', 'isz#-tu mu#-[sze-zi-ba-am la i-szu]', 'a-la-nu-ia sza la is,-s,a-ab#-[tu]',
'i-na-an-na is,-s,a-ab-[tu]', 'i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]',
'ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]', 'u3 ia-am-ha-ad[{ki}]', 'a-la-nu an-nu-tum u2-ul ih-li-qu2#',
'i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma', 'ih-ta-al-qu2',
'u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#', 'u3 na-pa-asz2-ti u2-ba-li-it,', 'pi2-qa-at ha-s,e-ra#-at',
'asz-szum a-la-nu-ka', 'u3 ma-ru-ka sza-al#-[mu]', '[a-na na-pa]-asz2#-ti-ia i-tu-ur']
In[16]: print(transliteration[0])
Out[16]: ['a-na ia-ah-du-li-[im]', 'qi2-bi2-[ma]', 'um-ma a-bi-sa-mar#-[ma]', 'sa-li-ma-am e-pu-[usz]',
'asz-szum mu-sze-zi-ba-am# [la i-szu]', '[sa]-li#-ma-am sza e-[pu-szu]', '[u2-ul] e-pu-usz sa#-[li-mu-um]',
'[u2-ul] sa-[li-mu-um-ma]', 'isz#-tu mu#-[sze-zi-ba-am la i-szu]', 'a-la-nu-ia sza la is,-s,a-ab#-[tu]',
'i-na-an-na is,-s,a-ab-[tu]', 'i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]',
'ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]', 'u3 ia-am-ha-ad[{ki}]', 'a-la-nu an-nu-tum u2-ul ih-li-qu2#',
'i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma', 'ih-ta-al-qu2',
'u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#', 'u3 na-pa-asz2-ti u2-ba-li-it,', 'pi2-qa-at ha-s,e-ra#-at',
'asz-szum a-la-nu-ka', 'u3 ma-ru-ka sza-al#-[mu]', '[a-na na-pa]-asz2#-ti-ia i-tu-ur']
# tokenize by word or sign
In[17]: atf = ATFConverter()
In[18]: tk = Tokenizer()
In[19]: wtk = WordTokenizer('akkadian')
In[18]: lines = [tk.string_tokenizer(text, include_blanks=False)
for text in atf.process(selected_text)]
In[20]: words = [wtk.tokenize(line[0]) for line in lines]
# taking off first four lines to focus on the text with [4:]
In[21]: print(lines)
In[21]: [['a-na ia-ah-du-li-im'], ['qi2-bi2-ma'], ['um-ma a-bi-sa-mar-ma'], ['sa-li-ma-am e-pu-usz'],
['asz-szum mu-sze-zi-ba-am la i-szu'], ['sa-li-ma-am sza e-pu-szu'], ['u2-ul e-pu-usz sa-li-mu-um'],
['u2-ul sa-li-mu-um-ma'], ['isz-tu mu-sze-zi-ba-am la i-szu'], ['a-la-nu-ia sza la is,-s,a-ab-tu'],
['i-na-an-na is,-s,a-ab-tu'], ['i-na ne2-kur-ti _lu2_ ha-szi-im{ki}'],
['ur-si-im{ki} _lu2_ ka-ar-ka-mi-is{ki}'], ['u3 ia-am-ha-ad{ki}'], ['a-la-nu an-nu-tum u2-ul ih-li-qu2'],
['i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur-ma'], ['ih-ta-al-qu2'],
['u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib'], ['u3 na-pa-asz2-ti u2-ba-li-it,'],
['pi2-qa-at ha-s,e-ra-at'], ['asz-szum a-la-nu-ka'], ['u3 ma-ru-ka sza-al-mu'],
['a-na na-pa-asz2-ti-ia i-tu-ur']]
In[22]: print(words)
In[22]: [[('a-na', 'akkadian'), ('ia-ah-du-li-im', 'akkadian')], [('qi2-bi2-ma', 'akkadian')],
[('um-ma', 'akkadian'), ('a-bi-sa-mar-ma', 'akkadian')], [('sa-li-ma-am', 'akkadian'),
('e-pu-usz', 'akkadian')],
[('asz-szum', 'akkadian'), ('mu-sze-zi-ba-am', 'akkadian'), ('la', 'akkadian'), ('i-szu', 'akkadian')],
[('sa-li-ma-am', 'akkadian'), ('sza', 'akkadian'), ('e-pu-szu', 'akkadian')],
[('u2-ul', 'akkadian'), ('e-pu-usz', 'akkadian'), ('sa-li-mu-um', 'akkadian')],
[('u2-ul', 'akkadian'), ('sa-li-mu-um-ma', 'akkadian')],
[('isz-tu', 'akkadian'), ('mu-sze-zi-ba-am', 'akkadian'), ('la', 'akkadian'), ('i-szu', 'akkadian')],
[('a-la-nu-ia', 'akkadian'), ('sza', 'akkadian'), ('la', 'akkadian'), ('is,-s,a-ab-tu', 'akkadian')],
[('i-na-an-na', 'akkadian'), ('is,-s,a-ab-tu', 'akkadian')],
[('i-na', 'akkadian'), ('ne2-kur-ti', 'akkadian'), ('_lu2_', 'sumerian'), ('ha-szi-im{ki}', 'akkadian')],
[('ur-si-im{ki}', 'akkadian'), ('_lu2_', 'sumerian'), ('ka-ar-ka-mi-is{ki}', 'akkadian')],
[('u3', 'akkadian'), ('ia-am-ha-ad{ki}', 'akkadian')],
[('a-la-nu', 'akkadian'), ('an-nu-tum', 'akkadian'), ('u2-ul', 'akkadian'), ('ih-li-qu2', 'akkadian')],
[('i-na', 'akkadian'), ('ne2-kur-ti', 'akkadian'), ('{disz}sa-am-si-{d}iszkur-ma', 'akkadian')],
[('ih-ta-al-qu2', 'akkadian')],
[('u3', 'akkadian'), ('a-la-nu', 'akkadian'), ('sza', 'akkadian'), ('ki-ma', 'akkadian'),
('u2-hu-ru', 'akkadian'), ('u2-sze-zi-ib', 'akkadian')],
[('u3', 'akkadian'), ('na-pa-asz2-ti', 'akkadian'), ('u2-ba-li-it,', 'akkadian')],
[('pi2-qa-at', 'akkadian'), ('ha-s,e-ra-at', 'akkadian')],
[('asz-szum', 'akkadian'), ('a-la-nu-ka', 'akkadian')],
[('u3', 'akkadian'), ('ma-ru-ka', 'akkadian'), ('sza-al-mu', 'akkadian')],
[('a-na', 'akkadian'), ('na-pa-asz2-ti-ia', 'akkadian'), ('i-tu-ur', 'akkadian')]]
In[23]: for signs in words:
In[24]: sign = [tk.sign_tokenizer(x) for x in signs]
# Note: Not printing 'signs' due to length. Try it!
# Pretty printing:
In[25]: pp = PrettyPrint()
In[26]: destination = os.path.join('test_akkadian', 'html_single_text.html')
In[27]: pp.html_print_single_text(cc.catalog, '&P254202', destination)
Read File¶
Reads a .txt file and saves to memory the text in .raw_file and .file_lines. These two instance attributes are used for the ATFConverter.
In[1]: import os
In[2]: from cltk.corpus.akkadian.file_importer import FileImport
In[3]: text_location = os.path.join('test_akkadian', 'single_text.txt')
In[4]: text = FileImport(text_location)
In[5]: text.read_file()
To access the text file, use .raw_file or .file_lines. .raw_file is the file in its entirety, .file_lines splits the text using .splitlines.
File Catalog¶
This function looks at the folder storing a file and outputs its contents.
In[1]: import os
In[2]: from cltk.corpus.akkadian.file_importer import FileImport
In[3]: text_location = os.path.join('test_akkadian', 'single_text.txt')
In[4]: folder = FileImport(text_location)
In[5]: folder.file_catalog()
Out[5]: ['html_file.html', 'html_single_text.html', 'single_text.txt',
'test_akkadian.py', 'two_text_abnormalities.txt']
Parse File¶
This method captures information in a text file and formats it in a clear, and disparate, manner for every text found. It saves to memory dictionaries that split up texts by text edition, cdli number, metadata, and various text, all of which are callable.
In[1]: Import os
In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus
In[3]: cdli = CDLICorpus()
In[4]: f_i = FileImport(os.path.join('test_akkadian', 'single_text.txt'))
In[5]: f_i.read_file()
In[6]: cdli.parse_file(f_i.file_lines)
To access the text, use .catalog.
In[7]: print(cc.catalog)
Out[7]: {'P254202': {'metadata': ['Primary publication: ARM 01, 001', 'Author(s): Dossin, Georges',
'Publication date: 1946',
'Secondary publication(s): Durand, Jean-Marie, LAPO 16, 0305',
'Collection: National Museum of Syria, Damascus, Syria',
'Museum no.: NMSD —', 'Accession no.:', 'Provenience: Mari (mod. Tell Hariri)',
'Excavation no.:', 'Period: Old Babylonian (ca. 1900-1600 BC)',
'Dates referenced:', 'Object type: tablet', 'Remarks:', 'Material: clay',
'Language: Akkadian', 'Genre: Letter', 'Sub-genre:', 'CDLI comments:',
'Catalogue source: 20050104 cdliadmin', 'ATF source: cdlistaff',
'Translation: Durand, Jean-Marie (fr); Guerra, Dylan M. (en)',
'UCLA Library ARK: 21198/zz001rsp8x', 'Composite no.:', 'Seal no.:',
'CDLI no.: P254202'],
'pnum': 'P254202',
'edition': 'ARM 01, 001',
'raw_text': ['@obverse', '1. a-na ia-ah-du-li-[im]', '2. qi2-bi2-[ma]',
'3. um-ma a-bi-sa-mar#-[ma]', '4. sa-li-ma-am e-pu-[usz]',
'5. asz-szum mu-sze-zi-ba-am# [la i-szu]', '6. [sa]-li#-ma-am sza e-[pu-szu]',
'7. [u2-ul] e-pu-usz sa#-[li-mu-um]', '8. [u2-ul] sa-[li-mu-um-ma]',
'$ rest broken', '@reverse', '$ beginning broken',
"1'. isz#-tu mu#-[sze-zi-ba-am la i-szu]",
"2'. a-la-nu-ia sza la is,-s,a-ab#-[tu]", "3'. i-na-an-na is,-s,a-ab-[tu]",
"4'. i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]",
"5'. ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]", "6'. u3 ia-am-ha-ad[{ki}]",
"7'. a-la-nu an-nu-tum u2-ul ih-li-qu2#",
"8'. i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma", "9'. ih-ta-al-qu2",
"10'. u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#",
"11'. u3 na-pa-asz2-ti u2-ba-li-it,", "12'. pi2-qa-at ha-s,e-ra#-at",
"13'. asz-szum a-la-nu-ka", "14'. u3 ma-ru-ka sza-al#-[mu]",
"15'. [a-na na-pa]-asz2#-ti-ia i-tu-ur"],
'transliteration': ['a-na ia-ah-du-li-[im]', 'qi2-bi2-[ma]', 'um-ma a-bi-sa-mar#-[ma]',
'sa-li-ma-am e-pu-[usz]', 'asz-szum mu-sze-zi-ba-am# [la i-szu]',
'[sa]-li#-ma-am sza e-[pu-szu]', '[u2-ul] e-pu-usz sa#-[li-mu-um]',
'[u2-ul] sa-[li-mu-um-ma]', 'isz#-tu mu#-[sze-zi-ba-am la i-szu]',
'a-la-nu-ia sza la is,-s,a-ab#-[tu]', 'i-na-an-na is,-s,a-ab-[tu]',
'i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]',
'ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]', 'u3 ia-am-ha-ad[{ki}]',
'a-la-nu an-nu-tum u2-ul ih-li-qu2#',
'i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma', 'ih-ta-al-qu2',
'u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#', 'u3 na-pa-asz2-ti u2-ba-li-it,',
'pi2-qa-at ha-s,e-ra#-at', 'asz-szum a-la-nu-ka', 'u3 ma-ru-ka sza-al#-[mu]',
'[a-na na-pa]-asz2#-ti-ia i-tu-ur'],
'normalization': [],
'translation': []}}
Table of Contents¶
Prints a table of contents from which one can identify the edition and cdli number for printing purposes.
In[1]: Import os
In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus
In[3]: cdli = CDLICorpus()
In[4]: path = FileImport(os.path.join('test_akkadian', 'single_text.txt'))
In[5]: f_i = FileImport(path)
In[6]: f_i.read_file()
In[6]: cdli.toc()
Out[6]: ['Pnum: P254202, Edition: ARM 01, 001, length: 23 line(s)']
List Pnums¶
Prints cdli numbers from which one can identify the edition and cdli number for printing purposes.
In[1]: Import os
In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus
In[3]: cdli = CDLICorpus()
In[4]: path = FileImport(os.path.join('test_akkadian', 'single_text.txt'))
In[5]: f_i = FileImport(path)
In[6]: f_i.read_file()
In[6]: cdli.list_pnums()
Out[6]: ['P254202']
List Editions¶
Prints editions from which one can identify the edition and cdli number for printing purposes.
In[1]: Import os
In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus
In[3]: cdli = CDLICorpus()
In[4]: path = FileImport(os.path.join('test_akkadian', 'single_text.txt'))
In[5]: f_i = FileImport(path)
In[6]: f_i.read_file()
In[6]: cdli.list_editions()
Out[6]: ['ARM 01, 001']
Print Catalog¶
Prints cdli_corpus.catalog with bite-sized information, rather than text entirety.
In[1]: Import os
In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus
In[3]: cdli = CDLICorpus()
In[4]: path = FileImport(os.path.join('test_akkadian', 'single_text.txt'))
In[5]: f_i = FileImport(path)
In[6]: f_i.read_file()
In[6]: cdli.print_catalog()
Out[6]: Pnum: P254202
Edition: ARM 01, 001
Metadata: True
Transliteration: True
Normalization: False
Translation: False
Tokenization¶
The Akkadian tokenizer reads ATF material and converts the data into readable, mutable tokens. There is an option whether or not to preserve damage in the text.
The ATFConverter depends upon the word and sign tokenizer outputs.
String Tokenization:
This function is based off CLTK’s line tokenizer. Use this for strings (e.g. copy-and-pasinge lines from a document) rather than .txt files.
In[1]: from cltk.akkadian.Tokenizer import Tokenizer
In[2]: line_tokenizer = Tokenizer(preserve_damage=False)
In[3]: text = '20. u2-sza-bi-la-kum\n1. a-na ia-as2-ma-ah-{d}iszkur#\n' \
'2. qi2-bi2-ma\n3. um-ma {d}utu-szi-{d}iszkur\n' \
'4. a-bu-ka-a-ma\n5. t,up-pa-[ka] sza#-[tu]-sza-bi-lam esz-me' \
'\n' '6. asz-szum t,e4#-em# {d}utu-illat-su2\n'\
'7. u3 ia#-szu-ub-dingir sza a-na la i-[zu]-zi-im\n'
In[4]: line_tokenizer.string_token(text)
Out[4]: ['20. u2-sza-bi-la-kum',
'1. a-na ia-as2-ma-ah-{d}iszkur',
'2. qi2-bi2-ma',
'3. um-ma {d}utu-szi-{d}iszkur',
'4. a-bu-ka-a-ma',
'5. t,up-pa-ka sza-tu-sza-bi-lam esz-me',
'6. asz-szum t,e4-em {d}utu-illat-su2',
'7. u3 ia-szu-ub-dingir sza a-na la i-zu-zi-im']
Line Tokenization:
Line Tokenization is for any text, from FileImport.raw_text to .CDLICorpus.texts.
In[1]: import os
In[2]: from cltk.akkadian.tokenizer import Tokenizer
In[3]: line_tokenizer = Tokenizer(preserve_damage=False)
In[4]: text = os.path.join('test_akkadian', 'single_text.txt')
In[5]: line_tokenizer.line_token(text[1:8])
Out[5]: ['a-na ia-ah-du-li-[im]',
'qi2-bi2-[ma]',
'um-ma a-bi-sa-mar#-[ma]',
'sa-li-ma-am e-pu-[usz]',
'asz-szum mu-sze-zi-ba-am# [la i-szu]',
'[sa]-li#-ma-am sza e-[pu-szu]',
'[u2-ul] e-pu-usz sa#-[li-mu-um]',
'[u2-ul] sa-[li-mu-um-ma]']
Word Tokenization:
Word tokenization operates on a single line of text, returns all words in the line as a tuple in a list.
In[1]: import os
In[2]: from cltk.tokenize.word import WordTokenizer
In[3]: word_tokenizer = WordTokenizer('akkadian')
In[4]: line = 'u2-wa-a-ru at-ta e2-kal2-la-ka _e2_-ka wu-e-er'
In[5]: output = word_tokenizer.tokenize(line)
Out[5]: [('u2-wa-a-ru', 'akkadian'), ('at-ta', 'akkadian'),
('e2-kal2-la-ka', 'akkadian'), ('_e2_-ka', 'sumerian'),
('wu-e-er', 'akkadian')]
Sign Tokenization:
Sign Tokenization takes a tuple (word, language) and splits the word up into individual sign tuples (sign, language) in a list.
In[1]: import os
In[2]: from cltk.tokenize.word import WordTokenizer
In[3]: word_tokenizer = WordTokenizer('akkadian')
In[4]: word = ("{gisz}isz-pur-ram", "akkadian")
In[5]: word_tokenizer.tokenize_sign(word)
Out[5]: [("gisz", "determinative"), ("isz", "akkadian"),
("pur", "akkadian"), ("ram", "akkadian")]
Unicode Conversion¶
From a list of tokens, this module will return the list converted from CDLI standards to print publication standards. two_three is a function allows the user to turn on and off accent marking for signs (a₂ versus á).
In[1]: from cltk.stem.akkadian.atf_converter import ATFConverter
In[2]: atf = ATFConverter(two_three=False)
In[2]: test = ['as,', 'S,ATU', 'tet,', 'T,et', 'sza', 'ASZ', "a", "a2", "a3", "be2", "bad3", "buru14"]
In[4]: atf.process(test)
Out[4]: ['aṣ', 'ṢATU', 'teṭ', 'Ṭet', 'ša', 'AŠ', "a", "á", "à", "bé", "bàd", "buru₁₄"]
Pretty Printing¶
Pretty Print allows an individual to take a .txt file and populate it into an html file.
In[1]: import os
In[2]: from cltk.corpus.akkadian.pretty_print import PrettyPrint
In[3]: origin = os.path.join('test_akkadian', 'single_text.txt')
In[4]: destination = os.path.join('..', 'Akkadian_test_text', 'html_single_text.html')
In[5]: f_i = FileImport(path)
f_i.read_file()
origin = f_i.raw_file
p_p = PrettyPrint()
p_p.html_print(origin, destination)
f_o = FileImport(destination)
f_o.read_file()
output = f_o.raw_file
Syllabifier¶
Syllabify Akkadian words.
In [1]: from cltk.stem.akkadian.syllabifier import Syllabifier
In [2]: word = "epištašu"
In [3]: syll = Syllabifier()
In [4]: syll.syllabify(word)
['e', 'piš', 'ta', 'šu']
Stress¶
This function identifies the stress on an Akkadian word.
In[2]: from cltk.phonology.akkadian.stress import StressFinder
In[3]: stresser = StressFinder()
In[4]: word = "šarrātim"
In[5]: stresser.find_stress(word)
Out[5]: ['šar', '[rā]', 'tim']
Decliner¶
This method outputs a list of tuples the first element being a declined noun, the second a dictionary containing its attributes.
In[2]: from cltk.stem.akkadian.declension import NaiveDecliner
In[3]: word = 'ilum'
In[4]: decliner = NaiveDecliner()
In[5]: decliner.decline_noun(word, 'm')
Out[5]:
[('ilam', {'case': 'accusative', 'number': 'singular'}),
('ilim', {'case': 'genitive', 'number': 'singular'}),
('ilum', {'case': 'nominative', 'number': 'singular'}),
('ilīn', {'case': 'oblique', 'number': 'dual'}),
('ilān', {'case': 'nominative', 'number': 'dual'}),
('ilī', {'case': 'oblique', 'number': 'plural'}),
('ilū', {'case': 'nominative', 'number': 'plural'})]
Stems and Bound Forms¶
These two methods reduce a noun to its stem or bound form.
In[2]: from cltk.stem.akkadian.stem import Stemmer
In[3]: stemmer = Stemmer()
In[4]: word = "ilātim"
In[5]: stemmer.get_stem(word, 'f')
Out[5]: 'ilt'
In[2]: from cltk.stem.akkadian.bound_form import BoundForm
In[3]: bound_former = BoundForm()
In[4]: word = "kalbim"
In[5]: bound_former.get_bound_form(word, 'm')
Out[5]: 'kalab'
Consonant and Vowel patterns¶
It’s useful to be able to parse Akkadian words as sequences of consonants and vowels.
In[2]: from cltk.stem.akkadian.cv_pattern import CVPattern
In[3]: cv_patterner = CVPattern()
In[4]: word = "iparras"
In[5]: cv_patterner.get_cv_pattern(word)
Out[5]:
[('V', 1, 'i'),
('C', 1, 'p'),
('V', 2, 'a'),
('C', 2, 'r'),
('C', 2, 'r'),
('V', 2, 'a'),
('C', 3, 's')]
In[6]: cv_patterner.get_cv_pattern(word, pprint=True)
Out[6]: 'V₁C₁V₂C₂C₂V₂C₃'
Stopword Filtering¶
To use the CLTK’s built-in stopwords list for Akkadian:
In[2]: from nltk.tokenize.punkt import PunktLanguageVars
In[3]: from cltk.stop.akkadian.stops import STOP_LIST
In[4]: sentence = "šumma awīlum ina dīnim ana šībūt sarrātim ūṣiamma awat iqbû la uktīn šumma dīnum šû dīn napištim awīlum šû iddâk"
In[5]: p = PunktLanguageVars()
In[6]: tokens = p.word_tokenize(sentence.lower())
In[7]: [w for w in tokens if not w in STOP_LIST]
Out[7]:
['awīlum',
'dīnim',
'šībūt',
'sarrātim',
'ūṣiamma',
'awat',
'iqbû',
'uktīn',
'dīnum',
'dīn',
'napištim',
'awīlum',
'iddâk']
Arabic¶
Classical Arabic is the form of the Arabic language used in Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD. The orthography of the Qurʾān was not developed for the standardized form of Classical Arabic; rather, it shows the attempt on the part of writers to utilize a traditional writing system for recording a non-standardized form of Classical Arabic. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('arabic')
In [3]: c.list_corpora
Out[3]:
['arabic_text_perseus','arabic_text_quranic_corpus','arabic_morphology_quranic-corpus']
In [4]: c.import_corpus('arabic_text_perseus') # ~/cltk_data/arabic/text/arabic_text_perseus/
Alphabet¶
The Arabic alphabet are placed in cltk/corpus/arabic/alphabet.py.
In [1]: from cltk.corpus.arabic.alphabet import *
# all Hamza forms
In [2]: HAMZAT
Out[2]: ('ء', 'أ', 'إ', 'آ', 'ؤ', 'ؤ', 'ٔ', 'ٕ')
# print HAMZA from hamza const and from HAMZAT list
In [3] HAMZA
Out[3] 'ء'
In [4] HAMZAT[0]
Out[4] 'ء'
# listing all Arabic letters
In [5] LETTERS
out [5] 'ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ء آ أ ؤ إ ؤ'
# Listing all shaped forms for example Beh letter
In [6] SHAPED_FORMS[BEH]
Out[6] ('ﺏ', 'ﺐ', 'ﺑ', 'ﺒ')
# Listing all Punctuation marks
In [7] PUNCTUATION_MARKS
Out[7] ['،', '؛', '؟']
# Listing all Diacritics FatHatanً ,Dammatanٌ ,Kasratanٍ ,FatHaَ ,Dammaُ ,Kasraِ ,Sukunْ ,Shaddaّ
In [8] TASHKEEL
Out[8] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ', 'ّ')
# Listing HARAKAT
In [9] HARAKAT
Out[9] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ')
# Listing SHORTHARAKAT
In [10] SHORTHARAKAT
Out[10] ('َ', 'ُ', 'ِ', 'ْ')
# Listing Tanween
In [11] TANWEEN
Out[11] ('ً', 'ٌ', 'ٍ')
# Kasheeda, Tatweel
In [12] NOT_DEF_HARAKA
Out[12] 'ـ'
# WESTERN_ARABIC_NUMERALS numerals
In [13] WESTERN_ARABIC_NUMERALS
Out[13] ['0','1','2','3','4','5','6','7','8','9']
# EASTERN ARABIC NUMERALS from 0 to 9
In [14] EASTERN_ARABIC_NUMERALS
Out[14] ['۰', '۱', '۲', '۳', '٤', '۵', '٦', '۷', '۸', '۹']
# Listing The Weak letters .
In [15] WEAK
Out[15] ('ا', 'و', 'ي', 'ى')
# Listing all Ligatures Lam-Alef
In [16] LIGATURES_LAM_ALEF
Out[16] ('ﻻ', 'ﻷ', 'ﻹ', 'ﻵ')
# listing small letters
In [17] SMALL
Out[17] ('ٰ', 'ۥ', 'ۦ')
# Import letters names in arabic
In [18] Names[ALEF]
Out[18] 'ألف'
CLTK Arabic Support¶
1. Pyarabic¶
Specific Arabic language library for Python, provides basic functions to manipulate Arabic letters and text, like detecting Arabic letters, Arabic letters groups and characteristics, remove diacritics etc.Developed by Taha Zerrouki:.
1.1. Features¶
- Arabic letters classification
- Text tokenization
- Strip Harakat (all, except Shadda, tatweel, last_haraka)
- Sperate and join Letters and Harakat
- Reduce tashkeel
- Measure tashkeel similarity (Harakat, fully or partially vocalized, similarity with a template)
- Letters normalization (Ligatures and Hamza)
- Numbers to words
- Extract numerical phrases
- Pre-vocalization of numerical phrases
- Unshiping texts
1.2. Applications¶
- Arabic text processing
1.3. Usage¶
In [1] from cltk.corpus.arabic.utils.pyarabic import araby
In [2] char = 'ْ'
In [3] araby.is_sukun(char) # Checks for Arabic Sukun Mark
Out[3] True
In [4] char = 'ّ'
In [5] araby.is_shadda(char) # Checks for Arabic Shadda Mark
Out[5] True
In [6] text = "الْعَرَبِيّةُ"
In [7] araby.strip_harakat(text) # Strip Harakat from arabic word except Shadda.
Out[7] العربيّة
In [8] text = "الْعَرَبِيّةُ"
In [9] araby.strip_lastharaka(text)# Strip the last Haraka from arabic word except Shadda
Out[9] الْعَرَبِيّة
In [10] text = "الْعَرَبِيّةُ"
In [11] araby.strip_tashkeel(text) # Strip vowels from a text, include Shadda
Out[11] العربية
Stopword Filtering¶
To use the CLTK’s built-in stopwords list:
In [1]: from cltk.stop.arabic.stopword_filter import stopwords_filter as ar_stop_filter
In [2]: text = 'سُئِل بعض الكُتَّاب عن الخَط، متى يَسْتحِقُ أن يُوصَف بِالجَودةِ؟'
In [3]: ar_stop_filter(text)
Out[3]: ['سئل', 'الكتاب', 'الخط', '،', 'يستحق', 'يوصف', 'بالجودة', '؟']
Swadesh¶
The corpus module has a class for generating a Swadesh list for Arabic.
In[1]: from cltk.corpus.swadesh import Swadesh
In[2]: swadesh = Swadesh('ar')
In[3]: swadesh.words()[:10]
Out[3]: ['أنا' ,'أنت, أنتِ', 'هو,هي' ,'نحن' ,'أنتم, أنتن, أنتما', 'هم, هن, هما' ,'هذا' ,'ذلك' ,'هنا']
Word Tokenization¶
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: word_tokenizer = WordTokenizer('arabic')
In [3]: text = 'اللُّغَةُ الْعَرَبِيَّةُ جَمِيلَةٌ.'
In [4]: word_tokenizer.tokenize(text)
Out[4]: ['اللُّغَةُ', 'الْعَرَبِيَّةُ', 'جَمِيلَةٌ', '.']
Transliteration¶
The CLTK Provides Buckwalter and ISO233-2 Transliteration Systems for the Arabic language.
Available Transliteration Systems¶
In [1] from cltk.phonology.arabic.romanization import available_transliterate_systems
In [2] available_transliterate_systems()
Out[2] ['buckwalter', 'iso233-2', 'asmo449']
Usage¶
In [1] from cltk.phonology.arabic.romanization import transliterate
In [2] mode = 'buckwalter'
In [3] ar_string = 'بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ' # translate in English: In the name of Allah, the Most Merciful, the Most Compassionate
In [4] ignore = '' # this is for ignore an arabic char from transliterate operation
In [5] reverse = True # true means transliteration from arabic native script to roman script such as Buckwalter
In [6] transliterate(mode, ar_string, ignore, reverse)
Out[6] 'bisomi Allhi Alra~Hom`ni Alra~Hiyomi'
Aramaic¶
Aramaic is a language or group of languages belonging to the Semitic subfamily of the Afroasiatic language family. More specifically, it is part of the Northwest Semitic group, which also includes the Canaanite languages such as Hebrew and Phoenician. The Aramaic alphabet was widely adopted for other languages and is ancestral to the Hebrew, Syriac and Arabic alphabets. During its approximately 3,100 years of written history, Aramaic has served variously as a language of administration of empires, as a language of divine worship and religious study, and as the spoken tongue of a number of Semitic peoples from the Near East. (Source: Wikipedia)
Transliterate Square Script To Imperial Aramaic¶
Unicode recently included a separate code block for encoding characters in Imperial Aramaic. Traditionally documents written in Imperial Aramaic are taught and shared using square script. Here is a small function for converting a string written in square script to its Imperial Aramaic version.
Usage:
Import the function:
In [1]: from cltk.corpus.aramaic.transliterate import square_to_imperial
Take a string written in square script:
In [2]: mystring = "פדי בר דג[נ]מלך לאחא בר חפיו נתנת לך"
Convert it to Imperial Aramaic by passing it to our function
In [3]: square_to_imperial(mystring)
Out[3]: "𐡐𐡃𐡉 𐡁𐡓 𐡃𐡂[𐡍]𐡌𐡋𐡊 𐡋𐡀𐡇𐡀 𐡁𐡓 𐡇𐡐𐡉𐡅 𐡍𐡕𐡍𐡕 𐡋𐡊"
Bengali¶
Bengali also known by its endonym Bangla is an Indo-Aryan language spoken in South Asia. It is the national and official language of the People’s Republic of Bangladesh, and the official language of several northeastern states of the Republic of India, including West Bengal, Tripura, Assam (Barak Valley) and Andaman and Nicobar Islands. With over 210 million speakers, Bengali is the seventh most spoken native language in the world. Source: Wikipedia.
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with bengali_
) to discover available Bengali corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('bengali')
In [3]: c.list_corpora
Out[3]:
['bengali_text_wikisource']
Tokenizer¶
This tool can help break up a sentence into smaller constituents.
In [1]: from cltk.tokenize.sentence import TokenizeSentence
In [2]: sentence = "রাজপণ্ডিত হব মনে আশা করে | সপ্তশ্লোক ভেটিলাম রাজা গৌড়েশ্বরে ||"
In [3]: tokenizer = TokenizeSentence('bengali')
In [4]: bengali_text_tokenize = tokenizer.tokenize(sentence)
In [5]: bengali_text_tokenize
['রাজপণ্ডিত', 'হব', 'মনে', 'আশা', 'করে', '|', 'সপ্তশ্লোক', 'ভেটিলাম', 'রাজা', 'গৌড়েশ্বরে', '|', '|']
Chinese¶
Chinese can be traced back to a hypothetical Sino-Tibetan proto-language. The first written records appeared over 3,000 years ago during the Shang dynasty. The earliest examples of Chinese are divinatory inscriptions on oracle bones from around 1250 BCE in the late Shang dynasty. Old Chinese was the language of the Western Zhou period (1046–771 BCE), recorded in inscriptions on bronze artifacts, the Classic of Poetry and portions of the Book of Documents and I Ching. Middle Chinese was the language used during Northern and Southern dynasties and the Sui, Tang, and Song dynasties (6th through 10th centuries CE). (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with chinese_
) to discover available Chinese corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('chinese')
In [3]: c.list_corpora
Out[3]:
['chinese_text_cbeta_01',
'chinese_text_cbeta_02',
'chinese_text_cbeta_indices']
Coptic¶
Coptic is the latest stage of the Egyptian language, a northern Afroasiatic language spoken in Egypt until at least the 17th century. Coptic flourished as a literary language from the second to thirteenth centuries, and its Bohairic dialect continues to be the liturgical language of the Coptic Orthodox Church of Alexandria. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with coptic_
) to discover available Coptic corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('coptic')
In [3]: c.list_corpora
Out[3]: ['coptic_text_scriptorium']
Swadesh¶
The corpus module has a class for generating a Swadesh list for Coptic.
In[1]: from cltk.corpus.swadesh import Swadesh
In[2]: swadesh = Swadesh('cop')
In[3]: swadesh.words()[:10]
Out[3]: ['ⲁⲛⲟⲕ', 'ⲛⲧⲟⲕ, ⲛⲧⲟ', 'ⲛⲧⲟϥ, ⲛⲧⲟⲥ', 'ⲁⲛⲟⲛ', 'ⲛⲧⲟⲧⲛ', 'ⲛⲧⲟⲩ', '-ⲉⲓ', 'ⲡⲓ-, ϯ-, ⲛⲓ-', 'ⲡⲉⲓⲙⲁ', 'ⲙⲙⲁⲩ']
Ancient Egyptian¶
The language spoken in ancient Egypt was a branch of the Afroasiatic language family. The earliest known complete written sentence in the Egyptian language has been dated to about 2690 BCE, making it one of the oldest recorded languages known, along with Sumerian. Egyptian was spoken until the late seventeenth century in the form of Coptic. (Source: Wikipedia)
Important
For correct visualisation of transliterated Unicode text you will probably need a special font. We recommend either egyptoserif (handles the yod better) or noto.
Transliterate MdC¶
MdC (Manuel de Codage) is the standard encoding scheme and a series of conventions for transliterating egyptian texts. At first it was also conceived as a system to represent positional relations between hieroglyphic signs. However it was soon realised that the scheme used by MdC was not really appropriate for this last task. Hence the current softwares for hieroglyphic typesetting use often slightly different schemes than MdC. For more on MdC, see here and here
Transliteration conventions proposed by MdC are widely accepted though. Since at that time the transliteration conventions of the egyptology were not covered by the Unicode, MdC’s all-ascii proposition made it possible to exchange at least transliterations in digital environement. It is the de facto transliteration system used by Thesaurus Linguae Aegyptiae which includes transliterations from several different scripts used in Ancient Egypt: a good discussion can be found here
Here are the unicode equivalents of MdC transliteration scheme as it is represented in transliterate_mdc:
Note
reStructuredText tables cannot display all characters in the Character column. The several that cannot be displayed are: U+0056: ; U+003c: 〈; U+003e: 〉; U+0024, U+00a3: H̱.
MdC | Unicode | ||
---|---|---|---|
Unicode Number | Character | Unicode Number | Character | ||
U+0041 | A | U+a723 | ꜣ |
U+0061 | a | U+a725 | ꜥ |
U+0048 | H | U+1e25 | ḥ |
U+0078 | x | U+1e2b | ḫ |
U+0058 | X | U+1e96 | ẖ |
U+0056 | V | U+0068+U+032d | See note |
U+0053 | S | U+0161 | š |
U+0063 | c | U+015b | ś |
U+0054 | T | U+1e6f | ṯ |
U+0076 | v | U+1e71 | ṱ |
U+0044 | D | U+1e0f | ḏ |
U+0069 | i | U+0069+U+0486 | i҆ | |
U+003d | = | U+2e17 | ⸗ |
U+003c | < | U+2329 | See note |
U+003e | > | U+232a | See note |
U+0071 | q | U+1e33 | ḳ |
U+0051 | Q | U+1e32 | Ḳ |
U+00a1, U+0040 | ¡, @ | U+1e24 | Ḥ |
U+0023, U+00a2 | #, ¢ | U+1e2a | Ḫ |
U+0024, U+00a3 | $, £ | U+0048 + U+0331 | See note |
U+00a5, U+005e | ¥, ^ | U+0160 | Š |
U+00a9, U+002b | ©, + | U+1e0e | Ḏ |
U+0043 | C | U+015a | Ś |
U+002a, U+00a7 | *, § | U+1e6e | Ṯ |
The Unicode still doesn’t cover all of the transliteration conventions used within the egyptology, but there has been a lot of progress. Only three characters are now problematic and are not covered by precomposed characters of the Unicode Consortium.
- Egyptological Yod
- Capital H4
- Small and Capital H5: almost exclusively used for transliterating demotic script.
The function is created in the view of transliteration font provided by CCER which maps couple of extra characters to transliterated equivalents such as ‘¡’ or ‘@’ for Ḥ.
There is also a q_kopf flag for choosing between the ‘q’ or ‘ḳ’ at the resulting text.
Usage:
Import the function:
In [1]: from cltk.corpus.egyptian.transliterate_mdc import mdc_unicode
Take a MdC encoded string (P.Berlin 3022:28-31):
In [1]: mdc_string = """rdi.n wi xAst n xAst
fx.n.i r kpny Hs.n.i r qdmi
ir.n.i rnpt wa gs im in wi amw-nnSi
HqA pw n rtnw Hrt"""
Ensure that mdc_string is encoded in Unicode characters (this is mostly unnecessary):
In [2]: mdc_string.encode().decode("utf-8")
Out[6]:
''rdi.n wi xAst n xAst\nfx.n.i r kpny Hs.n.i r qdmi\nir.n.i rnpt wa gs im in wi amw-nnSi\nHqA pw n rtnw Hrt''
Apply the function to obtain the Unicode map result:
In [10]: unicode_string = mdc_unicode(mdc_string)
In [11]: print(unicode_string)
rdi҆.n wi҆ ḫꜣst n ḫꜣst
fḫ.n.i҆ r kpny ḥs.n.i҆ r qdmi҆
i҆r.n.i҆ rnpt wꜥ gs i҆m i҆n wi҆ ꜥmw-nnši҆
ḥqꜣ pw n rtnw ḥrt
If you disable the option q_kopf, the result would be following:
In [136]: unicode_string = mdc_unicode(mdc_string, q_kopf=False)
In [152]: print(unicode_string)
rdi҆.n wi҆ ḫꜣst n ḫꜣst
fḫ.n.i҆ r kpny ḥs.n.i҆ r ḳdmi҆
i҆r.n.i҆ rnpt wꜥ gs i҆m i҆n wi҆ ꜥmw-nnši҆
ḥḳꜣ pw n rtnw ḥrt
Notice the q -> ḳ transformation.
If you are going to pass a string object read from a file be sure to precise the encoding during the opening of the file:
with open("~/mdc_text.txt", "r", encoding="utf-8") as f:
mdc_text = f.read()
unicode_text = mdc_unicode(mdc_text)
Notice encoding=”utf-8”.
TODO¶
- Add support for different transliteration systems used within egyptology.
- Add an option to for i -> j transformation for facilitating computer based operations.
- Add support for the problematic characters in future.
Old English¶
Old English is the earliest historical form of the English language, spoken in England and southern and eastern Scotland in the early Middle Ages. It was brought to Great Britain by Anglo-Saxon settlers probably in the mid 5th century, and the first Old English literary works date from the mid-7th century. (Source: Wikipedia)
IPA Transcription¶
CLTK’s IPA transcriber for OE can be found in OE’s the orthophonology
module.
In [1]: from cltk.phonology.old_english.orthophonology import OldEnglishOrthophonology as oe
In [2]: oe('Fæder ūre þū þe eeart on heofonum')
Out[2]: 'fæder u:re θu: θe eæɑrt on heovonum'
In [3]: oe('Hwæt! wē Gārdena in ġēardagum')
Out[3]: 'hwæt we: gɑ:rdenɑ in jæ:ɑrdɑyum'
The callable OldEnglishOrthophonology object can also return phonemes objects instead of IPA strings. The string representation of a phoneme object lists the distinctive features of the phoneme and its IPA representation.
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with old_english_
) to discover available Old English corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: corpus_importer = CorpusImporter("old_english")
In [3]: corpus_importer.list_corpora
['old_english_text_sacred_texts', 'old_english_models_cltk']
To download a corpus, use the import_corpus method. The following will download pre-trained POS models for Old English:
In [4]: corpus_importer.import_corpus('old_english_models_cltk')
Stopword Filtering¶
To use the CLTK’s built-in stopwords list, We use an example from Beowulf:
In [1]: from nltk.tokenize.punkt import PunktLanguageVars
In [2]: from cltk.stop.old_english.stops import STOPS_LIST
In [3]: sentence = 'þe hie ær drugon aldorlease lange hwile.'
In [4]: p = PunktLanguageVars()
In [5]: tokens = p.word_tokenize(sentence.lower())
In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['hie',
'drugon',
'aldorlease',
'catilina',
'lange',
'hwile',
'.']
Text Normalization¶
Diacritic Stripping¶
The Word
module provides a method useful for stripping various diacritical marks
In [1]: from cltk.phonology.old_english.phonology import Word
In [2]: Word('ġelǣd').remove_diacritics()
Out[2]: 'gelæd'
ASCII Encoding¶
For converting to ASCII, you can call ascii_encoding
In [3]: Word('oðþæt').ascii_encoding()
Out[3]: 'odthaet'
In [4]: Word('ƿeorðunga').ascii_encoding()
Out[4]: 'weordunga'
Transliteration¶
Anglo-Saxon runic transliteration¶
You can call the runic transliteration module for converting runic script into latin characters:
In [1]: from cltk.phonology.old_english.phonology import Transliterate as t
In [2]: t.transliterate('ᚩᚠᛏ ᛋᚳᚣᛚᛞ ᛋᚳᛖᚠᛁᛝ ᛋᚳᛠᚦᛖᚾᚪ ᚦᚱᛠᛏᚢᛗ', 'Latin')
Out[2]: 'oft scyld scefin sceathena threatum'
The reverse process is also possible:
In [3]: t.transliterate('Hƿæt Ƿe Gardena in geardagum', 'Anglo-Saxon')
Out[3]: 'ᚻᚹᚫᛏ ᚹᛖ ᚷᚪᚱᛞᛖᚾᚪ ᛁᚾ ᚷᛠᚱᛞᚪᚷᚢᛗ'
Syllabification¶
There is a facility for using the pre-specified sonoroty hierarchy for Old English to syllabify words.
In [1]: from cltk.phonology.syllabify import Syllabifier
In [2]: s = Syllabifier(language='old_english')
In [3]: s.syllabify('geardagum')
Out [3]:['gear', 'da', 'gum']
Lemmatization¶
A basic lemmatizer is provided, based on a hand-built dictionary of word forms.
In [1]: import cltk.lemmatize.old_english.lemma as oe_l
In [2]: lemmatizer = oe_l.OldEnglishDictionaryLemmatizer()
In [3]: lemmatizer.lemmatize('Næs him fruma æfre, or geworden, ne nu ende cymþ ecean')
Out [3]: [('Næs', 'næs'), ('him', 'he'), ('fruma', 'fruma'), ('æfre', 'æfre'), (',', ','), ('or', 'or'), ('geworden', 'weorþan'), (',', ','), ('ne', 'ne'), ('nu', 'nu'), ('ende', 'ende'), ('cymþ', 'cuman'), ('ecean', 'ecean')]
If an input word form has multiple possible lemmatizations, the system will select the lemma that occurs most frequently in a large corpus of Old English texts. If an input word form is not found in the dictionary, then it is simply returned.
Note, hovewer, that by passing in an extra parameter best_guess=False
to the lemmatize function,
one gains access to the underlying dictionary. In this case, a list is returned for each token. The list will contain:
- Nothing, if the word form is not found;
- A single string if the form maps to a unique lemma (the usual case);
- Multiple strings if the form maps to several lemmatas.
In [1]: lemmatizer.lemmatize('Næs him fruma æfre, or geworden, ne nu ende cymþ ecean', best_guess=False)
Out [1]: [('Næs', ['nesan', 'næs']), ('him', ['him', 'he', 'hi']), ('fruma', ['fruma']), ('æfre', ['æfre']), (',', []), ('or', []), ('geworden', ['weorþan', 'geweorþan']), (',', []), ('ne', ['ne']), ('nu', ['nu']), ('ende', ['ende']), ('cymþ', ['cuman']), ('ecean', [])]
By specifying return_frequencies=True
the log of the relative frequencies of the lemmata is also returned:
..code-block:: python
In [1]: lemmatizer.lemmatize(‘Næs him fruma æfre, or geworden, ne nu ende cymþ ecean’, best_guess=False, return_frequencies=True)
Out [1]: [(‘Næs’, [(‘nesan’, -11.498420778767347), (‘næs’, -5.340383031833549)]), (‘him’, [(‘him’, -2.1288142618657147), (‘he’, -1.4098446677862744), (‘hi’, -2.3713533259849857)]), (‘fruma’, [(‘fruma’, -7.3395376954076745)]), (‘æfre’, [(‘æfre’, -4.570372796517447)]), (‘,’, []), (‘or’, []), (‘geworden’, [(‘weorþan’, -8.608049020871182), (‘geweorþan’, -9.100525505968976)]), (‘,’, []), (‘ne’, [(‘ne’, -1.9050995182359884)]), (‘nu’, [(‘nu’, -3.393566264402446)]), (‘ende’, [(‘ende’, -5.038516324389812)]), (‘cymþ’, [(‘cuman’, -5.943525084818863)]), (‘ecean’, [])]
POS tagging¶
You can get the POS tags of Old English texts using the CLTK’s wrapper around the NLTK tokenizer. First, download the model by importing the old_english_models_cltk
corpus.
There are a number of different pre-trained models available for POS tagging of Old English. Each represents a trade-off between accuracy of tagging and speed of tagging. Listed in order of increasing accuracy (= decreasing speed), the models are:
- Unigram
- Trigram -> Bigram -> Unigram n-gram backoff model
- Conditional Random Field (CRF) model
- Perceptron model
(Bigram and trigram models are also available, but unsuitable due to low recall.)
The taggers were trained from annotated data from the The ISWOC Treebank (license: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License).
The POS tag scheme is explained here: https://proiel.github.io/handbook/developer/
Bech, Kristin and Kristine Eide. 2014. The ISWOC corpus.
Department of Literature, Area Studies and European Languages,
University of Oslo. http://iswoc.github.com.
Example: Tagging with the CRF tagger¶
The following sentence is from the beginning of Beowulf:
In [1]: from cltk.tag.pos import POSTag
In [2]: tagger = POSTag('old_english')
In [3]: sent = 'Hwæt! We Gardena in geardagum, þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon.'
In [4]: tagger.tag_crf(sent)
Out[4]:[('Hwæt', 'I-'), ('!', 'C-'),
('We', 'NE'), ('Gardena', 'NE'), ('in', 'R-'), ('geardagum', 'NB'), (',', 'C-'),
('þeodcyninga', 'NB'), (',', 'C-'), ('þrym', 'PY'), ('gefrunon', 'NB'),
(',', 'C-'), ('hu', 'DU'), ('ða', 'PD'), ('æþelingas', 'NB'), ('ellen', 'V-'),
('fremedon', 'V-'), ('.', 'C-')]
Swadesh¶
The corpus module has a class for generating a Swadesh list for Old English.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('eng_old')
In [3]: swadesh.words()[:10]
Out[3]: ['ic, iċċ, ih', 'þū', 'hē', 'wē', 'ġē', 'hīe', 'þēs, þēos, þis', 'sē, sēo, þæt', 'hēr', 'þār, þāra, þǣr, þēr']
Middle English¶
Middle English is collectively the varieties of the English language spoken after the Norman Conquest (1066) until the late 15th century; scholarly opinion varies but the Oxford English Dictionary specifies the period of 1150 to 1500. (Source: Wikipedia)
Text Normalization¶
CLTK’s normalizer attempts to clean the given text, converting it into a canonical form.
Lowercase Conversion¶
The to_lower
parameter converts the string into lowercase.
In [1]: from cltk.corpus.middle_english.alphabet import normalize_middle_english
In [2]: normalize_middle_english("Whan Phebus in the Crabbe had nere hys cours ronne And toward the leon his journé gan take", to_lower=True)
Out [2]: 'whan phebus in the crabbe had nere hys cours ronne and toward the leon his journé gan take'
Punctuation Removal¶
punct
is responsible for punctuation removal
In [3]: normalize_middle_english("Thus he hath me dryven agen myn entent, And contrary to my course naturall.", punct=True)
Out [3]: 'thus he hath me dryven agen myn entent and contrary to my course naturall'
Canonical Form¶
The alpha_conv
follows the established spelling conventions developed thorughout the last last century.
þ and ð are both converted to th while 3 is converted to y at the start of the word and to gh otherwise.
In [4]: normalize_middle_english("as 3e lykeþ best", alpha_conv=True)
Out [4]: 'as ye liketh best'
Stemming¶
CLTK supports a rule-based affix stemmer for ME.
Keep in mind, that while Middle English is considered a weakly inflected language with a grammatical structure resembling that of Modern English, its lack of orthographical conventions presents a difficulty when accounting for various affixes.
In [1]: from cltk.stem.middle_english import affix_stemmer
In [2]: from cltk.corpus.middle_english.alphabet import normalize_middle_english
In [3]: text = normalize_middle_english('The speke the henmest kyng, in the hillis he beholdis.').split(" ")
In [4]: affix_stemmer(text)
Out [4]: 'the spek the henm kyng in the hill he behold'
The stemmer can also take an additional parameter of a hard-coded exception dictionary. An example follows utilizing the compiled stopwords list.
In[7]: from cltk.stop.middle_english.stops import STOPS_LIST
In[8]: exceptions = dict(zip(STOPS_LIST, STOPS_LIST))
In[9]: affix_stemer('byfore him'.split(" "), exception_list = exceptions)
Out[9]: 'byfore him'
Stopword Filtering¶
To use the CLTK’s built-in stopwords list, We use an example from Chaucer’s “The Summoner’s Tale”:
In [1]: from nltk.tokenize.punkt import PunktLanguageVars
In [2]: from cltk.stop.middle_english.stops import STOPS_LIST
In [3]: sentence = 'This frere bosteth that he knoweth helle'
In [4]: p = PunktLanguageVars()
In [5]: tokens = p.word_tokenize(sentence.lower())
In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['frere',
'bosteth',
'knoweth',
'helle',
'.']
Stresser¶
The historical events of early 11th century Britain were intertwined with its phonological development. The Norman Conquest in 1066 is mainly responsible for the influx of both Francien and Latin words and by extension for the highly variable spelling and phonology of ME.
While the Stresser provided by CLTK is unable to recognize the stressing of a given word, it does accept some of the most common stressing rules as parameters (Latin/Germanic/French)
In [1]: from cltk.phonology.middle_english.transcription import Word
In [2]: ".".join(Word('beren').stresser(stress_rule = "FSR"))
Out[2]: "ber.'en"
In [3]: ".".join(Word('yisterday').stresser(stress_rule = "GSR"))
Out [3]: "yi.ster.'day"
In [4]: ".".join(Word('verbum').stresser(stress_rule = "LSR"))
Out [4]: "ver.'bum"
Syllabify¶
The Word
class provides a syllabification module for ME words.
In [1]: from cltk.phonology.middle_english.transcription import Word
In [2]: w = Word("hymsylf")
In [3]: w.syllabify()
Out [3]: ['hym', 'sylf']
In [4]: w.syllabified_str()
Out[4]: 'hym.sylf'
French¶
Old French (franceis, françois, romanz; Modern French ancien français) was the language spoken in Northern France from the 8th century to the 14th century. In the 14th century, these dialects came to be collectively known as the langue d’oïl, contrasting with the langue d’oc or Occitan language in the south of France. The mid-14th century is taken as the transitional period to Middle French, the language of the French Renaissance, specifically based on the dialect of the Île-de-France region. The place and area where Old French was spoken natively roughly extended to the historical Kingdom of France and its vassals, but the influence of Old French was much wider, as it was carried to England, Sicily and the Crusader states as the language of a feudal elite and of commerce.
(Source: Wikipedia)
LEMMATIZER¶
The lemmatizer takes as its input a list of tokens, previously tokenized and made lower-case. .. It first seeks a match between each token and a list of potential lemmas taken from Godefroy (1901)’s Lexique, the Tobler-Lommatszch, and the DECT. If a match is not found, the lemmatizer then seeks a match between the forms different lemmas have been known to take and the token (this at present only applies to lemmas from A-D and W-Z). If no match is returned at this stage, a set of rules is applied to the token. These rules are similar to those applied by the stemmer but aim to bring forms in line with lemmas rather than truncating them. Finally, if no match is found between the modified token and the list of lemmas, a result of ‘None’ is returned.
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: from cltk.lemmatize.french.lemma import LemmaReplacer
In [3]: text = "Li rois pense que par folie, Sire Tristran, vos aie amé ; Mais Dé plevis ma loiauté, Qui sor mon cors mete flaele, S’onques fors cil qui m’ot pucele Out m’amistié encor nul jor !"
In [4]: text = str.lower(text)
In [5]: tokenizer = WordTokenizer('french')
In [6]: lemmatizer = LemmaReplacer()
In [7]: tokens = tokenizer.tokenize(text)
In [8]: lemmatizer.lemmatize(tokens)
Out [8]: [('li', 'li'), ('rois', 'rois'), ('pense', 'pense'), ('que', 'que'), ('par', 'par'), ('folie', 'folie'), (',', ['PUNK']), ('sire', 'sire'), ('tristran', 'None'), (',', ['PUNK']), ('vos', 'vos'), ('aie', ['avoir']), ('amé', 'amer'), (';', ['PUNK']), ('mais', 'mais'), ('dé', 'dé'), ('plevis', 'plevir'), ('ma', 'ma'), ('loiauté', 'loiauté'), (',', ['PUNK']), ('qui', 'qui'), ('sor', 'sor'), ('mon', 'mon'), ('cors', 'cors'), ('mete', 'mete'), ('flaele', 'flaele'), (',', ['PUNK']), ("s'", "s'"), ('onques', 'onques'), ('fors', 'fors'), ('cil', 'cil'), ('qui', 'qui'), ("m'", "m'"), ('ot', 'ot'), ('pucele', 'pucele'), ('out', ['avoir']), ("m'", "m'"), ('amistié', 'amistié'), ('encor', 'encor'), ('nul', 'nul'), ('jor', 'jor'), ('!', ['PUNK'])]
LINE TOKENIZATION:¶
The line tokenizer takes a string as its input and returns a list of strings.
In [1]: from cltk.tokenize.line import LineTokenizer
In [2]: tokenizer = LineTokenizer('french')
In [3]: untokenized_text = """Ki de bone matire traite,\nmult li peise, se bien n’est faite.\nOëz, seignur, que dit Marie,\nki en sun tens pas ne s’oblie."""
In [4]: tokenizer.tokenize(untokenized_text)
Out [4]: ['Ki de bone matire traite,', 'mult li peise, se bien n’est faite.','Oëz, seignur, que dit Marie,', 'ki en sun tens pas ne s’oblie. ']
NAMED ENTITY RECOGNITION¶
The named entity recognizer for French takes as its input a string and returns a list of tuples. It tags named entities from a list, and also displays the category to which this named entity belongs.
Categories are modeled on those found in (Moisan, 1986) and include:
- Locations “LOC” (e.g. Girunde)
- Nationalities/places of origin “NAT” (e.g. Grius)
- Individuals: - animals “ANI” (i.e. horses, e.g. Veillantif, cows, e.g. Blerain, dogs, e.g. Husdent) - authors “AUT” (e.g. Marie, Chrestïen) - nobility “CHI” (e.g. Rolland, Artus). n.b. Characters such as Turpin are counted as nobility rather than religious figures. - characters from classical sources “CLAS” (e.g. Echo) - feasts “F” (e.g. Pentecost) - religious things “REL” (i.e. saints, e.g. St Alexis, and deities, e.g. Deus, and Old Testament people, e.g. Adam) - swords “SW” (e.g. Hautecler) - commoners “VIL” (e.g Pathelin)
In [1]: from cltk.tag.ner import NamedEntityReplacer
In [2]: text_str = """Berte fu mere Charlemaine, qui pukis tint France et tot le Maine."""
In [3]: ner_replacer = NamedEntityReplacer()
In [4]: ner_replacer.tag_ner_fr(text_str)
Out [4]: [[('Berte', 'entity', 'CHI')], ('fu',), ('mere',), [('Charlemaine', 'entity', 'CHI')], (',',), ('qui',), ('pukis',), ('tint',), [('France', 'entity', 'LOC')], ('et',), ('tot',), ('le',), [('Maine', 'entity', 'LOC')], ('.',)]
NORMALIZER¶
The normalizer aims to maximally reduce the variation between the orthography of texts written in the Anglo-Norman dialect to bring it in line with “orthographe commune”. It is heavily inspired by Pope (1956). It takes a string as its input. Spelling variation is not consistent enough to ensure the highest accuracy; the normalizer should therefore be used as a last resort.
In [1]: from cltk.corpus.utils.formatter import normalize_fr
In [2]: text = "viw"
In [3]: normalize_fr(text)
Out [3]: ['vieux']
STEMMER¶
The stemmer strips morphological endings from an input string. .. Morphological endings are taken from Brunot & Bruneau (1949) and include both nominal and verbal inflexion. A list of exceptions can be found at cltk.stem.french.exceptions.
In [1]: from cltk.stem.french.stem import stem
In [2]: text = "ja departissent a itant quant par la vile vint errant tut a cheval une pucele en tut le siecle n’ot si bele un blanc palefrei chevalchot"
In [3]: stem(text)
Out [3]: "j depart a it quant par la vil v err tut a cheval un pucel en tut le siecl n' o si bel un blanc palefre chevalcho"
STOPWORD FILTERING¶
The stopword filterer removes the function words from a string of OF or MF text. The list includes function words from the most common 100 words in the corpus, as well as all conjugated forms of auxiliaries estre and avoir.
In [1]: from cltk.stop.french.stops import STOPS_LIST as FRENCH_STOPS
In [2]: from cltk.tokenize.word import WordTokenizer
In [3]: tokenizer = WordTokenizer('french')
In [4]: text = "En pensé ai e en talant que d’ Yonec vus die avant dunt il fu nez, e de sun pere cum il vint primes a sa mere ."
In [5]: text = text.lower()
In [6]: tokens = tokenizer.tokenize(text)
In [7]: no_stops = [w for w in tokens if w not in FRENCH_STOPS]
In [8]: no_stops
Out [8]: ['pensé', 'talant', 'yonec', 'die', 'avant', 'dunt', 'nez', ',', 'pere', 'cum', 'primes', 'mere', '.']
WORD TOKENIZATION¶
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: word_tokenizer = WordTokenizer('french')
In [3]: text = "S'a table te veulz maintenir, Honnestement te dois tenir Et garder les enseignemens Dont cilz vers sont commancemens."
In [4]: word_tokenizer.tokenize(text)
Out [4]: ["S'", 'a', 'table', 'te', 'veulz', 'maintenir', ',', 'Honnestement', 'te', 'dois', 'tenir', 'Et', 'garder', 'les', 'enseignemens', 'Dont', 'cilz', 'vers', 'sont', 'commancemens', '.']
Apostrophes are considered part of the first word of the two they separate. Apostrophes are also normalized from “’” to “’“.
Swadesh¶
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('fr_old')
In [3]: swadesh.words()[:10]
Out[3]: ['jo, jou, je, ge', 'tu', 'il', 'nos, nous', 'vos, vous', 'il, eles', 'cist, cest, cestui', 'ci', 'la']
Middle High German¶
Middle High German (abbreviated MHG, German: Mittelhochdeutsch, abbr. Mhd.) is the term for the form of German spoken in the High Middle Ages. It is conventionally dated between 1050 and 1350, developing from Old High German and into Early New High German. High German is defined as those varieties of German which were affected by the Second Sound Shift; the Middle Low German and Middle Dutch languages spoken to the North and North West, which did not participate in this sound change, are not part of MHG. (Source: Wikipedia)
ASCII Encoding¶
Using the Word
class, you can easily convert a string to its ASCII encoding, essentialy striping it of its diacritics.
In [1]: from cltk.phonology.middle_high_german.transcription import Word
In [2]: w = Word("vogellîn")
In [3]: w.ASCII_encoding()
Out[3]: 'vogellin'
Stemming¶
Note
The stemming algorithm is still under developement and can sometimes produce inaccurate results.
CLTK’s stemming function, attempts to reduce inflected words to their stem by suffix stripping.
In [1]: from cltk.stem.middle_high_german.stem import stemmer_middle_high_german
In [2]: stemmer_middle_high_german("Man lūte dā zem münster nāch gewoneheit")
Out[2]: ['Man', 'lut', 'dâ', 'zem', 'munst', 'nâch', 'gewoneheit']
The stemmer strips umlauts by default, to toggle it off, simply set rem_umlauts = False
In [3]: stemmer_middle_high_german("Man lūte dā zem münster nāch gewoneheit", rem_umlauts = False)
Out[3]: ['Man', 'lût', 'dâ', 'zem', 'münst', 'nâch', 'gewoneheit']
The stemmer can also take an user-defined dictionary as an optional parameter.
In [4]: stemmer_middle_high_german("swaȥ kriuchet unde fliuget und bein zer erden biuget", rem_umlauts = False)
Out[4]: ['swaȥ', 'kriuchet', 'unde', 'fliuget', 'und', 'bein', 'zer', 'erden', 'biuget']
In [5]: stemmer_middle_high_german("swaȥ kriuchet unde fliuget und bein zer erden biuget", rem_umlauts = False, exceptions = {"biuget" : "biegen"})
Out[5]: ['swaȥ', 'kriuchet', 'unde', 'fliuget', 'und', 'bein', 'zer', 'erden', 'biegen']
Syllabification¶
A syllabifier is contained in the Word
module:
In [1]: from cltk.phonology.middle_high_gemran import Word
In [2]: Word('entslâfen').syllabify()
Out[2]: ['ent', 'slâ', 'fen']
Note that the syllabifier is case-insensitive:
In [3]: Word('Fröude').syllabify()
Out[3]: ['fröu', 'de']
You can also load the sonority of MHG phonemes to the phonology
syllabifier:
In [4]: from cltk.phonology.syllabify import Syllabifier
In [5]: s = Syllabifier(language='middle high german')
In [6]: s.syllabify('lobebæren')
Out[6]: ['lo', 'be', 'bæ', 'ren']
Stopword Filtering¶
CLTK offers a built-in stop word list for Middle High German.
In [1]: from cltk.stop.middle_high_german.stops import STOPS_LIST
In [2]: from cltk.tokenize.word import WordTokenizer
In [3]: word_tokenizer = WordTokenizer('middle_high_german')
In [4]: sentence = "Wol mich lieber mære diu ich hān vernomen daȥ der winter swære welle ze ende komen"
In [5]: tokens = word_tokenizer.tokenize(sentence.lower())
In [6]: [word for word in tokens if word not in STOPS_LIST]
Out[6]: ['lieber', 'mære', 'hān', 'vernomen', 'winter', 'swære', 'welle', 'komen']
Text Normalization¶
Text normalization attempts to narrow the disrepancies between various corpora.
Lowercase Conversion¶
By default, the function converts the whole string to lowercase. However, since in MHG uppercase is only used at the start of a sentence or to denote eponyms, you may also set to_lower_beginning = True
to only convert the words at the beginning of a sentence.
In [1]: from cltk.corpus.middle_high_german.alphabet import normalize_middle_high_german
In [2]: normalize_middle_high_german("Dô erbiten si der nahte und fuoren über Rîn")
Out[2]: 'dô erbiten si der nahte und fuoren über rîn'
In [3]: normalize_middle_high_german("Dô erbiten si der nahte und fuoren über Rîn",to_lower_all = False, to_lower_beginning = True)
Out[3]: 'dô erbiten si der nahte und fuoren über Rîn'
Alphabet Conversion¶
Various online corpora use the characters ā, ō, ū, ē, ī to represent â, ô, û, ê and î respectively. Sometimes, ae and oe are also used instead of æ and œ. By default, the normalizer converts the text to the canonical form.
In [4]: normalize_middle_high_german("Mit ūf erbürten schilden in was ze strīte nōt", alpha_conv = True)
Out[4]: 'mit ûf erbürten schilden in was ze strîte nôt'
Punctuation¶
Punctuation is also handled by the normalizer.
In [5]: normalize_middle_high_german("Si sprach: ‘herre Sigemunt, ir sult iȥ lāȥen stān", punct = True)
Out[5]: 'si sprach herre sigemunt ir sult iȥ lâȥen stân'
Phonetic Indexing¶
Phonetic Indexing helps identifying and processing homophones.
Soundex¶
The Word
class provides a modified Soundex algorithm modified for MHG.
In [1]: from cltk.phonology.middle_high_german.transcription import Word
In [2]: w1 = Word("krippe")
In [3]: w1.phonetic_index(p = "SE")
Out[3]: 'K510'
In [4]: w2 = Word("krîbbe")
In [5]: w2.phonetic_indexing(p = "SE")
Out[5]: 'K510'
Transliteration¶
CLTK’s transcriber rewrites a word into the International Phonetical Alphabet (IPA). As of this version, the Transcribe class doesn’t support any specific dialects and serves as a superset encompassing various regional accents.
In [1]: from cltk.phonology.middle_high_german.transcription import Transcriber
In [2]: tr = Transcriber()
In [3]: tr.transcribe("Slâfest du, friedel ziere?", punctuation = True)
Out[3]: '[Slɑːfest d̥ʊ, frɪ͡əd̥el t͡sɪ͡əre?]'
In [4]: tr.transcribe("Slâfest du, friedel ziere?", punctuation = False)
Out[4]: '[Slɑːfest d̥ʊ frɪ͡əd̥el t͡sɪ͡əre]'
Word Tokenization¶
The WordTokenizer
class takes a string as input and returns a list of tokens.
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: word_tokenizer = WordTokenizer('middle_high_german')
In [3]: text = "Mīn ougen wurden liebes alsō vol, \n\n\ndō ich die minneclīchen ērst gesach,\ndaȥ eȥ mir hiute und iemer mē tuot wol."
In [4]: word_tokenizer.tokenize(text)
Out[4]: ['Mīn', 'ougen', 'wurden', 'liebes', 'alsō', 'vol', ',', 'dō', 'ich', 'die', 'minneclīchen', 'ērst', 'gesach', ',', 'daȥ', 'eȥ', 'mir', 'hiute', 'und', 'iemer', 'mē', 'tuot', 'wol', '.']
Lemmatization¶
The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends. There is a generic version of the backoff Middle High German lemmatizer which requires data from the CLTK Middle High German models data found here. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.
To use the generic version of the backoff Middle High German Lemmatizer:
In [1]: from cltk.lemmatize.middle_high_german.backoff import BackoffMHGLemmatizer
In [2]: lemmatizer = BackoffMHGLemmatizer()
In [3]: tokens = "uns ist in alten mæren".split(" ")
In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('uns', {'uns', 'unser', 'unz', 'wir'}), ('ist', {'sîn/wider(e)+', 'ist', 'sîn/inne+', 'sîn/mit(e)<+', 'sîn/vür(e)+', 'sîn/abe+', 'sîn/obe+', 'sîn/vor(e)+', 'sîn/vür(e)>+', 'sîn/ûze+', 'sîn/ûz+', 'sîn/bî<.+', 'sîn/vür(e)<+', 'sîn/innen+', 'sîn/âne+', 'sîn/bî+', 'sîn/ûz<+', 'sîn', 'sîn/ûf<.+'}), ('in', {'ër', 'in/hin(e)+', 'in/>+gân', 'in/+gân', 'în/+gân', 'in/+lâzen', 'în', 'in/<.+wintel(e)n', 'in/>+rinnen', 'in/dar(e)+', 'in/.>+slîzen', 'în/hin(e)+', 'în/+lèiten', 'în/+var(e)n', 'in', 'in/>+tragen', 'in/+tropfen', 'în/+lègen', 'in/>+winten', 'în/+brèngen', 'in/>+büègen', 'ërr', 'în/+zièhen', 'in/<.+gân', 'in/+zièhen', 'in/>+tûchen', 'dër', 'în/dâr+', 'in/war(e).+', 'in/<.+lâzen', 'in/>+rîten', 'în/+lâzen', 'in/>+lâzen', 'in/+stapfen', 'în/+sènten', 'in/>.+lâzen', 'in/>+stân', 'in/+drücken', 'in/>+ligen', 'in/dâr+ ', 'in/+var(e)n', 'in/+vüèren', 'in/<.+vallen', 'in/>+vlièzen', 'in/<.+rîten', 'in/hër(e).+', 'ne', 'in/>+wonen', 'in/<.+sigel(e)n', 'in/+lègen', 'în/+dringen', 'in/>+ge-trîben', 'in/+diènen', 'in/>+ge-stëchen', 'in/>+stècken', 'in/hër(e)+', 'in/>+stëchen', 'in/dâr+', 'in/+blâsen', 'în/dâr.+', 'in/>+wîsen', 'în/+îlen', 'in/>+laden', 'în/+komen', 'în/+ge-lèiten', 'in/<.+vloèzen', 'ër ', 'in/>+sètzen', 'in/hièr+', 'in/>+bûwen', 'in/>+lèiten', 'în/+ge-binten', '[!]', 'în/+trîben', 'in/<.+blâsen', 'in/+komen', 'în/+krièchen', 'in/+trîben', 'in/<.+ligen', 'in/+stëchen', 'in/<+gân', 'in/dâr.+', 'în/hër(e)+', 'in/+kêren', 'in/<.+var(e)n', 'in/+rîten', 'in/>+vallen', 'in/<.+vüèren'}), ('alten', {'alt', 'alter', 'alten'}), ('mæren', {'mæren', 'mære'})]
POS tagging¶
In [1]: from cltk.tag.pos import POSTag
In [2]: mhg_pos_tagger = POSTag("middle_high_german")
In [3]: mhg_pos_tagger.tag_tnt("uns ist in alten mæren wunders vil geseit")
Out[3]: [('uns', 'PPER'), ('ist', 'VAFIN'), ('in', 'APPR'), ('alten', 'ADJA'), ('mæren', 'ADJA'),
('wunders', 'NA'), ('vil', 'AVD'), ('geseit', 'VVPP')]
Middle Low German¶
Middle Low German or Middle Saxon is a language that is the descendant of Old Saxon and the ancestor of modern Low German. It served as the international lingua franca of the Hanseatic League. It was spoken from about 1100 to 1600, or 1200 to 1650. (Source: Wikipedia)
POS Tagging¶
The POS taggers were trained by NLTK’s models on the ReN training set.
1–2–gram backoff tagger¶
In [1]: from cltk.tag.pos import POSTag
In [2]: tagger = POSTag('middle_low_german')
In [3]: tagger.tag_ngram_12_backoff('Jck Johannes Veghe preister verwarer vnde voirs tender des Juncfrouwen kloisters to Mariendale')
Out[3]: [('Jck', 'PPER'),
('Johannes', 'NE'),
('Veghe', 'NE'),
('preister', 'NA'),
('verwarer', 'NA'),
('vnde', 'KON'),
('voirs', 'NA'),
('tender', 'NA'),
('des', 'DDARTA'),
('Juncfrouwen', 'NA'),
('kloisters', 'NA'),
('to', 'APPR'),
('Mariendale', 'NE')]
Gothic¶
Gothic is an extinct East Germanic language that was spoken by the Goths. It is known primarily from the Codex Argenteus, a 6th-century copy of a 4th-century Bible translation, and is the only East Germanic language with a sizable text corpus. (Source: Wikipedia)
Phonological transcription¶
In [1]: from cltk.phonology.gothic import transcription as gt
In [2]: sentence = "Anastodeins aiwaggeljons Iesuis Xristaus sunaus gudis."
In [3]: tr = ut.Transcriber(gt.DIPHTHONGS_IPA, gt.DIPHTHONGS_IPA_class, gt.IPA_class, gt.gothic_rules)
In [4]: tr.main(sentence, gt.gothic_rules)
Out [4]:
"[anastoːðiːns ɛwaŋgeːljoːns jeːsuis kristɔs sunɔs guðis]"
Greek¶
Greek is an independent branch of the Indo-European family of languages, native to Greece and other parts of the Eastern Mediterranean. It has the longest documented history of any living language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the major part of its history; other systems, such as Linear B and the Cypriot syllabary, were used previously. The alphabet arose from the Phoenician script and was in turn the basis of the Latin, Cyrillic, Armenian, Coptic, Gothic and many other writing systems. (Source: Wikipedia)
Note
For most of the following operations, you must first import the CLTK Greek linguistic data (named greek_models_cltk
).
Alphabet¶
The Greek vowels and consonants in upper and lower case are placed in cltk/corpus/greek/alphabet.py.
Greek vowels can occur without any breathing or accent, have rough or smooth breathing, different accents, diareses, macrons, breves and combinations thereof and Greek consonants have none of these features, except ρ, which can have rough or smooth breathing.
In alphabet.py the vowels and consonants are grouped by upper or lower case, accent, breathing, a diaresis and possible combinations thereof.
These groupings are stored in lists or, in case of a single letter like ρ, as strings with descriptive names structured like CASE_SPECIFIERS
, e.g. LOWER_DIARESIS_CIRCUMFLEX
.
For example to use upper case vowels with rough breathing and an acute accent:
In[1]: from cltk.corpus.greek.alphabet import UPPER_ROUGH_ACUTE
In[2]: print(UPPER_ROUGH_ACUTE)
Out[2]: ['Ἅ', 'Ἕ', 'Ἥ', 'Ἵ', 'Ὅ', 'Ὕ', 'Ὥ', 'ᾍ', 'ᾝ', 'ᾭ']
Accents indicate the pitch of vowels. An acute accent or ὀξεῖα (oxeîa) indicates a rising pitch on a long vowel or a high pitch on a short vowel, a grave accent or βαρεῖα (bareîa) indicates a normal or low pitch and a circumflex or περισπωμένη (perispōménē) indicates high or falling pitch within one syllable.
Breathings, which are used not only on vowels, but also on ρ, indicate the presence or absence of a voiceless glottal fricative - rough breathing indicetes a voiceless glottal fricative before a vowel, like in αἵρεσις (haíresis) and smooth breathing indicates none.
Diareses are placed on ι and υ to indicate two vowels not being a diphthong and macrons and breves are placed on α, ι, and υ to indicate the length of these vowels.
For more information on Greek diacritics see the corresponding wikipedia page.
Accentuation and diacritics¶
James Tauber has created a Python 3 based library to enable working with the accentuation of Ancient Greek words. Installing it is optional for working with CLTK.
For further information please see the original docs, as this is just an abridged version.
The library can be installed with pip
:
pip install greek-accentuation
Contrary to the original docs to use the functions from this module it is necessary to explicitly import every function you need as opposed to
The Characters Module:
base
returns a given character without diacritics. For example:
In[1]: from greek_accentuation.characters import base
In[2]: base('ᾳ')
Out[2]: 'α'
add_diacritic
and add_breathing
add diacritics (accents, diaresis, macrons, breves) and breathing symbols to the given character. add_diacritic
is stackable, for example:
In[1]: from greek_accentuation.characters import add_diacritic
In[2]: add_diacritic(add_diacritic('ο', ROUGH), ACUTE)
Out[2]: 'ὅ'
accent
and strip_accents
return the accent of a character as an Unicode escape and the character stripped of its accent respectively. breathing
, strip_breathing
, length
and strip_length
work analogously, for example:
In[1]: from greek_accentuation.characters import length, strip_length
In[2]: length('ῠ') == SHORT
Out[2]: True
In[3]: strip_length('ῡ')
Out[3]: 'υ'
If a length diacritic becomes redundant because of a circumflex it can be stripped with remove_redundant_macron
just like strip_length
above.
The Syllabify Module:
syllabify
splits the given word in syllables, which are returned as a list of strings. Words without vowels are syllabified as a single syllable. The syllabification can also be displayed as a word with the syllablles separated by periods with display_word
.
In[1]: from greek_accentuation.syllabify import syllabify, display_word
In[2]: syllabify('γυναικός')
Out[2]: ['γυ', 'ναι', 'κός']
In[3]: syllabify('γγγ')
Out[3]: ['γγγ']
In[4]: display_word(syllabify('καταλλάσσω'))
Out[4]: 'κα.ταλ.λάσ.σω'
is_vowel
and is_diphthong
return a boolean value to determine whether a given character is a vowel or two given characters are a diphthong.
In[1]: from greek_accentuation.syllabify import is_diphthong
In[2]: is_diphthong('αι')
Out[2]: True
ultima
, antepenult
and penult
return the ultima, antepenult or penult (i.e. the last, next-to-last or third-from-last syllables) of the given word. A syllable can also be further broken down into its onset, nucleus and coda (i.e. the starting consonant, middle part and ending consonant) with the functions named accordingly. rime
returns the sequence of a syllable’s nucleus and coda and body
returns the sequence of a syllable’s onset and nucleus.
onset_nucleus_coda
returns a syllable’s onset, nucleus and coda all at once as a triple.
In[1]: from greek_accentuation.syllabify import ultima, rime, onset_nucleus_coda
In[2]: ultima('γυναικός')
Out[2]: 'κός'
In[3]: rime('κός')
Out[3]: 'ός'
In[4]: onset_nucleus_coda('ναι')
Out[4]: ('ν', 'αι', '')
debreath
returns a word with the smooth breathing removed and the rough breathing replaced with an h. rebreath
reverses debreath
.
In[1]: from greek_accentuation.syllabify import debreath, rebreath
In[2]: debreath('οἰκία')
Out[2]: 'οικία'
In[3]: rebreath('οικία')
Out[3]: 'οἰκία'
In[3]: debreath('ἑξεῖ')
Out[3]: 'hεξεῖ'
In[4]: rebreath('hεξεῖ')
Out[4]: 'ἑξεῖ'
syllable_length
returns the length of a syllable (in the linguistic sense) and syllable_accent
extracts a syllable’s accent.
In[1]: from greek_accentuation.syllabify import syllable_length, syllable_accent
In[2]: syllable_length('σω') == LONG
Out[2]: True
In[3]: syllable_accent('ναι') is None
Out[3]: True
The accentuation class of a word such as oxytone, paroxytone, proparoxytone, perispomenon, properispomenon or barytone can be tested with the functions named accordingly.
add_necessary_breathing
adds smooth breathing to a word if necessary.
In[1]: from greek_accentuation.syllabify import add_necessary_breathing
In[2]: add_necessary_breathing('οι')
Out[2]: 'οἰ'
In[3]: add_necessary_breathing('οἰ')
Out[3]: 'οἰ'
The Accentuation Module:
get_accent_type
returns the accent type of a word as a tuple of the syllable number and accent, which is comparable to the constants provided. The accent type can also be displayed as a string with display_accent_type
.
In[1]: from greek_accentuation.accentuation import get_accent_type, display_accent_type
In[2]: get_accent_type('ἀγαθοῦ') == PERISPOMENON
Out[2]: True
In[3]: display_accent_type(get_accent_type('ψυχή'))
Out[3]: 'oxytone'
syllable_add_accent(syllable, accent)
adds the given accent to a syllable. It is also possible to add an accent class to a syllable, for example:
In[1]: from greek_accentuation.accentuation import syllable_add_accent, make_paroxytone
In[2]: syllable_add_accent('ου', CIRCUMFLEX)
Out[2]: 'οῦ'
In[3]: make_paroxytone('λογος')
Out[3]: 'λόγος'
possible_accentuations
returns all possible accentuations of a given syllabification according to Ancient Greek accentuation rules. To treat vowels of unmarked length as short vowels set default_short = True
in the function parameters.
In[1]: from greek_accentuation.accentuation import possible_accentuations
In[2]: s = syllabify('εγινωσκου')
In[3]: for accent_class in possible_accentuations(s):
In[4]: print(add_accent(s, accent_class))
Out[4]: εγινώσκου
Out[4]: εγινωσκού
Out[4]: εγινωσκοῦ
In[5]: s = syllabify('κυριος')
In[6]: for accent_class in possible_accentuations(s, default_short=True):
In[7]: print(add_accent(s, accent_class))
Out[7]: κύριος
Out[7]: κυρίος
Out[7]: κυριός
recessive
finds the most recessive (i.e. as far away from the end of the word as possible) accent and returns the given word with that accent. A |
can be placed to set a point past which the accent will not recede. on_penult
places the accent on the penult (third-from-last syllable).
In[1]: from greek_accentuation.accentuation import recessive, on_penult
In[2]: recessive('εἰσηλθον')
Out[2]: 'εἴσηλθον'
In[3]: recessive('εἰσ|ηλθον')
Out[3]: 'εἰσῆλθον'
In[4]: on_penult('φωνησαι')
Out[4]: 'φωνῆσαι'
persistent
gets passed a word and a lemma (i.e. the canonical form of a set of words) and derives the accent from these two words.
In[1]: from greek_accentuation.accentuation import persistent
In[2]: persistent('ἀνθρωπου', 'ἄνθρωπος')
Out[2]: 'ἀνθρώπου'
Expand iota subscript:
The CLTK offers one transformation that can be useful in certain types of processing: Expanding the iota subsctipt from a unicode point and placing beside, to the right, of the character.
In [1]: from cltk.corpus.greek.alphabet import expand_iota_subscript
In [2]: s = 'εἰ δὲ καὶ τῷ ἡγεμόνι πιστεύσομεν ὃν ἂν Κῦρος διδῷ'
In [3]: expand_iota_subscript(s)
Out[3]: 'εἰ δὲ καὶ τῶΙ ἡγεμόνι πιστεύσομεν ὃν ἂν Κῦρος διδῶΙ'
In [4]: expand_iota_subscript(s, lowercase=True)
Out[4]: 'εἰ δὲ καὶ τῶι ἡγεμόνι πιστεύσομεν ὃν ἂν κῦρος διδῶι'
Converting Beta Code to Unicode¶
Note that incoming strings need to begin with an r
and that the Beta Code must follow immediately after the initial """
, as in input line 2, below.
In [1]: from cltk.corpus.greek.beta_to_unicode import Replacer
In [2]: BETA_EXAMPLE = r"""O(/PWS OU)=N MH\ TAU)TO\ PA/QWMEN E)KEI/NOIS, E)PI\ TH\N DIA/GNWSIN AU)TW=N E)/RXESQAI DEI= PRW=TON. TINE\S ME\N OU)=N AU)TW=N EI)SIN A)KRIBEI=S, TINE\S DE\ OU)K A)KRIBEI=S O)/NTES METAPI/-PTOUSIN EI)S TOU\S E)PI\ SH/YEI: OU(/TW GA\R KAI\ LOU=SAI KAI\ QRE/YAI KALW=S KAI\ MH\ LOU=SAI PA/LIN, O(/TE MH\ O)RQW=S DUNHQEI/HMEN."""
In [3]: r = Replacer()
In [4]: r.beta_code(BETA_EXAMPLE)
Out[4]: 'ὅπως οὖν μὴ ταὐτὸ πάθωμεν ἐκείνοις, ἐπὶ τὴν διάγνωσιν αὐτῶν ἔρχεσθαι δεῖ πρῶτον. τινὲς μὲν οὖν αὐτῶν εἰσιν ἀκριβεῖς, τινὲς δὲ οὐκ ἀκριβεῖς ὄντες μεταπίπτουσιν εἰς τοὺς ἐπὶ σήψει· οὕτω γὰρ καὶ λοῦσαι καὶ θρέψαι καλῶς καὶ μὴ λοῦσαι πάλιν, ὅτε μὴ ὀρθῶς δυνηθείημεν.'
The beta code converter can also handle lowercase notation:
In [5]: BETA_EXAMPLE_2 = r"""me/xri me\n w)/n tou/tou a(rpaga/s mou/nas ei)=nai par' a)llh/lwn, to\ de\ a)po\ tou/tou *(/ellhnas dh\ mega/lws ai)ti/ous gene/sqai: prote/rous ga\r a)/rcai strateu/esqai e)s th\n *)asi/hn h)\ sfe/as e)s th\n *eu)rw/phn. """
Out[5]: 'μέχρι μὲν ὤν τούτου ἁρπαγάς μούνας εἶναι παρ’ ἀλλήλων, τὸ δὲ ἀπὸ τούτου Ἕλληνας δὴ μεγάλως αἰτίους γενέσθαι· προτέρους γὰρ ἄρξαι στρατεύεσθαι ἐς τὴν Ἀσίην ἢ σφέας ἐς τὴν Εὐρώπην.'
Converting TLG texts with TLGU¶
The TLGU is excellent C language software for converting the TLG and PHI corpora into human-readable Unicode. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. When TLGU()
is instantiated, it checks the local OS for a functioning version of the software. If not found it is, following the user’s confirmation, downloaded and installed.
Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers). Note that you must import a local corpus before converting it.
In [1]: from cltk.corpus.greek.tlgu import TLGU
In [2]: t = TLGU()
In [3]: t.convert_corpus(corpus='tlg') # writes to: ~/cltk_data/greek/text/tlg/plaintext/
For the PHI7, you may declare whether you want the corpus to be written to the greek
or latin
directories. By default, it writes to greek
.
In [5]: t.convert_corpus(corpus='phi7') # ~/cltk_data/greek/text/phi7/plaintext/
In [6]: t.convert_corpus(corpus='phi7', latin=True) # ~/cltk_data/latin/text/phi7/plaintext/
The above commands take each author file and convert them into a new author file. But the software has a useful option to divide each author file into a new file for each work it contains. Thus, Homer’s file, TLG0012.TXT
, becomes TLG0012.TXT-001.txt
, TLG0012.TXT-002.txt
, and TLG0012.TXT-003.txt
. To achieve this, use the following command for the TLG
:
In [7]: t.divide_works('tlg') # ~/cltk_data/greek/text/tlg/individual_works/
You may also convert individual files, with options for how the conversion happens.
In [3]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt')
In [4]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', markup='full')
In [5]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', rm_newlines=True)
In [6]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', divide_works=True)
For convert()
, plain arguments may be sent directly to the TLGU
, as well, via extra_args
:
In [7]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', extra_args=['p', 'B'])
Even after plaintext conversion, the TLG will still need some cleanup. The CLTK contains some code for post-TLGU cleanup.
You may read about these arguments in the TLGU manual.
Once these files are created, see TLG Indices below for accessing these newly created files.
Corpus Readers¶
Most users will want to access words, sentences, paragraphs and even whole documents via a CorpusReader object. All Corpus contributors should provide a suitable reader. There is one for Perseus Greek, and others will be made available. The CorpusReader methods: paras()
returns paragraphs, if possible; words()
returns a generator of words; sentences
returns a generator of sentences; docs
returns a generator of Python dictionary objects representing each document.
In [1]: from cltk.corpus.readers import get_corpus_reader
...: reader = get_corpus_reader( corpus_name = 'greek_text_perseus', language = 'greek')
...: # get all the docs
...: docs = list(reader.docs())
...: len(docs)
...:
Out[1]: 222
In [2]: # or set just one
...: reader._fileids = ['plato__apology__grc.json']
In [3]: # get all the sentences
In [4]: sentences = list(reader.sents())
...: len(sentences)
...:
Out[4]: 4983
In [5]: # Or just one
In [6]: sentences[0]
Out[6]: '\n \n \n \n \n ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ\n τῶν ἐμῶν κατηγόρων, οὐκ οἶδα· ἐγὼ δʼ οὖν καὶ αὐτὸς ὑπʼ αὐτῶν ὀλίγου ἐμαυτοῦ\n ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.'
In [7]: # access an individual doc as a dictionary of dictionaries
...: doc = list(reader.docs())[0]
...: doc.keys()
...:
Out[7]: dict_keys(['language', 'englishTitle', 'original-urn', 'author', 'urn', 'text', 'source', 'originalTitle', 'edition', 'sourceLink', 'meta', 'filename'])
Information Retrieval¶
See Multilingual Information Retrieval for Greek–specific search options.
Lemmatization¶
Tip
For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.
The CLTK’s lemmatizer is based on a key-value store, whose code is available at the CLTK’s Latin lemma/POS repository.
The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens. Here is an example of the lemmatizer taking a string:
In [1]: from cltk.stem.lemma import LemmaReplacer
In [2]: sentence = 'τὰ γὰρ πρὸ αὐτῶν καὶ τὰ ἔτι παλαίτερα σαφῶς μὲν εὑρεῖν διὰ χρόνου πλῆθος ἀδύνατα ἦν'
In [3]: from cltk.corpus.utils.formatter import cltk_normalize
In [4]: sentence = cltk_normalize(sentence) # can help when using certain texts
In [5]: lemmatizer = LemmaReplacer('greek')
In [6]: lemmatizer.lemmatize(sentence)
Out[6]:
['τὰ',
'γὰρ',
'πρὸ',
'αὐτός',
'καὶ',
'τὰ',
'ἔτι',
'παλαιός',
'σαφής',
'μὲν',
'εὑρίσκω',
'διὰ',
'χρόνος',
'πλῆθος',
'ἀδύνατος',
'εἰμί']
And here taking a list:
In [5]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'])
Out[5]: ['χρόνος', 'πλῆθος', 'ἀδύνατος', 'εἰμί']
The lemmatizer takes several optional arguments for controlling output: return_raw=True
and return_string=True
. return_raw
returns the original inflection along with its headword:
In [6]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'], return_raw=True)
Out[6]: ['χρόνου/χρόνος', 'πλῆθος/πλῆθος', 'ἀδύνατα/ἀδύνατος', 'ἦν/εἰμί']
And return string
wraps the list in ' '.join()
:
In [7]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'], return_string=True)
Out[7]: 'χρόνος πλῆθος ἀδύνατος εἰμί'
These two arguments can be combined, as well.
Lemmatization, backoff method¶
The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.
There is a generic version of the backoff Greek lemmatizer which requires data from the CLTK greek models data found here <https://github.com/cltk/greek_models_cltk/tree/master/lemmata/backoff>. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.
To use the generic version of the backoff Greek Lemmatizer:
In [1]: from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer
In [2]: lemmatizer = BackoffGreekLemmatizer()
In [3]: tokens = 'κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος'.split()
In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('κατέβην', 'καταβαίνω'), ('χθὲς', 'χθές'), ('εἰς', 'εἰς'), ('Πειραιᾶ', 'Πειραιᾶ'), ('μετὰ', 'μετά'), ('Γλαύκωνος', 'Γλαύκων'), ('τοῦ', 'ὁ'), ('Ἀρίστωνος', 'Ἀρίστων')]
NB: The backoff chain for this lemmatizer is defined as follows: 1. a dictionary-based lemmatizer with high-frequency, unambiguous forms; 2. a training-data-based lemmatizer based on sentences from the [Perseus Latin Dependency Treebanks](https://perseusdl.github.io/treebank_data/); 3. a regular-expression-based lemmatizer transforming unambiguous endings (currently very limited); 4. a dictionary-based lemmatizer with the complete set of Morpheus lemmas; 5. an ‘identity’ lemmatizer returning the token as the lemma. Each of these sub-lemmatizers is explained in the documents for “Multilingual”.
Named Entity Recognition¶
There is available a simple interface to a list of Greek proper nouns (see repo for how it the list was created). By default tag_ner()
takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.
In [1]: from cltk.tag import ner
In [2]: text_str = 'τὰ Σίλαριν Σιννᾶν Κάππαρος Πρωτογενείας Διονυσιάδες τὴν'
In [3]: ner.tag_ner('greek', input_text=text_str, output_type=list)
Out[3]:
[('τὰ',),
('Σίλαριν', 'Entity'),
('Σιννᾶν', 'Entity'),
('Κάππαρος', 'Entity'),
('Πρωτογενείας', 'Entity'),
('Διονυσιάδες', 'Entity'),
('τὴν',)]
Normalization¶
Normalizing polytonic Greek is a problem that has been mostly solved, however when working with legacy applications issues still arise. We recommend normalizing Greek vowels in order to ensure string matching.
One type of normalization issue comes from tonos accents (intended for Modern Greek) being used instead of the oxia accents (for Ancient Greek). Here is an example of two characters appearing identical but being in fact dissimilar:
In [1]: from cltk.corpus.utils.formatter import tonos_oxia_converter
In [2]: char_tonos = "ά" # with tonos, for Modern Greek
In [3]: char_oxia = "ά" # with oxia, for Ancient Greek
In [4]: char_tonos == char_oxia
Out[4]: False
In [5]: ord(char_tonos)
Out[5]: 940
In [6]: ord(char_oxia)
Out[6]: 8049
In [7]: char_oxia == tonos_oxia_converter(char_tonos)
Out[7]: True
If for any reason you want to go from oxia to tonos, just add the reverse=True
parameter:
In [8]: char_tonos == tonos_oxia_converter(char_oxia, reverse=True)
Out[8]: True
Another approach to normalization is to use the Python language’s builtin normalize()
. The CLTK provides a wrapper for this, as a convenience. Here’s an example its use in “compatibility” mode (NFKC
):
In [1]: from cltk.corpus.utils.formatter import cltk_normalize
In [2]: tonos = "ά"
In [3]: oxia = "ά"
In [4]: tonos == oxia
Out[4]: False
In [5]: tonos == cltk_normalize(oxia)
Out[5]: True
One can turn off compatability with:
In [6]: tonos == cltk_normalize(oxia, compatibility=False)
Out[6]: True
For more on normalize()
see the Python Unicode docs.
POS tagging¶
These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the greek_models_cltk
corpus.
1–2–3–gram backoff tagger¶
In [1]: from cltk.tag.pos import POSTag
In [2]: tagger = POSTag('greek')
In [3]: tagger.tag_ngram_123_backoff('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
Out[3]:
[('θεοὺς', 'N-P---MA-'),
('μὲν', 'G--------'),
('αἰτῶ', 'V1SPIA---'),
('τῶνδ', 'P-P---MG-'),
('᾽', None),
('ἀπαλλαγὴν', 'N-S---FA-'),
('πόνων', 'N-P---MG-'),
('φρουρᾶς', 'N-S---FG-'),
('ἐτείας', 'A-S---FG-'),
('μῆκος', 'N-S---NA-')]
TnT tagger¶
In [4]: tagger.tag_tnt('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
Out[4]:
[('θεοὺς', 'N-P---MA-'),
('μὲν', 'G--------'),
('αἰτῶ', 'V1SPIA---'),
('τῶνδ', 'P-P---NG-'),
('᾽', 'Unk'),
('ἀπαλλαγὴν', 'N-S---FA-'),
('πόνων', 'N-P---MG-'),
('φρουρᾶς', 'N-S---FG-'),
('ἐτείας', 'A-S---FG-'),
('μῆκος', 'N-S---NA-')]
CRF tagger¶
Warning
This tagger’s accuracy has not yet been tested.
We use the NLTK’s CRF tagger. For information on it, see the NLTK docs.
In [5]: tagger.tag_crf('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
Out[5]:
[('θεοὺς', 'N-P---MA-'),
('μὲν', 'G--------'),
('αἰτῶ', 'V1SPIA---'),
('τῶνδ', 'P-P---NG-'),
('᾽', 'A-S---FA-'),
('ἀπαλλαγὴν', 'N-S---FA-'),
('πόνων', 'N-P---MG-'),
('φρουρᾶς', 'A-S---FG-'),
('ἐτείας', 'N-S---FG-'),
('μῆκος', 'N-S---NA-')]
Prosody Scanning¶
There is a prosody scanner for scanning rhythms in Greek texts. It returns a list of strings or long and short marks for each sentence. Note that the last syllable of each sentence string is marked with an anceps so that specific clausulae are dileneated.
In [1]: from cltk.prosody.greek.scanner import Scansion
In [2]: scanner = Scansion()
In [3]: scanner.scan_text('νέος μὲν καὶ ἄπειρος, δικῶν ἔγωγε ἔτι. μὲν καὶ ἄπειρος.')
Out[3]: ['˘¯¯¯˘¯¯˘¯˘¯˘˘x', '¯¯˘¯x']
Sentence Tokenization¶
Sentence tokenization for Ancient Greek is available using (by default) a regular-expression based tokenizer. To tokenize a Greek text by sentences…
In [1]: from cltk.tokenize.greek.sentence import SentenceTokenizer
In [2]: sent_tokenizer = SentenceTokenizer()
In [3]: untokenized_text = """ὅλως δ’ ἀντεχόμενοί τινες, ὡς οἴονται, δικαίου τινός (ὁ γὰρ νόμος δίκαιόν τἰ τὴν κατὰ πόλεμον δουλείαν τιθέασι δικαίαν, ἅμα δ’ οὔ φασιν· τήν τε γὰρ ἀρχὴν ἐνδέχεται μὴ δικαίαν εἶναι τῶν πολέμων, καὶ τὸν ἀνάξιον δουλεύειν οὐδαμῶς ἂν φαίη τις δοῦλον εἶναι· εἰ δὲ μή, συμβήσεται τοὺς εὐγενεστάτους εἶναι δοκοῦντας δούλους εἶναι καὶ ἐκ δούλων, ἐὰν συμβῇ πραθῆναι ληφθέντας."""
In [4]: sent_tokenizer.tokenize(untokenized_text)
Out[4]: ['ὅλως δ’ ἀντεχόμενοί τινες, ὡς οἴονται, δικαίου τινός (ὁ γὰρ νόμος δίκαιόν τἰ τὴν κατὰ πόλεμον δουλείαν τιθέασι δικαίαν, ἅμα δ’ οὔ φασιν·', 'τήν τε γὰρ ἀρχὴν ἐνδέχεται μὴ δικαίαν εἶναι τῶν πολέμων, καὶ τὸν ἀνάξιον δουλεύειν οὐδαμῶς ἂν φαίη τις δοῦλον εἶναι·', 'εἰ δὲ μή, συμβήσεται τοὺς εὐγενεστάτους εἶναι δοκοῦντας δούλους εἶναι καὶ ἐκ δούλων, ἐὰν συμβῇ πραθῆναι ληφθέντας.']
The sentence tokenizer takes a string input into tokenize_sentences()
and returns a list of strings. For more on the tokenizer, or to make your own, see the CLTK’s Greek sentence tokenizer training set repository.
There is also an experimental [Punkt](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) tokenizer trained on the Greek Tesserae texts. The model for this tokenizer can be found in the CLTK corpora under greek_model_cltk/tokenizers/sentence/greek_punkt.
In [5]: from cltk.tokenize.greek.sentence import SentenceTokenizer
In [6]: sent_tokenizer = SentenceTokenizer(tokenizer='punkt')
etc.
NB: The old method for sentence tokenizer, i.e. TokenizeSentence, is still available, but will soon be replaced by the method above.
In [7]: from cltk.tokenize.sentence import TokenizeSentence
In [8]: tokenizer = TokenizeSentence('greek')
etc.
Stopword Filtering¶
To use the CLTK’s built-in stopwords list:
In [1]: from nltk.tokenize.punkt import PunktLanguageVars
In [2]: from cltk.stop.greek.stops import STOPS_LIST
In [3]: sentence = 'Ἅρπαγος δὲ καταστρεψάμενος Ἰωνίην ἐποιέετο στρατηίην ἐπὶ Κᾶρας καὶ Καυνίους καὶ Λυκίους, ἅμα ἀγόμενος καὶ Ἴωνας καὶ Αἰολέας.'
In [4]: p = PunktLanguageVars()
In [5]: tokens = p.word_tokenize(sentence.lower())
In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['ἅρπαγος',
'καταστρεψάμενος',
'ἰωνίην',
'ἐποιέετο',
'στρατηίην',
'κᾶρας',
'καυνίους',
'λυκίους',
',',
'ἅμα',
'ἀγόμενος',
'ἴωνας',
'αἰολέας.']
Swadesh¶
The corpus module has a class for generating a Swadesh list for Greek.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('gr')
In [3]: swadesh.words()[:10]
Out[3]: ['ἐγώ', 'σύ', 'αὐτός, οὗ, ὅς, ὁ, οὗτος', 'ἡμεῖς', 'ὑμεῖς', 'αὐτοί', 'ὅδε', 'ἐκεῖνος', 'ἔνθα, ἐνθάδε, ἐνταῦθα', 'ἐκεῖ']
TEI XML¶
There are several rudimentary corpus converters for the “First 1K Years of Greek” project (download the corpus 'greek_text_first1kgreek'
). Both write files to `` ~/cltk_data/greek/text/greek_text_first1kgreek_plaintext``.
This one is built upon the MyCapytain
library (pip install lxml MyCapytain
), which has the ability for very precise chunking of TEI xml. The following function only preserves numbers:
In [1]: from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text_capitains
In [2]: onekgreek_tei_xml_to_text_capitains()
For the following, install the BeautifulSoup
library (pip install bs4
). Note that this will just dump all text not contained within a node’s bracket (including sometimes metadata).
In [1]: from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text
In [2]: onekgreek_tei_xml_to_text()
Text Cleanup¶
Intended for use on the TLG after processing by TLGU()
.
In [1]: from cltk.corpus.utils.formatter import tlg_plaintext_cleanup
In [2]: import os
In [3]: file = os.path.expanduser('~/cltk_data/greek/text/tlg/individual_works/TLG0035.TXT-001.txt')
In [4]: with open(file) as f:
...: r = f.read()
...:
In [5]: r[:500]
Out[5]: "\n{ΜΟΣΧΟΥ ΕΡΩΣ ΔΡΑΠΕΤΗΣ} \n Ἁ Κύπρις τὸν Ἔρωτα τὸν υἱέα μακρὸν ἐβώστρει: \n‘ὅστις ἐνὶ τριόδοισι πλανώμενον εἶδεν Ἔρωτα, \nδραπετίδας ἐμός ἐστιν: ὁ μανύσας γέρας ἑξεῖ. \nμισθός τοι τὸ φίλημα τὸ Κύπριδος: ἢν δ' ἀγάγῃς νιν, \nοὐ γυμνὸν τὸ φίλημα, τὺ δ', ὦ ξένε, καὶ πλέον ἑξεῖς. \nἔστι δ' ὁ παῖς περίσαμος: ἐν εἴκοσι πᾶσι μάθοις νιν. \nχρῶτα μὲν οὐ λευκὸς πυρὶ δ' εἴκελος: ὄμματα δ' αὐτῷ \nδριμύλα καὶ φλογόεντα: κακαὶ φρένες, ἁδὺ λάλημα: \nοὐ γὰρ ἴσον νοέει καὶ φθέγγεται: ὡς μέλι φωνά, \nὡς δὲ χολὰ νόος ἐστίν: "
In [7]: tlg_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Ἁ Κύπρις τὸν Ἔρωτα τὸν υἱέα μακρὸν ἐβώστρει ὅστις ἐνὶ τριόδοισι πλανώμενον εἶδεν Ἔρωτα δραπετίδας ἐμός ἐστιν ὁ μανύσας γέρας ἑξεῖ. μισθός τοι τὸ φίλημα τὸ Κύπριδος ἢν δ ἀγάγῃς νιν οὐ γυμνὸν τὸ φίλημα τὺ δ ὦ ξένε καὶ πλέον ἑξεῖς. ἔστι δ ὁ παῖς περίσαμος ἐν εἴκοσι πᾶσι μάθοις νιν. χρῶτα μὲν οὐ λευκὸς πυρὶ δ εἴκελος ὄμματα δ αὐτῷ δριμύλα καὶ φλογόεντα κακαὶ φρένες ἁδὺ λάλημα οὐ γὰρ ἴσον νοέει καὶ φθέγγεται ὡς μέλι φωνά ὡς δὲ χολὰ νόος ἐστίν ἀνάμερος ἠπεροπευτάς οὐδὲν ἀλαθεύων δόλιον βρέφος ἄγρια π'
TLG Indices¶
The TLG comes with some old, difficult-to-parse index files which have been made available as Python dictionaries (at /Users/kyle/cltk/cltk/corpus/greek/tlg
). Below are some functions to make accessing these easy. The outputs are variously a dict
of an index or set
if the function returns unique author ids.
Tip
Python sets are like lists, but contain only unique values. Multiple sets can be conveniently combined (see docs here).
In [1]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_female_authors
In [2]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_index
In [3]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithets
In [4]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_authors_by_epithet
In [5]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_of_author
In [6]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geo_index
In [7]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geographies
In [8]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_authors_by_geo
In [9]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geo_of_author
In [10]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_lists
In [11]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_id_author
In [12]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_id_by_name
In [13]: get_female_authors()
Out[13]:
{'0009',
'0051',
'0054',
…}
In [14]: get_epithet_index()
Out[14]:
{'Lexicographi': {'3136', '4040', '4085', '9003'},
'Lyrici/-ae': {'0009',
'0033',
'0199',
…}}
In [15]: get_epithets()
Out[15]:
['Alchemistae',
'Apologetici',
'Astrologici',
…]
In [16]: select_authors_by_epithet('Tactici')
Out[16]: {'0058', '0546', '0556', '0648', '3075', '3181'}
In [17]: get_epithet_of_author('0016')
Out[17]: 'Historici/-ae'
In [18]: get_geo_index()
Out[18]:
{'Alchemistae': {'1016',
'2019',
'2140',
'2181',
…}}
In [19]: get_geographies()
Out[19]:
['Abdera',
'Adramytteum',
'Aegae',
…]
In [20]: select_authors_by_geo('Thmuis')
Out[20]: {'2966'}
In [21]: get_geo_of_author('0216')
Out[21]: 'Aetolia'
In [22]: get_lists()
Out[22]:
{'Lists pertaining to all works in Canon (by TLG number)': {'LIST3CLA.BIN': 'Literary classifications of works',
'LIST3CLX.BIN': 'Literary classifications of works (with x-refs)',
'LIST3DAT.BIN': 'Chronological classifications of authors',
…}}
In [23]: get_id_author()
Out[23]:
{'1139': 'Anonymi Historici (FGrH)',
'4037': 'Anonymi Paradoxographi',
'0616': 'Polyaenus Rhet.',
…}
In [28]: select_id_by_name('hom')
Out[28]:
[('0012', 'Homerus Epic., Homer'),
('1252', 'Certamen Homeri Et Hesiodi'),
('1805', 'Vitae Homeri'),
('5026', 'Scholia In Homerum'),
('1375', 'Evangelium Thomae'),
('2038', 'Acta Thomae'),
('0013', 'Hymni Homerici, Homeric Hymns'),
('0253', '[Homerus] [Epic.]'),
('1802', 'Homerica'),
('1220', 'Batrachomyomachia'),
('9023', 'Thomas Magister Philol.')]
In addition to these indices there are several helper functions which will build filepaths for your particular computer. Note that you will need to have run convert_corpus(corpus='tlg')
and divide_works('tlg')
from the TLGU()
class, respectively, for the following two functions.
In [1]: from cltk.corpus.utils.formatter import assemble_tlg_author_filepaths
In [2]: assemble_tlg_author_filepaths()
Out[2]:
['/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1167.TXT',
'/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1584.TXT',
'/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1196.TXT',
'/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1201.TXT',
...]
In [3]: from cltk.corpus.utils.formatter import assemble_tlg_works_filepaths
In [4]: assemble_tlg_works_filepaths()
Out[4]:
['/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG1585.TXT-001.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG0038.TXT-001.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG1607.TXT-002.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG0468.TXT-001.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG0468.TXT-002.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-001.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-002.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-003.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-004.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-005.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-006.txt',
'/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-007.txt',
...]
These two functions are useful when, for example, needing to process all authors of the TLG corpus, all works of the corpus, or all works of one particular author.
Transliteration¶
The CLTK provides IPA phonetic transliteration for the Greek language. Currently, the only available dialect is Attic as reconstructed by Philomen Probert (taken from A Companion to the Ancient Greek Language, 85-103). Example:
In [1]: from cltk.phonology.greek.transcription import Transcriber
In [2]: transcriber = Transcriber(dialect="Attic", reconstruction="Probert")
In [3]: transcriber.transcribe("Διόθεν καὶ δισκήπτρου τιμῆς ὀχυρὸν ζεῦγος Ἀτρειδᾶν στόλον Ἀργείων")
Out[3]: '[di.ó.tʰen kɑj dis.kɛ́ːp.trọː ti.mɛ̂ːs o.kʰy.ron zdêw.gos ɑ.trẹː.dɑ̂n stó.lon ɑr.gẹ́ː.ɔːn]'
Word Tokenization¶
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: word_tokenizer = WordTokenizer('greek')
In [3]: text = 'Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων,'
In [4]: word_tokenizer.tokenize(text)
Out[4]: ['Θουκυδίδης', 'Ἀθηναῖος', 'ξυνέγραψε', 'τὸν', 'πόλεμον', 'τῶν', 'Πελοποννησίων', 'καὶ', 'Ἀθηναίων', ',']
Word2Vec¶
Note
The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK’s API for it will be revised.
Note
You will need to install Gensim to use these features.
Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).
The CLTK repository contains pre-trained Word2Vec models for Greek (import as greek_word2vec_cltk
), one lemmatized and the other not. They were trained on the TLG corpus. To train your own, see the README at the Greek Word2Vec repository.
One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here’s an example of its use:
In [1]: from cltk.ir.query import search_corpus
In [2]: In [6]: for x in search_corpus('πνεῦμα', 'tlg', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.5):
print(x)
...:
The following similar terms will be added to the 'πνεῦμα' query: '['γεννώμενον', 'ἔντερον', 'βάπτισμα', 'εὐαγγέλιον', 'δέρμα', 'ἐπιῤῥέον', 'ἔμβρυον', 'ϲῶμα', 'σῶμα', 'συγγενὲς']'.
('Lucius Annaeus Cornutus Phil.', "μυθολογεῖται δ' ὅτι διασπασθεὶς ὑπὸ τῶν Τιτά-\nνων συνετέθη πάλιν ὑπὸ τῆς Ῥέας, αἰνιττομένων τῶν \nπαραδόντων τὸν μῦθον ὅτι οἱ γεωργοί, θρέμματα γῆς \nὄντες, συνέχεαν τοὺς βότρυς καὶ τοῦ ἐν αὐτοῖς Διονύσου \nτὰ μέρη ἐχώρισαν ἀπ' ἀλλήλων, ἃ δὴ πάλιν ἡ εἰς ταὐτὸ \nσύρρυσις τοῦ γλεύκους συνήγαγε καὶ ἓν *σῶμα* ἐξ αὐτῶν \nἀπετέλεσε.")
('Metopus Phil.', '\nκαὶ ταὶ νόσοι δὲ γίνονται τῶ σώματος <τῷ> θερμότερον ἢ κρυμωδέσ-\nτερον γίνεσθαι τὸ *σῶμα*.')
…
threshold
is the closeness of the query term to its neighboring words. Note that when expand_keyword=True
, the search term will be stripped of any regular expression syntax.
The keyword expander leverages get_sims()
(which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:
In [3]: from cltk.vector.word2vec import get_sims
In [4]: get_sims('βασιλεύς', 'greek', lemmatized=False, threshold=0.5)
"word 'βασιλεύς' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['βασκαίνων', 'βασκανίας', 'βασιλάκιος', 'βασιλίδων', 'βασανισθέντα', 'βασιλήϊον', 'βασιλευόμενα', 'βασανιστηρίων', … ]'.
In [36]: get_sims('τυραννος', 'greek', lemmatized=True, threshold=0.7)
"word 'τυραννος' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['τυραννίσιν', 'τυρόριζαν', 'τυρεύοντες', 'τυρρηνοὶ', 'τυραννεύοντα', 'τυροὶ', 'τυραννικά', 'τυρσηνίαν', 'τυρώ', 'τυρσηνίας', … ]'.
To add and subtract vectors, you need to load the models yourself with Gensim.
Gujarati¶
Gujarati is an Indo-Aryan language native to the Indian state of Gujarat. It is part of the greater Indo-European language family. Gujarati is descended from Old Gujarati (circa 1100–1500 AD). In India, it is the official language in the state of Gujarat, as well as an official language in the union territories of Daman and Diu and Dadra and Nagar Haveli.Gujarati is spoken by 4.5% of the Indian population, which amounts to 46 million speakers in India.Altogether, there are about 50 million speakers of Gujarati worldwide.(Source: Wikipedia)
Alphabet¶
The Gujarati alphabets are placed in cltk/corpus/gujarati/alphabet.py.
There are 13 vowels in Gujarati. Like Hindi and other similar languages, vowels in Gujarati have an independent form and a matra form used to modify consonants in word formation.
VOWELS = [ 'અ' , 'આ' , 'ઇ' , 'ઈ' , 'ઉ' , 'ઊ' , 'ઋ' , 'એ' , 'ઐ' , 'ઓ' , 'ઔ' , 'અં' , 'અઃ' ]
The International Alphabet of Sanskrit Transliteration (I.A.S.T.) is a transliteration scheme that allows the lossless romanization of Indic scripts as employed by Sanskrit and related Indic languages. IAST makes it possible for the reader to read the Indic text unambiguously, exactly as if it were in the original Indic script.
IAST_VOWELS_REPRESENTATION = ['a', 'ā', 'i', 'ī', 'u', 'ū','ṛ','e','ai','o','au','ṁ','ḥ']
There are 33 consonants. They are grouped in accordance with the traditional Sanskrit scheme of arrangement.
1. Velar: A velar consonant is a consonant that is pronounced with the back part of the tongue against
the soft palate, also known as the velum, which is the back part of the roof of the mouth (e.g., k
).
2. Palatal: A palatal consonant is a consonant that is pronounced with the body (the middle part) of the
tongue against the hard palate (which is the middle part of the roof of the mouth) (e.g., j
).
3. Retroflex: A retroflex consonant is a coronal consonant where the tongue has a flat, concave, or even
curled shape, and is articulated between the alveolar ridge and the hard palate (e.g., English t
).
4. Dental: A dental consonant is a consonant articulated with the tongue against the
upper teeth (e.g., Spanish t
).
- Labial: Labials or labial consonants are articulated or made with the lips (e.g.,
p
).
# Digits
In[1]: from cltk.corpus.gujarati.alphabet import DIGITS
In[2]: print(DIGITS)
Out[2]: ['૦','૧','૨','૩','૪','૫','૬','૭','૮','૯','૧૦']
# Velar consonants
In[3]: from cltk.corpus.gujarati.alphabet import VELAR_CONSONANTS
In[4]: print(VELAR_CONSONANTS)
Out[4]: [ 'ક' , 'ખ' , 'ગ' , 'ઘ' , 'ઙ' ]
# Palatal consonants
In[5]: from cltk.corpus.gujarati.alphabet import PALATAL_CONSONANTS
In[6]: print(PALATAL_CONSONANTS)
Out[6]: ['ચ' , 'છ' , 'જ' , 'ઝ' , 'ઞ' ]
# Retroflex consonants
In[7]: from cltk.corpus.gujarati.alphabet import RETROFLEX_CONSONANTS
In[8]: print(RETROFLEX_CONSONANTS)
Out[8]: ['ટ' , 'ઠ' , 'ડ' , 'ઢ' , 'ણ']
# Dental consonants
In[9]: from cltk.corpus.gujarati.alphabet import DENTAL_CONSONANTS
In[10]: print(DENTAL_CONSONANTS)
Out[10]: ['ત' , 'થ' , 'દ' , 'ધ' , 'ન' ]
# Labial consonants
In[11]: from cltk.corpus.gujarati.alphabet import LABIAL_CONSONANTS
In[12]: print(LABIAL_CONSONANTS)
Out[12]: ['પ' , 'ફ' , 'બ' , 'ભ' , 'મ']
There are 4 sonorant consonants in Gujarati:
# Sonorant consonants
In[1]: from cltk.corpus.gujarati.alphabet import SONORANT_CONSONANTS
In[2]: print(SONORANT_CONSONANTS)
Out[2]: ['ય' , 'ર' , 'લ' , 'વ']
There are 3 sibilants in Gujarati:
# Sibilant consonants
In[1]: from cltk.corpus.gujarati.alphabet import SIBILANT_CONSONANTS
In[2]: print(SIBILANT_CONSONANTS)
Out[2]: ['શ' , 'ષ' , 'સ']
There is one guttural consonant also:
# Guttural consonant
In[1]: from cltk.corpus.gujarati.alphabet import GUTTURAL_CONSONANT
In[2]: print(GUTTURAL_CONSONANTS)
Out[2]:['હ']
There are also three additional consonants in Gujarati:
# Additional consonants
In[1]: from cltk.corpus.gujarati.alphabet import ADDITIONAL_CONSONANTS
In[2]: print(ADDITIONAL_CONSONANTS)
Out[2]: ['ળ' , 'ક્ષ' , 'જ્ઞ']
Hebrew¶
Hebrew is a language native to Israel, spoken by over 9 million people worldwide, of whom over 5 million are in Israel. Historically, it is regarded as the language of the Israelites and their ancestors, although the language was not referred to by the name Hebrew in the Tanakh. The earliest examples of written Paleo-Hebrew date from the 10th century BCE. Hebrew belongs to the West Semitic branch of the Afroasiatic language family. The Hebrew language is the only living Canaanite language left. Hebrew had ceased to be an everyday spoken language somewhere between 200 and 400 CE, declining since the aftermath of the Bar Kokhba revolt. (Source: Wikipedia)
Corpora¶
Use CorpusImporter
or browse the CLTK Github repository (anything beginning with hebrew_
) to discover available Hebrew corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: corpus_importer = CorpusImporter('hebrew')
In [3]: corpus_importer.list_corpora
Out[3]:
['hebrew_text_sefaria']
Hindi¶
Hindi is a standardised and Sanskritised register of the Hindustani language. Like other Indo-Aryan languages, Hindi is considered to be a direct descendant of an early form of Sanskrit, through Sauraseni Prakrit and Śauraseni Apabhraṃśa. It has been influenced by Dravidian languages, Turkic languages, Persian, Arabic, Portuguese and English. Hindi emerged as Apabhramsha, a degenerated form of Prakrit, in the 7th century A.D. By the 10th century A.D., it became stable. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with hindi_
) to discover available Hindi corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('hindi')
In [3]: c.list_corpora
Out[3]:
['hindi_text_ltrc']
Stopword Filtering¶
To use the CLTK’s built-in stopwords list:
In [1]: from cltk.stop.classical_hindi.stops import STOPS_LIST
In [2]: print(STOPS_LIST[:5])
Out[2]: ["हें", "है", "हैं", "हि", "ही"]
Swadesh¶
The corpus module has a class for generating a Swadesh list for classical hindi.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('hi')
In [3]: swadesh.words()[:10]
Out[3]: ['मैं', 'तू', 'वह', 'हम', 'तुम', 'वे', 'यह', 'वह', 'यहाँ', 'वहाँ' ]
Tokenizer¶
This tool can break a sentence into its constituent words. It simply splits the text into tokens of words and punctuations.
In [1]: from cltk.tokenize.sentence import TokenizeSentence
In [2]: import os
In [3]: root = os.path.expanduser('~')
In [4]: hindi_corpus = os.path.join(root,'cltk_data/hindi/text/hindi_text_ltrc')
In [5]: hindi_text_path = os.path.join(hindi_corpus, 'miscellaneous/gandhi/main.txt')
In [6]: hindi_text = open(hindi_text_path,'r').read()
In [7]: tokenizer = TokenizeSentence('hindi')
In [8]: hindi_text_tokenize = tokenizer.tokenize(hindi_text)
In [9]: print(hindi_text_tokenize[0:100])
['10्र', 'प्रति', 'ा', 'वापस', 'नहीं', 'ली', 'जातीएक', 'बार', 'कस्तुरबा', 'गांधी', 'बहुत', 'बीमार', 'हो', 'गईं', '।', 'जलर्', 'चिकित्सा', 'से', 'उन्हें', 'कोई', 'लाभ', 'नहीं', 'हुआ', '।', 'दूसरे', 'उपचार', 'किये', 'गये', '।', 'उनमे', 'भी', 'सफलता', 'नहीं', 'मिली', '।', 'अंत', 'में', 'गांधीजी', 'ने', 'उन्हें', 'नमक', 'और', 'दाल', 'छोडने', 'की', 'सलाह', 'दी', '।', 'परन्तु', 'इसके', 'लिए', 'बा', 'तैयार', 'नहीं', 'हुईं', '।', 'गांधीजी', 'ने', 'बहुत', 'समझाया', '.', 'पोथियों', 'से', 'प्रमाण', 'पढकर', 'सुनाये', '.', 'लेकर', 'सब', 'व्यर्थ', '।', 'बा', 'बोलीं', '.', '"', 'कोई', 'आपसे', 'कहे', 'कि', 'दाल', 'और', 'नमक', 'छोड', 'दो', 'तो', 'आप', 'भी', 'नहीं', 'छोडेंगे', '।', '"', 'गांधीजी', 'ने', 'तुरन्त', 'प्रसÙ', 'होकर', 'कहा', '.', '"', 'तुम']
Javanese¶
Javanese is the language of the Javanese people from the central and eastern parts of the island of Java, in Indonesia. Javanese is one of the Austronesian languages, but it is not particularly close to other languages and is difficult to classify. The 8th and 9th centuries are marked by the emergence of the Javanese literary tradition – with Sang Hyang Kamahayanikan, a Buddhist treatise; and the Kakawin Rāmâyaṇa, a Javanese rendering in Indian metres of the Vaishnavist Sanskrit epic Rāmāyaṇa. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with javanese_
) to discover available javanese corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('javanese')
In [3]: c.list_corpora
Out[3]:
['javanese_text_gretil']
Kannada¶
Kannada is a Dravidian language spoken predominantly by Kannada people in India, mainly in the state of Karnataka, and by significant linguistic minorities in the states of Andhra Pradesh, Telangana, Tamil Nadu, Maharashtra, Kerala, Goa and abroad. The language has roughly 38 million native speakers who are called Kannadigas (Kannadigaru), and a total of 51 million speakers according to a 2001 census. It is one of the scheduled languages of India and the official and administrative language of the state of Karnataka. (Source: Wikipedia)
Alphabet¶
The Kannada alphabet and digits are placed in cltk/corpus/kannada/alphabet.py.
The digits are placed in a list NUMERALS
with the digit the same as the list index (0-9). For example, the kannada digit for 6 can be accessed in this manner:
In [1]: from cltk.corpus.kannada.alphabet import NUMERALS
In [2]: NUMERALS[6]
Out[2]: '೬'
The vowels are places in a list VOWELS
and can be accessed in this manner :
In [1]: from cltk.corpus.kannada.alphabet import VOWELS
In [2]: VOWELS
Out[2]: ['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ', 'ಊ', 'ಋ','ೠ', 'ಎ', 'ಏ', 'ಐಒ', 'ಒ', 'ಓ', 'ಔ']
The rest of the alphabets are VOWEL_SIGNS
, YOGAVAAHAKAS
, UNSTRUCTURED_CONSONANTS
and STRUCTURED_CONSONANTS
that can be accessed in a similar way.
Latin¶
Latin is a classical language belonging to the Italic branch of the Indo-European languages. The Latin alphabet is derived from the Etruscan and Greek alphabets, and ultimately from the Phoenician alphabet. Latin was originally spoken in Latium, in the Italian Peninsula. Through the power of the Roman Republic, it became the dominant language, initially in Italy and subsequently throughout the Roman Empire. Vulgar Latin developed into the Romance languages, such as Italian, Portuguese, Spanish, French, and Romanian. (Source: Wikipedia)
Note
For most of the following operations, you must first import the CLTK Latin linguistic data (named latin_models_cltk
).
Note
Note that for most of the following operations, the j/i and v/u replacer JVReplacer()
and .lower()
should be used on the input string first, if necessary.
Corpus Readers¶
Most users will want to access words, sentences, paragraphs and even whole documents via a CorpusReader object. All Corpus contributors should provide a suitable reader. There are corpus readers for the Perseus Latin collection in json format, and one for the Latin Library, and others will be made available. The CorpusReader methods: paras()
returns paragraphs, if possible; words()
returns a generator of words; sentences
returns a generator of sentences; docs
returns a generator of Python dictionary objects representing each document.
In [1]: from cltk.corpus.readers import get_corpus_reader
...: reader = get_corpus_reader(language='latin', corpus_name='latin_text_perseus')
...: # get all the docs
...: docs = list(reader.docs())
...: len(docs)
...:
Out[1]: 293
In [2]: # or set just one
...: reader._fileids = ['cicero__on-behalf-of-aulus-caecina__latin.json']
...:
In [3]: # get all the sentences
...: sentences = list(reader.sents())
...: len(sentences)
...:
Out[3]: 25435
In [4]: # or one at a time
...: sentences[0]
...:
Out[4]: '\n\t\t\t si , quantum in agro locisque desertis audacia potest, tantum in foro atque\n\t\t\t\tin iudiciis impudentia valeret, non minus nunc in causa cederet A. Caecina Sex.'
In [5]: # access an individual doc as a dictionary of dictionaries
...: doc = list(reader.docs())[0]
...: doc.keys()
...:
Out[5]: dict_keys(['meta', 'author', 'text', 'edition', 'englishTitle', 'source', 'originalTitle', 'original-urn', 'language', 'sourceLink', 'urn', 'filename'])
Clausulae Analysis¶
Clausulae analysis is an integral part of Latin prosimetrics. The clausulae analysis module analyzes prose rhythm data generated by the prosody module to produce a dictionary of common rhythm types and their frequencies.
The list of rhythms which the module tallies is from the paper Keeline, T. and Kirby, J “Auceps syllabarum: A Digital Analysis of Latin Prose Rhythm,” Journal of Roman Studies, 2019.
In [1]: from cltk.prosody.latin.scanner import Scansion
In [2]: from cltk.prosody.latin.clausulae_analysis import Clausulae
In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'
In [4]: s = Scansion()
In [5]: c = Clausulae()
In [6]: prosody = s.scan_text(text)
Out[6]: ['-uuu-uuu-u--x', 'uu-uu-uu----x']
In [7]: c.clausulae_analysis(prosody)
Out[7]: [{'cretic_trochee': 1}, {'cretic_trochee_resolved_a': 0}, {'cretic_trochee_resolved_b': 0}, {'cretic_trochee_resolved_c': 0}, {'double_cretic': 0}, {'molossus_cretic': 0}, {'double_molossus_cretic_resolved_a': 0}, {'double_molossus_cretic_resolved_b': 0}, {'double_molossus_cretic_resolved_c': 0}, {'double_molossus_cretic_resolved_d': 0}, {'double_molossus_cretic_resolved_e': 0}, {'double_molossus_cretic_resolved_f': 0}, {'double_molossus_cretic_resolved_g': 0}, {'double_molossus_cretic_resolved_h': 0}, {'double_trochee': 0}, {'double_trochee_resolved_a': 0}, {'double_trochee_resolved_b': 0}, {'hypodochmiac': 0}, {'hypodochmiac_resolved_a': 0}, {'hypodochmiac_resolved_b': 0}, {'spondaic': 1}, {'heroic': 0}]
Converting J to I, V to U¶
In [1]: from cltk.stem.latin.j_v import JVReplacer
In [2]: j = JVReplacer()
In [3]: j.replace('vem jam')
Out[3]: 'uem iam'
Converting PHI texts with TLGU¶
Note
- Update this section with new post-TLGU processors in formatter.py
The TLGU is C-language software which does an excellent job at converting the TLG and PHI corpora into various forms of human-readable Unicode plaintext. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. Download and installation is handled in the background. When TLGU()
is instantiated, it checks the local OS for a functioning version of the software. If not found it is installed.
Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers).
In [1]: from cltk.corpus.greek.tlgu import TLGU
In [2]: t = TLGU()
In [3]: t.convert_corpus(corpus='phi5') # ~/cltk_data/latin/text/phi5/plaintext/
You can also divide the texts into a file for each individual work.
In [4]: t.divide_works('phi5') # ~/cltk_data/latin/text/phi5/individual_works/
Once these files are created, see PHI Indices below for accessing these newly created files.
See also Text Cleanup for removing extraneous non-textual characters from these files.
Information Retrieval¶
See Multilingual Information Retrieval for Latin–specific search options.
Declining¶
The CollatinusDecliner() attempts to retrieve all possible form of a lemma. This may be useful if you want to search for all forms of a word across a repository of non-lemmatized texts. This class is based on lexical and linguistic data built by the Collatinus Team. Data corrections and additions can be contributed back to the Collatinus project (in particular, into bin/data).
Example use, assuming you have already imported the latin_models_cltk:
In [1]: from cltk.stem.latin.declension import CollatinusDecliner
In [2]: decliner = CollatinusDecliner()
In [3]: print(decliner.decline("via"))
Out[3]:
[('via', '--s----n-'),
('via', '--s----v-'),
('viam', '--s----a-'),
('viae', '--s----g-'),
('viae', '--s----d-'),
('via', '--s----b-'),
('viae', '--p----n-'),
('viae', '--p----v-'),
('vias', '--p----a-'),
('viarum', '--p----g-'),
('viis', '--p----d-'),
('viis', '--p----b-')]
In [4]: decliner.decline("via", flatten=True)
Out[4]:
['via',
'via',
'viam',
'viae',
'viae',
'via',
'viae',
'viae',
'vias',
'viarum',
'viis',
'viis']
Lemmatization¶
*This lemmatizer is deprecated. It is recommended that you use the Backoff Lemmatizer described below.*
Tip
For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.
The CLTK’s lemmatizer is based on a key-value store, whose code is available at the CLTK’s Latin lemma/POS repository.
The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens (which, by the way, need j and v replaced first). Here is an example of the lemmatizer taking a string:
In [1]: from cltk.stem.lemma import LemmaReplacer
In [2]: from cltk.stem.latin.j_v import JVReplacer
In [3]: sentence = 'Aeneadum genetrix, hominum divomque voluptas, alma Venus, caeli subter labentia signa quae mare navigerum, quae terras frugiferentis concelebras, per te quoniam genus omne animantum concipitur visitque exortum lumina solis.'
In [6]: sentence = sentence.lower()
In [7]: lemmatizer = LemmaReplacer('latin')
In [8]: lemmatizer.lemmatize(sentence)
Out[8]:
['aeneadum',
'genetrix',
',',
'homo',
'divus',
'voluptas',
',',
'almus',
...]
And here taking a list:
In [9]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'])
Out[9]: ['qui1', 'terra', 'frugiferens', 'concelebro']
The lemmatizer takes several optional arguments for controlling output: return_raw=True
and return_string=True
. return_raw
returns the original inflection along with its headword:
In [10]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_raw=True)
Out[10]:
['quae/qui1',
'terras/terra',
'frugiferentis/frugiferens',
'concelebras/concelebro']
And return string
wraps the list in ' '.join()
:
In [11]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_string=True)
Out[11]: 'qui1 terra frugiferens concelebro'
These two arguments can be combined, as well.
Lemmatization, backoff method¶
The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.
There is a generic version of the backoff Latin lemmatizer which requires data from the CLTK latin models data found here. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.
To use the generic version of the backoff Latin Lemmatizer:
In [1]: from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
In [2]: lemmatizer = BackoffLatinLemmatizer()
In [3]: tokens = ['Quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']
In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutor'), (',', 'punc'), ('Catilina', 'Catilina'), (',', 'punc'), ('patientia', 'patientia'), ('nostra', 'noster'), ('?', 'punc')]
NB: The backoff chain for this lemmatizer is defined as follows: 1. a dictionary-based lemmatizer with high-frequency, unambiguous forms; 2. a training-data-based lemmatizer based on 4,000 sentences from the [Perseus Latin Dependency Treebanks](https://perseusdl.github.io/treebank_data/); 3. a regular-expression-based lemmatizer transforming unambiguous endings; 4. a dictionary-based lemmatizer with the complete set of Morpheus lemmas; 5. an ‘identity’ lemmatizer returning the token as the lemma. Each of these sub-lemmatizers is explained in the documents for “Multilingual”.
Line Tokenization¶
The line tokenizer takes a string input into tokenize()
and returns a list of strings.
In [1]: from cltk.tokenize.line import LineTokenizer
In [2]: tokenizer = LineTokenizer('latin')
In [3]: untokenized_text = """49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""
In [4]: tokenizer.tokenize(untokenized_text)
Out[4]: ['49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']
The line tokenizer by default removes multiple line breaks. If you wish to retain blank lines in the returned list, set the include_blanks
to True
.
In [5]: untokenized_text = """48. Cum tibi contigerit studio cognoscere multa,\nFac discas multa, vita nil discere velle.\n\n49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""
In [6]: tokenizer.tokenize(untokenized_text, include_blanks=True)
Out[6]: ['48. Cum tibi contigerit studio cognoscere multa,','Fac discas multa, vita nil discere velle.','','49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']
Macronizer¶
Automatically mark long Latin vowels with a macron. The algorithm used in this module is largely based on Johan Winge’s, which is detailed in his thesis found.
Note that the macronizer’s accuracy varies depending on which tagger is used. Currently, the macronizer supports the following taggers: tag_ngram_123_backoff
, tag_tnt
, and tag_crf
. The tagger is selected when calling the class, as seen on line 2. Be sure to first import the data models from latin_models_cltk
, via the corpus importer, since both the taggers and macronizer rely on them.
The macronizer can either macronize text, as seen at line 4 below, or return a list of tagged tokens containing the macronized form like on line 5.
In [1]: from cltk.prosody.latin.macronizer import Macronizer
In [2]: macronizer = Macronizer('tag_ngram_123_backoff')
In [3]: text = 'Quo usque tandem, O Catilina, abutere nostra patientia?'
In [4]: macronizer.macronize_text(text)
Out[4]: 'quō usque tandem , ō catilīnā , abūtēre nostrā patientia ?
In [5]: macronizer.macronize_tags(text)
Out[5]: [('quo', 'd--------', 'quō'), ('usque', 'd--------', 'usque'), ('tandem', 'd--------', 'tandem'), (',', 'u--------', ','), ('o', 'e--------', 'ō'), ('catilina', 'n-s---mb-', 'catilīnā'), (',', 'u--------', ','), ('abutere', 'v2sfip---', 'abūtēre'), ('nostra', 'a-s---fb-', 'nostrā'), ('patientia', 'n-s---fn-', 'patientia'), ('?', None, '?')]
Making POS training sets¶
Warning
POS tagging is a work in progress. A new tagging dictionary has been created, though a tagger has not yet been written.
First, obtain the Latin POS tagging files. The important file here is cltk_latin_pos_dict.txt
, which is saved at ~/cltk_data/compiled/pos_latin
. This file is a Python dict
type which aims to give all possible parts-of-speech for any given form, though this is based off the incomplete Perseus latin-analyses.txt
. Thus, there may be gaps in (i) the inflected forms defined and (ii) the comprehensiveness of the analyses of any given form. cltk_latin_pos_dict.txt
looks like:
{'-nam': {'perseus_pos': [{'pos0': {'case': 'indeclform',
'gloss': '',
'type': 'conj'}}]},
'-namque': {'perseus_pos': [{'pos0': {'case': 'indeclform',
'gloss': '',
'type': 'conj'}}]},
'-sed': {'perseus_pos': [{'pos0': {'case': 'indeclform',
'gloss': '',
'type': 'conj'}}]},
'Aaron': {'perseus_pos': [{'pos0': {'case': 'nom',
'gender': 'masc',
'gloss': 'Aaron',
'number': 'sg',
'type': 'substantive'}}]},
}
If you wish to edit the POS dictionary creator, see cltk_latin_pos_dict.txt
.For more, see the [pos_latin](https://github.com/cltk/latin_pos_lemmata_cltk) repository.
Named Entity Recognition¶
There is available a simple interface to a list of Latin proper nouns (see repo for how it the list was created). By default tag_ner()
takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.
In [1]: from cltk.tag import ner
In [2]: from cltk.stem.latin.j_v import JVReplacer
In [3]: text_str = """ut Venus, ut Sirius, ut Spica, ut aliae quae primae dicuntur esse mangitudinis."""
In [4]: jv_replacer = JVReplacer()
In [5]: text_str_iu = jv_replacer.replace(text_str)
In [7]: ner.tag_ner('latin', input_text=text_str_iu, output_type=list)
Out[7]:
[('ut',),
('Uenus', 'Entity'),
(',',),
('ut',),
('Sirius', 'Entity'),
(',',),
('ut',),
('Spica', 'Entity'),
(',',),
('ut',),
('aliae',),
('quae',),
('primae',),
('dicuntur',),
('esse',),
('mangitudinis',),
('.',)]
PHI Indices¶
Located at cltk/corpus/latin/phi5_index.py
of the source are indices for the PHI5, one of just id and name (PHI5_INDEX
) and another also containing information on the authors’ works (PHI5_WORKS_INDEX
).
In [1]: from cltk.corpus.latin.phi5_index import PHI5_INDEX
In [2]: PHI5_INDEX
Out[2]:
{'LAT1050': 'Lucius Verginius Rufus',
'LAT2335': 'Anonymi de Differentiis [Fronto]',
'LAT1345': 'Silius Italicus',
... }
In [3]: from cltk.corpus.latin.phi5_index import PHI5_WORKS_INDEX
In [4]: PHI5_WORKS_INDEX
Out [4]:
{'LAT2335': {'works': ['001'], 'name': 'Anonymi de Differentiis [Fronto]'},
'LAT1345': {'works': ['001'], 'name': 'Silius Italicus'},
'LAT1351': {'works': ['001', '002', '003', '004', '005'],
'name': 'Cornelius Tacitus'},
'LAT2349': {'works': ['001', '002', '003', '004', '005', '006', '007'],
'name': 'Maurus Servius Honoratus, Servius'},
...}
In addition to these indices there are several helper functions which will build filepaths for your particular computer. Not that you will need to have run convert_corpus(corpus='phi5')
and divide_works('phi5')
from the TLGU()
class, respectively, for the following two functions.
In [1]: from cltk.corpus.utils.formatter import assemble_phi5_author_filepaths
In [2]: assemble_phi5_author_filepaths()
Out[2]:
['/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0636.TXT',
'/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0658.TXT',
'/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0827.TXT',
...]
In [3]: from cltk.corpus.utils.formatter import assemble_phi5_works_filepaths
In [4]: assemble_phi5_works_filepaths()
Out[4]:
['/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0636.TXT-001.txt',
'/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0902.TXT-001.txt',
'/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-001.txt',
'/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-002.txt',
...]
These two functions are useful when, for example, needing to process all authors of the PHI5 corpus, all works of the corpus, or all works of one particular author.
POS tagging¶
These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the latin_models_cltk
corpus.
1–2–3–gram backoff tagger¶
In [1]: from cltk.tag.pos import POSTag
In [2]: tagger = POSTag('latin')
In [3]: tagger.tag_ngram_123_backoff('Gallia est omnis divisa in partes tres')
Out[3]:
[('Gallia', None),
('est', 'V3SPIA---'),
('omnis', 'A-S---MN-'),
('divisa', 'T-PRPPNN-'),
('in', 'R--------'),
('partes', 'N-P---FA-'),
('tres', 'M--------')]
TnT tagger¶
In [4]: tagger.tag_tnt('Gallia est omnis divisa in partes tres')
Out[4]:
[('Gallia', 'Unk'),
('est', 'V3SPIA---'),
('omnis', 'N-S---MN-'),
('divisa', 'T-SRPPFN-'),
('in', 'R--------'),
('partes', 'N-P---FA-'),
('tres', 'M--------')]
CRF tagger¶
Warning
This tagger’s accuracy has not yet been evaluated.
We use the NLTK’s CRF tagger. For information on it, see the NLTK docs.
In [5]: tagger.tag_crf('Gallia est omnis divisa in partes tres')
Out[5]:
[('Gallia', 'A-P---NA-'),
('est', 'V3SPIA---'),
('omnis', 'A-S---FN-'),
('divisa', 'N-S---FN-'),
('in', 'R--------'),
('partes', 'N-P---FA-'),
('tres', 'M--------')]
Lapos tagger¶
Note
The Lapos tagger is available in its own repo, with with the master
branch for Linux and apple
branch for Mac. See directions there on how to use it.
Prosody Scanning¶
A prosody scanner is available for text which already has had its natural lengths marked with macrons. It returns a list of strings of long and short marks for each sentence, with an anceps marking the last syllable of each sentence.
The algorithm is designed only for Latin prose rhythms. It is detailed in Keeline, T. and Kirby, J “Auceps syllabarum: A Digital Analysis of Latin Prose Rhythm,” Journal of Roman Studies, 2019.
In [1]: from cltk.prosody.latin.scanner import Scansion
In [2]: scanner = Scansion()
In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'
In [4]: scanner.scan_text(text)
Out[4]: ['-uuu-uuu-u--x', 'uu-uu-uu----x']
Scansion of Poetry¶
About the use of macrons in poetry¶
Most Latin poetry arrives to us without macrons. Some lines of Latin poetry can be scanned and fit a poetic meter without any macrons at all, due to the rules of meter and positional accentuation.
Automatically macronizing every word in a line of Latin poetry does not mean that it will automatically scan correctly. Poets often diverge from standard usage: regularly long vowels can appear short; the verb nesciō in poetry scans the final personal ending as a short o; and regularly short vowels can appear as long; e.g. Lucretius regularly writes rēligiō which scans, instead of the usual religiō; and there is a prosody device: diastole - the short final vowel of a word is lengthened to fit the meter; e.g. tibī in Lucretius I.104 and III.899, etc.
However, some macrons are necessary for scansion: Lucretius I.12 begins with “aeriae” which will not scan in hexameter unless one substitutes its macronized form “āeriae”.
HexameterScanner¶
The HexameterScanner class scans lines of Latin hexameter (with or without macrons) and determines if the line is a valid hexameter and what its scansion pattern is.
If the line is not properly macronized to scan, the scanner tries to determine whether the line:
- Scans merely by position.
- Syllabifies according to the common rules.
- Is complete (e.g. some hexameter lines are partial).
The scanner also determines which syllables would have to be made long to make the line scan as a valid hexameter. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The HexameterScanner’s scan method returns a Verse class object.
In [1]: from cltk.prosody.latin.hexameter_scanner import HexameterScanner
In [2]: scanner = HexameterScanner()
In [3]: scanner.scan("impulerit. Tantaene animis caelestibus irae?")
Out[3]: Verse(original='impulerit. Tantaene animis caelestibus irae?', scansion='- U U - - - U U - - - U U - - ', meter='hexameter', valid=True, syllable_count=15, accented='īmpulerīt. Tāntaene animīs caelēstibus īrae?', scansion_notes=['Valid by positional stresses.'], syllables = ['īm', 'pu', 'le', 'rīt', 'Tān', 'taen', 'a', 'ni', 'mīs', 'cae', 'lēs', 'ti', 'bus', 'i', 'rae'])
PentameterScanner¶
The PentameterScanner class scans lines of Latin pentameter (with or without macrons) and determines if the line is a valid pentameter and what its scansion pattern is.
If the line is not properly macronized to scan, the scanner tries to determine whether the line:
- Scans merely by position.
- Syllabifies according to the common rules.
The scanner also determines which syllables would have to be made long to make the line scan as a valid pentameter. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The PentameterScanner’s scan method returns a Verse class object.
In [1]: from cltk.prosody.latin.pentameter_scanner import PentameterScanner
In [2]: scanner = PentameterScanner()
In [3]: scanner.scan("ex hoc ingrato gaudia amore tibi.")
Out[3]: Verse(original='ex hoc ingrato gaudia amore tibi.', scansion='- - - - - - U U - U U U ', meter='pentameter', valid=True, syllable_count=12, accented='ēx hōc īngrātō gaudia amōre tibi.', scansion_notes=['Spondaic pentameter'], syllables = ['ēx', 'hoc', 'īn', 'gra', 'to', 'gau', 'di', 'a', 'mo', 're', 'ti', 'bi'])
HendecasyllableScanner¶
The HendecasyllableScanner class scans lines of Latin hendecasyllables (with or without macrons) and determines if the line is a valid example of the hendecasyllablic meter and what its scansion pattern is.
If the line is not properly macronized to scan, the scanner tries to determine whether the line:
- Scans merely by position.
- Syllabifies according to the common rules.
The scanner also determines which syllables would have to be made long to make the line scan as a valid hendecasyllables. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The HendecasyllableScanner’s scan method returns a Verse class object.
In [1]: from cltk.prosody.latin.hendecasyllable_scanner import HendecasyllableScanner
In [2]: scanner = HendecasyllableScanner()
In [3]: scanner.scan("Iam tum, cum ausus es unus Italorum")
Out[3]: Verse(original='Iam tum, cum ausus es unus Italorum', scansion=' - - - U U - U - U - U ', meter='hendecasyllable', valid=True, syllable_count=11, accented='Iām tūm, cum ausus es ūnus Ītalōrum', scansion_notes=['antepenult foot onward normalized.'], syllables = ['Jām', 'tūm', 'c', 'au', 'sus', 'es', 'u', 'nus', 'I', 'ta', 'lo', 'rum'])
Verse¶
The Verse class object returned by the HexameterScanner, PentameterScanner, and HendecasyllableScanner provides slots for:
- original - original line of verse
- scansion - the scansion pattern
- meter - the meter of the verse
- valid - whether or not the hexameter is valid
- syllable_count - number of syllables according to common syllabification rules
- accented - if the hexameter is valid, a version of the line with accented vowels (dipthongs are not accented)
- scansion_notes - a list recording the characteristics of the transformations made to the original line
- syllables - a list of syllables of which the line is divided into at the scansion level; elided syllables are not provided.
The Scansion notes are defined in a NOTE_MAP dictionary object contained in the ScansionConstants class.
ScansionConstants¶
The ScansionConstants class is a configuration class for specifying scansion constants. This class also allows users to customizing scansion constants and scanner behavior, for example, a user may alter the symbols used for stressed and unstressed syllables:
In [1]: from cltk.prosody.latin.scansion_constants import ScansionConstants
In [2]: constants = ScansionConstants(unstressed="U",stressed= "-", optional_terminal_ending="X")
In [3]: constants.DACTYL
Out[3]: '-UU'
In [4]: smaller_constants = ScansionConstants(unstressed="˘",stressed= "¯", optional_terminal_ending="x")
In [5]: smaller_constants.DACTYL
Out[5]: '¯˘˘'
Constants containing strings have characters in upper and lower case since they will often be used in regular expressions, and used to preserve/a verse’s original case.
Syllabifier¶
The Syllabifier class is a Latin language syllabifier. It parses a Latin word or a space separated list of words into a list of syllables. Consonantal I is transformed into a J at the start of a word as necessary. Tuned for poetry and verse, this class is tolerant of isolated single character consonants that may appear due to elision.
In [1]: from cltk.prosody.latin.syllabifier import Syllabifier
In [1]: syllabifier = Syllabifier()
In [2]: syllabifier.syllabify("libri")
Out[2]: ['li', 'bri']
In [3]: syllabifier.syllabify("contra")
Out[3]: ['con', 'tra']
Metrical Validator¶
The MetricalValidator class is a utility class for validating scansion patterns. Users may configure the scansion symbols internally via passing a customized ScansionConstants via a constructor argument:
In [1]: from cltk.prosody.latin.metrical_validator import MetricalValidator
In [2]: MetricalValidator().is_valid_hexameter("-UU---UU---UU-U")
Out[2]: 'True'
ScansionFormatter¶
The ScansionFormatter class is a utility class for formatting scansion patterns.
In [1]: from cltk.prosody.latin.scansion_formatter import ScansionFormatter
In [2]: ScansionFormatter().hexameter("-UU-UU-UU---UU--")
Out[2]: '-UU|-UU|-UU|--|-UU|--'
In [3]: constants = ScansionConstants(unstressed="˘", stressed= "¯", optional_terminal_ending="x")
In [4]: formatter = ScansionFormatter(constants)
In [5]: formatter.hexameter( "¯˘˘¯˘˘¯˘˘¯¯¯˘˘¯¯")
Out[5]: '¯˘˘|¯˘˘|¯˘˘|¯¯|¯˘˘|¯¯'
string_utils module¶
The string_utils module contains utility methods for processing scansion and text. Such as punctuation_for_spaces_dict()
returns a dictionary object that maps unicode punctuation to blanks spaces, which are essential for scansion to keep stress patterns in alignment with original vowel positions in the verse.
In [1]: import cltk.prosody.latin.string_utils as string_utils
In [2]: "I'm ok! Oh #%&*()[]{}!? Fine!".translate(string_utils.punctuation_for_spaces_dict()).strip()
Out[2]: 'I m ok Oh Fine'
Semantics¶
The Semantics module allows for the lookup of Latin lemmata, synonyms, and translations into Greek. Lemma, synonym, and translation dictionaries are drawn from the open-source Tesserae Project<http://github.com/tesserae/tesserae>
Tip
When lemmatizing ambiguous forms, the Semantics module is designed to return all possibilities. A probability distribution is included with the list of results, but as of June 8, 2018 the total probability is evenly distributed over all possibilities. Future updates will include a more intelligent system for determining the most likely lemma, synonym, or translation._.
The Lemmata class includes two relevant methods: lookup() takes a list of tokens standardized for spelling and returns a complex object which includes a probability distribution; isolate() takes the object returned by lookup() and discards everything but the lemmata.
In [1]: from cltk.semantics.latin.lookup import Lemmata
In [2]: lemmatizer = Lemmata(dictionary = 'lemmata', language = 'latin')
In [3]: tokens = ['ceterum', 'antequam', 'destinata', 'componam']
In [4]: lemmas = lemmatizer.lookup(tokens)
Out[4]:
[('ceterum', [('ceterus', 1.0)]), ('antequam', [('antequam', 1.0)]), ('destinata', [('destinatus', 0.25), ('destinatum', 0.25), ('destinata', 0.25), ('destino', 0.25)]), ('componam', [('compono', 1.0)])]
In [5]: justlemmas = lemmatizer.isolate(lemmas)
Out[5]:['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
The Synonym class can be initialized to lookup either synonyms or translations. It expects a list of lemmata, not inflected forms. Only successful ‘lookups’ will return results.
In [1]: from cltk.semantics.latin.lookup import Synonyms
In [2]: translator = Synonyms(dictionary = 'translations', language = 'latin')
In [3]: lemmas = ['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
In [4]: translations = translator.lookup(lemmas)
Out[4]:[('destino', [('σκοπός', 1.0)]), ('compono', [('συντίθημι', 1.0)])]
A raw list of translations can be obtained from the translation object using Lemmata.isolate().
Sentence Tokenization¶
Sentence tokenization for Latin is available using a [Punkt](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) tokenizer trained on the Latin Library. The model for this tokenizer can be found in the CLTK corpora under latin_model_cltk/tokenizers/sentence/latin_punkt. The training process considers Latin punctuation patterns as well as common abbreviations (e.g. nomina). To tokenize a Latin text by sentences…
In [1]: from cltk.tokenize.latin.sentence import SentenceTokenizer
In [2]: sent_tokenizer = SentenceTokenizer()
In [3]: untokenized_text = 'Meministine me ante diem XII Kalendas Novembris dicere in senatu fore in armis certo die, qui dies futurus esset ante diem VI Kal. Novembris, C. Manlium, audaciae satellitem atque administrum tuae? Num me fefellit, Catilina, non modo res tanta, tam atrox tamque incredibilis, verum, id quod multo magis est admirandum, dies? Dixi ego idem in senatu caedem te optumatium contulisse in ante diem V Kalendas Novembris, tum cum multi principes civitatis Roma non tam sui conservandi quam tuorum consiliorum reprimendorum causa profugerunt.'
In [4]: sent_tokenizer.tokenize(untokenized_text)
Out[4]: ['Meministine me ante diem XII Kalendas Novembris dicere in senatu fore in armis certo die, qui dies futurus esset ante diem VI Kal. Novembris, C. Manlium, audaciae satellitem atque administrum tuae?', 'Num me fefellit, Catilina, non modo res tanta, tam atrox tamque incredibilis, verum, id quod multo magis est admirandum, dies?', 'Dixi ego idem in senatu caedem te optumatium contulisse in ante diem V Kalendas Novembris, tum cum multi principes civitatis Roma non tam sui conservandi quam tuorum consiliorum reprimendorum causa profugerunt.']
Note that the Latin sentence tokenizer takes account of abbreviations like ‘Kal.’ and ‘C.’ and does not split sentences at these points.
By default, the Latin Punkt Sentence Tokenizer splits on period, question mark, and exclamation point. There is a `strict`
parameter that adds colon, semicolon, and hyphen to this.
In [5]: sent_tokenizer = SentenceTokenizer(strict=True)
In [6]: untokenized_text = ‘In principio creavit Deus caelum et terram; terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas; dixitque Deus fiat lux et facta est lux; et vidit Deus lucem quod esset bona et divisit lucem ac tenebras.’
In [7]: sent_tokenizer.tokenize(untokenized_text) Out[7]: [‘In principio creavit Deus caelum et terram;’, ‘terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas;’, ‘dixitque Deus fiat lux et facta est lux;’, ‘et vidit Deus lucem quod esset bona et divisit lucem ac tenebras.’]
NB: The old method for sentence tokenizer, i.e. TokenizeSentence, is still available, but now calls the tokenizer described above.
In [5]: from cltk.tokenize.sentence import TokenizeSentence
In [6]: tokenizer = TokenizeSentence('latin')
etc.
Semantics¶
The Semantics module allows for the lookup of Latin lemmata, synonyms, and translations into Greek. Lemma, synonym, and translation dictionaries are drawn from the open-source Tesserae Project<http://github.com/tesserae/tesserae>
The dictionaries used by this module are stored in https://github.com/cltk/latin_models_cltk/tree/master/semantics and https://github.com/cltk/greek_models_cltk/tree/master/semantics for Greek and Latin, respectively. In order to use the Semantics module, it is necessary to import those repos first<http://docs.cltk.org/en/latest/importing_corpora.html#importing-a-corpus>.
Tip
When lemmatizing ambiguous forms, the Semantics module is designed to return all possibilities. A probability distribution is included with the list of results, but as of June 8, 2018 the total probability is evenly distributed over all possibilities. Future updates will include a more intelligent system for determining the most likely lemma, synonym, or translation._.
The Lemmata class includes two relevant methods: lookup() takes a list of tokens standardized for spelling and returns a complex object which includes a probability distribution; isolate() takes the object returned by lookup() and discards everything but the lemmata.
In [1]: from cltk.semantics.latin.lookup import Lemmata
In [2]: lemmatizer = Lemmata(dictionary='lemmata', language='latin')
In [3]: tokens = ['ceterum', 'antequam', 'destinata', 'componam']
In [4]: lemmas = lemmatizer.lookup(tokens)
Out[4]:
[('ceterum', [('ceterus', 1.0)]), ('antequam', [('antequam', 1.0)]), ('destinata', [('destinatus', 0.25), ('destinatum', 0.25), ('destinata', 0.25), ('destino', 0.25)]), ('componam', [('compono', 1.0)])]
In [5]: just_lemmas = Lemmata.isolate(lemmas)
Out[5]:['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
The Synonym class can be initialized to lookup either synonyms or translations. It expects a list of lemmata, not inflected forms. Only successful ‘lookups’ will return results.
In [1]: from cltk.semantics.latin.lookup import Synonyms
In [2]: translator = Synonyms(dictionary='translations', language='latin')
In [3]: lemmas = ['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
In [4]: translations = translator.lookup(lemmas)
Out[4]:[('destino', [('σκοπός', 1.0)]), ('compono', [('συντίθημι', 1.0)])]
In [5]: just_translations = Lemmata.isolate(translations)
Out[5]:['σκοπός', 'συντίθημι']
A raw list of translations can be obtained from the translation object using Lemmata.isolate().
Stemming¶
The stemmer strips suffixes via an algorithm. It is much faster than the lemmatizer, which uses a replacement list.
In [1]: from cltk.stem.latin.stem import Stemmer
In [2]: sentence = 'Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiuerunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem ciuem existimarint foeneratorem quam furem, hinc licet existimare. Et uirum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, uerum, ut supra dixi, periculosum et calamitosum. At ex agricolis et uiri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque inuidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit.'
In [3]: stemmer = Stemmer()
In [4]: stemmer.stem(sentence.lower())
Out[4]: 'est interd praestar mercatur r quaerere, nisi tam periculos sit, et it foenerari, si tam honestum. maior nostr sic habueru et ita in leg posiuerunt: fur dupl condemnari, foenerator quadrupli. quant peior ciu existimari foenerator quam furem, hinc lice existimare. et uir bon quo laudabant, ita laudabant: bon agricol bon colonum; amplissim laudar existimaba qui ita laudabatur. mercator autem strenu studios re quaerend existimo, uerum, ut supr dixi, periculos et calamitosum. at ex agricol et uir fortissim et milit strenuissim gignuntur, maxim p quaest stabilissim consequi minim inuidiosus, minim mal cogitant su qui in e studi occupat sunt. nunc, ut ad r redeam, quod promis institut principi hoc erit. '
Stoplist Construction¶
To extract a stoplist from a collection of documents:
In [1]: test_1 = """cogitanti mihi saepe numero et memoria vetera repetenti perbeati fuisse, quinte frater, illi videri solent, qui in optima re publica, cum et honoribus et rerum gestarum gloria florerent, eum vitae cursum tenere potuerunt, ut vel in negotio sine periculo vel in otio cum dignitate esse possent; ac fuit cum mihi quoque initium requiescendi atque animum ad utriusque nostrum praeclara studia referendi fore iustum et prope ab omnibus concessum arbitrarer, si infinitus forensium rerum labor et ambitionis occupatio decursu honorum, etiam aetatis flexu constitisset. quam spem cogitationum et consiliorum meorum cum graves communium temporum tum varii nostri casus fefellerunt; nam qui locus quietis et tranquillitatis plenissimus fore videbatur, in eo maximae moles molestiarum et turbulentissimae tempestates exstiterunt; neque vero nobis cupientibus atque exoptantibus fructus oti datus est ad eas artis, quibus a pueris dediti fuimus, celebrandas inter nosque recolendas. nam prima aetate incidimus in ipsam perturbationem disciplinae veteris, et consulatu devenimus in medium rerum omnium certamen atque discrimen, et hoc tempus omne post consulatum obiecimus eis fluctibus, qui per nos a communi peste depulsi in nosmet ipsos redundarent. sed tamen in his vel asperitatibus rerum vel angustiis temporis obsequar studiis nostris et quantum mihi vel fraus inimicorum vel causae amicorum vel res publica tribuet oti, ad scribendum potissimum conferam; tibi vero, frater, neque hortanti deero neque roganti, nam neque auctoritate quisquam apud me plus valere te potest neque voluntate."""
In [2] test_2 = """ac mihi repetenda est veteris cuiusdam memoriae non sane satis explicata recordatio, sed, ut arbitror, apta ad id, quod requiris, ut cognoscas quae viri omnium eloquentissimi clarissimique senserint de omni ratione dicendi. vis enim, ut mihi saepe dixisti, quoniam, quae pueris aut adulescentulis nobis ex commentariolis nostris incohata ac rudia exciderunt, vix sunt hac aetate digna et hoc usu, quem ex causis, quas diximus, tot tantisque consecuti sumus, aliquid eisdem de rebus politius a nobis perfectiusque proferri; solesque non numquam hac de re a me in disputationibus nostris dissentire, quod ego eruditissimorum hominum artibus eloquentiam contineri statuam, tu autem illam ab elegantia doctrinae segregandam putes et in quodam ingeni atque exercitationis genere ponendam. ac mihi quidem saepe numero in summos homines ac summis ingeniis praeditos intuenti quaerendum esse visum est quid esset cur plures in omnibus rebus quam in dicendo admirabiles exstitissent; nam quocumque te animo et cogitatione converteris, permultos excellentis in quoque genere videbis non mediocrium artium, sed prope maximarum. quis enim est qui, si clarorum hominum scientiam rerum gestarum vel utilitate vel magnitudine metiri velit, non anteponat oratori imperatorem? quis autem dubitet quin belli duces ex hac una civitate praestantissimos paene innumerabilis, in dicendo autem excellentis vix paucos proferre possimus? iam vero consilio ac sapientia qui regere ac gubernare rem publicam possint, multi nostra, plures patrum memoria atque etiam maiorum exstiterunt, cum boni perdiu nulli, vix autem singulis aetatibus singuli tolerabiles oratores invenirentur. ac ne qui forte cum aliis studiis, quae reconditis in artibus atque in quadam varietate litterarum versentur, magis hanc dicendi rationem, quam cum imperatoris laude aut cum boni senatoris prudentia comparandam putet, convertat animum ad ea ipsa artium genera circumspiciatque, qui in eis floruerint quamque multi sint; sic facillime, quanta oratorum sit et semper fuerit paucitas, iudicabit."""
In [3]: test_corpus = [test_1, test_2]
In [4]: from cltk.stop.latin import CorpusStoplist
In [5]: S = CorpusStoplist()
In [6]: print(S.build_stoplist(test_corpus, size=10))
Out [6]: ['ac', 'atque', 'cum', 'et', 'in', 'mihi', 'neque', 'qui', 'rerum', 'vel']
Stopword Filtering¶
To use a pre-built stoplist (created originally by the Perseus Project):
In [1]: from nltk.tokenize.punkt import PunktLanguageVars
In [2]: from cltk.stop.latin.stops import STOPS_LIST
In [3]: sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'
In [4]: p = PunktLanguageVars()
In [5]: tokens = p.word_tokenize(sentence.lower())
In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['usque',
'tandem',
'abutere',
',',
'catilina',
',',
'patientia',
'nostra',
'?']
Swadesh¶
The corpus module has a class for generating a Swadesh list for Latin.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('la')
In [3]: swadesh.words()[:10]
Out[3]: ['ego', 'tū', 'is, ea, id', 'nōs', 'vōs', 'eī, iī, eae, ea', 'hic, haec, ho', 'ille, illa, illud', 'hīc', 'illic, ibi']
Syllabifier¶
The syllabifier splits a given input Latin word into a list of syllables based on an algorithm and set of syllable specifications for Latin.
In [1]: from cltk.stem.latin.syllabifier import Syllabifier
In [2]: word = 'sidere'
In [3]: syllabifier = Syllabifier()
In [4]: syllabifier.syllabify(word)
Out[4]: ['si', 'de', 're']
Text Cleanup¶
Intended for use on the TLG after processing by TLGU()
.
In [1]: from cltk.corpus.utils.formatter import phi5_plaintext_cleanup
In [2]: import os
In [3]: file = os.path.expanduser('~/cltk_data/latin/text/phi5/individual_works/LAT0031.TXT-001.txt')
In [4]: with open(file) as f:
...: r = f.read()
...:
In [5]: r[:500]
Out[5]: '\nDices pulchrum esse inimicos \nulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uide-\ntur, sed si liceat re publica salua ea persequi. sed quatenus id fieri non \npotest, multo tempore multisque partibus inimici nostri non peribunt \natque, uti nunc sunt, erunt potius quam res publica profligetur atque \npereat. \n Verbis conceptis deierare ausim, praeterquam qui \nTiberium Gracchum necarunt, neminem inimicum tantum molestiae \ntantumque laboris, quantum te ob has res, mihi tradidis'
In [6]: phi5_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Dices pulchrum esse inimicos ulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uidetur sed si liceat re publica salua ea persequi. sed quatenus id fieri non potest multo tempore multisque partibus inimici nostri non peribunt atque uti nunc sunt erunt potius quam res publica profligetur atque pereat. Verbis conceptis deierare ausim praeterquam qui Tiberium Gracchum necarunt neminem inimicum tantum molestiae tantumque laboris quantum te ob has res mihi tradidisse quem oportebat omni'
If you have a text of a language in Latin characters which contain a lot of junk, remove_non_ascii()
and remove_non_latin()
might be of use.
In [1]: from cltk.corpus.utils.formatter import remove_non_ascii
In [2]: text = 'Dices ἐστιν ἐμός pulchrum esse inimicos ulcisci.'
In [3]: remove_non_ascii(text)
Out[3]: 'Dices pulchrum esse inimicos ulcisci.
In [4]: from cltk.corpus.utils.formatter import remove_non_latin
In [5]: remove_non_latin(text)
Out[5]: ' Dices pulchrum esse inimicos ulcisci'
In [6]: remove_non_latin(text, also_keep=['.', ','])
Out[6]: ' Dices pulchrum esse inimicos ulcisci.'
Transliteration¶
The CLTK provides IPA phonetic transliteration for the Latin language. Currently, the only available dialect is Classical as reconstructed by W. Sidney Allen (taken from Vox Latina, 85-103). Example:
In [1]: from cltk.phonology.latin.transcription import Transcriber
In [2]: transcriber = Transcriber(dialect="Classical", reconstruction="Allen")
In [3]: transcriber.transcribe("Quo usque tandem, O Catilina, abutere nostra patientia?")
Out[3]: "['kʷoː 'ʊs.kʷɛ 't̪an̪.d̪ẽː 'oː ka.t̪ɪ.'liː.n̪aː a.buː.'t̪eː.rɛ 'n̪ɔs.t̪raː pa.t̪ɪ̣.'jɛn̪.t̪ɪ̣.ja]"
Word Tokenization¶
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: word_tokenizer = WordTokenizer('latin')
In [3]: text = 'atque haec abuterque puerve paterne nihil'
In [4]: word_tokenizer.tokenize(text)
Out[4]: ['atque', 'haec', 'abuter', '-que', 'puer', '-ve', 'pater', '-ne', 'nihil']
Word2Vec¶
Note
The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK’s API for it will be revised.
Note
You will need to install Gensim to use these features.
Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).
The CLTK repository contains pre-trained Word2Vec models for Latin (import as latin_word2vec_cltk
), one lemmatized and the other not. They were trained on the PHI5 corpus. To train your own, see the README at the Latin Word2Vec repository.
One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here’s an example of its use:
In [1]: from cltk.ir.query import search_corpus
In [2]: for x in search_corpus('amicitia', 'phi5', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.25):
print(x)
...:
Out[2]: The following similar terms will be added to the 'amicitia' query: '['societate', 'praesentia', 'uita', 'sententia', 'promptu', 'beneuolentia', 'dignitate', 'monumentis', 'somnis', 'philosophia']'.
('L. Iunius Moderatus Columella', 'hospitem, nisi ex *amicitia* domini, quam raris-\nsime recipiat.')
('L. Iunius Moderatus Columella', ' \n Xenophon Atheniensis eo libro, Publi Siluine, qui Oeconomicus \ninscribitur, prodidit maritale coniugium sic comparatum esse \nnatura, ut non solum iucundissima, uerum etiam utilissima uitae \nsocietas iniretur: nam primum, quod etiam Cicero ait, ne genus \nhumanum temporis longinquitate occideret, propter \nhoc marem cum femina esse coniunctum, deinde, ut ex \nhac eadem *societate* mortalibus adiutoria senectutis nec \nminus propugnacula praeparentur.')
('L. Iunius Moderatus Columella', 'ac ne ista quidem \npraesidia, ut diximus, non adsiduus labor et experientia \nuilici, non facultates ac uoluntas inpendendi tantum pollent \nquantum uel una *praesentia* domini, quae nisi frequens \noperibus interuenerit, ut in exercitu, cum abest imperator, \ncuncta cessant officia.')
['…']
threshold
is the closeness of the query term to its neighboring words. Note that when expand_keyword=True
, the search term will be stripped of any regular expression syntax.
The keyword expander leverages get_sims()
(which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:
In [3]: from cltk.vector.word2vec import get_sims
In [4]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.7)
Matches found, but below the threshold of 'threshold=0.7'. Lower it to see these results.
Out[4]: []
In [5]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.2)
Out[5]:
['lictor',
'extemplo',
'cena',
'nuntio',
'aduenio',
'iniussus2',
'forum',
'dictator',
'fabium',
'caesarem']
In [6]: get_sims('iube', 'latin', lemmatized=True, threshold=0.7)
Out[6]: "word 'iube' not in vocabulary"
['The following terms in the Word2Vec model you may be looking for: '['iubet”', 'iubet', 'iubilo', 'iubĕ', 'iubar', 'iubes', 'iubatus', 'iuba1', 'iubeo']'.]'
In [7]: get_sims('dictator', 'latin', lemmatized=False, threshold=0.7)
Out[7]:
['consul',
'caesar',
'seruilius',
'praefectus',
'flaccus',
'manlius',
'sp',
'fuluius',
'fabio',
'ualerius']
To add and subtract vectors, you need to load the models yourself with Gensim.
Malayalam¶
Malayalam is a language spoken in India, predominantly in the state of Kerala. Malayalam originated from Middle Tamil (Sen-Tamil) in the 7th century. An alternative theory proposes a split in even more ancient times.Malayalam incorporated many elements from Sanskrit through the ages. Many medieval liturgical texts were written in an admixture of Sanskrit and early Malayalam, called Manipravalam.The oldest literary work in Malayalam, distinct from the Tamil tradition, is dated from between the 9th and 11th centuries. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with malayalam_
) to discover available Malayalam corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('malayalam')
In [3]: c.list_corpora
Out[3]:
['malayalam_text_gretil']
Marathi¶
Marathi is an Indian language spoken predominantly by the Marathi people of Maharashtra. Marathi has some of the oldest literature of all modern Indo-Aryan languages, dating from about 900 AD. Early Marathi literature written during the Yadava (850-1312 CE) was mostly religious and philosophical in nature. Dnyaneshwar (1275–1296) was the first Marathi literary figure who had wide readership and profound influence. His major works are Amrutanubhav and Bhavarth Deepika (popularly known as Dnyaneshwari), a 9000-couplet long commentary on the Bhagavad Gita. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with Marathi_
) to discover available Marathi corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('Marathi')
In [3]: c.list_corpora
Out[3]:
['Marathi_text_wikisource']
Tokenizer¶
In [1]: from cltk.tokenize.sentence import TokenizeSentence
In [2]: tokenizer = TokenizeSentence('marathi')
In [3]: sentence = "आतां विश्वात्मके देवे, येणे वाग्यज्ञे तोषावे, तोषोनि मज द्यावे, पसायदान हे"
In [4]: tokenized_sentence = tokenizer.tokenize(sentence)
In [5]: print(tokenized_sentence)
['आतां', 'विश्वात्मके', 'देवे', ',', 'येणे', 'वाग्यज्ञे', 'तोषावे', ',', 'तोषोनि', 'मज', 'द्यावे', ',', 'पसायदान', 'हे']
Stopwords¶
Stop words of classical marathi calculated from “dnyaneshwari” and “Haripath”.
In [1]: from cltk.stop.marathi.stops import STOP_LIST
In [2]: print(STOP_LIST[1])
"तरी"
Alphabet¶
The alphabets of Marathi language are placed in cltk/corpus/Marathi/alphabet.py.
In [1]: from cltk.corpus.marathi.alphabet import DIGITS
In [2]: print(DIGITS)
['०', '१', '२', '३', '४', '५', '६', '७', '८', '९']
There are 13 vowels in Marathi. All vowels have their independent form and a matra form, which are used for modifying consonants: VOWELS = ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अॅ', 'ऑ']
.
The International Alphabet of Sanskrit Transliteration (I.A.S.T.) is a transliteration scheme that allows the lossless romanization of Indic scripts as employed by Sanskrit and related Indic languages. IAST makes it possible for the reader to read the Indic text unambiguously, exactly as if it were in the original Indic script. The vowels would be represented thus: IAST_REPRESENTATION_VOWELS = ['a', 'ā', 'i', 'ī', 'u', 'ū', 'ṛ', 'e', 'ai', 'o', 'au', 'ae', 'ao']
.
In [1]: from cltk.corpus.marathi.alphabet import VOWELS
In [2]: VOWELS
Out[2]: ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अॅ', 'ऑ']
In [3]: from cltk.corpus.marathi.alphabet import IAST_REPRESENTATION_VOWELS
In [4]: IAST_REPRESENTATION_VOWELS
out[4]: ['a', 'ā', 'i', 'ī', 'u', 'ū', 'ṛ', 'e', 'ai', 'o', 'au', 'ae', 'ao']
Similarly we can import others vowels and consonants. There are 25 regular consonants (consonants that stop air from moving out of the mouth) in Marathi, and they are organized into groups (“vargas”) of five. The vargas are ordered according to where the tongue is in the mouth. Each successive varga refers to a successively forward position of the tongue. The vargas are ordered and named thus (with an example of a corresponding consonant):
- Velar: A velar consonant is a consonant that is pronounced with the back part of the tongue against the soft palate, also known as the velum, which is the back part of the roof of the mouth (e.g.,
k
). - Palatal: A palatal consonant is a consonant that is pronounced with the body (the middle part) of the tongue against the hard palate (which is the middle part of the roof of the mouth) (e.g.,
j
). - Retroflex: A retroflex consonant is a coronal consonant where the tongue has a flat, concave, or even curled shape, and is articulated between the alveolar ridge and the hard palate (e.g., English
t
). - Dental: A dental consonant is a consonant articulated with the tongue against the upper teeth (e.g., Spanish
t
). - Labial: Labials or labial consonants are articulated or made with the lips (e.g.,
p
).
VELAR_CONSONANTS = ['क', 'ख', 'ग', 'घ', 'ङ']
PALATAL_CONSONANTS = ['च', 'छ', 'ज', 'झ', 'ञ']
RETROFLEX_CONSONANTS = ['ट','ठ', 'ड', 'ढ', 'ण']
DENTAL_CONSONANTS = ['त', 'थ', 'द', 'ध', 'न']
LABIAL_CONSONANTS = ['प', 'फ', 'ब', 'भ', 'म']
IAST_VELAR_CONSONANTS = ['k', 'kh', 'g', 'gh', 'ṅ']
IAST_PALATAL_CONSONANTS = ['c', 'ch', 'j', 'jh', 'ñ']
IAST_RETROFLEX_CONSONANTS = ['ṭ', 'ṭh', 'ḍ', 'ḍh', 'ṇ']
IAST_DENTAL_CONSONANTS = ['t', 'th', 'd', 'dh', 'n']
IAST_LABIAL_CONSONANTS = ['p', 'ph', 'b', 'bh', 'm']
There are four semi vowels in Marathi:
SEMI_VOWELS = ['य', 'र', 'ल', 'व']
IAST_SEMI_VOWELS = ['y', 'r', 'l', 'w']
There are three sibilants in Marathi:
SIBILANTS = ['श', 'ष', 'स']
IAST_SIBILANTS = ['ś', 'ṣ', 's']
There is one fricative consonant in Marathi:
FRIACTIVE_CONSONANTS = ['ह']
IAST_FRIACTIVE_CONSONANTS = ['h']
There are three additional consonants:
ADDITIONAL_CONSONANTS = ['ळ', 'क्ष', 'ज्ञ']
IAST_ADDITIONAL_CONSONANTS = ['La', 'kSha', 'dnya']
Multilingual¶
Some functions in the CLTK are language independent.
Concordance¶
Note
This is a new feature. Advice regarding readability is encouraged!
The philology
module can produce a concordance. Currently there are two methods that write a concordance to file, one which takes one or more paths and another which takes a text string. Texts in Latin characters are alphabetized.
In [1]: from cltk.utils import philology
In [2]: iliad = '~/cltk_data/greek/text/tlg/individual_works/TLG0012.TXT-001.txt'
In [3]: philology.write_concordance_from_file(iliad, 'iliad')
This will print a traditional, human–readable 120,000–line concordance at ~/cltk_data/user_data/concordance_iliad.txt
.
Multiple files can be passed as a list into this method.
In [5]: odyssey = '~/cltk_data/greek/text/tlg/individual_works/TLG0012.TXT-002.txt'
In [6]: philology.write_concordance_from_file([iliad, odyssey], 'homer')
This creates the file ~/cltk_data/user_data/concordance_homer.txt
.
write_concordance_from_string()
takes a string and will build the concordance from it.
In [7]: from cltk.corpus.utils.formatter import phi5_plaintext_cleanup
In [8]: import os
In [9]: tibullus = os.path.expanduser('~/cltk_data/latin/text/phi5/plaintext/LAT0660.TXT')
In [10]: with open(tibullus) as f:
....: tib_read = f.read()
In [10]: tib_clean = phi5_plaintext_cleanup(tib_read).lower()
In [11]: philology.write_concordance_from_string(tib_clean, 'tibullus')
The resulting concordance looks like:
modulatus eburno felices cantus ore sonante dedit. sed postquam fuerant digiti cum voce locuti , edidit haec tristi dulcia verba modo : 'salve , cura
caveto , neve cubet laxo pectus aperta sinu , neu te decipiat nutu , digitoque liquorem ne trahat et mensae ducat in orbe notas. exibit quam saepe ,
acerbi : non ego sum tanti , ploret ut illa semel. nec lacrimis oculos digna est foedare loquaces : lena nocet nobis , ipsa puella bona est. lena ne
eaera tamen.” “carmine formosae , pretio capiuntur avarae. gaudeat , ut digna est , versibus illa tuis. lutea sed niveum involvat membrana libellum ,
umnus olympo mille habet ornatus , mille decenter habet. sola puellarum digna est , cui mollia caris vellera det sucis bis madefacta tyros , possidea
velim , sed peccasse iuvat , voltus conponere famae taedet : cum digno digna fuisse ferar. invisus natalis adest , qui rure molesto et sine cerintho
a phoebe superbe lyra. hoc sollemne sacrum multos consummet in annos : dignior est vestro nulla puella choro. parce meo iuveni , seu quis bona pascua
a para. sic bene conpones : ullae non ille puellae servire aut cuiquam dignior illa viro. nec possit cupidos vigilans deprendere custos , fallendique
Corpora¶
The CLTK uses languages in its organization of data, however some good corpora do not and cannot be easily broken apart. Furthermore, some, such as parallel text corpora, are inherently multilingual. Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with multilingual_
) to discover available multilingual corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('multilingual')
In [3]: c.list_corpora
Out[3]: ['multilingual_treebank_proiel']
Information Retrieval (regex, keyword expansion)¶
Tip
To begin working with regular expressions, try Pythex, a handy tool for developing patterns. For more thorough lessons, try Learn Regex The Hard Way.
Tip
Read about Word2Vec for Latin or Greek for the powerful keyword expansion functionality.
Several functions are available for querying text in order to match regular expression patterns. match_regex()
is the most basic. Punctuation rules are included for texts using Latin sentence–final punctuation (‘.’, ‘!’, ‘?’) and Greek (‘.’, ‘;’). For returned strings, you may choose between a context of the match’s sentence, paragraph, or custom number of characters on each side of a hit. Note that this function and the next each return a generator.
Here is an example in Latin with a sentence context, case-insensitive:
In [1]: from cltk.ir.query import match_regex
In [2]: text = 'Ita fac, mi Lucili; vindica te tibi. et tempus, quod adhuc aut auferebatur aut subripiebatur aut excidebat, collige et serva.'
In [3]: matches = match_regex(text, r'tempus', language='latin', context='sentence', case_insensitive=True)
In [4]: for match in matches:
print(match)
...:
et *tempus*, quod adhuc aut auferebatur aut subripiebatur aut excidebat, collige et serva.
And here with context of 40 characters:
In [5]: matches = match_regex(text, r'tempus', language='latin', context=40, case_insensitive=True)
In [6]: for match in matches:
print(match)
...:
Ita fac, mi Lucili; vindica te tibi. et *tempus*, quod adhuc aut auferebatur aut subripi
For querying the entirety of a corpus, see search_corpus()
, which returns a tuple of ('author_name': 'match_context')
.
In [7]: from cltk.ir.query import search_corpus
In [8]: for match in search_corpus('ὦ ἄνδρες Ἀθηναῖοι', 'tlg', context='sentence'):
print(match)
...:
('Ammonius Phil.', ' \nκαλοῦντας ἑτέρους ἢ προστάσσοντας ἢ ἐρωτῶντας ἢ εὐχομένους περί τινων, \nπολλάκις δὲ καὶ αὐτοπροσώπως κατά τινας τῶν ἐνεργειῶν τούτων ἐνεργοῦ-\nσι “πρῶτον μέν, *ὦ ἄνδρες Ἀθηναῖοι*, τοῖς θεοῖς εὔχομαι πᾶσι καὶ πάσαις” \nλέγοντες ἢ “ἀπόκριναι γὰρ δεῦρό μοι ἀναστάς”. οἱ οὖν περὶ τῶν τεχνῶν \nτούτων πραγματευόμενοι καὶ τοὺς λόγους εἰς θεωρίαν ')
('Sopater Rhet.', "θόντα, ἢ συγγνωμονηκέναι καὶ ἐλεῆσαι. ψυχῆς γὰρ \nπάθος ἐπὶ συγγνώμῃ προτείνεται. παθητικὴν οὖν ποιή-\nσῃ τοῦ πρώτου προοιμίου τὴν ἔννοιαν: ἁπάντων, ὡς ἔοι-\nκεν, *ὦ ἄνδρες Ἀθηναῖοι*, πειρασθῆναί με τῶν παραδό-\nξων ἀπέκειτο, πόλιν ἰδεῖν ἐν μέσῃ Βοιωτίᾳ κειμένην. καὶ \nμετὰ Θήβας οὐκ ἔτ' οὔσας, ὅτι μὴ στεφανοῦντας Ἀθη-\nναίους ἀπέδειξα παρὰ τὴ")
…
Information Retrieval (boolean)¶
Note
The API for the CLTK index and query will likely change. Consider this module an alpha. Please report improvements or problems.
An index to a corpus allows for faster, and sometimes more nuanced, searches. The CLTK has built some indexing and querying functionality with the Whoosh library. The following show how to make an index and then query it:
First, ensure that you have imported and converted the PHI5 or TLG disks imported. If you want to use the author chunking, convert with convert_corpus()
, but for searching by work, convert with divide_works()
. CLTKIndex()
has an optional argument chunk
, which defaults to chunk='author'
. chunk='work'
is also available.
An index only needs to be made once. Then it can be queried with, e.g.:
In [1]: from cltk.ir.boolean import CLTKIndex
In [2]: cltk_index = CLTKIndex('latin', 'phi5', chunk='work')
In [3]: results = cltk_index.corpus_query('amicitia')
In [4]: results[:500]
Out[4]: 'Docs containing hits: 836.</br></br>Marcus Tullius Cicero, Cicero, Tully</br>/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0474.TXT-052.TXT</br>Approximate hits: 132.</br>LAELIUS DE <b class="match term0">AMICITIA</b> LIBER </br> AD T. POMPONIUM ATTICUM.} </br> Q. Mucius augur multa narrare...incidisset, exposuit </br>nobis sermonem Laeli de <b class="match term1">amicitia</b> habitum ab illo </br>secum et cum altero genero, C. Fannio...videretur. </br> Cum enim saepe me'
The function returns an string in HTML markup, which you can then parse yourself.
To save results, use the save_file
parameter:
In [4]: cltk_index.corpus_query('amicitia', save_file='2016_amicitia')
This will save a file at ~/cltk_data/user_data/search/2016_amicitia.html
, being a human-readable output with word-matches highlighted, of all authors (or texts, if chunk='work'
).
Lemmatization, backoff and others¶
CLTK offers ‘multiplex’ lemmatization, i.e. a series of lexicon-, rules-, or training data-based lemmatizers that can be chained together. The multiplex lemmatizers are based on backoff POS tagging in NLTK: 1. with Backoff lemmatization, tagging stops at the first successful instance of tagging by a sub-tagger; 2. with Ensemble lemmatization, tagging continues through the entire sequence, all possible lemmas are returned, and a method for scoring/selecting from possible lemmas can be specified. All of the examples below are in Latin, but these lemmatizers are language-independent (at least, where lemmatization is a meaningful NLP task) and can be made language-specific by providing different training sentences, regex patterns, etc.
The backoff module offers DefaultLemmatizer which returns the same “lemma” for all tokens:
In [1]: from cltk.lemmatize.backoff import DefaultLemmatizer
In [2]: lemmatizer = DefaultLemmatizer()
In [3]: tokens = ['Quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']
In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]
DefaultLemmatizer can take as a parameter what “lemma” should be returned:
In [5]: lemmatizer = DefaultLemmatizer('UNK')
In [6]: lemmatizer.lemmatize(tokens)
Out[6]: [('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]
The backoff module also offers IdentityLemmatizer which returns the given token as the lemma:
In [7]: from cltk.lemmatize.backoff import IdentityLemmatizer
In [8]: lemmatizer = IdentityLemmatizer()
In [9]: lemmatizer.lemmatize(tokens)
Out[9]: [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]
With the DictLemmatizer, the backoff module allows you to provide a dictionary of the form {‘TOKEN1’: ‘LEMMA1’, ‘TOKEN2’: ‘LEMMA2’} for lemmatization.
In [10]: tokens = ['arma', 'uirum', '-que', 'cano', ',', 'troiae', 'qui', 'primus', 'ab', 'oris']
In [11]: lemmas = {'arma': 'arma', 'uirum': 'uir', 'troiae': 'troia', 'oris': 'ora'}
In [12]: from cltk.lemmatize.backoff import DictLemmatizer
In [13]: lemmatizer = DictLemmatizer(lemmas=lemmas)
In [14]: lemmatizer.lemmatize(tokens)
Out[14]: [('arma', 'arma'), ('uirum', 'uir'), ('-que', None), ('cano', None), (',', None), ('troiae', 'troia'), ('qui', None), ('primus', None), ('ab', None), ('oris', 'ora')]
The DictLemmatizer—like all of the lemmatizers in this module—can take a second lemmatizer (or backoff lemmatizer) for any of the tokens that return ‘None’. This is done with a ‘backoff’ parameter:
In [15]: default = DefaultLemmatizer('UNK')
In [16]: lemmatizer = DictLemmatizer(lemmas=lemmas, backoff=default)
In [17]: lemmatizer.lemmatize(tokens)
Out[17]: [('arma', 'arma'), ('uirum', 'uir'), ('-que', 'UNK'), ('cano', 'UNK'), (',', 'UNK'), ('troiae', 'troia'), ('qui', 'UNK'), ('primus', 'UNK'), ('ab', 'UNK'), ('oris', 'ora')]
These lemmatizers also have a verbose mode that returns the specific tagger used for each lemma returned.
In [18]: default = DefaultLemmatizer('UNK', verbose=True)
In [19]: lemmatizer = DictLemmatizer(lemmas=lemmas, backoff=default, verbose=True)
In [20]: lemmatizer.lemmatize(tokens)
Out[20]: [('arma', 'arma', "<DictLemmatizer: {'arma': 'arma', ...}>"), ('uirum', 'uir', "<DictLemmatizer: {'arma': 'arma', ...}>"), ('-que', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('cano', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), (',', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('troiae', 'troia', "<DictLemmatizer: {'arma': 'arma', ...}>"), ('qui', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('primus', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('ab', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('oris', 'ora', "<DictLemmatizer: {'arma': 'arma', ...}>")]
You can provide a name for the data source to make the verbose output clearer:
In [21]: default = DefaultLemmatizer('UNK')
In [22]: lemmatizer = DictLemmatizer(lemmas=lemmas, source="CLTK Docs Example", backoff=default, verbose=True)
In [23]: lemmatizer.lemmatize(tokens)
Out[23]: [('arma', 'arma', '<DictLemmatizer: CLTK Docs Example>'), ('uirum', 'uir', '<DictLemmatizer: CLTK Docs Example>'), ('-que', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('cano', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), (',', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('troiae', 'troia', '<DictLemmatizer: CLTK Docs Example>'), ('qui', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('primus', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('ab', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('oris', 'ora', '<DictLemmatizer: CLTK Docs Example>')]
With the UnigramLemmatizer, the backoff module allows you to provide a list of lists of sentences of the form [[(‘TOKEN1’, ‘LEMMA1’), (‘TOKEN2’, ‘LEMMA2’)], [(‘TOKEN3’, ‘LEMMA3’), (‘TOKEN4’, ‘LEMMA4’)], … ] for lemmatization. The lemmatizer returns the the lemma that has the highest frequency based on the training sentences. So, for example, if the tuple (‘est’, ‘sum’) appears in the training sentences 99 times and (‘est’, ‘edo’) appears 1 time, the lemmatizer would return the lemma ‘sum’.
Here is an example of the UnigramLemmatizer():
In [24]: train_data = [[('cum', 'cum2'), ('esset', 'sum'), ('caesar', 'caesar'), ('in', 'in'), ('citeriore', 'citer'), ('gallia', 'gallia'), ('in', 'in'), ('hibernis', 'hibernus'), (',', 'punc'), ('ita', 'ita'), ('uti', 'ut'), ('supra', 'supra'), ('demonstrauimus', 'demonstro'), (',', 'punc'), ('crebri', 'creber'), ('ad', 'ad'), ('eum', 'is'), ('rumores', 'rumor'), ('adferebantur', 'affero'), ('litteris', 'littera'), ('-que', '-que'), ('item', 'item'), ('labieni', 'labienus'), ('certior', 'certus'), ('fiebat', 'fio'), ('omnes', 'omnis'), ('belgas', 'belgae'), (',', 'punc'), ('quam', 'qui'), ('tertiam', 'tertius'), ('esse', 'sum'), ('galliae', 'gallia'), ('partem', 'pars'), ('dixeramus', 'dico'), (',', 'punc'), ('contra', 'contra'), ('populum', 'populus'), ('romanum', 'romanus'), ('coniurare', 'coniuro'), ('obsides', 'obses'), ('-que', '-que'), ('inter', 'inter'), ('se', 'sui'), ('dare', 'do'), ('.', 'punc')], [('coniurandi', 'coniuro'), ('has', 'hic'), ('esse', 'sum'), ('causas', 'causa'), ('primum', 'primus'), ('quod', 'quod'), ('uererentur', 'uereor'), ('ne', 'ne'), (',', 'punc'), ('omni', 'omnis'), ('pacata', 'paco'), ('gallia', 'gallia'), (',', 'punc'), ('ad', 'ad'), ('eos', 'is'), ('exercitus', 'exercitus'), ('noster', 'noster'), ('adduceretur', 'adduco'), (';', 'punc')]]
In [25]: default = DefaultLemmatizer('UNK')
In [26]: lemmatizer = UnigramLemmatizer(train_sents, backoff=default)
In [27]: lemmatizer.lemmatize(tokens)
Out[27]: [('arma', 'UNK'), ('uirum', 'UNK'), ('-que', '-que'), ('cano', 'UNK'), (',', 'punc'), ('troiae', 'UNK'), ('qui', 'UNK'), ('primus', 'UNK'), ('ab', 'UNK'), ('oris', 'UNK')]
There is also a regular-expression based lemmatizer that uses a tuple with substitution patterns to return lemmas:
In [28]: regexps = [ (‘(.)tat(is|i|em|e|es|um|ibus)$’, r‘1tas’), (‘(.)ion(is|i|em|e|es|um|ibus)$’, r‘1io’), (‘(.)av(i|isti|it|imus|istis|erunt|)$’, r‘1o’),]
In [29]: tokens = “iam a principio nobilitatis factionem disturbavit”.split()
In [30]: from cltk.lemmatize.backoff import RegexpLemmatizer
In [31]: lemmatizer = RegexpLemmatizer(regexps=regexps)
In [32]: lemmatizer.lemmatize(tokens) Out[32]: [(‘iam’, None), (‘a’, None), (‘principio’, None), (‘nobilitatis’, ‘nobilitas’), (‘factionem’, ‘factio’), (‘disturbavit’, ‘disturbo’)]
Ensemble lemmatization are constructed in a similar manner, but all sub-lemmatizers return tags. A selection mechanism can be applied to the output. (NB: Selection and scoring mechanisms for use with the Ensemble Lemmatizer are under development.)
In [33]: from cltk.lemmatize.ensemble import EnsembleDictLemmatizer, EnsembleUnigramLemmatizer, EnsembleRegexpLemmatizer
In [34]: patterns = [(r’b(.+)(o|is|it|imus|itis|unt)b’, r‘1o’), (r’b(.+)(o|as|at|amus|atis|ant)b’, r‘1o’),]
In [35]: tokens = “arma virumque cano qui”.split()
In [36]: EDL = EnsembleDictLemmatizer(lemmas = {‘cano’: ‘cano’}, source=’EDL’, verbose=True)
In [37]: EUL = EnsembleUnigramLemmatizer(train=[[(‘arma’, ‘arma’), (‘virumque’, ‘vir’), (‘cano’, ‘cano’)], [(‘arma’, ‘arma’), (‘virumque’, ‘virus’), (‘cano’, ‘canus’)], [(‘arma’, ‘arma’), (‘virumque’, ‘vir’), (‘cano’, ‘canis’)], [(‘arma’, ‘arma’), (‘virumque’, ‘vir’), (‘cano’, ‘cano’)],], verbose=True, backoff=EDL)
In [38]: ERL = EnsembleRegexpLemmatizer(regexps=patterns, source=’Latin Regex Patterns’, verbose=True, backoff=EUL)
In [39]: ERL.lemmatize(test, lemmas_only=True) Out[39]: [[‘arma’], [‘vir’, ‘virus’], [‘canis’, ‘cano’, ‘canus’], []]
N–grams¶
In [1]: from nltk.tokenize.punkt import PunktLanguageVars In [2]: from nltk.util import bigrams In [3]: from nltk.util import trigrams In [4]: from nltk.util import ngrams In [5]: s = 'Ut primum nocte discussa sol novus diem fecit, et somno simul emersus et lectulo, anxius alioquin et nimis cupidus cognoscendi quae rara miraque sunt, reputansque me media Thessaliae loca tenere qua artis magicae nativa cantamina totius orbis consono orbe celebrentur fabulamque illam optimi comitis Aristomenis de situ civitatis huius exortam, suspensus alioquin et voto simul et studio, curiose singula considerabam. Nec fuit in illa civitate quod aspiciens id esse crederem quod esset, sed omnia prorsus ferali murmure in aliam effigiem translata, ut et lapides quos offenderem de homine duratos et aves quas audirem indidem plumatas et arbores quae pomerium ambirent similiter foliatas et fontanos latices de corporibus humanis fluxos crederem; iam statuas et imagines incessuras, parietes locuturos, boves et id genus pecua dicturas praesagium, de ipso vero caelo et iubaris orbe subito venturum oraculum.'.lower() In [6]: p = PunktLanguageVars() In [7]: tokens = p.word_tokenize(s) In [8]: b = bigrams(tokens) In [8]: [x for x in b] Out[8]: [('ut', 'primum'), ('primum', 'nocte'), ('nocte', 'discussa'), ('discussa', 'sol'), ('sol', 'novus'), ('novus', 'diem'), ...] In [9]: t = trigrams(tokens) In [9]: [x for x in t] [('ut', 'primum', 'nocte'), ('primum', 'nocte', 'discussa'), ('nocte', 'discussa', 'sol'), ('discussa', 'sol', 'novus'), ('sol', 'novus', 'diem'), …] In [10]: five_gram = ngrams(tokens, 5) In [11]: [x for x in five_gram] Out[11]: [('ut', 'primum', 'nocte', 'discussa', 'sol'), ('primum', 'nocte', 'discussa', 'sol', 'novus'), ('nocte', 'discussa', 'sol', 'novus', 'diem'), ('discussa', 'sol', 'novus', 'diem', 'fecit'), ('sol', 'novus', 'diem', 'fecit', ','), ('novus', 'diem', 'fecit', ',', 'et'), …]
Normalization¶
If you are working from texts from different resources, it is likely a good idea to normalize them before
further processing (such as sting comparison). The CLTK provides a wrapper to the Python language’s builtin normalize()
. Here’s an example its use in “compatibility” mode (NFKC
):
In [1]: from cltk.corpus.utils.formatter import cltk_normalize
In [2]: tonos = "ά"
In [3]: oxia = "ά"
In [4]: tonos == oxia
Out[4]: False
In [5]: tonos == cltk_normalize(oxia)
Out[5]: True
One can turn off compatibility with:
In [6]: tonos == cltk_normalize(oxia, compatibility=False)
Out[6]: True
For more on normalize()
see the Python Unicode docs.
Skipgrams¶
The NLTK has a handy skipgram function. Use it like this:
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: from nltk.util import skipgrams
In [3]: text = 'T. Pomponis Atticus, ab origine ultima stirpis Romanae generatus, \
...: perpetuo a maioribus acceptam equestrem obtinuit dignitatem.'
In [4]: word_tokenizer = WordTokenizer('latin')
In [5]: unigrams = word_tokenizer.tokenize(text)
In [6]: for ngram in skipgrams(unigrams, 3, 5):
...: print(ngram)
...:
('T.', 'Pomponis', 'Atticus')
('T.', 'Pomponis', ',')
('T.', 'Pomponis', 'ab')
('T.', 'Pomponis', 'origine')
('T.', 'Pomponis', 'ultima')
('T.', 'Pomponis', 'stirpis')
('T.', 'Atticus', ',')
('T.', 'Atticus', 'ab')
('T.', 'Atticus', 'origine')
('T.', 'Atticus', 'ultima')
…
('equestrem', 'obtinuit', '.')
('equestrem', 'dignitatem', '.')
('obtinuit', 'dignitatem', '.')
The first parameter is the length of the output n-gram and the second parameter is how many tokens to skip.
The NLTK’s skipgrams()
produces a generator whose values can be turned into a list like so:
In [8]: list(skipgrams(unigrams, 3, 5))
Out[8]:
[('T.', 'Pomponis', 'Atticus'),
('T.', 'Pomponis', ','),
('T.', 'Pomponis', 'ab'),
…
('equestrem', 'dignitatem', '.'),
('obtinuit', 'dignitatem', '.')]
Stoplist Construction¶
The Stop
module offers an abstract class for constructing stoplists: BaseCorpusStoplist
.
Children class must implement vectorizer
and tfidf_vectorizer
. Fow now, only Latin and Classical Chinese
with CorpusStoplist``have implemented a child class of ``BaseCorpusStoplist
.
Parameters like the size of the stoplist, the criterion on which you get the stop list with a parameter (basis
) for weighting words within the collection using different measures. The bases currently available are: frequency
, mean
(mean probability), variance
(variance probability), entropy
(entropy), and zou
(a composite measure based on mean, variance, and entropy as described in [Zou 2006]).
Other parameters for both StringStoplist
and CorpusStoplist
include boolean preprocessing options (lower
, remove_numbers
, remove_punctuation
) and override lists of words to add or subtract from stoplists (include
, exclude
).
Syllabification¶
CLTK provides a language-agnostic syllabifier module as part of phonology
. The syllabifier works by following the Sonority Sequencing Principle. The default phonetic scale (from most to least sonorous):
low vowels > mid vowels > high vowels > flaps > laterals > nasals > fricatives > plosives
In [1]: from cltk.phonology import syllabify
In [2]: high_vowels = ['a']
In [3]: mid_vowels = ['e']
In [4]: low_vowels = ['i', 'u']
In [5]: flaps = ['r']
In [6]: nasals = ['m', 'n']
In [7]: fricatives = ['f']
In [8]: s = Syllabifier(high_vowels=high_vowels, mid_vowels=mid_vowels, low_vowels=low_vowels, flaps=flaps, nasals=nasals, fricatives=fricatives)
In [9]: s.syllabify("feminarum")
Out[9]: ['fe', 'mi', 'na', 'rum']
Additionally, you can override the default sonority hierarchy by calling set_hierarchy
. However, you must also re-define the
vowel list for the nuclei to be correctly identified.
In [10]: s = Syllabifier()
In [11]: s.set_hierarchy([['i', 'u'], ['e'], ['a'], ['r'], ['m', 'n'], ['f']])
In [12]: s.set_vowels(['i', 'u', 'e', 'a'])
In [13]: s.syllabify('feminarum')
Out[13]: ['fe', 'mi', 'na', 'rum']
For a language-dependent approach, you can call the predefined sonority dictionary by toogling the language
parameter:
In [14]: s = Syllabifier(language='middle high german')
In [15]: s.syllabify('lobebæren')
Out[15]: ['lo', 'be', 'bæ', 'ren']
Text Reuse¶
The text reuse module offers a few tools to get started with studying text reuse (i.e., allusion and intertext). The major goals of this module are to leverage conventional text reuse strategies and to create comparison methods designed specifically for the languages of the corpora included in the CLTK.
This module is under active development, so if you experience a bug or have a suggestion for something to include, please create an issue on GitHub.
Levenshtein distance calculation¶
+.. note:: + + You will need to install two packages to use Levenshtein measures. Install them with pip install fuzzywuzzy python-Levenshtein. python-Levenshtein is optional but gives speed improvements.
The Levenshtein distance comparison is a commonly-used method for fuzzy string comparison. The CLTK Levenshtein class offers a few helps for getting started with creating comparisons from document.
This simple example compares a line from Vergil’s Georgics with a line from Propertius (Elegies III.13.41):
In [1]: from cltk.text_reuse.levenshtein import Levenshtein
In [2]: l = Levenshtein()
In [3]: l.ratio("dique deaeque omnes, studium quibus arua tueri,", "dique deaeque omnes, quibus est tutela per agros,")
Out[3]: 0.71
You can also calculate the Levenshtein distance of two words, defined as the minimum number of single word edits (insertions, deletions, substitutions) required to transform a word into another.
In [4]: l.levenshtein_distance("deaeque", "deaeuqe")
Out[4]: 2
Damerau-Levenshtein algorithm¶
Note
You will need to install pyxDamerauLevenshtein to use these features.
The Damerau-Levenshtein algorithm is used for finding the distance metric between any two strings i.e., finite number of symbols or letters between any two strings. The Damerau-Levenshtein algorithm is an enhancement over Levenshtein algorithm in the sense that it allows for transposition operations.
This simple example compares a two Latin words to find the distance between them:
In [1]: from pyxdameraulevenshtein import damerau_levenshtein_distance
In [2]: damerau_levenshtein_distance("deaeque", "deaque")
Out[2]: 1
Alternatively, you can also use CLTK’s native Levenshtein
class:
In [3]: from cltk.text_reuse.levenshtein import Levenshtein
In [4]: Levenshtein.damerau_levenshtein_distance("deaeque", "deaque")
Out[4]: 1
In [5]: Levenshtein.damerau_levenshtein_distance("deaeque", "deaeuqe")
Out[5]: 1
Needleman-Wunsch Algorithm¶
The Needleman-Wunsch Algorithm, calculates the optimal global alignment between two strings given a scoring matrix.
There are two optional parameters: S
specifying a weighted similarity square matrix, and alphabet
(where |alphabet| = rows(S) = cols(S)
). By default, the algorithm assumes the latin alphabet and a default matrix (1 for match, -1 for substitution)
In [1]: from cltk.text_reuse.comparison import Needleman_Wunsch as NW
In [2]: NW("abba", "ababa", alphabet = "ab", S = [[1, -3],[-3, 1]])
Out[2]: ('ab-ba', 'ababa')
In this case, the similarity matrix will be:
a | b | |
a | 1 | -3 |
b | -3 | 1 |
Longest Common Substring¶
Longest Common Substring takes two strings as an argument to the function and returns a substring which is common between both the strings. The example below compares a line from Vergil’s Georgics with a line from Propertius (Elegies III.13.41):
In [1]: from cltk.text_reuse.comparison import long_substring
In [2]: print(long_substring("dique deaeque omnes, studium quibus arua tueri,", "dique deaeque omnes, quibus est tutela per agros,"))
Out[2]: dique deaque omnes,
MinHash¶
The MinHash algorithm generates a score based on the similarity of the two strings. It takes two strings as a parameter to the function and returns a float.
In [1]: from cltk.text_reuse.comparison import minhash
In [2]: a = 'dique deaeque omnes, studium quibus arua tueri,'
In [3]: b = 'dique deaeque omnes, quibus est tutela per agros,'
In[3]: print(minhash(a,b))
Out[3]:0.171631205673
Treebank label dict¶
You can generate nested Python dict from a treebank in string format. Currently, only treebanks following the Penn notation are supported.
In [1]: from cltk.tags.treebanks import parse_treebanks
In [2]: st = "((IP-MAT-SPE (' ') (INTJ Yes) (, ,) (' ') (IP-MAT-PRN (NP-SBJ (PRO he)) (VBD seyde)) (, ,) (' ') (NP-SBJ (PRO I)) (MD shall) (VB promyse) (NP-OB2 (PRO you)) (IP-INF (TO to) (VB fullfylle) (NP-OB1 (PRO$ youre) (N desyre))) (. .) (' '))"
In [3]: treebank = parse_treebanks(st)
In [4]: treebank['IP-MAT-SPE']['INTJ']
Out[4]: ['Yes']
In [5]: treebank
Out[5]: {'IP-MAT-SPE': {"'": ["'", "'", "'"], 'INTJ': ['Yes'], ',': [',', ','], 'IP-MAT-PRN': {'NP-SBJ': {'PRO': ['he']}, 'VBD': ['seyde']}, 'NP-SBJ': {'PRO': ['I']}, 'MD': ['shall'], '\t': {'VB': ['promyse'], 'NP-OB2': {'PRO': ['you']}, 'IP-INF': {'TO': ['to'], '\t': {'VB': ['fullfylle'], 'NP-OB1': {'PRO$': ['youre'], 'N': ['desyre']}}, '.': ['.'], "'": ["'"]}}}}
Word count¶
For a dictionary-like object of word frequencies, use the NLTK’s Text()
.
In [1]: from nltk.tokenize.punkt import PunktLanguageVars In [2]: from nltk.text import Text In [3]: s = 'At at at ego ego tibi'.lower() In [4]: p = PunktLanguageVars() In [5]: tokens = p.word_tokenize(s) In [6]: t = Text(tokens) In [7]: vocabulary_count = t.vocab() In [8]: vocabulary_count['at'] Out[8]: 3 In [9]: vocabulary_count['ego'] Out[9]: 2 In [10]: vocabulary_count['tibi'] Out[10]: 1
Word frequency lists¶
The CLTK has a module which finds word frequency. The export is a Counter
type of dictionary.
In [1]: from cltk.utils.frequency import Frequency
In [2]: from cltk.corpus.utils.formatter import tlg_plaintext_cleanup
In [3]: import os
In [4]: freq = Frequency()
In [6]: file = os.path.expanduser('~/cltk_data/greek/text/tlg/plaintext/TLG0012.TXT')
In [7]: with open(file) as f:
...: text = f.read().lower()
...:
In [8]: text = tlg_plaintext_cleanup(text)
In [9]: freq.counter_from_str(text)
Out[9]: Counter({'δ': 6507, 'καὶ': 4799, 'δὲ': 3194, 'τε': 2645, 'μὲν': 1628, 'ἐν': 1420, 'δέ': 1267, 'ὣς': 1203, 'οἱ': 1126, 'τ': 1101, 'γὰρ': 969, 'ἀλλ': 936, 'τὸν': 904, 'ἐπὶ': 830, 'τοι': 772, 'αὐτὰρ': 761, 'δὴ': 748, 'μοι': 745, 'μιν': 645, 'γε': 632, 'ἐπεὶ': 611, 'ἄρ': 603, 'ἦ': 598, 'νῦν': 581, 'ἄρα': 576, 'κατὰ': 572, 'ἐς': 571, 'ἐκ': 554, 'ἐνὶ': 544, 'ὡς': 541, 'ὃ': 533, 'οὐ': 530, 'οἳ': 527, 'περ': 491, 'τις': 491, 'οὐδ': 482, 'καί': 481, 'οὔ': 476, 'γάρ': 435, 'κεν': 407, 'τι': 407, 'γ': 406, 'ἐγὼ': 404, 'ἐπ': 397, … })
If you have access to the TLG or PHI5 disc, and have already imported it and converted it with the CLTK, you can build your own custom lists off of that.
In [11]: freq.make_list_from_corpus('phi5', 200, save=False) # or 'phi5'; both take a while to run
Out[11]: Counter({',': 749396, 'et': 196410, 'in': 141035, 'non': 89836, 'est': 86472, ':': 76915, 'ut': 70516, ';': 69901, 'cum': 61454, 'si': 59578, 'ad': 59248, 'quod': 52896, 'qui': 46385, 'sed': 41546, '?': 40717, 'quae': 38085, 'ex': 36996, 'quam': 34431, "'": 33596, 'de': 31331, 'esse': 31066, 'aut': 30568, 'a': 29871, 'hoc': 26266, 'nec': 26027, 'etiam': 22540, 'se': 22486, 'enim': 22104, 'ab': 21336, 'quid': 21269, 'per': 20981, 'atque': 20201, 'sunt': 20025, 'sit': 19123, 'autem': 18853, 'id': 18846, 'quo': 18204, 'me': 17713, 'ne': 17265, 'ac': 17007, 'te': 16880, 'nam': 16640, 'tamen': 15560, 'eius': 15306, 'haec': 15080, 'ita': 14752, 'iam': 14532, 'mihi': 14440, 'neque': 13833, 'eo': 13125, 'quidem': 13063, 'est.': 12767, 'quoque': 12561, 'ea': 12389, 'pro': 12259, 'uel': 11824, 'quia': 11518, 'tibi': 11493, … })
Word tokenization¶
The CLTK wraps one of the NLTK’s tokenizers (TreebankWordTokenizer
), which with the multilingual
parameter works for most languages that use Latin-style whitespace and punctuation to indicate word division. There are some language-specific tokenizers, too, which do extra work to subdivide words when they are combined into one string (e.g., “armaque” in Latin). See WordTokenizer.available_languages
for supported languages for such sub-string tokenization.
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: tok.available_languages
Out[2]:
['akkadian',
'arabic',
'french',
'greek',
'latin',
'middle_english',
'middle_french',
'middle_high_german',
'old_french',
'old_norse',
'sanskrit',
'multilingual']
In [3]: luke_ocs = "рєчє жє притъчѫ к н҄имъ глагол҄ѧ чловѣкѹ єтєрѹ богатѹ ѹгобьѕи сѧ н҄ива"
In [4]: tok = WordTokenizer(language='multilingual')
In [5]: tok.tokenize(luke_ocs)
Out[5]:
['рєчє',
'жє',
'притъчѫ',
'к',
'н҄имъ',
'глагол҄ѧ',
'чловѣкѹ',
'єтєрѹ',
'богатѹ',
'ѹгобьѕи',
'сѧ',
'н҄ива']
If this default does not work for your texts, consider the NLTK’s RegexpTokenizer
, which splits on a regular expression patterns of your choosing. Here, for instance, on whitespace and punctuation:
In [6]: from nltk.tokenize import RegexpTokenizer
In [7]: word_toker = RegexpTokenizer(r'\w+')
In [8]: word_toker.tokenize(luke_ocs)
Out[8]:
['рєчє',
'жє',
'притъчѫ',
'к',
'н',
'имъ',
'глагол',
'ѧ',
'чловѣкѹ',
'єтєрѹ',
'богатѹ',
'ѹгобьѕи',
'сѧ',
'н',
'ива']
Old Norse¶
Old Norse was a North Germanic language that was spoken by inhabitants of Scandinavia and inhabitants of their overseas settlements during about the 9th to 13th centuries. The Proto-Norse language developed into Old Norse by the 8th century, and Old Norse began to develop into the modern North Germanic languages in the mid- to late-14th century, ending the language phase known as Old Norse. These dates, however, are not absolute, since written Old Norse is found well into the 15th century. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with old_norse_
) to discover available Old Norse corpora.
In[1]: from cltk.corpus.utils.importer import CorpusImporter
In[2]: corpus_importer = CorpusImporter("old_norse")
In[3]: corpus_importer.list_corpora
Out[3]: ['old_norse_text_perseus', 'old_norse_models_cltk', 'old_norse_texts_heimskringla', 'old_norse_runic_transcriptions', 'old_norse_dictionary_zoega']
Zoëga’s dictionary¶
This dictionary was made in the last century. It contains Old Norse entries in which a description is given in English. Each entry have possible POS tags for its word and the translations/meanings.
Stopword Filtering¶
To use the CLTK’s built-in stopwords list, We use an example from Eiríks saga rauða:
In[1]: from nltk.tokenize.punkt import PunktLanguageVars
In[2]: from cltk.stop.old_norse.stops import STOPS_LIST
In[3]: sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'
In[4]: p = PunktLanguageVars()
In[5]: tokens = p.word_tokenize(sentence.lower())
In[6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]: ['var',
'einn',
'morgin',
',',
'karlsefni',
'rjóðrit',
'flekk',
'nökkurn',
',',
'glitraði']
Swadesh¶
The corpus module has a class for generating a Swadesh list for Old Norse.
In[1]: from cltk.corpus.swadesh import Swadesh
In[2]: swadesh = Swadesh('old_norse')
In[3]: swadesh.words()[:10]
Out[3]: ['ek', 'þú', 'hann', 'vér', 'þér', 'þeir', 'sjá, þessi', 'sá', 'hér', 'þar']
Word Tokenizing¶
A very simple tokenizer is available for Old Norse. For now, it does not take into account specific Old Norse constructions like the merge of conjugated verbs with þú and with sik. Here is a sentence extracted from Gylfaginning in the Edda by Snorri Sturluson.
In[1]: word_tokenizer = WordTokenizer('old_norse')
In[2]: sentence = "Gylfi konungr var maðr vitr ok fjölkunnigr."
In[3]: word_tokenizer.tokenize(sentence)
Out[3]:['Gylfi', 'konungr', 'var', 'maðr', 'vitr', 'ok', 'fjölkunnigr', '.']
POS tagging¶
You can get the POS tags of Old Norse texts using the CLTK’s wrapper around the NLTK tokenizer. First, download the model by importing the old_norse_models_cltk
corpus. This TnT tagger was trained from annotated data from Icelandic Parsed Historical Corpus (version 0.9, license: LGPL).
TnT tagger¶
The following sentence is from the first verse of Völuspá (a poem describing destiny of Agards gods).
In[1]: from cltk.tag.pos import POSTag
In[2]: tagger = POSTag('old_norse')
In[3]: sent = 'Hlióðs bið ek allar.'
In[4]: tagger.tag_tnt(sent)
Out[4]: [('Hlióðs', 'Unk'),
('bið', 'VBPI'),
('ek', 'PRO-N'),
('allar', 'Q-A'),
('.', '.')]
Phonology transcription¶
According to phonological rules (available at Wikipedia - Old Norse orthography and Altnordisches Elementarbuch by Friedrich Ranke and Dietrich Hofmann), a reconstructed pronunciation of Old Norse words is implemented.
In[1]: from cltk.phonology.old_norse import transcription as ont
In[2]: sentence = "Gylfi konungr var maðr vitr ok fjölkunnigr"
In[3]: tr = ut.Transcriber(ont.DIPHTHONGS_IPA, ont.DIPHTHONGS_IPA_class, ont.IPA_class, ont.old_norse_rules)
In[4]: tr.main(sentence)
Out[4]: "[gylvi kɔnungr var maðr vitr ɔk fjœlkunːiɣr]"
Runes¶
The oldest runic inscriptions found are from 200 AC. They have always denoted Germanic languages. Until the 8th century, the elder futhark alphabet was used. It was compouned with 24 characters: ᚠ, ᚢ, ᚦ, ᚨ, ᚱ, ᚲ, ᚷ, ᚹ, ᚺ, ᚾ, ᛁ, ᛃ, ᛇ, ᛈ, ᛉ, ᛊ, ᛏ, ᛒ, ᛖ, ᛗ, ᛚ, ᛜ, ᛟ, ᛞ. The word Futhark comes from the 6 first characters of the alphabet: ᚠ (f), ᚢ (u), ᚦ (th), ᚨ (a), ᚱ (r), ᚲ (k). Later, this alphabet was reduced to 16 runes, the younger futhark ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚼ, ᚾ, ᛁ, ᛅ, ᛋ, ᛏ, ᛒ, ᛖ, ᛘ, ᛚ, ᛦ, with more ambiguity on sounds. Shapes of runes may vary according to which matter they are carved on, that is why there is a variant of the younger futhark like this: ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚽ, ᚿ, ᛁ, ᛅ, ᛌ, ᛐ, ᛓ, ᛖ, ᛙ, ᛚ, ᛧ.
In[1]: from cltk.corpus.old_norse import runes
In[2]: " ".join(Rune.display_runes(ELDER_FUTHARK))
Out[2]: ᚠ ᚢ ᚦ ᚨ ᚱ ᚲ ᚷ ᚹ ᚺ ᚾ ᛁ ᛃ ᛇ ᛈ ᛉ ᛊ ᛏ ᛒ ᛖ ᛗ ᛚ ᛜ ᛟ ᛞ
In[3]: little_jelling_stone = "᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬"
In[4]: Transcriber.transcribe(little_jelling_stone, YOUNGER_FUTHARK)
Out[4]: "᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫"
Syllabification¶
For a language-dependent approach, you can call the predefined sonority dictionary by toogling the language
parameter:
In[1]: from cltk.phonology.syllabify import Syllabifier
In[2]: s = Syllabifier(language='old_norse')
In[3]: s.syllabify("danmarkar")
Out[3]: ['dan', 'mar', 'kar']
Length of syllables in Old Norse poems plays a great role. To measure this, words have first to be phonetically transcribed. This is why “old_norse_ipa” language is used
In[1]: import cltk.phonology.old_norse.transcription as ont
In[2]: from cltk.phonology.syllabify import Syllabifier
In[3]: syllabifier = Syllabifier(language="old_norse_ipa")
In[4]: word = [ont.a, ont.s, ont.g, ont.a, ont.r, ont.dh, ont.r]
In[5]: syllabified_word = syllabifier.syllabify_phonemes(word)
In[6]: [ont.measure_old_norse_syllable(syllable) for syllable in syllabified_word]
Out[6]: [<Length.short: 'short'>, <Length.long: 'long'>]
Old Norse prosody¶
Edda poetry is traditionally composed of the skaldic poetry and the eddic poetry.
Eddic poetry¶
Eddic poems designate the poems of the Poetic Edda. Stanza, line and verse are the three levels that characterize eddic poetry. The poetic Edda are mainly composed of three kinds of poetic meters: fornyrðislag, ljóðaháttr and málaháttr.
- Fornyrðislag
A stanza of fornyrðislag has 8 short lines (or verses), 4 long-lines (or lines). Each long line has two short lines. The first verse of a line usually has an alliteration with the second verse of a line.
In[1]: text1 = "Hljóðs bið ek allar\nhelgar kindir,\nmeiri ok minni\nmögu Heimdallar;\nviltu at ek, Valföðr,\nvel fyr telja\nforn spjöll fira,\nþau er fremst of man."
In[2]: VerseManager.is_fornyrdhislag(text1)
Out[2]: True
In[3]: fo = Fornyrdhislag()
In[4]: fo.from_short_lines_text(text1)
In[5]: fo.short_lines
Out[5]: ['Hljóðs bið ek allar', 'helgar kindir,', 'meiri ok minni', 'mögu Heimdallar;', 'viltu at ek, Valföðr,', 'vel fyr telja', 'forn spjöll fira,', 'þau er fremst of man.']
In[6]: fo.long_lines
Out[6]: [['Hljóðs bið ek allar', 'helgar kindir,'], ['meiri ok minni', 'mögu Heimdallar;'], ['viltu at ek, Valföðr,', 'vel fyr telja'], ['forn spjöll fira,', 'þau er fremst of man.']]
In[7]: fo.syllabify()
In[8]: fo.syllabified_text
Out[8]: [[[[['hljóðs'], ['bið'], ['ek'], ['al', 'lar']]], [[['hel', 'gar'], ['kin', 'dir']]]], [[[['meir', 'i'], ['ok'], ['min', 'ni']]], [[['mög', 'u'], ['heim', 'dal', 'lar']]]], [[[['vil', 'tu'], ['at'], ['ek'], ['val', 'föðr']]], [[['vel'], ['fyr'], ['tel', 'ja']]]], [[[['forn'], ['spjöll'], ['fir', 'a']]], [[['þau'], ['er'], ['fremst'], ['of'], ['man']]]]]
In[9]: fo.to_phonetics()
In[10]: fo.transcribed_text
Out[10]: [[['[hljoːðs]', '[bið]', '[ɛk]', '[alːar]'], ['[hɛlɣar]', '[kindir]']], [['[mɛiri]', '[ɔk]', '[minːi]'], ['[mœɣu]', '[hɛimdalːar]']], [['[viltu]', '[at]', '[ɛk]', '[valvœðr]'], ['[vɛl]', '[fyr]', '[tɛlja]']], [['[fɔrn]', '[spjœlː]', '[fira]'], ['[θɒu]', '[ɛr]', '[frɛmst]', '[ɔv]', '[man]']]]
In[11]: fo.find_alliteration()
Out[11]: ([[('hljóðs', 'helgar')], [('meiri', 'mögu'), ('minni', 'mögu')], [], [('forn', 'fremst'), ('fira', 'fremst')]], [1, 2, 0, 2])
- Ljóðaháttr
A stanza of ljóðaháttr has 6 short lines (or verses), 4 long-lines (or lines). The first and the third lines have two verses, while the second and the fourth lines have only one (longer) verse. The first verse of the first and third lines alliterates with the second verse of these lines. The second and the fourth lines contain alliterations.
In[1]: text2 = "Deyr fé,\ndeyja frændr,\ndeyr sjalfr it sama,\nek veit einn,\nat aldrei deyr:\ndómr um dauðan hvern."
In[2]: VerseManager.is_ljoodhhaattr(text2)
Out[2]: True
In[3]: lj = Ljoodhhaatr()
In[4]: lj.from_short_lines_text(text2)
In[5]: lj.short_lines
Out[5]: ['Deyr fé,', 'deyja frændr,', 'deyr sjalfr it sama,', 'ek veit einn,', 'at aldrei deyr:', 'dómr um dauðan hvern.']
In[6]: lj.long_lines
Out[6]: [['Deyr fé,', 'deyja frændr,'], ['deyr sjalfr it sama,'], ['ek veit einn,', 'at aldrei deyr:'], ['dómr um dauðan hvern.']]
In[7]: lj.syllabify()
In[8]: lj.syllabified_text
Out[8]: [[[['deyr'], ['fé']], [['deyj', 'a'], ['frændr']]], [[['deyr'], ['sjalfr'], ['it'], ['sam', 'a']]], [[['ek'], ['veit'], ['einn']], [['at'], ['al', 'drei'], ['deyr']]], [[['dómr'], ['um'], ['dau', 'ðan'], ['hvern']]]]
In[9]: lj.to_phonetics()
In[10]: lj.transcribed_text
Out[10]: [[['[dɐyr]', '[feː]'], ['[dɐyja]', '[frɛːndr]']], [['[dɐyr]', '[sjalvr]', '[it]', '[sama]']], [['[ɛk]', '[vɛit]', '[ɛinː]'], ['[at]', '[aldrɛi]', '[dɐyr]']], [['[doːmr]', '[um]', '[dɒuðan]', '[hvɛrn]']]]
In[11]: verse_alliterations, n_alliterations_lines = lj.find_alliteration()
In[12]: verse_alliterations
Out[12]: [[('deyr', 'deyja'), ('fé', 'frændr')], [('sjalfr', 'sjalfr')], [('einn', 'aldrei')], [('dómr', 'um')]]
In[13]: n_alliterations_lines
Out[13]: [2, 1, 1, 1]
- Málaháttr
Málaháttr is very similar to ljóðaháttr, except that verses are longer. No special code has been written for this.
Skaldic poetry¶
Dróttkvætt and hrynhenda are examples of skaldic poetic meters.
Old Norse pronouns declension¶
Old Norse, like other ancient Germanic languages, is highly inflected. With the declension module, you can get a declined form of a pronoun already stored.
In[1]: from cltk.declension import utils as decl_utils
In[2]: from cltk.declension.old_norse import pronouns
In[3]: pro_demonstrative_pronouns_this = decl_utils.Pronoun("demonstrative pronouns this")
In[4]: demonstrative_pronouns_this = [[["þessi", "þenna", "þessum", "þessa"], ["þessir", "þessa", "þessum", "þessa"]], [["þessi", "þessa", "þessi", "þessar"], ["þessar", "þessar", "þessum", "þessa"]], [["þetta", "þetta", "þessu", "þessa"], ["þessi", "þessi", "þessum", "þessa"]]]
In[5]: pro_demonstrative_pronouns_this.set_declension(demonstrative_pronouns_this)
In[6]: pro_demonstrative_pronouns_this.get_declined(decl_utils.Case.accusative, decl_utils.Number.singular, decl_utils.Gender.feminine)
Out[6]: 'þessa'
Old Norse noun declension¶
Old Norse nouns vary according to case (nominative, accusative, dative, genitive), gender (masculine, feminine, neuter) and number (singular, plural). Nouns are considered either weak or strong. Weak nouns have a simpler declension than strong ones.
If you want a simple way to define the inflection of an Old Norse noun, you can do as follows:
In[1]: from cltk.inflection.utils import Noun, Gender
In[2]: sumar = [["sumar", "sumar", "sumri", "sumars"], ["sumur", "sumur", "sumrum", "sumra"]]
In[3]: noun_sumar = Noun("sumar", Gender.neuter)
In[4]: noun_sumar.set_declension(sumar)
To decline a noun and if you know its nominative singular, genitive singular and nominative plural forms, you can use the following functions.
masculine | feminine | neuter | |
strong | decline_strong_masculine_noun | decline_strong_feminine_noun | decline_strong_neuter_noun |
weak | decline_weak_masculine_noun | decline_weak_feminine_noun | decline_weak_neuter_noun |
Old Norse verb conjugation¶
Old Norse verbs vary according to:
- person (first, second, third),
- number (singular, plural),
- tense (past, present),
- voice (active and medio-passive),
- mood (indicative, subjunctive, imperative, infinitive, past participle and present participle).
They may be classified into three categories:
- strong verbs, they form their past root with a stem vowel change,
- weak verbs, they form their past root by adding a dental consonant,
- preterito-present verbs, their present conjugates like verbs in past but have present meanings.
Two examples are given below: one strong verb and one weak verb.
In[1]: from cltk.inflection.old_norse.verbs import StrongOldNorseVerb
In[2]: lita = StrongOldNorseVerb()
In[3]: lita.set_canonic_forms(["líta", "lítr", "leit", "litu", "litinn"])
In[4]: lita.subclass
Out[4]: 1
In[5]: lita.present_active()
Out[5]: ['lít', 'lítr', 'lítr', 'lítum', 'lítið', 'líta']
In[6]: lita.past_active()
Out[6]: ['leit', 'leizt', 'leit', 'litum', 'lituð', 'litu']
In[7]: lita.present_active_subjunctive()
Out[7]: ['líta', 'lítir', 'líti', 'lítim', 'lítið', 'líti']
In[8]: lita.past_active_subjunctive()
Out[9]: ['lita', 'litir', 'liti', 'litim', 'litið', 'liti']
In[9]: lita.past_participle()
Out[9]: [['litinn', 'litinn', 'litnum', 'litins', 'litnir', 'litna', 'litnum', 'litinna'], ['litin', 'litna', 'litinni', 'litinnar', 'litnar', 'litnar', 'litnum', 'litinna'], ['litit', 'litit', 'litnu', 'litins', 'litit', 'litit', 'litnum', 'litinna']]
In[1]: from cltk.inflection.old_norse.verbs import WeakOldNorseVerb
In[2]: kalla = WeakOldNorseVerb()
In[3]: kalla.set_canonic_forms([["kalla", "kallaði", "kallaðinn"])
In[4]: kalla.subclass
Out[4]: 1
In[5]: kalla.present_active()
Out[5]: ['kalla', 'kallar', 'kallar', 'köllum', 'kallið', 'kalla']
In[6]: kalla.past_active()
Out[6]: ['kallaða', 'kallaðir', 'kallaði', 'kölluðum', 'kölluðuð', 'kölluðu']
In[7]: kalla.present_active_subjunctive()
Out[7]: ['kalla', 'kallir', 'kalli', 'kallim', 'kallið', 'kalli']
In[8]: kalla.past_active_subjunctive()
Out[9]: ['kallaða', 'kallaðir', 'kallaði', 'kallaðim', 'kallaðið', 'kallaði']
In[9]: kalla.past_participle()
Out[9]: [['kallaðr', 'kallaðan', 'kölluðum', 'kallaðs', 'kallaðir', 'kallaða', 'kölluðum', 'kallaðra'], ['kölluð', 'kallaða', 'kallaðri', 'kallaðrar', 'kallaðar', 'kallaðar', 'kölluðum', 'kallaðra'], ['kallatt', 'kallatt', 'kölluðu', 'kallaðs', 'kölluð', 'kölluð', 'kölluðum', 'kallaðra']]
Odia¶
Odia is an Eastern Indo-Aryan language belonging to the Indo-Aryan language family. It is thought to be directly descended from a OdraMagadhi Prakrit similar to Ardha Magadhi, which was spoken in eastern India over 1,500 years ago, and is the primary language used in early Jain texts. Odia appears to have had relatively little influence from Persian and Arabic, compared to other major North Indian languages. Odia is an Indian language, belonging to the Indo-Aryan branch of the Indo-European language family. It is mainly spoken in the Indian states of Odisha and in parts of West Bengal, Jharkhand, Chhattisgarh and Andhra Pradesh. (Source: Wikipedia)
Alphabet¶
The Odia alphabet and digits are placed in cltk/corpus/odia/alphabet.py.
The digits are placed in a list NUMERALS
with the digit the same as the list index (0-9). For example, the odia digit for 4 can be accessed in this manner:
In [1]: from cltk.corpus.odia.alphabet import NUMERALS
In [2]: NUMERALS[4]
Out[2]: '୪'
The vowels are places in a list VOWELS
and can be accessed in this manner :
In [1]: from cltk.corpus.odia.alphabet import VOWELS
In [2]: VOWELS
Out[2]: ['ଅ', 'ଆ', 'ଇ', 'ଈ', 'ଉ', 'ଊ', 'ଋ', 'ୠ', 'ଌ', 'ୡ', 'ଏ', 'ଐ', 'ଓ', 'ଔ']
The rest of the alphabets are UNSTRUCTURED_CONSONANTS
and STRUCTURED_CONSONANTS
that can be accessed in a similar way.
Ottoman¶
Ottoman Turkish, or the Ottoman language, is the variety of the Turkish language that was used in the Ottoman Empire. Ottoman Turkish was highly influenced by Arabic and Persian. Arabic and Persian words in the language accounted for up to 88% of its vocabulary. As in most other Turkic and other foreign languages of Islamic communities, the Arabic borrowings were not originally the result of a direct exposure of Ottoman Turkish to Arabic, a fact that is evidenced by the typically Persian phonological mutation of the words of Arabic origin. (Source: Wikipedia)
Alphabet¶
The Ottoman digits and alphabet are placed in cltk/corpus/ottoman/alphabet.py.
The digits are placed in a dict NUMERALS
with the digit the same as the index (0-9). There is a dictionary named NUMERALS_WRITINGS for their writing also. For example, the persian digit for 5 can be accessed in this manner:
In [1]: from cltk.corpus.ottoman.alphabet import NUMERALS, NUMERALS_WRITINGS
In [2]: NUMERALS[5]
Out[2]: '۵'
In [3]: NUMERALS_WRITINGS[5]
Out[3]: 'بش'
One can also have the alphabetic orders of the charachters form ALPHABETIC_ORDER dictionary. The keys are the characters and the values are their order. The corresponding dictionary can be imported:
In [1]: from cltk.corpus.ottoman.alphabet import ALPHABETIC_ORDER, CIM
In [2]: ALPHABETIC_ORDER[CIM]
Out[2]: 6
Pali¶
Pali is a Prakrit language native to the Indian subcontinent which flourished between 5th and 1st century BC, now only used as a liturgical language. It is widely studied because it is the language of much of the earliest extant literature of Buddhism as collected in the Pāli Canon or Tipiṭaka and is the sacred language of Theravāda Buddhism. (Source: Wikipedia)
Alphabet¶
In [1]: from cltk.corpus.pali.alphabet import CONSONANTS, DEPENDENT_VOWELS, INDEPENDENT_VOWELS
In [2]: print(CONSONANTS)
['ක', 'ඛ', 'ග', 'ඝ', 'ඞ', 'ච', 'ඡ', 'ජ', 'ඣ', 'ඤ', 'ට', 'ඨ', 'ඩ', 'ඪ', 'ණ', 'ත', 'ථ', 'ද', 'ධ', 'න', 'ප', 'ඵ', 'බ', 'භ', 'ම', 'ය', 'ර', 'ල', 'ව', 'ස', 'හ', 'ළ', 'අං']
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with pali_
) to discover available Pali corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('pali')
In [3]: c.list_corpora
Out[3]: ['pali_text_ptr_tipitaka']
Persian¶
Persian is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. The Old Persian language is one of the two directly attested Old Iranian languages (the other being Avestan). Old Persian appears primarily in the inscriptions, clay tablets, and seals of the Achaemenid era (c. 600 BCE to 300 BCE). Examples of Old Persian have been found in what is now Iran, Romania (Gherla), Armenia, Bahrain, Iraq, Turkey and Egypt, the most important attestation by far being the contents of the Behistun Inscription (dated to 525 BCE). Avestan is one of the Eastern Iranian languages within the Indo-European language family known only from its use as the language of Zoroastrian scripture, i.e. the Avesta. (Source: Wikipedia)
Alphabet¶
The Persian digits and alphabet are placed in cltk/corpus/persian/alphabet.py.
The digits are placed in a dict NUMERALS
with the digit the same as the index (0-9). There is a dictionary named NUMERALS_WRITINGS for their writing also. For example, the persian digit for 5 can be accessed in this manner:
In [1]: from cltk.corpus.persian.alphabet import NUMERALS, NUMERALS_WRITINGS
In [2]: NUMERALS[5]
Out[2]: '۵'
In [3]: NUMERALS_WRITINGS[5]
Out[3]: 'پنج'
One can also have the alphabetic orders of the charachters form ALPHABETIC_ORDER dictionary. The keys are the characters and the values are their order. The corresponding dictionary can be imported:
In [1]: from cltk.corpus.persian.alphabet import ALPHABETIC_ORDER, JIM
In [2]: ALPHABETIC_ORDER[JIM]
Out[2]: 6
Phonology¶
The aim of phonological/phonetic reconstruction of ancient words is to provide a probable and realistic pronunciation of past languages. See the language-specific pages for phonological/phonetic support.
Old Portuguese¶
Galician-Portuguese, also known as Old Portuguese or Medieval Galician, was a West Iberian Romance language spoken in the Middle Ages, in the northwest area of the Iberian Peninsula. Alternatively, it can be considered a historical period of the Galician and Portuguese languages. The language was used for literary purposes from the final years of the 12th century to roughly the middle of the 14th century in what are now Spain and Portugal and was, almost without exception, the only language used for the composition of lyric poetry. (Source: Wikipedia)
Swadesh¶
The corpus module has a class for generating a Swadesh list for Old Portuguese.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('pt_old')
In [3]: swadesh.words()[:10]
Out[3]: ['eu', 'tu', 'ele', 'nos', 'vos', 'eles', 'esto, aquesto', 'aquelo', 'aqui', 'ali']
Prakrit¶
About¶
A Prakrit is any of several Middle Indo-Aryan languages. The Ardhamagadhi (“half-Magadhi”) Prakrit, which was used extensively to write the scriptures of Jainism, is often considered to be the definitive form of Prakrit, while others are considered variants thereof. Pali, the Prakrit used in Theravada Buddhism, tends to be treated as a special exception from the variants of the Ardhamagadhi language, as Classical Sanskrit grammars do not consider it as a Prakrit per se, presumably for sectarian rather than linguistic reasons. Other Prakrits are reported in old historical sources but are not attested, such as Paiśācī. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with prakrit_
) to discover available Prakrit corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('prakrit')
In [3]: c.list_corpora
Out[3]: ['prakrit_texts_gretil']
Punjabi¶
Punjabi is an Indo-Aryan language native language of the Punjabi people who inhabit the historical Punjab region of Pakistan and India. Punjabi developed from Sanskrit through Prakrit language and later Apabhraṃśa. Punjabi emerged as an Apabhramsha, a degenerated form of Prakrit, in the 7th century A.D. and became stable by the 10th century. By the 10th century, many Nath poets were associated with earlier Punjabi works. Arabic and Persian influence in the historical Punjab region began with the late first millennium Muslim conquests on the Indian subcontinent. (Source: Wikipedia)
Corpora¶
Use CorpusImporter
or browse the CLTK Github repository (anything beginning with punjabi_
) to discover available Punjabi corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('punjabi')
In [3]: c.list_corpora
Out[3]:
['punjabi_text_gurban']
Now from the list of available corpora import any one you like.
Alphabet¶
Punjabi is written in two sripts: Gurumukhi and Shahmukhi. Gurmukhi has its origins in Brahmi and Shahmukhi is a Perso-Arabic script.
The Punjabi digits, vowels, consonants, and symbols for both are placed in cltk/corpus/punjabi/alphabet.py. Look there for more information about the language’s phonology.
For example, to use Punjabi’s independent vowels in each script:
In [1]: from cltk.corpus.punjabi.alphabet import INDEPENDENT_VOWELS_GURMUKHI
In [2]: from cltk.corpus.punjabi.alphabet import INDEPENDENT_VOWELS_SHAHMUKHI
In [3]: INDEPENDENT_VOWELS_GURMUKHI
Out[3]: ['ਆ', 'ਇ', 'ਈ', 'ਉ', 'ਊ', 'ਏ', 'ਐ', 'ਓ', 'ਔ']
In [4]: INDEPENDENT_VOWELS_SHAHMUKHI
Out[4]: ['ا', 'و', 'ی', 'ے']
Similarly there are lists for DIGITS
, DEPENDENT_VOWELS
, CONSONANTS
, BINDI_CONSONANTS
(nasal pronunciation) and some OTHER_SYMBOLS
(mostly for pronunciation).
Numerifier¶
These convert English numbers into Punjabi and vice-verse.
In[1]: from cltk.corpus.punjabi.numerifier import punToEnglish_number
In[2]: from cltk.corpus.punjabi.numerifier import englishToPun_number
In[3]: c = punToEnglish_number('੧੨੩੪੫੬੭੮੯੦')
In[4]: print(c)
Out[4]: 1234567890
In[5]: c = englishToPun_number(1234567890)
In[6]: print(c)
Out[6]: ੧੨੩੪੫੬੭੮੯੦
Stopword Filtering¶
To use the CLTK’s built-in stopwords list:
In[1]: from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex
In[2]: from cltk.stop.punjabi.stops import STOPS_LIST
In[3]: sample = "ਪੰਜਾਬੀ ਪੰਜਾਬ ਦੀ ਮੁਖੱ ਬੋੋਲਣ ਜਾਣ ਵਾਲੀ ਭਾਸ਼ਾ ਹੈ।"
In[4]: x = indian_punctuation_tokenize_regex(sample)
In[5]: print(x)
Out[5]: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਦੀ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਵਾਲੀ', 'ਭਾਸ਼ਾ', 'ਹੈ', '।']
In[6]: lis = [w for w in x if not w in STOPS_LIST]
In[7]: print (lis)
Out[7]: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਭਾਸ਼ਾ', '।']
Sanskrit¶
Sanskrit is the primary liturgical language of Hinduism, a philosophical language of Hinduism, Jainism, Buddhism and Sikhism, and a literary language of ancient and medieval South Asia that also served as a lingua franca. It is a standardised dialect of Old Indo-Aryan, originating as Vedic Sanskrit and tracing its linguistic ancestry back to Proto-Indo-Iranian and Proto-Indo-European. As one of the oldest Indo-European languages for which substantial written documentation exists, Sanskrit holds a prominent position in Indo-European studies. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with sanskrit_
) to discover available Sanskrit corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('sanskrit')
In [3]: c.list_corpora
Out[3]:
['sanskrit_text_jnu', 'sanskrit_text_dcs', 'sanskrit_parallel_sacred_texts', 'sanskrit_text_sacred_texts', 'sanskrit_parallel_gitasupersite', 'sanskrit_text_gitasupersite','sanskrit_text_wikipedia','sanskrit_text_sanskrit_documents']
Transliterator¶
This tool has been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool is made for transliterating Itrans text to Devanagari(Unicode) script. Also, it can romanize Devanagari script.
Script Conversion¶
Convert from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script.more.
In [1]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator
In [2]: input_text=u'राजस्थान'
In [3]: UnicodeIndicTransliterator.transliterate(input_text,"hi","pa")
Out[3]: 'ਰਾਜਸ੍ਥਾਨ'
Romanization¶
Convert script text to Roman text in the ITRANS notation
In [4]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator
In [5]: input_text=u'राजस्थान'
In [6]: lang='hi'
In [7]: ItransTransliterator.to_itrans(input_text,lang)
Out[7]: 'rAjasthAna'
Indicization (ITRANS to Indic Script)¶
Conversion of ITRANS-transliteration to an Devanagari(Unicode) script
In [8]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator
In [9]: input_text=u'pitL^In'
In [10]: lang='hi'
In [11]: x=ItransTransliterator.from_itrans(input_text,lang)
In [12]: x
Out[12]: 'पितॣन्'
Query Script Information¶
Indic scripts have been designed keeping phonetic principles in nature and the design and organization of the scripts makes it easy to obtain phonetic information about the characters.
In [13]: from cltk.corpus.sanskrit.itrans.langinfo import *
In [14]: c = 'क'
In [15]: lang='hi'
In [16]: is_vowel(c,lang)
Out[16]: False
In [17]: is_consonant(c,lang)
Out[17]: True
In [18]: is_velar(c,lang)
Out[18]: True
In [19]: is_palatal(c,lang)
Out[19]: False
In [20]: is_aspirated(c,lang)
Out[20]: False
In [21]: is_unvoiced(c,lang)
Out[21]: True
In [22]: is_nasal(c,lang)
Out[22]: False
Other similar functions are here,
In [29]: dir(cltk.corpus.sanskrit.itrans.langinfo)
['APPROXIMANT_LIST', 'ASPIRATED_LIST', 'AUM_OFFSET', 'COORDINATED_RANGE_END_INCLUSIVE', 'COORDINATED_RANGE_START_INCLUSIVE', 'DANDA', 'DENTAL_RANGE', 'DOUBLE_DANDA', 'FRICATIVE_LIST', 'HALANTA_OFFSET', 'LABIAL_RANGE', 'LC_TA', 'NASAL_LIST', 'NUKTA_OFFSET', 'NUMERIC_OFFSET_END', 'NUMERIC_OFFSET_START', 'PALATAL_RANGE', 'RETROFLEX_RANGE', 'RUPEE_SIGN', 'SCRIPT_RANGES', 'UNASPIRATED_LIST', 'UNVOICED_LIST', 'URDU_RANGES', 'VELAR_RANGE', 'VOICED_LIST', '__author__', '__builtins__', '__cached__', '__doc__', '__file__', '__license__', '__loader__', '__name__', '__package__', '__spec__', 'get_offset', 'in_coordinated_range', 'is_approximant', 'is_aspirated', 'is_aum', 'is_consonant', 'is_dental', 'is_fricative', 'is_halanta', 'is_indiclang_char', 'is_labial', 'is_nasal', 'is_nukta', 'is_number', 'is_palatal', 'is_retroflex', 'is_unaspirated', 'is_unvoiced', 'is_velar', 'is_voiced', 'is_vowel', 'is_vowel_sign', 'offset_to_char']
Swadesh¶
The corpus module has a class for generating a Swadesh list for Sanskrit.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('sa')
In [3]: swadesh.words()[:10]
Out[3]: ['अहम्' , 'त्वम्', 'स', 'वयम्, नस्', 'यूयम्, वस्', 'ते', 'इदम्', 'तत्', 'अत्र', 'तत्र']
Syllabifier¶
This tool has also been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool can break a word into its syllables, this can be applied across 17 Indian languages including Devanagari (all using Unicode) script.
In [23]: from cltk.stem.sanskrit.indian_syllabifier import Syllabifier
In [24]: input_text = 'नमस्ते'
In [26]: lang='hindi'
In [27]: x = Syllabifier(lang)
In [28]: current = x.orthographic_syllabify(input_text)
Out[28]: ['न', 'म','स्ते']
Tokenizer¶
This tool has also been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool can break a sentence into its constituent words. It works on the basis of filtering out punctuations and spaces.
In [29]: from cltk.tokenize.sentence import TokenizeSentence
In [30]: tokenizer = TokenizeSentence('sanskrit')
In [31]: input_text = "हिन्दी भारत की सबसे अधिक बोली और समझी जाने वाली भाषा है"
In [32]: x = tokenizer.tokenize(input_text)
Out[32]: ['हिन्दी', 'भारत', 'की', 'सबसे', 'अधिक', 'बोली', 'और', 'समझी', 'जाने', 'वाली', 'भाषा', 'है']
Stopword Filtering¶
To use the CLTK’s built-in stopwords list:
In [1]: from cltk.stop.sanskrit.stops import STOPS_LIST
In [2]: from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex
In [3]: s = "हमने पिछले पाठ मे सीखा था कि “अहम् गच्छामि” का मतलब “मै जाता हूँ” है। आप ऊपर
...: की तालिकाँओ "
In [4]: tokens = indian_punctuation_tokenize_regex(s)
In [5]: len(tokens)
Out[5]: 20
In [6]: no_stops = [w for w in tokens if w not in STOPS_LIST]
In [7]: len(no_stops)
Out[7]: 18
In [8]: no_stops
Out[8]:
['हमने',
'पिछले',
'पाठ',
'सीखा',
'था',
'कि',
'“अहम्',
'गच्छामि”',
'मतलब',
'“मै',
'जाता',
'हूँ”',
'है',
'।',
'आप',
'ऊपर',
'की',
'तालिकाँओ']
Old Swedish¶
Old Swedish (fornsvenska) is a language spoken in Sweden between (Source: Wikipedia)
Phonological transcription¶
According to phonological rules, a reconstructed phonology/pronunciation of Old Swedish words is implemented.
In [1]: from cltk.phonology.old_swedish import transcription as old_swedish
In [2]: from cltk.phonology import utils as ut
In [3]: sentence = "Far man kunu oc dör han för en hun far barn. oc sigher hun oc hænnæ frændær."
In [4]: tr = ut.Transcriber(old_swedish.DIPHTHONGS_IPA, old_swedish.DIPHTHONGS_IPA_class, old_swedish.IPA_class,
old_swedish.old_swedish_rules)
In [5]: tr.main(sentence)
Out [5]: "[far man kunu ok dør han før ɛn hun far barn ok siɣɛr hun ok hɛnːɛ frɛndɛr]"
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with old_swedish_
) to discover available Old Swedish corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: corpus_importer = CorpusImporter("old_swedish")
In [3]: corpus_importer.list_corpora
Out [3]: ['old_swedish_texts', ]
Tamil¶
Tamil is a Dravidian language predominantly spoken by the Tamil people of India and Sri Lanka. It is one of the longest-surviving classical languages in the world. A recorded Tamil literature has been documented for over 2000 years. The earliest period of Tamil literature, Sangam literature, is dated from ca. 300 BC – AD 300. It has the oldest extant literature among Dravidian languages. The earliest epigraphic records found on rock edicts and hero stones date from around the 3rd century BC. More than 55% of the epigraphical inscriptions (about 55,000) found by the Archaeological Survey of India are in the Tamil language. Tamil language inscriptions written in Brahmi script have been discovered in Sri Lanka, and on trade goods in Thailand and Egypt. (Source: Wikipedia)
Alphabet¶
In [1]: from cltk.corpus.tamil.alphabet import VOWELS, CONSTANTS, GRANTHA_CONSONANTS
In [2]: print(VOWELS)
Out[2]: ['அ', 'ஆ', 'இ', 'ஈ', 'உ', 'ஊ',' எ', 'ஏ', 'ஐ', 'ஒ', 'ஓ', 'ஔ']
In [3]: print(CONSONANTS)
Out[3]: ['க்', 'ங்', 'ச்', 'ஞ்', 'ட்', 'ண்', 'த்', 'ந்', 'ப்', 'ம்', 'ய்', 'ர்', 'ல்', 'வ்', 'ழ்', 'ள்', 'ற்', 'ன்']
In [4]: print(GRANTHA_CONSONANTS)
Out[4]: ['ஜ்', 'ஶ்', 'ஷ்', 'ஸ்', 'ஹ்', 'க்ஷ்']
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with tamil_
) to discover available tamil corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('tamil')
In [3]: c.list_corpora
Out[3]: ['tamil_text_ptr_tipitaka']
Telugu¶
Telugu is a Dravidian language native to India. Inscriptions with Telugu words dating back to 400 BC to 100 BC have been discovered in Bhattiprolu in the Guntur district of Andhra Pradesh. Telugu literature can be traced back to the early 11th century period when Mahabharata was first translated to Telugu from Sanskrit by Nannaya. It flourished under the rule of the Vijayanagar empire, where Telugu was one of the empire’s official languages. (Source: Wikipedia)
Alphabet¶
The Telugu alphabet and digits are placed in cltk/corpus/telugu/alphabet.py.
The digits are placed in a list NUMERALS
with the digit the same as the list index (0-9). For example, the telugu digit for 4 can be accessed in this manner:
In [1]: from cltk.corpus.telugu.alphabet import NUMERALS
In [2]: NUMERALS[4]
Out[2]: '౪'
The vowels are places in a list VOWELS
and can be accessed in this manner :
In [1]: from cltk.corpus.telugu.alphabet import VOWELS
In [2]: VOWELS
Out[2]: ['అ ','ఆ','ఇ','ఈ ','ఉ ','ఊ ','ఋ ','ౠ ','ఌ ','ౡ','ఎ','ఏ','ఐ','ఒ','ఓ','ఔ ','అం','అః']
The rest of the alphabets are CONSONANTS
that can be accessed in a similar way.
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with telugu_
) to discover available Telugu corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('telugu')
In [3]: c.list_corpora
Out[3]:
['telugu_text_wikisource']
Tokenizer¶
This tool can help break up a sentence into smaller constituents i.e into words.
In [1]: from cltk.tokenize.sentence import TokenizeSentence
In [2]: tokenizer = TokenizeSentence('telugu')
In [3]: sentence = "క్లేశభూర్యల్పసారాణి కర్మాణి విఫలాని వా దేహినాం విషయార్తానాం న తథైవార్పితం త్వయి"
In [4]: telugu_text_tokenize = tokenizer.tokenize(sentence)
In [5]: telugu_text_tokenize
['క్లేశభూర్యల్పసారాణి',
'కర్మాణి',
'విఫలాని',
'వా',
'దేహినాం',
'విషయార్తానాం',
'న',
'తథైవార్పితం',
'త్వయి']
Tibetan¶
Classical Tibetan refers to the language of any text written in Tibetic after the Old Tibetan period; though it extends from the 7th century until the modern day, it particularly refers to the language of early canonical texts translated from other languages, especially Sanskrit. In 816, during the reign of King Sadnalegs, literary Tibetan underwent a thorough reform aimed at standardizing the language and vocabulary of the translations being made from Indian texts, which resulted in what is now called Classical Tibetan. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub repository (anything beginning with tibetan_
) to discover available Tibetan corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('tibetan')
In [3]: c.list_corpora
Out[3]: ['tibetan_pos_tdc', 'tibetan_lexica_tdc']
Tocharian B¶
Tocharian, also spelled Tokharian, is an extinct branch of the Indo-European language family. It is known from manuscripts dating from the 6th to the 8th century AD, which were found in oasis cities on the northern edge of the Tarim Basin (now part of Xinjiang in northwest China). The documents record two closely related languages, called Tocharian A (“East Tocharian”, Agnean or Turfanian) and Tocharian B (“West Tocharian” or Kuchean). The subject matter of the texts suggests that Tocharian A was more archaic and used as a Buddhist liturgical language, while Tocharian B was more actively spoken in the entire area from Turfan in the east to Tumshuq in the west. Tocharian A is found only in the eastern part of the Tocharian-speaking area, and all extant texts are of a religious nature. Tocharian B, however, is found throughout the range and in both religious and secular texts. (Source: Wikipedia)
Swadesh¶
The corpus module has a class for generating a Swadesh list for Tocharian B.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('txb')
In [3]: swadesh.words()[:10]
Out[3]: ['ñäś', 'tuwe', 'su', 'wes', 'yes', 'cey', 'se', 'su, samp', 'tane', 'tane, omp']
For interactive tutorials, in the form of Jupyter Notebooks, see https://github.com/cltk/tutorials.
Urdu¶
Alphabet¶
The Urdu alphabet and digits are placed in cltk/corpus/urdu/alphabet.py.
The digits are placed in a list DIGITS
with the digit the same as the list index (0-9). For example, the urdu digit for 4 can be accessed in this manner:
In [1]: from cltk.corpus.urdu.alphabet import DIGITS
In [2]: DIGITS[4]
Out[2]: '٤'
Persian has three SHORT_VOWELS
that are essentially diacritics used in the script. It also has four LONG_VOWELS that are actually part of the alphabet. The corresponding lists can be imported:
In [1]: from cltk.corpus.urdu.alphabet import SHORT_VOWELS
In [2]: SHORT_VOWELS
Out[2]: ['َ', 'ِ', 'ُ']
In [3]: from cltk.corpus.urdu.alphabet import LONG_VOWELS
In [4]: LONG_VOWELS
Out[4]: ['ا', 'و', 'ی', 'ے']
The rest of the alphabet are CONSONANTS
that can be accessed in a similar way.
There are three SPECIAL
characters that are ligatures or different orthographical shapes of the alphabet.
In [1]: from cltk.corpus.urdu.alphabet import SPECIAL
In [2]: SPECIAL
Out[2]: ['ﺁ', 'ۀ', 'ﻻ']