The Classical Language Toolkit (CLTK)

Contents

About

The Classical Language Toolkit (CLTK) offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia. Greek, Latin, Akkadian, and the Germanic languages are currently most complete. The goals of the CLTK are to:

  • compile analysis-friendly corpora;
  • collect and generate linguistic data;
  • act as a free and open platform for generating scientific research.

The project’s source is hosted on GitHub and the homepage is http://cltk.org.

Citation

For guidance on citing the CLTK, see Citation in the project’s README.

Installation

Please note that the CLTK is built, tested, and supported only on POSIX-compliant OS (namely, Linux, Mac and the BSDs).

With Pip

Note

The CLTK is only officially supported with Python 3.7 on POSIX–compliant operating systems (Linux, Mac OS X, FreeBSD, etc.).

First, you’ll need a working installation of Python 3.7, which now includes Pip. Create a virtual environment and activate it as follows:

$ python3.7 -m venv venv

$ source venv/bin/activate

Then, install the CLTK, which automatically includes all dependencies.

$ pip install cltk

Second, you will need an installation of Git, which the CLTK uses to download and update corpora, if you want to automatically import any of the CLTK’s corpora. Installation of Git will depend on your operating system.

Tip

For a user–friendly interactive shell environment, try IPython, which may be invoked with ipython from the command line. You may install it with pip install ipython.

Microsoft Windows

Warning

CLTK on Windows is not officially supported, however we do encourage Windows 10 users to give the following a try. Others have reported success. If this should fail for you, please open an issue on GitHub.

Windows 10 features a beta of “Bash on Ubuntu on Windows”, which creates a fully functional POSIX environment. For an introduction, see Microsoft’s docs here.

Once you have enabled Bash on Windows, installation and use is just the same as on Ubuntu. For instance, you do use the following:

sudo apt update
sudo apt install git
sudo apt-get install python-setuptools
sudo apt install python-virtualenv
virtualenv -p python3 ~/venv
source ~/venv/bin/activate
pip3 install cltk

Tip

Some fonts do not render Unicode well in the Bash Terminal. Try SimSub-ExtB or Courier New.

Older releases

For reproduction of scholarship, the CLTK archives past versions of its software releases. To get an older release by version, say v0.1.32, use:

$ pip install cltk==0.1.32

If you do not know a release’s version number but have its DOI (for instance, if you want to install version 10.5281/zenodo.51144), then you can search Zenodo and learn that this DOI corresponds to version v0.1.34.

The above will work for most researchers seeking to reproduce results. It will give you CLTK code identical to what the original researcher was using. However, it is possible that you will want to use the exact same CLTK dependencies the researcher was using, too. In this case, consult the CLTK GitHub Releases page and download a .tar.gz file of the desired version. Then, you may do the following:

$ tar zxvf cltk-0.1.34.tar.gz
$ cd cltk-0.1.34
$ python3.6 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

This will give you CLTK and immediate dependencies identical to your target codebase.

The CLTK’s repositories are versioned, too, using Git. Should there have been changes to a target corpus, you may acquire your needed version by manually cloning the entire repo, then checking out the past version by commit log. For example, if you need commit 0ed43e025df276e95768038eb3692ba155cc78c9 from the repo latin_text_perseus:

$ cd ~/cltk_data/latin/text/
$ rm -rf text/latin_text_perseus/
$ git clone https://github.com/cltk/latin_text_perseus.git
$ cd latin_text_perseus/
$ git checkout 0ed43e025df276e95768038eb3692ba155cc78c9

From source

The CLTK source is available at GitHub. To build from source, clone the repository, make a virtual environment (as above), and run:

$ pip install -U -r requirements.txt
$ python setup.py install

If you have modified the CLTK source, rebuild the project with this same command. If you make any changes, it is a good idea to run the test suite to ensure you did not introduce any breakage. Test with nose:

$ nosetests --with-doctest

Importing Corpora

The CLTK stores all data in the local directory cltk_data, which is created at a user’s root directory upon first initialization of the CorpusImporter() class. Within this are an originals directory, in which untouched copies of downloaded or copied files are preserved, and a directory for every language for which a corpus has been downloaded. It also contains cltk.log for all CLTK logging.

Listing corpora

To see all of the corpora available for importing, use list_corpora().

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter('greek')  # e.g., or CorpusImporter('latin')

In [3]: corpus_importer.list_corpora

Out[3]:
['greek_software_tlgu',
 'greek_text_perseus',
 'phi7',
 'tlg',
 'greek_proper_names_cltk',
 'greek_models_cltk',
 'greek_treebank_perseus',
 'greek_lexica_perseus',
 'greek_training_set_sentence_cltk',
 'greek_word2vec_cltk',
 'greek_text_lacus_curtius']

Importing a corpus

To download a remote corpus, use the following, for example, for the Latin Library.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter('latin')  # e.g., or CorpusImporter('greek')

In [3]: corpus_importer.import_corpus('latin_text_latin_library')
Downloaded 100% , 35.53 MiB | 3.28 MiB/s s

For a local corpus, such as the TLG, you must give a second argument of the filepath to the corpus, e.g.:

In [4]: corpus_importer.import_corpus('tlg', '~/Documents/corpora/TLG_E/')

User-defined, distributed corpora

Most users will want to use the CLTK’s publicly available corpora. However users can import any repository that is hosted on a Git server. The benefit of this is that users can use corpora that the CLTK organization is not able to distribute itself (because too specific, license restrictions, etc.).

Let’s say a user wants to keep a particular Git-backed corpus at git@github.com:kylepjohnson/latin_corpus_newton_example.git. It can be cloned into the ~/cltk_data/ directory by declaring it in a manually created YAML file at ~/cltk_data/distributed_corpora.yaml like the following:

example_distributed_latin_corpus:
    origin: https://github.com/kylepjohnson/latin_corpus_newton_example.git
    language: latin
    type: text

example_distributed_greek_corpus:
    origin: https://github.com/kylepjohnson/a_nonexistent_repo.git
    language: pali
    type: treebank

Each block defines a separate corpus. The first line of a block (e.g., example_distributed_latin_corpus) gives the unique name to the custom corpus. This first example block would allow a user to fetch the repo and install it at ~/cltk_data/latin/text/latin_corpus_newton_example.

Corpus Readers

After a corpus has been imported into the library, users will want to access the data through a CorpusReader object. The CorpusReader API follows the NLTK CorpusReader API paradigm. It offers a way for users to access the documents, paragraphs, sentences, and words of all the available documents in a corpus, or a specified collection of documents. Not every corpus will support every method, e.g. a corpus of inscriptions may not support paragraphs via a para method but the corpus provider should try to provide the interfaces that they can.

Reading a Corpus

Use the get_corpus method in the readers module.

In [1]: from cltk.corpus.readers import get_corpus_reader

In [2]: latin_corpus = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')

In [3]: len(list(latin_corpus.docs()))

Out[3]: 2141

In [4]: len(list(latin_corpus.paras()))

Out[4]: 212130

In [5]: len(list(latin_corpus.sents()))

Out[5]: 1038668

In [6]: len(list(latin_corpus.words()))

Out[6]: 16455728

Adding a Corpus to the CLTK Reader

Modify the cltk.corpus.readers module, updating SUPPORTED_CORPORA, adding your language and the specific corpus name. In the get_corpus_reader method implement the checks and mappings to return a NLTK compliant CorpusReader API object.

Providing Metadata for Corpus Filtration

If you’re adding a Corpus to CLTK, please also consider providing a genre mapping if you corpus is large or is easily segmented into genres. Consider creating a file containing mappings of categories to directories and files, e.g.:

In [1]: from cltk.corpus.latin.latin_library_corpus_types import corpus_directories_by_type

In [2]: corpus_directories_by_type.keys()

Out [2]: dict_keys(['republican', 'augustan', 'early_silver', 'late_silver', 'old', 'christian', 'medieval', 'renaissance', 'neo_latin', 'misc', 'early'])

In [3]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type

In [4]: corpus_directories_by_type.values()[:3]

Out [4]: [['./caesar', './lucretius', './nepos', './cicero'], ['./livy', './ovid', './horace', './vergil', './hyginus']]

In [5]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type

In [6]: list(corpus_texts_by_type.values())[:2]

Out [6]: [['sall.1.txt', 'sall.2.txt', 'sall.cotta.txt', 'sall.ep1.txt', 'sall.ep2.txt', 'sall.frag.txt', 'sall.invectiva.txt', 'sall.lep.txt', 'sall.macer.txt', 'sall.mithr.txt', 'sall.phil.txt', 'sall.pomp.txt', 'varro.frag.txt', 'varro.ll10.txt', 'varro.ll5.txt', 'varro.ll6.txt', 'varro.ll7.txt', 'varro.ll8.txt', 'varro.ll9.txt', 'varro.rr1.txt', 'varro.rr2.txt', 'varro.rr3.txt', 'sulpicia.txt'], ['resgestae.txt', 'resgestae1.txt', 'manilius1.txt', 'manilius2.txt', 'manilius3.txt', 'manilius4.txt', 'manilius5.txt', 'catullus.txt', 'vitruvius1.txt', 'vitruvius10.txt', 'vitruvius2.txt', 'vitruvius3.txt', 'vitruvius4.txt', 'vitruvius5.txt', 'vitruvius6.txt', 'vitruvius7.txt', 'vitruvius8.txt', 'vitruvius9.txt', 'propertius1.txt', 'tibullus1.txt', 'tibullus2.txt', 'tibullus3.txt']]

The mapping is a dictionary of genre types or periods, and the values are lists of files or directories for each type.

Helper Methods for Corpus Filtration

Users will typically construct a CorpusReader by selecting category types of directories or files. The assemble_corpus method allows users to take a CorpusReader and filter the files used provide the data for the reader.

In [1]: from cltk.corpus.readers import assemble_corpus, get_corpus_reader

In [2]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type, corpus_directories_by_type

In [3]: latin_corpus = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')

In [4]: filtered_reader, fileids, catgories = assemble_corpus(latin_corpus, types_requested=['republican', 'augustan'], type_dirs=corpus_directories_by_type,
...     type_files=corpus_texts_by_type)

In [5]: len(list(filtered_reader.docs()))

Out [5]: 510

In [6]: categories

Out [6]: {'republican', 'augustan'}

In [7]: len(fileids)

Out [7]: 510

Akkadian

Akkadian is an extinct East Semitic language (part of the greater Afroasiatic language family) that was spoken in ancient Mesopotamia. The earliest attested Semitic language, it used the cuneiform writing system, which was originally used to write the unrelated Ancient Sumerian, a language isolate. From the second half of the third millennium BC (ca. 2500 BC), texts fully written in Akkadian begin to appear. Hundreds of thousands of texts and text fragments have been excavated to date, covering a vast textual tradition of mythological narrative, legal texts, scientific works, correspondence, political and military events, and many other examples. By the second millennium BC, two variant forms of the language were in use in Assyria and Babylonia, known as Assyrian and Babylonian respectively. (Source: Wikipedia)

Workflow Sample Model

A sample workflow model of utilizing the tools in Akkadian is shown below. In this example, we are taking a text file downloaded from CDLI, importing it, and have it be read and ingested. From here, we will look at the table of contents, select a text, convert the text into Unicode and PrettyPrint its result.

note: this workflow model uses a set of test documents, available to be downloaded here:

https://github.com/cltk/cltk/tree/master/cltk/tests/test_akkadian

When you have downloaded these files, utilize its file location within os.path.join(), e.g.: os.path.join(‘downloads’, ‘single_text.txt’). This tutorial assumes that you are using a fork of CLTK.

In[1]: from cltk.corpus.akkadian.file_importer import FileImport

In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus

In[3]: from cltk.corpus.akkadian.pretty_print import PrettyPrint

In[4]: from cltk.corpus.akkadian.tokenizer import Tokenizer

In[5]: from cltk.tokenize.word import WordTokenizer

In[6]: from cltk.stem.akkadian.atf_converter import ATFConverter

In[7]: import os

# import a text and read it
In[8]: fi = FileImport(os.path.join('test_akkadian', 'single_text.txt'))

In[9]: fi.read_file()

# output = fi.raw_file or fi.file_lines; for folder catalog = fi.file_catalog()
# ingest your file lines
In[10]: cc = CDLICorpus()

In[11]: cc.parse_file(fi.file_lines)

# this creates disparate sections of the text ingested (edition, metadata, etc)
In[12]: transliteration = [cc.catalog[text]['transliteration'] for text in cc.catalog]

# access the data through cc.texts (e.g. above) or initial prints (e.g. below):
# look through the file's contents
In[13]: print(cc.toc())
Out[13]: ['Pnum: P254202, Edition: ARM 01, 001, length: 23 line(s)']

# select a text through edition or cdli number (there's also .print_metadata):
In[14]: selected_text = cc.catalog['P254202']['transliteration']

# otherwise use the above 'transliteration'; same thing:
In[15]: print(selected_text)
Out[15]: ['a-na ia-ah-du-li-[im]', 'qi2-bi2-[ma]', 'um-ma a-bi-sa-mar#-[ma]', 'sa-li-ma-am e-pu-[usz]',
          'asz-szum mu-sze-zi-ba-am# [la i-szu]', '[sa]-li#-ma-am sza e-[pu-szu]', '[u2-ul] e-pu-usz sa#-[li-mu-um]',
          '[u2-ul] sa-[li-mu-um-ma]', 'isz#-tu mu#-[sze-zi-ba-am la i-szu]', 'a-la-nu-ia sza la is,-s,a-ab#-[tu]',
          'i-na-an-na is,-s,a-ab-[tu]', 'i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]',
          'ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]', 'u3 ia-am-ha-ad[{ki}]', 'a-la-nu an-nu-tum u2-ul ih-li-qu2#',
          'i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma', 'ih-ta-al-qu2',
          'u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#', 'u3 na-pa-asz2-ti u2-ba-li-it,', 'pi2-qa-at ha-s,e-ra#-at',
          'asz-szum a-la-nu-ka', 'u3 ma-ru-ka sza-al#-[mu]', '[a-na na-pa]-asz2#-ti-ia i-tu-ur']

In[16]: print(transliteration[0])
Out[16]: ['a-na ia-ah-du-li-[im]', 'qi2-bi2-[ma]', 'um-ma a-bi-sa-mar#-[ma]', 'sa-li-ma-am e-pu-[usz]',
          'asz-szum mu-sze-zi-ba-am# [la i-szu]', '[sa]-li#-ma-am sza e-[pu-szu]', '[u2-ul] e-pu-usz sa#-[li-mu-um]',
          '[u2-ul] sa-[li-mu-um-ma]', 'isz#-tu mu#-[sze-zi-ba-am la i-szu]', 'a-la-nu-ia sza la is,-s,a-ab#-[tu]',
          'i-na-an-na is,-s,a-ab-[tu]', 'i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]',
          'ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]', 'u3 ia-am-ha-ad[{ki}]', 'a-la-nu an-nu-tum u2-ul ih-li-qu2#',
          'i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma', 'ih-ta-al-qu2',
          'u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#', 'u3 na-pa-asz2-ti u2-ba-li-it,', 'pi2-qa-at ha-s,e-ra#-at',
          'asz-szum a-la-nu-ka', 'u3 ma-ru-ka sza-al#-[mu]', '[a-na na-pa]-asz2#-ti-ia i-tu-ur']

# tokenize by word or sign
In[17]: atf = ATFConverter()

In[18]: tk = Tokenizer()

In[19]: wtk = WordTokenizer('akkadian')

In[18]: lines = [tk.string_tokenizer(text, include_blanks=False)
                 for text in atf.process(selected_text)]

In[20]: words = [wtk.tokenize(line[0]) for line in lines]

# taking off first four lines to focus on the text with [4:]
In[21]: print(lines)
In[21]: [['a-na ia-ah-du-li-im'], ['qi2-bi2-ma'], ['um-ma a-bi-sa-mar-ma'], ['sa-li-ma-am e-pu-usz'],
         ['asz-szum mu-sze-zi-ba-am la i-szu'], ['sa-li-ma-am sza e-pu-szu'], ['u2-ul e-pu-usz sa-li-mu-um'],
         ['u2-ul sa-li-mu-um-ma'], ['isz-tu mu-sze-zi-ba-am la i-szu'], ['a-la-nu-ia sza la is,-s,a-ab-tu'],
         ['i-na-an-na is,-s,a-ab-tu'], ['i-na ne2-kur-ti _lu2_ ha-szi-im{ki}'],
         ['ur-si-im{ki} _lu2_ ka-ar-ka-mi-is{ki}'], ['u3 ia-am-ha-ad{ki}'], ['a-la-nu an-nu-tum u2-ul ih-li-qu2'],
         ['i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur-ma'], ['ih-ta-al-qu2'],
         ['u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib'], ['u3 na-pa-asz2-ti u2-ba-li-it,'],
         ['pi2-qa-at ha-s,e-ra-at'], ['asz-szum a-la-nu-ka'], ['u3 ma-ru-ka sza-al-mu'],
         ['a-na na-pa-asz2-ti-ia i-tu-ur']]

In[22]: print(words)
In[22]: [[('a-na', 'akkadian'), ('ia-ah-du-li-im', 'akkadian')], [('qi2-bi2-ma', 'akkadian')],
         [('um-ma', 'akkadian'), ('a-bi-sa-mar-ma', 'akkadian')], [('sa-li-ma-am', 'akkadian'),
          ('e-pu-usz', 'akkadian')],
          [('asz-szum', 'akkadian'), ('mu-sze-zi-ba-am', 'akkadian'), ('la', 'akkadian'), ('i-szu', 'akkadian')],
         [('sa-li-ma-am', 'akkadian'), ('sza', 'akkadian'), ('e-pu-szu', 'akkadian')],
         [('u2-ul', 'akkadian'), ('e-pu-usz', 'akkadian'), ('sa-li-mu-um', 'akkadian')],
         [('u2-ul', 'akkadian'), ('sa-li-mu-um-ma', 'akkadian')],
         [('isz-tu', 'akkadian'), ('mu-sze-zi-ba-am', 'akkadian'), ('la', 'akkadian'), ('i-szu', 'akkadian')],
         [('a-la-nu-ia', 'akkadian'), ('sza', 'akkadian'), ('la', 'akkadian'), ('is,-s,a-ab-tu', 'akkadian')],
         [('i-na-an-na', 'akkadian'), ('is,-s,a-ab-tu', 'akkadian')],
         [('i-na', 'akkadian'), ('ne2-kur-ti', 'akkadian'), ('_lu2_', 'sumerian'), ('ha-szi-im{ki}', 'akkadian')],
         [('ur-si-im{ki}', 'akkadian'), ('_lu2_', 'sumerian'), ('ka-ar-ka-mi-is{ki}', 'akkadian')],
         [('u3', 'akkadian'), ('ia-am-ha-ad{ki}', 'akkadian')],
         [('a-la-nu', 'akkadian'), ('an-nu-tum', 'akkadian'), ('u2-ul', 'akkadian'), ('ih-li-qu2', 'akkadian')],
         [('i-na', 'akkadian'), ('ne2-kur-ti', 'akkadian'), ('{disz}sa-am-si-{d}iszkur-ma', 'akkadian')],
         [('ih-ta-al-qu2', 'akkadian')],
         [('u3', 'akkadian'), ('a-la-nu', 'akkadian'), ('sza', 'akkadian'), ('ki-ma', 'akkadian'),
          ('u2-hu-ru', 'akkadian'), ('u2-sze-zi-ib', 'akkadian')],
         [('u3', 'akkadian'), ('na-pa-asz2-ti', 'akkadian'), ('u2-ba-li-it,', 'akkadian')],
         [('pi2-qa-at', 'akkadian'), ('ha-s,e-ra-at', 'akkadian')],
         [('asz-szum', 'akkadian'), ('a-la-nu-ka', 'akkadian')],
         [('u3', 'akkadian'), ('ma-ru-ka', 'akkadian'), ('sza-al-mu', 'akkadian')],
         [('a-na', 'akkadian'), ('na-pa-asz2-ti-ia', 'akkadian'), ('i-tu-ur', 'akkadian')]]

In[23]: for signs in words:
In[24]:     sign = [tk.sign_tokenizer(x) for x in signs]
# Note: Not printing 'signs' due to length. Try it!

# Pretty printing:
In[25]: pp = PrettyPrint()

In[26]: destination = os.path.join('test_akkadian', 'html_single_text.html')

In[27]: pp.html_print_single_text(cc.catalog, '&P254202', destination)

Read File

Reads a .txt file and saves to memory the text in .raw_file and .file_lines. These two instance attributes are used for the ATFConverter.

In[1]: import os

In[2]: from cltk.corpus.akkadian.file_importer import FileImport

In[3]: text_location = os.path.join('test_akkadian', 'single_text.txt')

In[4]: text = FileImport(text_location)

In[5]: text.read_file()

To access the text file, use .raw_file or .file_lines. .raw_file is the file in its entirety, .file_lines splits the text using .splitlines.

File Catalog

This function looks at the folder storing a file and outputs its contents.

In[1]: import os

In[2]: from cltk.corpus.akkadian.file_importer import FileImport

In[3]: text_location = os.path.join('test_akkadian', 'single_text.txt')

In[4]: folder = FileImport(text_location)

In[5]: folder.file_catalog()

Out[5]: ['html_file.html', 'html_single_text.html', 'single_text.txt',
         'test_akkadian.py', 'two_text_abnormalities.txt']

Parse File

This method captures information in a text file and formats it in a clear, and disparate, manner for every text found. It saves to memory dictionaries that split up texts by text edition, cdli number, metadata, and various text, all of which are callable.

In[1]: Import os

In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus

In[3]: cdli = CDLICorpus()

In[4]: f_i = FileImport(os.path.join('test_akkadian', 'single_text.txt'))

In[5]: f_i.read_file()

In[6]: cdli.parse_file(f_i.file_lines)

To access the text, use .catalog.

In[7]: print(cc.catalog)
Out[7]: {'P254202': {'metadata': ['Primary publication: ARM 01, 001', 'Author(s): Dossin, Georges',
                                  'Publication date: 1946',
                                  'Secondary publication(s): Durand, Jean-Marie, LAPO 16, 0305',
                                  'Collection: National Museum of Syria, Damascus, Syria',
                                  'Museum no.: NMSD —', 'Accession no.:', 'Provenience: Mari (mod. Tell Hariri)',
                                  'Excavation no.:', 'Period: Old Babylonian (ca. 1900-1600 BC)',
                                  'Dates referenced:', 'Object type: tablet', 'Remarks:', 'Material: clay',
                                  'Language: Akkadian', 'Genre: Letter', 'Sub-genre:', 'CDLI comments:',
                                  'Catalogue source: 20050104 cdliadmin', 'ATF source: cdlistaff',
                                  'Translation: Durand, Jean-Marie (fr); Guerra, Dylan M. (en)',
                                  'UCLA Library ARK: 21198/zz001rsp8x', 'Composite no.:', 'Seal no.:',
                                  'CDLI no.: P254202'],
                          'pnum': 'P254202',
                       'edition': 'ARM 01, 001',
                     'raw_text': ['@obverse', '1. a-na ia-ah-du-li-[im]', '2. qi2-bi2-[ma]',
                                  '3. um-ma a-bi-sa-mar#-[ma]', '4. sa-li-ma-am e-pu-[usz]',
                                  '5. asz-szum mu-sze-zi-ba-am# [la i-szu]', '6. [sa]-li#-ma-am sza e-[pu-szu]',
                                  '7. [u2-ul] e-pu-usz sa#-[li-mu-um]', '8. [u2-ul] sa-[li-mu-um-ma]',
                                  '$ rest broken', '@reverse', '$ beginning broken',
                                  "1'. isz#-tu mu#-[sze-zi-ba-am la i-szu]",
                                  "2'. a-la-nu-ia sza la is,-s,a-ab#-[tu]", "3'. i-na-an-na is,-s,a-ab-[tu]",
                                  "4'. i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]",
                                  "5'. ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]", "6'. u3 ia-am-ha-ad[{ki}]",
                                  "7'. a-la-nu an-nu-tum u2-ul ih-li-qu2#",
                                  "8'. i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma", "9'. ih-ta-al-qu2",
                                  "10'. u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#",
                                  "11'. u3 na-pa-asz2-ti u2-ba-li-it,", "12'. pi2-qa-at ha-s,e-ra#-at",
                                  "13'. asz-szum a-la-nu-ka", "14'. u3 ma-ru-ka sza-al#-[mu]",
                                  "15'. [a-na na-pa]-asz2#-ti-ia i-tu-ur"],
              'transliteration': ['a-na ia-ah-du-li-[im]', 'qi2-bi2-[ma]', 'um-ma a-bi-sa-mar#-[ma]',
                                  'sa-li-ma-am e-pu-[usz]', 'asz-szum mu-sze-zi-ba-am# [la i-szu]',
                                  '[sa]-li#-ma-am sza e-[pu-szu]', '[u2-ul] e-pu-usz sa#-[li-mu-um]',
                                  '[u2-ul] sa-[li-mu-um-ma]', 'isz#-tu mu#-[sze-zi-ba-am la i-szu]',
                                  'a-la-nu-ia sza la is,-s,a-ab#-[tu]', 'i-na-an-na is,-s,a-ab-[tu]',
                                  'i-na ne2-kur-ti _lu2_ ha-szi-[im{ki}]',
                                  'ur-si-im{ki} _lu2_ ka-ar-ka#-[mi-is{ki}]', 'u3 ia-am-ha-ad[{ki}]',
                                  'a-la-nu an-nu-tum u2-ul ih-li-qu2#',
                                  'i-na ne2-kur-ti {disz}sa-am-si-{d}iszkur#-ma', 'ih-ta-al-qu2',
                                  'u3 a-la-nu sza ki-ma u2-hu-ru u2-sze-zi-ib#', 'u3 na-pa-asz2-ti u2-ba-li-it,',
                                  'pi2-qa-at ha-s,e-ra#-at', 'asz-szum a-la-nu-ka', 'u3 ma-ru-ka sza-al#-[mu]',
                                  '[a-na na-pa]-asz2#-ti-ia i-tu-ur'],
                'normalization': [],
                  'translation': []}}

Table of Contents

Prints a table of contents from which one can identify the edition and cdli number for printing purposes.

In[1]: Import os

In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus

In[3]: cdli = CDLICorpus()

In[4]: path = FileImport(os.path.join('test_akkadian', 'single_text.txt'))

In[5]: f_i = FileImport(path)

In[6]: f_i.read_file()

In[6]: cdli.toc()
Out[6]: ['Pnum: P254202, Edition: ARM 01, 001, length: 23 line(s)']

List Pnums

Prints cdli numbers from which one can identify the edition and cdli number for printing purposes.

In[1]: Import os

In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus

In[3]: cdli = CDLICorpus()

In[4]: path = FileImport(os.path.join('test_akkadian', 'single_text.txt'))

In[5]: f_i = FileImport(path)

In[6]: f_i.read_file()

In[6]: cdli.list_pnums()
Out[6]: ['P254202']

List Editions

Prints editions from which one can identify the edition and cdli number for printing purposes.

In[1]: Import os

In[2]: from cltk.corpus.akkadian.cdli_corpus import CDLICorpus

In[3]: cdli = CDLICorpus()

In[4]: path = FileImport(os.path.join('test_akkadian', 'single_text.txt'))

In[5]: f_i = FileImport(path)

In[6]: f_i.read_file()

In[6]: cdli.list_editions()
Out[6]: ['ARM 01, 001']

Tokenization

The Akkadian tokenizer reads ATF material and converts the data into readable, mutable tokens. There is an option whether or not to preserve damage in the text.

The ATFConverter depends upon the word and sign tokenizer outputs.

String Tokenization:

This function is based off CLTK’s line tokenizer. Use this for strings (e.g. copy-and-pasinge lines from a document) rather than .txt files.

In[1]: from cltk.akkadian.Tokenizer import  Tokenizer

In[2]: line_tokenizer = Tokenizer(preserve_damage=False)

In[3]: text = '20. u2-sza-bi-la-kum\n1. a-na ia-as2-ma-ah-{d}iszkur#\n' \
            '2. qi2-bi2-ma\n3. um-ma {d}utu-szi-{d}iszkur\n' \
            '4. a-bu-ka-a-ma\n5. t,up-pa-[ka] sza#-[tu]-sza-bi-lam esz-me' \
            '\n' '6. asz-szum t,e4#-em# {d}utu-illat-su2\n'\
            '7. u3 ia#-szu-ub-dingir sza a-na la i-[zu]-zi-im\n'

In[4]: line_tokenizer.string_token(text)
Out[4]: ['20. u2-sza-bi-la-kum',
         '1. a-na ia-as2-ma-ah-{d}iszkur',
         '2. qi2-bi2-ma',
         '3. um-ma {d}utu-szi-{d}iszkur',
         '4. a-bu-ka-a-ma',
         '5. t,up-pa-ka sza-tu-sza-bi-lam esz-me',
         '6. asz-szum t,e4-em {d}utu-illat-su2',
         '7. u3 ia-szu-ub-dingir sza a-na la i-zu-zi-im']

Line Tokenization:

Line Tokenization is for any text, from FileImport.raw_text to .CDLICorpus.texts.

In[1]: import os

In[2]: from cltk.akkadian.tokenizer import  Tokenizer

In[3]: line_tokenizer = Tokenizer(preserve_damage=False)

In[4]: text = os.path.join('test_akkadian', 'single_text.txt')

In[5]: line_tokenizer.line_token(text[1:8])
Out[5]: ['a-na ia-ah-du-li-[im]',
         'qi2-bi2-[ma]',
         'um-ma a-bi-sa-mar#-[ma]',
         'sa-li-ma-am e-pu-[usz]',
         'asz-szum mu-sze-zi-ba-am# [la i-szu]',
         '[sa]-li#-ma-am sza e-[pu-szu]',
         '[u2-ul] e-pu-usz sa#-[li-mu-um]',
         '[u2-ul] sa-[li-mu-um-ma]']

Word Tokenization:

Word tokenization operates on a single line of text, returns all words in the line as a tuple in a list.

In[1]: import os

In[2]: from cltk.tokenize.word import  WordTokenizer

In[3]: word_tokenizer = WordTokenizer('akkadian')

In[4]: line = 'u2-wa-a-ru at-ta e2-kal2-la-ka _e2_-ka wu-e-er'

In[5]: output = word_tokenizer.tokenize(line)
Out[5]: [('u2-wa-a-ru', 'akkadian'), ('at-ta', 'akkadian'),
         ('e2-kal2-la-ka', 'akkadian'), ('_e2_-ka', 'sumerian'),
         ('wu-e-er', 'akkadian')]

Sign Tokenization:

Sign Tokenization takes a tuple (word, language) and splits the word up into individual sign tuples (sign, language) in a list.

In[1]: import os

In[2]: from cltk.tokenize.word import  WordTokenizer

In[3]: word_tokenizer = WordTokenizer('akkadian')

In[4]: word = ("{gisz}isz-pur-ram", "akkadian")

In[5]: word_tokenizer.tokenize_sign(word)
Out[5]: [("gisz", "determinative"), ("isz", "akkadian"),
         ("pur", "akkadian"), ("ram", "akkadian")]

Unicode Conversion

From a list of tokens, this module will return the list converted from CDLI standards to print publication standards. two_three is a function allows the user to turn on and off accent marking for signs (a₂ versus á).

In[1]: from cltk.stem.akkadian.atf_converter import ATFConverter

In[2]: atf = ATFConverter(two_three=False)

In[2]: test = ['as,', 'S,ATU', 'tet,', 'T,et', 'sza', 'ASZ', "a", "a2", "a3", "be2", "bad3", "buru14"]

In[4]: atf.process(test)

Out[4]: ['aṣ', 'ṢATU', 'teṭ', 'Ṭet', 'ša', 'AŠ', "a", "á", "à", "bé", "bàd", "buru₁₄"]

Pretty Printing

Pretty Print allows an individual to take a .txt file and populate it into an html file.

In[1]: import os

In[2]: from cltk.corpus.akkadian.pretty_print import  PrettyPrint

In[3]: origin = os.path.join('test_akkadian', 'single_text.txt')

In[4]: destination = os.path.join('..', 'Akkadian_test_text', 'html_single_text.html')

In[5]: f_i = FileImport(path)
     f_i.read_file()
     origin = f_i.raw_file
     p_p = PrettyPrint()
     p_p.html_print(origin, destination)
     f_o = FileImport(destination)
     f_o.read_file()
     output = f_o.raw_file

Syllabifier

Syllabify Akkadian words.

In [1]: from cltk.stem.akkadian.syllabifier import Syllabifier

In [2]: word = "epištašu"

In [3]: syll = Syllabifier()

In [4]: syll.syllabify(word)
['e', 'piš', 'ta', 'šu']

Stress

This function identifies the stress on an Akkadian word.

In[2]: from cltk.phonology.akkadian.stress import StressFinder

In[3]: stresser = StressFinder()

In[4]: word = "šarrātim"

In[5]: stresser.find_stress(word)

Out[5]: ['šar', '[rā]', 'tim']

Decliner

This method outputs a list of tuples the first element being a declined noun, the second a dictionary containing its attributes.

In[2]: from cltk.stem.akkadian.declension import NaiveDecliner

In[3]: word = 'ilum'

In[4]: decliner = NaiveDecliner()

In[5]: decliner.decline_noun(word, 'm')

Out[5]:
[('ilam', {'case': 'accusative', 'number': 'singular'}),
 ('ilim', {'case': 'genitive', 'number': 'singular'}),
 ('ilum', {'case': 'nominative', 'number': 'singular'}),
 ('ilīn', {'case': 'oblique', 'number': 'dual'}),
 ('ilān', {'case': 'nominative', 'number': 'dual'}),
 ('ilī', {'case': 'oblique', 'number': 'plural'}),
 ('ilū', {'case': 'nominative', 'number': 'plural'})]

Stems and Bound Forms

These two methods reduce a noun to its stem or bound form.

In[2]: from cltk.stem.akkadian.stem import Stemmer

In[3]: stemmer = Stemmer()

In[4]: word = "ilātim"

In[5]: stemmer.get_stem(word, 'f')

Out[5]: 'ilt'
In[2]: from cltk.stem.akkadian.bound_form import BoundForm

In[3]: bound_former = BoundForm()

In[4]: word = "kalbim"

In[5]: bound_former.get_bound_form(word, 'm')

Out[5]: 'kalab'

Consonant and Vowel patterns

It’s useful to be able to parse Akkadian words as sequences of consonants and vowels.

In[2]: from cltk.stem.akkadian.cv_pattern import CVPattern

In[3]: cv_patterner = CVPattern()

In[4]: word = "iparras"

In[5]: cv_patterner.get_cv_pattern(word)

Out[5]:
[('V', 1, 'i'),
 ('C', 1, 'p'),
 ('V', 2, 'a'),
 ('C', 2, 'r'),
 ('C', 2, 'r'),
 ('V', 2, 'a'),
 ('C', 3, 's')]

In[6]: cv_patterner.get_cv_pattern(word, pprint=True)

Out[6]: 'V₁C₁V₂C₂C₂V₂C₃'

Stopword Filtering

To use the CLTK’s built-in stopwords list for Akkadian:

In[2]: from nltk.tokenize.punkt import PunktLanguageVars

In[3]: from cltk.stop.akkadian.stops import STOP_LIST

In[4]: sentence = "šumma awīlum ina dīnim ana šībūt sarrātim ūṣiamma awat iqbû la uktīn šumma dīnum šû dīn napištim awīlum šû iddâk"

In[5]: p = PunktLanguageVars()

In[6]: tokens = p.word_tokenize(sentence.lower())

In[7]: [w for w in tokens if not w in STOP_LIST]
Out[7]:
['awīlum',
 'dīnim',
 'šībūt',
 'sarrātim',
 'ūṣiamma',
 'awat',
 'iqbû',
 'uktīn',
 'dīnum',
 'dīn',
 'napištim',
 'awīlum',
 'iddâk']

Arabic

Classical Arabic is the form of the Arabic language used in Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD. The orthography of the Qurʾān was not developed for the standardized form of Classical Arabic; rather, it shows the attempt on the part of writers to utilize a traditional writing system for recording a non-standardized form of Classical Arabic. (Source: Wikipedia)

Corpora

Use CorpusImporter().

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('arabic')

In [3]: c.list_corpora
Out[3]:
['arabic_text_perseus','arabic_text_quranic_corpus','arabic_morphology_quranic-corpus']

In [4]: c.import_corpus('arabic_text_perseus')  # ~/cltk_data/arabic/text/arabic_text_perseus/

Alphabet

The Arabic alphabet are placed in cltk/corpus/arabic/alphabet.py.

 In [1]: from cltk.corpus.arabic.alphabet import *
# all Hamza forms
 In [2]: HAMZAT
 Out[2]: ('ء', 'أ', 'إ', 'آ', 'ؤ', 'ؤ', 'ٔ', 'ٕ')
# print HAMZA from hamza const and from HAMZAT list

 In [3] HAMZA
 Out[3] 'ء'

 In [4] HAMZAT[0]
 Out[4] 'ء'
# listing all Arabic letters

 In [5] LETTERS
 out [5] 'ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ء آ أ ؤ إ ؤ'
# Listing all shaped forms for example  Beh letter

 In [6] SHAPED_FORMS[BEH]
 Out[6] ('ﺏ', 'ﺐ', 'ﺑ', 'ﺒ')
# Listing all Punctuation marks

 In [7] PUNCTUATION_MARKS
 Out[7] ['،', '؛', '؟']
# Listing all Diacritics  FatHatanً ,Dammatanٌ ,Kasratanٍ ,FatHaَ ,Dammaُ ,Kasraِ ,Sukunْ ,Shaddaّ

 In [8] TASHKEEL
 Out[8] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ', 'ّ')
# Listing HARAKAT

 In [9] HARAKAT
 Out[9] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ')
# Listing SHORTHARAKAT

 In [10] SHORTHARAKAT
 Out[10] ('َ', 'ُ', 'ِ', 'ْ')
# Listing Tanween

 In [11] TANWEEN
 Out[11] ('ً', 'ٌ', 'ٍ')
# Kasheeda, Tatweel

 In [12] NOT_DEF_HARAKA
 Out[12] 'ـ'

# WESTERN_ARABIC_NUMERALS numerals

 In [13] WESTERN_ARABIC_NUMERALS
 Out[13] ['0','1','2','3','4','5','6','7','8','9']
# EASTERN ARABIC NUMERALS from 0 to 9

 In [14] EASTERN_ARABIC_NUMERALS
 Out[14] ['۰', '۱', '۲', '۳', '٤', '۵', '٦', '۷', '۸', '۹']
# Listing The Weak letters  .

 In [15] WEAK
 Out[15]  ('ا', 'و', 'ي', 'ى')
# Listing all Ligatures Lam-Alef

 In [16] LIGATURES_LAM_ALEF
 Out[16] ('ﻻ', 'ﻷ', 'ﻹ', 'ﻵ')
# listing small letters

 In [17] SMALL
 Out[17] ('ٰ', 'ۥ', 'ۦ')
# Import letters names in arabic

 In [18] Names[ALEF]
 Out[18]  'ألف'

CLTK Arabic Support

1. Pyarabic

Specific Arabic language library for Python, provides basic functions to manipulate Arabic letters and text, like detecting Arabic letters, Arabic letters groups and characteristics, remove diacritics etc.Developed by Taha Zerrouki:.

1.1. Features
  1. Arabic letters classification
  2. Text tokenization
  3. Strip Harakat (all, except Shadda, tatweel, last_haraka)
  4. Sperate and join Letters and Harakat
  5. Reduce tashkeel
  6. Measure tashkeel similarity (Harakat, fully or partially vocalized, similarity with a template)
  7. Letters normalization (Ligatures and Hamza)
  8. Numbers to words
  9. Extract numerical phrases
  10. Pre-vocalization of numerical phrases
  11. Unshiping texts
1.2. Applications
  • Arabic text processing
1.3. Usage
In [1] from cltk.corpus.arabic.utils.pyarabic import araby

In [2] char = 'ْ'

In [3] araby.is_sukun(char)  # Checks for Arabic Sukun Mark
Out[3] True

In [4] char = 'ّ'

In [5] araby.is_shadda(char)  # Checks for Arabic Shadda Mark
Out[5] True

In [6] text = "الْعَرَبِيّةُ"

In [7] araby.strip_harakat(text)  # Strip Harakat from arabic word except Shadda.
Out[7] العربيّة

In [8] text = "الْعَرَبِيّةُ"

In [9] araby.strip_lastharaka(text)# Strip the last Haraka from arabic word except Shadda
Out[9] الْعَرَبِيّة

In [10] text = "الْعَرَبِيّةُ"

In [11] araby.strip_tashkeel(text)  # Strip vowels from a text,  include Shadda
Out[11] العربية

Stopword Filtering

To use the CLTK’s built-in stopwords list:

In [1]: from cltk.stop.arabic.stopword_filter import stopwords_filter as ar_stop_filter

In [2]: text = 'سُئِل بعض الكُتَّاب عن الخَط، متى يَسْتحِقُ أن يُوصَف بِالجَودةِ؟'

In [3]: ar_stop_filter(text)
Out[3]: ['سئل', 'الكتاب', 'الخط', '،', 'يستحق', 'يوصف', 'بالجودة', '؟']

Swadesh

The corpus module has a class for generating a Swadesh list for Arabic.

In[1]: from cltk.corpus.swadesh import Swadesh

In[2]: swadesh = Swadesh('ar')

In[3]: swadesh.words()[:10]

Out[3]: ['أنا' ,'أنت‎, أنتِ‎', 'هو‎,هي' ,'نحن' ,'أنتم‎,‎ أنتن‎,‎ أنتما‎', 'هم‎,‎ هن‎,‎ هما' ,'هذا' ,'ذلك' ,'هنا‎']

Word Tokenization

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('arabic')

In [3]: text = 'اللُّغَةُ الْعَرَبِيَّةُ جَمِيلَةٌ.'

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['اللُّغَةُ', 'الْعَرَبِيَّةُ', 'جَمِيلَةٌ', '.']

Transliteration

The CLTK Provides Buckwalter and ISO233-2 Transliteration Systems for the Arabic language.

Available Transliteration Systems
In [1] from cltk.phonology.arabic.romanization import available_transliterate_systems

In [2] available_transliterate_systems()
Out[2] ['buckwalter', 'iso233-2', 'asmo449']
Usage
In [1] from cltk.phonology.arabic.romanization import transliterate

In [2] mode = 'buckwalter'

In [3] ar_string = 'بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ' # translate in English: In the name of Allah, the Most Merciful, the Most Compassionate

In [4] ignore = '' # this is for ignore an arabic char from transliterate operation

In [5] reverse = True # true means transliteration from arabic native script to roman script such as Buckwalter

In [6] transliterate(mode, ar_string, ignore, reverse)
Out[6] 'bisomi Allhi Alra~Hom`ni Alra~Hiyomi'

Aramaic

Aramaic is a language or group of languages belonging to the Semitic subfamily of the Afroasiatic language family. More specifically, it is part of the Northwest Semitic group, which also includes the Canaanite languages such as Hebrew and Phoenician. The Aramaic alphabet was widely adopted for other languages and is ancestral to the Hebrew, Syriac and Arabic alphabets. During its approximately 3,100 years of written history, Aramaic has served variously as a language of administration of empires, as a language of divine worship and religious study, and as the spoken tongue of a number of Semitic peoples from the Near East. (Source: Wikipedia)

Transliterate Square Script To Imperial Aramaic

Unicode recently included a separate code block for encoding characters in Imperial Aramaic. Traditionally documents written in Imperial Aramaic are taught and shared using square script. Here is a small function for converting a string written in square script to its Imperial Aramaic version.

Usage:

Import the function:

In [1]: from cltk.corpus.aramaic.transliterate import square_to_imperial

Take a string written in square script:

In [2]: mystring = "פדי בר דג[נ]מלך לאחא בר חפיו נתנת לך"

Convert it to Imperial Aramaic by passing it to our function

In [3]: square_to_imperial(mystring)
Out[3]: "𐡐𐡃𐡉 𐡁𐡓 𐡃𐡂[𐡍]𐡌𐡋𐡊 𐡋𐡀𐡇𐡀 𐡁𐡓 𐡇𐡐𐡉𐡅 𐡍𐡕𐡍𐡕 𐡋𐡊"

Bengali

Bengali also known by its endonym Bangla is an Indo-Aryan language spoken in South Asia. It is the national and official language of the People’s Republic of Bangladesh, and the official language of several northeastern states of the Republic of India, including West Bengal, Tripura, Assam (Barak Valley) and Andaman and Nicobar Islands. With over 210 million speakers, Bengali is the seventh most spoken native language in the world. Source: Wikipedia.

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with bengali_) to discover available Bengali corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('bengali')

In [3]: c.list_corpora
Out[3]:
['bengali_text_wikisource']

Tokenizer

This tool can help break up a sentence into smaller constituents.

In [1]: from cltk.tokenize.sentence import TokenizeSentence

In [2]: sentence = "রাজপণ্ডিত হব মনে আশা করে | সপ্তশ্লোক ভেটিলাম রাজা গৌড়েশ্বরে ||"

In [3]: tokenizer = TokenizeSentence('bengali')

In [4]: bengali_text_tokenize = tokenizer.tokenize(sentence)

In [5]: bengali_text_tokenize
['রাজপণ্ডিত', 'হব', 'মনে', 'আশা', 'করে', '|', 'সপ্তশ্লোক', 'ভেটিলাম', 'রাজা', 'গৌড়েশ্বরে', '|', '|']

Chinese

Chinese can be traced back to a hypothetical Sino-Tibetan proto-language. The first written records appeared over 3,000 years ago during the Shang dynasty. The earliest examples of Chinese are divinatory inscriptions on oracle bones from around 1250 BCE in the late Shang dynasty. Old Chinese was the language of the Western Zhou period (1046–771 BCE), recorded in inscriptions on bronze artifacts, the Classic of Poetry and portions of the Book of Documents and I Ching. Middle Chinese was the language used during Northern and Southern dynasties and the Sui, Tang, and Song dynasties (6th through 10th centuries CE). (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with chinese_) to discover available Chinese corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('chinese')

In [3]: c.list_corpora
Out[3]:
['chinese_text_cbeta_01',
 'chinese_text_cbeta_02',
 'chinese_text_cbeta_indices']

Coptic

Coptic is the latest stage of the Egyptian language, a northern Afroasiatic language spoken in Egypt until at least the 17th century. Coptic flourished as a literary language from the second to thirteenth centuries, and its Bohairic dialect continues to be the liturgical language of the Coptic Orthodox Church of Alexandria. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with coptic_) to discover available Coptic corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('coptic')

In [3]: c.list_corpora
Out[3]: ['coptic_text_scriptorium']

Swadesh

The corpus module has a class for generating a Swadesh list for Coptic.

In[1]: from cltk.corpus.swadesh import Swadesh

In[2]: swadesh = Swadesh('cop')

In[3]: swadesh.words()[:10]

Out[3]: ['ⲁⲛⲟⲕ', 'ⲛⲧⲟⲕ, ⲛⲧⲟ', 'ⲛⲧⲟϥ, ⲛⲧⲟⲥ', 'ⲁⲛⲟⲛ', 'ⲛⲧⲟⲧⲛ', 'ⲛⲧⲟⲩ', '-ⲉⲓ', 'ⲡⲓ-, ϯ-, ⲛⲓ-', 'ⲡⲉⲓⲙⲁ', 'ⲙⲙⲁⲩ']

Ancient Egyptian

The language spoken in ancient Egypt was a branch of the Afroasiatic language family. The earliest known complete written sentence in the Egyptian language has been dated to about 2690 BCE, making it one of the oldest recorded languages known, along with Sumerian. Egyptian was spoken until the late seventeenth century in the form of Coptic. (Source: Wikipedia)

Important

For correct visualisation of transliterated Unicode text you will probably need a special font. We recommend either egyptoserif (handles the yod better) or noto.

Transliterate MdC

MdC (Manuel de Codage) is the standard encoding scheme and a series of conventions for transliterating egyptian texts. At first it was also conceived as a system to represent positional relations between hieroglyphic signs. However it was soon realised that the scheme used by MdC was not really appropriate for this last task. Hence the current softwares for hieroglyphic typesetting use often slightly different schemes than MdC. For more on MdC, see here and here

Transliteration conventions proposed by MdC are widely accepted though. Since at that time the transliteration conventions of the egyptology were not covered by the Unicode, MdC’s all-ascii proposition made it possible to exchange at least transliterations in digital environement. It is the de facto transliteration system used by Thesaurus Linguae Aegyptiae which includes transliterations from several different scripts used in Ancient Egypt: a good discussion can be found here

Here are the unicode equivalents of MdC transliteration scheme as it is represented in transliterate_mdc:

Note

reStructuredText tables cannot display all characters in the Character column. The several that cannot be displayed are: U+0056: ; U+003c: ; U+003e: ; U+0024, U+00a3: .

MdC Unicode
Unicode Number | Character Unicode Number | Character
U+0041 A U+a723
U+0061 a U+a725
U+0048 H U+1e25
U+0078 x U+1e2b
U+0058 X U+1e96
U+0056 V U+0068+U+032d See note
U+0053 S U+0161 š
U+0063 c U+015b ś
U+0054 T U+1e6f
U+0076 v U+1e71
U+0044 D U+1e0f
U+0069 i U+0069+U+0486 i҆ |
U+003d = U+2e17
U+003c < U+2329 See note
U+003e > U+232a See note
U+0071 q U+1e33
U+0051 Q U+1e32
U+00a1, U+0040 ¡, @ U+1e24
U+0023, U+00a2 #, ¢ U+1e2a
U+0024, U+00a3 $, £ U+0048 + U+0331 See note
U+00a5, U+005e ¥, ^ U+0160 Š
U+00a9, U+002b ©, + U+1e0e
U+0043 C U+015a Ś
U+002a, U+00a7 *, § U+1e6e

The Unicode still doesn’t cover all of the transliteration conventions used within the egyptology, but there has been a lot of progress. Only three characters are now problematic and are not covered by precomposed characters of the Unicode Consortium.

  • Egyptological Yod
  • Capital H4
  • Small and Capital H5: almost exclusively used for transliterating demotic script.

The function is created in the view of transliteration font provided by CCER which maps couple of extra characters to transliterated equivalents such as ‘¡’ or ‘@’ for Ḥ.

There is also a q_kopf flag for choosing between the ‘q’ or ‘ḳ’ at the resulting text.

Usage:

Import the function:

In [1]: from cltk.corpus.egyptian.transliterate_mdc import mdc_unicode

Take a MdC encoded string (P.Berlin 3022:28-31):

In [1]: mdc_string = """rdi.n wi xAst n xAst
fx.n.i r kpny Hs.n.i r qdmi
ir.n.i rnpt wa gs im in wi amw-nnSi
HqA pw n rtnw Hrt"""

Ensure that mdc_string is encoded in Unicode characters (this is mostly unnecessary):

In [2]: mdc_string.encode().decode("utf-8")
Out[6]:
''rdi.n wi xAst n xAst\nfx.n.i r kpny Hs.n.i r qdmi\nir.n.i rnpt wa gs im in wi amw-nnSi\nHqA pw n rtnw Hrt''

Apply the function to obtain the Unicode map result:

In [10]: unicode_string = mdc_unicode(mdc_string)
In [11]: print(unicode_string)
rdi҆.n wi҆ ḫꜣst n ḫꜣst
fḫ.n.i҆ r kpny ḥs.n.i҆ r qdmi҆
i҆r.n.i҆ rnpt wꜥ gs i҆m i҆n wi҆ ꜥmw-nnši҆
ḥqꜣ pw n rtnw ḥrt

If you disable the option q_kopf, the result would be following:

In [136]: unicode_string = mdc_unicode(mdc_string, q_kopf=False)

In [152]: print(unicode_string)
rdi҆.n wi҆ ḫꜣst n ḫꜣst
fḫ.n.i҆ r kpny ḥs.n.i҆ r ḳdmi҆
i҆r.n.i҆ rnpt wꜥ gs i҆m i҆n wi҆ ꜥmw-nnši҆
ḥḳꜣ pw n rtnw ḥrt

Notice the q -> ḳ transformation.

If you are going to pass a string object read from a file be sure to precise the encoding during the opening of the file:

with open("~/mdc_text.txt", "r", encoding="utf-8") as f:
    mdc_text = f.read()
    unicode_text = mdc_unicode(mdc_text)

Notice encoding=”utf-8”.

TODO
  • Add support for different transliteration systems used within egyptology.
  • Add an option to for i -> j transformation for facilitating computer based operations.
  • Add support for the problematic characters in future.

Old English

Old English is the earliest historical form of the English language, spoken in England and southern and eastern Scotland in the early Middle Ages. It was brought to Great Britain by Anglo-Saxon settlers probably in the mid 5th century, and the first Old English literary works date from the mid-7th century. (Source: Wikipedia)

IPA Transcription

CLTK’s IPA transcriber for OE can be found in OE’s the orthophonology module.

In [1]: from cltk.phonology.old_english.orthophonology import OldEnglishOrthophonology as oe

In [2]: oe('Fæder ūre þū þe eeart on heofonum')
Out[2]: 'fæder u:re θu: θe eæɑrt on heovonum'

In [3]: oe('Hwæt! wē Gārdena in ġēardagum')
Out[3]: 'hwæt we: gɑ:rdenɑ in jæ:ɑrdɑyum'

The callable OldEnglishOrthophonology object can also return phonemes objects instead of IPA strings. The string representation of a phoneme object lists the distinctive features of the phoneme and its IPA representation.

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with old_english_) to discover available Old English corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter("old_english")

In [3]: corpus_importer.list_corpora
['old_english_text_sacred_texts', 'old_english_models_cltk']

To download a corpus, use the import_corpus method. The following will download pre-trained POS models for Old English:

In [4]: corpus_importer.import_corpus('old_english_models_cltk')

Stopword Filtering

To use the CLTK’s built-in stopwords list, We use an example from Beowulf:

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.old_english.stops import STOPS_LIST

In [3]: sentence = 'þe hie ær drugon aldorlease lange hwile.'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['hie',
 'drugon',
 'aldorlease',
 'catilina',
 'lange',
 'hwile',
 '.']

Text Normalization

Diacritic Stripping

The Word module provides a method useful for stripping various diacritical marks

In [1]: from cltk.phonology.old_english.phonology import Word

In [2]: Word('ġelǣd').remove_diacritics()
Out[2]: 'gelæd'
ASCII Encoding

For converting to ASCII, you can call ascii_encoding

In [3]: Word('oðþæt').ascii_encoding()
Out[3]: 'odthaet'

In [4]: Word('ƿeorðunga').ascii_encoding()
Out[4]: 'weordunga'

Transliteration

Anglo-Saxon runic transliteration

You can call the runic transliteration module for converting runic script into latin characters:

In [1]: from cltk.phonology.old_english.phonology import Transliterate as t

In [2]: t.transliterate('ᚩᚠᛏ ᛋᚳᚣᛚᛞ ᛋᚳᛖᚠᛁᛝ ᛋᚳᛠᚦᛖᚾᚪ ᚦᚱᛠᛏᚢᛗ', 'Latin')
Out[2]: 'oft scyld scefin sceathena threatum'

The reverse process is also possible:

In [3]: t.transliterate('Hƿæt Ƿe Gardena in geardagum', 'Anglo-Saxon')
Out[3]: 'ᚻᚹᚫᛏ ᚹᛖ ᚷᚪᚱᛞᛖᚾᚪ ᛁᚾ ᚷᛠᚱᛞᚪᚷᚢᛗ'

Syllabification

There is a facility for using the pre-specified sonoroty hierarchy for Old English to syllabify words.

In [1]: from cltk.phonology.syllabify import Syllabifier

In [2]: s = Syllabifier(language='old_english')

In [3]: s.syllabify('geardagum')
Out [3]:['gear', 'da', 'gum']

Lemmatization

A basic lemmatizer is provided, based on a hand-built dictionary of word forms.

In [1]: import cltk.lemmatize.old_english.lemma as oe_l
In [2]: lemmatizer = oe_l.OldEnglishDictionaryLemmatizer()
In [3]: lemmatizer.lemmatize('Næs him fruma æfre, or geworden, ne nu ende cymþ ecean')
Out [3]: [('Næs', 'næs'), ('him', 'he'), ('fruma', 'fruma'), ('æfre', 'æfre'), (',', ','), ('or', 'or'), ('geworden', 'weorþan'), (',', ','), ('ne', 'ne'), ('nu', 'nu'), ('ende', 'ende'), ('cymþ', 'cuman'), ('ecean', 'ecean')]

If an input word form has multiple possible lemmatizations, the system will select the lemma that occurs most frequently in a large corpus of Old English texts. If an input word form is not found in the dictionary, then it is simply returned.

Note, hovewer, that by passing in an extra parameter best_guess=False to the lemmatize function, one gains access to the underlying dictionary. In this case, a list is returned for each token. The list will contain:

  • Nothing, if the word form is not found;
  • A single string if the form maps to a unique lemma (the usual case);
  • Multiple strings if the form maps to several lemmatas.
In [1]: lemmatizer.lemmatize('Næs him fruma æfre, or geworden, ne nu ende cymþ ecean', best_guess=False)
Out [1]: [('Næs', ['nesan', 'næs']), ('him', ['him', 'he', 'hi']), ('fruma', ['fruma']), ('æfre', ['æfre']), (',', []), ('or', []), ('geworden', ['weorþan', 'geweorþan']), (',', []), ('ne', ['ne']), ('nu', ['nu']), ('ende', ['ende']), ('cymþ', ['cuman']), ('ecean', [])]

By specifying return_frequencies=True the log of the relative frequencies of the lemmata is also returned:

..code-block:: python

In [1]: lemmatizer.lemmatize(‘Næs him fruma æfre, or geworden, ne nu ende cymþ ecean’, best_guess=False, return_frequencies=True)

Out [1]: [(‘Næs’, [(‘nesan’, -11.498420778767347), (‘næs’, -5.340383031833549)]), (‘him’, [(‘him’, -2.1288142618657147), (‘he’, -1.4098446677862744), (‘hi’, -2.3713533259849857)]), (‘fruma’, [(‘fruma’, -7.3395376954076745)]), (‘æfre’, [(‘æfre’, -4.570372796517447)]), (‘,’, []), (‘or’, []), (‘geworden’, [(‘weorþan’, -8.608049020871182), (‘geweorþan’, -9.100525505968976)]), (‘,’, []), (‘ne’, [(‘ne’, -1.9050995182359884)]), (‘nu’, [(‘nu’, -3.393566264402446)]), (‘ende’, [(‘ende’, -5.038516324389812)]), (‘cymþ’, [(‘cuman’, -5.943525084818863)]), (‘ecean’, [])]

POS tagging

You can get the POS tags of Old English texts using the CLTK’s wrapper around the NLTK tokenizer. First, download the model by importing the old_english_models_cltk corpus.

There are a number of different pre-trained models available for POS tagging of Old English. Each represents a trade-off between accuracy of tagging and speed of tagging. Listed in order of increasing accuracy (= decreasing speed), the models are:

  • Unigram
  • Trigram -> Bigram -> Unigram n-gram backoff model
  • Conditional Random Field (CRF) model
  • Perceptron model

(Bigram and trigram models are also available, but unsuitable due to low recall.)

The taggers were trained from annotated data from the The ISWOC Treebank (license: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License).

The POS tag scheme is explained here: https://proiel.github.io/handbook/developer/

Bech, Kristin and Kristine Eide. 2014. The ISWOC corpus. Department of Literature, Area Studies and European Languages, University of Oslo. http://iswoc.github.com.

Example: Tagging with the CRF tagger

The following sentence is from the beginning of Beowulf:

In [1]: from cltk.tag.pos import POSTag

In [2]: tagger = POSTag('old_english')

In [3]: sent = 'Hwæt! We Gardena in geardagum, þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon.'

In [4]: tagger.tag_crf(sent)

Out[4]:[('Hwæt', 'I-'), ('!', 'C-'),
('We', 'NE'), ('Gardena', 'NE'), ('in', 'R-'), ('geardagum', 'NB'), (',', 'C-'),
('þeodcyninga', 'NB'), (',', 'C-'), ('þrym', 'PY'), ('gefrunon', 'NB'),
(',', 'C-'), ('hu', 'DU'), ('ða', 'PD'), ('æþelingas', 'NB'), ('ellen', 'V-'),
('fremedon', 'V-'), ('.', 'C-')]

Swadesh

The corpus module has a class for generating a Swadesh list for Old English.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('eng_old')

In [3]: swadesh.words()[:10]
Out[3]: ['ic, iċċ, ih', 'þū', 'hē', 'wē', 'ġē', 'hīe', 'þēs, þēos, þis', 'sē, sēo, þæt', 'hēr', 'þār, þāra, þǣr, þēr']

Middle English

Middle English is collectively the varieties of the English language spoken after the Norman Conquest (1066) until the late 15th century; scholarly opinion varies but the Oxford English Dictionary specifies the period of 1150 to 1500. (Source: Wikipedia)

Text Normalization

CLTK’s normalizer attempts to clean the given text, converting it into a canonical form.

Lowercase Conversion

The to_lower parameter converts the string into lowercase.

In [1]: from cltk.corpus.middle_english.alphabet import normalize_middle_english

In [2]: normalize_middle_english("Whan Phebus in the Crabbe had nere hys cours ronne And toward the leon his journé gan take", to_lower=True)
Out [2]: 'whan phebus in the crabbe had nere hys cours ronne and toward the leon his journé gan take'
Punctuation Removal

punct is responsible for punctuation removal

In [3]: normalize_middle_english("Thus he hath me dryven agen myn entent, And contrary to my course naturall.", punct=True)
Out [3]: 'thus he hath me dryven agen myn entent and contrary to my course naturall'
Canonical Form

The alpha_conv follows the established spelling conventions developed thorughout the last last century. þ and ð are both converted to th while 3 is converted to y at the start of the word and to gh otherwise.

In [4]: normalize_middle_english("as 3e lykeþ best", alpha_conv=True)
Out [4]: 'as ye liketh best'

Stemming

CLTK supports a rule-based affix stemmer for ME.

Keep in mind, that while Middle English is considered a weakly inflected language with a grammatical structure resembling that of Modern English, its lack of orthographical conventions presents a difficulty when accounting for various affixes.

In [1]: from cltk.stem.middle_english import affix_stemmer

In [2]: from cltk.corpus.middle_english.alphabet import normalize_middle_english

In [3]: text = normalize_middle_english('The speke the henmest kyng, in the hillis he beholdis.').split(" ")

In [4]: affix_stemmer(text)
Out [4]: 'the spek the henm kyng in the hill he behold'

The stemmer can also take an additional parameter of a hard-coded exception dictionary. An example follows utilizing the compiled stopwords list.

In[7]: from cltk.stop.middle_english.stops import STOPS_LIST

In[8]: exceptions = dict(zip(STOPS_LIST, STOPS_LIST))

In[9]: affix_stemer('byfore him'.split(" "), exception_list = exceptions)
Out[9]: 'byfore him'

Stopword Filtering

To use the CLTK’s built-in stopwords list, We use an example from Chaucer’s “The Summoner’s Tale”:

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.middle_english.stops import STOPS_LIST

In [3]: sentence = 'This frere bosteth that he knoweth helle'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['frere',
 'bosteth',
 'knoweth',
 'helle',
 '.']

Stresser

The historical events of early 11th century Britain were intertwined with its phonological development. The Norman Conquest in 1066 is mainly responsible for the influx of both Francien and Latin words and by extension for the highly variable spelling and phonology of ME.

While the Stresser provided by CLTK is unable to recognize the stressing of a given word, it does accept some of the most common stressing rules as parameters (Latin/Germanic/French)

In [1]: from cltk.phonology.middle_english.transcription import Word

In [2]: ".".join(Word('beren').stresser(stress_rule = "FSR"))
Out[2]: "ber.'en"

In [3]: ".".join(Word('yisterday').stresser(stress_rule = "GSR"))
Out [3]: "yi.ster.'day"

In [4]: ".".join(Word('verbum').stresser(stress_rule = "LSR"))
Out [4]: "ver.'bum"

Syllabify

The Word class provides a syllabification module for ME words.

In [1]: from cltk.phonology.middle_english.transcription import Word

In [2]: w = Word("hymsylf")

In [3]: w.syllabify()
Out [3]: ['hym', 'sylf']

In [4]: w.syllabified_str()
Out[4]: 'hym.sylf'

French

Old French (franceis, françois, romanz; Modern French ancien français) was the language spoken in Northern France from the 8th century to the 14th century. In the 14th century, these dialects came to be collectively known as the langue d’oïl, contrasting with the langue d’oc or Occitan language in the south of France. The mid-14th century is taken as the transitional period to Middle French, the language of the French Renaissance, specifically based on the dialect of the Île-de-France region. The place and area where Old French was spoken natively roughly extended to the historical Kingdom of France and its vassals, but the influence of Old French was much wider, as it was carried to England, Sicily and the Crusader states as the language of a feudal elite and of commerce.

(Source: Wikipedia)

LEMMATIZER

The lemmatizer takes as its input a list of tokens, previously tokenized and made lower-case. .. It first seeks a match between each token and a list of potential lemmas taken from Godefroy (1901)’s Lexique, the Tobler-Lommatszch, and the DECT. If a match is not found, the lemmatizer then seeks a match between the forms different lemmas have been known to take and the token (this at present only applies to lemmas from A-D and W-Z). If no match is returned at this stage, a set of rules is applied to the token. These rules are similar to those applied by the stemmer but aim to bring forms in line with lemmas rather than truncating them. Finally, if no match is found between the modified token and the list of lemmas, a result of ‘None’ is returned.

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: from cltk.lemmatize.french.lemma import LemmaReplacer

In [3]: text = "Li rois pense que par folie, Sire Tristran, vos aie amé ; Mais Dé plevis ma loiauté, Qui sor mon cors mete flaele, S’onques fors cil qui m’ot pucele Out m’amistié encor nul jor !"

In [4]: text = str.lower(text)

In [5]: tokenizer = WordTokenizer('french')

In [6]: lemmatizer = LemmaReplacer()

In [7]: tokens = tokenizer.tokenize(text)

In [8]: lemmatizer.lemmatize(tokens)
Out [8]: [('li', 'li'), ('rois', 'rois'), ('pense', 'pense'), ('que', 'que'), ('par', 'par'), ('folie', 'folie'), (',', ['PUNK']), ('sire', 'sire'), ('tristran', 'None'), (',', ['PUNK']), ('vos', 'vos'), ('aie', ['avoir']), ('amé', 'amer'), (';', ['PUNK']), ('mais', 'mais'), ('dé', 'dé'), ('plevis', 'plevir'), ('ma', 'ma'), ('loiauté', 'loiauté'), (',', ['PUNK']), ('qui', 'qui'), ('sor', 'sor'), ('mon', 'mon'), ('cors', 'cors'), ('mete', 'mete'), ('flaele', 'flaele'), (',', ['PUNK']), ("s'", "s'"), ('onques', 'onques'), ('fors', 'fors'), ('cil', 'cil'), ('qui', 'qui'), ("m'", "m'"), ('ot', 'ot'), ('pucele', 'pucele'), ('out', ['avoir']), ("m'", "m'"), ('amistié', 'amistié'), ('encor', 'encor'), ('nul', 'nul'), ('jor', 'jor'), ('!', ['PUNK'])]

LINE TOKENIZATION:

The line tokenizer takes a string as its input and returns a list of strings.

In [1]: from cltk.tokenize.line import LineTokenizer

In [2]: tokenizer = LineTokenizer('french')

In [3]: untokenized_text = """Ki de bone matire traite,\nmult li peise, se bien n’est faite.\nOëz, seignur, que dit Marie,\nki en sun tens pas ne s’oblie."""

In [4]: tokenizer.tokenize(untokenized_text)
Out [4]: ['Ki de bone matire traite,', 'mult li peise, se bien n’est faite.','Oëz, seignur, que dit Marie,', 'ki en sun tens pas ne s’oblie. ']

NAMED ENTITY RECOGNITION

The named entity recognizer for French takes as its input a string and returns a list of tuples. It tags named entities from a list, and also displays the category to which this named entity belongs.

Categories are modeled on those found in (Moisan, 1986) and include:

  • Locations “LOC” (e.g. Girunde)
  • Nationalities/places of origin “NAT” (e.g. Grius)
  • Individuals: - animals “ANI” (i.e. horses, e.g. Veillantif, cows, e.g. Blerain, dogs, e.g. Husdent) - authors “AUT” (e.g. Marie, Chrestïen) - nobility “CHI” (e.g. Rolland, Artus). n.b. Characters such as Turpin are counted as nobility rather than religious figures. - characters from classical sources “CLAS” (e.g. Echo) - feasts “F” (e.g. Pentecost) - religious things “REL” (i.e. saints, e.g. St Alexis, and deities, e.g. Deus, and Old Testament people, e.g. Adam) - swords “SW” (e.g. Hautecler) - commoners “VIL” (e.g Pathelin)
In [1]: from cltk.tag.ner import NamedEntityReplacer

In [2]: text_str = """Berte fu mere Charlemaine, qui pukis tint France et tot le Maine."""

In [3]: ner_replacer = NamedEntityReplacer()

In [4]: ner_replacer.tag_ner_fr(text_str)
Out [4]: [[('Berte', 'entity', 'CHI')], ('fu',), ('mere',), [('Charlemaine', 'entity', 'CHI')], (',',), ('qui',), ('pukis',), ('tint',), [('France', 'entity', 'LOC')], ('et',), ('tot',), ('le',), [('Maine', 'entity', 'LOC')], ('.',)]

NORMALIZER

The normalizer aims to maximally reduce the variation between the orthography of texts written in the Anglo-Norman dialect to bring it in line with “orthographe commune”. It is heavily inspired by Pope (1956). It takes a string as its input. Spelling variation is not consistent enough to ensure the highest accuracy; the normalizer should therefore be used as a last resort.

In [1]: from cltk.corpus.utils.formatter import normalize_fr

In [2]: text = "viw"

In [3]: normalize_fr(text)
Out [3]: ['vieux']

STEMMER

The stemmer strips morphological endings from an input string. .. Morphological endings are taken from Brunot & Bruneau (1949) and include both nominal and verbal inflexion. A list of exceptions can be found at cltk.stem.french.exceptions.

In [1]: from cltk.stem.french.stem import stem

In [2]: text = "ja departissent a itant quant par la vile vint errant tut a cheval une pucele en tut le siecle n’ot si bele un blanc palefrei chevalchot"

In [3]: stem(text)
Out [3]: "j depart a it quant par la vil v err tut a cheval un pucel en tut le siecl n' o si bel un blanc palefre chevalcho"

STOPWORD FILTERING

The stopword filterer removes the function words from a string of OF or MF text. The list includes function words from the most common 100 words in the corpus, as well as all conjugated forms of auxiliaries estre and avoir.

In [1]: from cltk.stop.french.stops import STOPS_LIST as FRENCH_STOPS

In [2]: from cltk.tokenize.word import WordTokenizer

In [3]: tokenizer = WordTokenizer('french')

In [4]: text = "En pensé ai e en talant que d’ Yonec vus die avant dunt il fu nez, e de sun pere cum il vint primes a sa mere ."

In [5]: text = text.lower()

In [6]: tokens = tokenizer.tokenize(text)

In [7]: no_stops = [w for w in tokens if w not in FRENCH_STOPS]

In [8]: no_stops
Out [8]: ['pensé', 'talant', 'yonec', 'die', 'avant', 'dunt', 'nez', ',', 'pere', 'cum', 'primes', 'mere', '.']

WORD TOKENIZATION

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('french')

In [3]: text = "S'a table te veulz maintenir, Honnestement te dois tenir Et garder les enseignemens Dont cilz vers sont commancemens."

In [4]: word_tokenizer.tokenize(text)
Out [4]: ["S'", 'a', 'table', 'te', 'veulz', 'maintenir', ',', 'Honnestement', 'te', 'dois', 'tenir', 'Et', 'garder', 'les', 'enseignemens', 'Dont', 'cilz', 'vers', 'sont', 'commancemens', '.']

Apostrophes are considered part of the first word of the two they separate. Apostrophes are also normalized from “’” to “’“.

Swadesh

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('fr_old')

In [3]: swadesh.words()[:10]
Out[3]: ['jo, jou, je, ge', 'tu', 'il', 'nos, nous', 'vos, vous', 'il, eles', 'cist, cest, cestui', 'ci', 'la']

Middle High German

Middle High German (abbreviated MHG, German: Mittelhochdeutsch, abbr. Mhd.) is the term for the form of German spoken in the High Middle Ages. It is conventionally dated between 1050 and 1350, developing from Old High German and into Early New High German. High German is defined as those varieties of German which were affected by the Second Sound Shift; the Middle Low German and Middle Dutch languages spoken to the North and North West, which did not participate in this sound change, are not part of MHG. (Source: Wikipedia)

ASCII Encoding

Using the Word class, you can easily convert a string to its ASCII encoding, essentialy striping it of its diacritics.

In [1]: from cltk.phonology.middle_high_german.transcription import Word

In [2]: w = Word("vogellîn")

In [3]: w.ASCII_encoding()
Out[3]: 'vogellin'

Stemming

Note

The stemming algorithm is still under developement and can sometimes produce inaccurate results.

CLTK’s stemming function, attempts to reduce inflected words to their stem by suffix stripping.

In [1]: from cltk.stem.middle_high_german.stem import stemmer_middle_high_german

In [2]: stemmer_middle_high_german("Man lūte dā zem münster nāch gewoneheit")
Out[2]: ['Man', 'lut', 'dâ', 'zem', 'munst', 'nâch', 'gewoneheit']

The stemmer strips umlauts by default, to toggle it off, simply set rem_umlauts = False

In [3]: stemmer_middle_high_german("Man lūte dā zem münster nāch gewoneheit", rem_umlauts = False)
Out[3]: ['Man', 'lût', 'dâ', 'zem', 'münst', 'nâch', 'gewoneheit']

The stemmer can also take an user-defined dictionary as an optional parameter.

In [4]: stemmer_middle_high_german("swaȥ kriuchet unde fliuget und bein zer erden biuget", rem_umlauts = False)
Out[4]: ['swaȥ', 'kriuchet', 'unde', 'fliuget', 'und', 'bein', 'zer', 'erden', 'biuget']

In [5]: stemmer_middle_high_german("swaȥ kriuchet unde fliuget und bein zer erden biuget", rem_umlauts = False, exceptions = {"biuget" : "biegen"})
Out[5]: ['swaȥ', 'kriuchet', 'unde', 'fliuget', 'und', 'bein', 'zer', 'erden', 'biegen']

Syllabification

A syllabifier is contained in the Word module:

In [1]: from cltk.phonology.middle_high_gemran import Word

In [2]: Word('entslâfen').syllabify()
Out[2]: ['ent', 'slâ', 'fen']

Note that the syllabifier is case-insensitive:

In [3]: Word('Fröude').syllabify()
Out[3]: ['fröu', 'de']

You can also load the sonority of MHG phonemes to the phonology syllabifier:

In [4]: from cltk.phonology.syllabify import Syllabifier

In [5]: s = Syllabifier(language='middle high german')

In [6]: s.syllabify('lobebæren')
Out[6]: ['lo', 'be', 'bæ', 'ren']

Stopword Filtering

CLTK offers a built-in stop word list for Middle High German.

In [1]: from cltk.stop.middle_high_german.stops import STOPS_LIST

In [2]: from cltk.tokenize.word import WordTokenizer

In [3]: word_tokenizer = WordTokenizer('middle_high_german')

In [4]: sentence = "Wol mich lieber mære diu ich hān vernomen daȥ der winter swære welle ze ende komen"

In [5]: tokens = word_tokenizer.tokenize(sentence.lower())

In [6]: [word for word in tokens if word not in STOPS_LIST]
Out[6]: ['lieber', 'mære', 'hān', 'vernomen', 'winter', 'swære', 'welle', 'komen']

Text Normalization

Text normalization attempts to narrow the disrepancies between various corpora.

Lowercase Conversion

By default, the function converts the whole string to lowercase. However, since in MHG uppercase is only used at the start of a sentence or to denote eponyms, you may also set to_lower_beginning = True to only convert the words at the beginning of a sentence.

In [1]: from cltk.corpus.middle_high_german.alphabet import normalize_middle_high_german

In [2]: normalize_middle_high_german("Dô erbiten si der nahte und fuoren über Rîn")
Out[2]: 'dô erbiten si der nahte und fuoren über rîn'

In [3]: normalize_middle_high_german("Dô erbiten si der nahte und fuoren über Rîn",to_lower_all = False, to_lower_beginning = True)
Out[3]: 'dô erbiten si der nahte und fuoren über Rîn'
Alphabet Conversion

Various online corpora use the characters ā, ō, ū, ē, ī to represent â, ô, û, ê and î respectively. Sometimes, ae and oe are also used instead of æ and œ. By default, the normalizer converts the text to the canonical form.

In [4]: normalize_middle_high_german("Mit ūf erbürten schilden in was ze strīte nōt", alpha_conv = True)
Out[4]: 'mit ûf erbürten schilden in was ze strîte nôt'
Punctuation

Punctuation is also handled by the normalizer.

In [5]: normalize_middle_high_german("Si sprach: ‘herre Sigemunt, ir sult iȥ lāȥen stān", punct = True)
Out[5]: 'si sprach herre sigemunt ir sult iȥ lâȥen stân'

Phonetic Indexing

Phonetic Indexing helps identifying and processing homophones.

Soundex

The Word class provides a modified Soundex algorithm modified for MHG.

In [1]: from cltk.phonology.middle_high_german.transcription import Word

In [2]: w1 = Word("krippe")

In [3]: w1.phonetic_index(p = "SE")
Out[3]: 'K510'

In [4]: w2 = Word("krîbbe")

In [5]: w2.phonetic_indexing(p = "SE")
Out[5]: 'K510'

Transliteration

CLTK’s transcriber rewrites a word into the International Phonetical Alphabet (IPA). As of this version, the Transcribe class doesn’t support any specific dialects and serves as a superset encompassing various regional accents.

In [1]: from cltk.phonology.middle_high_german.transcription import Transcriber

In [2]: tr = Transcriber()

In [3]: tr.transcribe("Slâfest du, friedel ziere?", punctuation = True)
Out[3]: '[Slɑːfest d̥ʊ, frɪ͡əd̥el t͡sɪ͡əre?]'

In [4]: tr.transcribe("Slâfest du, friedel ziere?", punctuation = False)
Out[4]: '[Slɑːfest d̥ʊ frɪ͡əd̥el t͡sɪ͡əre]'

Word Tokenization

The WordTokenizer class takes a string as input and returns a list of tokens.

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('middle_high_german')

In [3]: text = "Mīn ougen   wurden liebes alsō vol, \n\n\ndō ich die minneclīchen ērst gesach,\ndaȥ eȥ mir hiute und   iemer mē tuot wol."

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['Mīn', 'ougen', 'wurden', 'liebes', 'alsō', 'vol', ',', 'dō', 'ich', 'die', 'minneclīchen', 'ērst', 'gesach', ',', 'daȥ', 'eȥ', 'mir', 'hiute', 'und', 'iemer', 'mē', 'tuot', 'wol', '.']

Lemmatization

The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends. There is a generic version of the backoff Middle High German lemmatizer which requires data from the CLTK Middle High German models data found here. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.

To use the generic version of the backoff Middle High German Lemmatizer:

In [1]: from cltk.lemmatize.middle_high_german.backoff import BackoffMHGLemmatizer

In [2]: lemmatizer = BackoffMHGLemmatizer()

In [3]: tokens = "uns ist in alten mæren".split(" ")

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('uns', {'uns', 'unser', 'unz', 'wir'}), ('ist', {'sîn/wider(e)+', 'ist', 'sîn/inne+', 'sîn/mit(e)<+', 'sîn/vür(e)+', 'sîn/abe+', 'sîn/obe+', 'sîn/vor(e)+', 'sîn/vür(e)>+', 'sîn/ûze+', 'sîn/ûz+', 'sîn/bî<.+', 'sîn/vür(e)<+', 'sîn/innen+', 'sîn/âne+', 'sîn/bî+', 'sîn/ûz<+', 'sîn', 'sîn/ûf<.+'}), ('in', {'ër', 'in/hin(e)+', 'in/>+gân', 'in/+gân', 'în/+gân', 'in/+lâzen', 'în', 'in/<.+wintel(e)n', 'in/>+rinnen', 'in/dar(e)+', 'in/.>+slîzen', 'în/hin(e)+', 'în/+lèiten', 'în/+var(e)n', 'in', 'in/>+tragen', 'in/+tropfen', 'în/+lègen', 'in/>+winten', 'în/+brèngen', 'in/>+büègen', 'ërr', 'în/+zièhen', 'in/<.+gân', 'in/+zièhen', 'in/>+tûchen', 'dër', 'în/dâr+', 'in/war(e).+', 'in/<.+lâzen', 'in/>+rîten', 'în/+lâzen', 'in/>+lâzen', 'in/+stapfen', 'în/+sènten', 'in/>.+lâzen', 'in/>+stân', 'in/+drücken', 'in/>+ligen', 'in/dâr+ ', 'in/+var(e)n', 'in/+vüèren', 'in/<.+vallen', 'in/>+vlièzen', 'in/<.+rîten', 'in/hër(e).+', 'ne', 'in/>+wonen', 'in/<.+sigel(e)n', 'in/+lègen', 'în/+dringen', 'in/>+ge-trîben', 'in/+diènen', 'in/>+ge-stëchen', 'in/>+stècken', 'in/hër(e)+', 'in/>+stëchen', 'in/dâr+', 'in/+blâsen', 'în/dâr.+', 'in/>+wîsen', 'în/+îlen', 'in/>+laden', 'în/+komen', 'în/+ge-lèiten', 'in/<.+vloèzen', 'ër ', 'in/>+sètzen', 'in/hièr+', 'in/>+bûwen', 'in/>+lèiten', 'în/+ge-binten', '[!]', 'în/+trîben', 'in/<.+blâsen', 'in/+komen', 'în/+krièchen', 'in/+trîben', 'in/<.+ligen', 'in/+stëchen', 'in/<+gân', 'in/dâr.+', 'în/hër(e)+', 'in/+kêren', 'in/<.+var(e)n', 'in/+rîten', 'in/>+vallen', 'in/<.+vüèren'}), ('alten', {'alt', 'alter', 'alten'}), ('mæren', {'mæren', 'mære'})]

POS tagging

In [1]: from cltk.tag.pos import POSTag

In [2]: mhg_pos_tagger = POSTag("middle_high_german")

In [3]: mhg_pos_tagger.tag_tnt("uns ist in alten mæren wunders vil geseit")
Out[3]: [('uns', 'PPER'), ('ist', 'VAFIN'), ('in', 'APPR'), ('alten', 'ADJA'), ('mæren', 'ADJA'),
         ('wunders', 'NA'), ('vil', 'AVD'), ('geseit', 'VVPP')]

Middle Low German

Middle Low German or Middle Saxon is a language that is the descendant of Old Saxon and the ancestor of modern Low German. It served as the international lingua franca of the Hanseatic League. It was spoken from about 1100 to 1600, or 1200 to 1650. (Source: Wikipedia)

POS Tagging

The POS taggers were trained by NLTK’s models on the ReN training set.

1–2–gram backoff tagger
In [1]: from cltk.tag.pos import POSTag

In [2]: tagger = POSTag('middle_low_german')

In [3]: tagger.tag_ngram_12_backoff('Jck  Johannes  Veghe  preister  verwarer  vnde  voirs tender  des  Juncfrouwen  kloisters  to Mariendale')
Out[3]: [('Jck', 'PPER'),
       ('Johannes', 'NE'),
       ('Veghe', 'NE'),
       ('preister', 'NA'),
       ('verwarer', 'NA'),
       ('vnde', 'KON'),
       ('voirs', 'NA'),
       ('tender', 'NA'),
       ('des', 'DDARTA'),
       ('Juncfrouwen', 'NA'),
       ('kloisters', 'NA'),
       ('to', 'APPR'),
       ('Mariendale', 'NE')]

Gothic

Gothic is an extinct East Germanic language that was spoken by the Goths. It is known primarily from the Codex Argenteus, a 6th-century copy of a 4th-century Bible translation, and is the only East Germanic language with a sizable text corpus. (Source: Wikipedia)

Phonological transcription

In [1]: from cltk.phonology.gothic import transcription as gt

In [2]: sentence = "Anastodeins aiwaggeljons Iesuis Xristaus sunaus gudis."

In [3]: tr = ut.Transcriber(gt.DIPHTHONGS_IPA, gt.DIPHTHONGS_IPA_class, gt.IPA_class, gt.gothic_rules)

In [4]: tr.main(sentence, gt.gothic_rules)

Out [4]:
    "[anastoːðiːns ɛwaŋgeːljoːns jeːsuis kristɔs sunɔs guðis]"

Greek

Greek is an independent branch of the Indo-European family of languages, native to Greece and other parts of the Eastern Mediterranean. It has the longest documented history of any living language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the major part of its history; other systems, such as Linear B and the Cypriot syllabary, were used previously. The alphabet arose from the Phoenician script and was in turn the basis of the Latin, Cyrillic, Armenian, Coptic, Gothic and many other writing systems. (Source: Wikipedia)

Note

For most of the following operations, you must first import the CLTK Greek linguistic data (named greek_models_cltk).

Alphabet

The Greek vowels and consonants in upper and lower case are placed in cltk/corpus/greek/alphabet.py.

Greek vowels can occur without any breathing or accent, have rough or smooth breathing, different accents, diareses, macrons, breves and combinations thereof and Greek consonants have none of these features, except ρ, which can have rough or smooth breathing.

In alphabet.py the vowels and consonants are grouped by upper or lower case, accent, breathing, a diaresis and possible combinations thereof. These groupings are stored in lists or, in case of a single letter like ρ, as strings with descriptive names structured like CASE_SPECIFIERS, e.g. LOWER_DIARESIS_CIRCUMFLEX.

For example to use upper case vowels with rough breathing and an acute accent:

In[1]: from cltk.corpus.greek.alphabet import UPPER_ROUGH_ACUTE
In[2]: print(UPPER_ROUGH_ACUTE)
Out[2]: ['Ἅ', 'Ἕ', 'Ἥ', 'Ἵ', 'Ὅ', 'Ὕ', 'Ὥ', 'ᾍ', 'ᾝ', 'ᾭ']

Accents indicate the pitch of vowels. An acute accent or ὀξεῖα (oxeîa) indicates a rising pitch on a long vowel or a high pitch on a short vowel, a grave accent or βαρεῖα (bareîa) indicates a normal or low pitch and a circumflex or περισπωμένη (perispōménē) indicates high or falling pitch within one syllable.

Breathings, which are used not only on vowels, but also on ρ, indicate the presence or absence of a voiceless glottal fricative - rough breathing indicetes a voiceless glottal fricative before a vowel, like in αἵρεσις (haíresis) and smooth breathing indicates none.

Diareses are placed on ι and υ to indicate two vowels not being a diphthong and macrons and breves are placed on α, ι, and υ to indicate the length of these vowels.

For more information on Greek diacritics see the corresponding wikipedia page.

Accentuation and diacritics

James Tauber has created a Python 3 based library to enable working with the accentuation of Ancient Greek words. Installing it is optional for working with CLTK.

For further information please see the original docs, as this is just an abridged version.

The library can be installed with pip:

pip install greek-accentuation

Contrary to the original docs to use the functions from this module it is necessary to explicitly import every function you need as opposed to

The Characters Module:

base returns a given character without diacritics. For example:

In[1]: from greek_accentuation.characters import base

In[2]: base('ᾳ')
Out[2]: 'α'

add_diacritic and add_breathing add diacritics (accents, diaresis, macrons, breves) and breathing symbols to the given character. add_diacritic is stackable, for example:

In[1]: from greek_accentuation.characters import add_diacritic

In[2]: add_diacritic(add_diacritic('ο', ROUGH), ACUTE)
Out[2]: 'ὅ'

accent and strip_accents return the accent of a character as an Unicode escape and the character stripped of its accent respectively. breathing, strip_breathing, length and strip_length work analogously, for example:

In[1]: from greek_accentuation.characters import length, strip_length

In[2]: length('ῠ') == SHORT
Out[2]: True

In[3]: strip_length('ῡ')
Out[3]: 'υ'

If a length diacritic becomes redundant because of a circumflex it can be stripped with remove_redundant_macron just like strip_length above.

The Syllabify Module:

syllabify splits the given word in syllables, which are returned as a list of strings. Words without vowels are syllabified as a single syllable. The syllabification can also be displayed as a word with the syllablles separated by periods with display_word.

In[1]: from greek_accentuation.syllabify import syllabify, display_word

In[2]: syllabify('γυναικός')
Out[2]: ['γυ', 'ναι', 'κός']

In[3]: syllabify('γγγ')
Out[3]: ['γγγ']

In[4]: display_word(syllabify('καταλλάσσω'))
Out[4]: 'κα.ταλ.λάσ.σω'

is_vowel and is_diphthong return a boolean value to determine whether a given character is a vowel or two given characters are a diphthong.

In[1]: from greek_accentuation.syllabify import is_diphthong

In[2]: is_diphthong('αι')
Out[2]: True

ultima, antepenult and penult return the ultima, antepenult or penult (i.e. the last, next-to-last or third-from-last syllables) of the given word. A syllable can also be further broken down into its onset, nucleus and coda (i.e. the starting consonant, middle part and ending consonant) with the functions named accordingly. rime returns the sequence of a syllable’s nucleus and coda and body returns the sequence of a syllable’s onset and nucleus.

onset_nucleus_coda returns a syllable’s onset, nucleus and coda all at once as a triple.
In[1]: from greek_accentuation.syllabify import ultima, rime, onset_nucleus_coda

In[2]: ultima('γυναικός')
Out[2]: 'κός'

In[3]: rime('κός')
Out[3]: 'ός'

In[4]: onset_nucleus_coda('ναι')
Out[4]: ('ν', 'αι', '')

debreath returns a word with the smooth breathing removed and the rough breathing replaced with an h. rebreath reverses debreath.

In[1]: from greek_accentuation.syllabify import debreath, rebreath

In[2]: debreath('οἰκία')
Out[2]: 'οικία'

In[3]: rebreath('οικία')
Out[3]: 'οἰκία'

In[3]: debreath('ἑξεῖ')
Out[3]: 'hεξεῖ'

In[4]: rebreath('hεξεῖ')
Out[4]: 'ἑξεῖ'

syllable_length returns the length of a syllable (in the linguistic sense) and syllable_accent extracts a syllable’s accent.

In[1]: from greek_accentuation.syllabify import syllable_length, syllable_accent

In[2]: syllable_length('σω') == LONG
Out[2]: True

In[3]: syllable_accent('ναι') is None
Out[3]: True

The accentuation class of a word such as oxytone, paroxytone, proparoxytone, perispomenon, properispomenon or barytone can be tested with the functions named accordingly.

add_necessary_breathing adds smooth breathing to a word if necessary.

In[1]: from greek_accentuation.syllabify import add_necessary_breathing

In[2]: add_necessary_breathing('οι')
Out[2]: 'οἰ'

In[3]: add_necessary_breathing('οἰ')
Out[3]: 'οἰ'

The Accentuation Module:

get_accent_type returns the accent type of a word as a tuple of the syllable number and accent, which is comparable to the constants provided. The accent type can also be displayed as a string with display_accent_type.

In[1]: from greek_accentuation.accentuation import get_accent_type, display_accent_type

In[2]: get_accent_type('ἀγαθοῦ') == PERISPOMENON
Out[2]: True

In[3]: display_accent_type(get_accent_type('ψυχή'))
Out[3]: 'oxytone'

syllable_add_accent(syllable, accent) adds the given accent to a syllable. It is also possible to add an accent class to a syllable, for example:

In[1]: from greek_accentuation.accentuation import syllable_add_accent, make_paroxytone

In[2]: syllable_add_accent('ου', CIRCUMFLEX)
Out[2]: 'οῦ'

In[3]: make_paroxytone('λογος')
Out[3]: 'λόγος'

possible_accentuations returns all possible accentuations of a given syllabification according to Ancient Greek accentuation rules. To treat vowels of unmarked length as short vowels set default_short = True in the function parameters.

In[1]: from greek_accentuation.accentuation import possible_accentuations

In[2]: s = syllabify('εγινωσκου')

In[3]: for accent_class in possible_accentuations(s):

In[4]:     print(add_accent(s, accent_class))
Out[4]: εγινώσκου
Out[4]: εγινωσκού
Out[4]: εγινωσκοῦ

In[5]: s = syllabify('κυριος')

In[6]: for accent_class in possible_accentuations(s, default_short=True):

In[7]:     print(add_accent(s, accent_class))
Out[7]: κύριος
Out[7]: κυρίος
Out[7]: κυριός

recessive finds the most recessive (i.e. as far away from the end of the word as possible) accent and returns the given word with that accent. A | can be placed to set a point past which the accent will not recede. on_penult places the accent on the penult (third-from-last syllable).

In[1]: from greek_accentuation.accentuation import recessive, on_penult

In[2]: recessive('εἰσηλθον')
Out[2]: 'εἴσηλθον'

In[3]: recessive('εἰσ|ηλθον')
Out[3]: 'εἰσῆλθον'

In[4]: on_penult('φωνησαι')
Out[4]: 'φωνῆσαι'

persistent gets passed a word and a lemma (i.e. the canonical form of a set of words) and derives the accent from these two words.

In[1]: from greek_accentuation.accentuation import persistent

In[2]: persistent('ἀνθρωπου', 'ἄνθρωπος')
Out[2]: 'ἀνθρώπου'

Expand iota subscript:

The CLTK offers one transformation that can be useful in certain types of processing: Expanding the iota subsctipt from a unicode point and placing beside, to the right, of the character.

In [1]: from cltk.corpus.greek.alphabet import expand_iota_subscript

In [2]: s = 'εἰ δὲ καὶ τῷ ἡγεμόνι πιστεύσομεν ὃν ἂν Κῦρος διδῷ'

In [3]: expand_iota_subscript(s)
Out[3]: 'εἰ δὲ καὶ τῶΙ ἡγεμόνι πιστεύσομεν ὃν ἂν Κῦρος διδῶΙ'

In [4]: expand_iota_subscript(s, lowercase=True)
Out[4]: 'εἰ δὲ καὶ τῶι ἡγεμόνι πιστεύσομεν ὃν ἂν κῦρος διδῶι'

Converting Beta Code to Unicode

Note that incoming strings need to begin with an r and that the Beta Code must follow immediately after the initial """, as in input line 2, below.

In [1]: from cltk.corpus.greek.beta_to_unicode import Replacer

In [2]: BETA_EXAMPLE = r"""O(/PWS OU)=N MH\ TAU)TO\ PA/QWMEN E)KEI/NOIS, E)PI\ TH\N DIA/GNWSIN AU)TW=N E)/RXESQAI DEI= PRW=TON. TINE\S ME\N OU)=N AU)TW=N EI)SIN A)KRIBEI=S, TINE\S DE\ OU)K A)KRIBEI=S O)/NTES METAPI/-PTOUSIN EI)S TOU\S E)PI\ SH/YEI: OU(/TW GA\R KAI\ LOU=SAI KAI\ QRE/YAI KALW=S KAI\ MH\ LOU=SAI PA/LIN, O(/TE MH\ O)RQW=S DUNHQEI/HMEN."""

In [3]: r = Replacer()

In [4]: r.beta_code(BETA_EXAMPLE)
Out[4]: 'ὅπως οὖν μὴ ταὐτὸ πάθωμεν ἐκείνοις, ἐπὶ τὴν διάγνωσιν αὐτῶν ἔρχεσθαι δεῖ πρῶτον. τινὲς μὲν οὖν αὐτῶν εἰσιν ἀκριβεῖς, τινὲς δὲ οὐκ ἀκριβεῖς ὄντες μεταπίπτουσιν εἰς τοὺς ἐπὶ σήψει· οὕτω γὰρ καὶ λοῦσαι καὶ θρέψαι καλῶς καὶ μὴ λοῦσαι πάλιν, ὅτε μὴ ὀρθῶς δυνηθείημεν.'

The beta code converter can also handle lowercase notation:

In [5]: BETA_EXAMPLE_2 = r"""me/xri me\n w)/n tou/tou a(rpaga/s mou/nas ei)=nai par' a)llh/lwn, to\ de\ a)po\ tou/tou *(/ellhnas dh\ mega/lws ai)ti/ous gene/sqai: prote/rous ga\r a)/rcai strateu/esqai e)s th\n *)asi/hn h)\ sfe/as e)s th\n *eu)rw/phn. """
Out[5]: 'μέχρι μὲν ὤν τούτου ἁρπαγάς μούνας εἶναι παρ’ ἀλλήλων, τὸ δὲ ἀπὸ τούτου Ἕλληνας δὴ μεγάλως αἰτίους γενέσθαι· προτέρους γὰρ ἄρξαι στρατεύεσθαι ἐς τὴν Ἀσίην ἢ σφέας ἐς τὴν Εὐρώπην.'

Converting TLG texts with TLGU

The TLGU is excellent C language software for converting the TLG and PHI corpora into human-readable Unicode. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. When TLGU() is instantiated, it checks the local OS for a functioning version of the software. If not found it is, following the user’s confirmation, downloaded and installed.

Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers). Note that you must import a local corpus before converting it.

In [1]: from cltk.corpus.greek.tlgu import TLGU

In [2]: t = TLGU()

In [3]: t.convert_corpus(corpus='tlg')  # writes to: ~/cltk_data/greek/text/tlg/plaintext/

For the PHI7, you may declare whether you want the corpus to be written to the greek or latin directories. By default, it writes to greek.

In [5]: t.convert_corpus(corpus='phi7')  # ~/cltk_data/greek/text/phi7/plaintext/

In [6]: t.convert_corpus(corpus='phi7', latin=True)  # ~/cltk_data/latin/text/phi7/plaintext/

The above commands take each author file and convert them into a new author file. But the software has a useful option to divide each author file into a new file for each work it contains. Thus, Homer’s file, TLG0012.TXT, becomes TLG0012.TXT-001.txt, TLG0012.TXT-002.txt, and TLG0012.TXT-003.txt. To achieve this, use the following command for the TLG:

In [7]: t.divide_works('tlg')  # ~/cltk_data/greek/text/tlg/individual_works/

You may also convert individual files, with options for how the conversion happens.

In [3]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt')

In [4]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', markup='full')

In [5]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', rm_newlines=True)

In [6]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', divide_works=True)

For convert(), plain arguments may be sent directly to the TLGU, as well, via extra_args:

In [7]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', extra_args=['p', 'B'])

Even after plaintext conversion, the TLG will still need some cleanup. The CLTK contains some code for post-TLGU cleanup.

You may read about these arguments in the TLGU manual.

Once these files are created, see TLG Indices below for accessing these newly created files.

Corpus Readers

Most users will want to access words, sentences, paragraphs and even whole documents via a CorpusReader object. All Corpus contributors should provide a suitable reader. There is one for Perseus Greek, and others will be made available. The CorpusReader methods: paras() returns paragraphs, if possible; words() returns a generator of words; sentences returns a generator of sentences; docs returns a generator of Python dictionary objects representing each document.

In [1]: from cltk.corpus.readers import get_corpus_reader
   ...: reader = get_corpus_reader( corpus_name = 'greek_text_perseus', language = 'greek')
   ...: # get all the docs
   ...: docs = list(reader.docs())
   ...: len(docs)
   ...:
Out[1]: 222

In [2]: # or set just one
   ...: reader._fileids = ['plato__apology__grc.json']

In [3]: # get all the sentences
In [4]: sentences = list(reader.sents())
   ...: len(sentences)
   ...:
Out[4]: 4983

In [5]: # Or just one

In [6]: sentences[0]
Out[6]: '\n \n \n \n \n ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ\n τῶν ἐμῶν κατηγόρων, οὐκ οἶδα· ἐγὼ δʼ οὖν καὶ αὐτὸς ὑπʼ αὐτῶν ὀλίγου ἐμαυτοῦ\n ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.'

In [7]: # access an individual doc as a dictionary of dictionaries
   ...: doc = list(reader.docs())[0]
   ...: doc.keys()
   ...:
Out[7]: dict_keys(['language', 'englishTitle', 'original-urn', 'author', 'urn', 'text', 'source', 'originalTitle', 'edition', 'sourceLink', 'meta', 'filename'])

Information Retrieval

See Multilingual Information Retrieval for Greek–specific search options.

Lemmatization

Tip

For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.

The CLTK’s lemmatizer is based on a key-value store, whose code is available at the CLTK’s Latin lemma/POS repository.

The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens. Here is an example of the lemmatizer taking a string:

In [1]: from cltk.stem.lemma import LemmaReplacer

In [2]: sentence = 'τὰ γὰρ πρὸ αὐτῶν καὶ τὰ ἔτι παλαίτερα σαφῶς μὲν εὑρεῖν διὰ χρόνου πλῆθος ἀδύνατα ἦν'

In [3]: from cltk.corpus.utils.formatter import cltk_normalize

In [4]: sentence = cltk_normalize(sentence)  # can help when using certain texts

In [5]: lemmatizer = LemmaReplacer('greek')

In [6]: lemmatizer.lemmatize(sentence)
Out[6]:
['τὰ',
 'γὰρ',
 'πρὸ',
 'αὐτός',
 'καὶ',
 'τὰ',
 'ἔτι',
 'παλαιός',
 'σαφής',
 'μὲν',
 'εὑρίσκω',
 'διὰ',
 'χρόνος',
 'πλῆθος',
 'ἀδύνατος',
 'εἰμί']

And here taking a list:

In [5]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'])
Out[5]: ['χρόνος', 'πλῆθος', 'ἀδύνατος', 'εἰμί']

The lemmatizer takes several optional arguments for controlling output: return_raw=True and return_string=True. return_raw returns the original inflection along with its headword:

In [6]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'], return_raw=True)
Out[6]: ['χρόνου/χρόνος', 'πλῆθος/πλῆθος', 'ἀδύνατα/ἀδύνατος', 'ἦν/εἰμί']

And return string wraps the list in ' '.join():

In [7]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'], return_string=True)
Out[7]: 'χρόνος πλῆθος ἀδύνατος εἰμί'

These two arguments can be combined, as well.

Lemmatization, backoff method

The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.

There is a generic version of the backoff Greek lemmatizer which requires data from the CLTK greek models data found here <https://github.com/cltk/greek_models_cltk/tree/master/lemmata/backoff>. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.

To use the generic version of the backoff Greek Lemmatizer:

In [1]: from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer

In [2]: lemmatizer = BackoffGreekLemmatizer()

In [3]: tokens = 'κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος'.split()

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('κατέβην', 'καταβαίνω'), ('χθὲς', 'χθές'), ('εἰς', 'εἰς'), ('Πειραιᾶ', 'Πειραιᾶ'), ('μετὰ', 'μετά'), ('Γλαύκωνος', 'Γλαύκων'), ('τοῦ', 'ὁ'), ('Ἀρίστωνος', 'Ἀρίστων')]

NB: The backoff chain for this lemmatizer is defined as follows: 1. a dictionary-based lemmatizer with high-frequency, unambiguous forms; 2. a training-data-based lemmatizer based on sentences from the [Perseus Latin Dependency Treebanks](https://perseusdl.github.io/treebank_data/); 3. a regular-expression-based lemmatizer transforming unambiguous endings (currently very limited); 4. a dictionary-based lemmatizer with the complete set of Morpheus lemmas; 5. an ‘identity’ lemmatizer returning the token as the lemma. Each of these sub-lemmatizers is explained in the documents for “Multilingual”.

Named Entity Recognition

There is available a simple interface to a list of Greek proper nouns (see repo for how it the list was created). By default tag_ner() takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.

In [1]: from cltk.tag import ner

In [2]: text_str = 'τὰ Σίλαριν Σιννᾶν Κάππαρος Πρωτογενείας Διονυσιάδες τὴν'

In [3]: ner.tag_ner('greek', input_text=text_str, output_type=list)
Out[3]:
[('τὰ',),
 ('Σίλαριν', 'Entity'),
 ('Σιννᾶν', 'Entity'),
 ('Κάππαρος', 'Entity'),
 ('Πρωτογενείας', 'Entity'),
 ('Διονυσιάδες', 'Entity'),
 ('τὴν',)]

Normalization

Normalizing polytonic Greek is a problem that has been mostly solved, however when working with legacy applications issues still arise. We recommend normalizing Greek vowels in order to ensure string matching.

One type of normalization issue comes from tonos accents (intended for Modern Greek) being used instead of the oxia accents (for Ancient Greek). Here is an example of two characters appearing identical but being in fact dissimilar:

In [1]: from cltk.corpus.utils.formatter import tonos_oxia_converter

In [2]: char_tonos = "ά"  # with tonos, for Modern Greek

In [3]: char_oxia = "ά"  # with oxia, for Ancient Greek

In [4]: char_tonos == char_oxia
Out[4]: False

In [5]: ord(char_tonos)
Out[5]: 940

In [6]: ord(char_oxia)
Out[6]: 8049

In [7]: char_oxia == tonos_oxia_converter(char_tonos)
Out[7]: True

If for any reason you want to go from oxia to tonos, just add the reverse=True parameter:

In [8]: char_tonos == tonos_oxia_converter(char_oxia, reverse=True)
Out[8]: True

Another approach to normalization is to use the Python language’s builtin normalize(). The CLTK provides a wrapper for this, as a convenience. Here’s an example its use in “compatibility” mode (NFKC):

In [1]: from cltk.corpus.utils.formatter import cltk_normalize

In [2]: tonos = "ά"

In [3]: oxia = "ά"

In [4]: tonos == oxia
Out[4]: False

In [5]: tonos == cltk_normalize(oxia)
Out[5]: True

One can turn off compatability with:

In [6]: tonos == cltk_normalize(oxia, compatibility=False)
Out[6]: True

For more on normalize() see the Python Unicode docs.

POS tagging

These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the greek_models_cltk corpus.

1–2–3–gram backoff tagger
In [1]: from cltk.tag.pos import POSTag

In [2]: tagger = POSTag('greek')

In [3]: tagger.tag_ngram_123_backoff('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
Out[3]:
[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---MG-'),
 ('᾽', None),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]
TnT tagger
In [4]: tagger.tag_tnt('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
Out[4]:
[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---NG-'),
 ('᾽', 'Unk'),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]
CRF tagger

Warning

This tagger’s accuracy has not yet been tested.

We use the NLTK’s CRF tagger. For information on it, see the NLTK docs.

In [5]: tagger.tag_crf('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
Out[5]:
[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---NG-'),
 ('᾽', 'A-S---FA-'),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'A-S---FG-'),
 ('ἐτείας', 'N-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

Prosody Scanning

There is a prosody scanner for scanning rhythms in Greek texts. It returns a list of strings or long and short marks for each sentence. Note that the last syllable of each sentence string is marked with an anceps so that specific clausulae are dileneated.

In [1]: from cltk.prosody.greek.scanner import Scansion

In [2]: scanner = Scansion()

In [3]: scanner.scan_text('νέος μὲν καὶ ἄπειρος, δικῶν ἔγωγε ἔτι. μὲν καὶ ἄπειρος.')
Out[3]: ['˘¯¯¯˘¯¯˘¯˘¯˘˘x', '¯¯˘¯x']

Sentence Tokenization

Sentence tokenization for Ancient Greek is available using (by default) a regular-expression based tokenizer. To tokenize a Greek text by sentences…

In [1]: from cltk.tokenize.greek.sentence import SentenceTokenizer

In [2]: sent_tokenizer = SentenceTokenizer()

In [3]: untokenized_text = """ὅλως δ’ ἀντεχόμενοί τινες, ὡς οἴονται, δικαίου τινός (ὁ γὰρ νόμος δίκαιόν τἰ τὴν κατὰ πόλεμον δουλείαν τιθέασι δικαίαν, ἅμα δ’ οὔ φασιν· τήν τε γὰρ ἀρχὴν ἐνδέχεται μὴ δικαίαν εἶναι τῶν πολέμων, καὶ τὸν ἀνάξιον δουλεύειν οὐδαμῶς ἂν φαίη τις δοῦλον εἶναι· εἰ δὲ μή, συμβήσεται τοὺς εὐγενεστάτους εἶναι δοκοῦντας δούλους εἶναι καὶ ἐκ δούλων, ἐὰν συμβῇ πραθῆναι ληφθέντας."""

In [4]: sent_tokenizer.tokenize(untokenized_text)
Out[4]: ['ὅλως δ’ ἀντεχόμενοί τινες, ὡς οἴονται, δικαίου τινός (ὁ γὰρ νόμος δίκαιόν τἰ τὴν κατὰ πόλεμον δουλείαν τιθέασι δικαίαν, ἅμα δ’ οὔ φασιν·', 'τήν τε γὰρ ἀρχὴν ἐνδέχεται μὴ δικαίαν εἶναι τῶν πολέμων, καὶ τὸν ἀνάξιον δουλεύειν οὐδαμῶς ἂν φαίη τις δοῦλον εἶναι·', 'εἰ δὲ μή, συμβήσεται τοὺς εὐγενεστάτους εἶναι δοκοῦντας δούλους εἶναι καὶ ἐκ δούλων, ἐὰν συμβῇ πραθῆναι ληφθέντας.']

The sentence tokenizer takes a string input into tokenize_sentences() and returns a list of strings. For more on the tokenizer, or to make your own, see the CLTK’s Greek sentence tokenizer training set repository.

There is also an experimental [Punkt](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) tokenizer trained on the Greek Tesserae texts. The model for this tokenizer can be found in the CLTK corpora under greek_model_cltk/tokenizers/sentence/greek_punkt.

In [5]: from cltk.tokenize.greek.sentence import SentenceTokenizer

In [6]: sent_tokenizer = SentenceTokenizer(tokenizer='punkt')

etc.

NB: The old method for sentence tokenizer, i.e. TokenizeSentence, is still available, but will soon be replaced by the method above.

In [7]: from cltk.tokenize.sentence import TokenizeSentence

In [8]: tokenizer = TokenizeSentence('greek')

etc.

Stopword Filtering

To use the CLTK’s built-in stopwords list:

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.greek.stops import STOPS_LIST

In [3]: sentence = 'Ἅρπαγος δὲ καταστρεψάμενος Ἰωνίην ἐποιέετο στρατηίην ἐπὶ Κᾶρας καὶ Καυνίους καὶ Λυκίους, ἅμα ἀγόμενος καὶ Ἴωνας καὶ Αἰολέας.'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['ἅρπαγος',
 'καταστρεψάμενος',
 'ἰωνίην',
 'ἐποιέετο',
 'στρατηίην',
 'κᾶρας',
 'καυνίους',
 'λυκίους',
 ',',
 'ἅμα',
 'ἀγόμενος',
 'ἴωνας',
 'αἰολέας.']

Swadesh

The corpus module has a class for generating a Swadesh list for Greek.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('gr')

In [3]: swadesh.words()[:10]
Out[3]: ['ἐγώ', 'σύ', 'αὐτός, οὗ, ὅς, ὁ, οὗτος', 'ἡμεῖς', 'ὑμεῖς', 'αὐτοί', 'ὅδε', 'ἐκεῖνος', 'ἔνθα, ἐνθάδε, ἐνταῦθα', 'ἐκεῖ']

TEI XML

There are several rudimentary corpus converters for the “First 1K Years of Greek” project (download the corpus 'greek_text_first1kgreek'). Both write files to `` ~/cltk_data/greek/text/greek_text_first1kgreek_plaintext``.

This one is built upon the MyCapytain library (pip install lxml MyCapytain), which has the ability for very precise chunking of TEI xml. The following function only preserves numbers:

In [1]: from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text_capitains

In [2]: onekgreek_tei_xml_to_text_capitains()

For the following, install the BeautifulSoup library (pip install bs4). Note that this will just dump all text not contained within a node’s bracket (including sometimes metadata).

In [1]: from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text

In [2]: onekgreek_tei_xml_to_text()

Text Cleanup

Intended for use on the TLG after processing by TLGU().

In [1]: from cltk.corpus.utils.formatter import tlg_plaintext_cleanup

In [2]: import os

In [3]: file = os.path.expanduser('~/cltk_data/greek/text/tlg/individual_works/TLG0035.TXT-001.txt')

In [4]: with open(file) as f:
...:     r = f.read()
...:

In [5]: r[:500]
Out[5]: "\n{ΜΟΣΧΟΥ ΕΡΩΣ ΔΡΑΠΕΤΗΣ} \n  Ἁ Κύπρις τὸν Ἔρωτα τὸν υἱέα μακρὸν ἐβώστρει: \n‘ὅστις ἐνὶ τριόδοισι πλανώμενον εἶδεν Ἔρωτα, \nδραπετίδας ἐμός ἐστιν: ὁ μανύσας γέρας ἑξεῖ. \nμισθός τοι τὸ φίλημα τὸ Κύπριδος: ἢν δ' ἀγάγῃς νιν, \nοὐ γυμνὸν τὸ φίλημα, τὺ δ', ὦ ξένε, καὶ πλέον ἑξεῖς. \nἔστι δ' ὁ παῖς περίσαμος: ἐν εἴκοσι πᾶσι μάθοις νιν. \nχρῶτα μὲν οὐ λευκὸς πυρὶ δ' εἴκελος: ὄμματα δ' αὐτῷ \nδριμύλα καὶ φλογόεντα: κακαὶ φρένες, ἁδὺ λάλημα: \nοὐ γὰρ ἴσον νοέει καὶ φθέγγεται: ὡς μέλι φωνά, \nὡς δὲ χολὰ νόος ἐστίν: "

In [7]: tlg_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Ἁ Κύπρις τὸν Ἔρωτα τὸν υἱέα μακρὸν ἐβώστρει ὅστις ἐνὶ τριόδοισι πλανώμενον εἶδεν Ἔρωτα δραπετίδας ἐμός ἐστιν ὁ μανύσας γέρας ἑξεῖ. μισθός τοι τὸ φίλημα τὸ Κύπριδος ἢν δ ἀγάγῃς νιν οὐ γυμνὸν τὸ φίλημα τὺ δ ὦ ξένε καὶ πλέον ἑξεῖς. ἔστι δ ὁ παῖς περίσαμος ἐν εἴκοσι πᾶσι μάθοις νιν. χρῶτα μὲν οὐ λευκὸς πυρὶ δ εἴκελος ὄμματα δ αὐτῷ δριμύλα καὶ φλογόεντα κακαὶ φρένες ἁδὺ λάλημα οὐ γὰρ ἴσον νοέει καὶ φθέγγεται ὡς μέλι φωνά ὡς δὲ χολὰ νόος ἐστίν ἀνάμερος ἠπεροπευτάς οὐδὲν ἀλαθεύων δόλιον βρέφος ἄγρια π'

TLG Indices

The TLG comes with some old, difficult-to-parse index files which have been made available as Python dictionaries (at /Users/kyle/cltk/cltk/corpus/greek/tlg). Below are some functions to make accessing these easy. The outputs are variously a dict of an index or set if the function returns unique author ids.

Tip

Python sets are like lists, but contain only unique values. Multiple sets can be conveniently combined (see docs here).

In [1]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_female_authors

In [2]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_index

In [3]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithets

In [4]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_authors_by_epithet

In [5]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_of_author

In [6]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geo_index

In [7]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geographies

In [8]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_authors_by_geo

In [9]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geo_of_author

In [10]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_lists

In [11]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_id_author

In [12]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_id_by_name

In [13]: get_female_authors()
Out[13]:
{'0009',
 '0051',
 '0054',
 …}

In [14]: get_epithet_index()
Out[14]:
{'Lexicographi': {'3136', '4040', '4085', '9003'},
 'Lyrici/-ae': {'0009',
  '0033',
  '0199',
  …}}

In [15]: get_epithets()
Out[15]:
['Alchemistae',
 'Apologetici',
 'Astrologici',
 …]

In [16]: select_authors_by_epithet('Tactici')
Out[16]: {'0058', '0546', '0556', '0648', '3075', '3181'}

In [17]: get_epithet_of_author('0016')
Out[17]: 'Historici/-ae'

In [18]: get_geo_index()
Out[18]:
{'Alchemistae': {'1016',
  '2019',
  '2140',
  '2181',
  …}}

In [19]: get_geographies()
Out[19]:
['Abdera',
 'Adramytteum',
 'Aegae',
 …]

In [20]: select_authors_by_geo('Thmuis')
Out[20]: {'2966'}

In [21]: get_geo_of_author('0216')
Out[21]: 'Aetolia'

In [22]: get_lists()
Out[22]:
{'Lists pertaining to all works in Canon (by TLG number)': {'LIST3CLA.BIN': 'Literary classifications of works',
  'LIST3CLX.BIN': 'Literary classifications of works (with x-refs)',
  'LIST3DAT.BIN': 'Chronological classifications of authors',
   …}}

In [23]: get_id_author()
Out[23]:
{'1139': 'Anonymi Historici (FGrH)',
 '4037': 'Anonymi Paradoxographi',
 '0616': 'Polyaenus Rhet.',
 …}

In [28]: select_id_by_name('hom')
Out[28]:
[('0012', 'Homerus Epic., Homer'),
 ('1252', 'Certamen Homeri Et Hesiodi'),
 ('1805', 'Vitae Homeri'),
 ('5026', 'Scholia In Homerum'),
 ('1375', 'Evangelium Thomae'),
 ('2038', 'Acta Thomae'),
 ('0013', 'Hymni Homerici, Homeric Hymns'),
 ('0253', '[Homerus] [Epic.]'),
 ('1802', 'Homerica'),
 ('1220', 'Batrachomyomachia'),
 ('9023', 'Thomas Magister Philol.')]

In addition to these indices there are several helper functions which will build filepaths for your particular computer. Note that you will need to have run convert_corpus(corpus='tlg') and divide_works('tlg') from the TLGU() class, respectively, for the following two functions.

In [1]: from cltk.corpus.utils.formatter import assemble_tlg_author_filepaths

In [2]: assemble_tlg_author_filepaths()
Out[2]:
['/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1167.TXT',
 '/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1584.TXT',
 '/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1196.TXT',
 '/Users/kyle/cltk_data/greek/text/tlg/plaintext/TLG1201.TXT',
 ...]

In [3]: from cltk.corpus.utils.formatter import assemble_tlg_works_filepaths

In [4]: assemble_tlg_works_filepaths()
Out[4]:
['/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG1585.TXT-001.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG0038.TXT-001.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG1607.TXT-002.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG0468.TXT-001.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG0468.TXT-002.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-001.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-002.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-003.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-004.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-005.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-006.txt',
 '/Users/kyle/cltk_data/greek/text/tlg/individual_works/TLG4175.TXT-007.txt',
 ...]

These two functions are useful when, for example, needing to process all authors of the TLG corpus, all works of the corpus, or all works of one particular author.

Transliteration

The CLTK provides IPA phonetic transliteration for the Greek language. Currently, the only available dialect is Attic as reconstructed by Philomen Probert (taken from A Companion to the Ancient Greek Language, 85-103). Example:

In [1]: from cltk.phonology.greek.transcription import Transcriber

In [2]: transcriber = Transcriber(dialect="Attic", reconstruction="Probert")

In [3]: transcriber.transcribe("Διόθεν καὶ δισκήπτρου τιμῆς ὀχυρὸν ζεῦγος Ἀτρειδᾶν στόλον Ἀργείων")
Out[3]: '[di.ó.tʰen kɑj dis.kɛ́ːp.trọː ti.mɛ̂ːs o.kʰy.ron zdêw.gos ɑ.trẹː.dɑ̂n stó.lon ɑr.gẹ́ː.ɔːn]'

Word Tokenization

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('greek')

In [3]: text = 'Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων,'

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['Θουκυδίδης', 'Ἀθηναῖος', 'ξυνέγραψε', 'τὸν', 'πόλεμον', 'τῶν', 'Πελοποννησίων', 'καὶ', 'Ἀθηναίων', ',']

Word2Vec

Note

The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK’s API for it will be revised.

Note

You will need to install Gensim to use these features.

Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).

The CLTK repository contains pre-trained Word2Vec models for Greek (import as greek_word2vec_cltk), one lemmatized and the other not. They were trained on the TLG corpus. To train your own, see the README at the Greek Word2Vec repository.

One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here’s an example of its use:

In [1]: from cltk.ir.query import search_corpus

In [2]: In [6]: for x in search_corpus('πνεῦμα', 'tlg', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.5):
    print(x)
   ...:
The following similar terms will be added to the 'πνεῦμα' query: '['γεννώμενον', 'ἔντερον', 'βάπτισμα', 'εὐαγγέλιον', 'δέρμα', 'ἐπιῤῥέον', 'ἔμβρυον', 'ϲῶμα', 'σῶμα', 'συγγενὲς']'.
('Lucius Annaeus Cornutus Phil.', "μυθολογεῖται δ' ὅτι διασπασθεὶς ὑπὸ τῶν Τιτά-\nνων συνετέθη πάλιν ὑπὸ τῆς Ῥέας, αἰνιττομένων τῶν \nπαραδόντων τὸν μῦθον ὅτι οἱ γεωργοί, θρέμματα γῆς \nὄντες, συνέχεαν τοὺς βότρυς καὶ τοῦ ἐν αὐτοῖς Διονύσου \nτὰ μέρη ἐχώρισαν ἀπ' ἀλλήλων, ἃ δὴ πάλιν ἡ εἰς ταὐτὸ \nσύρρυσις τοῦ γλεύκους συνήγαγε καὶ ἓν *σῶμα* ἐξ αὐτῶν \nἀπετέλεσε.")
('Metopus Phil.', '\nκαὶ ταὶ νόσοι δὲ γίνονται τῶ σώματος <τῷ> θερμότερον ἢ κρυμωδέσ-\nτερον γίνεσθαι τὸ *σῶμα*.')
…

threshold is the closeness of the query term to its neighboring words. Note that when expand_keyword=True, the search term will be stripped of any regular expression syntax.

The keyword expander leverages get_sims() (which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:

In [3]: from cltk.vector.word2vec import get_sims

In [4]: get_sims('βασιλεύς', 'greek', lemmatized=False, threshold=0.5)
"word 'βασιλεύς' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['βασκαίνων', 'βασκανίας', 'βασιλάκιος', 'βασιλίδων', 'βασανισθέντα', 'βασιλήϊον', 'βασιλευόμενα', 'βασανιστηρίων', … ]'.

In [36]: get_sims('τυραννος', 'greek', lemmatized=True, threshold=0.7)
"word 'τυραννος' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['τυραννίσιν', 'τυρόριζαν', 'τυρεύοντες', 'τυρρηνοὶ', 'τυραννεύοντα', 'τυροὶ', 'τυραννικά', 'τυρσηνίαν', 'τυρώ', 'τυρσηνίας', … ]'.

To add and subtract vectors, you need to load the models yourself with Gensim.

Gujarati

Gujarati is an Indo-Aryan language native to the Indian state of Gujarat. It is part of the greater Indo-European language family. Gujarati is descended from Old Gujarati (circa 1100–1500 AD). In India, it is the official language in the state of Gujarat, as well as an official language in the union territories of Daman and Diu and Dadra and Nagar Haveli.Gujarati is spoken by 4.5% of the Indian population, which amounts to 46 million speakers in India.Altogether, there are about 50 million speakers of Gujarati worldwide.(Source: Wikipedia)

Alphabet

The Gujarati alphabets are placed in cltk/corpus/gujarati/alphabet.py.

There are 13 vowels in Gujarati. Like Hindi and other similar languages, vowels in Gujarati have an independent form and a matra form used to modify consonants in word formation.

VOWELS = [ 'અ' , 'આ' , 'ઇ' , 'ઈ' , 'ઉ' , 'ઊ' , 'ઋ' , 'એ' , 'ઐ' , 'ઓ' , 'ઔ' , 'અં' , 'અઃ'  ]

The International Alphabet of Sanskrit Transliteration (I.A.S.T.) is a transliteration scheme that allows the lossless romanization of Indic scripts as employed by Sanskrit and related Indic languages. IAST makes it possible for the reader to read the Indic text unambiguously, exactly as if it were in the original Indic script.

IAST_VOWELS_REPRESENTATION = ['a', 'ā', 'i', 'ī', 'u', 'ū','ṛ','e','ai','o','au','ṁ','ḥ']

There are 33 consonants. They are grouped in accordance with the traditional Sanskrit scheme of arrangement.

1. Velar: A velar consonant is a consonant that is pronounced with the back part of the tongue against the soft palate, also known as the velum, which is the back part of the roof of the mouth (e.g., k).

2. Palatal: A palatal consonant is a consonant that is pronounced with the body (the middle part) of the tongue against the hard palate (which is the middle part of the roof of the mouth) (e.g., j).

3. Retroflex: A retroflex consonant is a coronal consonant where the tongue has a flat, concave, or even curled shape, and is articulated between the alveolar ridge and the hard palate (e.g., English t).

4. Dental: A dental consonant is a consonant articulated with the tongue against the upper teeth (e.g., Spanish t).

  1. Labial: Labials or labial consonants are articulated or made with the lips (e.g., p).
# Digits

In[1]: from cltk.corpus.gujarati.alphabet import DIGITS

In[2]: print(DIGITS)
Out[2]:  ['૦','૧','૨','૩','૪','૫','૬','૭','૮','૯','૧૦']

# Velar consonants

In[3]: from cltk.corpus.gujarati.alphabet import VELAR_CONSONANTS

In[4]: print(VELAR_CONSONANTS)
Out[4]: [ 'ક' , 'ખ' , 'ગ' , 'ઘ' , 'ઙ' ]

# Palatal consonants

In[5]: from cltk.corpus.gujarati.alphabet import PALATAL_CONSONANTS

In[6]: print(PALATAL_CONSONANTS)
Out[6]: ['ચ' , 'છ' , 'જ' , 'ઝ' , 'ઞ' ]

# Retroflex consonants

In[7]: from cltk.corpus.gujarati.alphabet import RETROFLEX_CONSONANTS

In[8]: print(RETROFLEX_CONSONANTS)
Out[8]: ['ટ' , 'ઠ' , 'ડ' , 'ઢ' , 'ણ']

# Dental consonants

In[9]: from cltk.corpus.gujarati.alphabet import DENTAL_CONSONANTS

In[10]: print(DENTAL_CONSONANTS)
Out[10]: ['ત' , 'થ' , 'દ' , 'ધ' , 'ન' ]

# Labial consonants

In[11]: from cltk.corpus.gujarati.alphabet import LABIAL_CONSONANTS

In[12]: print(LABIAL_CONSONANTS)
Out[12]: ['પ' , 'ફ' , 'બ' , 'ભ' , 'મ']

There are 4 sonorant consonants in Gujarati:

# Sonorant consonants

In[1]: from cltk.corpus.gujarati.alphabet import SONORANT_CONSONANTS

In[2]: print(SONORANT_CONSONANTS)
Out[2]: ['ય' , 'ર' , 'લ' , 'વ']

There are 3 sibilants in Gujarati:

# Sibilant consonants

In[1]: from cltk.corpus.gujarati.alphabet import SIBILANT_CONSONANTS

In[2]: print(SIBILANT_CONSONANTS)
Out[2]: ['શ' , 'ષ' , 'સ']

There is one guttural consonant also:

# Guttural consonant

In[1]: from cltk.corpus.gujarati.alphabet import GUTTURAL_CONSONANT

In[2]: print(GUTTURAL_CONSONANTS)
Out[2]:['હ']

There are also three additional consonants in Gujarati:

# Additional consonants

In[1]: from cltk.corpus.gujarati.alphabet import ADDITIONAL_CONSONANTS

In[2]: print(ADDITIONAL_CONSONANTS)
Out[2]: ['ળ' , 'ક્ષ' , 'જ્ઞ']

Hebrew

Hebrew is a language native to Israel, spoken by over 9 million people worldwide, of whom over 5 million are in Israel. Historically, it is regarded as the language of the Israelites and their ancestors, although the language was not referred to by the name Hebrew in the Tanakh. The earliest examples of written Paleo-Hebrew date from the 10th century BCE. Hebrew belongs to the West Semitic branch of the Afroasiatic language family. The Hebrew language is the only living Canaanite language left. Hebrew had ceased to be an everyday spoken language somewhere between 200 and 400 CE, declining since the aftermath of the Bar Kokhba revolt. (Source: Wikipedia)

Corpora

Use CorpusImporter or browse the CLTK Github repository (anything beginning with hebrew_) to discover available Hebrew corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: corpus_importer = CorpusImporter('hebrew')
In [3]: corpus_importer.list_corpora
Out[3]:
['hebrew_text_sefaria']

Hindi

Hindi is a standardised and Sanskritised register of the Hindustani language. Like other Indo-Aryan languages, Hindi is considered to be a direct descendant of an early form of Sanskrit, through Sauraseni Prakrit and Śauraseni Apabhraṃśa. It has been influenced by Dravidian languages, Turkic languages, Persian, Arabic, Portuguese and English. Hindi emerged as Apabhramsha, a degenerated form of Prakrit, in the 7th century A.D. By the 10th century A.D., it became stable. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with hindi_) to discover available Hindi corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('hindi')

In [3]: c.list_corpora
Out[3]:
['hindi_text_ltrc']

Stopword Filtering

To use the CLTK’s built-in stopwords list:

In [1]: from cltk.stop.classical_hindi.stops import STOPS_LIST

In [2]: print(STOPS_LIST[:5])
Out[2]: ["हें", "है", "हैं", "हि", "ही"]

Swadesh

The corpus module has a class for generating a Swadesh list for classical hindi.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('hi')

In [3]: swadesh.words()[:10]
Out[3]: ['मैं', 'तू', 'वह', 'हम', 'तुम', 'वे', 'यह', 'वह', 'यहाँ', 'वहाँ' ]

Tokenizer

This tool can break a sentence into its constituent words. It simply splits the text into tokens of words and punctuations.

In [1]: from cltk.tokenize.sentence import TokenizeSentence

In [2]: import os

In [3]: root = os.path.expanduser('~')

In [4]: hindi_corpus = os.path.join(root,'cltk_data/hindi/text/hindi_text_ltrc')

In [5]: hindi_text_path = os.path.join(hindi_corpus, 'miscellaneous/gandhi/main.txt')

In [6]: hindi_text = open(hindi_text_path,'r').read()

In [7]: tokenizer = TokenizeSentence('hindi')

In [8]: hindi_text_tokenize = tokenizer.tokenize(hindi_text)

In [9]: print(hindi_text_tokenize[0:100])
['10्र', 'प्रति', 'ा', 'वापस', 'नहीं', 'ली', 'जातीएक', 'बार', 'कस्तुरबा', 'गांधी', 'बहुत', 'बीमार', 'हो', 'गईं', '।', 'जलर्', 'चिकित्सा', 'से', 'उन्हें', 'कोई', 'लाभ', 'नहीं', 'हुआ', '।', 'दूसरे', 'उपचार', 'किये', 'गये', '।', 'उनमे', 'भी', 'सफलता', 'नहीं', 'मिली', '।', 'अंत', 'में', 'गांधीजी', 'ने', 'उन्हें', 'नमक', 'और', 'दाल', 'छोडने', 'की', 'सलाह', 'दी', '।', 'परन्तु', 'इसके', 'लिए', 'बा', 'तैयार', 'नहीं', 'हुईं', '।', 'गांधीजी', 'ने', 'बहुत', 'समझाया', '.', 'पोथियों', 'से', 'प्रमाण', 'पढकर', 'सुनाये', '.', 'लेकर', 'सब', 'व्यर्थ', '।', 'बा', 'बोलीं', '.', '"', 'कोई', 'आपसे', 'कहे', 'कि', 'दाल', 'और', 'नमक', 'छोड', 'दो', 'तो', 'आप', 'भी', 'नहीं', 'छोडेंगे', '।', '"', 'गांधीजी', 'ने', 'तुरन्त', 'प्रसÙ', 'होकर', 'कहा', '.', '"', 'तुम']

Javanese

Javanese is the language of the Javanese people from the central and eastern parts of the island of Java, in Indonesia. Javanese is one of the Austronesian languages, but it is not particularly close to other languages and is difficult to classify. The 8th and 9th centuries are marked by the emergence of the Javanese literary tradition – with Sang Hyang Kamahayanikan, a Buddhist treatise; and the Kakawin Rāmâyaṇa, a Javanese rendering in Indian metres of the Vaishnavist Sanskrit epic Rāmāyaṇa. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with javanese_) to discover available javanese corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('javanese')

In [3]: c.list_corpora
Out[3]:
['javanese_text_gretil']

Kannada

Kannada is a Dravidian language spoken predominantly by Kannada people in India, mainly in the state of Karnataka, and by significant linguistic minorities in the states of Andhra Pradesh, Telangana, Tamil Nadu, Maharashtra, Kerala, Goa and abroad. The language has roughly 38 million native speakers who are called Kannadigas (Kannadigaru), and a total of 51 million speakers according to a 2001 census. It is one of the scheduled languages of India and the official and administrative language of the state of Karnataka. (Source: Wikipedia)

Alphabet

The Kannada alphabet and digits are placed in cltk/corpus/kannada/alphabet.py.

The digits are placed in a list NUMERALS with the digit the same as the list index (0-9). For example, the kannada digit for 6 can be accessed in this manner:

In [1]: from cltk.corpus.kannada.alphabet import NUMERALS
In [2]: NUMERALS[6]
Out[2]: '೬'

The vowels are places in a list VOWELS and can be accessed in this manner :

In [1]: from cltk.corpus.kannada.alphabet import VOWELS
In [2]: VOWELS
Out[2]: ['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ', 'ಊ', 'ಋ','ೠ', 'ಎ', 'ಏ', 'ಐಒ', 'ಒ', 'ಓ', 'ಔ']

The rest of the alphabets are VOWEL_SIGNS, YOGAVAAHAKAS , UNSTRUCTURED_CONSONANTS and STRUCTURED_CONSONANTS that can be accessed in a similar way.

Latin

Latin is a classical language belonging to the Italic branch of the Indo-European languages. The Latin alphabet is derived from the Etruscan and Greek alphabets, and ultimately from the Phoenician alphabet. Latin was originally spoken in Latium, in the Italian Peninsula. Through the power of the Roman Republic, it became the dominant language, initially in Italy and subsequently throughout the Roman Empire. Vulgar Latin developed into the Romance languages, such as Italian, Portuguese, Spanish, French, and Romanian. (Source: Wikipedia)

Note

For most of the following operations, you must first import the CLTK Latin linguistic data (named latin_models_cltk).

Note

Note that for most of the following operations, the j/i and v/u replacer JVReplacer() and .lower() should be used on the input string first, if necessary.

Corpus Readers

Most users will want to access words, sentences, paragraphs and even whole documents via a CorpusReader object. All Corpus contributors should provide a suitable reader. There are corpus readers for the Perseus Latin collection in json format, and one for the Latin Library, and others will be made available. The CorpusReader methods: paras() returns paragraphs, if possible; words() returns a generator of words; sentences returns a generator of sentences; docs returns a generator of Python dictionary objects representing each document.

In [1]: from cltk.corpus.readers import get_corpus_reader
   ...: reader = get_corpus_reader(language='latin', corpus_name='latin_text_perseus')
   ...: # get all the docs
   ...: docs = list(reader.docs())
   ...: len(docs)
   ...:
Out[1]: 293

In [2]: # or set just one
   ...: reader._fileids = ['cicero__on-behalf-of-aulus-caecina__latin.json']
   ...:

In [3]: # get all the sentences
   ...: sentences = list(reader.sents())
   ...: len(sentences)
   ...:
Out[3]: 25435

In [4]: # or one at a time
   ...: sentences[0]
   ...:
Out[4]: '\n\t\t\t si , quantum in agro locisque desertis audacia potest, tantum in foro atque\n\t\t\t\tin iudiciis impudentia valeret, non minus nunc in causa cederet A. Caecina Sex.'

In [5]: # access an individual doc as a dictionary of dictionaries
   ...: doc = list(reader.docs())[0]
   ...: doc.keys()
   ...:
Out[5]: dict_keys(['meta', 'author', 'text', 'edition', 'englishTitle', 'source', 'originalTitle', 'original-urn', 'language', 'sourceLink', 'urn', 'filename'])

Clausulae Analysis

Clausulae analysis is an integral part of Latin prosimetrics. The clausulae analysis module analyzes prose rhythm data generated by the prosody module to produce a dictionary of common rhythm types and their frequencies.

The list of rhythms which the module tallies is from the paper Keeline, T. and Kirby, J “Auceps syllabarum: A Digital Analysis of Latin Prose Rhythm,” Journal of Roman Studies, 2019.

In [1]: from cltk.prosody.latin.scanner import Scansion

In [2]: from cltk.prosody.latin.clausulae_analysis import Clausulae

In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'

In [4]: s = Scansion()

In [5]: c = Clausulae()

In [6]: prosody = s.scan_text(text)
Out[6]: ['-uuu-uuu-u--x', 'uu-uu-uu----x']

In [7]: c.clausulae_analysis(prosody)
Out[7]: [{'cretic_trochee': 1}, {'cretic_trochee_resolved_a': 0}, {'cretic_trochee_resolved_b': 0}, {'cretic_trochee_resolved_c': 0}, {'double_cretic': 0}, {'molossus_cretic': 0}, {'double_molossus_cretic_resolved_a': 0}, {'double_molossus_cretic_resolved_b': 0}, {'double_molossus_cretic_resolved_c': 0}, {'double_molossus_cretic_resolved_d': 0}, {'double_molossus_cretic_resolved_e': 0}, {'double_molossus_cretic_resolved_f': 0}, {'double_molossus_cretic_resolved_g': 0}, {'double_molossus_cretic_resolved_h': 0}, {'double_trochee': 0}, {'double_trochee_resolved_a': 0}, {'double_trochee_resolved_b': 0}, {'hypodochmiac': 0}, {'hypodochmiac_resolved_a': 0}, {'hypodochmiac_resolved_b': 0}, {'spondaic': 1}, {'heroic': 0}]

Converting J to I, V to U

In [1]: from cltk.stem.latin.j_v import JVReplacer

In [2]: j = JVReplacer()

In [3]: j.replace('vem jam')
Out[3]: 'uem iam'

Converting PHI texts with TLGU

Note

  1. Update this section with new post-TLGU processors in formatter.py

The TLGU is C-language software which does an excellent job at converting the TLG and PHI corpora into various forms of human-readable Unicode plaintext. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. Download and installation is handled in the background. When TLGU() is instantiated, it checks the local OS for a functioning version of the software. If not found it is installed.

Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers).

In [1]: from cltk.corpus.greek.tlgu import TLGU

In [2]: t = TLGU()

In [3]: t.convert_corpus(corpus='phi5')  # ~/cltk_data/latin/text/phi5/plaintext/

You can also divide the texts into a file for each individual work.

In [4]: t.divide_works('phi5')  # ~/cltk_data/latin/text/phi5/individual_works/

Once these files are created, see PHI Indices below for accessing these newly created files.

See also Text Cleanup for removing extraneous non-textual characters from these files.

Information Retrieval

See Multilingual Information Retrieval for Latin–specific search options.

Declining

The CollatinusDecliner() attempts to retrieve all possible form of a lemma. This may be useful if you want to search for all forms of a word across a repository of non-lemmatized texts. This class is based on lexical and linguistic data built by the Collatinus Team. Data corrections and additions can be contributed back to the Collatinus project (in particular, into bin/data).

Example use, assuming you have already imported the latin_models_cltk:

In [1]: from cltk.stem.latin.declension import CollatinusDecliner

In [2]: decliner = CollatinusDecliner()

In [3]: print(decliner.decline("via"))
Out[3]:
 [('via', '--s----n-'),
  ('via', '--s----v-'),
  ('viam', '--s----a-'),
  ('viae', '--s----g-'),
  ('viae', '--s----d-'),
  ('via', '--s----b-'),
  ('viae', '--p----n-'),
  ('viae', '--p----v-'),
  ('vias', '--p----a-'),
  ('viarum', '--p----g-'),
  ('viis', '--p----d-'),
  ('viis', '--p----b-')]

 In [4]: decliner.decline("via", flatten=True)
 Out[4]:
 ['via',
  'via',
  'viam',
  'viae',
  'viae',
  'via',
  'viae',
  'viae',
  'vias',
  'viarum',
  'viis',
  'viis']

Lemmatization

*This lemmatizer is deprecated. It is recommended that you use the Backoff Lemmatizer described below.*

Tip

For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.

The CLTK’s lemmatizer is based on a key-value store, whose code is available at the CLTK’s Latin lemma/POS repository.

The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens (which, by the way, need j and v replaced first). Here is an example of the lemmatizer taking a string:

In [1]: from cltk.stem.lemma import LemmaReplacer

In [2]: from cltk.stem.latin.j_v import JVReplacer

In [3]: sentence = 'Aeneadum genetrix, hominum divomque voluptas, alma Venus, caeli subter labentia signa quae mare navigerum, quae terras frugiferentis concelebras, per te quoniam genus omne animantum concipitur visitque exortum lumina solis.'

In [6]: sentence = sentence.lower()

In [7]: lemmatizer = LemmaReplacer('latin')

In [8]: lemmatizer.lemmatize(sentence)
Out[8]:
['aeneadum',
 'genetrix',
 ',',
 'homo',
 'divus',
 'voluptas',
 ',',
 'almus',
 ...]

And here taking a list:

In [9]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'])
Out[9]: ['qui1', 'terra', 'frugiferens', 'concelebro']

The lemmatizer takes several optional arguments for controlling output: return_raw=True and return_string=True. return_raw returns the original inflection along with its headword:

In [10]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_raw=True)
Out[10]:
['quae/qui1',
 'terras/terra',
 'frugiferentis/frugiferens',
 'concelebras/concelebro']

And return string wraps the list in ' '.join():

In [11]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_string=True)
Out[11]: 'qui1 terra frugiferens concelebro'

These two arguments can be combined, as well.

Lemmatization, backoff method

The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.

There is a generic version of the backoff Latin lemmatizer which requires data from the CLTK latin models data found here. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.

To use the generic version of the backoff Latin Lemmatizer:

In [1]: from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer

In [2]: lemmatizer = BackoffLatinLemmatizer()

In [3]: tokens = ['Quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutor'), (',', 'punc'), ('Catilina', 'Catilina'), (',', 'punc'), ('patientia', 'patientia'), ('nostra', 'noster'), ('?', 'punc')]

NB: The backoff chain for this lemmatizer is defined as follows: 1. a dictionary-based lemmatizer with high-frequency, unambiguous forms; 2. a training-data-based lemmatizer based on 4,000 sentences from the [Perseus Latin Dependency Treebanks](https://perseusdl.github.io/treebank_data/); 3. a regular-expression-based lemmatizer transforming unambiguous endings; 4. a dictionary-based lemmatizer with the complete set of Morpheus lemmas; 5. an ‘identity’ lemmatizer returning the token as the lemma. Each of these sub-lemmatizers is explained in the documents for “Multilingual”.

Line Tokenization

The line tokenizer takes a string input into tokenize() and returns a list of strings.

In [1]: from cltk.tokenize.line import LineTokenizer

In [2]: tokenizer = LineTokenizer('latin')

In [3]: untokenized_text = """49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""

In [4]: tokenizer.tokenize(untokenized_text)

Out[4]: ['49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']

The line tokenizer by default removes multiple line breaks. If you wish to retain blank lines in the returned list, set the include_blanks to True.

In [5]: untokenized_text = """48. Cum tibi contigerit studio cognoscere multa,\nFac discas multa, vita nil discere velle.\n\n49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""

In [6]: tokenizer.tokenize(untokenized_text, include_blanks=True)

Out[6]: ['48. Cum tibi contigerit studio cognoscere multa,','Fac discas multa, vita nil discere velle.','','49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']

Macronizer

Automatically mark long Latin vowels with a macron. The algorithm used in this module is largely based on Johan Winge’s, which is detailed in his thesis found.

Note that the macronizer’s accuracy varies depending on which tagger is used. Currently, the macronizer supports the following taggers: tag_ngram_123_backoff, tag_tnt, and tag_crf. The tagger is selected when calling the class, as seen on line 2. Be sure to first import the data models from latin_models_cltk, via the corpus importer, since both the taggers and macronizer rely on them.

The macronizer can either macronize text, as seen at line 4 below, or return a list of tagged tokens containing the macronized form like on line 5.

In [1]: from cltk.prosody.latin.macronizer import Macronizer

In [2]: macronizer = Macronizer('tag_ngram_123_backoff')

In [3]: text = 'Quo usque tandem, O Catilina, abutere nostra patientia?'

In [4]: macronizer.macronize_text(text)
Out[4]: 'quō usque tandem , ō catilīnā , abūtēre nostrā patientia ?

In [5]: macronizer.macronize_tags(text)
Out[5]: [('quo', 'd--------', 'quō'), ('usque', 'd--------', 'usque'), ('tandem', 'd--------', 'tandem'), (',', 'u--------', ','), ('o', 'e--------', 'ō'), ('catilina', 'n-s---mb-', 'catilīnā'), (',', 'u--------', ','), ('abutere', 'v2sfip---', 'abūtēre'), ('nostra', 'a-s---fb-', 'nostrā'), ('patientia', 'n-s---fn-', 'patientia'), ('?', None, '?')]

Making POS training sets

Warning

POS tagging is a work in progress. A new tagging dictionary has been created, though a tagger has not yet been written.

First, obtain the Latin POS tagging files. The important file here is cltk_latin_pos_dict.txt, which is saved at ~/cltk_data/compiled/pos_latin. This file is a Python dict type which aims to give all possible parts-of-speech for any given form, though this is based off the incomplete Perseus latin-analyses.txt. Thus, there may be gaps in (i) the inflected forms defined and (ii) the comprehensiveness of the analyses of any given form. cltk_latin_pos_dict.txt looks like:

{'-nam': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                    'gloss': '',
                                    'type': 'conj'}}]},
 '-namque': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                       'gloss': '',
                                       'type': 'conj'}}]},
 '-sed': {'perseus_pos': [{'pos0': {'case': 'indeclform',
                                    'gloss': '',
                                    'type': 'conj'}}]},
 'Aaron': {'perseus_pos': [{'pos0': {'case': 'nom',
                                     'gender': 'masc',
                                     'gloss': 'Aaron',
                                     'number': 'sg',
                                     'type': 'substantive'}}]},
}

If you wish to edit the POS dictionary creator, see cltk_latin_pos_dict.txt.For more, see the [pos_latin](https://github.com/cltk/latin_pos_lemmata_cltk) repository.

Named Entity Recognition

There is available a simple interface to a list of Latin proper nouns (see repo for how it the list was created). By default tag_ner() takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.

In [1]: from cltk.tag import ner

In [2]: from cltk.stem.latin.j_v import JVReplacer

In [3]: text_str = """ut Venus, ut Sirius, ut Spica, ut aliae quae primae dicuntur esse mangitudinis."""

In [4]: jv_replacer = JVReplacer()

In [5]: text_str_iu = jv_replacer.replace(text_str)

In [7]: ner.tag_ner('latin', input_text=text_str_iu, output_type=list)
Out[7]:
[('ut',),
 ('Uenus', 'Entity'),
 (',',),
 ('ut',),
 ('Sirius', 'Entity'),
 (',',),
 ('ut',),
 ('Spica', 'Entity'),
 (',',),
 ('ut',),
 ('aliae',),
 ('quae',),
 ('primae',),
 ('dicuntur',),
 ('esse',),
 ('mangitudinis',),
 ('.',)]

PHI Indices

Located at cltk/corpus/latin/phi5_index.py of the source are indices for the PHI5, one of just id and name (PHI5_INDEX) and another also containing information on the authors’ works (PHI5_WORKS_INDEX).

In [1]: from cltk.corpus.latin.phi5_index import PHI5_INDEX

In [2]: PHI5_INDEX
Out[2]:
{'LAT1050': 'Lucius Verginius Rufus',
 'LAT2335': 'Anonymi de Differentiis [Fronto]',
 'LAT1345': 'Silius Italicus',
 ... }

In [3]: from cltk.corpus.latin.phi5_index import PHI5_WORKS_INDEX

In [4]: PHI5_WORKS_INDEX
Out [4]:
{'LAT2335': {'works': ['001'], 'name': 'Anonymi de Differentiis [Fronto]'},
 'LAT1345': {'works': ['001'], 'name': 'Silius Italicus'},
 'LAT1351': {'works': ['001', '002', '003', '004', '005'],
  'name': 'Cornelius Tacitus'},
 'LAT2349': {'works': ['001', '002', '003', '004', '005', '006', '007'],
  'name': 'Maurus Servius Honoratus, Servius'},
  ...}

In addition to these indices there are several helper functions which will build filepaths for your particular computer. Not that you will need to have run convert_corpus(corpus='phi5') and divide_works('phi5') from the TLGU() class, respectively, for the following two functions.

In [1]: from cltk.corpus.utils.formatter import assemble_phi5_author_filepaths

In [2]: assemble_phi5_author_filepaths()
Out[2]:
['/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0636.TXT',
 '/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0658.TXT',
 '/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0827.TXT',
 ...]

In [3]: from cltk.corpus.utils.formatter import assemble_phi5_works_filepaths

In [4]: assemble_phi5_works_filepaths()
Out[4]:
['/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0636.TXT-001.txt',
 '/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0902.TXT-001.txt',
 '/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-001.txt',
 '/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-002.txt',
 ...]

These two functions are useful when, for example, needing to process all authors of the PHI5 corpus, all works of the corpus, or all works of one particular author.

POS tagging

These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the latin_models_cltk corpus.

1–2–3–gram backoff tagger
In [1]: from cltk.tag.pos import POSTag

In [2]: tagger = POSTag('latin')

In [3]: tagger.tag_ngram_123_backoff('Gallia est omnis divisa in partes tres')
Out[3]:
[('Gallia', None),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---MN-'),
 ('divisa', 'T-PRPPNN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]
TnT tagger
In [4]: tagger.tag_tnt('Gallia est omnis divisa in partes tres')
Out[4]:
[('Gallia', 'Unk'),
 ('est', 'V3SPIA---'),
 ('omnis', 'N-S---MN-'),
 ('divisa', 'T-SRPPFN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]
CRF tagger

Warning

This tagger’s accuracy has not yet been evaluated.

We use the NLTK’s CRF tagger. For information on it, see the NLTK docs.

In [5]: tagger.tag_crf('Gallia est omnis divisa in partes tres')
Out[5]:
[('Gallia', 'A-P---NA-'),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---FN-'),
 ('divisa', 'N-S---FN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]
Lapos tagger

Note

The Lapos tagger is available in its own repo, with with the master branch for Linux and apple branch for Mac. See directions there on how to use it.

Prosody Scanning

A prosody scanner is available for text which already has had its natural lengths marked with macrons. It returns a list of strings of long and short marks for each sentence, with an anceps marking the last syllable of each sentence.

The algorithm is designed only for Latin prose rhythms. It is detailed in Keeline, T. and Kirby, J “Auceps syllabarum: A Digital Analysis of Latin Prose Rhythm,” Journal of Roman Studies, 2019.

In [1]: from cltk.prosody.latin.scanner import Scansion

In [2]: scanner = Scansion()

In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'

In [4]: scanner.scan_text(text)
Out[4]: ['-uuu-uuu-u--x', 'uu-uu-uu----x']

Scansion of Poetry

About the use of macrons in poetry

Most Latin poetry arrives to us without macrons. Some lines of Latin poetry can be scanned and fit a poetic meter without any macrons at all, due to the rules of meter and positional accentuation.

Automatically macronizing every word in a line of Latin poetry does not mean that it will automatically scan correctly. Poets often diverge from standard usage: regularly long vowels can appear short; the verb nesciō in poetry scans the final personal ending as a short o; and regularly short vowels can appear as long; e.g. Lucretius regularly writes rēligiō which scans, instead of the usual religiō; and there is a prosody device: diastole - the short final vowel of a word is lengthened to fit the meter; e.g. tibī in Lucretius I.104 and III.899, etc.

However, some macrons are necessary for scansion: Lucretius I.12 begins with “aeriae” which will not scan in hexameter unless one substitutes its macronized form “āeriae”.

HexameterScanner

The HexameterScanner class scans lines of Latin hexameter (with or without macrons) and determines if the line is a valid hexameter and what its scansion pattern is.

If the line is not properly macronized to scan, the scanner tries to determine whether the line:

  1. Scans merely by position.
  2. Syllabifies according to the common rules.
  3. Is complete (e.g. some hexameter lines are partial).

The scanner also determines which syllables would have to be made long to make the line scan as a valid hexameter. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The HexameterScanner’s scan method returns a Verse class object.

In [1]: from cltk.prosody.latin.hexameter_scanner import HexameterScanner

In [2]: scanner = HexameterScanner()

In [3]: scanner.scan("impulerit. Tantaene animis caelestibus irae?")
Out[3]: Verse(original='impulerit. Tantaene animis caelestibus irae?', scansion='-  U U -    -   -   U U -    - -  U U  -  - ', meter='hexameter', valid=True, syllable_count=15, accented='īmpulerīt. Tāntaene animīs caelēstibus īrae?', scansion_notes=['Valid by positional stresses.'], syllables = ['īm', 'pu', 'le', 'rīt', 'Tān', 'taen', 'a', 'ni', 'mīs', 'cae', 'lēs', 'ti', 'bus', 'i', 'rae'])
PentameterScanner

The PentameterScanner class scans lines of Latin pentameter (with or without macrons) and determines if the line is a valid pentameter and what its scansion pattern is.

If the line is not properly macronized to scan, the scanner tries to determine whether the line:

  1. Scans merely by position.
  2. Syllabifies according to the common rules.

The scanner also determines which syllables would have to be made long to make the line scan as a valid pentameter. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The PentameterScanner’s scan method returns a Verse class object.

In [1]: from cltk.prosody.latin.pentameter_scanner import PentameterScanner

In [2]: scanner = PentameterScanner()

In [3]: scanner.scan("ex hoc ingrato gaudia amore tibi.")
Out[3]: Verse(original='ex hoc ingrato gaudia amore tibi.', scansion='-   -  -   - -   - U  U - U  U U ', meter='pentameter', valid=True, syllable_count=12, accented='ēx hōc īngrātō gaudia amōre tibi.', scansion_notes=['Spondaic pentameter'], syllables = ['ēx', 'hoc', 'īn', 'gra', 'to', 'gau', 'di', 'a', 'mo', 're', 'ti', 'bi'])
HendecasyllableScanner

The HendecasyllableScanner class scans lines of Latin hendecasyllables (with or without macrons) and determines if the line is a valid example of the hendecasyllablic meter and what its scansion pattern is.

If the line is not properly macronized to scan, the scanner tries to determine whether the line:

  1. Scans merely by position.
  2. Syllabifies according to the common rules.

The scanner also determines which syllables would have to be made long to make the line scan as a valid hendecasyllables. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The HendecasyllableScanner’s scan method returns a Verse class object.

In [1]: from cltk.prosody.latin.hendecasyllable_scanner import HendecasyllableScanner

In [2]: scanner = HendecasyllableScanner()

In [3]: scanner.scan("Iam tum, cum ausus es unus Italorum")
Out[3]: Verse(original='Iam tum, cum ausus es unus Italorum', scansion=' -   -        - U  U  - U  - U - U ', meter='hendecasyllable', valid=True, syllable_count=11, accented='Iām tūm, cum ausus es ūnus Ītalōrum', scansion_notes=['antepenult foot onward normalized.'], syllables = ['Jām', 'tūm', 'c', 'au', 'sus', 'es', 'u', 'nus', 'I', 'ta', 'lo', 'rum'])
Verse

The Verse class object returned by the HexameterScanner, PentameterScanner, and HendecasyllableScanner provides slots for:

  1. original - original line of verse
  2. scansion - the scansion pattern
  3. meter - the meter of the verse
  4. valid - whether or not the hexameter is valid
  5. syllable_count - number of syllables according to common syllabification rules
  6. accented - if the hexameter is valid, a version of the line with accented vowels (dipthongs are not accented)
  7. scansion_notes - a list recording the characteristics of the transformations made to the original line
  8. syllables - a list of syllables of which the line is divided into at the scansion level; elided syllables are not provided.

The Scansion notes are defined in a NOTE_MAP dictionary object contained in the ScansionConstants class.

ScansionConstants

The ScansionConstants class is a configuration class for specifying scansion constants. This class also allows users to customizing scansion constants and scanner behavior, for example, a user may alter the symbols used for stressed and unstressed syllables:

In [1]: from cltk.prosody.latin.scansion_constants import ScansionConstants

In [2]: constants = ScansionConstants(unstressed="U",stressed= "-", optional_terminal_ending="X")

In [3]: constants.DACTYL
Out[3]: '-UU'

In [4]: smaller_constants = ScansionConstants(unstressed="˘",stressed= "¯", optional_terminal_ending="x")

In [5]: smaller_constants.DACTYL
Out[5]: '¯˘˘'

Constants containing strings have characters in upper and lower case since they will often be used in regular expressions, and used to preserve/a verse’s original case.

Syllabifier

The Syllabifier class is a Latin language syllabifier. It parses a Latin word or a space separated list of words into a list of syllables. Consonantal I is transformed into a J at the start of a word as necessary. Tuned for poetry and verse, this class is tolerant of isolated single character consonants that may appear due to elision.

In [1]: from cltk.prosody.latin.syllabifier import Syllabifier

In [1]: syllabifier = Syllabifier()

In [2]: syllabifier.syllabify("libri")
Out[2]: ['li', 'bri']

In [3]: syllabifier.syllabify("contra")
Out[3]: ['con', 'tra']
Metrical Validator

The MetricalValidator class is a utility class for validating scansion patterns. Users may configure the scansion symbols internally via passing a customized ScansionConstants via a constructor argument:

In [1]: from cltk.prosody.latin.metrical_validator import MetricalValidator

In [2]: MetricalValidator().is_valid_hexameter("-UU---UU---UU-U")
Out[2]: 'True'
ScansionFormatter

The ScansionFormatter class is a utility class for formatting scansion patterns.

In [1]: from cltk.prosody.latin.scansion_formatter import ScansionFormatter

In [2]: ScansionFormatter().hexameter("-UU-UU-UU---UU--")
Out[2]: '-UU|-UU|-UU|--|-UU|--'

In [3]: constants = ScansionConstants(unstressed="˘", stressed= "¯", optional_terminal_ending="x")

In [4]: formatter = ScansionFormatter(constants)

In [5]: formatter.hexameter( "¯˘˘¯˘˘¯˘˘¯¯¯˘˘¯¯")
Out[5]: '¯˘˘|¯˘˘|¯˘˘|¯¯|¯˘˘|¯¯'
string_utils module

The string_utils module contains utility methods for processing scansion and text. Such as punctuation_for_spaces_dict() returns a dictionary object that maps unicode punctuation to blanks spaces, which are essential for scansion to keep stress patterns in alignment with original vowel positions in the verse.

In [1]: import cltk.prosody.latin.string_utils as string_utils

In [2]: "I'm ok! Oh #%&*()[]{}!? Fine!".translate(string_utils.punctuation_for_spaces_dict()).strip()
Out[2]: 'I m ok  Oh              Fine'

Semantics

The Semantics module allows for the lookup of Latin lemmata, synonyms, and translations into Greek. Lemma, synonym, and translation dictionaries are drawn from the open-source Tesserae Project<http://github.com/tesserae/tesserae>

Tip

When lemmatizing ambiguous forms, the Semantics module is designed to return all possibilities. A probability distribution is included with the list of results, but as of June 8, 2018 the total probability is evenly distributed over all possibilities. Future updates will include a more intelligent system for determining the most likely lemma, synonym, or translation._.

The Lemmata class includes two relevant methods: lookup() takes a list of tokens standardized for spelling and returns a complex object which includes a probability distribution; isolate() takes the object returned by lookup() and discards everything but the lemmata.

In [1]: from cltk.semantics.latin.lookup import Lemmata

In [2]: lemmatizer = Lemmata(dictionary = 'lemmata', language = 'latin')

In [3]: tokens = ['ceterum', 'antequam', 'destinata', 'componam']

In [4]: lemmas = lemmatizer.lookup(tokens)
Out[4]:
[('ceterum', [('ceterus', 1.0)]), ('antequam', [('antequam', 1.0)]), ('destinata', [('destinatus', 0.25), ('destinatum', 0.25), ('destinata', 0.25), ('destino', 0.25)]), ('componam', [('compono', 1.0)])]

In [5]: justlemmas = lemmatizer.isolate(lemmas)
Out[5]:['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']

The Synonym class can be initialized to lookup either synonyms or translations. It expects a list of lemmata, not inflected forms. Only successful ‘lookups’ will return results.

In [1]: from cltk.semantics.latin.lookup import Synonyms

In [2]: translator = Synonyms(dictionary = 'translations', language = 'latin')

In [3]: lemmas = ['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']

In [4]: translations = translator.lookup(lemmas)
Out[4]:[('destino', [('σκοπός', 1.0)]), ('compono', [('συντίθημι', 1.0)])]

A raw list of translations can be obtained from the translation object using Lemmata.isolate().

Sentence Tokenization

Sentence tokenization for Latin is available using a [Punkt](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) tokenizer trained on the Latin Library. The model for this tokenizer can be found in the CLTK corpora under latin_model_cltk/tokenizers/sentence/latin_punkt. The training process considers Latin punctuation patterns as well as common abbreviations (e.g. nomina). To tokenize a Latin text by sentences…

In [1]: from cltk.tokenize.latin.sentence import SentenceTokenizer

In [2]: sent_tokenizer = SentenceTokenizer()

In [3]: untokenized_text = 'Meministine me ante diem XII Kalendas Novembris dicere in senatu fore in armis certo die, qui dies futurus esset ante diem VI Kal. Novembris, C. Manlium, audaciae satellitem atque administrum tuae? Num me fefellit, Catilina, non modo res tanta, tam atrox tamque incredibilis, verum, id quod multo magis est admirandum, dies? Dixi ego idem in senatu caedem te optumatium contulisse in ante diem V Kalendas Novembris, tum cum multi principes civitatis Roma non tam sui conservandi quam tuorum consiliorum reprimendorum causa profugerunt.'

In [4]: sent_tokenizer.tokenize(untokenized_text)
Out[4]: ['Meministine me ante diem XII Kalendas Novembris dicere in senatu fore in armis certo die, qui dies futurus esset ante diem VI Kal. Novembris, C. Manlium, audaciae satellitem atque administrum tuae?', 'Num me fefellit, Catilina, non modo res tanta, tam atrox tamque incredibilis, verum, id quod multo magis est admirandum, dies?', 'Dixi ego idem in senatu caedem te optumatium contulisse in ante diem V Kalendas Novembris, tum cum multi principes civitatis Roma non tam sui conservandi quam tuorum consiliorum reprimendorum causa profugerunt.']

Note that the Latin sentence tokenizer takes account of abbreviations like ‘Kal.’ and ‘C.’ and does not split sentences at these points.

By default, the Latin Punkt Sentence Tokenizer splits on period, question mark, and exclamation point. There is a `strict` parameter that adds colon, semicolon, and hyphen to this.


In [5]: sent_tokenizer = SentenceTokenizer(strict=True)

In [6]: untokenized_text = ‘In principio creavit Deus caelum et terram; terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas; dixitque Deus fiat lux et facta est lux; et vidit Deus lucem quod esset bona et divisit lucem ac tenebras.’

In [7]: sent_tokenizer.tokenize(untokenized_text) Out[7]: [‘In principio creavit Deus caelum et terram;’, ‘terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas;’, ‘dixitque Deus fiat lux et facta est lux;’, ‘et vidit Deus lucem quod esset bona et divisit lucem ac tenebras.’]

NB: The old method for sentence tokenizer, i.e. TokenizeSentence, is still available, but now calls the tokenizer described above.

In [5]: from cltk.tokenize.sentence import TokenizeSentence

In [6]: tokenizer = TokenizeSentence('latin')

etc.

Semantics

The Semantics module allows for the lookup of Latin lemmata, synonyms, and translations into Greek. Lemma, synonym, and translation dictionaries are drawn from the open-source Tesserae Project<http://github.com/tesserae/tesserae>

The dictionaries used by this module are stored in https://github.com/cltk/latin_models_cltk/tree/master/semantics and https://github.com/cltk/greek_models_cltk/tree/master/semantics for Greek and Latin, respectively. In order to use the Semantics module, it is necessary to import those repos first<http://docs.cltk.org/en/latest/importing_corpora.html#importing-a-corpus>.

Tip

When lemmatizing ambiguous forms, the Semantics module is designed to return all possibilities. A probability distribution is included with the list of results, but as of June 8, 2018 the total probability is evenly distributed over all possibilities. Future updates will include a more intelligent system for determining the most likely lemma, synonym, or translation._.

The Lemmata class includes two relevant methods: lookup() takes a list of tokens standardized for spelling and returns a complex object which includes a probability distribution; isolate() takes the object returned by lookup() and discards everything but the lemmata.

In [1]: from cltk.semantics.latin.lookup import Lemmata

In [2]: lemmatizer = Lemmata(dictionary='lemmata', language='latin')

In [3]: tokens = ['ceterum', 'antequam', 'destinata', 'componam']

In [4]: lemmas = lemmatizer.lookup(tokens)
Out[4]:
[('ceterum', [('ceterus', 1.0)]), ('antequam', [('antequam', 1.0)]), ('destinata', [('destinatus', 0.25), ('destinatum', 0.25), ('destinata', 0.25), ('destino', 0.25)]), ('componam', [('compono', 1.0)])]

In [5]: just_lemmas = Lemmata.isolate(lemmas)
Out[5]:['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']

The Synonym class can be initialized to lookup either synonyms or translations. It expects a list of lemmata, not inflected forms. Only successful ‘lookups’ will return results.

In [1]: from cltk.semantics.latin.lookup import Synonyms

In [2]: translator = Synonyms(dictionary='translations', language='latin')

In [3]: lemmas = ['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']

In [4]: translations = translator.lookup(lemmas)
Out[4]:[('destino', [('σκοπός', 1.0)]), ('compono', [('συντίθημι', 1.0)])]

In [5]: just_translations = Lemmata.isolate(translations)
Out[5]:['σκοπός', 'συντίθημι']

A raw list of translations can be obtained from the translation object using Lemmata.isolate().

Stemming

The stemmer strips suffixes via an algorithm. It is much faster than the lemmatizer, which uses a replacement list.

In [1]: from cltk.stem.latin.stem import Stemmer

In [2]: sentence = 'Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiuerunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem ciuem existimarint foeneratorem quam furem, hinc licet existimare. Et uirum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, uerum, ut supra dixi, periculosum et calamitosum. At ex agricolis et uiri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque inuidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit.'

In [3]: stemmer = Stemmer()

In [4]: stemmer.stem(sentence.lower())
Out[4]: 'est interd praestar mercatur r quaerere, nisi tam periculos sit, et it foenerari, si tam honestum. maior nostr sic habueru et ita in leg posiuerunt: fur dupl condemnari, foenerator quadrupli. quant peior ciu existimari foenerator quam furem, hinc lice existimare. et uir bon quo laudabant, ita laudabant: bon agricol bon colonum; amplissim laudar existimaba qui ita laudabatur. mercator autem strenu studios re quaerend existimo, uerum, ut supr dixi, periculos et calamitosum. at ex agricol et uir fortissim et milit strenuissim gignuntur, maxim p quaest stabilissim consequi minim inuidiosus, minim mal cogitant su qui in e studi occupat sunt. nunc, ut ad r redeam, quod promis institut principi hoc erit. '

Stoplist Construction

To extract a stoplist from a collection of documents:

In [1]: test_1 = """cogitanti mihi saepe numero et memoria vetera repetenti perbeati fuisse, quinte frater, illi videri solent, qui in optima re publica, cum et honoribus et rerum gestarum gloria florerent, eum vitae cursum tenere potuerunt, ut vel in negotio sine periculo vel in otio cum dignitate esse possent; ac fuit cum mihi quoque initium requiescendi atque animum ad utriusque nostrum praeclara studia referendi fore iustum et prope ab omnibus concessum arbitrarer, si infinitus forensium rerum labor et ambitionis occupatio decursu honorum, etiam aetatis flexu constitisset. quam spem cogitationum et consiliorum meorum cum graves communium temporum tum varii nostri casus fefellerunt; nam qui locus quietis et tranquillitatis plenissimus fore videbatur, in eo maximae moles molestiarum et turbulentissimae tempestates exstiterunt; neque vero nobis cupientibus atque exoptantibus fructus oti datus est ad eas artis, quibus a pueris dediti fuimus, celebrandas inter nosque recolendas. nam prima aetate incidimus in ipsam perturbationem disciplinae veteris, et consulatu devenimus in medium rerum omnium certamen atque discrimen, et hoc tempus omne post consulatum obiecimus eis fluctibus, qui per nos a communi peste depulsi in nosmet ipsos redundarent. sed tamen in his vel asperitatibus rerum vel angustiis temporis obsequar studiis nostris et quantum mihi vel fraus inimicorum vel causae amicorum vel res publica tribuet oti, ad scribendum potissimum conferam; tibi vero, frater, neque hortanti deero neque roganti, nam neque auctoritate quisquam apud me plus valere te potest neque voluntate."""

In [2] test_2 = """ac mihi repetenda est veteris cuiusdam memoriae non sane satis explicata recordatio, sed, ut arbitror, apta ad id, quod requiris, ut cognoscas quae viri omnium eloquentissimi clarissimique senserint de omni ratione dicendi. vis enim, ut mihi saepe dixisti, quoniam, quae pueris aut adulescentulis nobis ex commentariolis nostris incohata ac rudia exciderunt, vix sunt hac aetate digna et hoc usu, quem ex causis, quas diximus, tot tantisque consecuti sumus, aliquid eisdem de rebus politius a nobis perfectiusque proferri; solesque non numquam hac de re a me in disputationibus nostris dissentire, quod ego eruditissimorum hominum artibus eloquentiam contineri statuam, tu autem illam ab elegantia doctrinae segregandam putes et in quodam ingeni atque exercitationis genere ponendam. ac mihi quidem saepe numero in summos homines ac summis ingeniis praeditos intuenti quaerendum esse visum est quid esset cur plures in omnibus rebus quam in dicendo admirabiles exstitissent; nam quocumque te animo et cogitatione converteris, permultos excellentis in quoque genere videbis non mediocrium artium, sed prope maximarum. quis enim est qui, si clarorum hominum scientiam rerum gestarum vel utilitate vel magnitudine metiri velit, non anteponat oratori imperatorem? quis autem dubitet quin belli duces ex hac una civitate praestantissimos paene innumerabilis, in dicendo autem excellentis vix paucos proferre possimus? iam vero consilio ac sapientia qui regere ac gubernare rem publicam possint, multi nostra, plures patrum memoria atque etiam maiorum exstiterunt, cum boni perdiu nulli, vix autem singulis aetatibus singuli tolerabiles oratores invenirentur. ac ne qui forte cum aliis studiis, quae reconditis in artibus atque in quadam varietate litterarum versentur, magis hanc dicendi rationem, quam cum imperatoris laude aut cum boni senatoris prudentia comparandam putet, convertat animum ad ea ipsa artium genera circumspiciatque, qui in eis floruerint quamque multi sint; sic facillime, quanta oratorum sit et semper fuerit paucitas, iudicabit."""

In [3]: test_corpus = [test_1, test_2]

In [4]: from cltk.stop.latin import CorpusStoplist

In [5]: S = CorpusStoplist()

In [6]: print(S.build_stoplist(test_corpus, size=10))

Out [6]: ['ac', 'atque', 'cum', 'et', 'in', 'mihi', 'neque', 'qui', 'rerum', 'vel']

Stopword Filtering

To use a pre-built stoplist (created originally by the Perseus Project):

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.latin.stops import STOPS_LIST

In [3]: sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['usque',
 'tandem',
 'abutere',
 ',',
 'catilina',
 ',',
 'patientia',
 'nostra',
 '?']

Swadesh

The corpus module has a class for generating a Swadesh list for Latin.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('la')

In [3]: swadesh.words()[:10]
Out[3]: ['ego', 'tū', 'is, ea, id', 'nōs', 'vōs', 'eī, iī, eae, ea', 'hic, haec, ho', 'ille, illa, illud', 'hīc', 'illic, ibi']

Syllabifier

The syllabifier splits a given input Latin word into a list of syllables based on an algorithm and set of syllable specifications for Latin.

In [1]: from cltk.stem.latin.syllabifier import Syllabifier

In [2]: word = 'sidere'

In [3]: syllabifier = Syllabifier()

In [4]: syllabifier.syllabify(word)
Out[4]: ['si', 'de', 're']

Text Cleanup

Intended for use on the TLG after processing by TLGU().

In [1]: from cltk.corpus.utils.formatter import phi5_plaintext_cleanup

In [2]: import os

In [3]: file = os.path.expanduser('~/cltk_data/latin/text/phi5/individual_works/LAT0031.TXT-001.txt')

In [4]: with open(file) as f:
...:     r = f.read()
...:

In [5]: r[:500]
Out[5]: '\nDices pulchrum esse inimicos \nulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uide-\ntur, sed si liceat re publica salua ea persequi. sed quatenus id fieri non  \npotest, multo tempore multisque partibus inimici nostri non peribunt \natque, uti nunc sunt, erunt potius quam res publica profligetur atque \npereat. \n    Verbis conceptis deierare ausim, praeterquam qui \nTiberium Gracchum necarunt, neminem inimicum tantum molestiae \ntantumque laboris, quantum te ob has res, mihi tradidis'

In [6]: phi5_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Dices pulchrum esse inimicos ulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uidetur sed si liceat re publica salua ea persequi. sed quatenus id fieri non potest multo tempore multisque partibus inimici nostri non peribunt atque uti nunc sunt erunt potius quam res publica profligetur atque pereat. Verbis conceptis deierare ausim praeterquam qui Tiberium Gracchum necarunt neminem inimicum tantum molestiae tantumque laboris quantum te ob has res mihi tradidisse quem oportebat omni'

If you have a text of a language in Latin characters which contain a lot of junk, remove_non_ascii() and remove_non_latin() might be of use.

In [1]: from cltk.corpus.utils.formatter import remove_non_ascii

In [2]: text =  'Dices ἐστιν ἐμός pulchrum esse inimicos ulcisci.'

In [3]: remove_non_ascii(text)
Out[3]: 'Dices   pulchrum esse inimicos ulcisci.

In [4]: from cltk.corpus.utils.formatter import remove_non_latin

In [5]: remove_non_latin(text)
Out[5]: ' Dices   pulchrum esse inimicos ulcisci'

In [6]: remove_non_latin(text, also_keep=['.', ','])
Out[6]: ' Dices   pulchrum esse inimicos ulcisci.'

Transliteration

The CLTK provides IPA phonetic transliteration for the Latin language. Currently, the only available dialect is Classical as reconstructed by W. Sidney Allen (taken from Vox Latina, 85-103). Example:

In [1]: from cltk.phonology.latin.transcription import Transcriber

In [2]: transcriber = Transcriber(dialect="Classical", reconstruction="Allen")

In [3]: transcriber.transcribe("Quo usque tandem, O Catilina, abutere nostra patientia?")
Out[3]: "['kʷoː 'ʊs.kʷɛ 't̪an̪.d̪ẽː 'oː ka.t̪ɪ.'liː.n̪aː a.buː.'t̪eː.rɛ 'n̪ɔs.t̪raː pa.t̪ɪ̣.'jɛn̪.t̪ɪ̣.ja]"

Word Tokenization

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('latin')

In [3]: text = 'atque haec abuterque puerve paterne nihil'

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['atque', 'haec', 'abuter', '-que', 'puer', '-ve', 'pater', '-ne', 'nihil']

Word2Vec

Note

The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK’s API for it will be revised.

Note

You will need to install Gensim to use these features.

Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).

The CLTK repository contains pre-trained Word2Vec models for Latin (import as latin_word2vec_cltk), one lemmatized and the other not. They were trained on the PHI5 corpus. To train your own, see the README at the Latin Word2Vec repository.

One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here’s an example of its use:

In [1]: from cltk.ir.query import search_corpus

In [2]: for x in search_corpus('amicitia', 'phi5', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.25):
    print(x)
   ...:
Out[2]: The following similar terms will be added to the 'amicitia' query: '['societate', 'praesentia', 'uita', 'sententia', 'promptu', 'beneuolentia', 'dignitate', 'monumentis', 'somnis', 'philosophia']'.
('L. Iunius Moderatus Columella', 'hospitem, nisi ex *amicitia* domini, quam raris-\nsime recipiat.')
('L. Iunius Moderatus Columella', ' \n    Xenophon Atheniensis eo libro, Publi Siluine, qui Oeconomicus \ninscribitur, prodidit maritale coniugium sic comparatum esse \nnatura, ut non solum iucundissima, uerum etiam utilissima uitae \nsocietas iniretur: nam primum, quod etiam Cicero ait, ne genus \nhumanum temporis longinquitate occideret, propter \nhoc marem cum femina esse coniunctum, deinde, ut ex \nhac eadem *societate* mortalibus adiutoria senectutis nec \nminus propugnacula praeparentur.')
('L. Iunius Moderatus Columella', 'ac ne ista quidem \npraesidia, ut diximus, non adsiduus labor et experientia \nuilici, non facultates ac uoluntas inpendendi tantum pollent \nquantum uel una *praesentia* domini, quae nisi frequens \noperibus interuenerit, ut in exercitu, cum abest imperator, \ncuncta cessant officia.')
['…']

threshold is the closeness of the query term to its neighboring words. Note that when expand_keyword=True, the search term will be stripped of any regular expression syntax.

The keyword expander leverages get_sims() (which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:

In [3]: from cltk.vector.word2vec import get_sims

In [4]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.7)
Matches found, but below the threshold of 'threshold=0.7'. Lower it to see these results.
Out[4]: []

In [5]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.2)
Out[5]:
['lictor',
 'extemplo',
 'cena',
 'nuntio',
 'aduenio',
 'iniussus2',
 'forum',
 'dictator',
 'fabium',
'caesarem']

In [6]: get_sims('iube', 'latin', lemmatized=True, threshold=0.7)
Out[6]: "word 'iube' not in vocabulary"
['The following terms in the Word2Vec model you may be looking for: '['iubet”', 'iubet', 'iubilo', 'iubĕ', 'iubar', 'iubes', 'iubatus', 'iuba1', 'iubeo']'.]'

In [7]: get_sims('dictator', 'latin', lemmatized=False, threshold=0.7)
Out[7]:
['consul',
 'caesar',
 'seruilius',
 'praefectus',
 'flaccus',
 'manlius',
 'sp',
 'fuluius',
 'fabio',
 'ualerius']

To add and subtract vectors, you need to load the models yourself with Gensim.

Malayalam

Malayalam is a language spoken in India, predominantly in the state of Kerala. Malayalam originated from Middle Tamil (Sen-Tamil) in the 7th century. An alternative theory proposes a split in even more ancient times.Malayalam incorporated many elements from Sanskrit through the ages. Many medieval liturgical texts were written in an admixture of Sanskrit and early Malayalam, called Manipravalam.The oldest literary work in Malayalam, distinct from the Tamil tradition, is dated from between the 9th and 11th centuries. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with malayalam_) to discover available Malayalam corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('malayalam')

In [3]: c.list_corpora
Out[3]:
['malayalam_text_gretil']

Marathi

Marathi is an Indian language spoken predominantly by the Marathi people of Maharashtra. Marathi has some of the oldest literature of all modern Indo-Aryan languages, dating from about 900 AD. Early Marathi literature written during the Yadava (850-1312 CE) was mostly religious and philosophical in nature. Dnyaneshwar (1275–1296) was the first Marathi literary figure who had wide readership and profound influence. His major works are Amrutanubhav and Bhavarth Deepika (popularly known as Dnyaneshwari), a 9000-couplet long commentary on the Bhagavad Gita. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with Marathi_) to discover available Marathi corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('Marathi')

In [3]: c.list_corpora
Out[3]:
['Marathi_text_wikisource']

Tokenizer

In [1]: from cltk.tokenize.sentence import TokenizeSentence

In [2]: tokenizer = TokenizeSentence('marathi')

In [3]: sentence = "आतां विश्वात्मके देवे, येणे वाग्यज्ञे तोषावे, तोषोनि मज द्यावे, पसायदान हे"

In [4]: tokenized_sentence = tokenizer.tokenize(sentence)

In [5]: print(tokenized_sentence)
['आतां', 'विश्वात्मके', 'देवे', ',', 'येणे', 'वाग्यज्ञे', 'तोषावे', ',', 'तोषोनि', 'मज', 'द्यावे', ',', 'पसायदान', 'हे']

Stopwords

Stop words of classical marathi calculated from “dnyaneshwari” and “Haripath”.

In [1]: from cltk.stop.marathi.stops import STOP_LIST

In [2]: print(STOP_LIST[1])
"तरी"

Alphabet

The alphabets of Marathi language are placed in cltk/corpus/Marathi/alphabet.py.

In [1]: from cltk.corpus.marathi.alphabet import DIGITS

In [2]: print(DIGITS)
['०', '१', '२', '३', '४', '५', '६', '७', '८', '९']

There are 13 vowels in Marathi. All vowels have their independent form and a matra form, which are used for modifying consonants: VOWELS = ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अॅ', 'ऑ'].

The International Alphabet of Sanskrit Transliteration (I.A.S.T.) is a transliteration scheme that allows the lossless romanization of Indic scripts as employed by Sanskrit and related Indic languages. IAST makes it possible for the reader to read the Indic text unambiguously, exactly as if it were in the original Indic script. The vowels would be represented thus: IAST_REPRESENTATION_VOWELS = ['a', 'ā', 'i', 'ī', 'u', 'ū', 'ṛ', 'e', 'ai', 'o', 'au', 'ae', 'ao'].

In [1]: from cltk.corpus.marathi.alphabet import VOWELS

In [2]: VOWELS

Out[2]: ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अॅ', 'ऑ']

In [3]: from cltk.corpus.marathi.alphabet import IAST_REPRESENTATION_VOWELS

In [4]: IAST_REPRESENTATION_VOWELS

out[4]: ['a', 'ā', 'i', 'ī', 'u', 'ū', 'ṛ', 'e', 'ai', 'o', 'au', 'ae', 'ao']

Similarly we can import others vowels and consonants. There are 25 regular consonants (consonants that stop air from moving out of the mouth) in Marathi, and they are organized into groups (“vargas”) of five. The vargas are ordered according to where the tongue is in the mouth. Each successive varga refers to a successively forward position of the tongue. The vargas are ordered and named thus (with an example of a corresponding consonant):

  1. Velar: A velar consonant is a consonant that is pronounced with the back part of the tongue against the soft palate, also known as the velum, which is the back part of the roof of the mouth (e.g., k).
  2. Palatal: A palatal consonant is a consonant that is pronounced with the body (the middle part) of the tongue against the hard palate (which is the middle part of the roof of the mouth) (e.g., j).
  3. Retroflex: A retroflex consonant is a coronal consonant where the tongue has a flat, concave, or even curled shape, and is articulated between the alveolar ridge and the hard palate (e.g., English t).
  4. Dental: A dental consonant is a consonant articulated with the tongue against the upper teeth (e.g., Spanish t).
  5. Labial: Labials or labial consonants are articulated or made with the lips (e.g., p).
VELAR_CONSONANTS = ['क', 'ख', 'ग', 'घ', 'ङ']

PALATAL_CONSONANTS = ['च', 'छ', 'ज', 'झ', 'ञ']

RETROFLEX_CONSONANTS = ['ट','ठ', 'ड', 'ढ', 'ण']

DENTAL_CONSONANTS = ['त', 'थ', 'द', 'ध', 'न']

LABIAL_CONSONANTS = ['प', 'फ', 'ब', 'भ', 'म']

IAST_VELAR_CONSONANTS = ['k', 'kh', 'g', 'gh', 'ṅ']

IAST_PALATAL_CONSONANTS = ['c', 'ch', 'j', 'jh', 'ñ']

IAST_RETROFLEX_CONSONANTS = ['ṭ', 'ṭh', 'ḍ', 'ḍh', 'ṇ']

IAST_DENTAL_CONSONANTS = ['t', 'th', 'd', 'dh', 'n']

IAST_LABIAL_CONSONANTS = ['p', 'ph', 'b', 'bh', 'm']

There are four semi vowels in Marathi:

SEMI_VOWELS = ['य', 'र', 'ल', 'व']

IAST_SEMI_VOWELS = ['y', 'r', 'l', 'w']

There are three sibilants in Marathi:

SIBILANTS = ['श', 'ष', 'स']

IAST_SIBILANTS = ['ś', 'ṣ', 's']

There is one fricative consonant in Marathi:

FRIACTIVE_CONSONANTS = ['ह']

IAST_FRIACTIVE_CONSONANTS = ['h']

There are three additional consonants:

ADDITIONAL_CONSONANTS = ['ळ', 'क्ष', 'ज्ञ']

IAST_ADDITIONAL_CONSONANTS = ['La', 'kSha', 'dnya']

Multilingual

Some functions in the CLTK are language independent.

Concordance

Note

This is a new feature. Advice regarding readability is encouraged!

The philology module can produce a concordance. Currently there are two methods that write a concordance to file, one which takes one or more paths and another which takes a text string. Texts in Latin characters are alphabetized.

In [1]: from cltk.utils import philology

In [2]: iliad = '~/cltk_data/greek/text/tlg/individual_works/TLG0012.TXT-001.txt'

In [3]: philology.write_concordance_from_file(iliad, 'iliad')

This will print a traditional, human–readable 120,000–line concordance at ~/cltk_data/user_data/concordance_iliad.txt.

Multiple files can be passed as a list into this method.

In [5]: odyssey = '~/cltk_data/greek/text/tlg/individual_works/TLG0012.TXT-002.txt'

In [6]: philology.write_concordance_from_file([iliad, odyssey], 'homer')

This creates the file ~/cltk_data/user_data/concordance_homer.txt.

write_concordance_from_string() takes a string and will build the concordance from it.

In [7]: from cltk.corpus.utils.formatter import phi5_plaintext_cleanup

In [8]: import os

In [9]: tibullus = os.path.expanduser('~/cltk_data/latin/text/phi5/plaintext/LAT0660.TXT')

In [10]: with open(tibullus) as f:
....:     tib_read = f.read()

In [10]: tib_clean = phi5_plaintext_cleanup(tib_read).lower()

In [11]: philology.write_concordance_from_string(tib_clean, 'tibullus')

The resulting concordance looks like:

 modulatus eburno felices cantus ore sonante dedit. sed postquam fuerant digiti cum voce locuti , edidit haec tristi dulcia verba modo : 'salve , cura
  caveto , neve cubet laxo pectus aperta sinu , neu te decipiat nutu , digitoque liquorem ne trahat et mensae ducat in orbe notas. exibit quam saepe ,
  acerbi : non ego sum tanti , ploret ut illa semel. nec lacrimis oculos digna est foedare loquaces : lena nocet nobis , ipsa puella bona est. lena ne
 eaera tamen.” “carmine formosae , pretio capiuntur avarae. gaudeat , ut digna est , versibus illa tuis. lutea sed niveum involvat membrana libellum ,
 umnus olympo mille habet ornatus , mille decenter habet. sola puellarum digna est , cui mollia caris vellera det sucis bis madefacta tyros , possidea
  velim , sed peccasse iuvat , voltus conponere famae taedet : cum digno digna fuisse ferar. invisus natalis adest , qui rure molesto et sine cerintho
 a phoebe superbe lyra. hoc sollemne sacrum multos consummet in annos : dignior est vestro nulla puella choro. parce meo iuveni , seu quis bona pascua
a para. sic bene conpones : ullae non ille puellae servire aut cuiquam dignior illa viro. nec possit cupidos vigilans deprendere custos , fallendique

Corpora

The CLTK uses languages in its organization of data, however some good corpora do not and cannot be easily broken apart. Furthermore, some, such as parallel text corpora, are inherently multilingual. Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with multilingual_) to discover available multilingual corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('multilingual')

In [3]: c.list_corpora
Out[3]: ['multilingual_treebank_proiel']

Information Retrieval (regex, keyword expansion)

Tip

To begin working with regular expressions, try Pythex, a handy tool for developing patterns. For more thorough lessons, try Learn Regex The Hard Way.

Tip

Read about Word2Vec for Latin or Greek for the powerful keyword expansion functionality.

Several functions are available for querying text in order to match regular expression patterns. match_regex() is the most basic. Punctuation rules are included for texts using Latin sentence–final punctuation (‘.’, ‘!’, ‘?’) and Greek (‘.’, ‘;’). For returned strings, you may choose between a context of the match’s sentence, paragraph, or custom number of characters on each side of a hit. Note that this function and the next each return a generator.

Here is an example in Latin with a sentence context, case-insensitive:

In [1]: from cltk.ir.query import match_regex

In [2]: text = 'Ita fac, mi Lucili; vindica te tibi. et tempus, quod adhuc aut auferebatur aut subripiebatur aut excidebat, collige et serva.'

In [3]: matches = match_regex(text, r'tempus', language='latin', context='sentence', case_insensitive=True)

In [4]: for match in matches:
    print(match)
   ...:
et *tempus*, quod adhuc aut auferebatur aut subripiebatur aut excidebat, collige et serva.

And here with context of 40 characters:

In [5]: matches = match_regex(text, r'tempus', language='latin', context=40, case_insensitive=True)

In [6]: for match in matches:
    print(match)
   ...:
Ita fac, mi Lucili; vindica te tibi. et *tempus*, quod adhuc aut auferebatur aut subripi

For querying the entirety of a corpus, see search_corpus(), which returns a tuple of ('author_name': 'match_context').

In [7]: from cltk.ir.query import search_corpus

In [8]: for match in search_corpus('ὦ ἄνδρες Ἀθηναῖοι', 'tlg', context='sentence'):
    print(match)
   ...:
('Ammonius Phil.', ' \nκαλοῦντας ἑτέρους ἢ προστάσσοντας ἢ ἐρωτῶντας ἢ εὐχομένους περί τινων, \nπολλάκις δὲ καὶ αὐτοπροσώπως κατά τινας τῶν ἐνεργειῶν τούτων ἐνεργοῦ-\nσι “πρῶτον μέν, *ὦ ἄνδρες Ἀθηναῖοι*, τοῖς θεοῖς εὔχομαι πᾶσι καὶ πάσαις” \nλέγοντες ἢ “ἀπόκριναι γὰρ δεῦρό μοι ἀναστάς”. οἱ οὖν περὶ τῶν τεχνῶν \nτούτων πραγματευόμενοι καὶ τοὺς λόγους εἰς θεωρίαν ')
('Sopater Rhet.', "θόντα, ἢ συγγνωμονηκέναι καὶ ἐλεῆσαι. ψυχῆς γὰρ \nπάθος ἐπὶ συγγνώμῃ προτείνεται. παθητικὴν οὖν ποιή-\nσῃ τοῦ πρώτου προοιμίου τὴν ἔννοιαν: ἁπάντων, ὡς ἔοι-\nκεν, *ὦ ἄνδρες Ἀθηναῖοι*, πειρασθῆναί με τῶν παραδό-\nξων ἀπέκειτο, πόλιν ἰδεῖν ἐν μέσῃ Βοιωτίᾳ κειμένην. καὶ \nμετὰ Θήβας οὐκ ἔτ' οὔσας, ὅτι μὴ στεφανοῦντας Ἀθη-\nναίους ἀπέδειξα παρὰ τὴ")
…

Information Retrieval (boolean)

Note

The API for the CLTK index and query will likely change. Consider this module an alpha. Please report improvements or problems.

An index to a corpus allows for faster, and sometimes more nuanced, searches. The CLTK has built some indexing and querying functionality with the Whoosh library. The following show how to make an index and then query it:

First, ensure that you have imported and converted the PHI5 or TLG disks imported. If you want to use the author chunking, convert with convert_corpus(), but for searching by work, convert with divide_works(). CLTKIndex() has an optional argument chunk, which defaults to chunk='author'. chunk='work' is also available.

An index only needs to be made once. Then it can be queried with, e.g.:

In [1]: from cltk.ir.boolean import CLTKIndex

In [2]: cltk_index = CLTKIndex('latin', 'phi5', chunk='work')

In [3]: results = cltk_index.corpus_query('amicitia')

In [4]: results[:500]
Out[4]: 'Docs containing hits: 836.</br></br>Marcus Tullius Cicero, Cicero, Tully</br>/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0474.TXT-052.TXT</br>Approximate hits: 132.</br>LAELIUS DE <b class="match term0">AMICITIA</b> LIBER </br>        AD T. POMPONIUM ATTICUM.} </br>    Q. Mucius augur multa narrare...incidisset, exposuit </br>nobis sermonem Laeli de <b class="match term1">amicitia</b> habitum ab illo </br>secum et cum altero genero, C. Fannio...videretur. </br>    Cum enim saepe me'

The function returns an string in HTML markup, which you can then parse yourself.

To save results, use the save_file parameter:

In [4]: cltk_index.corpus_query('amicitia', save_file='2016_amicitia')

This will save a file at ~/cltk_data/user_data/search/2016_amicitia.html, being a human-readable output with word-matches highlighted, of all authors (or texts, if chunk='work').

Lemmatization, backoff and others

CLTK offers ‘multiplex’ lemmatization, i.e. a series of lexicon-, rules-, or training data-based lemmatizers that can be chained together. The multiplex lemmatizers are based on backoff POS tagging in NLTK: 1. with Backoff lemmatization, tagging stops at the first successful instance of tagging by a sub-tagger; 2. with Ensemble lemmatization, tagging continues through the entire sequence, all possible lemmas are returned, and a method for scoring/selecting from possible lemmas can be specified. All of the examples below are in Latin, but these lemmatizers are language-independent (at least, where lemmatization is a meaningful NLP task) and can be made language-specific by providing different training sentences, regex patterns, etc.

The backoff module offers DefaultLemmatizer which returns the same “lemma” for all tokens:

In [1]: from cltk.lemmatize.backoff import DefaultLemmatizer

In [2]: lemmatizer = DefaultLemmatizer()

In [3]: tokens = ['Quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]

DefaultLemmatizer can take as a parameter what “lemma” should be returned:

In [5]: lemmatizer = DefaultLemmatizer('UNK')

In [6]: lemmatizer.lemmatize(tokens)
Out[6]: [('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]

The backoff module also offers IdentityLemmatizer which returns the given token as the lemma:

In [7]: from cltk.lemmatize.backoff import IdentityLemmatizer

In [8]: lemmatizer = IdentityLemmatizer()

In [9]: lemmatizer.lemmatize(tokens)
Out[9]: [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]

With the DictLemmatizer, the backoff module allows you to provide a dictionary of the form {‘TOKEN1’: ‘LEMMA1’, ‘TOKEN2’: ‘LEMMA2’} for lemmatization.

In [10]: tokens = ['arma', 'uirum', '-que', 'cano', ',', 'troiae', 'qui', 'primus', 'ab', 'oris']

In [11]: lemmas = {'arma': 'arma', 'uirum': 'uir', 'troiae': 'troia', 'oris': 'ora'}

In [12]: from cltk.lemmatize.backoff import DictLemmatizer

In [13]: lemmatizer = DictLemmatizer(lemmas=lemmas)

In [14]: lemmatizer.lemmatize(tokens)
Out[14]: [('arma', 'arma'), ('uirum', 'uir'), ('-que', None), ('cano', None), (',', None), ('troiae', 'troia'), ('qui', None), ('primus', None), ('ab', None), ('oris', 'ora')]

The DictLemmatizer—like all of the lemmatizers in this module—can take a second lemmatizer (or backoff lemmatizer) for any of the tokens that return ‘None’. This is done with a ‘backoff’ parameter:

In [15]: default = DefaultLemmatizer('UNK')

In [16]: lemmatizer = DictLemmatizer(lemmas=lemmas, backoff=default)

In [17]: lemmatizer.lemmatize(tokens)
Out[17]: [('arma', 'arma'), ('uirum', 'uir'), ('-que', 'UNK'), ('cano', 'UNK'), (',', 'UNK'), ('troiae', 'troia'), ('qui', 'UNK'), ('primus', 'UNK'), ('ab', 'UNK'), ('oris', 'ora')]

These lemmatizers also have a verbose mode that returns the specific tagger used for each lemma returned.

In [18]: default = DefaultLemmatizer('UNK', verbose=True)

In [19]: lemmatizer = DictLemmatizer(lemmas=lemmas, backoff=default, verbose=True)

In [20]: lemmatizer.lemmatize(tokens)
Out[20]: [('arma', 'arma', "<DictLemmatizer: {'arma': 'arma', ...}>"), ('uirum', 'uir', "<DictLemmatizer: {'arma': 'arma', ...}>"), ('-que', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('cano', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), (',', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('troiae', 'troia', "<DictLemmatizer: {'arma': 'arma', ...}>"), ('qui', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('primus', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('ab', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('oris', 'ora', "<DictLemmatizer: {'arma': 'arma', ...}>")]

You can provide a name for the data source to make the verbose output clearer:

In [21]: default = DefaultLemmatizer('UNK')

In [22]: lemmatizer = DictLemmatizer(lemmas=lemmas, source="CLTK Docs Example", backoff=default, verbose=True)

In [23]: lemmatizer.lemmatize(tokens)
Out[23]: [('arma', 'arma', '<DictLemmatizer: CLTK Docs Example>'), ('uirum', 'uir', '<DictLemmatizer: CLTK Docs Example>'), ('-que', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('cano', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), (',', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('troiae', 'troia', '<DictLemmatizer: CLTK Docs Example>'), ('qui', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('primus', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('ab', 'UNK', '<DefaultLemmatizer: lemma=UNK>'), ('oris', 'ora', '<DictLemmatizer: CLTK Docs Example>')]

With the UnigramLemmatizer, the backoff module allows you to provide a list of lists of sentences of the form [[(‘TOKEN1’, ‘LEMMA1’), (‘TOKEN2’, ‘LEMMA2’)], [(‘TOKEN3’, ‘LEMMA3’), (‘TOKEN4’, ‘LEMMA4’)], … ] for lemmatization. The lemmatizer returns the the lemma that has the highest frequency based on the training sentences. So, for example, if the tuple (‘est’, ‘sum’) appears in the training sentences 99 times and (‘est’, ‘edo’) appears 1 time, the lemmatizer would return the lemma ‘sum’.

Here is an example of the UnigramLemmatizer():

In [24]: train_data = [[('cum', 'cum2'), ('esset', 'sum'), ('caesar', 'caesar'), ('in', 'in'), ('citeriore', 'citer'), ('gallia', 'gallia'), ('in', 'in'), ('hibernis', 'hibernus'), (',', 'punc'), ('ita', 'ita'), ('uti', 'ut'), ('supra', 'supra'), ('demonstrauimus', 'demonstro'), (',', 'punc'), ('crebri', 'creber'), ('ad', 'ad'), ('eum', 'is'), ('rumores', 'rumor'), ('adferebantur', 'affero'), ('litteris', 'littera'), ('-que', '-que'), ('item', 'item'), ('labieni', 'labienus'), ('certior', 'certus'), ('fiebat', 'fio'), ('omnes', 'omnis'), ('belgas', 'belgae'), (',', 'punc'), ('quam', 'qui'), ('tertiam', 'tertius'), ('esse', 'sum'), ('galliae', 'gallia'), ('partem', 'pars'), ('dixeramus', 'dico'), (',', 'punc'), ('contra', 'contra'), ('populum', 'populus'), ('romanum', 'romanus'), ('coniurare', 'coniuro'), ('obsides', 'obses'), ('-que', '-que'), ('inter', 'inter'), ('se', 'sui'), ('dare', 'do'), ('.', 'punc')], [('coniurandi', 'coniuro'), ('has', 'hic'), ('esse', 'sum'), ('causas', 'causa'), ('primum', 'primus'), ('quod', 'quod'), ('uererentur', 'uereor'), ('ne', 'ne'), (',', 'punc'), ('omni', 'omnis'), ('pacata', 'paco'), ('gallia', 'gallia'), (',', 'punc'), ('ad', 'ad'), ('eos', 'is'), ('exercitus', 'exercitus'), ('noster', 'noster'), ('adduceretur', 'adduco'), (';', 'punc')]]

In [25]: default = DefaultLemmatizer('UNK')

In [26]: lemmatizer = UnigramLemmatizer(train_sents, backoff=default)

In [27]: lemmatizer.lemmatize(tokens)
Out[27]: [('arma', 'UNK'), ('uirum', 'UNK'), ('-que', '-que'), ('cano', 'UNK'), (',', 'punc'), ('troiae', 'UNK'), ('qui', 'UNK'), ('primus', 'UNK'), ('ab', 'UNK'), ('oris', 'UNK')]

There is also a regular-expression based lemmatizer that uses a tuple with substitution patterns to return lemmas:

In [28]: regexps = [ (‘(.)tat(is|i|em|e|es|um|ibus)$’, r‘1tas’), (‘(.)ion(is|i|em|e|es|um|ibus)$’, r‘1io’), (‘(.)av(i|isti|it|imus|istis|erunt|)$’, r‘1o’),]

In [29]: tokens = “iam a principio nobilitatis factionem disturbavit”.split()

In [30]: from cltk.lemmatize.backoff import RegexpLemmatizer

In [31]: lemmatizer = RegexpLemmatizer(regexps=regexps)

In [32]: lemmatizer.lemmatize(tokens) Out[32]: [(‘iam’, None), (‘a’, None), (‘principio’, None), (‘nobilitatis’, ‘nobilitas’), (‘factionem’, ‘factio’), (‘disturbavit’, ‘disturbo’)]

Ensemble lemmatization are constructed in a similar manner, but all sub-lemmatizers return tags. A selection mechanism can be applied to the output. (NB: Selection and scoring mechanisms for use with the Ensemble Lemmatizer are under development.)

In [33]: from cltk.lemmatize.ensemble import EnsembleDictLemmatizer, EnsembleUnigramLemmatizer, EnsembleRegexpLemmatizer

In [34]: patterns = [(r’b(.+)(o|is|it|imus|itis|unt)b’, r‘1o’), (r’b(.+)(o|as|at|amus|atis|ant)b’, r‘1o’),]

In [35]: tokens = “arma virumque cano qui”.split()

In [36]: EDL = EnsembleDictLemmatizer(lemmas = {‘cano’: ‘cano’}, source=’EDL’, verbose=True)

In [37]: EUL = EnsembleUnigramLemmatizer(train=[[(‘arma’, ‘arma’), (‘virumque’, ‘vir’), (‘cano’, ‘cano’)], [(‘arma’, ‘arma’), (‘virumque’, ‘virus’), (‘cano’, ‘canus’)], [(‘arma’, ‘arma’), (‘virumque’, ‘vir’), (‘cano’, ‘canis’)], [(‘arma’, ‘arma’), (‘virumque’, ‘vir’), (‘cano’, ‘cano’)],], verbose=True, backoff=EDL)

In [38]: ERL = EnsembleRegexpLemmatizer(regexps=patterns, source=’Latin Regex Patterns’, verbose=True, backoff=EUL)

In [39]: ERL.lemmatize(test, lemmas_only=True) Out[39]: [[‘arma’], [‘vir’, ‘virus’], [‘canis’, ‘cano’, ‘canus’], []]

N–grams

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from nltk.util import bigrams

In [3]: from nltk.util import trigrams

In [4]: from nltk.util import ngrams

In [5]: s = 'Ut primum nocte discussa sol novus diem fecit, et somno simul emersus et lectulo, anxius alioquin et nimis cupidus cognoscendi quae rara miraque sunt, reputansque me media Thessaliae loca tenere qua artis magicae nativa cantamina totius orbis consono orbe celebrentur fabulamque illam optimi comitis Aristomenis de situ civitatis huius exortam, suspensus alioquin et voto simul et studio, curiose singula considerabam. Nec fuit in illa civitate quod aspiciens id esse crederem quod esset, sed omnia prorsus ferali murmure in aliam effigiem translata, ut et lapides quos offenderem de homine duratos et aves quas audirem indidem plumatas et arbores quae pomerium ambirent similiter foliatas et fontanos latices de corporibus humanis fluxos crederem; iam statuas et imagines incessuras, parietes locuturos, boves et id genus pecua dicturas praesagium, de ipso vero caelo et iubaris orbe subito venturum oraculum.'.lower()

In [6]: p = PunktLanguageVars()

In [7]: tokens = p.word_tokenize(s)

In [8]: b = bigrams(tokens)

In [8]: [x for x in b]
Out[8]:
[('ut', 'primum'),
 ('primum', 'nocte'),
 ('nocte', 'discussa'),
 ('discussa', 'sol'),
 ('sol', 'novus'),
 ('novus', 'diem'),
 ...]

In [9]: t = trigrams(tokens)
In [9]: [x for x in t]
[('ut', 'primum', 'nocte'),
 ('primum', 'nocte', 'discussa'),
 ('nocte', 'discussa', 'sol'),
 ('discussa', 'sol', 'novus'),
 ('sol', 'novus', 'diem'),
 …]

In [10]: five_gram = ngrams(tokens, 5)

In [11]: [x for x in five_gram]
Out[11]:
[('ut', 'primum', 'nocte', 'discussa', 'sol'),
 ('primum', 'nocte', 'discussa', 'sol', 'novus'),
 ('nocte', 'discussa', 'sol', 'novus', 'diem'),
 ('discussa', 'sol', 'novus', 'diem', 'fecit'),
 ('sol', 'novus', 'diem', 'fecit', ','),
 ('novus', 'diem', 'fecit', ',', 'et'),
 …]

Normalization

If you are working from texts from different resources, it is likely a good idea to normalize them before further processing (such as sting comparison). The CLTK provides a wrapper to the Python language’s builtin normalize(). Here’s an example its use in “compatibility” mode (NFKC):

In [1]: from cltk.corpus.utils.formatter import cltk_normalize

In [2]: tonos = "ά"

In [3]: oxia = "ά"

In [4]: tonos == oxia
Out[4]: False

In [5]: tonos == cltk_normalize(oxia)
Out[5]: True

One can turn off compatibility with:

In [6]: tonos == cltk_normalize(oxia, compatibility=False)
Out[6]: True

For more on normalize() see the Python Unicode docs.

Skipgrams

The NLTK has a handy skipgram function. Use it like this:

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: from nltk.util import skipgrams

In [3]: text = 'T. Pomponis Atticus, ab origine ultima stirpis Romanae generatus, \
   ...:    perpetuo a maioribus acceptam equestrem obtinuit dignitatem.'

In [4]: word_tokenizer = WordTokenizer('latin')

In [5]: unigrams = word_tokenizer.tokenize(text)

In [6]: for ngram in skipgrams(unigrams, 3, 5):
   ...:     print(ngram)
   ...:
('T.', 'Pomponis', 'Atticus')
('T.', 'Pomponis', ',')
('T.', 'Pomponis', 'ab')
('T.', 'Pomponis', 'origine')
('T.', 'Pomponis', 'ultima')
('T.', 'Pomponis', 'stirpis')
('T.', 'Atticus', ',')
('T.', 'Atticus', 'ab')
('T.', 'Atticus', 'origine')
('T.', 'Atticus', 'ultima')
…
('equestrem', 'obtinuit', '.')
('equestrem', 'dignitatem', '.')
('obtinuit', 'dignitatem', '.')

The first parameter is the length of the output n-gram and the second parameter is how many tokens to skip.

The NLTK’s skipgrams() produces a generator whose values can be turned into a list like so:

In [8]: list(skipgrams(unigrams, 3, 5))
Out[8]:
[('T.', 'Pomponis', 'Atticus'),
 ('T.', 'Pomponis', ','),
 ('T.', 'Pomponis', 'ab'),
 …
 ('equestrem', 'dignitatem', '.'),
 ('obtinuit', 'dignitatem', '.')]

Stoplist Construction

The Stop module offers an abstract class for constructing stoplists: BaseCorpusStoplist.

Children class must implement vectorizer and tfidf_vectorizer. Fow now, only Latin and Classical Chinese with CorpusStoplist``have implemented a child class of ``BaseCorpusStoplist.

Parameters like the size of the stoplist, the criterion on which you get the stop list with a parameter (basis) for weighting words within the collection using different measures. The bases currently available are: frequency, mean (mean probability), variance (variance probability), entropy (entropy), and zou (a composite measure based on mean, variance, and entropy as described in [Zou 2006]). Other parameters for both StringStoplist and CorpusStoplist include boolean preprocessing options (lower, remove_numbers, remove_punctuation) and override lists of words to add or subtract from stoplists (include, exclude).

Syllabification

CLTK provides a language-agnostic syllabifier module as part of phonology. The syllabifier works by following the Sonority Sequencing Principle. The default phonetic scale (from most to least sonorous):

low vowels > mid vowels > high vowels > flaps > laterals > nasals > fricatives > plosives

In [1]: from cltk.phonology import syllabify

In [2]: high_vowels = ['a']

In [3]: mid_vowels = ['e']

In [4]: low_vowels = ['i', 'u']

In [5]: flaps = ['r']

In [6]: nasals = ['m', 'n']

In [7]: fricatives = ['f']

In [8]: s = Syllabifier(high_vowels=high_vowels, mid_vowels=mid_vowels, low_vowels=low_vowels, flaps=flaps, nasals=nasals, fricatives=fricatives)

In [9]: s.syllabify("feminarum")
Out[9]: ['fe', 'mi', 'na', 'rum']

Additionally, you can override the default sonority hierarchy by calling set_hierarchy. However, you must also re-define the vowel list for the nuclei to be correctly identified.

In [10]: s = Syllabifier()

In [11]: s.set_hierarchy([['i', 'u'], ['e'], ['a'], ['r'], ['m', 'n'], ['f']])

In [12]: s.set_vowels(['i', 'u', 'e', 'a'])

In [13]: s.syllabify('feminarum')
Out[13]: ['fe', 'mi', 'na', 'rum']

For a language-dependent approach, you can call the predefined sonority dictionary by toogling the language parameter:

In [14]: s = Syllabifier(language='middle high german')

In [15]: s.syllabify('lobebæren')
Out[15]: ['lo', 'be', 'bæ', 'ren']

Text Reuse

The text reuse module offers a few tools to get started with studying text reuse (i.e., allusion and intertext). The major goals of this module are to leverage conventional text reuse strategies and to create comparison methods designed specifically for the languages of the corpora included in the CLTK.

This module is under active development, so if you experience a bug or have a suggestion for something to include, please create an issue on GitHub.

Levenshtein distance calculation

+.. note:: + + You will need to install two packages to use Levenshtein measures. Install them with pip install fuzzywuzzy python-Levenshtein. python-Levenshtein is optional but gives speed improvements.

The Levenshtein distance comparison is a commonly-used method for fuzzy string comparison. The CLTK Levenshtein class offers a few helps for getting started with creating comparisons from document.

This simple example compares a line from Vergil’s Georgics with a line from Propertius (Elegies III.13.41):

In [1]: from cltk.text_reuse.levenshtein import Levenshtein

In [2]: l = Levenshtein()

In [3]: l.ratio("dique deaeque omnes, studium quibus arua tueri,", "dique deaeque omnes, quibus est tutela per agros,")
Out[3]: 0.71

You can also calculate the Levenshtein distance of two words, defined as the minimum number of single word edits (insertions, deletions, substitutions) required to transform a word into another.

In [4]: l.levenshtein_distance("deaeque", "deaeuqe")
Out[4]: 2
Damerau-Levenshtein algorithm

Note

You will need to install pyxDamerauLevenshtein to use these features.

The Damerau-Levenshtein algorithm is used for finding the distance metric between any two strings i.e., finite number of symbols or letters between any two strings. The Damerau-Levenshtein algorithm is an enhancement over Levenshtein algorithm in the sense that it allows for transposition operations.

This simple example compares a two Latin words to find the distance between them:

In [1]: from pyxdameraulevenshtein import damerau_levenshtein_distance

In [2]: damerau_levenshtein_distance("deaeque", "deaque")
Out[2]: 1

Alternatively, you can also use CLTK’s native Levenshtein class:

In [3]: from cltk.text_reuse.levenshtein import Levenshtein

In [4]: Levenshtein.damerau_levenshtein_distance("deaeque", "deaque")
Out[4]: 1

In [5]: Levenshtein.damerau_levenshtein_distance("deaeque", "deaeuqe")
Out[5]: 1
Needleman-Wunsch Algorithm

The Needleman-Wunsch Algorithm, calculates the optimal global alignment between two strings given a scoring matrix.

There are two optional parameters: S specifying a weighted similarity square matrix, and alphabet (where |alphabet| = rows(S) = cols(S)). By default, the algorithm assumes the latin alphabet and a default matrix (1 for match, -1 for substitution)

In [1]: from cltk.text_reuse.comparison import Needleman_Wunsch as NW

In [2]: NW("abba", "ababa", alphabet = "ab", S = [[1, -3],[-3, 1]])
Out[2]: ('ab-ba', 'ababa')

In this case, the similarity matrix will be:

  a b
a 1 -3
b -3 1
Longest Common Substring

Longest Common Substring takes two strings as an argument to the function and returns a substring which is common between both the strings. The example below compares a line from Vergil’s Georgics with a line from Propertius (Elegies III.13.41):

In [1]: from cltk.text_reuse.comparison import long_substring

In [2]: print(long_substring("dique deaeque omnes, studium quibus arua tueri,", "dique deaeque omnes, quibus est tutela per agros,"))
Out[2]: dique deaque omnes,
MinHash

The MinHash algorithm generates a score based on the similarity of the two strings. It takes two strings as a parameter to the function and returns a float.

In [1]: from cltk.text_reuse.comparison import minhash

In [2]: a = 'dique deaeque omnes, studium quibus arua tueri,'

In [3]: b = 'dique deaeque omnes, quibus est tutela per agros,'

In[3]: print(minhash(a,b))
Out[3]:0.171631205673

Treebank label dict

You can generate nested Python dict from a treebank in string format. Currently, only treebanks following the Penn notation are supported.

In [1]: from  cltk.tags.treebanks import parse_treebanks

In [2]: st = "((IP-MAT-SPE (' ') (INTJ Yes) (, ,) (' ') (IP-MAT-PRN (NP-SBJ (PRO he)) (VBD seyde)) (, ,) (' ') (NP-SBJ (PRO I)) (MD shall)   (VB promyse) (NP-OB2 (PRO you)) (IP-INF (TO to) (VB fullfylle) (NP-OB1 (PRO$ youre) (N desyre))) (. .) (' '))"

In [3]: treebank = parse_treebanks(st)

In [4]: treebank['IP-MAT-SPE']['INTJ']
Out[4]: ['Yes']

In [5]: treebank
Out[5]: {'IP-MAT-SPE': {"'": ["'", "'", "'"], 'INTJ': ['Yes'], ',': [',', ','], 'IP-MAT-PRN': {'NP-SBJ': {'PRO': ['he']}, 'VBD': ['seyde']}, 'NP-SBJ': {'PRO': ['I']}, 'MD': ['shall'], '\t': {'VB': ['promyse'], 'NP-OB2': {'PRO': ['you']}, 'IP-INF': {'TO': ['to'], '\t': {'VB': ['fullfylle'], 'NP-OB1': {'PRO$': ['youre'], 'N': ['desyre']}}, '.': ['.'], "'": ["'"]}}}}

Word count

For a dictionary-like object of word frequencies, use the NLTK’s Text().

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from nltk.text import Text

In [3]: s = 'At at at ego ego tibi'.lower()

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(s)

In [6]: t = Text(tokens)

In [7]: vocabulary_count = t.vocab()

In [8]: vocabulary_count['at']
Out[8]: 3

In [9]: vocabulary_count['ego']
Out[9]: 2

In [10]: vocabulary_count['tibi']
Out[10]: 1

Word frequency lists

The CLTK has a module which finds word frequency. The export is a Counter type of dictionary.

In [1]: from cltk.utils.frequency import Frequency

In [2]: from cltk.corpus.utils.formatter import tlg_plaintext_cleanup

In [3]: import os

In [4]: freq = Frequency()

In [6]: file = os.path.expanduser('~/cltk_data/greek/text/tlg/plaintext/TLG0012.TXT')

In [7]: with open(file) as f:
...:     text = f.read().lower()
...:

In [8]: text = tlg_plaintext_cleanup(text)

In [9]: freq.counter_from_str(text)
Out[9]: Counter({'δ': 6507, 'καὶ': 4799, 'δὲ': 3194, 'τε': 2645, 'μὲν': 1628, 'ἐν': 1420, 'δέ': 1267, 'ὣς': 1203, 'οἱ': 1126, 'τ': 1101, 'γὰρ': 969, 'ἀλλ': 936, 'τὸν': 904, 'ἐπὶ': 830, 'τοι': 772, 'αὐτὰρ': 761, 'δὴ': 748, 'μοι': 745, 'μιν': 645, 'γε': 632, 'ἐπεὶ': 611, 'ἄρ': 603, 'ἦ': 598, 'νῦν': 581, 'ἄρα': 576, 'κατὰ': 572, 'ἐς': 571, 'ἐκ': 554, 'ἐνὶ': 544, 'ὡς': 541, 'ὃ': 533, 'οὐ': 530, 'οἳ': 527, 'περ': 491, 'τις': 491, 'οὐδ': 482, 'καί': 481, 'οὔ': 476, 'γάρ': 435, 'κεν': 407, 'τι': 407, 'γ': 406, 'ἐγὼ': 404, 'ἐπ': 397, … })

If you have access to the TLG or PHI5 disc, and have already imported it and converted it with the CLTK, you can build your own custom lists off of that.

In [11]: freq.make_list_from_corpus('phi5', 200, save=False)  # or 'phi5'; both take a while to run
Out[11]: Counter({',': 749396, 'et': 196410, 'in': 141035, 'non': 89836, 'est': 86472, ':': 76915, 'ut': 70516, ';': 69901, 'cum': 61454, 'si': 59578, 'ad': 59248, 'quod': 52896, 'qui': 46385, 'sed': 41546, '?': 40717, 'quae': 38085, 'ex': 36996, 'quam': 34431, "'": 33596, 'de': 31331, 'esse': 31066, 'aut': 30568, 'a': 29871, 'hoc': 26266, 'nec': 26027, 'etiam': 22540, 'se': 22486, 'enim': 22104, 'ab': 21336, 'quid': 21269, 'per': 20981, 'atque': 20201, 'sunt': 20025, 'sit': 19123, 'autem': 18853, 'id': 18846, 'quo': 18204, 'me': 17713, 'ne': 17265, 'ac': 17007, 'te': 16880, 'nam': 16640, 'tamen': 15560, 'eius': 15306, 'haec': 15080, 'ita': 14752, 'iam': 14532, 'mihi': 14440, 'neque': 13833, 'eo': 13125, 'quidem': 13063, 'est.': 12767, 'quoque': 12561, 'ea': 12389, 'pro': 12259, 'uel': 11824, 'quia': 11518, 'tibi': 11493, … })

Word tokenization

The CLTK wraps one of the NLTK’s tokenizers (TreebankWordTokenizer), which with the multilingual parameter works for most languages that use Latin-style whitespace and punctuation to indicate word division. There are some language-specific tokenizers, too, which do extra work to subdivide words when they are combined into one string (e.g., “armaque” in Latin). See WordTokenizer.available_languages for supported languages for such sub-string tokenization.

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: tok.available_languages
Out[2]:
['akkadian',
 'arabic',
 'french',
 'greek',
 'latin',
 'middle_english',
 'middle_french',
 'middle_high_german',
 'old_french',
 'old_norse',
 'sanskrit',
 'multilingual']

In [3]: luke_ocs = "рєчє жє притъчѫ к н҄имъ глагол҄ѧ чловѣкѹ єтєрѹ богатѹ ѹгобьѕи сѧ н҄ива"

In [4]: tok = WordTokenizer(language='multilingual')

In [5]: tok.tokenize(luke_ocs)
Out[5]:
['рєчє',
 'жє',
 'притъчѫ',
 'к',
 'н҄имъ',
 'глагол҄ѧ',
 'чловѣкѹ',
 'єтєрѹ',
 'богатѹ',
 'ѹгобьѕи',
 'сѧ',
 'н҄ива']

If this default does not work for your texts, consider the NLTK’s RegexpTokenizer, which splits on a regular expression patterns of your choosing. Here, for instance, on whitespace and punctuation:

In [6]: from nltk.tokenize import RegexpTokenizer

In [7]: word_toker = RegexpTokenizer(r'\w+')

In [8]: word_toker.tokenize(luke_ocs)
Out[8]:
['рєчє',
 'жє',
 'притъчѫ',
 'к',
 'н',
 'имъ',
 'глагол',
 'ѧ',
 'чловѣкѹ',
 'єтєрѹ',
 'богатѹ',
 'ѹгобьѕи',
 'сѧ',
 'н',
 'ива']

Old Norse

Old Norse was a North Germanic language that was spoken by inhabitants of Scandinavia and inhabitants of their overseas settlements during about the 9th to 13th centuries. The Proto-Norse language developed into Old Norse by the 8th century, and Old Norse began to develop into the modern North Germanic languages in the mid- to late-14th century, ending the language phase known as Old Norse. These dates, however, are not absolute, since written Old Norse is found well into the 15th century. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with old_norse_) to discover available Old Norse corpora.

In[1]: from cltk.corpus.utils.importer import CorpusImporter

In[2]: corpus_importer = CorpusImporter("old_norse")

In[3]: corpus_importer.list_corpora

Out[3]: ['old_norse_text_perseus', 'old_norse_models_cltk', 'old_norse_texts_heimskringla', 'old_norse_runic_transcriptions', 'old_norse_dictionary_zoega']
Zoëga’s dictionary

This dictionary was made in the last century. It contains Old Norse entries in which a description is given in English. Each entry have possible POS tags for its word and the translations/meanings.

Stopword Filtering

To use the CLTK’s built-in stopwords list, We use an example from Eiríks saga rauða:

In[1]: from nltk.tokenize.punkt import PunktLanguageVars

In[2]: from cltk.stop.old_norse.stops import STOPS_LIST

In[3]: sentence = 'Þat var einn morgin, er þeir Karlsefni sá fyrir ofan rjóðrit flekk nökkurn, sem glitraði við þeim'

In[4]: p = PunktLanguageVars()

In[5]: tokens = p.word_tokenize(sentence.lower())

In[6]: [w for w in tokens if not w in STOPS_LIST]

Out[6]: ['var',
'einn',
'morgin',
',',
'karlsefni',
'rjóðrit',
'flekk',
'nökkurn',
',',
'glitraði']

Swadesh

The corpus module has a class for generating a Swadesh list for Old Norse.

In[1]: from cltk.corpus.swadesh import Swadesh

In[2]: swadesh = Swadesh('old_norse')

In[3]: swadesh.words()[:10]

Out[3]: ['ek', 'þú', 'hann', 'vér', 'þér', 'þeir', 'sjá, þessi', 'sá', 'hér', 'þar']

Word Tokenizing

A very simple tokenizer is available for Old Norse. For now, it does not take into account specific Old Norse constructions like the merge of conjugated verbs with þú and with sik. Here is a sentence extracted from Gylfaginning in the Edda by Snorri Sturluson.

In[1]: word_tokenizer = WordTokenizer('old_norse')

In[2]: sentence = "Gylfi konungr var maðr vitr ok fjölkunnigr."

In[3]: word_tokenizer.tokenize(sentence)

Out[3]:['Gylfi', 'konungr', 'var', 'maðr', 'vitr', 'ok', 'fjölkunnigr', '.']

POS tagging

You can get the POS tags of Old Norse texts using the CLTK’s wrapper around the NLTK tokenizer. First, download the model by importing the old_norse_models_cltk corpus. This TnT tagger was trained from annotated data from Icelandic Parsed Historical Corpus (version 0.9, license: LGPL).

TnT tagger

The following sentence is from the first verse of Völuspá (a poem describing destiny of Agards gods).

In[1]: from cltk.tag.pos import POSTag

In[2]: tagger = POSTag('old_norse')

In[3]: sent = 'Hlióðs bið ek allar.'

In[4]: tagger.tag_tnt(sent)

Out[4]: [('Hlióðs', 'Unk'),
('bið', 'VBPI'),
('ek', 'PRO-N'),
('allar', 'Q-A'),
('.', '.')]

Phonology transcription

According to phonological rules (available at Wikipedia - Old Norse orthography and Altnordisches Elementarbuch by Friedrich Ranke and Dietrich Hofmann), a reconstructed pronunciation of Old Norse words is implemented.

In[1]: from cltk.phonology.old_norse import transcription as ont

In[2]: sentence = "Gylfi konungr var maðr vitr ok fjölkunnigr"

In[3]: tr = ut.Transcriber(ont.DIPHTHONGS_IPA, ont.DIPHTHONGS_IPA_class, ont.IPA_class, ont.old_norse_rules)

In[4]: tr.main(sentence)

Out[4]: "[gylvi kɔnungr var maðr vitr ɔk fjœlkunːiɣr]"

Runes

The oldest runic inscriptions found are from 200 AC. They have always denoted Germanic languages. Until the 8th century, the elder futhark alphabet was used. It was compouned with 24 characters: ᚠ, ᚢ, ᚦ, ᚨ, ᚱ, ᚲ, ᚷ, ᚹ, ᚺ, ᚾ, ᛁ, ᛃ, ᛇ, ᛈ, ᛉ, ᛊ, ᛏ, ᛒ, ᛖ, ᛗ, ᛚ, ᛜ, ᛟ, ᛞ. The word Futhark comes from the 6 first characters of the alphabet: ᚠ (f), ᚢ (u), ᚦ (th), ᚨ (a), ᚱ (r), ᚲ (k). Later, this alphabet was reduced to 16 runes, the younger futhark ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚼ, ᚾ, ᛁ, ᛅ, ᛋ, ᛏ, ᛒ, ᛖ, ᛘ, ᛚ, ᛦ, with more ambiguity on sounds. Shapes of runes may vary according to which matter they are carved on, that is why there is a variant of the younger futhark like this: ᚠ, ᚢ, ᚦ, ᚭ, ᚱ, ᚴ, ᚽ, ᚿ, ᛁ, ᛅ, ᛌ, ᛐ, ᛓ, ᛖ, ᛙ, ᛚ, ᛧ.

In[1]: from cltk.corpus.old_norse import runes

In[2]: " ".join(Rune.display_runes(ELDER_FUTHARK))

Out[2]: ᚠ ᚢ ᚦ ᚨ ᚱ ᚲ ᚷ ᚹ ᚺ ᚾ ᛁ ᛃ ᛇ ᛈ ᛉ ᛊ ᛏ ᛒ ᛖ ᛗ ᛚ ᛜ ᛟ ᛞ

In[3]: little_jelling_stone = "᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬"

In[4]: Transcriber.transcribe(little_jelling_stone, YOUNGER_FUTHARK)

Out[4]: "᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫"

Syllabification

For a language-dependent approach, you can call the predefined sonority dictionary by toogling the language parameter:

In[1]: from cltk.phonology.syllabify import Syllabifier

In[2]: s = Syllabifier(language='old_norse')

In[3]: s.syllabify("danmarkar")

Out[3]: ['dan', 'mar', 'kar']

Length of syllables in Old Norse poems plays a great role. To measure this, words have first to be phonetically transcribed. This is why “old_norse_ipa” language is used

In[1]: import cltk.phonology.old_norse.transcription as ont

In[2]: from cltk.phonology.syllabify import Syllabifier

In[3]: syllabifier = Syllabifier(language="old_norse_ipa")

In[4]: word = [ont.a, ont.s, ont.g, ont.a, ont.r, ont.dh, ont.r]

In[5]: syllabified_word = syllabifier.syllabify_phonemes(word)

In[6]: [ont.measure_old_norse_syllable(syllable) for syllable in syllabified_word]

Out[6]: [<Length.short: 'short'>, <Length.long: 'long'>]

Old Norse prosody

Edda poetry is traditionally composed of the skaldic poetry and the eddic poetry.

Eddic poetry

Eddic poems designate the poems of the Poetic Edda. Stanza, line and verse are the three levels that characterize eddic poetry. The poetic Edda are mainly composed of three kinds of poetic meters: fornyrðislag, ljóðaháttr and málaháttr.

  • Fornyrðislag

A stanza of fornyrðislag has 8 short lines (or verses), 4 long-lines (or lines). Each long line has two short lines. The first verse of a line usually has an alliteration with the second verse of a line.

In[1]: text1 = "Hljóðs bið ek allar\nhelgar kindir,\nmeiri ok minni\nmögu Heimdallar;\nviltu at ek, Valföðr,\nvel fyr telja\nforn spjöll fira,\nþau er fremst of man."

In[2]: VerseManager.is_fornyrdhislag(text1)

Out[2]: True

In[3]: fo = Fornyrdhislag()

In[4]: fo.from_short_lines_text(text1)

In[5]: fo.short_lines

Out[5]: ['Hljóðs bið ek allar', 'helgar kindir,', 'meiri ok minni', 'mögu Heimdallar;', 'viltu at ek, Valföðr,', 'vel fyr telja', 'forn spjöll fira,', 'þau er fremst of man.']

In[6]: fo.long_lines

Out[6]: [['Hljóðs bið ek allar', 'helgar kindir,'], ['meiri ok minni', 'mögu Heimdallar;'], ['viltu at ek, Valföðr,', 'vel fyr telja'], ['forn spjöll fira,', 'þau er fremst of man.']]

In[7]: fo.syllabify()

In[8]: fo.syllabified_text

Out[8]: [[[[['hljóðs'], ['bið'], ['ek'], ['al', 'lar']]], [[['hel', 'gar'], ['kin', 'dir']]]], [[[['meir', 'i'], ['ok'], ['min', 'ni']]], [[['mög', 'u'], ['heim', 'dal', 'lar']]]], [[[['vil', 'tu'], ['at'], ['ek'], ['val', 'föðr']]], [[['vel'], ['fyr'], ['tel', 'ja']]]], [[[['forn'], ['spjöll'], ['fir', 'a']]], [[['þau'], ['er'], ['fremst'], ['of'], ['man']]]]]

In[9]: fo.to_phonetics()

In[10]: fo.transcribed_text

Out[10]: [[['[hljoːðs]', '[bið]', '[ɛk]', '[alːar]'], ['[hɛlɣar]', '[kindir]']], [['[mɛiri]', '[ɔk]', '[minːi]'], ['[mœɣu]', '[hɛimdalːar]']], [['[viltu]', '[at]', '[ɛk]', '[valvœðr]'], ['[vɛl]', '[fyr]', '[tɛlja]']], [['[fɔrn]', '[spjœlː]', '[fira]'], ['[θɒu]', '[ɛr]', '[frɛmst]', '[ɔv]', '[man]']]]

In[11]: fo.find_alliteration()

Out[11]: ([[('hljóðs', 'helgar')], [('meiri', 'mögu'), ('minni', 'mögu')], [], [('forn', 'fremst'), ('fira', 'fremst')]], [1, 2, 0, 2])
  • Ljóðaháttr

A stanza of ljóðaháttr has 6 short lines (or verses), 4 long-lines (or lines). The first and the third lines have two verses, while the second and the fourth lines have only one (longer) verse. The first verse of the first and third lines alliterates with the second verse of these lines. The second and the fourth lines contain alliterations.

In[1]: text2 = "Deyr fé,\ndeyja frændr,\ndeyr sjalfr it sama,\nek veit einn,\nat aldrei deyr:\ndómr um dauðan hvern."

In[2]: VerseManager.is_ljoodhhaattr(text2)

Out[2]: True

In[3]: lj = Ljoodhhaatr()

In[4]: lj.from_short_lines_text(text2)

In[5]: lj.short_lines

Out[5]: ['Deyr fé,', 'deyja frændr,', 'deyr sjalfr it sama,', 'ek veit einn,', 'at aldrei deyr:', 'dómr um dauðan hvern.']

In[6]: lj.long_lines

Out[6]: [['Deyr fé,', 'deyja frændr,'], ['deyr sjalfr it sama,'], ['ek veit einn,', 'at aldrei deyr:'], ['dómr um dauðan hvern.']]

In[7]: lj.syllabify()

In[8]: lj.syllabified_text

Out[8]: [[[['deyr'], ['fé']], [['deyj', 'a'], ['frændr']]], [[['deyr'], ['sjalfr'], ['it'], ['sam', 'a']]], [[['ek'], ['veit'], ['einn']], [['at'], ['al', 'drei'], ['deyr']]], [[['dómr'], ['um'], ['dau', 'ðan'], ['hvern']]]]

In[9]: lj.to_phonetics()

In[10]: lj.transcribed_text

Out[10]: [[['[dɐyr]', '[feː]'], ['[dɐyja]', '[frɛːndr]']], [['[dɐyr]', '[sjalvr]', '[it]', '[sama]']], [['[ɛk]', '[vɛit]', '[ɛinː]'], ['[at]', '[aldrɛi]', '[dɐyr]']], [['[doːmr]', '[um]', '[dɒuðan]', '[hvɛrn]']]]

In[11]: verse_alliterations, n_alliterations_lines = lj.find_alliteration()

In[12]: verse_alliterations

Out[12]: [[('deyr', 'deyja'), ('fé', 'frændr')], [('sjalfr', 'sjalfr')], [('einn', 'aldrei')], [('dómr', 'um')]]

In[13]: n_alliterations_lines

Out[13]: [2, 1, 1, 1]
  • Málaháttr

Málaháttr is very similar to ljóðaháttr, except that verses are longer. No special code has been written for this.

Skaldic poetry

Dróttkvætt and hrynhenda are examples of skaldic poetic meters.

Old Norse pronouns declension

Old Norse, like other ancient Germanic languages, is highly inflected. With the declension module, you can get a declined form of a pronoun already stored.

In[1]: from cltk.declension import utils as decl_utils

In[2]: from cltk.declension.old_norse import pronouns

In[3]: pro_demonstrative_pronouns_this = decl_utils.Pronoun("demonstrative pronouns this")

In[4]: demonstrative_pronouns_this = [[["þessi", "þenna", "þessum", "þessa"], ["þessir", "þessa", "þessum", "þessa"]], [["þessi", "þessa", "þessi", "þessar"], ["þessar", "þessar", "þessum", "þessa"]], [["þetta", "þetta", "þessu", "þessa"], ["þessi", "þessi", "þessum", "þessa"]]]

In[5]: pro_demonstrative_pronouns_this.set_declension(demonstrative_pronouns_this)

In[6]: pro_demonstrative_pronouns_this.get_declined(decl_utils.Case.accusative, decl_utils.Number.singular, decl_utils.Gender.feminine)

Out[6]: 'þessa'

Old Norse noun declension

Old Norse nouns vary according to case (nominative, accusative, dative, genitive), gender (masculine, feminine, neuter) and number (singular, plural). Nouns are considered either weak or strong. Weak nouns have a simpler declension than strong ones.

If you want a simple way to define the inflection of an Old Norse noun, you can do as follows:

In[1]: from cltk.inflection.utils import Noun, Gender

In[2]: sumar = [["sumar", "sumar", "sumri", "sumars"], ["sumur", "sumur", "sumrum", "sumra"]]

In[3]: noun_sumar = Noun("sumar", Gender.neuter)

In[4]: noun_sumar.set_declension(sumar)

To decline a noun and if you know its nominative singular, genitive singular and nominative plural forms, you can use the following functions.

  masculine feminine neuter
strong decline_strong_masculine_noun decline_strong_feminine_noun decline_strong_neuter_noun
weak decline_weak_masculine_noun decline_weak_feminine_noun decline_weak_neuter_noun

Old Norse verb conjugation

Old Norse verbs vary according to:

  • person (first, second, third),
  • number (singular, plural),
  • tense (past, present),
  • voice (active and medio-passive),
  • mood (indicative, subjunctive, imperative, infinitive, past participle and present participle).

They may be classified into three categories:

  • strong verbs, they form their past root with a stem vowel change,
  • weak verbs, they form their past root by adding a dental consonant,
  • preterito-present verbs, their present conjugates like verbs in past but have present meanings.

Two examples are given below: one strong verb and one weak verb.

In[1]: from cltk.inflection.old_norse.verbs import StrongOldNorseVerb

In[2]: lita = StrongOldNorseVerb()

In[3]: lita.set_canonic_forms(["líta", "lítr", "leit", "litu", "litinn"])

In[4]: lita.subclass
Out[4]: 1

In[5]: lita.present_active()
Out[5]: ['lít', 'lítr', 'lítr', 'lítum', 'lítið', 'líta']

In[6]: lita.past_active()
Out[6]: ['leit', 'leizt', 'leit', 'litum', 'lituð', 'litu']

In[7]: lita.present_active_subjunctive()
Out[7]: ['líta', 'lítir', 'líti', 'lítim', 'lítið', 'líti']

In[8]: lita.past_active_subjunctive()
Out[9]: ['lita', 'litir', 'liti', 'litim', 'litið', 'liti']

In[9]: lita.past_participle()
Out[9]: [['litinn', 'litinn', 'litnum', 'litins', 'litnir', 'litna', 'litnum', 'litinna'], ['litin', 'litna', 'litinni', 'litinnar', 'litnar', 'litnar', 'litnum', 'litinna'], ['litit', 'litit', 'litnu', 'litins', 'litit', 'litit', 'litnum', 'litinna']]
In[1]: from cltk.inflection.old_norse.verbs import WeakOldNorseVerb

In[2]: kalla = WeakOldNorseVerb()

In[3]: kalla.set_canonic_forms([["kalla", "kallaði", "kallaðinn"])

In[4]: kalla.subclass
Out[4]: 1

In[5]: kalla.present_active()
Out[5]: ['kalla', 'kallar', 'kallar', 'köllum', 'kallið', 'kalla']

In[6]: kalla.past_active()
Out[6]: ['kallaða', 'kallaðir', 'kallaði', 'kölluðum', 'kölluðuð', 'kölluðu']

In[7]: kalla.present_active_subjunctive()
Out[7]: ['kalla', 'kallir', 'kalli', 'kallim', 'kallið', 'kalli']

In[8]: kalla.past_active_subjunctive()
Out[9]: ['kallaða', 'kallaðir', 'kallaði', 'kallaðim', 'kallaðið', 'kallaði']

In[9]: kalla.past_participle()
Out[9]: [['kallaðr', 'kallaðan', 'kölluðum', 'kallaðs', 'kallaðir', 'kallaða', 'kölluðum', 'kallaðra'], ['kölluð', 'kallaða', 'kallaðri', 'kallaðrar', 'kallaðar', 'kallaðar', 'kölluðum', 'kallaðra'], ['kallatt', 'kallatt', 'kölluðu', 'kallaðs', 'kölluð', 'kölluð', 'kölluðum', 'kallaðra']]

Odia

Odia is an Eastern Indo-Aryan language belonging to the Indo-Aryan language family. It is thought to be directly descended from a OdraMagadhi Prakrit similar to Ardha Magadhi, which was spoken in eastern India over 1,500 years ago, and is the primary language used in early Jain texts. Odia appears to have had relatively little influence from Persian and Arabic, compared to other major North Indian languages. Odia is an Indian language, belonging to the Indo-Aryan branch of the Indo-European language family. It is mainly spoken in the Indian states of Odisha and in parts of West Bengal, Jharkhand, Chhattisgarh and Andhra Pradesh. (Source: Wikipedia)

Alphabet

The Odia alphabet and digits are placed in cltk/corpus/odia/alphabet.py.

The digits are placed in a list NUMERALS with the digit the same as the list index (0-9). For example, the odia digit for 4 can be accessed in this manner:

In [1]: from cltk.corpus.odia.alphabet import NUMERALS
In [2]: NUMERALS[4]
Out[2]: '୪'

The vowels are places in a list VOWELS and can be accessed in this manner :

In [1]: from cltk.corpus.odia.alphabet import VOWELS
In [2]: VOWELS
Out[2]: ['ଅ', 'ଆ', 'ଇ', 'ଈ', 'ଉ', 'ଊ', 'ଋ', 'ୠ', 'ଌ', 'ୡ', 'ଏ', 'ଐ', 'ଓ', 'ଔ']

The rest of the alphabets are UNSTRUCTURED_CONSONANTS and STRUCTURED_CONSONANTS that can be accessed in a similar way.

Ottoman

Ottoman Turkish, or the Ottoman language, is the variety of the Turkish language that was used in the Ottoman Empire. Ottoman Turkish was highly influenced by Arabic and Persian. Arabic and Persian words in the language accounted for up to 88% of its vocabulary. As in most other Turkic and other foreign languages of Islamic communities, the Arabic borrowings were not originally the result of a direct exposure of Ottoman Turkish to Arabic, a fact that is evidenced by the typically Persian phonological mutation of the words of Arabic origin. (Source: Wikipedia)

Alphabet

The Ottoman digits and alphabet are placed in cltk/corpus/ottoman/alphabet.py.

The digits are placed in a dict NUMERALS with the digit the same as the index (0-9). There is a dictionary named NUMERALS_WRITINGS for their writing also. For example, the persian digit for 5 can be accessed in this manner:

In [1]: from cltk.corpus.ottoman.alphabet import NUMERALS, NUMERALS_WRITINGS
In [2]: NUMERALS[5]
Out[2]: '۵'
In [3]: NUMERALS_WRITINGS[5]
Out[3]: 'بش'

One can also have the alphabetic orders of the charachters form ALPHABETIC_ORDER dictionary. The keys are the characters and the values are their order. The corresponding dictionary can be imported:

In [1]: from cltk.corpus.ottoman.alphabet import ALPHABETIC_ORDER, CIM
In [2]: ALPHABETIC_ORDER[CIM]
Out[2]: 6

Pali

Pali is a Prakrit language native to the Indian subcontinent which flourished between 5th and 1st century BC, now only used as a liturgical language. It is widely studied because it is the language of much of the earliest extant literature of Buddhism as collected in the Pāli Canon or Tipiṭaka and is the sacred language of Theravāda Buddhism. (Source: Wikipedia)

Alphabet

In [1]: from cltk.corpus.pali.alphabet import CONSONANTS, DEPENDENT_VOWELS, INDEPENDENT_VOWELS

In [2]: print(CONSONANTS)
['ක', 'ඛ', 'ග', 'ඝ', 'ඞ', 'ච', 'ඡ', 'ජ', 'ඣ', 'ඤ', 'ට', 'ඨ', 'ඩ', 'ඪ', 'ණ', 'ත', 'ථ', 'ද', 'ධ', 'න', 'ප', 'ඵ', 'බ', 'භ', 'ම', 'ය', 'ර', 'ල', 'ව', 'ස', 'හ', 'ළ', 'අං']

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with pali_) to discover available Pali corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('pali')

In [3]: c.list_corpora
Out[3]: ['pali_text_ptr_tipitaka']

Persian

Persian is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. The Old Persian language is one of the two directly attested Old Iranian languages (the other being Avestan). Old Persian appears primarily in the inscriptions, clay tablets, and seals of the Achaemenid era (c. 600 BCE to 300 BCE). Examples of Old Persian have been found in what is now Iran, Romania (Gherla), Armenia, Bahrain, Iraq, Turkey and Egypt, the most important attestation by far being the contents of the Behistun Inscription (dated to 525 BCE). Avestan is one of the Eastern Iranian languages within the Indo-European language family known only from its use as the language of Zoroastrian scripture, i.e. the Avesta. (Source: Wikipedia)

Alphabet

The Persian digits and alphabet are placed in cltk/corpus/persian/alphabet.py.

The digits are placed in a dict NUMERALS with the digit the same as the index (0-9). There is a dictionary named NUMERALS_WRITINGS for their writing also. For example, the persian digit for 5 can be accessed in this manner:

In [1]: from cltk.corpus.persian.alphabet import NUMERALS, NUMERALS_WRITINGS
In [2]: NUMERALS[5]
Out[2]: '۵'
In [3]: NUMERALS_WRITINGS[5]
Out[3]: 'پنج'

One can also have the alphabetic orders of the charachters form ALPHABETIC_ORDER dictionary. The keys are the characters and the values are their order. The corresponding dictionary can be imported:

In [1]: from cltk.corpus.persian.alphabet import ALPHABETIC_ORDER, JIM
In [2]: ALPHABETIC_ORDER[JIM]
Out[2]: 6

Phonology

The aim of phonological/phonetic reconstruction of ancient words is to provide a probable and realistic pronunciation of past languages. See the language-specific pages for phonological/phonetic support.

Old Portuguese

Galician-Portuguese, also known as Old Portuguese or Medieval Galician, was a West Iberian Romance language spoken in the Middle Ages, in the northwest area of the Iberian Peninsula. Alternatively, it can be considered a historical period of the Galician and Portuguese languages. The language was used for literary purposes from the final years of the 12th century to roughly the middle of the 14th century in what are now Spain and Portugal and was, almost without exception, the only language used for the composition of lyric poetry. (Source: Wikipedia)

Swadesh

The corpus module has a class for generating a Swadesh list for Old Portuguese.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('pt_old')

In [3]: swadesh.words()[:10]
Out[3]: ['eu', 'tu', 'ele', 'nos', 'vos', 'eles', 'esto, aquesto', 'aquelo', 'aqui', 'ali']

Prakrit

About

A Prakrit is any of several Middle Indo-Aryan languages. The Ardhamagadhi (“half-Magadhi”) Prakrit, which was used extensively to write the scriptures of Jainism, is often considered to be the definitive form of Prakrit, while others are considered variants thereof. Pali, the Prakrit used in Theravada Buddhism, tends to be treated as a special exception from the variants of the Ardhamagadhi language, as Classical Sanskrit grammars do not consider it as a Prakrit per se, presumably for sectarian rather than linguistic reasons. Other Prakrits are reported in old historical sources but are not attested, such as Paiśācī. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with prakrit_) to discover available Prakrit corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('prakrit')

In [3]: c.list_corpora
Out[3]: ['prakrit_texts_gretil']

Punjabi

Punjabi is an Indo-Aryan language native language of the Punjabi people who inhabit the historical Punjab region of Pakistan and India. Punjabi developed from Sanskrit through Prakrit language and later Apabhraṃśa. Punjabi emerged as an Apabhramsha, a degenerated form of Prakrit, in the 7th century A.D. and became stable by the 10th century. By the 10th century, many Nath poets were associated with earlier Punjabi works. Arabic and Persian influence in the historical Punjab region began with the late first millennium Muslim conquests on the Indian subcontinent. (Source: Wikipedia)

Corpora

Use CorpusImporter or browse the CLTK Github repository (anything beginning with punjabi_) to discover available Punjabi corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('punjabi')
In [3]: c.list_corpora
Out[3]:
['punjabi_text_gurban']

Now from the list of available corpora import any one you like.

Alphabet

Punjabi is written in two sripts: Gurumukhi and Shahmukhi. Gurmukhi has its origins in Brahmi and Shahmukhi is a Perso-Arabic script.

The Punjabi digits, vowels, consonants, and symbols for both are placed in cltk/corpus/punjabi/alphabet.py. Look there for more information about the language’s phonology.

For example, to use Punjabi’s independent vowels in each script:

In [1]: from cltk.corpus.punjabi.alphabet import INDEPENDENT_VOWELS_GURMUKHI

In [2]: from cltk.corpus.punjabi.alphabet import INDEPENDENT_VOWELS_SHAHMUKHI

In [3]: INDEPENDENT_VOWELS_GURMUKHI
Out[3]: ['ਆ', 'ਇ', 'ਈ', 'ਉ', 'ਊ', 'ਏ', 'ਐ', 'ਓ', 'ਔ']

In [4]: INDEPENDENT_VOWELS_SHAHMUKHI
Out[4]: ['ا', 'و', 'ی', 'ے']

Similarly there are lists for DIGITS, DEPENDENT_VOWELS, CONSONANTS, BINDI_CONSONANTS (nasal pronunciation) and some OTHER_SYMBOLS (mostly for pronunciation).

Numerifier

These convert English numbers into Punjabi and vice-verse.

In[1]: from cltk.corpus.punjabi.numerifier import punToEnglish_number

In[2]: from cltk.corpus.punjabi.numerifier import englishToPun_number

In[3]: c = punToEnglish_number('੧੨੩੪੫੬੭੮੯੦')

In[4]: print(c)
Out[4]: 1234567890

In[5]: c = englishToPun_number(1234567890)

In[6]: print(c)
Out[6]: ੧੨੩੪੫੬੭੮੯੦

Stopword Filtering

To use the CLTK’s built-in stopwords list:

In[1]: from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex

In[2]: from cltk.stop.punjabi.stops import STOPS_LIST

In[3]: sample = "ਪੰਜਾਬੀ ਪੰਜਾਬ ਦੀ ਮੁਖੱ ਬੋੋਲਣ ਜਾਣ ਵਾਲੀ ਭਾਸ਼ਾ ਹੈ।"

In[4]: x = indian_punctuation_tokenize_regex(sample)

In[5]: print(x)
Out[5]: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਦੀ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਵਾਲੀ', 'ਭਾਸ਼ਾ', 'ਹੈ', '।']

In[6]: lis = [w for w in x if not w in STOPS_LIST]

In[7]: print (lis)
Out[7]: ['ਪੰਜਾਬੀ', 'ਪੰਜਾਬ', 'ਮੁਖੱ', 'ਬੋੋਲਣ', 'ਜਾਣ', 'ਭਾਸ਼ਾ', '।']

Sanskrit

Sanskrit is the primary liturgical language of Hinduism, a philosophical language of Hinduism, Jainism, Buddhism and Sikhism, and a literary language of ancient and medieval South Asia that also served as a lingua franca. It is a standardised dialect of Old Indo-Aryan, originating as Vedic Sanskrit and tracing its linguistic ancestry back to Proto-Indo-Iranian and Proto-Indo-European. As one of the oldest Indo-European languages for which substantial written documentation exists, Sanskrit holds a prominent position in Indo-European studies. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with sanskrit_) to discover available Sanskrit corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('sanskrit')

In [3]: c.list_corpora
Out[3]:
['sanskrit_text_jnu', 'sanskrit_text_dcs', 'sanskrit_parallel_sacred_texts', 'sanskrit_text_sacred_texts', 'sanskrit_parallel_gitasupersite', 'sanskrit_text_gitasupersite','sanskrit_text_wikipedia','sanskrit_text_sanskrit_documents']

Transliterator

This tool has been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool is made for transliterating Itrans text to Devanagari(Unicode) script. Also, it can romanize Devanagari script.

Script Conversion

Convert from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script.more.

In [1]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator

In [2]: input_text=u'राजस्थान'

In [3]: UnicodeIndicTransliterator.transliterate(input_text,"hi","pa")
Out[3]: 'ਰਾਜਸ੍ਥਾਨ'
Romanization

Convert script text to Roman text in the ITRANS notation

In [4]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator

In [5]: input_text=u'राजस्थान'

In [6]: lang='hi'

In [7]: ItransTransliterator.to_itrans(input_text,lang)
Out[7]: 'rAjasthAna'
Indicization (ITRANS to Indic Script)

Conversion of ITRANS-transliteration to an Devanagari(Unicode) script

In [8]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator

In [9]: input_text=u'pitL^In'

In [10]: lang='hi'

In [11]: x=ItransTransliterator.from_itrans(input_text,lang)

In [12]: x
Out[12]: 'पितॣन्'
Query Script Information

Indic scripts have been designed keeping phonetic principles in nature and the design and organization of the scripts makes it easy to obtain phonetic information about the characters.

In [13]: from cltk.corpus.sanskrit.itrans.langinfo import *

In [14]: c = 'क'

In [15]: lang='hi'

In [16]: is_vowel(c,lang)
Out[16]: False

In [17]: is_consonant(c,lang)
Out[17]: True

In [18]: is_velar(c,lang)
Out[18]: True

In [19]: is_palatal(c,lang)
Out[19]: False

In [20]: is_aspirated(c,lang)
Out[20]: False

In [21]: is_unvoiced(c,lang)
Out[21]: True

In [22]: is_nasal(c,lang)
Out[22]: False

Other similar functions are here,

In [29]: dir(cltk.corpus.sanskrit.itrans.langinfo)
['APPROXIMANT_LIST', 'ASPIRATED_LIST', 'AUM_OFFSET', 'COORDINATED_RANGE_END_INCLUSIVE', 'COORDINATED_RANGE_START_INCLUSIVE', 'DANDA', 'DENTAL_RANGE', 'DOUBLE_DANDA', 'FRICATIVE_LIST', 'HALANTA_OFFSET', 'LABIAL_RANGE', 'LC_TA', 'NASAL_LIST', 'NUKTA_OFFSET', 'NUMERIC_OFFSET_END', 'NUMERIC_OFFSET_START', 'PALATAL_RANGE', 'RETROFLEX_RANGE', 'RUPEE_SIGN', 'SCRIPT_RANGES', 'UNASPIRATED_LIST', 'UNVOICED_LIST', 'URDU_RANGES', 'VELAR_RANGE', 'VOICED_LIST', '__author__', '__builtins__', '__cached__', '__doc__', '__file__', '__license__', '__loader__', '__name__', '__package__', '__spec__', 'get_offset', 'in_coordinated_range', 'is_approximant', 'is_aspirated', 'is_aum', 'is_consonant', 'is_dental', 'is_fricative', 'is_halanta', 'is_indiclang_char', 'is_labial', 'is_nasal', 'is_nukta', 'is_number', 'is_palatal', 'is_retroflex', 'is_unaspirated', 'is_unvoiced', 'is_velar', 'is_voiced', 'is_vowel', 'is_vowel_sign', 'offset_to_char']

Swadesh

The corpus module has a class for generating a Swadesh list for Sanskrit.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('sa')

In [3]: swadesh.words()[:10]
Out[3]: ['अहम्' , 'त्वम्', 'स', 'वयम्, नस्', 'यूयम्, वस्', 'ते', 'इदम्', 'तत्', 'अत्र', 'तत्र']

Syllabifier

This tool has also been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool can break a word into its syllables, this can be applied across 17 Indian languages including Devanagari (all using Unicode) script.

In [23]: from cltk.stem.sanskrit.indian_syllabifier import Syllabifier

In [24]: input_text = 'नमस्ते'

In [26]: lang='hindi'

In [27]: x = Syllabifier(lang)

In [28]: current = x.orthographic_syllabify(input_text)
Out[28]: ['न', 'म','स्ते']

Tokenizer

This tool has also been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool can break a sentence into its constituent words. It works on the basis of filtering out punctuations and spaces.

In [29]: from cltk.tokenize.sentence import TokenizeSentence

In [30]: tokenizer = TokenizeSentence('sanskrit')

In [31]: input_text = "हिन्दी भारत की सबसे अधिक बोली और समझी जाने वाली भाषा है"

In [32]: x = tokenizer.tokenize(input_text)
Out[32]: ['हिन्दी', 'भारत', 'की', 'सबसे', 'अधिक', 'बोली', 'और', 'समझी', 'जाने', 'वाली', 'भाषा', 'है']

Stopword Filtering

To use the CLTK’s built-in stopwords list:

In [1]: from cltk.stop.sanskrit.stops import STOPS_LIST

In [2]: from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex

In [3]: s = "हमने पिछले पाठ मे सीखा था कि “अहम् गच्छामि” का मतलब “मै जाता हूँ” है। आप ऊपर
   ...:  की तालिकाँओ "

In [4]: tokens = indian_punctuation_tokenize_regex(s)

In [5]: len(tokens)
Out[5]: 20

In [6]: no_stops = [w for w in tokens if w not in STOPS_LIST]

In [7]: len(no_stops)
Out[7]: 18

In [8]: no_stops
Out[8]:
['हमने',
 'पिछले',
 'पाठ',
 'सीखा',
 'था',
 'कि',
 '“अहम्',
 'गच्छामि”',
 'मतलब',
 '“मै',
 'जाता',
 'हूँ”',
 'है',
 '।',
 'आप',
 'ऊपर',
 'की',
 'तालिकाँओ']

Old Swedish

Old Swedish (fornsvenska) is a language spoken in Sweden between (Source: Wikipedia)

Phonological transcription

According to phonological rules, a reconstructed phonology/pronunciation of Old Swedish words is implemented.

In [1]: from cltk.phonology.old_swedish import transcription as old_swedish

In [2]: from cltk.phonology import utils as ut

In [3]: sentence = "Far man kunu oc dör han för en hun far barn. oc sigher hun oc hænnæ frændær."

In [4]:  tr = ut.Transcriber(old_swedish.DIPHTHONGS_IPA, old_swedish.DIPHTHONGS_IPA_class, old_swedish.IPA_class,
                        old_swedish.old_swedish_rules)

In [5]: tr.main(sentence)

Out [5]: "[far man kunu ok dør han før ɛn hun far barn ok siɣɛr hun ok hɛnːɛ frɛndɛr]"

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with old_swedish_) to discover available Old Swedish corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter("old_swedish")

In [3]: corpus_importer.list_corpora

Out [3]: ['old_swedish_texts', ]

Tamil

Tamil is a Dravidian language predominantly spoken by the Tamil people of India and Sri Lanka. It is one of the longest-surviving classical languages in the world. A recorded Tamil literature has been documented for over 2000 years. The earliest period of Tamil literature, Sangam literature, is dated from ca. 300 BC – AD 300. It has the oldest extant literature among Dravidian languages. The earliest epigraphic records found on rock edicts and hero stones date from around the 3rd century BC. More than 55% of the epigraphical inscriptions (about 55,000) found by the Archaeological Survey of India are in the Tamil language. Tamil language inscriptions written in Brahmi script have been discovered in Sri Lanka, and on trade goods in Thailand and Egypt. (Source: Wikipedia)

Alphabet

In [1]: from cltk.corpus.tamil.alphabet import VOWELS, CONSTANTS, GRANTHA_CONSONANTS

In [2]: print(VOWELS)
Out[2]: ['அ', 'ஆ', 'இ', 'ஈ', 'உ', 'ஊ',' எ', 'ஏ', 'ஐ', 'ஒ', 'ஓ', 'ஔ']

In [3]: print(CONSONANTS)
Out[3]: ['க்', 'ங்', 'ச்', 'ஞ்', 'ட்', 'ண்', 'த்', 'ந்', 'ப்', 'ம்', 'ய்', 'ர்', 'ல்', 'வ்', 'ழ்', 'ள்', 'ற்', 'ன்']

In [4]: print(GRANTHA_CONSONANTS)
Out[4]: ['ஜ்', 'ஶ்', 'ஷ்', 'ஸ்', 'ஹ்', 'க்ஷ்']

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with tamil_) to discover available tamil corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('tamil')

In [3]: c.list_corpora
Out[3]: ['tamil_text_ptr_tipitaka']

Telugu

Telugu is a Dravidian language native to India. Inscriptions with Telugu words dating back to 400 BC to 100 BC have been discovered in Bhattiprolu in the Guntur district of Andhra Pradesh. Telugu literature can be traced back to the early 11th century period when Mahabharata was first translated to Telugu from Sanskrit by Nannaya. It flourished under the rule of the Vijayanagar empire, where Telugu was one of the empire’s official languages. (Source: Wikipedia)

Alphabet

The Telugu alphabet and digits are placed in cltk/corpus/telugu/alphabet.py.

The digits are placed in a list NUMERALS with the digit the same as the list index (0-9). For example, the telugu digit for 4 can be accessed in this manner:

In [1]: from cltk.corpus.telugu.alphabet import NUMERALS
In [2]: NUMERALS[4]
Out[2]: '౪'

The vowels are places in a list VOWELS and can be accessed in this manner :

In [1]: from cltk.corpus.telugu.alphabet import VOWELS
In [2]: VOWELS
Out[2]: ['అ ','ఆ','ఇ','ఈ ','ఉ ','ఊ ','ఋ  ','ౠ  ','ఌ ','ౡ','ఎ','ఏ','ఐ','ఒ','ఓ','ఔ ','అం','అః']

The rest of the alphabets are CONSONANTS that can be accessed in a similar way.

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with telugu_) to discover available Telugu corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('telugu')

In [3]: c.list_corpora
Out[3]:
['telugu_text_wikisource']

Tokenizer

This tool can help break up a sentence into smaller constituents i.e into words.

  In [1]: from cltk.tokenize.sentence import TokenizeSentence

  In [2]: tokenizer = TokenizeSentence('telugu')

  In [3]: sentence = "క్లేశభూర్యల్పసారాణి కర్మాణి విఫలాని వా దేహినాం విషయార్తానాం న తథైవార్పితం త్వయి"

  In [4]: telugu_text_tokenize = tokenizer.tokenize(sentence)

  In [5]: telugu_text_tokenize
  ['క్లేశభూర్యల్పసారాణి',
'కర్మాణి',
'విఫలాని',
'వా',
'దేహినాం',
'విషయార్తానాం',
'న',
'తథైవార్పితం',
'త్వయి']

Tibetan

Classical Tibetan refers to the language of any text written in Tibetic after the Old Tibetan period; though it extends from the 7th century until the modern day, it particularly refers to the language of early canonical texts translated from other languages, especially Sanskrit. In 816, during the reign of King Sadnalegs, literary Tibetan underwent a thorough reform aimed at standardizing the language and vocabulary of the translations being made from Indian texts, which resulted in what is now called Classical Tibetan. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub repository (anything beginning with tibetan_) to discover available Tibetan corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('tibetan')

In [3]: c.list_corpora
Out[3]: ['tibetan_pos_tdc', 'tibetan_lexica_tdc']

Tocharian B

Tocharian, also spelled Tokharian, is an extinct branch of the Indo-European language family. It is known from manuscripts dating from the 6th to the 8th century AD, which were found in oasis cities on the northern edge of the Tarim Basin (now part of Xinjiang in northwest China). The documents record two closely related languages, called Tocharian A (“East Tocharian”, Agnean or Turfanian) and Tocharian B (“West Tocharian” or Kuchean). The subject matter of the texts suggests that Tocharian A was more archaic and used as a Buddhist liturgical language, while Tocharian B was more actively spoken in the entire area from Turfan in the east to Tumshuq in the west. Tocharian A is found only in the eastern part of the Tocharian-speaking area, and all extant texts are of a religious nature. Tocharian B, however, is found throughout the range and in both religious and secular texts. (Source: Wikipedia)

Swadesh

The corpus module has a class for generating a Swadesh list for Tocharian B.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('txb')

In [3]: swadesh.words()[:10]
Out[3]: ['ñäś', 'tuwe', 'su', 'wes', 'yes', 'cey', 'se', 'su, samp', 'tane', 'tane, omp']

For interactive tutorials, in the form of Jupyter Notebooks, see https://github.com/cltk/tutorials.

Urdu

Alphabet

The Urdu alphabet and digits are placed in cltk/corpus/urdu/alphabet.py.

The digits are placed in a list DIGITS with the digit the same as the list index (0-9). For example, the urdu digit for 4 can be accessed in this manner:

In [1]: from cltk.corpus.urdu.alphabet import DIGITS
In [2]: DIGITS[4]
Out[2]: '٤'

Persian has three SHORT_VOWELS that are essentially diacritics used in the script. It also has four LONG_VOWELS that are actually part of the alphabet. The corresponding lists can be imported:

In [1]: from cltk.corpus.urdu.alphabet import SHORT_VOWELS
In [2]: SHORT_VOWELS
Out[2]: ['َ', 'ِ', 'ُ']
In [3]: from cltk.corpus.urdu.alphabet import LONG_VOWELS
In [4]: LONG_VOWELS
Out[4]: ['ا', 'و', 'ی', 'ے']

The rest of the alphabet are CONSONANTS that can be accessed in a similar way.

There are three SPECIAL characters that are ligatures or different orthographical shapes of the alphabet.

In [1]: from cltk.corpus.urdu.alphabet import SPECIAL
In [2]: SPECIAL
Out[2]: ['ﺁ', 'ۀ', 'ﻻ']