wivu2laf 1.0.1¶
Submodules¶
wivu2laf.wivu2laf module¶
wivu2laf.config module¶
wivu2laf.laf module¶
- class Laf(cfg, wv, val)[source]¶
Knows the LAF data format.
All LAF knowledge is stored in template files together with sections in the main configuration file. The LAF class finds those templates, sets up the result files, and fills them.
- Note:
Templates
- template[key] = text
- where key is an entry in the laf_templates section of the main config file.
- Note:
Files and Filetypes
annotation_files[part][subpart] = (ftype, medium, location, requires, annotations, is_region)
The order is important, so we generate a list too:
- file_order
- list of ftypes according file_types section in main config file, expanded, in the order encountered
where
- ftype
comes from the file_types section in the main config file. It has the shape of LAF file identifier, but with wild cards.
- f.xxxxxx
- not an annotation file, but primary data or a header file
- f_part.subpart
- annotation file for part, subpart
- for each ftype
there is an infostring consisting of fields
- location
- file name of corresponding file, modulo a common prefix
- medium
- file type (text or xml)
- annotations
- space separated annotation labels occurring in this part, subpart
- requires
- space separated list of ftypes of required files
- is_region
- reveals whether the file only contains regions or not. A pure region file needs a different template.
- Note:
Header Generation
All header files are generated here: * the feature declaration file * the header for the resource as a whole * the header for the primary data file
The headers of the annotation files are included in those files. Those headers contain statistics: counts of the number of annotations with a given label. We know those number only after generation because these statistics will be collected during further processing.
When the annotation files are generated, we use placeholders for the statistics. In a post-generation stage we read/write the annotation files and replace the place holders by the true numbers. The files are written in situ. So we must take care that the placeholders contain enough space around them.
- Note:
Processing
This class provides methods to initialize and finalize the generation of primary data files and annotation files. There are methods to open/close all files that are relevant to the part that is being processed. (Part being: ‘monad’, ‘section’, ‘lingo’).
- Note:
Statistics
Counts are collected in a stats dictionary.
- stats[statistic_name] = statistic_value*
Initialization is:
- setting up the list of annotation files.
- reading and storing all templates
- annotation_files = defaultdict(<function <lambda> at 0x39cae60>, {})¶
- cfg = None¶
- file_handles = {}¶
- file_order = []¶
- finish_annot(part)[source]¶
Closes all annotation files belonging to a part.
When needed, it fills in required statistics, such as the number of times an annotation label is used. Uses templates:
- annotation_ftr
- gstats = defaultdict(<function <lambda> at 0x39caf50>, {})¶
- makefeatureheader()[source]¶
Creates a feature declaration file for all features and its values.
Uses the templates:
feature_basic, feature, feature_val1, feature_val, feature_decl
- makeheaders()[source]¶
Creates the headers that occupy separate files.
The resource header is the header file for the resource as a whole. The primary header is a header file for the primary data. The feature header is an xml document that contains feature declarations.
- makeprimaryheader()[source]¶
Create the primary header.
Uses the templates:
- annotation_item
- primary_hdr
- makeresourceheader()[source]¶
Creates the resource header
Uses the templates:
- annotation_decl
- resource_hdr,
- primary_handle = None¶
- start_annot(part)[source]¶
Creates the annotation headers of the annotation files belonging to a part.
Opens a file for writing, dumps the header to it, and leaves the file open for further writing by other parts of the program.
Uses templates:
- annotation_label
- region_hdr
- annotation_hdr
- start_primary()[source]¶
Opens a file for the primary header and leaves it open for other parts of the program to write to
- stats = defaultdict(<function <lambda> at 0x39caed8>, {})¶
- template = {}¶
- wv = None¶
wivu2laf.mylib module¶
wivu2laf.transform module¶
wivu2laf.validate module¶
- class Validate(cfg)[source]¶
Validates all generated files, knows the schemas involved.
The main program generates a bunch of XML files, according to various schemas. They can be sent to this object, with or without a schema specification. All files with a schema specification will be validated.
The base locations of the schemas and of the generated files will be retrieved from the main configuration. All schemas will be copied from source to destination.
generated_files = list of [absolute_path, schema in destination, validation result]Initialization is: get from config the schema locations and copy them all over
- add(xml, xsd)[source]¶
Add an item to the generated files list. If xsd is given, the file will eventually be validated.
The validation result will be stored in a member of the item, which is initially None. If validation takes place, None will be replaced by True or False, depending on whether the xml is valid wrt. the xsd.
- cfg = None¶
- generated_files = []¶
wivu2laf.wivu module¶
- class Wivu(cfg)[source]¶
Knows the WIVU data format.
All WIVU knowledge is stored in a file that describes objects, features and values. These are many items, and we divide them in parts and subparts. We have a parts for monads, sections and linguistic objects. When we generate LAF files, they may become unwieldy in size. That is why we also divide parts in subparts. Parts correspond to sets of objects and their features. Subparts correspond to subsets of objects and or subsets of features. N.B. It is “either or”: either
- a part consists of only one object type, and the subparts divide the features of that object type
or
- a part consists of multiple object types, and the subparts divide the object types of that part. If an object type belongs to a subpart, all its features belong to that subpart too.
In our case, the part ‘monad’ has the single object type, and its features are divided over subparts. The part ‘lingo’ has object types sentence, sentence_atom, clause, clause_atom, phrase, phrase_atom, subphrase, word. Its subparts are a partition of these object types in several subsets. The part ‘section’ does not have subparts. Note that an object type may occur in multiple parts: consider ‘word’. However, ‘word’ in part ‘monad’ has all non-relational word features, but ‘word’ in part ‘lingo’ has only relational features, i.e.features that relate words to other objects.
The Wivu object stores the complete information found in the Wivu config file in a bunch of data structures, and defines accessor functions for it.
The feature information is stored in the following dictionaries:
(Ia) part_info[part][subpart][object_type][feature_name] = None
Stores the organization of individual objects and their features in parts and subparts. NB: object_types may occur in multiple parts.(Ib) part_object[part][object_type] = None
Stores the set of object types of parts(Ic) part_feature[part][object_type][feature_name] = None
Stores the set of features types of parts(Id) object_subpart[part][object_type] = subpart
Stores the subpart in which each object type occurs, per part- object_info[object_type] = [attributes]
Stores the information on objects, except their features and values.- feature_info[object_type][feature_name] = [attributes]
Stores the information on features, except their values.- value_info[object_type][feature_name][feature_value] = [attributes]
Stores the feature value informationreference_feature[feature_name] = True | False
Stores the names of features that reference other object. The feature ‘self’ is an example. But we skip this feature. ‘self’ will get the value False, other features, such as mother and parents get True
- annotation_files[part][subpart] = (ftype, medium, location, requires, annotations, is_region)
Stores information of the files that are generated as the resulting LAF resourceThe files are organized by part and subpart. Header files and primary data files are in part ‘’. Other files may or may not contain annotations. If not, they only contain regions. Then is_region is True.
- ftype
- the file identifier to be used in header files
- medium
- text or xml
- location
- the last part of the file name. All file names can be obtained by appending location after the absolute path followed by a common prefix.
- requires
- the identifier of a file that is required by the current file
- annotations
- the annotation labels to be declared for this file
- The feature information file contains lines with tab-delimited fields (only the starred ones are used):
- 0* 1* 2* 3* 4* 5* 6 7* 8 9 10 11* 12* object_type, feature_name, defined_on, wivu_type, feature_value, isocat_key, isocat_id, isocat_name, isocat_type, isocat_def, note, part, subpart 0 1 2 3 4 5 6 7 8
Initialization is: reading the excel sheet with feature information.
The sheet should be in the form of a tab-delimited text file.
- There are columns with:
- WIVU information:
- object_type, feature_name, also_defined_on, type, value.
- ISOcat information
- key, id, name, type, definition, note
- LAF sectioning
- part, subpart
See the list of columns above.
So the file gives essential information to map objects/features/values to ISOcat data categories. It indicates how the LAF output can be chunked in parts and subparts.
- cfg = None¶
- check_raw_files(part)[source]¶
Generate the file with raw emdros output by executing a generated mql query. This query has been generated during initialization. Only when there is a command line flag present that tells to do this
- feature_atts(object_type, feature_name)[source]¶
Returns a tuple of feature attributes, corresponding with the columns in the feature excel sheet.
The Wivu columns (object type, feature name) are missing, since they are given as arguments. The LAFcolumns are not included. The attributes returned are:
defined_on, wivu_type, isocat_key, isocat_name
- feature_info = {}¶
- feature_list_part(part, object_type)[source]¶
Answers: which features belong to an object type, and also in a part and exclude the features to be skipped?
- feature_list_subpart(part, subpart, object_type)[source]¶
Answers: which features belong to an object type, a part and subpart, and also in a part and exclude the features to be skipped?
- is_ref_skip(feature_name)[source]¶
Tests if the feature_name is a reference feature that should be skipped
- make_query_file(part)[source]¶
Generate an emdros query file to extract the raw data for part from the emdros database.
- object_atts(object_type)[source]¶
Returns a tuple of object attributes, corresponding with the columns in the feature excel sheet.
The Wivu column (object type) is missing, since they are given as arguments. The LAFcolumns are not included. The attributes returned are:
isocat_key, isocat_name
- object_info = {}¶
- object_subpart = {}¶
- part_feature = defaultdict(<function <lambda> at 0x3ffd9b0>, {})¶
- part_info = {}¶
- part_object = defaultdict(<function <lambda> at 0x3ffda28>, {})¶
- reference_feature = {}¶
- value_atts(object_type, feature_name, feature_value)[source]¶
Returns a tuple of value attributes, corresponding with the columns in the feature excel sheet.
The Wivu columns (object type, feature name, feature_value) are missing, since they are given as arguments The LAFcolumns are not included. The attributes returned are:
wivu_type, isocat_key, isocat_name
- value_info = {}¶