_images/logo.png https://travis-ci.org/knipknap/Gelatin.svg?branch=master https://coveralls.io/repos/github/knipknap/Gelatin/badge.svg?branch=master Code Climate https://img.shields.io/github/stars/knipknap/Gelatin.svg https://img.shields.io/github/license/knipknap/Gelatin.svg

What is Gelatin?

Gelatin turns your text soup into something solid. It is a combined lexer, parser, and output generator. It converts text files to XML, JSON, or YAML. It is a simple language for converting text into a structured formats.

Development

Gelatin is on GitHub.

License

Gelatin is published under the MIT licence.

Contents

Quick Start

Suppose you want to convert the following text file to XML:

User
----
Name: John, Lastname: Doe
Office: 1st Ave
Birth date: 1978-01-01

User
----
Name: Jane, Lastname: Foo
Office: 2nd Ave
Birth date: 1970-01-01

The following Gelatin syntax does the job:

# Define commonly used data types. This is optional, but
# makes your life a litte easier by allowing to reuse regular
# expressions in the grammar.
define nl /[\r\n]/
define ws /\s+/
define fieldname /[\w ]+/
define value /[^\r\n,]+/
define field_end /[\r\n,] */

grammar user:
    match 'Name:' ws value field_end:
        out.add_attribute('.', 'firstname', '$2')
    match 'Lastname:' ws value field_end:
        out.add_attribute('.', 'lastname', '$2')
    match fieldname ':' ws value field_end:
        out.add('$0', '$3')
    match nl:
        do.return()

# The grammar named "input" is the entry point for the converter.
grammar input:
    match 'User' nl '----' nl:
        out.open('user')
    user()

Explanation:

  1. “grammar input:” is the entry point for the converter.
  2. “match” statements in each grammar are executed sequentially. If a match is found, the indented statements in the match block are executed. After reaching the end of a match block, the grammar restarts at the top of the grammar block.
  3. If the end of a grammar is reached before the end of the input document was reached, an error is raised.
  4. out.add(‘$0’, ‘$3’) creates a node in the XML (or JSON, or YAML) if it does not yet exist. The name of the node is the value of the first matched field (the fieldname, in this case). The data of the node is the value of the fourth matched field.
  5. out.open(‘user’) creates a “user” node in the output and selects it such that all following “add” statements generate output relative to the “user” node. Gelatin leaves the user node upon reaching the out.leave() statement.
  6. user() calls the grammar named “user”.

This produces the following output:

<xml>
  <user lastname="Doe" firstname="John">
    <office>1st Ave</office>
    <birth-date>1978-01-01</birth-date>
  </user>
  <user lastname="Foo" firstname="Jane">
    <office>2nd Ave</office>
    <birth-date>1970-01-01</birth-date>
  </user>
</xml>

Starting the transformation

The following command converts the input to XML:

gel -s mysyntax.gel input.txt

The same for JSON or YAML:

gel -s mysyntax.gel -f json input.txt
gel -s mysyntax.gel -f yaml input.txt

Using Gelatin as a Python Module

Gelatin also provides a Python API for transforming the text:

from Gelatin import generator
from Gelatin.util import compile

# Parse your .gel file.
syntax = compile('syntax.gel')

# Convert your input file to XML.
builder = generator.Xml()
syntax.parse('input.txt', builder)
print builder.serialize()

Common Operations

Generating XML attributes

There are two ways for creating an attribute. The first is using URL notation within a node name:

grammar input:
    match 'User' nl '----' nl 'Name:' ws value field_end:
        out.enter('user?name="$6"')
        user()

The second, equivalent way calls add_attribute() explicitely:

grammar input:
    match 'User' nl '----' nl 'Name:' ws value field_end:
        out.enter('user')
        out.add_attribute('.', 'name', '$6')
        user()

Skipping Values

match /# .*[\r\n]/:
    do.skip()

Matching Multiple Values

match /# .*[\r\n]/
    | '/*' /[^\r\n]/ '*/' nl:
    do.skip()

Grammar Inheritance

A grammar that uses inheritance executes the inherited match statements before trying it’s own:

grammar default:
    match nl:
        do.return()
    match ws:
        do.next()

grammar user(default):
    match fieldname ':' ws value field_end:
        out.add('$0', '$3')

In this case, the user grammar inherits the whitespace rules from the default grammar.

Gelatin Syntax

The following functions and types may be used within a Gelatin syntax.

Types

STRING

A string is any series of characters, delimited by the ' character. Escaping is done using the backslash character. Examples:

'test me'
'test \'escaped\' strings'
VARNAME

VARNAMEs are variable names. They may contain the following set of characters:

[a-z0-9_]
NODE

The output that is generated by Gelatin is represented by a tree consisting of nodes. The NODE type is used to describe a single node in a tree. It is a URL notated string consisting of the node name, optionally followed by attributes. Examples for NODE include:

.
element
element?attribute1="foo"
element?attribute1="foo"&attribute2="foo"
PATH

A PATH addresses a node in the tree. Addressing is relative to the currently selected node. A PATH is a string with the following syntax:

NODE[/NODE[/NODE]...]'

Examples:

.
./child
parent/element?attribute="foo"
parent/child1?name="foo"/child?attribute="foobar"
REGEX

This type describes a Python regular expression. The expression MUST NOT extract any subgroups. In other words, when using bracket expressions, always use (?:). Example:

/^(test|foo|bar)$/         # invalid!
/^(?:test|foo|bar)$/       # valid

If you are trying to extract a substring, use a match statement with multiple fields instead.

Statements

define VARNAME STRING|REGEX|VARNAME

define statements assign a value to a variable. Examples:

define my_test /(?:foo|bar)/
define my_test2 'foobar'
define my_test3 my_test2
match STRING|REGEX|VARNAME ...

Match statements are lists of tokens that are applied against the current input document. They parse the input stream by matching at the current position. On a match, the matching string is consumed from the input such that the next match statement may be applied. In other words, the current position in the document is advanced only when a match is found.

A match statement must be followed by an indented block. In this block, each matching token may be accessed using the $X variables, where X is the number of the match, starting with $0.

Examples:

define digit /[0-9]/
define number /[0-9]+/

grammar input:
    match 'foobar':
        do.say('Match was: $0!')
    match 'foo' 'bar' /[\r\n]/:
        do.say('Match was: $0!')
    match 'foobar' digit /\s+/ number /[\r\n]/:
        do.say('Matches: $1 and $3')

You may also use multiple matches resulting in a logical OR:

match 'foo' '[0-9]' /[\r\n]/
    | 'bar' /[a-z]/ /[\r\n]/
    | 'foobar' /[A-Z]/ /[\r\n]/:
    do.say('Match was: $1!')
imatch STRING|REGEX|VARNAME ...

imatch statements are like match statements, except that matching is case-insensitive.

when STRING|REGEX|VARNAME ...

when statements are like match statements, with the difference that upon a match, the string is not consumed from the input stream. In other words, the current position in the document is not advanced, even when a match is found. when statements are generally used in places where you want to “bail out” of a grammar without consuming the token.

Example:

grammar user:
    match 'Name:' /\s+/ /\S+/ /\n/:
        do.say('Name was: $2!')
    when 'User:':
        do.return()

grammar input:
    match 'User:' /\s+/ /\S+/ /\n/:
        out.enter('user/name', '$2')
        user()
skip STRING|REGEX|VARNAME

skip statements are like match statements without any actions. They also do not support lists of tokens, but only one single expression.

Example:

grammar user:
    skip /#.*?[\r\n]+/
    match 'Name: ' /\s+/ /\n/:
        do.say('Name was: $2!')
    when 'User:':
        do.return()

Output Generating Functions

out.create(PATH[, STRING])

Creates the leaf node (and attributes) in the given path, regardless of whether or not it already exists. In other words, using this function twice will lead to duplicates. If the given path contains multiple elements, the parent nodes are only created if the do not yet exist. If the STRING argument is given, the new node is also assigned the string as data. In other words, the following function call:

out.create('parent/child?name="test"', 'hello world')

leads to the following XML output:

<parent>
    <child name="test">hello world</child>
</parent>

Using the same call again, like so:

out.create('parent/child?name="test"', 'hello world')
out.create('parent/child?name="test"', 'hello world')

the resulting XML would look like this:

<parent>
    <child name="test">hello world</child>
    <child name="test">hello world</child>
</parent>
out.replace(PATH[, STRING])

Like out.create(), but replaces the nodes in the given path if they already exist.

out.add(PATH[, STRING])

Like out.create(), but appends the string to the text of the existing node if it already exists.

out.add_attribute(PATH, NAME, STRING)

Adds the attribute with the given name and value to the node with the given path.

out.open(PATH[, STRING])

Like out.create(), but also selects the addressed node, such that the PATH of all subsequent function calls is relative to the selected node until the end of the match block is reached.

out.enter(PATH[, STRING])

Like out.open(), but only creates the nodes in the given path if they do not already exist.

out.enqueue_before(REGEX, PATH[, STRING])

Like out.add(), but is not immediately executed. Instead, it is executed as soon as the given regular expression matches the input, regardless of the grammar in which the match occurs.

out.enqueue_after(REGEX, PATH[, STRING])

Like out.enqueue_before(), but is executed after the given regular expression matches the input and the next match statement was processed.

out.enqueue_on_add(REGEX, PATH[, STRING])

Like out.enqueue_before(), but is executed after the given regular expression matches the input and the next node is added to the output.

out.clear_queue()

Removes any items from the queue that were previously queued using the out.enqueue_*() functions.

out.set_root_name(STRING)

Specifies the name of the root tag, if the output is XML. Has no effect on JSON and YAML output.

Control Functions

do.skip()

Skip the current match and jump back to the top of the current grammar block.

do.next()
Skip the current match and continue with the next match statement without jumping back to the top of the current grammar block.
This function is rarely used and probably not what you want. Instead, use do.skip() in almost all cases, unless it is for some performance-specific hacks.
do.return()

Immediately leave the current grammar block and return to the calling function. When used at the top level (i.e. in the input grammar), stop parsing.

do.say(STRING)

Prints the given string to stdout, with additional debug information.

do.fail(STRING)

Like do.say(), but immediately terminates with an error.

Python API

Gelatin.util module

Gelatin.util.compile(syntax_file, encoding='utf8')[source]

Like compile_string(), but reads the syntax from the file with the given name.

Parameters:
  • syntax_file (str) – Name of a file containing Gelatin syntax.
  • encoding (str) – Character encoding of the syntax file.
Return type:

compiler.Context

Returns:

The compiled converter.

Gelatin.util.compile_string(syntax)[source]

Builds a converter from the given syntax and returns it.

Parameters:syntax (str) – A Gelatin syntax.
Return type:compiler.Context
Returns:The compiled converter.
Gelatin.util.generate(converter, input_file, format='xml', encoding='utf8')[source]

Given a converter (as returned by compile()), this function reads the given input file and converts it to the requested output format.

Supported output formats are ‘xml’, ‘yaml’, ‘json’, or ‘none’.

Parameters:
  • converter (compiler.Context) – The compiled converter.
  • input_file (str) – Name of a file to convert.
  • format (str) – The output format.
  • encoding (str) – Character encoding of the input file.
Return type:

str

Returns:

The resulting output.

Gelatin.util.generate_string(converter, input, format='xml')[source]

Like generate(), but reads the input from a string instead of from a file.

Parameters:
  • converter (compiler.Context) – The compiled converter.
  • input (str) – The string to convert.
  • format (str) – The output format.
Return type:

str

Returns:

The resulting output.

Gelatin.util.generate_string_to_file(converter, input, output_file, format='xml', out_encoding='utf8')[source]

Like generate(), but reads the input from a string instead of from a file, and writes the output to the given output file.

Parameters:
  • converter (compiler.Context) – The compiled converter.
  • input (str) – The string to convert.
  • output_file (str) – The output filename.
  • format (str) – The output format.
  • out_encoding (str) – Character encoding of the output file.
Return type:

str

Returns:

The resulting output.

Gelatin.util.generate_to_file(converter, input_file, output_file, format='xml', in_encoding='utf8', out_encoding='utf8')[source]

Like generate(), but writes the output to the given output file instead.

Parameters:
  • converter (compiler.Context) – The compiled converter.
  • input_file (str) – Name of a file to convert.
  • output_file (str) – The output filename.
  • format (str) – The output format.
  • in_encoding (str) – Character encoding of the input file.
  • out_encoding (str) – Character encoding of the output file.
Return type:

str

Returns:

The resulting output.

Gelatin.compiler package

Module contents
Gelatin.compiler.Context module
class Gelatin.compiler.Context.Context[source]

Bases: object

__init__()[source]
dump()[source]
parse(filename, builder, encoding='utf8', debug=0)[source]
parse_string(input, builder, debug=0)[source]
Gelatin.compiler.Context.do_fail(context, message='No matching statement found')[source]
Gelatin.compiler.Context.do_next(context)[source]
Gelatin.compiler.Context.do_return(context, levels=1)[source]
Gelatin.compiler.Context.do_say(context, message)[source]
Gelatin.compiler.Context.do_skip(context)[source]
Gelatin.compiler.Context.do_warn(context, message)[source]
Gelatin.compiler.Context.out_add(context, path, data=None)[source]
Gelatin.compiler.Context.out_add_attribute(context, path, name, value)[source]
Gelatin.compiler.Context.out_clear_queue(context)[source]
Gelatin.compiler.Context.out_create(context, path, data=None)[source]
Gelatin.compiler.Context.out_enqueue_after(context, regex, path, data=None)[source]
Gelatin.compiler.Context.out_enqueue_before(context, regex, path, data=None)[source]
Gelatin.compiler.Context.out_enqueue_on_add(context, regex, path, data=None)[source]
Gelatin.compiler.Context.out_enter(context, path)[source]
Gelatin.compiler.Context.out_open(context, path)[source]
Gelatin.compiler.Context.out_replace(context, path, data=None)[source]
Gelatin.compiler.Context.out_set_root_name(context, name)[source]
Gelatin.compiler.SyntaxCompiler module
class Gelatin.compiler.SyntaxCompiler.SyntaxCompiler[source]

Bases: simpleparse.dispatchprocessor.DispatchProcessor

Processor sub-class defining processing functions for the productions.

__init__()[source]
define_stmt(token, buffer)[source]
grammar_stmt(token, buffer)[source]
reset()[source]

Gelatin.generator package

Module contents
Gelatin.generator.new(format)[source]
Gelatin.generator.Builder module
class Gelatin.generator.Builder.Builder[source]

Bases: object

Abstract base class for all generators.

__init__()[source]
add(path, data=None, replace=False)[source]

Creates the given node if it does not exist. Returns the (new or existing) node.

add_attribute(path, name, value)[source]

Creates the given attribute and sets it to the given value. Returns the (new or existing) node to which the attribute was added.

create(path, data=None)[source]

Creates the given node, regardless of whether or not it already exists. Returns the new node.

dump()[source]
enter(path)[source]

Enters the given node. Creates it if it does not exist. Returns the node.

leave()[source]

Returns to the node that was selected before the last call to enter(). The history is a stack, to the method may be called multiple times.

open(path)[source]

Creates and enters the given node, regardless of whether it already exists. Returns the new node.

serialize(serializer)[source]
serialize_to_file(filename)[source]
set_root_name(name)[source]
class Gelatin.generator.Builder.Node(name, attribs=None)[source]

Bases: object

__init__(name, attribs=None)[source]
add(child)[source]
dump(indent=0)[source]
get_child(name, attribs=None)[source]

Returns the first child that matches the given name and attributes.

to_dict()[source]
class Gelatin.generator.Builder.OrderedDefaultDict(default_factory=None, *a, **kw)[source]

Bases: collections.OrderedDict

__init__(default_factory=None, *a, **kw)[source]
copy()[source]
Gelatin.generator.Builder.nodehash(name, attribs)[source]
Gelatin.generator.Dummy module
class Gelatin.generator.Dummy.Dummy[source]

Bases: Gelatin.generator.Builder.Builder

__init__()[source]
add(path, data=None, replace=False)[source]
add_attribute(path, name, value)[source]
dump()[source]
enter(path)[source]
leave()[source]
open(path)[source]
serialize()[source]
set_root_name(name)[source]
Gelatin.generator.Json module
class Gelatin.generator.Json.Json[source]

Bases: object

serialize_doc(node)[source]
Gelatin.generator.Xml module
class Gelatin.generator.Xml.Xml[source]

Bases: object

serialize_doc(node)[source]
serialize_node(node)[source]
Gelatin.generator.Yaml module
class Gelatin.generator.Yaml.Yaml[source]

Bases: object

serialize_doc(node)[source]
Gelatin.generator.Yaml.represent_ordereddict(dumper, data)[source]