Welcome to Stetl¶
Stetl, Streaming ETL, is an open source (GNU GPL) toolkit for the transformation (ETL) of geospatial data. Stetl is based on existing ETL tools like GDAL/OGR and XSLT. Stetl processing is driven from a configuration (.ini) file. Stetl is written in Python and in particular suited for processing GML.
This is the documentation of the Stetl toolkit. The code is on GitHub: https://github.com/geopython/stetl. Since July 2016 the project is a proud member of the GeoPython GitHub organization.
See an introductory Stetl presentation on Slideshare.
This is document version 2.1 generated on Jan 09, 2023.
Contents:
Intro¶
Stetl, streaming ETL, pronounced “staedl”, is a lightweight ETL-framework for the conversion of rich (such as GML) geospatial data. Stetl is Open Source (GNU GPL v3).
Read a 5-minute introduction here: http://www.slideshare.net/justb4/5-minute-intro-to-setl and a longer presentation here: http://www.slideshare.net/justb4/geospatial-etl-with-stetl-geopython-2016. Plus a presentation of Stetl for use in INSPIRE transformation: http://www.slideshare.net/justb4/2-stetlinspiretransformv1 with even a video recording: https://www.youtube.com/watch?v=vjdpYBm4AaM
Stetl originated in the INSPIRE-FOSS project and was originally created by Just van den Broecke. Subsequently, Stetl evolved into a wider use transforming Dutch GML-based datasets such as Top10NL, IMGEO/BGT (Large Scale Topography) and IMKAD/BRK (Kadastral Data). Therefore Stetl now has a repository of its own at GitHub.
Stetl basically glues together existing parsing and transformation tools like GDAL/OGR (ogr2ogr) and XSLT. By using native tools like libxml2 and libxslt (via Python lxml) Stetl is speed-optimized.
Stetl has (currently) no GUI. There are powerful Open Source ETL tools like GeoKettle and Talend Geospatial with a GUI. Check these out. But some of us would like to stay close to the commandline, be Pythonic and reuse existing tools ‘close to the iron’.
So why and when to use Stetl:
- when ogr2ogr or XSLT alone cannot do the job
- when having to deal with complex GML as source or destination
- when you want to use simple command-line tooling or (Python) program integrations
- when you need speed
- when you are a Pythonista
Stetl is in particularly useful for INSPIRE-related transformations and other complex GML-related ETL.
Stetl was presented at FOSS4G 2013 in Nottingham, see http://2013.foss4g.org/conf/programme/presentations/156 and the slides: http://www.slideshare.net/justb4/stetl-foss4g20131024v1
Installation¶
Stetl up to and including version 1.3 only runs with Python 2 (2.7+). Starting with Stetl v2.0 only Python 3 (3.6+) will be supported. You may want to read Upgrade to Python 3 when upgrading from a Stetl pre-v2 version.
Easiest is to first install the Stetl-dependencies (see below) and then install and maintain Stetl on your system as a Python package (pip is preferred).
(sudo) pip install stetl
or
easy_install stetl
Alternatively you can download Stetl from Github: by cloning (preferred) or downloading: https://github.com/geopython/stetl/archive/master.zip and then install locally
(sudo) python setup.py install
Try the examples first. This should work on Linuxes and Mac OSX.
Windows installation may be more involved depending on your local Python setup. Platform-specific installations below.
You may also want to download the complete .tar.gz distro from PyPi: https://pypi.python.org/pypi/Stetl . This includes the examples and tests.
Docker
Since version 1.0.9 Stetl also can be installed and run via Docker. See Install with Docker below.
Debian/Ubuntu
Thanks to Bas Couwenberg, work is performed to provide Stetl as Debian packages on both Debian and Ubuntu, see details:
https://packages.debian.org/search?keywords=stetl (Debian) and
https://launchpad.net/ubuntu/+source/python-stetl (Ubuntu, Xenial and later).
Stetl is split into 2 packages python-stetl
, the Python framework and stetl
the command line utility.
NB the versions of these packages may be older than when installing Stetl via pip from PyPi
or directly from GitHub. Always check this first.
Dependencies¶
Stetl depends on the following Python packages:
- GDAL (>=v2.4) bindings for Python
- psycopg2 (PostgreSQL client)
- lxml >=4.4.2
- Jinja2 templating
- Deprecated
GDAL
Python binding requires the native GDAL/OGR libs and tools to be installed.
lxml
http://lxml.de/installation.html requires the native (C) libraries:
- libxslt (required by lxml)
- libxml2 with Python bindings (required by lxml)
When using the Jinja2
templating filter, Jinja2TemplatingFilter
, see http://jinja.pocoo.org:
- Python
Jinja2
package
Deprecated
is used to indicated deprecated functions and classes.
Platform-specific guidelines for dependencies follow next.
Linux¶
For Debian-based distro’s like Ubuntu and Debian itself, most packages should be able to be installed via apt-get.
Tip: to get latest versions of GDAL and other Open Source geospatial software, best is to add the UbuntuGIS Repository. Below a setup that works in Ubuntu 16.04 Xenial using Debian/Ubuntu packages. In some cases you may choose to install the same packages via pip3 to have more recent versions like for lxml.
More Linux Tips. See also:
- the install commands for Debian in the [Dockerfile](https://github.com/geopython/stetl/blob/master/Dockerfile).
- the install commands for Ubuntu in the [Travis file](https://github.com/geopython/stetl/blob/master/.travis.yml).
Python dependencies:
apt-get install python3-setuptools apt-get install python3-dev apt-get install python3-pip pip3 install --upgrade pip
libxml2/libxslt
libs are usually already installed. Together with Pythonlxml
, the total install forlxml
is:apt-get install python3-libxml2 apt-get install python-libxslt1 apt-get install libxml2-dev libxslt1-dev lib32z1-dev apt-get install python3-lxml
GDAL
(http://gdal.org) version 2+ with Python bindings:# Add UbuntuGIS repo to get latest GDAL, at least v2 on Ubuntu 16.04, Xenial. add-apt-repository ppa:ubuntugis/ubuntugis-unstable apt-get update apt-get install gdal-bin gdalinfo --version # should show something like: GDAL 2.4.0, released 2019/03/04 apt-get install python3-gdal
the PostgreSQL client library for Python
psycopg2
:apt-get install python3-psycopg2
for
Jinja2
:apt-get install python3-jinja2
Mac OSX¶
Dependencies can best be installed via Homebrew.
Tip: sometimes installing GDAL Python bindings can be tricky as the installed GDAL binaries must be compatible. To install the right version you may use:
pip install GDAL==`gdalinfo --version | cut -d' ' -f2 | cut -d',' -f1`
Windows¶
Best is to install GDAL and python using the OSGeo4W Installer from http://trac.osgeo.org/osgeo4w.
- Download and run the OSGeo4W Installer
- Choose
Advanced Install
- On the
Select Packages
page expandCommandline_Utilities
and Select from the listgdal
andpython
- (
psycopg2
??) - Install
easy_install
to allow you to installlxml
- Download the
ez_setup.py
script - Open the OSGeo4W Shell (Start > Programs > OSGeo4W > OSGeo4W > OSGeo4W Shell)
- Change to the folder that you downloaded
ez_setup.py
to (if you downloaded to C:Temp then run cd C:Temp) - Install
easy_install
by running pythonez_setup.py
- To install
lxml
with easy_install runeasy_install lxml
Only Psycopg2 needs explicit installation. Many install via: http://www.stickpeople.com/projects/python/win-psycopg. Once the above has been installed you should have everything required to run Stetl.
Alternatively you may use Portable GIS. Still you will need to manually install psycopg2. See http://www.archaeogeek.com/portable-gis.html for details.
Test Installation¶
If you installed via Python ‘pip’ you can check if you run the latest version
stetl -h
You should get meaningful output like
2013-09-16 18:25:12,093 util INFO running with lxml.etree, good!
2013-09-16 18:25:12,100 util INFO running with cStringIO, fabulous!
2013-09-16 18:25:12,122 main INFO Stetl version = 1.0.3
usage: stetl [-h] -c CONFIG_FILE [-s CONFIG_SECTION] [-a CONFIG_ARGS]
Especially check the Stetl version number. You can also use the -v or –version option for stetl.
Try running the examples when running with a downloaded distro.
cd examples/basics
./runall.sh
Look for any error messages in your output.
Run Unit Tests¶
You can run unit tests to completely verify your installation. First install some extra packages:
pip install -r requirements-dev.txt
Then run the tests using nose2.
nose2
Install with Docker¶
The fastest way to use Stetl is via Docker. The Stetl Docker Image is lightweight, compressed just over 100MB, based on a Debian “slim” Docker Image.
Your environment needs to be setup to use Docker and probably you want to use some tooling like Vagrant. The author uses a combination of VirtualBox with Ubuntu and Vagrant on Mac OSX to run Docker, but this is a bit out of scope here.
Assuming you have a working Docker environment, there are two ways to install Stetl with Docker:
- build a Docker image yourself using the Dockerfile in https://github.com/geopython/stetl/blob/master/Dockerfile
- use a prebuilt public Stetl Docker image from Docker Hub: https://hub.docker.com/r/geopython/stetl
When rebuilding you can add build arguments for your environment, defaults:
ARG TIMEZONE="Europe/Amsterdam"
ARG LOCALE="en_US.UTF-8"
ARG ADD_PYTHON_DEB_PACKAGES=""
ARG ADD_PYTHON_PIP_PACKAGES=""
For example building with extra Python packages, building your local Docker Image:
docker build --build-arg ADD_PYTHON_DEB_PACKAGES="python-requests python-tz" -t geopython/stetl:local .
docker build --build-arg ADD_PYTHON_PIP_PACKAGES="scikit-learn==0.18 influxdb" -t geopython/stetl:local .
Or you may extend the Stetl Dockerfile with your own Dockerfile.
For running Stetl using Docker see Using Docker.
Upgrade to Python 3¶
Stetl development started in Python 2. With PEP 373 the EOL of python 2.7 was announced and python 2 will not be officialy supported after 2020. Stetl was therefore upgraded to Python 3.
Python 3¶
Work started early 2019 to upgrade Stetl
from Python 2 to Python 3. The last version of Stetl
that supports Python 2 is version 1.3. This version might receive quick fixes and updates, but
users are encouraged to upgrade to Stetl version 2 or higher and thus use Python 3.
For the full discussion on the Python 2 to Python 3 migration: see the conversation in pull request #81 within the GitHub repository.
Important changes for developers¶
Python 2 and 3 are very similar, but there are a couple of important changes that developers need to keep in mind and are worth mentioning:
- Stetl 2 supports Python 3.6 (3.4 and 3.5 were dropped) and higher (so with support of f strings)
- Python 3 uses Unicode strings, meaning encoding/decoding is a bit different
stringIO
andcstringIO
were moved around- slight syntax change on calling
next()
for iterators - update on
import
statements - differences in
urllib
to make http-calls (although issue 80 might change it to the requests library).
Important changes for users¶
The specification of the Stetl tool chain uses a configuration file. You can use the Inputs, Filters, and
Outputs that are provided by Stetl, or write your own. If you use Stetl Components in your configuration, you must
specify the stetl.
package prefix in the class specification. For example before Stetl version 2 the input XML
file was specified as
[input_xml_file]
class = inputs.fileinput.XmlFileInput
file_path = input/cities.xml
for Stetl version 2 this is changed to
[input_xml_file]
class = stetl.inputs.fileinput.XmlFileInput
file_path = input/cities.xml
Note the extra stetl.
part in the class
specification.
Background¶
The text below gives some introduction to ETL, the rationale why Stetl was developed and where and how it attempts to fit in.
Problem¶
Data conversion combined with model and coordinate transformation from a source to a target datastore (files, databases) is a recurring task in almost every geospatial project. This process is often referred to as ETL (Extract Transform Load). Source and/or target geo-data formats are increasingly encoded as GML (Geography Markup Language), either as flat records, so called Simple Features, but more and more using domain-specific, object oriented OGC/ISO GML Application Schema’s.
GML Application Schema’s are for example heavily used within the INSPIRE Data Harmonization effort in Europe. Many National Mapping and Cadastral Agencies (NMCAs) use GML-encoded datasets as their bulk format for download and exchange and via Web Feature Services (WFSs). As geospatial professionals we are often confronted with ETL-tasks involving (complex) GML or worse: “GML-lookalikes”, which are often XML Schemas embedded with GML-namespaced elements.
Luckily, in many cases GDAL/OGR, the Swiss Army Knife for geo-data conversion, can do the job. If “ogr2ogr” sounds like gibberish to you, check out http://gdal.org ! But when complex, some say rich, GML Application Schemas are involved, data conversion can be a daunting task when GDAL/OGR alone is not sufficient. Firstly, often complex data model transformations have to be applied.
In addition we may be confronted with the bulkiness of GML:
- Megabyte/Gigabyte-files.
- Deeply nested elements where the nuggets, the actual attribute values, reside.
- Trees of .zip files and possibly more nasty surprises once we have unboxed a GML-delivery.
- High resource consumption in memory and CPU and long processing hours, up to complete machine-lockup, can be the the side-effects of naive GML-processing.
Existing (partial) solutions¶
Within the FOSS4G world we can resort to high level, GUI-based, ETL-tools such as GeoKettle, Humboldt tools and Talend GeoSpatial. These are very powerful tools by themselves, check them out as well. Some of us, like the author, like to stay closer to GDAL/OGR and XSLT for model transforms, some command line tools and a bit of Python scripting, but without having to write a complete, ad-hoc ETL-program each time. This is the space where Stetl tries to fit in, so read on.
We already have great FOSS tools for XML/GML parsing, data-conversion and model-transformation like GDAL/OGR (ogr2ogr!), XSLT (Extensible Stylesheet Language Transformations, for transforming XML) and native XML-parsing libraries like libxml2. Each individual tool/library is extremely powerful and performant by itself. But we would like to combine of these tools. Take for example flat, national adres data in a PostGIS database that we need to transform to multiple INSPIRE Application Schema GML files. Each individual FOSS tool can handle part of the ETL: ogr2ogr for converting from PostGIS (including coordinate tranformation) into to simple feature GML, XSLT (xsltproc/libxslt) to transform the resulting flat GML to rich INSPIRE GML. But with millions of addresses we cannot simply use a single GML memory datastructure (DOM) or single intermediate GML-file.
Stetl: Python, streaming and configuration¶
Add Python and a configuration convention to this equation and we have Stetl: Streaming ETL. Stetl is a lightweight, geospatial ETL (Extract Transform Load) framework written in Python. ETL-processing with Stetl is driven from a configuration file. Within a Stetl configuration file a chain of ETL-processing modules is declared through which the data flows (“streams”). A module may be an input, filter or output module. Modules have input and output data types declared such that only compatible modules can be connected. However, Stetl does not define a grand internal data structure to which all data is mapped as many ETL-tools do. Data formats are kept close to the external tools that Stetl uses.
Stetl comes with pre-defined modules for:
- GML-parsing
- XSLT processing
- XSD Validation
- PostGIS/OGR input and output
- GML-splitting
- … and many more.
Stetl calls on the above tools like OGR, libxslt and libxml2 via their native interfaces. Stetl is even more speed-optimized as no intermediate file-storage is used: we use other means such as native string buffers. For example large XML/GML-files can be split into manageable documents and streamed into an XSLT filter module. Stetl-modules are off course extensible and can be user-defined. Reusable ETL-configurations invoked through parameterized commandline scripts can be defined without programming.
Stetl evolved from and is used within the INSPIRE-FOSS project (http://inspire-foss.org). Here for example, Dutch national addresses (BAG) were transformed into INSPIRE Addresses GML (files and database). Special Stetl integration modules are available to extract and publish data from/to a deegree WFS and deegree “Blobstore-database”. The combination Stetl/deegree is an ideal setup for INSPIRE deployments.
Other Dutch national datasets like Top10NL and BGT (Dutch topo vector datasets) have been completely and successfully transformed. Work is in progress to use Stetl as the basis for NLExtract (http://nlextract.nl), a project that provides ETL tools for Dutch open geo-datasets. Stetl development is now (april 2013) in an initial phase and takes place in GitHub. The current version is workable but we hope to present a v1.0 at FOSS4G with more documentation and as a standard Python Package via PyPi. The main link is: http://stetl.org (now links to GitHub). To get started find some basic examples here: https://github.com/geopython/stetl/tree/master/examples/basics.
Using Stetl¶
This section explains how to use Stetl for your ETL. It assumes Stetl is installed and you are able to run the examples. It may be useful to study some of the examples, especially the core ones found in the examples/basics directory. These examples start numbering from 1, building up more complex ETL cases like (INSPIRE) transformation using Jinja2 Templating.
In addition there are example cases like the Dutch Topo map (Top10NL) ETL in the examples/top10nl directory .
The core concepts of Stetl remain pretty simple: an input resource like a file or a database table is mapped to an output resource (also a file, a database, a remote HTTP server etc) via one or more filters. The input, filters and output are connected in a pipeline called a processing chain or Chain. This is a bit similar to a current in electrical engineering: an input flows through several filters, that each modify the current. In our case the current is (geospatial) data. Stetl design follows the so-called Pipes and Filters Architectural Pattern.
Stetl Config¶
Stetl components (Inputs, Filters, Outputs) and their interconnection (the Pipeline/Chain)
are specified in a Stetl config file. The file format follows the Python .ini
file-format.
To illustrate, let’s look at the example 2_xslt.
This example takes the input file input/cities.xml
and transforms this file to a valid GML file called
output/gmlcities.gml
. The Stetl config file looks as follows.
[etl]
chains = input_xml_file|transformer_xslt|output_file
[input_xml_file]
class = stetl.inputs.fileinput.XmlFileInput
file_path = input/cities.xml
[transformer_xslt]
class = stetl.filters.xsltfilter.XsltFilter
script = cities2gml.xsl
[output_file]
class = stetl.outputs.fileoutput.FileOutput
file_path = output/gmlcities.gml
Most of the sections in this ini-file specify a Stetl component: an Input, Filter or Output component.
Each component is specified by its (Python) class and per-component specific parameters.
For example [input_xml_file]
uses the class stetl.inputs.fileinput.XmlFileInput
reading and parsing the
file input/cities.xml
specified by the file_path
property. [transformer_xslt]
is a Filter that
applies XSLT with the script file cities2gml.xsl
that is in the same directory. The [output_file]
component specifies the output, in this case a file.
These components are coupled in a Stetl Chain using the special .ini section [etl]
. That section specifies one
or more processing chains. Each Chain is specified by the names of the component sections, their interconnection using
a the Unix pipe symbol “|”.
So the above Chain is input_xml_file|transformer_xslt|output_file
. The names
of the component sections like [input_xml_file]
are arbitrary.
Note: since v1.1.0 a datastream can be split (see below) to multiple Outputs
using ()
like :
[etl]
chains = input_xml_file|transformer_xslt|(output_gml_file)(output_wfs)
Or multiple Input streams can be combined/merged like:
[etl]
chains = (input_http_api_1) (input_http_api_2) | data_transformer | output_db
It is even possible to have both Splitting and Merging together with filtering:
[etl]
chains = (input_http_api_1 | cleaner_filter) (input_http_api_2) | data_transformer | (output_db) (output_file)
Note: since version 2 of stetl it is required that the call to stetl components actually start with stetl. This is not necessary when you write your own components (see example 7)
Configuring Components¶
Most Stetl Components, i.e. inputs, filters, outputs, have properties that can be configured within their
respective [section] in the config file. But what are the possible properties, values and defaults?
This is documented within each Component class using the @Config
decorator much similar to the standard Python
@property
, only with
some more intelligence for type conversions, defaults, required presence and documentation.
It is loosely based on https://wiki.python.org/moin/PythonDecoratorLibrary#Cached_Properties and Bruce Eckel’s
http://www.artima.com/weblogs/viewpost.jsp?thread=240845 with a fix/hack for Sphinx documentation.
See for example the stetl.inputs.fileinput.FileInput
documentation.
For class authors: this information is added
via the Python Decorators much similar to @property
. The stetl.component.Config
is used to define read-only properties for each Component instance. For example,
class FileInput(Input):
"""
Abstract base class for specific FileInputs, use derived classes.
"""
# Start attribute config meta
# Applying Decorator pattern with the Config class to provide
# read-only config values from the configured properties.
@Config(ptype=str, default=None, required=False)
def file_path(self):
"""
Path to file or files or URLs: can be a dir or files or URLs
or even multiple, comma separated. For URLs only JSON is supported now.
"""
pass
@Config(ptype=str, default='*.[gxGX][mM][lL]', required=False)
def filename_pattern(self):
"""
Filename pattern according to Python ``glob.glob`` for example:
'\\*.[gxGX][mM][lL]'
"""
pass
@Config(ptype=bool, default=False, required=False)
def depth_search(self):
"""
Should we recurse into sub-directories to find files?
"""
pass
# End attribute config meta
def __init__(self, configdict, section, produces):
Input.__init__(self, configdict, section, produces)
# Create the list of files to be used as input
self.file_list = Util.make_file_list(self.file_path, None, self.filename_pattern, self.depth_search)
This defines three configurable properties for the class FileInput.
Each @Config
has three parameters: ptype
, the Python type (str
, list
, dict
, bool
, int
),
default
(default value if not present) and required
(if property in mandatory or optional).
Within the config one can set specific config values like,
[input_xml_file]
class = inputs.fileinput.XmlFileInput
file_path = input/cities.xml
This automagically assigns file_path
to self.file_path
without any custom code and assigns the
default value to filename_pattern
. Automatic checks are performed: if file_path
(required=True
) is present, if its type is string.
In some cases type conversions may be applied e.g. when type is dict
or list
. It is guarded that the value is not
overwritten and the docstrings will appear in the auto-generated documentation, each entry prepended with a CONFIG
tag.
Running Stetl¶
The above ETL spec can be found in the file etl.cfg
. Now Stetl can be run, simply by typing
stetl -c etl.cfg
Stetl will parse etl.cfg
, create all Components by their class name and link them in a Chain and execute
that Chain. Of course this example is very trivial, as we could just call XSLT without Stetl. But it becomes interesting
with more complex transformations.
Suppose we want to convert the resulting GML to an ESRI Shapefile. As we cannot use GDAL ogr2ogr
on the input
file, we need to combine XSLT and ogr2ogr. See example
3_shape. Now we replace the output
by using outputs.ogroutput.Ogr2OgrOutput, which can execute any ogr2ogr command, converting
whatever it gets as input from the previous Filter in the Chain.
[etl]
chains = input_xml_file|transformer_xslt|output_ogr_shape
[input_xml_file]
class = stetl.inputs.fileinput.XmlFileInput
file_path = input/cities.xml
[transformer_xslt]
class = stetl.filters.xsltfilter.XsltFilter
script = cities2gml.xsl
# The ogr2ogr command-line. May be split over multiple lines for readability.
# Backslashes not required in that case.
[output_ogr_shape]
class = stetl.outputs.ogroutput.Ogr2OgrOutput
temp_file = temp/gmlcities.gml
ogr2ogr_cmd = ogr2ogr
-overwrite
-f "ESRI Shapefile"
-a_srs epsg:4326
output/gmlcities.shp
temp/gmlcities.gml
Using Docker¶
The most convenient way to run Stetl is via Docker. See the installation instructions at Install with Docker. A full example can be viewed in the Smart Emission project: https://github.com/Geonovum/smartemission/tree/master/etl.
In the simplest case you run a Stetl Docker container from your own built image or the Dockerhub-provided one, geopython/stetl:<version> stetl as follows (latest version):
sudo docker run -v <host dir>:<container dir> -w <work dir> geopython/stetl:latest stetl <any Stetl arguments>
For example within the current directory you may have an etl.cfg
Stetl file:
WORK_DIR=`pwd`
sudo docker run -v ${WORK_DIR}:${WORK_DIR} -w ${WORK_DIR} geopython/stetl:latest stetl -c etl.cfg
# or leaner
sudo docker run --rm -v $(pwd):/work -w /work geopython/stetl:latest stetl -c etl.cfg
A more advanced setup would be (network-)linking to a PostGIS Docker image like kartoza/postgis:
# First run Postgis, remains running,
sudo docker run --name postgis -d -t kartoza/postgis:9.4-2.1
# Then later run Stetl
STETL_ARGS="-c etl.cfg -a local.args"
WORK_DIR="`pwd`"
sudo docker run --name stetl --link postgis:postgis -v ${WORK_DIR}:${WORK_DIR} -w ${WORK_DIR} geopython/stetl:latest stetl ${STETL_ARGS}
The last example is used within the SmartEmission project. Also with more detail and keeping all dynamic data (like PostGIS DB), your Stetl config and results, and logs within the host. For PostGIS see: https://github.com/Geonovum/smartemission/tree/master/services/postgis and Stetl see: https://github.com/Geonovum/smartemission/tree/master/etl.
Even better is to use docker-compose.
Stetl Integration¶
Note: one can also run Stetl via its main ETL class: stetl.etl.ETL
.
This may be useful for integrations in for example Python programs
or even OGC WPS servers (planned).
Reusable Stetl Configs¶
What we saw in the last example is that it is hard to reuse this etl.cfg when we have for example a different input file or want to map to different output files. For this Stetl supports config parameter substitution.
Dynamic or secret (e.g. database credentials) parameters in etl.cfg are declared symbolically and substituted at runtime via the commandline or the OS environment.
A variable is declared between curly brackets like {out_xml}. See example 6_cmdargs.
[etl]
chains = input_xml_file|transformer_xslt|output_file
[input_xml_file]
class = stetl.inputs.fileinput.XmlFileInput
file_path = {in_xml}
[transformer_xslt]
class = stetl.filters.xsltfilter.XsltFilter
script = {in_xsl}
[output_file]
class = stetl.outputs.fileoutput.FileOutput
file_path = {out_xml}
Note the symbolic input, xsl and output files. We can now perform the ETL using the stetl -a option in two basic ways. One, passing the arguments on the commandline, like
stetl -c etl.cfg -a "in_xml=input/cities.xml in_xsl=cities2gml.xsl out_xml=output/gmlcities.gml"
Two, passing the arguments in a properties file, here called etl.args (the name of the suffix .args is not significant, could be .env as well).
stetl -c etl.cfg -a etl.args
Where the content of the etl.args properties file is:
# Arguments in properties file
in_xml=input/cities.xml
in_xsl=cities2gml.xsl
out_xml=output/gmlcities.gml
It is also possible to specify multiple -a arguments. This provides for situations where a default.args contains all default arguments and a my.args or explicit -a settings that override the default values in default.args. Overriding is determined by the order of the -a arguments. Examples:
stetl -c etl.cfg -a default.args -a my.args
stetl -c etl.cfg -a default.args -a "db_user=docker db_password=pass"
stetl -c etl.cfg -a default.args -a db_user=docker -a db_password=pass
It is also possible to pass these key/value pairs via OS Environment variables. This is especially handy in Docker-based deployments like Docker Compose and Kubernetes. In this case the variable names need to be prepended with STETL_ or stetl_ as to not mix-up with other non-related OS-env vars. A mixture of commandline args (file) and environment vars is possible. The rule is that OS Environment variables always override/overrule arguments specified with -a option(s).
For example, the above args could also be passed as follows:
export stetl_in_xml="input/cities.xml"
export stetl_in_xsl="cities2gml.xsl"
export stetl_out_xml="output/gmlcities.gml"
stetl -c etl.cfg
or only override the input file name in_xml from etl.args:
export stetl_in_xml="input/cities2.xml"
stetl -c etl.cfg -a etl.args
or even with multiple -a args:
export stetl_in_xml="input/cities2.xml"
stetl -c etl.cfg -a etl.args -a my.args
This makes an ETL chain highly reusable. A very elaborate Stetl config with parameter substitution can be seen in the Top10NL ETL.
Connection Compatibility¶
During ETL Chain processing Components typically pass data to a next stetl.component.Component
.
A stetl.filter.Filter
Component both consumes and produces data, Inputs produce data and
Outputs only consume data.
Data and status flows as stetl.packet.Packet
objects between the Components. The type of the data in these Packets needs
to be compatible only between two coupled Components.
Stetl does not define one unifying data structure, but leaves this to the Components themselves.
Each Component provides the type of data it consumes (Filters, Outputs) and/or produces (Inputs, Filters). This is indicated in its class definition using the consumes and produces object constructor parameters. Some Components can produce and/or consume multiple data types, like a single stream of records or a record array. In those cases the produces or consumes parameter can be a list (array) of data types.
During Chain construction Stetl will check for compatible formats when connecting Components. If one of the formats is a list of formats, the actual format is determined by:
- explicit setting: the actual input_format and/or output_format is set in the Component .ini configuration
- no setting provided: the first format in the list is taken as default
Stetl will only check if these input and output-formats for connecting Components are compatible when constructing a Chain.
The following data types are currently symbolically defined in the stetl.packet.FORMAT
class:
any
- ‘catch-all’ type, may be any of the types below.etree_doc
- a complete in-memory XML DOM structure using thelxml
etreeetree_element
- each Packet contains a single DOM Element (usually a Feature) inlxml
etree formatetree_feature_array
- each Packet contains an array of DOM Elements (usually Features) inlxml
etree formatgeojson_feature
- asstruct
but following naming conventions for a single Feature according to the GeoJSON spec: http://geojson.orggeojson_collection
- asstruct
but following naming conventions for a FeatureCollection according to the GeoJSON spec: http://geojson.orggdal_vsi_path
- a single file path in the GDAL Virtual File System (VSI) format (via GDAL Python bindings)ogr_feature
- a single Feature object from an OGR source (via GDAL Python bindings)ogr_feature_array
- a Python list (array) of a single Feature objects from an OGR sourcerecord
- a Pythondict
(hashmap)record_array
- a Python list (array) ofdict
string
- a general stringstruct
- a JSON-like generic tree structurexml_doc_as_string
- a string representation of a complete XML documentxml_line_stream
- each Packet contains a line (string) from an XML file or string representation (DEPRECATED)
Many components, in particular Filters, are able to transform data formats.
For example the XmlElementStreamerFileInput can produce an
etree_element, a subsequent XmlAssembler can create small in-memory etree_doc s that
can be fed into an XsltFilter, which outputs a transformed etree_doc. The type any is a catch-all,
for example used for printing any object to standard output in the stetl.packet.Component
.
An etree_element may also be interesting to be able to process single features.
Starting with Stetl 1.0.7 a new stetl.filters.formatconverter.FormatConverterFilter
class provides a Stetl Filter
to allow almost any conversion between otherwise incompatible Components.
TODO: the Packet typing system is still under constant review and extension. Soon it will be possible to add new data types and converters. We have deliberately chosen not to define a single internal datatype like a “Feature”, both for flexibility and performance reasons.
Multiple Chains¶
Usually a complete ETL will require multiple steps/commands. For example we need to create a database, maybe tables and/or making tables empty. Also we may need to do postprocessing, like removing duplicates in a table etc. In order to have repeatable/reusable ETL without any manual steps, we can specify multiple Chains within a single Stetl config. The syntax: chains are separated by commas (steps are sill separated by pipe symbols).
Chains are executed in order. We can even reuse the specified components from within the same file. Each will have a separate instance within a Chain.
For example in the Top10NL example we see three Chains:
[etl]
chains = input_sql_pre|schema_name_filter|output_postgres,
input_big_gml_files|xml_assembler|transformer_xslt|output_ogr2ogr,
input_sql_post|schema_name_filter|output_postgres
Here the Chain input_sql_pre|schema_name_filter|output_postgres sets up a PostgreSQL schema and creates tables. input_big_gml_files|xml_assembler|transformer_xslt|output_ogr2ogr does the actual ETL and input_sql_post|schema_name_filter|output_postgres does some PostgreSQL postprocessing.
Chain Splitting¶
In some cases we may want to split processed data to multiple Filters
or Outputs
.
For example to produce output files in multiple formats like GML, GeoJSON etc
or to publish converted (Filtered) data to multiple remote services (SOS, SensorThings API)
or just for simple debugging to a target Output
and StandardOutput
.
See issue https://github.com/geopython/stetl/issues/35 and the Chain Split example.
Here the Chains are split by using ()
in the ETL Chain definition:
# Transform input xml to valid GML file using an XSLT filter and pass to multiple outputs.
# Below are two Chains: simple Output splitting and splitting to 3 sub-Chains at Filter level.
[etl]
chains = input_xml_file | transformer_xslt |(output_file)(output_std),
input_xml_file | (transformer_xslt|output_file) (output_std) (transformer_xslt|output_std)
[input_xml_file]
class = stetl.inputs.fileinput.XmlFileInput
file_path = input/cities.xml
[transformer_xslt]
class = stetl.filters.xsltfilter.XsltFilter
script = cities2gml.xsl
[output_file]
class = stetl.outputs.fileoutput.FileOutput
file_path = output/gmlcities.gml
[output_std]
class = stetl.outputs.standardoutput.StandardOutput
Chain Merging¶
In some cases we may want to merge (combine, join) multiple input streams.
For example to harvest data from multiple HTTP REST APIs, or to realize a Filter that integrates data from two data-sources.
See issue https://github.com/geopython/stetl/issues/59 and the Chain Merge example.
Here the Chains are merged by using ()
notation in the ETL Chain definition, possibly even combined with Splitting
Outputs:
# Merge two inputs into single Filter.
[etl]
chains = (input_1) (input_2)|transformer_xslt|output_std,
(input_1) (input_2)|transformer_xslt|(output_file)(output_std)
[input_1]
class = stetl.inputs.fileinput.XmlFileInput
file_path = input1/cities.xml
[input_2]
class = stetl.inputs.fileinput.XmlFileInput
file_path = input2/cities.xml
[transformer_xslt]
class = stetl.filters.xsltfilter.XsltFilter
script = cities2gml.xsl
[output_file]
class = stetl.outputs.fileoutput.FileOutput
file_path = output/gmlcities.gml
[output_std]
class = stetl.outputs.standardoutput.StandardOutput
Cases¶
This chapter lists various cases/projects where Stetl is used.
NLExtract¶
NLExtract https://nlextract.nl is a development project that aims to provide ETL-tooling for all Dutch Open Geo-Datasets, in particular the country wide “Key Registries” (Dutch: Basisregistraties) like Cadastral Parcels (BRK), Topography (BRT+BGT) and Buildings and Addresses (BAG). These datasets are provided as XML/GML. The ETL mostly provides a transformation to PostGIS. For all Key Registries, except for the BAG, Stetl is used, basically as-is, without extra (Python) programming. See also the NLExtract GitHub: https://github.com/nlextract/NLExtract
Adresses and Buildings (BAG v2)¶
BAG version 1 ETL was developed as a custom Python program. In 2021 Dutch Kadaster released BAG version 2. For NLExtract a moment to switch ETL for BAG to Stetl. In particular use is made of the recent (v3.2.1+) GDAL/OGR LVBAG Driver and Python bindings for GDAL VSI /vsizip file handling.
See https://github.com/nlextract/NLExtract/tree/master/bagv2/etl and the Stetl conf at https://github.com/nlextract/NLExtract/tree/master/bagv2/etl/conf/
Topography (BRT)¶
Includes ETL for 5 scale levels: TOP10NL through TOP1000NL.
See https://github.com/nlextract/NLExtract/tree/master/brt/top10nl/etl and the Stetl conf at e.g. https://github.com/nlextract/NLExtract/tree/master/brt/top10nl/etl/conf/
Detailed Topography (BGT)¶
This is a very large and heavy dataset based on CityGML. Stetl streaming ETL is here at its best.
See https://github.com/nlextract/NLExtract/tree/master/bgt and the Stetl conf at https://github.com/nlextract/NLExtract/blob/master/bgt/etl/conf/
Cadastral Parcels (BRK)¶
See https://github.com/nlextract/NLExtract/tree/master/brk/etl and the Stetl conf at https://github.com/nlextract/NLExtract/tree/master/brk/etl/conf
INSPIRE¶
These were the origins of Stetl. This project was sponsored by Kadaster. See https://github.com/justb4/inspire-foss. The ETL involved the transformation of Dutch Key Registries (see above) to harmonized INSPIRE GML according to the Annexes.
Addresses¶
BAG to INSPIRE Addresses Annex II Theme.
See https://github.com/justb4/inspire-foss/blob/master/etl/NL.Kadaster/Addresses/
Ordnance Survey¶
A successful Proof-of-Concept to convert Ordnance Survey Mastermap GML to PostGIS:
https://github.com/geopython/stetl/tree/master/examples/ordnancesurvey
SOSPilot¶
A SensorWeb project by Geonovum, see http://sensors.geonovum.nl.
Dutch AQ to WFS/WMS(-Time) and SOS¶
Stetl was used for ETL from Dutch Air Quality Data from RIVM (XML) to WMS(-Time), WFS and SOS. The latter was effected by SOS-Transactional publication. Documentation at http://sospilot.readthedocs.org and ETL on GitHub at https://github.com/Geonovum/sospilot/tree/master/src/rivm-lml
Dutch AQ to EAI Reporting¶
Stetl was used to generate XML-based reports for the EU EAI:
https://github.com/Geonovum/sospilot/tree/master/src/aq-report
This involved the first use of Jinja2 templating for complex XML/GML generation.
Smart Emission¶
Sensors for air quality, meteo and audio at civilians. Project by University of Nijmegen/Gemeente Nijmegen with participation by Geonovum. Stetl is used to transform a low-level sensor API to PostGIS and later on WMS/WFS/SOS and the SensorThings API. Also InfluxDB output is developed here.
This is also an example how to use a Stetl Docker image:
See https://github.com/Geonovum/smartemission/tree/master/etl
API and Code¶
Below is the API documention for the the Stetl Python code.
Main Entry Points¶
There are several entry points through which Stetl can be called. The most common is to use the commandline script bin/stetl. This command should be available after doing an install.
In some contexts like integrations you may want to call Stetl via Python. The entries are then.
Core Framework¶
The core framework is directly under the directory src/stetl. Below are the main seven classes. Their interrelation is as follows:
One or more stetl.chain.Chain
objects are built from
a Stetl ETL configuration via the stetl.factory.Factory
class.
A stetl.chain.Chain
consists of a set of connected stetl.component.Component
objects.
A stetl.component.Component
is either an stetl.input.Input
, an stetl.output.Output
or a stetl.filter.Filter
. Data and status flows as stetl.packet.Packet
objects
from an stetl.input.Input
via zero or more stetl.filter.Filter
objects to a final stetl.output.Output
.
As a trivial example: an stetl.input.Input
could be an XML file, a stetl.filter.Filter
could represent
an XSLT file and an stetl.output.Output
a PostGIS database. This is effected by specialized classes in
the subpackages inputs, filters, and outputs. New in 1.1.0: stetl.Splitter
to split data to multiple Outputs
and stetl.Merger
to combine multiple Inputs.
-
class
stetl.factory.
Factory
[source]¶ Object and class Factory (Pattern). Based on: http://stackoverflow.com/questions/2226330/instantiate-a-python-class-from-a-name
-
class
stetl.component.
Component
(configdict, section, consumes='none', produces='none')[source]¶ Abstract Base class for all Input, Filter and Output Components.
-
input_format
()[source]¶ CONFIG
- The specific input format if the consumes parameter is a list or the format to be converted to the output_format.
- type: str
- required: False
- default: None
-
invoke
(packet)[source]¶ Components override for Component-specific behaviour, typically read, filter or write actions.
-
-
class
stetl.component.
Config
(ptype=<class 'str'>, default=None, required=False)[source]¶ Decorator class to tie config values from the .ini file to object instance property values. Somewhat like the Python standard @property but with the possibility to define default values, typing and making properties required.
Each property is defined by @Config(type, default, required). Basic idea comes from: https://wiki.python.org/moin/PythonDecoratorLibrary#Cached_Properties
-
class
stetl.chain.
Chain
(chain_str, config_dict)[source]¶ Holder for single invokable pipeline of components A Chain is basically a singly linked list of Components Each Component executes a part of the total ETL. Data along the Chain is passed within a Packet object. The compatibility of input and output for linked Components is checked when adding a Component to the Chain.
-
get_by_class
(clazz)[source]¶ Get Component instance from Chain by class, mainly for testing. :param clazz: :return Component:
-
get_by_id
(id)[source]¶ Get Component instance from Chain, mainly for testing. :param name: :return Component:
-
-
class
stetl.packet.
FORMAT
[source]¶ Format of Packet (enumeration).
Current possible values:
- ‘none’
- ‘xml_line_stream’
- ‘line_stream’
- ‘etree_doc’
- ‘etree_element’
- ‘etree_feature_array’
- ‘xml_doc_as_string’
- ‘string’
- ‘record’
- ‘record_array’
- ‘struct’
- ‘geojson_feature’
- ‘geojson_collection’
- ‘gdal_vsi_path`
- ‘ogr_feature’
- ‘ogr_feature_array’
- ‘any’
-
class
stetl.packet.
Packet
(data=None)[source]¶ Represents units of (any) data and status passed along Chain of Components.
-
class
stetl.input.
Input
(configdict, section, produces)[source]¶ Bases:
stetl.component.Component
Abstract Base class for all Input Components.
-
class
stetl.output.
Output
(configdict, section, consumes)[source]¶ Bases:
stetl.component.Component
Abstract Base class for all Output Components.
-
class
stetl.filter.
Filter
(configdict, section, consumes, produces)[source]¶ Bases:
stetl.component.Component
Maps input to output. Abstract base class for specific Filters.
-
class
stetl.splitter.
Splitter
(config_dict, child_list)[source]¶ Bases:
stetl.component.Component
Component that splits a single input to multiple output Components. Use this for example to produce multiple output file formats (GML, GeoJSON etc) or to publish to multiple remote services (SOS, SensorThings API) or for simple debugging: target Output and StandardOutput.
-
class
stetl.merger.
Merger
(config_dict, child_list)[source]¶ Bases:
stetl.component.Component
Component that merges multiple Input Components into a single Component. Use this for example to combine multiple input streams like API endpoints. The Merger will embed Child Components to which actions are delegated. A Child Component may be a sub-Chain e.g. (Input|Filter|Filter..) sequence. Hence the “next” should be coupled to the last Component in that sub-Chain with the degenerate case where the sub-Chain is a single (Input) Component. NB this Component can only be used for Inputs.
Components: Inputs¶
-
class
stetl.inputs.dbinput.
DbInput
(configdict, section, produces)[source]¶ Bases:
stetl.input.Input
Input from any database (abstract base class).
-
class
stetl.inputs.dbinput.
PostgresDbInput
(configdict, section)[source]¶ Bases:
stetl.inputs.dbinput.SqlDbInput
Input by querying records from a Postgres database. Input is a query, like SELECT * from mytable. Output is zero or more records as record array (array of dict) or single record (dict).
produces=FORMAT.record_array (default) or FORMAT.record
-
host
()[source]¶ CONFIG
- host name or host IP-address, defaults to ‘localhost’
- type: str
- required: False
- default: localhost
-
password
()[source]¶ CONFIG
- User password, defaults to ‘postgres’
- type: str
- required: False
- default: postgres
-
-
class
stetl.inputs.dbinput.
SqlDbInput
(configdict, section)[source]¶ Bases:
stetl.inputs.dbinput.DbInput
Input using a query from any SQL-based RDBMS (abstract base class).
-
column_names
()[source]¶ CONFIG
- Column names to populate records with. If empty taken from table metadata.
- type: str
- required: False
- default: None
-
read_once
()[source]¶ CONFIG
- Read once? i.e. only do query once and stop
- type: bool
- required: False
- default: False
-
-
class
stetl.inputs.dbinput.
SqliteDbInput
(configdict, section)[source]¶ Bases:
stetl.inputs.dbinput.SqlDbInput
Input by querying records from a SQLite database. Input is a query, like SELECT * from mytable. Output is zero or more records as record array (array of dict) or single record (dict).
produces=FORMAT.record_array (default) or FORMAT.record
-
class
stetl.inputs.httpinput.
ApacheDirInput
(configdict, section, produces='record')[source]¶ Bases:
stetl.inputs.httpinput.HttpInput
Read file data from an Apache directory “index” HTML page. Uses http://stackoverflow.com/questions/686147/url-tree-walker-in-python produces=FORMAT.record. Each record contains file_name and file_data (other meta data like date time is too fragile over different Apache servers).
-
file_ext
()[source]¶ CONFIG
- The file extension for target files in Apache dir.
- type: str
- required: False
- default: xml
-
-
class
stetl.inputs.httpinput.
HttpInput
(configdict, section, produces='any')[source]¶ Bases:
stetl.input.Input
Fetch data from remote services like WFS via HTTP protocol. Base class: subclasses will do datatype-specific formatting of the returned data.
produces=FORMAT.any
Add authorization from config data. Authorization scheme-specific. May be extended or overloaded for additional schemes.
Parameters: request – the HTTP Request Returns:
-
auth
()[source]¶ CONFIG
Authentication data: Flat JSON-like struct dependent on auth type/schema. Only the type field is required, other fields depend on auth schema. Supported values :
type: basic|token
If the type is
basic
(HTTP Basic Authentication) two additional fieldsuser
andpassword
are required. If the type istoken
(HTTP Token) additional two additional fieldskeyword
andtoken
are required.Any required Base64 encoding is provided by
HttpInput
.Examples:
# Basic Auth url = https://some.rest.api.com auth = { type: basic, user: myname password: mypassword } # Token Auth url = https://some.rest.api.com auth = { type: token, keyword: Bearer token: mytoken }
- type: dict
- required: False
- default: None
-
format_data
(data)[source]¶ Format response data, override in subclasses, defaults to returning original data. :param packet: :return:
-
parameters
()[source]¶ CONFIG
Flat JSON-like struct of the parameters to be appended to the url.
Example: (parameters require quotes):
url = http://geodata.nationaalgeoregister.nl/natura2000/wfs parameters = { service : WFS, version : 1.1.0, request : GetFeature, srsName : EPSG:28992, outputFormat : text/xml; subtype=gml/2.1.2, typename : natura2000 }
- type: dict
- required: False
- default: None
-
class
stetl.inputs.ogrinput.
OgrInput
(configdict, section)[source]¶ Bases:
stetl.input.Input
Direct GDAL OGR input via Python OGR wrapper. Via the Python API http://gdal.org/python an OGR data source is accessed and from each layer the Features are read. Each Layer corresponds to a “doc”, so for multi-layer sources the ‘end-of-doc’ flag is set after a Layer has been read.
This input can read almost any geospatial dataformat. One can use the features directly in a Stetl Filter or use a converter to e.g. convert to GeoJSON structures.
produces=FORMAT.ogr_feature or FORMAT.ogr_feature_array (all features)
-
data_source
()[source]¶ CONFIG
- String denoting the OGR datasource. Usually a path to a file like “path/rivers.shp” or connection string to PostgreSQL like “PG: host=localhost dbname=’rivers’ user=’postgres’”.
- type: str
- required: True
- default: None
-
source_format
()[source]¶ CONFIG
Instructs GDAL to use driver by that name to open datasource. Not required for many standard formats that are self-describing like ESRI Shapefile.
Examples: ‘PostgreSQL’, ‘GeoJSON’ etc
- type: str
- required: False
- default: None
-
-
class
stetl.inputs.ogrinput.
OgrPostgisInput
(configdict, section)[source]¶ Bases:
stetl.input.Input
Input from PostGIS via ogr2ogr command. For now hardcoded to produce an ogr GML line stream. OgrInput may be a better alternative.
Alternatives: either stetl.input.PostgresqlInput or stetl.input.OgrInput.
produces=FORMAT.xml_line_stream
-
in_pg_sql
()[source]¶ CONFIG
- The input query (string) to fire.
- type: str
- required: False
- default: None
-
in_srs
()[source]¶ CONFIG
- SRS (projection) (ogr2ogr -s_srs) input DB e.g. ‘EPSG:28992’.
- type: str
- required: False
- default: None
-
out_dimension
()[source]¶ CONFIG
- Dimension (OGR: DIM=N) of features in output stream.
- type: str
- required: False
- default: 2
-
out_geotype
()[source]¶ CONFIG
- OGR Geometry type new layer in output stream, e.g. POINT.
- type: str
- required: False
- default: None
-
out_gml_format
()[source]¶ CONFIG
- GML format OGR name in output stream, e.g. ‘GML3’.
- type: str
- required: False
- default: None
-
Components: Filters¶
-
class
stetl.filters.xsltfilter.
XsltFilter
(configdict, section)[source]¶ Bases:
stetl.filter.Filter
Invokes XSLT processor (via lxml) for given XSLT script on an etree doc.
consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc
-
class
stetl.filters.xmlassembler.
XmlAssembler
(configdict, section)[source]¶ Bases:
stetl.filter.Filter
Split a stream of etree DOM XML elements (usually Features) into etree DOM docs. Consumes and buffers elements until max_elements reached, will then produce an etree doc.
consumes=FORMAT.etree_element, produces=FORMAT.etree_doc
-
class
stetl.filters.xmlelementreader.
XmlElementReader
(configdict, section)[source]¶ Bases:
stetl.filter.Filter
Extracts XML elements from a file, outputs each feature element in Packet. Parsing is streaming (no internal DOM buildup) so any file size can be handled. Use this class for your big GML files!
consumes=FORMAT.string, produces=FORMAT.etree_element
CONFIG
- Comma-separated string of XML (feature) element tag names of the elements that should be extracted and added to the output element stream.
- type: list
- required: True
- default: None
-
class
stetl.filters.xmlvalidator.
XmlSchemaValidator
(configdict, section)[source]¶ Bases:
stetl.filter.Filter
Validates an etree doc and prints result to log.
consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc
-
class
stetl.filters.sieve.
AttrValueRecordSieve
(configdict, section)[source]¶ Bases:
stetl.filters.sieve.Sieve
Sieves by attr/value(s) in Record Packets.
-
attr_name
()[source]¶ CONFIG
- Name of attribute whose value(s) are to be sieved.
- type: str
- required: True
- default: None
-
-
class
stetl.filters.sieve.
Sieve
(configdict, section, consumes, produces)[source]¶ Bases:
stetl.filter.Filter
ABC for specific Sieves that pass-through, “sieve”, Packets based on criteria in their data.
-
class
stetl.filters.stringfilter.
StringConcatFilter
(configdict, section)[source]¶ Bases:
stetl.filters.stringfilter.StringFilter
Concatenates a specified string with the input string (packet.data) either/or as prefix (prepend) or postfix (append) and outputs that concatenation.
consumes=FORMAT.string, produces=FORMAT.string
-
class
stetl.filters.stringfilter.
StringFilter
(configdict, section, consumes, produces)[source]¶ Bases:
stetl.filter.Filter
Base class for any string filtering
-
class
stetl.filters.stringfilter.
StringSubstitutionFilter
(configdict, section)[source]¶ Bases:
stetl.filters.stringfilter.StringFilter
String filtering using Python advanced String formatting. String should have substitutable values like {schema} {foo} format_args should be of the form format_args = schema:test foo:bar …
consumes=FORMAT.string, produces=FORMAT.string
-
format_args
()[source]¶ CONFIG
Provides a list of format arguments used by the string substitution filter. Formatting of content according to Python String.format(). String should have substitutable values like {schema} {foo}.
Example: format_args = schema:test foo:bar
- type: str
- required: True
- default: None
-
-
class
stetl.filters.templatingfilter.
Jinja2TemplatingFilter
(configdict, section)[source]¶ Bases:
stetl.filters.templatingfilter.TemplatingFilter
Implements Templating using Jinja2. Jinja2 http://jinja.pocoo.org, is a modern and designer-friendly templating language for Python modelled after Django’s templates. A ‘struct’ format as input provides a tree-like structure that could originate from a JSON file or REST service. This input struct provides all the variables to be inserted into the template. The template itself can be configured in this component as a Jinja2 string or -file. An optional ‘template_search_paths’ provides a list of directories from which templates can be fethced. Default is the current working directory. Via the optional ‘globals_path’ a JSON structure can be inserted into the Template environment. The variables in this globals struture are typically “boilerplate” constants like: id-prefixes, point of contacts etc.
consumes=FORMAT.struct, produces=FORMAT.string
-
add_env_filters
(jinja2_env)[source]¶ Register additional Filters on the template environment by updating the filters dict: Somehow min and max of list are not present so add them as well.
-
static
geojson2gml_filter
(value, source_crs=4326, target_crs=None, gml_id=None, gml_format='GML2', gml_longsrs='NO')[source]¶ Jinja2 custom Filter: generates any GML geometry from a GeoJSON geometry. By specifying a target_crs we can even reproject from the source CRS. The gml_format=GML2|GML3 determines the general GML form: e.g. pos/posList or coordinates. gml_longsrs=YES|NO determines the srsName format like EPSG:4326 or urn:ogc:def:crs:EPSG::4326 (long).
-
-
class
stetl.filters.templatingfilter.
StringTemplatingFilter
(configdict, section)[source]¶ Bases:
stetl.filters.templatingfilter.TemplatingFilter
Implements Templating using Python’s internal string.Template. A template string or file should be configured. The input record contains the actual values to be substituted in the template string as a record (key/value pairs). Output is a regular string.
consumes=FORMAT.record or FORMAT.record_array, produces=FORMAT.string
-
safe_substitution
()[source]¶ CONFIG
- Apply safe substitution? With this method, string.Template.safe_substitute will be invoked, instead of string.Template.substitute. If placeholders are missing from mapping and keywords, instead of raising an exception, the original placeholder will appear in the resulting string intact.
- type: bool
- required: False
- default: False
-
-
class
stetl.filters.templatingfilter.
TemplatingFilter
(configdict, section, consumes='any', produces='string')[source]¶ Bases:
stetl.filter.Filter
Abstract base class for specific template-based filters. See https://wiki.python.org/moin/Templating Subclasses implement a specific template language like Python string.Template, Mako, Genshi, Jinja2,
consumes=FORMAT.any, produces=FORMAT.string
-
invoke
(packet)[source]¶ Components override for Component-specific behaviour, typically read, filter or write actions.
-
-
class
stetl.filters.gmlfeatureextractor.
GmlFeatureExtractor
(configdict, section='gml_feature_extractor')[source]¶ Bases:
stetl.filter.Filter
Extract arrays of GML features etree elements from etree docs.
consumes=FORMAT.etree_doc, produces=FORMAT.etree_feature_array
-
class
stetl.filters.formatconverter.
FormatConverter
(configdict, section)[source]¶ Bases:
stetl.filter.Filter
Converts (almost) any packet format (if converter available).
consumes=FORMAT.any, produces=FORMAT.any but actual formats are changed at initialization based on the input to output format to be converted via the input_format and output_format config parameters.
-
converter_args
()[source]¶ CONFIG
- Custom converter-specific arguments.
- type: dict
- required: False
- default: None
-
static
etree_doc2geojson_collection
(packet, converter_args=None)[source]¶ Use converter_args to determine XML tag names for features and GeoJSON feature id. For example
- converter_args = {
- ‘root_tag’: ‘FeatureCollection’, ‘feature_tag’: ‘featureMember’, ‘feature_id_attr’: ‘fid’ }
Parameters: - packet –
- converter_args –
Returns:
-
static
etree_doc2struct
(packet, strip_space=True, strip_ns=True, sub=False, attr_prefix='', gml2ogr=True, ogr2json=True)[source]¶ Parameters: - packet –
- strip_space –
- strip_ns –
- sub –
- attr_prefix –
- gml2ogr –
- ogr2json –
Returns:
-
-
class
stetl.filters.execfilter.
CommandExecFilter
(configdict, section)[source]¶ Bases:
stetl.filters.execfilter.ExecFilter
Executes an arbitrary command and captures the output
consumes=FORMAT.string, produces=FORMAT.string
-
class
stetl.filters.execfilter.
ExecFilter
(configdict, section, consumes, produces)[source]¶ Bases:
stetl.filter.Filter
Executes any command (abstract base class).
-
env_args
()[source]¶ CONFIG
Provides of list of environment variables which will be used when executing the given command.
Example: env_args = pgpassword=postgres othersetting=value~with~spaces
- type: str
- required: False
- default:
-
-
class
stetl.filters.nullfilter.
NullFilter
(configdict, section, consumes='any', produces='any')[source]¶ Bases:
stetl.filter.Filter
Pass-through Filter, does nothing. Mainly used in Test Cases.
-
class
stetl.filters.packetbuffer.
PacketBuffer
(configdict, section)[source]¶ Bases:
stetl.filter.Filter
Buffers all incoming Packets, main use is unit-testing to inspect Packets after ETL is done.
-
class
stetl.filters.packetwriter.
PacketWriter
(configdict, section)[source]¶ Bases:
stetl.filter.Filter
Writes the payload of a packet as a string to a file.
consumes=FORMAT.any, produces=FORMAT.string
-
class
stetl.filters.regexfilter.
RegexFilter
(configdict, section, consumes='string', produces='record')[source]¶ Bases:
stetl.filter.Filter
Extracts data from a string using a regular expression and returns the named groups as a record. consumes=FORMAT.string, produces=FORMAT.record
-
class
stetl.filters.fileextractor.
FileExtractor
(configdict, section, consumes='any', produces='string')[source]¶ Bases:
stetl.filter.Filter
Abstract Base Class. Extracts a file an archive and saves as the configured file name.
consumes=FORMAT.any, produces=FORMAT.string
-
buffer_size
()[source]¶ CONFIG
- Buffer size for read buffer during extraction.
- type: int
- required: False
- default: 1073741824
-
delete_file
()[source]¶ CONFIG
- Delete the file when the chain has been completed?
- type: bool
- required: False
- default: True
-
-
class
stetl.filters.fileextractor.
VsiFileExtractor
(configdict, section)[source]¶ Bases:
stetl.filters.fileextractor.FileExtractor
Extracts a file from a GDAL /vsi path spec, and saves it as the given file name.
Example paths: /vsizip/{/project/nlextract/data/BAG-2.0/BAGNLDL-08112020.zip}/9999STA08112020.zip’ /vsizip/{/vsizip/{BAGGEM0221L-15022021.zip}/GEM-WPL-RELATIE-15022021.zip}/GEM-WPL-RELATIE-15022021-000001.xml
See also stetl.inputs.fileinput.VsiZipFileInput that generates these paths.
Author: Just van den Broecke
consumes=FORMAT.gdal_vsi_path, produces=FORMAT.string
-
class
stetl.filters.fileextractor.
ZipFileExtractor
(configdict, section)[source]¶ Bases:
stetl.filters.fileextractor.FileExtractor
Extracts a file from a ZIP file, and saves it as the given file name. Author: Frank Steggink
consumes=FORMAT.record, produces=FORMAT.string
-
class
stetl.filters.vsifilter.
VsiFilter
(configdict, section, vsiname)[source]¶ Bases:
stetl.filter.Filter
Abstract base class for applying a GDAL/OGR virtual file system (VSI) filter.
-
class
stetl.filters.vsifilter.
VsiZipFilter
(configdict, section)[source]¶ Bases:
stetl.filters.vsifilter.VsiFilter
Applies a VSIZIP filter to the input record.
consumes=FORMAT.record, produces=FORMAT.string
NB: stetl.filters.zipfileextractor is deprecated. ZipFileExtractor is now part of module stetl.filters.fileextractor.
Components: Outputs¶
-
class
stetl.outputs.fileoutput.
FileOutput
(configdict, section)[source]¶ Bases:
stetl.output.Output
Pretty print input to file. Input may be an etree doc or any other stringify-able input.
consumes=FORMAT.any
-
class
stetl.outputs.fileoutput.
MultiFileOutput
(configdict, section)[source]¶ Bases:
stetl.outputs.fileoutput.FileOutput
Print to multiple files from subsequent packets like strings or etree docs, file_path must be of a form like: gmlcities-%03d.gml.
consumes=FORMAT.any
-
class
stetl.outputs.standardoutput.
StandardOutput
(configdict, section)[source]¶ Bases:
stetl.output.Output
Print any input to standard output.
consumes=FORMAT.any
-
class
stetl.outputs.standardoutput.
StandardXmlOutput
(configdict, section)[source]¶ Bases:
stetl.output.Output
Pretty print XML from etree doc to standard output. OBSOLETE, can be done with StandardOutput
consumes=FORMAT.etree_doc
-
class
stetl.outputs.ogroutput.
Ogr2OgrOutput
(configdict, section)[source]¶ Bases:
stetl.output.Output
Output from GML etree doc to any OGR2OGR output using the GDAL/OGR ogr2ogr command
consumes=FORMAT.etree_doc
-
class
stetl.outputs.ogroutput.
OgrOutput
(configdict, section)[source]¶ Bases:
stetl.output.Output
Direct GDAL OGR output via Python OGR wrapper. Via the Python API http://gdal.org/python OGR Features are written.
This output can write almost any geospatial, OGR-defined, dataformat.
consumes=FORMAT.ogr_feature or FORMAT.ogr_feature_array
-
always_apply_lco
()[source]¶ CONFIG
- Flag to indicate whether the layer creation options should be applied to all runs.
- type: bool
- required: False
- default: False
-
append
()[source]¶ CONFIG
- Add to destination destination if it extists (ogr2ogr -append option).
- type: bool
- required: False
- default: False
-
dest_create_options
()[source]¶ CONFIG
Creation options.
Examples: ..
- type: list
- required: False
- default: []
-
dest_data_source
()[source]¶ CONFIG
- String denoting the OGR data destination. Usually a path to a file like “path/rivers.shp” or connection string to PostgreSQL like “PG: host=localhost dbname=’rivers’ user=’postgres’”.
- type: str
- required: True
- default: None
-
dest_format
()[source]¶ CONFIG
Instructs GDAL to use driver by that name to open data destination. Not required for many standard formats that are self-describing like ESRI Shapefile.
Examples: ‘PostgreSQL’, ‘GeoJSON’ etc
- type: str
- required: False
- default: None
-
dest_options
()[source]¶ CONFIG
- Custom data destination-specific options. Used in gdal.SetConfigOption().
- type: dict
- required: False
- default: None
-
layer_create_options
()[source]¶ CONFIG
- Options for newly created layer (-lco).
- type: list
- required: False
- default: []
-
new_layer_name
()[source]¶ CONFIG
- Layer name for layer created in the destination source.
- type: str
- required: True
- default: None
-
overwrite
()[source]¶ CONFIG
- Overwrite destination if it extists (ogr2ogr -overwrite option).
- type: bool
- required: False
- default: False
-
-
class
stetl.outputs.execoutput.
CommandExecOutput
(configdict, section)[source]¶ Bases:
stetl.outputs.execoutput.ExecOutput
Executes an arbitrary command.
consumes=FORMAT.string
-
class
stetl.outputs.execoutput.
ExecOutput
(configdict, section, consumes)[source]¶ Bases:
stetl.output.Output
Executes any command (abstract base class).
-
class
stetl.outputs.execoutput.
Ogr2OgrExecOutput
(configdict, section)[source]¶ Bases:
stetl.outputs.execoutput.ExecOutput
Executes an Ogr2Ogr command. Input is a file name to be processed. Output by calling Ogr2Ogr command.
consumes=FORMAT.string
-
always_apply_lco
()[source]¶ CONFIG
- Flag to indicate whether the layer creation options should be applied to all runs.
- type: bool
- required: False
- default: False
-
cleanup_input
()[source]¶ CONFIG
- Flag to indicate whether the input file to ogr2ogr should be cleaned up.
- type: bool
- required: False
- default: False
-
dest_data_source
()[source]¶ CONFIG
- String denoting the OGR data destination. Usually a path to a file like “path/rivers.shp” or connection string to PostgreSQL like “PG: host=localhost dbname=’rivers’ user=’postgres’”.
- type: str
- required: True
- default: None
-
dest_format
()[source]¶ CONFIG
Instructs GDAL to use driver by that name to open data destination. Not required for many standard formats that are self-describing like ESRI Shapefile.
Examples: ‘PostgreSQL’, ‘GeoJSON’ etc
- type: str
- required: False
- default: None
-
gfs_template
()[source]¶ CONFIG
- Name of GFS template file to use during loading. Passed to ogr2ogr as –config GML_GFS_TEMPLATE <name>
- type: str
- required: False
- default: None
-
lco
()[source]¶ CONFIG
- Options for newly created layer (-lco).
- type: str
- required: False
- default: None
-
-
class
stetl.outputs.dboutput.
DbOutput
(configdict, section, consumes)[source]¶ Bases:
stetl.output.Output
Output to any database (abstract base class).
-
class
stetl.outputs.dboutput.
PostgresDbOutput
(configdict, section)[source]¶ Bases:
stetl.outputs.dboutput.DbOutput
Output to PostgreSQL database. Input is an SQL string. Output by executing input SQL string.
consumes=FORMAT.string
-
class
stetl.outputs.dboutput.
PostgresInsertOutput
(configdict, section, consumes='record')[source]¶ Bases:
stetl.outputs.dboutput.PostgresDbOutput
Output by inserting a single record in a Postgres database table. Input is a Stetl record (Python dict structure) or a list of records. Creates an INSERT for Postgres to insert each single record. When the “replace” parameter is True, any existing record keyed by “key” is attempted to be UPDATEd first.
NB a constraint is that the first and each subsequent each record needs to contain all values as an INSERT and UPDATE query template is built once for the columns in the first record.
consumes=[FORMAT.record_array, FORMAT.record]
-
class
stetl.outputs.deegreeoutput.
DeegreeBlobstoreOutput
(configdict, section)[source]¶ Bases:
stetl.output.Output
Insert features into deegree Blobstore from an etree doc.
consumes=FORMAT.etree_doc
-
class
stetl.outputs.deegreeoutput.
DeegreeFSLoaderOutput
(configdict, section)[source]¶ Bases:
stetl.output.Output
Insert features via deegree using deegree’s FSLoader tool from an etree doc.
consumes=FORMAT.etree_doc
Contact¶
The website stetl.org is the main entry point for all of Stetl.
All development is done via GitHub: see https://github.com/geopython/stetl.
Contact the main author Just van den Broecke via email at just@justobjects.nl.
Online chat via Gitter: https://gitter.im/geopython/stetl
Links¶
Below links relevant to Stetl.
Presentations¶
Below several presentations on Stetl given at various events. The most recent/relevant at the top.