Welcome to Good Tables’s documentation!¶
Good Tables is a python library and command line tool for validating and transforming tabular data.
Tabular data in the form of CSV or Excel is passed through a pipeline of validators. These validators can check structure, for example are their blank rows or columns, do rows have the same length as the header etc, and they can also validate against a schema, for example does the data have the expected columns, is the data of the right type (are dates actually dates).
Optionally, the data source is transformed as it passes through the pipeline.
In return, the client receives a report on processing performed and, optionally, the output data.
Get involved¶
You can contribute to the project with content, code, and ideas!
Start at one of the following channels:
Documentation: An overview of the features that are currently in place.
Issues: See current issues, the backlog, and/or file a new issue.
Code: Get the code here.
Table of contents¶
Quickstart¶
Let’s get started.
Installation¶
Good Tables runs on Python 2.7, 3.3 and 3.4.
PyPI:
- ::
- pip install goodtables
Git:
- ::
- git clone https://github.com/okfn/goodtables.git
Tutorials¶
Some tutorials for using and extending Good Tables.
1. Implementing a custom processor¶
# TODO: This is unfinished.
Implementing a custom validator that can be invoked in a pipeline is easy.
Let’s write one that checks that values are in a certain range.
For data, see the file custom-range.csv in the examples directory.
In our data, we have name, age and city data for a group of people.
We want to ensure that all the people in our data are in the 25-50 age range.
For demonstration, we’ll write a pretty specific validator for this.
Of course the implementation could be made more generic for a range scenarios.
Our validator class:
class AgeRangeValidator(object):
column_name = 'age'
column_type = int
column_range = (25, 50)
report = {}
def run_row(self, index, headers, row):
valid = True
return valid
def run():
valids = []
return valid, report
def generate_report():
return report
As you can see, we hard coded column_name, column_type and column_range.
Also, we are running our validation through the run_row method, which is the most common method used for validations.
However, we could easily run the same validation in the run_column instead:
# stuff
def run_column():
valid = True
return valid
So, let’s see it in action. First, we’ll run the validator in ‘stand alone’ via its run method:
validator = AgeRangeValidator()
filepath = 'examples/custom-range.csv'
valid, report = validator.run(filepath)
And the same, but part of a validation pipeline using the structure validator with our AgeRangeValidator:
validators = ('structure', 'my_module.AgeRangeValidator')
filepath = 'examples/custom-range.csv'
validation_pipeline = ValidationPipeline(filepath, validators)
valid, report = validation_pipeline.run()
Pipeline¶
Naturally, the pipeline.Pipeline class implements the processing pipeline.
Validator registration¶
Register by constructor¶
The pipeline.Pipeline constructor takes a validators keyword argument, which is a list of validators to run in the pipeline.
Each value in the validators list is expected to be a string describing the path to a validator class, for import via importlib.
Optionally, for builtin validators, the validator.name property can be used as a shorthand convenience.
Example
- ::
- validators = [‘structure’, ‘schema’] # short hand names for builtin validators validators = [‘my_module.CustomValidatorOne’, ‘my_module.CustomValidatorTwo’] # import from string validators = [‘structure’, ‘my_module.CustomValidatorTwo’] # both combined
Register by instance method¶
Once you have a pipeline.Pipeline instance, you can also register validators via the register_validator method.
Registering new validators this way will by default append the new validators to any existing pipeline.
You can define the position in the pipeline explicitly using the position argument.
Example
- ::
- pipeline = Pipeline(args, kwargs) pipeline.register_validator(‘structure’, structure_options) pipeline.register_validator(‘spec’, spec_options, 0)
Validator options¶
Pipeline takes an options keyword argument to pass options into each validator in the pipeline.
options should be a dict, with each top-level key being the name of the validator.
Example
- ::
- pipeline_options = {
- ‘structure’: {
- # keyword args for the StructureValidator
}, ‘schema’: {
# keyword args for the SchemaValidator}
}
Running the pipeline¶
Run the pipeline with the run method.
run in turn calls the supported validator methods of each validator.
Once the data table has been run through all validators, run returns a tuple of valid, report, where:
- valid is a boolean, indicating if the data table is valid according to the pipeline validation
- report is tellme.Report instance, which can be used to generate a report in various formats
Validator arguments¶
Most validators will have custom keyword arguments for their configuration.
Additionally, all validators are expected to take the following keyword arguments, and exhibit certain behaviour based on their values.
The base.Validator signature implements these arguments.
fail_fast¶
fail_fast is a boolean that defaults to False.
If fail_fast is True, the validator is expected to stop processing as soon as an error occurs.
transform¶
transform is a boolean that defaults to True.
If transform is True, then the validator is “allowed” to return transformed data.
The caller (e.g., the pipeline class) is responsible for persisting transformed data.
report_limit¶
report_limit is an int that defaults to 1000, and refers to the maximum amount of entries that this validator can write to a report.
If this number is reached, the validator should stop processing.
row_limit¶
row_limit is an int that defaults to 20000, and refers to the maximum amount of rows that this validator will process.
report_stream¶
report_stream allows calling code to pass in a writable, seekable text stream to write report entries to.
Validator attributes¶
Validators are also expected to have the following attributes.
report¶
A tellme.Report instance. See TellMe
Validators are expected to write report entries to the report instance.
pipeline.Pipeline will call validator.report.generate for each validator to build the pipeline report.
name¶
A shorthand name for this validator. name should be unique when called in a pipeline.
Validators that inherit from base.Validator have a name that defaults to a lower-cased version of the class name.
Batch¶
pipeline.Batch Allows the configuration and running of pipelines on multiple sources.
Data sources can be extracted from either a CSV file, or a Data Package.
Arguments¶
- source: Filepath to the lust of data sources to run the batch against.
- source_type: ‘csv’ (CSV file) or ‘dp’ (Data Package file).
- data_key: If source_type is ‘csv’, then this is the name of the header that indicates the data URL.
- schema_key: If source_type is ‘csv’, then this is the name of the header that indicates the schema URL.
- pipeline_options: The options keyword argument for the pipeline.Pipeline constructor.
- post_task: Any callable that takes the batch instance as its only argument. Runs after the batch processing is complete.
- pipeline_post_task: Any callable that takes a pipeline instance as its only argument. Runs on completion of each pipeline.
For an example of the batch processor at work, including use of post_task and pipeline_post_task, see spd-admin.
Reports¶
The results of any run over data, either by a standalone processor or a pipeline, are written to a report.
Each report is an instance of a tellme.Report, which is a small library we also developed (See the TellMe library for more information on its API).
Reports can then be generated in a variety of output formats supported by TellMe.
Pipeline reports¶
In a pipeline, the pipeline.Pipeline each processor writes report results to the pipeline’s report instance.
After processing of the data is complete, additional calculations are performed for a summary.
Finally, the report is generated to an output format (a Python dict in this case) and returned.
From a top-level view, a pipeline report will have the following structure:
{
'success': True,
'meta': {'name': 'Pipeline'},
'summary': {#summary},
'results': [...]
}
All the interesting stuff is happening in the results array and the sumamry object.
See below for a description of each object in the results array, and likewise a description of the summary object.
Standalone processor reports¶
Standalone processors (for example, the built-in StructureProcessor) have a report object almost identical to that of a pipeline report, except they do not have a summary object.
Report result schema¶
{
'result_type': '# type of this result',
'result_category': '# category of this result (row/header)',
'result_level': '# level of this result (info/warning/error)',
'result_message': '# message of this result',
'result_context': [# a list of the values of the row that the result was generated from]
'row_index': '# index of the row',
'row_name': # 'headers' or valud of id or _id if present, or empty
'column_index': '# index of the column (can be None)',
'column_name': '# name of the column (can be '')',
}
Report summary schema¶
{
'message': '# a summary message',
'total_rows': # int,
'total_columns': # int,
'bad_rows': # int,
'bad_columns': # int,
'columns': [# list of dicts with position, name, type conformance (%) per column]
}
CLI¶
Good Tables includes a command line interface, goodtables.
Pipeline¶
Run a Good Tables pipeline.
Example¶
goodtables pipeline *data_source* --schema filepath_or_url --fail_fast --dry_run --row_limit 20000 --report_limit 1000
Design goals¶
High-level design goals for Good Tables:
- Process tabular data in CSV, Excel and JSON formats
- Provide a suite of small tools that each implement a type of processing to run
- Provide a pipeline API for registering built-in and custom processors
- Components should be easily usable in 3rd party (Python) code
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line