Welcome to databuild’s documentation!¶
Databuild is an automation tool for data manipulation.
The general principles in Databuild are:
- Low entry barrier
- Easy to install
- Easy to grasp
- Extensible
Databuild can be useful for scenarios such as:
- Documenting data transformations in your infoviz project
- Automate data processing in a declarative way
Contents¶
Quickstart¶
Run databuild using a buildfile:
$ data-build.py buildfile.json
buildfile.json contains a list of operations to be performed on data. Think of it as a script for a spreadsheet.
An example of build file could be:
[
{
"operation": "sheets.import_data",
"description": "Importing data from csv file",
"params": {
"sheet": "dataset1",
"format": "csv",
"filename": "dataset1.csv",
"skip_last_lines": 1
}
},
{
"operation": "columns.add_column",
"description": "Calculate the gender ratio",
"params": {
"sheet": "dataset1",
"name": "Gender Ratio",
"expression": {
"language": "python",
"content": "return float(row['Male Total']) / float(row['Female Total'])"
}
}
},
{
"operation": "sheets.export_data",
"description": "save the data",
"params": {
"sheet": "dataset1",
"format": "csv",
"filename": "dataset2.csv"
}
}
]
YAML buildfiles are also supported. Databuild will guess the type based on the extension.
Philosophy¶
Databuild is an alternative to more complex and complete packages like pandas, numpy, and R.
It’s aimed at users that are not necessarily data scientist and are looking to a simpler alternative to such softwares.
It’s admittely less performant than those and is not optimized for huge datasets. But Databuild is much easier to get started with.
The general principles in Databuild are:
- Low entry barrier
- Easy to install
- Easy to grasp
- Extensible
Databuild can be useful for scenarios such as:
- Documenting data transformations in your infoviz project
- Automate data processing in a declarative way
Buildfiles¶
A buildfile contains a list of operations to be performed on data. Think of it as a script for a spreadsheet.
JSON and YAML format are supported. databuild will guess the format based on the file extension.
An example of build file could be:
[
{
"operation": "sheets.import_data",
"description": "Importing data from csv file",
"params": {
"sheet": "dataset1",
"format": "csv",
"filename": "dataset1.csv",
"skip_last_lines": 1
}
},
{
"operation": "columns.add_column",
"description": "Calculate the gender ratio",
"params": {
"sheet": "dataset1",
"name": "Gender Ratio",
"expression": {
"language": "python",
"content": "return float(row['Male Total']) / float(row['Female Total'])"
}
}
},
{
"operation": "sheets.export_data",
"description": "save the data",
"params": {
"sheet": "dataset1",
"format": "csv",
"filename": "dataset2.csv"
}
}
]
The same file in yaml:
- operation: sheets.import_data
description: Importing data from csv file
params:
sheet: dataset1
format: csv
filename: dataset1.csv
skip_last_lines: 1
- operation: columns.add_column
description: Calculate the gender ratio
params:
sheet: dataset1
name: Gender Ratio
expression:
language: python
content: "return float(row['Male Total']) / float(row['Female Totale'])"
- operation: sheets.export_data
description: save the data
params:
sheet: dataset1
format: csv
filename: dataset2.csv
Python API¶
Databuild can be integrated in your python project. Just import the build function:
from databuild.builder import build
build('buildfile.json')
- Supported arguments:
- build_file Required. Path to the buildfile.
- settings Optional. Python module path containing the settings. Defaults to datbuild.settings
- echo Optional. Set this to True if you want the operations’ description printed to the screen. Defaults to False.
Operation Functions¶
Operations functions are regular Python function that perform actions on the book. Examples of operations are: sheets.import_data, columns.add_column, columns.update_column, and more.
They have a function name that identifies them, an optional description and a number of parameters that they accept. Different operation functions accept different parameters.
Available Operation Functions¶
sheets.import_data¶
Creates a new sheet importing data from an external source.
- arguments:
- filename: Required.
- sheet: Optional. Defaults to filename‘s basename.
- format: Values currently supported are 'csv' and 'json'.
- headers: Optional. Defaults to null, meaning that the importer tries to autodetects the header names.
- encoding: Optional. Defaults to 'utf-8'.
- skip_first_lines: Optional. Defaults to 0. Supported only by the CSV importer.
- skip_Last_lines: Optional. Defaults to 0. Supported only by the CSV importer.
- guess_types: Optional. If set to true, the CSV importer will try to guess the data type. Defaults to true.
sheets.copy¶
- arguments:
- source
- destination
- headers (optional)
Create a copy of the source sheet named destination. Optionally copies only the headers specified in headers.
sheets.export_data¶
- arguments:
- sheet
- format
- filename
- headers (optional)
Exports the datasheet named sheet to the file named filename in the specified format. Optionally exports only the headers specified in headers.
sheets.print_data¶
- arguments:
- sheet
columns.update_column¶
- arguments:
- sheet
- column
- facets (optional)
- values
- expression
Either values or expression are required.
columns.add_column¶
- arguments:
- sheet
- name
- expression (optional)
columns.remove_column¶
- arguments:
- sheet
- name
columns.rename_column¶
- arguments:
- sheet
- old_name
- new_name
columns.to_float¶
- arguments:
- sheet
- column
- facets (optional)
columns.to_integer¶
- arguments:
- sheet
- column
- facets (optional)
columns.to_decimal¶
- arguments:
- sheet
- column
- facets (optional)
columns.to_text¶
- arguments:
- sheet
- column
- facets (optional)
columns.to_datetime¶
- arguments:
- sheet
- column
- facets (optional)
Custom Operation¶
You can add your custom operation and use them in your buildfile.
An Operation is just a regular python function. The first arguments has to be the workbook, but the remaining arguments will be pulled in from the params property of the operation in the buildfile.
def myoperation(workbook, foo, bar, baz):
pass
Operations are defined in modules, which are just regulare Python files.
As long as your operation modules are in your PYTHONPATH, you can add them to your OPERATION_MODULES setting (see operation-modules-setting) and then call the operation in your buildfile by referencing its import path:
[
...,
{
"operation": "mymodule.myoperation",
"description": "",
"params": {
"foo": "foos",
"bar": "bars",
"baz": "bazes"
}
}
]
Expressions¶
Expressions are objects encapsulating code for situations such as filtering or calculations.
An expression has the following properties:
- language: The name of the environment where the expression will be executed, as specified in settings.LANGUAGES. See LANGUAGES).
- content: The actual code to run, or
- path: path to a file containing the code to run
The expression will be evaluated inside a function and run against every row in the datasheet. The following context variables will be avalaible:
* ``row``: A dictionary representing the currently selected row.
Environments¶
Expressions are evaluated in the environment specified by their language property.
The value maps to a specific environment as specified in settings.LANGUAGES (See the LANGUAGES setting).
Included Environments¶
Currently, the following environments are shipped with databuild:
Python¶
Unsafe Python environment. Use only with trusted build files.
Writing Custom Environments¶
An Environment is a subclass of databuild.environments.base.BaseEnvironment that implements the following methods:
- __init__(self, book): Initializes the environment with the appropriate global variables.
- copy(self, iterable): Copies a variable from the databuild process to the hosted environment.
- eval(self, expression): Evaluates the string expression to an actual functions and returns it.
Add-on Environments¶
Lua¶
An additional Lua environment is available at http://github.com/databuild/databuild-lua
Requires Lua or LuaJIT (Note: LuaJIT is currently unsupported on OS X).
Functions¶
Functions are additional methods that can be used inside Expressions.
Available Functions¶
cross¶
Returns a single value from a column in a different sheet.
- arguments:
- row reference to the current row
- sheet_source name of the sheet that you want to get the data from
- column_source name of the column that you want to get the data from
- column_key name of the sheet that you want to match the data between the sheets.
column¶
Returns an array of values from column from a different dataset, ordered as the key.
- arguments:
- sheet_name name of the current sheet
- sheet_source name of the sheet that you want to get the data from
- column_source name of the column that you want to get the data from
- column_key name of the sheet that you want to match the data between the sheets.
Custom Functions Modules¶
You can write your own custom functions modules.
A function module is a regulare Python module containing Python functions with the following signature:
def myfunction(environment, book, **kwargs)
Function must accept the environment and book positional arguments. After them, everything other argument is up the the function.
Another reuqirement is that the function must return a value wrapped into the environment’s copy method:
return environment.copy(my_return_value)
Function modules must be made available by adding them to the FUNCTION_MODULES Settings variable.
Settings¶
ADAPTER¶
Classpath of the adapter class. Defaults to 'databuild.adapters.locmem.models.LocMemBook'.
LANGUAGES¶
A dict mapping languages to Environments. Default to:
LANGUAGES = {
'python': 'databuild.environments.python.PythonEnvironment',
}
FUNCTION_MODULES¶
A tuple of module paths to import Functions from. Defaults to:
FUNCTION_MODULES = (
'databuild.functions.data',
)
OPERATION_MODULES¶
A tuple of module paths to import Operation Functions from. Defaults to:
OPERATION_MODULES = (
"databuild.operations.sheets",
"databuild.operations.columns",
)