Pachyderm Developer Documentation

Welcome to the Pachyderm documentation portal! Below you’ll find guides and information for beginners and experienced Pachyderm users. You’ll also find API references docs.

If you can’t find what you’re looking for or have a an issue not mentioned here, we’d love to hear from you either on GitHub, our Users Slack channel, or email us at support@pachyderm.io.

Note: if you are using a Pachyderm version < 1.4, you can find relevant docs here.

Getting Started

Welcome to the documentation portal for first time Pachyderm users! We’ve organized information into three major sections:

Local Installation: Get Pachyderm deployed locally on OSX or Linux.

Beginner Tutorial: Learn to use Pachdyerm through a quick and simple tutorial.

Troubleshooting: Common getting started issues and how to fix them.

If you’d like to read about the technical concepts in Pachyderm before actually running it, check out our reference docs:

If you’ve already got a Kubernetes cluster running or would rather use AWS, GCE or Azure, check out our Intro.

Looking for in-depth development docs?

Learn how to Creating Analysis Pipelines check out more advanced Pachyderm examples such as word count or machine learning with TensorFlow.

Local Installation

This guide will walk you through the recommended path to get Pachyderm running locally on OSX or Linux.

If you hit any errors not covered in this guide, check our troubleshooting docs for common errors, submit an issue on GitHub, join our users channel on Slack, or email us at support@pachyderm.io and we can help you right away.

Prerequisites

Minikube

Kubernetes offers a fantastic guide to install minikube. Follow the Kubernetes installation guide to install Virtual Box, Minikibe, and Kubectl. Then come back here to install Pachyderm.

Note: Any time you want to stop and restart Pachyderm, you should start fresh with minikube delete and minikube start. Minikube isn’t meant to be a production environment and doesn’t handle being restarted well without a full wipe.

Pachctl

pachctl is a command-line utility used for interacting with a Pachyderm cluster.

# For OSX:
$ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.4

# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.4.7-RC1/pachctl_1.4.7-RC1_amd64.deb && sudo dpkg -i /tmp/pachctl.deb

Note: To install an older version of Pachyderm, navigate to that version using the menu in the bottom left.

To check that installation was successful, you can try running pachctl help, which should return a list of Pachyderm commands.

Deploy Pachyderm

Now that you have Minikube running, it’s incredibly easy to deploy Pachyderm.

pachctl deploy local

This generates a Pachyderm manifest and deploys Pachyderm on Kubernetes. It may take a few minutes for the pachd nodes to be running because it’s pulling containers from DockerHub. You can see the cluster status by using kubectl get all:

$ kubectl get all
NAME             READY     STATUS    RESTARTS   AGE
po/etcd-hvb78    1/1       Running   0          55s
po/pachd-1kwsx   1/1       Running   0          55s

NAME       DESIRED   CURRENT   READY     AGE
rc/etcd    1         1         1         55s
rc/pachd   1         1         1         55s

NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)                         AGE
svc/etcd         10.0.0.105   <nodes>       2379:32379/TCP,2380:30003/TCP   55s
svc/kubernetes   10.0.0.1     <none>        443/TCP                         4m
svc/pachd        10.0.0.144   <nodes>       650:30650/TCP,651:30651/TCP     55s

Note: If you see a few restarts on the pachd nodes, that’s ok. That simply means that Kubernetes tried to bring up those containers before etcd was ready so it restarted them.

Port Forwarding

The last step is to set up port forwarding so commands you send can reach Pachyderm within the VM. We background this process since port forwarding blocks.

$ pachctl port-forward &

Once port forwarding is complete, pachctl should automatically be connected. Try pachctl version to make sure everything is working.

$ pachctl version
COMPONENT           VERSION
pachctl             1.4.0
pachd               1.4.0

We’re good to go!

If for any reason port-forward doesn’t work, you can connect directly by setting ADDRESS to the minikube IP with port 30650.

$ minikube ip
192.168.99.100
$ export ADDRESS=192.168.99.100:30650

Next Steps

Now that you have everything installed and working, check out our Beginner Tutorial to learn the basics of Pachyderm such as adding data and building analysis pipelines.

Beginner Tutorial

Welcome to the beginner tutorial for Pachyderm. If you’ve already got Pachyderm installed, this guide should take about 15 minutes and you’ll be introduced to the basic concepts of Pachyderm.

Image processing with OpenCV

In this guide we’re going to create a Pachyderm pipeline to do some simple edge detection on a few images. Thanks to Pachyderm’s processing system, we’ll be able to run the pipeline in a distributed, streaming fashion. As new data is added, the pipeline will automatically process it and output the results.

If you hit any errors not covered in this guide, check our Troubleshooting docs for common errors, submit an issue on GitHub, join our users channel on Slack, or email us at support@pachyderm.io and we can help you right away.

Prerequisites

This guide assumes that you already have Pachyderm running locally. Check out our Local Installation instructions if haven’t done that yet and then come back here to continue.

Create a Repo

A repo is the highest level data primitive in Pachyderm. Like many things in Pachyderm, it shares it’s name with primitives in Git and is designed to behave analogously. Generally, repos should be dedicated to a single source of data such as log messages from a particular service, a users table, or training data for an ML model. Repos are dirt cheap so don’t be shy about making tons of them.

For this demo, we’ll simply create a repo called “images” to hold the data we want to process:

$ pachctl create-repo images

# See the repo we just created
$ pachctl list-repo
NAME                CREATED             SIZE
images              2 minutes ago       0 B

Adding Data to Pachyderm

Now that we’ve created a repo it’s time to add some data. In Pachyderm, you write data to an explicit commit (again, similar to Git). Commits are immutable snapshots of your data which give Pachyderm its version control properties. Files can be added, removed, or updated in a given commit.

Let’s start by just adding a file, in this case an image, to a new commit. We’ve provided some sample images for you that we host on Imgur.

We’ll use the put-file command along with two flags, -c and -f. -f can take either a local file or a URL which it’ll automatically scrape. In our case, we’ll simply pass the URL.

Unlike Git though, commits in Pachyderm must be explicitly started and finished as they can contain huge amounts of data and we don’t want that much “dirty” data hanging around in an unpersisted state. The -c flag specifies that we want to start a new commit, add data, and finish the commit in a convenient one-liner.

We also specify the repo name “images”, the branch name “master”, and what we want to name the file, “liberty.png”.

$ pachctl put-file images master liberty.png -c -f http://imgur.com/46Q8nDz.png

Finally, we check to make sure the data we just added is in Pachyderm.

# If we list the repos, we can see that there is now data
$ pachctl list-repo
NAME                CREATED             SIZE
images              5 minutes ago   57.27 KiB

# We can view the commit we just created
$ pachctl list-commit images
REPO                ID                                 PARENT              STARTED            DURATION            SIZE
images              7162f5301e494ec8820012576476326c   <none>              2 minutes ago      38 seconds          57.27 KiB

# And view the file in that commit
$ pachctl list-file images master
NAME                TYPE                SIZE
liberty.png         file                57.27 KiB

We can view the file we just added to Pachyderm. Since this is an image, we can’t just print it out in the terminal, but the following commands will let you view it easily.

# on OSX
$ pachctl get-file images master liberty.png | open -f -a /Applications/Preview.app

# on Linux
$ pachctl get-file images master liberty.png | display

Create a Pipeline

Now that we’ve got some data in our repo, it’s time to do something with it. Pipelines are the core processing primitive in Pachyderm and they’re specified with a JSON encoding. For this example, we’ve already created the pipeline for you and you can find the code on Github.

When you want to create your own pipelines later, you can refer to the full Pipeline Specification to use more advanced options. This includes building your own code into a container instead of the pre-built Docker image we’ll be using here.

For now, we’re going to create a single pipeline that takes in images and does some simple edge detection.

_images/opencv-liberty.jpg

Below is the pipeline spec and python code we’re using. Let’s walk through the details.

# edges.json
{
  "pipeline": {
    "name": "edges"
  },
  "transform": {
    "cmd": [ "python3", "/edges.py" ],
    "image": "pachyderm/opencv"
  },
  "input": {
    "atom": {
      "repo": "images",
      "glob": "/*"
    }
  }
}

Our pipeline spec contains a few simple sections. First is the pipeline name, edges. Then we have the transform which specifies the docker image we want to use, pachyderm/opencv (defaults to Dockerhub as the registry), and the entry point edges.py. Lastly, we specify the input. Here we only have one “atom” input, our images repo with a particular glob pattern.

The glob pattern defines how the input data can be broken up if we wanted to distribute our computation. /* means that each file can be processed individually, which makes sense for images. Glob patterns are one of the most powerful features of Pachyderm so when you start creating your own pipelines, check out the Pipeline Specification.

# edges.py
import cv2
import numpy as np
from matplotlib import pyplot as plt
import os

# make_edges reads an image from /pfs/images and outputs the result of running
# edge detection on that image to /pfs/out. Note that /pfs/images and
# /pfs/out are special directories that Pachyderm injects into the container.
def make_edges(image):
   img = cv2.imread(image)
   tail = os.path.split(image)[1]
   edges = cv2.Canny(img,100,200)
   plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')

# walk /pfs/images and call make_edges on every file found
for dirpath, dirs, files in os.walk("/pfs/images"):
   for file in files:
       make_edges(os.path.join(dirpath, file))

Our python code is really straight forward. We’re simply walking over all the images in /pfs/images, do our edge detection and write to /pfs/out.

/pfs/images and /pfs/out are special local directories that Pachyderm creates within the container for you. All the input data for a pipeline will be found in /pfs/[input_repo_name] and your code should always write to /pfs/out.

Now let’s create the pipeline in Pachyderm:

$ pachctl create-pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.4.0/doc/examples/opencv/edges.json

What Happens When You Create a Pipeline

Creating a pipeline tells Pachyderm to run your code on every finished commit in a repo as well as all future commits that happen after the pipeline is created. Our repo already had a commit, so Pachyderm automatically launched a job to process that data.

This first time it runs a pipeline it needs to download the image from DockerHub so this might take a minute. Every subsequent run will be much faster.

You can view the job with:

$ pachctl list-job
ID                                     OUTPUT COMMIT                            STARTED             DURATION            STATE
a6c70aa5-9f0c-4e36-b30a-4387fac54eac   edges/1a9c76a2cd154e6e90f200fb80c46d2f   2 minutes ago      About a minute      success

Every pipeline creates a corresponding repo with the same name where it stores its output results. In our example, the “edges” pipeline created a repo called “edges” to store the results.

$ pachctl list-repo
NAME                CREATED            SIZE
edges               2 minutes ago      22.22 KiB
images              10 minutes ago     57.27 KiB

Reading the Output

We can view the output data from the “edges” repo in the same fashion that we viewed the input data.

# on OSX
$ pachctl get-file edges master liberty.png | open -f -a /Applications/Preview.app

# on Linux
$ pachctl get-file edges master liberty.png | display

Processing More Data

Pipelines will also automatically process the data from new commits as they are created. Think of pipelines as being subscribed to any new commits on their input repo(s). Also similar to Git, commits have a parental structure that tracks which files have changed. In this case we’re going to be adding more images.

Let’s create two new commits in a parental structure. To do this we will simply do two more put-file commands with -c and by specifying master as the branch, it’ll automatically parent our commits onto each other. Branch names are just references to a particular HEAD commit.

$ pachctl put-file images master AT-AT.png -c -f http://imgur.com/8MN9Kg0.png

$ pachctl put-file images master kitten.png -c -f http://imgur.com/g2QnNqa.png

Adding a new commit of data will automatically trigger the pipeline to run on the new data we’ve added. We’ll see corresponding jobs get started and commits to the output “edges” repo. Let’s also view our new outputs.

# view the jobs that were kicked off
$ pachctl list-job
ID                                     OUTPUT COMMIT                            STARTED             DURATION             STATE
7395c7c9-df0e-4ea8-8202-ec846970b982   edges/8848e11056c04518a8d128b6939d9985   2 minutes ago      Less than a second   success
b90afeb1-c12b-4ca5-a4f4-50c50efb20bb   edges/da51395708cb4812bc8695bb151b69e3   2 minutes ago      1 seconds            success
9182d65e-ea36-4b98-bb07-ebf40fefcce5   edges/4dd2459531414d80936814b13b1a3442   5 minutes ago      3 seconds            success
# View the output data

# on OSX
$ pachctl get-file edges master AT-AT.png | open -f -a /Applications/Preview.app

$ pachctl get-file edges master kitten.png | open -f -a /Applications/Preview.app

# on Linux
$ pachctl get-file edges master AT-AT.png | display

$ pachctl get-file edges master kitten.png | display

Exploring the File System (optional)

Another nifty feature of Pachyderm is that you can mount the file system locally to poke around and explore your data using FUSE. FUSE comes pre-installed on most Linux distributions. For OS X, you’ll need to install OSX FUSE. This is just an optional step if you want another view of your data and system and can be useful for local development.

The first thing we need to do is mount Pachyderm’s filesystem (pfs).

First create the mount point:

$ mkdir ~/pfs

And then mount it:

# We background this process because it blocks.
$ pachctl mount ~/pfs &

Note

If you get any errors on OSX, those are most likely benign as it’s just SpotLight trying to index the Fuse volume and not having access.

This will mount pfs on ~/pfs you can inspect the filesystem like you would any other local filesystem such as using ls or pointing your browser at it.

Note

Use pachctl unmount ~/pfs to unmount the filesystem. You can also use the -a flag to remove all Pachyderm FUSE mounts.

Next Steps

We’ve now got Pachyderm running locally with data and a pipeline! If you want to keep playing with Pachyderm locally, you can use what you’ve learned to build on or change this pipeline. You can also start learning some of the more advanced topics to develop analysis in Pachyderm:

We’d love to help and see what you come up with so submit any issues/questions you come across on GitHub , Slack or email at support@pachyderm.io if you want to show off anything nifty you’ve created!

Getting Your Data into Pachyderm

Data that you put (or “commit”) into Pachyderm ultimately lives in an object store of your choice (S3, Minio, GCS, etc.). This data is content-addressed by Pachyderm to buid our version control semantics and are therefore is not “human-readable” directly in the object store. That being said, Pachyderm allows you and your pipeline stages to interact with versioned files like you would in a normal file system.

Jargon associated with putting data in Pachyderm

“Data Repositories”

Versioned data in Pachyderm lives in repositories (again think about something similar to “git for data”). Each data “repository” can contain one file, multiple files, multiple files arranged in directories, etc. Regardless of the structure, Pachyderm will version the state of each data repository as it changes over time.

“Commits”

Regardless of the method you use to get data into Pachyderm, the mechanism that is used to get data into Pachyderm is a “commit” of data into a data repository. In order to put data into Pachyderm a commit must be “started” (aka an “open commit”). Then the data put into Pachyderm in that open commit will only be available once the commit is “finished” (aka a “closed commit”). Although you have to do this opening, putting, and closing for all data that is committed into Pachyderm, we provide some convient ways to do that with our CLI tool and clients (see below).

How to get data into Pachyderm

In terms of actually getting data into Pachyderm via “commits,” there are a couple of options:

  • The pachctl CLI tool: This is the great option for testing and for users who prefer to input data scripting.
  • One of the Pachyderm language clients: This option is ideal for Go, Python, or Scala users who want to push data to Pachyderm from services or applications writtern in those languages. Actually, even if you don’t use Go, Python, or Scala, Pachyderm uses a protobuf API which supports many other languages, we just haven’t built the full clients yet.

pachctl

To get data into Pachyderm using pachctl, you first need to create one or more data repositories to hold your data:

$ pachctl create-repo <repo name>

Then to put data into the created repo, you use the put-file command. Below are a few example uses of put-file, but you can see the complete documentation here. Note again, commits in Pachyderm must be explicitly started and finished so put-file can only be called on an open commit (started, but not finished). The -c option allows you to start and finish a commit in addition to putting data as a one-line command.

Add a single file to a new branch:

# first start a commit
$ pachctl start-commit <repo> -b <branch>

# then utilize the returned <commit-id> in the put-file request
# to put <file> at <path> in the <repo>
$ pachctl put-file <repo> <commit-id> </path/to/file> -f <file>

# then finish the commit
$ pachctl finish-commit <repo> <commit-id>

Start and finish a commit while adding a file using -c:

$ pachctl put-file <repo> <branch> </path/to/file> -c -f <file> 

Put data from a URL:

$ pachctl put-file <repo> <branch> </path/to/file> -c -f http://url_path

Put data directly from an object store:

# here you can use s3://, gcs://, or as://
$ pachctl put-file <repo> <branch> </path/to/file> -c -f s3://object_store_url

Put data directly from another location within Pachyderm:

$ pachctl put-file <repo> <branch> </path/to/file> -c -f pfs://pachyderm_location

Add multiple files at once by using the -i option or multiple -f flags. In the case of -i, the target file should be a list of files, paths, or URLs that you want to input all at once:

$ pachctl put-file <repo> <branch> -c -i <file containing list of files, paths, or URLs>

Pipe data from stdin into a data repository:

$ echo "data" | pachctl put-file <repo> <branch> </path/to/file> -c

Add an entire directory by using the recursive flag, -r:

$ pachctl put-file <repo> <branch> -c -r <dir>

Pachyderm language clients

  • Go: We have a complete Golang client that will let you easily integrate pushing data to Pachyderm into your Go programs. Check out the godocs for put-file.
  • Python: Our great community of users have created a Python client for Pachyderm. See the included instructions to see how you can put data into Pachyderm from Python.
  • Scala: Our users are currently working on a Scala client for Pachyderm. Please contact us if you are interested in helping with this or testing it out.
  • Other languages: Pachyderm uses a simple protocol buffer API. Protobufs support a bunch of other languages, any of which can be used to programatically use Pachyderm. We haven’t built clients for them yet, but it’s not too hard. It’s an easy way to contribute to Pachyderm if you’re looking to get involved.

Creating Analysis Pipelines

There are three steps to running an analysis in a Pachyderm “pipeline”:

  1. Write your code.
  2. Build a Docker image that includes your code and dependencies.
  3. Create a Pachyderm “pipeline” referencing that Docker image.

Multi-stage pipelines (e.g., parsing -> modeling -> output) can be created by repeating these three steps to build up a graph of processing steps. For more tips on composing pipelines see “Composing Pipelines”.

1. Writing your analysis code

Code used to process data in Pachyderm can be written using any languages or libraries you want. It can be as simple as a bash command or as complicated as a TensorFlow neural network. At the end of the day, all your code and dependencies will be built into a container that can run anywhere (including inside of Pachyderm). We’ve got demonstrative examples on GitHub using bash, Python, TensorFlow, and OpenCV and we’re constantly adding more.

As we touch on briefly in the beginner tutorial, your code itself only needs to read and write files from a local file system. It does NOT have to import any special Pachyderm functionality or libraries. You just need to be able to read files and write files.

For the reading files part, Pachyderm automatically mounts each input data repository as /pfs/<repo_name> in the running instances of your Docker image (called “containers”). The code that you write just needs to read input data from this directory, just like in any other file system. Your analysis code also does NOT have to deal with data sharding or parallelization as Pachyderm will automatically shard the input data across parallel containers. For example, if you’ve got four containers running your Python code, Pachyderm will automatically supply 1/4 of the input data to /pfs/<repo_name> in each running container. That being said, you also have a lot of control over how that input data is split across containers. Check out our guide on :doc: parallelization to see the details of that.

For the writing files part (saving results, etc.), your code simply needs to write to /pfs/out. This is a special directory mounted by Pachyderm in all of your running containers. Similar to reading data, your code doesn’t have to manage parallelization or sharding, just write data to /pfs/out and Pachyderm will make sure it all ends up in the correct place.

2. Building a Docker Image

When you create a Pachyderm pipeline (which will be discussed next), you need to specify a Docker image including the code or binary you want to run. Please refer to the official documentation to learn how to build a Docker images. Note, your Docker image should NOT specifiy a CMD. Rather, you specify what commands are to be run in the container when you create your pipeline.

Unless Pachyderm is running on the same host that you used to build your image, you’ll need to use a public or private registry to get your image into the Pachyderm cluster. One (free) option is to use Docker’s DockerHub registry. You can refer to the official documentation to learn how to push your images to DockerHub. That being said, you are more than welcome to use any other public or private Docker registry.

Note, it is best practice to uniquely tag your Docker images with something other than :latest. This allows you to track which Docker images were used to process which data, and will help you as you update your pipelines. You can also utilize the --push-images flag on update-pipeline to help you tag your images as they are updated. See the updating pipelines docs for more information.

3. Creating a Pipeline

Now that you’ve got your code and image built, the final step is to tell Pachyderm to run the code in your image on certain input data. To do this, you need to supply Pachyderm with a JSON pipeline specification. There are four main components to a pipeline specification: name, transform, parallelism and input. Detailed explanations of the specification parameters and how they work can be found in the pipeline specification docs.

Here’s an example pipeline spec:

{
  "pipeline": {
    "name": "wordcount"
  },
  "transform": {
    "image": "wordcount-image",
    "cmd": ["/binary", "/pfs/data", "/pfs/out"]
  },
  "input": {
      "atom": {
        "repo": "data",
        "glob": "/*",
      }
  }
}

After you create the JSON pipeline spec (and save it, e.g., as your_pipeline.json), you can create the pipeline in Pachyderm using pachctl:

$ pachctl create-pipeline -f your_pipeline.json

(-f can also take a URL if your JSON manifest is hosted on GitHub or elsewhere. Keeping pipeline specifications under version control is a great idea so you can track changes and seamlessly view or deploy older pipelines if needed.)

Creating a pipeline tells Pachyderm to run the cmd (i.e., your code) in your image on the data in every finished commit on the input repo(s) as well as all future commits to the input repo(s). You can think of this pipeline as being “subscribed” to any new commits that are made on any of its input repos. It will automatically process the new data as it comes in.

Note - In Pachyderm 1.4+, as soon as you create your pipeline, Pachyderm will launch worker pods on Kubernetes, such that they are ready to process any data committed to their input repos.

Distributed Computing

Distributing computation across multiple workers is a fundamental part of processing any big data or computationally intensive workload. There are two main questions to think about when trying to distribute computation:

  1. How many workers to spread computation across?
  2. How to define which workers are responsible for which data?

Pachyderm Workers

Before we dive into the above questions, there are a few details you should understand about Pachyderm workers.

Every worker for a given pipeline is an identical pod running the Docker image you specified in the pipeline spec. Your analysis code does not need do anything special to run in a distributed fashion. Instead, Pachyderm will spread out the data that needs to be processed across the various workers and make that data available for your code.

Pachyderm workers are spun up when you create the pipeline and are left running in the cluster waiting for new jobs (data) to be available for processing (committed). This saves having to recreate and schedule the worker for every new job.

Controlling the Number of Workers (Parallelism)

The number of workers that are used for a given pipeline is controlled by the parallelism_spec defined in the pipeline specification.

  "parallelism_spec": {
    "strategy": "CONSTANT"|"COEFFICIENT"
    "constant": int        // if strategy == CONSTANT
    "coefficient": double  // if strategy == COEFFICIENT

Pachyderm has two parallelism strategies: CONSTANT and COEFFICIENT.

If you use the CONSTANT strategy, Pachyderm will start the number of workers that you specify. To use this strategy, set the field strategy to CONSTANT, and set the field constant to an integer value (e.g. "constant":10 to use 10 workers).

If you use the COEFFICIENT strategy, Pachyderm will start a number of workers that is a multiple of your Kubernetes cluster’s size. To use this strategy, set the field coefficient to a double. For example, if your Kubernetes cluster has 10 nodes, and you set "coefficient": 0.5, Pachyderm will start five workers. If you set it to 2.0, Pachyderm will start 20 workers (two per Kubernetes node).

NOTE: The parallelism_spec is optional and will default to “coefficient": 1, which means that it’ll spawn one worker per Kubernetes node for this pipeline if left unset.

Spreading Data Across Workers (Glob Patterns)

Defining how your data is spread out among workers is arguably the most important aspect of distributed computation and is the fundamental idea around concepts like Map/Reduce.

Instead of confining users to just data-distribution patterns such as Map (split everything as much as possible) and Reduce (all the data must be grouped together), Pachyderm uses Glob Patterns to offer incredible flexibility in defining your data distribution.

Glob patterns are defined by the user for each atom within the input of a pipeline, and they tell Pachyderm how to divide the input data into individual “datums” that can be processed independently.

"input": {
    "atom": {
        "repo": "string",
        "glob": "string",
    }
}

That means you could easily define multiple “atoms”, one with the data highly distributed and another where it’s grouped together. You can then join the datums in these atoms via a cross product or union (as shown above) for combined, distributed processing.

"input": {
    "cross" or "union": [
        {
            "atom": {
                "repo": "string",
                "glob": "string",
            }
        },
        {
            "atom": {
                "repo": "string",
                "glob": "string",
            }
        },
        etc...
    ]
}

More information about “atoms,” unions, and crosses can be found in our Pipeline Specification.

Datums

Pachyderm uses the glob pattern to determine how many “datums” an input atom consists of. Datums are the unit of parallelism in Pachyderm. That is, Pachyderm attempts to process datums in parallel whenever possible.

If you have two workers and define 2 datums, Pachyderm will send one datum to each worker. In a scenario where there are more datums than workers, Pachyderm will queue up extra datums and send them to workers as they finish processing previous datums.

Defining Datums via Glob Patterns

Intuitively, you should think of the input atom repo as a file system where the glob pattern is being applied to the root of the file system. The files and directories that match the glob pattern are considered datums.

For example, a glob pattern of just / would denote the entire input repo as a single datum. All of the input data would be given to a single worker similar to a typical reduce-style operation.

Another commonly used glob pattern is /*. /* would define each top level object (file or directory) in the input atom repo as its own datum. If you have a repo with just 10 files in it and no directory structure, every file would be a datum and could be processed independently. This is similar to a typical map-style operation.

But Pachyderm can do anything in between too. If you have a directory structure with each state as a directory and a file for each city such as:

/California
   /San-Francisco.json
   /Los-Angeles.json
   ...
/Colorado
   /Denver.json
   /Boulder.json
   ...
...

and you need to process all the data for a given state together, /* would also be the desired glob pattern. You’d have one datum per state, meaning all the cities for a given state would be processed together by a single worker, but each state can be processed independently.

If we instead used the glob pattern /*/* for the states example above, each <city>.json would be it’s own datum.

Glob patterns also let you take only a particular directory (or subset of directories) as an input atom instead of the whole input repo. If we create a pipeline that is specifically only for California, we can use a glob pattern of /California/* to only use the data in that directory as input to our pipeline.

Only Processing New Data

A datum defines the granularity at which Pachyderm decides what data is new and what data has already been processed. Pachyderm will never reprocess datums it’s already seen with the same analysis code. But if any part of a datum changes, the entire datum will be reprocessed.

Note: If you change your code (or pipeline spec), Pachyderm will of course allow you to process all of the past data through the new analysis code.

Let’s look at our states example with a few different glob patterns to demonstrate what gets processed and what doesn’t. Suppose we have an input data layout such as:

/California
   /San-Francisco.json
   /Los-Angeles.json
   ...
/Colorado
   /Denver.json
   /Boulder.json
   ...
...

If our glob pattern is /, then the entire input atom is a single datum, which means anytime any file or directory is changed in our input, all the the data will be processed from scratch. There are plenty of usecases where this is exactly what we need (e.g. some machine learning training algorithms).

If our glob pattern is /*, then each state directory is it’s own datum and we’ll only process the ones that have changed. So if we add a new city file, Sacramento.json to the /California directory, only the California datum, will be reprocessed.

If our glob pattern was /*/* then each <city>.json file would be it’s own datum. That means if we added a Sacramento.json file, only that specific file would be processed by Pachyderm.

Getting Data Out of Pachyderm

Once you’ve got one or more pipelines built and have data flowing through Pachyderm, you need to be able to track that data flowing through your pipeline(s) and get results out of Pachyderm. Let’s use the OpenCV pipeline as an example.

Here’s what our pipeline and the corresponding data repositories look like:

alt tag

Every commit of new images into the “images” data repository results in a corresponding output commit of results into the “edges” data repository. But how do we get our results out of Pachyderm? Moreover, how would we get the particular result corresponding to a particular input image? That’s what we will explore here.

Getting files with pachctl

The pachctl CLI tool command get-file can be used to get versioned data out of any data repository:

pachctl get-file <repo> <commit-id or branch> path/to/file

In the case of the OpenCV pipeline, we could get out an image named example_pic.jpg:

pachctl get-file edges master example_pic.jpg

But how do we know which files to get? Of course we can use the pachctl list-file command to see what files are available. But how do we know which results are the latest, came from certain input, etc.? In this case, we would like to know which edge detected images in the edges repo come from which input images in the images repo. This is where provenance and the flush-commit command come in handy.

Examining file provenance with flush-commit

Generally, flush-commit will let our process block on an input commit until all of the output results are ready to read. In other words, flush-commit lets you view a consistent global snapshot of all your data at a given commit. Note, we are just going to cover a few aspects of flush-commit here.

Let’s demonstrate a typical workflow using flush-commit. First, we’ll make a few commits of data into the images repo on the master branch. That will then trigger our edges pipeline and generate three output commits in our edges repo:

$ pachctl list-commit images
REPO                ID                                 PARENT                             STARTED              DURATION             SIZE                
images              c721c4bb9a8046f3a7319ed97d256bb9   a9678d2a439648c59636688945f3c6b5   About a minute ago   1 seconds            932.2 KiB           
images              a9678d2a439648c59636688945f3c6b5   87f5266ef44f4510a7c5e046d77984a6   About a minute ago   Less than a second   238.3 KiB           
images              87f5266ef44f4510a7c5e046d77984a6   <none>                             10 minutes ago       Less than a second   57.27 KiB           
$ pachctl list-commit edges
REPO                ID                                 PARENT                             STARTED              DURATION             SIZE                
edges               f716eabf95854be285c3ef23570bd836   026536b547a44a8daa2db9d25bf88b79   About a minute ago   Less than a second   233.7 KiB           
edges               026536b547a44a8daa2db9d25bf88b79   754542b89c1c47a5b657e60381c06c71   About a minute ago   Less than a second   133.6 KiB           
edges               754542b89c1c47a5b657e60381c06c71   <none>                             2 minutes ago        Less than a second   22.22 KiB

In this case, we have one output commit per input commit on images. However, this might get more complicated for pipelines with multiple branches, multiple input atoms, etc. To confirm which commits correspond to which outputs, we can use flush-commit. In particular, we can call flush-commit on any one of our commits into images to see which output came from this particular commmit:

$ pachctl flush-commit images/a9678d2a439648c59636688945f3c6b5
REPO                ID                                 PARENT                             STARTED             DURATION             SIZE                
edges               026536b547a44a8daa2db9d25bf88b79   754542b89c1c47a5b657e60381c06c71   3 minutes ago       Less than a second   133.6 KiB

Exporting data via egress

In addition to getting data out of Pachyderm with pachctl get-file, you can add an optional egress field to your pipeline specification. egress allows you to push the results of a Pipeline to an external data store such as S3, Google Cloud Storage or Azure Blob Storage. Data will be pushed after the user code has finished running but before the job is marked as successful.

Other ways to view, interact with, or export data in Pachyderm

Although pachctl and output provide easy ways to interact with data in Pachyderm repos, they are by no means the only ways. For example, you can:

  • Have one or more of your pipeline stages connect and export data to databases running outside of Pachyderm.
  • Use a Pachyderm service to launch a long running service, like Jupyter, that has access to internal Pachyderm data and can be accessed externally via a specified port.
  • Mount versioned data from the distributed file system via pachctl mount ... (a feature best suited for experimentation and testing).

Updating Pipelines

During development, it’s very common to update pipelines, whether it’s changing your code or just cranking up parallelism. For example, when developing a machine learning model you will likely need to try out a bunch of different versions of your model while your training data stays relatively constant. This is where update-pipeline comes in.

Updating your pipeline specification

In cases in which you are updating parallelism, adding another input repo, or otherwise modifying your pipeline specification, you just need to update your JSON file and call update-pipeline:

$ pachctl update-pipeline -f pipeline.json 

Similar to create-pipeline, update-pipeline with the -f flag can also take a URL if your JSON manifest is hosted on GitHub or elsewhere.

Updating the code used in a pipeline

You can also use update-pipeline to update the code you are using in one or more of your piplines. To update the code in your pipeline:

  1. Make the code changes.
  2. Re-build your Docker image.
  3. Call update-pipeline with the --push-images flag.

You need to call update-pipeline with the --push-images flag because, if you have already run your pipeline, Pachyderm has already pulled the specified images. It won’t re-pull new versions of the images, unless we tell it to (which ensures that we don’t waste time pulling images when we don’t need to). When --push-images is specified, Pachyderm will do the following:

  1. Tag your image with a new unique tag.
  2. Push that tagged image to your registry (e.g., DockerHub).
  3. Update the pipeline specification that you previously gave to Pachyderm with the new unique tag.

For example, you could update the Python code used in the OpenCV pipeline via:

pachctl update-pipeline -f edges.json --push-images --password <registry password> -u <registry user>

Re-processing commits, from commit

Changing your pipeline code implies that your previously computed results aren’t in sync with (or generated by) your most recent code. By default (if the “from-commit” field in the pipeline spec is not given), Pachyderm will start a new “commit tree” for your new code and re-compute the results with your new code (committing to the new commit tree).

In some cases, such as changing parallelism, you don’t want to archive previous data and re-compute results. Or maybe you want to only utilize your new code for new input data (that that point on). If so, you can specify the “from” field in your pipeline specification with a commit ID. Pachyderm will then only process new data from that commit ID on with the new code.

Note that from can take a branch name. As such, you can just specify "from": "master" to process only the new data with the updated pipeline (because master points to the latest commit).

Examples

OpenCV Edge Detection

This example does edge detection using OpenCV. This is our canonical starter demo. If you haven’t used Pachyderm before, start here. We’ll get you started running Pachyderm locally in just a few minutes and processing sample log lines.

Open CV

Word Count (Map/Reduce)

Word count is basically the “hello world” of distributed computation. This example is great for benchmarking in distributed deployments on large swaths of text data.

Word Count

Machine Learning

Sentiment analysis with Neon

This example implements the machine learning template pipeline discussed in this blog post. It trains and utilizes a neural network (implemented in Python using Nervana Neon) to infer the sentiment of movie reviews based on data from IMDB.

Neon - Sentiment Analysis

pix2pix with TensorFlow

If you haven’t seen pix2pix, check out this great demo. In this example, we implement the training and image translation of the pix2pix model in Pachyderm, so you can generate cat images from edge drawings, day time photos from night time photos, etc.

TensorFlow - pix2pix

Intro

Pachyderm runs on Kubernetes and is backed by an object store of your choice. As such, Pachyderm can run on any platform that supports Kubernetes and an object store. These docs cover the following commonly used deployments:

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

Google Cloud Platform

Google Cloud Platform has excellent support for Kubernetes through the Google Container Engine.

Prerequisites

If this is the first time you use the SDK, make sure to follow the quick start guide. This may update your ~/.bash_profile and point your $PATH at the location where you extracted google-cloud-sdk. We recommend extracting this to ~/bin.

Note, you can also install kubectl installed via the SDK using:

$ gcloud components install kubectl

This will download the kubectl binary to google-cloud-sdk/bin

Deploy Kubernetes

To create a new Kubernetes cluster in GKE, run:

$ CLUSTER_NAME=[any unique name, e.g. pach-cluster]

$ GCP_ZONE=[a GCP availability zone. e.g. us-west1-a]

$ gcloud config set compute/zone ${GCP_ZONE}

$ gcloud config set container/cluster ${CLUSTER_NAME}

$ MACHINE_TYPE=[machine for the k8s nodes. We recommend "n1-standard-4" or larger.]

# By default this spins up a 3-node cluster. You can change the default with `--num-nodes VAL`
$ gcloud container clusters create ${CLUSTER_NAME} --scopes storage-rw --machine-type ${MACHINE_TYPE}

This may take a few minutes to start up. You can check the status on the GCP Console. Then, after the cluster is up, you can point kubectl to this cluster via:

# Update your kubeconfig to point at your newly created cluster
$ gcloud container clusters get-credentials ${CLUSTER_NAME}

As a sanity check, make sure your cluster is up and running via kubectl:

$ kubectl get all
NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
svc/kubernetes   10.0.0.1     <none>        443/TCP   22s

Deploy Pachyderm

To deploy Pachyderm we will need to:

  1. Add some storage resources on Google,
  2. Install the Pachyderm CLI tool, pachctl, and
  3. Deploy Pachyderm on top of the storage resources.

Set up the Storage Resources

Pachyderm needs a GCS bucket and a persistent disk to function correctly. The create the persistent disk:

# For a demo you should only need 10 GB. This stores PFS metadata. For reference, 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the volume that you are going to create, in GBs. e.g. "10"]

# Name this whatever you want, we chose pach-disk as a default
$ STORAGE_NAME=pach-disk

$ gcloud compute disks create --size=${STORAGE_SIZE}GB ${STORAGE_NAME}

Then we need to specify the bucket name and create the bucket:

# BUCKET_NAME needs to be globally unique across the entire GCP region.
$ BUCKET_NAME=[The name of the GCS bucket where your data will be stored]

# Create the bucket.
$ gsutil mb gs://${BUCKET_NAME}

To check that everything has been set up correctly, try:

$ gcloud compute instances list
# should see a number of instances

$ gsutil ls
# should see a bucket

$ gcloud compute disks list
# should see a number of disks, including the one you specified

Install pachctl

pachctl is a command-line utility for interacting with a Pachyderm cluster.

# For OSX:
$ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.4

# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.4.7-RC1/pachctl_1.4.7-RC1_amd64.deb && sudo dpkg -i /tmp/pachctl.deb

You can try running pachctl version to check that this worked correctly, but Pachyderm itself isn’t deployed yet so you won’t get a pachd version.

$ pachctl version
COMPONENT           VERSION             
pachctl             1.4.0           
pachd               (version unknown) : error connecting to pachd server at address (0.0.0.0:30650): context deadline exceeded

please make sure pachd is up (`kubectl get all`) and portforwarding is enabled

Deploy Pachyderm

Now we’re ready to deploy Pachyderm itself. This can be done in one command:

pachctl deploy google ${BUCKET_NAME} ${STORAGE_SIZE} --static-etcd-volume=${STORAGE_NAME}

It may take a few minutes for the pachd nodes to be running because it’s pulling containers from DockerHub. You can see the cluster status by using:

$ kubectl get all
NAME             READY     STATUS    RESTARTS   AGE
po/etcd-wn317    1/1       Running   0          5m
po/pachd-mljp6   1/1       Running   3          5m

NAME       DESIRED   CURRENT   READY     AGE
rc/etcd    1         1         1         5m
rc/pachd   1         1         1         5m

NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)                         AGE
svc/etcd         10.0.0.165   <nodes>       2379:32379/TCP,2380:32686/TCP   5m
svc/kubernetes   10.0.0.1     <none>        443/TCP                         5m
svc/pachd        10.0.0.214   <nodes>       650:30650/TCP,651:30651/TCP     5m

Note: If you see a few restarts on the pachd nodes, that’s totally ok. That simply means that Kubernetes tried to bring up those containers before other components were ready so it restarted them.

Finally, assuming your pachd is running as shown above, we need to set up forward a port so that pachctl can talk to the cluster.

# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &

And you’re done! You can test to make sure the cluster is working by trying pachctl version or even creating a new repo.

$ pachctl version
COMPONENT           VERSION
pachctl             1.4.0
pachd               1.4.0

Amazon Web Services

Below, we show how to deploy Pachyderm on AWS in a couple of different ways:

  1. By manually deploying Kubernetes and Pachyderm.
  2. By executing a one shot deploy script that will both deploy Kubernetes and Pachyderm.

If you already have a Kubernetes deployment or would like to customize the types of instances, size of volumes, etc. in your Kubernetes cluster, you should follow option (1). If you just want a quick deploy to experiment with Pachyderm in AWS or would just like to use our default configuration, you might want to try option (2)

Prerequisites

Manual Pachyderm Deploy

Deploy Kubernetes

The easiest way to install Kubernetes on AWS is with kops. Kubenetes has provided a step by step guide for the deploy. Please follow this guide to deploy Kubernetes on AWS.

Once, you have a Kubernetes cluster up and running in AWS, you should be able to see the following output from kubectl:

$ kubectl get all
NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
svc/kubernetes   10.0.0.1     <none>        443/TCP   22s

Deploy Pachyderm

To deploy Pachyderm we will need to:

  1. Install the pachctl CLI tool,
  2. Add some storage resources on AWS,
  3. Deploy Pachyderm on top of the storage resources.
Install pachctl

To deploy and interact with Pachyderm, you will need pachctl, a command-line utility used for Pachyderm. To install pachctl run one of the following:

# For OSX:
$ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.4

# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.4.7-RC1/pachctl_1.4.7-RC1_amd64.deb && sudo dpkg -i /tmp/pachctl.deb

You can try running pachctl version to check that this worked correctly, but Pachyderm itself isn’t deployed yet so you won’t get a pachd version.

$ pachctl version
COMPONENT           VERSION
pachctl             1.4.0
pachd               (version unknown) : error connecting to pachd server at address (0.0.0.0:30650): context deadline exceeded.
Set up the Storage Resources

Pachyderm needs an S3 bucket, and a persistent disk (EBS) to function correctly.

Here are the environmental variables you should set up to create these resources:

$ kubectl cluster-info
  Kubernetes master is running at https://1.2.3.4
  ...
$ KUBECTLFLAGS="-s [The public IP of the Kubernetes master. e.g. 1.2.3.4]"

# BUCKET_NAME needs to be globally unique across the entire AWS region
$ BUCKET_NAME=[The name of the S3 bucket where your data will be stored]

# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the EBS volume that you are going to create, in GBs. e.g. "10"]

$ AWS_REGION=[the AWS region of your Kubernetes cluster. e.g. "us-west-2" (not us-west-2a)]

$ AWS_AVAILABILITY_ZONE=[the AWS availability zone of your Kubernetes cluster. e.g. "us-west-2a"]

Then to actually create the resources, you can run:

$ aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION} --create-bucket-configuration LocationConstraint=${AWS_REGION}

$ aws ec2 create-volume --size ${STORAGE_SIZE} --region ${AWS_REGION} --availability-zone ${AWS_AVAILABILITY_ZONE} --volume-type gp2

Record the “volume-id” that is output (e.g. “vol-8050b807”) from the above create-volume command as shown below (you can also view it in the aws console or with aws ec2 describe-volumes):

$ STORAGE_NAME=<volume id>

Now, as a sanity check, you should be able to see the bucket and the EBS volume that are just created:

aws s3api list-buckets --query 'Buckets[].Name'
aws ec2 describe-volumes --query 'Volumes[].VolumeId'
Deploy Pachyderm

When you installed kops, you should have created a dedicated IAM user (see here for details). To deploy Pachyderm you will need to export these credentials to the following environmental variables:

$ AWS_ID=[access key ID]

$ AWS_KEY=[secret access key]

Run the following command to deploy your Pachyderm cluster:

$ pachctl deploy amazon ${BUCKET_NAME} ${AWS_ID} ${AWS_KEY} " " ${AWS_REGION} ${STORAGE_SIZE} --static-etcd-volume=${STORAGE_NAME}

(Note, the " " in the deploy command is for an optional temporary AWS token, if you are just experimenting with a deploy. Such a token should NOT be used for a production deploy). It may take a few minutes for the pachd nodes to be running because it’s pulling containers from DockerHub. You can see the cluster status by using:

$ kubectl get all
NAME             READY     STATUS    RESTARTS   AGE
po/etcd-j834q    1/1       Running   0          1m
po/pachd-hq4r1   1/1       Running   3          1m

NAME       DESIRED   CURRENT   READY     AGE
rc/etcd    1         1         1         1m
rc/pachd   1         1         1         1m

NAME             CLUSTER-IP       EXTERNAL-IP   PORT(S)                       AGE
svc/etcd         100.64.95.15     <nodes>       2379:30049/TCP                1m
svc/kubernetes   100.64.0.1       <none>        443/TCP                       16m
svc/pachd        100.64.189.246   <nodes>       650:30650/TCP,651:30651/TCP   1m

Note: If you see a few restarts on the pachd nodes, that’s totally ok. That simply means that Kubernetes tried to bring up those containers before etcd was ready so it restarted them.

Finally, we need to set up forward a port so that pachctl can talk to the cluster.

# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &

And you’re done! You can test to make sure the cluster is working by trying pachctl version or even creating a new repo.

$ pachctl version
COMPONENT           VERSION
pachctl             1.4.0
pachd               1.4.0

One Shot Script

Install additional prerequisites

This scripted deploy requires a couple of prerequisites in addition to the ones listed under Prerequisites:

Run the deploy script

Once you have the prerequisites mentioned above, download and run our AWS deploy script by running:

curl -o aws.sh https://raw.githubusercontent.com/pachyderm/pachyderm/master/etc/deploy/aws.sh
chmod +x aws.sh
sudo -E ./aws.sh

This script will use kops to deploy Kubernetes and Pachyderm in AWS. The script will ask you for your AWS credentials, region preference, etc. If you would like to customize the number of nodes in the cluster, node types, etc., you can open up the deploy script and modify the respective fields.

The script will take a few minutes, and Pachyderm will take an addition couple of minutes to spin up. Once it is up, kubectl get all should return something like:

NAME             READY     STATUS    RESTARTS   AGE
po/etcd-wn317    1/1       Running   0          5m
po/pachd-mljp6   1/1       Running   3          5m

NAME       DESIRED   CURRENT   READY     AGE
rc/etcd    1         1         1         5m
rc/pachd   1         1         1         5m

NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)                         AGE
svc/etcd         10.0.0.165   <nodes>       2379:32379/TCP,2380:32686/TCP   5m
svc/kubernetes   10.0.0.1     <none>        443/TCP                         5m
svc/pachd        10.0.0.214   <nodes>       650:30650/TCP,651:30651/TCP     5m

Connect pachctl

Finally, we need to set up forward a port so that pachctl can talk to the cluster.

# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &

And you’re done! You can test to make sure the cluster is working by trying pachctl version:

$ pachctl version
COMPONENT           VERSION
pachctl             1.4.0
pachd               1.4.0

Azure

Prerequisites

Deploy Kubernetes

The easiest way to deploy a Kubernetes cluster is to use the official Kubernetes guide.

Deploy Pachyderm

To deploy Pachyderm we will need to:

  1. Add some storage resources on Azure,
  2. Install the Pachyderm CLI tool, pachctl, and
  3. Deploy Pachyderm on top of the storage resources.

Set up the Storage Resources

Pachyderm requires an object store (Azure Storage) and a data disk to function correctly.

Here are the parameters required to create these resources:

# Needs to be globally unique across the entire Azure location
$ RESOURCE_GROUP=[The name of the resource group where the Azure resources will be organized]

$ LOCATION=[The Azure region of your Kubernetes cluster. e.g. "West US2"]

# Needs to be globally unique across the entire Azure location
$ STORAGE_ACCOUNT=[The name of the storage account where your data will be stored]

$ CONTAINER_NAME=[The name of the Azure blob container where your data will be stored]

# Needs to end in a ".vhd" extension
$ STORAGE_NAME=pach-disk.vhd

# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the data disk volume that you are going to create, in GBs. e.g. "10"]

And then run:

# Create a resource group
$ az group create --name=${RESOURCE_GROUP} --location=${LOCATION}

# Create azure storage account
az storage account create \
  --resource-group="${RESOURCE_GROUP}" \
  --location="${LOCATION}" \
  --sku=Standard_LRS \
  --name="${STORAGE_ACCOUNT}" \
  --kind=Storage

# Build microsoft tool for creating Azure VMs from an image
$ STORAGE_KEY="$(az storage account keys list \
                 --account-name="${STORAGE_ACCOUNT}" \
                 --resource-group="${RESOURCE_GROUP}" \
                 --output=json \
                 | jq .[0].value -r
              )"
$ make docker-build-microsoft-vhd 
$ VOLUME_URI="$(docker run -it microsoft_vhd \
                "${STORAGE_ACCOUNT}" \
                "${STORAGE_KEY}" \
                "${CONTAINER_NAME}" \
                "${STORAGE_NAME}" \
                "${STORAGE_SIZE}G"
             )"

To check that everything has been setup correctly, try:

$ az storage account list | jq '.[].name'
$ az storage blob list \
  --container=${CONTAINER_NAME} \
  --account-name=${STORAGE_ACCOUNT} \
  --account-key=${STORAGE_KEY}

Install pachctl

pachctl is a command-line utility used for interacting with a Pachyderm cluster.

# For OSX:
$ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.4

# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.4.7-RC1/pachctl_1.4.7-RC1_amd64.deb && sudo dpkg -i /tmp/pachctl.deb

You can try running pachctl version to check that this worked correctly, but Pachyderm itself isn’t deployed yet so you won’t get a pachd version.

$ pachctl version
COMPONENT           VERSION
pachctl             1.4.0
pachd               (version unknown) : error connecting to pachd server at address (0.0.0.0:30650): context deadline exceeded.

Deploy Pachyderm

Now we’re ready to boot up Pachyderm:

$ pachctl deploy microsoft ${CONTAINER_NAME} ${STORAGE_ACCOUNT} ${STORAGE_KEY} ${STORAGE_SIZE} --static-etcd-volume=${VOLUME_URI}

It may take a few minutes for the pachd nodes to be running because it’s pulling containers from Docker Hub. You can see the cluster status by using:

NAME             READY     STATUS    RESTARTS   AGE
po/etcd-wn317    1/1       Running   0          5m
po/pachd-mljp6   1/1       Running   3          5m

NAME       DESIRED   CURRENT   READY     AGE
rc/etcd    1         1         1         5m
rc/pachd   1         1         1         5m

NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)                         AGE
svc/etcd         10.0.0.165   <nodes>       2379:32379/TCP,2380:32686/TCP   5m
svc/kubernetes   10.0.0.1     <none>        443/TCP                         5m
svc/pachd        10.0.0.214   <nodes>       650:30650/TCP,651:30651/TCP     5m

Note: If you see a few restarts on the pachd nodes, that’s totally ok. That simply means that Kubernetes tried to bring up those containers before etcd was ready so it restarted them.

Finally, we need to set up forward a port so that pachctl can talk to the cluster.

# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &

And you’re done! You can test to make sure the cluster is working by trying pachctl version or even creating a new repo.

$ pachctl version
COMPONENT           VERSION
pachctl             1.4.0
pachd               1.4.0

OpenShift

OpenShift is a popular enterprise Kubernetes distribution. Pachyderm can run on OpenShift with two additional steps:

  1. Make sure that privilege containers are allowed (they are not allowed by default). You can add priviledged scc (SecurityContextConstraints) to pachyderm service account:
oadm policy add-scc-to-user privileged system:serviceaccount:<PROJECT_NAME>:pachyderm

or manually edit oc edit scc privileged:

users:
- system:serviceaccount:<PROJECT_NAME>:pachyderm
  1. Replace hostPath with emptyDir in your cluster manifest (Your manifest is generated by the pachctl deploy ... command or can be generated manually. To only generate the manifest, run pachctl deploy ... with the --dry-run flag).
      "spec": {
        "volumes": [
          {
            "name": "pach-disk",
            "emptyDir": {}
          }
        ],

 ... <snip>  ...

      "spec": {
        "volumes": [
          {
            "name": "etcd-storage",
            "emptyDir": {}
          }
        ],

Please note that emptyDir does not persit your data. You need to configure persistent volume or hostPath to persist your data.

  1. Deploy Pachyderm manifest you modified.
$ oc create -f pachyderm.json

You can see the cluster status by using oc get all like kubernetes:

$ oc get all
NAME             DESIRED          CURRENT       AGE
rc/etcd          1                1             5m
rc/pachd         1                1             5m
NAME             CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE
svc/etcd         172.30.170.24    <nodes>       2379/TCP          5m
svc/pachd        172.30.194.202   <nodes>       650/TCP,651/TCP   5m
NAME             READY            STATUS        RESTARTS          AGE
po/etcd-7m5r1    1/1              Running       0                 5m
po/pachd-foq68   1/1              Running       0                 5m

Problems related to OpenShift deployment are tracked in this issue: https://github.com/pachyderm/pachyderm/issues/336. If you have additional related questions, please ask them on Pachyderm’s slack channel or via email support@pachyderm.io.

On Premises

Pachyderm is built on Kubernetes and can be backed by an object store of your choice. As such, Pachyderm can run on any on premise platforms/frameworks that support Kubernetes, a persistent disk/volume, and an object store.

Prerequisites

  1. kubectl
  2. pachctl

Kubernetes

The Kubernetes docs have instructions for deploying Kubernetes in a variety of on-premise scenarios. We recommend following one of these guides to get Kubernetes running on premise.

Object Store

Once you have Kubernetes up and running, deploying Pachyderm is a matter of supplying Kubernetes with a JSON/yaml manifest to create the Pachyderm resources. This includes providing information that Pachyderm will use to connect to a backing object store.

For on premise deployments, we recommend using Minio as a backing object store. However, at this point, you could utilize any backing object store that has an S3 compatible API. To create a manifest template for your on premise deployment, run:

pachctl deploy custom --persistent-disk google --object-store s3 <persistent disk name> <persistent disk size> <object store bucket> <object store id> <object store secret> <object store endpoint> --static-etcd-volume=${STORAGE_NAME} --dry-run > deployment.json

Then you can modify deployment.json to fit your environment and kubernetes deployment. Once, you have your manifest ready, deploying Pachyderm is as simple as:

kubectl create -f deployment.json

Need Help?

If you need help with your on premises deploy, please reach out to us on Pachyderm’s slack channel or via email at support@pachyderm.io. We are happy to help!

Custom Object Stores

In other sections of this guide was have demonstrated how to deploy Pachyderm in a single cloud using that cloud’s object store offering. However, Pachyderm can be backed by any object store, and you are not restricted to the object store service provided by the cloud in which you are deploying.

As long as you are running an object store that has an S3 compatible API, you can easily deploy Pachyderm in a way that will allow you to back Pachyderm by that object store. For example, we have seen Pachyderm be backed by Minio, GlusterFS, Ceph, and more.

To deploy Pachyderm with your choice of object store in Google, Azure, or AWS, see the below guides. To deploy Pachyderm on premise with a custom object store, see the on premise docs.

Common Prerequisites

  1. A working Kubernetes cluster and kubectl.
  2. An account on or running instance of an object store with an S3 compatible API. You should be able to get an ID, secret, bucket name, and endpoint that point to this object store.

Google + Custom Object Store

Additional prerequisites:

First, we need to create a persistent disk for Pachyderm’s metadata:

# Name this whatever you want, we chose pach-disk as a default
$ STORAGE_NAME=pach-disk

# For a demo you should only need 10 GB. This stores PFS metadata. For reference, 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the volume that you are going to create, in GBs. e.g. "10"]

# Create the disk.
gcloud compute disks create --size=${STORAGE_SIZE}GB ${STORAGE_NAME}

Then we can deploy Pachyderm:

pachctl deploy custom --persistent-disk google --object-store s3 ${STORAGE_NAME} ${STORAGE_SIZE} <object store bucket> <object store id> <object store secret> <object store endpoint> --static-etcd-volume=${STORAGE_NAME}

AWS + Custom Object Store

Additional prerequisites:

First, we need to create a persistent disk for Pachyderm’s metadata:

# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the EBS volume that you are going to create, in GBs. e.g. "10"]

$ AWS_REGION=[the AWS region of your Kubernetes cluster. e.g. "us-west-2" (not us-west-2a)]

$ AWS_AVAILABILITY_ZONE=[the AWS availability zone of your Kubernetes cluster. e.g. "us-west-2a"]

# Create the volume.
$ aws ec2 create-volume --size ${STORAGE_SIZE} --region ${AWS_REGION} --availability-zone ${AWS_AVAILABILITY_ZONE} --volume-type gp2

# Store the volume ID.
$ aws ec2 describe-volumes
$ STORAGE_NAME=[volume id]

The we can deploy Pachyderm:

pachctl deploy custom --persistent-disk aws --object-store s3 ${STORAGE_NAME} ${STORAGE_SIZE} <object store bucket> <object store id> <object store secret> <object store endpoint> --static-etcd-volume=${STORAGE_NAME}

Azure + Custom Object Store

Additional prerequisites:

  • Install Azure CLI >= 2.0.1
  • Install jq
  • Clone github.com/pachyderm/pachyderm and work from the root of that project.

First, we need to create a persistent disk for Pachyderm’s metadata. To do this, start by declaring some environmental variables:

# Needs to be globally unique across the entire Azure location
$ RESOURCE_GROUP=[The name of the resource group where the Azure resources will be organized]

$ LOCATION=[The Azure region of your Kubernetes cluster. e.g. "West US2"]

# Needs to be globally unique across the entire Azure location
$ STORAGE_ACCOUNT=[The name of the storage account where your data will be stored]

# Needs to end in a ".vhd" extension
$ STORAGE_NAME=pach-disk.vhd

# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the data disk volume that you are going to create, in GBs. e.g. "10"]

And then run:

# Create a resource group
$ az group create --name=${RESOURCE_GROUP} --location=${LOCATION}

# Create azure storage account
az storage account create \
  --resource-group="${RESOURCE_GROUP}" \
  --location="${LOCATION}" \
  --sku=Standard_LRS \
  --name="${STORAGE_ACCOUNT}" \
  --kind=Storage

# Build microsoft tool for creating Azure VMs from an image
$ STORAGE_KEY="$(az storage account keys list \
                 --account-name="${STORAGE_ACCOUNT}" \
                 --resource-group="${RESOURCE_GROUP}" \
                 --output=json \
                 | jq .[0].value -r
              )"
$ make docker-build-microsoft-vhd 
$ VOLUME_URI="$(docker run -it microsoft_vhd \
                "${STORAGE_ACCOUNT}" \
                "${STORAGE_KEY}" \
                "${CONTAINER_NAME}" \
                "${STORAGE_NAME}" \
                "${STORAGE_SIZE}G"
             )"

To check that everything has been setup correctly, try:

$ az storage account list | jq '.[].name'

The we can deploy Pachyderm:

pachctl deploy custom --persistent-disk azure --object-store s3 ${VOLUME_URI} ${STORAGE_SIZE} <object store bucket> <object store id> <object store secret> <object store endpoint> --static-etcd-volume=${VOLUME_URI}

Migrations

Occationally, Pachyderm introduces changes that are backward-incompatible: repos/commits/files created on an old version of Pachyderm may be unusable on a new version of Pachyderm. When that happens, we try our best to write a migration script that “upgrades” your data so it’s usable by the new version of Pachyderm.

To upgrade from version X to version Y, look under the directory named migration/X-Y. For instance, to upgrade from 1.3.12 to 1.4.0, look under migration/1.3.12-1.4.0.

Note - If you are migrating from Pachyderm <= 1.3 to 1.4+, you should read this guide. In this particular case, a migration script is NOT provided due to significant changes in our processing and metadata structures.

Backup

It’s paramount that you backup your data before running a migration script. While we’ve tested the scripts extensively, it’s still possible that they contain bugs, or that you accidentally use them in a wrong way.

In general, there are two data storage systems that you might consider backing up: the metadata storage and the data storage. Not all migration scripts touch both systems, so you might only need to back up one of them. Look at the README for a particular migration script for details.

Backup the metadata store

Assuming you’ve deployed Pachyderm on a public cloud, your metadata is probably stored on a persistent volume. See the respective Deploying Pachyderm guide for details.

Here are official guides on backing up persistent volumes for each cloud provider:

Backup the object store

We don’t currently have migration scripts that touch the data storage system.

Creating Machine Learning Workflows

Because Pachyderm is language/framework agnostic and because it easily distributes analyses over large data sets, data scientists can use whatever tooling they like for ML. Even if that tooling isn’t familiar to the rest of an engineering organization, data scientists can autonomously develop and deploy scalable solutions via containers. Moreover, Pachyderm’s pipelining logic paired with data versioning, allows any results to be exactly reproduced (e.g., for debugging or during the development of improvements to a model).

We recommend combining model training processes, persisted models, and a model utilization processes (e.g., making inferences or generating results) into a single Pachyderm pipeline DAG (Directed Acyclic Graph). Such a pipeline allows us to:

  • Keep a rigorous historical record of exactly what models were used on what data to produce which results.
  • Automatically update online ML models when training data or parameterization changes.
  • Easily revert to other versions of an ML model when a new model is not performing or when “bad data” is introduced into a training data set.

This sort of sustainable ML pipeline looks like this:

alt tag

A data scientist can update the training dataset at any time to automatically train a new persisted model. This training could utilize any language or framework (Spark, Tensorflow, scikit-learn, etc.) and output any format of persisted model (pickle, XML, POJO, etc.). Regardless of framework, the model will be versioned by Pachyderm, and you will be able to track what “Input data” was input into which model AND exactly what “Training data” was used to train that model.

Any new input data coming into the “Input data” repository will be processed with the updated model. Old predictions can be re-computed with the updated model, or new models could be backtested on previously input and versioned data. This will allow you to avoid manual updates to historical results or having to worry about how to swap out ML models in production!

Examples

We have implemented this machine learning workflow in some example pipelines using a couple of different frameworks. These examples are a great starting point if you are trying to implement ML in Pachyderm.

Processing Time-Windowed Data

If you are analyzing data that is changing over time, chances are that you will want to perform some sort of analysis on “the last two weeks of data,” “January’s data,” or some other moving or static time window of data. There are a few different ways of doing these types of analyses in Pachyderm, depending on your use case. We recommend one of the following patterns for:

  1. Fixed time windows - for rigid, fixed time windows, such as months (Jan, Feb, etc.) or days (01-01-17, 01-02-17, etc.).
  2. Moving or rolling time windows - for rolling time windows of data, such as three day windows or two week windows.

Fixed time windows

As further discussed in Creating Analysis Pipelines and Distributed Computing, the basic unit of data partitioning in Pachyderm is a “datum” which is defined by a glob pattern. When analyzing data within fixed time windows (e.g., corresponding to fixed calendar times/dates), we recommend organizing your data repositories such that each of the time windows that you are going to analyze corresponds to a separate files or directories in your repository. By doing this, you will be able to:

  • Analyze each time window in parallel.
  • Only re-process data within a time window when that data, or a corresponding data pipeline, changes.

For example, if you have monthly time windows of JSON sales data that need to be analyzed, you could create a sales data repository and structure it like:

sales
├── January
|   ├── 01-01-17.json
|   ├── 01-02-17.json
|   └── etc...
├── February
|   ├── 01-01-17.json
|   ├── 01-02-17.json
|   └── etc...
└── March
    ├── 01-01-17.json
    ├── 01-02-17.json
    └── etc...

When you run a pipeline with an input repo of sales having a glob pattern of /*, each month’s worth of sales data is processed in parallel (if possible). Further, when you add new data into a subset of the months or add data into a new month (e.g., May), only those updated datums will be re-processed.

More generally, this structure allows you to create:

  • Pipelines that aggregate, or otherwise process, daily data on a monthly basis via a /* glob pattern.
  • Pipelines that only analyze a certain month’s data via, e.g., a /January/* or /January/ glob pattern.
  • Pipelines that process data on a daily basis via a /*/* glob pattern.
  • Any combination of the above.

Moving or rolling time windows

In certain use cases, you need to run analyses for moving or rolling time windows, even when those don’t correspond to certain calendar months, days, etc. For example, you may need to analyze the last three days of data, the three days of data prior to that, the three days of data prior to that, etc. In other words, you need to run an analysis for every rolling length of time.

For rolling or moving time windows, there are a couple of recommended patterns:

  1. Bin your data in repository folders for each of the rolling/moving time windows.
  2. Maintain a time windowed set of data corresponding to the latest of the rolling/moving time windows.

Binning data into rolling/moving time windows

In this method of processing rolling time windows, we’ll use a two-pipeline DAG to analyze time windows efficiently:

  • Pipeline 1 - Read in data, determine which bins the data corresponds to, and write the data into those bins
  • Pipeline 2 - Read in and analyze the binned data.

By splitting this analysis into two pipelines we can benefit from parallelism at the file level. In other words, Pipeline 1 can be easily parallelized for each file, and Pipeline 2 can be parallelized per bin. Now we can scale the pipelines easily as the number of files increases.

Let’s take the three day rolling time windows as an example, and let’s say that we want to analyze three day rolling windows of sales data. In a first repo, called sales, a first day’s worth of sales data is committed:

sales
└── 01-01-17.json

We then create a first pipeline to bin this into a repository directory corresponding to our first rolling time window from 01-01-17 to 01-03-17:

binned_sales
└── 01-01-17_to_01-03-17
    └── 01-01-17.json

When our next day’s worth of sales is committed,

sales
├── 01-01-17.json
└── 01-02-17.json

the first pipeline executes again to bin the 01-02-17 data into any relevant bins. In this case, we would put it in the previously created bin for 01-01-17 to 01-03-17, but we would also put it into a bin starting on 01-02-17:

binned_sales
├── 01-01-17_to_01-03-17
|   ├── 01-01-17.json
|   └── 01-02-17.json
└── 01-02-17_to_01-04-17
    └── 01-02-17.json

As more and more daily data is added, you will end up with a directory structure that looks like:

binned_sales
├── 01-01-17_to_01-03-17
|   ├── 01-01-17.json
|   ├── 01-02-17.json
|   └── 01-03-17.json
├── 01-02-17_to_01-04-17
|   ├── 01-02-17.json
|   ├── 01-03-17.json
|   └── 01-04-17.json
├── 01-03-17_to_01-05-17
|   ├── 01-03-17.json
|   ├── 01-04-17.json
|   └── 01-05-17.json
└── etc...

and is maintained over time as new data is committed:

alt tag

Your second pipeline can then process these bins in parallel, via a glob pattern of /*, or in any other relevant way as discussed further in the “Fixed time windows” section. Both your first and second pipelines can be easily parallelized.

Note - When looking at the above directory structure, it may seem like there is an uneccessary duplication of the data. However, under the hood Pachyderm deduplicates all of these files and maintains a space efficient representation of your data. The binning of the data is merely a structural re-arrangement to allow you to process these types of rolling time windows.

Note - It might also seem as if there is unecessary data transfers over the network to perform the above binning. However, the Pachyderm team is currently working on enhancements to ensure that performing these types of “shuffles” doesn’t actually require transferring data over the network. This work is being tracked here.

Maintaining a single time-windowed data set

The advantage of the binning pattern above is that any of the rolling time windows are available for processing. They can be compared, aggregated, combined, etc. in any way, and any results or aggregations are kept in sync with updates to the bins. However, you do need to put in some logic to maintain the binning directory structure.

There is another pattern for moving time windows that avoids the binning of the above approach and maintains an up-to-date version of a moving time-windowed data set. It also involves two pipelines:

  • Pipeline 1 - Read in data, determine which files belong in your moving time window, and write the relevant files into an updated version of the moving time-windowed data set.
  • Pipeline 2 - Read in and analyze the moving time-windowed data set.

Let’s utilize our sales example again to see how this would work. In the example, we want to keep a moving time window of the last three days worth of data. Now say that our daily sales repo looks like the following:

sales
├── 01-01-17.json
├── 01-02-17.json
├── 01-03-17.json
└── 01-04-17.json

When the January 4th file, 01-04-17.json, is committed, our first pipeline pulls out the last three days of data and arranges it like so:

last_three_days
├── 01-02-17.json
├── 01-03-17.json
└── 01-04-17.json

Think of this as a “shuffle” step. Then, when the January 5th file, 01-05-17.json, is commited,

sales
├── 01-01-17.json
├── 01-02-17.json
├── 01-03-17.json
├── 01-04-17.json
└── 01-05-17.json

the first pipeline would again update the moving window:

last_three_days
├── 01-03-17.json
├── 01-04-17.json
└── 01-05-17.json

Whatever analysis we need to run on the moving windowed data set in moving_sales_window can use a glob pattern of / or /* (depending on whether we need to process all of the time windowed files together or they can be processed in parallel).

Warning - When creating this type of moving time-windowed data set, the concept of “now” or “today” is relative. It is important that you make a sound choice for how to define time based on your use case (e.g., by defaulting to UTC). You should not use a function such as time.now() to figure out a current day. The actual time at which this analysis is run may vary. If you have further questions about this issue, please do not hesitate to reach out to us via Slack or at support@pachyderm.io.

Pipeline Specification

This document discusses each of the fields present in a pipeline specification. To see how to use a pipeline spec, refer to the pachctl create-pipeline doc.

JSON Manifest Format

{
  "pipeline": {
    "name": string
  },
  "transform": {
    "image": string,
    "cmd": [ string ],
    "stdin": [ string ]
    "env": {
        string: string
    },
    "secrets": [ {
        "name": string,
        "mountPath": string
    } ],
    "imagePullSecrets": [ string ],
    "acceptReturnCode": [ int ]
  },
  "parallelism_spec": {
    "strategy": "CONSTANT"|"COEFFICIENT"
    "constant": int        // if strategy == CONSTANT
    "coefficient": double  // if strategy == COEFFICIENT
  },
  "resource_spec": {
    "memory": string
    "cpu": double
  },
  "input": {
    <"atom" or "cross" or "union", see below> 
  },
  "outputBranch": string,
  "egress": {
    "URL": "s3://bucket/dir"
  },
  "scaleDownThreshold": string
}

------------------------------------
"atom" input
------------------------------------

"atom": {
  "name": string,
  "repo": string,
  "branch": string,
  "glob": string,
  "lazy" bool,
  "from_commit": string
}

------------------------------------
"cross" or "union" input
------------------------------------

"cross" or "union": [ 
  {
    "atom": {
      "name": string,
      "repo": string,
      "branch": string,
      "glob": string,
      "lazy" bool,
      "from_commit": string
    }
  },
  {
    "atom": {
      "name": string,
      "repo": string,
      "branch": string,
      "glob": string,
      "lazy" bool,
      "from_commit": string
    }
  }
  etc...
]

In practice, you rarely need to specify all the fields. Most fields either come with sensible defaults or can be nil. Following is an example of a minimal spec:

{
  "pipeline": {
    "name": "wordcount"
  },
  "transform": {
    "image": "wordcount-image",
    "cmd": ["/binary", "/pfs/data", "/pfs/out"]
  },
  "input": {
        "atom": {
            "repo": "data",
            "glob": "/*"
        }
    }
}

Following is a walk-through of all the fields.

Name (required)

pipeline.name is the name of the pipeline that you are creating. Each pipeline needs to have a unique name.

Transform (required)

transform.image is the name of the Docker image that your jobs run in.

transform.cmd is the command passed to the Docker run invocation. Note that as with Docker, cmd is not run inside a shell which means that things like wildcard globbing (*), pipes (|) and file redirects (> and >>) will not work. To get that behavior, you can set cmd to be a shell of your choice (e.g. sh) and pass a shell script to stdin.

transform.stdin is an array of lines that are sent to your command on stdin. Lines need not end in newline characters.

transform.env is a map from key to value of environment variables that will be injected into the container

transform.secrets is an array of secrets, secrets reference Kubernetes secrets by name and specify a path that the secrets should be mounted to. Secrets are useful for embedding sensitive data such as credentials. Read more about secrets in Kubernetes here.

transform.imagePullSecrets is an array of image pull secrets, image pull secrets are similar to secrets except that they’re mounted before the containers are created so they can be used to provide credentials for image pulling. For example, if you are using a private Docker registry for your images, you can specify it via:

$ kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL

And then tell your pipeline about it via "imagePullSecrets": [ "myregistrykey" ]. Read more about image pull secrets here.

transform.acceptReturnCode is an array of return codes (i.e. exit codes) from your docker command that are considered acceptable, which means that if your docker command exits with one of the codes in this array, it will be considered a successful run for the purpose of setting job status. 0 is always considered a successful exit code.

Parallelism Spec (optional)

parallelism_spec describes how Pachyderm should parallelize your pipeline. Currently, Pachyderm has two parallelism strategies: CONSTANT and COEFFICIENT.

If you use the CONSTANT strategy, Pachyderm will start a number of workers that you give it. To use this strategy, set the field strategy to CONSTANT, and set the field constant to an integer value (e.g. 10 to start 10 workers).

If you use the COEFFICIENT strategy, Pachyderm will start a number of workers that is a multiple of your Kubernetes cluster’s size. To use this strategy, set the field coefficient to a double. For example, if your Kubernetes cluster has 10 nodes, and you set coefficient to 0.5, Pachyderm will start five workers. If you set it to 2.0, Pachyderm will start 20 workers (two per Kubernetes node).

By default, we use the parallelism spec “coefficient=1”, which means that we spawn one worker per node for this pipeline.

Resource Spec (optional)

resource_spec describes the amount of resources you expect the workers for a given pipeline to consume. Knowing this in advance lets us schedule big jobs on separate machines, so that they don’t conflict and either slow down or die.

The memory field is a string that describes the amount of memory, in bytes, each worker needs (with allowed SI suffixes (M, K, G, Mi, Ki, Gi, etc). For example, a worker that needs to read a 1GB file into memory might set "memory": "1.2GB" (with a little extra for the code to use in addition to the file. Workers for this pipeline will only be placed on machines with at least 1.2GB of free memory, and other large workers will be prevented from using it (if they also set their resource_spec).

The cpu field is a double that describes the amount of CPU time (in (cpu seconds)/(real seconds) each worker needs. Setting "cpu": 0.5 indicates that the worker should get 500ms of CPU time per second. Setting "cpu": 2 indicates that the worker should get 2000ms of CPU time per second (i.e. it’s using 2 CPUs, essentially, though worker threads might spend e.g. 500ms on four physical CPUs instead of one second on two physical CPUs).

In both cases, the resource requests are not upper bounds. If the worker uses more memory than it’s requested, it will not (necessarily) be killed. However, if the whole node runs out of memory, Kubernetes will start killing pods that have been placed on it and exceeded their memory request, to reclaim memory. To prevent your worker getting killed, you must set your memory request to a sufficiently large value. However, if the total memory requested by all workers in the system is too large, Kubernetes will be unable to schedule new workers (because no machine will have enough unclaimed memory). cpu works similarly, but for CPU time.

By default, workers are scheduled with an effective resource request of 0 (to avoid scheduling problems that prevent users from being unable to run pipelines). This means that if a node runs out of memory, any such worker might be killed.

Input (required)

input specifies repos that will be visible to the jobs during runtime. Commits to these repos will automatically trigger the pipeline to create new jobs to process them. Input is a recursive type, there are multiple different kinds of inputs which can be combined together. The input object is a container for the different input types with a field for each, only one of these fields be set for any insantiation of the object.

{
    "atom": atom_input,
    "union": [input],
    "cross": [input],
}
Atom Input

Atom inputs are the simplest inputs, they take input from a single branch on a single repo.

{
    "name": string,
    "repo": string,
    "branch": string,
    "glob": string,
    "lazy" bool,
    "from_commit": string
}

input.atom.name is the name of the input. An input with name XXX will be visible under the path /pfs/XXX when a job runs. Input names must be unique. If an input’s name is not specified, it defaults to the name of the repo. Therefore, if you have two inputs from the same repo, you’ll need to give at least one of them a unique name.

input.atom.repo is the repo to be used for the input.

input.atom.branch is the branch to watch for commits on, it may be left blank in which case "master" will be used.

input.atom.commit is the repo and branch (specified as id) to be used for the input, repo is required but id may be left blank in which case "master" will be used.

input.atom.glob is a glob pattern that’s used to determine how the input data is partitioned. It’s explained in detail in the next section.

input.atom.lazy controls how the data is exposed to jobs. The default is false which means the job will eagerly download the data it needs to process and it will be exposed as normal files on disk. If lazy is set to true, data will be exposed as named pipes instead and no data will be downloaded until the job opens the pipe and reads it, if the pipe is never opened then no data will be downloaded. Some applications won’t work with pipes, for example if they make syscalls such as Seek which pipes don’t support. Applications that can work with pipes should use them since they’re more performant, the difference will be especially notable if the job only reads a subset of the files that are available to it. Note that lazy currently doesn’t support datums that contain more than 10000 files.

input.atom.from_commit specifies the starting point of the input branch. If from_commit is not specified, then the entire input branch will be processed. Otherwise, only commits since the from_commit (not including the commit itself) will be processed.

Union Input

Union inputs take the union of other inputs. For example:

| inputA | inputB | inputA ∪ inputB |
| ------ | ------ | --------------- |
| foo    | fizz   | foo             |
| bar    | buzz   | fizz            |
|        |        | bar             |
|        |        | buzz            |

Notice that union inputs, do not take a name and maintain the names of the sub-inputs. In the above example you would see files under /pfs/inputA/... and /pfs/inputB/....

input.union is an array of inputs to union, note that these need not be atom inputs, they can also be union and cross inputs. Although there’s no reason to take a union of unions since union is associative.

Cross Input

Cross inputs take the cross product of other inputs, in other words it creates tuples of the datums in the inputs. For example:

| inputA | inputB | inputA ⨯ inputB |
| ------ | ------ | --------------- |
| foo    | fizz   | (foo, fizz)     |
| bar    | buzz   | (foo, buzz)     |
|        |        | (bar, fizz)     |
|        |        | (bar, buzz)     |

Notice that cross inputs, do not take a name and maintain the names of the sub-inputs. In the above example you would see files under /pfs/inputA/... and /pfs/inputB/....

input.cross is an array of inputs to cross, note that these need not be atom inputs, they can also be union and cross inputs. Although there’s no reason to take a cross of crosses since cross products are associative.

OutputBranch (optional)

This is the branch where the pipeline outputs new commits. By default, it’s “master”.

Egress (optional)

egress allows you to push the results of a Pipeline to an external data store such as s3, Google Cloud Storage or Azure Storage. Data will be pushed after the user code has finished running but before the job is marked as successful.

Scale-down threshold (optional)

scaleDownThreshold specifies when the worker pods of a pipeline should be terminated.

By default, a pipeline’s worker pods are always running. When scaleDownThreshold is set, the worker pods are terminated after they have been idle for the given duration. When a new input commit comes in, the worker pods are then re-created.

scaleDownThreshold is a string that needs to be sequence of decimal numbers with a unit suffix, such as “300ms”, “1.5h” or “2h45m”. Valid time units are “s”, “m”, “h”.

The Input Glob Pattern

Each atom input needs to specify a glob pattern.

Pachyderm uses the glob pattern to determine how many “datums” an input consists of. Datums are the unit of parallelism in Pachyderm. That is, Pachyderm attempts to process datums in parallel whenever possible.

Intuitively, you may think of the input repo as a file system, and you are applying the glob pattern to the root of the file system. The files and directories that match the glob pattern are considered datums.

For instance, let’s say your input repo has the following structure:

/foo-1
/foo-2
/bar
  /bar-1
  /bar-2

Now let’s consider what the following glob patterns would match respectively:

  • /: this pattern matches /, the root directory itself, meaning all the data would be a single large datum.
  • /*: this pattern matches everything under the root directory given us 3 datums: /foo-1., /foo-2., and everything under the directory /bar.
  • /bar/*: this pattern matches files only under the /bar directory: /bar-1 and /bar-2
  • /foo*: this pattern matches files under the root directory that start with the characters foo
  • /*/*: this pattern matches everything that’s two levels deep relative to the root: /bar/bar-1 and /bar/bar-2

The datums are defined as whichever files or directories match by the glob pattern. For instance, if we used /*, then the job will process three datums (potentially in parallel): /foo-1, /foo-2, and /bar. Both the bar-1 and bar-2 files within the directory bar would be grouped together and always processed by the same worker.

Multiple Inputs

It’s important to note that if a pipeline takes multiple atom inputs (via cross or union) then the pipeline will not get triggered until all of the atom inputs have at least one commit on the branch.

PPS Mounts and File Access

Mount Paths

The root mount point is at /pfs, which contains:

  • /pfs/input_name which is where you would find the datum.
    • Each input will be found here by its name, which defaults to the repo name if not specified.
  • /pfs/out which is where you write any output.

Output Formats

PFS supports data to be delimited by line, JSON, or binary blobs.

Best Practices

This document discusses best practices for common use cases.

Shuffling files

Certain pipelines simply shuffle files around. If you find yourself writing a pipeline that does a lot of copying, such as Time Windowing, it probably falls into this category.

The best way to shuffle files, especially large files, is to create symlinks in the output directory that point to files in the input directory.

For instance, to move a file log.txt to logs/log.txt, you might be tempted to write a transform like this:

cp /pfs/input/log.txt /pfs/out/logs/log.txt

However, it’s more efficient to create a symlink:

ln -s /pfs/input/log.txt /pfs/out/logs/log.txt

Under the hood, Pachyderm is smart enough to recognize that the output file simply symlinks to a file that already exists in Pachyderm, and therefore skips the upload altogether.

Note that if your shuffling pipeline only needs the names of the input files but not their content, you can use lazy input. That way, your shuffling pipeline can skip both the download and the upload.

Pachctl Command Line Tool

Pachctl is the command line interface for Pachyderm. To install Pachctl, follow the Local Installation instructions

Synopsis

Access the Pachyderm API.

Environment variables:

ADDRESS=<host>:<port>, the pachd server to connect to (e.g. 127.0.0.1:30650).

Options

--no-metrics Don’t report user metrics for this command
-v, --verbose Output verbose logs

./pachctl commit

Docs for commits.

Synopsis

Commits are atomic transactions on the content of a repo.

Creating a commit is a multistep process:

  • start a new commit with start-commit
  • write files to it through fuse or with put-file
  • finish the new commit with finish-commit

Commits that have been started but not finished are NOT durable storage. Commits become reliable (and immutable) when they are finished.

Commits can be created with another commit as a parent. This layers the data in the commit over the data in the parent.

./pachctl commit
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl create-job

Create a new job. Returns the id of the created job.

Synopsis

Create a new job from a spec, the spec looks like this { “transform”: { “cmd”: [ “cmd”, “args...” ], “env”: { “foo”: “bar” }, “secrets”: [ { “name”: “secret_name”, “mountPath”: “/path/in/container” } ], “imagePullSecrets”: [ “my-secret” ], “acceptReturnCode”: [ “1” ] }, “parallelismSpec”: { “constant”: “1” }, “inputs”: [ { “commit”: { “repo”: { “name”: “in_repo” }, “id”: “10cf676b626044f9a405235bf7660959” }, “glob”: “*”, “lazy”: true } ] }

./pachctl create-job -f job.json
Options
  -f, --file string       The file containing the job, it can be a url or local file. - reads from stdin. (default "-")
      --password string   Your password for the registry being pushed to.
  -p, --push-images       If true, push local docker images into the cluster registry.
  -r, --registry string   The registry to push images to. (default "docker.io")
  -u, --username string   The username to push images as, defaults to your OS username.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl create-pipeline

Create a new pipeline.

Synopsis

Create a new pipeline from a Pipeline Specification

./pachctl create-pipeline -f pipeline.json
Options
  -d, --description string   A description of the repo.
  -f, --file string          The file containing the pipeline, it can be a url or local file. - reads from stdin. (default "-")
      --password string      Your password for the registry being pushed to.
  -p, --push-images          If true, push local docker images into the cluster registry.
  -r, --registry string      The registry to push images to. (default "docker.io")
  -u, --username string      The username to push images as, defaults to your OS username.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl create-repo

Create a new repo.

Synopsis

Create a new repo.

./pachctl create-repo repo-name
Options
  -d, --description string   A description of the repo.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl delete-all

Delete everything.

Synopsis

Delete all repos, commits, files, pipelines and jobs. This resets the cluster to its initial state.

./pachctl delete-all
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl delete-branch

Delete a branch

Synopsis

Delete a branch, while leaving the commits intact

./pachctl delete-branch <repo-name> <branch-name>
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl delete-file

Delete a file.

Synopsis

Delete a file.

./pachctl delete-file repo-name commit-id path/to/file
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl delete-job

Delete a job.

Synopsis

Delete a job.

./pachctl delete-job job-id
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl delete-pipeline

Delete a pipeline.

Synopsis

Delete a pipeline.

./pachctl delete-pipeline pipeline-name
Options
      --delete-jobs   delete the jobs in this pipeline as well
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl delete-repo

Delete a repo.

Synopsis

Delete a repo.

./pachctl delete-repo repo-name
Options
  -f, --force   remove the repo regardless of errors; use with care
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl deploy

Deploy a Pachyderm cluster.

Synopsis

Deploy a Pachyderm cluster.

Options
      --block-cache-size string       Size of pachd's in-memory cache for PFS files. Size is specified in bytes, with allowed SI suffixes (M, K, G, Mi, Ki, Gi, etc).
      --dash-image string             Image URL for pachyderm dashboard (default "pachyderm/dash:0.3.21")
      --dashboard                     Deploy the Pachyderm UI along with Pachyderm (experimental)
      --dashboard-only                Only deploy the Pachyderm UI (experimental), without the rest of pachyderm. This is for launching the UI adjacent to an existing Pachyderm cluster
      --dry-run                       Don't actually deploy pachyderm to Kubernetes, instead just print the manifest.
      --dynamic-etcd-nodes int        Deploy etcd as a StatefulSet with the given number of pods.  The persistent volumes used by these pods are provisioned dynamically.  Note that StatefulSet is currently a beta kubernetes feature, which might be unavailable in older versions of kubernetes.
      --etcd-cpu-request string       (rarely set) The size of etcd's CPU request, which we give to Kubernetes. Size is in cores (with partial cores allowed and encouraged).
      --etcd-memory-request string    (rarely set) The size of etcd's memory request. Size is in bytes, with SI suffixes (M, K, G, Mi, Ki, Gi, etc).
      --log-level string              The level of log messages to print options are, from least to most verbose: "error", "info", "debug". (default "info")
      --pachd-cpu-request string      (rarely set) The size of Pachd's CPU request, which we give to Kubernetes. Size is in cores (with partial cores allowed and encouraged).
      --pachd-memory-request string   (rarely set) The size of PachD's memory request in addition to its block cache (set via --block-cache-size). Size is in bytes, with SI suffixes (M, K, G, Mi, Ki, Gi, etc).
      --shards int                    Number of Pachd nodes (stateless Pachyderm API servers). (default 16)
      --static-etcd-volume string     Deploy etcd as a ReplicationController with one pod.  The pod uses the given persistent volume.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl file

Docs for files.

Synopsis

Files are the lowest level data object in Pachyderm.

Files can be written to started (but not finished) commits with put-file.
Files can be read from finished commits with get-file.
./pachctl file
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl finish-commit

Finish a started commit.

Synopsis

Finish a started commit. Commit-id must be a writeable commit.

./pachctl finish-commit repo-name commit-id
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl flush-commit

Wait for all commits caused by the specified commits to finish and return them.

Synopsis

Wait for all commits caused by the specified commits to finish and return them.

Examples:

# return commits caused by foo/XXX and bar/YYY
$ pachctl flush-commit foo/XXX bar/YYY

# return commits caused by foo/XXX leading to repos bar and baz
$ pachctl flush-commit foo/XXX -r bar -r baz
./pachctl flush-commit commit [commit ...]
Options
  -r, --repos value   Wait only for commits leading to a specific set of repos (default [])
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl get-file

Return the contents of a file.

Synopsis

Return the contents of a file.

./pachctl get-file repo-name commit-id path/to/file
Options
  -o, --output string      The path where data will be downloaded.
  -p, --parallelism uint   The maximum number of files that can be downloaded in parallel (default 10)
  -r, --recursive          Recursively download a directory.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl get-logs

Return logs from a job.

Synopsis

Return logs from a job.

Examples:

```sh# return logs emitted by recent jobs in the "filter" pipeline
$ pachctl get-logs --pipeline=filter

# return logs emitted by the job aedfa12aedf
$ pachctl get-logs --job=aedfa12aedf

# return logs emitted by the pipeline \"filter\" while processing /apple.txt and a file with the hash 123aef
$ pachctl get-logs --pipeline=filter --inputs=/apple.txt,123aef

./pachctl get-logs [–pipeline=|–job=]


### Options
  --inputs string     Filter for log lines generated while processing these files (accepts PFS paths or file hashes)
  --job string        Filter for log lines from this job (accepts job ID)
  --pipeline string   Filter the log for lines from this pipeline (accepts pipeline name)
  --raw               Return log messages verbatim from server.

### Options inherited from parent commands
  --no-metrics   Don't report user metrics for this command

-v, –verbose Output verbose logs


### SEE ALSO
* [./pachctl](./pachctl.md)  - 

###### Auto generated by spf13/cobra on 10-May-2017

./pachctl get-object

Return the contents of an object

Synopsis

Return the contents of an object

./pachctl get-object hash
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl get-tag

Return the contents of a tag

Synopsis

Return the contents of a tag

./pachctl get-tag tag
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl inspect-commit

Return info about a commit.

Synopsis

Return info about a commit.

./pachctl inspect-commit repo-name commit-id
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl inspect-file

Return info about a file.

Synopsis

Return info about a file.

./pachctl inspect-file repo-name commit-id path/to/file
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl inspect-job

Return info about a job.

Synopsis

Return info about a job.

./pachctl inspect-job job-id
Options
  -b, --block   block until the job has either succeeded or failed
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl inspect-pipeline

Return info about a pipeline.

Synopsis

Return info about a pipeline.

./pachctl inspect-pipeline pipeline-name
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl inspect-repo

Return info about a repo.

Synopsis

Return info about a repo.

./pachctl inspect-repo repo-name
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl job

Docs for jobs.

Synopsis

Jobs are the basic unit of computation in Pachyderm.

Jobs run a containerized workload over a set of finished input commits. Creating a job will also create a new repo and a commit in that repo which contains the output of the job. Unless the job is created with another job as a parent. If the job is created with a parent it will use the same repo as its parent job and the commit it creates will use the parent job’s commit as a parent. If the job fails the commit it creates will not be finished. The increase the throughput of a job increase the Shard paremeter.

./pachctl job
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl list-branch

Return all branches on a repo.

Synopsis

Return all branches on a repo.

./pachctl list-branch <repo-name>
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl list-commit

Return all commits on a set of repos.

Synopsis

Return all commits on a set of repos.

Examples:

# return commits in repo "foo"
$ pachctl list-commit foo

# return commits in repo "foo" on branch "master"
$ pachctl list-commit foo master

# return the last 20 commits in repo "foo" on branch "master"
$ pachctl list-commit foo master -n 20

# return commits that are the ancestors of XXX
$ pachctl list-commit foo XXX

# return commits in repo "foo" since commit XXX
$ pachctl list-commit foo master --from XXX
./pachctl list-commit repo-name
Options
  -f, --from string   list all commits since this commit
  -n, --number int    list only this many commits; if set to zero, list all commits
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl list-file

Return the files in a directory.

Synopsis

Return the files in a directory.

./pachctl list-file repo-name commit-id path/to/dir
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl list-job

Return info about jobs.

Synopsis

Return info about jobs.

Examples:

```sh# return all jobs
$ pachctl list-job

# return all jobs in pipeline foo
$ pachctl list-job -p foo

# return all jobs whose input commits include foo/XXX and bar/YYY
$ pachctl list-job foo/XXX bar/YYY

# return all jobs in pipeline foo and whose input commits include bar/YYY
$ pachctl list-job -p foo bar/YYY

./pachctl list-job [-p pipeline-name] [commits]


### Options

-p, –pipeline string Limit to jobs made by pipeline.


### Options inherited from parent commands
  --no-metrics   Don't report user metrics for this command

-v, –verbose Output verbose logs


### SEE ALSO
* [./pachctl](./pachctl.md)  - 

###### Auto generated by spf13/cobra on 10-May-2017

./pachctl list-pipeline

Return info about all pipelines.

Synopsis

Return info about all pipelines.

./pachctl list-pipeline
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl list-repo

Return all repos.

Synopsis

Reutrn all repos.

./pachctl list-repo
Options
  -p, --provenance value   list only repos with the specified repos provenance (default [])
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl mount

Mount pfs locally. This command blocks.

Synopsis

Mount pfs locally. This command blocks.

./pachctl mount path/to/mount/point
Options
  -a, --all-commits   Show archived and cancelled commits.
  -d, --debug         Turn on debug messages.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl pipeline

Docs for pipelines.

Synopsis

Pipelines are a powerful abstraction for automating jobs.

Pipelines take a set of repos as inputs, rather than the set of commits that jobs take. Pipelines then subscribe to commits on those repos and launches a job to process each incoming commit. Creating a pipeline will also create a repo of the same name. All jobs created by a pipeline will create commits in the pipeline’s repo.

./pachctl pipeline
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl port-forward

Forward a port on the local machine to pachd. This command blocks.

Synopsis

Forward a port on the local machine to pachd. This command blocks.

./pachctl port-forward
Options
  -k, --kubectlflags string   Any kubectl flags to proxy, e.g. --kubectlflags='--kubeconfig /some/path/kubeconfig'
  -p, --port int              The local port to bind to. (default 30650)
  -x, --proxy-port int        The local port to bind to. (default 38081)
  -u, --ui-port int           The local port to bind to. (default 38080)
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl put-file

Put a file into the filesystem.

Synopsis

Put-file supports a number of ways to insert data into pfs:

# Put data from stdin as repo/branch/path:
echo "data" | pachctl put-file repo branch path

# Put data from stding as repo/branch/path and start / finish a new commit on the branch.
echo "data" | pachctl put-file -c repo branch path

# Put a file from the local filesystem as repo/branch/path:
pachctl put-file repo branch path -f file

# Put a file from the local filesystem as repo/branch/file:
pachctl put-file repo branch -f file

# Put the contents of a directory as repo/branch/path/dir/file:
pachctl put-file -r repo branch path -f dir

# Put the contents of a directory as repo/branch/dir/file:
pachctl put-file -r repo branch -f dir

# Put the data from a URL as repo/branch/path:
pachctl put-file repo branch path -f http://host/path

# Put the data from a URL as repo/branch/path:
pachctl put-file repo branch -f http://host/path

# Put several files or URLs that are listed in file.
# Files and URLs should be newline delimited.
pachctl put-file repo branch -i file

# Put several files or URLs that are listed at URL.
# NOTE this URL can reference local files, so it could cause you to put sensitive
# files into your Pachyderm cluster.
pachctl put-file repo branch -i http://host/path
./pachctl put-file repo-name branch path/to/file/in/pfs
Options
  -c, --commit                    Put file(s) in a new commit.
  -f, --file value                The file to be put, it can be a local file or a URL. (default [-])
  -i, --input-file string         Read filepaths or URLs from a file.  If - is used, paths are read from the standard input.
  -p, --parallelism uint          The maximum number of files that can be uploaded in parallel (default 10)
  -r, --recursive                 Recursively put the files in a directory.
      --split string              Split the input file into smaller files, subject to the constraints of --target-file-datums and --target-file-bytes
      --target-file-bytes uint    the target upper bound of the number of bytes that each file contains; needs to be used with --split
      --target-file-datums uint   the target upper bound of the number of datums that each file contains; needs to be used with --split
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl repo

Docs for repos.

Synopsis

Repos, short for repository, are the top level data object in Pachyderm.

Repos are created with create-repo.
./pachctl repo
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl run-pipeline

Run a pipeline once.

Synopsis

Run a pipeline once, optionally overriding some pipeline options by providing a spec. The spec looks like this: { “parallelismSpec”: { “constant”: “3” }, “inputs”: [ { “commit”: { “repo”: { “name”: “in_repo” }, “id”: “10cf676b626044f9a405235bf7660959” }, “glob”: “*“ } ] }

./pachctl run-pipeline pipeline-name [-f job.json]
Options
  -f, --file string   The file containing the run-pipeline spec, - reads from stdin.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl set-branch

Set a commit and its ancestors to a branch

Synopsis

Set a commit and its ancestors to a branch.

Examples:

# Set commit XXX and its ancestors as branch master in repo foo.
$ pachctl set-branch foo XXX master

# Set the head of branch test as branch master in repo foo.
# After running this command, "test" and "master" both point to the
# same commit.
$ pachctl set-branch foo test master
./pachctl set-branch <repo-name> <commit-id/branch-name> <new-branch-name>
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl start-commit

Start a new commit.

Synopsis

Start a new commit with parent-commit as the parent, or start a commit on the given branch; if the branch does not exist, it will be created.

Examples:

# Start a new commit in repo "test" that's not on any branch
$ pachctl start-commit test

# Start a commit in repo "test" on branch "master"
$ pachctl start-commit test master

# Start a commit with "master" as the parent in repo "test", on a new branch "patch"; essentially a fork.
$ pachctl start-commit test patch -p master

# Start a commit with XXX as the parent in repo "test", not on any branch
$ pachctl start-commit test -p XXX
./pachctl start-commit repo-name [branch]
Options
  -p, --parent string   The parent of the new commit, unneeded if branch is specified and you want to use the previous head of the branch as the parent.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl start-pipeline

Restart a stopped pipeline.

Synopsis

Restart a stopped pipeline.

./pachctl start-pipeline pipeline-name
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl stop-pipeline

Stop a running pipeline.

Synopsis

Stop a running pipeline.

./pachctl stop-pipeline pipeline-name
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl undeploy

Tear down a deployed Pachyderm cluster.

Synopsis

Tear down a deployed Pachyderm cluster.

./pachctl undeploy
Options
  -a, --all   
Delete everything, including the persistent volumes where metadata
is stored.  If your persistent volumes were dynamically provisioned (i.e. if
you used the "--dynamic-etcd-nodes" flag), the underlying volumes will be
removed, making metadata such repos, commits, pipelines, and jobs
unrecoverable. If your persistent volume was manually provisioned (i.e. if
you used the "--static-etcd-volume" flag), the underlying volume will not be
removed.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl unmount

Unmount pfs.

Synopsis

Unmount pfs.

./pachctl unmount path/to/mount/point
Options
  -a, --all   unmount all pfs mounts
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl update-pipeline

Update an existing Pachyderm pipeline.

Synopsis

Update a Pachyderm pipeline with a new Pipeline Specification

./pachctl update-pipeline -f pipeline.json
Options
  -f, --file string       The file containing the pipeline, it can be a url or local file. - reads from stdin. (default "-")
      --password string   Your password for the registry being pushed to.
  -p, --push-images       If true, push local docker images into the cluster registry.
  -r, --registry string   The registry to push images to. (default "docker.io")
  -u, --username string   The username to push images as, defaults to your OS username.
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

./pachctl version

Return version information.

Synopsis

Return version information.

./pachctl version
Options inherited from parent commands
      --no-metrics   Don't report user metrics for this command
  -v, --verbose      Output verbose logs
SEE ALSO
Auto generated by spf13/cobra on 10-May-2017

Golang Client

For any Go users, we’ve built a Golang client so you can easily script Pachyderm commands. Check out the autogenerated godocs as a reference.