ExTASY Workflows 0.2

Github Page

https://bitbucket.org/extasy-project/extasy-workflows

Mailing List

Build Status

http://ci.radical-project.org/buildStatus/icon?job=ExTASY-0.2

Contents

Introduction

What is ExTASY ?

ExTASY, the Extensible Toolkit for Advanced Sampling and analYsis, is a flexible toolkit to allow efficient sampling of complex macromolecules using molecular Dynamics in combination with on-the-fly analysis tools, to drive the sampling process to regions of interest. In particular, compared with existing approaches like metadynamics, ExTASY requires no a priori assumptions about the behaviour of the system. ExTASY consists of several interoperable Python tools, which are coupled together into pre-defined patterns that may be executed on compute resources ranging from PCs and small clusters, to large-scale HPC systems.

Extasy execution model

ExTASY provides a command line interface, that along with specific configuration files, keeps the user’s job minimal and free of the underlying execution methods and data management that is resource specific. The ExTASY user interface is run on your local machine and handles the data staging, job scheduling and execution on the target machine in a uniform manner, making it easy to test small systems locally before moving to larger HPC resources as needed.

The coupled simulation-analysis execution pattern (aka ExTASY pattern) currently supports two usecases:

  • Gromacs as the “Simulator” and LSDMap as the “Analyzer”
  • AMBER as the “Simulator” and CoCo as the “Analyzer”

The ExTASY approach

ExTASY uses swarm/ensemble simulation strategies that map efficiently onto HPC services. It uses smart collective coordinate strategies to focus sampling in interesting regions, and relies on machine learning methods rather than user expertise to select and refine (on the fly) the collective coordinates. ExTASY is compatible with standard MD codes out of the box - without requiring software patches.

Extasy workflow

Background

Why do enhanced sampling ?

To efficiently and accurately identify particular alternative conformations of a molecule.

  • E.g., starting from an apo-conformation, identify alternative low-energy conformations of a protein relevant to ligand binding (induced fit/conformational selection).

To efficiently and accurately sample ALL conformational space for a molecule.

  • E.g., calculation of thermodynamic and kinetic parameters.

How to do enhanced sampling ?

Faster MD through hardware and software developments, e.g.:

  • multicore architectures and domain composition.
  • specialized hardware (ANTON, GRAPE,...).

Faster MD through manipulation of the effective potential energy surface, e.g.:

  • meta-dynamics,
  • accelerated dynamics.

Faster sampling through multiple simulation strategies, e.g.:

  • Replica exchange.
  • Swarm/ensemble simulations and Markov chain models.

Installation & Setup

Installation

This section describes the requirements and procedure to be followed to install the ExTASY package.

Note

Pre-requisites.The following are the minimal requirements to install the ExTASY module.

  • python >= 2.7
  • virtualenv >= 1.11
  • pip >= 1.5
  • Password-less ssh login to Stampede and/or Archer machine (help )

The easiest way to install ExTASY is to create virtualenv. This way, ExTASY and its dependencies can easily be installed in user-space without clashing with potentially incompatible system-wide packages.

Tip

If the virtualenv command is not available, try the following set of commands,

wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.tar.gz
tar xzf virtualenv-1.11.tar.gz
python virtualenv-1.11/virtualenv.py --system-site-packages $HOME/ExTASY-tools/
source $HOME/ExTASY-tools/bin/activate

Step 1 : Create the virtualenv,

virtualenv $HOME/ExTASY-tools/

If your shell is BASH,

source $HOME/ExTASY-tools/bin/activate

If your shell is CSH,

Setuptools might not get installed with virtualenv and hence using pip would fail. Please look at https://pypi.python.org/pypi/setuptools for installation instructions.

source $HOME/ExTASY-tools/bin/activate.csh

To install the Ensemble MD Toolkit Python modules in the virtual environment, run:

pip install radical.ensemblemd

You can check the version of Ensemble MD Toolkit with the ensemblemd-version command-line tool.

0.3.14

Tip

If your shell is CSH you would need to do,

rehash

This will reset the PATH variable to also point to the packages which were just installed.

Installation is complete !

Preparing the Environment

ExTASY is developed using Ensemble MD Toolkit which is a client-side library and relies on a set of external software packages. One of these packages is radical.pilot, an HPC cluster resource access and management library. It can access HPC clusters remotely via SSH and GSISSH, but it requires (a) a MongoDB server and (b) a properly set-up SSH environment.

MongoDB and SSH ports.

Note

For the purposes of the examples in this guide, we provide access to a mongodb url (mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot). This is for trying out these examples only and is periodically purged. We recommend setting up your own mongodb instances for production simulations/experiments.

MongoDB Server

The MongoDB server is used to store and retrieve operational data during the execution of an Ensemble MD Toolkit application. The MongoDB server must be reachable on port 27017 from both, the host that runs the Ensemble MD Toolkit application and the host that executes the MD tasks, i.e., the HPC cluster (see blue arrows in the figure above). In our experience, a small VM instance (e.g., Amazon AWS) works exceptionally well for this.

Warning

If you want to run your application on your laptop or private workstation, but run your MD tasks on a remote HPC cluster, installing MongoDB on your laptop or workstation won’t work. Your laptop or workstations usually does not have a public IP address and is hidden behind a masked and firewalled home or office network. This means that the components running on the HPC cluster will not be able to access the MongoDB server.

A MongoDB server can support more than one user. In an environment where multiple users use Ensemble MD Toolkit applications, a single MongoDB server for all users / hosts is usually sufficient.

Install your own MongoDB

Once you have identified a host that can serve as the new home for MongoDB, installation is straight forward. You can either install the MongoDB server package that is provided by most Linux distributions, or follow the installation instructions on the MongoDB website:

http://docs.mongodb.org/manual/installation/

MongoDB-as-a-Service

There are multiple commercial providers of hosted MongoDB services, some of them offering free usage tiers. We have had some good experience with the following:

HPC Cluster Access

In order to execute MD tasks on a remote HPC cluster, you need to set-up password-less SSH login for that host. This can either be achieved via an ssh-agent that stores your SSH key’s password (e.g., default on OS X) or by setting up password-less SSH keys.

Password-less SSH with ssh-agent

An ssh-agent asks you for your key’s password the first time you use it and then stores it for you so that you don’t have to enter it again. On OS X (>= 10.5), an ssh-agent is running by default. On other Linux operating systems you might have to install or launch it manually.

You can test whether an ssh-agent is running by default on your system if you log-in via SSH into the remote host twice. The first time, the ssh-agent should ask you for a password, the second time, it shouldn’t. You can use the ssh-add command to list all keys that are currently managed by your ssh-agent:

%> ssh-add -l
4096 c3:d6:4b:fb:ce:45:b7:f0:2e:05:b1:81:87:24:7f:3f /Users/enmdtk/.ssh/rsa_work (RSA)

For more information on this topic, please refer to this article:

Password-less SSH keys

Warning

Using password-less SSH keys is really not encouraged. Some sites might even have a policy in place prohibiting the use of password-less SSH keys. Use ssh-agent if possible.

These instructions were taken from http://www.linuxproblem.org/art_9.html

Follow these instructions to create and set-up a public-private key pair that doesn’t have a password.

As user_a on host workstation, generate a pair of keys. Do not enter a passphrase:

user_a@workstation:~> ssh-keygen -t rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/home/a/.ssh/id_rsa):
Created directory '/home/a/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/a/.ssh/id_rsa.
Your public key has been saved in /home/a/.ssh/id_rsa.pub.
The key fingerprint is:
3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A

Now use ssh to create a directory ~/.ssh as user_b on cluster. (The directory may already exist, which is fine):

user_a@workstation:~> ssh user_b@cluster mkdir -p .ssh
user_b@cluster's password:

Finally append usera_a‘s new public key to user_b@cluster:.ssh/authorized_keys and enter user_b‘s password one last time:

user_a@workstation:~> cat .ssh/id_rsa.pub | ssh user_b@cluster 'cat >> .ssh/authorized_keys'
user_b@cluster's password:

From now on you can log into cluster as user_b from workstation as user_a without a password:

user_a@workstation:~> ssh user_b@cluster

Note

Depending on your version of SSH you might also have to do the following changes:

  • Put the public key in .ssh/authorized_keys2 (note the 2)
  • Change the permissions of .ssh to 700
  • Change the permissions of .ssh/authorized_keys2 to 640

Getting Started

ExTASY 0.2 uses the Ensemble Toolkit API for composing the application. In this section we will run you through the basics building blocks of the API. We will introduce the SimulationAnalysisLoop pattern and then work through simple examples using the same pattern in the next section. Once you are comfortable with these examples, we will then move to two molecular dynamics applications created using this API.

SimulationAnalysisLoop (SAL): The Pattern

The SAL pattern supports multiple iterations of two chained bag of tasks (BoT). The first bag consists of ‘n’ instances of simulations followed by the second bag which consists of ‘m’ instances of analysis. These analysis instances work on the output of the simulation instances and hence they are chained. There can be multiple iterations of these two BoTs. Depending on the application, it is possible to have the simulation instances of iteration ‘i+1’ work on the output of the analysis instances of iteration ‘i’. There also exist two steps - pre_loop and post_loop to perform any pre- or post- processing. A graphical representation of the pattern is given below:

SAL pattern image

There are also a set of data references that can be used to reference the data in a particular step or instance.

  • $PRE_LOOP - References the pre_loop step
  • $PREV_SIMULATION - References the previous simulation step with the same instance number.
  • $PREV_SIMULATION_INSTANCE_Y - References instance Y of the previous simulation step.
  • $SIMULATION_ITERATION_X_INSTANCE_Y - References instance Y of the simulation step of iteration number X.
  • $PREV_ANALYSIS - References the previous analysis step with the same instance number.
  • $PREV_ANALYSIS_INSTANCE_Y - References instance Y of the previous analysis step.
  • $ANALYSIS_ITERATION_X_INSTANCE_Y - References instance Y of the analysis step of iteration number X.

Components of the API

There are three components that the user interacts with in order to implement the application:

  • Resource Handle: The resource handle can be see as a container that acquires the resources on the remote machine and provides application level control of these resources.
  • Execution Pattern: A pattern can be seen as a parameterized template for an execution trajectory that implements a specific algorithm. A pattern provides placeholder methods for the individual steps or stages of an execution trajectory. These placeholders are populated with Kernels that get executed when it’s the step’s turn to be executed. In ExTASY, we will be using the SAL pattern.
  • Application Kernel: A kernel is an object that abstracts a computational task in EnsembleMD. It represents an instantiation of a specific science tool along with its resource specific environment.

Running Generic Examples

Multiple Simulations Single Analysis Application with SAL pattern

This example shows how to use the SAL pattern to execute 4 iterations of a simulation analysis loop with multiple simulation instances and a single analysis instance. We skip the pre_loop step in this example. Each simulation_stage generates 16 new random ASCII files. One ASCII file in each of its instances. In the analysis_stage, the ASCII files from each of the simulation instances are analyzed and character count is performed on each of the files using one analysis instance. The output is downloaded to the user machine.

[S]    [S]    [S]    [S]    [S]    [S]    [S]
 |      |      |      |      |      |      |
 \-----------------------------------------/
                      |
                     [A]
                      |
 /-----------------------------------------\
 |      |      |      |      |      |      |
[S]    [S]    [S]    [S]    [S]    [S]    [S]
 |      |      |      |      |      |      |
 \-----------------------------------------/
                      |
                     [A]
                      .
                      .

Warning

In order to run this example, you need access to a MongoDB server and set the RADICAL_PILOT_DBURL in your environment accordingly. The format is mongodb://hostname:port.

Run locally

  • Step 1: View the example source below. You can download the generic examples using the following:
wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/generic.tar
tar xf generic.tar
cd generic

Note

The files in the above link are configured to run for the tutorial. The source at the end of this page is generic and might require changes.

  • Step 2: Run the multiple_simulations_single_analysis:
python multiple_simulations_single_analysis.py

Once the script has finished running, you should see the character frequency files generated by the individual ensembles (cfreqs-1.dat) in the in the same directory you launched the script in. You should see as many such files as were the number of iterations. Each analysis stage generates the character frequency file for all the files generated in the simulation stage every iteration.

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Run remotely

By default, simulation and analysis stages run on one core your local machine:

SingleClusterEnvironment(
resource="local.localhost",
cores=1,
walltime=30,
username=None,
project=None
)

You can change the script to use a remote HPC cluster and increase the number of cores to see how this affects the runtime of the script as the individual simulations instances can run in parallel. For example, execution on xsede.stampede using 16 cores would require:

SingleClusterEnvironment(
resource="xsede.stampede",
cores=16,
walltime=30,    #minutes
username=None,  # add your username here
project=None # add your allocation or project id here if required
)

Example Script

Download multiple_simulations_single_analysis.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#!/usr/bin/env python



__author__       = "Vivek <vivek.balasubramanian@rutgers.edu>"
__copyright__    = "Copyright 2014, http://radical.rutgers.edu"
__license__      = "MIT"
__example_name__ = "Multiple Simulations Instances, Single Analysis Instance Example (MSSA)"


import sys
import os
import json

from radical.ensemblemd import Kernel
from radical.ensemblemd import SimulationAnalysisLoop
from radical.ensemblemd import EnsemblemdError
from radical.ensemblemd import ResourceHandle


# ------------------------------------------------------------------------------
# Set default verbosity

if os.environ.get('RADICAL_ENTK_VERBOSE') == None:
	os.environ['RADICAL_ENTK_VERBOSE'] = 'REPORT'

# ------------------------------------------------------------------------------
#
class MSSA(SimulationAnalysisLoop):
	"""MSMA exemplifies how the MSMA (Multiple-Simulations / Multiple-Analsysis)
	   scheme can be implemented with the SimulationAnalysisLoop pattern.
	"""
	def __init__(self, iterations, simulation_instances, analysis_instances):
		SimulationAnalysisLoop.__init__(self, iterations, simulation_instances, analysis_instances)


	def simulation_stage(self, iteration, instance):
		"""In the simulation step we
		"""
		k = Kernel(name="misc.mkfile")
		k.arguments = ["--size=1000", "--filename=asciifile.dat"]
		return [k]

	def analysis_stage(self, iteration, instance):
		"""In the analysis step we use the ``$PREV_SIMULATION`` data reference
		   to refer to the previous simulation. The same
		   instance is picked implicitly, i.e., if this is instance 5, the
		   previous simulation with instance 5 is referenced.
		"""
		link_input_data = []
		for i in range(1, self.simulation_instances+1):
			link_input_data.append("$PREV_SIMULATION_INSTANCE_{instance}/asciifile.dat > asciifile-{instance}.dat".format(instance=i))

		k = Kernel(name="misc.ccount")
		k.arguments            = ["--inputfile=asciifile-*.dat", "--outputfile=cfreqs.dat"]
		k.link_input_data      = link_input_data
		k.download_output_data = "cfreqs.dat > cfreqs-{iteration}.dat".format(iteration=iteration)
		return [k]


# ------------------------------------------------------------------------------
#
if __name__ == "__main__":

	try:

		# Create a new static execution context with one resource and a fixed
		# number of cores and runtime.
		cluster = ResourceHandle(
				resource='local.localhost',
				cores=1,
				walltime=15,
				#username=None,

				#project=None,
				#queue = None,

				database_url='mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot',
				#database_name=None,
				#access_schema=None
		)

		# Allocate the resources.
		cluster.allocate()

		# We set both the the simulation and the analysis step 'instances' to 16.
		# If they
		mssa = MSSA(iterations=4, simulation_instances=16, analysis_instances=1)

		cluster.run(mssa)

		cluster.deallocate()

	except EnsemblemdError, er:

		print "Ensemble MD Toolkit Error: {0}".format(str(er))
		raise # Just raise the execption again to get the backtrace

In line 55, a SingleClusterEnvironment (Execution context) is created targetted to reserve 1 core on localhost for a duration of 30 mins. In line 64, an allocation request is made for the particular execution context.

In line 16, we define the pattern class to be the SAL pattern. We skip the definition of the pre_loop step since we do not require it for this example. In line 27, we define the kernel that needs to be executed during the simulation stage (mkfile) as well as the arguments to the kernel. In line 41, we define the kernel that needs to be executed during the analysis stage (ccount). In lines 38-39, we create a list of references to the output data created in each of the simulation instances, in order to stage it in during the analysis instance (line 43).

In line 68, we create an instance of this MSSA class to run 4 iterations of 16 simulation instances and 1 analysis instance. We run this pattern in the execution context in line 70 and once completed we deallocate the acquired resources (line 72).

Multiple Simulations Multiple Analysis Application with SAL pattern

This example shows how to use the SAL pattern to execute 4 iterations of a simulation analysis loop with multiple simulation instances and multiple analysis instance. We skip the pre_loop step in this example. Each simulation_stage generates 16 new random ASCII files. One ASCII file in each of its instances. In the analysis_stage, the ASCII files from the simulation instances are analyzed and character count is performed. Each analysis instance uses the file generated by the corresponding simulation instance. This is possible since we use the same number of instances for simulation and analysis.The output is downloaded to the user machine.

[S]    [S]    [S]    [S]    [S]    [S]    [S]    [S]
 |      |      |      |      |      |      |      |
[A]    [A]    [A]    [A]    [A]    [A]    [A]    [A]
 |      |      |      |      |      |      |      |
[S]    [S]    [S]    [S]    [S]    [S]    [S]    [S]
 |      |      |      |      |      |      |      |
[A]    [A]    [A]    [A]    [A]    [A]    [A]    [A]

Warning

In order to run this example, you need access to a MongoDB server and set the RADICAL_PILOT_DBURL in your environment accordingly. The format is mongodb://hostname:port.

Run locally

  • Step 1: View the example sources below. You can download the generic examples using the following (same link as above):
wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/generic.tar
tar xf generic.tar
cd generic

Note

The files in the above link are configured to run for the CECAM workshop. The source at the end of this page is generic and might require changes.

  • Step 2: Run the multiple_simulations_multiple_analysis:
python multiple_simulations_multiple_analysis.py

Once the script has finished running, you should see the character frequency files generated by the individual ensembles (cfreqs-1-1.dat) in the in the same directory you launched the script in. You should see as many such files as were the number of iterations times the number of ensembles (i.e. simulation/analysis width). Each analysis stage generates the character frequency file for each of the files generated in the simulation stage every iteration.

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Run remotely

By default, simulation and analysis stages run on one core your local machine:

SingleClusterEnvironment(
resource="local.localhost",
cores=1,
walltime=30,
username=None,
project=None
)

You can change the script to use a remote HPC cluster and increase the number of cores to see how this affects the runtime of the script as the individual simulations instances can run in parallel. For example, execution on xsede.stampede using 16 cores would require:

SingleClusterEnvironment(
resource="xsede.stampede",
cores=16,
walltime=30,
username=None,  # add your username here
project=None # add your allocation or project id here if required
)

Example Script

Download multiple_simulations_multiple_analysis.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#!/usr/bin/env python

__author__       = "Vivek <vivek.balasubramanian@rutgers.edu>"
__copyright__    = "Copyright 2014, http://radical.rutgers.edu"
__license__      = "MIT"
__example_name__ = "Multiple Simulations Instances, Multiple Analysis Instances Example (MSMA)"

import sys
import os
import json


from radical.ensemblemd import Kernel
from radical.ensemblemd import SimulationAnalysisLoop
from radical.ensemblemd import EnsemblemdError
from radical.ensemblemd import ResourceHandle


# ------------------------------------------------------------------------------
# Set default verbosity

if os.environ.get('RADICAL_ENTK_VERBOSE') == None:
	os.environ['RADICAL_ENTK_VERBOSE'] = 'REPORT'

# ------------------------------------------------------------------------------
#
class MSMA(SimulationAnalysisLoop):
	"""MSMA exemplifies how the MSMA (Multiple-Simulations / Multiple-Analsysis)
	   scheme can be implemented with the SimulationAnalysisLoop pattern.
	"""
	def __init__(self, iterations, simulation_instances, analysis_instances):
		SimulationAnalysisLoop.__init__(self, iterations, simulation_instances, analysis_instances)


	def simulation_stage(self, iteration, instance):
		"""In the simulation step we
		"""
		k = Kernel(name="misc.mkfile")
		k.arguments = ["--size=1000", "--filename=asciifile.dat"]
		return k

	def analysis_stage(self, iteration, instance):
		"""In the analysis step we use the ``$PREV_SIMULATION`` data reference
		   to refer to the previous simulation. The same
		   instance is picked implicitly, i.e., if this is instance 5, the
		   previous simulation with instance 5 is referenced.
		"""
		k = Kernel(name="misc.ccount")
		k.arguments            = ["--inputfile=asciifile.dat", "--outputfile=cfreqs.dat"]
		k.link_input_data      = "$PREV_SIMULATION/asciifile.dat".format(instance=instance)
		k.download_output_data = "cfreqs.dat > cfreqs-{iteration}-{instance}.dat".format(instance=instance, iteration=iteration)
		return k


# ------------------------------------------------------------------------------
#
if __name__ == "__main__":

	try:

		# Create a new static execution context with one resource and a fixed
		# number of cores and runtime.
		cluster = ResourceHandle(
				resource='local.localhost',
				cores=1,
				walltime=15,
				#username=None,
				#project=None,
				#queue = None,

				database_url='mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot',
				#database_name=,
				#access_schema=None,
		)

		# Allocate the resources.
		cluster.allocate()

		# We set both the the simulation and the analysis step 'instances' to 8.
		msma = MSMA(iterations=2, simulation_instances=8, analysis_instances=8)

		cluster.run(msma)

		cluster.deallocate()

	except EnsemblemdError, er:

		print "Ensemble MD Toolkit Error: {0}".format(str(er))
		raise # Just raise the execption again to get the backtrace

In line 51, a SingleClusterEnvironment (Execution context) is created targetted to reserve 1 core on localhost for a duration of 30 mins. In line 60, an allocation request is made for the particular execution context.

In line 16, we define the pattern class to be the SAL pattern. We skip the definition of the pre_loop step since we do not require it for this example. In line 27, we define the kernel that needs to be executed during the simulation stage (mkfile) as well as the arguments to the kernel. In line 37, we define the kernel that needs to be executed during the analysis stage (ccount). In lines 39, we refer to the output data created in the previous simulation stage with the same instance number as that of the current analysis instance.

In line 63, we create an instance of this MSSA class to run 4 iterations of 16 simulation instances and 1 analysis instance. We run this pattern in the execution context in line 65 and once completed we deallocate the acquired resources (line 67).

Running a Coco/Amber Workload

Introduction

CoCo (“Complementary Coordinates”) uses a PCA-based method to analyse trajectory data and identify potentially under sampled regions of conformational space. For an introduction to CoCo, including how it can be used as a stand-alone tool, see here.

The ExTASY workflow uses cycles of CoCo and MD simulation to rapidly generate a diverse ensemble of conformations of a chosen molecule. A typical use could be in exploring the conformational flexibility of a protein’s ligand binding site and the generation of diverse conformations for docking studies. The basic workflow is as follows:

  1. The input ensemble (typically the trajectory file from a short preliminary simulation, but potentially just a single structure) is analysed by CoCo and N possible, but so far unsampled, conformations identified.
  2. N independent short MD simulations are run, starting from each of these points.
  3. The resulting trajectory files are added to the input ensemble, and CoCo analysis performed on them all, identifying N new points.
  4. Steps 2 and 3 are repeated, building up an ensemble of an increasing number of short, but diverse, trajectory files, for as many cycles as the user choses.

In common with the other ExTASY workflows, a user prepares the necessary input files and ExTASY configuration files on their local workstation, and launches the job from there, but the calculations are then performed on the execution host, which is typically an HPC resource.

This release of ExTASY has a few restrictions:

  1. The MD simulations can only be performed using AMBER or GROMACS.
  2. The system to be simulated cannot contain any non-standard residues (i.e., any not found in the default AMBER residue library).

Required Input files

The Amber/CoCo workflow requires the user to prepare four AMBER-style files, and two ExTASY configuration files. For more information about Amber-specific file formats see here.

  1. A topology file for the system (Amber .top format).
  2. An initial structure (Amber .crd format) or ensemble (any Amber trajectory format).
  3. A simple minimisation input script (.in format). This will be used to refine each structure produced by CoCo before it is used for MD.
  4. An MD input script (.in format).
  5. An ExTASY Resource configuration (.rcfg) file.
  6. An ExTASY Workload configuration (.wcfg) file.

Here is an example of a typical minimisation input script (min.in):

Basic minimisation, weakly restraining backbone so it does not drift too far
from CoCo-generated conformation
    &cntrl
        imin=1, maxcyc=500,
        ntpr=50,
        ntr=1,
        ntb=0, cut=25.0, igb=2,
    &end
Atoms to be restrained
0.1
FIND
CA * * *
N * * *
C * * *
O * * *
SEARCH
RES 1 999
END
END

Here is an example of a typical MD input script (mdshort.in):

0.1 ns GBSA sim
    &cntrl
        imin=0, ntx=1,
        ntpr=1000, ntwr=1000, ntwx=500,
        ioutfm=1,
        nstlim=50000, dt=0.002,
        ntt=3, ig=-1, gamma_ln=5.0,
        ntc=2, ntf=2,
        ntb=0, cut=25.0, igb=2,
    &end

The resource and workload configuration files are discussed specific to the resource in the forthcoming sections. In section 5.3, we discuss execution on Stampede and in section 5.4, we discuss execution on Archer.

Running on Stampede

This section is to be done entirely on your laptop. The ExTASY tool expects two input files:

  1. The resource configuration file sets the parameters of the HPC resource we want to run the workload on, in this case Stampede.
  2. The workload configuration file defines the CoCo/Amber workload itself. The configuration file given in this example is strictly meant for the coco-amber usecase only.

Step 1: Create a new directory for the example,

mkdir $HOME/extasy-tutorial/
cd $HOME/extasy-tutorial/

Step 2: Download the config files and the input files directly using the following link.

wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/coam-on-stampede.tar
tar xf coam-on-stampede.tar
cd coam-on-stampede

Step 3: In the coam-on-stampede folder, a resource configuration file stampede.rcfg exists. Details and modifications required are as follows:

Note

For the purposes of this example, you require to change only:

  • UNAME
  • ALLOCATION

The other parameters in the resource configuration are already set up to successfully execute the workload in this example.

REMOTE_HOST = 'xsede.stampede'  # Label/Name of the Remote Machine
UNAME       = 'username'                  # Username on the Remote Machine
ALLOCATION  = 'TG-MCB090174'              # Allocation to be charged
WALLTIME    = 20                          # Walltime to be requested for the pilot
PILOTSIZE   = 16                          # Number of cores to be reserved
WORKDIR     = None                        # Working directory on the remote machine
QUEUE       = 'development'                    # Name of the queue in the remote machine

DBURL       = 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot'          #MongoDB link to be used for coordination purposes

Step 4: In the coam-on-stampede folder, a workload configuration file cocoamber.wcfg exists. Details and modifications required are as follows:

#-------------------------Applications----------------------
simulator                = 'Amber'          # Simulator to be loaded
analyzer                 = 'CoCo'           # Analyzer to be loaded

#-------------------------General---------------------------
num_iterations          = 2                 # Number of iterations of Simulation-Analysis
start_iter              = 0                 # Iteration number with which to start
num_CUs 		= 16                # Number of tasks or Compute Units
nsave			= 2		    # Iterations after which output is transferred to local machine

#-------------------------Simulation-----------------------
num_cores_per_sim_cu    = 2                 # Number of cores per Simulation Compute Units
md_input_file           = './inp_files/mdshort.in'    # Entire path to MD Input file - Do not use $HOME or the likes
minimization_input_file = './inp_files/min.in'        # Entire path to Minimization file - Do not use $HOME or the likes
initial_crd_file        = './inp_files/penta.crd'     # Entire path to Coordinates file - Do not use $HOME or the likes
top_file                = './inp_files/penta.top'     # Entire path to Topology file - Do not use $HOME or the likes
ref_file		= './inp_files/penta.pdb'     # Path to file with reference coordinates that will be used as an auxiliary file to read the trajectory files
logfile                 = 'coco.log'        # Name of the log file created by pyCoCo
atom_selection          = 'protein'

#-------------------------Analysis--------------------------
grid                    = '5'               # Number of points along each dimension of the CoCo histogram
dims                    = '3'               # The number of projections to consider from the input pcz file

Note

All the parameters in the above example file are mandatory for amber-coco. There are no other parameters currently supported.

Step 5: You can find the executable script `extasy_amber_coco.py` in the coam-on-stampede folder.

Now you are can run the workload using :

python extasy_amber_coco.py --RPconfig stampede.rcfg --Kconfig cocoamber.wcfg

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Note

Time to completion: ~240 seconds (from the time job goes through LRMS)

Running on Archer

This section is to be done entirely on your laptop. The ExTASY tool expects two input files:

  1. The resource configuration file sets the parameters of the HPC resource we want to run the workload on, in this case Archer.
  2. The workload configuration file defines the CoCo/Amber workload itself. The configuration file given in this example is strictly meant for the coco-amber usecase only.

Step 1: Create a new directory for the example,

mkdir $HOME/extasy-tutorial/
cd $HOME/extasy-tutorial/

Step 2: Download the config files and the input files directly using the following link.

wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/coam-on-archer.tar
tar xf coam-on-archer.tar
cd coam-on-archer

Step 3: In the coam-on-archer folder, a resource configuration file archer.rcfg exists. Details and modifications required are as follows:

Note

For the purposes of this example, you require to change only:

  • UNAME
  • ALLOCATION

The other parameters in the resource configuration are already set up to successfully execute the workload in this example.

REMOTE_HOST = 'epsrc.archer'  # Label/Name of the Remote Machine
UNAME       = 'username'                  # Username on the Remote Machine
ALLOCATION  = 'e290'              # Allocation to be charged
WALLTIME    = 20                          # Walltime to be requested for the pilot
PILOTSIZE   = 24                          # Number of cores to be reserved
WORKDIR     = None                        # Working directory on the remote machine
QUEUE       = 'standard'                    # Name of the queue in the remote machine

DBURL       = 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot'          #MongoDB link to be used for coordination purposes

Step 4: In the coam-on-archer folder, a resource configuration file cocoamber.wcfg exists. Details and modifications required are as follows:

#-------------------------Applications----------------------
simulator                = 'Amber'          # Simulator to be loaded
analyzer                 = 'CoCo'           # Analyzer to be loaded

#-------------------------General---------------------------
num_iterations          = 2                 # Number of iterations of Simulation-Analysis
start_iter              = 0                 # Iteration number with which to start
num_CUs 		= 16                # Number of tasks or Compute Units
nsave			= 2		    # Iterations after which output is transferred to local machine

#-------------------------Simulation-----------------------
num_cores_per_sim_cu    = 2                 # Number of cores per Simulation Compute Units
md_input_file           = './inp_files/mdshort.in'    # Entire path to MD Input file - Do not use $HOME or the likes
minimization_input_file = './inp_files/min.in'        # Entire path to Minimization file - Do not use $HOME or the likes
initial_crd_file        = './inp_files/penta.crd'     # Entire path to Coordinates file - Do not use $HOME or the likes
top_file                = './inp_files/penta.top'     # Entire path to Topology file - Do not use $HOME or the likes
ref_file		= './inp_files/penta.pdb'     # Path to file with reference coordinates that will be used as an auxiliary file to read the trajectory files
logfile                 = 'coco.log'        # Name of the log file created by pyCoCo
atom_selection          = 'protein'

#-------------------------Analysis--------------------------
grid                    = '5'               # Number of points along each dimension of the CoCo histogram
dims                    = '3'               # The number of projections to consider from the input pcz file

Note

All the parameters in the above example file are mandatory for amber-coco. There are no other parameters currently supported.

Step 5: You can find the executable script `extasy_amber_coco.py` in the coam-on-archer folder.

Now you are can run the workload using :

python extasy_amber_coco.py --RPconfig archer.rcfg --Kconfig cocoamber.wcfg

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Note

Time to completion: ~600 seconds (from the time job goes through LRMS)

Running on localhost

The above two sections describes execution on XSEDE.Stampede and EPSRC.Archer, assuming you have access to these machines. This section describes the changes required to the EXISTING scripts in order to get CoCo-Amber running on your local machines (label to be used = local.localhost as in the generic examples).

Step 1: You might have already guessed the first step. You need to create a SingleClusterEnvironment object targetting the localhost machine. You can either directly make changes to the extasy_amber_coco.py script or create a separate resource configuration file and provide it as an argument.

Step 2: The MD tools require some tool specific environment variables to be setup (AMBERHOME, PYTHONPATH, GCC, GROMACS_DIR, etc). Along with this, you would require to set the PATH environment variable to point to the binary file (if any) of the MD tool. Once you determine all the environment variables to be setup, set them on the terminal and test it by executing the MD command (possibly for a sample case). For example, if you have amber installed in $HOME as $HOME/amber14. You probably have to setup AMBERHOME to $HOME/amber14 and append $HOME/amber14/bin to PATH. Please check official documentation of the MD tool.

Step 3: There are three options to proceed.

  • Once you tested the environment setup, next you need to add it to the particular kernel definition. You need to, first, locate the particular file to be modified. All the files related to Ensemble Toolkit are located within the virtualenv (say “myenv”). Go into the following path: myenv/lib/python-2.7/site-packages/radical/ensemblemd/kernel_plugins/md. This path contains all the kernels used for the MD examples. You can open the amber.py file and add an entry for local.localhost (in "machine_configs") as follows:
..
..
"machine_configs":
{

    ..
    ..

    "local.localhost":
    {
        "pre_exec"    : ["export AMBERHOME=$HOME/amber14", "export PATH=$HOME/amber14/bin:$PATH"],
        "executable"  : ["sander"],
        "uses_mpi"    : False       # Could be True or False
    },

    ..
    ..

}
..
..

This would have to be repeated for all the kernels.

  • Another option is to perform the same above steps. But leave the "pre_exec" value as an empty list and set all the environment variables in your bashrc ($HOME/.bashrc). Remember that you would still need to set the executable as above.
  • The third option is to create your own kernel plugin as part of your user script. These avoids the entire procedure of locating the existing kernel plugin files. This would also get you comfortable in using kernels other than the ones currently available as part of the package. Creating your own kernel plugins are discussed here

Understanding the Output of the Examples

In the local machine, a “output” folder is created and at the end of every checkpoint intervel (=nsave) an “iter*” folder is created which contains the necessary files to start the next iteration.

For example, in the case of CoCo-Amber on stampede, for 4 iterations with nsave=2:

coam-on-stampede$ ls
output/  cocoamber.wcfg  mdshort.in  min.in  penta.crd  penta.top  stampede.rcfg

coam-on-stampede/output$ ls
iter1/  iter3/

The “iter*” folder will not contain any of the initial files such as the topology file, minimization file, etc since they already exist on the local machine. In coco-amber, the “iter*” folder contains the NetCDF files required to start the next iteration and a logfile of the CoCo stage of the current iteration.

coam-on-stampede/output/iter1$ ls
1_coco.log    md_0_11.ncdf  md_0_14.ncdf  md_0_2.ncdf  md_0_5.ncdf  md_0_8.ncdf  md_1_10.ncdf  md_1_13.ncdf  md_1_1.ncdf  md_1_4.ncdf  md_1_7.ncdf
md_0_0.ncdf   md_0_12.ncdf  md_0_15.ncdf  md_0_3.ncdf  md_0_6.ncdf  md_0_9.ncdf  md_1_11.ncdf  md_1_14.ncdf  md_1_2.ncdf  md_1_5.ncdf  md_1_8.ncdf
md_0_10.ncdf  md_0_13.ncdf  md_0_1.ncdf   md_0_4.ncdf  md_0_7.ncdf  md_1_0.ncdf  md_1_12.ncdf  md_1_15.ncdf  md_1_3.ncdf  md_1_6.ncdf  md_1_9.ncdf

It is important to note that since, in coco-amber, all the NetCDF files of previous and current iterations are transferred at each checkpoint, it might be useful to have longer checkpoint intervals. Since smaller intervals would lead to heavy data transfer of redundant data.

On the remote machine, inside the pilot-* folder you can find a folder called “unit.00000”. This location is used to exchange/link/move intermediate data. The shared data is kept in “unit.00000/” and the iteration specific inputs/outputs can be found in their specific folders (=”unit.00000/iter*”).

$ cd unit.00000/
$ ls
iter0/  iter1/  iter2/  iter3/  mdshort.in  min.in  penta.crd  penta.top  postexec.py

Running a Gromacs/LSDMap Workload

This section will discuss details about the execution phase. The input to the tool is given in terms of a resource configuration file and a workload configuration file. The execution is started based on the parameters set in these configuration files. In section 4.1, we discuss execution on Stampede and in section 4.2, we discuss execution on Archer.

Introduction

DM-d-MD (Diffusion-Map directed Molecular Dynamics) is an adaptive sampling algorithm based on LSDMap (Locally Scaled Diffusion Map), a nonlinear dimensionality reduction technique which provides a set of collective variables associated with slow time scales of Molecular Dynamics simulations (MD).

For an introduction to DM-d-MD, including how it can be used as a stand-alone tool, see J.Preto and C. Clementi, Phys. Chem. Chem. Phys., 2014, 16, 19181-19191 .

In a nutshell, DM-d-MD consists in periodically restarting multiple parallel GROMACS MD trajectories from a distribution of configurations uniformly sampled along LSDMap coordinates. In this way, during each DM-d-MD cycle, it becomes possible to visit a wider area of the configuration space without remaining trapped in local minima as it could be the case for plain MD simulations. As another feature, DM-d-MD includes a reweighting scheme that is used to keep track of the free energy landscape all along the procedure. A typical DM-d-MD cycle includes the following steps:

  1. Simulation: Short MD trajectories are run using GROMACS starting from the set of configurations selected in step 3 (one trajectory per configuration). For the first cycle, the trajectories start from configurations specified within an input file provided by the user (option md_input_file in the Workload configuration file).
  2. Analysis: LSDMap is computed from the endpoints of each trajectory. The LSDMap coordinates are stored in a file called lsdmap.ev.
  3. Select + Reweighting: New configurations are selected among the endpoints so that the distribution of new configurations is uniform along LSDMap coordinates. The same endpoint can be selected as a new configuration more than once or can be not selected. At the same time, a statistical weight is provided to each new configuration in order to recover the free energy landscape associated with regular MD. The weights are stored in a file called weight.w.

In common with the other ExTASY workflows, a user prepares the necessary input files and ExTASY configuration files on their local workstation, and launches the job from there, but the calculations are then performed on the execution host, which is typically an HPC resource.

Required Input files

The GROMACS/LSDMap (DM-d-MD) workflow requires the user to prepare at least three GROMACS-style files, one configuration file used for LSDMap, and two ExTASY configuration files.

  1. A topology file (.top format) (specified via the option top_file in the Workload configuration file).
  2. An initial structure file (.gro format) (specified via the option md_input_file in the Workload configuration file).
  3. A parameter file (.mdp format) for MD simulations (specified via the option mdp_file in the Workload configuration file).
  4. A configuration file used for LSDMap (.ini format) (specified via the option lsdm_config_file in the Workload configuration file)
  5. An ExTASY Resource configuration (*.rcfg) file.
  6. An ExTASY Workload configuration (*.wcfg) file.

For more information about .top, .gro and .mdp formats, we refer the user to the following website http://manual.gromacs.org/current/online/files.html. Please note that the parameter “nsteps” specified in the .mdp file should correspond to the number of MD time steps of each DM-d-MD cycle. Documentation on GROMACS can be found on the official website: http://www.gromacs.org.

Here is an example of a typical LSDMap configuration file (config.ini):

[LSDMAP]
;metric used to compute the distance matrix (rmsd, cmd, dihedral)
metric=rmsd

;constant r0 used with cmd metric
r0=0.05

[LOCALSCALE]
;status (constant, kneighbor, kneighbor_mean)
status=constant

;constant epsilon used in case status is constant
epsilon=0.05

;value of k in case status is kneighbor or kneighbor_mean
k=30

Notes:

  1. See the paper W. Zheng, M. A. Rohrdanz, M. Maggioni and C. Clementi, J. Chem. Phys., 2011, 134, 144109 for more information on how LSDMap works.
  2. metric is the metric used with LSDMap (only rmsd, cmd (contact map distance) and dihedral metric are currently supported, see the paper `P. Cossio, A. Laio and F. Pietrucci, Phys. Chem. Chem. Phys., 2011, 13, 10421–10425 <http://pubs.rsc.org/en/Content/ArticleLanding/2011/CP/c0cp02675a#!divAbstract>`_for more information).
  3. status in the section LOCALSCALE refers to the way the local scale is computed when performing LSDMap. constant means that the local scale is the same for all the configurations and is equal to the value specified via the parameter epsilon (in nm). kneighbor implies that the local scale of each MD configuration is given as the distance to its kth nearest neighbor, where k is given by the parameter k. kneighbor_mean means that the local scale is the same for all the configuration and is equal to the average kth-neighbor distance.

The resource and workload configuration files are discussed specific to the resource in the forthcoming sections. In section 6.3, we discuss execution on Stampede and in section 6.4, we discuss execution on Archer.

Running on Stampede

This section is to be done entirely on your laptop. The ExTASY tool expects two input files:

  1. The resource configuration file sets the parameters of the HPC resource we want to run the workload on, in this case Stampede.
  2. The workload configuration file defines the GROMACS/LSDMap workload itself. The configuration file given in this example is strictly meant for the gromacs-lsdmap usecase only.

Step 1: Create a new directory for the example,

mkdir $HOME/extasy-tutorial/
cd $HOME/extasy-tutorial/

Step 2: Download the config files and the input files directly using the following link.

wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/grlsd-on-stampede.tar
tar xf grlsd-on-stampede.tar
cd grlsd-on-stampede

Step 3: In the grlsd-on-stampede folder, a resource configuration file stampede.rcfg exists. Details and modifications required are as follows:

Note

For the purposes of this example, you require to change only:

  • UNAME
  • ALLOCATION

The other parameters in the resource configuration are already set up to successfully execute the workload in this example.

Step 4: In the grlsd-on-stampede folder, a workload configuration file gromacslsdmap.wcfg exists. Details and modifications are as follows:

Note

All the parameters in the above example file are mandatory for gromacs-lsdmap. If ndxfile, grompp_options, mdrun_options and itp_file_loc are not required, they should be set to None; but they still have to mentioned in the configuration file. There are no other parameters currently supported for these examples.

Step 5: You can find the executable script `extasy_gromacs_lsdmap.py` in the grlsd-on-stampede folder.

Now you are can run the workload using :

python extasy_gromacs_lsdmap.py --RPconfig stampede.rcfg --Kconfig gromacslsdmap.wcfg

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Note

Time to completion: ~13 mins (from the time job goes through LRMS)

Running on Archer

This section is to be done entirely on your laptop. The ExTASY tool expects two input files:

  1. The resource configuration file sets the parameters of the HPC resource we want to run the workload on, in this case Archer.
  2. The workload configuration file defines the CoCo/Amber workload itself. The configuration file given in this example is strictly meant for the gromacs-lsdmap usecase only.

Step 1: Create a new directory for the example,

mkdir $HOME/extasy-tutorial/
cd $HOME/extasy-tutorial/

Step 2: Download the config files and the input files directly using the following link.

wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/grlsd-on-archer.tar
tar xf grlsd-on-archer.tar
cd grlsd-on-archer

Step 3: In the grlsd-on-archer folder, a resource configuration file archer.rcfg exists. Details and modifications required are as follows:

Note

For the purposes of this example, you require to change only:

  • UNAME
  • ALLOCATION

The other parameters in the resource configuration are already set up to successfully execute the workload in this example.

Step 4: In the grlsd-on-archer folder, a workload configuration file gromacslsdmap.wcfg exists. Details and modifications required are as follows:

Note

All the parameters in the above example file are mandatory for gromacs-lsdmap. If ndxfile, grompp_options, mdrun_options and itp_file_loc are not required, they should be set to None; but they still have to mentioned in the configuration file. There are no other parameters currently supported.

Step 5: You can find the executable script `extasy_gromacs_lsdmap.py` in the grlsd-on-archer folder.

Now you are can run the workload using :

python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Note

Time to completion: ~15 mins (from the time job goes through LRMS)

Running on localhost

The above two sections describes execution on XSEDE.Stampede and EPSRC.Archer, assuming you have access to these machines. This section describes the changes required to the EXISTING scripts in order to get Gromacs-LSDMap running on your local machines (label to be used = local.localhost as in the generic examples).

Step 1: You might have already guessed the first step. You need to create a SingleClusterEnvironment object targetting the localhost machine. You can either directly make changes to the extasy_gromacs_lsdmap.py script or create a separate resource configuration file and provide it as an argument.

Step 2: The MD tools require some tool specific environment variables to be setup (AMBERHOME, PYTHONPATH, GCC, GROMACS_DIR, etc). Along with this, you would require to set the PATH environment variable to point to the binary file (if any) of the MD tool. Once you determine all the environment variables to be setup, set them on the terminal and test it by executing the MD command (possibly for a sample case). For example, if you have gromacs installed in $HOME as $HOME/gromacs_5. You probably have to setup GROMACS_DIR to $HOME/gromacs-5 and append $HOME/gromacs-5/bin to PATH. Please check official documentation of the MD tool.

Step 3: There are three options to proceed.

  • Once you tested the environment setup, next you need to add it to the particular kernel definition. You need to, first, locate the particular file to be modified. All the files related to Ensemble Toolkit are located within the virtualenv (say “myenv”). Go into the following path: myenv/lib/python-2.7/site-packages/radical/ensemblemd/kernel_plugins/md. This path contains all the kernels used for the MD examples. You can open the gromacs.py file and add an entry for local.localhost (in "machine_configs") as follows:
..
..
"machine_configs":
{

    ..
    ..

    "local.localhost":
    {
        "pre_exec"    : ["export GROMACS_DIR=$HOME/gromacs-5", "export PATH=$HOME/gromacs-5/bin:$PATH"],
        "executable"  : ["mdrun"],
        "uses_mpi"    : False       # Could be True or False
    },

    ..
    ..

}
..
..

This would have to be repeated for all the kernels.

  • Another option is to perform the same above steps. But leave the "pre_exec" value as an empty list and set all the environment variables in your bashrc ($HOME/.bashrc). Remember that you would still need to set the executable as above.
  • The third option is to create your own kernel plugin as part of your user script. These avoids the entire procedure of locating the existing kernel plugin files. This would also get you comfortable in using kernels other than the ones currently available as part of the package. Creating your own kernel plugins are discussed here

Understanding the Output of the Examples

In the local machine, a “output” folder is created and at the end of every checkpoint intervel (=nsave) an “iter*” folder is created which contains the necessary files to start the next iteration.

For example, in the case of gromacs-lsdmap on stampede, for 4 iterations with nsave=2:

grlsd-on-stampede$ ls
output/  config.ini  gromacslsdmap.wcfg  grompp.mdp  input.gro  stampede.rcfg  topol.top

grlsd-on-stampede/output$ ls
iter1/  iter3/

The “iter*” folder will not contain any of the initial files such as the topology file, minimization file, etc since they already exist on the local machine. In gromacs-lsdmap, the “iter*” folder contains the coordinate file and weight file required in the next iteration. It also contains a logfile about the lsdmap stage of the current iteration.

grlsd-on-stampede/output/iter1$ ls
2_input.gro  lsdmap.log  weight.w

On the remote machine, inside the pilot-* folder you can find a folder called “unit.00000”. This location is used to exchange/link/move intermediate data. The shared data is kept in “unit.00000/” and the iteration specific inputs/outputs can be found in their specific folders (=”unit.00000/iter*”).

$ cd unit.00000/
$ ls
config.ini  gro.py   input.gro   iter1/  iter3/    post_analyze.py  reweighting.py   run.py     spliter.py
grompp.mdp  gro.pyc  iter0/      iter2/  lsdm.py   pre_analyze.py   run_analyzer.sh  select.py  topol.top

As specified above, outputs of the DM-d-MD procedure can be used to recover the free energy landscape of the system. It is however the responsibility of the user to decide how many DM-d-MD cycles he/she wants to perform depending on the region of the configuration space he/she might want to explore. In general, the larger the number of DM-d-MD cycles, the better. However, different systems may require more or less cycles to achieve a complete exploration of their free energy landscape. The free energy landscape can be plotted every nsave cycle. The .gro file in backup/iterX can be used to compute any specific collective variables to build the free energy plot. The weights contained in the .w file should be used to “reweight” each configuration when computing the free energy histogram.

API Reference

Execution Context API

class radical.ensemblemd.SingleClusterEnvironment (resource, cores, walltime, database_url, queue = None, username = None, allocation = None, cleanup = False)

A static execution context provides a fixed set of computational resources.

  • name: Returns the name of the execution context
  • allocate(): Allocates the resources
  • deallocate(): Deallocates the resources
  • run(pattern, force_plugin = None): Create a new SingleClusterEnvironment instance
  • get_name(): Returns the name of the execution pattern

Execution Pattern API

pre_loop()

The radical.ensemblemd.Kernel returned by pre_loop is executed before the main simulation-analysis loop is started. It can be used for example to set up structures, initialize experimental environments, one-time data transfers and so on.

Returns:
Implementations of this method must return either a single or a list of radical.ensemblemd.Kernel object(s). An exception is thrown otherwise.

simulation_step(iteration, instance)

The radical.ensemblemd.Kernel returned by simulation_step is executed once per loop iteration before analysis_step.

Arguments:
  • iteration [int] - The iteration parameter is a positive integer and references the current iteration of the simulation-analysis loop.
  • instance [int] - The instance parameter is a positive integer and references the instance of the simulation step, which is in the range [1 .. simulation_instances].
Returns:
Implementations of this method must return either a single or a list of radical.ensemblemd.Kernel object(s). An exception is thrown otherwise.

analysis_step(iteration, instance)

The radical.ensemblemd.Kernel returned by analysis_step is executed once per loop iteration after simulation_step.

Arguments:
  • iteration [int] - The iteration parameter is a positive integer and references the current iteration of the simulation-analysis loop.
  • instance [int] - The instance parameter is a positive integer and references the instance of the simulation step, which is in the range [1 .. simulation_instances].
Returns:
Implementations of this method must return either a single or a list of radical.ensemblemd.Kernel object(s). An exception is thrown otherwise.

post_loop()

The radical.ensemblemd.Kernel returned by post_loop is executed after the main simulation-analysis loop has finished. It can be used for example to set up structures, initialize experimental environments and so on.

Returns:
Implementations of this method must return a single radical.ensemblemd.Kernel object. An exception is thrown otherwise.

Application Kernel API

class radical.ensemblemd.Kernel (name, args = None)

The Kernel provides functions to support file movement as required by the pattern.

  • cores: number of cores the kernel is using.
  • upload_input_data: Instructs the application to upload one or more files or directories from the host the script is running on into the kernel’s execution directory.
Example:
k = Kernel(name="misc.ccount")
k.arguments = ["--inputfile=input.txt", "--outputfile=output.txt"]
k.upload_input_data = ["/location/on/HOST/RUNNING/THE/SCRIPT/data.txt > input.txt"]
  • download_input_data: Instructs the kernel to download one or more files or directories from a remote HTTP server into the kernel’s execution directory.
Example:
k = Kernel(name="misc.ccount")
k.arguments = ["--inputfile=input.txt", "--outputfile=output.txt"]
k.download_input_data = ["http://REMOTE.WEBSERVER/location/data.txt > input.txt"]
  • copy_input_data: Instructs the kernel to copy one or more files or directories from the execution host’s filesystem into the kernel’s execution directory.
Example:
k = Kernel(name="misc.ccount")
k.arguments = ["--inputfile=input.txt", "--outputfile=output.txt"]
k.copy_input_data = ["/location/on/EXECUTION/HOST/data.txt > input.txt"]
  • link_input_data: Instructs the kernel to create a link to one or more files or directories on the execution host’s filesystem in the kernel’s execution directory.
Example:
k = Kernel(name="misc.ccount")
k.arguments = ["--inputfile=input.txt", "--outputfile=output.txt"]
k.link_input_data = ["/location/on/EXECUTION/HOST/data.txt > input.txt"]
  • download_output_data: Instructs the application to download one or more files or directories from the kernel’s execution directory back to the host the script is running on.
Example:
k = Kernel(name="misc.ccount")
k.arguments = ["--inputfile=input.txt", "--outputfile=output.txt"]
k.download_output_data = ["output.txt > output-run-1.txt"]
  • copy_output_data: Instructs the application to download one or more files or directories from the kernel’s execution directory to a directory on the execution host’s filesystem.
Example:
k = Kernel(name="misc.ccount")
k.arguments = ["--inputfile=input.txt", "--outputfile=output.txt"]
k.download_output_data = ["output.txt > /location/on/EXECUTION/HOST/output.txt"]
  • get_raw_args(): Returns the arguments passed to the kernel.
  • get arg(name): Returns the value of the kernel argument given by ‘arg_name’.

Exceptions & Errors

This module defines and implement all ensemblemd Exceptions.

  • exception radical.ensemblemd.exceptions.EnsemblemdError(msg): EnsemblemdError is the base exception thrown by the ensemblemd library. [source]
    Bases: exceptions.Exception
    
  • exception radical.ensemblemd.exceptions.NotImplementedError(method_name, class_name): NotImplementedError is thrown if a class method or function is not implemented. [source]
    Bases: radical.ensemblemd.exceptions.EnsemblemdError
    
  • exception radical.ensemblemd.exceptions.TypeError(expected_type, actual_type): TypeError is thrown if a parameter of a wrong type is passed to a method or function. [source]
    Bases: radical.ensemblemd.exceptions.EnsemblemdError
    
  • exception radical.ensemblemd.exceptions.FileError(message): FileError is thrown if something goes wrong related to file operations, i.e., if a file doesn’t exist, cannot be copied and so on. [source]
    Bases: radical.ensemblemd.exceptions.EnsemblemdError
    
  • exception radical.ensemblemd.exceptions.ArgumentError(kernel_name, message, valid_arguments_set): A BadArgumentError is thrown if a wrong set of arguments were passed to a kernel. [source]
    Bases: radical.ensemblemd.exceptions.EnsemblemdError
    
  • exception radical.ensemblemd.exceptions.NoKernelPluginError(kernel_name): NoKernelPluginError is thrown if no kernel plug-in could be found for a given kernel name. [source]
    Bases: radical.ensemblemd.exceptions.EnsemblemdError
    
  • exception radical.ensemblemd.exceptions.NoKernelConfigurationError(kernel_name, resource_key): NoKernelConfigurationError is thrown if no kernel configuration could be found for the provided resource key. [source]
    Bases: radical.ensemblemd.exceptions.EnsemblemdError
    
  • exception radical.ensemblemd.exceptions.NoExecutionPluginError(pattern_name, context_name, plugin_name): NoExecutionPluginError is thrown if a patterns is passed to an execution context via execut() but no execution plugin for the pattern exist. [source]
    Bases: radical.ensemblemd.exceptions.EnsemblemdError
    

Customization

Writing New Application Kernels

While the current set of available application kernels might provide a good set of tools to start, sooner or later you will probably want to use a tool for which no application Kernel exsits. This section describes how you can add your custom kernels.

We have two files, user_script.py which contains the user application which uses our custom kernel, new_kernel.py which contains the definition of the custom kernel. You can download them from the following links:

Let’s first take a look at new_kernel.py.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from radical.ensemblemd.kernel_plugins.kernel_base import KernelBase

# ------------------------------------------------------------------------------
#
_KERNEL_INFO = {
    "name":         "sleep",        #Mandatory
    "description":  "sleeping kernel",        #Optional
    "arguments":   {                #Mandatory
        "--interval=": {
            "mandatory": True,        #Mandatory argument? True or False
            "description": "Number of seconds to do nothing."
    	    },
        },
    "machine_configs":             #Use a dictionary with keys as resource names and values specific to the resource
        {
            "local.localhost":
            {
                "environment" : None,        #list or None, can be used to set environment variables
                "pre_exec"    : None,            #list or None, can be used to load modules
                "executable"  : ["/bin/sleep"],        #specify the executable to be used
                "uses_mpi"    : False            #mpi-enabled? True or False
            },
        }
}

# ------------------------------------------------------------------------------
#
class MyUserDefinedKernel(KernelBase):

    def __init__(self):

        super(MyUserDefinedKernel, self).__init__(_KERNEL_INFO)
     	"""Le constructor."""
        		
    # --------------------------------------------------------------------------
    #
    @staticmethod
    def get_name():
        return _KERNEL_INFO["name"]

    def _bind_to_resource(self, resource_key):
        """This function binds the Kernel to a specific resource defined in
        "resource_key".
        """        
        arguments  = ['{0}'.format(self.get_arg("--interval="))]

        self._executable  = _KERNEL_INFO["machine_configs"][resource_key]["executable"]
        self._arguments   = arguments
        self._environment = _KERNEL_INFO["machine_configs"][resource_key]["environment"]
        self._uses_mpi    = _KERNEL_INFO["machine_configs"][resource_key]["uses_mpi"]
        self._pre_exec    = _KERNEL_INFO["machine_configs"][resource_key]["pre_exec"]

# ------------------------------------------------------------------------------

Lines 5-24 contain information about the kernel to be defined. “name” and “arguments” keys are mandatory. The “arguments” key needs to specify the arguments the kernel expects. You can specify whether the individual arguments are mandatory or not. “machine_configs” is not mandatory, but creating a dictionary with resource names (same as defined in the SingleClusterEnvironment) as keys and values which are resource specific lets use the same kernel to be used on different machines.

In lines 28-50, we define a user defined class (of “KernelBase” type) with 3 mandatory functions. First the constructor, self-explanatory. Second, a static method that is used by EnsembleMD to differentiate kernels. Third, _bind_to_resource which is the function that (as the name suggests) binds the kernel with its resource specific values, during execution. In lines 41, 43-45, you can see how the “machine_configs” dictionary approach is helping us solve the tool-level heterogeneity across resources. There might be other ways to do this (if conditions,etc), but we feel this could be quite convenient.

Now, let’s take a look at user_script.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from radical.ensemblemd import Kernel
from radical.ensemblemd import Pipeline
from radical.ensemblemd import EnsemblemdError
from radical.ensemblemd import SingleClusterEnvironment

#Used to register user defined kernels
from radical.ensemblemd.engine import get_engine

#Import our new kernel
from new_kernel import MyUserDefinedKernel

# Register the user-defined kernel with Ensemble MD Toolkit.
get_engine().add_kernel_plugin(MyUserDefinedKernel)


#Now carry on with your application as usual !
class Sleep(Pipeline):

    def __init__(self, instances,steps):
        Pipeline.__init__(self, instances,steps)

    def step_1(self, instance):
        """This step sleeps for 60 seconds."""

        k = Kernel(name="sleep")
        k.arguments = ["--interval=10"]
        return k


# ------------------------------------------------------------------------------
#
if __name__ == "__main__":

    try:
        # Create a new static execution context with one resource and a fixed
        # number of cores and runtime.
        cluster = SingleClusterEnvironment(
                resource="local.localhost",
                cores=1,
                walltime=15,
        	   username=None,
        	    project=None
        	)

        # Allocate the resources.
        cluster.allocate()

        # Set the 'instances' of the pipeline to 16. This means that 16 instances
        # of each pipeline step are executed.
        #
        # Execution of the 16 pipeline instances can happen concurrently or
        # sequentially, depending on the resources (cores) available in the
        # SingleClusterEnvironment.
        sleep = Sleep(steps=1,instances=16)

        cluster.run(sleep)

        cluster.deallocate()

    except EnsemblemdError, er:

        print "Ensemble MD Toolkit Error: {0}".format(str(er))
        raise # Just raise the execption again to get the backtrace

There are 3 important lines in this script. In line 7, we import the get_engine function in order to register our new kernel. In line 10, we import our new kernel and in line 13, we register our kernel. THAT’S IT. We can continue with the application as in the previous examples.

Writing a Custom Resource Configuration File

A number of resources are already supported by RADICAL-Pilot, they are list in List of Pre-Configured Resources. If you want to use RADICAL-Pilot with a resource that is not in any of the provided configuration files, you can write your own, and drop it in $HOME/.radical/pilot/configs/<your_site>.json.

Note

Be advised that you may need system admin level knowledge for the target cluster to do so. Also, while RADICAL-Pilot can handle very different types of systems and batch system, it may run into trouble on specific configurationsor versions we did not encounter before. If you run into trouble using a cluster not in our list of officially supported ones, please drop us a note on the users mailing list.

A configuration file has to be valid JSON. The structure is as follows:

# filename: lrz.json
{
    "supermuc":
    {
        "description"                 : "The SuperMUC petascale HPC cluster at LRZ.",
        "notes"                       : "Access only from registered IP addresses.",
        "schemas"                     : ["gsissh", "ssh"],
        "ssh"                         :
        {
            "job_manager_endpoint"    : "loadl+ssh://supermuc.lrz.de/",
            "filesystem_endpoint"     : "sftp://supermuc.lrz.de/"
        },
        "gsissh"                      :
        {
            "job_manager_endpoint"    : "loadl+gsissh://supermuc.lrz.de:2222/",
            "filesystem_endpoint"     : "gsisftp://supermuc.lrz.de:2222/"
        },
        "default_queue"               : "test",
        "lrms"                        : "LOADL",
        "task_launch_method"          : "SSH",
        "mpi_launch_method"           : "MPIEXEC",
        "forward_tunnel_endpoint"     : "login03",
        "global_virtenv"              : "/home/hpc/pr87be/di29sut/pilotve",
        "pre_bootstrap"               : ["source /etc/profile",
                                        "source /etc/profile.d/modules.sh",
                                        "module load python/2.7.6",
                                        "module unload mpi.ibm", "module load mpi.intel",
                                        "source /home/hpc/pr87be/di29sut/pilotve/bin/activate"
                                        ],
        "valid_roots"                 : ["/home", "/gpfs/work", "/gpfs/scratch"],
        "pilot_agent"                 : "radical-pilot-agent-multicore.py"
    },
    "ANOTHER_KEY_NAME":
    {
        ...
    }
}

The name of your file (here lrz.json) together with the name of the resource (supermuc) form the resource key which is used in the class:ComputePilotDescription resource attribute (lrz.supermuc).

All fields are mandatory, unless indicated otherwise below.

  • description: a human readable description of the resource
  • notes: information needed to form valid pilot descriptions, such as which parameter are required, etc.
  • schemas: allowed values for the access_schema parameter of the pilot description. The first schema in the list is used by default. For each schema, a subsection is needed which specifies job_manager_endpoint and filesystem_endpoint.
  • job_manager_endpoint: access url for pilot submission (interpreted by SAGA)
  • filesystem_endpoint: access url for file staging (interpreted by SAGA)
  • default_queue: queue to use for pilot submission (optional)
  • lrms: type of job management system (LOADL, LSF, PBSPRO, SGE, SLURM, TORQUE, FORK)
  • task_launch_method: type of compute node access (required for non-MPI units: SSH,`APRUN` or LOCAL)
  • mpi_launch_method: type of MPI support (required for MPI units: MPIRUN, MPIEXEC, APRUN, IBRUN or POE)
  • python_interpreter: path to python (optional)
  • pre_bootstrap: list of commands to execute for initialization (optional)
  • valid_roots: list of shared file system roots (optional). Pilot sandboxes must lie under these roots.
  • pilot_agent: type of pilot agent to use (radical-pilot-agent-multicore.py)
  • forward_tunnel_endpoint: name of host which can be used to create ssh tunnels from the compute nodes to the outside world (optional)

Several configuration files are part of the RADICAL-Pilot installation, and live under radical/pilot/configs/.

List of Pre-Configured Resources

The following list of resources are supported by the underlying layers of ExTASY.

Note

To configure your applications to run on these machines, you would need to add entries to your kernel definitions and specify the environment to be loaded for execution, executable, arguments, etc.

RESOURCE_LOCAL

LOCALHOST_SPARK_ANA

Your local machine gets spark.

  • Resource label : local.localhost_spark_ana
  • Raw config : resource_local.json
  • Note : To use the ssh schema, make sure that ssh access to localhost is enabled.
  • Default values for ComputePilotDescription attributes:
  • queue         : None
  • sandbox       : $HOME
  • access_schema : local
  • Available schemas : local, ssh

LOCALHOST_YARN

Your local machine.

Uses the YARN resource management system* Resource label : local.localhost_yarn * Raw config : resource_local.json * Note : To use the ssh schema, make sure that ssh access to localhost is enabled. * Default values for ComputePilotDescription attributes:

  • queue         : None
  • sandbox       : $HOME
  • access_schema : local
  • Available schemas : local, ssh

LOCALHOST_ANACONDA

Your local machine.

To be used when the anaconda python interpreter is enabled* Resource label : local.localhost_anaconda * Raw config : resource_local.json * Note : To use the ssh schema, make sure that ssh access to localhost is enabled. * Default values for ComputePilotDescription attributes:

  • queue         : None
  • sandbox       : $HOME
  • access_schema : local
  • Available schemas : local, ssh

LOCALHOST

Your local machine.

  • Resource label : local.localhost
  • Raw config : resource_local.json
  • Note : To use the ssh schema, make sure that ssh access to localhost is enabled.
  • Default values for ComputePilotDescription attributes:
  • queue         : None
  • sandbox       : $HOME
  • access_schema : local
  • Available schemas : local, ssh

LOCALHOST_SPARK

Your local machine gets spark.

  • Resource label : local.localhost_spark
  • Raw config : resource_local.json
  • Note : To use the ssh schema, make sure that ssh access to localhost is enabled.
  • Default values for ComputePilotDescription attributes:
  • queue         : None
  • sandbox       : $HOME
  • access_schema : local
  • Available schemas : local, ssh

RESOURCE_EPSRC

ARCHER

The EPSRC Archer Cray XC30 system (https://www.archer.ac.uk/)

  • Resource label : epsrc.archer
  • Raw config : resource_epsrc.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : standard
  • sandbox       : /work/`id -gn`/`id -gn`/$USER
  • access_schema : ssh
  • Available schemas : ssh

ARCHER_ORTE

The EPSRC Archer Cray XC30 system (https://www.archer.ac.uk/)

  • Resource label : epsrc.archer_orte
  • Raw config : resource_epsrc.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : standard
  • sandbox       : /work/`id -gn`/`id -gn`/$USER
  • access_schema : ssh
  • Available schemas : ssh

RESOURCE_NERSC

EDISON_CCM

The NERSC Edison Cray XC30 in Cluster Compatibility Mode (https://www.nersc.gov/users/computational-systems/edison/)

  • Resource label : nersc.edison_ccm
  • Raw config : resource_nersc.json
  • Note : For CCM you need to use special ccm_ queues.
  • Default values for ComputePilotDescription attributes:
  • queue         : ccm_queue
  • sandbox       : $SCRATCH
  • access_schema : ssh
  • Available schemas : ssh

EDISON

The NERSC Edison Cray XC30 (https://www.nersc.gov/users/computational-systems/edison/)

  • Resource label : nersc.edison
  • Raw config : resource_nersc.json
  • Note :
  • Default values for ComputePilotDescription attributes:
  • queue         : regular
  • sandbox       : $SCRATCH
  • access_schema : ssh
  • Available schemas : ssh, go

HOPPER

The NERSC Hopper Cray XE6 (https://www.nersc.gov/users/computational-systems/hopper/)

  • Resource label : nersc.hopper
  • Raw config : resource_nersc.json
  • Note :
  • Default values for ComputePilotDescription attributes:
  • queue         : regular
  • sandbox       : $SCRATCH
  • access_schema : ssh
  • Available schemas : ssh, go

HOPPER_APRUN

The NERSC Hopper Cray XE6 (https://www.nersc.gov/users/computational-systems/hopper/)

  • Resource label : nersc.hopper_aprun
  • Raw config : resource_nersc.json
  • Note : Only one CU per node in APRUN mode
  • Default values for ComputePilotDescription attributes:
  • queue         : regular
  • sandbox       : $SCRATCH
  • access_schema : ssh
  • Available schemas : ssh

HOPPER_CCM

The NERSC Hopper Cray XE6 in Cluster Compatibility Mode (https://www.nersc.gov/users/computational-systems/hopper/)

  • Resource label : nersc.hopper_ccm
  • Raw config : resource_nersc.json
  • Note : For CCM you need to use special ccm_ queues.
  • Default values for ComputePilotDescription attributes:
  • queue         : ccm_queue
  • sandbox       : $SCRATCH
  • access_schema : ssh
  • Available schemas : ssh

EDISON_APRUN

The NERSC Edison Cray XC30 (https://www.nersc.gov/users/computational-systems/edison/)

  • Resource label : nersc.edison_aprun
  • Raw config : resource_nersc.json
  • Note : Only one CU per node in APRUN mode
  • Default values for ComputePilotDescription attributes:
  • queue         : regular
  • sandbox       : $SCRATCH
  • access_schema : ssh
  • Available schemas : ssh, go

RESOURCE_STFC

JOULE

The STFC Joule IBM BG/Q system (http://community.hartree.stfc.ac.uk/wiki/site/admin/home.html)

  • Resource label : stfc.joule
  • Raw config : resource_stfc.json
  • Note : This currently needs a centrally administered outbound ssh tunnel.
  • Default values for ComputePilotDescription attributes:
  • queue         : prod
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh

RESOURCE_RICE

DAVINCI

The DAVinCI Linux cluster at Rice University (https://docs.rice.edu/confluence/display/ITDIY/Getting+Started+on+DAVinCI).

  • Resource label : rice.davinci
  • Raw config : resource_rice.json
  • Note : DAVinCI compute nodes have 12 or 16 processor cores per node.
  • Default values for ComputePilotDescription attributes:
  • queue         : parallel
  • sandbox       : $SHARED_SCRATCH/$USER
  • access_schema : ssh
  • Available schemas : ssh

BIOU

The Blue BioU Linux cluster at Rice University (https://docs.rice.edu/confluence/display/ITDIY/Getting+Started+on+Blue+BioU).

  • Resource label : rice.biou
  • Raw config : resource_rice.json
  • Note : Blue BioU compute nodes have 32 processor cores per node.
  • Default values for ComputePilotDescription attributes:
  • queue         : serial
  • sandbox       : $SHARED_SCRATCH/$USER
  • access_schema : ssh
  • Available schemas : ssh

RESOURCE_LRZ

SUPERMUC

The SuperMUC petascale HPC cluster at LRZ, Munich (http://www.lrz.de/services/compute/supermuc/).

  • Resource label : lrz.supermuc
  • Raw config : resource_lrz.json
  • Note : Default authentication to SuperMUC uses X509 and is firewalled, make sure you can gsissh into the machine from your registered IP address. Because of outgoing traffic restrictions your MongoDB needs to run on a port in the range 20000 to 25000.
  • Default values for ComputePilotDescription attributes:
  • queue         : test
  • sandbox       : $HOME
  • access_schema : gsissh
  • Available schemas : gsissh, ssh

RESOURCE_NCSA

BW_CCM

The NCSA Blue Waters Cray XE6/XK7 system in CCM (https://bluewaters.ncsa.illinois.edu/)

  • Resource label : ncsa.bw_ccm
  • Raw config : resource_ncsa.json
  • Note : Running ‘touch .hushlogin’ on the login node will reduce the likelihood of prompt detection issues.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : /scratch/sciteam/$USER
  • access_schema : gsissh
  • Available schemas : gsissh

BW

The NCSA Blue Waters Cray XE6/XK7 system (https://bluewaters.ncsa.illinois.edu/)

  • Resource label : ncsa.bw
  • Raw config : resource_ncsa.json
  • Note : Running ‘touch .hushlogin’ on the login node will reduce the likelihood of prompt detection issues.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : /scratch/sciteam/$USER
  • access_schema : gsissh
  • Available schemas : gsissh

BW_LOCAL

The NCSA Blue Waters Cray XE6/XK7 system (https://bluewaters.ncsa.illinois.edu/)

  • Resource label : ncsa.bw_local
  • Raw config : resource_ncsa.json
  • Note : Running ‘touch .hushlogin’ on the login node will reduce the likelihood of prompt detection issues.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : /scratch/training/$USER
  • access_schema : local
  • Available schemas : local

BW_APRUN

The NCSA Blue Waters Cray XE6/XK7 system (https://bluewaters.ncsa.illinois.edu/)

  • Resource label : ncsa.bw_aprun
  • Raw config : resource_ncsa.json
  • Note : Running ‘touch .hushlogin’ on the login node will reduce the likelihood of prompt detection issues.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : /scratch/sciteam/$USER
  • access_schema : gsissh
  • Available schemas : gsissh

RESOURCE_RADICAL

TUTORIAL

Our private tutorial VM on EC2

  • Resource label : radical.tutorial
  • Raw config : resource_radical.json
  • Default values for ComputePilotDescription attributes:
  • queue         : batch
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, local

RESOURCE_XSEDE

LONESTAR

The XSEDE ‘Lonestar’ cluster at TACC (https://www.tacc.utexas.edu/resources/hpc/lonestar).

  • Resource label : xsede.lonestar
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

COMET_SPARK

The Comet HPC resource at SDSC ‘HPC for the 99%’ (http://www.sdsc.edu/services/hpc/hpc_systems.html#comet).

  • Resource label : xsede.comet_spark
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : compute
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

WRANGLER

The XSEDE ‘Wrangler’ cluster at TACC (https://www.tacc.utexas.edu/wrangler/).

  • Resource label : xsede.wrangler
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $WORK
  • access_schema : ssh
  • Available schemas : ssh, gsissh, go

STAMPEDE_YARN

The XSEDE ‘Stampede’ cluster at TACC (https://www.tacc.utexas.edu/stampede/).

  • Resource label : xsede.stampede_yarn
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $WORK
  • access_schema : gsissh
  • Available schemas : gsissh, ssh, go

STAMPEDE

The XSEDE ‘Stampede’ cluster at TACC (https://www.tacc.utexas.edu/stampede/).

  • Resource label : xsede.stampede
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $WORK
  • access_schema : gsissh
  • Available schemas : gsissh, ssh, go

COMET_SSH

The Comet HPC resource at SDSC ‘HPC for the 99%’ (http://www.sdsc.edu/services/hpc/hpc_systems.html#comet).

  • Resource label : xsede.comet_ssh
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : compute
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

WRANGLER_YARN

The XSEDE ‘Wrangler’ cluster at TACC (https://www.tacc.utexas.edu/wrangler/).

  • Resource label : xsede.wrangler_yarn
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : hadoop
  • sandbox       : $WORK
  • access_schema : ssh
  • Available schemas : ssh, gsissh, go

GORDON

The XSEDE ‘Gordon’ cluster at SDSC (http://www.sdsc.edu/us/resources/gordon/).

  • Resource label : xsede.gordon
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

BLACKLIGHT

The XSEDE ‘Blacklight’ cluster at PSC (https://www.psc.edu/index.php/computing-resources/blacklight).

  • Resource label : xsede.blacklight
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : batch
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

WRANGLER_SPARK

The XSEDE ‘Wrangler’ cluster at TACC (https://www.tacc.utexas.edu/wrangler/).

  • Resource label : xsede.wrangler_spark
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $WORK
  • access_schema : ssh
  • Available schemas : ssh, gsissh, go

COMET

The Comet HPC resource at SDSC ‘HPC for the 99%’ (http://www.sdsc.edu/services/hpc/hpc_systems.html#comet).

  • Resource label : xsede.comet
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : compute
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

SUPERMIC

SuperMIC (pronounced ‘Super Mick’) is Louisiana State University’s (LSU) newest supercomputer funded by the National Science Foundation’s (NSF) Major Research Instrumentation (MRI) award to the Center for Computation & Technology. (https://portal.xsede.org/lsu-supermic)

  • Resource label : xsede.supermic
  • Raw config : resource_xsede.json
  • Note : Partially allocated through XSEDE. Primary access through GSISSH. Allows SSH key authentication too.
  • Default values for ComputePilotDescription attributes:
  • queue         : workq
  • sandbox       : /work/$USER
  • access_schema : gsissh
  • Available schemas : gsissh, ssh

STAMPEDE_SPARK

The XSEDE ‘Stampede’ cluster at TACC (https://www.tacc.utexas.edu/stampede/).

  • Resource label : xsede.stampede_spark
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $WORK
  • access_schema : gsissh
  • Available schemas : gsissh, ssh, go

TRESTLES

The XSEDE ‘Trestles’ cluster at SDSC (http://www.sdsc.edu/us/resources/trestles/).

  • Resource label : xsede.trestles
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : normal
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

GREENFIELD

The XSEDE ‘Greenfield’ cluster at PSC (https://www.psc.edu/index.php/computing-resources/greenfield).

  • Resource label : xsede.greenfield
  • Raw config : resource_xsede.json
  • Note : Always set the project attribute in the ComputePilotDescription or the pilot will fail.
  • Default values for ComputePilotDescription attributes:
  • queue         : batch
  • sandbox       : $HOME
  • access_schema : ssh
  • Available schemas : ssh, gsissh

RESOURCE_ORNL

TITAN_ORTE

The Cray XK7 supercomputer located at the Oak Ridge Leadership Computing Facility (OLCF), (https://www.olcf.ornl.gov/titan/)

  • Resource label : ornl.titan_orte
  • Raw config : resource_ornl.json
  • Note : Requires the use of an RSA SecurID on every connection.
  • Default values for ComputePilotDescription attributes:
  • queue         : batch
  • sandbox       : $MEMBERWORK/`groups | cut -d' ' -f2`
  • access_schema : ssh
  • Available schemas : ssh, local, go

TITAN_APRUN

The Cray XK7 supercomputer located at the Oak Ridge Leadership Computing Facility (OLCF), (https://www.olcf.ornl.gov/titan/)

  • Resource label : ornl.titan_aprun
  • Raw config : resource_ornl.json
  • Note : Requires the use of an RSA SecurID on every connection.
  • Default values for ComputePilotDescription attributes:
  • queue         : batch
  • sandbox       : $MEMBERWORK/`groups | cut -d' ' -f2`
  • access_schema : ssh
  • Available schemas : ssh, local, go

Troubleshooting

Some issues that you might face during the execution are discussed here.

Execution fails with “Couldn’t read packet: Connection reset by peer”

You encounter the following error when running any of the extasy workflows:

...
#######################
##       ERROR       ##
#######################
Pilot 54808707f8cdba339a7204ce has FAILED. Can't recover.
Pilot log: [u'Pilot launching failed: Insufficient system resources: Insufficient system resources: read from process failed \'[Errno 5] Input/output error\' : (Shared connection to stampede.tacc.utexas.edu closed.\n)
...

TO fix this, create a file ~/.saga/cfg in your home directory and add the following two lines:

[saga.utils.pty]
ssh_share_mode = no

This switches the SSH transfer layer into “compatibility” mode which should address the “Connection reset by peer” problem.

Configuring SSH Access

From a terminal from your local machine, setup a key pair with your email address.

$ ssh-keygen -t rsa -C "name@email.com"

Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa): [Enter]
Enter passphrase (empty for no passphrase): [Passphrase]
Enter same passphrase again: [Passphrase]
Your identification has been saved in /home/user/.ssh/id_rsa.
Your public key has been saved in /home/user/.ssh/id_rsa.pub.
The key fingerprint is:
03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.ac.uk
The key's randomart image is:
+--[ RSA 2048]----+
|    . ...+o++++. |
| . . . =o..      |
|+ . . .......o o |
|oE .   .         |
|o =     .   S    |
|.    +.+     .   |
|.  oo            |
|.  .             |
| ..              |
+-----------------+

Next you need to transfer it to the remote machine.

To transfer to Stampede,

$cat ~/.ssh/id_rsa.pub | ssh username@stampede.tacc.utexas.edu 'cat - >> ~/.ssh/authorized_keys'

To transfer to Archer,

cat ~/.ssh/id_rsa.pub | ssh username@login.archer.ac.uk 'cat - >> ~/.ssh/authorized_keys'

Error: Permission denied (publickey,keyboard-interactive) in AGENT.STDERR

The Pilot does not start running and goes to the ‘Done’ state directly from ‘PendingActive’. Please check the AGENT.STDERR file for “Permission denied (publickey,keyboard-interactive)” .

Permission denied (publickey,keyboard-interactive).
kill: 19932: No such process

You require to setup passwordless, intra-node SSH access. Although this is default in most HPC clusters, this might not be the case always.

On the head-node, run:

cd ~/.ssh/
ssh-keygen -t rsa

Do not enter a passphrase. The result should look like this:

Generating public/private rsa key pair.

Enter file in which to save the key (/home/e290/e290/oweidner/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/e290/e290/oweidner/.ssh/id_rsa.
Your public key has been saved in /home/e290/e290/oweidner/.ssh/id_rsa.pub.
The key fingerprint is:
73:b9:cf:45:3d:b6:a7:22:72:90:28:0a:2f:8a:86:fd oweidner@eslogin001
The key's randomart image is:
+--[ RSA 2048]----+
|    . ...+o++++. |
| . . . =o..      |
|+ . . .......o o |
|oE .   .         |
|o =     .   S    |
|.    +.+     .   |
|.  oo            |
|.  .             |
| ..              |
+-----------------+

Next, you need to add this key to the authorized_keys file.

cat id_rsa.pub >> ~/.ssh/authorized_keys

This should be all. Next time you run radical.pilot, you shouldn’t see that error message anymore.

Error: Couldn’t create new session

If you get an error similar to,

An error occurred: Couldn't create new session (database URL 'mongodb://extasy:extasyproject@extasy-db.epcc.ac.uk/radicalpilot' incorrect?): [Errno -2] Name or service not known
Exception triggered, no session created, exiting now...

This means no session was created, mostly due to error in the MongoDB URL that is present in the resource configuration file. Please check the URL that you have used. If the URL is correct, you should check the system on which the MongoDB is hosted.

Error: Prompted for unkown password

If you get an error similar to,

An error occurred: prompted for unknown password (username@stampede.tacc.utexas.edu's password: ) (/experiments/extasy/local/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py +306 (_initialize_pty)  :  % match))

You should check the username that is present in the resource configuration file. If the username is correct, you should check if you have a passwordless login set up for the target machine. You can check this by simply attempting a login to the target machine, if this attempt requires a password, you need to set up a passwordless login to use ExTASY.

Error: Pilot has FAILED. Can’t recover

If you get an error similar to,

ExTASY version :  0.1.3-beta-15-g9e16ce7
Session UID: 55102e9023769c19e7c8a84e
Pilot UID       : 55102e9123769c19e7c8a850
[Callback]: ComputePilot '55102e9123769c19e7c8a850' state changed to Launching.
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/mmpbsa.json
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/coco.json
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/namd.json
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/lsdmap.json
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/amber.json
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/gromacs.json
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/sleep.json
Loading kernel configurations from /experiments/extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/test.json
Preprocessing stage ....
[Callback]: ComputePilot '55102e9123769c19e7c8a850' state changed to Failed.
#######################
##       ERROR       ##
#######################
Pilot 55102e9123769c19e7c8a850 has FAILED. Can't recover.
Pilot log: [<radical.pilot.logentry.Logentry object at 0x7f41f8043a10>, <radical.pilot.logentry.Logentry object at 0x7f41f8043610>, <radical.pilot.logentry.Logentry object at 0x7f41f80433d0>, <radical.pilot.logentry.Logentry object at 0x7f41f8043750>, <radical.pilot.logentry.Logentry object at 0x7f41f8043710>, <radical.pilot.logentry.Logentry object at 0x7f41f8043690>]
Execution was interrupted
Closing session, exiting now ...

This generally means either the Allocation ID or Queue name present in the resource configuration file is incorrect. If this is not the case, please re-run the experiment with the environment variables EXTASY_DEBUG=True, SAGA_VERBOSE=DEBUG, RADICAL_PILOT_VERBOSE=DEBUG. Example,

EXTASY_DEBUG=True SAGA_VERBOSE=DEBUG RADICAL_PILOT_VERBOSE=DEBUG extasy --RPconfig stampede.rcfg --Kconfig gromacslsdmap.wcfg 2> output.log

This should generate a more verbose output. You may look at this verbose output for errors or create a ticket with this log here )

Couldn’t send packet: Broken pipe

If you get an error similar to,

2015:03:30 16:05:07 radical.pilot.MainProcess: [DEBUG   ] read : [   19] [  159] ( ls /work/e290/e290/e290ib/radical.pilot.sandbox/pilot-55196431d7bf7579ecc ^H3f080/unit-551965f7d7bf7579ecc3f09b/lsdmap.log\nCouldn't send packet: Broken pipe\n)
2015:03:30 16:05:08 radical.pilot.MainProcess: [ERROR   ] Output transfer failed: read from process failed '[Errno 5] Input/output error' : (s   --:-- ETA/home/h012/ibethune/testlsdmap2/input.gro     100%  105KB 104.7KB/s   00:00
sftp>  ls /work/e290/e290/e290ib/radical.pilot.sandbox/pilot-55196431d7bf7579ecc ^H3f080/unit-551965f7d7bf7579ecc3f09b/lsdmap.log
Couldn't send packet: Broken pipe

This is mostly because of an older version of sftp/scp being used. This can be fixed by setting an environment variable SAGA_PTY_SSH_SHAREMODE to no.

export SAGA_PTY_SSH_SHAREMODE=no

Writing a Custom Resource Configuration File

If you want to use RADICAL-Pilot with a resource that is not in any of the provided configuration files, you can write your own, and drop it in $HOME/.radical/pilot/configs/<your_site>.json.

Note

Be advised that you may need system admin level knowledge for the target cluster to do so. Also, while RADICAL-Pilot can handle very different types of systems and batch system, it may run into trouble on specific configurationsor versions we did not encounter before. If you run into trouble using a cluster not in our list of officially supported ones, please drop us a note on the users mailing list.

A configuration file has to be valid JSON. The structure is as follows:

# filename: lrz.json
{
    "supermuc":
{
    "description"                 : "The SuperMUC petascale HPC cluster at LRZ.",
        "notes"                       : "Access only from registered IP addresses.",
        "schemas"                     : ["gsissh", "ssh"],
        "ssh"                         :
        {
        "job_manager_endpoint"    : "loadl+ssh://supermuc.lrz.de/",
        "filesystem_endpoint"     : "sftp://supermuc.lrz.de/"
        },
        "gsissh"                      :
        {
            "job_manager_endpoint"    : "loadl+gsissh://supermuc.lrz.de:2222/",
        "filesystem_endpoint"     : "gsisftp://supermuc.lrz.de:2222/"
        },
        "default_queue"               : "test",
        "lrms"                        : "LOADL",
        "task_launch_method"          : "SSH",
        "mpi_launch_method"           : "MPIEXEC",
        "forward_tunnel_endpoint"     : "login03",
        "global_virtenv"              : "/home/hpc/pr87be/di29sut/pilotve",
        "pre_bootstrap"               : ["source /etc/profile",
                                        "source /etc/profile.d/modules.sh",
                                        "module load python/2.7.6",
                                        "module unload mpi.ibm", "module load mpi.intel",
                                        "source /home/hpc/pr87be/di29sut/pilotve/bin/activate"
                                        ],
        "valid_roots"                 : ["/home", "/gpfs/work", "/gpfs/scratch"],
        "pilot_agent"                 : "radical-pilot-agent-multicore.py"
},
"ANOTHER_KEY_NAME":
{
        ...
}
}

The name of your file (here lrz.json) together with the name of the resource (supermuc) form the resource key which is used in the class:ComputePilotDescription resource attribute (lrz.supermuc).

All fields are mandatory, unless indicated otherwise below.

  • description: a human readable description of the resource
  • notes: information needed to form valid pilot descriptions, such as which parameter are required, etc.
  • schemas: allowed values for the access_schema parameter of the pilot description. The first schema in the list is used by default. For each schema, a subsection is needed which specifies job_manager_endpoint and filesystem_endpoint.
  • job_manager_endpoint: access url for pilot submission (interpreted by SAGA)
  • filesystem_endpoint: access url for file staging (interpreted by SAGA)
  • default_queue: queue to use for pilot submission (optional)
  • lrms: type of job management system (LOADL, LSF, PBSPRO, SGE, SLURM, TORQUE, FORK)
  • task_launch_method: type of compute node access (required for non-MPI units: SSH,`APRUN` or LOCAL)
  • mpi_launch_method: type of MPI support (required for MPI units: MPIRUN, MPIEXEC, APRUN, IBRUN or POE)
  • python_interpreter: path to python (optional)
  • pre_bootstrap: list of commands to execute for initialization (optional)
  • valid_roots: list of shared file system roots (optional). Pilot sandboxes must lie under these roots.
  • pilot_agent: type of pilot agent to use (radical-pilot-agent-multicore.py)
  • forward_tunnel_endpoint: name of host which can be used to create ssh tunnels from the compute nodes to the outside world (optional)

Several configuration files are part of the RADICAL-Pilot installation, and live under radical/pilot/configs/.

Indices and tables