Welcome to Ray!

Welcome to Ray!

Ray is an open-source unified framework for scaling AI and Python applications. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert.

Scaling with Ray


from typing import Dict
import numpy as np

import ray

# Step 1: Create a Ray Dataset from in-memory Numpy arrays.
ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"]))

# Step 2: Define a Predictor class for inference.
class HuggingFacePredictor:
    def __init__(self):
        from transformers import pipeline
        # Initialize a pre-trained GPT2 Huggingface pipeline.
        self.model = pipeline("text-generation", model="gpt2")

    # Logic for inference on 1 batch of data.
    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
        # Get the predictions from the input batch.
        predictions = self.model(
            list(batch["data"]), max_length=20, num_return_sequences=1)
        # `predictions` is a list of length-one lists. For example:
        # [[{'generated_text': 'output_1'}], ..., [{'generated_text': 'output_2'}]]
        # Modify the output to get it into the following format instead:
        # ['output_1', 'output_2']
        batch["output"] = [sequences[0]["generated_text"] for sequences in predictions]
        return batch

# Use 2 parallel actors for inference. Each actor predicts on a
# different partition of data.
scale = ray.data.ActorPoolStrategy(size=2)
# Step 3: Map the Predictor over the Dataset to get predictions.
predictions = ds.map_batches(HuggingFacePredictor, compute=scale)
# Step 4: Show one prediction output.
predictions.show(limit=1)

            

from ray.air.config import ScalingConfig
from ray.train.torch import TorchTrainer

# Step 1: setup PyTorch model training as you normally would
def train_loop_per_worker():
    model = ...
    train_dataset = ...
    for epoch in range(num_epochs):
        ...  # model training logic

# Step 2: setup Ray's PyTorch Trainer to run on 32 GPUs
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=ScalingConfig(num_workers=32, use_gpu=True),
    datasets={"train": train_dataset},
)

# Step 3: run distributed model training on 32 GPUs
result = trainer.fit()
            

from ray import tune
from ray.air.config import ScalingConfig
from ray.train.lightgbm import LightGBMTrainer

train_dataset, eval_dataset = ...

# Step 1: setup Ray's LightGBM Trainer to train on 64 CPUs
trainer = LightGBMTrainer(
    ...
    scaling_config=ScalingConfig(num_workers=64),
    datasets={"train": train_dataset, "eval": eval_dataset},
)

# Step 2: setup Ray Tuner to run 1000 trials
tuner = tune.Tuner(
    trainer=trainer,
    param_space=hyper_param_space,
    tune_config=tune.TuneConfig(num_samples=1000),
)

# Step 3: run distributed HPO with 1000 trials; each trial runs on 64 CPUs
result_grid = tuner.fit()

            

import pandas as pd

from ray import serve
from starlette.requests import Request


@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
    def __init__(self, model_id: str, revision: str = None):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            …
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

    def generate(self, text: str) -> pd.DataFrame:
        input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to(
            self.model.device
        )

        gen_tokens = self.model.generate(
            input_ids,
            …
        )
        return pd.DataFrame(
            self.tokenizer.batch_decode(gen_tokens), columns=["responses"]
        )

    async def __call__(self, http_request: Request) -> str:
        prompts: list[str] = await http_request.json()["prompts"]
        return self.generate(prompts)


            

from ray.rllib.algorithms.ppo import PPOConfig

# Step 1: configure PPO to run 64 parallel workers to collect samples from the env.
ppo_config = (
    PPOConfig()
    .environment(env="Taxi-v3")
    .rollouts(num_rollout_workers=64)
    .framework("torch")
    .training(model=rnn_lage)
)

# Step 2: build the PPO algorithm
ppo_algo = ppo_config.build()

# Step 3: train and evaluate PPO
for _ in range(5):
    print(ppo_algo.train())

ppo_algo.evaluate()
            

Getting Started

Beyond the basics

Ray AI Runtime

Scale the entire ML pipeline from data ingest to model serving with high-level Python APIs that integrate with popular ecosystem frameworks.

Learn more about AIR >

Ray Core

Scale generic Python code with simple, foundational primitives that enable a high degree of control for building distributed applications or custom platforms.

Learn more about Core >

Ray Clusters

Deploy a Ray cluster on AWS, GCP, Azure or kubernetes from a laptop to a large cluster to seamlessly scale workloads for production

Learn more about clusters >

Getting involved

Overview Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert. Ray minimizes the complexity of running your distributed individual and end-to-end machine learning workflows with these components: Scalable libraries for common machine learning tasks such as data preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving. Pythonic distributed computing primitives for parallelizing and scaling Python applications. Integrations and utilities for integrating and deploying a Ray cluster with existing tools and infrastructure such as Kubernetes, AWS, GCP, and Azure. For data scientists and machine learning practitioners, Ray lets you scale jobs without needing infrastructure expertise: Easily parallelize and distribute workloads across multiple nodes and GPUs. Quickly configure and access cloud compute resources. Leverage the ML ecosystem with native and extensible integrations. For distributed systems engineers, Ray automatically handles key processes: Orchestration–Managing the various components of a distributed system. Scheduling–Coordinating when and where tasks are executed. Fault tolerance–Ensuring tasks complete regardless of inevitable points of failure. Auto-scaling–Adjusting the number of resources allocated to dynamic demand. What you can do with Ray These are some common ML workloads that individuals, organizations, and companies leverage Ray to build their AI applications: Batch inference on CPUs and GPUs Parallel training Model serving Distributed training of large models Parallel hyperparameter tuning experiments Reinforcement learning ML platform Ray framework Stack of Ray libraries - unified toolkit for ML workloads. Ray’s unified compute framework consists of three layers: Ray AI Runtime–An open-source, Python, domain-specific set of libraries that equip ML engineers, data scientists, and researchers with a scalable and unified toolkit for ML applications. Ray Core–An open-source, Python, general purpose, distributed computing library that enables ML engineers and Python developers to scale Python applications and accelerate machine learning workloads. Ray cluster–A set of worker nodes connected to a common Ray head node. Ray clusters can be fixed-size, or they can autoscale up and down according to the resources requested by applications running on the cluster. Scale machine learning workloads Build ML applications with a toolkit of libraries for distributed data processing, model training, tuning, reinforcement learning, model serving, and more. Ray AIR Build distributed applications Build and run distributed applications with a simple and flexible API. Parallelize single machine code with little to zero code changes. Ray Core Deploy large-scale workloads Deploy workloads on AWS, GCP, Azure or on premise. Use Ray cluster managers to run Ray on existing Kubernetes, YARN, or Slurm clusters. Ray Clusters Each of Ray AIR’s five native libraries distributes a specific ML task: Data: Scalable, framework-agnostic data loading and transformation across training, tuning, and prediction. Train: Distributed multi-node and multi-core model training with fault tolerance that integrates with popular training libraries. Tune: Scalable hyperparameter tuning to optimize model performance. Serve: Scalable and programmable serving to deploy models for online inference, with optional microbatching to improve performance. RLlib: Scalable distributed reinforcement learning workloads that integrate with the other Ray AIR libraries. For custom applications, the Ray Core library enables Python developers to easily build scalable, distributed systems that can run on a laptop, cluster, cloud, or Kubernetes. It’s the foundation that Ray AIR and third-party integrations (Ray ecosystem) are built on. Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing ecosystem of community integrations. Getting Started Use Ray to scale applications on your laptop or the cloud. Choose the right guide for your task. Scale end-to-end ML applications: Ray AI Runtime Quickstart Scale single ML workloads: Ray Libraries Quickstart Scale general Python applications: Ray Core Quickstart Deploy to the cloud: Ray Clusters Quickstart Debug and monitor applications: Debugging and Monitoring Quickstart Ray AI Runtime Quickstart Explore Ray’s full suite of libraries for end-to-end ML pipelines, with the air packages: pip install -U "ray[air]" Efficiently process your data into features. Load data into a Dataset. import ray # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) # Create a test dataset by dropping the target column. test_dataset = valid_dataset.drop_columns(cols=["target"]) Preprocess your data with a Preprocessor. # Create a preprocessor to scale some columns. from ray.data.preprocessors import StandardScaler preprocessor = StandardScaler(columns=["mean radius", "mean texture"]) Scale out model training. This example will use XGBoost to train a Machine Learning model, so, install Ray’s wrapper library xgboost_ray: pip install xgboost_ray Train a model with an XGBoostTrainer. from ray.air.config import ScalingConfig from ray.train.xgboost import XGBoostTrainer trainer = XGBoostTrainer( scaling_config=ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, # Make sure to leave some CPUs free for Ray Data operations. _max_cpu_fraction_per_node=0.9, ), label_column="target", num_boost_round=20, params={ # XGBoost specific params "objective": "binary:logistic", # "tree_method": "gpu_hist", # uncomment this to use GPUs. "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, ) best_result = trainer.fit() print(best_result.metrics) Tune the hyperparameters to find the best model with Ray Tune. Configure the parameters for tuning: from ray import tune param_space = {"params": {"max_depth": tune.randint(1, 9)}} metric = "train-logloss" Run hyperparameter tuning with Ray Tune to find the best model: from ray.tune.tuner import Tuner, TuneConfig tuner = Tuner( trainer, param_space=param_space, tune_config=TuneConfig(num_samples=5, metric=metric, mode="min"), ) result_grid = tuner.fit() best_result = result_grid.get_best_result() print("Best result:", best_result) Use the trained model for Batch prediction Use the trained model for batch prediction with Dataset.map_batches(). To learn more, see :ref:End-to-end: Offline Batch Inference . Learn more about Ray AIR Ray Libraries Quickstart Use individual libraries for single ML workloads, without having to install the full AI Runtime package. Click on the dropdowns for your workload below. ray Data: Scalable Datasets for ML Scale offline inference and training ingest with Ray Data – a data processing library designed for ML. To learn more, see Offline batch inference and Data preprocessing and ingest for ML training. To run this example, install Ray Data: pip install -U "ray[data]" from typing import Dict import numpy as np import ray # Create datasets from on-disk files, Python objects, and cloud storage like S3. ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") # Apply functions to transform data. Ray Data executes transformations in parallel. def compute_area(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: length = batch["petal length (cm)"] width = batch["petal width (cm)"] batch["petal area (cm^2)"] = length * width return batch transformed_ds = ds.map_batches(compute_area) # Iterate over batches of data. for batch in transformed_ds.iter_batches(batch_size=4): print(batch) # Save dataset contents to on-disk files or cloud storage. transformed_ds.write_parquet("local:///tmp/iris/") ... Learn more about Ray Data ray Train: Distributed Model Training Ray Train abstracts away the complexity of setting up a distributed training system. PyTorch This example shows how you can use Ray Train with PyTorch. To run this example install Ray Train and PyTorch packages: pip install -U "ray[train]" torch torchvision Set up your dataset and model. import torch import torch.nn as nn from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor def get_dataset(): return datasets.FashionMNIST( root="/tmp/data", train=True, download=True, transform=ToTensor(), ) class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28 * 28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), ) def forward(self, inputs): inputs = self.flatten(inputs) logits = self.linear_relu_stack(inputs) return logits Now define your single-worker PyTorch training function. def train_func(): num_epochs = 3 batch_size = 64 dataset = get_dataset() dataloader = DataLoader(dataset, batch_size=batch_size) model = NeuralNetwork() criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for epoch in range(num_epochs): for inputs, labels in dataloader: optimizer.zero_grad() pred = model(inputs) loss = criterion(pred, labels) loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") This training function can be executed with: train_func() Convert this to a distributed multi-worker training function. Use the ray.train.torch.prepare_model and ray.train.torch.prepare_data_loader utility functions to set up your model and data for distributed training. This automatically wraps the model with DistributedDataParallel and places it on the right device, and adds DistributedSampler to the DataLoaders. from ray import train def train_func_distributed(): num_epochs = 3 batch_size = 64 dataset = get_dataset() dataloader = DataLoader(dataset, batch_size=batch_size) dataloader = train.torch.prepare_data_loader(dataloader) model = NeuralNetwork() model = train.torch.prepare_model(model) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for epoch in range(num_epochs): for inputs, labels in dataloader: optimizer.zero_grad() pred = model(inputs) loss = criterion(pred, labels) loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") Instantiate a TorchTrainer with 4 workers, and use it to run the new training function. from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = TorchTrainer( train_func_distributed, scaling_config=ScalingConfig(num_workers=4, use_gpu=use_gpu) ) results = trainer.fit() TensorFlow This example shows how you can use Ray Train to set up Multi-worker training with Keras. To run this example install Ray Train and Tensorflow packages: pip install -U "ray[train]" tensorflow Set up your dataset and model. import numpy as np import tensorflow as tf def mnist_dataset(batch_size): (x_train, y_train), _ = tf.keras.datasets.mnist.load_data() # The `x` arrays are in uint8 and have values in the [0, 255] range. # You need to convert them to float32 with values in the [0, 1] range. x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train, y_train)).shuffle(60000).repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(): model = tf.keras.Sequential([ tf.keras.layers.InputLayer(input_shape=(28, 28)), tf.keras.layers.Reshape(target_shape=(28, 28, 1)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.SGD(learning_rate=0.001), metrics=['accuracy']) return model Now define your single-worker TensorFlow training function. def train_func(): batch_size = 64 single_worker_dataset = mnist_dataset(batch_size) single_worker_model = build_and_compile_cnn_model() single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70) This training function can be executed with: train_func() Now convert this to a distributed multi-worker training function. Set the global batch size - each worker processes the same size batch as in the single-worker code. Choose your TensorFlow distributed training strategy. This examples uses the MultiWorkerMirroredStrategy. import json import os def train_func_distributed(): per_worker_batch_size = 64 # This environment variable will be set by Ray Train. tf_config = json.loads(os.environ['TF_CONFIG']) num_workers = len(tf_config['cluster']['worker']) strategy = tf.distribute.MultiWorkerMirroredStrategy() global_batch_size = per_worker_batch_size * num_workers multi_worker_dataset = mnist_dataset(global_batch_size) with strategy.scope(): # Model building/compiling need to be within `strategy.scope()`. multi_worker_model = build_and_compile_cnn_model() multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70) Instantiate a TensorflowTrainer with 4 workers, and use it to run the new training function. from ray.train.tensorflow import TensorflowTrainer from ray.air.config import ScalingConfig # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = TensorflowTrainer(train_func_distributed, scaling_config=ScalingConfig(num_workers=4, use_gpu=use_gpu)) trainer.fit() Learn more about Ray Train ray Tune: Hyperparameter Tuning at Scale Tune is a library for hyperparameter tuning at any scale. With Tune, you can launch a multi-node distributed hyperparameter sweep in less than 10 lines of code. Tune supports any deep learning framework, including PyTorch, TensorFlow, and Keras. To run this example, install Ray Tune: pip install -U "ray[tune]" This example runs a small grid search with an iterative training function. from ray import tune def objective(config): # ① score = config["a"] ** 2 + config["b"] return {"score": score} search_space = { # ② "a": tune.grid_search([0.001, 0.01, 0.1, 1.0]), "b": tune.choice([1, 2, 3]), } tuner = tune.Tuner(objective, param_space=search_space) # ③ results = tuner.fit() print(results.get_best_result(metric="score", mode="min").config) If TensorBoard is installed, automatically visualize all trial results: tensorboard --logdir ~/ray_results Learn more about Ray Tune ray Serve: Scalable Model Serving Ray Serve is a scalable model-serving library built on Ray. To run this example, install Ray Serve and scikit-learn: pip install -U "ray[serve]" scikit-learn This example runs serves a scikit-learn gradient boosting classifier. import requests from starlette.requests import Request from typing import Dict from sklearn.datasets import load_iris from sklearn.ensemble import GradientBoostingClassifier from ray import serve # Train model. iris_dataset = load_iris() model = GradientBoostingClassifier() model.fit(iris_dataset["data"], iris_dataset["target"]) @serve.deployment(route_prefix="/iris") class BoostingModel: def __init__(self, model): self.model = model self.label_list = iris_dataset["target_names"].tolist() async def __call__(self, request: Request) -> Dict: payload = (await request.json())["vector"] print(f"Received http request with data {payload}") prediction = self.model.predict([payload])[0] human_name = self.label_list[prediction] return {"result": human_name} # Deploy model. serve.run(BoostingModel.bind(model)) # Query it! sample_request_input = {"vector": [1.2, 1.0, 1.1, 0.9]} response = requests.get( "http://localhost:8000/iris", json=sample_request_input) print(response.text) As a result you will see {"result": "versicolor"}. Learn more about Ray Serve ray RLlib: Industry-Grade Reinforcement Learning RLlib is an industry-grade library for reinforcement learning (RL) built on top of Ray. RLlib offers high scalability and unified APIs for a variety of industry- and research applications. To run this example, install rllib and either tensorflow or pytorch: pip install -U "ray[rllib]" tensorflow # or torch import gymnasium as gym from ray.rllib.algorithms.ppo import PPOConfig # Define your problem using python and Farama-Foundation's gymnasium API: class SimpleCorridor(gym.Env): """Corridor in which an agent must learn to move right to reach the exit. --------------------- | S | 1 | 2 | 3 | G | S=start; G=goal; corridor_length=5 --------------------- Possible actions to chose from are: 0=left; 1=right Observations are floats indicating the current field index, e.g. 0.0 for starting position, 1.0 for the field next to the starting position, etc.. Rewards are -0.1 for all steps, except when reaching the goal (+1.0). """ def __init__(self, config): self.end_pos = config["corridor_length"] self.cur_pos = 0 self.action_space = gym.spaces.Discrete(2) # left and right self.observation_space = gym.spaces.Box(0.0, self.end_pos, shape=(1,)) def reset(self, *, seed=None, options=None): """Resets the episode. Returns: Initial observation of the new episode and an info dict. """ self.cur_pos = 0 # Return initial observation. return [self.cur_pos], {} def step(self, action): """Takes a single step in the episode given `action`. Returns: New observation, reward, terminated-flag, truncated-flag, info-dict (empty). """ # Walk left. if action == 0 and self.cur_pos > 0: self.cur_pos -= 1 # Walk right. elif action == 1: self.cur_pos += 1 # Set `terminated` flag when end of corridor (goal) reached. terminated = self.cur_pos >= self.end_pos truncated = False # +1 when goal reached, otherwise -1. reward = 1.0 if terminated else -0.1 return [self.cur_pos], reward, terminated, truncated, {} # Create an RLlib Algorithm instance from a PPOConfig object. config = ( PPOConfig().environment( # Env class to use (here: our gym.Env sub-class from above). env=SimpleCorridor, # Config dict to be passed to our custom env's constructor. # Use corridor with 20 fields (including S and G). env_config={"corridor_length": 28}, ) # Parallelize environment rollouts. .rollouts(num_rollout_workers=3) ) # Construct the actual (PPO) algorithm object from the config. algo = config.build() # Train for n iterations and report results (mean episode rewards). # Since we have to move at least 19 times in the env to reach the goal and # each move gives us -0.1 reward (except the last move at the end: +1.0), # we can expect to reach an optimal episode reward of -0.1*18 + 1.0 = -0.8 for i in range(5): results = algo.train() print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}") # Perform inference (action computations) based on given env observations. # Note that we are using a slightly different env here (len 10 instead of 20), # however, this should still work as the agent has (hopefully) learned # to "just always walk right!" env = SimpleCorridor({"corridor_length": 10}) # Get the initial observation (should be: [0.0] for the starting position). obs, info = env.reset() terminated = truncated = False total_reward = 0.0 # Play one episode. while not terminated and not truncated: # Compute a single action, given the current observation # from the environment. action = algo.compute_single_action(obs) # Apply the computed action in the environment. obs, reward, terminated, truncated, info = env.step(action) # Sum up rewards for reporting purposes. total_reward += reward # Report results. print(f"Played 1 episode; total-reward={total_reward}") Learn more about Ray RLlib Ray Core Quickstart Turn functions and classes easily into Ray tasks and actors, for Python and Java, with simple primitives for building and running distributed applications. ray Core: Parallelizing Functions with Ray Tasks Python To run this example install Ray Core: pip install -U "ray" Import Ray and and initialize it with ray.init(). Then decorate the function with @ray.remote to declare that you want to run this function remotely. Lastly, call the function with .remote() instead of calling it normally. This remote call yields a future, a Ray object reference, that you can then fetch with ray.get. import ray ray.init() @ray.remote def f(x): return x * x futures = [f.remote(i) for i in range(4)] print(ray.get(futures)) # [0, 1, 4, 9] Java To run this example, add the ray-api and ray-runtime dependencies in your project. Use Ray.init to initialize Ray runtime. Then use Ray.task(...).remote() to convert any Java static method into a Ray task. The task runs asynchronously in a remote worker process. The remote method returns an ObjectRef, and you can fetch the actual result with get. import io.ray.api.ObjectRef; import io.ray.api.Ray; import java.util.ArrayList; import java.util.List; public class RayDemo { public static int square(int x) { return x * x; } public static void main(String[] args) { // Intialize Ray runtime. Ray.init(); List> objectRefList = new ArrayList<>(); // Invoke the `square` method 4 times remotely as Ray tasks. // The tasks will run in parallel in the background. for (int i = 0; i < 4; i++) { objectRefList.add(Ray.task(RayDemo::square, i).remote()); } // Get the actual results of the tasks. System.out.println(Ray.get(objectRefList)); // [0, 1, 4, 9] } } In the above code block we defined some Ray Tasks. While these are great for stateless operations, sometimes you must maintain the state of your application. You can do that with Ray Actors. Learn more about Ray Core ray Core: Parallelizing Classes with Ray Actors Ray provides actors to allow you to parallelize an instance of a class in Python or Java. When you instantiate a class that is a Ray actor, Ray will start a remote instance of that class in the cluster. This actor can then execute remote method calls and maintain its own internal state. Python To run this example install Ray Core: pip install -U "ray" import ray ray.init() # Only call this once. @ray.remote class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 def read(self): return self.n counters = [Counter.remote() for i in range(4)] [c.increment.remote() for c in counters] futures = [c.read.remote() for c in counters] print(ray.get(futures)) # [1, 1, 1, 1] Java To run this example, add the ray-api and ray-runtime dependencies in your project. import io.ray.api.ActorHandle; import io.ray.api.ObjectRef; import io.ray.api.Ray; import java.util.ArrayList; import java.util.List; import java.util.stream.Collectors; public class RayDemo { public static class Counter { private int value = 0; public void increment() { this.value += 1; } public int read() { return this.value; } } public static void main(String[] args) { // Intialize Ray runtime. Ray.init(); List> counters = new ArrayList<>(); // Create 4 actors from the `Counter` class. // They will run in remote worker processes. for (int i = 0; i < 4; i++) { counters.add(Ray.actor(Counter::new).remote()); } // Invoke the `increment` method on each actor. // This will send an actor task to each remote actor. for (ActorHandle counter : counters) { counter.task(Counter::increment).remote(); } // Invoke the `read` method on each actor, and print the results. List> objectRefList = counters.stream() .map(counter -> counter.task(Counter::read).remote()) .collect(Collectors.toList()); System.out.println(Ray.get(objectRefList)); // [1, 1, 1, 1] } } Learn more about Ray Core Ray Cluster Quickstart Deploy your applications on Ray clusters, often with minimal code changes to your existing code. ray Clusters: Launching a Ray Cluster on AWS Ray programs can run on a single machine, or seamlessly scale to large clusters. Take this simple example that waits for individual nodes to join the cluster. example.py import sys import time from collections import Counter import ray @ray.remote def get_host_name(x): import platform import time time.sleep(0.01) return x + (platform.node(),) def wait_for_nodes(expected): # Wait for all nodes to join the cluster. while True: num_nodes = len(ray.nodes()) if num_nodes < expected: print( "{} nodes have joined so far, waiting for {} more.".format( num_nodes, expected - num_nodes ) ) sys.stdout.flush() time.sleep(1) else: break def main(): wait_for_nodes(4) # Check that objects can be transferred from each node to each other node. for i in range(10): print("Iteration {}".format(i)) results = [get_host_name.remote(get_host_name.remote(())) for _ in range(100)] print(Counter(ray.get(results))) sys.stdout.flush() print("Success!") sys.stdout.flush() time.sleep(20) if __name__ == "__main__": ray.init(address="localhost:6379") main() You can also download this example from our GitHub repository. Go ahead and store it locally in a file called example.py. To execute this script in the cloud, just download this configuration file, or copy it here: cluster.yaml # An unique identifier for the head node and workers of this cluster. cluster_name: default # The maximum number of workers nodes to launch in addition to the head # node. max_workers: 2 # The autoscaler will scale up the cluster faster with higher upscaling speed. # E.g., if the task requires adding more nodes then autoscaler will gradually # scale up the cluster in chunks of upscaling_speed*currently_running_nodes. # This number should be > 0. upscaling_speed: 1.0 # This executes all commands on all nodes in the docker container, # and opens all the necessary ports to support the Ray cluster. # Empty string means disabled. docker: image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup # image: rayproject/ray:latest-cpu # use this one if you don't need ML dependencies, it's faster to pull container_name: "ray_container" # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image # if no cached version is present. pull_before_run: True run_options: # Extra options to pass into "docker run" - --ulimit nofile=65536:65536 # Example of running a GPU head with CPU workers # head_image: "rayproject/ray-ml:latest-gpu" # Allow Ray to automatically detect GPUs # worker_image: "rayproject/ray-ml:latest-cpu" # worker_run_options: [] # If a node is idle for this many minutes, it will be removed. idle_timeout_minutes: 5 # Cloud-provider specific configuration. provider: type: aws region: us-west-2 # Availability zone(s), comma-separated, that nodes may be launched in. # Nodes will be launched in the first listed availability zone and will # be tried in the subsequent availability zones if launching fails. availability_zone: us-west-2a,us-west-2b # Whether to allow node reuse. If set to False, nodes will be terminated # instead of stopped. cache_stopped_nodes: True # If not present, the default is True. # How Ray will authenticate with newly launched nodes. auth: ssh_user: ubuntu # By default Ray creates a new private keypair, but you can also use your own. # If you do so, make sure to also set "KeyName" in the head and worker node # configurations below. # ssh_private_key: /path/to/your/key.pem # Tell the autoscaler the allowed node types and the resources they provide. # The key is the name of the node type, which is just for debugging purposes. # The node config specifies the launch config and physical instance type. available_node_types: ray.head.default: # The node type's CPU and GPU resources are auto-detected based on AWS instance type. # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler. # You can also set custom resources. # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set # resources: {"CPU": 1, "GPU": 1, "custom": 5} resources: {} # Provider-specific config for this node type, e.g. instance type. By default # Ray will auto-configure unspecified fields such as SubnetId and KeyName. # For more documentation on available fields, see: # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances node_config: InstanceType: m5.large # Default AMI for us-west-2. # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py # for default images for other zones. ImageId: ami-0387d929287ab193e # You can provision additional disk space with a conf as follows BlockDeviceMappings: - DeviceName: /dev/sda1 Ebs: VolumeSize: 140 VolumeType: gp3 # Additional options in the boto docs. ray.worker.default: # The minimum number of worker nodes of this type to launch. # This number should be >= 0. min_workers: 1 # The maximum number of worker nodes of this type to launch. # This takes precedence over min_workers. max_workers: 2 # The node type's CPU and GPU resources are auto-detected based on AWS instance type. # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler. # You can also set custom resources. # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set # resources: {"CPU": 1, "GPU": 1, "custom": 5} resources: {} # Provider-specific config for this node type, e.g. instance type. By default # Ray will auto-configure unspecified fields such as SubnetId and KeyName. # For more documentation on available fields, see: # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances node_config: InstanceType: m5.large # Default AMI for us-west-2. # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py # for default images for other zones. ImageId: ami-0387d929287ab193e # Run workers on spot by default. Comment this out to use on-demand. # NOTE: If relying on spot instances, it is best to specify multiple different instance # types to avoid interruption when one instance type is experiencing heightened demand. # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/ InstanceMarketOptions: MarketType: spot # Additional options can be found in the boto docs, e.g. # SpotOptions: # MaxPrice: MAX_HOURLY_PRICE # Additional options in the boto docs. # Specify the node type of the head node (as configured above). head_node_type: ray.head.default # Files or directories to copy to the head and worker nodes. The format is a # dictionary from REMOTE_PATH: LOCAL_PATH, e.g. file_mounts: { # "/path1/on/remote/machine": "/path1/on/local/machine", # "/path2/on/remote/machine": "/path2/on/local/machine", } # Files or directories to copy from the head node to the worker nodes. The format is a # list of paths. The same path on the head node will be copied to the worker node. # This behavior is a subset of the file_mounts behavior. In the vast majority of cases # you should just use file_mounts. Only use this if you know what you're doing! cluster_synced_files: [] # Whether changes to directories in file_mounts or cluster_synced_files in the head node # should sync to the worker node continuously file_mounts_sync_continuously: False # Patterns for files to exclude when running rsync up or rsync down rsync_exclude: - "**/.git" - "**/.git/**" # Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for # in the source directory and recursively through all subdirectories. For example, if .gitignore is provided # as a value, the behavior will match git's behavior for finding and using .gitignore files. rsync_filter: - ".gitignore" # List of commands that will be run before `setup_commands`. If docker is # enabled, these commands will run outside the container and before docker # is setup. initialization_commands: [] # List of shell commands to run to set up nodes. setup_commands: [] # Note: if you're developing Ray, you probably want to create a Docker image that # has your Ray repo pre-cloned. Then, you can replace the pip installs # below with a git checkout (and possibly a recompile). # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line: # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl" # Custom commands that will be run on the head node after common setup. head_setup_commands: [] # Custom commands that will be run on worker nodes after common setup. worker_setup_commands: [] # Command to start ray on the head node. You don't need to change this. head_start_ray_commands: - ray stop - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 # Command to start ray on worker nodes. You don't need to change this. worker_start_ray_commands: - ray stop - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 Assuming you have stored this configuration in a file called cluster.yaml, you can now launch an AWS cluster as follows: ray submit cluster.yaml example.py --start Learn more about launching Ray Clusters Debugging and Monitoring Quickstart Use built-in observability tools to monitor and debug Ray applications and clusters. ray Ray Dashboard: Web GUI to monitor and debug Ray Ray dashboard provides a visual interface that displays real-time system metrics, node-level resource monitoring, job profiling, and task visualizations. The dashboard is designed to help users understand the performance of their Ray applications and identify potential issues. To get started with the dashboard, install the default installation as follows: pip install -U "ray[default]" Access the dashboard through the default URL, http://localhost:8265. Learn more about Ray Dashboard ray Ray State APIs: CLI to access cluster states Ray state APIs allow users to conveniently access the current state (snapshot) of Ray through CLI or Python SDK. To get started with the state API, install the default installation as follows: pip install -U "ray[default]" Run the following code. import ray import time ray.init(num_cpus=4) @ray.remote def task_running_300_seconds(): print("Start!") time.sleep(300) @ray.remote class Actor: def __init__(self): print("Actor created") # Create 2 tasks tasks = [task_running_300_seconds.remote() for _ in range(2)] # Create 2 actors actors = [Actor.remote() for _ in range(2)] ray.get(tasks) See the summarized statistics of Ray tasks using ray summary tasks. ray summary tasks ======== Tasks Summary: 2022-07-22 08:54:38.332537 ======== Stats: ------------------------------------ total_actor_scheduled: 2 total_actor_tasks: 0 total_tasks: 2 Table (group by func_name): ------------------------------------ FUNC_OR_CLASS_NAME STATE_COUNTS TYPE 0 task_running_300_seconds RUNNING: 2 NORMAL_TASK 1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK Learn more about Ray State APIs Learn More Here are some talks, papers, and press coverage involving Ray and its libraries. Please raise an issue if any of the below links are broken, or if you’d like to add your own talk! Blog and Press Modern Parallel and Distributed Python: A Quick Tutorial on Ray Why Every Python Developer Will Love Ray Ray: A Distributed System for AI (BAIR) 10x Faster Parallel Python Without Python Multiprocessing Implementing A Parameter Server in 15 Lines of Python with Ray Ray Distributed AI Framework Curriculum RayOnSpark: Running Emerging AI Applications on Big Data Clusters with Ray and Analytics Zoo First user tips for Ray Tune: a Python library for fast hyperparameter tuning at any scale Cutting edge hyperparameter tuning with Ray Tune New Library Targets High Speed Reinforcement Learning Scaling Multi Agent Reinforcement Learning Functional RL with Keras and Tensorflow Eager How to Speed up Pandas by 4x with one line of code Quick Tip – Speed up Pandas using Modin Ray Blog Talks (Videos) Unifying Large Scale Data Preprocessing and Machine Learning Pipelines with Ray Data | PyData 2021 (slides) Programming at any Scale with Ray | SF Python Meetup Sept 2019 Ray for Reinforcement Learning | Data Council 2019 Scaling Interactive Pandas Workflows with Modin Ray: A Distributed Execution Framework for AI | SciPy 2018 Ray: A Cluster Computing Engine for Reinforcement Learning Applications | Spark Summit RLlib: Ray Reinforcement Learning Library | RISECamp 2018 Enabling Composition in Distributed Reinforcement Learning | Spark Summit 2018 Tune: Distributed Hyperparameter Search | RISECamp 2018 Slides Talk given at UC Berkeley DS100 Talk given in October 2019 Talk given at RISECamp 2019 Papers Ray 2.0 Architecture whitepaper Ray 1.0 Architecture whitepaper (old) Ray AIR Technical whitepaper Exoshuffle: large-scale data shuffle in Ray RLlib paper RLlib flow paper Tune paper Ray paper (old) Ray HotOS paper (old) Installing Ray Ray currently officially supports x86_64, aarch64 (ARM) for Linux, and Apple silicon (M1) hardware. Ray on Windows is currently in beta. Official Releases From Wheels You can install the latest official version of Ray from PyPI on Linux, Windows, and macOS by choosing the option that best matches your use case. Recommended For machine learning applications pip install -U "ray[air]" # For reinforcement learning support, install RLlib instead. # pip install -U "ray[rllib]" For general Python applications pip install -U "ray[default]" # If you don't want Ray Dashboard or Cluster Launcher, install Ray with minimal dependencies instead. # pip install -U "ray" Advanced Command Installed components pip install -U "ray" Core pip install -U "ray[default]" Core, Dashboard, Cluster Launcher pip install -U "ray[data]" Core, Data pip install -U "ray[train]" Core, Train pip install -U "ray[tune]" Core, Tune pip install -U "ray[serve]" Core, Dashboard, Cluster Launcher, Serve pip install -U "ray[rllib]" Core, Tune, RLlib pip install -U "ray[air]" Core, Dashboard, Cluster Launcher, Data, Train, Tune, Serve pip install -U "ray[all]" Core, Dashboard, Cluster Launcher, Data, Train, Tune, Serve, RLlib You can combine installation extras. For example, to install Ray with Dashboard, Cluster Launcher, and Train support, you can run: pip install -U "ray[default,train]" Daily Releases (Nightlies) You can install the nightly Ray wheels via the following links. These daily releases are tested via automated tests but do not go through the full release process. To install these wheels, use the following pip command and wheels: # Clean removal of previous install pip uninstall -y ray # Install Ray with support for the dashboard + cluster launcher pip install -U "ray[default] @ LINK_TO_WHEEL.whl" # Install Ray with minimal dependencies # pip install -U LINK_TO_WHEEL.whl Linux Linux (x86_64) Linux (arm64/aarch64) Linux Python 3.10 (x86_64) Linux Python 3.10 (aarch64) Linux Python 3.9 (x86_64) Linux Python 3.9 (aarch64) Linux Python 3.8 (x86_64) Linux Python 3.8 (aarch64) Linux Python 3.7 (x86_64) Linux Python 3.7 (aarch64) Linux Python 3.11 (x86_64) (EXPERIMENTAL) Linux Python 3.11 (aarch64) (EXPERIMENTAL) MacOS MacOS (x86_64) MacOS (arm64) MacOS Python 3.10 (x86_64) MacOS Python 3.10 (arm64) MacOS Python 3.9 (x86_64) MacOS Python 3.9 (arm64) MacOS Python 3.8 (x86_64) MacOS Python 3.8 (arm64) MacOS Python 3.7 (x86_64) MacOS Python 3.11 (arm64) (EXPERIMENTAL) MacOS Python 3.11 (x86_64) (EXPERIMENTAL) Windows (beta) Windows (beta) Windows Python 3.10 Windows Python 3.9 Windows Python 3.8 Windows Python 3.7 Windows Python 3.11 (EXPERIMENTAL) On Windows, support for multi-node Ray clusters is currently experimental and untested. If you run into issues please file a report at https://github.com/ray-project/ray/issues. Usage stats collection is enabled by default (can be disabled) for nightly wheels including both local clusters started via ray.init() and remote clusters via cli. Python 3.11 support is experimental. Installing from a specific commit You can install the Ray wheels of any particular commit on master with the following template. You need to specify the commit hash, Ray version, Operating System, and Python version: pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/{COMMIT_HASH}/ray-{RAY_VERSION}-{PYTHON_VERSION}-{PYTHON_VERSION}-{OS_VERSION}.whl For example, here are the Ray 3.0.0.dev0 wheels for Python 3.9, MacOS for commit 4f2ec46c3adb6ba9f412f09a9732f436c4a5d0c9: pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/4f2ec46c3adb6ba9f412f09a9732f436c4a5d0c9/ray-3.0.0.dev0-cp39-cp39-macosx_10_15_x86_64.whl There are minor variations to the format of the wheel filename; it’s best to match against the format in the URLs listed in the Nightlies section. Here’s a summary of the variations: For MacOS, commits predating August 7, 2021 will have macosx_10_13 in the filename instead of macosx_10_15. Install Ray Java with Maven Before installing Ray Java with Maven, you should install Ray Python with pip install -U ray . Note that the versions of Ray Java and Ray Python must match. Note that nightly Ray python wheels are also required if you want to install Ray Java snapshot version. The latest Ray Java release can be found in central repository. To use the latest Ray Java release in your application, add the following entries in your pom.xml: io.ray ray-api ${ray.version} io.ray ray-runtime ${ray.version} The latest Ray Java snapshot can be found in sonatype repository. To use the latest Ray Java snapshot in your application, add the following entries in your pom.xml: sonatype https://oss.sonatype.org/content/repositories/snapshots/ false true io.ray ray-api ${ray.version} io.ray ray-runtime ${ray.version} When you run pip install to install Ray, Java jars are installed as well. The above dependencies are only used to build your Java code and to run your code in local mode. If you want to run your Java code in a multi-node Ray cluster, it’s better to exclude Ray jars when packaging your code to avoid jar conficts if the versions (installed Ray with pip install and maven dependencies) don’t match. Install Ray C++ You can install and use Ray C++ API as follows. pip install -U ray[cpp] # Create a Ray C++ project template to start with. ray cpp --generate-bazel-project-template-to ray-template If you build Ray from source, remove the build option build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" from the file cpp/example/.bazelrc before running your application. The related issue is this. M1 Mac (Apple Silicon) Support Ray supports machines running Apple Silicon (such as M1 macs). Multi-node clusters are untested. To get started with local Ray development: Install miniforge. wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh bash Miniforge3-MacOSX-arm64.sh rm Miniforge3-MacOSX-arm64.sh # Cleanup. Ensure you’re using the miniforge environment (you should see (base) in your terminal). source ~/.bash_profile conda activate Install Ray as you normally would. pip install ray Windows Support Windows support is currently in beta, and multi-node Ray clusters are untested. Please submit any issues you encounter on GitHub. Installing Ray on Arch Linux Note: Installing Ray on Arch Linux is not tested by the Project Ray developers. Ray is available on Arch Linux via the Arch User Repository (AUR) as python-ray. You can manually install the package by following the instructions on the Arch Wiki or use an AUR helper like yay (recommended for ease of install) as follows: yay -S python-ray To discuss any issues related to this package refer to the comments section on the AUR page of python-ray here. Installing From conda-forge Ray can also be installed as a conda package on Linux and Windows. # also works with mamba conda create -c conda-forge python=3.9 -n ray conda activate ray # Install Ray with support for the dashboard + cluster launcher conda install -c conda-forge "ray-default" # Install Ray with minimal dependencies # conda install -c conda-forge ray To install Ray libraries, use pip as above or conda/mamba. conda install -c conda-forge "ray-air" # installs Ray + dependencies for Ray AI Runtime conda install -c conda-forge "ray-tune" # installs Ray + dependencies for Ray Tune conda install -c conda-forge "ray-rllib" # installs Ray + dependencies for Ray RLlib conda install -c conda-forge "ray-serve" # installs Ray + dependencies for Ray Serve For a complete list of available ray libraries on Conda-forge, have a look at https://anaconda.org/conda-forge/ray-default Ray conda packages are maintained by the community, not the Ray team. While using a conda environment, it is recommended to install Ray from PyPi using pip install ray in the newly created environment. Building Ray from Source Installing from pip should be sufficient for most Ray users. However, should you need to build from source, follow these instructions for building Ray. Docker Source Images Most users should pull a Docker image from the Ray Docker Hub. The rayproject/ray images include Ray and all required dependencies. It comes with anaconda and various versions of Python. The rayproject/ray-ml images include the above as well as many additional ML libraries. The rayproject/base-deps and rayproject/ray-deps images are for the Linux and Python dependencies respectively. Images are tagged with the format {Ray version}[-{Python version}][-{Platform}]. Ray version tag can be one of the following: Ray version tag Description latest The most recent Ray release. x.y.z A specific Ray release, e.g. 1.12.1 nightly The most recent Ray development build (a recent commit from Github master) 6 character Git SHA prefix A specific development build (uses a SHA from the Github master, e.g. 8960af). The optional Python version tag specifies the Python version in the image. All Python versions supported by Ray are available, e.g. py37, py38, py39 and py310. If unspecified, the tag points to an image using Python 3.7. The optional Platform tag specifies the platform where the image is intended for: Platform tag Description -cpu These are based off of an Ubuntu image. -cuXX These are based off of an NVIDIA CUDA image with the specified CUDA version. They require the Nvidia Docker Runtime. -gpu Aliases to a specific -cuXX tagged image. Aliases to -cpu tagged images. For ray-ml image, aliases to -gpu tagged image. Example: for the nightly image based on Python 3.8 and without GPU support, the tag is nightly-py38-cpu. If you want to tweak some aspect of these images and build them locally, refer to the following script: cd ray ./build-docker.sh Beyond creating the above Docker images, this script can also produce the following two images. The rayproject/development image has the ray source code included and is setup for development. The rayproject/examples image adds additional libraries for running examples. Review images by listing them: docker images Output should look something like the following: REPOSITORY TAG IMAGE ID CREATED SIZE rayproject/ray latest 7243a11ac068 2 days ago 1.11 GB rayproject/ray-deps latest b6b39d979d73 8 days ago 996 MB rayproject/base-deps latest 5606591eeab9 8 days ago 512 MB ubuntu focal 1e4467b07108 3 weeks ago 73.9 MB Launch Ray in Docker Start out by launching the deployment container. docker run --shm-size= -t -i rayproject/ray Replace with a limit appropriate for your system, for example 512M or 2G. A good estimate for this is to use roughly 30% of your available memory (this is what Ray uses internally for its Object Store). The -t and -i options here are required to support interactive use of the container. If you use a GPU version Docker image, remember to add --gpus all option. Replace with your target ray version in the following command: docker run --shm-size= -t -i --gpus all rayproject/ray:-gpu Note: Ray requires a large amount of shared memory because each object store keeps all of its objects in shared memory, so the amount of shared memory will limit the size of the object store. You should now see a prompt that looks something like: root@ebc78f68d100:/ray# Test if the installation succeeded To test if the installation was successful, try running some tests. This assumes that you’ve cloned the git repository. python -m pytest -v python/ray/tests/test_mini.py Installed Python dependencies Our docker images are shipped with pre-installed Python dependencies required for Ray and its libraries. We publish the dependencies that are installed in our ray and ray-ml Docker images for Python 3.9. ray (Python 3.9) ray-ml (Python 3.9) Ray version: nightly (0d880e3) adal==1.2.7 aiohttp==3.8.4 aiohttp-cors==0.7.0 aiorwlock==1.3.0 aiosignal==1.3.1 anyio==3.6.2 applicationinsights==0.11.10 argcomplete==1.12.3 async-timeout==4.0.2 attrs==23.1.0 azure-cli-core==2.40.0 azure-cli-telemetry==1.0.8 azure-common==1.1.28 azure-core==1.26.4 azure-identity==1.10.0 azure-mgmt-compute==23.1.0 azure-mgmt-core==1.4.0 azure-mgmt-network==19.0.0 azure-mgmt-resource==20.0.0 backoff==2.2.1 bcrypt==4.0.1 blessed==1.20.0 boltons @ file:///croot/boltons_1677628692245/work boto3==1.26.82 botocore==1.29.134 brotlipy==0.7.0 cachetools==5.3.0 certifi @ file:///croot/certifi_1683875369620/work/certifi cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work click==8.1.3 cloudpickle==2.2.1 colorful==0.5.5 conda==23.3.1 conda-content-trust @ file:///tmp/abs_5952f1c8-355c-4855-ad2e-538535021ba5h26t22e5/croots/recipe/conda-content-trust_1658126371814/work conda-package-handling @ file:///croot/conda-package-handling_1666940373510/work cryptography==38.0.1 Cython==0.29.32 Deprecated==1.2.13 distlib==0.3.6 dm-tree==0.1.8 fastapi==0.95.1 filelock==3.12.0 flatbuffers==23.5.9 flit_core @ file:///opt/conda/conda-bld/flit-core_1644941570762/work/source/flit_core frozenlist==1.3.3 fsspec==2023.5.0 google-api-core==2.11.0 google-api-python-client==1.7.8 google-auth==2.18.0 google-auth-httplib2==0.1.0 google-oauth==1.0.1 googleapis-common-protos==1.59.0 gpustat==1.1 grpcio==1.51.3 Gymnasium==0.26.3 gymnasium-notices==0.0.1 h11==0.14.0 httplib2==0.22.0 humanfriendly==10.0 idna @ file:///croot/idna_1666125576474/work imageio==2.28.1 importlib-metadata==6.0.1 isodate==0.6.1 jmespath==1.0.1 jsonpatch @ file:///tmp/build/80754af9/jsonpatch_1615747632069/work jsonpointer==2.1 jsonschema==4.17.3 knack==0.10.1 kubernetes==26.1.0 lazy_loader==0.2 lz4==4.3.2 markdown-it-py==2.2.0 mdurl==0.1.2 msal==1.18.0b1 msal-extensions==1.0.0 msgpack==1.0.5 msrest==0.7.1 msrestazure==0.6.4 multidict==6.0.4 networkx==3.1 numpy==1.24.3 nvidia-ml-py==11.525.112 oauthlib==3.2.2 opencensus==0.11.2 opencensus-context==0.1.3 opentelemetry-api==1.17.0 opentelemetry-exporter-otlp==1.17.0 opentelemetry-exporter-otlp-proto-grpc==1.17.0 opentelemetry-exporter-otlp-proto-http==1.17.0 opentelemetry-proto==1.17.0 opentelemetry-sdk==1.17.0 opentelemetry-semantic-conventions==0.38b0 packaging==21.3 pandas==2.0.1 paramiko==2.12.0 Pillow==9.5.0 pkginfo==1.9.6 platformdirs==3.5.1 pluggy @ file:///tmp/build/80754af9/pluggy_1648024445381/work portalocker==2.7.0 prometheus-client==0.16.0 protobuf==3.20.3 psutil==5.9.5 py-spy==0.3.14 pyarrow==12.0.0 pyasn1==0.5.0 pyasn1-modules==0.3.0 pycosat @ file:///croot/pycosat_1666805502580/work pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pydantic==1.10.7 Pygments==2.15.1 PyJWT==2.7.0 PyNaCl==1.5.0 pyOpenSSL==22.1.0 pyparsing==3.0.9 pyrsistent==0.19.3 PySocks @ file:///tmp/build/80754af9/pysocks_1605305812635/work python-dateutil==2.8.2 pytz==2023.3 PyWavelets==1.4.1 PyYAML==6.0 ray @ file:///home/ray/ray-2.6.3-cp39-cp39-manylinux2014_x86_64.whl redis==3.5.3 requests @ file:///croot/requests_1682607517574/work requests-oauthlib==1.3.1 rich==13.3.5 rsa==4.9 ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work s3transfer==0.6.1 scikit-image==0.20.0 scipy==1.9.1 six==1.13.0 smart-open==6.3.0 sniffio==1.3.0 starlette==0.26.1 tabulate==0.9.0 tensorboardX==2.6 tifffile==2023.4.12 toolz @ file:///croot/toolz_1667464077321/work tqdm @ file:///croot/tqdm_1679561862951/work typer==0.9.0 typing_extensions==4.5.0 tzdata==2023.3 uritemplate==3.0.1 urllib3 @ file:///croot/urllib3_1680254681959/work uvicorn==0.22.0 virtualenv==20.21.0 wcwidth==0.2.6 websocket-client==1.5.1 wrapt==1.15.0 yarl==1.9.2 zipp @ file:///croot/zipp_1672387121353/work Ray version: nightly (0d880e3) absl-py==1.4.0 accelerate==0.17.1 adal==1.2.7 aim==3.16.1 aim-ui==3.16.1 aimrecords==0.0.7 aimrocks==0.3.1 aiobotocore==2.4.2 aiofiles==22.1.0 aiohttp==3.8.4 aiohttp-cors==0.7.0 aioitertools==0.11.0 aiorwlock==1.3.0 aiosignal==1.3.1 aiosqlite==0.19.0 ale-py==0.8.1 alembic==1.11.0 anyio==3.6.2 applicationinsights==0.11.10 argcomplete==1.12.3 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arrow==1.2.3 asttokens==2.2.1 astunparse==1.6.3 async-timeout==4.0.2 attrs==23.1.0 autocfg==0.0.8 autograd==1.5 autopage==0.5.1 AutoROM==0.6.1 AutoROM.accept-rom-license==0.6.1 ax-platform==0.3.1 azure-cli-core==2.40.0 azure-cli-telemetry==1.0.8 azure-common==1.1.28 azure-core==1.26.4 azure-identity==1.10.0 azure-mgmt-compute==23.1.0 azure-mgmt-core==1.4.0 azure-mgmt-network==19.0.0 azure-mgmt-resource==20.0.0 Babel==2.12.1 backcall==0.2.0 backoff==2.2.1 base58==2.0.1 bayesian-optimization==1.2.0 bcrypt==4.0.1 beautifulsoup4==4.12.2 bleach==6.0.0 blessed==1.20.0 blinker==1.6.2 boltons @ file:///croot/boltons_1677628692245/work boto3==1.26.82 botocore==1.27.59 botorch==0.8.3 brotlipy==0.7.0 build==0.10.0 cached-property==1.5.2 cachetools==5.3.0 catboost==1.2 certifi @ file:///croot/certifi_1683875369620/work/certifi cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work chess==1.7.0 chex==0.1.7 click==8.1.3 cliff==4.3.0 cloudpickle==2.2.1 cma==2.7.0 cmaes==0.9.1 cmd2==2.4.3 coloredlogs==15.0.1 colorful==0.5.5 colorlog==6.7.0 comet-ml==3.31.9 comm==0.1.3 commonmark==0.9.1 conda==23.3.1 conda-content-trust @ file:///tmp/abs_5952f1c8-355c-4855-ad2e-538535021ba5h26t22e5/croots/recipe/conda-content-trust_1658126371814/work conda-package-handling @ file:///croot/conda-package-handling_1666940373510/work configobj==5.0.8 ConfigSpace==0.4.18 contourpy==1.0.7 coolname==2.2.0 cryptography==38.0.1 cycler==0.11.0 Cython==0.29.32 databricks-cli==0.17.7 datasets==2.0.0 debugpy==1.6.7 decorator==5.1.1 deepspeed==0.8.3 defusedxml==0.7.1 Deprecated==1.2.13 dill==0.3.6 distlib==0.3.6 dm-tree==0.1.8 docker==6.1.2 docker-pycreds==0.4.0 dopamine-rl==4.0.5 dragonfly-opt==0.1.6 dulwich==0.21.5 entrypoints==0.4 etils==1.3.0 everett==3.2.0 exceptiongroup==1.1.1 executing==1.2.0 fairscale==0.4.6 fastapi==0.95.1 fasteners==0.18 fastjsonschema==2.16.3 filelock==3.12.0 FLAML==1.1.1 Flask==2.3.2 flatbuffers==2.0.7 flax==0.6.10 flit_core @ file:///opt/conda/conda-bld/flit-core_1644941570762/work/source/flit_core fonttools==4.39.4 fqdn==1.5.1 freezegun==1.1.0 frozenlist==1.3.3 fsspec==2023.1.0 future==0.18.3 gast==0.4.0 gin-config==0.5.0 gitdb==4.0.10 GitPython==3.1.31 glfw==2.5.9 gluoncv==0.10.5.post0 google-api-core==2.11.0 google-api-python-client==1.7.8 google-auth==2.18.0 google-auth-httplib2==0.1.0 google-auth-oauthlib==0.4.6 google-oauth==1.0.1 google-pasta==0.2.0 googleapis-common-protos==1.59.0 gpustat==1.1 GPy==1.10.0 gpytorch==1.9.1 graphviz==0.8.4 greenlet==2.0.2 grpcio==1.51.3 gunicorn==20.1.0 gym==0.26.2 gym-notices==0.0.8 Gymnasium==0.26.3 gymnasium-notices==0.0.1 h11==0.14.0 h5py==3.7.0 HEBO==0.3.2 higher==0.2.1 hjson==3.1.0 hpbandster==0.7.4 httplib2==0.22.0 huggingface-hub==0.14.1 humanfriendly==10.0 hyperopt==0.2.5 idna @ file:///croot/idna_1666125576474/work imageio==2.28.1 imageio-ffmpeg==0.4.5 importlib-metadata==6.0.1 importlib-resources==5.12.0 iniconfig==2.0.0 ipykernel==6.23.1 ipython==8.13.2 ipython-genutils==0.2.0 ipywidgets==8.0.6 isodate==0.6.1 isoduration==20.11.0 itsdangerous==2.1.2 jax==0.4.10 jaxlib==0.4.10 jedi==0.18.2 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.2.0 json5==0.9.14 jsonpatch @ file:///tmp/build/80754af9/jsonpatch_1615747632069/work jsonpointer==2.1 jsonschema==4.17.3 jupyter-events==0.6.3 jupyter-ydoc==0.2.4 jupyter_client==8.2.0 jupyter_core==5.3.0 jupyter_server==2.5.0 jupyter_server_fileid==0.9.0 jupyter_server_terminals==0.4.4 jupyter_server_ydoc==0.6.1 jupyterlab==3.6.1 jupyterlab-pygments==0.2.2 jupyterlab-widgets==3.0.7 jupyterlab_server==2.22.1 kaggle-environments==1.7.11 keras==2.11.0 kiwisolver==1.4.4 knack==0.10.1 kubernetes==26.1.0 lazy_loader==0.2 libclang==16.0.0 lightgbm==3.3.5 lightgbm-ray==0.1.8 lightning-bolts==0.4.0 linear-operator==0.3.0 llvmlite==0.40.0 lz4==4.3.2 Mako==1.2.4 Markdown==3.4.3 markdown-it-py==2.2.0 MarkupSafe==2.1.2 matplotlib==3.7.1 matplotlib-inline==0.1.6 mdurl==0.1.2 minigrid==2.1.1 mistune==2.0.5 ml-dtypes==0.1.0 mlagents-envs==0.28.0 mlflow==2.2.2 modin==0.18.1 monotonic==1.6 mosaicml==0.12.1 mpmath==1.3.0 msal==1.18.0b1 msal-extensions==1.0.0 msgpack==1.0.5 msgpack-numpy==0.4.8 msrest==0.7.1 msrestazure==0.6.4 mujoco==2.2.0 mujoco-py==2.1.2.14 multidict==6.0.4 multipledispatch==0.6.0 multiprocess==0.70.14 mxnet==1.8.0.post0 nbclassic==1.0.0 nbclient==0.7.4 nbconvert==7.4.0 nbformat==5.8.0 nest-asyncio==1.5.6 netifaces==0.11.0 networkx==3.1 nevergrad==0.4.3.post7 ninja==1.11.1 notebook==6.5.4 notebook_shim==0.2.3 numba==0.57.0 numpy==1.24.3 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-ml-py==11.525.112 oauthlib==3.2.2 onnx==1.12.0 onnxruntime==1.14.1 open-spiel==1.2 opencensus==0.11.2 opencensus-context==0.1.3 opencv-python==4.7.0.72 opentelemetry-api==1.1.0 opentelemetry-exporter-otlp==1.1.0 opentelemetry-exporter-otlp-proto-grpc==1.1.0 opentelemetry-exporter-otlp-proto-http==1.17.0 opentelemetry-proto==1.1.0 opentelemetry-sdk==1.1.0 opentelemetry-semantic-conventions==0.20b0 opt-einsum==3.3.0 optax==0.1.5 optuna==2.10.0 orbax-checkpoint==0.2.3 packaging==23.1 pandas==1.5.3 pandocfilters==1.5.0 paramiko==2.12.0 paramz==0.9.5 parso==0.8.3 pathtools==0.1.2 patsy==0.5.3 pbr==5.11.1 PettingZoo==1.22.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.5.0 pip-tools==6.13.0 pkginfo==1.9.6 platformdirs==3.5.1 plotly==5.14.1 pluggy @ file:///tmp/build/80754af9/pluggy_1648024445381/work portalocker==2.7.0 prettytable==3.7.0 prometheus-client==0.13.1 promise==2.3 prompt-toolkit==3.0.38 protobuf==3.19.6 psutil==5.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 py-cpuinfo==9.0.0 py-spy==0.3.14 py3nvml==0.2.7 pyaml==23.5.9 pyarrow==11.0.0 pyasn1==0.5.0 pyasn1-modules==0.3.0 pycosat @ file:///croot/pycosat_1666805502580/work pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pydantic==1.10.7 pyDeprecate==0.3.2 pygame==2.1.2 pyglet==1.5.15 Pygments==2.15.1 PyJWT==2.7.0 pymoo==0.5.0 pymunk==6.2.1 PyNaCl==1.5.0 PyOpenGL==3.1.6 pyOpenSSL==22.1.0 pyparsing==3.0.9 pyperclip==1.8.2 pypng==0.20220715.0 pyproject_hooks==1.0.0 pyro-api==0.1.2 pyro-ppl==1.8.4 Pyro4==4.82 pyrsistent==0.19.3 PySocks @ file:///tmp/build/80754af9/pysocks_1605305812635/work pytest==7.3.1 pytest-remotedata==0.3.2 python-dateutil==2.8.2 python-json-logger==2.0.7 pytorch-lightning==1.6.5 pytorch-ranger==0.1.1 pytz==2022.7.1 PyWavelets==1.4.1 PyYAML==6.0 pyzmq==25.0.2 querystring-parser==1.2.4 ray @ file:///home/ray/ray-2.6.3-cp39-cp39-manylinux2014_x86_64.whl ray-lightning==0.3.0 recsim==0.2.4 redis==3.5.3 regex==2023.5.5 requests==2.30.0 requests-oauthlib==1.3.1 requests-toolbelt==1.0.0 responses==0.18.0 RestrictedPython==6.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==12.0.1 rsa==4.9 ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work s3fs==2023.1.0 s3transfer==0.6.1 scikit-image==0.20.0 scikit-learn==1.2.2 scikit-optimize==0.9.0 scipy==1.9.1 segment-analytics-python==2.2.2 semantic-version==2.10.0 Send2Trash==1.8.2 sentencepiece==0.1.96 sentry-sdk==1.23.0 serpent==1.41 setproctitle==1.3.2 shap==0.41.0 shortuuid==1.0.1 sigopt==7.5.0 six==1.13.0 slicer==0.0.7 smart-open==6.3.0 smmap==5.0.0 sniffio==1.3.0 soupsieve==2.4.1 SQLAlchemy==1.4.48 sqlparse==0.4.4 stack-data==0.6.2 starlette==0.26.1 statsmodels==0.14.0 stevedore==5.1.0 SuperSuit==3.7.0 sympy==1.12 tabulate==0.9.0 tblib==1.7.0 tenacity==8.2.2 tensorboard==2.11.2 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.4.1 tensorflow==2.11.0 tensorflow-estimator==2.11.0 tensorflow-io-gcs-filesystem==0.32.0 tensorflow-probability==0.19.0 tensorstore==0.1.36 termcolor==2.3.0 terminado==0.17.1 tf-slim==1.1.0 tf2onnx==1.13.0 threadpoolctl==3.1.0 tifffile==2023.4.12 timm==0.4.5 tinycss2==1.2.1 tinyscaler==1.2.6 tokenizers==0.12.1 tomli==2.0.1 toolz @ file:///croot/toolz_1667464077321/work torch==1.13.0+cu116 torch-cluster==1.6.0+pt113cu116 torch-geometric==2.1.0 torch-optimizer==0.3.0 torch-scatter==2.1.0+pt113cu116 torch-sparse==0.6.15+pt113cu116 torch-spline-conv==1.2.1+pt113cu116 torchmetrics==0.9.3 torchvision==0.14.0+cu116 tornado==6.3.2 tqdm @ file:///croot/tqdm_1679561862951/work traitlets==5.9.0 transformers==4.19.1 tune-sklearn==0.4.4 typeguard==2.13.3 typer==0.9.0 typing_extensions==4.5.0 tzdata==2023.3 uri-template==1.2.0 uritemplate==3.0.1 urllib3 @ file:///croot/urllib3_1680254681959/work uvicorn==0.22.0 virtualenv==20.21.0 wandb==0.13.4 wcwidth==0.2.6 webcolors==1.13 webencodings==0.5.1 websocket-client==1.5.1 Werkzeug==2.3.4 widgetsnbextension==4.0.7 wrapt==1.15.0 wurlitzer==3.0.3 xgboost==1.3.3 xgboost-ray==0.1.15 xmltodict==0.13.0 xxhash==3.2.0 y-py==0.5.9 yacs==0.1.8 yarl==1.9.2 ypy-websocket==0.8.2 zipp @ file:///croot/zipp_1672387121353/work zoopt==0.4.1 Ray Use Cases This page indexes common Ray use cases for scaling ML. It contains highlighted references to blogs, examples, and tutorials also located elsewhere in the Ray documentation. LLMs and Gen AI Large language models (LLMs) and generative AI are rapidly changing industries, and demand compute at an astonishing pace. Ray provides a distributed compute framework for scaling these models, allowing developers to train and deploy models faster and more efficiently. With specialized libraries for data streaming, training, fine-tuning, hyperparameter tuning, and serving, Ray simplifies the process of developing and deploying large-scale AI models. Learn more about how Ray scales LLMs and generative AI with the following resources. [Blog] How Ray solves common production challenges for generative AI infrastructure [Blog] Training 175B Parameter Language Models at 1000 GPU scale with Alpa and Ray [Blog] Faster stable diffusion fine-tuning with Ray AIR [Blog] How to fine tune and serve LLMs simply, quickly and cost effectively using Ray + DeepSpeed + HuggingFace [Article] How OpenAI Uses Ray to Train Tools like ChatGPT [Example] GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed [Example] Fine-tuning DreamBooth with Ray AIR [Example] Stable Diffusion Batch Prediction with Ray AIR [Example] GPT-J-6B Serving with Ray AIR [Intermediate Example] Aviary toolkit serving live traffic for LLMs Batch Inference Batch inference is the process of generating model predictions on a large “batch” of input data. Ray for batch inference works with any cloud provider and ML framework, and is fast and cheap for modern deep learning applications. It scales from single machines to large clusters with minimal code changes. As a Python-first framework, you can easily express and interactively develop your inference workloads in Ray. To learn more about running batch inference with Ray, see the batch inference guide. [Guide] Batch Prediction using Ray Data [Example] Batch Inference Examples [Blog] Offline Batch Inference: Comparing Ray, Apache Spark, and SageMaker [Blog] Streaming distributed execution across CPUs and GPUs [Blog] Using Ray Data to parallelize LangChain inference Many Model Training Many model training is common in ML use cases such as time series forecasting, which require fitting of models on multiple data batches corresponding to locations, products, etc. The focus is on training many models on subsets of a dataset. This is in contrast to training a single model on the entire dataset. When any given model you want to train can fit on a single GPU, Ray can assign each training run to a separate Ray Task. In this way, all available workers are utilized to run independent remote training rather than one worker running jobs sequentially. Data parallelism pattern for distributed training on large datasets. How do I do many model training on Ray? To train multiple independent models, use the Ray Tune (Tutorial) library. This is the recommended library for most cases. You can use Tune with your current data preprocessing pipeline if your data source fits into the memory of a single machine (node). If you need to scale your data, or you want to plan for future scaling, use the Ray Data library. Your data must be a supported format, to use Ray Data. Alternative solutions exist for less common cases: If your data is not in a supported format, use Ray Core (Tutorial) for custom applications. This is an advanced option and requires and understanding of design patterns and anti-patterns. If you have a large preprocessing pipeline, you can use the Ray Data library to train multiple models (Tutorial). Learn more about many model training with the following resources. [Blog] Training One Million ML Models in Record Time with Ray [Blog] Many Models Batch Training at Scale with Ray Core [Example] Batch Training with Ray Core [Example] Batch Training with Ray Data [Guide] Tune Basic Parallel Experiments [Example] Batch Training and Tuning using Ray Tune [Talk] Scaling Instacart fulfillment ML on Ray Model Serving Ray Serve is well suited for model composition, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. It supports complex model deployment patterns requiring the orchestration of multiple Ray actors, where different actors provide inference for different models. Serve handles both batch and online inference and can scale to thousands of models in production. Deployment patterns with Ray Serve. (Click image to enlarge.) Learn more about model serving with the following resources. [Talk] Productionizing ML at Scale with Ray Serve [Blog] Simplify your MLOps with Ray & Ray Serve [Guide] Getting Started with Ray Serve [Guide] Model Composition in Serve [Gallery] Serve Examples Gallery [Gallery] More Serve Use Cases on the Blog Hyperparameter Tuning The Ray Tune library enables any parallel Ray workload to be run under a hyperparameter tuning algorithm. Running multiple hyperparameter tuning experiments is a pattern apt for distributed computing because each experiment is independent of one another. Ray Tune handles the hard bit of distributing hyperparameter optimization and makes available key features such as checkpointing the best result, optimizing scheduling, and specifying search patterns. Distributed tuning with distributed training per trial. Learn more about the Tune library with the following talks and user guides. [Guide] Getting Started with Ray Tune [Blog] How to distribute hyperparameter tuning with Ray Tune [Talk] Simple Distributed Hyperparameter Optimization [Blog] Hyperparameter Search with 🤗 Transformers [Gallery] Ray Tune Examples Gallery More Tune use cases on the Blog Distributed Training The Ray Train library integrates many distributed training frameworks under a simple Trainer API, providing distributed orchestration and management capabilities out of the box. In contrast to training many models, model parallelism partitions a large model across many machines for training. Ray Train has built-in abstractions for distributing shards of models and running training in parallel. Model parallelism pattern for distributed large model training. Learn more about the Train library with the following talks and user guides. [Talk] Ray Train, PyTorch, TorchX, and distributed deep learning [Blog] Elastic Distributed Training with XGBoost on Ray [Guide] Getting Started with Ray Train [Example] Fine-tune a 🤗 Transformers model [Gallery] Ray Train Examples Gallery [Gallery] More Train Use Cases on the Blog Reinforcement Learning RLlib is an open-source library for reinforcement learning (RL), offering support for production-level, highly distributed RL workloads while maintaining unified and simple APIs for a large variety of industry applications. RLlib is used by industry leaders in many different verticals, such as climate control, industrial control, manufacturing and logistics, finance, gaming, automobile, robotics, boat design, and many others. Decentralized distributed proximal polixy optimiation (DD-PPO) architecture. Learn more about reinforcement learning with the following resources. [Course] Applied Reinforcement Learning with RLlib [Blog] Intro to RLlib: Example Environments [Guide] Getting Started with RLlib [Talk] Deep reinforcement learning at Riot Games [Gallery] RLlib Examples Gallery [Gallery] More RL Use Cases on the Blog ML Platform Merlin is Shopify’s ML platform built on Ray. It enables fast-iteration and scaling of distributed applications such as product categorization and recommendations. Shopify’s Merlin architecture built on Ray. Spotify uses Ray for advanced applications that include personalizing content recommendations for home podcasts, and personalizing Spotify Radio track sequencing. How Ray ecosystem empowers ML scientists and engineers at Spotify. The following highlights feature companies leveraging Ray’s unified API to build simpler, more flexible ML platforms. [Blog] The Magic of Merlin - Shopify’s New ML Platform [Slides] Large Scale Deep Learning Training and Tuning with Ray [Blog] Griffin: How Instacart’s ML Platform Tripled in a year [Talk] Predibase - A low-code deep learning platform built for scale [Blog] Building a ML Platform with Kubeflow and Ray on GKE [Talk] Ray Summit Panel - ML Platform on Ray End-to-End ML Workflows The following highlights examples utilizing Ray AIR to implement end-to-end ML workflows. [Example] Text classification with Ray [Example] Image classification with Ray [Example] Object detection with Ray [Example] Credit scoring with Ray and Feast [Example] Machine learning on tabular data [Example] AutoML for Time Series with Ray [Gallery] Full Ray AIR Examples Gallery Large Scale Workload Orchestration The following highlights feature projects leveraging Ray Core’s distributed APIs to simplify the orchestration of large scale workloads. [Blog] Highly Available and Scalable Online Applications on Ray at Ant Group [Blog] Ray Forward 2022 Conference: Hyper-scale Ray Application Use Cases [Blog] A new world record on the CloudSort benchmark using Ray [Example] Speed up your web crawler by parallelizing it with Ray Ray Examples
Blog How Ray solves common production challenges for generative AI infrastructure Blog Training 175B Parameter Language Models at 1000 GPU scale with Alpa and Ray Blog Faster stable diffusion fine-tuning with Ray AIR Blog How to fine tune and serve LLMs simply, quickly and cost effectively using Ray + DeepSpeed + HuggingFace Blog How OpenAI Uses Ray to Train Tools like ChatGPT Code example GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed Tutorial Get started with Ray AIR from an existing PyTorch codebase Tutorial Get started with Ray AIR from an existing Tensorflow/Keras Code example Distributed training with LightGBM Tutorial Distributed training with XGBoost Tutorial Distributed tuning with XGBoost Code example Integrating with Scikit-Learn (non-distributed) Code example Build an AutoML system for time-series forecasting with Ray AIR Code example Perform batch tuning on NYC Taxi Dataset with Ray AIR Code example Perform batch forecasting on NYC Taxi Dataset with Prophet, ARIMA and Ray AIR Code example How to use Ray AIR to run Hugging Face Transformers with DeepSpeed for fine-tuning a large model Code example How to use Ray AIR to do batch prediction with the Hugging Face Transformers GPT-J model Code example How to use Ray AIR to do online serving with the Hugging Face Transformers GPT-J model Code example How to fine-tune a DreamBooth text-to-image model with your own images. Code example How to fine-tune a dolly-v2-7b model with Ray AIR LightningTrainer and FSDP Code example Torch Image Classification Example with Ray AIR Code example Torch Object Detection Example with Ray AIR Code example Image Classification Batch Inference with PyTorch ResNet152 Code example How to use Ray AIR to do batch prediction with the Stable Diffusion text-to-image model Code example Object Detection Batch Inference with PyTorch FasterRCNN_ResNet50 Code example Image Classification Batch Inference with PyTorch ResNet18 Code example Image Classification Batch Inference with Huggingface Vision Transformer Code example How to log results and upload models to Comet ML Code example How to log results and upload models to Weights and Biases Code example Serving RL models with Ray AIR Code example RL Online Learning with Ray AIR Code example RL Offline Learning with Ray AIR Code example Incrementally train and deploy a PyTorch CV model Code example Integrate with Feast feature store in both train and inference Code example Serving ML models with Ray Serve (Tensorflow, PyTorch, Scikit-Learn, others) Code example Batching tutorial for Ray Serve Code example Serving RLlib Models with Ray Serve Code example Scaling your Gradio app with Ray Serve Code example Visualizing a Deployment Graph with Gradio Code example Java tutorial for Ray Serve Code example Serving a Stable Diffusion Model Code example Serving a Distilbert Model Code example Serving an Object Detection Model Code example Fine-tuning DreamBooth with Ray AIR Code example Stable Diffusion Batch Prediction with Ray AIR Code example GPT-J-6B Serving with Ray AIR Blog Offline Batch Inference: Comparing Ray, Apache Spark, and SageMaker Blog Streaming distributed execution across CPUs and GPUs Blog Using Ray Data to parallelize LangChain inference Blog Batch Prediction using Ray Data Code example Batch Inference on NYC taxi data using Ray Data Code example Batch OCR processing using Ray Data Blog Training One Million ML Models in Record Time with Ray Blog Many Models Batch Training at Scale with Ray Core Code example Batch Training with Ray Core Code example Batch Training with Ray Data Tutorial Tune Basic Parallel Experiments Code example Batch Training and Tuning using Ray Tune Video Scaling Instacart fulfillment ML on Ray Code example Using Aim with Ray Tune For Experiment Management Code example Using Comet with Ray Tune For Experiment Management Code example Tracking Your Experiment Process Weights & Biases Code example Using MLflow Tracking & AutoLogging with Tune Code example How To Use Tune With Ax Code example How To Use Tune With Dragonfly Code example How To Use Tune With Scikit-Optimize Code example How To Use Tune With HyperOpt Code example How To Use Tune With BayesOpt Code example How To Use Tune With BlendSearch and CFO Code example How To Use Tune With TuneBOHB Code example How To Use Tune With Nevergrad Code example How To Use Tune With Optuna Code example How To Use Tune With ZOOpt Code example How To Use Tune With SigOpt Code example How To Use Tune With HEBO Video Productionizing ML at Scale with Ray Serve Blog Simplify your MLOps with Ray & Ray Serve Tutorial Getting Started with Ray Serve Tutorial Model Composition in Serve Tutorial Getting Started with Ray Tune Blog How to distribute hyperparameter tuning with Ray Tune Video Simple Distributed Hyperparameter Optimization Blog Hyperparameter Search with 🤗 Transformers Code example How To Use Tune With Keras & TF Models Code example How To Use Tune With PyTorch Models Code example How To Tune PyTorch Lightning Models Code example How To Tune MXNet Models Code example Model Selection & Serving With Ray Serve Code example Tuning RL Experiments With Ray Tune & Ray Serve Code example A Guide To Tuning XGBoost Parameters With Tune Code example A Guide To Tuning LightGBM Parameters With Tune Code example A Guide To Tuning Horovod Parameters With Tune Code example A Guide To Tuning Huggingface Transformers With Tune Code example More Tune use cases on the Blog Video Ray Train, PyTorch, TorchX, and distributed deep learning Code example Elastic Distributed Training with XGBoost on Ray Tutorial Getting Started with Ray Train Code example Fine-tune a 🤗 Transformers model Code example PyTorch Fashion MNIST Training Example Code example Transformers with PyTorch Training Example Code example TensorFlow MNIST Training Example Code example End-to-end Horovod Training Example Code example End-to-end PyTorch Lightning Training Example Code example Use LightningTrainer with Ray Data and Batch Predictor Code example Fine-tune LLM with AIR LightningTrainer and FSDP Code example End-to-end Example for Tuning a TensorFlow Model Code example End-to-end Example for Tuning a PyTorch Model with PBT Code example Logging Training Runs with MLflow Code example Using Experiment Tracking Tools in LightningTrainer Course Applied Reinforcement Learning with RLlib Blog Intro to RLlib: Example Environments Code example A collection of tuned hyperparameters by RLlib algorithm Code example A collection of reasonably optimized Atari and MuJoCo results for RLlib Code example RLlib’s trajectory view API and how it enables implementations of GTrXL (attention net) architectures Code example A how-to on connecting RLlib with the Unity3D game engine for running visual- and physics-based RL experiments Code example How we ported 12 of RLlib’s algorithms from TensorFlow to PyTorch and what we learnt on the way Code example This blog post is a brief tutorial on multi-agent RL and its design in RLlib Code example Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms Code example Example of defining and registering a gym env and model for use with RLlib Code example Rendering and recording of an environment Code example Coin game example with RLlib Code example RecSym environment example (for recommender systems) using the SlateQ algorithm Code example VizDoom example script using RLlib’s auto-attention wrapper Code example Attention Net (GTrXL) learning the “repeat-after-me” environment Code example Working with custom Keras models in RLlib Tutorial Getting Started with RLlib Video Deep reinforcement learning at Riot Games Blog The Magic of Merlin - Shopify’s New ML Platform Tutorial Large Scale Deep Learning Training and Tuning with Ray Blog Griffin: How Instacart’s ML Platform Tripled in a year Video Predibase - A low-code deep learning platform built for scale Blog Building a ML Platform with Kubeflow and Ray on GKE Video Ray Summit Panel - ML Platform on Ray Code example AutoML for Time Series with Ray Blog Highly Available and Scalable Online Applications on Ray at Ant Group Blog Ray Forward 2022 Conference: Hyper-scale Ray Application Use Cases Blog A new world record on the CloudSort benchmark using Ray Code example Speed up your web crawler by parallelizing it with Ray The Ray Ecosystem This page lists libraries that have integrations with Ray for distributed execution in alphabetical order. It’s easy to add your own integration to this list. Simply open a pull request with a few lines of text, see the dropdown below for more information. Adding Your Integration To add an integration, simply add an entry to the projects list of our Gallery YAML on GitHub. - name: the integration link button text section_title: The section title for this integration description: A quick description of your library and its integration with Ray website: The URL of your website repo: The URL of your project on GitHub image: The URL of a logo of your project That’s all! Classy Vision is a new end-to-end, PyTorch-based framework for large-scale training of state-of-the-art image and video classification models. The library features a modular, flexible design that allows anyone to train machine learning models on top of PyTorch using very simple abstractions. Classy Vision Integration Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. Dask uses existing Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their Dask-powered equivalents. Dask Integration Flambé is a machine learning experimentation framework built to accelerate the entire research life cycle. Flambé’s main objective is to provide a unified interface for prototyping models, running experiments containing complex pipelines, monitoring those experiments in real-time, reporting results, and deploying a final model for inference. Flambé Integration Flyte is a Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source. Flyte Integration Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Horovod Integration State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0. It integrates with Ray for distributed hyperparameter tuning of transformer models. Hugging Face Transformers Integration Analytics Zoo seamlessly scales TensorFlow, Keras and PyTorch to distributed big data (using Spark, Flink & Ray). Intel Analytics Zoo Integration The power of 350+ pre-trained NLP models, 100+ Word Embeddings, 50+ Sentence Embeddings, and 50+ Classifiers in 46 languages with 1 line of Python code. NLU Integration Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. With Ludwig, you can train a deep learning model on Ray in zero lines of code, automatically leveraging Dask on Ray for data preprocessing, Horovod on Ray for distributed training, and Ray Tune for hyperparameter optimization. Ludwig Integration Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn. Mars can scale in to a single machine, and scale out to a cluster with thousands of machines. MARS Integration Scale your pandas workflows by changing one line of code. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Modin Integration Prefect is an open source workflow orchestration platform in Python. It allows you to easily define, track and schedule workflows in Python. This integration makes it easy to run a Prefect workflow on a Ray cluster in a distributed way. Prefect Integration PyCaret is an open source low-code machine learning library in Python that aims to reduce the hypothesis to insights cycle time in a ML experiment. It enables data scientists to perform end-to-end experiments quickly and efficiently. PyCaret Integration RayDP (“Spark on Ray”) enables you to easily use Spark inside a Ray program. You can use Spark to read the input data, process the data using SQL, Spark DataFrame, or Pandas (via Koalas) API, extract and transform features using Spark MLLib, and use RayDP Estimator API for distributed training on the preprocessed dataset. RayDP Integration Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit Learn Integration Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The focus of the library is to provide high-quality implementations of black-box, white-box, local and global explanation methods for classification and regression models. Seldon Alibi Integration Sematic is an open-source ML pipelining tool written in Python. It enables users to write end-to-end pipelines that can seamlessly transition between your laptop and the cloud, with rich visualizations, traceability, reproducibility, and usability as first-class citizens. This integration enables dynamic allocation of Ray clusters within Sematic pipelines. Sematic Integration spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research, and was designed from day one to be used in real products. spaCy Integration XGBoost is a popular gradient boosting library for classification and regression. It is one of the most popular tools in data science and workhorse of many top-performing Kaggle kernels. XGBoost Integration LightGBM is a high-performance gradient boosting library for classification and regression. It is designed to be distributed and efficient. LightGBM Integration Volcano is system for running high-performance workloads on Kubernetes. It features powerful batch scheduling capabilities required by ML and other data-intensive workloads. Volcano Integration What is Ray Core? Ray Core provides a small number of core primitives (i.e., tasks, actors, objects) for building and scaling distributed applications. Below we’ll walk through simple examples that show you how to turn your functions and classes easily into Ray tasks and actors, and how to work with Ray objects. Getting Started To get started, install Ray via pip install -U ray. See Installing Ray for more installation options. The following few sections will walk through the basics of using Ray Core. The first step is to import and initialize Ray: import ray ray.init() In recent versions of Ray (>=1.5), ray.init() is automatically called on the first use of a Ray remote API. Running a Task Ray lets you run functions as remote tasks in the cluster. To do this, you decorate your function with @ray.remote to declare that you want to run this function remotely. Then, you call that function with .remote() instead of calling it normally. This remote call returns a future, a so-called Ray object reference, that you can then fetch with ray.get: # Define the square task. @ray.remote def square(x): return x * x # Launch four parallel square tasks. futures = [square.remote(i) for i in range(4)] # Retrieve results. print(ray.get(futures)) # -> [0, 1, 4, 9] Calling an Actor Ray provides actors to allow you to parallelize computation across multiple actor instances. When you instantiate a class that is a Ray actor, Ray will start a remote instance of that class in the cluster. This actor can then execute remote method calls and maintain its own internal state: # Define the Counter actor. @ray.remote class Counter: def __init__(self): self.i = 0 def get(self): return self.i def incr(self, value): self.i += value # Create a Counter actor. c = Counter.remote() # Submit calls to the actor. These calls run asynchronously but in # submission order on the remote actor process. for _ in range(10): c.incr.remote(1) # Retrieve final actor state. print(ray.get(c.get.remote())) # -> 10 The above covers very basic actor usage. For a more in-depth example, including using both tasks and actors together, check out Monte Carlo Estimation of π. Passing an Object As seen above, Ray stores task and actor call results in its distributed object store, returning object references that can be later retrieved. Object references can also be created explicitly via ray.put, and object references can be passed to tasks as substitutes for argument values: import numpy as np # Define a task that sums the values in a matrix. @ray.remote def sum_matrix(matrix): return np.sum(matrix) # Call the task with a literal argument value. print(ray.get(sum_matrix.remote(np.ones((100, 100))))) # -> 10000.0 # Put a large array into the object store. matrix_ref = ray.put(np.ones((1000, 1000))) # Call the task with the object reference as an argument. print(ray.get(sum_matrix.remote(matrix_ref))) # -> 1000000.0 Next Steps To check how your application is doing, you can use the Ray dashboard. Ray’s key primitives are simple, but can be composed together to express almost any kind of distributed computation. Learn more about Ray’s key concepts with the following user guides: Using remote functions (Tasks) Using remote classes (Actors) Working with Ray Objects Key Concepts This section overviews Ray’s key concepts. These primitives work together to enable Ray to flexibly support a broad range of distributed applications. Tasks Ray enables arbitrary functions to be executed asynchronously on separate Python workers. These asynchronous Ray functions are called “tasks”. Ray enables tasks to specify their resource requirements in terms of CPUs, GPUs, and custom resources. These resource requests are used by the cluster scheduler to distribute tasks across the cluster for parallelized execution. See the User Guide for Tasks. Actors Actors extend the Ray API from functions (tasks) to classes. An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements. See the User Guide for Actors. Objects In Ray, tasks and actors create and compute on objects. We refer to these objects as remote objects because they can be stored anywhere in a Ray cluster, and we use object refs to refer to them. Remote objects are cached in Ray’s distributed shared-memory object store, and there is one object store per node in the cluster. In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref(s). See the User Guide for Objects. Placement Groups Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can be then used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks. See the User Guide for Placement Groups. Environment Dependencies When Ray executes tasks and actors on remote machines, their environment dependencies (e.g., Python packages, local files, environment variables) must be available for the code to run. To address this problem, you can (1) prepare your dependencies on the cluster in advance using the Ray Cluster Launcher, or (2) use Ray’s runtime environments to install them on the fly. See the User Guide for Environment Dependencies. User Guides This section explains how to use Ray’s key concepts to build distributed applications. If you’re brand new to Ray, we recommend starting with the walkthrough. Tasks Ray enables arbitrary functions to be executed asynchronously on separate Python workers. Such functions are called Ray remote functions and their asynchronous invocations are called Ray tasks. Here is an example. Python import ray import time # A regular Python function. def normal_function(): return 1 # By adding the `@ray.remote` decorator, a regular Python function # becomes a Ray remote function. @ray.remote def my_function(): return 1 # To invoke this remote function, use the `remote` method. # This will immediately return an object ref (a future) and then create # a task that will be executed on a worker process. obj_ref = my_function.remote() # The result can be retrieved with ``ray.get``. assert ray.get(obj_ref) == 1 @ray.remote def slow_function(): time.sleep(10) return 1 # Ray tasks are executed in parallel. # All computation is performed in the background, driven by Ray's internal event loop. for _ in range(4): # This doesn't block. slow_function.remote() See the ray.remote package reference page for specific documentation on how to use ray.remote. Java public class MyRayApp { // A regular Java static method. public static int myFunction() { return 1; } } // Invoke the above method as a Ray task. // This will immediately return an object ref (a future) and then create // a task that will be executed on a worker process. ObjectRef res = Ray.task(MyRayApp::myFunction).remote(); // The result can be retrieved with ``ObjectRef::get``. Assert.assertTrue(res.get() == 1); public class MyRayApp { public static int slowFunction() throws InterruptedException { TimeUnit.SECONDS.sleep(10); return 1; } } // Ray tasks are executed in parallel. // All computation is performed in the background, driven by Ray's internal event loop. for(int i = 0; i < 4; i++) { // This doesn't block. Ray.task(MyRayApp::slowFunction).remote(); } C++ // A regular C++ function. int MyFunction() { return 1; } // Register as a remote function by `RAY_REMOTE`. RAY_REMOTE(MyFunction); // Invoke the above method as a Ray task. // This will immediately return an object ref (a future) and then create // a task that will be executed on a worker process. auto res = ray::Task(MyFunction).Remote(); // The result can be retrieved with ``ray::ObjectRef::Get``. assert(*res.Get() == 1); int SlowFunction() { std::this_thread::sleep_for(std::chrono::seconds(10)); return 1; } RAY_REMOTE(SlowFunction); // Ray tasks are executed in parallel. // All computation is performed in the background, driven by Ray's internal event loop. for(int i = 0; i < 4; i++) { // This doesn't block. ray::Task(SlowFunction).Remote(); a Use ray summary tasks from State API to see running and finished tasks and count: # This API is only available when you download Ray via `pip install "ray[default]"` ray summary tasks ======== Tasks Summary: 2023-05-26 11:09:32.092546 ======== Stats: ------------------------------------ total_actor_scheduled: 0 total_actor_tasks: 0 total_tasks: 5 Table (group by func_name): ------------------------------------ FUNC_OR_CLASS_NAME STATE_COUNTS TYPE 0 slow_function RUNNING: 4 NORMAL_TASK 1 my_function FINISHED: 1 NORMAL_TASK Specifying required resources You can specify resource requirements in tasks (see Specifying Task or Actor Resource Requirements for more details.) Python # Specify required resources. @ray.remote(num_cpus=4, num_gpus=2) def my_function(): return 1 # Override the default resource requirements. my_function.options(num_cpus=3).remote() Java // Specify required resources. Ray.task(MyRayApp::myFunction).setResource("CPU", 4.0).setResource("GPU", 2.0).remote(); C++ // Specify required resources. ray::Task(MyFunction).SetResource("CPU", 4.0).SetResource("GPU", 2.0).Remote(); Passing object refs to Ray tasks In addition to values, Object refs can also be passed into remote functions. When the task gets executed, inside the function body the argument will be the underlying value. For example, take this function: Python @ray.remote def function_with_an_argument(value): return value + 1 obj_ref1 = my_function.remote() assert ray.get(obj_ref1) == 1 # You can pass an object ref as an argument to another Ray task. obj_ref2 = function_with_an_argument.remote(obj_ref1) assert ray.get(obj_ref2) == 2 Java public class MyRayApp { public static int functionWithAnArgument(int value) { return value + 1; } } ObjectRef objRef1 = Ray.task(MyRayApp::myFunction).remote(); Assert.assertTrue(objRef1.get() == 1); // You can pass an object ref as an argument to another Ray task. ObjectRef objRef2 = Ray.task(MyRayApp::functionWithAnArgument, objRef1).remote(); Assert.assertTrue(objRef2.get() == 2); C++ static int FunctionWithAnArgument(int value) { return value + 1; } RAY_REMOTE(FunctionWithAnArgument); auto obj_ref1 = ray::Task(MyFunction).Remote(); assert(*obj_ref1.Get() == 1); // You can pass an object ref as an argument to another Ray task. auto obj_ref2 = ray::Task(FunctionWithAnArgument).Remote(obj_ref1); assert(*obj_ref2.Get() == 2); Note the following behaviors: As the second task depends on the output of the first task, Ray will not execute the second task until the first task has finished. If the two tasks are scheduled on different machines, the output of the first task (the value corresponding to obj_ref1/objRef1) will be sent over the network to the machine where the second task is scheduled. Waiting for Partial Results Calling ray.get on Ray task results will block until the task finished execution. After launching a number of tasks, you may want to know which ones have finished executing without blocking on all of them. This could be achieved by ray.wait(). The function works as follows. Python object_refs = [slow_function.remote() for _ in range(2)] # Return as soon as one of the tasks finished execution. ready_refs, remaining_refs = ray.wait(object_refs, num_returns=1, timeout=None) Java WaitResult waitResult = Ray.wait(objectRefs, /*num_returns=*/0, /*timeoutMs=*/1000); System.out.println(waitResult.getReady()); // List of ready objects. System.out.println(waitResult.getUnready()); // list of unready objects. C++ ray::WaitResult wait_result = ray::Wait(object_refs, /*num_objects=*/0, /*timeout_ms=*/1000); Multiple returns By default, a Ray task only returns a single Object Ref. However, you can configure Ray tasks to return multiple Object Refs, by setting the num_returns option. Python # By default, a Ray task only returns a single Object Ref. @ray.remote def return_single(): return 0, 1, 2 object_ref = return_single.remote() assert ray.get(object_ref) == (0, 1, 2) # However, you can configure Ray tasks to return multiple Object Refs. @ray.remote(num_returns=3) def return_multiple(): return 0, 1, 2 object_ref0, object_ref1, object_ref2 = return_multiple.remote() assert ray.get(object_ref0) == 0 assert ray.get(object_ref1) == 1 assert ray.get(object_ref2) == 2 For tasks that return multiple objects, Ray also supports remote generators that allow a task to return one object at a time to reduce memory usage at the worker. Ray also supports an option to set the number of return values dynamically, which can be useful when the task caller does not know how many return values to expect. See the user guide for more details on use cases. Python @ray.remote(num_returns=3) def return_multiple_as_generator(): for i in range(3): yield i # NOTE: Similar to normal functions, these objects will not be available # until the full task is complete and all returns have been generated. a, b, c = return_multiple_as_generator.remote() Cancelling tasks Ray tasks can be canceled by calling ray.cancel() on the returned Object ref. Python @ray.remote def blocking_operation(): time.sleep(10e6) obj_ref = blocking_operation.remote() ray.cancel(obj_ref) try: ray.get(obj_ref) except ray.exceptions.TaskCancelledError: print("Object reference was cancelled.") Scheduling For each task, Ray will choose a node to run it and the scheduling decision is based on a few factors like the task’s resource requirements, the specified scheduling strategy and locations of task arguments. See Ray scheduling for more details. Fault Tolerance By default, Ray will retry failed tasks due to system failures and specified application-level failures. You can change this behavior by setting max_retries and retry_exceptions options in ray.remote() and .options(). See Ray fault tolerance for more details. More about Ray Tasks Nested Remote Functions Remote functions can call other remote functions, resulting in nested tasks. For example, consider the following. import ray @ray.remote def f(): return 1 @ray.remote def g(): # Call f 4 times and return the resulting object refs. return [f.remote() for _ in range(4)] @ray.remote def h(): # Call f 4 times, block until those 4 tasks finish, # retrieve the results, and return the values. return ray.get([f.remote() for _ in range(4)]) Then calling g and h produces the following behavior. >>> ray.get(g.remote()) [ObjectRef(b1457ba0911ae84989aae86f89409e953dd9a80e), ObjectRef(7c14a1d13a56d8dc01e800761a66f09201104275), ObjectRef(99763728ffc1a2c0766a2000ebabded52514e9a6), ObjectRef(9c2f372e1933b04b2936bb6f58161285829b9914)] >>> ray.get(h.remote()) [1, 1, 1, 1] One limitation is that the definition of f must come before the definitions of g and h because as soon as g is defined, it will be pickled and shipped to the workers, and so if f hasn’t been defined yet, the definition will be incomplete. Yielding Resources While Blocked Ray will release CPU resources when being blocked. This prevents deadlock cases where the nested tasks are waiting for the CPU resources held by the parent task. Consider the following remote function. @ray.remote(num_cpus=1, num_gpus=1) def g(): return ray.get(f.remote()) When a g task is executing, it will release its CPU resources when it gets blocked in the call to ray.get. It will reacquire the CPU resources when ray.get returns. It will retain its GPU resources throughout the lifetime of the task because the task will most likely continue to use GPU memory. Generators Python generators are functions that behave like an iterator, yielding one value per iteration. Ray supports remote generators for two use cases: To reduce max heap memory usage when returning multiple values from a remote function. See the design pattern guide for an example. When the number of return values is set dynamically by the remote function instead of by the caller. Remote generators can be used in both actor and non-actor tasks. num_returns set by the task caller Where possible, the caller should set the remote function’s number of return values using @ray.remote(num_returns=x) or foo.options(num_returns=x).remote(). Ray will return this many ObjectRefs to the caller. The remote task should then return the same number of values, usually as a tuple or list. Compared to setting the number of return values dynamically, this adds less complexity to user code and less performance overhead, as Ray will know exactly how many ObjectRefs to return to the caller ahead of time. Without changing the caller’s syntax, we can also use a remote generator function to yield the values iteratively. The generator should yield the same number of return values specified by the caller, and these will be stored one at a time in Ray’s object store. An error will be raised for generators that yield a different number of values from the one specified by the caller. For example, we can swap the following code that returns a list of return values: import numpy as np @ray.remote def large_values(num_returns): return [ np.random.randint(np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8) for _ in range(num_returns) ] for this code, which uses a generator function: @ray.remote def large_values_generator(num_returns): for i in range(num_returns): yield np.random.randint( np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8 ) print(f"yielded return value {i}") The advantage of doing so is that the generator function does not need to hold all of its return values in memory at once. It can yield the arrays one at a time to reduce memory pressure. num_returns set by the task executor In some cases, the caller may not know the number of return values to expect from a remote function. For example, suppose we want to write a task that breaks up its argument into equal-size chunks and returns these. We may not know the size of the argument until we execute the task, so we don’t know the number of return values to expect. In these cases, we can use a remote generator function that returns a dynamic number of values. To use this feature, set num_returns="dynamic" in the @ray.remote decorator or the remote function’s .options(). Then, when invoking the remote function, Ray will return a single ObjectRef that will get populated with an ObjectRefGenerator when the task completes. The ObjectRefGenerator can be used to iterate over a list of ObjectRefs containing the actual values returned by the task. import numpy as np @ray.remote(num_returns="dynamic") def split(array, chunk_size): while len(array) > 0: yield array[:chunk_size] array = array[chunk_size:] array_ref = ray.put(np.zeros(np.random.randint(1000_000))) block_size = 1000 # Returns an ObjectRef[ObjectRefGenerator]. dynamic_ref = split.remote(array_ref, block_size) print(dynamic_ref) # ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000) i = -1 ref_generator = ray.get(dynamic_ref) print(ref_generator) # for i, ref in enumerate(ref_generator): # Each ObjectRefGenerator iteration returns an ObjectRef. assert len(ray.get(ref)) <= block_size num_blocks_generated = i + 1 array_size = len(ray.get(array_ref)) assert array_size <= num_blocks_generated * block_size print(f"Split array of size {array_size} into {num_blocks_generated} blocks of " f"size {block_size} each.") # Split array of size 63153 into 64 blocks of size 1000 each. # NOTE: The dynamic_ref points to the generated ObjectRefs. Make sure that this # ObjectRef goes out of scope so that Ray can garbage-collect the internal # ObjectRefs. del dynamic_ref We can also pass the ObjectRef returned by a task with num_returns="dynamic" to another task. The task will receive the ObjectRefGenerator, which it can use to iterate over the task’s return values. Similarly, you can also pass an ObjectRefGenerator as a task argument. @ray.remote def get_size(ref_generator : ObjectRefGenerator): print(ref_generator) num_elements = 0 for ref in ref_generator: array = ray.get(ref) assert len(array) <= block_size num_elements += len(array) return num_elements # Returns an ObjectRef[ObjectRefGenerator]. dynamic_ref = split.remote(array_ref, block_size) assert array_size == ray.get(get_size.remote(dynamic_ref)) # (get_size pid=1504184) # This also works, but should be avoided because you have to call an additional # `ray.get`, which blocks the driver. ref_generator = ray.get(dynamic_ref) assert array_size == ray.get(get_size.remote(ref_generator)) # (get_size pid=1504184) Exception handling If a generator function raises an exception before yielding all its values, the values that it already stored will still be accessible through their ObjectRefs. The remaining ObjectRefs will contain the raised exception. This is true for both static and dynamic num_returns. If the task was called with num_returns="dynamic", the exception will be stored as an additional final ObjectRef in the ObjectRefGenerator. @ray.remote def generator(): for i in range(2): yield i raise Exception("error") ref1, ref2, ref3, ref4 = generator.options(num_returns=4).remote() assert ray.get([ref1, ref2]) == [0, 1] # All remaining ObjectRefs will contain the error. try: ray.get([ref3, ref4]) except Exception as error: print(error) dynamic_ref = generator.options(num_returns="dynamic").remote() ref_generator = ray.get(dynamic_ref) ref1, ref2, ref3 = ref_generator assert ray.get([ref1, ref2]) == [0, 1] # Generators with num_returns="dynamic" will store the exception in the final # ObjectRef. try: ray.get(ref3) except Exception as error: print(error) Note that there is currently a known bug where exceptions will not be propagated for generators that yield more values than expected. This can occur in two cases: When num_returns is set by the caller, but the generator task returns more than this value. When a generator task with num_returns="dynamic" is re-executed, and the re-executed task yields more values than the original execution. Note that in general, Ray does not guarantee correctness for task re-execution if the task is nondeterministic, and it is recommended to set @ray.remote(num_retries=0) for such tasks. # Generators that yield more values than expected currently do not throw an # exception (the error is only logged). # See https://github.com/ray-project/ray/issues/28689. ref1, ref2 = generator.options(num_returns=2).remote() assert ray.get([ref1, ref2]) == [0, 1] """ (generator pid=2375938) 2022-09-28 11:08:51,386 ERROR worker.py:755 -- Unhandled error: Task threw exception, but all return values already created. This should only occur when using generator tasks. ... """ Limitations Although a generator function creates ObjectRefs one at a time, currently Ray will not schedule dependent tasks until the entire task is complete and all values have been created. This is similar to the semantics used by tasks that return multiple values as a list. Actors Actors extend the Ray API from functions (tasks) to classes. An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. Python The ray.remote decorator indicates that instances of the Counter class will be actors. Each actor runs in its own Python process. import ray @ray.remote class Counter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value def get_counter(self): return self.value # Create an actor from this class. counter = Counter.remote() Java Ray.actor is used to create actors from regular Java classes. // A regular Java class. public class Counter { private int value = 0; public int increment() { this.value += 1; return this.value; } } // Create an actor from this class. // `Ray.actor` takes a factory method that can produce // a `Counter` object. Here, we pass `Counter`'s constructor // as the argument. ActorHandle counter = Ray.actor(Counter::new).remote(); C++ ray::Actor is used to create actors from regular C++ classes. // A regular C++ class. class Counter { private: int value = 0; public: int Increment() { value += 1; return value; } }; // Factory function of Counter class. static Counter *CreateCounter() { return new Counter(); }; RAY_REMOTE(&Counter::Increment, CreateCounter); // Create an actor from this class. // `ray::Actor` takes a factory method that can produce // a `Counter` object. Here, we pass `Counter`'s factory function // as the argument. auto counter = ray::Actor(CreateCounter).Remote(); Use ray list actors from State API to see actors states: # This API is only available when you install Ray with `pip install "ray[default]"`. ray list actors ======== List: 2023-05-25 10:10:50.095099 ======== Stats: ------------------------------ Total: 1 Table: ------------------------------ ACTOR_ID CLASS_NAME STATE JOB_ID NAME NODE_ID PID RAY_NAMESPACE 0 9e783840250840f87328c9f201000000 Counter ALIVE 01000000 13a475571662b784b4522847692893a823c78f1d3fd8fd32a2624923 38906 ef9de910-64fb-4575-8eb5-50573faa3ddf Specifying required resources You can specify resource requirements in actors too (see Specifying Task or Actor Resource Requirements for more details.) Python # Specify required resources for an actor. @ray.remote(num_cpus=2, num_gpus=0.5) class Actor: pass Java // Specify required resources for an actor. Ray.actor(Counter::new).setResource("CPU", 2.0).setResource("GPU", 0.5).remote(); C++ // Specify required resources for an actor. ray::Actor(CreateCounter).SetResource("CPU", 2.0).SetResource("GPU", 0.5).Remote(); Calling the actor We can interact with the actor by calling its methods with the remote operator. We can then call get on the object ref to retrieve the actual value. Python # Call the actor. obj_ref = counter.increment.remote() print(ray.get(obj_ref)) 1 Java // Call the actor. ObjectRef objectRef = counter.task(&Counter::increment).remote(); Assert.assertTrue(objectRef.get() == 1); C++ // Call the actor. auto object_ref = counter.Task(&Counter::increment).Remote(); assert(*object_ref.Get() == 1); Methods called on different actors can execute in parallel, and methods called on the same actor are executed serially in the order that they are called. Methods on the same actor will share state with one another, as shown below. Python # Create ten Counter actors. counters = [Counter.remote() for _ in range(10)] # Increment each Counter once and get the results. These tasks all happen in # parallel. results = ray.get([c.increment.remote() for c in counters]) print(results) # Increment the first Counter five times. These tasks are executed serially # and share state. results = ray.get([counters[0].increment.remote() for _ in range(5)]) print(results) [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] [2, 3, 4, 5, 6] Java // Create ten Counter actors. List> counters = new ArrayList<>(); for (int i = 0; i < 10; i++) { counters.add(Ray.actor(Counter::new).remote()); } // Increment each Counter once and get the results. These tasks all happen in // parallel. List> objectRefs = new ArrayList<>(); for (ActorHandle counterActor : counters) { objectRefs.add(counterActor.task(Counter::increment).remote()); } // prints [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] System.out.println(Ray.get(objectRefs)); // Increment the first Counter five times. These tasks are executed serially // and share state. objectRefs = new ArrayList<>(); for (int i = 0; i < 5; i++) { objectRefs.add(counters.get(0).task(Counter::increment).remote()); } // prints [2, 3, 4, 5, 6] System.out.println(Ray.get(objectRefs)); C++ // Create ten Counter actors. std::vector> counters; for (int i = 0; i < 10; i++) { counters.emplace_back(ray::Actor(CreateCounter).Remote()); } // Increment each Counter once and get the results. These tasks all happen in // parallel. std::vector> object_refs; for (ray::ActorHandle counter_actor : counters) { object_refs.emplace_back(counter_actor.Task(&Counter::Increment).Remote()); } // prints 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 auto results = ray::Get(object_refs); for (const auto &result : results) { std::cout << *result; } // Increment the first Counter five times. These tasks are executed serially // and share state. object_refs.clear(); for (int i = 0; i < 5; i++) { object_refs.emplace_back(counters[0].Task(&Counter::Increment).Remote()); } // prints 2, 3, 4, 5, 6 results = ray::Get(object_refs); for (const auto &result : results) { std::cout << *result; } Passing Around Actor Handles Actor handles can be passed into other tasks. We can define remote functions (or actor methods) that use actor handles. Python import time @ray.remote def f(counter): for _ in range(10): time.sleep(0.1) counter.increment.remote() Java public static class MyRayApp { public static void foo(ActorHandle counter) throws InterruptedException { for (int i = 0; i < 1000; i++) { TimeUnit.MILLISECONDS.sleep(100); counter.task(Counter::increment).remote(); } } } C++ void Foo(ray::ActorHandle counter) { for (int i = 0; i < 1000; i++) { std::this_thread::sleep_for(std::chrono::milliseconds(100)); counter.Task(&Counter::Increment).Remote(); } } If we instantiate an actor, we can pass the handle around to various tasks. Python counter = Counter.remote() # Start some tasks that use the actor. [f.remote(counter) for _ in range(3)] # Print the counter value. for _ in range(10): time.sleep(0.1) print(ray.get(counter.get_counter.remote())) 0 3 8 10 15 18 20 25 30 30 Java ActorHandle counter = Ray.actor(Counter::new).remote(); // Start some tasks that use the actor. for (int i = 0; i < 3; i++) { Ray.task(MyRayApp::foo, counter).remote(); } // Print the counter value. for (int i = 0; i < 10; i++) { TimeUnit.SECONDS.sleep(1); System.out.println(counter.task(Counter::getCounter).remote().get()); } C++ auto counter = ray::Actor(CreateCounter).Remote(); // Start some tasks that use the actor. for (int i = 0; i < 3; i++) { ray::Task(Foo).Remote(counter); } // Print the counter value. for (int i = 0; i < 10; i++) { std::this_thread::sleep_for(std::chrono::seconds(1)); std::cout << *counter.Task(&Counter::GetCounter).Remote().Get() << std::endl; } Scheduling For each actor, Ray will choose a node to run it and the scheduling decision is based on a few factors like the actor’s resource requirements and the specified scheduling strategy. See Ray scheduling for more details. Fault Tolerance By default, Ray actors won’t be restarted and actor tasks won’t be retried when actors crash unexpectedly. You can change this behavior by setting max_restarts and max_task_retries options in ray.remote() and .options(). See Ray fault tolerance for more details. FAQ: Actors, Workers and Resources What’s the difference between a worker and an actor? Each “Ray worker” is a python process. Workers are treated differently for tasks and actors. Any “Ray worker” is either 1. used to execute multiple Ray tasks or 2. is started as a dedicated Ray actor. Tasks: When Ray starts on a machine, a number of Ray workers will be started automatically (1 per CPU by default). They will be used to execute tasks (like a process pool). If you execute 8 tasks with num_cpus=2, and total number of CPUs is 16 (ray.cluster_resources()["CPU"] == 16), you will end up with 8 of your 16 workers idling. Actor: A Ray Actor is also a “Ray worker” but is instantiated at runtime (upon actor_cls.remote()). All of its methods will run on the same process, using the same resources (designated when defining the Actor). Note that unlike tasks, the python processes that runs Ray Actors are not reused and will be terminated when the Actor is deleted. To maximally utilize your resources, you want to maximize the time that your workers are working. You also want to allocate enough cluster resources so that both all of your needed actors can run and any other tasks you define can run. This also implies that tasks are scheduled more flexibly, and that if you don’t need the stateful part of an actor, you’re mostly better off using tasks. More about Ray Actors Named Actors An actor can be given a unique name within their namespace. This allows you to retrieve the actor from any job in the Ray cluster. This can be useful if you cannot directly pass the actor handle to the task that needs it, or if you are trying to access an actor launched by another driver. Note that the actor will still be garbage-collected if no handles to it exist. See Actor Lifetimes for more details. Python import ray @ray.remote class Counter: pass # Create an actor with a name counter = Counter.options(name="some_name").remote() # Retrieve the actor later somewhere counter = ray.get_actor("some_name") Java // Create an actor with a name. ActorHandle counter = Ray.actor(Counter::new).setName("some_name").remote(); ... // Retrieve the actor later somewhere Optional> counter = Ray.getActor("some_name"); Assert.assertTrue(counter.isPresent()); C++ // Create an actor with a globally unique name ActorHandle counter = ray::Actor(CreateCounter).SetGlobalName("some_name").Remote(); ... // Retrieve the actor later somewhere boost::optional> counter = ray::GetGlobalActor("some_name"); We also support non-global named actors in C++, which means that the actor name is only valid within the job and the actor cannot be accessed from another job // Create an actor with a job-scope-unique name ActorHandle counter = ray::Actor(CreateCounter).SetName("some_name").Remote(); ... // Retrieve the actor later somewhere in the same job boost::optional> counter = ray::GetActor("some_name"); Named actors are scoped by namespace. If no namespace is assigned, they will be placed in an anonymous namespace by default. Python import ray @ray.remote class Actor: pass # driver_1.py # Job 1 creates an actor, "orange" in the "colors" namespace. ray.init(address="auto", namespace="colors") Actor.options(name="orange", lifetime="detached").remote() # driver_2.py # Job 2 is now connecting to a different namespace. ray.init(address="auto", namespace="fruit") # This fails because "orange" was defined in the "colors" namespace. ray.get_actor("orange") # You can also specify the namespace explicitly. ray.get_actor("orange", namespace="colors") # driver_3.py # Job 3 connects to the original "colors" namespace ray.init(address="auto", namespace="colors") # This returns the "orange" actor we created in the first job. ray.get_actor("orange") Java import ray class Actor { } // Driver1.java // Job 1 creates an actor, "orange" in the "colors" namespace. System.setProperty("ray.job.namespace", "colors"); Ray.init(); Ray.actor(Actor::new).setName("orange").remote(); // Driver2.java // Job 2 is now connecting to a different namespace. System.setProperty("ray.job.namespace", "fruits"); Ray.init(); // This fails because "orange" was defined in the "colors" namespace. Optional> actor = Ray.getActor("orange"); Assert.assertFalse(actor.isPresent()); // actor.isPresent() is false. // Driver3.java System.setProperty("ray.job.namespace", "colors"); Ray.init(); // This returns the "orange" actor we created in the first job. Optional> actor = Ray.getActor("orange"); Assert.assertTrue(actor.isPresent()); // actor.isPresent() is true. Get-Or-Create a Named Actor A common use case is to create an actor only if it doesn’t exist. Ray provides a get_if_exists option for actor creation that does this out of the box. This method is available after you set a name for the actor via .options(). If the actor already exists, a handle to the actor will be returned and the arguments will be ignored. Otherwise, a new actor will be created with the specified arguments. Python import ray @ray.remote class Greeter: def __init__(self, value): self.value = value def say_hello(self): return self.value # Actor `g1` doesn't yet exist, so it is created with the given args. a = Greeter.options(name="g1", get_if_exists=True).remote("Old Greeting") assert ray.get(a.say_hello.remote()) == "Old Greeting" # Actor `g1` already exists, so it is returned (new args are ignored). b = Greeter.options(name="g1", get_if_exists=True).remote("New Greeting") assert ray.get(b.say_hello.remote()) == "Old Greeting" Java // This feature is not yet available in Java. C++ // This feature is not yet available in C++. Actor Lifetimes Separately, actor lifetimes can be decoupled from the job, allowing an actor to persist even after the driver process of the job exits. We call these actors detached. Python counter = Counter.options(name="CounterActor", lifetime="detached").remote() The CounterActor will be kept alive even after the driver running above script exits. Therefore it is possible to run the following script in a different driver: counter = ray.get_actor("CounterActor") Note that an actor can be named but not detached. If we only specified the name without specifying lifetime="detached", then the CounterActor can only be retrieved as long as the original driver is still running. Java System.setProperty("ray.job.namespace", "lifetime"); Ray.init(); ActorHandle counter = Ray.actor(Counter::new).setName("some_name").setLifetime(ActorLifetime.DETACHED).remote(); The CounterActor will be kept alive even after the driver running above process exits. Therefore it is possible to run the following code in a different driver: System.setProperty("ray.job.namespace", "lifetime"); Ray.init(); Optional> counter = Ray.getActor("some_name"); Assert.assertTrue(counter.isPresent()); C++ Customizing lifetime of an actor hasn’t been implemented in C++ yet. Unlike normal actors, detached actors are not automatically garbage-collected by Ray. Detached actors must be manually destroyed once you are sure that they are no longer needed. To do this, use ray.kill to manually terminate the actor. After this call, the actor’s name may be reused. Terminating Actors Actor processes will be terminated automatically when all copies of the actor handle have gone out of scope in Python, or if the original creator process dies. Note that automatic termination of actors is not yet supported in Java or C++. Manual termination via an actor handle In most cases, Ray will automatically terminate actors that have gone out of scope, but you may sometimes need to terminate an actor forcefully. This should be reserved for cases where an actor is unexpectedly hanging or leaking resources, and for detached actors, which must be manually destroyed. Python import ray @ray.remote class Actor: pass actor_handle = Actor.remote() ray.kill(actor_handle) # This will not go through the normal Python sys.exit # teardown logic, so any exit handlers installed in # the actor using ``atexit`` will not be called. Java actorHandle.kill(); // This will not go through the normal Java System.exit teardown logic, so any // shutdown hooks installed in the actor using ``Runtime.addShutdownHook(...)`` will // not be called. C++ actor_handle.Kill(); // This will not go through the normal C++ std::exit // teardown logic, so any exit handlers installed in // the actor using ``std::atexit`` will not be called. This will cause the actor to immediately exit its process, causing any current, pending, and future tasks to fail with a RayActorError. If you would like Ray to automatically restart the actor, make sure to set a nonzero max_restarts in the @ray.remote options for the actor, then pass the flag no_restart=False to ray.kill. For named and detached actors, calling ray.kill on an actor handle destroys the actor and allow the name to be reused. Use ray list actors --detail from State API to see the death cause of dead actors: # This API is only available when you download Ray via `pip install "ray[default]"` ray list actors --detail --- - actor_id: e8702085880657b355bf7ef001000000 class_name: Actor state: DEAD job_id: '01000000' name: '' node_id: null pid: 0 ray_namespace: dbab546b-7ce5-4cbb-96f1-d0f64588ae60 serialized_runtime_env: '{}' required_resources: {} death_cause: actor_died_error_context: # <---- You could see the error message w.r.t why the actor exits. error_message: The actor is dead because `ray.kill` killed it. owner_id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff owner_ip_address: 127.0.0.1 ray_namespace: dbab546b-7ce5-4cbb-96f1-d0f64588ae60 class_name: Actor actor_id: e8702085880657b355bf7ef001000000 never_started: true node_ip_address: '' pid: 0 name: '' is_detached: false placement_group_id: null repr_name: '' Manual termination within the actor If necessary, you can manually terminate an actor from within one of the actor methods. This will kill the actor process and release resources associated/assigned to the actor. Python @ray.remote class Actor: def exit(self): ray.actor.exit_actor() actor = Actor.remote() actor.exit.remote() This approach should generally not be necessary as actors are automatically garbage collected. The ObjectRef resulting from the task can be waited on to wait for the actor to exit (calling ray.get() on it will raise a RayActorError). Java Ray.exitActor(); Garbage collection for actors haven’t been implemented yet, so this is currently the only way to terminate an actor gracefully. The ObjectRef resulting from the task can be waited on to wait for the actor to exit (calling ObjectRef::get on it will throw a RayActorException). C++ ray::ExitActor(); Garbage collection for actors haven’t been implemented yet, so this is currently the only way to terminate an actor gracefully. The ObjectRef resulting from the task can be waited on to wait for the actor to exit (calling ObjectRef::Get on it will throw a RayActorException). Note that this method of termination waits until any previously submitted tasks finish executing and then exits the process gracefully with sys.exit. You could see the actor is dead as a result of the user’s exit_actor() call: # This API is only available when you download Ray via `pip install "ray[default]"` ray list actors --detail --- - actor_id: 070eb5f0c9194b851bb1cf1602000000 class_name: Actor state: DEAD job_id: '02000000' name: '' node_id: 47ccba54e3ea71bac244c015d680e202f187fbbd2f60066174a11ced pid: 47978 ray_namespace: 18898403-dda0-485a-9c11-e9f94dffcbed serialized_runtime_env: '{}' required_resources: {} death_cause: actor_died_error_context: error_message: 'The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. exit_actor() is called.' owner_id: 02000000ffffffffffffffffffffffffffffffffffffffffffffffff owner_ip_address: 127.0.0.1 node_ip_address: 127.0.0.1 pid: 47978 ray_namespace: 18898403-dda0-485a-9c11-e9f94dffcbed class_name: Actor actor_id: 070eb5f0c9194b851bb1cf1602000000 name: '' never_started: false is_detached: false placement_group_id: null repr_name: '' AsyncIO / Concurrency for Actors Within a single actor process, it is possible to execute concurrent threads. Ray offers two types of concurrency within an actor: async execution threading Keep in mind that the Python’s Global Interpreter Lock (GIL) will only allow one thread of Python code running at once. This means if you are just parallelizing Python code, you won’t get true parallelism. If you call Numpy, Cython, Tensorflow, or PyTorch code, these libraries will release the GIL when calling into C/C++ functions. Neither the Threaded Actors nor AsyncIO for Actors model will allow you to bypass the GIL. AsyncIO for Actors Since Python 3.5, it is possible to write concurrent code using the async/await syntax. Ray natively integrates with asyncio. You can use ray alongside with popular async frameworks like aiohttp, aioredis, etc. import ray import asyncio @ray.remote class AsyncActor: # multiple invocation of this method can be running in # the event loop at the same time async def run_concurrent(self): print("started") await asyncio.sleep(2) # concurrent workload here print("finished") actor = AsyncActor.remote() # regular ray.get ray.get([actor.run_concurrent.remote() for _ in range(4)]) # async ray.get async def async_get(): await actor.run_concurrent.remote() asyncio.run(async_get()) (AsyncActor pid=40293) started (AsyncActor pid=40293) started (AsyncActor pid=40293) started (AsyncActor pid=40293) started (AsyncActor pid=40293) finished (AsyncActor pid=40293) finished (AsyncActor pid=40293) finished (AsyncActor pid=40293) finished # NOTE: The outputs from the previous code block can show up in subsequent tests. # To prevent flakiness, we wait for the async calls finish. import time print("Sleeping...") time.sleep(3) ... ObjectRefs as asyncio.Futures ObjectRefs can be translated to asyncio.Futures. This feature make it possible to await on ray futures in existing concurrent applications. Instead of: import ray @ray.remote def some_task(): return 1 ray.get(some_task.remote()) ray.wait([some_task.remote()]) you can do: import ray import asyncio @ray.remote def some_task(): return 1 async def await_obj_ref(): await some_task.remote() await asyncio.wait([some_task.remote()]) asyncio.run(await_obj_ref()) Please refer to asyncio doc for more asyncio patterns including timeouts and asyncio.gather. If you need to directly access the future object, you can call: import asyncio async def convert_to_asyncio_future(): ref = some_task.remote() fut: asyncio.Future = asyncio.wrap_future(ref.future()) print(await fut) asyncio.run(convert_to_asyncio_future()) 1 ObjectRefs as concurrent.futures.Futures ObjectRefs can also be wrapped into concurrent.futures.Future objects. This is useful for interfacing with existing concurrent.futures APIs: import concurrent refs = [some_task.remote() for _ in range(4)] futs = [ref.future() for ref in refs] for fut in concurrent.futures.as_completed(futs): assert fut.done() print(fut.result()) 1 1 1 1 Defining an Async Actor By using async method definitions, Ray will automatically detect whether an actor support async calls or not. import asyncio @ray.remote class AsyncActor: async def run_task(self): print("started") await asyncio.sleep(2) # Network, I/O task here print("ended") actor = AsyncActor.remote() # All 5 tasks should start at once. After 2 second they should all finish. # they should finish at the same time ray.get([actor.run_task.remote() for _ in range(5)]) (AsyncActor pid=3456) started (AsyncActor pid=3456) started (AsyncActor pid=3456) started (AsyncActor pid=3456) started (AsyncActor pid=3456) started (AsyncActor pid=3456) ended (AsyncActor pid=3456) ended (AsyncActor pid=3456) ended (AsyncActor pid=3456) ended (AsyncActor pid=3456) ended Under the hood, Ray runs all of the methods inside a single python event loop. Please note that running blocking ray.get or ray.wait inside async actor method is not allowed, because ray.get will block the execution of the event loop. In async actors, only one task can be running at any point in time (though tasks can be multi-plexed). There will be only one thread in AsyncActor! See Threaded Actors if you want a threadpool. Setting concurrency in Async Actors You can set the number of “concurrent” task running at once using the max_concurrency flag. By default, 1000 tasks can be running concurrently. import asyncio @ray.remote class AsyncActor: async def run_task(self): print("started") await asyncio.sleep(1) # Network, I/O task here print("ended") actor = AsyncActor.options(max_concurrency=2).remote() # Only 2 tasks will be running concurrently. Once 2 finish, the next 2 should run. ray.get([actor.run_task.remote() for _ in range(8)]) (AsyncActor pid=5859) started (AsyncActor pid=5859) started (AsyncActor pid=5859) ended (AsyncActor pid=5859) ended (AsyncActor pid=5859) started (AsyncActor pid=5859) started (AsyncActor pid=5859) ended (AsyncActor pid=5859) ended (AsyncActor pid=5859) started (AsyncActor pid=5859) started (AsyncActor pid=5859) ended (AsyncActor pid=5859) ended (AsyncActor pid=5859) started (AsyncActor pid=5859) started (AsyncActor pid=5859) ended (AsyncActor pid=5859) ended Threaded Actors Sometimes, asyncio is not an ideal solution for your actor. For example, you may have one method that performs some computation heavy task while blocking the event loop, not giving up control via await. This would hurt the performance of an Async Actor because Async Actors can only execute 1 task at a time and rely on await to context switch. Instead, you can use the max_concurrency Actor options without any async methods, allowng you to achieve threaded concurrency (like a thread pool). When there is at least one async def method in actor definition, Ray will recognize the actor as AsyncActor instead of ThreadedActor. @ray.remote class ThreadedActor: def task_1(self): print("I'm running in a thread!") def task_2(self): print("I'm running in another thread!") a = ThreadedActor.options(max_concurrency=2).remote() ray.get([a.task_1.remote(), a.task_2.remote()]) (ThreadedActor pid=4822) I'm running in a thread! (ThreadedActor pid=4822) I'm running in another thread! Each invocation of the threaded actor will be running in a thread pool. The size of the threadpool is limited by the max_concurrency value. AsyncIO for Remote Tasks We don’t support asyncio for remote tasks. The following snippet will fail: @ray.remote async def f(): pass Instead, you can wrap the async function with a wrapper to run the task synchronously: async def f(): pass @ray.remote def wrapper(): import asyncio asyncio.run(f()) Limiting Concurrency Per-Method with Concurrency Groups Besides setting the max concurrency overall for an asyncio actor, Ray allows methods to be separated into concurrency groups, each with its own asyncio event loop. This allows you to limit the concurrency per-method, e.g., allow a health-check method to be given its own concurrency quota separate from request serving methods. Concurrency groups are only supported for asyncio actors, not threaded actors. Defining Concurrency Groups This defines two concurrency groups, “io” with max concurrency = 2 and “compute” with max concurrency = 4. The methods f1 and f2 are placed in the “io” group, and the methods f3 and f4 are placed into the “compute” group. Note that there is always a default concurrency group, which has a default concurrency of 1000 in Python and 1 in Java. Python You can define concurrency groups for asyncio actors using the concurrency_group decorator argument: import ray @ray.remote(concurrency_groups={"io": 2, "compute": 4}) class AsyncIOActor: def __init__(self): pass @ray.method(concurrency_group="io") async def f1(self): pass @ray.method(concurrency_group="io") async def f2(self): pass @ray.method(concurrency_group="compute") async def f3(self): pass @ray.method(concurrency_group="compute") async def f4(self): pass async def f5(self): pass a = AsyncIOActor.remote() a.f1.remote() # executed in the "io" group. a.f2.remote() # executed in the "io" group. a.f3.remote() # executed in the "compute" group. a.f4.remote() # executed in the "compute" group. a.f5.remote() # executed in the default group. Java You can define concurrency groups for concurrent actors using the API setConcurrencyGroups() argument: class ConcurrentActor { public long f1() { return Thread.currentThread().getId(); } public long f2() { return Thread.currentThread().getId(); } public long f3(int a, int b) { return Thread.currentThread().getId(); } public long f4() { return Thread.currentThread().getId(); } public long f5() { return Thread.currentThread().getId(); } } ConcurrencyGroup group1 = new ConcurrencyGroupBuilder() .setName("io") .setMaxConcurrency(1) .addMethod(ConcurrentActor::f1) .addMethod(ConcurrentActor::f2) .build(); ConcurrencyGroup group2 = new ConcurrencyGroupBuilder() .setName("compute") .setMaxConcurrency(1) .addMethod(ConcurrentActor::f3) .addMethod(ConcurrentActor::f4) .build(); ActorHandle myActor = Ray.actor(ConcurrentActor::new) .setConcurrencyGroups(group1, group2) .remote(); myActor.task(ConcurrentActor::f1).remote(); // executed in the "io" group. myActor.task(ConcurrentActor::f2).remote(); // executed in the "io" group. myActor.task(ConcurrentActor::f3, 3, 5).remote(); // executed in the "compute" group. myActor.task(ConcurrentActor::f4).remote(); // executed in the "compute" group. myActor.task(ConcurrentActor::f5).remote(); // executed in the "default" group. Default Concurrency Group By default, methods are placed in a default concurrency group which has a concurrency limit of 1000 in Python, 1 in Java. The concurrency of the default group can be changed by setting the max_concurrency actor option. Python The following AsyncIOActor has 2 concurrency groups: “io” and “default”. The max concurrency of “io” is 2, and the max concurrency of “default” is 10. @ray.remote(concurrency_groups={"io": 2}) class AsyncIOActor: async def f1(self): pass actor = AsyncIOActor.options(max_concurrency=10).remote() Java The following concurrent actor has 2 concurrency groups: “io” and “default”. The max concurrency of “io” is 2, and the max concurrency of “default” is 10. class ConcurrentActor: public long f1() { return Thread.currentThread().getId(); } ConcurrencyGroup group = new ConcurrencyGroupBuilder() .setName("io") .setMaxConcurrency(2) .addMethod(ConcurrentActor::f1) .build(); ActorHandle myActor = Ray.actor(ConcurrentActor::new) .setConcurrencyGroups(group1) .setMaxConcurrency(10) .remote(); Setting the Concurrency Group at Runtime You can also dispatch actor methods into a specific concurrency group at runtime. The following snippet demonstrates setting the concurrency group of the f2 method dynamically at runtime. Python You can use the .options method. # Executed in the "io" group (as defined in the actor class). a.f2.options().remote() # Executed in the "compute" group. a.f2.options(concurrency_group="compute").remote() Java You can use setConcurrencyGroup method. // Executed in the "io" group (as defined in the actor creation). myActor.task(ConcurrentActor::f2).remote(); // Executed in the "compute" group. myActor.task(ConcurrentActor::f2).setConcurrencyGroup("compute").remote(); Utility Classes Actor Pool Python The ray.util module contains a utility class, ActorPool. This class is similar to multiprocessing.Pool and lets you schedule Ray tasks over a fixed pool of actors. import ray from ray.util import ActorPool @ray.remote class Actor: def double(self, n): return n * 2 a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) # pool.map(..) returns a Python generator object ActorPool.map gen = pool.map(lambda a, v: a.double.remote(v), [1, 2, 3, 4]) print(list(gen)) # [2, 4, 6, 8] See the package reference for more information. Java Actor pool hasn’t been implemented in Java yet. C++ Actor pool hasn’t been implemented in C++ yet. Message passing using Ray Queue Sometimes just using one signal to synchronize is not enough. If you need to send data among many tasks or actors, you can use ray.util.queue.Queue. import ray from ray.util.queue import Queue, Empty ray.init() # You can pass this object around to different tasks/actors queue = Queue(maxsize=100) @ray.remote def consumer(id, queue): try: while True: next_item = queue.get(block=True, timeout=1) print(f"consumer {id} got work {next_item}") except Empty: pass [queue.put(i) for i in range(10)] print("Put work 1 - 10 to queue...") consumers = [consumer.remote(id, queue) for id in range(2)] ray.get(consumers) Ray’s Queue API has a similar API to Python’s asyncio.Queue and queue.Queue. Out-of-band Communication Typically, Ray actor communication is done through actor method calls and data is shared through the distributed object store. However, in some use cases out-of-band communication can be useful. Wrapping Library Processes Many libraries already have mature, high-performance internal communication stacks and they leverage Ray as a language-integrated actor scheduler. The actual communication between actors is mostly done out-of-band using existing communication stacks. For example, Horovod-on-Ray uses NCCL or MPI-based collective communications, and RayDP uses Spark’s internal RPC and object manager. See Ray Distributed Library Patterns for more details. Ray Collective Ray’s collective communication library (ray.util.collective) allows efficient out-of-band collective and point-to-point communication between distributed CPUs or GPUs. See Ray Collective for more details. HTTP Server You can start a http server inside the actor and expose http endpoints to clients so users outside of the ray cluster can communicate with the actor. Python import ray import asyncio import requests from aiohttp import web @ray.remote class Counter: async def __init__(self): self.counter = 0 asyncio.get_running_loop().create_task(self.run_http_server()) async def run_http_server(self): app = web.Application() app.add_routes([web.get("/", self.get)]) runner = web.AppRunner(app) await runner.setup() site = web.TCPSite(runner, "127.0.0.1", 25001) await site.start() async def get(self, request): return web.Response(text=str(self.counter)) async def increment(self): self.counter = self.counter + 1 ray.init() counter = Counter.remote() [ray.get(counter.increment.remote()) for i in range(5)] r = requests.get("http://127.0.0.1:25001/") assert r.text == "5" Similarly, you can expose other types of servers as well (e.g., gRPC servers). Limitations When using out-of-band communication with Ray actors, keep in mind that Ray does not manage the calls between actors. This means that functionality like distributed reference counting will not work with out-of-band communication, so you should take care not to pass object references in this way. Actor Task Execution Order Synchronous, Single-Threaded Actor In Ray, an actor receives tasks from multiple submitters (including driver and workers). For tasks received from the same submitter, a synchronous, single-threaded actor executes them following the submission order. In other words, a given task will not be executed until previously submitted tasks from the same submitter have finished execution. Python import ray @ray.remote class Counter: def __init__(self): self.value = 0 def add(self, addition): self.value += addition return self.value counter = Counter.remote() # For tasks from the same submitter, # they are executed according to submission order. value0 = counter.add.remote(1) value1 = counter.add.remote(2) # Output: 1. The first submitted task is executed first. print(ray.get(value0)) # Output: 3. The later submitted task is executed later. print(ray.get(value1)) 1 3 However, the actor does not guarantee the execution order of the tasks from different submitters. For example, suppose an unfulfilled argument blocks a previously submitted task. In this case, the actor can still execute tasks submitted by a different worker. Python import time import ray @ray.remote class Counter: def __init__(self): self.value = 0 def add(self, addition): self.value += addition return self.value counter = Counter.remote() # Submit task from a worker @ray.remote def submitter(value): return ray.get(counter.add.remote(value)) # Simulate delayed result resolution. @ray.remote def delayed_resolution(value): time.sleep(5) return value # Submit tasks from different workers, with # the first submitted task waiting for # dependency resolution. value0 = submitter.remote(delayed_resolution.remote(1)) value1 = submitter.remote(2) # Output: 3. The first submitted task is executed later. print(ray.get(value0)) # Output: 2. The later submitted task is executed first. print(ray.get(value1)) 3 2 Asynchronous or Threaded Actor Asynchronous or threaded actors do not guarantee the task execution order. This means the system might execute a task even though previously submitted tasks are pending execution. Python import time import ray @ray.remote class AsyncCounter: def __init__(self): self.value = 0 async def add(self, addition): self.value += addition return self.value counter = AsyncCounter.remote() # Simulate delayed result resolution. @ray.remote def delayed_resolution(value): time.sleep(5) return value # Submit tasks from the driver, with # the first submitted task waiting for # dependency resolution. value0 = counter.add.remote(delayed_resolution.remote(1)) value1 = counter.add.remote(2) # Output: 3. The first submitted task is executed later. print(ray.get(value0)) # Output: 2. The later submitted task is executed first. print(ray.get(value1)) 3 2 Objects In Ray, tasks and actors create and compute on objects. We refer to these objects as remote objects because they can be stored anywhere in a Ray cluster, and we use object refs to refer to them. Remote objects are cached in Ray’s distributed shared-memory object store, and there is one object store per node in the cluster. In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref(s). An object ref is essentially a pointer or a unique ID that can be used to refer to a remote object without seeing its value. If you’re familiar with futures, Ray object refs are conceptually similar. Object refs can be created in two ways. They are returned by remote function calls. They are returned by ray.put(). Python import ray # Put an object in Ray's object store. y = 1 object_ref = ray.put(y) Java // Put an object in Ray's object store. int y = 1; ObjectRef objectRef = Ray.put(y); C++ // Put an object in Ray's object store. int y = 1; ray::ObjectRef object_ref = ray::Put(y); Remote objects are immutable. That is, their values cannot be changed after creation. This allows remote objects to be replicated in multiple object stores without needing to synchronize the copies. Fetching Object Data You can use the ray.get() method to fetch the result of a remote object from an object ref. If the current node’s object store does not contain the object, the object is downloaded. Python If the object is a numpy array or a collection of numpy arrays, the get call is zero-copy and returns arrays backed by shared object store memory. Otherwise, we deserialize the object data into a Python object. import ray import time # Get the value of one object ref. obj_ref = ray.put(1) assert ray.get(obj_ref) == 1 # Get the values of multiple object refs in parallel. assert ray.get([ray.put(i) for i in range(3)]) == [0, 1, 2] # You can also set a timeout to return early from a ``get`` # that's blocking for too long. from ray.exceptions import GetTimeoutError # ``GetTimeoutError`` is a subclass of ``TimeoutError``. @ray.remote def long_running_function(): time.sleep(8) obj_ref = long_running_function.remote() try: ray.get(obj_ref, timeout=4) except GetTimeoutError: # You can capture the standard "TimeoutError" instead print("`get` timed out.") `get` timed out. Java // Get the value of one object ref. ObjectRef objRef = Ray.put(1); Assert.assertTrue(objRef.get() == 1); // You can also set a timeout(ms) to return early from a ``get`` that's blocking for too long. Assert.assertTrue(objRef.get(1000) == 1); // Get the values of multiple object refs in parallel. List> objectRefs = new ArrayList<>(); for (int i = 0; i < 3; i++) { objectRefs.add(Ray.put(i)); } List results = Ray.get(objectRefs); Assert.assertEquals(results, ImmutableList.of(0, 1, 2)); // Ray.get timeout example: Ray.get will throw an RayTimeoutException if time out. public class MyRayApp { public static int slowFunction() throws InterruptedException { TimeUnit.SECONDS.sleep(10); return 1; } } Assert.assertThrows(RayTimeoutException.class, () -> Ray.get(Ray.task(MyRayApp::slowFunction).remote(), 3000)); C++ // Get the value of one object ref. ray::ObjectRef obj_ref = ray::Put(1); assert(*obj_ref.Get() == 1); // Get the values of multiple object refs in parallel. std::vector> obj_refs; for (int i = 0; i < 3; i++) { obj_refs.emplace_back(ray::Put(i)); } auto results = ray::Get(obj_refs); assert(results.size() == 3); assert(*results[0] == 0); assert(*results[1] == 1); assert(*results[2] == 2); Passing Object Arguments Ray object references can be freely passed around a Ray application. This means that they can be passed as arguments to tasks, actor methods, and even stored in other objects. Objects are tracked via distributed reference counting, and their data is automatically freed once all references to the object are deleted. There are two different ways one can pass an object to a Ray task or method. Depending on the way an object is passed, Ray will decide whether to de-reference the object prior to task execution. Passing an object as a top-level argument: When an object is passed directly as a top-level argument to a task, Ray will de-reference the object. This means that Ray will fetch the underlying data for all top-level object reference arguments, not executing the task until the object data becomes fully available. import ray @ray.remote def echo(a: int, b: int, c: int): """This function prints its input values to stdout.""" print(a, b, c) # Passing the literal values (1, 2, 3) to `echo`. echo.remote(1, 2, 3) # -> prints "1 2 3" # Put the values (1, 2, 3) into Ray's object store. a, b, c = ray.put(1), ray.put(2), ray.put(3) # Passing an object as a top-level argument to `echo`. Ray will de-reference top-level # arguments, so `echo` will see the literal values (1, 2, 3) in this case as well. echo.remote(a, b, c) # -> prints "1 2 3" Passing an object as a nested argument: When an object is passed within a nested object, for example, within a Python list, Ray will not de-reference it. This means that the task will need to call ray.get() on the reference to fetch the concrete value. However, if the task never calls ray.get(), then the object value never needs to be transferred to the machine the task is running on. We recommend passing objects as top-level arguments where possible, but nested arguments can be useful for passing objects on to other tasks without needing to see the data. import ray @ray.remote def echo_and_get(x_list): # List[ObjectRef] """This function prints its input values to stdout.""" print("args:", x_list) print("values:", ray.get(x_list)) # Put the values (1, 2, 3) into Ray's object store. a, b, c = ray.put(1), ray.put(2), ray.put(3) # Passing an object as a nested argument to `echo_and_get`. Ray does not # de-reference nested args, so `echo_and_get` sees the references. echo_and_get.remote([a, b, c]) # -> prints args: [ObjectRef(...), ObjectRef(...), ObjectRef(...)] # values: [1, 2, 3] The top-level vs not top-level passing convention also applies to actor constructors and actor method calls: @ray.remote class Actor: def __init__(self, arg): pass def method(self, arg): pass obj = ray.put(2) # Examples of passing objects to actor constructors. actor_handle = Actor.remote(obj) # by-value actor_handle = Actor.remote([obj]) # by-reference # Examples of passing objects to actor method calls. actor_handle.method.remote(obj) # by-value actor_handle.method.remote([obj]) # by-reference Closure Capture of Objects You can also pass objects to tasks via closure-capture. This can be convenient when you have a large object that you want to share verbatim between many tasks or actors, and don’t want to pass it repeatedly as an argument. Be aware however that defining a task that closes over an object ref will pin the object via reference-counting, so the object will not be evicted until the job completes. import ray # Put the values (1, 2, 3) into Ray's object store. a, b, c = ray.put(1), ray.put(2), ray.put(3) @ray.remote def print_via_capture(): """This function prints the values of (a, b, c) to stdout.""" print(ray.get([a, b, c])) # Passing object references via closure-capture. Inside the `print_via_capture` # function, the global object refs (a, b, c) can be retrieved and printed. print_via_capture.remote() # -> prints [1, 2, 3] Nested Objects Ray also supports nested object references. This allows you to build composite objects that themselves hold references to further sub-objects. # Objects can be nested within each other. Ray will keep the inner object # alive via reference counting until all outer object references are deleted. object_ref_2 = ray.put([object_ref]) Fault Tolerance Ray can automatically recover from object data loss via lineage reconstruction but not owner failure. See Ray fault tolerance for more details. More about Ray Objects Serialization Since Ray processes do not share memory space, data transferred between workers and nodes will need to serialized and deserialized. Ray uses the Plasma object store to efficiently transfer objects across different processes and different nodes. Numpy arrays in the object store are shared between workers on the same node (zero-copy deserialization). Overview Ray has decided to use a customized Pickle protocol version 5 backport to replace the original PyArrow serializer. This gets rid of several previous limitations (e.g. cannot serialize recursive objects). Ray is currently compatible with Pickle protocol version 5, while Ray supports serialization of a wider range of objects (e.g. lambda & nested functions, dynamic classes) with the help of cloudpickle. Plasma Object Store Plasma is an in-memory object store. It has been originally developed as part of Apache Arrow. Prior to Ray’s version 1.0.0 release, Ray forked Arrow’s Plasma code into Ray’s code base in order to disentangle and continue development with respect to Ray’s architecture and performance needs. Plasma is used to efficiently transfer objects across different processes and different nodes. All objects in Plasma object store are immutable and held in shared memory. This is so that they can be accessed efficiently by many workers on the same node. Each node has its own object store. When data is put into the object store, it does not get automatically broadcasted to other nodes. Data remains local to the writer until requested by another task or actor on another node. Numpy Arrays Ray optimizes for numpy arrays by using Pickle protocol 5 with out-of-band data. The numpy array is stored as a read-only object, and all Ray workers on the same node can read the numpy array in the object store without copying (zero-copy reads). Each numpy array object in the worker process holds a pointer to the relevant array held in shared memory. Any writes to the read-only object will require the user to first copy it into the local process memory. You can often avoid serialization issues by using only native types (e.g., numpy arrays or lists/dicts of numpy arrays and other primitive types), or by using Actors hold objects that cannot be serialized. Fixing “assignment destination is read-only” Because Ray puts numpy arrays in the object store, when deserialized as arguments in remote functions they will become read-only. For example, the following code snippet will crash: import ray import numpy as np @ray.remote def f(arr): # arr = arr.copy() # Adding a copy will fix the error. arr[0] = 1 try: ray.get(f.remote(np.zeros(100))) except ray.exceptions.RayTaskError as e: print(e) # ray.exceptions.RayTaskError(ValueError): ray::f() # File "test.py", line 6, in f # arr[0] = 1 # ValueError: assignment destination is read-only To avoid this issue, you can manually copy the array at the destination if you need to mutate it (arr = arr.copy()). Note that this is effectively like disabling the zero-copy deserialization feature provided by Ray. Serialization notes Ray is currently using Pickle protocol version 5. The default pickle protocol used by most python distributions is protocol 3. Protocol 4 & 5 are more efficient than protocol 3 for larger objects. For non-native objects, Ray will always keep a single copy even it is referred multiple times in an object: import ray import numpy as np obj = [np.zeros(42)] * 99 l = ray.get(ray.put(obj)) assert l[0] is l[1] # no problem! Whenever possible, use numpy arrays or Python collections of numpy arrays for maximum performance. Lock objects are mostly unserializable, because copying a lock is meaningless and could cause serious concurrency problems. You may have to come up with a workaround if your object contains a lock. Customized Serialization Sometimes you may want to customize your serialization process because the default serializer used by Ray (pickle5 + cloudpickle) does not work for you (fail to serialize some objects, too slow for certain objects, etc.). There are at least 3 ways to define your custom serialization process: If you want to customize the serialization of a type of objects, and you have access to the code, you can define __reduce__ function inside the corresponding class. This is commonly done by most Python libraries. Example code: import ray import sqlite3 class DBConnection: def __init__(self, path): self.path = path self.conn = sqlite3.connect(path) # without '__reduce__', the instance is unserializable. def __reduce__(self): deserializer = DBConnection serialized_data = (self.path,) return deserializer, serialized_data original = DBConnection("/tmp/db") print(original.conn) copied = ray.get(ray.put(original)) print(copied.conn) If you want to customize the serialization of a type of objects, but you cannot access or modify the corresponding class, you can register the class with the serializer you use: import ray import threading class A: def __init__(self, x): self.x = x self.lock = threading.Lock() # could not be serialized! try: ray.get(ray.put(A(1))) # fail! except TypeError: pass def custom_serializer(a): return a.x def custom_deserializer(b): return A(b) # Register serializer and deserializer for class A: ray.util.register_serializer( A, serializer=custom_serializer, deserializer=custom_deserializer) ray.get(ray.put(A(1))) # success! # You can deregister the serializer at any time. ray.util.deregister_serializer(A) try: ray.get(ray.put(A(1))) # fail! except TypeError: pass # Nothing happens when deregister an unavailable serializer. ray.util.deregister_serializer(A) NOTE: Serializers are managed locally for each Ray worker. So for every Ray worker, if you want to use the serializer, you need to register the serializer. Deregister a serializer also only applies locally. If you register a new serializer for a class, the new serializer would replace the old serializer immediately in the worker. This API is also idempotent, there are no side effects caused by re-registering the same serializer. We also provide you an example, if you want to customize the serialization of a specific object: import threading class A: def __init__(self, x): self.x = x self.lock = threading.Lock() # could not serialize! try: ray.get(ray.put(A(1))) # fail! except TypeError: pass class SerializationHelperForA: """A helper class for serialization.""" def __init__(self, a): self.a = a def __reduce__(self): return A, (self.a.x,) ray.get(ray.put(SerializationHelperForA(A(1)))) # success! # the serializer only works for a specific object, not all A # instances, so we still expect failure here. try: ray.get(ray.put(A(1))) # still fail! except TypeError: pass Troubleshooting Use ray.util.inspect_serializability to identify tricky pickling issues. This function can be used to trace a potential non-serializable object within any Python object – whether it be a function, class, or object instance. Below, we demonstrate this behavior on a function with a non-serializable object (threading lock): from ray.util import inspect_serializability import threading lock = threading.Lock() def test(): print(lock) inspect_serializability(test, name="test") The resulting output is: ============================================================= Checking Serializability of ============================================================= !!! FAIL serialization: cannot pickle '_thread.lock' object Detected 1 global variables. Checking serializability... Serializing 'lock' ... !!! FAIL serialization: cannot pickle '_thread.lock' object WARNING: Did not find non-serializable object in . This may be an oversight. ============================================================= Variable: FailTuple(lock [obj=, parent=]) was found to be non-serializable. There may be multiple other undetected variables that were non-serializable. Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class. ============================================================= Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information. If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/ ============================================================= For even more detailed information, set environmental variable RAY_PICKLE_VERBOSE_DEBUG='2' before importing Ray. This enables serialization with python-based backend instead of C-Pickle, so you can debug into python code at the middle of serialization. However, this would make serialization much slower. Known Issues Users could experience memory leak when using certain python3.8 & 3.9 versions. This is due to a bug in python’s pickle module. This issue has been solved for Python 3.8.2rc1, Python 3.9.0 alpha 4 or late versions. Object Spilling Ray 1.3+ spills objects to external storage once the object store is full. By default, objects are spilled to Ray’s temporary directory in the local filesystem. Single node Ray uses object spilling by default. Without any setting, objects are spilled to [temp_folder]/spill. On Linux and MacOS, the temp_folder is /tmp by default. To configure the directory where objects are spilled to, use: import ray ray.shutdown() import json import ray ray.init( _system_config={ "object_spilling_config": json.dumps( {"type": "filesystem", "params": {"directory_path": "/tmp/spill"}}, ) }, ) You can also specify multiple directories for spilling to spread the IO load and disk space usage across multiple physical devices if needed (e.g., SSD devices): ray.shutdown() import json import ray ray.init( _system_config={ "max_io_workers": 4, # More IO workers for parallelism. "object_spilling_config": json.dumps( { "type": "filesystem", "params": { # Multiple directories can be specified to distribute # IO across multiple mounted physical devices. "directory_path": [ "/tmp/spill", "/tmp/spill_1", "/tmp/spill_2", ] }, } ) }, ) To optimize the performance, it is recommended to use an SSD instead of an HDD when using object spilling for memory-intensive workloads. If you are using an HDD, it is recommended that you specify a large buffer size (> 1MB) to reduce IO requests during spilling. ray.shutdown() import json import ray ray.init( _system_config={ "object_spilling_config": json.dumps( { "type": "filesystem", "params": { "directory_path": "/tmp/spill", "buffer_size": 1_000_000, } }, ) }, ) To prevent running out of disk space, local object spilling will throw OutOfDiskError if the disk utilization exceeds the predefined threshold. If multiple physical devices are used, any physical device’s over-usage will trigger the OutOfDiskError. The default threshold is 0.95 (95%). You can adjust the threshold by setting local_fs_capacity_threshold, or set it to 1 to disable the protection. ray.shutdown() import json import ray ray.init( _system_config={ # Allow spilling until the local disk is 99% utilized. # This only affects spilling to the local file system. "local_fs_capacity_threshold": 0.99, "object_spilling_config": json.dumps( { "type": "filesystem", "params": { "directory_path": "/tmp/spill", } }, ) }, ) To enable object spilling to remote storage (any URI supported by smart_open): ray.shutdown() import json import ray ray.init( _system_config={ "max_io_workers": 4, # More IO workers for remote storage. "min_spilling_size": 100 * 1024 * 1024, # Spill at least 100MB at a time. "object_spilling_config": json.dumps( { "type": "smart_open", "params": { "uri": "s3://bucket/path" }, "buffer_size": 100 * 1024 * 1024, # Use a 100MB buffer for writes }, ) }, ) It is recommended that you specify a large buffer size (> 1MB) to reduce IO requests during spilling. Spilling to multiple remote storages is also supported. ray.shutdown() import json import ray ray.init( _system_config={ "max_io_workers": 4, # More IO workers for remote storage. "min_spilling_size": 100 * 1024 * 1024, # Spill at least 100MB at a time. "object_spilling_config": json.dumps( { "type": "smart_open", "params": { "uri": ["s3://bucket/path1", "s3://bucket/path2", "s3://bucket/path3"], }, "buffer_size": 100 * 1024 * 1024, # Use a 100MB buffer for writes }, ) }, ) Remote storage support is still experimental. Cluster mode To enable object spilling in multi node clusters: # Note that `object_spilling_config`'s value should be json format. # You only need to specify the config when starting the head node, all the worker nodes will get the same config from the head node. ray start --head --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/tmp/spill\"}}"}' Stats When spilling is happening, the following INFO level messages will be printed to the raylet logs (e.g., /tmp/ray/session_latest/logs/raylet.out): local_object_manager.cc:166: Spilled 50 MiB, 1 objects, write throughput 230 MiB/s local_object_manager.cc:334: Restored 50 MiB, 1 objects, read throughput 505 MiB/s You can also view cluster-wide spill stats by using the ray memory command: --- Aggregate object store stats across all nodes --- Plasma memory usage 50 MiB, 1 objects, 50.0% full Spilled 200 MiB, 4 objects, avg write throughput 570 MiB/s Restored 150 MiB, 3 objects, avg read throughput 1361 MiB/s If you only want to display cluster-wide spill stats, use ray memory --stats-only. Environment Dependencies Your Ray application may have dependencies that exist outside of your Ray script. For example: Your Ray script may import/depend on some Python packages. Your Ray script may be looking for some specific environment variables to be available. Your Ray script may import some files outside of the script. One frequent problem when running on a cluster is that Ray expects these “dependencies” to exist on each Ray node. If these are not present, you may run into issues such as ModuleNotFoundError, FileNotFoundError and so on. To address this problem, you can (1) prepare your dependencies on the cluster in advance (e.g. using a container image) using the Ray Cluster Launcher, or (2) use Ray’s runtime environments to install them on the fly. For production usage or non-changing environments, we recommend installing your dependencies into a container image and specifying the image using the Cluster Launcher. For dynamic environments (e.g. for development and experimentation), we recommend using runtime environments. Concepts Ray Application. A program including a Ray script that calls ray.init() and uses Ray tasks or actors. Dependencies, or Environment. Anything outside of the Ray script that your application needs to run, including files, packages, and environment variables. Files. Code files, data files or other files that your Ray application needs to run. Packages. External libraries or executables required by your Ray application, often installed via pip or conda. Local machine and Cluster. Usually, you may want to separate the Ray cluster compute machines/pods from the machine/pod that handles and submits the application. You can submit a Ray Job via the Ray Job Submission mechanism, or use ray attach to connect to a cluster interactively. We call the machine submitting the job your local machine. Job. A Ray job is a single application: it is the collection of Ray tasks, objects, and actors that originate from the same script. Preparing an environment using the Ray Cluster launcher The first way to set up dependencies is to is to prepare a single environment across the cluster before starting the Ray runtime. You can build all your files and dependencies into a container image and specify this in your your Cluster YAML Configuration. You can also install packages using setup_commands in the Ray Cluster configuration file (reference); these commands will be run as each node joins the cluster. Note that for production settings, it is recommended to build any necessary packages into a container image instead. You can push local files to the cluster using ray rsync_up (reference). Runtime environments This feature requires a full installation of Ray using pip install "ray[default]". This feature is available starting with Ray 1.4.0 and is currently supported on macOS and Linux, with beta support on Windows. The second way to set up dependencies is to install them dynamically while Ray is running. A runtime environment describes the dependencies your Ray application needs to run, including files, packages, environment variables, and more. It is installed dynamically on the cluster at runtime and cached for future use (see Caching and Garbage Collection for details about the lifecycle). Runtime environments can be used on top of the prepared environment from the Ray Cluster launcher if it was used. For example, you can use the Cluster launcher to install a base set of packages, and then use runtime environments to install additional packages. In contrast with the base cluster environment, a runtime environment will only be active for Ray processes. (For example, if using a runtime environment specifying a pip package my_pkg, the statement import my_pkg will fail if called outside of a Ray task, actor, or job.) Runtime environments also allow you to set dependencies per-task, per-actor, and per-job on a long-running Ray cluster. import ray ray.shutdown() import ray runtime_env = {"pip": ["emoji"]} ray.init(runtime_env=runtime_env) @ray.remote def f(): import emoji return emoji.emojize('Python is :thumbs_up:') print(ray.get(f.remote())) Python is 👍 A runtime environment can be described by a Python dict: runtime_env = { "pip": ["emoji"], "env_vars": {"TF_WARNINGS": "none"} } Alternatively, you can use ray.runtime_env.RuntimeEnv: from ray.runtime_env import RuntimeEnv runtime_env = RuntimeEnv( pip=["emoji"], env_vars={"TF_WARNINGS": "none"} ) For more examples, jump to the API Reference. There are two primary scopes for which you can specify a runtime environment: Per-Job, and Per-Task/Actor, within a job. Specifying a Runtime Environment Per-Job You can specify a runtime environment for your whole job, whether running a script directly on the cluster, using the Ray Jobs API: # Option 1: Starting a single-node local Ray cluster or connecting to existing local cluster ray.init(runtime_env=runtime_env) # Option 2: Using Ray Jobs API (Python SDK) from ray.job_submission import JobSubmissionClient client = JobSubmissionClient("http://:8265") job_id = client.submit_job( entrypoint="python my_ray_script.py", runtime_env=runtime_env, ) # Option 3: Using Ray Jobs API (CLI). (Note: can use --runtime-env to pass a YAML file instead of an inline JSON string.) $ ray job submit --address="http://:8265" --runtime-env-json='{"working_dir": "/data/my_files", "pip": ["emoji"]}' -- python my_ray_script.py If using the Ray Jobs API (either the Python SDK or the CLI), specify the runtime_env argument in the submit_job call or the ray job submit, not in the ray.init() call in the entrypoint script (in this example, my_ray_script.py). This ensures the runtime environment is installed on the cluster before the entrypoint script is run. There are two options for when to install the runtime environment: As soon as the job starts (i.e., as soon as ray.init() is called), the dependencies are eagerly downloaded and installed. The dependencies are installed only when a task is invoked or an actor is created. The default is option 1. To change the behavior to option 2, add "eager_install": False to the config of runtime_env. Specifying a Runtime Environment Per-Task or Per-Actor You can specify different runtime environments per-actor or per-task using .options() or the @ray.remote decorator: # Invoke a remote task that will run in a specified runtime environment. f.options(runtime_env=runtime_env).remote() # Instantiate an actor that will run in a specified runtime environment. actor = SomeClass.options(runtime_env=runtime_env).remote() # Specify a runtime environment in the task definition. Future invocations via # `g.remote()` will use this runtime environment unless overridden by using # `.options()` as above. @ray.remote(runtime_env=runtime_env) def g(): pass # Specify a runtime environment in the actor definition. Future instantiations # via `MyClass.remote()` will use this runtime environment unless overridden by # using `.options()` as above. @ray.remote(runtime_env=runtime_env) class MyClass: pass This allows you to have actors and tasks running in their own environments, independent of the surrounding environment. (The surrounding environment could be the job’s runtime environment, or the system environment of the cluster.) Ray does not guarantee compatibility between tasks and actors with conflicting runtime environments. For example, if an actor whose runtime environment contains a pip package tries to communicate with an actor with a different version of that package, it can lead to unexpected behavior such as unpickling errors. Common Workflows This section describes some common use cases for runtime environments. These use cases are not mutually exclusive; all of the options described below can be combined in a single runtime environment. Using Local Files Your Ray application might depend on source files or data files. For a development workflow, these might live on your local machine, but when it comes time to run things at scale, you will need to get them to your remote cluster. The following simple example explains how to get your local files on the cluster. import ray ray.shutdown() import os import ray os.makedirs("/tmp/runtime_env_working_dir", exist_ok=True) with open("/tmp/runtime_env_working_dir/hello.txt", "w") as hello_file: hello_file.write("Hello World!") # Specify a runtime environment for the entire Ray job ray.init(runtime_env={"working_dir": "/tmp/runtime_env_working_dir"}) # Create a Ray task, which inherits the above runtime env. @ray.remote def f(): # The function will have its working directory changed to its node's # local copy of /tmp/runtime_env_working_dir. return open("hello.txt").read() print(ray.get(f.remote())) Hello World! The example above is written to run on a local machine, but as for all of these examples, it also works when specifying a Ray cluster to connect to (e.g., using ray.init("ray://123.456.7.89:10001", runtime_env=...) or ray.init(address="auto", runtime_env=...)). The specified local directory will automatically be pushed to the cluster nodes when ray.init() is called. You can also specify files via a remote cloud storage URI; see Remote URIs for details. Using conda or pip packages Your Ray application might depend on Python packages (for example, pendulum or requests) via import statements. Ray ordinarily expects all imported packages to be preinstalled on every node of the cluster; in particular, these packages are not automatically shipped from your local machine to the cluster or downloaded from any repository. However, using runtime environments you can dynamically specify packages to be automatically downloaded and installed in a virtual environment for your Ray job, or for specific Ray tasks or actors. import ray ray.shutdown() import ray import requests # This example runs on a local machine, but you can also do # ray.init(address=..., runtime_env=...) to connect to a cluster. ray.init(runtime_env={"pip": ["requests"]}) @ray.remote def reqs(): return requests.get("https://www.ray.io/").status_code print(ray.get(reqs.remote())) 200 You may also specify your pip dependencies either via a Python list or a local requirements.txt file. Alternatively, you can specify a conda environment, either as a Python dictionary or via a local environment.yml file. This conda environment can include pip packages. For details, head to the API Reference. Since the packages in the runtime_env are installed at runtime, be cautious when specifying conda or pip packages whose installations involve building from source, as this can be slow. When using the "pip" field, the specified packages will be installed “on top of” the base environment using virtualenv, so existing packages on your cluster will still be importable. By contrast, when using the conda field, your Ray tasks and actors will run in an isolated environment. The conda and pip fields cannot both be used in a single runtime_env. The ray[default] package itself will automatically be installed in the environment. For the conda field only, if you are using any other Ray libraries (for example, Ray Serve), then you will need to specify the library in the runtime environment (e.g. runtime_env = {"conda": {"dependencies": ["pytorch", "pip", {"pip": ["requests", "ray[serve]"]}]}}.) conda environments must have the same Python version as the Ray cluster. Do not list ray in the conda dependencies, as it will be automatically installed. Library Development Suppose you are developing a library my_module on Ray. A typical iteration cycle will involve Making some changes to the source code of my_module Running a Ray script to test the changes, perhaps on a distributed cluster. To ensure your local changes show up across all Ray workers and can be imported properly, use the py_modules field. import ray import my_module ray.init("ray://123.456.7.89:10001", runtime_env={"py_modules": [my_module]}) @ray.remote def test_my_module(): # No need to import my_module inside this function. my_module.test() ray.get(f.remote()) Note: This feature is currently limited to modules that are packages with a single directory containing an __init__.py file. For single-file modules, you may use working_dir. API Reference The runtime_env is a Python dictionary or a Python class ray.runtime_env.RuntimeEnv including one or more of the following fields: working_dir (str): Specifies the working directory for the Ray workers. This must either be (1) an local existing directory with total size at most 100 MiB, (2) a local existing zipped file with total unzipped size at most 100 MiB (Note: excludes has no effect), or (3) a URI to a remotely-stored zip file containing the working directory for your job. See Remote URIs for details. The specified directory will be downloaded to each node on the cluster, and Ray workers will be started in their node’s copy of this directory. Examples "." # cwd "/src/my_project" "/src/my_project.zip" "s3://path/to/my_dir.zip" Note: Setting a local directory per-task or per-actor is currently unsupported; it can only be set per-job (i.e., in ray.init()). Note: If the local directory contains a .gitignore file, the files and paths specified there are not uploaded to the cluster. You can disable this by setting the environment variable RAY_RUNTIME_ENV_IGNORE_GITIGNORE=1 on the machine doing the uploading. py_modules (List[str|module]): Specifies Python modules to be available for import in the Ray workers. (For more ways to specify packages, see also the pip and conda fields below.) Each entry must be either (1) a path to a local directory, (2) a URI to a remote zip file (see Remote URIs for details), (3) a Python module object, or (4) a path to a local whl file. Examples of entries in the list: "." "/local_dependency/my_module" "s3://bucket/my_module.zip" my_module # Assumes my_module has already been imported, e.g. via 'import my_module' my_module.whl The modules will be downloaded to each node on the cluster. Note: Setting options (1), (3) and (4) per-task or per-actor is currently unsupported, it can only be set per-job (i.e., in ray.init()). Note: For option (1), if the local directory contains a .gitignore file, the files and paths specified there are not uploaded to the cluster. You can disable this by setting the environment variable RAY_RUNTIME_ENV_IGNORE_GITIGNORE=1 on the machine doing the uploading. Note: This feature is currently limited to modules that are packages with a single directory containing an __init__.py file. For single-file modules, you may use working_dir. excludes (List[str]): When used with working_dir or py_modules, specifies a list of files or paths to exclude from being uploaded to the cluster. This field uses the pattern-matching syntax used by .gitignore files: see https://git-scm.com/docs/gitignore for details. Note: In accordance with .gitignore syntax, if there is a separator (/) at the beginning or middle (or both) of the pattern, then the pattern is interpreted relative to the level of the working_dir. In particular, you shouldn’t use absolute paths (e.g. /Users/my_working_dir/subdir/) with excludes; rather, you should use the relative path /subdir/ (written here with a leading / to match only the top-level subdir directory, rather than all directories named subdir at all levels.) Example: {"working_dir": "/Users/my_working_dir/", "excludes": ["my_file.txt", "/subdir/, "path/to/dir", "*.log"]} pip (dict | List[str] | str): Either (1) a list of pip requirements specifiers, (2) a string containing the path to a local pip “requirements.txt” file, or (3) a python dictionary that has three fields: (a) packages (required, List[str]): a list of pip packages, (b) pip_check (optional, bool): whether to enable pip check at the end of pip install, defaults to False. (c) pip_version (optional, str): the version of pip; Ray will spell the package name “pip” in front of the pip_version to form the final requirement string. The syntax of a requirement specifier is defined in full in PEP 508. This will be installed in the Ray workers at runtime. Packages in the preinstalled cluster environment will still be available. To use a library like Ray Serve or Ray Tune, you will need to include "ray[serve]" or "ray[tune]" here. The Ray version must match that of the cluster. Example: ["requests==1.0.0", "aiohttp", "ray[serve]"] Example: "./requirements.txt" Example: {"packages":["tensorflow", "requests"], "pip_check": False, "pip_version": "==22.0.2;python_version=='3.8.11'"} When specifying a path to a requirements.txt file, the file must be present on your local machine and it must be a valid absolute path or relative filepath relative to your local current working directory, not relative to the working_dir specified in the runtime_env. Furthermore, referencing local files within a requirements.txt file is not supported (e.g., -r ./my-laptop/more-requirements.txt, ./my-pkg.whl). conda (dict | str): Either (1) a dict representing the conda environment YAML, (2) a string containing the path to a local conda “environment.yml” file, or (3) the name of a local conda environment already installed on each node in your cluster (e.g., "pytorch_p36"). In the first two cases, the Ray and Python dependencies will be automatically injected into the environment to ensure compatibility, so there is no need to manually include them. The Python and Ray version must match that of the cluster, so you likely should not specify them manually. Note that the conda and pip keys of runtime_env cannot both be specified at the same time—to use them together, please use conda and add your pip dependencies in the "pip" field in your conda environment.yaml. Example: {"dependencies": ["pytorch", "torchvision", "pip", {"pip": ["pendulum"]}]} Example: "./environment.yml" Example: "pytorch_p36" When specifying a path to a environment.yml file, the file must be present on your local machine and it must be a valid absolute path or a relative filepath relative to your local current working directory, not relative to the working_dir specified in the runtime_env. Furthermore, referencing local files within a environment.yml file is not supported. env_vars (Dict[str, str]): Environment variables to set. Environment variables already set on the cluster will still be visible to the Ray workers; so there is no need to include os.environ or similar in the env_vars field. By default, these environment variables override the same name environment variables on the cluster. You can also reference existing environment variables using ${ENV_VAR} to achieve the appending behavior. Only PATH, LD_LIBRARY_PATH, DYLD_LIBRARY_PATH, and LD_PRELOAD are supported. See below for an example: Example: {"OMP_NUM_THREADS": "32", "TF_WARNINGS": "none"} Example: {"LD_LIBRARY_PATH": "${LD_LIBRARY_PATH}:/home/admin/my_lib"} container (dict): Require a given (Docker) image, and the worker process will run in a container with this image. The worker_path is the default_worker.py path. It is required only if ray installation directory in the container is different from raylet host. The run_options list spec is here. Example: {"image": "anyscale/ray-ml:nightly-py38-cpu", "worker_path": "/root/python/ray/workers/default_worker.py", "run_options": ["--cap-drop SYS_ADMIN","--log-level=debug"]} Note: container is experimental now. If you have some requirements or run into any problems, raise issues in github. config (dict | ray.runtime_env.RuntimeEnvConfig): config for runtime environment. Either a dict or a RuntimeEnvConfig. Fields: (1) setup_timeout_seconds, the timeout of runtime environment creation, timeout is in seconds. Example: {"setup_timeout_seconds": 10} Example: RuntimeEnvConfig(setup_timeout_seconds=10) (2) eager_install (bool): Indicates whether to install the runtime environment on the cluster at ray.init() time, before the workers are leased. This flag is set to True by default. If set to False, the runtime environment will be only installed when the first task is invoked or when the first actor is created. Currently, specifying this option per-actor or per-task is not supported. Example: {"eager_install": False} Example: RuntimeEnvConfig(eager_install=False) Caching and Garbage Collection Runtime environment resources on each node (such as conda environments, pip packages, or downloaded working_dir or py_modules directories) will be cached on the cluster to enable quick reuse across different runtime environments within a job. Each field (working_dir, py_modules, etc.) has its own cache whose size defaults to 10 GB. To change this default, you may set the environment variable RAY_RUNTIME_ENV__CACHE_SIZE_GB on each node in your cluster before starting Ray e.g. export RAY_RUNTIME_ENV_WORKING_DIR_CACHE_SIZE_GB=1.5. When the cache size limit is exceeded, resources not currently used by any actor, task or job will be deleted. Inheritance The runtime environment is inheritable, so it will apply to all tasks/actors within a job and all child tasks/actors of a task or actor once set, unless it is overridden. If an actor or task specifies a new runtime_env, it will override the parent’s runtime_env (i.e., the parent actor/task’s runtime_env, or the job’s runtime_env if there is no parent actor or task) as follows: The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment. Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent. Example: # Parent's `runtime_env` {"pip": ["requests", "chess"], "env_vars": {"A": "a", "B": "b"}} # Child's specified `runtime_env` {"pip": ["torch", "ray[serve]"], "env_vars": {"B": "new", "C": "c"}} # Child's actual `runtime_env` (merged with parent's) {"pip": ["torch", "ray[serve]"], "env_vars": {"A": "a", "B": "new", "C": "c"}} Frequently Asked Questions Are environments installed on every node? If a runtime environment is specified in ray.init(runtime_env=...), then the environment will be installed on every node. See Per-Job for more details. (Note, by default the runtime environment will be installed eagerly on every node in the cluster. If you want to lazily install the runtime environment on demand, set the eager_install option to false: ray.init(runtime_env={..., "config": {"eager_install": False}}.) When is the environment installed? When specified per-job, the environment is installed when you call ray.init() (unless "eager_install": False is set). When specified per-task or per-actor, the environment is installed when the task is invoked or the actor is instantiated (i.e. when you call my_task.remote() or my_actor.remote().) See Per-Job Per-Task/Actor, within a job for more details. Where are the environments cached? Any local files downloaded by the environments are cached at /tmp/ray/session_latest/runtime_resources. How long does it take to install or to load from cache? The install time usually mostly consists of the time it takes to run pip install or conda create / conda activate, or to upload/download a working_dir, depending on which runtime_env options you’re using. This could take seconds or minutes. On the other hand, loading a runtime environment from the cache should be nearly as fast as the ordinary Ray worker startup time, which is on the order of a few seconds. A new Ray worker is started for every Ray actor or task that requires a new runtime environment. (Note that loading a cached conda environment could still be slow, since the conda activate command sometimes takes a few seconds.) You can set setup_timeout_seconds config to avoid the installation hanging for a long time. If the installation is not finished within this time, your tasks or actors will fail to start. What is the relationship between runtime environments and Docker? They can be used independently or together. A container image can be specified in the Cluster Launcher for large or static dependencies, and runtime environments can be specified per-job or per-task/actor for more dynamic use cases. The runtime environment will inherit packages, files, and environment variables from the container image. My runtime_env was installed, but when I log into the node I can’t import the packages. The runtime environment is only active for the Ray worker processes; it does not install any packages “globally” on the node. Remote URIs The working_dir and py_modules arguments in the runtime_env dictionary can specify either local path(s) or remote URI(s). A local path must be a directory path. The directory’s contents will be directly accessed as the working_dir or a py_module. A remote URI must be a link directly to a zip file. The zip file must contain only a single top-level directory. The contents of this directory will be directly accessed as the working_dir or a py_module. For example, suppose you want to use the contents in your local /some_path/example_dir directory as your working_dir. If you want to specify this directory as a local path, your runtime_env dictionary should contain: runtime_env = {..., "working_dir": "/some_path/example_dir", ...} Suppose instead you want to host your files in your /some_path/example_dir directory remotely and provide a remote URI. You would need to first compress the example_dir directory into a zip file. There should be no other files or directories at the top level of the zip file, other than example_dir. You can use the following command in the Terminal to do this: cd /some_path zip -r zip_file_name.zip example_dir Note that this command must be run from the parent directory of the desired working_dir to ensure that the resulting zip file contains a single top-level directory. In general, the zip file’s name and the top-level directory’s name can be anything. The top-level directory’s contents will be used as the working_dir (or py_module). You can check that the zip file contains a single top-level directory by running the following command in the Terminal: zipinfo -1 zip_file_name.zip # example_dir/ # example_dir/my_file_1.txt # example_dir/subdir/my_file_2.txt Suppose you upload the compressed example_dir directory to AWS S3 at the S3 URI s3://example_bucket/example.zip. Your runtime_env dictionary should contain: runtime_env = {..., "working_dir": "s3://example_bucket/example.zip", ...} Check for hidden files and metadata directories in zipped dependencies. You can inspect a zip file’s contents by running the zipinfo -1 zip_file_name.zip command in the Terminal. Some zipping methods can cause hidden files or metadata directories to appear in the zip file at the top level. To avoid this, use the zip -r command directly on the directory you want to compress from its parent’s directory. For example, if you have a directory structure such as: a/b and you what to compress b, issue the zip -r b command from the directory a. If Ray detects more than a single directory at the top level, it will use the entire zip file instead of the top-level directory, which may lead to unexpected behavior. Currently, three types of remote URIs are supported for hosting working_dir and py_modules packages: HTTPS: HTTPS refers to URLs that start with https. These are particularly useful because remote Git providers (e.g. GitHub, Bitbucket, GitLab, etc.) use https URLs as download links for repository archives. This allows you to host your dependencies on remote Git providers, push updates to them, and specify which dependency versions (i.e. commits) your jobs should use. To use packages via HTTPS URIs, you must have the smart_open library (you can install it using pip install smart_open). Example: runtime_env = {"working_dir": "https://github.com/example_username/example_respository/archive/HEAD.zip"} S3: S3 refers to URIs starting with s3:// that point to compressed packages stored in AWS S3. To use packages via S3 URIs, you must have the smart_open and boto3 libraries (you can install them using pip install smart_open and pip install boto3). Ray does not explicitly pass in any credentials to boto3 for authentication. boto3 will use your environment variables, shared credentials file, and/or AWS config file to authenticate access. See the AWS boto3 documentation to learn how to configure these. Example: runtime_env = {"working_dir": "s3://example_bucket/example_file.zip"} GS: GS refers to URIs starting with gs:// that point to compressed packages stored in Google Cloud Storage. To use packages via GS URIs, you must have the smart_open and google-cloud-storage libraries (you can install them using pip install smart_open and pip install google-cloud-storage). Ray does not explicitly pass in any credentials to the google-cloud-storage’s Client object. google-cloud-storage will use your local service account key(s) and environment variables by default. Follow the steps on Google Cloud Storage’s Getting started with authentication guide to set up your credentials, which allow Ray to access your remote package. Example: runtime_env = {"working_dir": "gs://example_bucket/example_file.zip"} Note that the smart_open, boto3, and google-cloud-storage packages are not installed by default, and it is not sufficient to specify them in the pip section of your runtime_env. The relevant packages must already be installed on all nodes of the cluster when Ray starts. Hosting a Dependency on a Remote Git Provider: Step-by-Step Guide You can store your dependencies in repositories on a remote Git provider (e.g. GitHub, Bitbucket, GitLab, etc.), and you can periodically push changes to keep them updated. In this section, you will learn how to store a dependency on GitHub and use it in your runtime environment. These steps will also be useful if you use another large, remote Git provider (e.g. BitBucket, GitLab, etc.). For simplicity, this section refers to GitHub alone, but you can follow along on your provider. First, create a repository on GitHub to store your working_dir contents or your py_module dependency. By default, when you download a zip file of your repository, the zip file will already contain a single top-level directory that holds the repository contents, so you can directly upload your working_dir contents or your py_module dependency to the GitHub repository. Once you have uploaded your working_dir contents or your py_module dependency, you need the HTTPS URL of the repository zip file, so you can specify it in your runtime_env dictionary. You have two options to get the HTTPS URL. Option 1: Download Zip (quicker to implement, but not recommended for production environments) The first option is to use the remote Git provider’s “Download Zip” feature, which provides an HTTPS link that zips and downloads your repository. This is quick, but it is not recommended because it only allows you to download a zip file of a repository branch’s latest commit. To find a GitHub URL, navigate to your repository on GitHub, choose a branch, and click on the green “Code” drop down button: This will drop down a menu that provides three options: “Clone” which provides HTTPS/SSH links to clone the repository, “Open with GitHub Desktop”, and “Download ZIP.” Right-click on “Download Zip.” This will open a pop-up near your cursor. Select “Copy Link Address”: Now your HTTPS link is copied to your clipboard. You can paste it into your runtime_env dictionary. Using the HTTPS URL from your Git provider’s “Download as Zip” feature is not recommended if the URL always points to the latest commit. For instance, using this method on GitHub generates a link that always points to the latest commit on the chosen branch. By specifying this link in the runtime_env dictionary, your Ray Cluster always uses the chosen branch’s latest commit. This creates a consistency risk: if you push an update to your remote Git repository while your cluster’s nodes are pulling the repository’s contents, some nodes may pull the version of your package just before you pushed, and some nodes may pull the version just after. For consistency, it is better to specify a particular commit, so all the nodes use the same package. See “Option 2: Manually Create URL” to create a URL pointing to a specific commit. Option 2: Manually Create URL (slower to implement, but recommended for production environments) The second option is to manually create this URL by pattern-matching your specific use case with one of the following examples. This is recommended because it provides finer-grained control over which repository branch and commit to use when generating your dependency zip file. These options prevent consistency issues on Ray Clusters (see the warning above for more info). To create the URL, pick a URL template below that fits your use case, and fill in all parameters in brackets (e.g. [username], [repository], etc.) with the specific values from your repository. For instance, suppose your GitHub username is example_user, the repository’s name is example_repository, and the desired commit hash is abcdefg. If example_repository is public and you want to retrieve the abcdefg commit (which matches the first example use case), the URL would be: runtime_env = {"working_dir": ("https://github.com" "/example_user/example_repository/archive/abcdefg.zip")} Here is a list of different use cases and corresponding URLs: Example: Retrieve package from a specific commit hash on a public GitHub repository runtime_env = {"working_dir": ("https://github.com" "/[username]/[repository]/archive/[commit hash].zip")} Example: Retrieve package from a private GitHub repository using a Personal Access Token during development. For production see this document to learn how to authenticate private dependencies safely. runtime_env = {"working_dir": ("https://[username]:[personal access token]@github.com" "/[username]/[private repository]/archive/[commit hash].zip")} Example: Retrieve package from a public GitHub repository’s latest commit runtime_env = {"working_dir": ("https://github.com" "/[username]/[repository]/archive/HEAD.zip")} Example: Retrieve package from a specific commit hash on a public Bitbucket repository runtime_env = {"working_dir": ("https://bitbucket.org" "/[owner]/[repository]/get/[commit hash].tar.gz")} It is recommended to specify a particular commit instead of always using the latest commit. This prevents consistency issues on a multi-node Ray Cluster. See the warning below “Option 1: Download Zip” for more info. Once you have specified the URL in your runtime_env dictionary, you can pass the dictionary into a ray.init() or .options() call. Congratulations! You have now hosted a runtime_env dependency remotely on GitHub! Debugging If runtime_env cannot be set up (e.g., network issues, download failures, etc.), Ray will fail to schedule tasks/actors that require the runtime_env. If you call ray.get, it will raise RuntimeEnvSetupError with the error message in detail. import ray import time @ray.remote def f(): pass @ray.remote class A: def f(self): pass start = time.time() bad_env = {"conda": {"dependencies": ["this_doesnt_exist"]}} # [Tasks] will raise `RuntimeEnvSetupError`. try: ray.get(f.options(runtime_env=bad_env).remote()) except ray.exceptions.RuntimeEnvSetupError: print("Task fails with RuntimeEnvSetupError") # [Actors] will raise `RuntimeEnvSetupError`. a = A.options(runtime_env=bad_env).remote() try: ray.get(a.f.remote()) except ray.exceptions.RuntimeEnvSetupError: print("Actor fails with RuntimeEnvSetupError") Task fails with RuntimeEnvSetupError Actor fails with RuntimeEnvSetupError Full logs can always be found in the file runtime_env_setup-[job_id].log for per-actor, per-task and per-job environments, or in runtime_env_setup-ray_client_server_[port].log for per-job environments when using Ray Client. You can also enable runtime_env debugging log streaming by setting an environment variable RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 on each node before starting Ray, for example using setup_commands in the Ray Cluster configuration file (reference). This will print the full runtime_env setup log messages to the driver (the script that calls ray.init()). Example log output: ray.shutdown() ray.init(runtime_env={"pip": ["requests"]}) (pid=runtime_env) 2022-02-28 14:12:33,653 INFO pip.py:188 -- Creating virtualenv at /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv, current python dir /Users/user/anaconda3/envs/ray-py38 (pid=runtime_env) 2022-02-28 14:12:33,653 INFO utils.py:76 -- Run cmd[1] ['/Users/user/anaconda3/envs/ray-py38/bin/python', '-m', 'virtualenv', '--app-data', '/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv_app_data', '--reset-app-data', '--no-periodic-update', '--system-site-packages', '--no-download', '/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv'] (pid=runtime_env) 2022-02-28 14:12:34,267 INFO utils.py:97 -- Output of cmd[1]: created virtual environment CPython3.8.11.final.0-64 in 473ms (pid=runtime_env) creator CPython3Posix(dest=/private/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv, clear=False, no_vcs_ignore=False, global=True) (pid=runtime_env) seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/private/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv_app_data) (pid=runtime_env) added seed packages: pip==22.0.3, setuptools==60.6.0, wheel==0.37.1 (pid=runtime_env) activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator (pid=runtime_env) (pid=runtime_env) 2022-02-28 14:12:34,268 INFO utils.py:76 -- Run cmd[2] ['/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv/bin/python', '-c', 'import ray; print(ray.__version__, ray.__path__[0])'] (pid=runtime_env) 2022-02-28 14:12:35,118 INFO utils.py:97 -- Output of cmd[2]: 3.0.0.dev0 /Users/user/ray/python/ray (pid=runtime_env) (pid=runtime_env) 2022-02-28 14:12:35,120 INFO pip.py:236 -- Installing python requirements to /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv (pid=runtime_env) 2022-02-28 14:12:35,122 INFO utils.py:76 -- Run cmd[3] ['/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt'] (pid=runtime_env) 2022-02-28 14:12:38,000 INFO utils.py:97 -- Output of cmd[3]: Requirement already satisfied: requests in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from -r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (2.26.0) (pid=runtime_env) Requirement already satisfied: idna<4,>=2.5 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (3.2) (pid=runtime_env) Requirement already satisfied: certifi>=2017.4.17 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (2021.10.8) (pid=runtime_env) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (1.26.7) (pid=runtime_env) Requirement already satisfied: charset-normalizer~=2.0.0 in /Users/user/anaconda3/envs/ray-py38/lib/python3.8/site-packages (from requests->-r /tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/requirements.txt (line 1)) (2.0.6) (pid=runtime_env) (pid=runtime_env) 2022-02-28 14:12:38,001 INFO utils.py:76 -- Run cmd[4] ['/tmp/ray/session_2022-02-28_14-12-29_909064_87908/runtime_resources/pip/0cc818a054853c3841171109300436cad4dcf594/virtualenv/bin/python', '-c', 'import ray; print(ray.__version__, ray.__path__[0])'] (pid=runtime_env) 2022-02-28 14:12:38,804 INFO utils.py:97 -- Output of cmd[4]: 3.0.0.dev0 /Users/user/ray/python/ray See Logging Directory Structure for more details. Scheduling For each task or actor, Ray will choose a node to run it and the scheduling decision is based on the following factors. Resources Each task or actor has the specified resource requirements. Given that, a node can be in one of the following states: Feasible: the node has the required resources to run the task or actor. Depending on the current availability of these resources, there are two sub-states: Available: the node has the required resources and they are free now. Unavailable: the node has the required resources but they are currently being used by other tasks or actors. Infeasible: the node doesn’t have the required resources. For example a CPU-only node is infeasible for a GPU task. Resource requirements are hard requirements meaning that only feasible nodes are eligible to run the task or actor. If there are feasible nodes, Ray will either choose an available node or wait until a unavailable node to become available depending on other factors discussed below. If all nodes are infeasible, the task or actor cannot be scheduled until feasible nodes are added to the cluster. Scheduling Strategies Tasks or actors support a scheduling_strategy option to specify the strategy used to decide the best node among feasible nodes. Currently the supported strategies are the followings. “DEFAULT” "DEFAULT" is the default strategy used by Ray. Ray schedules tasks or actors onto a group of the top k nodes. Specifically, the nodes are sorted to first favor those that already have tasks or actors scheduled (for locality), then to favor those that have low resource utilization (for load balancing). Within the top k group, nodes are chosen randomly to further improve load-balancing and mitigate delays from cold-start in large clusters. Implementation-wise, Ray calculates a score for each node in a cluster based on the utilization of its logical resources. If the utilization is below a threshold (controlled by the OS environment variable RAY_scheduler_spread_threshold, default is 0.5), the score is 0, otherwise it is the resource utilization itself (score 1 means the node is fully utilized). Ray selects the best node for scheduling by randomly picking from the top k nodes with the lowest scores. The value of k is the max of (number of nodes in the cluster * RAY_scheduler_top_k_fraction environment variable) and RAY_scheduler_top_k_absolute environment variable. By default, it’s 20% of the total number of nodes. Currently Ray handles actors that don’t require any resources (i.e., num_cpus=0 with no other resources) specially by randomly choosing a node in the cluster without considering resource utilization. Since nodes are randomly chosen, actors that don’t require any resources are effectively SPREAD across the cluster. @ray.remote def func(): return 1 @ray.remote(num_cpus=1) class Actor: pass # If unspecified, "DEFAULT" scheduling strategy is used. func.remote() actor = Actor.remote() # Explicitly set scheduling strategy to "DEFAULT". func.options(scheduling_strategy="DEFAULT").remote() actor = Actor.options(scheduling_strategy="DEFAULT").remote() # Zero-CPU (and no other resources) actors are randomly assigned to nodes. actor = Actor.options(num_cpus=0).remote() “SPREAD” "SPREAD" strategy will try to spread the tasks or actors among available nodes. @ray.remote(scheduling_strategy="SPREAD") def spread_func(): return 2 @ray.remote(num_cpus=1) class SpreadActor: pass # Spread tasks across the cluster. [spread_func.remote() for _ in range(10)] # Spread actors across the cluster. actors = [SpreadActor.options(scheduling_strategy="SPREAD").remote() for _ in range(10)] PlacementGroupSchedulingStrategy PlacementGroupSchedulingStrategy will schedule the task or actor to where the placement group is located. This is useful for actor gang scheduling. See Placement Group for more details. NodeAffinitySchedulingStrategy NodeAffinitySchedulingStrategy is a low-level strategy that allows a task or actor to be scheduled onto a particular node specified by its node id. The soft flag specifies whether the task or actor is allowed to run somewhere else if the specified node doesn’t exist (e.g. if the node dies) or is infeasible because it does not have the resources required to run the task or actor. In these cases, if soft is True, the task or actor will be scheduled onto a different feasible node. Otherwise, the task or actor will fail with TaskUnschedulableError or ActorUnschedulableError. As long as the specified node is alive and feasible, the task or actor will only run there regardless of the soft flag. This means if the node currently has no available resources, the task or actor will wait until resources become available. This strategy should only be used if other high level scheduling strategies (e.g. placement group) cannot give the desired task or actor placements. It has the following known limitations: It’s a low-level strategy which prevents optimizations by a smart scheduler. It cannot fully utilize an autoscaling cluster since node ids must be known when the tasks or actors are created. It can be difficult to make the best static placement decision especially in a multi-tenant cluster: for example, an application won’t know what else is being scheduled onto the same nodes. @ray.remote def node_affinity_func(): return ray.get_runtime_context().get_node_id() @ray.remote(num_cpus=1) class NodeAffinityActor: pass # Only run the task on the local node. node_affinity_func.options( scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy( node_id=ray.get_runtime_context().get_node_id(), soft=False, ) ).remote() # Run the two node_affinity_func tasks on the same node if possible. node_affinity_func.options( scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy( node_id=ray.get(node_affinity_func.remote()), soft=True, ) ).remote() # Only run the actor on the local node. actor = NodeAffinityActor.options( scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy( node_id=ray.get_runtime_context().get_node_id(), soft=False, ) ).remote() Locality-Aware Scheduling By default, Ray prefers available nodes that have large task arguments local to avoid transferring data over the network. If there are multiple large task arguments, the node with most object bytes local is preferred. This takes precedence over the "DEFAULT" scheduling strategy, which means Ray will try to run the task on the locality preferred node regardless of the node resource utilization. However, if the locality preferred node is not available, Ray may run the task somewhere else. When other scheduling strategies are specified, they have higher precedence and data locality is no longer considered. Locality-aware scheduling is only for tasks not actors. @ray.remote def large_object_func(): # Large object is stored in the local object store # and available in the distributed memory, # instead of returning inline directly to the caller. return [1] * (1024 * 1024) @ray.remote def small_object_func(): # Small object is returned inline directly to the caller, # instead of storing in the distributed memory. return [1] @ray.remote def consume_func(data): return len(data) large_object = large_object_func.remote() small_object = small_object_func.remote() # Ray will try to run consume_func on the same node # where large_object_func runs. consume_func.remote(large_object) # Ray will try to spread consume_func across the entire cluster # instead of only running on the node where large_object_func runs. [ consume_func.options(scheduling_strategy="SPREAD").remote(large_object) for i in range(10) ] # Ray won't consider locality for scheduling consume_func # since the argument is small and will be sent to the worker node inline directly. consume_func.remote(small_object) More about Ray Scheduling Resources Ray allows you to seamlessly scale your applications from a laptop to a cluster without code change. Ray resources are key to this capability. They abstract away physical machines and let you express your computation in terms of resources, while the system manages scheduling and autoscaling based on resource requests. A resource in Ray is a key-value pair where the key denotes a resource name, and the value is a float quantity. For convenience, Ray has native support for CPU, GPU, and memory resource types; CPU, GPU and memory are called pre-defined resources. Besides those, Ray also supports custom resources. Physical Resources and Logical Resources Physical resources are resources that a machine physically has such as physical CPUs and GPUs and logical resources are virtual resources defined by a system. Ray resources are logical and don’t need to have 1-to-1 mapping with physical resources. For example, you can start a Ray head node with 3 GPUs via ray start --head --num-gpus=3 even if it physically has zero. They are mainly used for admission control during scheduling. The fact that resources are logical has several implications: Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage. For example, Ray doesn’t prevent a num_cpus=1 task from launching multiple threads and using multiple physical CPUs. It’s your responsibility to make sure tasks or actors use no more resources than specified via resource requirements. Ray doesn’t provide CPU isolation for tasks or actors. For example, Ray won’t reserve a physical CPU exclusively and pin a num_cpus=1 task to it. Ray will let the operating system schedule and run the task instead. If needed, you can use operating system APIs like sched_setaffinity to pin a task to a physical CPU. Ray does provide GPU isolation in the form of visible devices by automatically setting the CUDA_VISIBLE_DEVICES environment variable, which most ML frameworks will respect for purposes of GPU assignment. Physical resources vs logical resources Custom Resources Besides pre-defined resources, you can also specify a Ray node’s custom resources and request them in your tasks or actors. Some use cases for custom resources: Your node has special hardware and you can represent it as a custom resource. Then your tasks or actors can request the custom resource via @ray.remote(resources={"special_hardware": 1}) and Ray will schedule the tasks or actors to the node that has the custom resource. You can use custom resources as labels to tag nodes and you can achieve label based affinity scheduling. For example, you can do ray.remote(resources={"custom_label": 0.001}) to schedule tasks or actors to nodes with custom_label custom resource. For this use case, the actual quantity doesn’t matter, and the convention is to specify a tiny number so that the label resource is not the limiting factor for parallelism. Specifying Node Resources By default, Ray nodes start with pre-defined CPU, GPU, and memory resources. The quantities of these resources on each node are set to the physical quantities auto detected by Ray. By default, logical resources are configured by the following rule. Ray does not permit dynamic updates of resource capacities after Ray has been started on a node. Number of logical CPUs (``num_cpus``): Set to the number of CPUs of the machine/container. Number of logical GPUs (``num_gpus``): Set to the number of GPUs of the machine/container. Memory (``memory``): Set to 70% of “available memory” when ray runtime starts. Object Store Memory (``object_store_memory``): Set to 30% of “available memory” when ray runtime starts. Note that the object store memory is not logical resource, and users cannot use it for scheduling. However, you can always override that by manually specifying the quantities of pre-defined resources and adding custom resources. There are several ways to do that depending on how you start the Ray cluster: ray.init() If you are using ray.init() to start a single node Ray cluster, you can do the following to manually specify node resources: # This will start a Ray node with 3 logical cpus, 4 logical gpus, # 1 special_hardware resource and 1 custom_label resource. ray.init(num_cpus=3, num_gpus=4, resources={"special_hardware": 1, "custom_label": 1}) ray start If you are using ray start to start a Ray node, you can run: ray start --head --num-cpus=3 --num-gpus=4 --resources='{"special_hardware": 1, "custom_label": 1}' ray up If you are using ray up to start a Ray cluster, you can set the resources field in the yaml file: available_node_types: head: ... resources: CPU: 3 GPU: 4 special_hardware: 1 custom_label: 1 KubeRay If you are using KubeRay to start a Ray cluster, you can set the rayStartParams field in the yaml file: headGroupSpec: rayStartParams: num-cpus: "3" num-gpus: "4" resources: '"{\"special_hardware\": 1, \"custom_label\": 1}"' Specifying Task or Actor Resource Requirements Ray allows specifying a task or actor’s resource requirements (e.g., CPU, GPU, and custom resources). The task or actor will only run on a node if there are enough required resources available to execute the task or actor. By default, Ray tasks use 1 CPU resource and Ray actors use 1 CPU for scheduling and 0 CPU for running (This means, by default, actors cannot get scheduled on a zero-cpu node, but an infinite number of them can run on any non-zero cpu node. The default resource requirements for actors was chosen for historical reasons. It’s recommended to always explicitly set num_cpus for actors to avoid any surprises. If resources are specified explicitly, they are required for both scheduling and running.) You can also explicitly specify a task’s or actor’s resource requirements (for example, one task may require a GPU) instead of using default ones via ray.remote() and task.options()/actor.options(). Python # Specify the default resource requirements for this remote function. @ray.remote(num_cpus=2, num_gpus=2, resources={"special_hardware": 1}) def func(): return 1 # You can override the default resource requirements. func.options(num_cpus=3, num_gpus=1, resources={"special_hardware": 0}).remote() @ray.remote(num_cpus=0, num_gpus=1) class Actor: pass # You can override the default resource requirements for actors as well. actor = Actor.options(num_cpus=1, num_gpus=0).remote() Java // Specify required resources. Ray.task(MyRayApp::myFunction).setResource("CPU", 1.0).setResource("GPU", 1.0).setResource("special_hardware", 1.0).remote(); Ray.actor(Counter::new).setResource("CPU", 2.0).setResource("GPU", 1.0).remote(); C++ // Specify required resources. ray::Task(MyFunction).SetResource("CPU", 1.0).SetResource("GPU", 1.0).SetResource("special_hardware", 1.0).Remote(); ray::Actor(CreateCounter).SetResource("CPU", 2.0).SetResource("GPU", 1.0).Remote(); Task and actor resource requirements have implications for the Ray’s scheduling concurrency. In particular, the sum of the resource requirements of all of the concurrently executing tasks and actors on a given node cannot exceed the node’s total resources. This property can be used to limit the number of concurrently running tasks or actors to avoid issues like OOM. Fractional Resource Requirements Ray supports fractional resource requirements. For example, if your task or actor is IO bound and has low CPU usage, you can specify fractional CPU num_cpus=0.5 or even zero CPU num_cpus=0. The precision of the fractional resource requirement is 0.0001 so you should avoid specifying a double that’s beyond that precision. @ray.remote(num_cpus=0.5) def io_bound_task(): import time time.sleep(1) return 2 io_bound_task.remote() @ray.remote(num_gpus=0.5) class IOActor: def ping(self): import os print(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}") # Two actors can share the same GPU. io_actor1 = IOActor.remote() io_actor2 = IOActor.remote() ray.get(io_actor1.ping.remote()) ray.get(io_actor2.ping.remote()) # Output: # (IOActor pid=96328) CUDA_VISIBLE_DEVICES: 1 # (IOActor pid=96329) CUDA_VISIBLE_DEVICES: 1 Besides resource requirements, you can also specify an environment for a task or actor to run in, which can include Python packages, local files, environment variables, and more—see Runtime Environments for details. GPU Support GPUs are critical for many machine learning applications. Ray natively supports GPU as a pre-defined resource type and allows tasks and actors to specify their GPU resource requirements. Starting Ray Nodes with GPUs By default, Ray will set the quantity of GPU resources of a node to the physical quantities of GPUs auto detected by Ray. If you need to, you can override this. There is nothing preventing you from specifying a larger value of num_gpus than the true number of GPUs on the machine given Ray resources are logical. In this case, Ray will act as if the machine has the number of GPUs you specified for the purposes of scheduling tasks and actors that require GPUs. Trouble will only occur if those tasks and actors attempt to actually use GPUs that don’t exist. You can set CUDA_VISIBLE_DEVICES environment variable before starting a Ray node to limit the GPUs that are visible to Ray. For example, CUDA_VISIBLE_DEVICES=1,3 ray start --head --num-gpus=2 will let Ray only see devices 1 and 3. Using GPUs in Tasks and Actors If a task or actor requires GPUs, you can specify the corresponding resource requirements (e.g. @ray.remote(num_gpus=1)). Ray will then schedule the task or actor to a node that has enough free GPU resources and assign GPUs to the task or actor by setting the CUDA_VISIBLE_DEVICES environment variable before running the task or actor code. import os import ray ray.init(num_gpus=2) @ray.remote(num_gpus=1) class GPUActor: def ping(self): print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids())) print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"])) @ray.remote(num_gpus=1) def use_gpu(): print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids())) print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"])) gpu_actor = GPUActor.remote() ray.get(gpu_actor.ping.remote()) # The actor uses the first GPU so the task will use the second one. ray.get(use_gpu.remote()) # (GPUActor pid=52420) ray.get_gpu_ids(): [0] # (GPUActor pid=52420) CUDA_VISIBLE_DEVICES: 0 # (use_gpu pid=51830) ray.get_gpu_ids(): [1] # (use_gpu pid=51830) CUDA_VISIBLE_DEVICES: 1 Inside a task or actor, ray.get_gpu_ids() will return a list of GPU IDs that are available to the task or actor. Typically, it is not necessary to call ray.get_gpu_ids() because Ray will automatically set the CUDA_VISIBLE_DEVICES environment variable, which most ML frameworks will respect for purposes of GPU assignment. Note: The function use_gpu defined above doesn’t actually use any GPUs. Ray will schedule it on a node which has at least one GPU, and will reserve one GPU for it while it is being executed, however it is up to the function to actually make use of the GPU. This is typically done through an external library like TensorFlow. Here is an example that actually uses GPUs. In order for this example to work, you will need to install the GPU version of TensorFlow. @ray.remote(num_gpus=1) def use_gpu(): import tensorflow as tf # Create a TensorFlow session. TensorFlow will restrict itself to use the # GPUs specified by the CUDA_VISIBLE_DEVICES environment variable. tf.Session() Note: It is certainly possible for the person implementing use_gpu to ignore ray.get_gpu_ids() and to use all of the GPUs on the machine. Ray does not prevent this from happening, and this can lead to too many tasks or actors using the same GPU at the same time. However, Ray does automatically set the CUDA_VISIBLE_DEVICES environment variable, which will restrict the GPUs used by most deep learning frameworks assuming it’s not overridden by the user. Fractional GPUs Ray supports fractional resource requirements so multiple tasks and actors can share the same GPU. ray.init(num_cpus=4, num_gpus=1) @ray.remote(num_gpus=0.25) def f(): import time time.sleep(1) # The four tasks created here can execute concurrently # and share the same GPU. ray.get([f.remote() for _ in range(4)]) Note: It is the user’s responsibility to make sure that the individual tasks don’t use more than their share of the GPU memory. TensorFlow can be configured to limit its memory usage. When Ray assigns GPUs of a node to tasks or actors with fractional resource requirements, it will pack one GPU before moving on to the next one to avoid fragmentation. ray.init(num_gpus=3) @ray.remote(num_gpus=0.5) class FractionalGPUActor: def ping(self): print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids())) fractional_gpu_actors = [FractionalGPUActor.remote() for _ in range(3)] # Ray will try to pack GPUs if possible. [ray.get(fractional_gpu_actors[i].ping.remote()) for i in range(3)] # (FractionalGPUActor pid=57417) ray.get_gpu_ids(): [0] # (FractionalGPUActor pid=57416) ray.get_gpu_ids(): [0] # (FractionalGPUActor pid=57418) ray.get_gpu_ids(): [1] Workers not Releasing GPU Resources Currently, when a worker executes a task that uses a GPU (e.g., through TensorFlow), the task may allocate memory on the GPU and may not release it when the task finishes executing. This can lead to problems the next time a task tries to use the same GPU. To address the problem, Ray disables the worker process reuse between GPU tasks by default, where the GPU resources is released after the task process exits. Since this adds overhead to GPU task scheduling, you can re-enable worker reuse by setting max_calls=0 in the ray.remote decorator. # By default, ray will not reuse workers for GPU tasks to prevent # GPU resource leakage. @ray.remote(num_gpus=1) def leak_gpus(): import tensorflow as tf # This task will allocate memory on the GPU and then never release it. tf.Session() Accelerator Types Ray supports resource specific accelerator types. The accelerator_type option can be used to force to a task or actor to run on a node with a specific type of accelerator. Under the hood, the accelerator type option is implemented as a custom resource requirement of "accelerator_type:": 0.001. This forces the task or actor to be placed on a node with that particular accelerator type available. This also lets the multi-node-type autoscaler know that there is demand for that type of resource, potentially triggering the launch of new nodes providing that accelerator. from ray.util.accelerators import NVIDIA_TESLA_V100 @ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100) def train(data): return "This function was run on a node with a Tesla V100 GPU" ray.get(train.remote(1)) See ray.util.accelerators for available accelerator types. Current automatically detected accelerator types include Nvidia GPUs. Placement Groups Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can be then used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks. Here are some real-world use cases: Distributed Machine Learning Training: Distributed Training (e.g., Ray Train and Ray Tune) uses the placement group APIs to enable gang scheduling. In these settings, all resources for a trial must be available at the same time. Gang scheduling is a critical technique to enable all-or-nothing scheduling for deep learning training. Fault tolerance in distributed training: Placement groups can be used to configure fault tolerance. In Ray Tune, it can be beneficial to pack related resources from a single trial together, so that a node failure impacts a low number of trials. In libraries that support elastic training (e.g., XGBoost-Ray), spreading the resources across multiple nodes can help to ensure that training continues even when a node dies. Key Concepts Bundles A bundle is a collection of “resources”. It could be a single resource, {"CPU": 1}, or a group of resources, {"CPU": 1, "GPU": 4}. A bundle is a unit of reservation for placement groups. “Scheduling a bundle” means we find a node that fits the bundle and reserve the resources specified by the bundle. A bundle must be able to fit on a single node on the Ray cluster. For example, if you only have an 8 CPU node, and if you have a bundle that requires {"CPU": 9}, this bundle cannot be scheduled. Placement Group A placement group reserves the resources from the cluster. The reserved resources can only be used by tasks or actors that use the PlacementGroupSchedulingStrategy. Placement groups are represented by a list of bundles. For example, {"CPU": 1} * 4 means you’d like to reserve 4 bundles of 1 CPU (i.e., it reserves 4 CPUs). Bundles are then placed according to the placement strategies across nodes on the cluster. After the placement group is created, tasks or actors can be then scheduled according to the placement group and even on individual bundles. Create a Placement Group (Reserve Resources) You can create a placement group using ray.util.placement_group(). Placement groups take in a list of bundles and a placement strategy. Note that each bundle must be able to fit on a single node on the Ray cluster. For example, if you only have a 8 CPU node, and if you have a bundle that requires {"CPU": 9}, this bundle cannot be scheduled. Bundles are specified by a list of dictionaries, e.g., [{"CPU": 1}, {"CPU": 1, "GPU": 1}]). CPU corresponds to num_cpus as used in ray.remote. GPU corresponds to num_gpus as used in ray.remote. memory corresponds to memory as used in ray.remote Other resources corresponds to resources as used in ray.remote (E.g., ray.init(resources={"disk": 1}) can have a bundle of {"disk": 1}). Placement group scheduling is asynchronous. The ray.util.placement_group returns immediately. Python from pprint import pprint import time # Import placement group APIs. from ray.util.placement_group import ( placement_group, placement_group_table, remove_placement_group, ) from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy # Initialize Ray. import ray # Create a single node Ray cluster with 2 CPUs and 2 GPUs. ray.init(num_cpus=2, num_gpus=2) # Reserve a placement group of 1 bundle that reserves 1 CPU and 1 GPU. pg = placement_group([{"CPU": 1, "GPU": 1}]) Java // Initialize Ray. Ray.init(); // Construct a list of bundles. Map bundle = ImmutableMap.of("CPU", 1.0); List> bundles = ImmutableList.of(bundle); // Make a creation option with bundles and strategy. PlacementGroupCreationOptions options = new PlacementGroupCreationOptions.Builder() .setBundles(bundles) .setStrategy(PlacementStrategy.STRICT_SPREAD) .build(); PlacementGroup pg = PlacementGroups.createPlacementGroup(options); C++ // Initialize Ray. ray::Init(); // Construct a list of bundles. std::vector> bundles{{{"CPU", 1.0}}}; // Make a creation option with bundles and strategy. ray::internal::PlacementGroupCreationOptions options{ false, "my_pg", bundles, ray::internal::PlacementStrategy::PACK}; ray::PlacementGroup pg = ray::CreatePlacementGroup(options); You can block your program until the placement group is ready using one of two APIs: ready, which is compatible with ray.get wait, which blocks the program until the placement group is ready) Python # Wait until placement group is created. ray.get(pg.ready(), timeout=10) # You can also use ray.wait. ready, unready = ray.wait([pg.ready()], timeout=10) # You can look at placement group states using this API. print(placement_group_table(pg)) Java // Wait for the placement group to be ready within the specified time(unit is seconds). boolean ready = pg.wait(60); Assert.assertTrue(ready); // You can look at placement group states using this API. List allPlacementGroup = PlacementGroups.getAllPlacementGroups(); for (PlacementGroup group: allPlacementGroup) { System.out.println(group); } C++ // Wait for the placement group to be ready within the specified time(unit is seconds). bool ready = pg.Wait(60); assert(ready); // You can look at placement group states using this API. std::vector all_placement_group = ray::GetAllPlacementGroups(); for (const ray::PlacementGroup &group : all_placement_group) { std::cout << group.GetName() << std::endl; } Let’s verify the placement group is successfully created. # This API is only available when you download Ray via `pip install "ray[default]"` ray list placement-groups ======== List: 2023-04-07 01:15:05.682519 ======== Stats: ------------------------------ Total: 1 Table: ------------------------------ PLACEMENT_GROUP_ID NAME CREATOR_JOB_ID STATE 0 3cd6174711f47c14132155039c0501000000 01000000 CREATED The placement group is successfully created. Out of the {"CPU": 2, "GPU": 2} resources, the placement group reserves {"CPU": 1, "GPU": 1}. The reserved resources can only be used when you schedule tasks or actors with a placement group. The diagram below demonstrates the “1 CPU and 1 GPU” bundle that the placement group reserved. Placement groups are atomically created; if a bundle cannot fit in any of the current nodes, the entire placement group is not ready and no resources are reserved. To illustrate, let’s create another placement group that requires {"CPU":1}, {"GPU": 2} (2 bundles). Python # Cannot create this placement group because we # cannot create a {"GPU": 2} bundle. pending_pg = placement_group([{"CPU": 1}, {"GPU": 2}]) # This raises the timeout exception! try: ray.get(pending_pg.ready(), timeout=5) except Exception as e: print( "Cannot create a placement group because " "{'GPU': 2} bundle cannot be created." ) print(e) You can verify the new placement group is pending creation. # This API is only available when you download Ray via `pip install "ray[default]"` ray list placement-groups ======== List: 2023-04-07 01:16:23.733410 ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ PLACEMENT_GROUP_ID NAME CREATOR_JOB_ID STATE 0 3cd6174711f47c14132155039c0501000000 01000000 CREATED 1 e1b043bebc751c3081bddc24834d01000000 01000000 PENDING <---- the new placement group. You can also verify that the {"CPU": 1, "GPU": 2} bundles cannot be allocated, using the ray status CLI command. ray status Resources --------------------------------------------------------------- Usage: 0.0/2.0 CPU (0.0 used of 1.0 reserved in placement groups) 0.0/2.0 GPU (0.0 used of 1.0 reserved in placement groups) 0B/3.46GiB memory 0B/1.73GiB object_store_memory Demands: {'CPU': 1.0} * 1, {'GPU': 2.0} * 1 (PACK): 1+ pending placement groups <--- 1 placement group is pending creation. The current cluster has {"CPU": 2, "GPU": 2}. We already created a {"CPU": 1, "GPU": 1} bundle, so only {"CPU": 1, "GPU": 1} is left in the cluster. If we create 2 bundles {"CPU": 1}, {"GPU": 2}, we can create a first bundle successfully, but can’t schedule the second bundle. Since we cannot create every bundle on the cluster, the placement group is not created, including the {"CPU": 1} bundle. When the placement group cannot be scheduled in any way, it is called “infeasible”. Imagine you schedule {"CPU": 4} bundle, but you only have a single node with 2 CPUs. There’s no way to create this bundle in your cluster. The Ray Autoscaler is aware of placement groups, and auto-scales the cluster to ensure pending groups can be placed as needed. If Ray Autoscaler cannot provide resources to schedule a placement group, Ray does not print a warning about infeasible groups and tasks and actors that use the groups. You can observe the scheduling state of the placement group from the dashboard or state APIs. Schedule Tasks and Actors to Placement Groups (Use Reserved Resources) In the previous section, we created a placement group that reserved {"CPU": 1, "GPU: 1"} from a 2 CPU and 2 GPU node. Now let’s schedule an actor to the placement group. You can schedule actors or tasks to a placement group using options(scheduling_strategy=PlacementGroupSchedulingStrategy(...)). Python @ray.remote(num_cpus=1) class Actor: def __init__(self): pass def ready(self): pass # Create an actor to a placement group. actor = Actor.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() # Verify the actor is scheduled. ray.get(actor.ready.remote(), timeout=10) Java public static class Counter { private int value; public Counter(int initValue) { this.value = initValue; } public int getValue() { return value; } public static String ping() { return "pong"; } } // Create GPU actors on a gpu bundle. for (int index = 0; index < 1; index++) { Ray.actor(Counter::new, 1) .setPlacementGroup(pg, 0) .remote(); } C++ class Counter { public: Counter(int init_value) : value(init_value){} int GetValue() {return value;} std::string Ping() { return "pong"; } private: int value; }; // Factory function of Counter class. static Counter *CreateCounter() { return new Counter(); }; RAY_REMOTE(&Counter::Ping, &Counter::GetValue, CreateCounter); // Create GPU actors on a gpu bundle. for (int index = 0; index < 1; index++) { ray::Actor(CreateCounter) .SetPlacementGroup(pg, 0) .Remote(1); } When you use an actor with a placement group, always specify num_cpus. When you don’t specify (e.g., num_cpus=0), a placement group option is ignored, and the task and actor don’t use the reserved resources. Note that by default (with no arguments to ray.remote), Ray task requires 1 CPU Ray actor requires 1 CPU when it is scheduled. But after it is created, it occupies 0 CPU. When scheduling an actor without resource requirements and a placement group, the placement group has to be created (since it requires 1 CPU to be scheduled). However, when the actor is created, it ignores the placement group. The actor is scheduled now! One bundle can be used by multiple tasks and actors (i.e., the bundle to task (or actor) is a one-to-many relationship). In this case, since the actor uses 1 CPU, 1 GPU remains from the bundle. You can verify this from the CLI command ray status. You can see the 1 CPU is reserved by the placement group, and 1.0 is used (by the actor we created). ray status Resources --------------------------------------------------------------- Usage: 1.0/2.0 CPU (1.0 used of 1.0 reserved in placement groups) <--- 0.0/2.0 GPU (0.0 used of 1.0 reserved in placement groups) 0B/4.29GiB memory 0B/2.00GiB object_store_memory Demands: (no resource demands) You can also verify the actor is created using ray list actors. # This API is only available when you download Ray via `pip install "ray[default]"` ray list actors --detail - actor_id: b5c990f135a7b32bfbb05e1701000000 class_name: Actor death_cause: null is_detached: false job_id: '01000000' name: '' node_id: b552ca3009081c9de857a31e529d248ba051a4d3aeece7135dde8427 pid: 8795 placement_group_id: d2e660ac256db230dbe516127c4a01000000 <------ ray_namespace: e5b19111-306c-4cd8-9e4f-4b13d42dff86 repr_name: '' required_resources: CPU_group_d2e660ac256db230dbe516127c4a01000000: 1.0 serialized_runtime_env: '{}' state: ALIVE Since 1 GPU remains, let’s create a new actor that requires 1 GPU. This time, we also specify the placement_group_bundle_index. Each bundle is given an “index” within the placement group. For example, a placement group of 2 bundles [{"CPU": 1}, {"GPU": 1}] has index 0 bundle {"CPU": 1} and index 1 bundle {"GPU": 1}. Since we only have 1 bundle, we only have index 0. If you don’t specify a bundle, the actor (or task) is scheduled on a random bundle that has unallocated reserved resources. Python @ray.remote(num_cpus=0, num_gpus=1) class Actor: def __init__(self): pass def ready(self): pass # Create a GPU actor on the first bundle of index 0. actor2 = Actor.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, placement_group_bundle_index=0, ) ).remote() # Verify that the GPU actor is scheduled. ray.get(actor2.ready.remote(), timeout=10) We succeed to schedule the GPU actor! The below image describes 2 actors scheduled into the placement group. You can also verify that the reserved resources are all used, with the ray status command. ray status Resources --------------------------------------------------------------- Usage: 1.0/2.0 CPU (1.0 used of 1.0 reserved in placement groups) 1.0/2.0 GPU (1.0 used of 1.0 reserved in placement groups) <---- 0B/4.29GiB memory 0B/2.00GiB object_store_memory Placement Strategy One of the features the placement group provides is to add placement constraints among bundles. For example, you’d like to pack your bundles to the same node or spread out to multiple nodes as much as possible. You can specify the strategy via strategy argument. This way, you can make sure your actors and tasks can be scheduled with certain placement constraints. The example below creates a placement group with 2 bundles with a PACK strategy; both bundles have to be created in the same node. Note that it is a soft policy. If the bundles cannot be packed into a single node, they are spread to other nodes. If you’d like to avoid the problem, you can instead use STRICT_PACK policies, which fail to create placement groups if placement requirements cannot be satisfied. # Reserve a placement group of 2 bundles # that have to be packed on the same node. pg = placement_group([{"CPU": 1}, {"GPU": 1}], strategy="PACK") The image below demonstrates the PACK policy. Three of the {"CPU": 2} bundles are located in the same node. The image below demonstrates the SPREAD policy. Each of three of the {"CPU": 2} bundles are located in three different nodes. Ray supports four placement group strategies. The default scheduling policy is PACK. STRICT_PACK All bundles must be placed into a single node on the cluster. Use this strategy when you want to maximize the locality. PACK All provided bundles are packed onto a single node on a best-effort basis. If strict packing is not feasible (i.e., some bundles do not fit on the node), bundles can be placed onto other nodes. STRICT_SPREAD Each bundle must be scheduled in a separate node. SPREAD Each bundle is spread onto separate nodes on a best-effort basis. If strict spreading is not feasible, bundles can be placed on overlapping nodes. Remove Placement Groups (Free Reserved Resources) By default, a placement group’s lifetime is scoped to the driver that creates placement groups (unless you make it a detached placement group). When the placement group is created from a detached actor, the lifetime is scoped to the detached actor. In Ray, the driver is the Python script that calls ray.init. Reserved resources (bundles) from the placement group are automatically freed when the driver or detached actor that creates placement group exits. To free the reserved resources manually, remove the placement group using the remove_placement_group API (which is also an asynchronous API). When you remove the placement group, actors or tasks that still use the reserved resources are forcefully killed. Python # This API is asynchronous. remove_placement_group(pg) # Wait until placement group is killed. time.sleep(1) # Check that the placement group has died. pprint(placement_group_table(pg)) """ {'bundles': {0: {'GPU': 1.0}, 1: {'CPU': 1.0}}, 'name': 'unnamed_group', 'placement_group_id': '40816b6ad474a6942b0edb45809b39c3', 'state': 'REMOVED', 'strategy': 'PACK'} """ Java PlacementGroups.removePlacementGroup(placementGroup.getId()); PlacementGroup removedPlacementGroup = PlacementGroups.getPlacementGroup(placementGroup.getId()); Assert.assertEquals(removedPlacementGroup.getState(), PlacementGroupState.REMOVED); C++ ray::RemovePlacementGroup(placement_group.GetID()); ray::PlacementGroup removed_placement_group = ray::GetPlacementGroup(placement_group.GetID()); assert(removed_placement_group.GetState(), ray::PlacementGroupState::REMOVED); Observe and Debug Placement Groups Ray provides several useful tools to inspect the placement group states and resource usage. Ray Status is a CLI tool for viewing the resource usage and scheduling resource requirements of placement groups. Ray Dashboard is a UI tool for inspecting placement group states. Ray State API is a CLI for inspecting placement group states. ray status (CLI) The CLI command ray status provides the autoscaling status of the cluster. It provides the “resource demands” from unscheduled placement groups as well as the resource reservation status. Resources --------------------------------------------------------------- Usage: 1.0/2.0 CPU (1.0 used of 1.0 reserved in placement groups) 0.0/2.0 GPU (0.0 used of 1.0 reserved in placement groups) 0B/4.29GiB memory 0B/2.00GiB object_store_memory Dashboard The dashboard job view provides the placement group table that displays the scheduling state and metadata of the placement group. Ray dashboard is only available when you install Ray is with pip install "ray[default]". Ray State API Ray state API is a CLI tool for inspecting the state of Ray resources (tasks, actors, placement groups, etc.). ray list placement-groups provides the metadata and the scheduling state of the placement group. ray list placement-groups --detail provides statistics and scheduling state in a greater detail. State API is only available when you install Ray is with pip install "ray[default]" Inspect Placement Group Scheduling State With the above tools, you can see the state of the placement group. The definition of states are specified in the following files: High level state Details [Advanced] Child Tasks and Actors By default, child actors and tasks don’t share the same placement group that the parent uses. To automatically schedule child actors or tasks to the same placement group, set placement_group_capture_child_tasks to True. Python import ray from ray.util.placement_group import placement_group from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy ray.init(num_cpus=2) # Create a placement group. pg = placement_group([{"CPU": 2}]) ray.get(pg.ready()) @ray.remote(num_cpus=1) def child(): import time time.sleep(5) @ray.remote(num_cpus=1) def parent(): # The child task is scheduled to the same placement group as its parent, # although it didn't specify the PlacementGroupSchedulingStrategy. ray.get(child.remote()) # Since the child and parent use 1 CPU each, the placement group # bundle {"CPU": 2} is fully occupied. ray.get( parent.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, placement_group_capture_child_tasks=True ) ).remote() ) Java It’s not implemented for Java APIs yet. When placement_group_capture_child_tasks is True, but you don’t want to schedule child tasks and actors to the same placement group, specify PlacementGroupSchedulingStrategy(placement_group=None). @ray.remote def parent(): # In this case, the child task isn't # scheduled with the parent's placement group. ray.get( child.options( scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=None) ).remote() ) # This times out because we cannot schedule the child task. # The cluster has {"CPU": 2}, and both of them are reserved by # the placement group with a bundle {"CPU": 2}. Since the child shouldn't # be scheduled within this placement group, it cannot be scheduled because # there's no available CPU resources. try: ray.get( parent.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, placement_group_capture_child_tasks=True ) ).remote(), timeout=5, ) except Exception as e: print("Couldn't create a child task!") print(e) [Advanced] Named Placement Group A placement group can be given a globally unique name. This allows you to retrieve the placement group from any job in the Ray cluster. This can be useful if you cannot directly pass the placement group handle to the actor or task that needs it, or if you are trying to access a placement group launched by another driver. Note that the placement group is still destroyed if its lifetime isn’t detached. Python # first_driver.py # Create a placement group with a global name. pg = placement_group([{"CPU": 1}], name="global_name") ray.get(pg.ready()) # second_driver.py # Retrieve a placement group with a global name. pg = ray.util.get_placement_group("global_name") Java // Create a placement group with a unique name. Map bundle = ImmutableMap.of("CPU", 1.0); List> bundles = ImmutableList.of(bundle); PlacementGroupCreationOptions options = new PlacementGroupCreationOptions.Builder() .setBundles(bundles) .setStrategy(PlacementStrategy.STRICT_SPREAD) .setName("global_name") .build(); PlacementGroup pg = PlacementGroups.createPlacementGroup(options); pg.wait(60); ... // Retrieve the placement group later somewhere. PlacementGroup group = PlacementGroups.getPlacementGroup("global_name"); Assert.assertNotNull(group); C++ // Create a placement group with a globally unique name. std::vector> bundles{{{"CPU", 1.0}}}; ray::PlacementGroupCreationOptions options{ true/*global*/, "global_name", bundles, ray::PlacementStrategy::STRICT_SPREAD}; ray::PlacementGroup pg = ray::CreatePlacementGroup(options); pg.Wait(60); ... // Retrieve the placement group later somewhere. ray::PlacementGroup group = ray::GetGlobalPlacementGroup("global_name"); assert(!group.Empty()); We also support non-global named placement group in C++, which means that the placement group name is only valid within the job and cannot be accessed from another job. // Create a placement group with a job-scope-unique name. std::vector> bundles{{{"CPU", 1.0}}}; ray::PlacementGroupCreationOptions options{ false/*non-global*/, "non_global_name", bundles, ray::PlacementStrategy::STRICT_SPREAD}; ray::PlacementGroup pg = ray::CreatePlacementGroup(options); pg.Wait(60); ... // Retrieve the placement group later somewhere in the same job. ray::PlacementGroup group = ray::GetPlacementGroup("non_global_name"); assert(!group.Empty()); [Advanced] Detached Placement Group By default, the lifetimes of placement groups belong to the driver and actor. If the placement group is created from a driver, it is destroyed when the driver is terminated. If it is created from a detached actor, it is killed when the detached actor is killed. To keep the placement group alive regardless of its job or detached actor, specify lifetime="detached". For example: Python # driver_1.py # Create a detached placement group that survives even after # the job terminates. pg = placement_group([{"CPU": 1}], lifetime="detached", name="global_name") ray.get(pg.ready()) Java The lifetime argument is not implemented for Java APIs yet. Let’s terminate the current script and start a new Python script. Call ray list placement-groups, and you can see the placement group is not removed. Note that the lifetime option is decoupled from the name. If we only specified the name without specifying lifetime="detached", then the placement group can only be retrieved as long as the original driver is still running. It is recommended to always specify the name when creating the detached placement group. [Advanced] Fault Tolerance Rescheduling Bundles on a Dead Node If nodes that contain some bundles of a placement group die, all the bundles are rescheduled on different nodes by GCS (i.e., we try reserving resources again). This means that the initial creation of placement group is “atomic”, but once it is created, there could be partial placement groups. Rescheduling bundles have higher scheduling priority than other placement group scheduling. Provide Resources for Partially Lost Bundles If there are not enough resources to schedule the partially lost bundles, the placement group waits, assuming Ray Autoscaler will start a new node to satisfy the resource requirements. If the additional resources cannot be provided (e.g., you don’t use the Autoscaler or the Autoscaler hits the resource limit), the placement group remains in the partially created state indefinitely. Fault Tolerance of Actors and Tasks that Use the Bundle Actors and tasks that use the bundle (reserved resources) are rescheduled based on their fault tolerant policy once the bundle is recovered. API Reference Placement Group API reference Memory Management This page describes how memory management works in Ray. Also view Debugging Out of Memory to learn how to troubleshoot out-of-memory issues. Concepts There are several ways that Ray applications use memory: https://docs.google.com/drawings/d/1wHHnAJZ-NsyIv3TUXQJTYpPz6pjB6PUm2M40Zbfb1Ak/edit Ray system memory: this is memory used internally by Ray GCS: memory used for storing the list of nodes and actors present in the cluster. The amount of memory used for these purposes is typically quite small. Raylet: memory used by the C++ raylet process running on each node. This cannot be controlled, but is typically quite small. Application memory: this is memory used by your application Worker heap: memory used by your application (e.g., in Python code or TensorFlow), best measured as the resident set size (RSS) of your application minus its shared memory usage (SHR) in commands such as top. The reason you need to subtract SHR is that object store shared memory is reported by the OS as shared with each worker. Not subtracting SHR will result in double counting memory usage. Object store memory: memory used when your application creates objects in the object store via ray.put and when it returns values from remote functions. Objects are reference counted and evicted when they fall out of scope. An object store server runs on each node. By default, when starting an instance, Ray reserves 30% of available memory. The size of the object store can be controlled by –object-store-memory. The memory is by default allocated to /dev/shm (shared memory) for Linux. For MacOS, Ray uses /tmp (disk), which can impact the performance compared to Linux. In Ray 1.3+, objects are spilled to disk if the object store fills up. Object store shared memory: memory used when your application reads objects via ray.get. Note that if an object is already present on the node, this does not cause additional allocations. This allows large objects to be efficiently shared among many actors and tasks. ObjectRef Reference Counting Ray implements distributed reference counting so that any ObjectRef in scope in the cluster is pinned in the object store. This includes local python references, arguments to pending tasks, and IDs serialized inside of other objects. Debugging using ‘ray memory’ The ray memory command can be used to help track down what ObjectRef references are in scope and may be causing an ObjectStoreFullError. Running ray memory from the command line while a Ray application is running will give you a dump of all of the ObjectRef references that are currently held by the driver, actors, and tasks in the cluster. ======== Object references status: 2021-02-23 22:02:22.072221 ======== Grouping by node address... Sorting by object size... --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 287 MiB 4 0 0 1 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (put object) | test.py: :17 192.168.0.15 6465 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (task call) | test.py: ::18 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000002000000 18 MiB CAPTURED_IN_OBJECT (put object) | test.py: :19 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000004000000 21 MiB LOCAL_REFERENCE (put object) | test.py: :20 192.168.0.15 6465 Driver ffffffffffffffffffffffffffffffffffffffff0100000003000000 218 MiB LOCAL_REFERENCE (put object) | test.py: :20 --- Aggregate object store stats across all nodes --- Plasma memory usage 0 MiB, 4 objects, 0.0% full Each entry in this output corresponds to an ObjectRef that’s currently pinning an object in the object store along with where the reference is (in the driver, in a worker, etc.), what type of reference it is (see below for details on the types of references), the size of the object in bytes, the process ID and IP address where the object was instantiated, and where in the application the reference was created. ray memory comes with features to make the memory debugging experience more effective. For example, you can add arguments sort-by=OBJECT_SIZE and group-by=STACK_TRACE, which may be particularly helpful for tracking down the line of code where a memory leak occurs. You can see the full suite of options by running ray memory --help. There are five types of references that can keep an object pinned: 1. Local ObjectRef references import ray @ray.remote def f(arg): return arg a = ray.put(None) b = f.remote(None) In this example, we create references to two objects: one that is ray.put() in the object store and another that’s the return value from f.remote(). --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 30 MiB 2 0 0 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 6867 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (put object) | test.py: :12 192.168.0.15 6867 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 15 MiB LOCAL_REFERENCE (task call) | test.py: ::13 In the output from ray memory, we can see that each of these is marked as a LOCAL_REFERENCE in the driver process, but the annotation in the “Reference Creation Site” indicates that the first was created as a “put object” and the second from a “task call.” 2. Objects pinned in memory import numpy as np a = ray.put(np.zeros(1)) b = ray.get(a) del a In this example, we create a numpy array and then store it in the object store. Then, we fetch the same numpy array from the object store and delete its ObjectRef. In this case, the object is still pinned in the object store because the deserialized copy (stored in b) points directly to the memory in the object store. --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 243 MiB 0 1 0 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7066 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 243 MiB PINNED_IN_MEMORY test. py::19 The output from ray memory displays this as the object being PINNED_IN_MEMORY. If we del b, the reference can be freed. 3. Pending task references @ray.remote def f(arg): while True: pass a = ray.put(None) b = f.remote(a) In this example, we first create an object via ray.put() and then submit a task that depends on the object. --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 25 MiB 1 1 1 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7207 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 ? LOCAL_REFERENCE (task call) | test.py: ::29 192.168.0.15 7241 Worker ffffffffffffffffffffffffffffffffffffffff0100000001000000 10 MiB PINNED_IN_MEMORY (deserialize task arg) __main__.f 192.168.0.15 7207 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB USED_BY_PENDING_TASK (put object) | test.py: :28 While the task is running, we see that ray memory shows both a LOCAL_REFERENCE and a USED_BY_PENDING_TASK reference for the object in the driver process. The worker process also holds a reference to the object because the Python arg is directly referencing the memory in the plasma, so it can’t be evicted; therefore it is PINNED_IN_MEMORY. 4. Serialized ObjectRef references @ray.remote def f(arg): while True: pass a = ray.put(None) b = f.remote([a]) In this example, we again create an object via ray.put(), but then pass it to a task wrapped in another object (in this case, a list). --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 15 MiB 2 0 1 0 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7411 Worker ffffffffffffffffffffffffffffffffffffffff0100000001000000 ? LOCAL_REFERENCE (deserialize task arg) __main__.f 192.168.0.15 7373 Driver a67dc375e60ddd1affffffffffffffffffffffff0100000001000000 ? LOCAL_REFERENCE (task call) | test.py: ::38 192.168.0.15 7373 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB USED_BY_PENDING_TASK (put object) | test.py: :37 Now, both the driver and the worker process running the task hold a LOCAL_REFERENCE to the object in addition to it being USED_BY_PENDING_TASK on the driver. If this was an actor task, the actor could even hold a LOCAL_REFERENCE after the task completes by storing the ObjectRef in a member variable. 5. Captured ObjectRef references a = ray.put(None) b = ray.put([a]) del a In this example, we first create an object via ray.put(), then capture its ObjectRef inside of another ray.put() object, and delete the first ObjectRef. In this case, both objects are still pinned. --- Summary for node address: 192.168.0.15 --- Mem Used by Objects Local References Pinned Count Pending Tasks Captured in Objects Actor Handles 233 MiB 1 0 0 1 0 --- Object references for node address: 192.168.0.15 --- IP Address PID Type Object Ref Size Reference Type Call Site 192.168.0.15 7473 Driver ffffffffffffffffffffffffffffffffffffffff0100000001000000 15 MiB CAPTURED_IN_OBJECT (put object) | test.py: :41 192.168.0.15 7473 Driver ffffffffffffffffffffffffffffffffffffffff0100000002000000 218 MiB LOCAL_REFERENCE (put object) | test.py: :42 In the output of ray memory, we see that the second object displays as a normal LOCAL_REFERENCE, but the first object is listed as CAPTURED_IN_OBJECT. Memory Aware Scheduling By default, Ray does not take into account the potential memory usage of a task or actor when scheduling. This is simply because it cannot estimate ahead of time how much memory is required. However, if you know how much memory a task or actor requires, you can specify it in the resource requirements of its ray.remote decorator to enable memory-aware scheduling: Specifying a memory requirement does NOT impose any limits on memory usage. The requirements are used for admission control during scheduling only (similar to how CPU scheduling works in Ray). It is up to the task itself to not use more memory than it requested. To tell the Ray scheduler a task or actor requires a certain amount of available memory to run, set the memory argument. The Ray scheduler will then reserve the specified amount of available memory during scheduling, similar to how it handles CPU and GPU resources: # reserve 500MiB of available memory to place this task @ray.remote(memory=500 * 1024 * 1024) def some_function(x): pass # reserve 2.5GiB of available memory to place this actor @ray.remote(memory=2500 * 1024 * 1024) class SomeActor: def __init__(self, a, b): pass In the above example, the memory quota is specified statically by the decorator, but you can also set them dynamically at runtime using .options() as follows: # override the memory quota to 100MiB when submitting the task some_function.options(memory=100 * 1024 * 1024).remote(x=1) # override the memory quota to 1GiB when creating the actor SomeActor.options(memory=1000 * 1024 * 1024).remote(a=1, b=2) Questions or Issues? You can post questions or issues or feedback through the following channels: Discussion Board: For questions about Ray usage or feature requests. GitHub Issues: For bug reports. Ray Slack: For getting in touch with Ray maintainers. StackOverflow: Use the [ray] tag questions about Ray. Out-Of-Memory Prevention If application tasks or actors consume a large amount of heap space, it can cause the node to run out of memory (OOM). When that happens, the operating system will start killing worker or raylet processes, disrupting the application. OOM may also stall metrics and if this happens on the head node, it may stall the dashboard or other control processes and cause the cluster to become unusable. In this section we will go over: What is the memory monitor and how it works How to enable and configure it How to use the memory monitor to detect and resolve memory issues Also view Debugging Out of Memory to learn how to troubleshoot out-of-memory issues. What is the memory monitor? The memory monitor is a component that runs within the raylet process on each node. It periodically checks the memory usage, which includes the worker heap, the object store, and the raylet as described in memory management. If the combined usage exceeds a configurable threshold the raylet will kill a task or actor process to free up memory and prevent Ray from failing. It is available on Linux and is tested with Ray running inside a container that is using cgroup v1. If you encounter issues when running the memory monitor outside of a container or the container is using cgroup v2, please file an issue or post a question. How do I disable the memory monitor? The memory monitor is enabled by default and can be disabled by setting the environment variable RAY_memory_monitor_refresh_ms to zero when Ray starts (e.g., RAY_memory_monitor_refresh_ms=0 ray start …). How do I configure the memory monitor? The memory monitor is controlled by the following environment variables: RAY_memory_monitor_refresh_ms (int, defaults to 250) is the interval to check memory usage and kill tasks or actors if needed. Task killing is disabled when this value is 0. The memory monitor selects and kills one task at a time and waits for it to be killed before choosing another one, regardless of how frequent the memory monitor runs. RAY_memory_usage_threshold (float, defaults to 0.95) is the threshold when the node is beyond the memory capacity. If the memory usage is above this fraction it will start killing processes to free up memory. Ranges from [0, 1]. Using the Memory Monitor Retry policy When a task or actor is killed by the memory monitor it will be retried with exponential backoff. There is a cap on the retry delay, which is 60 seconds. If tasks are killed by the memory monitor, it retries infinitely (not respecting max_retries). If actors are killed by the memory monitor, it doesn’t recreate the actor infinitely (It respects max_restarts, which is 0 by default). Worker killing policy The memory monitor avoids infinite loops of task retries by ensuring at least one task is able to run for each caller on each node. If it is unable to ensure this, the workload will fail with an OOM error. Note that this is only an issue for tasks, since the memory monitor will not indefinitely retry actors. If the workload fails, refer to how to address memory issues on how to adjust the workload to make it pass. For code example, see the last task example below. When a worker needs to be killed, the policy first prioritizes tasks that are retriable, i.e. when max_retries or max_restarts is > 0. This is done to minimize workload failure. Actors by default are not retriable since max_restarts defaults to 0. Therefore, by default, tasks are preferred to actors when it comes to what gets killed first. When there are multiple callers that has created tasks, the policy will pick a task from the caller with the most number of running tasks. If two callers have the same number of tasks it picks the caller whose earliest task has a later start time. This is done to ensure fairness and allow each caller to make progress. Amongst the tasks that share the same caller, the latest started task will be killed first. Below is an example to demonstrate the policy. In the example we have a script that creates two tasks, which in turn creates four more tasks each. The tasks are colored such that each color forms a “group” of tasks where they belong to the same caller. Initial state of the task graph If, at this point, the node runs out of memory, it will pick a task from the caller with the most number of tasks, and kill its task whose started the last: Initial state of the task graph If, at this point, the node still runs out of memory, the process will repeat: Initial state of the task graph Example: Workloads fails if the last task of the caller is killed Let’s create an application oom.py that runs a single task that requires more memory than what is available. It is set to infinite retry by setting max_retries to -1. The worker killer policy sees that it is the last task of the caller, and will fail the workload when it kills the task as it is the last one for the caller, even when the task is set to retry forver. import ray @ray.remote(max_retries=-1) def leaks_memory(): chunks = [] bits_to_allocate = 8 * 100 * 1024 * 1024 # ~100 MiB while True: chunks.append([0] * bits_to_allocate) try: ray.get(leaks_memory.remote()) except ray.exceptions.OutOfMemoryError as ex: print("task failed with OutOfMemoryError, which is expected") Set RAY_event_stats_print_interval_ms=1000 so it prints the worker kill summary every second, since by default it prints every minute. RAY_event_stats_print_interval_ms=1000 python oom.py (raylet) node_manager.cc:3040: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 2c82620270df6b9dd7ae2791ef51ee4b5a9d5df9f795986c10dd219c, IP: 172.31.183.172) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 172.31.183.172` (raylet) (raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero. task failed with OutOfMemoryError, which is expected Verify the task was indeed executed twice via ``task_oom_retry``: Example: memory monitor prefers to kill a retriable task Let’s first start ray and specify the memory threshold. RAY_memory_usage_threshold=0.4 ray start --head Let’s create an application two_actors.py that submits two actors, where the first one is retriable and the second one is non-retriable. from math import ceil import ray from ray._private.utils import ( get_system_memory, ) # do not use outside of this example as these are private methods. from ray._private.utils import ( get_used_memory, ) # do not use outside of this example as these are private methods. # estimates the number of bytes to allocate to reach the desired memory usage percentage. def get_additional_bytes_to_reach_memory_usage_pct(pct: float) -> int: used = get_used_memory() total = get_system_memory() bytes_needed = int(total * pct) - used assert ( bytes_needed > 0 ), "memory usage is already above the target. Increase the target percentage." return bytes_needed @ray.remote class MemoryHogger: def __init__(self): self.allocations = [] def allocate(self, bytes_to_allocate: float) -> None: # divide by 8 as each element in the array occupies 8 bytes new_list = [0] * ceil(bytes_to_allocate / 8) self.allocations.append(new_list) first_actor = MemoryHogger.options( max_restarts=1, max_task_retries=1, name="first_actor" ).remote() second_actor = MemoryHogger.options( max_restarts=0, max_task_retries=0, name="second_actor" ).remote() # each task requests 0.3 of the system memory when the memory threshold is 0.4. allocate_bytes = get_additional_bytes_to_reach_memory_usage_pct(0.3) first_actor_task = first_actor.allocate.remote(allocate_bytes) second_actor_task = second_actor.allocate.remote(allocate_bytes) error_thrown = False try: ray.get(first_actor_task) except ray.exceptions.OutOfMemoryError as ex: error_thrown = True print("First started actor, which is retriable, was killed by the memory monitor.") assert error_thrown ray.get(second_actor_task) print("Second started actor, which is not-retriable, finished.") Run the application to see that only the first actor was killed. $ python two_actors.py First started actor, which is retriable, was killed by the memory monitor. Second started actor, which is not-retriable, finished. Addressing memory issues When the application fails due to OOM, consider reducing the memory usage of the tasks and actors, increasing the memory capacity of the node, or limit the number of concurrently running tasks. Questions or Issues? You can post questions or issues or feedback through the following channels: Discussion Board: For questions about Ray usage or feature requests. GitHub Issues: For bug reports. Ray Slack: For getting in touch with Ray maintainers. StackOverflow: Use the [ray] tag questions about Ray. Fault Tolerance Ray is a distributed system, and that means failures can happen. Generally, failures can be classified into two classes: 1) application-level failures, and 2) system-level failures. The former can happen because of bugs in user-level code, or if external systems fail. The latter can be triggered by node failures, network failures, or just bugs in Ray. Here, we describe the mechanisms that Ray provides to allow applications to recover from failures. To handle application-level failures, Ray provides mechanisms to catch errors, retry failed code, and handle misbehaving code. See the pages for task and actor fault tolerance for more information on these mechanisms. Ray also provides several mechanisms to automatically recover from internal system-level failures like node failures. In particular, Ray can automatically recover from some failures in the distributed object store. How to Write Fault Tolerant Ray Applications There are several recommendations to make Ray applications fault tolerant: First, if the fault tolerance mechanisms provided by Ray don’t work for you, you can always catch exceptions caused by failures and recover manually. @ray.remote class Actor: def read_only(self): import sys import random rand = random.random() if rand < 0.2: return 2 / 0 elif rand < 0.3: sys.exit(1) return 2 actor = Actor.remote() # Manually retry the actor task. while True: try: print(ray.get(actor.read_only.remote())) break except ZeroDivisionError: pass except ray.exceptions.RayActorError: # Manually restart the actor actor = Actor.remote() Second, avoid letting an ObjectRef outlive its owner task or actor (the task or actor that creates the initial ObjectRef by calling ray.put() or foo.remote()). As long as there are still references to an object, the owner worker of the object keeps running even after the corresponding task or actor finishes. If the owner worker fails, Ray cannot recover the object automatically for those who try to access the object. One example of creating such outlived objects is returning ObjectRef created by ray.put() from a task: import ray # Non-fault tolerant version: @ray.remote def a(): x_ref = ray.put(1) return x_ref x_ref = ray.get(a.remote()) # Object x outlives its owner task A. try: # If owner of x (i.e. the worker process running task A) dies, # the application can no longer get value of x. print(ray.get(x_ref)) except ray.exceptions.OwnerDiedError: pass In the above example, object x outlives its owner task a. If the worker process running task a fails, calling ray.get on x_ref afterwards will result in an OwnerDiedError exception. A fault tolerant version is returning x directly so that it is owned by the driver and it’s only accessed within the lifetime of the driver. If x is lost, Ray can automatically recover it via lineage reconstruction. See Anti-pattern: Returning ray.put() ObjectRefs from a task harms performance and fault tolerance for more details. # Fault tolerant version: @ray.remote def a(): # Here we return the value directly instead of calling ray.put() first. return 1 # The owner of x is the driver # so x is accessible and can be auto recovered # during the entire lifetime of the driver. x_ref = a.remote() print(ray.get(x_ref)) Third, avoid using custom resource requirements that can only be satisfied by a particular node. If that particular node fails, the running tasks or actors cannot be retried. @ray.remote def b(): return 1 # If the node with ip 127.0.0.3 fails while task b is running, # Ray cannot retry the task on other nodes. b.options(resources={"node:127.0.0.3": 1}).remote() If you prefer running a task on a particular node, you can use the NodeAffinitySchedulingStrategy. It allows you to specify the affinity as a soft constraint so even if the target node fails, the task can still be retried on other nodes. # Prefer running on the particular node specified by node id # but can also run on other nodes if the target node fails. b.options( scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy( node_id=ray.get_runtime_context().get_node_id(), soft=True ) ).remote() More about Ray Fault Tolerance Task Fault Tolerance Tasks can fail due to application-level errors, e.g., Python-level exceptions, or system-level failures, e.g., a machine fails. Here, we describe the mechanisms that an application developer can use to recover from these errors. Catching application-level failures Ray surfaces application-level failures as Python-level exceptions. When a task on a remote worker or actor fails due to a Python-level exception, Ray wraps the original exception in a RayTaskError and stores this as the task’s return value. This wrapped exception will be thrown to any worker that tries to get the result, either by calling ray.get or if the worker is executing another task that depends on the object. import ray @ray.remote def f(): raise Exception("the real error") @ray.remote def g(x): return try: ray.get(f.remote()) except ray.exceptions.RayTaskError as e: print(e) # ray::f() (pid=71867, ip=XXX.XX.XXX.XX) # File "errors.py", line 5, in f # raise Exception("the real error") # Exception: the real error try: ray.get(g.remote(f.remote())) except ray.exceptions.RayTaskError as e: print(e) # ray::g() (pid=73085, ip=128.32.132.47) # At least one of the input arguments for this task could not be computed: # ray.exceptions.RayTaskError: ray::f() (pid=73085, ip=XXX.XX.XXX.XX) # File "errors.py", line 5, in f # raise Exception("the real error") # Exception: the real error Use ray list tasks from State API CLI to query task exit details: # This API is only available when you download Ray via `pip install "ray[default]"` ray list tasks ======== List: 2023-05-26 10:32:00.962610 ======== Stats: ------------------------------ Total: 3 Table: ------------------------------ TASK_ID ATTEMPT_NUMBER NAME STATE JOB_ID ACTOR_ID TYPE FUNC_OR_CLASS_NAME PARENT_TASK_ID NODE_ID WORKER_ID ERROR_TYPE 0 16310a0f0a45af5cffffffffffffffffffffffff01000000 0 f FAILED 01000000 NORMAL_TASK f ffffffffffffffffffffffffffffffffffffffff01000000 767bd47b72efb83f33dda1b661621cce9b969b4ef00788140ecca8ad b39e3c523629ab6976556bd46be5dbfbf319f0fce79a664122eb39a9 TASK_EXECUTION_EXCEPTION 1 c2668a65bda616c1ffffffffffffffffffffffff01000000 0 g FAILED 01000000 NORMAL_TASK g ffffffffffffffffffffffffffffffffffffffff01000000 767bd47b72efb83f33dda1b661621cce9b969b4ef00788140ecca8ad b39e3c523629ab6976556bd46be5dbfbf319f0fce79a664122eb39a9 TASK_EXECUTION_EXCEPTION 2 c8ef45ccd0112571ffffffffffffffffffffffff01000000 0 f FAILED 01000000 NORMAL_TASK f ffffffffffffffffffffffffffffffffffffffff01000000 767bd47b72efb83f33dda1b661621cce9b969b4ef00788140ecca8ad b39e3c523629ab6976556bd46be5dbfbf319f0fce79a664122eb39a9 TASK_EXECUTION_EXCEPTION Retrying failed tasks When a worker is executing a task, if the worker dies unexpectedly, either because the process crashed or because the machine failed, Ray will rerun the task until either the task succeeds or the maximum number of retries is exceeded. The default number of retries is 3 and can be overridden by specifying max_retries in the @ray.remote decorator. Specifying -1 allows infinite retries, and 0 disables retries. To override the default number of retries for all tasks submitted, set the OS environment variable RAY_TASK_MAX_RETRIES. e.g., by passing this to your driver script or by using runtime environments. You can experiment with this behavior by running the following code. import numpy as np import os import ray import time ray.init(ignore_reinit_error=True) @ray.remote(max_retries=1) def potentially_fail(failure_probability): time.sleep(0.2) if np.random.random() < failure_probability: os._exit(0) return 0 for _ in range(3): try: # If this task crashes, Ray will retry it up to one additional # time. If either of the attempts succeeds, the call to ray.get # below will return normally. Otherwise, it will raise an # exception. ray.get(potentially_fail.remote(0.5)) print('SUCCESS') except ray.exceptions.WorkerCrashedError: print('FAILURE') When a task returns a result in the Ray object store, it is possible for the resulting object to be lost after the original task has already finished. In these cases, Ray will also try to automatically recover the object by re-executing the tasks that created the object. This can be configured through the same max_retries option described here. See object fault tolerance for more information. By default, Ray will not retry tasks upon exceptions thrown by application code. However, you may control whether application-level errors are retried, and even which application-level errors are retried, via the retry_exceptions argument. This is False by default. To enable retries upon application-level errors, set retry_exceptions=True to retry upon any exception, or pass a list of retryable exceptions. An example is shown below. import numpy as np import os import ray import time ray.init(ignore_reinit_error=True) class RandomError(Exception): pass @ray.remote(max_retries=1, retry_exceptions=True) def potentially_fail(failure_probability): if failure_probability < 0 or failure_probability > 1: raise ValueError( "failure_probability must be between 0 and 1, but got: " f"{failure_probability}" ) time.sleep(0.2) if np.random.random() < failure_probability: raise RandomError("Failed!") return 0 for _ in range(3): try: # If this task crashes, Ray will retry it up to one additional # time. If either of the attempts succeeds, the call to ray.get # below will return normally. Otherwise, it will raise an # exception. ray.get(potentially_fail.remote(0.5)) print('SUCCESS') except RandomError: print('FAILURE') # Provide the exceptions that we want to retry as an allowlist. retry_on_exception = potentially_fail.options(retry_exceptions=[RandomError]) try: # This will fail since we're passing in -1 for the failure_probability, # which will raise a ValueError in the task and does not match the RandomError # exception that we provided. ray.get(retry_on_exception.remote(-1)) except ValueError: print("FAILED AS EXPECTED") else: raise RuntimeError("An exception should be raised so this shouldn't be reached.") # These will retry on the RandomError exception. for _ in range(3): try: # If this task crashes, Ray will retry it up to one additional # time. If either of the attempts succeeds, the call to ray.get # below will return normally. Otherwise, it will raise an # exception. ray.get(retry_on_exception.remote(0.5)) print('SUCCESS') except RandomError: print('FAILURE AFTER RETRIES') Use ray list tasks -f task_id= from State API CLI to see task attempts failures and retries: # This API is only available when you download Ray via `pip install "ray[default]"` ray list tasks -f task_id=16310a0f0a45af5cffffffffffffffffffffffff01000000 ======== List: 2023-05-26 10:38:08.809127 ======== Stats: ------------------------------ Total: 2 Table: ------------------------------ TASK_ID ATTEMPT_NUMBER NAME STATE JOB_ID ACTOR_ID TYPE FUNC_OR_CLASS_NAME PARENT_TASK_ID NODE_ID WORKER_ID ERROR_TYPE 0 16310a0f0a45af5cffffffffffffffffffffffff01000000 0 potentially_fail FAILED 01000000 NORMAL_TASK potentially_fail ffffffffffffffffffffffffffffffffffffffff01000000 94909e0958e38d10d668aa84ed4143d0bf2c23139ae1a8b8d6ef8d9d b36d22dbf47235872ad460526deaf35c178c7df06cee5aa9299a9255 WORKER_DIED 1 16310a0f0a45af5cffffffffffffffffffffffff01000000 1 potentially_fail FINISHED 01000000 NORMAL_TASK potentially_fail ffffffffffffffffffffffffffffffffffffffff01000000 94909e0958e38d10d668aa84ed4143d0bf2c23139ae1a8b8d6ef8d9d 22df7f2a9c68f3db27498f2f435cc18582de991fbcaf49ce0094ddb0 Cancelling misbehaving tasks If a task is hanging, you may want to cancel the task to continue to make progress. You can do this by calling ray.cancel on an ObjectRef returned by the task. By default, this will send a KeyboardInterrupt to the task’s worker if it is mid-execution. Passing force=True to ray.cancel will force-exit the worker. See the API reference for ray.cancel for more details. Note that currently, Ray will not automatically retry tasks that have been cancelled. Sometimes, application-level code may cause memory leaks on a worker after repeated task executions, e.g., due to bugs in third-party libraries. To make progress in these cases, you can set the max_calls option in a task’s @ray.remote decorator. Once a worker has executed this many invocations of the given remote function, it will automatically exit. By default, max_calls is set to infinity. Actor Fault Tolerance Actors can fail if the actor process dies, or if the owner of the actor dies. The owner of an actor is the worker that originally created the actor by calling ActorClass.remote(). Detached actors do not have an owner process and are cleaned up when the Ray cluster is destroyed. Actor process failure Ray can automatically restart actors that crash unexpectedly. This behavior is controlled using max_restarts, which sets the maximum number of times that an actor will be restarted. The default value of max_restarts is 0, meaning that the actor won’t be restarted. If set to -1, the actor will be restarted infinitely many times. When an actor is restarted, its state will be recreated by rerunning its constructor. After the specified number of restarts, subsequent actor methods will raise a RayActorError. By default, actor tasks execute with at-most-once semantics (max_task_retries=0 in the @ray.remote decorator). This means that if an actor task is submitted to an actor that is unreachable, Ray will report the error with RayActorError, a Python-level exception that is thrown when ray.get is called on the future returned by the task. Note that this exception may be thrown even though the task did indeed execute successfully. For example, this can happen if the actor dies immediately after executing the task. Ray also offers at-least-once execution semantics for actor tasks (max_task_retries=-1 or max_task_retries > 0). This means that if an actor task is submitted to an actor that is unreachable, the system will automatically retry the task. With this option, the system will only throw a RayActorError to the application if one of the following occurs: (1) the actor’s max_restarts limit has been exceeded and the actor cannot be restarted anymore, or (2) the max_task_retries limit has been exceeded for this particular task. Note that if the actor is currently restarting when a task is submitted, this will count for one retry. The retry limit can be set to infinity with max_task_retries = -1. You can experiment with this behavior by running the following code. import os import ray ray.init() # This actor kills itself after executing 10 tasks. @ray.remote(max_restarts=4, max_task_retries=-1) class Actor: def __init__(self): self.counter = 0 def increment_and_possibly_fail(self): # Exit after every 10 tasks. if self.counter == 10: os._exit(0) self.counter += 1 return self.counter actor = Actor.remote() # The actor will be reconstructed up to 4 times, so we can execute up to 50 # tasks successfully. The actor is reconstructed by rerunning its constructor. # Methods that were executing when the actor died will be retried and will not # raise a `RayActorError`. Retried methods may execute twice, once on the # failed actor and a second time on the restarted actor. for _ in range(50): counter = ray.get(actor.increment_and_possibly_fail.remote()) print(counter) # Prints the sequence 1-10 5 times. # After the actor has been restarted 4 times, all subsequent methods will # raise a `RayActorError`. for _ in range(10): try: counter = ray.get(actor.increment_and_possibly_fail.remote()) print(counter) # Unreachable. except ray.exceptions.RayActorError: print("FAILURE") # Prints 10 times. For at-least-once actors, the system will still guarantee execution ordering according to the initial submission order. For example, any tasks submitted after a failed actor task will not execute on the actor until the failed actor task has been successfully retried. The system will not attempt to re-execute any tasks that executed successfully before the failure (unless max_task_retries is nonzero and the task is needed for object reconstruction). For async or threaded actors, tasks might be executed out of order. Upon actor restart, the system will only retry incomplete tasks. Previously completed tasks will not be re-executed. At-least-once execution is best suited for read-only actors or actors with ephemeral state that does not need to be rebuilt after a failure. For actors that have critical state, the application is responsible for recovering the state, e.g., by taking periodic checkpoints and recovering from the checkpoint upon actor restart. Actor checkpointing max_restarts automatically restarts the crashed actor, but it doesn’t automatically restore application level state in your actor. Instead, you should manually checkpoint your actor’s state and recover upon actor restart. For actors that are restarted manually, the actor’s creator should manage the checkpoint and manually restart and recover the actor upon failure. This is recommended if you want the creator to decide when the actor should be restarted and/or if the creator is coordinating actor checkpoints with other execution: import os import sys import ray import json import tempfile import shutil @ray.remote(num_cpus=1) class Worker: def __init__(self): self.state = {"num_tasks_executed": 0} def execute_task(self, crash=False): if crash: sys.exit(1) # Execute the task # ... # Update the internal state self.state["num_tasks_executed"] = self.state["num_tasks_executed"] + 1 def checkpoint(self): return self.state def restore(self, state): self.state = state class Controller: def __init__(self): self.worker = Worker.remote() self.worker_state = ray.get(self.worker.checkpoint.remote()) def execute_task_with_fault_tolerance(self): i = 0 while True: i = i + 1 try: ray.get(self.worker.execute_task.remote(crash=(i % 2 == 1))) # Checkpoint the latest worker state self.worker_state = ray.get(self.worker.checkpoint.remote()) return except ray.exceptions.RayActorError: print("Actor crashes, restarting...") # Restart the actor and restore the state self.worker = Worker.remote() ray.get(self.worker.restore.remote(self.worker_state)) controller = Controller() controller.execute_task_with_fault_tolerance() controller.execute_task_with_fault_tolerance() assert ray.get(controller.worker.checkpoint.remote())["num_tasks_executed"] == 2 Alternatively, if you are using Ray’s automatic actor restart, the actor can checkpoint itself manually and restore from a checkpoint in the constructor: @ray.remote(max_restarts=-1, max_task_retries=-1) class ImmortalActor: def __init__(self, checkpoint_file): self.checkpoint_file = checkpoint_file if os.path.exists(self.checkpoint_file): # Restore from a checkpoint with open(self.checkpoint_file, "r") as f: self.state = json.load(f) else: self.state = {} def update(self, key, value): import random if random.randrange(10) < 5: sys.exit(1) self.state[key] = value # Checkpoint the latest state with open(self.checkpoint_file, "w") as f: json.dump(self.state, f) def get(self, key): return self.state[key] checkpoint_dir = tempfile.mkdtemp() actor = ImmortalActor.remote(os.path.join(checkpoint_dir, "checkpoint.json")) ray.get(actor.update.remote("1", 1)) ray.get(actor.update.remote("2", 2)) assert ray.get(actor.get.remote("1")) == 1 shutil.rmtree(checkpoint_dir) If the checkpoint is saved to external storage, make sure it’s accessible to the entire cluster since the actor can be restarted on a different node. For example, save the checkpoint to cloud storage (e.g., S3) or a shared directory (e.g., via NFS). Actor creator failure For non-detached actors, the owner of an actor is the worker that created it, i.e. the worker that called ActorClass.remote(). Similar to objects, if the owner of an actor dies, then the actor will also fate-share with the owner. Ray will not automatically recover an actor whose owner is dead, even if it has a nonzero max_restarts. Since detached actors do not have an owner, they will still be restarted by Ray even if their original creator dies. Detached actors will continue to be automatically restarted until the maximum restarts is exceeded, the actor is destroyed, or until the Ray cluster is destroyed. You can try out this behavior in the following code. import ray import os import signal ray.init() @ray.remote(max_restarts=-1) class Actor: def ping(self): return "hello" @ray.remote class Parent: def generate_actors(self): self.child = Actor.remote() self.detached_actor = Actor.options(name="actor", lifetime="detached").remote() return self.child, self.detached_actor, os.getpid() parent = Parent.remote() actor, detached_actor, pid = ray.get(parent.generate_actors.remote()) os.kill(pid, signal.SIGKILL) try: print("actor.ping:", ray.get(actor.ping.remote())) except ray.exceptions.RayActorError as e: print("Failed to submit actor call", e) # Failed to submit actor call The actor died unexpectedly before finishing this task. # class_name: Actor # actor_id: 56f541b178ff78470f79c3b601000000 # namespace: ea8b3596-7426-4aa8-98cc-9f77161c4d5f # The actor is dead because because all references to the actor were removed. try: print("detached_actor.ping:", ray.get(detached_actor.ping.remote())) except ray.exceptions.RayActorError as e: print("Failed to submit detached actor call", e) # detached_actor.ping: hello Force-killing a misbehaving actor Sometimes application-level code can cause an actor to hang or leak resources. In these cases, Ray allows you to recover from the failure by manually terminating the actor. You can do this by calling ray.kill on any handle to the actor. Note that it does not need to be the original handle to the actor. If max_restarts is set, you can also allow Ray to automatically restart the actor by passing no_restart=False to ray.kill. Object Fault Tolerance A Ray object has both data (the value returned when calling ray.get) and metadata (e.g., the location of the value). Data is stored in the Ray object store while the metadata is stored at the object’s owner. The owner of an object is the worker process that creates the original ObjectRef, e.g., by calling f.remote() or ray.put(). Note that this worker is usually a distinct process from the worker that creates the value of the object, except in cases of ray.put. import ray import numpy as np @ray.remote def large_array(): return np.zeros(int(1e5)) x = ray.put(1) # The driver owns x and also creates the value of x. y = large_array.remote() # The driver is the owner of y, even though the value may be stored somewhere else. # If the node that stores the value of y dies, Ray will automatically recover # it by re-executing the large_array task. # If the driver dies, anyone still using y will receive an OwnerDiedError. Ray can automatically recover from data loss but not owner failure. Recovering from data loss When an object value is lost from the object store, such as during node failures, Ray will use lineage reconstruction to recover the object. Ray will first automatically attempt to recover the value by looking for copies of the same object on other nodes. If none are found, then Ray will automatically recover the value by re-executing the task that previously created the value. Arguments to the task are recursively reconstructed through the same mechanism. Lineage reconstruction currently has the following limitations: The object, and any of its transitive dependencies, must have been generated by a task (actor or non-actor). This means that objects created by ray.put are not recoverable. Tasks are assumed to be deterministic and idempotent. Thus, by default, objects created by actor tasks are not reconstructable. To allow reconstruction of actor task results, set the max_task_retries parameter to a non-zero value (see actor fault tolerance for more details). Tasks will only be re-executed up to their maximum number of retries. By default, a non-actor task can be retried up to 3 times and an actor task cannot be retried. This can be overridden with the max_retries parameter for remote functions and the max_task_retries parameter for actors. The owner of the object must still be alive (see below). Lineage reconstruction can cause higher than usual driver memory usage because the driver keeps the descriptions of any tasks that may be re-executed in case of failure. To limit the amount of memory used by lineage, set the environment variable RAY_max_lineage_bytes (default 1GB) to evict lineage if the threshold is exceeded. To disable lineage reconstruction entirely, set the environment variable RAY_TASK_MAX_RETRIES=0 during ray start or ray.init. With this setting, if there are no copies of an object left, an ObjectLostError will be raised. Recovering from owner failure The owner of an object can die because of node or worker process failure. Currently, Ray does not support recovery from owner failure. In this case, Ray will clean up any remaining copies of the object’s value to prevent a memory leak. Any workers that subsequently try to get the object’s value will receive an OwnerDiedError exception, which can be handled manually. Understanding ObjectLostErrors Ray throws an ObjectLostError to the application when an object cannot be retrieved due to application or system error. This can occur during a ray.get() call or when fetching a task’s arguments, and can happen for a number of reasons. Here is a guide to understanding the root cause for different error types: OwnerDiedError: The owner of an object, i.e., the Python worker that first created the ObjectRef via .remote() or ray.put(), has died. The owner stores critical object metadata and an object cannot be retrieved if this process is lost. ObjectReconstructionFailedError: This error is thrown if an object, or another object that this object depends on, cannot be reconstructed due to one of the limitations described above. ReferenceCountingAssertionError: The object has already been deleted, so it cannot be retrieved. Ray implements automatic memory management through distributed reference counting, so this error should not happen in general. However, there is a known edge case that can produce this error. ObjectFetchTimedOutError: A node timed out while trying to retrieve a copy of the object from a remote node. This error usually indicates a system-level bug. The timeout period can be configured using the RAY_fetch_fail_timeout_milliseconds environment variable (default 10 minutes). ObjectLostError: The object was successfully created, but no copy is reachable. This is a generic error thrown when lineage reconstruction is disabled and all copies of the object are lost from the cluster. Node Fault Tolerance A Ray cluster consists of one or more worker nodes, each of which consists of worker processes and system processes (e.g. raylet). One of the worker nodes is designated as the head node and has extra processes like the GCS. Here, we describe node failures and their impact on tasks, actors, and objects. Worker node failure When a worker node fails, all the running tasks and actors will fail and all the objects owned by worker processes of this node will be lost. In this case, the tasks, actors, objects fault tolerance mechanisms will kick in and try to recover the failures using other worker nodes. Head node failure When a head node fails, the entire Ray cluster fails. To tolerate head node failures, we need to make GCS fault tolerant so that when we start a new head node we still have all the cluster-level data. Raylet failure When a raylet process fails, the corresponding node will be marked as dead and is treated the same as node failure. Each raylet is associated with a unique id, so even if the raylet restarts on the same physical machine, it’ll be treated as a new raylet/node to the Ray cluster. GCS Fault Tolerance Global Control Service (GCS) is a server that manages cluster-level metadata. It also provides a handful of cluster-level operations including actor, placement groups and node management. By default, the GCS is not fault tolerant since all the data is stored in-memory and its failure means that the entire Ray cluster fails. To make the GCS fault tolerant, HA Redis is required. Then, when the GCS restarts, it loads all the data from the Redis instance and resumes regular functions. During the recovery period, the following functions are not available: Actor creation, deletion and reconstruction. Placement group creation, deletion and reconstruction. Resource management. Worker node registration. Worker process creation. However, running Ray tasks and actors remain alive and any existing objects will continue to be available. Setting up Redis KubeRay (officially supported) If you are using KubeRay, please refer to KubeRay docs on GCS Fault Tolerance. ray start If you are using ray start to start the Ray head node, set the OS environment RAY_REDIS_ADDRESS to the Redis address, and supply the --redis-password flag with the password when calling ray start: RAY_REDIS_ADDRESS=redis_ip:port ray start --head --redis-password PASSWORD ray up If you are using ray up to start the Ray cluster, change head_start_ray_commands field to add RAY_REDIS_ADDRESS and --redis-password to the ray start command: head_start_ray_commands: - ray stop - ulimit -n 65536; RAY_REDIS_ADDRESS=redis_ip:port ray start --head --redis-password PASSWORD --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 Kubernetes If you are using Kubernetes but not KubeRay, please refer to this doc. Once the GCS is backed by Redis, when it restarts, it’ll recover the state by reading from Redis. When the GCS is recovering from its failed state, the raylet will try to reconnect to the GCS. If the raylet fails to reconnect to the GCS for more than 60 seconds, the raylet will exit and the corresponding node fails. This timeout threshold can be tuned by the OS environment variable RAY_gcs_rpc_server_reconnect_timeout_s. You can also set the OS environment variable RAY_external_storage_namespace to isolate the data stored in Redis. This makes sure that there is no data conflicts if multiple Ray clusters share the same Redis instance. If the IP address of GCS will change after restarts, it’s better to use a qualified domain name and pass it to all raylets at start time. Raylet will resolve the domain name and connect to the correct GCS. You need to ensure that at any time, only one GCS is alive. GCS fault tolerance with external Redis is officially supported ONLY if you are using KubeRay for Ray serve fault tolerance. For other cases, you can use it at your own risk and you need to implement additional mechanisms to detect the failure of GCS or the head node and restart it. Design Patterns & Anti-patterns This section is a collection of common design patterns and anti-patterns for writing Ray applications. Pattern: Using nested tasks to achieve nested parallelism In this pattern, a remote task can dynamically call other remote tasks (including itself) for nested parallelism. This is useful when sub-tasks can be parallelized. Keep in mind, though, that nested tasks come with their own cost: extra worker processes, scheduling overhead, bookkeeping overhead, etc. To achieve speedup with nested parallelism, make sure each of your nested tasks does significant work. See Anti-pattern: Over-parallelizing with too fine-grained tasks harms speedup for more details. Example use case You want to quick-sort a large list of numbers. By using nested tasks, we can sort the list in a distributed and parallel fashion. Tree of tasks Code example import ray import time from numpy import random def partition(collection): # Use the last element as the pivot pivot = collection.pop() greater, lesser = [], [] for element in collection: if element > pivot: greater.append(element) else: lesser.append(element) return lesser, pivot, greater def quick_sort(collection): if len(collection) <= 200000: # magic number return sorted(collection) else: lesser, pivot, greater = partition(collection) lesser = quick_sort(lesser) greater = quick_sort(greater) return lesser + [pivot] + greater @ray.remote def quick_sort_distributed(collection): # Tiny tasks are an antipattern. # Thus, in our example we have a "magic number" to # toggle when distributed recursion should be used vs # when the sorting should be done in place. The rule # of thumb is that the duration of an individual task # should be at least 1 second. if len(collection) <= 200000: # magic number return sorted(collection) else: lesser, pivot, greater = partition(collection) lesser = quick_sort_distributed.remote(lesser) greater = quick_sort_distributed.remote(greater) return ray.get(lesser) + [pivot] + ray.get(greater) for size in [200000, 4000000, 8000000]: print(f"Array size: {size}") unsorted = random.randint(1000000, size=(size)).tolist() s = time.time() quick_sort(unsorted) print(f"Sequential execution: {(time.time() - s):.3f}") s = time.time() ray.get(quick_sort_distributed.remote(unsorted)) print(f"Distributed execution: {(time.time() - s):.3f}") print("--" * 10) # Outputs: # Array size: 200000 # Sequential execution: 0.040 # Distributed execution: 0.152 # -------------------- # Array size: 4000000 # Sequential execution: 6.161 # Distributed execution: 5.779 # -------------------- # Array size: 8000000 # Sequential execution: 15.459 # Distributed execution: 11.282 # -------------------- We call ray.get() after both quick_sort_distributed function invocations take place. This allows you to maximize parallelism in the workload. See Anti-pattern: Calling ray.get in a loop harms parallelism for more details. Notice in the execution times above that with smaller tasks, the non-distributed version is faster. However, as the task execution time increases, i.e. because the lists to sort are larger, the distributed version is faster. Pattern: Using generators to reduce heap memory usage In this pattern, we use generators in Python to reduce the total heap memory usage during a task. The key idea is that for tasks that return multiple objects, we can return them one at a time instead of all at once. This allows a worker to free the heap memory used by a previous return value before returning the next one. Example use case You have a task that returns multiple large values. Another possibility is a task that returns a single large value, but you want to stream this value through Ray’s object store by breaking it up into smaller chunks. Using normal Python functions, we can write such a task like this. Here’s an example that returns numpy arrays of size 100MB each: import numpy as np @ray.remote def large_values(num_returns): return [ np.random.randint(np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8) for _ in range(num_returns) ] However, this will require the task to hold all num_returns arrays in heap memory at the same time at the end of the task. If there are many return values, this can lead to high heap memory usage and potentially an out-of-memory error. We can fix the above example by rewriting large_values as a generator. Instead of returning all values at once as a tuple or list, we can yield one value at a time. @ray.remote def large_values_generator(num_returns): for i in range(num_returns): yield np.random.randint( np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8 ) print(f"yielded return value {i}") Code example import sys import ray # fmt: off # __large_values_start__ import numpy as np @ray.remote def large_values(num_returns): return [ np.random.randint(np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8) for _ in range(num_returns) ] # __large_values_end__ # fmt: on # fmt: off # __large_values_generator_start__ @ray.remote def large_values_generator(num_returns): for i in range(num_returns): yield np.random.randint( np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8 ) print(f"yielded return value {i}") # __large_values_generator_end__ # fmt: on # A large enough value (e.g. 100). num_returns = int(sys.argv[1]) # Worker will likely OOM using normal returns. print("Using normal functions...") try: ray.get( large_values.options(num_returns=num_returns, max_retries=0).remote( num_returns )[0] ) except ray.exceptions.WorkerCrashedError: print("Worker failed with normal function") # Using a generator will allow the worker to finish. # Note that this will block until the full task is complete, i.e. the # last yield finishes. print("Using generators...") ray.get( large_values_generator.options(num_returns=num_returns, max_retries=0).remote( num_returns )[0] ) print("Success!") $ RAY_IGNORE_UNHANDLED_ERRORS=1 python test.py 100 Using normal functions... ... -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker... Worker failed Using generators... (large_values_generator pid=373609) yielded return value 0 (large_values_generator pid=373609) yielded return value 1 (large_values_generator pid=373609) yielded return value 2 ... Success! Pattern: Using ray.wait to limit the number of pending tasks In this pattern, we use ray.wait() to limit the number of pending tasks. If we continuously submit tasks faster than their process time, we will accumulate tasks in the pending task queue, which can eventually cause OOM. With ray.wait(), we can apply backpressure and limit the number of pending tasks so that the pending task queue won’t grow indefinitely and cause OOM. If we submit a finite number of tasks, it’s unlikely that we will hit the issue mentioned above since each task only uses a small amount of memory for bookkeeping in the queue. It’s more likely to happen when we have an infinite stream of tasks to run. This method is meant primarily to limit how many tasks should be in flight at the same time. It can also be used to limit how many tasks can run concurrently, but it is not recommended, as it can hurt scheduling performance. Ray automatically decides task parallelism based on resource availability, so the recommended method for adjusting how many tasks can run concurrently is to modify each task’s resource requirements instead. Example use case You have a worker actor that process tasks at a rate of X tasks per second and you want to submit tasks to it at a rate lower than X to avoid OOM. For example, Ray Serve uses this pattern to limit the number of pending queries for each worker. Limit number of pending tasks Code example Without backpressure: import ray ray.init() @ray.remote class Actor: async def heavy_compute(self): # taking a long time... # await asyncio.sleep(5) return actor = Actor.remote() NUM_TASKS = 1000 result_refs = [] # When NUM_TASKS is large enough, this will eventually OOM. for _ in range(NUM_TASKS): result_refs.append(actor.heavy_compute.remote()) ray.get(result_refs) With backpressure: MAX_NUM_PENDING_TASKS = 100 result_refs = [] for _ in range(NUM_TASKS): if len(result_refs) > MAX_NUM_PENDING_TASKS: # update result_refs to only # track the remaining tasks. ready_refs, result_refs = ray.wait(result_refs, num_returns=1) ray.get(ready_refs) result_refs.append(actor.heavy_compute.remote()) ray.get(result_refs) Pattern: Using resources to limit the number of concurrently running tasks In this pattern, we use resources to limit the number of concurrently running tasks. By default, Ray tasks require 1 CPU each and Ray actors require 0 CPU each, so the scheduler limits task concurrency to the available CPUs and actor concurrency to infinite. Tasks that use more than 1 CPU (e.g., via mutlithreading) may experience slowdown due to interference from concurrent ones, but otherwise are safe to run. However, tasks or actors that use more than their proportionate share of memory may overload a node and cause issues like OOM. If that is the case, we can reduce the number of concurrently running tasks or actors on each node by increasing the amount of resources requested by them. This works because Ray makes sure that the sum of the resource requirements of all of the concurrently running tasks and actors on a given node does not exceed the node’s total resources. For actor tasks, the number of running actors limits the number of concurrently running actor tasks we can have. Example use case You have a data processing workload that processes each input file independently using Ray remote functions. Since each task needs to load the input data into heap memory and do the processing, running too many of them can cause OOM. In this case, you can use the memory resource to limit the number of concurrently running tasks (usage of other resources like num_cpus can achieve the same goal as well). Note that similar to num_cpus, the memory resource requirement is logical, meaning that Ray will not enforce the physical memory usage of each task if it exceeds this amount. Code example Without limit: import ray # Assume this Ray node has 16 CPUs and 16G memory. ray.init() @ray.remote def process(file): # Actual work is reading the file and process the data. # Assume it needs to use 2G memory. pass NUM_FILES = 1000 result_refs = [] for i in range(NUM_FILES): # By default, process task will use 1 CPU resource and no other resources. # This means 16 tasks can run concurrently # and will OOM since 32G memory is needed while the node only has 16G. result_refs.append(process.remote(f"{i}.csv")) ray.get(result_refs) With limit: result_refs = [] for i in range(NUM_FILES): # Now each task will use 2G memory resource # and the number of concurrently running tasks is limited to 8. # In this case, setting num_cpus to 2 has the same effect. result_refs.append( process.options(memory=2 * 1024 * 1024 * 1024).remote(f"{i}.csv") ) ray.get(result_refs) Pattern: Using asyncio to run actor methods concurrently By default, a Ray actor runs in a single thread and actor method calls are executed sequentially. This means that a long running method call blocks all the following ones. In this pattern, we use await to yield control from the long running method call so other method calls can run concurrently. Normally the control is yielded when the method is doing IO operations but you can also use await asyncio.sleep(0) to yield control explicitly. You can also use threaded actors to achieve concurrency. Example use case You have an actor with a long polling method that continuously fetches tasks from the remote store and executes them. You also want to query the number of tasks executed while the long polling method is running. With the default actor, the code will look like this: import ray @ray.remote class TaskStore: def get_next_task(self): return "task" @ray.remote class TaskExecutor: def __init__(self, task_store): self.task_store = task_store self.num_executed_tasks = 0 def run(self): while True: task = ray.get(task_store.get_next_task.remote()) self._execute_task(task) def _execute_task(self, task): # Executing the task self.num_executed_tasks = self.num_executed_tasks + 1 def get_num_executed_tasks(self): return self.num_executed_tasks task_store = TaskStore.remote() task_executor = TaskExecutor.remote(task_store) task_executor.run.remote() try: # This will timeout since task_executor.run occupies the entire actor thread # and get_num_executed_tasks cannot run. ray.get(task_executor.get_num_executed_tasks.remote(), timeout=5) except ray.exceptions.GetTimeoutError: print("get_num_executed_tasks didn't finish in 5 seconds") This is problematic because TaskExecutor.run method runs forever and never yield the control to run other methods. We can solve this problem by using async actors and use await to yield control: @ray.remote class AsyncTaskExecutor: def __init__(self, task_store): self.task_store = task_store self.num_executed_tasks = 0 async def run(self): while True: # Here we use await instead of ray.get() to # wait for the next task and it will yield # the control while waiting. task = await task_store.get_next_task.remote() self._execute_task(task) def _execute_task(self, task): # Executing the task self.num_executed_tasks = self.num_executed_tasks + 1 def get_num_executed_tasks(self): return self.num_executed_tasks async_task_executor = AsyncTaskExecutor.remote(task_store) async_task_executor.run.remote() # We are able to run get_num_executed_tasks while run method is running. num_executed_tasks = ray.get(async_task_executor.get_num_executed_tasks.remote()) print(f"num of executed tasks so far: {num_executed_tasks}") Here, instead of using the blocking ray.get() to get the value of an ObjectRef, we use await so it can yield the control while we are waiting for the object to be fetched. Pattern: Using an actor to synchronize other tasks and actors When you have multiple tasks that need to wait on some condition or otherwise need to synchronize across tasks & actors on a cluster, you can use a central actor to coordinate among them. Example use case You can use an actor to implement a distributed asyncio.Event that multiple tasks can wait on. Code example import asyncio import ray # We set num_cpus to zero because this actor will mostly just block on I/O. @ray.remote(num_cpus=0) class SignalActor: def __init__(self): self.ready_event = asyncio.Event() def send(self, clear=False): self.ready_event.set() if clear: self.ready_event.clear() async def wait(self, should_wait=True): if should_wait: await self.ready_event.wait() @ray.remote def wait_and_go(signal): ray.get(signal.wait.remote()) print("go!") signal = SignalActor.remote() tasks = [wait_and_go.remote(signal) for _ in range(4)] print("ready...") # Tasks will all be waiting for the signals. print("set..") ray.get(signal.send.remote()) # Tasks are unblocked. ray.get(tasks) # Output is: # ready... # set.. # (wait_and_go pid=77366) go! # (wait_and_go pid=77372) go! # (wait_and_go pid=77367) go! # (wait_and_go pid=77358) go! Pattern: Using a supervisor actor to manage a tree of actors Actor supervision is a pattern in which a supervising actor manages a collection of worker actors. The supervisor delegates tasks to subordinates and handles their failures. This pattern simplifies the driver since it manages only a few supervisors and does not deal with failures from worker actors directly. Furthermore, multiple supervisors can act in parallel to parallelize more work. Tree of actors If the supervisor dies (or the driver), the worker actors are automatically terminated thanks to actor reference counting. Actors can be nested to multiple levels to form a tree. Example use case You want to do data parallel training and train the same model with different hyperparameters in parallel. For each hyperparameter, you can launch a supervisor actor to do the orchestration and it will create worker actors to do the actual training per data shard. For data parallel training and hyperparameter tuning, it’s recommended to use Ray AIR (DataParallelTrainer and Tuner) which applies this pattern under the hood. Code example import ray @ray.remote(num_cpus=1) class Trainer: def __init__(self, hyperparameter, data): self.hyperparameter = hyperparameter self.data = data # Train the model on the given training data shard. def fit(self): return self.data * self.hyperparameter @ray.remote(num_cpus=1) class Supervisor: def __init__(self, hyperparameter, data): self.trainers = [Trainer.remote(hyperparameter, d) for d in data] def fit(self): # Train with different data shard in parallel. return ray.get([trainer.fit.remote() for trainer in self.trainers]) data = [1, 2, 3] supervisor1 = Supervisor.remote(1, data) supervisor2 = Supervisor.remote(2, data) # Train with different hyperparameters in parallel. model1 = supervisor1.fit.remote() model2 = supervisor2.fit.remote() assert ray.get(model1) == [1, 2, 3] assert ray.get(model2) == [2, 4, 6] Pattern: Using pipelining to increase throughput If you have multiple work items and each requires several steps to complete, you can use the pipelining technique to improve the cluster utilization and increase the throughput of your system. Pipelining is an important technique to improve the performance and is heavily used by Ray libraries. See Ray Data as an example. Example use case A component of your application needs to do both compute-intensive work and communicate with other processes. Ideally, you want to overlap computation and communication to saturate the CPU and increase the overall throughput. Code example import ray @ray.remote class WorkQueue: def __init__(self): self.queue = list(range(10)) def get_work_item(self): if self.queue: return self.queue.pop(0) else: return None @ray.remote class WorkerWithoutPipelining: def __init__(self, work_queue): self.work_queue = work_queue def process(self, work_item): print(work_item) def run(self): while True: # Get work from the remote queue. work_item = ray.get(self.work_queue.get_work_item.remote()) if work_item is None: break # Do work. self.process(work_item) @ray.remote class WorkerWithPipelining: def __init__(self, work_queue): self.work_queue = work_queue def process(self, work_item): print(work_item) def run(self): self.work_item_ref = self.work_queue.get_work_item.remote() while True: # Get work from the remote queue. work_item = ray.get(self.work_item_ref) if work_item is None: break self.work_item_ref = self.work_queue.get_work_item.remote() # Do work while we are fetching the next work item. self.process(work_item) work_queue = WorkQueue.remote() worker_without_pipelining = WorkerWithoutPipelining.remote(work_queue) ray.get(worker_without_pipelining.run.remote()) work_queue = WorkQueue.remote() worker_with_pipelining = WorkerWithPipelining.remote(work_queue) ray.get(worker_with_pipelining.run.remote()) In the example above, a worker actor pulls work off of a queue and then does some computation on it. Without pipelining, we call ray.get() immediately after requesting a work item, so we block while that RPC is in flight, causing idle CPU time. With pipelining, we instead preemptively request the next work item before processing the current one, so we can use the CPU while the RPC is in flight which increases the CPU utilization. Anti-pattern: Returning ray.put() ObjectRefs from a task harms performance and fault tolerance TLDR: Avoid calling ray.put() on task return values and returning the resulting ObjectRefs. Instead, return these values directly if possible. Returning ray.put() ObjectRefs are considered anti-patterns for the following reasons: It disallows inlining small return values: Ray has a performance optimization to return small (<= 100KB) values inline directly to the caller, avoiding going through the distributed object store. On the other hand, ray.put() will unconditionally store the value to the object store which makes the optimization for small return values impossible. Returning ObjectRefs involves extra distributed reference counting protocol which is slower than returning the values directly. It’s less fault tolerant: the worker process that calls ray.put() is the “owner” of the returned ObjectRef and the return value fate shares with the owner. If the worker process dies, the return value is lost. In contrast, the caller process (often the driver) is the owner of the return value if it’s returned directly. Code example If you want to return a single value regardless if it’s small or large, you should return it directly. import ray import numpy as np @ray.remote def task_with_single_small_return_value_bad(): small_return_value = 1 # The value will be stored in the object store # and the reference is returned to the caller. small_return_value_ref = ray.put(small_return_value) return small_return_value_ref @ray.remote def task_with_single_small_return_value_good(): small_return_value = 1 # Ray will return the value inline to the caller # which is faster than the previous approach. return small_return_value assert ray.get(ray.get(task_with_single_small_return_value_bad.remote())) == ray.get( task_with_single_small_return_value_good.remote() ) @ray.remote def task_with_single_large_return_value_bad(): large_return_value = np.zeros(10 * 1024 * 1024) large_return_value_ref = ray.put(large_return_value) return large_return_value_ref @ray.remote def task_with_single_large_return_value_good(): # Both approaches will store the large array to the object store # but this is better since it's faster and more fault tolerant. large_return_value = np.zeros(10 * 1024 * 1024) return large_return_value assert np.array_equal( ray.get(ray.get(task_with_single_large_return_value_bad.remote())), ray.get(task_with_single_large_return_value_good.remote()), ) # Same thing applies for actor tasks as well. @ray.remote class Actor: def task_with_single_return_value_bad(self): single_return_value = np.zeros(9 * 1024 * 1024) return ray.put(single_return_value) def task_with_single_return_value_good(self): return np.zeros(9 * 1024 * 1024) actor = Actor.remote() assert np.array_equal( ray.get(ray.get(actor.task_with_single_return_value_bad.remote())), ray.get(actor.task_with_single_return_value_good.remote()), ) If you want to return multiple values and you know the number of returns before calling the task, you should use the num_returns option. # This will return a single object # which is a tuple of two ObjectRefs to the actual values. @ray.remote(num_returns=1) def task_with_static_multiple_returns_bad1(): return_value_1_ref = ray.put(1) return_value_2_ref = ray.put(2) return (return_value_1_ref, return_value_2_ref) # This will return two objects each of which is an ObjectRef to the actual value. @ray.remote(num_returns=2) def task_with_static_multiple_returns_bad2(): return_value_1_ref = ray.put(1) return_value_2_ref = ray.put(2) return (return_value_1_ref, return_value_2_ref) # This will return two objects each of which is the actual value. @ray.remote(num_returns=2) def task_with_static_multiple_returns_good(): return_value_1 = 1 return_value_2 = 2 return (return_value_1, return_value_2) assert ( ray.get(ray.get(task_with_static_multiple_returns_bad1.remote())[0]) == ray.get(ray.get(task_with_static_multiple_returns_bad2.remote()[0])) == ray.get(task_with_static_multiple_returns_good.remote()[0]) ) @ray.remote class Actor: @ray.method(num_returns=1) def task_with_static_multiple_returns_bad1(self): return_value_1_ref = ray.put(1) return_value_2_ref = ray.put(2) return (return_value_1_ref, return_value_2_ref) @ray.method(num_returns=2) def task_with_static_multiple_returns_bad2(self): return_value_1_ref = ray.put(1) return_value_2_ref = ray.put(2) return (return_value_1_ref, return_value_2_ref) @ray.method(num_returns=2) def task_with_static_multiple_returns_good(self): # This is faster and more fault tolerant. return_value_1 = 1 return_value_2 = 2 return (return_value_1, return_value_2) actor = Actor.remote() assert ( ray.get(ray.get(actor.task_with_static_multiple_returns_bad1.remote())[0]) == ray.get(ray.get(actor.task_with_static_multiple_returns_bad2.remote()[0])) == ray.get(actor.task_with_static_multiple_returns_good.remote()[0]) ) If you don’t know the number of returns before calling the task, you should use the dynamic generator pattern if possible. @ray.remote(num_returns=1) def task_with_dynamic_returns_bad(n): return_value_refs = [] for i in range(n): return_value_refs.append(ray.put(np.zeros(i * 1024 * 1024))) return return_value_refs @ray.remote(num_returns="dynamic") def task_with_dynamic_returns_good(n): for i in range(n): yield np.zeros(i * 1024 * 1024) assert np.array_equal( ray.get(ray.get(task_with_dynamic_returns_bad.remote(2))[0]), ray.get(next(iter(ray.get(task_with_dynamic_returns_good.remote(2))))), ) Anti-pattern: Calling ray.get in a loop harms parallelism TLDR: Avoid calling ray.get() in a loop since it’s a blocking call; use ray.get() only for the final result. A call to ray.get() fetches the results of remotely executed functions. However, it is a blocking call, which means that it always waits until the requested result is available. If you call ray.get() in a loop, the loop will not continue to run until the call to ray.get() is resolved. If you also spawn the remote function calls in the same loop, you end up with no parallelism at all, as you wait for the previous function call to finish (because of ray.get()) and only spawn the next call in the next iteration of the loop. The solution here is to separate the call to ray.get() from the call to the remote functions. That way all remote functions are spawned before we wait for the results and can run in parallel in the background. Additionally, you can pass a list of object references to ray.get() instead of calling it one by one to wait for all of the tasks to finish. Code example import ray ray.init() @ray.remote def f(i): return i # Anti-pattern: no parallelism due to calling ray.get inside of the loop. sequential_returns = [] for i in range(100): sequential_returns.append(ray.get(f.remote(i))) # Better approach: parallelism because the tasks are executed in parallel. refs = [] for i in range(100): refs.append(f.remote(i)) parallel_returns = ray.get(refs) Calling ray.get() in a loop When calling ray.get() right after scheduling the remote work, the loop blocks until the result is received. We thus end up with sequential processing. Instead, we should first schedule all remote calls, which are then processed in parallel. After scheduling the work, we can then request all the results at once. Other ray.get() related anti-patterns are: Anti-pattern: Calling ray.get unnecessarily harms performance Anti-pattern: Processing results in submission order using ray.get increases runtime Anti-pattern: Calling ray.get unnecessarily harms performance TLDR: Avoid calling ray.get() unnecessarily for intermediate steps. Work with object references directly, and only call ray.get() at the end to get the final result. When ray.get() is called, objects must be transferred to the worker/node that calls ray.get(). If you don’t need to manipulate the object, you probably don’t need to call ray.get() on it! Typically, it’s best practice to wait as long as possible before calling ray.get(), or even design your program to avoid having to call ray.get() at all. Code example Anti-pattern: import ray import numpy as np ray.init() @ray.remote def generate_rollout(): return np.ones((10000, 10000)) @ray.remote def reduce(rollout): return np.sum(rollout) # `ray.get()` downloads the result here. rollout = ray.get(generate_rollout.remote()) # Now we have to reupload `rollout` reduced = ray.get(reduce.remote(rollout)) Better approach: # Don't need ray.get here. rollout_obj_ref = generate_rollout.remote() # Rollout object is passed by reference. reduced = ray.get(reduce.remote(rollout_obj_ref)) Notice in the anti-pattern example, we call ray.get() which forces us to transfer the large rollout to the driver, then again to the reduce worker. In the fixed version, we only pass the reference to the object to the reduce task. The reduce worker will implicitly call ray.get() to fetch the actual rollout data directly from the generate_rollout worker, avoiding the extra copy to the driver. Other ray.get() related anti-patterns are: Anti-pattern: Calling ray.get in a loop harms parallelism Anti-pattern: Processing results in submission order using ray.get increases runtime Anti-pattern: Processing results in submission order using ray.get increases runtime TLDR: Avoid processing independent results in submission order using ray.get() since results may be ready in a different order than the submission order. A batch of tasks is submitted, and we need to process their results individually once they’re done. If each task takes a different amount of time to finish and we process results in submission order, we may waste time waiting for all of the slower (straggler) tasks that were submitted earlier to finish while later faster tasks have already finished. Instead, we want to process the tasks in the order that they finish using ray.wait() to speed up total time to completion. Processing results in submission order vs completion order Code example import random import time import ray ray.init() @ray.remote def f(i): time.sleep(random.random()) return i # Anti-pattern: process results in the submission order. sum_in_submission_order = 0 refs = [f.remote(i) for i in range(100)] for ref in refs: # Blocks until this ObjectRef is ready. result = ray.get(ref) # process result sum_in_submission_order = sum_in_submission_order + result # Better approach: process results in the completion order. sum_in_completion_order = 0 refs = [f.remote(i) for i in range(100)] unfinished = refs while unfinished: # Returns the first ObjectRef that is ready. finished, unfinished = ray.wait(unfinished, num_returns=1) result = ray.get(finished[0]) # process result sum_in_completion_order = sum_in_completion_order + result Other ray.get() related anti-patterns are: Anti-pattern: Calling ray.get unnecessarily harms performance Anti-pattern: Calling ray.get in a loop harms parallelism Anti-pattern: Fetching too many objects at once with ray.get causes failure TLDR: Avoid calling ray.get() on too many objects since this will lead to heap out-of-memory or object store out-of-space. Instead fetch and process one batch at a time. If you have a large number of tasks that you want to run in parallel, trying to do ray.get() on all of them at once could lead to failure with heap out-of-memory or object store out-of-space since Ray needs to fetch all the objects to the caller at the same time. Instead you should get and process the results one batch at a time. Once a batch is processed, Ray will evict objects in that batch to make space for future batches. Fetching too many objects at once with ray.get() Code example Anti-pattern: import ray import numpy as np ray.init() def process_results(results): # custom process logic pass @ray.remote def return_big_object(): return np.zeros(1024 * 10) NUM_TASKS = 1000 object_refs = [return_big_object.remote() for _ in range(NUM_TASKS)] # This will fail with heap out-of-memory # or object store out-of-space if NUM_TASKS is large enough. results = ray.get(object_refs) process_results(results) Better approach: BATCH_SIZE = 100 while object_refs: # Process results in the finish order instead of the submission order. ready_object_refs, object_refs = ray.wait(object_refs, num_returns=BATCH_SIZE) # The node only needs enough space to store # a batch of objects instead of all objects. results = ray.get(ready_object_refs) process_results(results) Here besides getting one batch at a time to avoid failure, we are also using ray.wait() to process results in the finish order instead of the submission order to reduce the runtime. See Anti-pattern: Processing results in submission order using ray.get increases runtime for more details. Anti-pattern: Over-parallelizing with too fine-grained tasks harms speedup TLDR: Avoid over-parallelizing. Parallelizing tasks has higher overhead than using normal functions. Parallelizing or distributing tasks usually comes with higher overhead than an ordinary function call. Therefore, if you parallelize a function that executes very quickly, the overhead could take longer than the actual function call! To handle this problem, we should be careful about parallelizing too much. If you have a function or task that’s too small, you can use a technique called batching to make your tasks do more meaningful work in a single call. Code example Anti-pattern: import ray import time import itertools ray.init() numbers = list(range(10000)) def double(number): time.sleep(0.00001) return number * 2 start_time = time.time() serial_doubled_numbers = [double(number) for number in numbers] end_time = time.time() print(f"Ordinary funciton call takes {end_time - start_time} seconds") # Ordinary funciton call takes 0.16506004333496094 seconds @ray.remote def remote_double(number): return double(number) start_time = time.time() doubled_number_refs = [remote_double.remote(number) for number in numbers] parallel_doubled_numbers = ray.get(doubled_number_refs) end_time = time.time() print(f"Parallelizing tasks takes {end_time - start_time} seconds") # Parallelizing tasks takes 1.6061789989471436 seconds Better approach: Use batching. @ray.remote def remote_double_batch(numbers): return [double(number) for number in numbers] BATCH_SIZE = 1000 start_time = time.time() doubled_batch_refs = [] for i in range(0, len(numbers), BATCH_SIZE): batch = numbers[i : i + BATCH_SIZE] doubled_batch_refs.append(remote_double_batch.remote(batch)) parallel_doubled_numbers_with_batching = list( itertools.chain(*ray.get(doubled_batch_refs)) ) end_time = time.time() print(f"Parallelizing tasks with batching takes {end_time - start_time} seconds") # Parallelizing tasks with batching takes 0.030150890350341797 seconds As we can see from the example above, over-parallelizing has higher overhead and the program runs slower than the serial version. Through batching with a proper batch size, we are able to amortize the overhead and achieve the expected speedup. Anti-pattern: Redefining the same remote function or class harms performance TLDR: Avoid redefining the same remote function or class. Decorating the same function or class multiple times using the ray.remote decorator leads to slow performance in Ray. For each Ray remote function or class, Ray will pickle it and upload to GCS. Later on, the worker that runs the task or actor will download and unpickle it. Each decoration of the same function or class generates a new remote function or class from Ray’s perspective. As a result, the pickle, upload, download and unpickle work will happen every time we redefine and run the remote function or class. Code example Anti-pattern: import ray ray.init() outputs = [] for i in range(10): @ray.remote def double(i): return i * 2 outputs.append(double.remote(i)) outputs = ray.get(outputs) # The double remote function is pickled and uploaded 10 times. Better approach: @ray.remote def double(i): return i * 2 outputs = [] for i in range(10): outputs.append(double.remote(i)) outputs = ray.get(outputs) # The double remote function is pickled and uploaded 1 time. We should define the same remote function or class outside of the loop instead of multiple times inside a loop so that it’s pickled and uploaded only once. Anti-pattern: Passing the same large argument by value repeatedly harms performance TLDR: Avoid passing the same large argument by value to multiple tasks, use ray.put() and pass by reference instead. When passing a large argument (>100KB) by value to a task, Ray will implicitly store the argument in the object store and the worker process will fetch the argument to the local object store from the caller’s object store before running the task. If we pass the same large argument to multiple tasks, Ray will end up storing multiple copies of the argument in the object store since Ray doesn’t do deduplication. Instead of passing the large argument by value to multiple tasks, we should use ray.put() to store the argument to the object store once and get an ObjectRef, then pass the argument reference to tasks. This way, we make sure all tasks use the same copy of the argument, which is faster and uses less object store memory. Code example Anti-pattern: import ray import numpy as np ray.init() @ray.remote def func(large_arg, i): return len(large_arg) + i large_arg = np.zeros(1024 * 1024) # 10 copies of large_arg are stored in the object store. outputs = ray.get([func.remote(large_arg, i) for i in range(10)]) Better approach: # 1 copy of large_arg is stored in the object store. large_arg_ref = ray.put(large_arg) outputs = ray.get([func.remote(large_arg_ref, i) for i in range(10)]) Anti-pattern: Closure capturing large objects harms performance TLDR: Avoid closure capturing large objects in remote functions or classes, use object store instead. When you define a ray.remote function or class, it is easy to accidentally capture large (more than a few MB) objects implicitly in the definition. This can lead to slow performance or even OOM since Ray is not designed to handle serialized functions or classes that are very large. For such large objects, there are two options to resolve this problem: Use ray.put() to put the large objects in the Ray object store, and then pass object references as arguments to the remote functions or classes (“better approach #1” below) Create the large objects inside the remote functions or classes by passing a lambda method (“better approach #2”). This is also the only option for using unserializable objects. Code example Anti-pattern: import ray import numpy as np ray.init() large_object = np.zeros(10 * 1024 * 1024) @ray.remote def f1(): return len(large_object) # large_object is serialized along with f1! ray.get(f1.remote()) Better approach #1: large_object_ref = ray.put(np.zeros(10 * 1024 * 1024)) @ray.remote def f2(large_object): return len(large_object) # Large object is passed through object store. ray.get(f2.remote(large_object_ref)) Better approach #2: large_object_creator = lambda: np.zeros(10 * 1024 * 1024) # noqa E731 @ray.remote def f3(): large_object = ( large_object_creator() ) # Lambda is small compared with the large object. return len(large_object) ray.get(f3.remote()) Anti-pattern: Using global variables to share state between tasks and actors TLDR: Don’t use global variables to share state with tasks and actors. Instead, encapsulate the global variables in an actor and pass the actor handle to other tasks and actors. Ray drivers, tasks and actors are running in different processes, so they don’t share the same address space. This means that if you modify global variables in one process, changes are not reflected in other processes. The solution is to use an actor’s instance variables to hold the global state and pass the actor handle to places where the state needs to be modified or accessed. Note that using class variables to manage state between instances of the same class is not supported. Each actor instance is instantiated in its own process, so each actor will have its own copy of the class variables. Code example Anti-pattern: import ray ray.init() global_var = 3 @ray.remote class Actor: def f(self): return global_var + 3 actor = Actor.remote() global_var = 4 # This returns 6, not 7. It is because the value change of global_var # inside a driver is not reflected to the actor # because they are running in different processes. assert ray.get(actor.f.remote()) == 6 Better approach: @ray.remote class GlobalVarActor: def __init__(self): self.global_var = 3 def set_global_var(self, var): self.global_var = var def get_global_var(self): return self.global_var @ray.remote class Actor: def __init__(self, global_var_actor): self.global_var_actor = global_var_actor def f(self): return ray.get(self.global_var_actor.get_global_var.remote()) + 3 global_var_actor = GlobalVarActor.remote() actor = Actor.remote(global_var_actor) ray.get(global_var_actor.set_global_var.remote(4)) # This returns 7 correctly. assert ray.get(actor.f.remote()) == 7 Advanced Topics This section covers extended topics on how to use Ray. Tips for first-time users Ray provides a highly flexible, yet minimalist and easy to use API. On this page, we describe several tips that can help first-time Ray users to avoid some common mistakes that can significantly hurt the performance of their programs. For an in-depth treatment of advanced design patterns, please read core design patterns. The core Ray API we use in this document. API Description ray.init() Initialize Ray context. @ray.remote Function or class decorator specifying that the function will be executed as a task or the class as an actor in a different process. .remote() Postfix to every remote function, remote class declaration, or invocation of a remote class method. Remote operations are asynchronous. ray.put() Store object in object store, and return its ID. This ID can be used to pass object as an argument to any remote function or method call. This is a synchronous operation. ray.get() Return an object or list of objects from the object ID or list of object IDs. This is a synchronous (i.e., blocking) operation. ray.wait() From a list of object IDs, returns (1) the list of IDs of the objects that are ready, and (2) the list of IDs of the objects that are not ready yet. By default, it returns one ready object ID at a time. All the results reported in this page were obtained on a 13-inch MacBook Pro with a 2.7 GHz Core i7 CPU and 16GB of RAM. While ray.init() automatically detects the number of cores when it runs on a single machine, to reduce the variability of the results you observe on your machine when running the code below, here we specify num_cpus = 4, i.e., a machine with 4 CPUs. Since each task requests by default one CPU, this setting allows us to execute up to four tasks in parallel. As a result, our Ray system consists of one driver executing the program, and up to four workers running remote tasks or actors. Tip 1: Delay ray.get() With Ray, the invocation of every remote operation (e.g., task, actor method) is asynchronous. This means that the operation immediately returns a promise/future, which is essentially an identifier (ID) of the operation’s result. This is key to achieving parallelism, as it allows the driver program to launch multiple operations in parallel. To get the actual results, the programmer needs to call ray.get() on the IDs of the results. This call blocks until the results are available. As a side effect, this operation also blocks the driver program from invoking other operations, which can hurt parallelism. Unfortunately, it is quite natural for a new Ray user to inadvertently use ray.get(). To illustrate this point, consider the following simple Python code which calls the do_some_work() function four times, where each invocation takes around 1 sec: import ray import time def do_some_work(x): time.sleep(1) # Replace this with work you need to do. return x start = time.time() results = [do_some_work(x) for x in range(4)] print("duration =", time.time() - start) print("results =", results) The output of a program execution is below. As expected, the program takes around 4 seconds: duration = 4.0149290561676025 results = [0, 1, 2, 3] Now, let’s parallelize the above program with Ray. Some first-time users will do this by just making the function remote, i.e., import ray ray.shutdown() import time import ray ray.init(num_cpus=4) # Specify this system has 4 CPUs. @ray.remote def do_some_work(x): time.sleep(1) # Replace this with work you need to do. return x start = time.time() results = [do_some_work.remote(x) for x in range(4)] print("duration =", time.time() - start) print("results =", results) However, when executing the above program one gets: duration = 0.0003619194030761719 results = [ObjectRef(df5a1a828c9685d3ffffffff0100000001000000), ObjectRef(cb230a572350ff44ffffffff0100000001000000), ObjectRef(7bbd90284b71e599ffffffff0100000001000000), ObjectRef(bd37d2621480fc7dffffffff0100000001000000)] When looking at this output, two things jump out. First, the program finishes immediately, i.e., in less than 1 ms. Second, instead of the expected results (i.e., [0, 1, 2, 3]), we get a bunch of identifiers. Recall that remote operations are asynchronous and they return futures (i.e., object IDs) instead of the results themselves. This is exactly what we see here. We measure only the time it takes to invoke the tasks, not their running times, and we get the IDs of the results corresponding to the four tasks. To get the actual results, we need to use ray.get(), and here the first instinct is to just call ray.get() on the remote operation invocation, i.e., replace line 12 with: results = [ray.get(do_some_work.remote(x)) for x in range(4)] By re-running the program after this change we get: duration = 4.018050909042358 results = [0, 1, 2, 3] So now the results are correct, but it still takes 4 seconds, so no speedup! What’s going on? The observant reader will already have the answer: ray.get() is blocking so calling it after each remote operation means that we wait for that operation to complete, which essentially means that we execute one operation at a time, hence no parallelism! To enable parallelism, we need to call ray.get() after invoking all tasks. We can easily do so in our example by replacing line 12 with: results = ray.get([do_some_work.remote(x) for x in range(4)]) By re-running the program after this change we now get: duration = 1.0064549446105957 results = [0, 1, 2, 3] So finally, success! Our Ray program now runs in just 1 second which means that all invocations of do_some_work() are running in parallel. In summary, always keep in mind that ray.get() is a blocking operation, and thus if called eagerly it can hurt the parallelism. Instead, you should try to write your program such that ray.get() is called as late as possible. Tip 2: Avoid tiny tasks When a first-time developer wants to parallelize their code with Ray, the natural instinct is to make every function or class remote. Unfortunately, this can lead to undesirable consequences; if the tasks are very small, the Ray program can take longer than the equivalent Python program. Let’s consider again the above examples, but this time we make the tasks much shorter (i.e, each takes just 0.1ms), and dramatically increase the number of task invocations to 100,000. import time def tiny_work(x): time.sleep(0.0001) # Replace this with work you need to do. return x start = time.time() results = [tiny_work(x) for x in range(100000)] print("duration =", time.time() - start) By running this program we get: duration = 13.36544418334961 This result should be expected since the lower bound of executing 100,000 tasks that take 0.1ms each is 10s, to which we need to add other overheads such as function calls, etc. Let’s now parallelize this code using Ray, by making every invocation of tiny_work() remote: import time import ray @ray.remote def tiny_work(x): time.sleep(0.0001) # Replace this with work you need to do. return x start = time.time() result_ids = [tiny_work.remote(x) for x in range(100000)] results = ray.get(result_ids) print("duration =", time.time() - start) The result of running this code is: duration = 27.46447515487671 Surprisingly, not only Ray didn’t improve the execution time, but the Ray program is actually slower than the sequential program! What’s going on? Well, the issue here is that every task invocation has a non-trivial overhead (e.g., scheduling, inter-process communication, updating the system state) and this overhead dominates the actual time it takes to execute the task. One way to speed up this program is to make the remote tasks larger in order to amortize the invocation overhead. Here is one possible solution where we aggregate 1000 tiny_work() function calls in a single bigger remote function: import time import ray def tiny_work(x): time.sleep(0.0001) # replace this is with work you need to do return x @ray.remote def mega_work(start, end): return [tiny_work(x) for x in range(start, end)] start = time.time() result_ids = [] [result_ids.append(mega_work.remote(x*1000, (x+1)*1000)) for x in range(100)] results = ray.get(result_ids) print("duration =", time.time() - start) Now, if we run the above program we get: duration = 3.2539820671081543 This is approximately one fourth of the sequential execution, in line with our expectations (recall, we can run four tasks in parallel). Of course, the natural question is how large is large enough for a task to amortize the remote invocation overhead. One way to find this is to run the following simple program to estimate the per-task invocation overhead: @ray.remote def no_work(x): return x start = time.time() num_calls = 1000 [ray.get(no_work.remote(x)) for x in range(num_calls)] print("per task overhead (ms) =", (time.time() - start)*1000/num_calls) Running the above program on a 2018 MacBook Pro notebook shows: per task overhead (ms) = 0.4739549160003662 In other words, it takes almost half a millisecond to execute an empty task. This suggests that we will need to make sure a task takes at least a few milliseconds to amortize the invocation overhead. One caveat is that the per-task overhead will vary from machine to machine, and between tasks that run on the same machine versus remotely. This being said, making sure that tasks take at least a few milliseconds is a good rule of thumb when developing Ray programs. Tip 3: Avoid passing same object repeatedly to remote tasks When we pass a large object as an argument to a remote function, Ray calls ray.put() under the hood to store that object in the local object store. This can significantly improve the performance of a remote task invocation when the remote task is executed locally, as all local tasks share the object store. However, there are cases when automatically calling ray.put() on a task invocation leads to performance issues. One example is passing the same large object as an argument repeatedly, as illustrated by the program below: import time import numpy as np import ray @ray.remote def no_work(a): return start = time.time() a = np.zeros((5000, 5000)) result_ids = [no_work.remote(a) for x in range(10)] results = ray.get(result_ids) print("duration =", time.time() - start) This program outputs: duration = 1.0837509632110596 This running time is quite large for a program that calls just 10 remote tasks that do nothing. The reason for this unexpected high running time is that each time we invoke no_work(a), Ray calls ray.put(a) which results in copying array a to the object store. Since array a has 2.5 million entries, copying it takes a non-trivial time. To avoid copying array a every time no_work() is invoked, one simple solution is to explicitly call ray.put(a), and then pass a’s ID to no_work(), as illustrated below: import ray ray.shutdown() import time import numpy as np import ray ray.init(num_cpus=4) @ray.remote def no_work(a): return start = time.time() a_id = ray.put(np.zeros((5000, 5000))) result_ids = [no_work.remote(a_id) for x in range(10)] results = ray.get(result_ids) print("duration =", time.time() - start) Running this program takes only: duration = 0.132796049118042 This is 7 times faster than the original program which is to be expected since the main overhead of invoking no_work(a) was copying the array a to the object store, which now happens only once. Arguably a more important advantage of avoiding multiple copies of the same object to the object store is that it precludes the object store filling up prematurely and incur the cost of object eviction. Tip 4: Pipeline data processing If we use ray.get() on the results of multiple tasks we will have to wait until the last one of these tasks finishes. This can be an issue if tasks take widely different amounts of time. To illustrate this issue, consider the following example where we run four do_some_work() tasks in parallel, with each task taking a time uniformly distributed between 0 and 4 seconds. Next, assume the results of these tasks are processed by process_results(), which takes 1 sec per result. The expected running time is then (1) the time it takes to execute the slowest of the do_some_work() tasks, plus (2) 4 seconds which is the time it takes to execute process_results(). import time import random import ray @ray.remote def do_some_work(x): time.sleep(random.uniform(0, 4)) # Replace this with work you need to do. return x def process_results(results): sum = 0 for x in results: time.sleep(1) # Replace this with some processing code. sum += x return sum start = time.time() data_list = ray.get([do_some_work.remote(x) for x in range(4)]) sum = process_results(data_list) print("duration =", time.time() - start, "\nresult = ", sum) The output of the program shows that it takes close to 8 sec to run: duration = 7.82636022567749 result = 6 Waiting for the last task to finish when the others tasks might have finished much earlier unnecessarily increases the program running time. A better solution would be to process the data as soon it becomes available. Fortunately, Ray allows you to do exactly this by calling ray.wait() on a list of object IDs. Without specifying any other parameters, this function returns as soon as an object in its argument list is ready. This call has two returns: (1) the ID of the ready object, and (2) the list containing the IDs of the objects not ready yet. The modified program is below. Note that one change we need to do is to replace process_results() with process_incremental() that processes one result at a time. import time import random import ray @ray.remote def do_some_work(x): time.sleep(random.uniform(0, 4)) # Replace this with work you need to do. return x def process_incremental(sum, result): time.sleep(1) # Replace this with some processing code. return sum + result start = time.time() result_ids = [do_some_work.remote(x) for x in range(4)] sum = 0 while len(result_ids): done_id, result_ids = ray.wait(result_ids) sum = process_incremental(sum, ray.get(done_id[0])) print("duration =", time.time() - start, "\nresult = ", sum) This program now takes just a bit over 4.8sec, a significant improvement: duration = 4.852453231811523 result = 6 To aid the intuition, Figure 1 shows the execution timeline in both cases: when using ray.get() to wait for all results to become available before processing them, and using ray.wait() to start processing the results as soon as they become available. Figure 1: (a) Execution timeline when using ray.get() to wait for all results from do_some_work() tasks before calling process_results(). (b) Execution timeline when using ray.wait() to process results as soon as they become available. Starting Ray This page covers how to start Ray on your single machine or cluster of machines. Be sure to have installed Ray before following the instructions on this page. What is the Ray runtime? Ray programs are able to parallelize and distribute by leveraging an underlying Ray runtime. The Ray runtime consists of multiple services/processes started in the background for communication, data transfer, scheduling, and more. The Ray runtime can be started on a laptop, a single server, or multiple servers. There are three ways of starting the Ray runtime: Implicitly via ray.init() (Starting Ray on a single machine) Explicitly via CLI (Starting Ray via the CLI (ray start)) Explicitly via the cluster launcher (Launching a Ray cluster (ray up)) In all cases, ray.init() will try to automatically find a Ray instance to connect to. It checks, in order: 1. The RAY_ADDRESS OS environment variable. 2. The concrete address passed to ray.init(address=
). 3. If no address is provided, the latest Ray instance that was started on the same machine using ray start. Starting Ray on a single machine Calling ray.init() starts a local Ray instance on your laptop/machine. This laptop/machine becomes the “head node”. In recent versions of Ray (>=1.5), ray.init() will automatically be called on the first use of a Ray remote API. Python import ray ray.shutdown() import ray # Other Ray APIs will not work until `ray.init()` is called. ray.init() Java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { // Other Ray APIs will not work until `Ray.init()` is called. Ray.init(); ... } } C++ #include // Other Ray APIs will not work until `ray::Init()` is called. ray::Init() When the process calling ray.init() terminates, the Ray runtime will also terminate. To explicitly stop or restart Ray, use the shutdown API. Python ray.shutdown() import ray ray.init() ... # ray program ray.shutdown() Java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { Ray.init(); ... // ray program Ray.shutdown(); } } C++ #include ray::Init() ... // ray program ray::Shutdown() To check if Ray is initialized, use the is_initialized API. Python import ray ray.init() assert ray.is_initialized() ray.shutdown() assert not ray.is_initialized() Java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { Ray.init(); Assert.assertTrue(Ray.isInitialized()); Ray.shutdown(); Assert.assertFalse(Ray.isInitialized()); } } C++ #include int main(int argc, char **argv) { ray::Init(); assert(ray::IsInitialized()); ray::Shutdown(); assert(!ray::IsInitialized()); } See the Configuration documentation for the various ways to configure Ray. Starting Ray via the CLI (ray start) Use ray start from the CLI to start a 1 node ray runtime on a machine. This machine becomes the “head node”. $ ray start --head --port=6379 Local node IP: 192.123.1.123 2020-09-20 10:38:54,193 INFO services.py:1166 -- View the Ray dashboard at http://localhost:8265 -------------------- Ray runtime started. -------------------- ... You can connect to this Ray instance by starting a driver process on the same node as where you ran ray start. ray.init() will now automatically connect to the latest Ray instance. Python import ray ray.init() java import io.ray.api.Ray; public class MyRayApp { public static void main(String[] args) { Ray.init(); ... } } java -classpath \ -Dray.address=
\ C++ #include int main(int argc, char **argv) { ray::Init(); ... } RAY_ADDRESS=
./ You can connect other nodes to the head node, creating a Ray cluster by also calling ray start on those nodes. See Launching an On-Premise Cluster for more details. Calling ray.init() on any of the cluster machines will connect to the same Ray cluster. Launching a Ray cluster (ray up) Ray clusters can be launched with the Cluster Launcher. The ray up command uses the Ray cluster launcher to start a cluster on the cloud, creating a designated “head node” and worker nodes. Underneath the hood, it automatically calls ray start to create a Ray cluster. Your code only needs to execute on one machine in the cluster (usually the head node). Read more about running programs on a Ray cluster. To connect to the Ray cluster, call ray.init from one of the machines in the cluster. This will connect to the latest Ray cluster: ray.shutdown() ray.init() Note that the machine calling ray up will not be considered as part of the Ray cluster, and therefore calling ray.init on that same machine will not attach to the cluster. What’s next? Check out our Deployment section for more information on deploying Ray in different settings, including Kubernetes, YARN, and SLURM. Using Namespaces A namespace is a logical grouping of jobs and named actors. When an actor is named, its name must be unique within the namespace. In order to set your applications namespace, it should be specified when you first connect to the cluster. Python import ray ray.init(namespace="hello") Java System.setProperty("ray.job.namespace", "hello"); // set it before Ray.init() Ray.init(); C++ ray::RayConfig config; config.ray_namespace = "hello"; ray::Init(config); Please refer to Driver Options for ways of configuring a Java application. Named actors are only accessible within their namespaces. Python import subprocess import ray try: subprocess.check_output(["ray", "start", "--head"]) @ray.remote class Actor: pass # Job 1 creates two actors, "orange" and "purple" in the "colors" namespace. with ray.init("ray://localhost:10001", namespace="colors"): Actor.options(name="orange", lifetime="detached").remote() Actor.options(name="purple", lifetime="detached").remote() # Job 2 is now connecting to a different namespace. with ray.init("ray://localhost:10001", namespace="fruits"): # This fails because "orange" was defined in the "colors" namespace. try: ray.get_actor("orange") except ValueError: pass # This succceeds because the name "orange" is unused in this namespace. Actor.options(name="orange", lifetime="detached").remote() Actor.options(name="watermelon", lifetime="detached").remote() # Job 3 connects to the original "colors" namespace context = ray.init("ray://localhost:10001", namespace="colors") # This fails because "watermelon" was in the fruits namespace. try: ray.get_actor("watermelon") except ValueError: pass # This returns the "orange" actor we created in the first job, not the second. ray.get_actor("orange") # We are manually managing the scope of the connection in this example. context.disconnect() finally: subprocess.check_output(["ray", "stop", "--force"]) Java // `ray start --head` has been run to launch a local cluster. // Job 1 creates two actors, "orange" and "purple" in the "colors" namespace. System.setProperty("ray.address", "localhost:10001"); System.setProperty("ray.job.namespace", "colors"); try { Ray.init(); Ray.actor(Actor::new).setName("orange").remote(); Ray.actor(Actor::new).setName("purple").remote(); } finally { Ray.shutdown(); } // Job 2 is now connecting to a different namespace. System.setProperty("ray.address", "localhost:10001"); System.setProperty("ray.job.namespace", "fruits"); try { Ray.init(); // This fails because "orange" was defined in the "colors" namespace. Ray.getActor("orange").isPresent(); // return false // This succceeds because the name "orange" is unused in this namespace. Ray.actor(Actor::new).setName("orange").remote(); Ray.actor(Actor::new).setName("watermelon").remote(); } finally { Ray.shutdown(); } // Job 3 connects to the original "colors" namespace. System.setProperty("ray.address", "localhost:10001"); System.setProperty("ray.job.namespace", "colors"); try { Ray.init(); // This fails because "watermelon" was in the fruits namespace. Ray.getActor("watermelon").isPresent(); // return false // This returns the "orange" actor we created in the first job, not the second. Ray.getActor("orange").isPresent(); // return true } finally { Ray.shutdown(); } C++ // `ray start --head` has been run to launch a local cluster. // Job 1 creates two actors, "orange" and "purple" in the "colors" namespace. ray::RayConfig config; config.ray_namespace = "colors"; ray::Init(config); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("orange").Remote(); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("purple").Remote(); ray::Shutdown(); // Job 2 is now connecting to a different namespace. ray::RayConfig config; config.ray_namespace = "fruits"; ray::Init(config); // This fails because "orange" was defined in the "colors" namespace. ray::GetActor("orange"); // return nullptr; // This succeeds because the name "orange" is unused in this namespace. ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("orange").Remote(); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("watermelon").Remote(); ray::Shutdown(); // Job 3 connects to the original "colors" namespace. ray::RayConfig config; config.ray_namespace = "colors"; ray::Init(config); // This fails because "watermelon" was in the fruits namespace. ray::GetActor("watermelon"); // return nullptr; // This returns the "orange" actor we created in the first job, not the second. ray::GetActor("orange"); ray::Shutdown(); Specifying namespace for named actors You can specify a namespace for a named actor while creating it. The created actor belongs to the specified namespace, no matter what namespace of the current job is. Python import subprocess import ray try: subprocess.check_output(["ray", "start", "--head"]) @ray.remote class Actor: pass ctx = ray.init("ray://localhost:10001") # Create an actor with specified namespace. Actor.options(name="my_actor", namespace="actor_namespace", lifetime="detached").remote() # It is accessible in its namespace. ray.get_actor("my_actor", namespace="actor_namespace") ctx.disconnect() finally: subprocess.check_output(["ray", "stop", "--force"]) Java // `ray start --head` has been run to launch a local cluster. System.setProperty("ray.address", "localhost:10001"); try { Ray.init(); // Create an actor with specified namespace. Ray.actor(Actor::new).setName("my_actor", "actor_namespace").remote(); // It is accessible in its namespace. Ray.getActor("my_actor", "actor_namespace").isPresent(); // return true } finally { Ray.shutdown(); } C++ // `ray start --head` has been run to launch a local cluster. ray::RayConfig config; ray::Init(config); // Create an actor with specified namespace. ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("my_actor", "actor_namespace").Remote(); // It is accessible in its namespace. ray::GetActor("orange"); ray::Shutdown();` Anonymous namespaces When a namespace is not specified, Ray will place your job in an anonymous namespace. In an anonymous namespace, your job will have its own namespace and will not have access to actors in other namespaces. Python import subprocess import ray try: subprocess.check_output(["ray", "start", "--head"]) @ray.remote class Actor: pass # Job 1 connects to an anonymous namespace by default with ray.init("ray://localhost:10001"): Actor.options(name="my_actor", lifetime="detached").remote() # Job 2 connects to a _different_ anonymous namespace by default with ray.init("ray://localhost:10001"): # This succeeds because the second job is in its own namespace. Actor.options(name="my_actor", lifetime="detached").remote() finally: subprocess.check_output(["ray", "stop", "--force"]) Java // `ray start --head` has been run to launch a local cluster. // Job 1 connects to an anonymous namespace by default. System.setProperty("ray.address", "localhost:10001"); try { Ray.init(); Ray.actor(Actor::new).setName("my_actor").remote(); } finally { Ray.shutdown(); } // Job 2 connects to a _different_ anonymous namespace by default System.setProperty("ray.address", "localhost:10001"); try { Ray.init(); // This succeeds because the second job is in its own namespace. Ray.actor(Actor::new).setName("my_actor").remote(); } finally { Ray.shutdown(); } C++ // `ray start --head` has been run to launch a local cluster. // Job 1 connects to an anonymous namespace by default. ray::RayConfig config; ray::Init(config); ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("my_actor").Remote(); ray::Shutdown(); // Job 2 connects to a _different_ anonymous namespace by default ray::RayConfig config; ray::Init(config); // This succeeds because the second job is in its own namespace. ray::Actor(RAY_FUNC(Counter::FactoryCreate)).SetName("my_actor").Remote(); ray::Shutdown(); Anonymous namespaces are implemented as UUID’s. This makes it possible for a future job to manually connect to an existing anonymous namespace, but it is not recommended. Getting the current namespace You can access to the current namespace using runtime_context APIs. Python import subprocess import ray try: subprocess.check_output(["ray", "start", "--head"]) ray.init(address="auto", namespace="colors") # Will print namespace name "colors". print(ray.get_runtime_context().namespace) finally: subprocess.check_output(["ray", "stop", "--force"]) Java System.setProperty("ray.job.namespace", "colors"); try { Ray.init(); // Will print namespace name "colors". System.out.println(Ray.getRuntimeContext().getNamespace()); } finally { Ray.shutdown(); } C++ ray::RayConfig config; config.ray_namespace = "colors"; ray::Init(config); // Will print namespace name "colors". std::cout << ray::GetNamespace() << std::endl; ray::Shutdown(); Cross-Language Programming This page will show you how to use Ray’s cross-language programming feature. Setup the driver We need to set Code Search Path in your driver. Python import ray ray.init(job_config=ray.job_config.JobConfig(code_search_path=["/path/to/code"])) Java java -classpath \ -Dray.address=
\ -Dray.job.code-search-path=/path/to/code/ \ You may want to include multiple directories to load both Python and Java code for workers, if they are placed in different directories. Python import ray ray.init(job_config=ray.job_config.JobConfig(code_search_path="/path/to/jars:/path/to/pys")) Java java -classpath \ -Dray.address=
\ -Dray.job.code-search-path=/path/to/jars:/path/to/pys \ Python calling Java Suppose we have a Java static method and a Java class as follows: package io.ray.demo; public class Math { public static int add(int a, int b) { return a + b; } } package io.ray.demo; // A regular Java class. public class Counter { private int value = 0; public int increment() { this.value += 1; return this.value; } } Then, in Python, we can call the above Java remote function, or create an actor from the above Java class. import ray with ray.init(job_config=ray.job_config.JobConfig(code_search_path=["/path/to/code"])): # Define a Java class. counter_class = ray.cross_language.java_actor_class( "io.ray.demo.Counter") # Create a Java actor and call actor method. counter = counter_class.remote() obj_ref1 = counter.increment.remote() assert ray.get(obj_ref1) == 1 obj_ref2 = counter.increment.remote() assert ray.get(obj_ref2) == 2 # Define a Java function. add_function = ray.cross_language.java_function( "io.ray.demo.Math", "add") # Call the Java remote function. obj_ref3 = add_function.remote(1, 2) assert ray.get(obj_ref3) == 3 Java calling Python Suppose we have a Python module as follows: # /path/to/the_dir/ray_demo.py import ray @ray.remote class Counter(object): def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value @ray.remote def add(a, b): return a + b The function or class should be decorated by @ray.remote. Then, in Java, we can call the above Python remote function, or create an actor from the above Python class. package io.ray.demo; import io.ray.api.ObjectRef; import io.ray.api.PyActorHandle; import io.ray.api.Ray; import io.ray.api.function.PyActorClass; import io.ray.api.function.PyActorMethod; import io.ray.api.function.PyFunction; import org.testng.Assert; public class JavaCallPythonDemo { public static void main(String[] args) { // Set the code-search-path to the directory of your `ray_demo.py` file. System.setProperty("ray.job.code-search-path", "/path/to/the_dir/"); Ray.init(); // Define a Python class. PyActorClass actorClass = PyActorClass.of( "ray_demo", "Counter"); // Create a Python actor and call actor method. PyActorHandle actor = Ray.actor(actorClass).remote(); ObjectRef objRef1 = actor.task( PyActorMethod.of("increment", int.class)).remote(); Assert.assertEquals(objRef1.get(), 1); ObjectRef objRef2 = actor.task( PyActorMethod.of("increment", int.class)).remote(); Assert.assertEquals(objRef2.get(), 2); // Call the Python remote function. ObjectRef objRef3 = Ray.task(PyFunction.of( "ray_demo", "add", int.class), 1, 2).remote(); Assert.assertEquals(objRef3.get(), 3); Ray.shutdown(); } } Cross-language data serialization The arguments and return values of ray call can be serialized & deserialized automatically if their types are the following: Primitive data types MessagePack Python Java nil None null bool bool Boolean int int Short / Integer / Long / BigInteger float float Float / Double str str String bin bytes byte[] Basic container types MessagePack Python Java array list Array Ray builtin types ActorHandle Be aware of float / double precision between Python and Java. If Java is using a float type to receive the input argument, the double precision Python data will be reduced to float precision in Java. BigInteger can support a max value of 2^64-1, please refer to: https://github.com/msgpack/msgpack/blob/master/spec.md#int-format-family. If the value is larger than 2^64-1, then sending the value to Python will raise an exception. The following example shows how to pass these types as parameters and how to return these types. You can write a Python function which returns the input data: # ray_serialization.py import ray @ray.remote def py_return_input(v): return v Then you can transfer the object from Java to Python, and back from Python to Java: package io.ray.demo; import io.ray.api.ObjectRef; import io.ray.api.Ray; import io.ray.api.function.PyFunction; import java.math.BigInteger; import org.testng.Assert; public class SerializationDemo { public static void main(String[] args) { Ray.init(); Object[] inputs = new Object[]{ true, // Boolean Byte.MAX_VALUE, // Byte Short.MAX_VALUE, // Short Integer.MAX_VALUE, // Integer Long.MAX_VALUE, // Long BigInteger.valueOf(Long.MAX_VALUE), // BigInteger "Hello World!", // String 1.234f, // Float 1.234, // Double "example binary".getBytes()}; // byte[] for (Object o : inputs) { ObjectRef res = Ray.task( PyFunction.of("ray_serialization", "py_return_input", o.getClass()), o).remote(); Assert.assertEquals(res.get(), o); } Ray.shutdown(); } } Cross-language exception stacks Suppose we have a Java package as follows: package io.ray.demo; import io.ray.api.ObjectRef; import io.ray.api.Ray; import io.ray.api.function.PyFunction; public class MyRayClass { public static int raiseExceptionFromPython() { PyFunction raiseException = PyFunction.of( "ray_exception", "raise_exception", Integer.class); ObjectRef refObj = Ray.task(raiseException).remote(); return refObj.get(); } } and a Python module as follows: # ray_exception.py import ray @ray.remote def raise_exception(): 1 / 0 Then, run the following code: # ray_exception_demo.py import ray with ray.init(job_config=ray.job_config.JobConfig(code_search_path=["/path/to/ray_exception"])): obj_ref = ray.cross_language.java_function( "io.ray.demo.MyRayClass", "raiseExceptionFromPython").remote() ray.get(obj_ref) # <-- raise exception from here. The exception stack will be: Traceback (most recent call last): File "ray_exception_demo.py", line 9, in ray.get(obj_ref) # <-- raise exception from here. File "ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "ray/python/ray/_private/worker.py", line 2247, in get raise value ray.exceptions.CrossLanguageError: An exception raised from JAVA: io.ray.api.exception.RayTaskException: (pid=61894, ip=172.17.0.2) Error executing task c8ef45ccd0112571ffffffffffffffffffffffff01000000 at io.ray.runtime.task.TaskExecutor.execute(TaskExecutor.java:186) at io.ray.runtime.RayNativeRuntime.nativeRunTaskExecutor(Native Method) at io.ray.runtime.RayNativeRuntime.run(RayNativeRuntime.java:231) at io.ray.runtime.runner.worker.DefaultWorker.main(DefaultWorker.java:15) Caused by: io.ray.api.exception.CrossLanguageException: An exception raised from PYTHON: ray.exceptions.RayTaskError: ray::raise_exception() (pid=62041, ip=172.17.0.2) File "ray_exception.py", line 7, in raise_exception 1 / 0 ZeroDivisionError: division by zero Working with Jupyter Notebooks & JupyterLab This document describes best practices for using Ray with Jupyter Notebook / JupyterLab. We use AWS for the purpose of illustration, but the arguments should also apply to other Cloud providers. Feel free to contribute if you think this document is missing anything. Setting Up Notebook 1. Ensure your EC2 instance has enough EBS volume if you plan to run the Notebook on it. The Deep Learning AMI, pre-installed libraries and environmental set-up will by default consume ~76% of the disk prior to any Ray work. With additional applications running, the Notebook could fail frequently due to full disk. Kernel restart loses progressing cell outputs, especially if we rely on them to track experiment progress. Related issue: Autoscaler should allow configuration of disk space and should use a larger default.. 2. Avoid unnecessary memory usage. IPython stores the output of every cell in a local Python variable indefinitely. This causes Ray to pin the objects even though you application may not actually be using them. Therefore, explicitly calling print or repr is better than letting the Notebook automatically generate the output. Another option is to just altogether disable IPython caching with the following (run from bash/zsh): echo 'c = get_config() c.InteractiveShell.cache_size = 0 # disable cache ' >> ~/.ipython/profile_default/ipython_config.py This will still allow printing, but stop IPython from caching altogether. While the above settings help reduce memory footprint, it’s always a good practice to remove references that are no longer needed in your application to free space in the object store. 3. Understand the node’s responsibility. Assuming the Notebook runs on a EC2 instance, do you plan to start a ray runtime locally on this instance, or do you plan to use this instance as a cluster launcher? Jupyter Notebook is more suitable for the first scenario. CLI’s such as ray exec and ray submit fit the second use case better. 4. Forward the ports. Assuming the Notebook runs on an EC2 instance, you should forward both the Notebook port and the Ray Dashboard port. The default ports are 8888 and 8265 respectively. They will increase if the default ones are not available. You can forward them with the following (run from bash/zsh): ssh -i /path/my-key-pair.pem -N -f -L localhost:8888:localhost:8888 my-instance-user-name@my-instance-IPv6-address ssh -i /path/my-key-pair.pem -N -f -L localhost:8265:localhost:8265 my-instance-user-name@my-instance-IPv6-address Lazy Computation Graphs with the Ray DAG API With ray.remote you have the flexibility of running an application where computation is executed remotely at runtime. For a ray.remote decorated class or function, you can also use .bind on the body to build a static computation graph. Ray DAG is designed to be a developer facing API where recommended use cases are Locally iterate and test your application authored by higher level libraries. Build libraries on top of the Ray DAG APIs. When .bind() is called on a ray.remote decorated class or function, it will generate an intermediate representation (IR) node that act as backbone and building blocks of the DAG that is statically holding the computation graph together, where each IR node is resolved to value at execution time with respect to their topological order. The IR node can also be assigned to a variable and passed into other nodes as arguments. Ray DAG with functions The IR node generated by .bind() on a ray.remote decorated function is executed as a Ray Task upon execution which will be solved to the task output. This example shows how to build a chain of functions where each node can be executed as root node while iterating, or used as input args or kwargs of other functions to form more complex DAGs. Any IR node can be executed directly dag_node.execute() that acts as root of the DAG, where all other non-reachable nodes from the root will be igored. Python import ray ray.init() @ray.remote def func(src, inc=1): return src + inc a_ref = func.bind(1, inc=2) assert ray.get(a_ref.execute()) == 3 # 1 + 2 = 3 b_ref = func.bind(a_ref, inc=3) assert ray.get(b_ref.execute()) == 6 # (1 + 2) + 3 = 6 c_ref = func.bind(b_ref, inc=a_ref) assert ray.get(c_ref.execute()) == 9 # ((1 + 2) + 3) + (1 + 2) = 9 Ray DAG with classes and class methods The IR node generated by .bind() on a ray.remote decorated class is executed as a Ray Actor upon execution. The Actor will be instantiated every time the node is executed, and the classmethod calls can form a chain of function calls specific to the parent actor instance. DAG IR nodes generated from a function, class or classmethod can be combined together to form a DAG. Python import ray ray.init() @ray.remote class Actor: def __init__(self, init_value): self.i = init_value def inc(self, x): self.i += x def get(self): return self.i a1 = Actor.bind(10) # Instantiate Actor with init_value 10. val = a1.get.bind() # ClassMethod that returns value from get() from # the actor created. assert ray.get(val.execute()) == 10 @ray.remote def combine(x, y): return x + y a2 = Actor.bind(10) # Instantiate another Actor with init_value 10. a1.inc.bind(2) # Call inc() on the actor created with increment of 2. a1.inc.bind(4) # Call inc() on the actor created with increment of 4. a2.inc.bind(6) # Call inc() on the actor created with increment of 6. # Combine outputs from a1.get() and a2.get() dag = combine.bind(a1.get.bind(), a2.get.bind()) # a1 + a2 + inc(2) + inc(4) + inc(6) # 10 + (10 + ( 2 + 4 + 6)) = 32 assert ray.get(dag.execute()) == 32 Ray DAG with custom InputNode InputNode is the singleton node of a DAG that represents user input value at runtime. It should be used within a context manager with no args, and called as args of dag_node.execute() Python import ray ray.init() from ray.dag.input_node import InputNode @ray.remote def a(user_input): return user_input * 2 @ray.remote def b(user_input): return user_input + 1 @ray.remote def c(x, y): return x + y with InputNode() as dag_input: a_ref = a.bind(dag_input) b_ref = b.bind(dag_input) dag = c.bind(a_ref, b_ref) # a(2) + b(2) = c # (2 * 2) + (2 * 1) assert ray.get(dag.execute(2)) == 7 # a(3) + b(3) = c # (3 * 2) + (3 * 1) assert ray.get(dag.execute(3)) == 10 More Resources You can find more application patterns and examples in the following resources from other Ray libraries built on top of Ray DAG API with the same mechanism. Visualization of DAGs DAG Cookbook and patterns Serve Deployment Graph’s original REP Miscellaneous Topics This page will cover some miscellaneous topics in Ray. Dynamic Remote Parameters Overloaded Functions Inspecting Cluster State Node Information Resource Information Running Large Ray Clusters Tuning Operating System Settings Maximum open files ARP cache Tuning Ray Settings Resource broadcasting Benchmark Dynamic Remote Parameters You can dynamically adjust resource requirements or return values of ray.remote during execution with .options. For example, here we instantiate many copies of the same actor with varying resource requirements. Note that to create these actors successfully, Ray will need to be started with sufficient CPU resources and the relevant custom resources: import ray @ray.remote(num_cpus=4) class Counter(object): def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value a1 = Counter.options(num_cpus=1, resources={"Custom1": 1}).remote() a2 = Counter.options(num_cpus=2, resources={"Custom2": 1}).remote() a3 = Counter.options(num_cpus=3, resources={"Custom3": 1}).remote() You can specify different resource requirements for tasks (but not for actor methods): ray.shutdown() ray.init(num_cpus=1, num_gpus=1) @ray.remote def g(): return ray.get_gpu_ids() object_gpu_ids = g.remote() assert ray.get(object_gpu_ids) == [] dynamic_object_gpu_ids = g.options(num_cpus=1, num_gpus=1).remote() assert ray.get(dynamic_object_gpu_ids) == [0] And vary the number of return values for tasks (and actor methods too): @ray.remote def f(n): return list(range(n)) id1, id2 = f.options(num_returns=2).remote(2) assert ray.get(id1) == 0 assert ray.get(id2) == 1 And specify a name for tasks (and actor methods too) at task submission time: import setproctitle @ray.remote def f(x): assert setproctitle.getproctitle() == "ray::special_f" return x + 1 obj = f.options(name="special_f").remote(3) assert ray.get(obj) == 4 This name will appear as the task name in the machine view of the dashboard, will appear as the worker process name when this task is executing (if a Python task), and will appear as the task name in the logs. Overloaded Functions Ray Java API supports calling overloaded java functions remotely. However, due to the limitation of Java compiler type inference, one must explicitly cast the method reference to the correct function type. For example, consider the following. Overloaded normal task call: public static class MyRayApp { public static int overloadFunction() { return 1; } public static int overloadFunction(int x) { return x; } } // Invoke overloaded functions. Assert.assertEquals((int) Ray.task((RayFunc0) MyRayApp::overloadFunction).remote().get(), 1); Assert.assertEquals((int) Ray.task((RayFunc1) MyRayApp::overloadFunction, 2).remote().get(), 2); Overloaded actor task call: public static class Counter { protected int value = 0; public int increment() { this.value += 1; return this.value; } } public static class CounterOverloaded extends Counter { public int increment(int diff) { super.value += diff; return super.value; } public int increment(int diff1, int diff2) { super.value += diff1 + diff2; return super.value; } } ActorHandle a = Ray.actor(CounterOverloaded::new).remote(); // Call an overloaded actor method by super class method reference. Assert.assertEquals((int) a.task(Counter::increment).remote().get(), 1); // Call an overloaded actor method, cast method reference first. a.task((RayFunc1) CounterOverloaded::increment).remote(); a.task((RayFunc2) CounterOverloaded::increment, 10).remote(); a.task((RayFunc3) CounterOverloaded::increment, 10, 10).remote(); Assert.assertEquals((int) a.task(Counter::increment).remote().get(), 33); Inspecting Cluster State Applications written on top of Ray will often want to have some information or diagnostics about the cluster. Some common questions include: How many nodes are in my autoscaling cluster? What resources are currently available in my cluster, both used and total? What are the objects currently in my cluster? For this, you can use the global state API. Node Information To get information about the current nodes in your cluster, you can use ray.nodes(): ray.nodes()[source] Get a list of the nodes in the cluster (for debugging only). Returns Information about the Ray clients in the cluster. DeveloperAPI: This API may change across minor Ray releases. ray.shutdown() import ray ray.init() print(ray.nodes()) [{'NodeID': '2691a0c1aed6f45e262b2372baf58871734332d7', 'Alive': True, 'NodeManagerAddress': '192.168.1.82', 'NodeManagerHostname': 'host-MBP.attlocal.net', 'NodeManagerPort': 58472, 'ObjectManagerPort': 52383, 'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_11-00-17_114725_17883/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-08-04_11-00-17_114725_17883/sockets/raylet', 'MetricsExportPort': 64860, 'alive': True, 'Resources': {'CPU': 16.0, 'memory': 100.0, 'object_store_memory': 34.0, 'node:192.168.1.82': 1.0}}] The above information includes: NodeID: A unique identifier for the raylet. alive: Whether the node is still alive. NodeManagerAddress: PrivateIP of the node that the raylet is on. Resources: The total resource capacity on the node. MetricsExportPort: The port number at which metrics are exposed to through a Prometheus endpoint. Resource Information To get information about the current total resource capacity of your cluster, you can use ray.cluster_resources(). ray.cluster_resources()[source] Get the current total cluster resources. Note that this information can grow stale as nodes are added to or removed from the cluster. Returns A dictionary mapping resource name to the total quantity of that resource in the cluster. DeveloperAPI: This API may change across minor Ray releases. To get information about the current available resource capacity of your cluster, you can use ray.available_resources(). ray.available_resources()[source] Get the current available cluster resources. This is different from cluster_resources in that this will return idle (available) resources rather than total resources. Note that this information can grow stale as tasks start and finish. Returns A dictionary mapping resource name to the total quantity of that resource in the cluster. DeveloperAPI: This API may change across minor Ray releases. Running Large Ray Clusters Here are some tips to run Ray with more than 1k nodes. When running Ray with such a large number of nodes, several system settings may need to be tuned to enable communication between such a large number of machines. Tuning Operating System Settings Because all nodes and workers connect to the GCS, many network connections will be created and the operating system has to support that number of connections. Maximum open files The OS has to be configured to support opening many TCP connections since every worker and raylet connects to the GCS. In POSIX systems, the current limit can be checked by ulimit -n and if it’s small, it should be increased according to the OS manual. ARP cache Another thing that needs to be configured is the ARP cache. In a large cluster, all the worker nodes connect to the head node, which adds a lot of entries to the ARP table. Ensure that the ARP cache size is large enough to handle this many nodes. Failure to do this will result in the head node hanging. When this happens, dmesg will show errors like neighbor table overflow message. In Ubuntu, the ARP cache size can be tuned in /etc/sysctl.conf by increasing the value of net.ipv4.neigh.default.gc_thresh1 - net.ipv4.neigh.default.gc_thresh3. For more details, please refer to the OS manual. Tuning Ray Settings There is an ongoing project focusing on improving Ray’s scalability and stability. Feel free to share your thoughts and use cases. To run a large cluster, several parameters need to be tuned in Ray. Resource broadcasting In Ray 2.3+, lightweight resource broadcasting is supported as an experimental feature. Turning it on can significantly reduce GCS load and thus improve its overall stability and scalability. To turn it on, this OS environment should be set: RAY_use_ray_syncer=true. This feature will be turned on by default in 2.4+. Benchmark The machine setup: 1 head node: m5.4xlarge (16 vCPUs/64GB mem) 2000 worker nodes: m5.large (2 vCPUs/8GB mem) The OS setup: Set the maximum number of opening files to 1048576 Increase the ARP cache size: net.ipv4.neigh.default.gc_thresh1=2048 net.ipv4.neigh.default.gc_thresh2=4096 net.ipv4.neigh.default.gc_thresh3=8192 The Ray setup: RAY_use_ray_syncer=true RAY_event_stats=false Test workload: Test script: code Benchmark result Number of actors Actor launch time Actor ready time Total time 20k (10 actors / node) 14.5s 136.1s 150.7s Authenticating Remote URIs in runtime_env This section helps you: Avoid leaking remote URI credentials in your runtime_env Provide credentials safely in KubeRay Understand best practices for authenticating your remote URI Authenticating Remote URIs You can add dependencies to your runtime_env with remote URIs. This is straightforward for files hosted publicly, because you simply paste the public URI into your runtime_env: runtime_env = {"working_dir": ( "https://github.com/" "username/repo/archive/refs/heads/master.zip" ) } However, dependencies hosted privately, in a private GitHub repo for example, require authentication. One common way to authenticate is to insert credentials into the URI itself: runtime_env = {"working_dir": ( "https://username:personal_access_token@github.com/" "username/repo/archive/refs/heads/master.zip" ) } In this example, personal_access_token is a secret credential that authenticates this URI. While Ray can successfully access your dependencies using authenticated URIs, you should not include secret credentials in your URIs for two reasons: Ray may log the URIs used in your runtime_env, which means the Ray logs could contain your credentials. Ray stores your remote dependency package in a local directory, and it uses a parsed version of the remote URI–including your credential–as the directory’s name. In short, your remote URI is not treated as a secret, so it should not contain secret info. Instead, use a netrc file. Running on VMs: the netrc File The netrc file contains credentials that Ray uses to automatically log into remote servers. Set your credentials in this file instead of in the remote URI: # "$HOME/.netrc" machine github.com login username password personal_access_token In this example, the machine github.com line specifies that any access to github.com should be authenticated using the provided login and password. On Unix, name the netrc file as .netrc. On Windows, name the file as _netrc. The netrc file requires owner read/write access, so make sure to run the chmod command after creating the file: chmod 600 "$HOME/.netrc" Add the netrc file to your VM container’s home directory, so Ray can access the runtime_env’s private remote URIs, even when they don’t contain credentials. Running on KubeRay: Secrets with netrc KubeRay can also obtain credentials from a netrc file for remote URIs. Supply your netrc file using a Kubernetes secret and a Kubernetes volume with these steps: 1. Launch your Kubernetes cluster. 2. Create the netrc file locally in your home directory. 3. Store the netrc file’s contents as a Kubernetes secret on your cluster: kubectl create secret generic netrc-secret --from-file=.netrc="$HOME/.netrc" 4. Expose the secret to your KubeRay application using a mounted volume, and update the NETRC environment variable to point to the netrc file. Include the following YAML in your KubeRay config. headGroupSpec: ... containers: - name: ... image: rayproject/ray:latest ... volumeMounts: - mountPath: "/home/ray/netrcvolume/" name: netrc-kuberay readOnly: true env: - name: NETRC value: "/home/ray/netrcvolume/.netrc" volumes: - name: netrc-kuberay secret: secretName: netrc-secret workerGroupSpecs: ... containers: - name: ... image: rayproject/ray:latest ... volumeMounts: - mountPath: "/home/ray/netrcvolume/" name: netrc-kuberay readOnly: true env: - name: NETRC value: "/home/ray/netrcvolume/.netrc" volumes: - name: netrc-kuberay secret: secretName: netrc-secret 5. Apply your KubeRay config. Your KubeRay application can use the netrc file to access private remote URIs, even when they don’t contain credentials. Ray Tutorials and Examples Machine Learning Examples Build Simple AutoML for Time Series Using Ray Build Batch Prediction Using Ray Build Batch Training Using Ray Build a Simple Parameter Server Using Ray Simple Parallel Model Selection Fault-Tolerant Fairseq Training Reinforcement Learning Examples These are simple examples that show you how to leverage Ray Core. For Ray’s production-grade reinforcement learning library, see RLlib. Learning to Play Pong Asynchronous Advantage Actor Critic (A3C) Basic Examples A Gentle Introduction to Ray Core by Example Using Ray for Highly Parallelizable Tasks Running a Simple MapReduce Example with Ray Core A Gentle Introduction to Ray Core by Example Implement a function in Ray Core to understand how Ray works and its basic concepts. Python programmers from those with less experience to those who are interested in advanced tasks, can start working with distributed computing using Python by learning the Ray Core API. Install Ray Install Ray with the following command: ! pip install ray Ray Core Start a local cluster by running the following commands: import ray ray.init() Note the following lines in the output: ... INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8265 {'node_ip_address': '192.168.1.41', ... 'node_id': '...'} These messages indicate that the Ray cluster is working as expected. In this example output, the address of the Ray dashboard is http://127.0.0.1:8265. Access the Ray dashboard at the address on the first line of your output. The Ray dashboard displays information such as the number of CPU cores available and the total utilization of the current Ray application. This is a typical output for a laptop: {'CPU': 12.0, 'memory': 14203886388.0, 'node:127.0.0.1': 1.0, 'object_store_memory': 2147483648.0} Next, is a brief introduction to the Ray Core API, which we refer to as the Ray API. The Ray API builds on concepts such as decorators, functions, and classes, that are familiar to Python programmers. It is a universal programming interface for distributed computing. The engine handles the complicated work, allowing developers to use Ray with existing Python libraries and systems. Your First Ray API Example The following function retrieves and processes data from a database. The dummy database is a plain Python list containing the words of the title of the “Learning Ray” book. The sleep function pauses for a certain amount of time to simulate the cost of accessing and processing data from the database. import time database = [ "Learning", "Ray", "Flexible", "Distributed", "Python", "for", "Machine", "Learning" ] def retrieve(item): time.sleep(item / 10.) return item, database[item] If the item with index 5 takes half a second (5 / 10.), an estimate of the total runtime to retrieve all eight items sequentially is (0+1+2+3+4+5+6+7)/10. = 2.8 seconds. Run the following code to get the actual time: def print_runtime(input_data, start_time): print(f'Runtime: {time.time() - start_time:.2f} seconds, data:') print(*input_data, sep="\n") start = time.time() data = [retrieve(item) for item in range(8)] print_runtime(data, start) Runtime: 2.82 seconds, data: (0, 'Learning') (1, 'Ray') (2, 'Flexible') (3, 'Distributed') (4, 'Python') (5, 'for') (6, 'Machine') (7, 'Learning') The total time to run the function is 2.82 seconds in this example, but time may be different for your computer. Note that this basic Python version cannot run the function simultaneously. You may expect that Python list comprehensions are more efficient. The measured runtime of 2.8 seconds is actually the worst case scenario. Although this program “sleeps” for most of its runtime, it is slow because of the Global Interpreter Lock (GIL). Ray Tasks This task can benefit from parallelization. If it is perfectly distributed, the runtime should not take much longer than the slowest subtask, that is, 7/10. = 0.7 seconds. To extend this example to run in parallel on Ray, start by using the @ray.remote decorator: import ray @ray.remote def retrieve_task(item): return retrieve(item) With the decorator, the function retrieve_task becomes a :ref:ray-remote-functions_. A Ray task is a function that Ray executes on a different process from where it was called, and possibly on a different machine. Ray is convenient to use because you can continue writing Python code, without having to significantly change your approach or programming style. Using the :func:ray.remote()<@ray.remote> decorator on the retrieve function is the intended use of decorators, and you did not modify the original code in this example. To retrieve database entries and measure performance, you do not need to make many changes to the code. Here’s an overview of the process: start = time.time() object_references = [ retrieve_task.remote(item) for item in range(8) ] data = ray.get(object_references) print_runtime(data, start) 2022-12-20 13:52:34,632 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265  Runtime: 2.82 seconds, data: (0, 'Learning') (1, 'Ray') (2, 'Flexible') (3, 'Distributed') (4, 'Python') (5, 'for') (6, 'Machine') (7, 'Learning') Running the task in parallel requires two minor code modifications. To execute your Ray task remotely, you must use a .remote() call. Ray executes remote tasks asynchronously, even on a local cluster. The items in the object_references list in the code snippet do not directly contain the results. If you check the Python type of the first item using type(object_references[0]), you see that it is actually an ObjectRef. These object references correspond to futures for which you need to request the result. The call :func:ray.get() is for requesting the result. Whenever you call remote on a Ray task, it immediately returns one or more object references. Consider Ray tasks as the primary way of creating objects. The following section is an example that links multiple tasks together and allows Ray to pass and resolve the objects between them. Let’s review the previous steps. You started with a Python function, then decorated it with @ray.remote, making the function a Ray task. Instead of directly calling the original function in the code, you called .remote(...) on the Ray task. Finally, you retrieved the results from the Ray cluster using .get(...). Consider creating a Ray task from one of your own functions as an additional exercise. Let’s review the performance gain from using Ray tasks. On most laptops the runtime is around 0.71 seconds, which is slightly more than the slowest subtask, which is 0.7 seconds. You can further improve the program by leveraging more of Ray’s API. Object Stores The retrieve definition directly accesses items from the database. While this works well on a local Ray cluster, consider how it functions on an actual cluster with multiple computers. A Ray cluster has a head node with a driver process and multiple worker nodes with worker processes executing tasks. In this scenario the database is only defined on the driver, but the worker processes need access to it to run the retrieve task. Ray’s solution for sharing objects between the driver and workers or between workers is to use the ray.put function to place the data into Ray’s distributed object store. In the retrieve_task definition, you can add a db argument to pass later as the db_object_ref object. db_object_ref = ray.put(database) @ray.remote def retrieve_task(item, db): time.sleep(item / 10.) return item, db[item] By using the object store, you allow Ray to manage data access throughout the entire cluster. Although the object store involves some overhead, it improves performance for larger datasets. This step is crucial for a truly distributed environment. Rerun the example with the retrieve_task function to confirm that it executes as you expect. Non-blocking calls In the previous section, you used ray.get(object_references) to retrieve results. This call blocks the driver process until all results are available. This dependency can cause problems if each database item takes several minutes to process. More efficiency gains are possible if you allow the driver process to perform other tasks while waiting for results, and to process results as they are completed rather than waiting for all items to finish. Additionally, if one of the database items cannot be retrieved due to an issue like a deadlock in the database connection, the driver hangs indefinitely. To prevent indefinite hangs, set reasonable timeout values when using the wait function. For example, if you want to wait less than ten times the time of the slowest data retrieval task, use the wait function to stop the task after that time has passed. start = time.time() object_references = [ retrieve_task.remote(item, db_object_ref) for item in range(8) ] all_data = [] while len(object_references) > 0: finished, object_references = ray.wait( object_references, timeout=7.0 ) data = ray.get(finished) print_runtime(data, start) all_data.extend(data) print_runtime(all_data, start) Runtime: 0.11 seconds, data: (0, 'Learning') (1, 'Ray') Runtime: 0.31 seconds, data: (2, 'Flexible') (3, 'Distributed') Runtime: 0.51 seconds, data: (4, 'Python') (5, 'for') Runtime: 0.71 seconds, data: (6, 'Machine') (7, 'Learning') Runtime: 0.71 seconds, data: (0, 'Learning') (1, 'Ray') (2, 'Flexible') (3, 'Distributed') (4, 'Python') (5, 'for') (6, 'Machine') (7, 'Learning') Instead of printing the results, you can use the retrieved values within the while loop to initiate new tasks on other workers. Task dependencies You may want to perform an additional processing task on the retrieved data. For example, use the results from the first retrieval task to query other related data from the same database (perhaps from a different table). The code below sets up this follow-up task and executes both the retrieve_task and follow_up_task in sequence. @ray.remote def follow_up_task(retrieve_result): original_item, _ = retrieve_result follow_up_result = retrieve(original_item + 1) return retrieve_result, follow_up_result retrieve_refs = [retrieve_task.remote(item, db_object_ref) for item in [0, 2, 4, 6]] follow_up_refs = [follow_up_task.remote(ref) for ref in retrieve_refs] result = [print(data) for data in ray.get(follow_up_refs)] ((0, 'Learning'), (1, 'Ray')) ((2, 'Flexible'), (3, 'Distributed')) ((4, 'Python'), (5, 'for')) ((6, 'Machine'), (7, 'Learning')) If you’re unfamiliar with asynchronous programming, this example may not be particularly impressive. However, at second glance it might be surprising that the code runs at all. The code appears to be a regular Python function with a few list comprehensions. The function body of follow_up_task expects a Python tuple for its input argument retrieve_result. However, when you use the [follow_up_task.remote(ref) for ref in retrieve_refs] command, you are not passing tuples to the follow-up task. Instead, you are using the retrieve_refs to pass in Ray object references. Behind the scenes, Ray recognizes that the follow_up_task needs actual values, so it automatically uses the ray.get function to resolve these futures. Additionally, Ray creates a dependency graph for all the tasks and executes them in a way that respects their dependencies. You don’t have to explicitly tell Ray when to wait for a previous task to be completed––it infers the order of execution. This feature of the Ray object store is useful because you avoid copying large intermediate values back to the driver by passing the object references to the next task and letting Ray handle the rest. The next steps in the process are only scheduled once the tasks specifically designed to retrieve information are completed. In fact, if retrieve_refs was called retrieve_result, you might not have noticed this crucial and intentional naming nuance. Ray allows you to concentrate on your work rather than the technicalities of cluster computing. The dependency graph for the two tasks looks like this: Task dependency Ray Actors This example covers one more significant aspect of Ray Core. Up until this step, everything is essentially a function. You used the @ray.remote decorator to make certain functions remote, but aside from that, you only used standard Python. If you want to keep track of how often the database is being queried, you could count the results of the retrieve tasks. However, is there a more efficient way to do this? Ideally, you want to track this in a distributed manner that can handle a large amount of data. Ray provides a solution with actors, which run stateful computations on a cluster and can also communicate with each other. Similar to how you create Ray tasks using decorated functions, create Ray actors using decorated Python classes. Therefore, you can create a simple counter using a Ray actor to track the number of database calls. @ray.remote class DataTracker: def __init__(self): self._counts = 0 def increment(self): self._counts += 1 def counts(self): return self._counts The DataTracker class becomes an actor when you give it the ray.remote decorator. This actor is capable of tracking state, such as a count, and its methods are Ray actor tasks that you can invoke in the same way as functions using .remote(). Modify the retrieve_task to incorporate this actor. @ray.remote def retrieve_tracker_task(item, tracker, db): time.sleep(item / 10.) tracker.increment.remote() return item, db[item] tracker = DataTracker.remote() object_references = [ retrieve_tracker_task.remote(item, tracker, db_object_ref) for item in range(8) ] data = ray.get(object_references) print(data) print(ray.get(tracker.counts.remote())) [(0, 'Learning'), (1, 'Ray'), (2, 'Flexible'), (3, 'Distributed'), (4, 'Python'), (5, 'for'), (6, 'Machine'), (7, 'Learning')] 8 As expected, the outcome of this calculation is 8. Although you don’t need actors to perform this calculation, this demonstrates a way to maintain state across the cluster, possibly involving multiple tasks. In fact, you could pass the actor into any related task or even into the constructor of a different actor. The Ray API is flexible, allowing for limitless possibilities. It’s rare for distributed Python tools to allow for stateful computations, which is especially useful for running complex distributed algorithms such as reinforcement learning. Summary In this example, you only used six API methods. These included ray.init() to initiate the cluster, @ray.remote to transform functions and classes into tasks and actors, ray.put() to transfer values into Ray’s object store, and ray.get() to retrieve objects from the cluster. Additionally, you used .remote() on actor methods or tasks to execute code on the cluster, and ray.wait to prevent blocking calls. The Ray API consists of more than these six calls, but these six are powerful, if you’re just starting out. To summarize more generally, the methods are as follows: ray.init(): Initializes your Ray cluster. Pass in an address to connect to an existing cluster. @ray.remote: Turns functions into tasks and classes into actors. ray.put(): Puts values into Ray’s object store. ray.get(): Gets values from the object store. Returns the values you’ve put there or that were computed by a task or actor. .remote(): Runs actor methods or tasks on your Ray cluster and is used to instantiate actors. ray.wait(): Returns two lists of object references, one with finished tasks we’re waiting for and one with unfinished tasks. Want to learn more? This example is a simplified version of the Ray Core walkthrough of our “Learning Ray” book. If you liked it, check out the Ray Core Examples Gallery or some of the ML workloads in our Use Case Gallery. Monte Carlo Estimation of π This tutorial shows you how to estimate the value of π using a Monte Carlo method that works by randomly sampling points within a 2x2 square. We can use the proportion of the points that are contained within the unit circle centered at the origin to estimate the ratio of the area of the circle to the area of the square. Given that we know the true ratio to be π/4, we can multiply our estimated ratio by 4 to approximate the value of π. The more points that we sample to calculate this approximation, the closer the value should be to the true value of π. We use Ray tasks to distribute the work of sampling and Ray actors to track the progress of these distributed sampling tasks. The code can run on your laptop and can be easily scaled to large clusters to increase the accuracy of the estimate. To get started, install Ray via pip install -U ray. See Installing Ray for more installation options. Starting Ray First, let’s include all modules needed for this tutorial and start a local Ray cluster with ray.init(): import ray import math import time import random ray.init() In recent versions of Ray (>=1.5), ray.init() is automatically called on the first use of a Ray remote API. Defining the Progress Actor Next, we define a Ray actor that can be called by sampling tasks to update progress. Ray actors are essentially stateful services that anyone with an instance (a handle) of the actor can call its methods. @ray.remote class ProgressActor: def __init__(self, total_num_samples: int): self.total_num_samples = total_num_samples self.num_samples_completed_per_task = {} def report_progress(self, task_id: int, num_samples_completed: int) -> None: self.num_samples_completed_per_task[task_id] = num_samples_completed def get_progress(self) -> float: return ( sum(self.num_samples_completed_per_task.values()) / self.total_num_samples ) We define a Ray actor by decorating a normal Python class with ray.remote. The progress actor has report_progress() method that will be called by sampling tasks to update their progress individually and get_progress() method to get the overall progress. Defining the Sampling Task After our actor is defined, we now define a Ray task that does the sampling up to num_samples and returns the number of samples that are inside the circle. Ray tasks are stateless functions. They execute asynchronously, and run in parallel. @ray.remote def sampling_task(num_samples: int, task_id: int, progress_actor: ray.actor.ActorHandle) -> int: num_inside = 0 for i in range(num_samples): x, y = random.uniform(-1, 1), random.uniform(-1, 1) if math.hypot(x, y) <= 1: num_inside += 1 # Report progress every 1 million samples. if (i + 1) % 1_000_000 == 0: # This is async. progress_actor.report_progress.remote(task_id, i + 1) # Report the final progress. progress_actor.report_progress.remote(task_id, num_samples) return num_inside To convert a normal Python function as a Ray task, we decorate the function with ray.remote. The sampling task takes a progress actor handle as an input and reports progress to it. The above code shows an example of calling actor methods from tasks. Creating a Progress Actor Once the actor is defined, we can create an instance of it. # Change this to match your cluster scale. NUM_SAMPLING_TASKS = 10 NUM_SAMPLES_PER_TASK = 10_000_000 TOTAL_NUM_SAMPLES = NUM_SAMPLING_TASKS * NUM_SAMPLES_PER_TASK # Create the progress actor. progress_actor = ProgressActor.remote(TOTAL_NUM_SAMPLES) To create an instance of the progress actor, simply call ActorClass.remote() method with arguments to the constructor. This creates and runs the actor on a remote worker process. The return value of ActorClass.remote(...) is an actor handle that can be used to call its methods. Executing Sampling Tasks Now the task is defined, we can execute it asynchronously. # Create and execute all sampling tasks in parallel. results = [ sampling_task.remote(NUM_SAMPLES_PER_TASK, i, progress_actor) for i in range(NUM_SAMPLING_TASKS) ] We execute the sampling task by calling remote() method with arguments to the function. This immediately returns an ObjectRef as a future and then executes the function asynchronously on a remote worker process. Calling the Progress Actor While sampling tasks are running, we can periodically query the progress by calling the actor get_progress() method. # Query progress periodically. while True: progress = ray.get(progress_actor.get_progress.remote()) print(f"Progress: {int(progress * 100)}%") if progress == 1: break time.sleep(1) To call an actor method, use actor_handle.method.remote(). This invocation immediately returns an ObjectRef as a future and then executes the method asynchronously on the remote actor process. To fetch the actual returned value of ObjectRef, we use the blocking ray.get(). Calculating π Finally, we get number of samples inside the circle from the remote sampling tasks and calculate π. # Get all the sampling tasks results. total_num_inside = sum(ray.get(results)) pi = (total_num_inside * 4) / TOTAL_NUM_SAMPLES print(f"Estimated value of π is: {pi}") As we can see from the above code, besides a single ObjectRef, ray.get() can also take a list of ObjectRef and return a list of results. If you run this tutorial, you will see output like: Progress: 0% Progress: 15% Progress: 28% Progress: 40% Progress: 50% Progress: 60% Progress: 70% Progress: 80% Progress: 90% Progress: 100% Estimated value of π is: 3.1412202 Asynchronous Advantage Actor Critic (A3C) This example explains how to distribute simulations using Ray actors. For an overview of Ray’s industry-grade reinforcement learning library, see RLlib. This document walks through A3C, a state-of-the-art reinforcement learning algorithm. In this example, we adapt the OpenAI Universe Starter Agent implementation of A3C to use Ray. View the code for this example. To run the application, first install ray and then some dependencies: pip install tensorflow pip install six pip install gym[atari] pip install scikit-image pip install scipy You can run the code with TODO: this is the only mention of Pong-ram-v4 in this file. How is this useful? rllib train --env=Pong-ram-v4 --run=A3C --config='{"num_workers": N}' Reinforcement Learning Reinforcement Learning is an area of machine learning concerned with learning how an agent should act in an environment so as to maximize some form of cumulative reward. Typically, an agent will observe the current state of the environment and take an action based on its observation. The action will change the state of the environment and will provide some numerical reward (or penalty) to the agent. The agent will then take in another observation and the process will repeat. The mapping from state to action is a policy, and in reinforcement learning, this policy is often represented with a deep neural network. The environment is often a simulator (for example, a physics engine), and reinforcement learning algorithms often involve trying out many different sequences of actions within these simulators. These rollouts can often be done in parallel. Policies are often initialized randomly and incrementally improved via simulation within the environment. To improve a policy, gradient-based updates may be computed based on the sequences of states and actions that have been observed. The gradient calculation is often delayed until a termination condition is reached (that is, the simulation has finished) so that delayed rewards have been properly accounted for. However, in the Actor Critic model, we can begin the gradient calculation at any point in the simulation rollout by predicting future rewards with a Value Function approximator. In our A3C implementation, each worker, implemented as a Ray actor, continuously simulates the environment. The driver will create a task that runs some steps of the simulator using the latest model, computes a gradient update, and returns the update to the driver. Whenever a task finishes, the driver will use the gradient update to update the model and will launch a new task with the latest model. There are two main parts to the implementation - the driver and the worker. Worker Code Walkthrough We use a Ray Actor to simulate the environment. import numpy as np import ray @ray.remote class Runner: """Actor object to start running simulation on workers. Gradient computation is also executed on this object.""" def __init__(self, env_name, actor_id): # starts simulation environment, policy, and thread. # Thread will continuously interact with the simulation environment self.env = env = create_env(env_name) self.id = actor_id self.policy = LSTMPolicy() self.runner = RunnerThread(env, self.policy, 20) self.start() def start(self): # starts the simulation thread self.runner.start_runner() def pull_batch_from_queue(self): # Implementation details removed - gets partial rollout from queue return rollout def compute_gradient(self, params): self.policy.set_weights(params) rollout = self.pull_batch_from_queue() batch = process_rollout(rollout, gamma=0.99, lambda_=1.0) gradient = self.policy.compute_gradients(batch) info = {"id": self.id, "size": len(batch.a)} return gradient, info Driver Code Walkthrough The driver manages the coordination among workers and handles updating the global model parameters. The main training script looks like the following. TODO: this is untested code. literalinclude and test. import numpy as np import ray def train(num_workers, env_name="PongDeterministic-v4"): # Setup a copy of the environment # Instantiate a copy of the policy - mainly used as a placeholder env = create_env(env_name, None, None) policy = LSTMPolicy(env.observation_space.shape, env.action_space.n, 0) obs = 0 # Start simulations on actors agents = [Runner.remote(env_name, i) for i in range(num_workers)] # Start gradient calculation tasks on each actor parameters = policy.get_weights() gradient_list = [agent.compute_gradient.remote(parameters) for agent in agents] while True: # Replace with your termination condition # wait for some gradient to be computed - unblock as soon as the earliest arrives done_id, gradient_list = ray.wait(gradient_list) # get the results of the task from the object store gradient, info = ray.get(done_id)[0] obs += info["size"] # apply update, get the weights from the model, start a new task on the same actor object policy.apply_gradients(gradient) parameters = policy.get_weights() gradient_list.extend([agents[info["id"]].compute_gradient(parameters)]) return policy Benchmarks and Visualization For the PongDeterministic-v4 and an Amazon EC2 m4.16xlarge instance, we are able to train the agent with 16 workers in around 15 minutes. With 8 workers, we can train the agent in around 25 minutes. You can visualize performance by running tensorboard --logdir [directory] in a separate screen, where [directory] is defaulted to ~/ray_results/. If you are running multiple experiments, be sure to vary the directory to which Tensorflow saves its progress (found in a3c.py). Fault-Tolerant Fairseq Training For an overview of Ray’s distributed training library, see Ray Train. This document provides a walkthrough of adapting the Fairseq library to perform fault-tolerant distributed training on AWS. As an example, we use the WikiText-103 dataset to pretrain the RoBERTa model following this tutorial. The pipeline and configurations in this document will work for other models supported by Fairseq, such as sequence-to-sequence machine translation models. To run this example, you will need to install Ray on your local machine to use the Ray cluster launcher. You can view the code for this example. To use Ray cluster launcher on AWS, install boto (pip install boto3) and configure your AWS credentials in ~/.aws/credentials as described on the Automatic Cluster Setup page. We provide an example config file (lm-cluster.yaml). In the example config file, we use an m5.xlarge on-demand instance as the head node, and use p3.2xlarge GPU spot instances as the worker nodes. We set the minimal number of workers to 1 and maximum workers to 2 in the config, which can be modified according to your own demand. We also mount Amazon EFS to store code, data and checkpoints. The {{SecurityGroupId}} and {{FileSystemId}} fields in the config file should be replaced by your own IDs. In setup_commands, we use the PyTorch environment in the Deep Learning AMI, and install Ray and Fairseq: setup_commands: - echo 'export PATH="$HOME/anaconda3/envs/pytorch_p36/bin:$PATH"' >> ~/.bashrc; source ~/.bashrc; pip install -U ray; pip install -U fairseq==0.8.0; Run the following command on your local machine to start the Ray cluster: ray up lm-cluster.yaml ray_train.sh also assumes that all of the lm/ files are in $HOME/efs. You can move these files manually, or use the following command to upload files from a local path: ray rsync-up lm-cluster.yaml PATH/TO/LM '~/efs/lm' Preprocessing Data Once the cluster is started, you can then SSH into the head node using ray attach lm-cluster.yaml and download or preprocess the data on EFS for training. We can run preprocess.sh (code) to do this, which adapts instructions from the RoBERTa tutorial. Training We provide ray_train.py (code) as an entrypoint to the Fairseq library. Since we are training the model on spot instances, we provide fault-tolerance in ray_train.py by checkpointing and restarting when a node fails. The code will also check whether there are new resources available after checkpointing. If so, the program will make use of them by restarting and resizing. Two main components of ray_train.py are a RayDistributedActor class and a function run_fault_tolerant_loop(). The RayDistributedActor sets proper arguments for different ray actor processes, adds a checkpoint hook to enable the process to make use of new available GPUs, and calls the main of Fairseq: import math import copy import socket import time import ray import fairseq from fairseq import options from fairseq_cli.train import main from contextlib import closing _original_save_checkpoint = fairseq.checkpoint_utils.save_checkpoint class RayDistributedActor: """Actor to perform distributed training.""" def run(self, url, world_rank, args): """Runs the fairseq training. We set args for different ray actors for communication, add a checkpoint hook, and call the main function of fairseq. """ # Set the init_method and rank of the process for distributed training. print("Ray worker at {url} rank {rank}".format( url=url, rank=world_rank)) self.url = url self.world_rank = world_rank args.distributed_rank = world_rank args.distributed_init_method = url # Add a checkpoint hook to make use of new resources. self.add_checkpoint_hook(args) # Call the original main function of fairseq. main(args, init_distributed=(args.distributed_world_size > 1)) def add_checkpoint_hook(self, args): """Add a hook to the original save_checkpoint function. This checks if there are new computational resources available. If so, raise exception to restart the training process and make use of the new resources. """ if args.cpu: original_n_cpus = args.distributed_world_size def _new_save_checkpoint(*args, **kwargs): _original_save_checkpoint(*args, **kwargs) n_cpus = int(ray.cluster_resources()["CPU"]) if n_cpus > original_n_cpus: raise Exception( "New CPUs find (original %d CPUs, now %d CPUs)" % (original_n_cpus, n_cpus)) else: original_n_gpus = args.distributed_world_size def _new_save_checkpoint(*args, **kwargs): _original_save_checkpoint(*args, **kwargs) n_gpus = int(ray.cluster_resources().get("GPU", 0)) if n_gpus > original_n_gpus: raise Exception( "New GPUs find (original %d GPUs, now %d GPUs)" % (original_n_gpus, n_gpus)) fairseq.checkpoint_utils.save_checkpoint = _new_save_checkpoint def get_node_ip(self): """Returns the IP address of the current node.""" return ray._private.services.get_node_ip_address() def find_free_port(self): """Finds a free port on the current node.""" with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s: s.bind(("", 0)) s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) return s.getsockname()[1] The function run_fault_tolerant_loop() provides fault-tolerance by catching failure and restart the computation: def run_fault_tolerant_loop(): """Entrance function to the fairseq library, providing fault-tolerance.""" # Parse the command line arguments. parser = options.get_training_parser() add_ray_args(parser) args = options.parse_args_and_arch(parser) original_args = copy.deepcopy(args) # Main loop for fault-tolerant training. retry = True while retry: args = copy.deepcopy(original_args) # Initialize Ray. ray.init(address=args.ray_address) set_num_resources(args) set_batch_size(args) # Set up Ray distributed actors. Actor = ray.remote( num_cpus=1, num_gpus=int(not args.cpu))(RayDistributedActor) workers = [Actor.remote() for i in range(args.distributed_world_size)] # Get the IP address and a free port of actor 0, which is used for # fairseq distributed training. ip = ray.get(workers[0].get_node_ip.remote()) port = ray.get(workers[0].find_free_port.remote()) address = "tcp://{ip}:{port}".format(ip=ip, port=port) # Start the remote processes, and check whether their are any process # fails. If so, restart all the processes. unfinished = [ worker.run.remote(address, i, args) for i, worker in enumerate(workers) ] try: while len(unfinished) > 0: finished, unfinished = ray.wait(unfinished) finished = ray.get(finished) retry = False except Exception as inst: print("Ray restart because following error occurs:") print(inst) retry = True ray.shutdown() In ray_train.py, we also define a set of helper functions. add_ray_args() adds Ray and fault-tolerant training related arguments to the argument parser: def add_ray_args(parser): """Add ray and fault-tolerance related parser arguments to the parser.""" group = parser.add_argument_group("Ray related arguments") group.add_argument( "--ray-address", default="auto", type=str, help="address for ray initialization") group.add_argument( "--fix-batch-size", default=None, metavar="B1,B2,...,B_N", type=lambda uf: options.eval_str_list(uf, type=int), help="fix the actual batch size (max_sentences * update_freq " "* n_GPUs) to be the fixed input values by adjusting update_freq " "accroding to actual n_GPUs; the batch size is fixed to B_i for " "epoch i; all epochs >N are fixed to B_N") return group set_num_resources() sets the distributed world size to be the number of resources. Also if we want to use GPUs but the current number of GPUs is 0, the function will wait until there is GPU available: def set_num_resources(args): """Get the number of resources and set the corresponding fields.""" if args.cpu: args.distributed_world_size = int(ray.cluster_resources()["CPU"]) else: n_gpus = int(ray.cluster_resources().get("GPU", 0)) while n_gpus == 0: print("No GPUs available, wait 10 seconds") time.sleep(10) n_gpus = int(ray.cluster_resources().get("GPU", 0)) args.distributed_world_size = n_gpus set_batch_size() keeps the effective batch size to be relatively the same given different number of GPUs: def set_batch_size(args): """Fixes the total batch_size to be agnostic to the GPU count.""" if args.fix_batch_size is not None: args.update_freq = [ math.ceil(batch_size / (args.max_sentences * args.distributed_world_size)) for batch_size in args.fix_batch_size ] print("Training on %d GPUs, max_sentences=%d, update_freq=%s" % (args.distributed_world_size, args.max_sentences, repr(args.update_freq))) To start training, run following commands (ray_train.sh) on the head machine: cd ~/efs/lm TOTAL_UPDATES=125000 # Total number of training steps WARMUP_UPDATES=10000 # Warmup the learning rate over this many updates PEAK_LR=0.0005 # Peak learning rate, adjust as needed TOKENS_PER_SAMPLE=512 # Max sequence length #MAX_POSITIONS=512 # Num. positional embeddings (usually same as above) MAX_SENTENCES=8 # Number of sequences per batch on one GPU (batch size) FIX_BATCH_SIZE=2048 # Number of batch size in total (max_sentences * update_freq * n_gpus) SAVE_INTERVAL_UPDATES=1000 # save a checkpoint every N updates LOG_DIR=$HOME/efs/lm/log/ DATA_DIR=$HOME/efs/lm/data-bin/wikitext-103/ mkdir -p $LOG_DIR python $HOME/efs/lm/ray_train.py --fp16 $DATA_DIR \ --task masked_lm --criterion masked_lm \ --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \ --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \ --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \ --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \ --max-sentences $MAX_SENTENCES \ --fix-batch-size $FIX_BATCH_SIZE \ --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \ --save-interval-updates $SAVE_INTERVAL_UPDATES \ --save-dir $LOG_DIR --ddp-backend=no_c10d SAVE_INTERVAL_UPDATES controls how often to save a checkpoint, which can be tuned based on the stability of chosen instances. FIX_BATCH_SIZE controls the total batch size to be a roughly fixed number. Helpful Ray Commands To let Ray automatically stop the cluster after the training finished, you can download the ray_train.sh to ~/efs of the remote machine, and run the following command on your local machine: ray exec --stop lm-cluster.yaml 'bash $HOME/efs/lm/ray_train.sh' or run the following command on the remote head node: ray exec --stop ~/ray_bootstrap_config.yaml 'bash $HOME/efs/lm/ray_train.sh' To test the fault-tolerance, you can run the following command on your local machine to randomly kill one node: ray kill-random-node lm-cluster.yaml Simple Parallel Model Selection For a production-grade implementation of distributed hyperparameter tuning, use Ray Tune, a scalable hyperparameter tuning library built using Ray’s Actor API. In this example, we’ll demonstrate how to quickly write a hyperparameter tuning script that evaluates a set of hyperparameters in parallel. This script will demonstrate how to use two important parts of the Ray API: using ray.remote to define remote functions and ray.wait to wait for their results to be ready. Setup: Dependencies First, import some dependencies and define functions to generate random hyperparameters and retrieve data. import os import numpy as np from filelock import FileLock import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torchvision import datasets, transforms import ray ray.init() # The number of sets of random hyperparameters to try. num_evaluations = 10 # A function for generating random hyperparameters. def generate_hyperparameters(): return { "learning_rate": 10 ** np.random.uniform(-5, 1), "batch_size": np.random.randint(1, 100), "momentum": np.random.uniform(0, 1), } def get_data_loaders(batch_size): mnist_transforms = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] ) # We add FileLock here because multiple workers will want to # download data, and this may cause overwrites since # DataLoader is not threadsafe. with FileLock(os.path.expanduser("~/data.lock")): train_loader = torch.utils.data.DataLoader( datasets.MNIST( "~/data", train=True, download=True, transform=mnist_transforms ), batch_size=batch_size, shuffle=True, ) test_loader = torch.utils.data.DataLoader( datasets.MNIST("~/data", train=False, transform=mnist_transforms), batch_size=batch_size, shuffle=True, ) return train_loader, test_loader Setup: Defining the Neural Network We define a small neural network to use in training. In addition, we created methods to train and test this neural network. class ConvNet(nn.Module): """Simple two layer Convolutional Neural Network.""" def __init__(self): super(ConvNet, self).__init__() self.conv1 = nn.Conv2d(1, 3, kernel_size=3) self.fc = nn.Linear(192, 10) def forward(self, x): x = F.relu(F.max_pool2d(self.conv1(x), 3)) x = x.view(-1, 192) x = self.fc(x) return F.log_softmax(x, dim=1) def train(model, optimizer, train_loader, device=torch.device("cpu")): """Optimize the model with one pass over the data. Cuts off at 1024 samples to simplify training. """ model.train() for batch_idx, (data, target) in enumerate(train_loader): if batch_idx * len(data) > 1024: return data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step() def test(model, test_loader, device=torch.device("cpu")): """Checks the validation accuracy of the model. Cuts off at 512 samples for simplicity. """ model.eval() correct = 0 total = 0 with torch.no_grad(): for batch_idx, (data, target) in enumerate(test_loader): if batch_idx * len(data) > 512: break data, target = data.to(device), target.to(device) outputs = model(data) _, predicted = torch.max(outputs.data, 1) total += target.size(0) correct += (predicted == target).sum().item() return correct / total Evaluating the Hyperparameters For a given configuration, the neural network created previously will be trained and return the accuracy of the model. These trained networks will then be tested for accuracy to find the best set of hyperparameters. The @ray.remote decorator defines a remote process. @ray.remote def evaluate_hyperparameters(config): model = ConvNet() train_loader, test_loader = get_data_loaders(config["batch_size"]) optimizer = optim.SGD( model.parameters(), lr=config["learning_rate"], momentum=config["momentum"] ) train(model, optimizer, train_loader) return test(model, test_loader) Synchronous Evaluation of Randomly Generated Hyperparameters We will create multiple sets of random hyperparameters for our neural network that will be evaluated in parallel. # Keep track of the best hyperparameters and the best accuracy. best_hyperparameters = None best_accuracy = 0 # A list holding the object refs for all of the experiments that we have # launched but have not yet been processed. remaining_ids = [] # A dictionary mapping an experiment's object ref to its hyperparameters. # hyerparameters used for that experiment. hyperparameters_mapping = {} Launch asynchronous parallel tasks for evaluating different hyperparameters. accuracy_id is an ObjectRef that acts as a handle to the remote task. It is used later to fetch the result of the task when the task finishes. # Randomly generate sets of hyperparameters and launch a task to evaluate it. for i in range(num_evaluations): hyperparameters = generate_hyperparameters() accuracy_id = evaluate_hyperparameters.remote(hyperparameters) remaining_ids.append(accuracy_id) hyperparameters_mapping[accuracy_id] = hyperparameters Process each hyperparameter and corresponding accuracy in the order that they finish to store the hyperparameters with the best accuracy. # Fetch and print the results of the tasks in the order that they complete. while remaining_ids: # Use ray.wait to get the object ref of the first task that completes. done_ids, remaining_ids = ray.wait(remaining_ids) # There is only one return result by default. result_id = done_ids[0] hyperparameters = hyperparameters_mapping[result_id] accuracy = ray.get(result_id) print( """We achieve accuracy {:.3}% with learning_rate: {:.2} batch_size: {} momentum: {:.2} """.format( 100 * accuracy, hyperparameters["learning_rate"], hyperparameters["batch_size"], hyperparameters["momentum"], ) ) if accuracy > best_accuracy: best_hyperparameters = hyperparameters best_accuracy = accuracy # Record the best performing set of hyperparameters. print( """Best accuracy over {} trials was {:.3} with learning_rate: {:.2} batch_size: {} momentum: {:.2} """.format( num_evaluations, 100 * best_accuracy, best_hyperparameters["learning_rate"], best_hyperparameters["batch_size"], best_hyperparameters["momentum"], ) ) Parameter Server For a production-grade implementation of distributed training, use Ray Train. The parameter server is a framework for distributed machine learning training. In the parameter server framework, a centralized server (or group of server nodes) maintains global shared parameters of a machine-learning model (e.g., a neural network) while the data and computation of calculating updates (i.e., gradient descent updates) are distributed over worker nodes. Parameter servers are a core part of many machine learning applications. This document walks through how to implement simple synchronous and asynchronous parameter servers using Ray actors. To run the application, first install some dependencies. pip install torch torchvision filelock Let’s first define some helper functions and import some dependencies. import os import torch import torch.nn as nn import torch.nn.functional as F from torchvision import datasets, transforms from filelock import FileLock import numpy as np import ray def get_data_loader(): """Safely downloads data. Returns training/validation set dataloader.""" mnist_transforms = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] ) # We add FileLock here because multiple workers will want to # download data, and this may cause overwrites since # DataLoader is not threadsafe. with FileLock(os.path.expanduser("~/data.lock")): train_loader = torch.utils.data.DataLoader( datasets.MNIST( "~/data", train=True, download=True, transform=mnist_transforms ), batch_size=128, shuffle=True, ) test_loader = torch.utils.data.DataLoader( datasets.MNIST("~/data", train=False, transform=mnist_transforms), batch_size=128, shuffle=True, ) return train_loader, test_loader def evaluate(model, test_loader): """Evaluates the accuracy of the model on a validation dataset.""" model.eval() correct = 0 total = 0 with torch.no_grad(): for batch_idx, (data, target) in enumerate(test_loader): # This is only set to finish evaluation faster. if batch_idx * len(data) > 1024: break outputs = model(data) _, predicted = torch.max(outputs.data, 1) total += target.size(0) correct += (predicted == target).sum().item() return 100.0 * correct / total Setup: Defining the Neural Network We define a small neural network to use in training. We provide some helper functions for obtaining data, including getter/setter methods for gradients and weights. class ConvNet(nn.Module): """Small ConvNet for MNIST.""" def __init__(self): super(ConvNet, self).__init__() self.conv1 = nn.Conv2d(1, 3, kernel_size=3) self.fc = nn.Linear(192, 10) def forward(self, x): x = F.relu(F.max_pool2d(self.conv1(x), 3)) x = x.view(-1, 192) x = self.fc(x) return F.log_softmax(x, dim=1) def get_weights(self): return {k: v.cpu() for k, v in self.state_dict().items()} def set_weights(self, weights): self.load_state_dict(weights) def get_gradients(self): grads = [] for p in self.parameters(): grad = None if p.grad is None else p.grad.data.cpu().numpy() grads.append(grad) return grads def set_gradients(self, gradients): for g, p in zip(gradients, self.parameters()): if g is not None: p.grad = torch.from_numpy(g) Defining the Parameter Server The parameter server will hold a copy of the model. During training, it will: Receive gradients and apply them to its model. Send the updated model back to the workers. The @ray.remote decorator defines a remote process. It wraps the ParameterServer class and allows users to instantiate it as a remote actor. @ray.remote class ParameterServer(object): def __init__(self, lr): self.model = ConvNet() self.optimizer = torch.optim.SGD(self.model.parameters(), lr=lr) def apply_gradients(self, *gradients): summed_gradients = [ np.stack(gradient_zip).sum(axis=0) for gradient_zip in zip(*gradients) ] self.optimizer.zero_grad() self.model.set_gradients(summed_gradients) self.optimizer.step() return self.model.get_weights() def get_weights(self): return self.model.get_weights() Defining the Worker The worker will also hold a copy of the model. During training. it will continuously evaluate data and send gradients to the parameter server. The worker will synchronize its model with the Parameter Server model weights. @ray.remote class DataWorker(object): def __init__(self): self.model = ConvNet() self.data_iterator = iter(get_data_loader()[0]) def compute_gradients(self, weights): self.model.set_weights(weights) try: data, target = next(self.data_iterator) except StopIteration: # When the epoch ends, start a new epoch. self.data_iterator = iter(get_data_loader()[0]) data, target = next(self.data_iterator) self.model.zero_grad() output = self.model(data) loss = F.nll_loss(output, target) loss.backward() return self.model.get_gradients() Synchronous Parameter Server Training We’ll now create a synchronous parameter server training scheme. We’ll first instantiate a process for the parameter server, along with multiple workers. iterations = 200 num_workers = 2 ray.init(ignore_reinit_error=True) ps = ParameterServer.remote(1e-2) workers = [DataWorker.remote() for i in range(num_workers)] We’ll also instantiate a model on the driver process to evaluate the test accuracy during training. model = ConvNet() test_loader = get_data_loader()[1] Training alternates between: Computing the gradients given the current weights from the server Updating the parameter server’s weights with the gradients. print("Running synchronous parameter server training.") current_weights = ps.get_weights.remote() for i in range(iterations): gradients = [worker.compute_gradients.remote(current_weights) for worker in workers] # Calculate update after all gradients are available. current_weights = ps.apply_gradients.remote(*gradients) if i % 10 == 0: # Evaluate the current model. model.set_weights(ray.get(current_weights)) accuracy = evaluate(model, test_loader) print("Iter {}: \taccuracy is {:.1f}".format(i, accuracy)) print("Final accuracy is {:.1f}.".format(accuracy)) # Clean up Ray resources and processes before the next example. ray.shutdown() Asynchronous Parameter Server Training We’ll now create a synchronous parameter server training scheme. We’ll first instantiate a process for the parameter server, along with multiple workers. print("Running Asynchronous Parameter Server Training.") ray.init(ignore_reinit_error=True) ps = ParameterServer.remote(1e-2) workers = [DataWorker.remote() for i in range(num_workers)] Here, workers will asynchronously compute the gradients given its current weights and send these gradients to the parameter server as soon as they are ready. When the Parameter server finishes applying the new gradient, the server will send back a copy of the current weights to the worker. The worker will then update the weights and repeat. current_weights = ps.get_weights.remote() gradients = {} for worker in workers: gradients[worker.compute_gradients.remote(current_weights)] = worker for i in range(iterations * num_workers): ready_gradient_list, _ = ray.wait(list(gradients)) ready_gradient_id = ready_gradient_list[0] worker = gradients.pop(ready_gradient_id) # Compute and apply gradients. current_weights = ps.apply_gradients.remote(*[ready_gradient_id]) gradients[worker.compute_gradients.remote(current_weights)] = worker if i % 10 == 0: # Evaluate the current model after every 10 updates. model.set_weights(ray.get(current_weights)) accuracy = evaluate(model, test_loader) print("Iter {}: \taccuracy is {:.1f}".format(i, accuracy)) print("Final accuracy is {:.1f}.".format(accuracy)) Final Thoughts This approach is powerful because it enables you to implement a parameter server with a few lines of code as part of a Python application. As a result, this simplifies the deployment of applications that use parameter servers and to modify the behavior of the parameter server. For example, sharding the parameter server, changing the update rule, switching between asynchronous and synchronous updates, ignoring straggler workers, or any number of other customizations, will only require a few extra lines of code. Learning to Play Pong For a production-grade implementation of distributed reinforcement learning, use Ray RLlib. In this example, we’ll train a very simple neural network to play Pong using Gymnasium. At a high level, we will use multiple Ray actors to obtain simulation rollouts and calculate gradient simultaneously. We will then centralize these gradients and update the neural network. The updated neural network will then be passed back to each Ray actor for more gradient calculation. This application is adapted, with minimal modifications, from Andrej Karpathy’s source code (see the accompanying blog post). To run the application, first install some dependencies. pip install gymnasium[atari] gym==0.26.2 At the moment, on a large machine with 64 physical cores, computing an update with a batch of size 1 takes about 1 second, a batch of size 10 takes about 2.5 seconds. A batch of size 60 takes about 3 seconds. On a cluster with 11 nodes, each with 18 physical cores, a batch of size 300 takes about 10 seconds. If the numbers you see differ from these by much, take a look at the Troubleshooting section at the bottom of this page and consider submitting an issue. Note that these times depend on how long the rollouts take, which in turn depends on how well the policy is doing. For example, a really bad policy will lose very quickly. As the policy learns, we should expect these numbers to increase. import numpy as np import os import ray import time import gymnasium as gym Hyperparameters Here we’ll define a couple of the hyperparameters that are used. H = 200 # The number of hidden layer neurons. gamma = 0.99 # The discount factor for reward. decay_rate = 0.99 # The decay factor for RMSProp leaky sum of grad^2. D = 80 * 80 # The input dimensionality: 80x80 grid. learning_rate = 1e-4 # Magnitude of the update. Helper Functions We first define a few helper functions: Preprocessing: The preprocess function will preprocess the original 210x160x3 uint8 frame into a one-dimensional 6400 float vector. Reward Processing: The process_rewards function will calculate a discounted reward. This formula states that the “value” of a sampled action is the weighted sum of all rewards afterwards, but later rewards are exponentially less important. Rollout: The rollout function plays an entire game of Pong (until either the computer or the RL agent loses). def preprocess(img): # Crop the image. img = img[35:195] # Downsample by factor of 2. img = img[::2, ::2, 0] # Erase background (background type 1). img[img == 144] = 0 # Erase background (background type 2). img[img == 109] = 0 # Set everything else (paddles, ball) to 1. img[img != 0] = 1 return img.astype(np.float).ravel() def process_rewards(r): """Compute discounted reward from a vector of rewards.""" discounted_r = np.zeros_like(r) running_add = 0 for t in reversed(range(0, r.size)): # Reset the sum, since this was a game boundary (pong specific!). if r[t] != 0: running_add = 0 running_add = running_add * gamma + r[t] discounted_r[t] = running_add return discounted_r def rollout(model, env): """Evaluates env and model until the env returns "Terminated" or "Truncated". Returns: xs: A list of observations hs: A list of model hidden states per observation dlogps: A list of gradients drs: A list of rewards. """ # Reset the game. observation, info = env.reset() # Note that prev_x is used in computing the difference frame. prev_x = None xs, hs, dlogps, drs = [], [], [], [] terminated = truncated = False while not terminated and not truncated: cur_x = preprocess(observation) x = cur_x - prev_x if prev_x is not None else np.zeros(D) prev_x = cur_x aprob, h = model.policy_forward(x) # Sample an action. action = 2 if np.random.uniform() < aprob else 3 # The observation. xs.append(x) # The hidden state. hs.append(h) y = 1 if action == 2 else 0 # A "fake label". # The gradient that encourages the action that was taken to be # taken (see http://cs231n.github.io/neural-networks-2/#losses if # confused). dlogps.append(y - aprob) observation, reward, terminated, truncated, info = env.step(action) # Record reward (has to be done after we call step() to get reward # for previous action). drs.append(reward) return xs, hs, dlogps, drs Neural Network Here, a neural network is used to define a “policy” for playing Pong (that is, a function that chooses an action given a state). To implement a neural network in NumPy, we need to provide helper functions for calculating updates and computing the output of the neural network given an input, which in our case is an observation. class Model(object): """This class holds the neural network weights.""" def __init__(self): self.weights = {} self.weights["W1"] = np.random.randn(H, D) / np.sqrt(D) self.weights["W2"] = np.random.randn(H) / np.sqrt(H) def policy_forward(self, x): h = np.dot(self.weights["W1"], x) h[h < 0] = 0 # ReLU nonlinearity. logp = np.dot(self.weights["W2"], h) # Softmax p = 1.0 / (1.0 + np.exp(-logp)) # Return probability of taking action 2, and hidden state. return p, h def policy_backward(self, eph, epx, epdlogp): """Backward pass to calculate gradients. Arguments: eph: Array of intermediate hidden states. epx: Array of experiences (observations). epdlogp: Array of logps (output of last layer before softmax). """ dW2 = np.dot(eph.T, epdlogp).ravel() dh = np.outer(epdlogp, self.weights["W2"]) # Backprop relu. dh[eph <= 0] = 0 dW1 = np.dot(dh.T, epx) return {"W1": dW1, "W2": dW2} def update(self, grad_buffer, rmsprop_cache, lr, decay): """Applies the gradients to the model parameters with RMSProp.""" for k, v in self.weights.items(): g = grad_buffer[k] rmsprop_cache[k] = decay * rmsprop_cache[k] + (1 - decay) * g ** 2 self.weights[k] += lr * g / (np.sqrt(rmsprop_cache[k]) + 1e-5) def zero_grads(grad_buffer): """Reset the batch gradient buffer.""" for k, v in grad_buffer.items(): grad_buffer[k] = np.zeros_like(v) Parallelizing Gradients We define an actor, which is responsible for taking a model and an env and performing a rollout + computing a gradient update. # This forces OpenMP to use 1 single thread, which is needed to # prevent contention between multiple actors. # See https://docs.ray.io/en/latest/ray-core/configure.html for # more details. os.environ["OMP_NUM_THREADS"] = "1" # Tell numpy to only use one core. If we don't do this, each actor may # try to use all of the cores and the resulting contention may result # in no speedup over the serial version. Note that if numpy is using # OpenBLAS, then you need to set OPENBLAS_NUM_THREADS=1, and you # probably need to do it from the command line (so it happens before # numpy is imported). os.environ["MKL_NUM_THREADS"] = "1" ray.init() @ray.remote class RolloutWorker(object): def __init__(self): self.env = gym.make("GymV26Environment-v0", env_id="ALE/Pong-v5") def compute_gradient(self, model): # Compute a simulation episode. xs, hs, dlogps, drs = rollout(model, self.env) reward_sum = sum(drs) # Vectorize the arrays. epx = np.vstack(xs) eph = np.vstack(hs) epdlogp = np.vstack(dlogps) epr = np.vstack(drs) # Compute the discounted reward backward through time. discounted_epr = process_rewards(epr) # Standardize the rewards to be unit normal (helps control the gradient # estimator variance). discounted_epr -= np.mean(discounted_epr) discounted_epr /= np.std(discounted_epr) # Modulate the gradient with advantage (the policy gradient magic # happens right here). epdlogp *= discounted_epr return model.policy_backward(eph, epx, epdlogp), reward_sum Running This example is easy to parallelize because the network can play ten games in parallel and no information needs to be shared between the games. In the loop, the network repeatedly plays games of Pong and records a gradient from each game. Every ten games, the gradients are combined together and used to update the network. iterations = 20 batch_size = 4 model = Model() actors = [RolloutWorker.remote() for _ in range(batch_size)] running_reward = None # "Xavier" initialization. # Update buffers that add up gradients over a batch. grad_buffer = {k: np.zeros_like(v) for k, v in model.weights.items()} # Update the rmsprop memory. rmsprop_cache = {k: np.zeros_like(v) for k, v in model.weights.items()} for i in range(1, 1 + iterations): model_id = ray.put(model) gradient_ids = [] # Launch tasks to compute gradients from multiple rollouts in parallel. start_time = time.time() gradient_ids = [actor.compute_gradient.remote(model_id) for actor in actors] for batch in range(batch_size): [grad_id], gradient_ids = ray.wait(gradient_ids) grad, reward_sum = ray.get(grad_id) # Accumulate the gradient over batch. for k in model.weights: grad_buffer[k] += grad[k] running_reward = ( reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01 ) end_time = time.time() print( "Batch {} computed {} rollouts in {} seconds, " "running mean is {}".format( i, batch_size, end_time - start_time, running_reward ) ) model.update(grad_buffer, rmsprop_cache, learning_rate, decay_rate) zero_grads(grad_buffer) Using Ray for Highly Parallelizable Tasks While Ray can be used for very complex parallelization tasks, often we just want to do something simple in parallel. For example, we may have 100,000 time series to process with exactly the same algorithm, and each one takes a minute of processing. Clearly running it on a single processor is prohibitive: this would take 70 days. Even if we managed to use 8 processors on a single machine, that would bring it down to 9 days. But if we can use 8 machines, each with 16 cores, it can be done in about 12 hours. How can we use Ray for these types of task? We take the simple example of computing the digits of pi. The algorithm is simple: generate random x and y, and if x^2 + y^2 < 1, it’s inside the circle, we count as in. This actually turns out to be pi/4 (remembering your high school math). The following code (and this notebook) assumes you have already set up your Ray cluster and that you are running on the head node. For more details on how to set up a Ray cluster please see Ray Clusters Getting Started. import ray import random import time import math from fractions import Fraction # Let's start Ray ray.init(address='auto') We use the @ray.remote decorator to create a Ray task. A task is like a function, except the result is returned asynchronously. It also may not run on the local machine, it may run elsewhere in the cluster. This way you can run multiple tasks in parallel, beyond the limit of the number of processors you can have in a single machine. @ray.remote def pi4_sample(sample_count): """pi4_sample runs sample_count experiments, and returns the fraction of time it was inside the circle. """ in_count = 0 for i in range(sample_count): x = random.random() y = random.random() if x*x + y*y <= 1: in_count += 1 return Fraction(in_count, sample_count) To get the result of a future, we use ray.get() which blocks until the result is complete. SAMPLE_COUNT = 1000 * 1000 start = time.time() future = pi4_sample.remote(sample_count = SAMPLE_COUNT) pi4 = ray.get(future) end = time.time() dur = end - start print(f'Running {SAMPLE_COUNT} tests took {dur} seconds') Running 1000000 tests took 1.4935967922210693 seconds Now let’s see how good our approximation is. pi = pi4 * 4 float(pi) 3.143024 abs(pi-math.pi)/pi 0.0004554042254233261 Meh. A little off – that’s barely 4 decimal places. Why don’t we do it a 100,000 times as much? Let’s do 100 billion! FULL_SAMPLE_COUNT = 100 * 1000 * 1000 * 1000 # 100 billion samples! BATCHES = int(FULL_SAMPLE_COUNT / SAMPLE_COUNT) print(f'Doing {BATCHES} batches') results = [] for _ in range(BATCHES): results.append(pi4_sample.remote(sample_count = SAMPLE_COUNT)) output = ray.get(results) Doing 100000 batches Notice that in the above, we generated a list with 100,000 futures. Now all we do is have to do is wait for the result. Depending on your ray cluster’s size, this might take a few minutes. But to give you some idea, if we were to do it on a single machine, when I ran this it took 0.4 seconds. On a single core, that means we’re looking at 0.4 * 100000 = about 11 hours. Here’s what the Dashboard looks like: View of the dashboard So now, rather than just a single core working on this, I have 168 working on the task together. And its ~80% efficient. pi = sum(output)*4/len(output) float(pi) 3.14159518188 abs(pi-math.pi)/pi 8.047791203506436e-07 Not bad at all – we’re off by a millionth. Batch Prediction with Ray Core For a higher level API for batch inference on large datasets, see batch inference with Ray Data. This example is for users who want more control over data sharding and execution. The batch prediction is the process of using a trained model to generate predictions for a collection of observations. It has the following elements: Input dataset: this is a collection of observations to generate predictions for. The data is usually stored in an external storage system like S3, HDFS or database, and can be large. ML model: this is a trained ML model which is usually also stored in an external storage system. Predictions: these are the outputs when applying the ML model on observations. The predictions are usually written back to the storage system. With Ray, you can build scalable batch prediction for large datasets at high prediction throughput. Ray Data provides a higher-level API for offline batch inference, with built-in optimizations. However, for more control, you can use the lower-level Ray Core APIs. This example demonstrates batch inference with Ray Core by splitting the dataset into disjoint shards and executing them in parallel, with either Ray Tasks or Ray Actors across a Ray Cluster. Task-based batch prediction With Ray tasks, you can build a batch prediction program in this way: Loads the model Launches Ray tasks, with each taking in the model and a shard of input dataset Each worker executes predictions on the assigned shard, and writes out results Let’s take NYC taxi data in 2009 for example. Suppose we have this simple model: import pandas as pd import numpy as np def load_model(): # A dummy model. def model(batch: pd.DataFrame) -> pd.DataFrame: # Dummy payload so copying the model will actually copy some data # across nodes. model.payload = np.zeros(100_000_000) return pd.DataFrame({"score": batch["passenger_count"] % 2 == 0}) return model The dataset has 12 files (one for each month) so we can naturally have each Ray task to take one file. By taking the model and a shard of input dataset (i.e. a single file), we can define a Ray remote task for prediction: import pyarrow.parquet as pq import ray @ray.remote def make_prediction(model, shard_path): df = pq.read_table(shard_path).to_pandas() result = model(df) # Write out the prediction result. # NOTE: unless the driver will have to further process the # result (other than simply writing out to storage system), # writing out at remote task is recommended, as it can avoid # congesting or overloading the driver. # ... # Here we just return the size about the result in this example. return len(result) The driver launches all tasks for the entire input dataset. # 12 files, one for each remote task. input_files = [ f"s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet" f"/fe41422b01c04169af2a65a83b753e0f_{i:06d}.parquet" for i in range(12) ] # ray.put() the model just once to local object store, and then pass the # reference to the remote tasks. model = load_model() model_ref = ray.put(model) result_refs = [] # Launch all prediction tasks. for file in input_files: # Launch a prediction task by passing model reference and shard file to it. # NOTE: it would be highly inefficient if you are passing the model itself # like make_prediction.remote(model, file), which in order to pass the model # to remote node will ray.put(model) for each task, potentially overwhelming # the local object store and causing out-of-disk error. result_refs.append(make_prediction.remote(model_ref, file)) results = ray.get(result_refs) # Let's check prediction output size. for r in results: print("Prediction output size:", r) In order to not overload the cluster and cause OOM, we can control the parallelism by setting the proper resource requirement for tasks, see details about this design pattern in Pattern: Using resources to limit the number of concurrently running tasks. For example, if it’s easy for your to get a good estimate of the in-memory size for data loaded from external storage, you can control the parallelism by specifying the amount of memory needed for each task, e.g. launching tasks with make_prediction.options(memory=100*1023*1025).remote(model_ref, file). Ray will then do the right thing and make sure tasks scheduled to a node will not exceed its total memory. To avoid repeatedly storing the same model into object store (this can cause Out-of-disk for driver node), use ray.put() to store the model once, and then pass the reference around. To avoid congest or overload the driver node, it’s preferable to have each task to write out the predictions (instead of returning results back to driver which actualy does nothing but write out to storage system). Actor-based batch prediction In the above solution, each Ray task will have to fetch the model from the driver node before it can start performing the prediction. This is an overhead cost that can be significant if the model size is large. We can optimize it by using Ray actors, which will fetch the model just once and reuse it for all tasks assigned to the actor. First, we define a callable class that’s structured with an interface (i.e. constructor) to load/cache the model, and the other to take in a file and perform prediction. import pandas as pd import pyarrow.parquet as pq import ray @ray.remote class BatchPredictor: def __init__(self, model): self.model = model def predict(self, shard_path): df = pq.read_table(shard_path).to_pandas() result =self.model(df) # Write out the prediction result. # NOTE: unless the driver will have to further process the # result (other than simply writing out to storage system), # writing out at remote task is recommended, as it can avoid # congesting or overloading the driver. # ... # Here we just return the size about the result in this example. return len(result) The constructor is called only once per actor worker. We use ActorPool to manage a set of actors that can receive prediction requests. from ray.util.actor_pool import ActorPool model = load_model() model_ref = ray.put(model) num_actors = 4 actors = [BatchPredictor.remote(model_ref) for _ in range(num_actors)] pool = ActorPool(actors) input_files = [ f"s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet" f"/fe41422b01c04169af2a65a83b753e0f_{i:06d}.parquet" for i in range(12) ] for file in input_files: pool.submit(lambda a, v: a.predict.remote(v), file) while pool.has_next(): print("Prediction output size:", pool.get_next()) Note that the ActorPool is fixed in size, unlike task-based approach where the number of parallel tasks can be dynamic (as long as it’s not exceeding max_in_flight_tasks). To have autoscaling actor pool, you will need to use the Ray Data batch prediction. Batch prediction with GPUs If your cluster has GPU nodes and your predictor can utilize the GPUs, you can direct the tasks or actors to those GPU nodes by specifying num_gpus. Ray will schedule them onto GPU nodes accordingly. On the node, you will need to move the model to GPU. The following is an example for Torch model. import torch @ray.remote(num_gpus=1) def make_torch_prediction(model: torch.nn.Module, shard_path): # Move model to GPU. model.to(torch.device("cuda")) inputs = pq.read_table(shard_path).to_pandas().to_numpy() results = [] # for each tensor in inputs: # results.append(model(tensor)) # # Write out the results right in task instead of returning back # to the driver node (unless you have to), to avoid congest/overload # driver node. # ... # Here we just return simple/light meta information. return len(results) FAQs How to load and pass model efficiently in Ray cluster if the model is large? The recommended way is to (taking task-based batch prediction for example, the actor-based is the same): let the driver load the model (e.g. from storage system) let the driver ray.put(model) to store the model into object store; and pass the same reference of the model to each remote tasks when launching them. The remote task will fetch the model (from driver’s object store) to its local object store before start performing prediction. Note it’s highly inefficient if you skip the step 2 and pass the model (instead of reference) to remote tasks. If the model is large and there are many tasks, it’ll likely cause out-of-disk crash for the driver node. # GOOD: the model will be stored to driver's object store only once model = load_model() model_ref = ray.put(model) for file in input_files: make_prediction.remote(model_ref, file) # BAD: the same model will be stored to driver's object store repeatedly for each task model = load_model() for file in input_files: make_prediction.remote(model, file) For more details, check out Anti-pattern: Passing the same large argument by value repeatedly harms performance. How to improve the GPU utilization rate? To keep GPUs busy, there are following things to look at: Schedule multiple tasks on the same GPU node if it has multiple GPUs: If there are multiple GPUs on same node and a single task cannot use them all, you can direct multiple tasks to the node. This is automatically handled by Ray, e.g. if you specify num_gpus=1 and there are 4 GPUs, Ray will schedule 4 tasks to the node, provided there are enough tasks and no other resource constraints. Use actor-based approach: as mentioned above, actor-based approach is more efficient because it reuses model initialization for many tasks, so the node will spend more time on the actual workload. Batch Training with Ray Core The workload showcased in this notebook can be expressed using different Ray components, such as Ray Data, Ray Tune and Ray Core. For best practices, see Many Model Training. Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc. This notebook showcases how to conduct batch training on the NYC Taxi Dataset using only Ray Core and stateless Ray tasks. Batch training in the context of this notebook is understood as creating the same model(s) for different and separate datasets or subsets of a dataset. This task is naively parallelizable and can be easily scaled with Ray. Batch training diagram Contents In this tutorial, we will walk through the following steps: Reading parquet data, Using Ray tasks to preprocess, train and evaluate data batches, Dividing data into batches and spawning a Ray task for each batch to be run in parallel, Starting batch training, [Optional] Optimizing for runtime over memory with centralized data loading. Walkthrough We want to analyze the relationship between the dropoff location and the trip duration. The relationship will be very different for each pickup location, therefore we need to have a separate model for each of those. Furthermore, the relationship can change with time. Therefore, our task is to create separate models for each pickup location-month combination. The dataset we are using is already partitioned into months (each file being equal to one), and we can use the pickup_location_id column in the dataset to group it into data batches. We will then fit models for each batch and choose the best one. Let’s start by importing Ray and initializing a local Ray cluster. from typing import Callable, Optional, List, Union, Tuple, Iterable import time import numpy as np import pandas as pd from sklearn.base import BaseEstimator from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error import pyarrow as pa from pyarrow import fs from pyarrow import dataset as ds from pyarrow import parquet as pq import pyarrow.compute as pc import ray ray.init(ignore_reinit_error=True) For benchmarking purposes, we can print the times of various operations. In order to reduce clutter in the output, this is set to False by default. PRINT_TIMES = False def print_time(msg: str): if PRINT_TIMES: print(msg) To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019. You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False. SMOKE_TEST = True Reading parquet data The read_data function reads a Parquet file and uses a push-down predicate to extract the data batch we want to fit a model on using the provided index to group the rows. By having each task read the data and extract batches separately, we ensure that memory utilization is minimal - as opposed to requiring each task to load the entire partition into memory first. We are using PyArrow to read the file, as it supports push-down predicates to be applied during file reading. This lets us avoid having to load an entire file into memory, which could cause an OOM error with a large dataset. After the dataset is loaded, we convert it to pandas so that it can be used for training with scikit-learn. def read_data(file: str, pickup_location_id: int) -> pd.DataFrame: return pq.read_table( file, filters=[("pickup_location_id", "=", pickup_location_id)], columns=[ "pickup_at", "dropoff_at", "pickup_location_id", "dropoff_location_id", ], ).to_pandas() Creating Ray tasks to preprocess, train and evaluate data batches As we will be using the NYC Taxi dataset, we define a simple batch transformation function to set correct data types, calculate the trip duration and fill missing values. def transform_batch(df: pd.DataFrame) -> pd.DataFrame: df["pickup_at"] = pd.to_datetime( df["pickup_at"], format="%Y-%m-%d %H:%M:%S" ) df["dropoff_at"] = pd.to_datetime( df["dropoff_at"], format="%Y-%m-%d %H:%M:%S" ) df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds df["pickup_location_id"] = df["pickup_location_id"].fillna(-1) df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1) return df We will be fitting scikit-learn models on data batches. We define a Ray task fit_and_score_sklearn that fits the model and calculates mean absolute error on the validation set. We will be treating this as a simple regression problem where we want to predict the relationship between the drop-off location and the trip duration. # Ray task to fit and score a scikit-learn model. @ray.remote def fit_and_score_sklearn( train: pd.DataFrame, test: pd.DataFrame, model: BaseEstimator ) -> Tuple[BaseEstimator, float]: train_X = train[["dropoff_location_id"]] train_y = train["trip_duration"] test_X = test[["dropoff_location_id"]] test_y = test["trip_duration"] # Start training. model = model.fit(train_X, train_y) pred_y = model.predict(test_X) error = mean_absolute_error(test_y, pred_y) return model, error Next, we will define a train_and_evaluate Ray task which contains all logic necessary to load a data batch, transform it, split it into train and test and fit and evaluate models on it. We make sure to return the file and location id used so that we can map the fitted models back to them. For data loading and processing, we are using the read_data and transform_batch functions we have defined earlier. def train_and_evaluate_internal( df: pd.DataFrame, models: List[BaseEstimator], pickup_location_id: int = 0 ) -> List[Tuple[BaseEstimator, float]]: # We need at least 4 rows to create a train / test split. if len(df) < 4: print( f"Dataframe for LocID: {pickup_location_id} is empty or smaller than 4" ) return None # Train / test split. train, test = train_test_split(df) # We put the train & test dataframes into Ray object store # so that they can be reused by all models fitted here. # https://docs.ray.io/en/master/ray-core/patterns/pass-large-arg-by-value.html train_ref = ray.put(train) test_ref = ray.put(test) # Launch a fit and score task for each model. results = ray.get( [ fit_and_score_sklearn.remote(train_ref, test_ref, model) for model in models ] ) results.sort(key=lambda x: x[1]) # sort by error return results @ray.remote def train_and_evaluate( file_name: str, pickup_location_id: int, models: List[BaseEstimator], ) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]: start_time = time.time() data = read_data(file_name, pickup_location_id) data_loading_time = time.time() - start_time print_time( f"Data loading time for LocID: {pickup_location_id}: {data_loading_time}" ) # Perform transformation start_time = time.time() data = transform_batch(data) transform_time = time.time() - start_time print_time( f"Data transform time for LocID: {pickup_location_id}: {transform_time}" ) # Perform training & evaluation for each model start_time = time.time() results = (train_and_evaluate_internal(data, models, pickup_location_id),) training_time = time.time() - start_time print_time( f"Training time for LocID: {pickup_location_id}: {training_time}" ) return ( file_name, pickup_location_id, results, ) Dividing data into batches and spawning a Ray task for each batch to be ran in parallel The run_batch_training driver function generates tasks for each Parquet file it recieves (with each file corresponding to one month). We define the function to take in a list of models, so that we can evaluate them all and choose the best one for each batch. The function blocks when it reaches ray.get() and waits for tasks to return their results. def run_batch_training(files: List[str], models: List[BaseEstimator]): print("Starting run...") start = time.time() # Store task references task_refs = [] for file in files: try: locdf = pq.read_table(file, columns=["pickup_location_id"]) except Exception: continue pickup_location_ids = locdf["pickup_location_id"].unique() for pickup_location_id in pickup_location_ids: # Cast PyArrow scalar to Python if needed. try: pickup_location_id = pickup_location_id.as_py() except Exception: pass task_refs.append( train_and_evaluate.remote(file, pickup_location_id, models) ) # Block to obtain results from each task results = ray.get(task_refs) taken = time.time() - start count = len(results) # If result is None, then it means there weren't enough records to train results_not_none = [x for x in results if x is not None] count_not_none = len(results_not_none) # Sleep a moment for nicer output time.sleep(1) print("", flush=True) print(f"Number of pickup locations: {count}") print( f"Number of pickup locations with enough records to train: {count_not_none}" ) print(f"Number of models trained: {count_not_none * len(models)}") print(f"TOTAL TIME TAKEN: {taken:.2f} seconds") return results Starting batch training We can now tie everything together! First, we obtain the partitions of the dataset from an S3 bucket so that we can pass them to run. The dataset is partitioned by year and month, meaning each file represents one month. # Obtain the dataset. Each month is a separate file. dataset = ds.dataset( "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/", partitioning=["year", "month"], ) starting_idx = -2 if SMOKE_TEST else 0 files = [f"s3://anonymous@{file}" for file in dataset.files][starting_idx:] print(f"Obtained {len(files)} files!") Obtained 2 files! We can now run our script. The output is a list of tuples in the following format: (file name, partition id, list of models and their MAE scores). For brevity, we will print out the first 10 tuples. from sklearn.linear_model import LinearRegression results = run_batch_training(files, models=[LinearRegression()]) print(results[:10]) Starting run... (train_and_evaluate pid=3658) Dataframe for LocID: 214 is empty or smaller than 4 (train_and_evaluate pid=2027) Dataframe for LocID: 204 is empty or smaller than 4 (train_and_evaluate pid=3658) Dataframe for LocID: 176 is empty or smaller than 4 Number of pickup locations: 522 Number of pickup locations with enough records to train: 522 Number of models trained: 522 TOTAL TIME TAKEN: 139.27 seconds [('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, ([(LinearRegression(), 811.1991448011532)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 161, ([(LinearRegression(), 753.8173175448575)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 163, ([(LinearRegression(), 735.7208096221053)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 193, ([(LinearRegression(), 915.8790566477112)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 260, ([(LinearRegression(), 626.6908388606766)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 56, ([(LinearRegression(), 902.6575414213821)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 79, ([(LinearRegression(), 710.7781383724797)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 90, ([(LinearRegression(), 667.0555322496516)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 162, ([(LinearRegression(), 700.0288733783458)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 50, ([(LinearRegression(), 697.2487742278146)],))] Using the output we’ve gotten, we can now tie each model to the given file (month)-pickup location combination and see their predictive power, as measured by the error. At this stage, we can carry on with further analysis if necessary or use them for inference. We can also provide multiple scikit-learn models to our run function and the best one will be chosen for each batch. A common use-case here would be to define several models of the same type with different hyperparameters. from sklearn.tree import DecisionTreeRegressor results = run_batch_training( files, models=[ LinearRegression(), DecisionTreeRegressor(), DecisionTreeRegressor(splitter="random"), ], ) print(results[:10]) Starting run... (train_and_evaluate pid=21437) Dataframe for LocID: 214 is empty or smaller than 4 (train_and_evaluate pid=21888) Dataframe for LocID: 204 is empty or smaller than 4 (train_and_evaluate pid=22358) Dataframe for LocID: 176 is empty or smaller than 4 Number of pickup locations: 522 Number of pickup locations with enough records to train: 522 Number of models trained: 1566 TOTAL TIME TAKEN: 247.80 seconds [('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, ([(DecisionTreeRegressor(splitter='random'), 586.3557158021763), (DecisionTreeRegressor(), 587.4490404009856), (LinearRegression(), 867.6406607489587)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 161, ([(DecisionTreeRegressor(), 598.902261739656), (DecisionTreeRegressor(splitter='random'), 598.9147196919863), (LinearRegression(), 760.6576436185691)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 163, ([(DecisionTreeRegressor(splitter='random'), 573.8896116905775), (DecisionTreeRegressor(), 573.8983618518819), (LinearRegression(), 738.3486584028989)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 193, ([(DecisionTreeRegressor(splitter='random'), 743.5483210338866), (DecisionTreeRegressor(), 744.3629120390056), (LinearRegression(), 953.6672220167137)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 260, ([(DecisionTreeRegressor(splitter='random'), 498.29219023609505), (DecisionTreeRegressor(), 501.13978495420673), (LinearRegression(), 665.543426962402)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 56, ([(LinearRegression(), 1516.8825665745849), (DecisionTreeRegressor(), 1572.7744375553175), (DecisionTreeRegressor(splitter='random'), 1572.7744375553175)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 79, ([(DecisionTreeRegressor(), 567.3130440323552), (DecisionTreeRegressor(splitter='random'), 567.3722846337118), (LinearRegression(), 701.2370802810619)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 90, ([(DecisionTreeRegressor(splitter='random'), 513.5831366488217), (DecisionTreeRegressor(), 513.6235175626782), (LinearRegression(), 666.2786163862434)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 162, ([(DecisionTreeRegressor(splitter='random'), 557.7537740834588), (DecisionTreeRegressor(), 557.7568050908675), (LinearRegression(), 701.2237669363365)],)), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 50, ([(DecisionTreeRegressor(), 563.7371119126768), (DecisionTreeRegressor(splitter='random'), 563.8079887794675), (LinearRegression(), 714.1553440667034)],))] [Optional] Optimizing for runtime over memory with centralized data loading In order to ensure that the data can always fit in memory, each task reads the files independently and extracts the desired data batch. This, however, negatively impacts the runtime. If we have sufficient memory in our Ray cluster, we can instead load each partition once, extract the batches, and save them in the Ray object store, reducing time required dramatically at a cost of higher memory usage. In other words, we perform centralized data loading using Ray object store as opposed to distributed data loading. Notice we do not call ray.get() on the references of the read_into_object_store. Instead, we pass the reference itself as the argument to the train_and_evaluate.remote dispatch, allowing for the data to stay in the object store until it is actually needed. This avoids a situation where all the data would be loaded into the memory of the process calling ray.get(). You can use the Ray Dashboard to compare the memory usage between the previous approach and this one. # Redefine the train_and_evaluate task to use in-memory data. # We still keep file_name and pickup_location_id for identification purposes. @ray.remote def train_and_evaluate( pickup_location_id_and_data: Tuple[int, pd.DataFrame], file_name: str, models: List[BaseEstimator], ) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]: pickup_location_id, data = pickup_location_id_and_data # Perform transformation start_time = time.time() # The underlying numpy arrays are stored in the Ray object # store for efficient access, making them immutable. We therefore # copy the DataFrame to obtain a mutable copy we can transform. data = data.copy() data = transform_batch(data) transform_time = time.time() - start_time print_time( f"Data transform time for LocID: {pickup_location_id}: {transform_time}" ) return ( file_name, pickup_location_id, train_and_evaluate_internal(data, models, pickup_location_id), ) # This allows us to create a Ray Task that is also a generator, returning object references. @ray.remote(num_returns="dynamic") def read_into_object_store(file: str) -> ray.ObjectRefGenerator: print(f"Loading {file}") # Read the entire file into memory. try: locdf = pq.read_table( file, columns=[ "pickup_at", "dropoff_at", "pickup_location_id", "dropoff_location_id", ], ) except Exception: return [] pickup_location_ids = locdf["pickup_location_id"].unique() for pickup_location_id in pickup_location_ids: # Each id-data batch tuple will be put as a separate object into the Ray object store. # Cast PyArrow scalar to Python if needed. try: pickup_location_id = pickup_location_id.as_py() except Exception: pass yield ( pickup_location_id, locdf.filter( pc.equal(locdf["pickup_location_id"], pickup_location_id) ).to_pandas(), ) def run_batch_training_with_object_store( files: List[str], models: List[BaseEstimator] ): print("Starting run...") start = time.time() # Store task references task_refs = [] # Use a SPREAD scheduling strategy to load each # file on a separate node as an OOM safeguard. # This is not foolproof though! We can also specify a resource # requirement for memory, if we know what is the maximum # memory requirement for a single file. read_into_object_store_spread = read_into_object_store.options( scheduling_strategy="SPREAD" ) # Dictionary of references to read tasks with file names as keys read_tasks_by_file = { files[file_id]: read_into_object_store_spread.remote(file) for file_id, file in enumerate(files) } for file, read_task_ref in read_tasks_by_file.items(): # We iterate over references and pass them to the tasks directly for pickup_location_id_and_data_batch_ref in iter(ray.get(read_task_ref)): task_refs.append( train_and_evaluate.remote( pickup_location_id_and_data_batch_ref, file, models ) ) # Block to obtain results from each task results = ray.get(task_refs) taken = time.time() - start count = len(results) # If result is None, then it means there weren't enough records to train results_not_none = [x for x in results if x is not None] count_not_none = len(results_not_none) # Sleep a moment for nicer output time.sleep(1) print("", flush=True) print(f"Number of pickup locations: {count}") print( f"Number of pickup locations with enough records to train: {count_not_none}" ) print(f"Number of models trained: {count_not_none * len(models)}") print(f"TOTAL TIME TAKEN: {taken:.2f} seconds") return results results = run_batch_training_with_object_store( files, models=[LinearRegression()] ) print(results[:10]) Starting run... (read_into_object_store pid=22584) Loading s3://air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet (read_into_object_store pid=22586) Loading s3://air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet (train_and_evaluate pid=22584) Dataframe for LocID: 214 is empty or smaller than 4 (train_and_evaluate pid=23204) Dataframe for LocID: 204 is empty or smaller than 4 (train_and_evaluate pid=23204) Dataframe for LocID: 176 is empty or smaller than 4 Number of pickup locations: 522 Number of pickup locations with enough records to train: 522 Number of models trained: 522 TOTAL TIME TAKEN: 19.47 seconds [('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, [(LinearRegression(), 851.6540137470965)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 161, [(LinearRegression(), 759.3457730674915)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 163, [(LinearRegression(), 743.6905538807495)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 193, [(LinearRegression(), 857.6903787276541)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 260, [(LinearRegression(), 646.4703728065817)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 56, [(LinearRegression(), 1372.6945225983686)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 79, [(LinearRegression(), 701.0097453726145)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 90, [(LinearRegression(), 650.179758287182)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 162, [(LinearRegression(), 706.316835556958)]), ('s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 50, [(LinearRegression(), 694.0467262859878)])] We can see that this approach allowed us to finish training much faster, but it would not have been possible if the dataset was too large to fit into our cluster memory. Therefore, this pattern is only recommended if the data you are working with is small. Otherwise, it is recommended to load the data inside the tasks right before its used. As always, your mileage may vary - we recommend you try both approaches for your workload and see what works best for you! Simple AutoML for time series with Ray Core We strongly recommend using Ray AIR Tuner for hyperparameter tuning/AutoML, which will enable you to build it faster and more easily, and get the built-in benefits like logging, fault tolerance and many more. If you think your use case cannot be supported by Ray AIR, we’d love to get your feedback e.g. through a Ray GitHub issue. AutoML (Automatic Machine Learning) is a broad topic, but in essence, it boils down to choosing the best model (and possibly preprocessing) for the task and dataset at hand. While there exist multiple advanced AutoML frameworks, we can quickly build a simple solution using just Ray Core and stateless tasks. If you are interested in applying more advanced optimization algorithms or would like to take advantage of a greater level of abstraction and multiple built-in features, we highly recommend to use a Ray AIR Tuner. In this notebook, we will build an AutoML (or more precisely, an AutoTS) system which will choose the best combination of a statsforecast model and hyperparameters for a time series regression task - here, we will be using a partition of the M5 dataset. Simple AutoML consists of running different functions (hyperparameter configurations) on the same data independently of each other. We will want to train models with different configurations and evaluate them to obtain various metrics, such as mean square error. After all configurations have been evaluated, we will be able to choose the best configuration according to the metric we want to use. AutoML To make this example more practical, we will be using time series cross-validation (CV) as our evaluation strategy. Cross-validation works by evaluating a model k-times, each time choosing a different subset (fold) of the data for training and evaluation. This allows for more robust estimation of performance and helps prevent overfitting, especially with small data. In other words, we will be running n * k separate evaluations, where n is the number of configurations and k is the number of folds. Walkthrough Let’s start by importing Ray and initializing a local Ray cluster. from typing import List, Union, Callable, Dict, Type, Tuple import time import ray import itertools import pandas as pd import numpy as np from collections import defaultdict from statsforecast import StatsForecast from statsforecast.models import ETS, AutoARIMA, _TS from pyarrow import parquet as pq from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import mean_squared_error, mean_absolute_error ray.init(ignore_reinit_error=True) We will break up our logic into several functions and a Ray task. The Ray task is train_and_evaluate_fold, which contains all the logic necessary to fit and evaluate a model on a CV fold of data. We structure our task to take in a dataset and indices splitting it into train and test - that way, we can keep one instance of the dataset in the Ray object store and split it in each task separately. We are defining this as a Ray task as we want all folds to be evaluated in parallel on a Ray cluster - Ray will handle all orchestration and execution. Each task will reserve 1 CPU core by default. @ray.remote def train_and_evaluate_fold( model: _TS, df: pd.DataFrame, train_indices: np.ndarray, test_indices: np.ndarray, label_column: str, metrics: Dict[str, Callable[[pd.Series, pd.Series], float]], freq: str = "D", ) -> Dict[str, float]: try: # Create the StatsForecast object with train data & model. statsforecast = StatsForecast( df=df.iloc[train_indices], models=[model], freq=freq ) # Make a forecast and calculate metrics on test data. # This will fit the model first automatically. forecast = statsforecast.forecast(len(test_indices)) return { metric_name: metric( df.iloc[test_indices][label_column], forecast[model.__class__.__name__] ) for metric_name, metric in metrics.items() } except Exception: # In case the model fit or eval fails, return None for all metrics. return {metric_name: None for metric_name, metric in metrics.items()} evaluate_models_with_cv is a driver function to run our optimization loop. We take in a list of models (with their parameters already set) and the dataframe. The dataframe is put into the Ray object store and reused, which means we only need to serialize it once. That way, we avoid an Anti-pattern: Passing the same large argument by value repeatedly harms performance. We treat the fitting of each fold as a separate task. We generate k-tasks for each model and wait for them to complete by calling ray.get(), which blocks until all tasks finish and the results are collected. We then aggregate the returned metrics to calculate mean metrics from each fold for each model. def evaluate_models_with_cv( models: List[_TS], df: pd.DataFrame, label_column: str, metrics: Dict[str, Callable[[pd.Series, pd.Series], float]], freq: str = "D", cv: Union[int, TimeSeriesSplit] = 5, ) -> Dict[_TS, Dict[str, float]]: # Obtain CV train-test indices for each fold. if isinstance(cv, int): cv = TimeSeriesSplit(cv) train_test_indices = list(cv.split(df)) # Put df into Ray object store for better performance. df_ref = ray.put(df) # Add tasks to be executed for each fold. fold_refs = [] for model in models: fold_refs.extend( [ train_and_evaluate_fold.remote( model, df_ref, train_indices, test_indices, label_column, metrics, freq=freq, ) for train_indices, test_indices in train_test_indices ] ) fold_results = ray.get(fold_refs) # Split fold results into a list of CV splits-sized chunks. # Ray guarantees that order is preserved. fold_results_per_model = [ fold_results[i : i + len(train_test_indices)] for i in range(0, len(fold_results), len(train_test_indices)) ] # Aggregate and average results from all folds per model. # We go from a list of dicts to a dict of lists and then # get a mean of those lists. mean_results_per_model = [] for model_results in fold_results_per_model: aggregated_results = defaultdict(list) for fold_result in model_results: for metric, value in fold_result.items(): aggregated_results[metric].append(value) mean_results = { metric: np.mean(values) for metric, values in aggregated_results.items() } mean_results_per_model.append(mean_results) # Join models and their metrics together. mean_results_per_model = { models[i]: mean_results_per_model[i] for i in range(len(mean_results_per_model)) } return mean_results_per_model Finally, we have to define the logic to translate a dictionary search space into instantiated models we can pass to evaluate_models_with_cv. scikit-learn and statsforecast models can be easily serialized and are very small, meaning instantiated models can be easily passed around the Ray cluster. With other frameworks, such as Torch, you may want to instead instantiate the model in the task that fits it in order to avoid issues. Our generate_configurations generator translates a two-level dictionary, where the keys are the model classes and the values are dictionaries of arguments and lists of their possible values. We want to run a grid search, meaning we want to evaluate every possible hyperparameter combination for the given models. The search space we will be using later looks like this: { AutoARIMA: {}, ETS: { "season_length": [6, 7], "model": ["ZNA", "ZZZ"] } } It will translate to the following models: AutoARIMA(), ETS(season_length=6, model="ZNA") ETS(season_length=7, model="ZNA") ETS(season_length=6, model="ZZZ") ETS(season_length=7, model="ZZZ") evaluate_search_space_with_cv is the entry point for our AutoML system, which takes in the search space, dataframe, label column, metrics, the metric to use to choose the best configuration, whether we want to minimize or maximize it, the frequency of the data and the scikit-learn TimeSeriesSplit cross-validation splitter to use. def generate_configurations(search_space: Dict[Type[_TS], Dict[str, list]]) -> _TS: # Convert dict search space into configurations - models instantiated with specific arguments. for model, model_search_space in search_space.items(): kwargs, values = model_search_space.keys(), model_search_space.values() # Get a product - all combinations in the per-model grid. for configuration in itertools.product(*values): yield model(**dict(zip(kwargs, configuration))) def evaluate_search_space_with_cv( search_space: Dict[Type[_TS], Dict[str, list]], df: pd.DataFrame, label_column: str, metrics: Dict[str, Callable[[pd.Series, pd.Series], float]], eval_metric: str, mode: str = "min", freq: str = "D", cv: Union[int, TimeSeriesSplit] = 5, ) -> List[Tuple[_TS, Dict[str, float]]]: assert eval_metric in metrics assert mode in ("min", "max") configurations = list(generate_configurations(search_space)) print( f"Evaluating {len(configurations)} configurations with {cv.get_n_splits()} splits each, " f"totalling {len(configurations)*cv.get_n_splits()} tasks..." ) ret = evaluate_models_with_cv( configurations, df, label_column, metrics, freq=freq, cv=cv ) # Sort the results by eval_metric ret = sorted(ret.items(), key=lambda x: x[1][eval_metric], reverse=(mode == "max")) print("Evaluation complete!") return ret With our system complete, we just need a quick helper function to obtain the data from an S3 bucket and preprocess it to the format statsforecast expects. As the dataset is quite large, we use PyArrow’s push-down predicate as a filter to obtain just the rows we care about without having to load them all into memory. def get_m5_partition(unique_id: str) -> pd.DataFrame: ds1 = pq.read_table( "s3://anonymous@m5-benchmarks/data/train/target.parquet", filters=[("item_id", "=", unique_id)], ) Y_df = ds1.to_pandas() # StatsForecasts expects specific column names! Y_df = Y_df.rename( columns={"item_id": "unique_id", "timestamp": "ds", "demand": "y"} ) Y_df["unique_id"] = Y_df["unique_id"].astype(str) Y_df["ds"] = pd.to_datetime(Y_df["ds"]) Y_df = Y_df.dropna() constant = 10 Y_df["y"] += constant return Y_df[Y_df.unique_id == unique_id] df = get_m5_partition("FOODS_1_001_CA_1") df
unique_id ds y
0 FOODS_1_001_CA_1 2011-01-29 13.0
1 FOODS_1_001_CA_1 2011-01-30 10.0
2 FOODS_1_001_CA_1 2011-01-31 10.0
3 FOODS_1_001_CA_1 2011-02-01 11.0
4 FOODS_1_001_CA_1 2011-02-02 14.0
... ... ... ...
1936 FOODS_1_001_CA_1 2016-05-18 10.0
1937 FOODS_1_001_CA_1 2016-05-19 11.0
1938 FOODS_1_001_CA_1 2016-05-20 10.0
1939 FOODS_1_001_CA_1 2016-05-21 10.0
1940 FOODS_1_001_CA_1 2016-05-22 10.0

1941 rows × 3 columns

We can now run our AutoML system with our search space and obtain the best model with its configuration. We will be using scikit-learn implementations of mean squared error (MSE) and mean absolute error (MAE) as metrics, with the former being what we want to optimize for. tuning_results = evaluate_search_space_with_cv( {AutoARIMA: {}, ETS: {"season_length": [6, 7], "model": ["ZNA", "ZZZ"]}}, df, "y", {"mse": mean_squared_error, "mae": mean_absolute_error}, "mse", cv=TimeSeriesSplit(test_size=1), ) Evaluating 5 configurations with 5 splits each, totalling 25 tasks... Evaluation complete! We can see that the model that minimizes MSE the most from our search space is a ZNA ETS model with a season length of 6. print(tuning_results[0]) # Print arguments of the model: print(tuning_results[0][0].__dict__) (ETS, {'mse': 0.64205205, 'mae': 0.7200615}) {'season_length': 6, 'model': 'ZNA'} Speed up your web crawler by parallelizing it with Ray In this example we’ll quickly demonstrate how to build a simple web scraper in Python and parallelize it with Ray Tasks with minimal code changes. To run this example locally on your machine, please first install ray and beautifulsoup with pip install "beautifulsoup4==4.11.1" "ray>=2.2.0" First, we’ll define a function called find_links which takes a starting page (start_url) to crawl, and we’ll take the Ray documentation as example of such a starting point. Our crawler simply extracts all available links from the starting URL that contain a given base_url (e.g. in our example we only want to follow links on http://docs.ray.io, not any external links). The find_links function is then called recursively with all the links we found this way, until a certain depth is reached. To extract the links from HTML elements on a site, we define a little helper function called extract_links, which takes care of handling relative URLs properly and sets a limit on the number of links returned from a site (max_results) to control the runtime of the crawler more easily. Here’s the full implementation: import requests from bs4 import BeautifulSoup def extract_links(elements, base_url, max_results=100): links = [] for e in elements: url = e["href"] if "https://" not in url: url = base_url + url if base_url in url: links.append(url) return set(links[:max_results]) def find_links(start_url, base_url, depth=2): if depth == 0: return set() page = requests.get(start_url) soup = BeautifulSoup(page.content, "html.parser") elements = soup.find_all("a", href=True) links = extract_links(elements, base_url) for url in links: new_links = find_links(url, base_url, depth-1) links = links.union(new_links) return links Let’s define a starting and base URL and crawl the Ray docs to a depth of 2. base = "https://docs.ray.io/en/latest/" docs = base + "index.html" %time len(find_links(docs, base)) CPU times: user 19.3 s, sys: 340 ms, total: 19.7 s Wall time: 25.8 s 591 As you can see, crawling the documentation root recursively like this returns a total of 591 pages and the wall time comes in at around 25 seconds. Crawling pages can be parallelized in many ways. Probably the simplest way is to simple start with multiple starting URLs and call find_links in parallel for each of them. We can do this with Ray Tasks in a straightforward way. We simply use the ray.remote decorator to wrap the find_links function in a task called find_links_task like this: import ray @ray.remote def find_links_task(start_url, base_url, depth=2): return find_links(start_url, base_url, depth) To use this task to kick off a parallel call, the only thing you have to do is use find_links_tasks.remote(...) instead of calling the underlying Python function directly. Here’s how you run six crawlers in parallel, the first three (redundantly) crawl docs.ray.io again, the other three crawl the main entry points of the Ray RLlib, Tune, and Serve libraries, respectively: links = [find_links_task.remote(f"{base}{lib}/index.html", base) for lib in ["", "", "", "rllib", "tune", "serve"]] %time for res in ray.get(links): print(len(res)) 591 591 105 204 105 CPU times: user 65.5 ms, sys: 47.8 ms, total: 113 ms Wall time: 27.2 s This parallel run crawls around four times the number of pages in roughly the same time as the initial, sequential run. Note the use of ray.get in the timed run to retrieve the results from Ray (the remote call promise gets resolved with get). Of course, there are much smarter ways to create a crawler and efficiently parallelize it, and this example gives you a starting point to work from. A Simple MapReduce Example with Ray Core This example demonstrates how to use Ray for a common distributed computing example––counting word occurrences across multiple documents. The complexity lies in the handling of a large corpus, requiring multiple compute nodes to process the data. The simplicity of implementing MapReduce with Ray is a significant milestone in distributed computing. Many popular big data technologies, such as Hadoop, are built on this programming model, underscoring the impact of using Ray Core. The MapReduce approach has three phases: Map phase The map phase applies a specified function to transform or map elements within a set of data. It produces key-value pairs: the key represents an element and the value is a metric calculated for that element. To count the number of times each word appears in a document, the map function outputs the pair (word, 1) every time a word appears, to indicate that it has been found once. Shuffle phase The shuffle phase collects all the outputs from the map phase and organizes them by key. When the same key is found on multiple compute nodes, this phase includes transferring or shuffling data between different nodes. If the map phase produces four occurrences of the pair (word, 1), the shuffle phase puts all occurrences of the word on the same node. Reduce phase The reduce phase aggregates the elements from the shuffle phase. The total count of each word’s occurrences is the sum of occurrences on each node. For example, four instances of (word, 1) combine for a final count of word: 4. The first and last phases are in the MapReduce name, but the middle phase is equally crucial. These phases appear straightforward, but their power is in running them concurrently on multiple machines. This figure illustrates the three MapReduce phases on a set of documents: Simple Map Reduce Loading Data We use Python to implement the MapReduce algorithm for the word count and Ray to parallelize the computation. We start by loading some sample data from the Zen of Python, a collection of coding guidelines for the Python community. Access to the Zen of Python, according to Easter egg tradition, is by typing import this in a Python session. We divide the Zen of Python into three separate “documents” by treating each line as a separate entity and then splitting the lines into three partitions. import subprocess zen_of_python = subprocess.check_output(["python", "-c", "import this"]) corpus = zen_of_python.split() num_partitions = 3 chunk = len(corpus) // num_partitions partitions = [ corpus[i * chunk: (i + 1) * chunk] for i in range(num_partitions) ] Mapping Data To determine the map phase, we require a map function to use on each document. The output is the pair (word, 1) for every word found in a document. For basic text documents we load as Python strings, the process is as follows: def map_function(document): for word in document.lower().split(): yield word, 1 We use the apply_map function on a large collection of documents by marking it as a task in Ray using the @ray.remote decorator. When we call apply_map, we apply it to three sets of document data (num_partitions=3). The apply_map function returns three lists, one for each partition so that Ray can rearrange the results of the map phase and distribute them to the appropriate nodes. import ray @ray.remote def apply_map(corpus, num_partitions=3): map_results = [list() for _ in range(num_partitions)] for document in corpus: for result in map_function(document): first_letter = result[0].decode("utf-8")[0] word_index = ord(first_letter) % num_partitions map_results[word_index].append(result) return map_results For text corpora that can be stored on a single machine, the map phase is not necessasry. However, when the data needs to be divided across multiple nodes, the map phase is useful. To apply the map phase to the corpus in parallel, we use a remote call on apply_map, similar to the previous examples. The main difference is that we want three results returned (one for each partition) using the num_returns argument. map_results = [ apply_map.options(num_returns=num_partitions) .remote(data, num_partitions) for data in partitions ] for i in range(num_partitions): mapper_results = ray.get(map_results[i]) for j, result in enumerate(mapper_results): print(f"Mapper {i}, return value {j}: {result[:2]}") Mapper 0, return value 0: [(b'of', 1), (b'is', 1)] Mapper 0, return value 1: [(b'python,', 1), (b'peters', 1)] Mapper 0, return value 2: [(b'the', 1), (b'zen', 1)] Mapper 1, return value 0: [(b'unless', 1), (b'in', 1)] Mapper 1, return value 1: [(b'although', 1), (b'practicality', 1)] Mapper 1, return value 2: [(b'beats', 1), (b'errors', 1)] Mapper 2, return value 0: [(b'is', 1), (b'is', 1)] Mapper 2, return value 1: [(b'although', 1), (b'a', 1)] Mapper 2, return value 2: [(b'better', 1), (b'than', 1)] This example demonstrates how to collect data on the driver with ray.get. To continue with another task after the mapping phase, you wouldn’t do this. The following section shows how to run all phases together efficiently. Shuffling and Reducing Data The objective for the reduce phase is to transfer all pairs from the j-th return value to the same node. In the reduce phase we create a dictionary that adds up all word occurrences on each partition: @ray.remote def apply_reduce(*results): reduce_results = dict() for res in results: for key, value in res: if key not in reduce_results: reduce_results[key] = 0 reduce_results[key] += value return reduce_results We can take the j-th return value from each mapper and send it to the j-th reducer using the following method. Note that this code works for large datasets that don’t fit on one machine because we are passing references to the data using Ray objects rather than the actual data itself. Both the map and reduce phases can run on any Ray cluster and Ray handles the data shuffling. outputs = [] for i in range(num_partitions): outputs.append( apply_reduce.remote(*[partition[i] for partition in map_results]) ) counts = {k: v for output in ray.get(outputs) for k, v in output.items()} sorted_counts = sorted(counts.items(), key=lambda item: item[1], reverse=True) for count in sorted_counts: print(f"{count[0].decode('utf-8')}: {count[1]}") is: 10 better: 8 than: 8 the: 6 to: 5 of: 3 although: 3 be: 3 unless: 2 one: 2 if: 2 implementation: 2 idea.: 2 special: 2 should: 2 do: 2 may: 2 a: 2 never: 2 way: 2 explain,: 2 ugly.: 1 implicit.: 1 complex.: 1 complex: 1 complicated.: 1 flat: 1 readability: 1 counts.: 1 cases: 1 rules.: 1 in: 1 face: 1 refuse: 1 one--: 1 only: 1 --obvious: 1 it.: 1 obvious: 1 first: 1 often: 1 *right*: 1 it's: 1 it: 1 idea: 1 --: 1 let's: 1 python,: 1 peters: 1 simple: 1 sparse: 1 dense.: 1 aren't: 1 practicality: 1 purity.: 1 pass: 1 silently.: 1 silenced.: 1 ambiguity,: 1 guess.: 1 and: 1 preferably: 1 at: 1 you're: 1 dutch.: 1 good: 1 are: 1 great: 1 more: 1 zen: 1 by: 1 tim: 1 beautiful: 1 explicit: 1 nested.: 1 enough: 1 break: 1 beats: 1 errors: 1 explicitly: 1 temptation: 1 there: 1 that: 1 not: 1 now: 1 never.: 1 now.: 1 hard: 1 bad: 1 easy: 1 namespaces: 1 honking: 1 those!: 1 For a thorough understanding of scaling MapReduce tasks across multiple nodes using Ray, including memory management, read the blog post on the topic. Wrapping up This MapReduce example demonstrates how flexible Ray’s programming model is. A production-grade MapReduce implementation requires more effort but being able to reproduce common algorithms like this one quickly goes a long way. In the earlier years of MapReduce, around 2010, this paradigm was often the only model available for expressing workloads. With Ray, an entire range of interesting distributed computing patterns are accessible to any intermediate Python programmer. To learn more about Ray, and Ray Core and particular, see the Ray Core Examples Gallery, or the ML workloads in our Use Case Gallery. This MapReduce example can be found in “Learning Ray”, which contains more examples similar to this one. Ray Core API Core API ray.init([address, num_cpus, num_gpus, ...]) Connect to an existing Ray cluster or start one and connect to it. ray.shutdown([_exiting_interpreter]) Disconnect the worker, and terminate processes started by ray.init(). ray.is_initialized() Check if ray.init has been called yet. ray.job_config.JobConfig([jvm_options, ...]) A class used to store the configurations of a job. ray.init ray.init(address: Optional[str] = None, *, num_cpus: Optional[int] = None, num_gpus: Optional[int] = None, resources: Optional[Dict[str, float]] = None, labels: Optional[Dict[str, str]] = None, object_store_memory: Optional[int] = None, local_mode: bool = False, ignore_reinit_error: bool = False, include_dashboard: Optional[bool] = None, dashboard_host: str = '127.0.0.1', dashboard_port: Optional[int] = None, job_config: ray.job_config.JobConfig = None, configure_logging: bool = True, logging_level: int = 'info', logging_format: Optional[str] = None, log_to_driver: bool = True, namespace: Optional[str] = None, runtime_env: Optional[Union[Dict[str, Any], RuntimeEnv]] = None, storage: Optional[str] = None, **kwargs) -> ray._private.worker.BaseContext[source] Connect to an existing Ray cluster or start one and connect to it. This method handles two cases; either a Ray cluster already exists and we just attach this driver to it or we start all of the processes associated with a Ray cluster and attach to the newly started cluster. In most cases, it is enough to just call this method with no arguments. This will autodetect an existing Ray cluster or start a new Ray instance if no existing cluster is found: ray.init() To explicitly connect to an existing local cluster, use this as follows. A ConnectionError will be thrown if no existing local cluster is found. ray.init(address="auto") To connect to an existing remote cluster, use this as follows (substituting in the appropriate address). Note the addition of “ray://” at the beginning of the address. ray.init(address="ray://123.45.67.89:10001") More details for starting and connecting to a remote cluster can be found here: https://docs.ray.io/en/master/cluster/getting-started.html You can also define an environment variable called RAY_ADDRESS in the same format as the address parameter to connect to an existing cluster with ray.init() or ray.init(address=”auto”). Parameters address – The address of the Ray cluster to connect to. The provided address is resolved as follows: 1. If a concrete address (e.g., localhost:) is provided, try to connect to it. Concrete addresses can be prefixed with “ray://” to connect to a remote cluster. For example, passing in the address “ray://123.45.67.89:50005” will connect to the cluster at the given address. 2. If no address is provided, try to find an existing Ray instance to connect to. This is done by first checking the environment variable RAY_ADDRESS. If this is not defined, check the address of the latest cluster started (found in /tmp/ray/ray_current_cluster) if available. If this is also empty, then start a new local Ray instance. 3. If the provided address is “auto”, then follow the same process as above. However, if there is no existing cluster found, this will throw a ConnectionError instead of starting a new local Ray instance. 4. If the provided address is “local”, start a new local Ray instance, even if there is already an existing local Ray instance. num_cpus – Number of CPUs the user wishes to assign to each raylet. By default, this is set based on virtual cores. num_gpus – Number of GPUs the user wishes to assign to each raylet. By default, this is set based on detected GPUs. resources – A dictionary mapping the names of custom resources to the quantities for them available. labels – [Experimental] The key-value labels of the node. object_store_memory – The amount of memory (in bytes) to start the object store with. By default, this is automatically set based on available system memory. local_mode – Deprecated: consider using the Ray Debugger instead. ignore_reinit_error – If true, Ray suppresses errors from calling ray.init() a second time. Ray won’t be restarted. include_dashboard – Boolean flag indicating whether or not to start the Ray dashboard, which displays the status of the Ray cluster. If this argument is None, then the UI will be started if the relevant dependencies are present. dashboard_host – The host to bind the dashboard server to. Can either be localhost (127.0.0.1) or 0.0.0.0 (available from all interfaces). By default, this is set to localhost to prevent access from external machines. dashboard_port (int, None) – The port to bind the dashboard server to. Defaults to 8265 and Ray will automatically find a free port if 8265 is not available. job_config (ray.job_config.JobConfig) – The job configuration. configure_logging – True (default) if configuration of logging is allowed here. Otherwise, the user may want to configure it separately. logging_level – Logging level, defaults to logging.INFO. Ignored unless “configure_logging” is true. logging_format – Logging format, defaults to string containing a timestamp, filename, line number, and message. See the source file ray_constants.py for details. Ignored unless “configure_logging” is true. log_to_driver – If true, the output from all of the worker processes on all nodes will be directed to the driver. namespace – A namespace is a logical grouping of jobs and named actors. runtime_env – The runtime environment to use for this job (see Runtime environments for details). storage – [Experimental] Specify a URI for persistent cluster-wide storage. This storage path must be accessible by all nodes of the cluster, otherwise an error will be raised. This option can also be specified as the RAY_STORAGE env var. _enable_object_reconstruction – If True, when an object stored in the distributed plasma store is lost due to node failure, Ray will attempt to reconstruct the object by re-executing the task that created the object. Arguments to the task will be recursively reconstructed. If False, then ray.ObjectLostError will be thrown. _redis_max_memory – Redis max memory. _plasma_directory – Override the plasma mmap file directory. _node_ip_address – The IP address of the node that we are on. _driver_object_store_memory – Deprecated. _memory – Amount of reservable memory resource in bytes rounded down to the nearest integer. _redis_password – Prevents external clients without the password from connecting to Redis if provided. _temp_dir – If provided, specifies the root temporary directory for the Ray process. Must be an absolute path. Defaults to an OS-specific conventional location, e.g., “/tmp/ray”. _metrics_export_port – Port number Ray exposes system metrics through a Prometheus endpoint. It is currently under active development, and the API is subject to change. _system_config – Configuration for overriding RayConfig defaults. For testing purposes ONLY. _tracing_startup_hook – If provided, turns on and sets up tracing for Ray. Must be the name of a function that takes no arguments and sets up a Tracer Provider, Remote Span Processors, and (optional) additional instruments. See more at docs.ray.io/tracing.html. It is currently under active development, and the API is subject to change. _node_name – User-provided node name or identifier. Defaults to the node IP address. Returns If the provided address includes a protocol, for example by prepending “ray://” to the address to get “ray://1.2.3.4:10001”, then a ClientContext is returned with information such as settings, server versions for ray and python, and the dashboard_url. Otherwise, a RayContext is returned with ray and python versions, and address information about the started processes. Raises Exception – An exception is raised if an inappropriate combination of arguments is passed in. PublicAPI: This API is stable across Ray releases.ray.shutdown ray.shutdown(_exiting_interpreter: bool = False)[source] Disconnect the worker, and terminate processes started by ray.init(). This will automatically run at the end when a Python process that uses Ray exits. It is ok to run this twice in a row. The primary use case for this function is to cleanup state between tests. Note that this will clear any remote function definitions, actor definitions, and existing actors, so if you wish to use any previously defined remote functions or actors after calling ray.shutdown(), then you need to redefine them. If they were defined in an imported module, then you will need to reload the module. Parameters _exiting_interpreter – True if this is called by the atexit hook and false otherwise. If we are exiting the interpreter, we will wait a little while to print any extra error messages. PublicAPI: This API is stable across Ray releases.ray.is_initialized ray.is_initialized() -> bool[source] Check if ray.init has been called yet. Returns True if ray.init has already been called and false otherwise. PublicAPI: This API is stable across Ray releases.ray.job_config.JobConfig class ray.job_config.JobConfig(jvm_options: Optional[List[str]] = None, code_search_path: Optional[List[str]] = None, runtime_env: Optional[dict] = None, _client_job: bool = False, metadata: Optional[dict] = None, ray_namespace: Optional[str] = None, default_actor_lifetime: str = 'non_detached', _py_driver_sys_path: Optional[List[str]] = None)[source] Bases: object A class used to store the configurations of a job. Examples import ray ray.shutdown() import ray from ray.job_config import JobConfig ray.init(job_config=JobConfig(default_actor_lifetime="non_detached")) Parameters jvm_options – The jvm options for java workers of the job. code_search_path – A list of directories or jar files that specify the search path for user code. This will be used as CLASSPATH in Java and PYTHONPATH in Python. See Ray cross-language programming for more details. runtime_env – A runtime environment dictionary. metadata – An opaque metadata dictionary. ray_namespace – A namespace is a logical grouping of jobs and named actors. default_actor_lifetime – The default value of actor lifetime, can be “detached” or “non_detached”. See actor lifetimes for more details. PublicAPI: This API is stable across Ray releases. Methods from_json(job_config_json) Generates a JobConfig object from json. set_default_actor_lifetime(...) Set the default actor lifetime, which can be "detached" or "non_detached". set_metadata(key, value) Add key-value pair to the metadata dictionary. set_ray_namespace(ray_namespace) Set Ray namespace. set_runtime_env(runtime_env[, validate]) Modify the runtime_env of the JobConfig. ray.job_config.JobConfig.from_json classmethod JobConfig.from_json(job_config_json)[source] Generates a JobConfig object from json. Examples from ray.job_config import JobConfig job_config = JobConfig.from_json( {"runtime_env": {"working_dir": "uri://abc"}}) Parameters job_config_json – The job config json dictionary.ray.job_config.JobConfig.set_default_actor_lifetime JobConfig.set_default_actor_lifetime(default_actor_lifetime: str) -> None[source] Set the default actor lifetime, which can be “detached” or “non_detached”. See actor lifetimes for more details. Parameters default_actor_lifetime – The default actor lifetime to set.ray.job_config.JobConfig.set_metadata JobConfig.set_metadata(key: str, value: str) -> None[source] Add key-value pair to the metadata dictionary. If the key already exists, the value is overwritten to the new value. Examples import ray from ray.job_config import JobConfig job_config = JobConfig() job_config.set_metadata("submitter", "foo") Parameters key – The key of the metadata. value – The value of the metadata.ray.job_config.JobConfig.set_ray_namespace JobConfig.set_ray_namespace(ray_namespace: str) -> None[source] Set Ray namespace. Parameters ray_namespace – The namespace to set.ray.job_config.JobConfig.set_runtime_env JobConfig.set_runtime_env(runtime_env: Optional[Union[Dict[str, Any], RuntimeEnv]], validate: bool = False) -> None[source] Modify the runtime_env of the JobConfig. We don’t validate the runtime_env by default here because it may go through some translation before actually being passed to C++ (e.g., working_dir translated from a local directory to a URI). Parameters runtime_env – A runtime environment dictionary. validate – Whether to validate the runtime env. Attributes jvm_options The jvm options for java workers of the job. code_search_path A list of directories or jar files that specify the search path for user code. metadata An opaque metadata dictionary. ray_namespace A namespace is a logical grouping of jobs and named actors. ray.job_config.JobConfig.jvm_options JobConfig.jvm_options The jvm options for java workers of the job.ray.job_config.JobConfig.code_search_path JobConfig.code_search_path A list of directories or jar files that specify the search path for user code.ray.job_config.JobConfig.metadata JobConfig.metadata An opaque metadata dictionary.ray.job_config.JobConfig.ray_namespace JobConfig.ray_namespace A namespace is a logical grouping of jobs and named actors. Tasks ray.remote() Defines a remote function or an actor class. ray.remote_function.RemoteFunction.options(...) Configures and overrides the task invocation parameters. ray.cancel(object_ref, *[, force, recursive]) Cancels a task according to the following conditions. ray.remote ray.remote(__function: Callable[[], ray._private.worker.R]) -> ray._private.worker.RemoteFunctionNoArgs[ray._private.worker.R][source] ray.remote(__function: Callable[[ray._private.worker.T0], ray._private.worker.R]) -> ray._private.worker.RemoteFunction0[ray._private.worker.R, ray._private.worker.T0] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1], ray._private.worker.R]) -> ray._private.worker.RemoteFunction1[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2], ray._private.worker.R]) -> ray._private.worker.RemoteFunction2[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3], ray._private.worker.R]) -> ray._private.worker.RemoteFunction3[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4], ray._private.worker.R]) -> ray._private.worker.RemoteFunction4[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5], ray._private.worker.R]) -> ray._private.worker.RemoteFunction5[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6], ray._private.worker.R]) -> ray._private.worker.RemoteFunction6[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6, ray._private.worker.T7], ray._private.worker.R]) -> ray._private.worker.RemoteFunction7[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6, ray._private.worker.T7] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6, ray._private.worker.T7, ray._private.worker.T8], ray._private.worker.R]) -> ray._private.worker.RemoteFunction8[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6, ray._private.worker.T7, ray._private.worker.T8] ray.remote(__function: Callable[[ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6, ray._private.worker.T7, ray._private.worker.T8, ray._private.worker.T9], ray._private.worker.R]) -> ray._private.worker.RemoteFunction9[ray._private.worker.R, ray._private.worker.T0, ray._private.worker.T1, ray._private.worker.T2, ray._private.worker.T3, ray._private.worker.T4, ray._private.worker.T5, ray._private.worker.T6, ray._private.worker.T7, ray._private.worker.T8, ray._private.worker.T9] ray.remote(__t: type) -> Any ray.remote(*, num_returns: Union[int, float] = 'Undefined', num_cpus: Union[int, float] = 'Undefined', num_gpus: Union[int, float] = 'Undefined', resources: Dict[str, float] = 'Undefined', accelerator_type: str = 'Undefined', memory: Union[int, float] = 'Undefined', max_calls: int = 'Undefined', max_restarts: int = 'Undefined', max_task_retries: int = 'Undefined', max_retries: int = 'Undefined', runtime_env: Dict[str, Any] = 'Undefined', retry_exceptions: bool = 'Undefined', scheduling_strategy: Union[None, typing_extensions.Literal[DEFAULT], typing_extensions.Literal[SPREAD], ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy] = 'Undefined') -> ray._private.worker.RemoteDecorator Defines a remote function or an actor class. This function can be used as a decorator with no arguments to define a remote function or actor as follows: import ray @ray.remote def f(a, b, c): return a + b + c object_ref = f.remote(1, 2, 3) result = ray.get(object_ref) assert result == (1 + 2 + 3) @ray.remote class Foo: def __init__(self, arg): self.x = arg def method(self, a): return self.x + a actor_handle = Foo.remote(123) object_ref = actor_handle.method.remote(321) result = ray.get(object_ref) assert result == (123 + 321) Equivalently, use a function call to create a remote function or actor. def g(a, b, c): return a + b + c remote_g = ray.remote(g) object_ref = remote_g.remote(1, 2, 3) assert ray.get(object_ref) == (1 + 2 + 3) class Bar: def __init__(self, arg): self.x = arg def method(self, a): return self.x + a RemoteBar = ray.remote(Bar) actor_handle = RemoteBar.remote(123) object_ref = actor_handle.method.remote(321) result = ray.get(object_ref) assert result == (123 + 321) It can also be used with specific keyword arguments as follows: @ray.remote(num_gpus=1, max_calls=1, num_returns=2) def f(): return 1, 2 @ray.remote(num_cpus=2, resources={"CustomResource": 1}) class Foo: def method(self): return 1 Remote task and actor objects returned by @ray.remote can also be dynamically modified with the same arguments as above using .options() as follows: >>> @ray.remote(num_gpus=1, max_calls=1, num_returns=2) ... def f(): ... return 1, 2 >>> >>> f_with_2_gpus = f.options(num_gpus=2) >>> object_ref = f_with_2_gpus.remote() >>> assert ray.get(object_ref) == (1, 2) >>> @ray.remote(num_cpus=2, resources={"CustomResource": 1}) ... class Foo: ... def method(self): ... return 1 >>> >>> Foo_with_no_resources = Foo.options(num_cpus=1, resources=None) >>> foo_actor = Foo_with_no_resources.remote() >>> assert ray.get(foo_actor.method.remote()) == 1 A remote actor will be terminated when all actor handle to it in Python is deleted, which will cause them to complete any outstanding work and then shut down. If you only have 1 reference to an actor handle, calling del actor could trigger actor deletion. Note that your program may have multiple references to the same ActorHandle, and actor termination will not occur until the reference count goes to 0. See the Python documentation for more context about object deletion. https://docs.python.org/3.9/reference/datamodel.html#object.__del__ If you want to kill actors immediately, you can also call ray.kill(actor). Avoid repeatedly passing in large arguments to remote task or method calls. Instead, use ray.put to create a copy of the object in the object store. See more info here. Parameters num_returns – This is only for remote functions. It specifies the number of object refs returned by the remote function invocation. The default value is 1. Pass “dynamic” to allow the task to decide how many return values to return during execution, and the caller will receive an ObjectRef[ObjectRefGenerator]. See dynamic generators for more details. num_cpus – The quantity of CPU resources to reserve for this task or for the lifetime of the actor. By default, tasks use 1 CPU resource and actors use 1 CPU for scheduling and 0 CPU for running (This means, by default, actors cannot get scheduled on a zero-cpu node, but an infinite number of them can run on any non-zero cpu node. The default value for actors was chosen for historical reasons. It’s recommended to always explicitly set num_cpus for actors to avoid any surprises. If resources are specified explicitly, they are required for both scheduling and running.) See specifying resource requirements for more details. num_gpus – The quantity of GPU resources to reserve for this task or for the lifetime of the actor. The default value is 0. See Ray GPU support for more details. resources (Dict[str, float]) – The quantity of various custom resources to reserve for this task or for the lifetime of the actor. This is a dictionary mapping strings (resource names) to floats. By default it is empty. accelerator_type – If specified, requires that the task or actor run on a node with the specified type of accelerator. See ray.util.accelerators for accelerator types. memory – The heap memory request in bytes for this task/actor, rounded down to the nearest integer. max_calls – Only for remote functions. This specifies the maximum number of times that a given worker can execute the given remote function before it must exit (this can be used to address memory leaks in third-party libraries or to reclaim resources that cannot easily be released, e.g., GPU memory that was acquired by TensorFlow). By default this is infinite for CPU tasks and 1 for GPU tasks (to force GPU tasks to release resources after finishing). max_restarts – Only for actors. This specifies the maximum number of times that the actor should be restarted when it dies unexpectedly. The minimum valid value is 0 (default), which indicates that the actor doesn’t need to be restarted. A value of -1 indicates that an actor should be restarted indefinitely. See actor fault tolerance for more details. max_task_retries – Only for actors. How many times to retry an actor task if the task fails due to a system error, e.g., the actor has died. If set to -1, the system will retry the failed task until the task succeeds, or the actor has reached its max_restarts limit. If set to n > 0, the system will retry the failed task up to n times, after which the task will throw a RayActorError exception upon ray.get. Note that Python exceptions are not considered system errors and will not trigger retries. The default value is 0. See actor fault tolerance for more details. max_retries – Only for remote functions. This specifies the maximum number of times that the remote function should be rerun when the worker process executing it crashes unexpectedly. The minimum valid value is 0, the default value is 3, and a value of -1 indicates infinite retries. See task fault tolerance for more details. runtime_env (Dict[str, Any]) – Specifies the runtime environment for this actor or task and its children. See Runtime environments for detailed documentation. retry_exceptions – Only for remote functions. This specifies whether application-level errors should be retried up to max_retries times. This can be a boolean or a list of exceptions that should be retried. See task fault tolerance for more details. scheduling_strategy – Strategy about how to schedule a remote function or actor. Possible values are None: ray will figure out the scheduling strategy to use, it will either be the PlacementGroupSchedulingStrategy using parent’s placement group if parent has one and has placement_group_capture_child_tasks set to true, or “DEFAULT”; “DEFAULT”: default hybrid scheduling; “SPREAD”: best effort spread scheduling; PlacementGroupSchedulingStrategy: placement group based scheduling; NodeAffinitySchedulingStrategy: node id based affinity scheduling. See Ray scheduling strategies for more details. _metadata – Extended options for Ray libraries. For example, _metadata={“workflows.io/options”: } for Ray workflows. PublicAPI: This API is stable across Ray releases.ray.remote_function.RemoteFunction.options RemoteFunction.options(**task_options)[source] Configures and overrides the task invocation parameters. The arguments are the same as those that can be passed to ray.remote. Overriding max_calls is not supported. Parameters num_returns – It specifies the number of object refs returned by the remote function invocation. num_cpus – The quantity of CPU cores to reserve for this task or for the lifetime of the actor. num_gpus – The quantity of GPUs to reserve for this task or for the lifetime of the actor. resources (Dict[str, float]) – The quantity of various custom resources to reserve for this task or for the lifetime of the actor. This is a dictionary mapping strings (resource names) to floats. accelerator_type – If specified, requires that the task or actor run on a node with the specified type of accelerator. See ray.util.accelerators for accelerator types. memory – The heap memory request in bytes for this task/actor, rounded down to the nearest integer. object_store_memory – The object store memory request for actors only. max_calls – This specifies the maximum number of times that a given worker can execute the given remote function before it must exit (this can be used to address memory leaks in third-party libraries or to reclaim resources that cannot easily be released, e.g., GPU memory that was acquired by TensorFlow). By default this is infinite for CPU tasks and 1 for GPU tasks (to force GPU tasks to release resources after finishing). max_retries – This specifies the maximum number of times that the remote function should be rerun when the worker process executing it crashes unexpectedly. The minimum valid value is 0, the default is 3 (default), and a value of -1 indicates infinite retries. runtime_env (Dict[str, Any]) – Specifies the runtime environment for this actor or task and its children. See Runtime environments for detailed documentation. retry_exceptions – This specifies whether application-level errors should be retried up to max_retries times. scheduling_strategy – Strategy about how to schedule a remote function or actor. Possible values are None: ray will figure out the scheduling strategy to use, it will either be the PlacementGroupSchedulingStrategy using parent’s placement group if parent has one and has placement_group_capture_child_tasks set to true, or “DEFAULT”; “DEFAULT”: default hybrid scheduling; “SPREAD”: best effort spread scheduling; PlacementGroupSchedulingStrategy: placement group based scheduling; NodeAffinitySchedulingStrategy: node id based affinity scheduling. _metadata – Extended options for Ray libraries. For example, _metadata={“workflows.io/options”: } for Ray workflows. Examples: @ray.remote(num_gpus=1, max_calls=1, num_returns=2) def f(): return 1, 2 # Task g will require 2 gpus instead of 1. g = f.options(num_gpus=2)ray.cancel ray.cancel(object_ref: ray.ObjectRef, *, force: bool = False, recursive: bool = True)[source] Cancels a task according to the following conditions. If the specified task is pending execution, it will not be executed. If the task is currently executing, the behavior depends on the force flag. When force=False, a KeyboardInterrupt will be raised in Python and when force=True, the executing task will immediately exit. If the task is already finished, nothing will happen. Only non-actor tasks can be canceled. Canceled tasks will not be retried (max_retries will not be respected). Calling ray.get on a canceled task will raise a TaskCancelledError or a WorkerCrashedError if force=True. Parameters object_ref – ObjectRef returned by the task that should be canceled. force – Whether to force-kill a running task by killing the worker that is running the task. recursive – Whether to try to cancel tasks submitted by the task specified. Raises TypeError – This is also raised for actor tasks. PublicAPI: This API is stable across Ray releases. Actors ray.remote() Defines a remote function or an actor class. ray.actor.ActorClass.options(**actor_options) Configures and overrides the actor instantiation parameters. ray.method(*args, **kwargs) Annotate an actor method. ray.get_actor(name[, namespace]) Get a handle to a named actor. ray.kill(actor, *[, no_restart]) Kill an actor forcefully. ray.actor.ActorClass.options ActorClass.options(**actor_options)[source] Configures and overrides the actor instantiation parameters. The arguments are the same as those that can be passed to ray.remote. Parameters num_cpus – The quantity of CPU cores to reserve for this task or for the lifetime of the actor. num_gpus – The quantity of GPUs to reserve for this task or for the lifetime of the actor. resources (Dict[str, float]) – The quantity of various custom resources to reserve for this task or for the lifetime of the actor. This is a dictionary mapping strings (resource names) to floats. accelerator_type – If specified, requires that the task or actor run on a node with the specified type of accelerator. See ray.util.accelerators for accelerator types. memory – The heap memory request in bytes for this task/actor, rounded down to the nearest integer. object_store_memory – The object store memory request for actors only. max_restarts – This specifies the maximum number of times that the actor should be restarted when it dies unexpectedly. The minimum valid value is 0 (default), which indicates that the actor doesn’t need to be restarted. A value of -1 indicates that an actor should be restarted indefinitely. max_task_retries – How many times to retry an actor task if the task fails due to a system error, e.g., the actor has died. If set to -1, the system will retry the failed task until the task succeeds, or the actor has reached its max_restarts limit. If set to n > 0, the system will retry the failed task up to n times, after which the task will throw a RayActorError exception upon ray.get. Note that Python exceptions are not considered system errors and will not trigger retries. max_pending_calls – Set the max number of pending calls allowed on the actor handle. When this value is exceeded, PendingCallsLimitExceeded will be raised for further tasks. Note that this limit is counted per handle. -1 means that the number of pending calls is unlimited. max_concurrency – The max number of concurrent calls to allow for this actor. This only works with direct actor calls. The max concurrency defaults to 1 for threaded execution, and 1000 for asyncio execution. Note that the execution order is not guaranteed when max_concurrency > 1. name – The globally unique name for the actor, which can be used to retrieve the actor via ray.get_actor(name) as long as the actor is still alive. namespace – Override the namespace to use for the actor. By default, actors are created in an anonymous namespace. The actor can be retrieved via ray.get_actor(name=name, namespace=namespace). lifetime – Either None, which defaults to the actor will fate share with its creator and will be deleted once its refcount drops to zero, or “detached”, which means the actor will live as a global object independent of the creator. runtime_env (Dict[str, Any]) – Specifies the runtime environment for this actor or task and its children. See Runtime environments for detailed documentation. scheduling_strategy – Strategy about how to schedule a remote function or actor. Possible values are None: ray will figure out the scheduling strategy to use, it will either be the PlacementGroupSchedulingStrategy using parent’s placement group if parent has one and has placement_group_capture_child_tasks set to true, or “DEFAULT”; “DEFAULT”: default hybrid scheduling; “SPREAD”: best effort spread scheduling; PlacementGroupSchedulingStrategy: placement group based scheduling; NodeAffinitySchedulingStrategy: node id based affinity scheduling. _metadata – Extended options for Ray libraries. For example, _metadata={“workflows.io/options”: } for Ray workflows. Examples: @ray.remote(num_cpus=2, resources={"CustomResource": 1}) class Foo: def method(self): return 1 # Class Bar will require 1 cpu instead of 2. # It will also require no custom resources. Bar = Foo.options(num_cpus=1, resources=None)ray.method ray.method(*args, **kwargs)[source] Annotate an actor method. @ray.remote class Foo: @ray.method(num_returns=2) def bar(self): return 1, 2 f = Foo.remote() _, _ = f.bar.remote() Parameters num_returns – The number of object refs that should be returned by invocations of this actor method. PublicAPI: This API is stable across Ray releases.ray.get_actor ray.get_actor(name: str, namespace: Optional[str] = None) -> ray.actor.ActorHandle[source] Get a handle to a named actor. Gets a handle to an actor with the given name. The actor must have been created with Actor.options(name=”name”).remote(). This works for both detached & non-detached actors. This method is a sync call and it’ll timeout after 60s. This can be modified by setting OS env RAY_gcs_server_request_timeout_seconds before starting the cluster. Parameters name – The name of the actor. namespace – The namespace of the actor, or None to specify the current namespace. Returns ActorHandle to the actor. Raises ValueError if the named actor does not exist. – PublicAPI: This API is stable across Ray releases.ray.kill ray.kill(actor: ray.actor.ActorHandle, *, no_restart: bool = True)[source] Kill an actor forcefully. This will interrupt any running tasks on the actor, causing them to fail immediately. atexit handlers installed in the actor will not be run. If you want to kill the actor but let pending tasks finish, you can call actor.__ray_terminate__.remote() instead to queue a termination task. Any atexit handlers installed in the actor will be run in this case. If the actor is a detached actor, subsequent calls to get its handle via ray.get_actor will fail. Parameters actor – Handle to the actor to kill. no_restart – Whether or not this actor should be restarted if it’s a restartable actor. PublicAPI: This API is stable across Ray releases. Objects ray.get() Get a remote object or a list of remote objects from the object store. ray.wait(object_refs, *[, num_returns, ...]) Return a list of IDs that are ready and a list of IDs that are not. ray.put(value, *[, _owner]) Store an object in the object store. ray.get ray.get(object_refs: Sequence[ObjectRef[Any]], *, timeout: Optional[float] = 'None') -> List[Any][source] ray.get(object_refs: Sequence[ObjectRef[R]], *, timeout: Optional[float] = 'None') -> List[R] ray.get(object_refs: ObjectRef[R], *, timeout: Optional[float] = 'None') -> R Get a remote object or a list of remote objects from the object store. This method blocks until the object corresponding to the object ref is available in the local object store. If this object is not in the local object store, it will be shipped from an object store that has it (once the object has been created). If object_refs is a list, then the objects corresponding to each object in the list will be returned. Ordering for an input list of object refs is preserved for each object returned. That is, if an object ref to A precedes an object ref to B in the input list, then A will precede B in the returned list. This method will issue a warning if it’s running inside async context, you can use await object_ref instead of ray.get(object_ref). For a list of object refs, you can use await asyncio.gather(*object_refs). Related patterns and anti-patterns: Anti-pattern: Calling ray.get in a loop harms parallelism Anti-pattern: Calling ray.get unnecessarily harms performance Anti-pattern: Processing results in submission order using ray.get increases runtime Anti-pattern: Fetching too many objects at once with ray.get causes failure Parameters object_refs – Object ref of the object to get or a list of object refs to get. timeout (Optional[float]) – The maximum amount of time in seconds to wait before returning. Set this to None will block until the corresponding object becomes available. Setting timeout=0 will return the object immediately if it’s available, else raise GetTimeoutError in accordance with the above docstring. Returns A Python object or a list of Python objects. Raises GetTimeoutError – A GetTimeoutError is raised if a timeout is set and the get takes longer than timeout to return. Exception – An exception is raised if the task that created the object or that created one of the objects raised an exception. PublicAPI: This API is stable across Ray releases.ray.wait ray.wait(object_refs: List[ray.ObjectRef], *, num_returns: int = 1, timeout: Optional[float] = None, fetch_local: bool = True) -> Tuple[List[ray.ObjectRef], List[ray.ObjectRef]][source] Return a list of IDs that are ready and a list of IDs that are not. If timeout is set, the function returns either when the requested number of IDs are ready or when the timeout is reached, whichever occurs first. If it is not set, the function simply waits until that number of objects is ready and returns that exact number of object refs. This method returns two lists. The first list consists of object refs that correspond to objects that are available in the object store. The second list corresponds to the rest of the object refs (which may or may not be ready). Ordering of the input list of object refs is preserved. That is, if A precedes B in the input list, and both are in the ready list, then A will precede B in the ready list. This also holds true if A and B are both in the remaining list. This method will issue a warning if it’s running inside an async context. Instead of ray.wait(object_refs), you can use await asyncio.wait(object_refs). Related patterns and anti-patterns: Pattern: Using ray.wait to limit the number of pending tasks Anti-pattern: Processing results in submission order using ray.get increases runtime Parameters object_refs – List of ObjectRefs or StreamingObjectRefGenerators for objects that may or may not be ready. Note that these must be unique. num_returns – The number of object refs that should be returned. timeout – The maximum amount of time in seconds to wait before returning. fetch_local – If True, wait for the object to be downloaded onto the local node before returning it as ready. If False, ray.wait() will not trigger fetching of objects to the local node and will return immediately once the object is available anywhere in the cluster. Returns A list of object refs that are ready and a list of the remaining object IDs. PublicAPI: This API is stable across Ray releases.ray.put ray.put(value: Any, *, _owner: Optional[ray.actor.ActorHandle] = None) -> ray.ObjectRef[source] Store an object in the object store. The object may not be evicted while a reference to the returned ID exists. Related patterns and anti-patterns: Anti-pattern: Returning ray.put() ObjectRefs from a task harms performance and fault tolerance Anti-pattern: Passing the same large argument by value repeatedly harms performance Anti-pattern: Closure capturing large objects harms performance Parameters value – The Python object to be stored. [Experimental] (_owner) – The actor that should own this object. This allows creating objects with lifetimes decoupled from that of the creating process. The owner actor must be passed a reference to the object prior to the object creator exiting, otherwise the reference will still be lost. Note that this argument is an experimental API and should be avoided if possible. Returns The object ref assigned to this value. PublicAPI: This API is stable across Ray releases. Runtime Context ray.runtime_context.get_runtime_context() Get the runtime context of the current driver/worker. ray.runtime_context.RuntimeContext(worker) A class used for getting runtime context. ray.get_gpu_ids() Get the IDs of the GPUs that are available to the worker. ray.runtime_context.get_runtime_context ray.runtime_context.get_runtime_context() -> ray.runtime_context.RuntimeContext[source] Get the runtime context of the current driver/worker. The obtained runtime context can be used to get the metadata of the current task and actor. Example import ray # Get the job id. ray.get_runtime_context().get_job_id() # Get the actor id. ray.get_runtime_context().get_actor_id() # Get the task id. ray.get_runtime_context().get_task_id() PublicAPI: This API is stable across Ray releases.ray.runtime_context.RuntimeContext class ray.runtime_context.RuntimeContext(worker)[source] Bases: object A class used for getting runtime context. PublicAPI: This API is stable across Ray releases. Methods get() Get a dictionary of the current context. get_actor_id() Get the current actor ID in this worker. get_assigned_resources() Get the assigned resources to this worker. get_job_id() Get current job ID for this worker or driver. get_node_id() Get current node ID for this worker or driver. get_placement_group_id() Get the current Placement group ID of this worker. get_runtime_env_string() Get the runtime env string used for the current driver or worker. get_task_id() Get current task ID for this worker or driver. get_worker_id() Get current worker ID for this worker or driver process. ray.runtime_context.RuntimeContext.get RuntimeContext.get() -> Dict[str, Any][source] Get a dictionary of the current context. Returns Dictionary of the current context. Return type dict DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use get_xxx_id() methods to get relevant ids insteadray.runtime_context.RuntimeContext.get_actor_id RuntimeContext.get_actor_id() -> Optional[str][source] Get the current actor ID in this worker. ID of the actor of the current process. This shouldn’t be used in a driver process. The ID will be in hex format. Returns The current actor id in hex format in this worker. None if there’s no actor id.ray.runtime_context.RuntimeContext.get_assigned_resources RuntimeContext.get_assigned_resources()[source] Get the assigned resources to this worker. By default for tasks, this will return {“CPU”: 1}. By default for actors, this will return {}. This is because actors do not have CPUs assigned to them by default. Returns A dictionary mapping the name of a resource to a float, where the float represents the amount of that resource reserved for this worker.ray.runtime_context.RuntimeContext.get_job_id RuntimeContext.get_job_id() -> str[source] Get current job ID for this worker or driver. Job ID is the id of your Ray drivers that create tasks or actors. Returns If called by a driver, this returns the job ID. If called in a task, return the job ID of the associated driver. The job ID will be hex format. Raises AssertionError – If not called in a driver or worker. Generally, this means that ray.init() was not called.ray.runtime_context.RuntimeContext.get_node_id RuntimeContext.get_node_id() -> str[source] Get current node ID for this worker or driver. Node ID is the id of a node that your driver, task, or actor runs. The ID will be in hex format. Returns A node id in hex format for this worker or driver. Raises AssertionError – If not called in a driver or worker. Generally, this means that ray.init() was not called.ray.runtime_context.RuntimeContext.get_placement_group_id RuntimeContext.get_placement_group_id() -> Optional[str][source] Get the current Placement group ID of this worker. Returns The current placement group id in hex format of this worker.ray.runtime_context.RuntimeContext.get_runtime_env_string RuntimeContext.get_runtime_env_string()[source] Get the runtime env string used for the current driver or worker. Returns The runtime env string currently using by this worker.ray.runtime_context.RuntimeContext.get_task_id RuntimeContext.get_task_id() -> Optional[str][source] Get current task ID for this worker or driver. Task ID is the id of a Ray task. The ID will be in hex format. This shouldn’t be used in a driver process. Example import ray @ray.remote class Actor: def get_task_id(self): return ray.get_runtime_context().get_task_id() @ray.remote def get_task_id(): return ray.get_runtime_context().get_task_id() # All the below code generates different task ids. a = Actor.remote() # Task ids are available for actor tasks. print(ray.get(a.get_task_id.remote())) # Task ids are available for normal tasks. print(ray.get(get_task_id.remote())) 16310a0f0a45af5c2746a0e6efb235c0962896a201000000 c2668a65bda616c1ffffffffffffffffffffffff01000000 Returns The current worker’s task id in hex. None if there’s no task id.ray.runtime_context.RuntimeContext.get_worker_id RuntimeContext.get_worker_id() -> str[source] Get current worker ID for this worker or driver process. Returns A worker id in hex format for this worker or driver process. Attributes actor_id Get the current actor ID in this worker. current_actor Get the current actor handle of this actor itsself. current_placement_group_id Get the current Placement group ID of this worker. gcs_address Get the GCS address of the ray cluster. job_id Get current job ID for this worker or driver. namespace Get the current namespace of this worker. node_id Get current node ID for this worker or driver. runtime_env Get the runtime env used for the current driver or worker. should_capture_child_tasks_in_placement_group Get if the current task should capture parent's placement group. task_id Get current task ID for this worker or driver. was_current_actor_reconstructed Check whether this actor has been restarted. ray.runtime_context.RuntimeContext.actor_id property RuntimeContext.actor_id Get the current actor ID in this worker. ID of the actor of the current process. This shouldn’t be used in a driver process. Returns The current actor id in this worker. None if there’s no actor id. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use get_actor_id() insteadray.runtime_context.RuntimeContext.current_actor property RuntimeContext.current_actor Get the current actor handle of this actor itsself. Returns The handle of current actor.ray.runtime_context.RuntimeContext.current_placement_group_id property RuntimeContext.current_placement_group_id Get the current Placement group ID of this worker. Returns The current placement group id of this worker. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use get_placement_group_id() insteadray.runtime_context.RuntimeContext.gcs_address property RuntimeContext.gcs_address Get the GCS address of the ray cluster. :returns: The GCS address of the cluster.ray.runtime_context.RuntimeContext.job_id property RuntimeContext.job_id Get current job ID for this worker or driver. Job ID is the id of your Ray drivers that create tasks or actors. Returns If called by a driver, this returns the job ID. If called in a task, return the job ID of the associated driver. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use get_job_id() insteadray.runtime_context.RuntimeContext.namespace property RuntimeContext.namespace Get the current namespace of this worker. Returns The current namespace of this worker.ray.runtime_context.RuntimeContext.node_id property RuntimeContext.node_id Get current node ID for this worker or driver. Node ID is the id of a node that your driver, task, or actor runs. Returns A node id for this worker or driver. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use get_node_id() insteadray.runtime_context.RuntimeContext.runtime_env property RuntimeContext.runtime_env Get the runtime env used for the current driver or worker. Returns The runtime env currently using by this worker. The type of return value is ray.runtime_env.RuntimeEnv.ray.runtime_context.RuntimeContext.should_capture_child_tasks_in_placement_group property RuntimeContext.should_capture_child_tasks_in_placement_group Get if the current task should capture parent’s placement group. This returns True if it is called inside a driver. Returns Return True if the current task should implicitly capture the parent placement group.ray.runtime_context.RuntimeContext.task_id property RuntimeContext.task_id Get current task ID for this worker or driver. Task ID is the id of a Ray task. This shouldn’t be used in a driver process. Example import ray @ray.remote class Actor: def ready(self): return True @ray.remote def f(): return True # All the below code generates different task ids. # Task ids are available for actor creation. a = Actor.remote() # Task ids are available for actor tasks. a.ready.remote() # Task ids are available for normal tasks. f.remote() Returns The current worker’s task id. None if there’s no task id. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use get_task_id() insteadray.runtime_context.RuntimeContext.was_current_actor_reconstructed property RuntimeContext.was_current_actor_reconstructed Check whether this actor has been restarted. Returns Whether this actor has been ever restarted.ray.get_gpu_ids ray.get_gpu_ids()[source] Get the IDs of the GPUs that are available to the worker. If the CUDA_VISIBLE_DEVICES environment variable was set when the worker started up, then the IDs returned by this method will be a subset of the IDs in CUDA_VISIBLE_DEVICES. If not, the IDs will fall in the range [0, NUM_GPUS - 1], where NUM_GPUS is the number of GPUs that the node has. Returns A list of GPU IDs. PublicAPI: This API is stable across Ray releases. Cross Language ray.cross_language.java_function(class_name, ...) Define a Java function. ray.cross_language.java_actor_class(class_name) Define a Java actor class. ray.cross_language.java_function ray.cross_language.java_function(class_name: str, function_name: str)[source] Define a Java function. Parameters class_name – Java class name. function_name – Java function name. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.cross_language.java_actor_class ray.cross_language.java_actor_class(class_name: str)[source] Define a Java actor class. Parameters class_name – Java class name. PublicAPI (beta): This API is in beta and may change before becoming stable. Scheduling API Scheduling Strategy ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(...) Placement group based scheduling strategy. ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(...) Static scheduling strategy used to run a task or actor on a particular node. ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy class ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group: PlacementGroup, placement_group_bundle_index: int = - 1, placement_group_capture_child_tasks: Optional[bool] = None)[source] Bases: object Placement group based scheduling strategy. placement_group the placement group this actor belongs to, or None if it doesn’t belong to any group. placement_group_bundle_index the index of the bundle if the actor belongs to a placement group, which may be -1 to specify any available bundle. placement_group_capture_child_tasks Whether or not children tasks of this actor should implicitly use the same placement group as its parent. It is False by default. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy class ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(node_id: str, soft: bool, _spill_on_unavailable: bool = False, _fail_on_unavailable: bool = False)[source] Bases: object Static scheduling strategy used to run a task or actor on a particular node. node_id the hex id of the node where the task or actor should run. soft whether the scheduler should run the task or actor somewhere else if the target node doesn’t exist (e.g. the node dies) or is infeasible during scheduling. If the node exists and is feasible, the task or actor will only be scheduled there. This means if the node doesn’t have the available resources, the task or actor will wait indefinitely until resources become available. If the node doesn’t exist or is infeasible, the task or actor will fail if soft is False or be scheduled somewhere else if soft is True. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Placement Group ray.util.placement_group(bundles[, ...]) Asynchronously creates a PlacementGroup. ray.util.placement_group.PlacementGroup(id) A handle to a placement group. ray.util.placement_group_table([placement_group]) Get the state of the placement group from GCS. ray.util.remove_placement_group(placement_group) Asynchronously remove placement group. ray.util.get_current_placement_group() Get the current placement group which a task or actor is using. ray.util.placement_group ray.util.placement_group(bundles: List[Dict[str, float]], strategy: str = 'PACK', name: str = '', lifetime: Optional[str] = None, _max_cpu_fraction_per_node: float = 1.0) -> ray.util.placement_group.PlacementGroup[source] Asynchronously creates a PlacementGroup. Parameters bundles – A list of bundles which represent the resources requirements. strategy – The strategy to create the placement group.”PACK”: Packs Bundles into as few nodes as possible. ”SPREAD”: Places Bundles across distinct nodes as even as possible. ”STRICT_PACK”: Packs Bundles into one node. The group is not allowed to span multiple nodes. ”STRICT_SPREAD”: Packs Bundles across distinct nodes. name – The name of the placement group. lifetime – Either None, which defaults to the placement group will fate share with its creator and will be deleted once its creator is dead, or “detached”, which means the placement group will live as a global object independent of the creator. _max_cpu_fraction_per_node – (Experimental) Disallow placing bundles on nodes if it would cause the fraction of CPUs used by bundles from any placement group on the node to exceed this fraction. This effectively sets aside CPUs that placement groups cannot occupy on nodes. when max_cpu_fraction_per_node < 1.0, at least 1 CPU will be excluded from placement group scheduling. Note: This feature is experimental and is not recommended for use with autoscaling clusters (scale-up will not trigger properly). Raises ValueError if bundle type is not a list. – ValueError if empty bundle or empty resource bundles are given. – ValueError if the wrong lifetime arguments are given. – Returns Placement group object. Return type PlacementGroup PublicAPI: This API is stable across Ray releases.ray.util.placement_group.PlacementGroup class ray.util.placement_group.PlacementGroup(id: ray._raylet.PlacementGroupID, bundle_cache: Optional[List[Dict]] = None)[source] Bases: object A handle to a placement group. PublicAPI: This API is stable across Ray releases. Methods ready() Returns an ObjectRef to check ready status. wait([timeout_seconds]) Wait for the placement group to be ready within the specified time. ray.util.placement_group.PlacementGroup.ready PlacementGroup.ready() -> ray._raylet.ObjectRef[source] Returns an ObjectRef to check ready status. This API runs a small dummy task to wait for placement group creation. It is compatible to ray.get and ray.wait. Example import ray pg = ray.util.placement_group([{"CPU": 1}]) ray.get(pg.ready()) pg = ray.util.placement_group([{"CPU": 1}]) ray.wait([pg.ready()])ray.util.placement_group.PlacementGroup.wait PlacementGroup.wait(timeout_seconds: Union[float, int] = 30) -> bool[source] Wait for the placement group to be ready within the specified time. :param timeout_seconds: Timeout in seconds. :type timeout_seconds: float|int Returns True if the placement group is created. False otherwise. Attributes bundle_count bundle_specs Return bundles belonging to this placement group. is_empty ray.util.placement_group.PlacementGroup.bundle_count property PlacementGroup.bundle_count: int ray.util.placement_group.PlacementGroup.bundle_specs property PlacementGroup.bundle_specs: List[Dict] Return bundles belonging to this placement group. Type List[Dict]ray.util.placement_group.PlacementGroup.is_empty property PlacementGroup.is_empty ray.util.placement_group_table ray.util.placement_group_table(placement_group: ray.util.placement_group.PlacementGroup = None) -> dict[source] Get the state of the placement group from GCS. Parameters placement_group – placement group to see states. DeveloperAPI: This API may change across minor Ray releases.ray.util.remove_placement_group ray.util.remove_placement_group(placement_group: ray.util.placement_group.PlacementGroup) -> None[source] Asynchronously remove placement group. Parameters placement_group – The placement group to delete. PublicAPI: This API is stable across Ray releases.ray.util.get_current_placement_group ray.util.get_current_placement_group() -> Optional[ray.util.placement_group.PlacementGroup][source] Get the current placement group which a task or actor is using. It returns None if there’s no current placement group for the worker. For example, if you call this method in your driver, it returns None (because drivers never belong to any placement group). Examples import ray from ray.util.placement_group import get_current_placement_group from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy @ray.remote def f(): # This returns the placement group the task f belongs to. # It means this pg is identical to the pg created below. return get_current_placement_group() pg = ray.util.placement_group([{"CPU": 2}]) assert ray.get(f.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg)).remote()) == pg # Driver doesn't belong to any placement group, # so it returns None. assert get_current_placement_group() is None Returns Placement group object. None if the current task or actor wasn’t created with any placement group. Return type PlacementGroup PublicAPI: This API is stable across Ray releases. Runtime Env API ray.runtime_env.RuntimeEnvConfig([...]) Used to specify configuration options for a runtime environment. ray.runtime_env.RuntimeEnv(*[, py_modules, ...]) This class is used to define a runtime environment for a job, task, or actor. ray.runtime_env.RuntimeEnvConfig class ray.runtime_env.RuntimeEnvConfig(setup_timeout_seconds: int = 600, eager_install: bool = True)[source] Bases: dict Used to specify configuration options for a runtime environment. The config is not included when calculating the runtime_env hash, which means that two runtime_envs with the same options but different configs are considered the same for caching purposes. Parameters setup_timeout_seconds – The timeout of runtime environment creation, timeout is in seconds. The value -1 means disable timeout logic, except -1, setup_timeout_seconds cannot be less than or equal to 0. The default value of setup_timeout_seconds is 600 seconds. eager_install – Indicates whether to install the runtime environment on the cluster at ray.init() time, before the workers are leased. This flag is set to True by default. PublicAPI: This API is stable across Ray releases. Methods clear() copy() fromkeys([value]) Create a new dictionary with keys from iterable and values set to value. get(key[, default]) Return the value for key if key is in the dictionary, else default. items() keys() pop(k[,d]) If key is not found, d is returned if given, otherwise KeyError is raised popitem() 2-tuple; but raise KeyError if D is empty. setdefault(key[, default]) Insert key with a value of default if key is not in the dictionary. update([E, ]**F) If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k] values() ray.runtime_env.RuntimeEnvConfig.clear RuntimeEnvConfig.clear() -> None. Remove all items from D. ray.runtime_env.RuntimeEnvConfig.copy RuntimeEnvConfig.copy() -> a shallow copy of D ray.runtime_env.RuntimeEnvConfig.fromkeys RuntimeEnvConfig.fromkeys(value=None, /) Create a new dictionary with keys from iterable and values set to value.ray.runtime_env.RuntimeEnvConfig.get RuntimeEnvConfig.get(key, default=None, /) Return the value for key if key is in the dictionary, else default.ray.runtime_env.RuntimeEnvConfig.items RuntimeEnvConfig.items() -> a set-like object providing a view on D's items ray.runtime_env.RuntimeEnvConfig.keys RuntimeEnvConfig.keys() -> a set-like object providing a view on D's keys ray.runtime_env.RuntimeEnvConfig.pop RuntimeEnvConfig.pop(k, [d]) -> v, remove specified key and return the corresponding value. If key is not found, d is returned if given, otherwise KeyError is raisedray.runtime_env.RuntimeEnvConfig.popitem RuntimeEnvConfig.popitem() -> (k, v), remove and return some (key, value) pair as a 2-tuple; but raise KeyError if D is empty.ray.runtime_env.RuntimeEnvConfig.setdefault RuntimeEnvConfig.setdefault(key, default=None, /) Insert key with a value of default if key is not in the dictionary. Return the value for key if key is in the dictionary, else default.ray.runtime_env.RuntimeEnvConfig.update RuntimeEnvConfig.update([E], **F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]ray.runtime_env.RuntimeEnvConfig.values RuntimeEnvConfig.values() -> an object providing a view on D's values Attributes known_fields ray.runtime_env.RuntimeEnvConfig.known_fields RuntimeEnvConfig.known_fields: Set[str] = {'eager_install', 'setup_timeout_seconds'} ray.runtime_env.RuntimeEnv class ray.runtime_env.RuntimeEnv(*, py_modules: Optional[List[str]] = None, working_dir: Optional[str] = None, pip: Optional[List[str]] = None, conda: Optional[Union[Dict[str, str], str]] = None, container: Optional[Dict[str, str]] = None, env_vars: Optional[Dict[str, str]] = None, worker_process_setup_hook: Optional[Union[Callable, str]] = None, config: Optional[Union[Dict, ray.runtime_env.runtime_env.RuntimeEnvConfig]] = None, _validate: bool = True, **kwargs)[source] Bases: dict This class is used to define a runtime environment for a job, task, or actor. See Runtime environments for detailed documentation. This class can be used interchangeably with an unstructured dictionary in the relevant API calls. Can specify a runtime environment whole job, whether running a script directly on the cluster, using Ray Job submission, or using Ray Client: from ray.runtime_env import RuntimeEnv # Starting a single-node local Ray cluster ray.init(runtime_env=RuntimeEnv(...)) from ray.runtime_env import RuntimeEnv # Connecting to remote cluster using Ray Client ray.init("ray://123.456.7.89:10001", runtime_env=RuntimeEnv(...)) Can specify different runtime environments per-actor or per-task using .options() or the @ray.remote decorator: from ray.runtime_env import RuntimeEnv # Invoke a remote task that will run in a specified runtime environment. f.options(runtime_env=RuntimeEnv(...)).remote() # Instantiate an actor that will run in a specified runtime environment. actor = SomeClass.options(runtime_env=RuntimeEnv(...)).remote() # Specify a runtime environment in the task definition. Future invocations via # `g.remote()` will use this runtime environment unless overridden by using # `.options()` as above. @ray.remote(runtime_env=RuntimeEnv(...)) def g(): pass # Specify a runtime environment in the actor definition. Future instantiations # via `MyClass.remote()` will use this runtime environment unless overridden by # using `.options()` as above. @ray.remote(runtime_env=RuntimeEnv(...)) class MyClass: pass Here are some examples of RuntimeEnv initialization: # Example for using conda RuntimeEnv(conda={ "channels": ["defaults"], "dependencies": ["codecov"]}) RuntimeEnv(conda="pytorch_p36") # Found on DLAMIs # Example for using container RuntimeEnv( container={"image": "anyscale/ray-ml:nightly-py38-cpu", "worker_path": "/root/python/ray/_private/workers/default_worker.py", "run_options": ["--cap-drop SYS_ADMIN","--log-level=debug"]}) # Example for set env_vars RuntimeEnv(env_vars={"OMP_NUM_THREADS": "32", "TF_WARNINGS": "none"}) # Example for set pip RuntimeEnv( pip={"packages":["tensorflow", "requests"], "pip_check": False, "pip_version": "==22.0.2;python_version=='3.8.11'"}) Parameters py_modules – List of URIs (either in the GCS or external storage), each of which is a zip file that will be unpacked and inserted into the PYTHONPATH of the workers. working_dir – URI (either in the GCS or external storage) of a zip file that will be unpacked in the directory of each task/actor. pip – Either a list of pip packages, a string containing the path to a pip requirements.txt file, or a python dictionary that has three fields: 1) packages (required, List[str]): a list of pip packages, 2) pip_check (optional, bool): whether enable pip check at the end of pip install, defaults to False. 3) pip_version (optional, str): the version of pip, Ray will spell the package name “pip” in front of the pip_version to form the final requirement string, the syntax of a requirement specifier is defined in full in PEP 508. conda – Either the conda YAML config, the name of a local conda env (e.g., “pytorch_p36”), or the path to a conda environment.yaml file. The Ray dependency will be automatically injected into the conda env to ensure compatibility with the cluster Ray. The conda name may be mangled automatically to avoid conflicts between runtime envs. This field cannot be specified at the same time as the ‘pip’ field. To use pip with conda, please specify your pip dependencies within the conda YAML config: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually container – Require a given (Docker) container image, The Ray worker process will run in a container with this image. The worker_path is the default_worker.py path. The run_options list spec is here: https://docs.docker.com/engine/reference/run/ env_vars – Environment variables to set. worker_process_setup_hook – (Experimental) The setup hook that’s called after workers start and before Tasks and Actors are scheduled. The value has to be a callable when passed to the Job, Task, or Actor. The callable is then exported and this value is converted to the setup hook’s function name for observability. config – config for runtime environment. Either a dict or a RuntimeEnvConfig. Field: (1) setup_timeout_seconds, the timeout of runtime environment creation, timeout is in seconds. PublicAPI: This API is stable across Ray releases. Methods clear() copy() fromkeys([value]) Create a new dictionary with keys from iterable and values set to value. items() keys() plugin_uris() Not implemented yet, always return a empty list pop(k[,d]) If key is not found, d is returned if given, otherwise KeyError is raised popitem() 2-tuple; but raise KeyError if D is empty. setdefault(key[, default]) Insert key with a value of default if key is not in the dictionary. update([E, ]**F) If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k] values() ray.runtime_env.RuntimeEnv.clear RuntimeEnv.clear() -> None. Remove all items from D. ray.runtime_env.RuntimeEnv.copy RuntimeEnv.copy() -> a shallow copy of D ray.runtime_env.RuntimeEnv.fromkeys RuntimeEnv.fromkeys(value=None, /) Create a new dictionary with keys from iterable and values set to value.ray.runtime_env.RuntimeEnv.items RuntimeEnv.items() -> a set-like object providing a view on D's items ray.runtime_env.RuntimeEnv.keys RuntimeEnv.keys() -> a set-like object providing a view on D's keys ray.runtime_env.RuntimeEnv.plugin_uris RuntimeEnv.plugin_uris() -> List[str][source] Not implemented yet, always return a empty listray.runtime_env.RuntimeEnv.pop RuntimeEnv.pop(k, [d]) -> v, remove specified key and return the corresponding value. If key is not found, d is returned if given, otherwise KeyError is raisedray.runtime_env.RuntimeEnv.popitem RuntimeEnv.popitem() -> (k, v), remove and return some (key, value) pair as a 2-tuple; but raise KeyError if D is empty.ray.runtime_env.RuntimeEnv.setdefault RuntimeEnv.setdefault(key, default=None, /) Insert key with a value of default if key is not in the dictionary. Return the value for key if key is in the dictionary, else default.ray.runtime_env.RuntimeEnv.update RuntimeEnv.update([E], **F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]ray.runtime_env.RuntimeEnv.values RuntimeEnv.values() -> an object providing a view on D's values Attributes extensions_fields known_fields ray.runtime_env.RuntimeEnv.extensions_fields RuntimeEnv.extensions_fields: Set[str] = {'_inject_current_ray', '_ray_commit', '_ray_release'} ray.runtime_env.RuntimeEnv.known_fields RuntimeEnv.known_fields: Set[str] = {'_inject_current_ray', '_ray_commit', '_ray_release', 'conda', 'config', 'container', 'docker', 'env_vars', 'excludes', 'java_jars', 'pip', 'py_modules', 'worker_process_setup_hook', 'working_dir'} Utility ray.util.ActorPool(actors) Utility class to operate on a fixed pool of actors. ray.util.queue.Queue([maxsize, actor_options]) A first-in, first-out queue implementation on Ray. ray.nodes() Get a list of the nodes in the cluster (for debugging only). ray.cluster_resources() Get the current total cluster resources. ray.available_resources() Get the current available cluster resources. ray.util.ActorPool class ray.util.ActorPool(actors: list)[source] Bases: object Utility class to operate on a fixed pool of actors. Parameters actors – List of Ray actor handles to use in this pool. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) print(list(pool.map(lambda a, v: a.double.remote(v), [1, 2, 3, 4]))) [2, 4, 6, 8] DeveloperAPI: This API may change across minor Ray releases. Methods get_next([timeout, ignore_if_timedout]) Returns the next pending result in order. get_next_unordered([timeout, ignore_if_timedout]) Returns any of the next pending results. has_free() Returns whether there are any idle actors available. has_next() Returns whether there are any pending results to return. map(fn, values) Apply the given function in parallel over the actors and values. map_unordered(fn, values) Similar to map(), but returning an unordered iterator. pop_idle() Removes an idle actor from the pool. push(actor) Pushes a new actor into the current list of idle actors. submit(fn, value) Schedule a single task to run in the pool. ray.util.ActorPool.get_next ActorPool.get_next(timeout=None, ignore_if_timedout=False)[source] Returns the next pending result in order. This returns the next result produced by submit(), blocking for up to the specified timeout until it is available. Returns The next result. Raises TimeoutError if the timeout is reached. – Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) pool.submit(lambda a, v: a.double.remote(v), 1) print(pool.get_next()) 2ray.util.ActorPool.get_next_unordered ActorPool.get_next_unordered(timeout=None, ignore_if_timedout=False)[source] Returns any of the next pending results. This returns some result produced by submit(), blocking for up to the specified timeout until it is available. Unlike get_next(), the results are not always returned in same order as submitted, which can improve performance. Returns The next result. Raises TimeoutError if the timeout is reached. – Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) pool.submit(lambda a, v: a.double.remote(v), 1) pool.submit(lambda a, v: a.double.remote(v), 2) print(pool.get_next_unordered()) print(pool.get_next_unordered()) 4 2ray.util.ActorPool.has_free ActorPool.has_free()[source] Returns whether there are any idle actors available. Returns True if there are any idle actors and no pending submits. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1 = Actor.remote() pool = ActorPool([a1]) pool.submit(lambda a, v: a.double.remote(v), 1) print(pool.has_free()) print(pool.get_next()) print(pool.has_free()) False 2 Trueray.util.ActorPool.has_next ActorPool.has_next()[source] Returns whether there are any pending results to return. Returns True if there are any pending results not yet returned. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) pool.submit(lambda a, v: a.double.remote(v), 1) print(pool.has_next()) print(pool.get_next()) print(pool.has_next()) True 2 Falseray.util.ActorPool.map ActorPool.map(fn: Callable[[Any], Any], values: List[Any])[source] Apply the given function in parallel over the actors and values. This returns an ordered iterator that will return results of the map as they finish. Note that you must iterate over the iterator to force the computation to finish. Parameters fn – Function that takes (actor, value) as argument and returns an ObjectRef computing the result over the value. The actor will be considered busy until the ObjectRef completes. values – List of values that fn(actor, value) should be applied to. Returns Iterator over results from applying fn to the actors and values. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) print(list(pool.map(lambda a, v: a.double.remote(v), [1, 2, 3, 4]))) [2, 4, 6, 8]ray.util.ActorPool.map_unordered ActorPool.map_unordered(fn: Callable[[Any], Any], values: List[Any])[source] Similar to map(), but returning an unordered iterator. This returns an unordered iterator that will return results of the map as they finish. This can be more efficient that map() if some results take longer to compute than others. Parameters fn – Function that takes (actor, value) as argument and returns an ObjectRef computing the result over the value. The actor will be considered busy until the ObjectRef completes. values – List of values that fn(actor, value) should be applied to. Returns Iterator over results from applying fn to the actors and values. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) print(list(pool.map_unordered(lambda a, v: a.double.remote(v), [1, 2, 3, 4]))) [6, 8, 4, 2]ray.util.ActorPool.pop_idle ActorPool.pop_idle()[source] Removes an idle actor from the pool. Returns An idle actor if one is available. None if no actor was free to be removed. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1 = Actor.remote() pool = ActorPool([a1]) pool.submit(lambda a, v: a.double.remote(v), 1) assert pool.pop_idle() is None assert pool.get_next() == 2 assert pool.pop_idle() == a1ray.util.ActorPool.push ActorPool.push(actor)[source] Pushes a new actor into the current list of idle actors. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1]) pool.push(a2)ray.util.ActorPool.submit ActorPool.submit(fn, value)[source] Schedule a single task to run in the pool. This has the same argument semantics as map(), but takes on a single value instead of a list of values. The result can be retrieved using get_next() / get_next_unordered(). Parameters fn – Function that takes (actor, value) as argument and returns an ObjectRef computing the result over the value. The actor will be considered busy until the ObjectRef completes. value – Value to compute a result for. Examples import ray from ray.util.actor_pool import ActorPool @ray.remote class Actor: def double(self, v): return 2 * v a1, a2 = Actor.remote(), Actor.remote() pool = ActorPool([a1, a2]) pool.submit(lambda a, v: a.double.remote(v), 1) pool.submit(lambda a, v: a.double.remote(v), 2) print(pool.get_next(), pool.get_next()) 2 4ray.util.queue.Queue class ray.util.queue.Queue(maxsize: int = 0, actor_options: Optional[Dict] = None)[source] Bases: object A first-in, first-out queue implementation on Ray. The behavior and use cases are similar to those of the asyncio.Queue class. Features both sync and async put and get methods. Provides the option to block until space is available when calling put on a full queue, or to block until items are available when calling get on an empty queue. Optionally supports batched put and get operations to minimize serialization overhead. Parameters maxsize (optional, int) – maximum size of the queue. If zero, size is unbounded. actor_options (optional, Dict) – Dictionary of options to pass into the QueueActor during creation. These are directly passed into QueueActor.options(…). This could be useful if you need to pass in custom resource requirements, for example. Examples >>> from ray.util.queue import Queue >>> q = Queue() >>> items = list(range(10)) >>> for item in items: ... q.put(item) >>> for item in items: ... assert item == q.get() >>> # Create Queue with the underlying actor reserving 1 CPU. >>> q = Queue(actor_options={"num_cpus": 1}) PublicAPI (beta): This API is in beta and may change before becoming stable. Methods empty() Whether the queue is empty. full() Whether the queue is full. get([block, timeout]) Gets an item from the queue. get_async([block, timeout]) Gets an item from the queue. get_nowait() Equivalent to get(block=False). get_nowait_batch(num_items) Gets items from the queue and returns them in a list in order. put(item[, block, timeout]) Adds an item to the queue. put_async(item[, block, timeout]) Adds an item to the queue. put_nowait(item) Equivalent to put(item, block=False). put_nowait_batch(items) Takes in a list of items and puts them into the queue in order. qsize() The size of the queue. shutdown([force, grace_period_s]) Terminates the underlying QueueActor. size() The size of the queue. ray.util.queue.Queue.empty Queue.empty() -> bool[source] Whether the queue is empty.ray.util.queue.Queue.full Queue.full() -> bool[source] Whether the queue is full.ray.util.queue.Queue.get Queue.get(block: bool = True, timeout: Optional[float] = None) -> Any[source] Gets an item from the queue. If block is True and the queue is empty, blocks until the queue is no longer empty or until timeout. There is no guarantee of order if multiple consumers get from the same empty queue. Returns The next item in the queue. Raises Empty – if the queue is empty and blocking is False. Empty – if the queue is empty, blocking is True, and it timed out. ValueError – if timeout is negative.ray.util.queue.Queue.get_async async Queue.get_async(block: bool = True, timeout: Optional[float] = None) -> Any[source] Gets an item from the queue. There is no guarantee of order if multiple consumers get from the same empty queue. Returns The next item in the queue. Raises Empty – if the queue is empty and blocking is False. Empty – if the queue is empty, blocking is True, and it timed out. ValueError – if timeout is negative.ray.util.queue.Queue.get_nowait Queue.get_nowait() -> Any[source] Equivalent to get(block=False). Raises Empty – if the queue is empty.ray.util.queue.Queue.get_nowait_batch Queue.get_nowait_batch(num_items: int) -> List[Any][source] Gets items from the queue and returns them in a list in order. Raises Empty – if the queue does not contain the desired number of itemsray.util.queue.Queue.put Queue.put(item: Any, block: bool = True, timeout: Optional[float] = None) -> None[source] Adds an item to the queue. If block is True and the queue is full, blocks until the queue is no longer full or until timeout. There is no guarantee of order if multiple producers put to the same full queue. Raises Full – if the queue is full and blocking is False. Full – if the queue is full, blocking is True, and it timed out. ValueError – if timeout is negative.ray.util.queue.Queue.put_async async Queue.put_async(item: Any, block: bool = True, timeout: Optional[float] = None) -> None[source] Adds an item to the queue. If block is True and the queue is full, blocks until the queue is no longer full or until timeout. There is no guarantee of order if multiple producers put to the same full queue. Raises Full – if the queue is full and blocking is False. Full – if the queue is full, blocking is True, and it timed out. ValueError – if timeout is negative.ray.util.queue.Queue.put_nowait Queue.put_nowait(item: Any) -> None[source] Equivalent to put(item, block=False). Raises Full – if the queue is full.ray.util.queue.Queue.put_nowait_batch Queue.put_nowait_batch(items: collections.abc.Iterable) -> None[source] Takes in a list of items and puts them into the queue in order. Raises Full – if the items will not fit in the queueray.util.queue.Queue.qsize Queue.qsize() -> int[source] The size of the queue.ray.util.queue.Queue.shutdown Queue.shutdown(force: bool = False, grace_period_s: int = 5) -> None[source] Terminates the underlying QueueActor. All of the resources reserved by the queue will be released. Parameters force – If True, forcefully kill the actor, causing an immediate failure. If False, graceful actor termination will be attempted first, before falling back to a forceful kill. grace_period_s – If force is False, how long in seconds to wait for graceful termination before falling back to forceful kill.ray.util.queue.Queue.size Queue.size() -> int[source] The size of the queue.ray.nodes ray.nodes()[source] Get a list of the nodes in the cluster (for debugging only). Returns Information about the Ray clients in the cluster. DeveloperAPI: This API may change across minor Ray releases.ray.cluster_resources ray.cluster_resources()[source] Get the current total cluster resources. Note that this information can grow stale as nodes are added to or removed from the cluster. Returns A dictionary mapping resource name to the total quantity of that resource in the cluster. DeveloperAPI: This API may change across minor Ray releases.ray.available_resources ray.available_resources()[source] Get the current available cluster resources. This is different from cluster_resources in that this will return idle (available) resources rather than total resources. Note that this information can grow stale as tasks start and finish. Returns A dictionary mapping resource name to the total quantity of that resource in the cluster. DeveloperAPI: This API may change across minor Ray releases. Custom Metrics ray.util.metrics.Counter(name[, ...]) A cumulative metric that is monotonically increasing. ray.util.metrics.Gauge(name[, description, ...]) Gauges keep the last recorded value and drop everything before. ray.util.metrics.Histogram(name[, ...]) Tracks the size and number of events in buckets. ray.util.metrics.Counter class ray.util.metrics.Counter(name: str, description: str = '', tag_keys: Optional[Tuple[str]] = None)[source] Bases: ray.util.metrics.Metric A cumulative metric that is monotonically increasing. This corresponds to Prometheus’ counter metric: https://prometheus.io/docs/concepts/metric_types/#counter Parameters name – Name of the metric. description – Description of the metric. tag_keys – Tag keys of the metric. DeveloperAPI: This API may change across minor Ray releases. Methods inc([value, tags]) Increment the counter by value (defaults to 1). record(value[, tags, _internal]) Record the metric point of the metric. set_default_tags(default_tags) Set default tags of metrics. ray.util.metrics.Counter.inc Counter.inc(value: Union[int, float] = 1.0, tags: Optional[Dict[str, str]] = None)[source] Increment the counter by value (defaults to 1). Tags passed in will take precedence over the metric’s default tags. Parameters value (int, float) – Value to increment the counter by (default=1). tags (Dict[str, str]) – Tags to set or override for this counter.ray.util.metrics.Counter.record Counter.record(value: Union[int, float], tags: Optional[Dict[str, str]] = None, _internal=False) -> None Record the metric point of the metric. Tags passed in will take precedence over the metric’s default tags. Parameters value – The value to be recorded as a metric point.ray.util.metrics.Counter.set_default_tags Counter.set_default_tags(default_tags: Dict[str, str]) Set default tags of metrics. Example >>> from ray.util.metrics import Counter >>> # Note that set_default_tags returns the instance itself. >>> counter = Counter("name", tag_keys=("a",)) >>> counter2 = counter.set_default_tags({"a": "b"}) >>> assert counter is counter2 >>> # this means you can instantiate it in this way. >>> counter = Counter("name", tag_keys=("a",)).set_default_tags({"a": "b"}) Parameters default_tags – Default tags that are used for every record method. Returns it returns the instance itself. Return type Metric Attributes info Return the information of this metric. ray.util.metrics.Counter.info property Counter.info: Dict[str, Any] Return the information of this metric. Example >>> from ray.util.metrics import Counter >>> counter = Counter("name", description="desc") >>> print(counter.info) {'name': 'name', 'description': 'desc', 'tag_keys': (), 'default_tags': {}}ray.util.metrics.Gauge class ray.util.metrics.Gauge(name: str, description: str = '', tag_keys: Optional[Tuple[str]] = None)[source] Bases: ray.util.metrics.Metric Gauges keep the last recorded value and drop everything before. Unlike counters, gauges can go up or down over time. This corresponds to Prometheus’ gauge metric: https://prometheus.io/docs/concepts/metric_types/#gauge Parameters name – Name of the metric. description – Description of the metric. tag_keys – Tag keys of the metric. DeveloperAPI: This API may change across minor Ray releases. Methods record(value[, tags, _internal]) Record the metric point of the metric. set(value[, tags]) Set the gauge to the given value. set_default_tags(default_tags) Set default tags of metrics. ray.util.metrics.Gauge.record Gauge.record(value: Union[int, float], tags: Optional[Dict[str, str]] = None, _internal=False) -> None Record the metric point of the metric. Tags passed in will take precedence over the metric’s default tags. Parameters value – The value to be recorded as a metric point.ray.util.metrics.Gauge.set Gauge.set(value: Union[int, float], tags: Optional[Dict[str, str]] = None)[source] Set the gauge to the given value. Tags passed in will take precedence over the metric’s default tags. Parameters value (int, float) – Value to set the gauge to. tags (Dict[str, str]) – Tags to set or override for this gauge.ray.util.metrics.Gauge.set_default_tags Gauge.set_default_tags(default_tags: Dict[str, str]) Set default tags of metrics. Example >>> from ray.util.metrics import Counter >>> # Note that set_default_tags returns the instance itself. >>> counter = Counter("name", tag_keys=("a",)) >>> counter2 = counter.set_default_tags({"a": "b"}) >>> assert counter is counter2 >>> # this means you can instantiate it in this way. >>> counter = Counter("name", tag_keys=("a",)).set_default_tags({"a": "b"}) Parameters default_tags – Default tags that are used for every record method. Returns it returns the instance itself. Return type Metric Attributes info Return the information of this metric. ray.util.metrics.Gauge.info property Gauge.info: Dict[str, Any] Return the information of this metric. Example >>> from ray.util.metrics import Counter >>> counter = Counter("name", description="desc") >>> print(counter.info) {'name': 'name', 'description': 'desc', 'tag_keys': (), 'default_tags': {}}ray.util.metrics.Histogram class ray.util.metrics.Histogram(name: str, description: str = '', boundaries: Optional[List[float]] = None, tag_keys: Optional[Tuple[str]] = None)[source] Bases: ray.util.metrics.Metric Tracks the size and number of events in buckets. Histograms allow you to calculate aggregate quantiles such as 25, 50, 95, 99 percentile latency for an RPC. This corresponds to Prometheus’ histogram metric: https://prometheus.io/docs/concepts/metric_types/#histogram Parameters name – Name of the metric. description – Description of the metric. boundaries – Boundaries of histogram buckets. tag_keys – Tag keys of the metric. DeveloperAPI: This API may change across minor Ray releases. Methods observe(value[, tags]) Observe a given value and add it to the appropriate bucket. record(value[, tags, _internal]) Record the metric point of the metric. set_default_tags(default_tags) Set default tags of metrics. ray.util.metrics.Histogram.observe Histogram.observe(value: Union[int, float], tags: Optional[Dict[str, str]] = None)[source] Observe a given value and add it to the appropriate bucket. Tags passed in will take precedence over the metric’s default tags. Parameters value (int, float) – Value to set the gauge to. tags (Dict[str, str]) – Tags to set or override for this gauge.ray.util.metrics.Histogram.record Histogram.record(value: Union[int, float], tags: Optional[Dict[str, str]] = None, _internal=False) -> None Record the metric point of the metric. Tags passed in will take precedence over the metric’s default tags. Parameters value – The value to be recorded as a metric point.ray.util.metrics.Histogram.set_default_tags Histogram.set_default_tags(default_tags: Dict[str, str]) Set default tags of metrics. Example >>> from ray.util.metrics import Counter >>> # Note that set_default_tags returns the instance itself. >>> counter = Counter("name", tag_keys=("a",)) >>> counter2 = counter.set_default_tags({"a": "b"}) >>> assert counter is counter2 >>> # this means you can instantiate it in this way. >>> counter = Counter("name", tag_keys=("a",)).set_default_tags({"a": "b"}) Parameters default_tags – Default tags that are used for every record method. Returns it returns the instance itself. Return type Metric Attributes info Return information about histogram metric. ray.util.metrics.Histogram.info property Histogram.info Return information about histogram metric. Debugging ray.util.pdb.set_trace([breakpoint_uuid]) Interrupt the flow of the program and drop into the Ray debugger. ray.util.inspect_serializability(base_obj[, ...]) Identifies what objects are preventing serialization. ray.timeline([filename]) Return a list of profiling events that can viewed as a timeline. ray.util.pdb.set_trace pdb.set_trace(*, header=None)[source] ray.util.inspect_serializability ray.util.inspect_serializability(base_obj: Any, name: Optional[str] = None, depth: int = 3, print_file: Optional[Any] = None) -> Tuple[bool, Set[ray.util.check_serialize.FailureTuple]][source] Identifies what objects are preventing serialization. Parameters base_obj – Object to be serialized. name – Optional name of string. depth – Depth of the scope stack to walk through. Defaults to 3. print_file – file argument that will be passed to print(). Returns True if serializable. set[FailureTuple]: Set of unserializable objects. Return type bool New in version 1.1.0. DeveloperAPI: This API may change across minor Ray releases.ray.timeline ray.timeline(filename=None)[source] Return a list of profiling events that can viewed as a timeline. Ray profiling must be enabled by setting the RAY_PROFILING=1 environment variable prior to starting Ray, and RAY_task_events_report_interval_ms set to be positive (default 1000) To view this information as a timeline, simply dump it as a json file by passing in “filename” or using using json.dump, and then load go to chrome://tracing in the Chrome web browser and load the dumped file. Parameters filename – If a filename is provided, the timeline is dumped to that file. Returns If filename is not provided, this returns a list of profiling events. Each profile event is a dictionary. DeveloperAPI: This API may change across minor Ray releases. Exceptions ray.exceptions.RayError Super class of all ray exception types. ray.exceptions.RayTaskError(function_name, ...) Indicates that a task threw an exception during execution. ray.exceptions.RayActorError(cause, ]] = None) Indicates that the actor died unexpectedly before finishing a task. ray.exceptions.TaskCancelledError(task_id) Raised when this task is cancelled. ray.exceptions.TaskUnschedulableError(...) Raised when the task cannot be scheduled. ray.exceptions.ActorUnschedulableError(...) Raised when the actor cannot be scheduled. ray.exceptions.AsyncioActorExit Raised when an asyncio actor intentionally exits via exit_actor(). ray.exceptions.LocalRayletDiedError Indicates that the task's local raylet died. ray.exceptions.WorkerCrashedError Indicates that the worker died unexpectedly while executing a task. ray.exceptions.TaskPlacementGroupRemoved Raised when the corresponding placement group was removed. ray.exceptions.ActorPlacementGroupRemoved Raised when the corresponding placement group was removed. ray.exceptions.ObjectStoreFullError Indicates that the object store is full. ray.exceptions.OutOfDiskError Indicates that the local disk is full. ray.exceptions.ObjectLostError(...) Indicates that the object is lost from distributed memory, due to node failure or system error. ray.exceptions.ObjectFetchTimedOutError(...) Indicates that an object fetch timed out. ray.exceptions.GetTimeoutError Indicates that a call to the worker timed out. ray.exceptions.OwnerDiedError(...) Indicates that the owner of the object has died while there is still a reference to the object. ray.exceptions.PlasmaObjectNotAvailable Called when an object was not available within the given timeout. ray.exceptions.ObjectReconstructionFailedError(...) Indicates that the object cannot be reconstructed. ray.exceptions.ObjectReconstructionFailedMaxAttemptsExceededError(...) Indicates that the object cannot be reconstructed because the maximum number of task retries has been exceeded. ray.exceptions.ObjectReconstructionFailedLineageEvictedError(...) Indicates that the object cannot be reconstructed because its lineage was evicted due to memory pressure. ray.exceptions.RuntimeEnvSetupError([...]) Raised when a runtime environment fails to be set up. ray.exceptions.CrossLanguageError(ray_exception) Raised from another language. ray.exceptions.RaySystemError(client_exc[, ...]) Indicates that Ray encountered a system error. ray.exceptions.RayError exception ray.exceptions.RayError[source] Super class of all ray exception types. PublicAPI: This API is stable across Ray releases.ray.exceptions.RayTaskError exception ray.exceptions.RayTaskError(function_name, traceback_str, cause, proctitle=None, pid=None, ip=None, actor_repr=None, actor_id=None)[source] Indicates that a task threw an exception during execution. If a task throws an exception during execution, a RayTaskError is stored in the object store for each of the task’s outputs. When an object is retrieved from the object store, the Python method that retrieved it checks to see if the object is a RayTaskError and if it is then an exception is thrown propagating the error message. PublicAPI: This API is stable across Ray releases.ray.exceptions.RayActorError exception ray.exceptions.RayActorError(cause: Optional[Union[ray.exceptions.RayTaskError, ]] = None)[source] Indicates that the actor died unexpectedly before finishing a task. This exception could happen either because the actor process dies while executing a task, or because a task is submitted to a dead actor. If the actor is dead because of an exception thrown in its creation tasks, RayActorError will contain the creation_task_error, which is used to reconstruct the exception on the caller side. Parameters cause – The cause of the actor error. RayTaskError type means the actor has died because of an exception within __init__. ActorDiedErrorContext means the actor has died because of unexepected system error. None means the cause is not known. Theoretically, this should not happen, but it is there as a safety check. PublicAPI: This API is stable across Ray releases.ray.exceptions.TaskCancelledError exception ray.exceptions.TaskCancelledError(task_id: Optional[] = None)[source] Raised when this task is cancelled. Parameters task_id – The TaskID of the function that was directly cancelled. PublicAPI: This API is stable across Ray releases.ray.exceptions.TaskUnschedulableError exception ray.exceptions.TaskUnschedulableError(error_message: str)[source] Raised when the task cannot be scheduled. One example is that the node specified through NodeAffinitySchedulingStrategy is dead. PublicAPI: This API is stable across Ray releases.ray.exceptions.ActorUnschedulableError exception ray.exceptions.ActorUnschedulableError(error_message: str)[source] Raised when the actor cannot be scheduled. One example is that the node specified through NodeAffinitySchedulingStrategy is dead. PublicAPI: This API is stable across Ray releases.ray.exceptions.AsyncioActorExit exception ray.exceptions.AsyncioActorExit[source] Raised when an asyncio actor intentionally exits via exit_actor(). PublicAPI: This API is stable across Ray releases.ray.exceptions.LocalRayletDiedError exception ray.exceptions.LocalRayletDiedError[source] Indicates that the task’s local raylet died. PublicAPI: This API is stable across Ray releases.ray.exceptions.WorkerCrashedError exception ray.exceptions.WorkerCrashedError[source] Indicates that the worker died unexpectedly while executing a task. PublicAPI: This API is stable across Ray releases.ray.exceptions.TaskPlacementGroupRemoved exception ray.exceptions.TaskPlacementGroupRemoved[source] Raised when the corresponding placement group was removed. PublicAPI: This API is stable across Ray releases.ray.exceptions.ActorPlacementGroupRemoved exception ray.exceptions.ActorPlacementGroupRemoved[source] Raised when the corresponding placement group was removed. PublicAPI: This API is stable across Ray releases.ray.exceptions.ObjectStoreFullError exception ray.exceptions.ObjectStoreFullError[source] Indicates that the object store is full. This is raised if the attempt to store the object fails because the object store is full even after multiple retries. PublicAPI: This API is stable across Ray releases.ray.exceptions.OutOfDiskError exception ray.exceptions.OutOfDiskError[source] Indicates that the local disk is full. This is raised if the attempt to store the object fails because both the object store and disk are full. PublicAPI: This API is stable across Ray releases.ray.exceptions.ObjectLostError exception ray.exceptions.ObjectLostError(object_ref_hex, owner_address, call_site)[source] Indicates that the object is lost from distributed memory, due to node failure or system error. Parameters object_ref_hex – Hex ID of the object. PublicAPI: This API is stable across Ray releases.ray.exceptions.ObjectFetchTimedOutError exception ray.exceptions.ObjectFetchTimedOutError(object_ref_hex, owner_address, call_site)[source] Indicates that an object fetch timed out. Parameters object_ref_hex – Hex ID of the object. PublicAPI: This API is stable across Ray releases.ray.exceptions.GetTimeoutError exception ray.exceptions.GetTimeoutError[source] Indicates that a call to the worker timed out. PublicAPI: This API is stable across Ray releases.ray.exceptions.OwnerDiedError exception ray.exceptions.OwnerDiedError(object_ref_hex, owner_address, call_site)[source] Indicates that the owner of the object has died while there is still a reference to the object. Parameters object_ref_hex – Hex ID of the object. PublicAPI: This API is stable across Ray releases.ray.exceptions.PlasmaObjectNotAvailable exception ray.exceptions.PlasmaObjectNotAvailable[source] Called when an object was not available within the given timeout. PublicAPI: This API is stable across Ray releases.ray.exceptions.ObjectReconstructionFailedError exception ray.exceptions.ObjectReconstructionFailedError(object_ref_hex, owner_address, call_site)[source] Indicates that the object cannot be reconstructed. Parameters object_ref_hex – Hex ID of the object. PublicAPI: This API is stable across Ray releases.ray.exceptions.ObjectReconstructionFailedMaxAttemptsExceededError exception ray.exceptions.ObjectReconstructionFailedMaxAttemptsExceededError(object_ref_hex, owner_address, call_site)[source] Indicates that the object cannot be reconstructed because the maximum number of task retries has been exceeded. Parameters object_ref_hex – Hex ID of the object. PublicAPI: This API is stable across Ray releases.ray.exceptions.ObjectReconstructionFailedLineageEvictedError exception ray.exceptions.ObjectReconstructionFailedLineageEvictedError(object_ref_hex, owner_address, call_site)[source] Indicates that the object cannot be reconstructed because its lineage was evicted due to memory pressure. Parameters object_ref_hex – Hex ID of the object. PublicAPI: This API is stable across Ray releases.ray.exceptions.RuntimeEnvSetupError exception ray.exceptions.RuntimeEnvSetupError(error_message: Optional[str] = None)[source] Raised when a runtime environment fails to be set up. Parameters error_message – The error message that explains why runtime env setup has failed. PublicAPI: This API is stable across Ray releases.ray.exceptions.CrossLanguageError exception ray.exceptions.CrossLanguageError(ray_exception)[source] Raised from another language. PublicAPI: This API is stable across Ray releases.ray.exceptions.RaySystemError exception ray.exceptions.RaySystemError(client_exc, traceback_str=None)[source] Indicates that Ray encountered a system error. This exception can be thrown when the raylet is killed. PublicAPI: This API is stable across Ray releases. Ray Core CLI Debugging applications This section contains commands for inspecting and debugging the current cluster. ray stack Take a stack dump of all Python workers on the local machine. ray stack [OPTIONS] ray memory Print object references held in a Ray cluster. ray memory [OPTIONS] Options --address
Override the address to connect to. --redis_password Connect to ray with redis_password. --group-by Group object references by a GroupByType (e.g. NODE_ADDRESS or STACK_TRACE). Options NODE_ADDRESS | STACK_TRACE --sort-by Sort object references in ascending order by a SortingType (e.g. PID, OBJECT_SIZE, or REFERENCE_TYPE). Options PID | OBJECT_SIZE | REFERENCE_TYPE --units Specify unit metrics for displaying object sizes (e.g. B, KB, MB, GB). Options B | KB | MB | GB --no-format Display unformatted results. Defaults to true when terminal width is less than 137 characters. --stats-only Display plasma store stats only. --num-entries, --n Specify number of sorted entries per group. ray timeline Take a Chrome tracing timeline for a Ray cluster. ray timeline [OPTIONS] Options --address
Override the Ray address to connect to. ray status Print cluster status, including autoscaling info. PublicAPI: This API is stable across Ray releases. ray status [OPTIONS] Options --address
Override the address to connect to. --redis_password Connect to ray with redis_password. ray debug Show all active breakpoints and exceptions in the Ray debugger. ray debug [OPTIONS] Options --address
Override the address to connect to. Usage Stats This section contains commands to enable/disable Ray usage stats. ray disable-usage-stats Disable usage stats collection. This will not affect the current running clusters but clusters launched in the future. ray disable-usage-stats [OPTIONS] ray enable-usage-stats Enable usage stats collection. This will not affect the current running clusters but clusters launched in the future. ray enable-usage-stats [OPTIONS] State CLI State This section contains commands to access the live state of Ray resources (actor, task, object, etc.). APIs are alpha. This feature requires a full installation of Ray using pip install "ray[default]". This feature also requires the dashboard component to be available. The dashboard component needs to be included when starting the ray cluster, which is the default behavior for ray start and ray.init(). For more in-depth debugging, you could check the dashboard log at /dashboard.log, which is usually /tmp/ray/session_latest/logs/dashboard.log. State CLI allows users to access the state of various resources (e.g., actor, task, object). ray summary tasks Summarize the task state of the cluster. By default, the output contains the information grouped by task function names. The output schema is TaskSummaries. Raises: RayStateApiException if the CLI is failed to query the data. PublicAPI: This API is stable across Ray releases. ray summary tasks [OPTIONS] Options --timeout Timeout in seconds for the API requests. Default is 30 --address
The address of Ray API server. If not provided, it will be configured automatically from querying the GCS server. ray summary actors Summarize the actor state of the cluster. By default, the output contains the information grouped by actor class names. The output schema is ray.util.state.common.ActorSummaries. Raises: RayStateApiException if the CLI is failed to query the data. PublicAPI: This API is stable across Ray releases. ray summary actors [OPTIONS] Options --timeout Timeout in seconds for the API requests. Default is 30 --address
The address of Ray API server. If not provided, it will be configured automatically from querying the GCS server. ray summary objects Summarize the object state of the cluster. The API is recommended when debugging memory leaks. See Debugging with Ray Memory for more details. (Note that this command is almost equivalent to ray memory, but it returns easier-to-understand output). By default, the output contains the information grouped by object callsite. Note that the callsite is not collected and all data will be aggregated as “disable” callsite if the env var RAY_record_ref_creation_sites is not configured. To enable the callsite collection, set the following environment variable when starting Ray. Example: ` RAY_record_ref_creation_sites=1 ray start --head ` ` RAY_record_ref_creation_sites=1 ray_script.py ` The output schema is ray.util.state.common.ObjectSummaries. Raises: RayStateApiException if the CLI is failed to query the data. PublicAPI: This API is stable across Ray releases. ray summary objects [OPTIONS] Options --timeout Timeout in seconds for the API requests. Default is 30 --address
The address of Ray API server. If not provided, it will be configured automatically from querying the GCS server. ray list List all states of a given resource. Normally, summary APIs are recommended before listing all resources. The output schema is defined at State API Schema section. For example, the output schema of ray list tasks is TaskState. Usage: List all actor information from the cluster. ` ray list actors ` List 50 actors from the cluster. The sorting order cannot be controlled. ` ray list actors --limit 50 ` List 10 actors with state PENDING. ` ray list actors --limit 10 --filter "state=PENDING" ` List actors with yaml format. ` ray list actors --format yaml ` List actors with details. When –detail is specified, it might query more data sources to obtain data in details. ` ray list actors --detail ` The API queries one or more components from the cluster to obtain the data. The returned state snapshot could be stale, and it is not guaranteed to return the live data. The API can return partial or missing output upon the following scenarios. When the API queries more than 1 component, if some of them fail, the API will return the partial result (with a suppressible warning). When the API returns too many entries, the API will truncate the output. Currently, truncated data cannot be selected by users. Args: resource: The type of the resource to query. Raises: RayStateApiException if the CLI is failed to query the data. PublicAPI: This API is stable across Ray releases. ray list [OPTIONS] {actors|jobs|placement- groups|nodes|workers|tasks|objects|runtime-envs|cluster-events} Options --format Options default | json | yaml | table -f, --filter A key, predicate, and value to filter the result. E.g., –filter ‘key=value’ or –filter ‘key!=value’. You can specify multiple –filter options. In this case all predicates are concatenated as AND. For example, –filter key=value –filter key2=value means (key==val) AND (key2==val2) --limit Maximum number of entries to return. 100 by default. --detail If the flag is set, the output will contain data in more details. Note that the API could query more sources to obtain information in a greater detail. --timeout Timeout in seconds for the API requests. Default is 30 --address
The address of Ray API server. If not provided, it will be configured automatically from querying the GCS server. Arguments RESOURCE Required argument ray get Get a state of a given resource by ID. We currently DO NOT support get by id for jobs and runtime-envs The output schema is defined at State API Schema section. For example, the output schema of ray get tasks is TaskState. Usage: Get an actor with actor id ` ray get actors ` Get a placement group information with ` ray get placement-groups ` The API queries one or more components from the cluster to obtain the data. The returned state snapshot could be stale, and it is not guaranteed to return the live data. Args: resource: The type of the resource to query. id: The id of the resource. Raises: RayStateApiException if the CLI is failed to query the data. PublicAPI: This API is stable across Ray releases. ray get [OPTIONS] {actors|placement- groups|nodes|workers|tasks|objects|cluster-events} ID Options --address
The address of Ray API server. If not provided, it will be configured automatically from querying the GCS server. --timeout Timeout in seconds for the API requests. Default is 30 Arguments RESOURCE Required argument ID Required argument Log This section contains commands to access logs from Ray clusters. APIs are alpha. This feature requires a full installation of Ray using pip install "ray[default]". Log CLI allows users to access the log from the cluster. Note that only the logs from alive nodes are available through this API. ray logs Get logs based on filename (cluster) or resource identifiers (actor) Example: Get all the log files available on a node (ray address could be obtained from ray start --head or ray.init()). ` ray logs cluster ` [ray logs cluster] Print the last 500 lines of raylet.out on a head node. ` ray logs cluster raylet.out --tail 500 ` Or simply, using ray logs as an alias for ray logs cluster: ` ray logs raylet.out --tail 500 ` Print the last 500 lines of raylet.out on a worker node id A. ` ray logs raylet.out --tail 500 —-node-id A ` [ray logs actor] Follow the log file with an actor id ABC. ` ray logs actor --id ABC --follow ` [ray logs task] Get the std err generated by a task. Note: If a task is from a concurrent actor (i.e. an async actor or a threaded actor), the log of the tasks are expected to be interleaved. Please use ray logs actor --id for the entire actor log. ` ray logs task --id --err ` ray logs [OPTIONS] COMMAND [ARGS]... Commands actor Get/List logs associated with an actor. cluster Get/List logs that matches the GLOB_FILTER… job Get logs associated with a submission job. task Get logs associated with a task. worker Get logs associated with a worker process. State API APIs are alpha. This feature requires a full installation of Ray using pip install "ray[default]". For an overview with examples see Monitoring Ray States. For the CLI reference see Ray State CLI Reference or Ray Log CLI Reference. State Python SDK State APIs are also exported as functions. Summary APIs ray.util.state.summarize_actors([address, ...]) Summarize the actors in cluster. ray.util.state.summarize_objects([address, ...]) Summarize the objects in cluster. ray.util.state.summarize_tasks([address, ...]) Summarize the tasks in cluster. ray.util.state.summarize_actors ray.util.state.summarize_actors(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) -> Dict[source] Summarize the actors in cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout for requests made when getting the states. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns Dictionarified ActorSummaries Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.summarize_objects ray.util.state.summarize_objects(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) -> Dict[source] Summarize the objects in cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout for requests made when getting the states. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns Dictionarified ObjectSummaries Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.summarize_tasks ray.util.state.summarize_tasks(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) -> Dict[source] Summarize the tasks in cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout for requests made when getting the states. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns Dictionarified TaskSummaries Raises Exceptions – RayStateApiException if the CLI is failed to query the data. DeveloperAPI: This API may change across minor Ray releases. List APIs ray.util.state.list_actors([address, ...]) List actors in the cluster. ray.util.state.list_placement_groups([...]) List placement groups in the cluster. ray.util.state.list_nodes([address, ...]) List nodes in the cluster. ray.util.state.list_jobs([address, filters, ...]) List jobs submitted to the cluster by :ref: ray job submission. ray.util.state.list_workers([address, ...]) List workers in the cluster. ray.util.state.list_tasks([address, ...]) List tasks in the cluster. ray.util.state.list_objects([address, ...]) List objects in the cluster. ray.util.state.list_runtime_envs([address, ...]) List runtime environments in the cluster. ray.util.state.list_actors ray.util.state.list_actors(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.ActorState][source] List actors in the cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("id", "=", "abcd") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in ActorState) will be queried and returned. See ActorState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of ActorState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.list_placement_groups ray.util.state.list_placement_groups(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.PlacementGroupState][source] List placement groups in the cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("state", "=", "abcd") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in PlacementGroupState) will be queried and returned. See PlacementGroupState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of PlacementGroupState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.list_nodes ray.util.state.list_nodes(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.NodeState][source] List nodes in the cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("node_name", "=", "abcd") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in NodeState) will be queried and returned. See NodeState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of dictionarified NodeState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.list_jobs ray.util.state.list_jobs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.JobState][source] List jobs submitted to the cluster by :ref: ray job submission. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("status", "=", "abcd") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in JobState) will be queried and returned. See JobState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of dictionarified JobState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.list_workers ray.util.state.list_workers(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.WorkerState][source] List workers in the cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("is_alive", "=", "True") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in WorkerState) will be queried and returned. See WorkerState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of WorkerState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.list_tasks ray.util.state.list_tasks(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.TaskState][source] List tasks in the cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("is_alive", "=", "True") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in WorkerState) will be queried and returned. See WorkerState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of TaskState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.list_objects ray.util.state.list_objects(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.ObjectState][source] List objects in the cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("ip", "=", "0.0.0.0") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in ObjectState) will be queried and returned. See ObjectState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of ObjectState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.list_runtime_envs ray.util.state.list_runtime_envs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) -> List[ray.util.state.common.RuntimeEnvState][source] List runtime environments in the cluster. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("node_id", "=", "abcdef") limit – Max number of entries returned by the state backend. timeout – Max timeout value for the state APIs requests made. detail – When True, more details info (specified in RuntimeEnvState) will be queried and returned. See RuntimeEnvState. raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable. _explain – Print the API information such as API latency or failed query information. Returns List of RuntimeEnvState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases. Get APIs ray.util.state.get_actor(id[, address, ...]) Get an actor by id. ray.util.state.get_placement_group(id[, ...]) Get a placement group by id. ray.util.state.get_node(id[, address, ...]) Get a node by id. ray.util.state.get_worker(id[, address, ...]) Get a worker by id. ray.util.state.get_task(id[, address, ...]) Get task attempts of a task by id. ray.util.state.get_objects(id[, address, ...]) Get objects by id. ray.util.state.get_actor ray.util.state.get_actor(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) -> Optional[Dict][source] Get an actor by id. Parameters id – Id of the actor address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout value for the state API requests made. _explain – Print the API information such as API latency or failed query information. Returns None if actor not found, or ActorState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.get_placement_group ray.util.state.get_placement_group(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) -> Optional[ray.util.state.common.PlacementGroupState][source] Get a placement group by id. Parameters id – Id of the placement group address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout value for the state APIs requests made. _explain – Print the API information such as API latency or failed query information. Returns None if actor not found, or PlacementGroupState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.get_node ray.util.state.get_node(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) -> Optional[ray.util.state.common.NodeState][source] Get a node by id. Parameters id – Id of the node. address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout value for the state APIs requests made. _explain – Print the API information such as API latency or failed query information. Returns None if actor not found, or NodeState. Raises Exceptions – RayStateApiException if the CLI is failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.get_worker ray.util.state.get_worker(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) -> Optional[ray.util.state.common.WorkerState][source] Get a worker by id. Parameters id – Id of the worker address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout value for the state APIs requests made. _explain – Print the API information such as API latency or failed query information. Returns None if actor not found, or WorkerState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.get_task ray.util.state.get_task(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) -> Optional[ray.util.state.common.TaskState][source] Get task attempts of a task by id. Parameters id – Id of the task address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout value for the state APIs requests made. _explain – Print the API information such as API latency or failed query information. Returns None if task not found, or a list of TaskState from the task attempts. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.get_objects ray.util.state.get_objects(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) -> List[ray.util.state.common.ObjectState][source] Get objects by id. There could be more than 1 entry returned since an object could be referenced at different places. Parameters id – Id of the object address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray. timeout – Max timeout value for the state APIs requests made. _explain – Print the API information such as API latency or failed query information. Returns List of ObjectState. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases. Log APIs ray.util.state.list_logs([address, node_id, ...]) Listing log files available. ray.util.state.get_log([address, node_id, ...]) Retrieve log file based on file name or some entities ids (pid, actor id, task id). ray.util.state.list_logs ray.util.state.list_logs(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, glob_filter: Optional[str] = None, timeout: int = 30) -> Dict[str, List[str]][source] Listing log files available. Parameters address – Ray bootstrap address, could be auto, localhost:6379. If not specified, it will be retrieved from the initialized ray cluster. node_id – Id of the node containing the logs. node_ip – Ip of the node containing the logs. glob_filter – Name of the file (relative to the ray log directory) to be retrieved. E.g. glob_filter="*worker*" for all worker logs. actor_id – Id of the actor if getting logs from an actor. timeout – Max timeout for requests made when getting the logs. _interval – The interval in secs to print new logs when follow=True. Returns A dictionary where the keys are log groups (e.g. gcs, raylet, worker), and values are list of log filenames. Raises Exceptions – RayStateApiException if the CLI failed to query the data, or ConnectionError if failed to resolve the ray address. DeveloperAPI: This API may change across minor Ray releases.ray.util.state.get_log ray.util.state.get_log(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, filename: Optional[str] = None, actor_id: Optional[str] = None, task_id: Optional[str] = None, pid: Optional[int] = None, follow: bool = False, tail: int = - 1, timeout: int = 30, suffix: str = 'out', encoding: Optional[str] = 'utf-8', errors: Optional[str] = 'strict', submission_id: Optional[str] = None, attempt_number: int = 0, _interval: Optional[float] = None) -> Generator[str, None, None][source] Retrieve log file based on file name or some entities ids (pid, actor id, task id). Examples import ray import time ray.shutdown() ray.init() # Wait for the node to be registered to the dashboard time.sleep(5) import ray from ray.util.state import get_log # Node id could be retrieved from list_nodes() or ray.nodes() node_id = ray.nodes()[0]["NodeID"] filename = "raylet.out" for l in get_log(filename=filename, node_id=node_id): print(l) [2023-05-19 12:35:18,347 I 4259 68399276] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service. [2023-05-19 12:35:18,348 I 4259 68399276] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 2.14748GB of memory. [2023-05-19 12:35:18,348 I 4259 68399276] (raylet) store_runner.cc:48: Starting object store with directory /tmp, fallback /tmp/ray, and huge page support disabled Parameters address – Ray bootstrap address, could be auto, localhost:6379. If not specified, it will be retrieved from the initialized ray cluster. node_id – Id of the node containing the logs . node_ip – Ip of the node containing the logs. (At least one of the node_id and node_ip have to be supplied when identifying a node). filename – Name of the file (relative to the ray log directory) to be retrieved. actor_id – Id of the actor if getting logs from an actor. task_id – Id of the task if getting logs from a non concurrent actor. For concurrent actor, please query the log with actor_id. pid – PID of the worker if getting logs generated by a worker. When querying with pid, either node_id or node_ip must be supplied. follow – When set to True, logs will be streamed and followed. tail – Number of lines to get from the end of the log file. Set to -1 for getting the entire log. timeout – Max timeout for requests made when getting the logs. suffix – The suffix of the log file if query by id of tasks/workers/actors. Default to “out”. encoding – The encoding used to decode the content of the log file. Default is “utf-8”. Use None to get binary data directly. errors – The error handling scheme to use for decoding errors. Default is “strict”. See https://docs.python.org/3/library/codecs.html#error-handlers submission_id – Job submission ID if getting log from a submission job. attempt_number – The attempt number of the task if getting logs generated by a task. _interval – The interval in secs to print new logs when follow=True. Returns A Generator of log line, None for SendType and ReturnType. Raises Exceptions – RayStateApiException if the CLI failed to query the data. DeveloperAPI: This API may change across minor Ray releases. State APIs Schema ray.util.state.common.ActorState(actor_id, ...) Actor State ray.util.state.common.TaskState(task_id, ...) Task State ray.util.state.common.NodeState(node_id, ...) Node State ray.util.state.common.PlacementGroupState(...) PlacementGroup State ray.util.state.common.WorkerState(worker_id, ...) Worker State ray.util.state.common.ObjectState(object_id, ...) Object State ray.util.state.common.RuntimeEnvState(...[, ...]) Runtime Environment State ray.util.state.common.JobState The state of the job that's submitted by Ray's Job APIs or driver jobs ray.util.state.common.StateSummary(...) ray.util.state.common.TaskSummaries(summary, ...) ray.util.state.common.TaskSummaryPerFuncOrClassName(...) ray.util.state.common.ActorSummaries(...[, ...]) ray.util.state.common.ActorSummaryPerClass(...) ray.util.state.common.ObjectSummaries(...[, ...]) ray.util.state.common.ObjectSummaryPerKey(...) ray.util.state.common.ActorState class ray.util.state.common.ActorState(actor_id: str, class_name: str, state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], job_id: str, name: Optional[str], node_id: Optional[str], pid: Optional[int], ray_namespace: Optional[str], serialized_runtime_env: Optional[str] = None, required_resources: Optional[dict] = None, death_cause: Optional[dict] = None, is_detached: Optional[bool] = None, placement_group_id: Optional[str] = None, repr_name: Optional[str] = None)[source] Bases: ray.util.state.common.StateSchema Actor State Below columns can be used for the --filter option. pid state class_name name job_id repr_name actor_id ray_namespace placement_group_id node_id Below columns are available only when get API is used, --detail is specified through CLI, or detail=True is given to Python APIs. pid state class_name name job_id is_detached repr_name actor_id serialized_runtime_env ray_namespace placement_group_id death_cause required_resources node_id actor_id: str The id of the actor. class_name: str The class name of the actor. state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD] The state of the actor. DEPENDENCIES_UNREADY: Actor is waiting for dependency to be ready. E.g., a new actor is waiting for object ref that’s created from other remote task. PENDING_CREATION: Actor’s dependency is ready, but it is not created yet. It could be because there are not enough resources, too many actor entries in the scheduler queue, or the actor creation is slow (e.g., slow runtime environment creation, slow worker startup, or etc.). ALIVE: The actor is created, and it is alive. RESTARTING: The actor is dead, and it is restarting. It is equivalent to PENDING_CREATION, but means the actor was dead more than once. DEAD: The actor is permanatly dead. job_id: str The job id of this actor. name: Optional[str] The name of the actor given by the name argument. node_id: Optional[str] The node id of this actor. If the actor is restarting, it could be the node id of the dead actor (and it will be re-updated when the actor is successfully restarted). pid: Optional[int] The pid of the actor. 0 if it is not created yet. ray_namespace: Optional[str] The namespace of the actor. serialized_runtime_env: Optional[str] = None The runtime environment information of the actor. required_resources: Optional[dict] = None The resource requirement of the actor. death_cause: Optional[dict] = None Actor’s death information in detail. None if the actor is not dead yet. is_detached: Optional[bool] = None True if the actor is detached. False otherwise. placement_group_id: Optional[str] = None The placement group id that’s associated with this actor. repr_name: Optional[str] = None Actor’s repr name if a customized __repr__ method exists, else empty string.ray.util.state.common.TaskState class ray.util.state.common.TaskState(task_id: str, attempt_number: int, name: str, state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], job_id: str, actor_id: Optional[str], type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK], func_or_class_name: str, parent_task_id: str, node_id: Optional[str], worker_id: Optional[str], error_type: Optional[str], language: Optional[str] = None, required_resources: Optional[dict] = None, runtime_env_info: Optional[dict] = None, placement_group_id: Optional[str] = None, events: Optional[List[dict]] = None, profiling_data: Optional[dict] = None, creation_time_ms: Optional[int] = None, start_time_ms: Optional[int] = None, end_time_ms: Optional[int] = None, task_log_info: Optional[dict] = None, error_message: Optional[str] = None)[source] Bases: ray.util.state.common.StateSchema Task State Below columns can be used for the --filter option. node_id error_type language state attempt_number name job_id func_or_class_name worker_id actor_id placement_group_id task_id type parent_task_id Below columns are available only when get API is used, --detail is specified through CLI, or detail=True is given to Python APIs. name actor_id error_message task_id node_id end_time_ms events state job_id worker_id runtime_env_info placement_group_id type creation_time_ms parent_task_id error_type attempt_number start_time_ms profiling_data language func_or_class_name task_log_info required_resources task_id: str The id of the task. attempt_number: int The attempt (retry) number of the task. name: str The name of the task if it is given by the name argument. state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED] The state of the task. Refer to src/ray/protobuf/common.proto for a detailed explanation of the state breakdowns and typical state transition flow. job_id: str The job id of this task. actor_id: Optional[str] The actor id that’s associated with this task. It is empty if there’s no relevant actors. type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK] The type of the task. NORMAL_TASK: Tasks created by func.remote()` ACTOR_CREATION_TASK: Actors created by class.remote() ACTOR_TASK: Actor tasks submitted by actor.method.remote() DRIVER_TASK: Driver (A script that calls ray.init). func_or_class_name: str The name of the task. If is the name of the function if the type is a task or an actor task. It is the name of the class if it is a actor scheduling task. parent_task_id: str The parent task id. If the parent is a normal task, it will be the task’s id. If the parent runs in a concurrent actor (async actor or threaded actor), it will be the actor’s creation task id. node_id: Optional[str] Id of the node that runs the task. If the task is retried, it could contain the node id of the previous executed task. If empty, it means the task hasn’t been scheduled yet. worker_id: Optional[str] The worker id that’s associated with this task. error_type: Optional[str] Task error type. language: Optional[str] = None The language of the task. E.g., Python, Java, or Cpp. required_resources: Optional[dict] = None The required resources to execute the task. runtime_env_info: Optional[dict] = None The runtime environment information for the task. placement_group_id: Optional[str] = None The placement group id that’s associated with this task. events: Optional[List[dict]] = None The list of events of the given task. Refer to src/ray/protobuf/common.proto for a detailed explanation of the state breakdowns and typical state transition flow. profiling_data: Optional[dict] = None The list of profile events of the given task. creation_time_ms: Optional[int] = None The time when the task is created. A Unix timestamp in ms. start_time_ms: Optional[int] = None The time when the task starts to run. A Unix timestamp in ms. end_time_ms: Optional[int] = None The time when the task is finished or failed. A Unix timestamp in ms. task_log_info: Optional[dict] = None The task logs info, e.g. offset into the worker log file when the task starts/finishes. None if the task is from a concurrent actor (e.g. async actor or threaded actor) error_message: Optional[str] = None Task error detail info.ray.util.state.common.NodeState class ray.util.state.common.NodeState(node_id: str, node_ip: str, is_head_node: bool, state: typing_extensions.Literal[ALIVE, DEAD], node_name: str, resources_total: dict, start_time_ms: Optional[int] = None, end_time_ms: Optional[int] = None)[source] Bases: ray.util.state.common.StateSchema Node State Below columns can be used for the --filter option. node_name is_head_node state node_ip node_id Below columns are available only when get API is used, --detail is specified through CLI, or detail=True is given to Python APIs. node_name end_time_ms is_head_node state start_time_ms resources_total node_ip node_id node_id: str The id of the node. node_ip: str The ip address of the node. is_head_node: bool If this is a head node. state: typing_extensions.Literal[ALIVE, DEAD] The state of the node. ALIVE: The node is alive. DEAD: The node is dead. node_name: str The name of the node if it is given by the name argument. resources_total: dict The total resources of the node. start_time_ms: Optional[int] = None The time when the node (raylet) starts.ray.util.state.common.PlacementGroupState class ray.util.state.common.PlacementGroupState(placement_group_id: str, name: str, creator_job_id: str, state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING], bundles: Optional[List[dict]] = None, is_detached: Optional[bool] = None, stats: Optional[dict] = None)[source] Bases: ray.util.state.common.StateSchema PlacementGroup State Below columns can be used for the --filter option. creator_job_id state name is_detached placement_group_id Below columns are available only when get API is used, --detail is specified through CLI, or detail=True is given to Python APIs. creator_job_id state name is_detached stats bundles placement_group_id placement_group_id: str The id of the placement group. name: str The name of the placement group if it is given by the name argument. creator_job_id: str The job id of the placement group. state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING] The state of the placement group. PENDING: The placement group creation is pending scheduling. It could be because there’s not enough resources, some of creation stage has failed (e.g., failed to commit placement gropus because the node is dead). CREATED: The placement group is created. REMOVED: The placement group is removed. RESCHEDULING: The placement group is rescheduling because some of bundles are dead because they were on dead nodes. bundles: Optional[List[dict]] = None The bundle specification of the placement group. is_detached: Optional[bool] = None True if the placement group is detached. False otherwise. stats: Optional[dict] = None The scheduling stats of the placement group.ray.util.state.common.WorkerState class ray.util.state.common.WorkerState(worker_id: str, is_alive: bool, worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]], node_id: str, ip: str, pid: int, exit_detail: Optional[str] = None, worker_launch_time_ms: Optional[int] = None, worker_launched_time_ms: Optional[int] = None, start_time_ms: Optional[int] = None, end_time_ms: Optional[int] = None)[source] Bases: ray.util.state.common.StateSchema Worker State Below columns can be used for the --filter option. is_alive pid worker_id worker_type exit_type ip node_id Below columns are available only when get API is used, --detail is specified through CLI, or detail=True is given to Python APIs. worker_launch_time_ms end_time_ms is_alive exit_detail pid worker_id start_time_ms worker_type exit_type ip worker_launched_time_ms node_id worker_id: str The id of the worker. is_alive: bool Whether or not if the worker is alive. worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER] The driver (Python script that calls ray.init). - SPILL_WORKER: The worker that spills objects. - RESTORE_WORKER: The worker that restores objects. Type DRIVER exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]] The exit type of the worker if the worker is dead. SYSTEM_ERROR: Worker exit due to system level failures (i.e. worker crash). INTENDED_SYSTEM_EXIT: System-level exit that is intended. E.g., Workers are killed because they are idle for a long time. USER_ERROR: Worker exits because of user error. E.g., execptions from the actor initialization. INTENDED_USER_EXIT: Intended exit from users (e.g., users exit workers with exit code 0 or exit initated by Ray API such as ray.kill). node_id: str The node id of the worker. ip: str The ip address of the worker. pid: int The pid of the worker. exit_detail: Optional[str] = None The exit detail of the worker if the worker is dead. worker_launch_time_ms: Optional[int] = None The time worker is first launched. -1 if the value doesn’t exist. The lifecycle of worker is as follow. worker_launch_time_ms (process startup requested). -> worker_launched_time_ms (process started). -> start_time_ms (worker is ready to be used). -> end_time_ms (worker is destroyed). worker_launched_time_ms: Optional[int] = None The time worker is succesfully launched -1 if the value doesn’t exist. start_time_ms: Optional[int] = None The time when the worker is started and initialized. 0 if the value doesn’t exist. end_time_ms: Optional[int] = None The time when the worker exits. The timestamp could be delayed if the worker is dead unexpectedly. 0 if the value doesn’t exist.ray.util.state.common.ObjectState class ray.util.state.common.ObjectState(object_id: str, object_size: int, task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], call_site: str, type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], pid: int, ip: str)[source] Bases: ray.util.state.common.StateSchema Object State Below columns can be used for the --filter option. task_status object_id reference_type object_size pid call_site ip type Below columns are available only when get API is used, --detail is specified through CLI, or detail=True is given to Python APIs. task_status object_id reference_type object_size pid call_site ip type object_id: str The id of the object. object_size: int The size of the object in mb. task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED] The status of the task that creates the object. NIL: We don’t have a status for this task because we are not the owner or the task metadata has already been deleted. WAITING_FOR_DEPENDENCIES: The task is waiting for its dependencies to be created. SCHEDULED: All dependencies have been created and the task is scheduled to execute. It could be because the task is waiting for resources, runtime environmenet creation, fetching dependencies to the local node, and etc.. FINISHED: The task finished successfully. WAITING_FOR_EXECUTION: The task is scheduled properly and waiting for execution. It includes time to deliver the task to the remote worker + queueing time from the execution side. RUNNING: The task that is running. reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS] The reference type of the object. See Debugging with Ray Memory for more details. ACTOR_HANDLE: The reference is an actor handle. PINNED_IN_MEMORY: The object is pinned in memory, meaning there’s in-flight ray.get on this reference. LOCAL_REFERENCE: There’s a local reference (e.g., Python reference) to this object reference. The object won’t be GC’ed until all of them is gone. USED_BY_PENDING_TASK: The object reference is passed to other tasks. E.g., a = ray.put() -> task.remote(a). In this case, a is used by a pending task task. CAPTURED_IN_OBJECT: The object is serialized by other objects. E.g., a = ray.put(1) -> b = ray.put([a]). a is serialized within a list. UNKNOWN_STATUS: The object ref status is unkonwn. call_site: str The callsite of the object. type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER] The worker type that creates the object. WORKER: The regular Ray worker process that executes tasks or instantiates an actor. DRIVER: The driver (Python script that calls ray.init). SPILL_WORKER: The worker that spills objects. RESTORE_WORKER: The worker that restores objects. pid: int The pid of the owner. ip: str The ip address of the owner.ray.util.state.common.RuntimeEnvState class ray.util.state.common.RuntimeEnvState(runtime_env: dict, success: bool, creation_time_ms: Optional[float], node_id: str, ref_cnt: Optional[int] = None, error: Optional[str] = None)[source] Bases: ray.util.state.common.StateSchema Runtime Environment State Below columns can be used for the --filter option. success error runtime_env node_id Below columns are available only when get API is used, --detail is specified through CLI, or detail=True is given to Python APIs. ref_cnt success runtime_env error creation_time_ms node_id runtime_env: dict The runtime environment spec. success: bool Whether or not the runtime env creation has succeeded. creation_time_ms: Optional[float] The latency of creating the runtime environment. Available if the runtime env is successfully created. node_id: str The node id of this runtime environment. ref_cnt: Optional[int] = None The number of actors and tasks that use this runtime environment. error: Optional[str] = None The error message if the runtime environment creation has failed. Available if the runtime env is failed to be created.ray.util.state.common.JobState class ray.util.state.common.JobState(*, type: ray.dashboard.modules.job.pydantic_models.JobType, job_id: str = None, submission_id: str = None, driver_info: ray.dashboard.modules.job.pydantic_models.DriverInfo = None, status: ray.dashboard.modules.job.common.JobStatus, entrypoint: str, message: str = None, error_type: str = None, start_time: int = None, end_time: int = None, metadata: Dict[str, str] = None, runtime_env: Dict[str, Any] = None, driver_agent_http_address: str = None, driver_node_id: str = None)[source] Bases: ray.util.state.common.StateSchema, ray.dashboard.modules.job.pydantic_models.JobDetails The state of the job that’s submitted by Ray’s Job APIs or driver jobs Below columns can be used for the --filter option. status job_id type submission_id classmethod filterable_columns() -> Set[str][source] Return a list of filterable columns classmethod humanify(state: dict) -> dict[source] Convert the given state object into something human readable. classmethod list_columns(detail: bool = False) -> List[str][source] Return a list of columns.ray.util.state.common.StateSummary class ray.util.state.common.StateSummary(node_id_to_summary: Dict[str, Union[ray.util.state.common.TaskSummaries, ray.util.state.common.ActorSummaries, ray.util.state.common.ObjectSummaries]])[source] Bases: object node_id_to_summary: Dict[str, Union[ray.util.state.common.TaskSummaries, ray.util.state.common.ActorSummaries, ray.util.state.common.ObjectSummaries]] Node ID -> summary per node If the data is not required to be orgnized per node, it will contain a single key, “cluster”.ray.util.state.common.TaskSummaries class ray.util.state.common.TaskSummaries(summary: Union[Dict[str, ray.util.state.common.TaskSummaryPerFuncOrClassName], List[ray.util.state.common.NestedTaskSummary]], total_tasks: int, total_actor_tasks: int, total_actor_scheduled: int, summary_by: str = 'func_name')[source] Bases: object total_tasks: int Total Ray tasks. total_actor_tasks: int Total actor tasks. total_actor_scheduled: int Total scheduled actors. classmethod to_summary_by_lineage(*, tasks: List[Dict], actors: List[Dict]) -> ray.util.state.common.TaskSummaries[source] This summarizes tasks by lineage. i.e. A task will be grouped with another task if they have the same parent. This does things in 4 steps. Step 1: Iterate through all tasks and keep track of them by id and ownership Step 2: Put the tasks in a tree structure based on ownership Step 3: Merge together siblings in the tree if there are more than one with the same name. Step 4: Total the children This can probably be more efficient if we merge together some steps to reduce the amount of iterations but this algorithm produces very easy to understand code. We can optimize in the future.ray.util.state.common.TaskSummaryPerFuncOrClassName class ray.util.state.common.TaskSummaryPerFuncOrClassName(func_or_class_name: str, type: str, state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = )[source] Bases: object func_or_class_name: str The function or class name of this task. type: str The type of the class. Equivalent to protobuf TaskType. state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int] State name to the count dict. State name is equivalent to the protobuf TaskStatus.ray.util.state.common.ActorSummaries class ray.util.state.common.ActorSummaries(summary: Dict[str, ray.util.state.common.ActorSummaryPerClass], total_actors: int, summary_by: str = 'class')[source] Bases: object summary: Dict[str, ray.util.state.common.ActorSummaryPerClass] Group key (actor class name) -> summary total_actors: int Total number of actorsray.util.state.common.ActorSummaryPerClass class ray.util.state.common.ActorSummaryPerClass(class_name: str, state_counts: Dict[typing_extensions.Literal['DEPENDENCIES_UNREADY', 'PENDING_CREATION', 'ALIVE', 'RESTARTING', 'DEAD'], int] = )[source] Bases: object class_name: str The class name of the actor. state_counts: Dict[typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], int] State name to the count dict. State name is equivalent to the protobuf ActorState.ray.util.state.common.ObjectSummaries class ray.util.state.common.ObjectSummaries(summary: Dict[str, ray.util.state.common.ObjectSummaryPerKey], total_objects: int, total_size_mb: float, callsite_enabled: bool, summary_by: str = 'callsite')[source] Bases: object summary: Dict[str, ray.util.state.common.ObjectSummaryPerKey] Group key (actor class name) -> summary total_objects: int Total number of referenced objects in the cluster. total_size_mb: float Total size of referenced objects in the cluster in MB. callsite_enabled: bool Whether or not the callsite collection is enabled.ray.util.state.common.ObjectSummaryPerKey class ray.util.state.common.ObjectSummaryPerKey(total_objects: int, total_size_mb: float, total_num_workers: int, total_num_nodes: int, task_state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = , ref_type_counts: Dict[typing_extensions.Literal['ACTOR_HANDLE', 'PINNED_IN_MEMORY', 'LOCAL_REFERENCE', 'USED_BY_PENDING_TASK', 'CAPTURED_IN_OBJECT', 'UNKNOWN_STATUS'], int] = )[source] Bases: object total_objects: int Total number of objects of the type. total_size_mb: float Total size in mb. total_num_workers: int Total number of workers that reference the type of objects. total_num_nodes: int Total number of nodes that reference the type of objects. task_state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int] State name to the count dict. State name is equivalent to ObjectState. ref_type_counts: Dict[typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], int] Ref count type to the count dict. State name is equivalent to ObjectState. State APIs Exceptions ray.util.state.exception.RayStateApiException ray.util.state.exception.RayStateApiException exception ray.util.state.exception.RayStateApiException[source] Ray AI Runtime (AIR) AIR is currently in beta. Fill out this short form to get involved. We’ll be holding office hours, development sprints, and other activities as we get closer to the GA release. Join us! Ray AI Runtime (AIR) is a scalable and unified toolkit for ML applications. AIR enables simple scaling of individual workloads, end-to-end workflows, and popular ecosystem frameworks, all in just Python. https://docs.google.com/drawings/d/1atB1dLjZIi8ibJ2-CoHdd3Zzyl_hDRWyK2CJAVBBLdU/edit AIR builds on Ray’s best-in-class libraries for Preprocessing, Training, Tuning, Scoring, Serving, and Reinforcement Learning to bring together an ecosystem of integrations. ML Compute, Simplified Ray AIR aims to simplify the ecosystem of machine learning frameworks, platforms, and tools. It does this by leveraging Ray to provide a seamless, unified, and open experience for scalable ML: https://docs.google.com/drawings/d/1oi_JwNHXVgtR_9iTdbecquesUd4hOk0dWgHaTaFj6gk/edit 1. Seamless Dev to Prod: AIR reduces friction going from development to production. With Ray and AIR, the same Python code scales seamlessly from a laptop to a large cluster. 2. Unified ML API: AIR’s unified ML API enables swapping between popular frameworks, such as XGBoost, PyTorch, and Hugging Face, with just a single class change in your code. 3. Open and Extensible: AIR and Ray are fully open-source and can run on any cluster, cloud, or Kubernetes. Build custom components and integrations on top of scalable developer APIs. When to use AIR? AIR is for both data scientists and ML engineers alike. https://docs.google.com/drawings/d/1Qw_h457v921jWQkx63tmKAsOsJ-qemhwhCZvhkxWrWo/edit For data scientists, AIR can be used to scale individual workloads, and also end-to-end ML applications. For ML Engineers, AIR provides scalable platform abstractions that can be used to easily onboard and integrate tooling from the broader ML ecosystem. Quick Start Below, we walk through how AIR’s unified ML API enables scaling of end-to-end ML workflows, focusing on a few of the popular frameworks AIR integrates with (XGBoost, Pytorch, and Tensorflow). The ML workflow we’re going to build is summarized by the following diagram: https://docs.google.com/drawings/d/1z0r_Yc7-0NAPVsP2jWUkLV2jHVHdcJHdt9uN1GDANSY/edit AIR provides a unified API for the ML ecosystem. This diagram shows how AIR enables an ecosystem of libraries to be run at scale in just a few lines of code. Get started by installing Ray AIR: pip install -U "ray[air]" # The below Ray AIR tutorial was written with the following libraries. # Consider running the following to ensure that the code below runs properly: pip install -U pandas>=1.3.5 pip install -U torch>=1.12 pip install -U numpy>=1.19.5 pip install -U tensorflow>=2.6.2 pip install -U pyarrow>=6.0.1 Preprocessing First, let’s start by loading a dataset from storage: import ray # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) # Create a test dataset by dropping the target column. test_dataset = valid_dataset.drop_columns(cols=["target"]) Then, we define a Preprocessor pipeline for our task: XGBoost Pytorch Tensorflow # Create a preprocessor to scale some columns. from ray.data.preprocessors import StandardScaler preprocessor = StandardScaler(columns=["mean radius", "mean texture"]) import numpy as np from ray.data.preprocessors import Concatenator, Chain, StandardScaler # Create a preprocessor to scale some columns and concatenate the result. preprocessor = Chain( StandardScaler(columns=["mean radius", "mean texture"]), Concatenator(exclude=["target"], dtype=np.float32), ) import numpy as np from ray.data.preprocessors import Concatenator, Chain, StandardScaler # Create a preprocessor to scale some columns and concatenate the result. preprocessor = Chain( StandardScaler(columns=["mean radius", "mean texture"]), Concatenator(exclude=["target"], dtype=np.float32), ) Training Train a model with a Trainer with common ML frameworks: XGBoost Pytorch Tensorflow from ray.air.config import ScalingConfig from ray.train.xgboost import XGBoostTrainer trainer = XGBoostTrainer( scaling_config=ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, # Make sure to leave some CPUs free for Ray Data operations. _max_cpu_fraction_per_node=0.9, ), label_column="target", num_boost_round=20, params={ # XGBoost specific params "objective": "binary:logistic", # "tree_method": "gpu_hist", # uncomment this to use GPUs. "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, ) best_result = trainer.fit() print(best_result.metrics) import torch import torch.nn as nn from ray import train from ray.air import session from ray.air.config import ScalingConfig from ray.train.torch import TorchCheckpoint, TorchTrainer def create_model(input_features): return nn.Sequential( nn.Linear(in_features=input_features, out_features=16), nn.ReLU(), nn.Linear(16, 16), nn.ReLU(), nn.Linear(16, 1), nn.Sigmoid(), ) def train_loop_per_worker(config): batch_size = config["batch_size"] lr = config["lr"] epochs = config["num_epochs"] num_features = config["num_features"] # Get the Dataset shard for this data parallel worker, # and convert it to a PyTorch Dataset. train_data = session.get_dataset_shard("train") # Create model. model = create_model(num_features) model = train.torch.prepare_model(model) loss_fn = nn.BCELoss() optimizer = torch.optim.SGD(model.parameters(), lr=lr) for cur_epoch in range(epochs): for batch in train_data.iter_torch_batches( batch_size=batch_size, dtypes=torch.float32 ): # "concat_out" is the output column of the Concatenator. inputs, labels = batch["concat_out"], batch["target"] optimizer.zero_grad() predictions = model(inputs) train_loss = loss_fn(predictions, labels.unsqueeze(1)) train_loss.backward() optimizer.step() loss = train_loss.item() session.report({"loss": loss}, checkpoint=TorchCheckpoint.from_model(model)) num_features = len(train_dataset.schema().names) - 1 trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config={ "batch_size": 128, "num_epochs": 20, "num_features": num_features, "lr": 0.001, }, scaling_config=ScalingConfig( num_workers=3, # Number of workers to use for data parallelism. use_gpu=False, trainer_resources={"CPU": 0}, # so that the example works on Colab. ), datasets={"train": train_dataset}, preprocessor=preprocessor, ) # Execute training. best_result = trainer.fit() print(f"Last result: {best_result.metrics}") # Last result: {'loss': 0.6559339960416158, ...} import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from ray.air import session from ray.air.config import ScalingConfig from ray.air.integrations.keras import ReportCheckpointCallback from ray.train.tensorflow import TensorflowTrainer def create_keras_model(input_features): return keras.Sequential( [ keras.Input(shape=(input_features,)), layers.Dense(16, activation="relu"), layers.Dense(16, activation="relu"), layers.Dense(1), ] ) def train_loop_per_worker(config): batch_size = config["batch_size"] lr = config["lr"] epochs = config["num_epochs"] num_features = config["num_features"] # Get the Dataset shard for this data parallel worker, # and convert it to a Tensorflow Dataset. train_data = session.get_dataset_shard("train") strategy = tf.distribute.MultiWorkerMirroredStrategy() with strategy.scope(): # Model building/compiling need to be within `strategy.scope()`. multi_worker_model = create_keras_model(num_features) multi_worker_model.compile( optimizer=tf.keras.optimizers.SGD(learning_rate=lr), loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=[ tf.keras.metrics.BinaryCrossentropy( name="loss", ) ], ) for _ in range(epochs): tf_dataset = train_data.to_tf( feature_columns="concat_out", label_columns="target", batch_size=batch_size ) multi_worker_model.fit( tf_dataset, callbacks=[ReportCheckpointCallback()], verbose=0, ) num_features = len(train_dataset.schema().names) - 1 trainer = TensorflowTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config={ "batch_size": 128, "num_epochs": 50, "num_features": num_features, "lr": 0.0001, }, scaling_config=ScalingConfig( num_workers=2, # Number of data parallel training workers use_gpu=False, trainer_resources={"CPU": 0}, # so that the example works on Colab. ), datasets={"train": train_dataset}, preprocessor=preprocessor, ) best_result = trainer.fit() print(f"Last result: {best_result.metrics}") # Last result: {'loss': 8.997025489807129, ...} Hyperparameter Tuning You can specify a hyperparameter space to search over for each trainer: XGBoost Pytorch Tensorflow from ray import tune param_space = {"params": {"max_depth": tune.randint(1, 9)}} metric = "train-logloss" from ray import tune param_space = {"train_loop_config": {"lr": tune.loguniform(0.0001, 0.01)}} metric = "loss" from ray import tune param_space = {"train_loop_config": {"lr": tune.loguniform(0.0001, 0.01)}} metric = "loss" Then use the Tuner to run the search: from ray.tune.tuner import Tuner, TuneConfig from ray.air.config import RunConfig tuner = Tuner( trainer, param_space=param_space, tune_config=TuneConfig(num_samples=5, metric=metric, mode="min"), ) # Execute tuning. result_grid = tuner.fit() # Fetch the best result. best_result = result_grid.get_best_result() print("Best Result:", best_result) # Best Result: Result(metrics={'loss': 0.278409322102863, ...}) Batch Inference After running the steps in Training or Tuning, use the trained model for scalable batch prediction with Dataset.map_batches(). To learn more, see End-to-end: Offline Batch Inference. Project Status AIR is currently in beta. If you have questions for the team or are interested in getting involved in the development process, fill out this short form. For an overview of the AIR libraries, ecosystem integrations, and their readiness, check out the latest AIR ecosystem map. Next Steps Key Concepts Examples API reference Technical whitepaper To check how your application is doing, you can use the Ray dashboard. Key Concepts Here, we cover the main concepts in AIR. Datasets Preprocessors Trainers Tuner Checkpoints Batch Predictor Deployments Datasets Ray Data is the standard way to load and exchange data in Ray AIR. It provides a Dataset concept which is used extensively for data loading, preprocessing, and batch inference. Preprocessors Preprocessors are primitives that can be used to transform input data into features. Preprocessors operate on Datasets, which makes them scalable and compatible with a variety of datasources and dataframe libraries. A Preprocessor is fitted during Training, and applied at runtime in both Training and Serving on data batches in the same way. AIR comes with a collection of built-in preprocessors, and you can also define your own with simple templates. See the documentation on Preprocessors. import ray import pandas as pd from sklearn.datasets import load_breast_cancer from ray.data.preprocessors import * # Split data into train and validation. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) test_dataset = valid_dataset.drop_columns(["target"]) columns_to_scale = ["mean radius", "mean texture"] preprocessor = StandardScaler(columns=columns_to_scale) Trainers Trainers are wrapper classes around third-party training frameworks such as XGBoost and Pytorch. They are built to help integrate with core Ray actors (for distribution), Ray Tune, and Ray Data. See the documentation on Trainers. from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig num_workers = 2 use_gpu = False # XGBoost specific params params = { "tree_method": "approx", "objective": "binary:logistic", "eval_metric": ["logloss", "error"], "max_depth": 2, } trainer = XGBoostTrainer( scaling_config=ScalingConfig( num_workers=num_workers, use_gpu=use_gpu, # Make sure to leave some CPUs free for Ray Data operations. _max_cpu_fraction_per_node=0.9, ), label_column="target", params=params, datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, num_boost_round=5, ) result = trainer.fit() Trainer objects produce a Result object after calling .fit(). These objects contain training metrics as well as checkpoints to retrieve the best model. print(result.metrics) print(result.checkpoint) Tuner Tuners offer scalable hyperparameter tuning as part of Ray Tune. Tuners can work seamlessly with any Trainer but also can support arbitrary training functions. from ray import tune from ray.tune.tuner import Tuner, TuneConfig tuner = Tuner( trainer, param_space={"params": {"max_depth": tune.randint(1, 9)}}, tune_config=TuneConfig(num_samples=5, metric="train-logloss", mode="min"), ) result_grid = tuner.fit() best_result = result_grid.get_best_result() print(best_result) Checkpoints The AIR trainers, tuners, and custom pretrained model generate a framework-specific Checkpoint object. Checkpoints are a common interface for models that are used across different AIR components and libraries. There are two main ways to generate a checkpoint. Checkpoint objects can be retrieved from the Result object returned by a Trainer or Tuner .fit() call. checkpoint = result.checkpoint print(checkpoint) # Checkpoint(local_path=..../checkpoint_000005) tuned_checkpoint = result_grid.get_best_result().checkpoint print(tuned_checkpoint) # Checkpoint(local_path=..../checkpoint_000005) You can also generate a checkpoint from a pretrained model. Each AIR supported machine learning (ML) framework has a Checkpoint object that can be used to generate an AIR checkpoint: from ray.train.tensorflow import TensorflowCheckpoint import tensorflow as tf # This can be a trained model. def build_model() -> tf.keras.Model: model = tf.keras.Sequential( [ tf.keras.layers.InputLayer(input_shape=(1,)), tf.keras.layers.Dense(1), ] ) return model model = build_model() checkpoint = TensorflowCheckpoint.from_model(model) Checkpoints can be used to instantiate a Predictor, BatchPredictor, or PredictorDeployment classes, as seen below. Batch Predictor You can take a checkpoint and do batch inference using the BatchPredictor object. from ray.train.batch_predictor import BatchPredictor from ray.train.xgboost import XGBoostPredictor batch_predictor = BatchPredictor.from_checkpoint(result.checkpoint, XGBoostPredictor) # Bulk batch prediction. predicted_probabilities = batch_predictor.predict(test_dataset) predicted_probabilities.show() # Pipelined batch prediction: instead of processing the data in bulk, process it # incrementally in windows of the given size. pipeline = batch_predictor.predict_pipelined(test_dataset, bytes_per_window=1048576) pipeline.show() Deployments Deploy the model as an inference service by using Ray Serve and the PredictorDeployment class. from ray import serve from fastapi import Request from ray.serve import PredictorDeployment from ray.serve.http_adapters import json_request async def adapter(request: Request): content = await request.json() print(content) return pd.DataFrame.from_dict(content) serve.run( PredictorDeployment.options(name="XGBoostService").bind( XGBoostPredictor, result.checkpoint, batching_params=False, http_adapter=adapter ) ) After deploying the service, you can send requests to it. import requests sample_input = test_dataset.take(1) sample_input = dict(sample_input[0]) output = requests.post("http://localhost:8000/", json=[sample_input]).json() print(output) User Guides AIR User Guides Using Preprocessors Using Trainers Configuring Training Datasets Configuring Hyperparameter Tuning Using Predictors for Inference Deploying Predictors with Serve How to Deploy AIR Environment variables Some behavior of Ray AIR can be controlled using environment variables. Please also see the Ray Tune environment variables. RAY_AIR_FULL_TRACEBACKS: If set to 1, will print full tracebacks for training functions, including internal code paths. Otherwise, abbreviated tracebacks that only show user code are printed. Defaults to 0 (disabled). RAY_AIR_NEW_OUTPUT: If set to 0, this disables the experimental new console output. RAY_AIR_RICH_LAYOUT: If set to 1, this enables the stick table layout (only available for Ray Tune). Running multiple AIR jobs concurrently on a single cluster Running multiple AIR training or tuning jobs at the same time on a single cluster is not officially supported. We don’t test this workflow and recommend the use of multiple smaller clusters instead. If you still want to do this, refer to the Ray Tune multi-tenancy docs for potential pitfalls. Using Preprocessors Data preprocessing is a common technique for transforming raw data into features for a machine learning model. In general, you may want to apply the same preprocessing logic to your offline training data and online inference data. Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit Overview The most common way of using a preprocessor is by passing it as an argument to the constructor of a Trainer in conjunction with a Ray Data. For example, the following code trains a model with a preprocessor that normalizes the data. import ray from ray.data.preprocessors import MinMaxScaler from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig train_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(0, 32, 3)]) valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)]) preprocessor = MinMaxScaler(["x"]) trainer = XGBoostTrainer( label_column="y", params={"objective": "reg:squarederror"}, scaling_config=ScalingConfig(num_workers=2), datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, ) result = trainer.fit() The Preprocessor class with four public methods that can we used separately from a trainer: fit(): Compute state information about a Dataset (e.g., the mean or standard deviation of a column) and save it to the Preprocessor. This information is used to perform transform(), and the method is typically called on a training dataset. transform(): Apply a transformation to a Dataset. If the Preprocessor is stateful, then fit() must be called first. This method is typically called on training, validation, and test datasets. transform_batch(): Apply a transformation to a single batch of data. This method is typically called on online or offline inference data. fit_transform(): Syntactic sugar for calling both fit() and transform() on a Dataset. To show these methods in action, let’s walk through a basic example. First, we’ll set up two simple Ray Datasets. import pandas as pd import ray from ray.data.preprocessors import MinMaxScaler from ray.data.preprocessors.scaler import StandardScaler # Generate two simple datasets. dataset = ray.data.range(8) dataset1, dataset2 = dataset.split(2) print(dataset1.take()) # [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}] print(dataset2.take()) # [{'id': 4}, {'id': 5}, {'id': 6}, {'id': 7}] Next, fit the Preprocessor on one Dataset, and then transform both Datasets with this fitted information. # Fit the preprocessor on dataset1, and transform both dataset1 and dataset2. preprocessor = MinMaxScaler(["id"]) dataset1_transformed = preprocessor.fit_transform(dataset1) print(dataset1_transformed.take()) # [{'id': 0.0}, {'id': 0.3333333333333333}, {'id': 0.6666666666666666}, {'id': 1.0}] dataset2_transformed = preprocessor.transform(dataset2) print(dataset2_transformed.take()) # [{'id': 1.3333333333333333}, {'id': 1.6666666666666667}, {'id': 2.0}, {'id': 2.3333333333333335}] Finally, call transform_batch on a single batch of data. batch = pd.DataFrame({"id": list(range(8, 12))}) batch_transformed = preprocessor.transform_batch(batch) print(batch_transformed) # id # 0 2.666667 # 1 3.000000 # 2 3.333333 # 3 3.666667 Life of an AIR preprocessor Now that we’ve gone over the basics, let’s dive into how Preprocessors fit into an end-to-end application built with AIR. The diagram below depicts an overview of the main steps of a Preprocessor: Passed into a Trainer to fit and transform input Datasets Saved as a Checkpoint Reconstructed in a Predictor to fit_batch on batches of data Throughout this section we’ll go through this workflow in more detail, with code examples using XGBoost. The same logic is applicable to other machine learning framework integrations as well. Trainer The journey of the Preprocessor starts with the Trainer. If the Trainer is instantiated with a Preprocessor, then the following logic is executed when Trainer.fit() is called: If a "train" Dataset is passed in, then the Preprocessor calls fit() on it. The Preprocessor then calls transform() on all Datasets, including the "train" Dataset. The Trainer then performs training on the preprocessed Datasets. import ray from ray.data.preprocessors import MinMaxScaler from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig train_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(0, 32, 3)]) valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)]) preprocessor = MinMaxScaler(["x"]) trainer = XGBoostTrainer( label_column="y", params={"objective": "reg:squarederror"}, scaling_config=ScalingConfig(num_workers=2), datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, ) result = trainer.fit() If you’re passing a Preprocessor that is already fitted, it is refitted on the "train" Dataset. Adding the functionality to support passing in a fitted Preprocessor is being tracked here. TODO: Remove the note above once the issue is resolved. Tune If you’re using Ray Tune for hyperparameter optimization, be aware that each Trial instantiates its own copy of the Preprocessor and the fitting and transforming logic occur once per Trial. Checkpoint Trainer.fit() returns a Result object which contains a Checkpoint. If a Preprocessor is passed into the Trainer, then it is saved in the Checkpoint along with any fitted state. As a sanity check, let’s confirm the Preprocessor is available in the Checkpoint. In practice, you don’t need to check. import os import ray.cloudpickle as cpickle from ray.air.constants import PREPROCESSOR_KEY checkpoint = result.checkpoint with checkpoint.as_directory() as checkpoint_path: path = os.path.join(checkpoint_path, PREPROCESSOR_KEY) with open(path, "rb") as f: preprocessor = cpickle.load(f) print(preprocessor) # MixMaxScaler(columns=['x'], stats={'min(x)': 0, 'max(x)': 30}) Predictor A Predictor can be constructed from a saved Checkpoint. If the Checkpoint contains a Preprocessor, then the Preprocessor calls transform_batch on input batches prior to performing inference. In the following example, we show the Batch Predictor flow. The same logic applies to the Online Inference flow. from ray.train.batch_predictor import BatchPredictor from ray.train.xgboost import XGBoostPredictor test_dataset = ray.data.from_items([{"x": x} for x in range(2, 32, 3)]) batch_predictor = BatchPredictor.from_checkpoint(checkpoint, XGBoostPredictor) predicted_probabilities = batch_predictor.predict(test_dataset) predicted_probabilities.show() # {'predictions': 0.09843720495700836} # {'predictions': 5.604666709899902} # {'predictions': 11.405311584472656} # {'predictions': 15.684700012207031} # {'predictions': 23.990947723388672} # {'predictions': 29.900211334228516} # {'predictions': 34.59944152832031} # {'predictions': 40.6968994140625} # {'predictions': 45.68107604980469} Types of preprocessors Built-in preprocessors Ray AIR provides a handful of preprocessors out of the box. Generic preprocessors ray.data.preprocessors.BatchMapper Apply an arbitrary operation to a dataset. ray.data.preprocessors.Chain Combine multiple preprocessors into a single Preprocessor. ray.data.preprocessors.Concatenator Combine numeric columns into a column of type TensorDtype. ray.data.preprocessor.Preprocessor Implements an ML preprocessing operation. ray.data.preprocessors.SimpleImputer Replace missing values with imputed values. Categorical encoders ray.data.preprocessors.Categorizer Convert columns to pd.CategoricalDtype. ray.data.preprocessors.LabelEncoder Encode labels as integer targets. ray.data.preprocessors.MultiHotEncoder Multi-hot encode categorical data. ray.data.preprocessors.OneHotEncoder One-hot encode categorical data. ray.data.preprocessors.OrdinalEncoder Encode values within columns as ordered integer values. Feature scalers ray.data.preprocessors.MaxAbsScaler Scale each column by its absolute max value. ray.data.preprocessors.MinMaxScaler Scale each column by its range. ray.data.preprocessors.Normalizer Scales each sample to have unit norm. ray.data.preprocessors.PowerTransformer Apply a power transform to make your data more normally distributed. ray.data.preprocessors.RobustScaler Scale and translate each column using quantiles. ray.data.preprocessors.StandardScaler Translate and scale each column by its mean and standard deviation, respectively. Text encoders ray.data.preprocessors.CountVectorizer Count the frequency of tokens in a column of strings. ray.data.preprocessors.HashingVectorizer Count the frequency of tokens using the hashing trick. ray.data.preprocessors.Tokenizer Replace each string with a list of tokens. ray.data.preprocessors.FeatureHasher Apply the hashing trick to a table that describes token frequencies. Utilities ray.data.Dataset.train_test_split Materialize and split the dataset into train and test subsets. Which preprocessor should you use? The type of preprocessor you use depends on what your data looks like. This section provides tips on handling common data formats. Categorical data Most models expect numerical inputs. To represent your categorical data in a way your model can understand, encode categories using one of the preprocessors described below. Categorical Data Type Example Preprocessor Labels "cat", "dog", "airplane" LabelEncoder Ordered categories "bs", "md", "phd" OrdinalEncoder Unordered categories "red", "green", "blue" OneHotEncoder Lists of categories ("sci-fi", "action"), ("action", "comedy", "animated") MultiHotEncoder If you’re using LightGBM, you don’t need to encode your categorical data. Instead, use Categorizer to convert your data to pandas.CategoricalDtype. Numerical data To ensure your models behaves properly, normalize your numerical data. Reference the table below to determine which preprocessor to use. Data Property Preprocessor Your data is approximately normal StandardScaler Your data is sparse MaxAbsScaler Your data contains many outliers RobustScaler Your data isn’t normal, but you need it to be PowerTransformer You need unit-norm rows Normalizer You aren’t sure what your data looks like MinMaxScaler These preprocessors operate on numeric columns. If your dataset contains columns of type TensorDtype, you may need to implement a custom preprocessor. Additionally, if your model expects a tensor or ndarray, create a tensor using Concatenator. Built-in feature scalers like StandardScaler don’t work on TensorDtype columns, so apply Concatenator after feature scaling. Combine feature scaling and concatenation into a single preprocessor with Chain. from ray.data.preprocessors import Chain, Concatenator, StandardScaler # Generate a simple dataset. dataset = ray.data.from_items([{"X": 1.0, "Y": 2.0}, {"X": 4.0, "Y": 0.0}]) print(dataset.take()) # [{'X': 1.0, 'Y': 2.0}, {'X': 4.0, 'Y': 0.0}] preprocessor = Chain(StandardScaler(columns=["X", "Y"]), Concatenator()) dataset_transformed = preprocessor.fit_transform(dataset) print(dataset_transformed.take()) # [{'concat_out': array([-1., 1.])}, {'concat_out': array([ 1., -1.])}] Text data A document-term matrix is a table that describes text data, often used in natural language processing. To generate a document-term matrix from a collection of documents, use HashingVectorizer or CountVectorizer. If you already know the frequency of tokens and want to store the data in a document-term matrix, use FeatureHasher. Requirement Preprocessor You care about memory efficiency HashingVectorizer You care about model interpretability CountVectorizer Filling in missing values If your dataset contains missing values, replace them with SimpleImputer. from ray.data.preprocessors import SimpleImputer # Generate a simple dataset. dataset = ray.data.from_items([{"id": 1.0}, {"id": None}, {"id": 3.0}]) print(dataset.take()) # [{'id': 1.0}, {'id': None}, {'id': 3.0}] imputer = SimpleImputer(columns=["id"], strategy="mean") dataset_transformed = imputer.fit_transform(dataset) print(dataset_transformed.take()) # [{'id': 1.0}, {'id': 2.0}, {'id': 3.0}] Chaining preprocessors If you need to apply more than one preprocessor, compose them together with Chain. Chain applies fit and transform sequentially. For example, if you construct Chain(preprocessorA, preprocessorB), then preprocessorB.transform is applied to the result of preprocessorA.transform. import ray from ray.data.preprocessors import Chain, MinMaxScaler, SimpleImputer # Generate one simple dataset. dataset = ray.data.from_items( [{"id": 0}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": None}] ) print(dataset.take()) # [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': None}] preprocessor = Chain(SimpleImputer(["id"]), MinMaxScaler(["id"])) dataset_transformed = preprocessor.fit_transform(dataset) print(dataset_transformed.take()) # [{'id': 0.0}, {'id': 0.3333333333333333}, {'id': 0.6666666666666666}, {'id': 1.0}, {'id': 0.5}] Implementing custom preprocessors If you want to implement a custom preprocessor that needs to be fit, extend the Preprocessor base class. from typing import Dict import ray from pandas import DataFrame from ray.data.preprocessor import Preprocessor from ray.data import Dataset from ray.data.aggregate import Max class CustomPreprocessor(Preprocessor): def _fit(self, dataset: Dataset) -> Preprocessor: self.stats_ = dataset.aggregate(Max("id")) def _transform_pandas(self, df: DataFrame) -> DataFrame: return df * self.stats_["max(id)"] # Generate a simple dataset. dataset = ray.data.range(4) print(dataset.take()) # [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}] # Create a stateful preprocessor that finds the max id and scales each id by it. preprocessor = CustomPreprocessor() dataset_transformed = preprocessor.fit_transform(dataset) print(dataset_transformed.take()) # [{'id': 0}, {'id': 3}, {'id': 6}, {'id': 9}] If your preprocessor doesn’t need to be fit, construct a BatchMapper to apply a UDF in parallel over your data. BatchMapper can drop, add, or modify columns, and you can specify a batch_size to control the size of the data batches provided to your UDF. import ray from ray.data.preprocessors import BatchMapper # Generate a simple dataset. dataset = ray.data.range(4) print(dataset.take()) # [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}] # Create a stateless preprocess that multiplies ids by 2. preprocessor = BatchMapper(lambda df: df * 2, batch_size=2, batch_format="pandas") dataset_transformed = preprocessor.transform(dataset) print(dataset_transformed.take()) # [{'id': 0}, {'id': 2}, {'id': 4}, {'id': 6}] Using Trainers https://docs.google.com/drawings/d/1anmT0JVFH9abR5wX5_WcxNHJh6jWeDL49zWxGpkfORA/edit Ray AIR Trainers provide a way to scale out training with popular machine learning frameworks. As part of Ray Train, Trainers enable users to run distributed multi-node training with fault tolerance. Fully integrated with the Ray ecosystem, Trainers leverage Ray Data to enable scalable preprocessing and performant distributed data ingestion. Also, Trainers can be composed with Tuners for distributed hyperparameter tuning. After executing training, Trainers output the trained model in the form of a Checkpoint, which can be used for batch or online prediction inference. There are three broad categories of Trainers that AIR offers: Deep Learning Trainers (Pytorch, Tensorflow, Horovod) Tree-based Trainers (XGboost, LightGBM) Other ML frameworks (Hugging Face, Scikit-Learn, RLlib) Trainer Basics All trainers inherit from the BaseTrainer interface. To construct a Trainer, you can provide: A scaling_config, which specifies how many parallel training workers and what type of resources (CPUs/GPUs) to use per worker during training. A run_config, which configures a variety of runtime parameters such as fault tolerance, logging, and callbacks. A collection of datasets and a preprocessor for the provided datasets, which configures preprocessing and the datasets to ingest from. resume_from_checkpoint, which is a checkpoint path to resume from, should your training run be interrupted. After instantiating a Trainer, you can invoke it by calling Trainer.fit(). import ray from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) trainer = XGBoostTrainer( label_column="y", params={"objective": "reg:squarederror"}, scaling_config=ScalingConfig(num_workers=3), datasets={"train": train_dataset}, ) result = trainer.fit() Deep Learning Trainers Ray Train offers 3 main deep learning trainers: TorchTrainer, TensorflowTrainer, and HorovodTrainer. These three trainers all take a train_loop_per_worker parameter, which is a function that defines the main training logic that runs on each training worker. Under the hood, Ray AIR will use the provided scaling_config to instantiate the correct number of workers. Upon instantiation, each worker will be able to reference a global Session object, which provides functionality for reporting metrics, saving checkpoints, and more. You can provide multiple datasets to a trainer via the datasets parameter. If datasets includes a training dataset (denoted by the “train” key), then it will be split into multiple dataset shards, with each worker training on a single shard. All other datasets will not be split. You can access the data shard within a worker via get_dataset_shard(), and use to_tf() or iter_torch_batches to generate batches of Tensorflow or Pytorch tensors. You can read more about data ingest here. Read more about Ray Train’s Deep Learning Trainers. Code examples Torch import torch import torch.nn as nn import ray from ray import train from ray.air import session, Checkpoint from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False input_size = 1 layer_size = 15 output_size = 1 num_epochs = 3 class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) def train_loop_per_worker(): dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() loss_fn = nn.MSELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.1) model = train.torch.prepare_model(model) for epoch in range(num_epochs): for batches in dataset_shard.iter_torch_batches( batch_size=32, dtypes=torch.float ): inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"] output = model(inputs) loss = loss_fn(output, labels) optimizer.zero_grad() loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") session.report( {}, checkpoint=Checkpoint.from_dict( dict(epoch=epoch, model=model.state_dict()) ), ) train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)]) scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() Tensorflow import ray import tensorflow as tf from ray.air import session from ray.air.integrations.keras import ReportCheckpointCallback from ray.train.tensorflow import TensorflowTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False a = 5 b = 10 size = 100 def build_model() -> tf.keras.Model: model = tf.keras.Sequential( [ tf.keras.layers.InputLayer(input_shape=()), # Add feature dimension, expanding (batch_size,) to (batch_size, 1). tf.keras.layers.Flatten(), tf.keras.layers.Dense(10), tf.keras.layers.Dense(1), ] ) return model def train_func(config: dict): batch_size = config.get("batch_size", 64) epochs = config.get("epochs", 3) strategy = tf.distribute.MultiWorkerMirroredStrategy() with strategy.scope(): # Model building/compiling need to be within `strategy.scope()`. multi_worker_model = build_model() multi_worker_model.compile( optimizer=tf.keras.optimizers.SGD(learning_rate=config.get("lr", 1e-3)), loss=tf.keras.losses.mean_squared_error, metrics=[tf.keras.metrics.mean_squared_error], ) dataset = session.get_dataset_shard("train") results = [] for _ in range(epochs): tf_dataset = dataset.to_tf( feature_columns="x", label_columns="y", batch_size=batch_size ) history = multi_worker_model.fit( tf_dataset, callbacks=[ReportCheckpointCallback()] ) results.append(history.history) return results config = {"lr": 1e-3, "batch_size": 32, "epochs": 4} train_dataset = ray.data.from_items( [{"x": x / 200, "y": 2 * x / 200} for x in range(200)] ) scaling_config = ScalingConfig(num_workers=2, use_gpu=use_gpu) trainer = TensorflowTrainer( train_loop_per_worker=train_func, train_loop_config=config, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() print(result.metrics) Horovod import ray import ray.train as train import ray.train.torch # Need this to use `train.torch.get_device()` import horovod.torch as hvd import torch import torch.nn as nn from ray.air import session, Checkpoint from ray.train.horovod import HorovodTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False input_size = 1 layer_size = 15 output_size = 1 num_epochs = 3 class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) def train_loop_per_worker(): hvd.init() dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() device = train.torch.get_device() model.to(device) loss_fn = nn.MSELoss() lr_scaler = 1 optimizer = torch.optim.SGD(model.parameters(), lr=0.1 * lr_scaler) # Horovod: wrap optimizer with DistributedOptimizer. optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), op=hvd.Average, ) for epoch in range(num_epochs): model.train() for batch in dataset_shard.iter_torch_batches( batch_size=32, dtypes=torch.float ): inputs, labels = torch.unsqueeze(batch["x"], 1), batch["y"] outputs = model(inputs) loss = loss_fn(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") session.report( {}, checkpoint=Checkpoint.from_dict(dict(model=model.state_dict())), ) train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) trainer = HorovodTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() How to report metrics and checkpoints? During model training, you may want to save training metrics and checkpoints for downstream processing (e.g., serving the model). Use the Session API to gather metrics and save checkpoints. Checkpoints are synced to driver or the cloud storage based on user’s configurations, as specified in Trainer(run_config=...). Code example import tensorflow as tf from ray.air import session from ray.air.checkpoint import Checkpoint from ray.air.config import ScalingConfig from ray.train.tensorflow import TensorflowTrainer def build_model() -> tf.keras.Model: model = tf.keras.Sequential( [ tf.keras.layers.InputLayer(input_shape=(1,)), tf.keras.layers.Dense(10), tf.keras.layers.Dense(1), ] ) return model def train_func(): ckpt = session.get_checkpoint() if ckpt: with ckpt.as_directory() as loaded_checkpoint_dir: import tensorflow as tf model = tf.keras.models.load_model(loaded_checkpoint_dir) else: model = build_model() model.save("my_model", overwrite=True) session.report( metrics={"iter": 1}, checkpoint=Checkpoint.from_directory("my_model") ) scaling_config = ScalingConfig(num_workers=2) trainer = TensorflowTrainer( train_loop_per_worker=train_func, scaling_config=scaling_config ) result = trainer.fit() # trainer2 will pick up from the checkpoint saved by trainer1. trainer2 = TensorflowTrainer( train_loop_per_worker=train_func, scaling_config=scaling_config, # this is ultimately what is accessed through # ``Session.get_checkpoint()`` resume_from_checkpoint=result.checkpoint, ) result2 = trainer2.fit() Tree-based Trainers Ray Train offers 2 main tree-based trainers: XGBoostTrainer and LightGBMTrainer. See here for a more detailed user-guide. XGBoost Trainer Ray AIR also provides an easy to use XGBoostTrainer for training XGBoost models at scale. To use this trainer, you will need to first run: pip install -U xgboost-ray. import ray from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) trainer = XGBoostTrainer( label_column="y", params={"objective": "reg:squarederror"}, scaling_config=ScalingConfig(num_workers=3), datasets={"train": train_dataset}, ) result = trainer.fit() LightGBMTrainer Similarly, Ray AIR comes with a LightGBMTrainer for training LightGBM models at scale. To use this trainer, you will need to first run pip install -U lightgbm-ray. import ray from ray.train.lightgbm import LightGBMTrainer from ray.air.config import ScalingConfig train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) trainer = LightGBMTrainer( label_column="y", params={"objective": "regression"}, scaling_config=ScalingConfig(num_workers=3), datasets={"train": train_dataset}, ) result = trainer.fit() Other Trainers Hugging Face TransformersTrainer TransformersTrainer further extends TorchTrainer, built for interoperability with the HuggingFace Transformers library. Users are required to provide a trainer_init_per_worker function which returns a transformers.Trainer object. The trainer_init_per_worker function will have access to preprocessed train and evaluation datasets. Upon calling TransformersTrainer.fit(), multiple workers (ray actors) will be spawned, and each worker will create its own copy of a transformers.Trainer. Each worker will then invoke transformers.Trainer.train(), which will perform distributed training via Pytorch DDP. Code example # Based on # huggingface/notebooks/examples/language_modeling_from_scratch.ipynb # Hugging Face imports from datasets import load_dataset import transformers from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer import ray from ray.train.huggingface import TransformersTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False model_checkpoint = "gpt2" tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer" block_size = 128 datasets = load_dataset("wikitext", "wikitext-2-raw-v1") tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint) def tokenize_function(examples): return tokenizer(examples["text"]) tokenized_datasets = datasets.map( tokenize_function, batched=True, num_proc=1, remove_columns=["text"] ) def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model # supported it. # instead of this drop, you can customize this part to your needs. total_length = (total_length // block_size) * block_size # Split by chunks of max_len. result = { k: [t[i : i + block_size] for i in range(0, total_length, block_size)] for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() return result lm_datasets = tokenized_datasets.map( group_texts, batched=True, batch_size=1000, num_proc=1, ) ray_train_ds = ray.data.from_huggingface(lm_datasets["train"]) ray_evaluation_ds = ray.data.from_huggingface(lm_datasets["validation"]) def trainer_init_per_worker(train_dataset, eval_dataset, **config): model_config = AutoConfig.from_pretrained(model_checkpoint) model = AutoModelForCausalLM.from_config(model_config) args = transformers.TrainingArguments( output_dir=f"{model_checkpoint}-wikitext2", evaluation_strategy="epoch", save_strategy="epoch", logging_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, no_cuda=(not use_gpu), ) return transformers.Trainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) trainer = TransformersTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=scaling_config, datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds}, ) result = trainer.fit() AccelerateTrainer If you prefer a more fine-grained Hugging Face API than what Transformers provides, you can use AccelerateTrainer to run training functions making use of Hugging Face Accelerate. Similarly to TransformersTrainer, AccelerateTrainer is also an extension of TorchTrainer. AccelerateTrainer allows you to pass an Accelerate configuration file generated with accelerate config to be applied on all training workers. This ensures that the worker environments are set up correctly for Accelerate, allowing you to take advantage of Accelerate APIs and integrations such as DeepSpeed and FSDP just as you would if you were running Accelerate without Ray. AccelerateTrainer will override some settings set with accelerate config, mainly related to the topology and networking. See the AccelerateTrainer API reference for more details. Aside from Accelerate support, the usage is identical to TorchTrainer, meaning you define your own training function and use the Session API to report metrics, save checkpoints etc. Code example import torch import torch.nn as nn from accelerate import Accelerator import ray from ray.air import session, Checkpoint from ray.train.huggingface import AccelerateTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False input_size = 1 layer_size = 15 output_size = 1 num_epochs = 3 class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) def train_loop_per_worker(): accelerator = Accelerator() dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() loss_fn = nn.MSELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.1) model, optimizer = accelerator.prepare(model, optimizer) for epoch in range(num_epochs): for batches in dataset_shard.iter_torch_batches( batch_size=32, dtypes=torch.float ): inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"] output = model(inputs) loss = loss_fn(output, labels) optimizer.zero_grad() accelerator.backward(loss) optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") session.report( {}, checkpoint=Checkpoint.from_dict( dict(epoch=epoch, model=accelerator.unwrap_model(model).state_dict()) ), ) train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)]) scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) trainer = AccelerateTrainer( train_loop_per_worker=train_loop_per_worker, # Instead of using a dict, you can run ``accelerate config``. # The default value of None will then load that configuration # file. accelerate_config={}, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() Scikit-Learn Trainer This trainer is not distributed. The Scikit-Learn Trainer is a thin wrapper to launch scikit-learn training within Ray AIR. Even though this trainer is not distributed, you can still benefit from its integration with Ray Tune for distributed hyperparameter tuning and scalable batch/online prediction. import ray from ray.train.sklearn import SklearnTrainer from sklearn.ensemble import RandomForestRegressor train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) trainer = SklearnTrainer( estimator=RandomForestRegressor(), label_column="y", scaling_config=ray.air.config.ScalingConfig(trainer_resources={"CPU": 4}), datasets={"train": train_dataset}, ) result = trainer.fit() RLlib Trainer RLTrainer provides an interface to RL Trainables. This enables you to use the same abstractions as in the other trainers to define the scaling behavior, and to use Ray Data for offline training. Please note that some scaling behavior still has to be defined separately. The scaling_config will set the number of training workers (“Rollout workers”). To set the number of e.g. evaluation workers, you will have to specify this in the config parameter of the RLTrainer: from ray.air.config import RunConfig, ScalingConfig from ray.train.rl import RLTrainer trainer = RLTrainer( run_config=RunConfig(stop={"training_iteration": 5}), scaling_config=ScalingConfig(num_workers=2, use_gpu=False), algorithm="PPO", config={ "env": "CartPole-v0", "framework": "tf", "evaluation_num_workers": 1, "evaluation_interval": 1, "evaluation_config": {"input": "sampler"}, }, ) result = trainer.fit() How to interpret training results? Calling Trainer.fit() returns a Result, providing you access to metrics, checkpoints, and errors. You can interact with a Result object as follows: result = trainer.fit() # returns the last saved checkpoint result.checkpoint # returns the N best saved checkpoints, as configured in ``RunConfig.CheckpointConfig`` result.best_checkpoints # returns the final metrics as reported result.metrics # returns the Exception if training failed. result.error # Returns a pandas dataframe of all reported results result.metrics_dataframe See the Result docstring for more details. Configuring Training Datasets This guide covers how to leverage Ray Data to load data for distributed training jobs. You may want to use Ray Data for training over framework built-in data loading utilities for a few reasons: To leverage the full Ray cluster to speed up preprocessing of your data. To make data loading agnostic of the underlying framework. Advanced Ray Data features such as global shuffles. Basics Let’s use a single Torch training workload as a running example. A very basic example of using Ray Data with TorchTrainer looks like this: import ray from ray.air import session from ray.air.config import ScalingConfig from ray.train.torch import TorchTrainer import numpy as np from typing import Dict # Load the data. train_ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") ## Uncomment to randomize the block order each epoch. # train_ds = train_ds.randomize_block_order() # Define a preprocessing function. def normalize_length(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: new_col = batch["sepal.length"] / np.max(batch["sepal.length"]) batch["normalized.sepal.length"] = new_col del batch["sepal.length"] return batch # Preprocess your data any way you want. This will be re-run each epoch. # You can use Ray Data preprocessors here as well, # e.g., preprocessor.fit_transform(train_ds) train_ds = train_ds.map_batches(normalize_length) def train_loop_per_worker(): # Get an iterator to the dataset we passed in below. it = session.get_dataset_shard("train") # Train for 10 epochs over the data. We'll use a shuffle buffer size # of 10k elements, and prefetch up to 10 batches of size 128 each. for _ in range(10): for batch in it.iter_batches( local_shuffle_buffer_size=10000, batch_size=128, prefetch_batches=10 ): print("Do some training on batch", batch) my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"train": train_ds}, ) my_trainer.fit() In this basic example, the train_ds object is created in your Ray script before the Trainer is even instantiated. The train_ds object is passed to the Trainer via the datasets argument, and is accessible to the train_loop_per_worker function via the session.get_dataset_shard method. Splitting data across workers By default, Train will split the "train" dataset across workers using Dataset.streaming_split. This means that each worker sees a disjoint subset of the data, instead of iterating over the entire dataset. To customize this, we can pass in a DataConfig to the Trainer constructor. For example, the following splits dataset "a" but not "b". dataset_a = ray.data.read_text( "s3://anonymous@ray-example-data/sms_spam_collection_subset.txt" ) dataset_b = ray.data.read_csv("s3://anonymous@ray-example-data/dow_jones.csv") my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"a": dataset_a, "b": dataset_b}, dataset_config=ray.train.DataConfig( datasets_to_split=["a"], ), ) Performance This section covers common options for improving ingest performance. Materializing your dataset Datasets are lazy and their execution is streamed, which means that on each epoch, all preprocessing operations will be re-run. If this loading / preprocessing is expensive, you may benefit from materializing your dataset in memory. This tells Ray Data to compute all the blocks of the dataset fully and pin them in Ray object store memory. This means that when iterating over the dataset repeatedly, the preprocessing operations do not need to be re-run, greatly improving performance. However, the trade-off is that if the preprocessed data is too large to fit into Ray object store memory, this could slow things down because data needs to be spilled to disk. # Load the data. train_ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") # Preprocess the data. Transformations that are made to the materialize call below # will only be run once. train_ds = train_ds.map_batches(normalize_length) # Materialize the dataset in object store memory. train_ds = train_ds.materialize() # Add per-epoch preprocessing. Transformations that you want to run per-epoch, such # as data augmentation, should go after the materialize call. train_ds = train_ds.map_batches(augment_data) Ray Data execution options Under the hood, Train configures some default Data options for ingest: limiting the data ingest memory usage to 2GB per worker, and telling it to optimize the locality of the output data for ingest. See help(DataConfig.default_ingest_options()) if you want to learn more and further customize these settings. Common options you may want to adjust: resource_limits.object_store_memory, which sets the amount of Ray object memory to use for Data ingestion. Increasing this can improve performance up to a point where it can trigger disk spilling and slow things down. preserve_order. This is off by default, and lets Ray Data compute blocks out of order. Setting this to True will avoid this source of nondeterminism. You can pass in custom execution options to the data config, which will apply to all data executions for the Trainer. For example, if you want to adjust the ingest memory size to 10GB per worker: from ray.train import DataConfig options = DataConfig.default_ingest_options() options.resource_limits.object_store_memory = 10e9 my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), dataset_config=ray.train.DataConfig( execution_options=options, ), ) Other performance tips Adjust the prefetch_batches argument for DataIterator.iter_batches. This can be useful if bottlenecked on the network. Finally, you can use print(ds.stats()) or print(iterator.stats()) to print detailed timing information about Ray Data performance. Custom data config (advanced) For use cases not covered by the default config class, you can also fully customize exactly how your input datasets are splitted. To do this, you need to define a custom DataConfig class (DeveloperAPI). The DataConfig class is responsible for that shared setup and splitting of data across nodes. # Note that this example class is doing the same thing as the basic DataConfig # impl included with Ray Train. from typing import Optional, Dict, List from ray.data import Dataset, DataIterator, NodeIdStr from ray.actor import ActorHandle class MyCustomDataConfig(DataConfig): def configure( self, datasets: Dict[str, Dataset], world_size: int, worker_handles: Optional[List[ActorHandle]], worker_node_ids: Optional[List[NodeIdStr]], **kwargs, ) -> List[Dict[str, DataIterator]]: assert len(datasets) == 1, "This example only handles the simple case" # Configure Ray Data for ingest. ctx = ray.data.DataContext.get_current() ctx.execution_options = DataConfig.default_ingest_options() # Split the stream into shards. iterator_shards = datasets["train"].streaming_split( world_size, equal=True, locality_hints=worker_node_ids ) # Return the assigned iterators for each worker. return [{"train": it} for it in iterator_shards] my_trainer = TorchTrainer( train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"train": train_ds}, dataset_config=MyCustomDataConfig(), ) my_trainer.fit() What do you need to know about this DataConfig class? It must be serializable, since it will be copied from the driver script to the driving actor of the Trainer. Its configure method is called on the main actor of the Trainer group to create the data iterators for each worker. In general, you can use DataConfig for any shared setup that has to occur ahead of time before the workers start reading data. The setup will be run at the start of each Trainer run. Migrating from the legacy DatasetConfig API Starting from Ray 2.6, the DatasetConfig API is deprecated, and it will be removed in a future release. If your workloads are still using it, consider migrating to the new DataConfig API as soon as possible. The main difference is that preprocessing no longer part of the Trainer. As Dataset operations are lazy. You can apply any operations to your Datasets before passing them to the Trainer. The operations will be re-executed before each epoch. In the following example with the legacy DatasetConfig API, we pass 2 Datasets (“train” and “test”) to the Trainer and apply an “add_noise” preprocessor per epoch to the “train” Dataset. Also, we will split the “train” Dataset, but not the “test” Dataset. import random import ray from ray.air.config import ScalingConfig, DatasetConfig from ray.data.preprocessors.batch_mapper import BatchMapper from ray.train.torch import TorchTrainer train_ds = ray.data.range_tensor(1000) test_ds = ray.data.range_tensor(10) # A randomized preprocessor that adds a random float to all values. add_noise = BatchMapper(lambda df: df + random.random(), batch_format="pandas") my_trainer = TorchTrainer( lambda: None, scaling_config=ScalingConfig(num_workers=1), datasets={ "train": train_ds, "test": test_ds, }, dataset_config={ "train": DatasetConfig( split=True, # Apply the preprocessor for each epoch. per_epoch_preprocessor=add_noise, ), "test": DatasetConfig( split=False, ), }, ) my_trainer.fit() To migrate this example to the new DatasetConfig API, we apply the “add_noise” preprocesor to the “train” Dataset prior to passing it to the Trainer. And we use DataConfig(datasets_to_split=["train"]) to specify which Datasets need to be split. Note, the datasets_to_split argument is optional. By default, only the “train” Dataset will be split. If you don’t want to split the “train” Dataset either, use datasets_to_split=[]. from ray.train import DataConfig train_ds = ray.data.range_tensor(1000) test_ds = ray.data.range_tensor(10) # Apply the preprocessor before passing the Dataset to the Trainer. # This operation is lazy. It will be re-executed for each epoch. train_ds = add_noise.transform(train_ds) my_trainer = TorchTrainer( lambda: None, scaling_config=ScalingConfig(num_workers=1), datasets={ "train": train_ds, "test": test_ds, }, # Specify which datasets to split. dataset_config=DataConfig( datasets_to_split=["train"], ), ) my_trainer.fit() Configuring Hyperparameter Tuning The Ray AIR Tuner is the recommended way to tune hyperparameters in Ray AIR. https://docs.google.com/drawings/d/1yMd12iMkyo6DGrFoET1TIlKfFnXX9dfh2u3GSdTz6W4/edit The Tuner will take in a Trainer and execute multiple training runs, each with different hyperparameter configurations. As part of Ray Tune, the Tuner provides an interface that works with AIR Trainers to perform distributed hyperparameter tuning. It provides a variety of state-of-the-art hyperparameter tuning algorithms for optimizing model performance. What follows next is basic coverage of what a Tuner is and how you can use it for basic examples. If you are interested in reading more, please take a look at the Ray Tune documentation. Key Concepts There are a number of key concepts that dictate proper use of a Tuner: A set of hyperparameters you want to tune in a search space. A search algorithm to effectively optimize your parameters and optionally use a scheduler to stop searches early and speed up your experiments. The search space, search algorithm, scheduler, and Trainer are passed to a Tuner, which runs the hyperparameter tuning workload by evaluating multiple hyperparameters in parallel. Each individual hyperparameter evaluation run is called a trial. The Tuner returns its results in a ResultGrid. Tuners can also be used to launch hyperparameter tuning without using Ray AIR Trainers. See the Ray Tune documentation for more guides and examples. Basic usage Below, we demonstrate how you can use a Trainer object with a Tuner. import ray from ray import tune from ray.tune import Tuner from ray.train.xgboost import XGBoostTrainer dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") trainer = XGBoostTrainer( label_column="target", params={ "objective": "binary:logistic", "eval_metric": ["logloss", "error"], "max_depth": 4, }, datasets={"train": dataset}, ) # Create Tuner tuner = Tuner( trainer, # Add some parameters to tune param_space={"params": {"max_depth": tune.choice([4, 5, 6])}}, # Specify tuning behavior tune_config=tune.TuneConfig(metric="train-logloss", mode="min", num_samples=2), ) # Run tuning job tuner.fit() How to configure a search space? A Tuner takes in a param_space argument where you can define the search space from which hyperparameter configurations will be sampled. Depending on the model and dataset, you may want to tune: The training batch size The learning rate for deep learning training (e.g., image classification) The maximum depth for tree-based models (e.g., XGBoost) The following shows some example code on how to specify the param_space. XGBoost import ray from ray import tune from ray.tune import Tuner from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig, RunConfig dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Create an XGBoost trainer trainer = XGBoostTrainer( label_column="target", params={ "objective": "binary:logistic", "eval_metric": ["logloss", "error"], "max_depth": 4, }, num_boost_round=10, datasets={"train": dataset}, ) param_space = { # Tune parameters directly passed into the XGBoostTrainer "num_boost_round": tune.randint(5, 20), # `params` will be merged with the `params` defined in the above XGBoostTrainer "params": { "min_child_weight": tune.uniform(0.8, 1.0), # Below will overwrite the XGBoostTrainer setting "max_depth": tune.randint(1, 5), }, # Tune the number of distributed workers "scaling_config": ScalingConfig(num_workers=tune.grid_search([1, 2])), } tuner = Tuner( trainable=trainer, run_config=RunConfig(name="test_tuner"), param_space=param_space, tune_config=tune.TuneConfig( mode="min", metric="train-logloss", num_samples=2, max_concurrent_trials=2 ), ) result_grid = tuner.fit() Pytorch from ray import tune from ray.tune import Tuner from ray.train.examples.pytorch.torch_linear_example import ( train_func as linear_train_func, ) from ray.train.torch import TorchTrainer trainer = TorchTrainer( train_loop_per_worker=linear_train_func, train_loop_config={"lr": 1e-2, "batch_size": 4, "epochs": 10}, scaling_config=ScalingConfig(num_workers=1, use_gpu=False), ) param_space = { # The params will be merged with the ones defined in the TorchTrainer "train_loop_config": { # This is a parameter that hasn't been set in the TorchTrainer "hidden_size": tune.randint(1, 4), # This will overwrite whatever was set when TorchTrainer was instantiated "batch_size": tune.choice([4, 8]), }, # Tune the number of distributed workers "scaling_config": ScalingConfig(num_workers=tune.grid_search([1, 2])), } tuner = Tuner( trainable=trainer, run_config=RunConfig(name="test_tuner", storage_path="~/ray_results"), param_space=param_space, tune_config=tune.TuneConfig( mode="min", metric="loss", num_samples=2, max_concurrent_trials=2 ), ) result_grid = tuner.fit() Read more about Tune search spaces here. You can use a Tuner to tune most arguments and configurations in Ray AIR, including but not limited to: Ray Data Preprocessors Scaling configurations and other hyperparameters. There are a couple gotchas about parameter specification when using Tuners with Trainers: By default, configuration dictionaries and config objects will be deep-merged. Parameters that are duplicated in the Trainer and Tuner will be overwritten by the Tuner param_space. Exception: all arguments of the RunConfig and TuneConfig are inherently un-tunable. See Getting Data in and out of Tune for an example. How to configure a Tuner? There are two main configuration objects that can be passed into a Tuner: the TuneConfig and the RunConfig. The TuneConfig contains tuning specific settings, including: the tuning algorithm to use the metric and mode to rank results the amount of parallelism to use Here are some common configurations for TuneConfig: from ray.tune import TuneConfig from ray.tune.search.bayesopt import BayesOptSearch tune_config = TuneConfig( metric="loss", mode="min", max_concurrent_trials=10, num_samples=100, search_alg=BayesOptSearch(), ) See the TuneConfig API reference for more details. The RunConfig contains configurations that are more generic than tuning specific settings. This may include: failure/retry configurations verbosity levels the name of the experiment the logging directory checkpoint configurations custom callbacks integration with cloud storage Below we showcase some common configurations of RunConfig. from ray import air from ray.air.config import RunConfig run_config = RunConfig( name="MyExperiment", storage_path="s3://...", verbose=2, checkpoint_config=air.CheckpointConfig(checkpoint_frequency=2), ) See the RunConfig API reference for more details. How to specify parallelism? You can specify parallelism via the TuneConfig by setting the following flags: num_samples which specifies the number of trials to run in total max_concurrent_trials which specifies the max number of trials to run concurrently Note that actual parallelism can be less than max_concurrent_trials and will be determined by how many trials can fit in the cluster at once (i.e., if you have a trial that requires 16 GPUs, your cluster has 32 GPUs, and max_concurrent_trials=10, the Tuner can only run 2 trials concurrently). from ray.tune import TuneConfig config = TuneConfig( # ... num_samples=100, max_concurrent_trials=10, ) Read more about this in A Guide To Parallelism and Resources for Ray Tune section. How to specify an optimization algorithm? You can specify your hyperparameter optimization method via the TuneConfig by setting the following flags: search_alg which provides an optimizer for selecting the optimal hyperparameters scheduler which provides a scheduling/resource allocation algorithm for accelerating the search process from ray.tune.search.bayesopt import BayesOptSearch from ray.tune.schedulers import HyperBandScheduler from ray.tune import TuneConfig config = TuneConfig( # ... search_alg=BayesOptSearch(), scheduler=HyperBandScheduler(), ) Read more about this in the Search Algorithm and Scheduler section. How to analyze results? Tuner.fit() generates a ResultGrid object. This object contains metrics, results, and checkpoints of each trial. Below is a simple example: from ray.tune import Tuner, TuneConfig tuner = Tuner( trainable=trainer, param_space=param_space, tune_config=TuneConfig(mode="min", metric="loss", num_samples=5), ) result_grid = tuner.fit() num_results = len(result_grid) # Check if there have been errors if result_grid.errors: print("At least one trial failed.") # Get the best result best_result = result_grid.get_best_result() # And the best checkpoint best_checkpoint = best_result.checkpoint # And the best metrics best_metric = best_result.metrics # Or a dataframe for further analysis results_df = result_grid.get_dataframe() print("Shortest training time:", results_df["time_total_s"].min()) # Iterate over results for result in result_grid: if result.error: print("The trial had an error:", result.error) continue print("The trial finished successfully with the metrics:", result.metrics["loss"]) See Analyzing Tune Experiment Results for more usage examples. Advanced Tuning Tuners also offer the ability to tune different data preprocessing steps, as shown in the following snippet. from ray.data.preprocessors import StandardScaler from ray.tune import Tuner prep_v1 = StandardScaler(["worst radius", "worst area"]) prep_v2 = StandardScaler(["worst concavity", "worst smoothness"]) tuner = Tuner( trainer, param_space={ "preprocessor": tune.grid_search([prep_v1, prep_v2]), # Your other parameters go here }, ) Additionally, you can sample different train/validation datasets: def get_dataset(): return ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") def get_another_dataset(): # imagine this is a different dataset return ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") dataset_1 = get_dataset() dataset_2 = get_another_dataset() tuner = tune.Tuner( trainer, param_space={ "datasets": { "train": tune.grid_search([dataset_1, dataset_2]), } # Your other parameters go here }, ) Restoring and resuming A Tuner regularly saves its state, so that a tuning run can be resumed after being interrupted. Additionally, if trials fail during a tuning run, they can be retried - either from scratch or from the latest available checkpoint. To restore the Tuner state, pass the path to the experiment directory as an argument to Tuner.restore(...). This path is obtained from the output of a tuning run, namely “Result logdir”. However, if you specify a name in the RunConfig, it is located under ~/ray_results/. tuner = Tuner.restore( path="~/ray_results/test_tuner", trainable=trainer, restart_errored=True ) tuner.fit() For more resume options, please see the documentation of Tuner.restore(). Using Predictors for Inference Refer to the blog on Model Batch Inference in Ray for an overview of batch inference strategies in Ray and additional examples. https://docs.google.com/presentation/d/1jfkQk0tGqgkLgl10vp4-xjcbYG9EEtlZV_Vnve_NenQ/edit After you train a model, you will often want to use the model to do inference and prediction. To do so, you can use a Ray AIR Predictor. In this guide, we’ll cover how to use the Predictor on different types of data. What are predictors? Ray AIR Predictors are a class that loads models from Checkpoint to perform inference. Predictors are used by BatchPredictor and PredictorDeployment to do large-scale scoring or online inference. Let’s walk through a basic usage of the Predictor. In the below example, we create Checkpoint object from a model definition. Checkpoints can be generated from a variety of different ways – see the Checkpoints user guide for more details. The checkpoint then is used to create a framework specific Predictor (in our example, a TensorflowPredictor), which then can be used for inference: import numpy as np import tensorflow as tf import ray from ray.train.batch_predictor import BatchPredictor from ray.train.tensorflow import ( TensorflowCheckpoint, TensorflowPredictor, ) def build_model() -> tf.keras.Model: model = tf.keras.Sequential( [ tf.keras.layers.InputLayer(input_shape=()), # Add feature dimension, expanding (batch_size,) to (batch_size, 1). tf.keras.layers.Flatten(), tf.keras.layers.Dense(1), ] ) return model model = build_model() checkpoint = TensorflowCheckpoint.from_model(model) predictor = TensorflowPredictor.from_checkpoint( checkpoint, model_definition=build_model ) data = np.array([1, 2, 3, 4]) predictions = predictor.predict(data) print(predictions) # [[-1.6930283] # [-3.3860567] # [-5.079085 ] # [-6.7721133]] Predictors expose a predict method that accepts an input batch of type DataBatchType (which is a typing union of different standard Python ecosystem data types, such as Pandas Dataframe or Numpy Array) and outputs predictions of the same type as the input batch. Life of a prediction: Underneath the hood, when the Predictor.predict method is called the following occurs: The input batch is converted into a Pandas DataFrame. Tensor input (like a np.ndarray) will be converted into a single-column Pandas Dataframe. If there is a Preprocessor saved in the provided Checkpoint, the preprocessor will be used to transform the DataFrame. The transformed DataFrame will be passed to the model for inference. The predictions will be outputted by predict in the same type as the original input. Batch Prediction Ray AIR provides a BatchPredictor utility for large-scale batch inference. The BatchPredictor takes in a checkpoint and a predictor class and executes large-scale batch prediction on a given dataset in a parallel/distributed fashion when calling predict(). predict() will load the entire given dataset into memory, which may be a problem if your dataset size is larger than your available cluster memory. See the Lazy/Pipelined Prediction (experimental) section for a workaround. import pandas as pd from ray.train.batch_predictor import BatchPredictor batch_predictor = BatchPredictor( checkpoint, TensorflowPredictor, model_definition=build_model ) # Create a dummy dataset. ds = ray.data.from_pandas(pd.DataFrame({"feature_1": [1, 2, 3], "label": [1, 2, 3]})) # Use `feature_columns` to specify the input columns to your model. predictions = batch_predictor.predict(ds, feature_columns=["feature_1"]) print(predictions.show()) # {'predictions': array([-1.2789773], dtype=float32)} # {'predictions': array([-2.5579545], dtype=float32)} # {'predictions': array([-3.8369317], dtype=float32)} Additionally, you can compute metrics from the predictions. Do this by: specifying a function for computing metrics using keep_columns to keep the label column in the returned dataset using map_batches to compute metrics on a batch-by-batch basis Aggregate batch metrics via mean() def calculate_accuracy(df): return pd.DataFrame({"correct": int(df["predictions"][0]) == df["label"]}) predictions = batch_predictor.predict( ds, feature_columns=["feature_1"], keep_columns=["label"] ) print(predictions.show()) # {'predictions': array([-1.2789773], dtype=float32), 'label': 0} # {'predictions': array([-2.5579545], dtype=float32), 'label': 1} # {'predictions': array([-3.8369317], dtype=float32), 'label': 0} correct = predictions.map_batches(calculate_accuracy) print("Final accuracy: ", correct.mean(on="correct")) # Final accuracy: 0.5 Configuring Batch Prediction To configure the computation resources for your BatchPredictor, you have to set the following parameters in predict(): min_scoring_workers and max_scoring_workers The BatchPredictor will internally create an actor pool to autoscale the number of workers from [min, max] to execute your transforms. If not set, the auto-scaling range will be set to [1, inf) by default. num_gpus_per_worker: If you want to use GPU for batch prediction, please set this parameter explicitly. If not specified, the BatchPredictor will perform inference on CPUs by default. num_cpus_per_worker: Set the number of CPUs for a worker. separate_gpu_stage: If using GPUs, whether to use separate stages for GPU inference and data preprocessing. Enabled by default to avoid excessive preprocessing workload on GPU workers. You may disable it if your preprocessor is very lightweight. Here are some examples: 1. Use multiple CPUs for Batch Prediction: If num_gpus_per_worker not specified, use CPUs for batch prediction by default. Two workers with 3 CPUs each. predictions = batch_predictor.predict( ds, feature_columns=["feature_1"], min_scoring_workers=2, max_scoring_workers=2, num_cpus_per_worker=3, ) 2. Use multiple GPUs for Batch prediction: Two workers, each with 1 GPU and 1 CPU (by default). predictions = batch_predictor.predict( ds, feature_columns=["feature_1"], min_scoring_workers=2, max_scoring_workers=2, num_gpus_per_worker=1, ) 3. Configure Auto-scaling: Scale from 1 to 4 workers, depending on your dataset size and cluster resources. If no min/max values are provided, BatchPredictor will scale from 1 to inf workers by default. predictions = batch_predictor.predict( ds, feature_columns=["feature_1"], min_scoring_workers=1, max_scoring_workers=4, num_cpus_per_worker=3, ) Developer Guide: Implementing your own Predictor If you’re using an unsupported framework, or if built-in predictors are too inflexible, you may need to implement a custom predictor. To implement a custom Predictor, subclass Predictor and implement: __init__() _predict_numpy() or _predict_pandas() from_checkpoint() You don’t need to implement both _predict_numpy() and _predict_pandas(). Pick the method that’s easiest to implement. If both are implemented, override preferred_batch_format() to specify which format is more performant. This allows upstream producers to choose the best format. Examples We’ll walk through how to implement a predictor for two frameworks: MXNet – a deep learning framework like Torch. statsmodel – a Python library that provides regression and linear models. For more examples, read the source code of built-in predictors like TorchPredictor, XGBoostPredictor, and SklearnPredictor. Before you begin MXNet statsmodel First, install MXNet and Ray AIR. pip install mxnet 'ray[air]' Then, import the objects required for this example. import os from typing import Dict, Optional, Union import mxnet as mx import numpy as np from mxnet import gluon import ray from ray.air import Checkpoint from ray.data.preprocessor import Preprocessor from ray.data.preprocessors import BatchMapper from ray.train.batch_predictor import BatchPredictor from ray.train.predictor import Predictor Finally, create a stub for the MXNetPredictor class. class MXNetPredictor(Predictor): ... First, install statsmodel and Ray AIR. pip install statsmodel 'ray[air]' Then, import the objects required for this example. import os from typing import Optional import numpy as np # noqa: F401 import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from statsmodels.base.model import Results from statsmodels.regression.linear_model import OLSResults import ray from ray.air import Checkpoint from ray.data.preprocessor import Preprocessor from ray.train.batch_predictor import BatchPredictor from ray.train.predictor import Predictor Finally, create a stub the StatsmodelPredictor class. class StatsmodelPredictor(Predictor): ... Create a model MXNet statsmodel You’ll need to pass a model to the MXNetPredictor constructor. To create the model, load a pre-trained computer vision model from the MXNet model zoo. net = gluon.model_zoo.vision.resnet50_v1(pretrained=True) You’ll need to pass a model to the StatsmodelPredictor constructor. To create the model, fit a linear model on the Guerry dataset. data: pd.DataFrame = sm.datasets.get_rdataset("Guerry", "HistData").data results = smf.ols("Lottery ~ Literacy + np.log(Pop1831)", data=data).fit() Implement __init__ MXNet statsmodel Use the constructor to set instance attributes required for prediction. In the code snippet below, we assign the model to an attribute named net. def __init__( self, net: gluon.Block, preprocessor: Optional[Preprocessor] = None, ): self.net = net super().__init__(preprocessor) You must call the base class’ constructor; otherwise, Predictor.predict raises a NotImplementedError. Use the constructor to set instance attributes required for prediction. In the code snippet below, we assign the fitted model to an attribute named results. def __init__(self, results: Results, preprocessor: Optional[Preprocessor] = None): self.results = results super().__init__(preprocessor) You must call the base class’ constructor; otherwise, Predictor.predict raises a NotImplementedError. Implement from_checkpoint MXNet statsmodel from_checkpoint() creates a Predictor from a Checkpoint. Before implementing from_checkpoint(), save the model parameters to a directory, and create a Checkpoint from that directory. os.makedirs("checkpoint", exist_ok=True) net.save_parameters("checkpoint/net.params") checkpoint = Checkpoint.from_directory("checkpoint") Then, implement from_checkpoint(). @classmethod def from_checkpoint( cls, checkpoint: Checkpoint, net: gluon.Block, ) -> Predictor: with checkpoint.as_directory() as directory: path = os.path.join(directory, "net.params") net.load_parameters(path) return cls(net, preprocessor=checkpoint.get_preprocessor()) from_checkpoint() creates a Predictor from a Checkpoint. Before implementing from_checkpoint(), save the fitten model to a directory, and create a Checkpoint from that directory. os.makedirs("checkpoint", exist_ok=True) results.save("checkpoint/guerry.pickle") checkpoint = Checkpoint.from_directory("checkpoint") Then, implement from_checkpoint(). @classmethod def from_checkpoint( cls, checkpoint: Checkpoint, filename: str, ) -> Predictor: with checkpoint.as_directory() as directory: path = os.path.join(directory, filename) results = OLSResults.load(path) return cls(results, checkpoint.get_preprocessor()) Implement _predict_numpy or _predict_pandas MXNet statsmodel Because MXNet models accept tensors as input, you should implement _predict_numpy(). _predict_numpy() performs inference on a batch of NumPy data. It accepts a np.ndarray or dict[str, np.ndarray] as input and returns a np.ndarray or dict[str, np.ndarray] as output. The input type is determined by the type of Dataset passed to BatchPredictor.predict. If your dataset has columns, the input is a dict; otherwise, the input is a np.ndarray. def _predict_numpy( self, data: Union[np.ndarray, Dict[str, np.ndarray]], dtype: Optional[np.dtype] = None, ) -> Dict[str, np.ndarray]: # If `data` looks like `{"features": array([...])}`, unwrap the `dict` and pass # the array directly to the model. if isinstance(data, dict) and len(data) == 1: data = next(iter(data.values())) inputs = mx.nd.array(data, dtype=dtype) outputs = self.net(inputs).asnumpy() return {"predictions": outputs} Because your OLS model accepts dataframes as input, you should implement _predict_pandas(). _predict_pandas() performs inference on a batch of pandas data. It accepts a pandas.DataFrame as input and return a pandas.DataFrame as output. def _predict_pandas(self, data: pd.DataFrame) -> pd.DataFrame: predictions: pd.Series = self.results.predict(data) return predictions.to_frame(name="predictions") Perform inference MXNet statsmodel To perform inference with the completed MXNetPredictor: Create a Preprocessor and set it in the Checkpoint. You can also use any of the out-of-the-box preprocessors instead of implementing your own: Preprocessor. Create a BatchPredictor from your checkpoint. Read sample images into a Dataset. Call predict to classify the images in the dataset. # These images aren't normalized. In practice, normalize images before inference. dataset = ray.data.read_images( "s3://anonymous@air-example-data-2/imagenet-sample-images", size=(224, 224) ) def preprocess(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: # (B, H, W, C) -> (B, C, H, W) batch["image"] = batch["image"].transpose(0, 3, 1, 2) return batch # Create the preprocessor and set it in the checkpoint. # This preprocessor will be used to transform the data prior to prediction. preprocessor = BatchMapper(preprocess, batch_format="numpy") checkpoint.set_preprocessor(preprocessor=preprocessor) predictor = BatchPredictor.from_checkpoint( checkpoint, MXNetPredictor, net=net ) predictor.predict(dataset) To perform inference with the completed StatsmodelPredictor: Create a BatchPredictor from your checkpoint. Read the Guerry dataset into a Dataset. Call predict to perform regression on the samples in the dataset. predictor = BatchPredictor.from_checkpoint( checkpoint, StatsmodelPredictor, filename="guerry.pickle" ) # This is the same data we trained our model on. Don't do this in practice. dataset = ray.data.from_pandas(data) predictions = predictor.predict(dataset) predictions.show() Lazy/Pipelined Prediction (experimental) If you have a large dataset but not a lot of available memory, you can use the predict_pipelined method. Unlike predict() which will load the entire data into memory, predict_pipelined will create a DatasetPipeline object, which will lazily load the data and perform inference on a smaller batch of data at a time. The lazy loading of the data will allow you to operate on datasets much greater than your available memory. Execution can be triggered by pulling from the pipeline, as shown in the example below. import pandas as pd import ray from ray.air import Checkpoint from ray.train.predictor import Predictor from ray.train.batch_predictor import BatchPredictor # Create a BatchPredictor that always returns `42` for each input. batch_pred = BatchPredictor.from_pandas_udf( lambda data: pd.DataFrame({"a": [42] * len(data)}) ) # Create a dummy dataset. ds = ray.data.range_tensor(200, parallelism=4) # Setup a prediction pipeline. pipeline = batch_pred.predict_pipelined(ds, blocks_per_window=1) for batch in pipeline.iter_batches(): print("Pipeline result", batch) # 0 42 # 1 42 # ... Online Inference Check out the Deploying Predictors with Serve for details on how to perform online inference with AIR. Computer Vision This guide explains how to perform common computer vision tasks like: Reading image data Transforming images Training vision models Batch predicting images Serving vision models Reading image data Raw images Datasets like ImageNet store files like this: root/dog/xxx.png root/dog/xxy.png root/dog/[...]/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/[...]/asd932_.png To load images stored in this layout, read the raw images and include the class names. import ray from ray.data.datasource.partitioning import Partitioning root = "s3://anonymous@air-example-data/cifar-10/images" partitioning = Partitioning("dir", field_names=["class"], base_dir=root) dataset = ray.data.read_images(root, partitioning=partitioning) Then, apply a user-defined function to encode the class names as integer targets. from typing import Dict import numpy as np CLASS_TO_LABEL = { "airplane": 0, "automobile": 1, "bird": 2, "cat": 3, "deer": 4, "dog": 5, "frog": 6, "horse": 7, "ship": 8, "truck": 9, } def add_label_column(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: labels = [] for name in batch["class"]: label = CLASS_TO_LABEL[name] labels.append(label) batch["label"] = np.array(labels) return batch def remove_class_column(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: del batch["class"] return batch dataset = dataset.map_batches(add_label_column).map_batches(remove_class_column) You can also use LabelEncoder to encode labels. NumPy To load NumPy arrays into a Dataset, separately read the image and label arrays. import ray images = ray.data.read_numpy("s3://anonymous@air-example-data/cifar-10/images.npy") labels = ray.data.read_numpy("s3://anonymous@air-example-data/cifar-10/labels.npy") Then, combine the datasets and rename the columns. dataset = images.zip(labels) dataset = dataset.map_batches( lambda batch: batch.rename(columns={"data": "image", "data_1": "label"}), batch_format="pandas", ) TFRecords Image datasets often contain tf.train.Example messages that look like this: features { feature { key: "image" value { bytes_list { value: ... # Raw image bytes } } } feature { key: "label" value { int64_list { value: 3 } } } } To load examples stored in this format, read the TFRecords into a Dataset. import ray dataset = ray.data.read_tfrecords( "s3://anonymous@air-example-data/cifar-10/tfrecords" ) Then, apply a user-defined function to decode the raw image bytes. import io from typing import Dict import numpy as np from PIL import Image def decode_bytes(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: images = [] for data in batch["image"]: image = Image.open(io.BytesIO(data)) images.append(np.array(image)) batch["image"] = np.array(images) return batch dataset = dataset.map_batches(decode_bytes, batch_format="numpy") Parquet To load image data stored in Parquet files, call ray.data.read_parquet(). import ray dataset = ray.data.read_parquet("s3://anonymous@air-example-data/cifar-10/parquet") For more information on creating datasets, see Loading Data. Transforming images To transform images, create a Preprocessor. They’re the standard way to preprocess data with Ray. Torch To apply TorchVision transforms, create a TorchVisionPreprocessor. Create two TorchVisionPreprocessors – one to normalize images, and another to augment images. Later, you’ll pass the preprocessors to Trainers, Predictors, and PredictorDeployments. from torchvision import transforms from ray.data.preprocessors import TorchVisionPreprocessor transform = transforms.Compose([transforms.ToTensor(), transforms.CenterCrop(224)]) preprocessor = TorchVisionPreprocessor(columns=["image"], transform=transform) per_epoch_transform = transforms.RandomHorizontalFlip(p=0.5) per_epoch_preprocessor = TorchVisionPreprocessor( columns=["image"], transform=per_epoch_transform ) TensorFlow To apply TorchVision transforms, create a BatchMapper. Create two BatchMapper – one to normalize images, and another to augment images. Later, you’ll pass the preprocessors to Trainers, Predictors, and PredictorDeployments. from typing import Dict import numpy as np import tensorflow as tf from tensorflow.keras.applications import imagenet_utils from ray.data.preprocessors import BatchMapper def preprocess(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = imagenet_utils.preprocess_input(batch["image"]) batch["image"] = tf.image.resize(batch["image"], (224, 224)).numpy() return batch preprocessor = BatchMapper(preprocess, batch_format="numpy") def augment(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = tf.image.random_flip_left_right(batch["image"]).numpy() return batch per_epoch_preprocessor = BatchMapper(augment, batch_format="numpy") For more information on transforming data, see Using Preprocessors and Transforming Data. Training vision models Trainers let you train models in parallel. Torch To train a vision model, define the training loop per worker. import torch.nn as nn import torch.optim as optim from torchvision import models from ray import train from ray.air import session from ray.air.config import ScalingConfig from ray.train.torch import TorchCheckpoint, TorchTrainer def train_one_epoch(model, *, criterion, optimizer, batch_size, epoch): dataset_shard = session.get_dataset_shard("train") running_loss = 0 for i, batch in enumerate( dataset_shard.iter_torch_batches( batch_size=batch_size, local_shuffle_buffer_size=256 ) ): inputs, labels = batch["image"], batch["label"] outputs = model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() running_loss += loss.item() if i % 2000 == 1999: session.report( metrics={ "epoch": epoch, "batch": i, "running_loss": running_loss / 2000, }, checkpoint=TorchCheckpoint.from_model(model), ) running_loss = 0 def train_loop_per_worker(config): model = train.torch.prepare_model(models.resnet50()) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=config["lr"]) for epoch in range(config["epochs"]): train_one_epoch( model, criterion=criterion, optimizer=optimizer, batch_size=config["batch_size"], epoch=epoch, ) Then, create a TorchTrainer and call fit(). dataset = per_epoch_preprocessor.transform(dataset) trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config={"batch_size": 32, "lr": 0.02, "epochs": 1}, datasets={"train": dataset}, scaling_config=ScalingConfig(num_workers=2), preprocessor=preprocessor, ) results = trainer.fit() For more in-depth examples, read Training a Torch Image Classifier and Using Trainers. TensorFlow To train a vision model, define the training loop per worker. import tensorflow as tf from ray.air import session from ray.air.integrations.keras import ReportCheckpointCallback def train_loop_per_worker(config): strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() train_shard = session.get_dataset_shard("train") train_dataset = train_shard.to_tf( "image", "label", batch_size=config["batch_size"], local_shuffle_buffer_size=256, ) with strategy.scope(): model = tf.keras.applications.resnet50.ResNet50(weights=None) optimizer = tf.keras.optimizers.Adam(config["lr"]) model.compile( optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"], ) model.fit( train_dataset, epochs=config["epochs"], callbacks=[ReportCheckpointCallback()], ) Then, create a TensorflowTrainer and call fit(). from ray.air import ScalingConfig from ray.train.tensorflow import TensorflowTrainer # The following transform operation is lazy. # It will be re-run every epoch. dataset = per_epoch_preprocessor.transform(dataset) trainer = TensorflowTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config={"batch_size": 32, "lr": 0.02, "epochs": 1}, datasets={"train": dataset}, scaling_config=ScalingConfig(num_workers=2), preprocessor=preprocessor, ) results = trainer.fit() For more information, read Using Trainers. Creating checkpoints Checkpoints are required for batch inference and model serving. They contain model state and optionally a preprocessor. If you’re going from training to prediction, don’t create a new checkpoint. Trainer.fit() returns a Result object. Use Result.checkpoint instead. Torch To create a TorchCheckpoint, pass a Torch model and the Preprocessor you created in Transforming images to TorchCheckpoint.from_model(). from torchvision import models from ray.train.torch import TorchCheckpoint model = models.resnet50(pretrained=True) checkpoint = TorchCheckpoint.from_model(model, preprocessor=preprocessor) TensorFlow To create a TensorflowCheckpoint, pass a TensorFlow model and the Preprocessor you created in Transforming images to TensorflowCheckpoint.from_model(). import tensorflow as tf from ray.train.tensorflow import TensorflowCheckpoint model = tf.keras.applications.resnet50.ResNet50() checkpoint = TensorflowCheckpoint.from_model(model, preprocessor=preprocessor) Batch predicting images BatchPredictor lets you perform inference on large image datasets. Torch To create a BatchPredictor, call BatchPredictor.from_checkpoint and pass the checkpoint you created in Creating checkpoints. from ray.train.batch_predictor import BatchPredictor from ray.train.torch import TorchPredictor predictor = BatchPredictor.from_checkpoint(checkpoint, TorchPredictor) predictor.predict(dataset, feature_columns=["image"], keep_columns=["label"]) For more in-depth examples, read Using Predictors for Inference. TensorFlow To create a BatchPredictor, call BatchPredictor.from_checkpoint and pass the checkpoint you created in Creating checkpoints. import tensorflow as tf from ray.train.batch_predictor import BatchPredictor from ray.train.tensorflow import TensorflowPredictor predictor = BatchPredictor.from_checkpoint( checkpoint, TensorflowPredictor, model_definition=tf.keras.applications.resnet50.ResNet50, ) predictor.predict(dataset, feature_columns=["image"], keep_columns=["label"]) For more information, read Using Predictors for Inference. Serving vision models PredictorDeployment lets you deploy a model to an endpoint and make predictions over the Internet. Deployments use HTTP adapters to define how HTTP messages are converted to model inputs. For example, json_to_ndarray() converts HTTP messages like this: {"array": [[1, 2], [3, 4]]} To NumPy ndarrays like this: array([[1., 2.], [3., 4.]]) Torch To deploy a Torch model to an endpoint, pass the checkpoint you created in Creating checkpoints to PredictorDeployment.bind and specify json_to_ndarray() as the HTTP adapter. from ray import serve from ray.serve import PredictorDeployment from ray.serve.http_adapters import json_to_ndarray from ray.train.torch import TorchPredictor serve.run( PredictorDeployment.bind( TorchPredictor, checkpoint, http_adapter=json_to_ndarray, ) ) Then, make a request to classify an image. from io import BytesIO import numpy as np import requests from PIL import Image response = requests.get("http://placekitten.com/200/300") image = Image.open(BytesIO(response.content)) payload = {"array": np.array(image).tolist(), "dtype": "float32"} response = requests.post("http://localhost:8000/", json=payload) predictions = response.json() For more in-depth examples, read Training a Torch Image Classifier and Deploying Predictors with Serve. TensorFlow To deploy a TensorFlow model to an endpoint, pass the checkpoint you created in Creating checkpoints to PredictorDeployment.bind and specify json_to_multi_ndarray() as the HTTP adapter. import tensorflow as tf from ray import serve from ray.serve import PredictorDeployment from ray.serve.http_adapters import json_to_multi_ndarray from ray.train.tensorflow import TensorflowPredictor serve.run( PredictorDeployment.bind( TensorflowPredictor, checkpoint, http_adapter=json_to_multi_ndarray, model_definition=tf.keras.applications.resnet50.ResNet50, ) ) Then, make a request to classify an image. from io import BytesIO import numpy as np import requests from PIL import Image response = requests.get("http://placekitten.com/200/300") image = Image.open(BytesIO(response.content)) payload = {"image": {"array": np.array(image).tolist(), "dtype": "float32"}} response = requests.post("http://localhost:8000/", json=payload) predictions = response.json() For more information, read Deploying Predictors with Serve. Deploying Predictors with Serve Ray Serve is the recommended tool to deploy models trained with AIR. After training a model with Ray Train, you can serve a model using Ray Serve. In this guide, we will cover how to use Ray AIR’s PredictorDeployment, Predictor, and Checkpoint abstractions to quickly deploy a model for online inference. But before that, let’s review the key concepts: Checkpoint represents a trained model stored in memory, file, or remote uri. Predictors understand how to perform a model inference given checkpoints and the model definition. Ray AIR comes with predictors for each supported frameworks. Deployment is a Ray Serve construct that represent an HTTP endpoint along with scalable pool of models. The core concept for model deployment is the PredictorDeployment. The PredictorDeployment takes a predictor class and a checkpoint and transforms them into a live HTTP endpoint. We’ll start with a simple quick-start demo showing how you can use the PredictorDeployment to deploy your model for online inference. Let’s first make sure Ray AIR is installed. For the quick-start, we’ll also use Ray AIR to train and serve a XGBoost model. !pip install "ray[air]" xgboost scikit-learn You can find the preprocessor and trainer in the key concepts walk-through. import ray import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig from ray.data.preprocessors import StandardScaler data_raw = load_breast_cancer() dataset_df = pd.DataFrame(data_raw["data"], columns=data_raw["feature_names"]) dataset_df["target"] = data_raw["target"] train_df, test_df = train_test_split(dataset_df, test_size=0.3) train_dataset = ray.data.from_pandas(train_df) valid_dataset = ray.data.from_pandas(test_df) test_dataset = ray.data.from_pandas(test_df.drop("target", axis=1)) # Define preprocessor columns_to_scale = ["mean radius", "mean texture"] preprocessor = StandardScaler(columns=columns_to_scale) # Define trainer trainer = XGBoostTrainer( scaling_config=ScalingConfig(num_workers=1), label_column="target", params={ "tree_method": "approx", "objective": "binary:logistic", "eval_metric": ["logloss", "error"], "max_depth": 2, }, datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, num_boost_round=5, ) result = trainer.fit() 2022-06-02 19:31:31,356 INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8265 == Status ==
Current time: 2022-06-02 19:31:48 (running for 00:00:13.38)
Memory usage on this node: 37.9/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/25.71 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/simonmo/ray_results/XGBoostTrainer_2022-06-02_19-31-34
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) train-logloss train-error valid-logloss
XGBoostTrainer_4930d_00000TERMINATED127.0.0.1:60303 5 8.72108 0.190254 0.035176 0.20535


(GBDTTrainable pid=60303) UserWarning: `num_actors` in `ray_params` is smaller than 2 (1). XGBoost will NOT be distributed! (GBDTTrainable pid=60303) 2022-06-02 19:31:42,283 INFO main.py:980 -- [RayXGBoost] Created 1 new actors (1 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=60303) 2022-06-02 19:31:46,324 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=60578) [19:31:46] task [xgboost.ray]:140298197243216 got new rank 0 Result for XGBoostTrainer_4930d_00000: date: 2022-06-02_19-31-47 done: false experiment_id: 171c25bee8e7490f933cc082daf7e6e0 hostname: Simons-MacBook-Pro.local iterations_since_restore: 1 node_ip: 127.0.0.1 pid: 60303 should_checkpoint: true time_since_restore: 8.666727781295776 time_this_iter_s: 8.666727781295776 time_total_s: 8.666727781295776 timestamp: 1654223507 timesteps_since_restore: 0 train-error: 0.047739 train-logloss: 0.483805 training_iteration: 1 trial_id: 4930d_00000 valid-error: 0.05848 valid-logloss: 0.488357 warmup_time: 0.0035247802734375 (GBDTTrainable pid=60303) 2022-06-02 19:31:47,421 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=398 in 5.16 seconds (1.09 pure XGBoost training time). Result for XGBoostTrainer_4930d_00000: date: 2022-06-02_19-31-47 done: true experiment_id: 171c25bee8e7490f933cc082daf7e6e0 experiment_tag: '0' hostname: Simons-MacBook-Pro.local iterations_since_restore: 5 node_ip: 127.0.0.1 pid: 60303 should_checkpoint: true time_since_restore: 8.72108268737793 time_this_iter_s: 0.011542558670043945 time_total_s: 8.72108268737793 timestamp: 1654223507 timesteps_since_restore: 0 train-error: 0.035176 train-logloss: 0.190254 training_iteration: 5 trial_id: 4930d_00000 valid-error: 0.046784 valid-logloss: 0.20535 warmup_time: 0.0035247802734375 2022-06-02 19:31:48,266 INFO tune.py:753 -- Total run time: 13.77 seconds (13.38 seconds for the tuning loop). The following block serves a Ray AIR model from a checkpoint, using the built-in XGBoostPredictor. from ray.train.xgboost import XGBoostPredictor from ray import serve from ray.serve import PredictorDeployment from ray.serve.http_adapters import pandas_read_json serve.run( PredictorDeployment.options(name="XGBoostService").bind( XGBoostPredictor, result.checkpoint, http_adapter=pandas_read_json ) ) (ServeController pid=60981) INFO 2022-06-02 19:31:52,825 controller 60981 checkpoint_path.py:17 - Using RayInternalKVStore for controller checkpoint and recovery. (ServeController pid=60981) INFO 2022-06-02 19:31:52,828 controller 60981 http_state.py:115 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:127.0.0.1-0' on node 'node:127.0.0.1-0' listening on '127.0.0.1:8000' (HTTPProxyActor pid=60984) INFO: Started server process [60984] (ServeController pid=60981) INFO 2022-06-02 19:31:55,191 controller 60981 deployment_state.py:1221 - Adding 1 replicas to deployment 'XGBoostService'. Let’s send a request through HTTP. import requests sample_input = test_dataset.take(1) sample_input = dict(sample_input[0]) output = requests.post("http://localhost:8000/", json=[sample_input]).json() print(output) [{'predictions': 0.1142289936542511}] (HTTPProxyActor pid=60984) INFO 2022-06-02 19:32:00,604 http_proxy 127.0.0.1 http_proxy.py:320 - POST /XGBoostService 307 5.4ms (XGBoostService pid=60988) INFO 2022-06-02 19:32:00,603 XGBoostService XGBoostService#LOYoUm replica.py:484 - HANDLE __call__ OK 0.3ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:32:00,658 http_proxy 127.0.0.1 http_proxy.py:320 - POST /XGBoostService 200 49.8ms (XGBoostService pid=60988) INFO 2022-06-02 19:32:00,656 XGBoostService XGBoostService#LOYoUm replica.py:484 - HANDLE __call__ OK 46.8ms It works! As you can see, you can use the PredictorDeployment to deploy checkpoints trained in Ray AIR as live endpoints. You can find more end-to-end examples for your specific frameworks in the examples page. This tutorial aims to provide an in-depth understanding of PredictorDeployments. In particular, it’ll demonstrate: How to serve a predictor accepting array input. How to serve a predictor accepting dataframe input. How to serve a predictor accepting custom input that can be transformed to array or dataframe. How to configure micro-batching to enhance performance. 1. Predictor accepting NumPy array We’ll use a simple predictor implementation that adds an increment to an input array. import numpy as np from ray.train.predictor import Predictor from ray.air.checkpoint import Checkpoint class AdderPredictor(Predictor): """Dummy predictor that increments input by a staic value.""" def __init__(self, increment: int): self.increment = increment @classmethod def from_checkpoint(cls, ckpt: Checkpoint): """Create predictor from checkpoint. Args: ckpt: The AIR checkpoint representing a single dictionary. The dictionary should have key `increment` and an integer value. """ return cls(ckpt.to_dict()["increment"]) def predict(self, inp: np.ndarray) -> np.ndarray: return inp + self.increment Let’s first test it locally. local_checkpoint = Checkpoint.from_dict({"increment": 2}) local_predictor = AdderPredictor.from_checkpoint(local_checkpoint) assert local_predictor.predict(np.array([40])) == np.array([42]) It worked! Now let’s serve it behind HTTP. In Ray Serve, the core unit of an HTTP service is called a Deployment. It turns a Python class into a queryable HTTP endpoint. For Ray AIR, Serve provides a PredictorDeployment to simplify this transformation. You don’t need to implement any Python classes. You just pass in your predictor and checkpoint instead. The deployment takes several arguments. It requires two arguments to start: predictor_cls (Type[Predictor] | str): The predictor Python class. Typically you can use built-in integrations from Ray AIR like the TorchPredictor. Alternatively, you can specify the class path to import a predictor like "ray.air.integrations.torch.TorchPredictor". checkpoint (Checkpoint | str): A checkpoint instance, or uri to load the checkpoint from. The following cell showcases how to create a deployment with our AdderPredictor To learn more about Ray Serve, check out its documentation. from ray import serve from ray.serve import PredictorDeployment # Deploy the model behind HTTP endpoint serve.run( PredictorDeployment.options(name="Adder").bind( predictor_cls=AdderPredictor, checkpoint=local_checkpoint ) ) (ServeController pid=60981) INFO 2022-06-02 19:32:07,559 controller 60981 deployment_state.py:1221 - Adding 1 replicas to deployment 'Adder'. After the model has been deployed, let’s send an HTTP request. import requests resp = requests.post("http://localhost:8000/", json={"array": [40]}) resp.raise_for_status() resp.json() [42.0] (HTTPProxyActor pid=60984) INFO 2022-06-02 19:32:18,864 http_proxy 127.0.0.1 http_proxy.py:320 - POST /Adder 200 18.0ms (Adder pid=60999) INFO 2022-06-02 19:32:18,863 Adder Adder#aqYgDS replica.py:484 - HANDLE __call__ OK 13.1ms Nice! We sent [40] as our input and got [42] as our output in JSON format. You can also specify multi-dimensional arrays in the JSON payload, as well as “dtype” and “shape” fields to process to array. For more information about the array input schema, see Ndarray. That’s it for arrays! Let’s take a look at tabular input. 2. Predictor accepting Pandas DataFrame Let’s now take a look at a predictor accepting dataframe inputs. We’ll perform some simple column-wise transformations on the input data. import pandas as pd class DataFramePredictor(Predictor): """Dummy predictor that first multiplies input then increment it.""" def __init__(self, increment: int): self.increment = increment @classmethod def from_checkpoint(cls, ckpt: Checkpoint): return cls(ckpt.to_dict()["increment"]) def predict(self, inp: pd.DataFrame) -> pd.DataFrame: inp["prediction"] = inp["base"] * inp["multiplier"] + self.increment return inp local_df_predictor = DataFramePredictor.from_checkpoint(local_checkpoint) Just like the AdderPredictor, we’ll use the same PredictorDeployment approach to make it queryable with HTTP. Note that we added http_adapter=pandas_read_json as the keyword argument. This tells Serve how to convert incoming JSON requests into a DataFrame. The pandas_read_json adapter accepts: Pandas-parsable JSON in HTTP body Optional keyword arguments to the pandas.read_json function via HTTP url parameters. To learn more, see HTTP Adapters. You might wonder why the previous array predictor doesn’t need to specify any http adapter. This is because Ray Serve defaults to a built-in adapter called json_to_ndarray(ray.serve.http_adapters.json_to_ndarray)! from ray.serve.http_adapters import pandas_read_json serve.run( PredictorDeployment.options(name="DataFramePredictor").bind( predictor_cls=DataFramePredictor, checkpoint=local_checkpoint, http_adapter=pandas_read_json ) ) (ServeController pid=60981) INFO 2022-06-02 19:32:24,396 controller 60981 deployment_state.py:1221 - Adding 1 replicas to deployment 'DataFramePredictor'. Let’s send a request to our endpoint. resp = requests.post( "http://localhost:8000/", json=[{"base": 1, "multiplier": 2}, {"base": 3, "multiplier": 4}], params={"orient": "records"}, ) resp.raise_for_status() resp.text '[{"base":1,"multiplier":2,"prediction":4},{"base":3,"multiplier":4,"prediction":14}]' (HTTPProxyActor pid=60984) INFO 2022-06-02 19:32:28,751 http_proxy 127.0.0.1 http_proxy.py:320 - POST /DataFramePredictor 200 21.0ms (DataFramePredictor pid=61006) INFO 2022-06-02 19:32:28,750 DataFramePredictor DataFramePredictor#IJcHCI replica.py:484 - HANDLE __call__ OK 17.2ms Great! You can see that the input JSON has been converted to a dataframe, so our predictor can work with pure dataframes instead of raw HTTP requests. But what if we need to configure the HTTP request? You can do that as well. 3. Accepting custom inputs via http_adapter The http_adapter field accepts any callable function that’s type annotated. You can also bring in additional types that are accepted by FastAPI’s dependency injection framework. For more detail, see HTTP Adapters. In the following example, instead of using the pandas adapter Serve provides, we’ll implement our own request adapter that works with just http parameters instead of JSON. def our_own_http_adapter(base: int, multiplier: int): return pd.DataFrame([{"base": base, "multiplier": multiplier}]) Let’s deploy it. from ray.serve.http_adapters import pandas_read_json serve.run( PredictorDeployment.options(name="DataFramePredictor").bind( predictor_cls=DataFramePredictor, checkpoint=local_checkpoint, http_adapter=our_own_http_adapter ) ) (ServeController pid=60981) INFO 2022-06-02 19:33:31,010 controller 60981 deployment_state.py:1180 - Stopping 1 replicas of deployment 'DataFramePredictor' with outdated versions. (ServeController pid=60981) INFO 2022-06-02 19:33:33,165 controller 60981 deployment_state.py:1221 - Adding 1 replicas to deployment 'DataFramePredictor'. Let’s now send a request. Note that the new predictor accepts our specified input via HTTP parameters. The equivalent curl request would be curl -X POST http://localhost:8000/DataFramePredictor/?base=10&multiplier=4. resp = requests.post( "http://localhost:8000/", params={"base": 10, "multiplier": 4} ) resp.raise_for_status() resp.text '[{"base":10,"multiplier":4,"prediction":42}]' (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:36,070 http_proxy 127.0.0.1 http_proxy.py:320 - POST /DataFramePredictor 200 21.6ms (DataFramePredictor pid=61037) INFO 2022-06-02 19:33:36,069 DataFramePredictor DataFramePredictor#QzQiec replica.py:484 - HANDLE __call__ OK 17.5ms 4. PredictorDeployment performs microbatching to improve performance Common machine learning models take a batch of inputs for prediction. Common ML Frameworks are optimized with vectorized instruction to make inference on batch requests almost as fast as single requests. In Serve’s PredictorDeployment, the incoming requests are automatically batched. When multiple clients send requests at the same time, Serve will combine the requests into a single batch (array or dataframe). Then, Serve calls predict on the entire batch. Let’s take a look at a predictor that returns each row’s content, batch_size, and batch group. import time class BatchSizePredictor(Predictor): @classmethod def from_checkpoint(cls, _: Checkpoint): return cls() def predict(self, inp: np.ndarray): time.sleep(0.5) # simulate model inference. return [(i, len(inp), inp) for i in inp] serve.run( PredictorDeployment.options(name="BatchSizePredictor").bind( predictor_cls=BatchSizePredictor, checkpoint=local_checkpoint, ) ) (ServeController pid=60981) INFO 2022-06-02 19:33:39,305 controller 60981 deployment_state.py:1221 - Adding 1 replicas to deployment 'BatchSizePredictor'. Let’s use a threadpool executor to send ten requests at the same time to simulate multiple clients. from concurrent.futures import ThreadPoolExecutor, wait with ThreadPoolExecutor() as pool: futs = [ pool.submit( requests.post, "http://localhost:8000/", json={"array": [i]}, ) for i in range(10) ] wait(futs) for fut in futs: i, batch_size, batch_group = fut.result().json() print(f"Request id: {i} is part of batch group: {batch_group}, with batch size {batch_size}") (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:43,141 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 525.9ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:43,139 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 519.1ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:43,647 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 1030.2ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:43,645 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 1013.6ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:44,155 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 1015.0ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:44,155 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 511.8ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:44,155 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 511.4ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:44,155 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 511.0ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:44,661 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2043.3ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:44,662 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2042.9ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:44,662 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2039.5ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:44,662 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2038.1ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:44,663 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2038.9ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:44,663 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2036.8ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:44,664 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2036.5ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:44,661 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 1016.0ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:44,661 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 1015.6ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:44,662 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 1015.5ms Request id: [0.0] is part of batch group: [[3.0], [0.0], [4.0], [7.0]], with batch size 4 Request id: [1.0] is part of batch group: [[1.0]], with batch size 1 Request id: [2.0] is part of batch group: [[2.0]], with batch size 1 Request id: [3.0] is part of batch group: [[3.0], [0.0], [4.0], [7.0]], with batch size 4 Request id: [4.0] is part of batch group: [[3.0], [0.0], [4.0], [7.0]], with batch size 4 Request id: [5.0] is part of batch group: [[6.0], [5.0], [9.0]], with batch size 3 Request id: [6.0] is part of batch group: [[6.0], [5.0], [9.0]], with batch size 3 Request id: [7.0] is part of batch group: [[3.0], [0.0], [4.0], [7.0]], with batch size 4 Request id: [8.0] is part of batch group: [[8.0]], with batch size 1 Request id: [9.0] is part of batch group: [[6.0], [5.0], [9.0]], with batch size 3 (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:45,167 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 2539.1ms (BatchSizePredictor pid=61041) INFO 2022-06-02 19:33:45,165 BatchSizePredictor BatchSizePredictor#QQPBXh replica.py:484 - HANDLE __call__ OK 1516.7ms As you can see, some of the requests are part of a bigger group that’s run together. You can also configure the exact details of batching parameters: max_batch_size(int): the maximum batch size that will be executed in one call to predict. batch_wait_timeout_s (float): the maximum duration to wait for max_batch_size elements before running the predict call. Let’s set a max_batch_size of 10 to group our requests into the same batch. serve.run( PredictorDeployment.options(name="BatchSizePredictor").bind( predictor_cls=BatchSizePredictor, checkpoint=local_checkpoint, batching_params={"max_batch_size": 10, "batch_wait_timeout_s": 5} ) ) (ServeController pid=60981) INFO 2022-06-02 19:33:47,081 controller 60981 deployment_state.py:1180 - Stopping 1 replicas of deployment 'BatchSizePredictor' with outdated versions. (ServeController pid=60981) INFO 2022-06-02 19:33:49,234 controller 60981 deployment_state.py:1221 - Adding 1 replicas to deployment 'BatchSizePredictor'. Let’s call them again! You should see all ten requests executed as part of the same group. from concurrent.futures import ThreadPoolExecutor, wait with ThreadPoolExecutor() as pool: futs = [ pool.submit( requests.post, "http://localhost:8000/", json={"array": [i]}, ) for i in range(10) ] wait(futs) for fut in futs: i, batch_size, batch_group = fut.result().json() print(f"Request id: {i} is part of batch group: {batch_group}, with batch size {batch_size}") Request id: [0.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [1.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [2.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [3.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [4.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [5.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [6.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [7.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [8.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 Request id: [9.0] is part of batch group: [[0.0], [5.0], [1.0], [2.0], [3.0], [4.0], [7.0], [6.0], [8.0], [9.0]], with batch size 10 (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,751 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 538.8ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,752 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 526.8ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,753 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 535.1ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,753 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 528.0ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,754 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 533.4ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,754 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 528.0ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,754 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 526.3ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,754 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 525.0ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,755 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 524.5ms (HTTPProxyActor pid=60984) INFO 2022-06-02 19:33:52,755 http_proxy 127.0.0.1 http_proxy.py:320 - POST /BatchSizePredictor 200 524.0ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,746 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 530.1ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,746 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 514.7ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,747 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 514.4ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,747 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 513.6ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,747 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 513.4ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,748 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 511.6ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,748 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 510.6ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,748 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 510.4ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,749 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 510.3ms (BatchSizePredictor pid=61046) INFO 2022-06-02 19:33:52,749 BatchSizePredictor BatchSizePredictor#mlVwXr replica.py:484 - HANDLE __call__ OK 509.9ms The batching behavior is well-defined: When batching arrays, they are all concatenated into a new array with an added batch dimension. When batching dataframes, they are all concatenated row-wise. You can also turn off this behavior by setting batching_params=False. How to Deploy AIR Here, we describe how you might use or deploy AIR in your infrastructure. There are two main deployment patterns – pick and choose and within existing platforms. The core idea is that AIR can be complementary to your existing infrastructure and integration tools. Design Principles Ray AIR handles the heavyweight compute aspects of AI apps and services. Ray AIR relies on external integrations (e.g., Tecton, MLFlow, W&B) for Storage and Tracking. Workflow Orchestrators (e.g., AirFlow) are an optional component that can be used for scheduling recurring jobs, launching new Ray clusters for jobs, and running non-Ray compute steps. Lightweight orchestration of task graphs within a single Ray AIR app can be handled using Ray tasks. Ray AIR libraries can be used independently, within an existing ML platform, or to build a Ray-native ML platform. Pick and choose your own libraries You can pick and choose which Ray AIR libraries you want to use. This is applicable if you are an ML engineer who wants to independently use a Ray AIR library for a specific AI app or service use case and do not need to integrate with existing ML platforms. For example, Alice wants to use RLlib to train models for her work project. Bob wants to use Ray Serve to deploy his model pipeline. In both cases, Alice and Bob can leverage these libraries independently without any coordination. This scenario describes most usages of Ray libraries today. In the above diagram: Only one library is used – showing that you can pick and choose and do not need to replace all of your ML infrastructure to use Ray AIR. You can use one of Ray’s many deployment modes to launch and manage Ray clusters and Ray applications. AIR libraries can read data from external storage systems such as Amazon S3 / Google Cloud Storage, as well as store results there. Existing ML Platform integration You may already have an existing machine learning platform but want to use some subset of Ray AIR. For example, an ML engineer wants to use Ray AIR within the ML Platform their organization has purchased (e.g., SageMaker, Vertex). Ray AIR can complement existing machine learning platforms by integrating with existing pipeline/workflow orchestrators, storage, and tracking services, without requiring a replacement of your entire ML platform. In the above diagram: A workflow orchestrator such as AirFlow, Oozie, SageMaker Pipelines, etc. is responsible for scheduling and creating Ray clusters and running Ray AIR apps and services. The Ray AIR app may be part of a larger orchestrated workflow (e.g., Spark ETL, then Training on Ray). Lightweight orchestration of task graphs can be handled entirely within Ray AIR. External workflow orchestrators will integrate nicely but are only needed if running non-Ray steps. Ray AIR clusters can also be created for interactive use (e.g., Jupyter notebooks, Google Colab, Databricks Notebooks, etc.). Ray Train, Datasets, and Serve provide integration with Feature Stores like Feast for Training and Serving. Ray Train and Tune provide integration with tracking services such as MLFlow and Weights & Biases. Examples Framework-specific Examples Convert existing PyTorch code to Ray AIR: Get started with Ray AIR from an existing PyTorch codebase Convert existing Tensorflow/Keras code to Ray AIR: Get started with Ray AIR from an existing Tensorflow/Keras codebase. Training a model with distributed LightGBM: Distributed training with LightGBM Training a model with distributed XGBoost: Distributed training with XGBoost Hyperparameter tuning with XGBoostTrainer: Distributed tuning with XGBoost Training a model with Sklearn: Integrating with Scikit-Learn (non-distributed) Simple Machine Learning AutoML for time series forecasting with Ray AIR: Build an AutoML system for time-series forecasting with Ray AIR Batch training & tuning on Ray Tune: Perform batch tuning on NYC Taxi Dataset with Ray AIR Parallel demand forecasting at scale using Ray Tune: Perform batch forecasting on NYC Taxi Dataset with Prophet, ARIMA and Ray AIR Text/NLP Fine-tune a 🤗 Transformers model: How to use Ray AIR to run Hugging Face Transformers fine-tuning on a text classification task. GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed: How to use Ray AIR to run Hugging Face Transformers with DeepSpeed for fine-tuning a large model. GPT-J-6B Batch Prediction with Ray AIR: How to use Ray AIR to do batch prediction with the Hugging Face Transformers GPT-J model. GPT-J-6B Serving with Ray AIR: How to use Ray AIR to do online serving with the Hugging Face Transformers GPT-J model. Fine-tuning DreamBooth with Ray AIR: How to fine-tune a DreamBooth text-to-image model with your own images. Fine-tune dolly-v2-7b with Ray AIR LightningTrainer and FSDP: How to fine-tune a dolly-v2-7b model with Ray AIR LightningTrainer and FSDP. Image/CV Training a Torch Image Classifier Fine-tuning a Torch object detection model Stable Diffusion Batch Prediction with Ray AIR: How to use Ray AIR to do batch prediction with the Stable Diffusion text-to-image model. Logging & Observability Logging results and uploading models to Comet ML: How to log results and upload models to Comet ML. Logging results and uploading models to Weights & Biases: How to log results and upload models to Weights and Biases. RL (RLlib) Serving reinforcement learning policy models Online reinforcement learning with Ray AIR Offline reinforcement learning with Ray AIR Advanced Incremental Learning with Ray AIR: Incrementally train and deploy a PyTorch CV model Integrate Ray AIR with Feast feature store: Integrate with Feast feature store in both train and inference Training a Torch Image Classifier This tutorial shows you how to train an image classifier using the Ray AI Runtime (AIR). You should be familiar with PyTorch before starting the tutorial. If you need a refresher, read PyTorch’s training a classifier tutorial. Before you begin Install the Ray AI Runtime. You need Ray 2.0 or later to run this example. !pip install 'ray[air]' Install requests, torch, and torchvision. !pip install requests torch torchvision Load and normalize CIFAR-10 We’ll train our classifier on a popular image dataset called CIFAR-10. First, let’s load CIFAR-10 into a Dataset. import ray import torchvision import torchvision.transforms as transforms train_dataset = torchvision.datasets.CIFAR10("data", download=True, train=True) test_dataset = torchvision.datasets.CIFAR10("data", download=True, train=False) train_dataset: ray.data.Dataset = ray.data.from_torch(train_dataset) test_dataset: ray.data.Dataset = ray.data.from_torch(test_dataset) Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz 100%|██████████| 170498071/170498071 [00:21<00:00, 7792736.24it/s] Extracting data/cifar-10-python.tar.gz to data Files already downloaded and verified 2022-10-23 10:33:48,403 INFO worker.py:1518 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265  train_dataset from_torch doesn’t parallelize reads, so you shouldn’t use it with larger datasets. Next, let’s represent our data using a dictionary of ndarrays instead of tuples. This lets us call Dataset.iter_torch_batches later in the tutorial. from typing import Dict, Tuple import numpy as np from PIL.Image import Image import torch def convert_batch_to_numpy(batch) -> Dict[str, np.ndarray]: images = np.stack([np.array(image) for image, _ in batch["item"]]) labels = np.array([label for _, label in batch["item"]]) return {"image": images, "label": labels} train_dataset = train_dataset.map_batches(convert_batch_to_numpy).materialize() test_dataset = test_dataset.map_batches(convert_batch_to_numpy).materialize() Read->Map_Batches: 0%| | 0/1 [00:00Map_Batches: 100%|██████████| 1/1 [00:04<00:00, 4.27s/it] Read->Map_Batches: 0%| | 0/1 [00:00Map_Batches: 100%|██████████| 1/1 [00:01<00:00, 1.40s/it] train_dataset Train a convolutional neural network Now that we’ve created our datasets, let’s define the training logic. import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = torch.flatten(x, 1) # flatten all dimensions except batch x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x We define our training logic in a function called train_loop_per_worker. This function contains regular PyTorch code with a few notable exceptions: We wrap our model with train.torch.prepare_model. We call session.get_dataset_shard and Dataset.iter_torch_batches to get a subset of our training data. We save model state using session.report. from ray import train from ray.air import session, Checkpoint from ray.train.torch import TorchCheckpoint import torch.nn as nn import torch.optim as optim import torchvision def train_loop_per_worker(config): model = train.torch.prepare_model(Net()) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9) train_dataset_shard = session.get_dataset_shard("train") for epoch in range(2): running_loss = 0.0 train_dataset_batches = train_dataset_shard.iter_torch_batches( batch_size=config["batch_size"], ) for i, batch in enumerate(train_dataset_batches): # get the inputs and labels inputs, labels = batch["image"], batch["label"] # zero the parameter gradients optimizer.zero_grad() # forward + backward + optimize outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # print statistics running_loss += loss.item() if i % 2000 == 1999: # print every 2000 mini-batches print(f"[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}") running_loss = 0.0 metrics = dict(running_loss=running_loss) checkpoint = TorchCheckpoint.from_state_dict(model.state_dict()) session.report(metrics, checkpoint=checkpoint) To improve our model’s accuracy, we’ll also define a Preprocessor to normalize the images. from ray.data.preprocessors import TorchVisionPreprocessor transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))] ) preprocessor = TorchVisionPreprocessor(columns=["image"], transform=transform) Finally, we can train our model. This should take a few minutes to run. from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config={"batch_size": 2}, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=2), preprocessor=preprocessor ) result = trainer.fit() latest_checkpoint = result.checkpoint == Status ==
Current time: 2022-08-30 15:31:37 (running for 00:00:45.17)
Memory usage on this node: 16.9/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/14.83 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/bveeramani/ray_results/TorchTrainer_2022-08-30_15-30-52
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) running_loss _timestamp _time_this_iter_s
TorchTrainer_6799a_00000TERMINATED127.0.0.1:3978 2 43.7121 595.445 1661898697 20.8503


(RayTrainWorker pid=3979) 2022-08-30 15:30:54,566 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=2] (RayTrainWorker pid=3979) 2022-08-30 15:30:55,727 INFO train_loop_utils.py:300 -- Moving model to device: cpu (RayTrainWorker pid=3979) 2022-08-30 15:30:55,728 INFO train_loop_utils.py:347 -- Wrapping provided model in DDP. (RayTrainWorker pid=3980) [1, 2000] loss: 2.276 (RayTrainWorker pid=3979) [1, 2000] loss: 2.270 (RayTrainWorker pid=3980) [1, 4000] loss: 1.964 (RayTrainWorker pid=3979) [1, 4000] loss: 1.936 (RayTrainWorker pid=3980) [1, 6000] loss: 1.753 (RayTrainWorker pid=3979) [1, 6000] loss: 1.754 (RayTrainWorker pid=3980) [1, 8000] loss: 1.638 (RayTrainWorker pid=3979) [1, 8000] loss: 1.661 (RayTrainWorker pid=3980) [1, 10000] loss: 1.586 (RayTrainWorker pid=3979) [1, 10000] loss: 1.547 (RayTrainWorker pid=3980) [1, 12000] loss: 1.489 (RayTrainWorker pid=3979) [1, 12000] loss: 1.476 Result for TorchTrainer_6799a_00000: _time_this_iter_s: 20.542800188064575 _timestamp: 1661898676 _training_iteration: 1 date: 2022-08-30_15-31-16 done: false experiment_id: c25700542bc348dbbeaf54e46f1fc84c hostname: MBP.local.meter iterations_since_restore: 1 node_ip: 127.0.0.1 pid: 3978 running_loss: 687.5853321105242 should_checkpoint: true time_since_restore: 22.880314111709595 time_this_iter_s: 22.880314111709595 time_total_s: 22.880314111709595 timestamp: 1661898676 timesteps_since_restore: 0 training_iteration: 1 trial_id: 6799a_00000 warmup_time: 0.0025300979614257812 (RayTrainWorker pid=3980) [2, 2000] loss: 1.417 (RayTrainWorker pid=3979) [2, 2000] loss: 1.431 (RayTrainWorker pid=3980) [2, 4000] loss: 1.403 (RayTrainWorker pid=3979) [2, 4000] loss: 1.404 (RayTrainWorker pid=3980) [2, 6000] loss: 1.394 (RayTrainWorker pid=3979) [2, 6000] loss: 1.368 (RayTrainWorker pid=3980) [2, 8000] loss: 1.343 (RayTrainWorker pid=3979) [2, 8000] loss: 1.363 (RayTrainWorker pid=3980) [2, 10000] loss: 1.340 (RayTrainWorker pid=3979) [2, 10000] loss: 1.297 (RayTrainWorker pid=3980) [2, 12000] loss: 1.253 (RayTrainWorker pid=3979) [2, 12000] loss: 1.276 Result for TorchTrainer_6799a_00000: _time_this_iter_s: 20.850306034088135 _timestamp: 1661898697 _training_iteration: 2 date: 2022-08-30_15-31-37 done: false experiment_id: c25700542bc348dbbeaf54e46f1fc84c hostname: MBP.local.meter iterations_since_restore: 2 node_ip: 127.0.0.1 pid: 3978 running_loss: 595.4451928250492 should_checkpoint: true time_since_restore: 43.71214985847473 time_this_iter_s: 20.831835746765137 time_total_s: 43.71214985847473 timestamp: 1661898697 timesteps_since_restore: 0 training_iteration: 2 trial_id: 6799a_00000 warmup_time: 0.0025300979614257812 Result for TorchTrainer_6799a_00000: _time_this_iter_s: 20.850306034088135 _timestamp: 1661898697 _training_iteration: 2 date: 2022-08-30_15-31-37 done: true experiment_id: c25700542bc348dbbeaf54e46f1fc84c experiment_tag: '0' hostname: MBP.local.meter iterations_since_restore: 2 node_ip: 127.0.0.1 pid: 3978 running_loss: 595.4451928250492 should_checkpoint: true time_since_restore: 43.71214985847473 time_this_iter_s: 20.831835746765137 time_total_s: 43.71214985847473 timestamp: 1661898697 timesteps_since_restore: 0 training_iteration: 2 trial_id: 6799a_00000 warmup_time: 0.0025300979614257812 2022-08-30 15:31:37,386 INFO tune.py:758 -- Total run time: 45.32 seconds (45.16 seconds for the tuning loop). To scale your training script, create a Ray Cluster and increase the number of workers. If your cluster contains GPUs, add "use_gpu": True to your scaling config. scaling_config=ScalingConfig(num_workers=8, use_gpu=True) Test the network on the test data Let’s see how our model performs. To classify images in the test dataset, we’ll need to create a Predictor. Predictors load data from checkpoints and efficiently perform inference. In contrast to TorchPredictor, which performs inference on a single batch, BatchPredictor performs inference on an entire dataset. Because we want to classify all of the images in the test dataset, we’ll use a BatchPredictor. from ray.train.torch import TorchPredictor from ray.train.batch_predictor import BatchPredictor batch_predictor = BatchPredictor.from_checkpoint( checkpoint=latest_checkpoint, predictor_cls=TorchPredictor, model=Net(), ) outputs: ray.data.Dataset = batch_predictor.predict( data=test_dataset, dtype=torch.float, feature_columns=["image"], keep_columns=["label"], ) Map Progress (1 actors 1 pending): 100%|██████████| 1/1 [00:01<00:00, 1.59s/it] Our model outputs a list of energies for each class. To classify an image, we choose the class that has the highest energy. import numpy as np def convert_logits_to_classes(df): best_class = df["predictions"].map(lambda x: x.argmax()) df["prediction"] = best_class return df[["prediction", "label"]] predictions = outputs.map_batches(convert_logits_to_classes, batch_format="pandas") predictions.show(1) Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 59.42it/s] {'prediction': 3, 'label': 3} Now that we’ve classified all of the images, let’s figure out which images were classified correctly. The predictions dataset contains predicted labels and the test_dataset contains the true labels. To determine whether an image was classified correctly, we join the two datasets and check if the predicted labels are the same as the actual labels. def calculate_prediction_scores(df): df["correct"] = df["prediction"] == df["label"] return df scores = predictions.map_batches(calculate_prediction_scores, batch_format="pandas") scores.show(1) Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 132.06it/s] {'prediction': 3, 'label': 3, 'correct': True} To compute our test accuracy, we’ll count how many images the model classified correctly and divide that number by the total number of test images. scores.sum(on="correct") / scores.count() Shuffle Map: 100%|██████████| 1/1 [00:00<00:00, 152.00it/s] Shuffle Reduce: 100%|██████████| 1/1 [00:00<00:00, 219.54it/s] 0.557 Deploy the network and make a prediction Our model seems to perform decently, so let’s deploy the model to an endpoint. This allows us to make predictions over the Internet. from ray import serve from ray.serve import PredictorDeployment from ray.serve.http_adapters import json_to_ndarray serve.run( PredictorDeployment.bind( TorchPredictor, latest_checkpoint, model=Net(), http_adapter=json_to_ndarray, ) ) (ServeController pid=3987) INFO 2022-08-30 15:31:39,948 controller 3987 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-4b114e48c80d3549aa5da89fa16707e0334a0bafde984fd8b8618e47' on node '4b114e48c80d3549aa5da89fa16707e0334a0bafde984fd8b8618e47' listening on '127.0.0.1:8000' (HTTPProxyActor pid=3988) INFO: Started server process [3988] (ServeController pid=3987) INFO 2022-08-30 15:31:40,567 controller 3987 deployment_state.py:1232 - Adding 1 replica to deployment 'PredictorDeployment'. RayServeSyncHandle(deployment='PredictorDeployment') Let’s classify a test image. image = test_dataset.take(1)[0]["image"] You can perform inference against a deployed model by posting a dictionary with an "array" key. To learn more about the default input schema, read the NdArray documentation. import requests payload = {"array": image.tolist(), "dtype": "float32"} response = requests.post("http://localhost:8000/", json=payload) response.json() [-1.1342155933380127, -1.854529857635498, 1.2062205076217651, 2.6219608783721924, 0.5199968218803406, 2.2016565799713135, 0.9447429180145264, -0.5387609004974365, -1.9515650272369385, -1.676588773727417] (HTTPProxyActor pid=3988) INFO 2022-08-30 15:31:41,713 http_proxy 127.0.0.1 http_proxy.py:315 - POST / 200 12.9ms (ServeReplica:PredictorDeployment pid=3995) INFO 2022-08-30 15:31:41,712 PredictorDeployment PredictorDeployment#pTPSPE replica.py:482 - HANDLE __call__ OK 9.9ms Fine-tuning a Torch object detection model This tutorial explains how to fine-tune fasterrcnn_resnet50_fpn using the Ray AI Runtime for parallel data ingest and training. Here’s what you’ll do: Load raw images and VOC-style annotations into a Dataset Fine-tune fasterrcnn_resnet50_fpn (the backbone is pre-trained on ImageNet) Evaluate the model’s accuracy You should be familiar with PyTorch before starting the tutorial. If you need a refresher, read PyTorch’s training a classifier tutorial. Before you begin Install the Ray AI Runtime. !pip install 'ray[air]' Install torch, torchmetrics, torchvision, and xmltodict. !pip install torch torchmetrics torchvision xmltodict Create a Dataset You’ll work with a subset of Pascal VOC that contains cats and dogs (the full dataset has 20 classes). CLASS_TO_LABEL = { "background": 0, "cat": 1, "dog": 2, } The dataset contain two subdirectories: JPEGImages and Annotations. JPEGImages contains raw images, and Annotations contains XML annotations. AnimalDetection ├── Annotations │ ├── 2007_000063.xml │ ├── 2007_000528.xml │ └── ... └── JPEGImages ├── 2007_000063.jpg ├── 2007_000528.jpg └── ... Define a custom datasource Each annotation describes the objects in an image. For example, view this image of a dog: import io from PIL import Image import requests response = requests.get("https://s3-us-west-2.amazonaws.com/air-example-data/AnimalDetection/JPEGImages/2007_000063.jpg") image = Image.open(io.BytesIO(response.content)) image Then, print the image’s annotation: !curl "https://s3-us-west-2.amazonaws.com/air-example-data/AnimalDetection/Annotations/2007_000063.xml" VOC2012 2007_000063.jpg The VOC2007 Database PASCAL VOC2007 flickr 500 375 3 1 dog Unspecified 0 0 123 115 379 275 Notice how there’s one object labeled “dog” dog Unspecified 0 0 123 115 379 275 Ray Data lets you read and preprocess data in parallel. Ray Data doesn’t have built-in support for VOC-style annotations, so you’ll need to define a custom datasource. A Datasource is an object that reads data of a particular type. For example, Ray Data implements a Datasource that reads CSV files. Your datasource will parse labels and bounding boxes from XML files. Later, you’ll read the corresponding images. To implement the datasource, extend the built-in FileBasedDatasource class and override the _read_file method. from typing import List, Tuple import xmltodict import pandas as pd import pyarrow as pa from ray.data.datasource import FileBasedDatasource from ray.data.extensions import TensorArray class VOCAnnotationDatasource(FileBasedDatasource): def _read_file(self, f: pa.NativeFile, path: str, **reader_args) -> pd.DataFrame: text = f.read().decode("utf-8") annotation = xmltodict.parse(text)["annotation"] objects = annotation["object"] # If there's one object, `objects` is a `dict`; otherwise, it's a `list[dict]`. if isinstance(objects, dict): objects = [objects] boxes: List[Tuple] = [] for obj in objects: x1 = float(obj["bndbox"]["xmin"]) y1 = float(obj["bndbox"]["ymin"]) x2 = float(obj["bndbox"]["xmax"]) y2 = float(obj["bndbox"]["ymax"]) boxes.append((x1, y1, x2, y2)) labels: List[int] = [CLASS_TO_LABEL[obj["name"]] for obj in objects] filename = annotation["filename"] return pd.DataFrame( { "boxes": TensorArray([boxes]), "labels": TensorArray([labels]), "filename": [filename], } ) def _rows_per_file(self): return 1 Read annotations To load the annotations into a Dataset, call ray.data.read_datasource and pass the custom datasource to the constructor. Ray will read the annotations in parallel. import os import ray annotations: ray.data.Dataset = ray.data.read_datasource( VOCAnnotationDatasource(), paths="s3://anonymous@air-example-data/AnimalDetection/Annotations" ) find: ‘.git’: No such file or directory 2023-03-01 13:05:51,314 INFO worker.py:1360 -- Connecting to existing Ray cluster at address: 10.0.26.109:6379... 2023-03-01 13:05:51,327 INFO worker.py:1548 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_mf1limh36cs2yrh9wkf6h2a75k/services?redirect_to=dashboard  2023-03-01 13:05:52,269 INFO packaging.py:330 -- Pushing file package 'gcs://_ray_pkg_00aff5a3a84ab6438be1961b97a5beaa.zip' (266.32MiB) to Ray cluster... 2023-03-01 13:05:58,529 INFO packaging.py:343 -- Successfully pushed file package 'gcs://_ray_pkg_00aff5a3a84ab6438be1961b97a5beaa.zip'. Look at the first two samples. VOCAnnotationDatasource should’ve correctly parsed labels and bounding boxes. annotations.take(2) [{'boxes': array([[123., 115., 379., 275.]]), 'labels': 2, 'filename': '2007_000063.jpg'}, {'boxes': array([[124., 68., 319., 310.]]), 'labels': 1, 'filename': '2007_000528.jpg'}] Load images into memory Each row of annotations contains the filename of an image. Write a user-defined function that loads these images. For each annotation, your function will: Open the image associated with the annotation. Add the image to a new "image" column. from typing import Dict import numpy as np from PIL import Image def read_images(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: images: List[np.ndarray] = [] for filename in batch["filename"]: url = os.path.join("https://s3-us-west-2.amazonaws.com/air-example-data/AnimalDetection/JPEGImages", filename) response = requests.get(url) image = Image.open(io.BytesIO(response.content)) images.append(np.array(image)) batch["image"] = np.array(images, dtype=object) return batch dataset = annotations.map_batches(read_images) dataset 2023-03-01 13:06:08,005 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[read->MapBatches(read_images)] read->MapBatches(read_images): 100%|██████████| 128/128 [00:24<00:00, 5.25it/s] Split the dataset into train and test sets Once you’ve created a Dataset, split the dataset into train and test sets. train_dataset, test_dataset = dataset.train_test_split(0.2) Define preprocessing logic A Preprocessor is an object that defines preprocessing logic. It’s the standard way to preprocess data with Ray. Create two preprocessors: one to transpose and scale images (ToTensor), and another to randomly augment images every epoch (RandomHorizontalFlip). You’ll later pass these preprocessors to a Trainer. from torchvision import transforms from ray.data.preprocessors import TorchVisionPreprocessor transform = transforms.ToTensor() preprocessor = TorchVisionPreprocessor(columns=["image"], transform=transform) per_epoch_transform = transforms.RandomHorizontalFlip(p=0.5) per_epoch_preprocessor = TorchVisionPreprocessor(columns=["image"], transform=per_epoch_transform) Fine-tune the object detection model Define the training loop Write a function that trains fasterrcnn_resnet50_fpn. Your code will look like standard Torch code with a few changes. Here are a few things to point out: Distribute the model with ray.train.torch.prepare_model. Don’t use DistributedDataParallel. Pass your Dataset to the Trainer. The Trainer automatically shards the data across workers. Iterate over data with DataIterator.iter_batches. Don’t use a Torch DataLoader. Pass preprocessors to the Trainer. In addition, report metrics and checkpoints with session.report. session.report tracks these metrics in Ray AIR’s internal bookkeeping, allowing you to monitor training and analyze training runs after they’ve finished. import torch from torchvision import models from ray.air import Checkpoint from ray.air import session def train_one_epoch(*, model, optimizer, batch_size, epoch): model.train() lr_scheduler = None if epoch == 0: warmup_factor = 1.0 / 1000 lr_scheduler = torch.optim.lr_scheduler.LinearLR( optimizer, start_factor=warmup_factor, total_iters=250 ) device = ray.train.torch.get_device() train_dataset_shard = session.get_dataset_shard("train") batches = train_dataset_shard.iter_batches(batch_size=batch_size) for batch in batches: inputs = [torch.as_tensor(image).to(device) for image in batch["image"]] targets = [ { "boxes": torch.as_tensor(boxes).to(device), "labels": torch.as_tensor(labels).to(device), } for boxes, labels in zip(batch["boxes"], batch["labels"]) ] loss_dict = model(inputs, targets) losses = sum(loss for loss in loss_dict.values()) optimizer.zero_grad() losses.backward() optimizer.step() if lr_scheduler is not None: lr_scheduler.step() session.report( { "losses": losses.item(), "epoch": epoch, "lr": optimizer.param_groups[0]["lr"], **{key: value.item() for key, value in loss_dict.items()}, } ) def train_loop_per_worker(config): # By default, `fasterrcnn_resnet50_fpn`'s backbone is pre-trained on ImageNet. model = models.detection.fasterrcnn_resnet50_fpn(num_classes=3) model = ray.train.torch.prepare_model(model) parameters = [p for p in model.parameters() if p.requires_grad] optimizer = torch.optim.SGD( parameters, lr=config["lr"], momentum=config["momentum"], weight_decay=config["weight_decay"], ) lr_scheduler = torch.optim.lr_scheduler.MultiStepLR( optimizer, milestones=config["lr_steps"], gamma=config["lr_gamma"] ) for epoch in range(0, config["epochs"]): train_one_epoch( model=model, optimizer=optimizer, batch_size=config["batch_size"], epoch=epoch, ) lr_scheduler.step() checkpoint = Checkpoint.from_dict( { "model": model.module.state_dict(), "optimizer": optimizer.state_dict(), "lr_scheduler": lr_scheduler.state_dict(), "config": config, "epoch": epoch, } ) session.report({}, checkpoint=checkpoint) Fine-tune the model Once you’ve defined the training loop, create a TorchTrainer and pass the training loop to the constructor. Then, call TorchTrainer.fit to train the model. from ray.air.config import ScalingConfig from ray.train.torch import TorchTrainer # The following transform operation is lazy. # It will be re-run every epoch. train_dataset = per_epoch_preprocessor.transform(train_dataset) trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config={ "batch_size": 2, "lr": 0.02, "epochs": 1, # You'd normally train for 26 epochs. "momentum": 0.9, "weight_decay": 1e-4, "lr_steps": [16, 22], "lr_gamma": 0.1, }, scaling_config=ScalingConfig(num_workers=4, use_gpu=True), datasets={"train": train_dataset}, preprocessor=preprocessor, ) results = trainer.fit() 2023-03-01 13:06:39,486 INFO instantiator.py:21 -- Created a temporary directory at /tmp/tmp1stz0z_r 2023-03-01 13:06:39,488 INFO instantiator.py:76 -- Writing /tmp/tmp1stz0z_r/_remote_module_non_scriptable.py

Tune Status

Current time:2023-03-01 13:08:45
Running for: 00:02:05.37
Memory: 50.5/480.2 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/8 GPUs, 0.0/324.83 GiB heap, 0.0/143.21 GiB objects (0.0/1.0 accelerator_type:V100)

Trial Status

Trial name status loc iter total time (s)
TorchTrainer_f5aa9_00000TERMINATED10.0.26.109:175347 244 108.703
(RayTrainWorker pid=175611) 2023-03-01 13:06:56,331 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=4] (TorchTrainer pid=175347) 2023-03-01 13:07:00,615 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[TorchVisionPreprocessor] -> AllToAllOperator[randomize_block_order] (autoscaler +1m25s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +1m25s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster. (TorchTrainer pid=175347) /home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/dataset_iterator.py:64: UserWarning: session.get_dataset_shard returns a ray.data.DataIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DataIterator docs. (TorchTrainer pid=175347) warnings.warn( Stage 0: 0%| | 0/1 [00:00 TaskPoolMapOperator[TorchVisionPreprocessor] (PipelineSplitExecutorCoordinator pid=191352) Stage 0: : 2it [00:08, 4.31s/it] 2023-03-01 13:07:33,990 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[TorchVisionPreprocessor] (RayTrainWorker pid=175612) 2023-03-01 13:07:34,394 WARNING plan.py:527 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#data-and-tune (PipelineSplitExecutorCoordinator pid=191352) Stage 0: : 3it [00:13, 4.48s/it]2023-03-01 13:07:38,660 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[TorchVisionPreprocessor] (RayTrainWorker pid=175612) /tmp/ipykernel_160001/3839218723.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:199.) (RayTrainWorker pid=175614) /tmp/ipykernel_160001/3839218723.py:26: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:199.) (RayTrainWorker pid=175611) /tmp/ipykernel_160001/3839218723.py:26: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:199.) (RayTrainWorker pid=175613) /tmp/ipykernel_160001/3839218723.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:199.)

Trial Progress

Trial name date done experiment_taghostname iterations_since_restorenode_ip pidshould_checkpoint time_since_restore time_this_iter_s time_total_s timestamp training_iterationtrial_id
TorchTrainer_f5aa9_000002023-03-01_13-08-41True 0ip-10-0-26-109 24410.0.26.109175347True 108.703 4.2088 108.703 1677704918 244f5aa9_00000
(RayTrainWorker pid=175612) 2023-03-01 13:07:41,980 INFO distributed.py:1027 -- Reducer buckets have been rebuilt in this iteration. (PipelineSplitExecutorCoordinator pid=191352) Stage 0: : 4it [01:11, 25.77s/it]2023-03-01 13:08:37,068 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[TorchVisionPreprocessor] (RayTrainWorker pid=175614) 2023-03-01 13:08:37,464 WARNING plan.py:527 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#data-and-tune 2023-03-01 13:08:45,074 INFO tune.py:825 -- Total run time: 125.51 seconds (125.36 seconds for the tuning loop). Next steps End-to-end: Offline Batch Inference Convert existing PyTorch code to Ray AIR If you already have working PyTorch code, you don’t have to start from scratch to utilize the benefits of Ray AIR. Instead, you can continue to use your existing code and incrementally add Ray AIR components as needed. Some of the benefits you’ll get by using Ray AIR with your existing PyTorch training code: Easy distributed data-parallel training on a cluster Automatic checkpointing/fault tolerance and result tracking Parallel data preprocessing Seamless integration with hyperparameter tuning Scalable model serving This tutorial will show you how to start with Ray AIR from your existing PyTorch training code and learn how to distribute your training. The example code The example code we’ll be using is that of the PyTorch quickstart tutorial. This code trains a neural network classifier on the FashionMNIST dataset. You can find the code we used for this tutorial here on GitHub. Unmodified Let’s start with the unmodified code from the example. A thorough explanation of the parts is given in the full tutorial - we’ll just focus on the code here. We start with some imports: import torch from torch import nn from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor Then we download the data: This tutorial assumes that your existing code is using the torch.utils.data.Dataset native to PyTorch. It continues to use torch.utils.data.Dataset to allow you to make as few code changes as possible. This tutorial also runs with Ray Data, which gives you the benefits of efficient parallel preprocessing. See an example of using Ray Data for the CIFAR-10 dataset here. # Download training data from open datasets. training_data = datasets.FashionMNIST( root="data", train=True, download=True, transform=ToTensor(), ) # Download test data from open datasets. test_data = datasets.FashionMNIST( root="data", train=False, download=True, transform=ToTensor(), ) We can now define the dataloaders: batch_size = 64 # Create data loaders. train_dataloader = DataLoader(training_data, batch_size=batch_size) test_dataloader = DataLoader(test_data, batch_size=batch_size) We can then define and instantiate the neural network: # Get cpu or gpu device for training. device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using {device} device") # Define model class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10) ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits model = NeuralNetwork().to(device) print(model) Using cpu device NeuralNetwork( (flatten): Flatten(start_dim=1, end_dim=-1) (linear_relu_stack): Sequential( (0): Linear(in_features=784, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() (4): Linear(in_features=512, out_features=10, bias=True) ) ) Define our optimizer and loss: loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) And finally our training loop. Note that we renamed the function from train to train_epoch to avoid conflicts with the Ray Train module later (which is also called train): def train_epoch(dataloader, model, loss_fn, optimizer): size = len(dataloader.dataset) model.train() for batch, (X, y) in enumerate(dataloader): X, y = X.to(device), y.to(device) # Compute prediction error pred = model(X) loss = loss_fn(pred, y) # Backpropagation optimizer.zero_grad() loss.backward() optimizer.step() if batch % 100 == 0: loss, current = loss.item(), batch * len(X) print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") And while we’re at it, here is our validation loop (note that we sneaked in a return test_loss statement and also renamed the function): def test_epoch(dataloader, model, loss_fn): size = len(dataloader.dataset) num_batches = len(dataloader) model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for X, y in dataloader: X, y = X.to(device), y.to(device) pred = model(X) test_loss += loss_fn(pred, y).item() correct += (pred.argmax(1) == y).type(torch.float).sum().item() test_loss /= num_batches correct /= size print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") return test_loss Now we can trigger training and save a model: epochs = 5 for t in range(epochs): print(f"Epoch {t+1}\n-------------------------------") train_epoch(train_dataloader, model, loss_fn, optimizer) test_epoch(test_dataloader, model, loss_fn) print("Done!") Epoch 1 ------------------------------- loss: 2.295566 [ 0/60000] loss: 2.291762 [ 6400/60000] loss: 2.268867 [12800/60000] loss: 2.262820 [19200/60000] loss: 2.256001 [25600/60000] loss: 2.204572 [32000/60000] loss: 2.225075 [38400/60000] loss: 2.184233 [44800/60000] loss: 2.182663 [51200/60000] loss: 2.154192 [57600/60000] Test Error: Accuracy: 36.5%, Avg loss: 2.146461 Epoch 2 ------------------------------- loss: 2.150961 [ 0/60000] loss: 2.147769 [ 6400/60000] loss: 2.085719 [12800/60000] loss: 2.107859 [19200/60000] loss: 2.066872 [25600/60000] loss: 1.978430 [32000/60000] loss: 2.029306 [38400/60000] loss: 1.939256 [44800/60000] loss: 1.951516 [51200/60000] loss: 1.881199 [57600/60000] Test Error: Accuracy: 55.0%, Avg loss: 1.879711 Epoch 3 ------------------------------- loss: 1.907144 [ 0/60000] loss: 1.879325 [ 6400/60000] loss: 1.765395 [12800/60000] loss: 1.815291 [19200/60000] loss: 1.708041 [25600/60000] loss: 1.641765 [32000/60000] loss: 1.687605 [38400/60000] loss: 1.581743 [44800/60000] loss: 1.615951 [51200/60000] loss: 1.507691 [57600/60000] Test Error: Accuracy: 62.3%, Avg loss: 1.523205 Epoch 4 ------------------------------- loss: 1.589735 [ 0/60000] loss: 1.549950 [ 6400/60000] loss: 1.404985 [12800/60000] loss: 1.479113 [19200/60000] loss: 1.362190 [25600/60000] loss: 1.348071 [32000/60000] loss: 1.376365 [38400/60000] loss: 1.297325 [44800/60000] loss: 1.336892 [51200/60000] loss: 1.234042 [57600/60000] Test Error: Accuracy: 63.8%, Avg loss: 1.255606 Epoch 5 ------------------------------- loss: 1.334560 [ 0/60000] loss: 1.311746 [ 6400/60000] loss: 1.151140 [12800/60000] loss: 1.254679 [19200/60000] loss: 1.132061 [25600/60000] loss: 1.149663 [32000/60000] loss: 1.179779 [38400/60000] loss: 1.117024 [44800/60000] loss: 1.159811 [51200/60000] loss: 1.072276 [57600/60000] Test Error: Accuracy: 65.0%, Avg loss: 1.088372 Done! torch.save(model.state_dict(), "model.pth") print("Saved PyTorch Model State to model.pth") Saved PyTorch Model State to model.pth We’ll cover the rest of the tutorial (loading the model and doing batch prediction) later! Introducing a wrapper function (no Ray AIR, yet!) The notebook-style from the tutorial is great for tutorials, but in your production code you probably wrapped the actual training logic in a function. So let’s do this here, too. Note that we do not add or alter any code here (apart from variable definitions) - we just take the loose bits of code in the current tutorial and put them into one function. def train_func(): batch_size = 64 lr = 1e-3 epochs = 5 # Create data loaders. train_dataloader = DataLoader(training_data, batch_size=batch_size) test_dataloader = DataLoader(test_data, batch_size=batch_size) # Get cpu or gpu device for training. device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using {device} device") model = NeuralNetwork().to(device) print(model) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=lr) for t in range(epochs): print(f"Epoch {t+1}\n-------------------------------") train_epoch(train_dataloader, model, loss_fn, optimizer) test_epoch(test_dataloader, model, loss_fn) print("Done!") Let’s see it in action again: train_func() Using cpu device NeuralNetwork( (flatten): Flatten(start_dim=1, end_dim=-1) (linear_relu_stack): Sequential( (0): Linear(in_features=784, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() (4): Linear(in_features=512, out_features=10, bias=True) ) ) Epoch 1 ------------------------------- loss: 2.311088 [ 0/60000] loss: 2.295296 [ 6400/60000] loss: 2.271576 [12800/60000] loss: 2.258537 [19200/60000] loss: 2.250895 [25600/60000] loss: 2.216462 [32000/60000] loss: 2.222296 [38400/60000] loss: 2.189997 [44800/60000] loss: 2.188647 [51200/60000] loss: 2.145895 [57600/60000] Test Error: Accuracy: 44.8%, Avg loss: 2.144711 Epoch 2 ------------------------------- loss: 2.164661 [ 0/60000] loss: 2.150512 [ 6400/60000] loss: 2.085597 [12800/60000] loss: 2.099732 [19200/60000] loss: 2.047274 [25600/60000] loss: 1.980986 [32000/60000] loss: 2.014364 [38400/60000] loss: 1.930184 [44800/60000] loss: 1.941903 [51200/60000] loss: 1.856329 [57600/60000] Test Error: Accuracy: 56.2%, Avg loss: 1.857978 Epoch 3 ------------------------------- loss: 1.901466 [ 0/60000] loss: 1.867397 [ 6400/60000] loss: 1.739829 [12800/60000] loss: 1.784509 [19200/60000] loss: 1.677714 [25600/60000] loss: 1.621924 [32000/60000] loss: 1.652736 [38400/60000] loss: 1.549752 [44800/60000] loss: 1.583215 [51200/60000] loss: 1.469457 [57600/60000] Test Error: Accuracy: 62.0%, Avg loss: 1.491323 Epoch 4 ------------------------------- loss: 1.564052 [ 0/60000] loss: 1.533092 [ 6400/60000] loss: 1.374619 [12800/60000] loss: 1.450151 [19200/60000] loss: 1.340597 [25600/60000] loss: 1.326336 [32000/60000] loss: 1.345804 [38400/60000] loss: 1.269192 [44800/60000] loss: 1.307673 [51200/60000] loss: 1.200916 [57600/60000] Test Error: Accuracy: 63.8%, Avg loss: 1.232803 Epoch 5 ------------------------------- loss: 1.311137 [ 0/60000] loss: 1.301159 [ 6400/60000] loss: 1.127901 [12800/60000] loss: 1.233908 [19200/60000] loss: 1.118969 [25600/60000] loss: 1.134692 [32000/60000] loss: 1.157277 [38400/60000] loss: 1.094546 [44800/60000] loss: 1.135308 [51200/60000] loss: 1.043909 [57600/60000] Test Error: Accuracy: 65.0%, Avg loss: 1.072193 Done! The output should look very similar to the previous ouput. Starting with Ray AIR: Distribute the training As a first step, we want to distribute the training across multiple workers. For this we want to Use data-parallel training by sharding the training data Setup the model to communicate gradient updates across machines Report the results back to Ray Train. To facilitate this, we only need a few changes to the code: We import Ray Train and Ray AIR Session: import ray.train as train from ray.air import session We use a config dict to configure some hyperparameters (this is not strictly needed but good practice, especially if you want to o hyperparameter tuning later): def train_func(config: dict): batch_size = config["batch_size"] lr = config["lr"] epochs = config["epochs"] We dynamically adjust the worker batch size according to the number of workers: batch_size_per_worker = batch_size // session.get_world_size() We prepare the data loader for distributed data sharding: train_dataloader = train.torch.prepare_data_loader(train_dataloader) test_dataloader = train.torch.prepare_data_loader(test_dataloader) We prepare the model for distributed gradient updates: model = train.torch.prepare_model(model) Note that train.torch.prepare_model() also automatically takes care of setting up devices (e.g. GPU training) - so we can get rid of those lines in our current code! We capture the validation loss and report it to Ray train: test_loss = test(test_dataloader, model, loss_fn) session.report(dict(loss=test_loss)) In the train_epoch() and test_epoch() functions we divide the size by the world size: # Divide by word size size = len(dataloader.dataset) // session.get_world_size() In the train_epoch() function we can get rid of the device mapping. Ray Train does this for us: # We don't need this anymore! Ray Train does this automatically: # X, y = X.to(device), y.to(device) That’s it - you need less than 10 lines of Ray Train-specific code and can otherwise continue to use your original code. Let’s take a look at the resulting code. First the train_epoch() function (2 lines changed, and we also commented out the print statement): def train_epoch(dataloader, model, loss_fn, optimizer): size = len(dataloader.dataset) // session.get_world_size() # Divide by word size model.train() for batch, (X, y) in enumerate(dataloader): # We don't need this anymore! Ray Train does this automatically: # X, y = X.to(device), y.to(device) # Compute prediction error pred = model(X) loss = loss_fn(pred, y) # Backpropagation optimizer.zero_grad() loss.backward() optimizer.step() if batch % 100 == 0: loss, current = loss.item(), batch * len(X) # print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") Then the test_epoch() function (1 line changed, and we also commented out the print statement): def test_epoch(dataloader, model, loss_fn): size = len(dataloader.dataset) // session.get_world_size() # Divide by word size num_batches = len(dataloader) model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for X, y in dataloader: X, y = X.to(device), y.to(device) pred = model(X) test_loss += loss_fn(pred, y).item() correct += (pred.argmax(1) == y).type(torch.float).sum().item() test_loss /= num_batches correct /= size # print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n") return test_loss And lastly, the wrapping train_func() where we added 4 lines and modified 2 (apart from the config dict): import ray.train as train from ray.air import session def train_func(config: dict): batch_size = config["batch_size"] lr = config["lr"] epochs = config["epochs"] batch_size_per_worker = batch_size // session.get_world_size() # Create data loaders. train_dataloader = DataLoader(training_data, batch_size=batch_size_per_worker) test_dataloader = DataLoader(test_data, batch_size=batch_size_per_worker) train_dataloader = train.torch.prepare_data_loader(train_dataloader) test_dataloader = train.torch.prepare_data_loader(test_dataloader) model = NeuralNetwork() model = train.torch.prepare_model(model) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=lr) for t in range(epochs): train_epoch(train_dataloader, model, loss_fn, optimizer) test_loss = test_epoch(test_dataloader, model, loss_fn) session.report(dict(loss=test_loss)) print("Done!") Now we’ll use Ray Train’s TorchTrainer to kick off the training. Note that we can set the hyperparameters here! In the scaling_config we can also configure how many parallel workers to use and if we want to enable GPU training or not. from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig trainer = TorchTrainer( train_loop_per_worker=train_func, train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4}, scaling_config=ScalingConfig(num_workers=2, use_gpu=False), ) result = trainer.fit() print(f"Last result: {result.metrics}") Great, this works! You’re now training your model in parallel. You could now scale this up to more nodes and workers on your Ray cluster. But there are a few improvements we can make to the code in order to get the most of the system. For one, we should enable checkpointing to get access to the trained model afterwards. Additionally, we should optimize the data loading to take place within the workers. Enabling checkpointing to retrieve the model Enabling checkpointing is pretty easy - we just need to pass a Checkpoint object with the model state to the session.report() API. from ray.air import Checkpoint checkpoint = Checkpoint.from_dict( dict(epoch=t, model=model.state_dict()) ) session.report(dict(loss=test_loss), checkpoint=checkpoint) Move the data loader to the training function You may have noticed a warning: Warning: The actor TrainTrainable is very large (52 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.. This is because we load the data outside the training function. Ray then serializes it to make it accessible to the remote tasks (that may get executed on a remote node!). This is not too bad with just 52 MB of data, but imagine this were a full image dataset - you wouldn’t want to ship this around the cluster unnecessarily. Instead, you should move the dataset loading part into the train_func(). This will then download the data to disk once per machine and result in much more efficient data loading. The result looks like this: from ray.air import Checkpoint def load_data(): # Download training data from open datasets. training_data = datasets.FashionMNIST( root="data", train=True, download=True, transform=ToTensor(), ) # Download test data from open datasets. test_data = datasets.FashionMNIST( root="data", train=False, download=True, transform=ToTensor(), ) return training_data, test_data def train_func(config: dict): batch_size = config["batch_size"] lr = config["lr"] epochs = config["epochs"] batch_size_per_worker = batch_size // session.get_world_size() training_data, test_data = load_data() # <- this is new! # Create data loaders. train_dataloader = DataLoader(training_data, batch_size=batch_size_per_worker) test_dataloader = DataLoader(test_data, batch_size=batch_size_per_worker) train_dataloader = train.torch.prepare_data_loader(train_dataloader) test_dataloader = train.torch.prepare_data_loader(test_dataloader) model = NeuralNetwork() model = train.torch.prepare_model(model) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=lr) for t in range(epochs): train_epoch(train_dataloader, model, loss_fn, optimizer) test_loss = test_epoch(test_dataloader, model, loss_fn) checkpoint = Checkpoint.from_dict( dict(epoch=t, model=model.state_dict()) ) session.report(dict(loss=test_loss), checkpoint=checkpoint) print("Done!") Let’s train again: trainer = TorchTrainer( train_loop_per_worker=train_func, train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4}, scaling_config=ScalingConfig(num_workers=2, use_gpu=False), ) result = trainer.fit() We can see our results here: print(f"Last result: {result.metrics}") print(f"Checkpoint: {result.checkpoint}") Last result: {'loss': 1.215654496934004, '_timestamp': 1657734050, '_time_this_iter_s': 10.695234060287476, '_training_iteration': 4, 'time_this_iter_s': 10.697366952896118, 'should_checkpoint': True, 'done': True, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 4, 'trial_id': 'b43fc_00000', 'experiment_id': '3b3c6e36d57a4e7993aacdbe6cd4c8ed', 'date': '2022-07-13_10-40-50', 'timestamp': 1657734050, 'time_total_s': 96.68163204193115, 'pid': 65706, 'hostname': 'Jiaos-MacBook-Pro-16-inch-2019', 'node_ip': '127.0.0.1', 'config': {}, 'time_since_restore': 96.68163204193115, 'timesteps_since_restore': 0, 'iterations_since_restore': 4, 'warmup_time': 0.0036132335662841797, 'experiment_tag': '0'} Checkpoint: Summary This tutorial demonstrated how to turn your existing PyTorch code into code you can use with Ray AIR. We learned how to enable distributed training using Ray Train abstractions save and retrieve model checkpoints via Ray AIR load a model for batch prediction In our other examples you can learn how to do more things with the Ray AIR API, such as serving your model with Ray Serve or tune your hyperparameters with Ray Tune. You can also learn how to perform offline batch inference with Ray Data. We hope this tutorial gave you a good starting point to leverage Ray AIR. If you have any questions, suggestions, or run into any problems pelase reach out on Discuss or GitHub! Convert existing Tensorflow/Keras code to Ray AIR If you already have working Tensorflow code, you don’t have to start from scratch to utilize the benefits of Ray AIR. Instead, you can continue to use your existing code and incrementally add Ray AIR components as needed. Some of the benefits you’ll get by using Ray AIR with your existing Tensorflow training code: Easy distributed data-parallel training on a cluster Automatic checkpointing/fault tolerance and result tracking Parallel data preprocessing Seamless integration with hyperparameter tuning Scalable model serving This tutorial will show you how to start with Ray AIR from your existing Tensorflow training code. We will learn how to perform distributed data-parallel training. Example Code The example code we’ll be converting to Ray AIR is that of the Tensorflow quickstart tutorial. This code trains a neural network classifier on the MNIST dataset. Follow along with this example by launching the notebook using the 🚀 icon above! Existing Tensorflow Code Let’s start with the unmodified code from the example. A thorough explanation of the parts is given in the full tutorial - we’ll just focus on the code here. import tensorflow as tf print("TensorFlow version:", tf.__version__) TensorFlow version: 2.9.2 First, we load and preprocess the MNIST dataset. Assumption for this tutorial: your existing code is using the tf.data.Dataset native to Tensorflow. This tutorial continues to use tf.data.Dataset to allow you to make as few code changes as possible. Everything in this tutorial is also possible if you choose to use Ray Data, and you will also get the benefits of efficient preprocessing and multi-worker batch prediction. See here for resources to get started with Ray Data. mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 train_ds = tf.data.Dataset.from_tensor_slices( (x_train, y_train)).shuffle(len(x_train)).batch(32) test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32) print(f"Training Dataset: {len(x_train)} samples") print(f"Test Dataset: {len(x_test)} samples") Training Dataset: 60000 samples Test Dataset: 10000 samples Next, we define the model architecture of the neural network. We wrap the model definition inside a function for easy reuse later. def build_model() -> tf.keras.Model: return tf.keras.Sequential( [ tf.keras.layers.InputLayer(input_shape=(28, 28)), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation="relu"), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10), ] ) Next, initialize the model, loss, optimizer, and define some metrics that we want to track during training. We recommend using the Keras Model.fit API, as it simplifies distributing your training with tf.distribute and Ray AIR. Compile your model with a loss function and optimizer, then run model.fit(train_ds). loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) optimizer = tf.keras.optimizers.Adam() model = build_model() model.compile( optimizer=optimizer, loss=loss_object, metrics=["accuracy"], ) Next, we train the model for some number of epochs, updating the model parameters to minimize the loss. Each epochs loop through the entire training dataset and perform gradient descent steps. train_history = model.fit(train_ds, epochs=5, verbose=2) Epoch 1/5 1875/1875 - 3s - loss: 0.2954 - accuracy: 0.9134 - 3s/epoch - 2ms/step Epoch 2/5 1875/1875 - 3s - loss: 0.1437 - accuracy: 0.9567 - 3s/epoch - 2ms/step Epoch 3/5 1875/1875 - 3s - loss: 0.1078 - accuracy: 0.9673 - 3s/epoch - 1ms/step Epoch 4/5 1875/1875 - 3s - loss: 0.0860 - accuracy: 0.9736 - 3s/epoch - 1ms/step Epoch 5/5 1875/1875 - 3s - loss: 0.0746 - accuracy: 0.9760 - 3s/epoch - 2ms/step After training, we evaluate the model’s performance on the test set. # Evaluate on the test set and report metrics eval_result = model.evaluate(test_ds, return_dict=True, verbose=0) test_loss = eval_result["loss"] test_accuracy = eval_result["accuracy"] print( f"Final Test Loss: {test_loss:.4f}, " f"Final Test Accuracy: {test_accuracy:.4f}" ) 313/313 - 0s - loss: 0.0735 - accuracy: 0.9788 - 457ms/epoch - 1ms/step Final Test Loss: 0.0735, Final Test Accuracy: 0.9788 Wrap everything in a training loop function Later on, we might want to perform hyperparameter optimization and launch multiple training runs, so it is useful to wrap the training logic we have so far in a function. We also introduce a function to get the training and test datasets, which is used within the training function. def get_train_test_datasets(batch_size): train_ds = tf.data.Dataset.from_tensor_slices( (x_train, y_train)).shuffle(len(x_train)).batch(batch_size) test_ds = tf.data.Dataset.from_tensor_slices( (x_test, y_test)).batch(batch_size) return train_ds, test_ds def train_func(): epochs = 5 batch_size = 32 loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) optimizer = tf.keras.optimizers.Adam() model = build_model() model.compile( optimizer=optimizer, loss=loss_object, metrics=["accuracy"], ) train_ds, test_ds = get_train_test_datasets(batch_size) model.fit(train_ds, epochs=epochs, verbose=2) eval_result = model.evaluate(test_ds, return_dict=True, verbose=0) test_loss = eval_result["loss"] test_accuracy = eval_result["accuracy"] print( f"Final Test Loss: {test_loss:.4f}, " f"Final Test Accuracy: {test_accuracy:.4f}" ) Introduce Ray AIR for Distributed Data-Parallel Training Now that we have set up a training loop that runs on a single worker, let’s use Ray AIR to implement distributed training, allowing us to train using any number of workers! Ray Train, the model training library within Ray AIR, implements a TensorflowTrainer that allows you to do distributed training with Tensorflow without needing to create and handle workers manually. Ray Train creates workers in a Ray cluster and configures the TF_CONFIG environment variable for you. This way, you can use simply use a strategy from tf.distribute to run your training loop across multiple workers in a distributed data-parallel fashion! Currently, the only multi-worker strategy that Train supports is tf.distribute.MultiWorkerMirroredStrategy, which shards the dataset evenly across workers and synchronizes parameter updates so that workers share the same weights at all times. Let’s start by installing Ray and AIR modules if we haven’t already: !pip install "ray[air]" Update the train function As a first step, let’s implement the following: Add a config argument as an easy way to pass in hyperparameters such as batch_size_per_worker through Ray Train. Set up the model to communicate gradients and synchronize model weights between workers under the tf.distribute.MultiWorkerMirroredStrategy strategy. Enable data-parallel distributed training by sharding the training data (and test data) so that each worker only deals with a subset of the data. Enable checkpointing and metric reporting to get access to the trained model and results after our training job has finished. We only need change a few lines of code: from ray.air import session from ray.air.integrations.keras import ReportCheckpointCallback # 1. Add a `config` argument to the train function to pass in hyperparameters def train_func(config: dict): epochs = config.get("epochs", 5) batch_size_per_worker = config.get("batch_size", 32) # 2. Build and compile the model within tf.distribute strategy scope # Important: The strategy must be instantiated at the beginning # of the function, since the tf.Dataset that we load later needs # to be auto-sharded. # See https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras # for more details. strategy = tf.distribute.MultiWorkerMirroredStrategy() with strategy.scope(): loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) optimizer = tf.keras.optimizers.Adam() model = build_model() model.compile( optimizer=optimizer, loss=loss_object, metrics=["accuracy"], ) # 3. Set a `global_batch_size` so that every worker gets the specified # `batch_size_per_worker` regardless of the number of workers. # This is needed because the datasets are sharded across workers. global_batch_size = batch_size_per_worker * session.get_world_size() train_ds, test_ds = get_train_test_datasets(global_batch_size) # ^ Even though we are loading the datasets the same way as before, the # TF dataset will automatically shard the datasets across workers, # according to the strategy. # ... # 4. Use a Keras callback provided by Ray AIR to report metrics and checkpoint report_metrics_and_checkpoint_callback = ReportCheckpointCallback(report_metrics_on="epoch_end") model.fit( ..., callbacks=[report_metrics_and_checkpoint_callback] ) We see above that we pass a Keras ReportCheckpointCallback into Model.fit, which is an AIR integration that reports metrics and saves checkpoints after each epoch (configurable via the on parameter). The callback will automatically report metrics such as loss and accuracy that are specified when compiling the model. Let’s see the updated training function after these additions: from ray.air import session from ray.air.integrations.keras import ReportCheckpointCallback # 1. Pass in the hyperparameter config def train_func(config: dict): epochs = config.get("epochs", 5) batch_size_per_worker = config.get("batch_size", 32) # 2. Synchronized model setup strategy = tf.distribute.MultiWorkerMirroredStrategy() with strategy.scope(): loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) optimizer = tf.keras.optimizers.Adam() model = build_model() model.compile( optimizer=optimizer, loss=loss_object, metrics=["accuracy"], ) # 3. Shard the dataset across `session.get_world_size()` workers global_batch_size = batch_size_per_worker * session.get_world_size() train_ds, test_ds = get_train_test_datasets(batch_size_per_worker) if session.get_world_rank() == 0: print(f"\nDataset is sharded across {session.get_world_size()} workers:") # The number of samples is approximate, because is not always # a multiple of batch_size, so some batches could contain fewer than # `batch_size_per_worker` samples. print( f"# training batches per worker = {len(train_ds)} " f"(~{len(train_ds) * batch_size_per_worker} samples)" ) print( f"# test batches per worker = {len(test_ds)} " f"(~{len(test_ds) * batch_size_per_worker} samples)" ) # 4. Report metrics and checkpoint the model report_metrics_and_checkpoint_callback = ReportCheckpointCallback(report_metrics_on="epoch_end") model.fit( train_ds, epochs=epochs, callbacks=[report_metrics_and_checkpoint_callback], verbose=(0 if session.get_world_rank() != 0 else 2), ) eval_result = model.evaluate(test_ds, return_dict=True, verbose=0) test_loss = eval_result["loss"] test_accuracy = eval_result["accuracy"] if session.get_world_rank() == 0: print( f"Final Test Loss: {test_loss:.4f}, " f"Final Test Accuracy: {test_accuracy:.4f}" ) A few notes on the session API introduced by Ray AIR: session.get_world_size() is a Ray AIR helper that gets the number of workers doing training. In the updated code below, we also use the helper session.get_world_rank() to only print logs on the head worker node (with rank 0) so that the output isn’t spammed by logs from all workers. Move data loading inside of the training function One important detail is that we should not try to use loaded data from outside of the training function. If we try to reference the training data from outside the training function, Ray serializes it to make it accessible to the remote tasks (that may get executed on a remote node!), and it’s not ideal to ship the data around the cluster unnecessarily. Instead, move the dataset loading part into the train_func(). This will download the data to disk once per machine and result in much more efficient data loading. Let’s update the get_train_test_datasets method to load the MNIST dataset rather than use a reference from outside the train function. def get_train_test_datasets(batch_size): # NEW: Now, the dataset will be downloaded to disk once per machine mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 train_ds = tf.data.Dataset.from_tensor_slices( (x_train, y_train)).shuffle(len(x_train)).batch(batch_size) test_ds = tf.data.Dataset.from_tensor_slices( (x_test, y_test)).batch(batch_size) return train_ds, test_ds Start training with TensorflowTrainer Now, we’ll use Ray Train’s TensorflowTrainer to kick off the distributed training. A few notes on the configs set below: train_loop_config sets the hyperparameters passed into the training loop as the config parameter scaling_config configures how many parallel workers to use, the resources required per worker, and whether we want to enable GPU training or not. See this configuration guide for more details on how to configure the trainer. from ray import air from ray.train.tensorflow import TensorflowTrainer num_workers = 2 use_gpu = False trainer = TensorflowTrainer( train_loop_per_worker=train_func, train_loop_config={ "batch_size": 32, "epochs": 4, }, scaling_config=air.ScalingConfig( num_workers=num_workers, use_gpu=use_gpu, ), ) result = trainer.fit() Great, this works 🎉! You’re now training your model in parallel. You could now scale this up to more nodes and workers on your Ray cluster. We can use the training Result output of trainer.fit() to view some reported metrics. See the Result documentation for a full list of what’s available. Let’s plot the training loss vs. training iteration. result.metrics_dataframe.plot("training_iteration", "loss") Summary This tutorial demonstrated how a few lines of code with Ray AIR API’s can allow you to scale up your Tensorflow model training. We learned how to: enable distributed training using Ray Train abstractions save and retrieve model checkpoints via Ray AIR load a model for batch prediction In our other examples you can learn how to do more things with the Ray AIR API, such as serving your model with Ray Serve or tune your hyperparameters with Ray Tune. You can also learn how to perform offline batch inference with Ray Data. See this table for a full catalog of frameworks that AIR supports out of the box. We hope this tutorial gave you a good starting point to leverage Ray AIR. If you have any questions, suggestions, or run into any problems pelase reach out on Discuss, GitHub or the Ray Slack! Tabular data training and serving with Keras and Ray AIR This notebook is adapted from a Keras tutorial. It uses Chicago Taxi dataset and a DNN Keras model to predict whether a trip may generate a big tip. In this example, we showcase how to achieve the same tasks as the Keras Tutorial using Ray AIR, covering every step from data ingestion to pushing a model to serving. Read a CSV into Dataset. Process the dataset by chaining Ray AIR preprocessors. Train the model using the TensorflowTrainer from AIR. Serve the model using Ray Serve and the above preprocessors. Uncomment and run the following line in order to install all the necessary dependencies: # ! pip install "tensorflow>=2.8.0" "ray[air]>=2.0.0" Set up Ray We will use ray.init() to initialize a local cluster. By default, this cluster will be composed of only the machine you are running this notebook on. If you wish to attach to an existing Ray cluster, you can do so through ray.init(address="auto"). from pprint import pprint import ray ray.shutdown() ray.init() 2022-11-08 22:33:29,918 INFO worker.py:1528 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 

Ray

Python version: 3.8.6
Ray version: 2.6.3
Dashboard: http://127.0.0.1:8265
We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine. pprint(ray.cluster_resources()) {'CPU': 16.0, 'memory': 6000536781.0, 'node:127.0.0.1': 1.0, 'object_store_memory': 2147483648.0} Getting the data Let’s start with defining a helper function to get the data to work with. Some columns are dropped for simplicity. import pandas as pd INPUT = "input" LABEL = "is_big_tip" def get_data() -> pd.DataFrame: """Fetch the taxi fare data to work on.""" _data = pd.read_csv( "https://raw.githubusercontent.com/tensorflow/tfx/master/" "tfx/examples/chicago_taxi_pipeline/data/simple/data.csv" ) _data[LABEL] = _data["tips"] / _data["fare"] > 0.2 # We drop some columns here for the sake of simplicity. return _data.drop( [ "tips", "fare", "dropoff_latitude", "dropoff_longitude", "pickup_latitude", "pickup_longitude", "pickup_census_tract", ], axis=1, ) data = get_data() Now let’s take a look at the data. Notice that some values are missing. This is exactly where preprocessing comes into the picture. We will come back to this in the preprocessing session below. data.head(5)
pickup_community_area trip_start_month trip_start_hour trip_start_day trip_start_timestamp trip_miles dropoff_census_tract payment_type company trip_seconds dropoff_community_area is_big_tip
0 NaN 5 19 6 1400269500 0.0 NaN Credit Card Chicago Elite Cab Corp. (Chicago Carriag 0.0 NaN False
1 NaN 3 19 5 1362683700 0.0 NaN Unknown Chicago Elite Cab Corp. 300.0 NaN False
2 60.0 10 2 3 1380593700 12.6 NaN Cash Taxi Affiliation Services 1380.0 NaN False
3 10.0 10 1 2 1382319000 0.0 NaN Cash Taxi Affiliation Services 180.0 NaN False
4 14.0 5 7 5 1369897200 0.0 NaN Cash Dispatch Taxi Affiliation 1080.0 NaN False
We continue to split the data into training and test data. For the test data, we separate out the features to run serving on as well as labels to compare serving results with. import numpy as np from sklearn.model_selection import train_test_split from typing import Tuple def split_data(data: pd.DataFrame) -> Tuple[ray.data.Dataset, pd.DataFrame, np.array]: """Split the data in a stratified way. Returns: A tuple containing train dataset, test data and test label. """ # There is a native offering in Dataset for split as well. # However, supporting stratification is a TODO there. So use # scikit-learn equivalent here. train_data, test_data = train_test_split( data, stratify=data[[LABEL]], random_state=1113 ) _train_ds = ray.data.from_pandas(train_data) _test_label = test_data[LABEL].values _test_df = test_data.drop([LABEL], axis=1) return _train_ds, _test_df, _test_label train_ds, test_df, test_label = split_data(data) print(f"There are {train_ds.count()} samples for training and {test_df.shape[0]} samples for testing.") There are 11251 samples for training and 3751 samples for testing. Preprocessing Let’s focus on preprocessing first. Usually, input data needs to go through some preprocessing before being fed into model. It is a good idea to package preprocessing logic into a modularized component so that the same logic can be applied to both training data as well as data for online serving or offline batch prediction. In AIR, this component is a Preprocessor. It is constructed in a way that allows easy composition. Now let’s construct a chained preprocessor composed of simple preprocessors, including Imputer for filling missing features; OneHotEncoder for encoding categorical features; BatchMapper where arbitrary user-defined function can be applied to batches of records. Here, we implement a custom BatchMapper for extracting year information out of the timestamp. Concatenator to combine multiple features into a single tensor feature which is used as the input to our model. Take a look at Preprocessor for more information on the built-in preprocessors. The output of the preprocessing step goes into model for training. from ray.data.preprocessors import ( BatchMapper, Concatenator, Chain, OneHotEncoder, SimpleImputer, ) def get_preprocessor(): """Construct a chain of preprocessors.""" imputer1 = SimpleImputer( ["dropoff_census_tract"], strategy="most_frequent" ) imputer2 = SimpleImputer( ["pickup_community_area", "dropoff_community_area"], strategy="most_frequent", ) imputer3 = SimpleImputer(["payment_type"], strategy="most_frequent") imputer4 = SimpleImputer( ["company"], strategy="most_frequent") imputer5 = SimpleImputer( ["trip_start_timestamp", "trip_miles", "trip_seconds"], strategy="mean" ) ohe = OneHotEncoder( columns=[ "trip_start_hour", "trip_start_day", "trip_start_month", "dropoff_census_tract", "pickup_community_area", "dropoff_community_area", "payment_type", "company", ], max_categories={ "dropoff_census_tract": 25, "pickup_community_area": 20, "dropoff_community_area": 20, "payment_type": 2, "company": 7, }, ) def batch_mapper_fn(df): df["trip_start_year"] = pd.to_datetime(df["trip_start_timestamp"], unit="s").dt.year df = df.drop(["trip_start_timestamp"], axis=1) return df chained_pp = Chain( imputer1, imputer2, imputer3, imputer4, imputer5, ohe, BatchMapper(batch_mapper_fn, batch_format="pandas"), # Concatenate all columns, except LABEL into a single tensor column with name INPUT. Concatenator(output_column_name=INPUT, exclude=[LABEL]) ) return chained_pp Now let’s define some constants for clarity. # Note that `INPUT_SIZE` here is corresponding to the dimension # of the previously created tensor column during preprocessing. # This is used to specify the input shape of Keras model. INPUT_SIZE = 120 # The global training batch size. Based on `NUM_WORKERS`, each worker # will get its own share of this batch size. For example, if # `NUM_WORKERS = 2`, each worker will work on 4 samples per batch. BATCH_SIZE = 8 # Number of epoch. Adjust it based on how quickly you want the run to be. EPOCH = 1 # Number of training workers. # Adjust this accordingly based on the resources you have! NUM_WORKERS = 2 Training Let’s starting with defining a simple Keras model for the classification task. import tensorflow as tf def build_model(): model = tf.keras.models.Sequential() model.add(tf.keras.Input(shape=(INPUT_SIZE,))) model.add(tf.keras.layers.Dense(50, activation="relu")) model.add(tf.keras.layers.Dense(1, activation="sigmoid")) return model Now let’s define the training loop. This code will be run on each training worker in a distributed fashion. See more details here. from ray.air import session, Checkpoint from ray.train.tensorflow import TensorflowCheckpoint def train_loop_per_worker(): dataset_shard = session.get_dataset_shard("train") strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() with strategy.scope(): model = build_model() model.compile( loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"], ) for epoch in range(EPOCH): tf_dataset = dataset_shard.to_tf(feature_columns=INPUT, label_columns=LABEL, batch_size=BATCH_SIZE, drop_last=True) model.fit(tf_dataset, verbose=0) # This saves checkpoint in a way that can be used by Ray Serve coherently. session.report( {}, checkpoint=TensorflowCheckpoint.from_model(model), ) Now let’s define a trainer that takes in the training loop, the training dataset as well the preprocessor that we just defined. And run it! Notice that you can tune how long you want the run to be by changing EPOCH. from ray.train.tensorflow import TensorflowTrainer from ray.air.config import ScalingConfig trainer = TensorflowTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=ScalingConfig(num_workers=NUM_WORKERS), datasets={"train": train_ds}, preprocessor=get_preprocessor(), ) result = trainer.fit() Moving on to Serve We will use Ray Serve to serve the trained model. A core concept of Ray Serve is a Deployment. It allows you to define and update your business logic or models that will handle incoming requests as well as how this is exposed over HTTP or in Python. In the case of serving a model, ray.serve.air_integrations.Predictor and ray.serve.air_integrations.PredictorDeployment wrap a ray.air.checkpoint.Checkpoint into a Ray Serve deployment that can readily serve HTTP requests. Note, Checkpoint captures both model and preprocessing steps in a way compatible with Ray Serve and ensures that the ML workload can transition seamlessly between training and serving. This removes the boilerplate code and minimizes the effort to bring your model to production! Let’s first wrap our checkpoint in a serve endpoint that exposes a URL to where requests can be sent to. Our Serve endpoint will take in JSON data as input, so we also specify an adapter to convert the JSON data to a Pandas Dataframe so it can be inputted to the TensorflowPredictor from ray import serve from ray.air.checkpoint import Checkpoint from ray.train.tensorflow import TensorflowPredictor from ray.serve import PredictorDeployment from ray.serve.http_adapters import pandas_read_json def serve_model(checkpoint: Checkpoint, model_definition, name="Model") -> str: """Expose a serve endpoint. Returns: serve URL. """ serve.run( PredictorDeployment.options(name=name).bind( TensorflowPredictor, checkpoint, model_definition=model_definition, http_adapter=pandas_read_json, ) ) return f"http://localhost:8000/" import ray # Generally speaking, training and serving are done in totally different ray clusters. # To simulate that, let's shutdown the old ray cluster in preparation for serving. ray.shutdown() endpoint_uri = serve_model(result.checkpoint, build_model) Let’s write a helper function to send requests to this endpoint and compare the results with labels. import json import requests import pandas as pd import numpy as np NUM_SERVE_REQUESTS = 10 def send_requests(df: pd.DataFrame, label: np.array): for i in range(NUM_SERVE_REQUESTS): one_row = df.iloc[[i]].to_dict() serve_result = requests.post(endpoint_uri, data=json.dumps(one_row), headers={"Content-Type": "application/json"}).json() print( f"request{i} prediction: {serve_result[0]['predictions']} " f"- label: {str(label[i])}" ) send_requests(test_df, test_label) Fine-tune a 🤗 Transformers model This notebook is based on an official 🤗 notebook - “How to fine-tune a model on text classification”. The main aim of this notebook is to show the process of conversion from vanilla 🤗 to Ray AIR 🤗 without changing the training logic unless necessary. In this notebook, we will: Set up Ray Load the dataset Preprocess the dataset with Ray AIR Run the training with Ray AIR Optionally, share the model with the community Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with transformers==4.19.1): #! pip install "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow" "ray[air]>=1.13" Set up Ray We will use ray.init() to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster. from pprint import pprint import ray ray.init() 2022-08-25 10:09:51,282 INFO worker.py:1223 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS 2022-08-25 10:09:51,697 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.80.117:9031... 2022-08-25 10:09:51,706 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at https://session-i8ddtfaxhwypbvnyb9uzg7xs.i.anyscaleuserdata-staging.com/auth/?token=agh0_CkcwRQIhAJXwvxwq31GryaWthvXGCXZebsijbuqi7qL2pCa5uROOAiBGjzsyXAJFHLlaEI9zSlNI8ewtghKg5UV3t8NmlxuMcRJmEiCtvjcKE0VPiU7iQx51P9oPQjfpo5g1RJXccVSS5005cBgCIgNuL2E6DAj9xazjBhDwj4veAUIMCP3ClJgGEPCPi94B-gEeChxzZXNfaThERFRmQVhId1lwYlZueWI5dVpnN3hT&redirect_to=dashboard  2022-08-25 10:09:51,709 INFO packaging.py:342 -- Pushing file package 'gcs://_ray_pkg_3332f64b0a461fddc20be71129115d0a.zip' (0.34MiB) to Ray cluster... 2022-08-25 10:09:51,714 INFO packaging.py:351 -- Successfully pushed file package 'gcs://_ray_pkg_3332f64b0a461fddc20be71129115d0a.zip'. We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine. pprint(ray.cluster_resources()) {'CPU': 208.0, 'GPU': 16.0, 'accelerator_type:T4': 4.0, 'memory': 616693614180.0, 'node:172.31.76.237': 1.0, 'node:172.31.80.117': 1.0, 'node:172.31.85.193': 1.0, 'node:172.31.85.32': 1.0, 'node:172.31.90.137': 1.0, 'object_store_memory': 259318055729.0} In this notebook, we will see how to fine-tune one of the 🤗 Transformers model to a text classification task of the GLUE Benchmark. We will be running the training using Ray AIR. You can change those two variables to control whether the training (which we will get to later) uses CPUs or GPUs, and how many workers should be spawned. Each worker will claim one CPU or GPU. Make sure not to request more resources than the resources present! By default, we will run the training with one GPU worker. use_gpu = True # set this to False to run on CPUs num_workers = 1 # set this to number of GPUs/CPUs you want to use Fine-tuning a model on a text classification task The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. If you would like to learn more, refer to the original notebook. Each task is named by its acronym, with mnli-mm standing for the mismatched version of MNLI (so same training set as mnli but different validation and test sets): GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"] This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the Model Hub as long as that model has a version with a classification head. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly: task = "cola" model_checkpoint = "distilbert-base-uncased" batch_size = 16 Loading the dataset We will use the 🤗 Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions load_dataset and load_metric. Apart from mnli-mm being a special code, we can directly pass our task name to those functions. As Ray AIR doesn’t provide integrations for 🤗 Datasets yet, we will simply run the normal 🤗 Datasets code to load the dataset from the Hub. from datasets import load_dataset actual_task = "mnli" if task == "mnli-mm" else task datasets = load_dataset("glue", actual_task) The dataset object itself is DatasetDict, which contains one key for the training, validation, and test set (with more keys for the mismatched validation and test set in the special case of mnli). We will also need the metric. In order to avoid serialization errors, we will load the metric inside the training workers later. Therefore, now we will just define the function we will use. from datasets import load_metric def load_metric_fn(): return load_metric('glue', actual_task) The metric is an instance of datasets.Metric. Preprocessing the data with Ray AIR Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers’ Tokenizer, which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires. To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure that: we get a tokenizer that corresponds to the model architecture we want to use, we download the vocabulary used when pretraining this specific checkpoint. from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True) We pass along use_fast=True to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument. To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names: task_to_keys = { "cola": ("sentence", None), "mnli": ("premise", "hypothesis"), "mnli-mm": ("premise", "hypothesis"), "mrpc": ("sentence1", "sentence2"), "qnli": ("question", "sentence"), "qqp": ("question1", "question2"), "rte": ("sentence1", "sentence2"), "sst2": ("sentence", None), "stsb": ("sentence1", "sentence2"), "wnli": ("sentence1", "sentence2"), } For Ray AIR, instead of using 🤗 Dataset objects directly, we will convert them to Ray Data. Both are backed by Arrow tables, so the conversion is straightforward. We will use the built-in ray.data.from_huggingface function. import ray.data ray_datasets = ray.data.from_huggingface(datasets) ray_datasets {'train': Dataset(num_blocks=1, num_rows=8551, schema={sentence: string, label: int64, idx: int32}), 'validation': Dataset(num_blocks=1, num_rows=1043, schema={sentence: string, label: int64, idx: int32}), 'test': Dataset(num_blocks=1, num_rows=1063, schema={sentence: string, label: int64, idx: int32})} We can then write the function that will preprocess our samples. We just feed them to the tokenizer with the argument truncation=True. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model. We use a BatchMapper to create a Ray AIR preprocessor that will map the function to the dataset in a distributed fashion. It will run during training and prediction. import pandas as pd from ray.data.preprocessors import BatchMapper def preprocess_function(examples: pd.DataFrame): # if we only have one column, we are inferring. # no need to tokenize in that case. if len(examples.columns) == 1: return examples examples = examples.to_dict("list") sentence1_key, sentence2_key = task_to_keys[task] if sentence2_key is None: ret = tokenizer(examples[sentence1_key], truncation=True) else: ret = tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True) # Add back the original columns ret = {**examples, **ret} return pd.DataFrame.from_dict(ret) batch_encoder = BatchMapper(preprocess_function, batch_format="pandas") Fine-tuning the model with Ray AIR Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the AutoModelForSequenceClassification class. We will not go into details about each specific component of the training (see the original notebook for that). The tokenizer is the same as we have used to encoded the dataset before. The main difference when using the Ray AIR is that we need to create our 🤗 Transformers Trainer inside a function (trainer_init_per_worker) and return it. That function will be passed to the TransformersTrainer and will run on every Ray worker. The training will then proceed by the means of PyTorch DDP. Make sure that you initialize the model, metric, and tokenizer inside that function. Otherwise, you may run into serialization errors. Furthermore, push_to_hub=True is not yet supported. Ray will, however, checkpoint the model at every epoch, allowing you to push it to hub manually. We will do that after the training. If you wish to use thrid party logging libraries, such as MLflow or Weights&Biases, do not set them in TrainingArguments (they will be automatically disabled) - instead, you should pass Ray AIR callbacks to TransformersTrainer’s run_config. In this example, we will use MLflow. from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer import numpy as np import torch num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2 metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy" model_name = model_checkpoint.split("/")[-1] validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation" name = f"{model_name}-finetuned-{task}" def trainer_init_per_worker(train_dataset, eval_dataset = None, **config): print(f"Is CUDA available: {torch.cuda.is_available()}") metric = load_metric_fn() tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True) model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels) args = TrainingArguments( name, evaluation_strategy="epoch", save_strategy="epoch", logging_strategy="epoch", learning_rate=config.get("learning_rate", 2e-5), per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, num_train_epochs=config.get("epochs", 2), weight_decay=config.get("weight_decay", 0.01), push_to_hub=False, disable_tqdm=True, # declutter the output a little no_cuda=not use_gpu, # you need to explicitly set no_cuda if you want CPUs ) def compute_metrics(eval_pred): predictions, labels = eval_pred if task != "stsb": predictions = np.argmax(predictions, axis=1) else: predictions = predictions[:, 0] return metric.compute(predictions=predictions, references=labels) trainer = Trainer( model, args, train_dataset=train_dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, compute_metrics=compute_metrics ) print("Starting training") return trainer With our trainer_init_per_worker complete, we can now instantiate the TransformersTrainer. Aside from the function, we set the scaling_config, controlling the amount of workers and resources used, and the datasets we will use for training and evaluation. We specify the MLflowLoggerCallback inside the run_config, and pass the preprocessor we have defined earlier as an argument. The preprocessor will be included with the returned Checkpoint, meaning it will also be applied during inference. from ray.train.huggingface import TransformersTrainer from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig from ray.air.integrations.mlflow import MLflowLoggerCallback trainer = TransformersTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), datasets={ "train": ray_datasets["train"], "evaluation": ray_datasets[validation_key], }, run_config=RunConfig( callbacks=[MLflowLoggerCallback(experiment_name=name)], checkpoint_config=CheckpointConfig( num_to_keep=1, checkpoint_score_attribute="eval_loss", checkpoint_score_order="min", ), ), preprocessor=batch_encoder, ) Finally, we call the fit method to start training with Ray AIR. We will save the Result object to a variable so we can access metrics and checkpoints. result = trainer.fit() == Status ==
Current time: 2022-08-25 10:14:09 (running for 00:04:06.45)
Memory usage on this node: 4.3/62.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/208 CPUs, 0/16 GPUs, 0.0/574.34 GiB heap, 0.0/241.51 GiB objects (0.0/4.0 accelerator_type:T4)
Result logdir: /home/ray/ray_results/TransformersTrainer_2022-08-25_10-10-02
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) loss learning_rate epoch
TransformersTrainer_c1ff5_00000TERMINATED172.31.90.137:947 2 200.2170.3886 0 2


(RayTrainWorker pid=1114, ip=172.31.90.137) 2022-08-25 10:10:44,617 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4] (RayTrainWorker pid=1114, ip=172.31.90.137) Is CUDA available: True (RayTrainWorker pid=1116, ip=172.31.90.137) Is CUDA available: True (RayTrainWorker pid=1117, ip=172.31.90.137) Is CUDA available: True (RayTrainWorker pid=1115, ip=172.31.90.137) Is CUDA available: True Downloading builder script: 5.76kB [00:00, 6.45MB/s] Downloading builder script: 5.76kB [00:00, 6.91MB/s] Downloading builder script: 5.76kB [00:00, 6.44MB/s] Downloading builder script: 5.76kB [00:00, 6.94MB/s] Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 30.5kB/s] Downloading config.json: 100%|██████████| 483/483 [00:00<00:00, 817kB/s] Downloading vocab.txt: 0%| | 0.00/226k [00:00 If we would like to tune any hyperparameters of the model, we can do so by simply passing our TransformersTrainer into a Tuner and defining the search space. We can also take advantage of the advanced search algorithms and schedulers provided by Ray Tune. In this example, we will use an ASHAScheduler to aggresively terminate underperforming trials. from ray import tune from ray.tune import Tuner from ray.tune.schedulers.async_hyperband import ASHAScheduler tune_epochs = 4 tuner = Tuner( trainer, param_space={ "trainer_init_config": { "learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]), "epochs": tune_epochs, } }, tune_config=tune.TuneConfig( metric="eval_loss", mode="min", num_samples=1, scheduler=ASHAScheduler( max_t=tune_epochs, ) ), run_config=RunConfig( checkpoint_config=CheckpointConfig(num_to_keep=1, checkpoint_score_attribute="eval_loss", checkpoint_score_order="min") ), ) tune_results = tuner.fit() == Status ==
Current time: 2022-08-25 10:20:13 (running for 00:06:01.75)
Memory usage on this node: 4.4/62.0 GiB
Using AsyncHyperBand: num_stopped=4 Bracket: Iter 4.000: -0.8064090609550476 | Iter 1.000: -0.6378736793994904
Resources requested: 0/208 CPUs, 0/16 GPUs, 0.0/574.34 GiB heap, 0.0/241.51 GiB objects (0.0/4.0 accelerator_type:T4)
Current best trial: 5654d_00001 with eval_loss=0.6492420434951782 and parameters={'trainer_init_config': {'learning_rate': 0.0002, 'epochs': 4}}
Result logdir: /home/ray/ray_results/TransformersTrainer_2022-08-25_10-14-11
Number of trials: 4/4 (4 TERMINATED)
Trial name status loc trainer_init_conf... iter total time (s) loss learning_rate epoch
TransformersTrainer_5654d_00000TERMINATED172.31.90.137:1729 2e-05 4 347.171 0.1958 0 4
TransformersTrainer_5654d_00001TERMINATED172.31.76.237:1805 0.0002 1 95.24920.6225 0.00015 1
TransformersTrainer_5654d_00002TERMINATED172.31.85.32:1322 0.002 1 93.76130.6463 0.0015 1
TransformersTrainer_5654d_00003TERMINATED172.31.85.193:1060 0.02 1 99.36770.926 0.015 1


(RayTrainWorker pid=1789, ip=172.31.90.137) 2022-08-25 10:14:23,379 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4] (RayTrainWorker pid=1792, ip=172.31.90.137) Is CUDA available: True (RayTrainWorker pid=1790, ip=172.31.90.137) Is CUDA available: True (RayTrainWorker pid=1791, ip=172.31.90.137) Is CUDA available: True (RayTrainWorker pid=1789, ip=172.31.90.137) Is CUDA available: True (RayTrainWorker pid=1974, ip=172.31.76.237) 2022-08-25 10:14:29,354 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4] (RayTrainWorker pid=1977, ip=172.31.76.237) Is CUDA available: True (RayTrainWorker pid=1976, ip=172.31.76.237) Is CUDA available: True (RayTrainWorker pid=1975, ip=172.31.76.237) Is CUDA available: True (RayTrainWorker pid=1974, ip=172.31.76.237) Is CUDA available: True (RayTrainWorker pid=1483, ip=172.31.85.32) 2022-08-25 10:14:35,313 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4] (RayTrainWorker pid=1790, ip=172.31.90.137) Starting training (RayTrainWorker pid=1792, ip=172.31.90.137) Starting training (RayTrainWorker pid=1791, ip=172.31.90.137) Starting training (RayTrainWorker pid=1789, ip=172.31.90.137) Starting training (RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running training ***** (RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 8551 (RayTrainWorker pid=1789, ip=172.31.90.137) Num Epochs = 4 (RayTrainWorker pid=1789, ip=172.31.90.137) Instantaneous batch size per device = 16 (RayTrainWorker pid=1789, ip=172.31.90.137) Total train batch size (w. parallel, distributed & accumulation) = 64 (RayTrainWorker pid=1789, ip=172.31.90.137) Gradient Accumulation steps = 1 (RayTrainWorker pid=1789, ip=172.31.90.137) Total optimization steps = 2140 (RayTrainWorker pid=1483, ip=172.31.85.32) Is CUDA available: True (RayTrainWorker pid=1485, ip=172.31.85.32) Is CUDA available: True (RayTrainWorker pid=1486, ip=172.31.85.32) Is CUDA available: True (RayTrainWorker pid=1484, ip=172.31.85.32) Is CUDA available: True (RayTrainWorker pid=1977, ip=172.31.76.237) Starting training (RayTrainWorker pid=1976, ip=172.31.76.237) Starting training (RayTrainWorker pid=1975, ip=172.31.76.237) Starting training (RayTrainWorker pid=1974, ip=172.31.76.237) Starting training (RayTrainWorker pid=1974, ip=172.31.76.237) ***** Running training ***** (RayTrainWorker pid=1974, ip=172.31.76.237) Num examples = 8551 (RayTrainWorker pid=1974, ip=172.31.76.237) Num Epochs = 4 (RayTrainWorker pid=1974, ip=172.31.76.237) Instantaneous batch size per device = 16 (RayTrainWorker pid=1974, ip=172.31.76.237) Total train batch size (w. parallel, distributed & accumulation) = 64 (RayTrainWorker pid=1974, ip=172.31.76.237) Gradient Accumulation steps = 1 (RayTrainWorker pid=1974, ip=172.31.76.237) Total optimization steps = 2140 (RayTrainWorker pid=1483, ip=172.31.85.32) Starting training (RayTrainWorker pid=1485, ip=172.31.85.32) Starting training (RayTrainWorker pid=1486, ip=172.31.85.32) Starting training (RayTrainWorker pid=1484, ip=172.31.85.32) Starting training (RayTrainWorker pid=1483, ip=172.31.85.32) ***** Running training ***** (RayTrainWorker pid=1483, ip=172.31.85.32) Num examples = 8551 (RayTrainWorker pid=1483, ip=172.31.85.32) Num Epochs = 4 (RayTrainWorker pid=1483, ip=172.31.85.32) Instantaneous batch size per device = 16 (RayTrainWorker pid=1483, ip=172.31.85.32) Total train batch size (w. parallel, distributed & accumulation) = 64 (RayTrainWorker pid=1483, ip=172.31.85.32) Gradient Accumulation steps = 1 (RayTrainWorker pid=1483, ip=172.31.85.32) Total optimization steps = 2140 (RayTrainWorker pid=1223, ip=172.31.85.193) 2022-08-25 10:14:48,193 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4] (RayTrainWorker pid=1223, ip=172.31.85.193) Is CUDA available: True (RayTrainWorker pid=1224, ip=172.31.85.193) Is CUDA available: True (RayTrainWorker pid=1226, ip=172.31.85.193) Is CUDA available: True (RayTrainWorker pid=1225, ip=172.31.85.193) Is CUDA available: True Downloading builder script: 5.76kB [00:00, 6.59MB/s] Downloading builder script: 5.76kB [00:00, 6.52MB/s] Downloading builder script: 5.76kB [00:00, 6.07MB/s] Downloading builder script: 5.76kB [00:00, 6.81MB/s] Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 46.0kB/s] Downloading config.json: 100%|██████████| 483/483 [00:00<00:00, 766kB/s] Downloading vocab.txt: 0%| | 0.00/226k [00:00
loss learning_rate epoch step eval_loss eval_matthews_correlation eval_runtime eval_samples_per_second eval_steps_per_second _timestamp ... pid hostname node_ip time_since_restore timesteps_since_restore iterations_since_restore warmup_time config/trainer_init_config/epochs config/trainer_init_config/learning_rate logdir
1 0.6225 0.00015 1.0 535 0.649242 0.000000 1.0157 267.792 4.923 1661447759 ... 1805 ip-172-31-76-237 172.31.76.237 95.249164 0 1 0.003661 4 0.00020 /home/ray/ray_results/TransformersTrainer_2022-...
3 0.9260 0.01500 1.0 535 0.652943 0.000000 0.9428 288.510 5.303 1661447782 ... 1060 ip-172-31-85-193 172.31.85.193 99.367746 0 1 0.004133 4 0.02000 /home/ray/ray_results/TransformersTrainer_2022-...
2 0.6463 0.00150 1.0 535 0.658653 0.000000 0.9576 284.050 5.222 1661447764 ... 1322 ip-172-31-85-32 172.31.85.32 93.761317 0 1 0.004533 4 0.00200 /home/ray/ray_results/TransformersTrainer_2022-...
0 0.1958 0.00000 4.0 2140 0.806409 0.532286 1.0006 271.827 4.997 1661448005 ... 1729 ip-172-31-90-137 172.31.90.137 347.170584 0 4 0.003702 4 0.00002 /home/ray/ray_results/TransformersTrainer_2022-...

4 rows × 33 columns

best_result = tune_results.get_best_result() Share the model To be able to share your model with the community, there are a few more steps to follow. We have conducted the training on the Ray cluster, but share the model from the local enviroment - this will allow us to easily authenticate. First you have to store your authentication token from the Hugging Face website (sign up here if you haven’t already!) then execute the following cell and input your username and password: from huggingface_hub import notebook_login notebook_login() Then you need to install Git-LFS. Uncomment the following instructions: # !apt install git-lfs Now, load the model and tokenizer locally, and recreate the 🤗 Transformers Trainer: from ray.train.huggingface import TransformersCheckpoint checkpoint = TransformersCheckpoint.from_checkpoint(result.checkpoint) hf_trainer = checkpoint.get_model(model=AutoModelForSequenceClassification) You can now upload the result of the training to the Hub, just execute this instruction: hf_trainer.push_to_hub() You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier "your-username/the-name-you-picked" so for instance: from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model") Next steps End-to-end: Offline Batch Inference Training a model with Sklearn In this example we will train a model in Ray AIR using a Sklearn classifier. Let’s start with installing our dependencies: !pip install -qU "ray[tune]" sklearn Then we need some imports: from typing import Tuple import ray from ray.data import Dataset from ray.train.sklearn import SklearnPredictor from ray.data.preprocessors import Chain, OrdinalEncoder, StandardScaler from ray.air.result import Result from ray.train.sklearn import SklearnTrainer from ray.air.config import ScalingConfig from sklearn.ensemble import RandomForestClassifier try: from cuml.ensemble import RandomForestClassifier as cuMLRandomForestClassifier except ImportError: cuMLRandomForestClassifier = None Next we define a function to load our train, validation, and test datasets. def prepare_data() -> Tuple[Dataset, Dataset, Dataset]: dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer_with_categorical.csv") train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) test_dataset = valid_dataset.drop_columns(["target"]) return train_dataset, valid_dataset, test_dataset The following function will create a Sklearn trainer, train it, and return the result. def train_sklearn(num_cpus: int, use_gpu: bool = False) -> Result: if use_gpu and not cuMLRandomForestClassifier: raise RuntimeError("cuML must be installed for GPU enabled sklearn estimators.") train_dataset, valid_dataset, _ = prepare_data() # Scale some random columns columns_to_scale = ["mean radius", "mean texture"] preprocessor = Chain( OrdinalEncoder(["categorical_column"]), StandardScaler(columns=columns_to_scale) ) if use_gpu: trainer_resources = {"CPU": 1, "GPU": 1} estimator = cuMLRandomForestClassifier() else: trainer_resources = {"CPU": num_cpus} estimator = RandomForestClassifier() trainer = SklearnTrainer( estimator=estimator, label_column="target", datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, cv=5, scaling_config=ScalingConfig(trainer_resources=trainer_resources), ) result = trainer.fit() print(result.metrics) return result Now we can run the training: result = train_sklearn(num_cpus=2, use_gpu=False) 2022-06-22 17:27:37,741 INFO services.py:1477 -- View the Ray dashboard at http://127.0.0.1:8269 2022-06-22 17:27:39,822 WARNING read_api.py:260 -- The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks. Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 44.05it/s] == Status ==
Current time: 2022-06-22 17:27:59 (running for 00:00:18.31)
Memory usage on this node: 10.7/31.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/12.9 GiB heap, 0.0/6.45 GiB objects
Result logdir: /home/ubuntu/ray_results/SklearnTrainer_2022-06-22_17-27-40
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) fit_time
SklearnTrainer_9dec8_00000TERMINATED172.31.43.110:1492629 1 15.6842 2.31571


(SklearnTrainer pid=1492629) 2022-06-22 17:27:45,647 WARNING pool.py:591 -- The 'context' argument is not supported using ray. Please refer to the documentation for how to control ray initialization. Result for SklearnTrainer_9dec8_00000: cv: fit_time: - 2.221003770828247 - 2.215489387512207 - 2.2075674533843994 - 2.222351312637329 - 2.312389612197876 fit_time_mean: 2.235760307312012 fit_time_std: 0.03866614559685742 score_time: - 0.022464990615844727 - 0.0230865478515625 - 0.02564835548400879 - 0.029137849807739258 - 0.021221637725830078 score_time_mean: 0.02431187629699707 score_time_std: 0.0028120522003997595 test_score: - 0.9625 - 0.9125 - 0.9875 - 1.0 - 0.9367088607594937 test_score_mean: 0.9598417721518986 test_score_std: 0.032128186960552516 date: 2022-06-22_17-27-59 done: false experiment_id: f8215019c10e4a81ba2187c38e875365 fit_time: 2.3157050609588623 hostname: ip-172-31-43-110 iterations_since_restore: 1 node_ip: 172.31.43.110 pid: 1492629 should_checkpoint: true time_since_restore: 15.684244871139526 time_this_iter_s: 15.684244871139526 time_total_s: 15.684244871139526 timestamp: 1655918879 timesteps_since_restore: 0 training_iteration: 1 trial_id: 9dec8_00000 valid: score_time: 0.03549623489379883 test_score: 0.9532163742690059 warmup_time: 0.0057866573333740234 Result for SklearnTrainer_9dec8_00000: cv: fit_time: - 2.221003770828247 - 2.215489387512207 - 2.2075674533843994 - 2.222351312637329 - 2.312389612197876 fit_time_mean: 2.235760307312012 fit_time_std: 0.03866614559685742 score_time: - 0.022464990615844727 - 0.0230865478515625 - 0.02564835548400879 - 0.029137849807739258 - 0.021221637725830078 score_time_mean: 0.02431187629699707 score_time_std: 0.0028120522003997595 test_score: - 0.9625 - 0.9125 - 0.9875 - 1.0 - 0.9367088607594937 test_score_mean: 0.9598417721518986 test_score_std: 0.032128186960552516 date: 2022-06-22_17-27-59 done: true experiment_id: f8215019c10e4a81ba2187c38e875365 experiment_tag: '0' fit_time: 2.3157050609588623 hostname: ip-172-31-43-110 iterations_since_restore: 1 node_ip: 172.31.43.110 pid: 1492629 should_checkpoint: true time_since_restore: 15.684244871139526 time_this_iter_s: 15.684244871139526 time_total_s: 15.684244871139526 timestamp: 1655918879 timesteps_since_restore: 0 training_iteration: 1 trial_id: 9dec8_00000 valid: score_time: 0.03549623489379883 test_score: 0.9532163742690059 warmup_time: 0.0057866573333740234 2022-06-22 17:27:59,333 INFO tune.py:734 -- Total run time: 19.09 seconds (18.31 seconds for the tuning loop). {'valid': {'score_time': 0.03549623489379883, 'test_score': 0.9532163742690059}, 'cv': {'fit_time': array([2.22100377, 2.21548939, 2.20756745, 2.22235131, 2.31238961]), 'score_time': array([0.02246499, 0.02308655, 0.02564836, 0.02913785, 0.02122164]), 'test_score': array([0.9625 , 0.9125 , 0.9875 , 1. , 0.93670886]), 'fit_time_mean': 2.235760307312012, 'fit_time_std': 0.03866614559685742, 'score_time_mean': 0.02431187629699707, 'score_time_std': 0.0028120522003997595, 'test_score_mean': 0.9598417721518986, 'test_score_std': 0.032128186960552516}, 'fit_time': 2.3157050609588623, 'time_this_iter_s': 15.684244871139526, 'should_checkpoint': True, 'done': True, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 1, 'trial_id': '9dec8_00000', 'experiment_id': 'f8215019c10e4a81ba2187c38e875365', 'date': '2022-06-22_17-27-59', 'timestamp': 1655918879, 'time_total_s': 15.684244871139526, 'pid': 1492629, 'hostname': 'ip-172-31-43-110', 'node_ip': '172.31.43.110', 'config': {}, 'time_since_restore': 15.684244871139526, 'timesteps_since_restore': 0, 'iterations_since_restore': 1, 'warmup_time': 0.0057866573333740234, 'experiment_tag': '0'} Next steps End-to-end: Offline Batch Inference Training a model with distributed XGBoost In this example we will train a model in Ray AIR using distributed XGBoost. Let’s start with installing our dependencies: !pip install -qU "ray[tune]" xgboost_ray [notice] A new release of pip available: 22.3.1 -> 23.1.2 [notice] To update, run: pip install --upgrade pip Then we need some imports: from typing import Tuple import ray from ray.train.xgboost import XGBoostPredictor from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig from ray.data import Dataset from ray.air.result import Result from ray.data.preprocessors import StandardScaler Next we define a function to load our train, validation, and test datasets. def prepare_data() -> Tuple[Dataset, Dataset, Dataset]: dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) test_dataset = valid_dataset.drop_columns(["target"]) return train_dataset, valid_dataset, test_dataset The following function will create a XGBoost trainer, train it, and return the result. def train_xgboost(num_workers: int, use_gpu: bool = False) -> Result: train_dataset, valid_dataset, _ = prepare_data() # Scale some random columns columns_to_scale = ["mean radius", "mean texture"] preprocessor = StandardScaler(columns=columns_to_scale) # XGBoost specific params params = { "tree_method": "approx", "objective": "binary:logistic", "eval_metric": ["logloss", "error"], } trainer = XGBoostTrainer( scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), label_column="target", params=params, datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, num_boost_round=100, ) result = trainer.fit() print(result.metrics) return result Once we have the result, we can do batch inference on the obtained model. Let’s define a utility function for this. import pandas as pd from ray.air import Checkpoint from ray.data import ActorPoolStrategy class Predict: def __init__(self, checkpoint: Checkpoint): self.predictor = XGBoostPredictor.from_checkpoint(checkpoint) def __call__(self, batch: pd.DataFrame) -> pd.DataFrame: return self.predictor.predict(batch) def predict_xgboost(result: Result): _, _, test_dataset = prepare_data() scores = test_dataset.map_batches( Predict, fn_constructor_args=[result.checkpoint], compute=ActorPoolStrategy(), batch_format="pandas" ) predicted_labels = scores.map_batches(lambda df: (df > 0.5).astype(int), batch_format="pandas") print(f"PREDICTED LABELS") predicted_labels.show() Now we can run the training: result = train_xgboost(num_workers=2, use_gpu=False)

Tune Status

Current time:2023-07-06 18:33:25
Running for: 00:00:06.19
Memory: 14.9/64.0 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 2.0/10 CPUs, 0/0 GPUs

Trial Status

Trial name status loc iter total time (s) train-logloss train-error valid-logloss
XGBoostTrainer_40fed_00000TERMINATED127.0.0.1:40725 101 4.90132 0.00587595 0 0.06215
(XGBoostTrainer pid=40725) The `preprocessor` arg to Trainer is deprecated. Apply preprocessor transformations ahead of time by calling `preprocessor.transform(ds)`. Support for the preprocessor arg will be dropped in a future release. (XGBoostTrainer pid=40725) Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format. (XGBoostTrainer pid=40725) Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] (XGBoostTrainer pid=40725) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) (XGBoostTrainer pid=40725) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`    (pid=40725) Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory: 0%| | 0/14 [00:00 TaskPoolMapOperator[MapBatches(StandardScaler._transform_pandas)]  (pid=40725) Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory: 0%| | 0/14 [00:01 TaskPoolMapOperator[MapBatches(StandardScaler._transform_pandas)]  (pid=40725) Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory: 0%| | 0/14 [00:01 ActorPoolMapOperator[MapBatches()->MapBatches(Predict)] -> TaskPoolMapOperator[MapBatches()] 2023-07-06 18:33:28,112 INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) 2023-07-06 18:33:28,114 INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` 2023-07-06 18:33:28,150 INFO actor_pool_map_operator.py:117 -- MapBatches()->MapBatches(Predict): Waiting for 1 pool actors to start... PREDICTED LABELS {'predictions': 1} {'predictions': 1} {'predictions': 0} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 0} {'predictions': 1} {'predictions': 0} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 0} {'predictions': 0} {'predictions': 1} {'predictions': 1} {'predictions': 0} Hyperparameter tuning with XGBoostTrainer In this example, we will go through how you can use Ray AIR to run a distributed hyperparameter experiment to find optimal hyperparameters for an XGBoost model. What we’ll cover: How to load data from an Sklearn example dataset How to initialize an XGBoost trainer How to define a search space for regular XGBoost parameters and for data preprocessors How to fetch the best obtained result from the tuning run How to fetch a dataframe to do further analysis on the results We’ll use the Covertype dataset provided from sklearn to train a multiclass classification task using XGBoost. In this dataset, we try to predict the forst cover type (e.g. “lodgehole pine”) from cartographic variables, like the distance to the closest road, or the hillshade at different times of the day. The features are binary, discrete and continuous and thus well suited for a decision-tree based classification task. You can find more information about the dataset on the dataset homepage. We will train XGBoost models on this dataset. Because model training performance can be influenced by hyperparameter choices, we will generate several different configurations and train them in parallel. Notably each of these trials will itself start a distributed training job to speed up training. All of this happens automatically within Ray AIR. First, let’s make sure we have all dependencies installed: !pip install -q "ray[air]" scikit-learn WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available. You should consider upgrading via the '/Users/kai/.pyenv/versions/3.7.7/bin/python3.7 -m pip install --upgrade pip' command. Then we can start with some imports. import pandas as pd from sklearn.datasets import fetch_covtype import ray from ray import tune from ray.air import RunConfig, ScalingConfig from ray.train.xgboost import XGBoostTrainer from ray.tune.tune_config import TuneConfig from ray.tune.tuner import Tuner We’ll define a utility function to create a Dataset from the Sklearn dataset. We expect the target column to be in the dataframe, so we’ll add it to the dataframe manually. def get_training_data() -> ray.data.Dataset: data_raw = fetch_covtype() df = pd.DataFrame(data_raw["data"], columns=data_raw["feature_names"]) df["target"] = data_raw["target"] return ray.data.from_pandas(df) train_dataset = get_training_data() 2022-05-13 12:31:51,444 INFO services.py:1484 -- View the Ray dashboard at http://127.0.0.1:8265 Let’s take a look at the schema here: print(train_dataset) Dataset(num_blocks=1, num_rows=581012, schema={Elevation: float64, Aspect: float64, Slope: float64, Horizontal_Distance_To_Hydrology: float64, Vertical_Distance_To_Hydrology: float64, Horizontal_Distance_To_Roadways: float64, Hillshade_9am: float64, Hillshade_Noon: float64, Hillshade_3pm: float64, Horizontal_Distance_To_Fire_Points: float64, Wilderness_Area_0: float64, Wilderness_Area_1: float64, Wilderness_Area_2: float64, Wilderness_Area_3: float64, Soil_Type_0: float64, Soil_Type_1: float64, Soil_Type_2: float64, Soil_Type_3: float64, Soil_Type_4: float64, Soil_Type_5: float64, Soil_Type_6: float64, Soil_Type_7: float64, Soil_Type_8: float64, Soil_Type_9: float64, Soil_Type_10: float64, Soil_Type_11: float64, Soil_Type_12: float64, Soil_Type_13: float64, Soil_Type_14: float64, Soil_Type_15: float64, Soil_Type_16: float64, Soil_Type_17: float64, Soil_Type_18: float64, Soil_Type_19: float64, Soil_Type_20: float64, Soil_Type_21: float64, Soil_Type_22: float64, Soil_Type_23: float64, Soil_Type_24: float64, Soil_Type_25: float64, Soil_Type_26: float64, Soil_Type_27: float64, Soil_Type_28: float64, Soil_Type_29: float64, Soil_Type_30: float64, Soil_Type_31: float64, Soil_Type_32: float64, Soil_Type_33: float64, Soil_Type_34: float64, Soil_Type_35: float64, Soil_Type_36: float64, Soil_Type_37: float64, Soil_Type_38: float64, Soil_Type_39: float64, target: int32}) Since we’ll be training a multiclass prediction model, we have to pass some information to XGBoost. For instance, XGBoost expects us to provide the number of classes, and multiclass-enabled evaluation metrices. For a good overview of commonly used hyperparameters, see our tutorial in the docs. # XGBoost specific params params = { "tree_method": "approx", "objective": "multi:softmax", "eval_metric": ["mlogloss", "merror"], "num_class": 8, "min_child_weight": 2 } With these parameters in place, we’ll create a Ray AIR XGBoostTrainer. Note that we pass in a scaling_config to configure the distributed training behavior of each individual XGBoost training job. We want to distribute training across 2 workers. We also keep some CPU resources free for Ray Data operations. The label_column specifies which columns in the dataset contains the target values. params are the XGBoost training params defined above - we can tune these later! The datasets dict contains the dataset we would like to train on. Lastly, we pass the number of boosting rounds to XGBoost. trainer = XGBoostTrainer( scaling_config=ScalingConfig(num_workers=2, _max_cpu_fraction_per_node=0.9), label_column="target", params=params, datasets={"train": train_dataset}, num_boost_round=10, ) We can now create the Tuner with a search space to override some of the default parameters in the XGBoost trainer. Here, we just want to the XGBoost max_depth and min_child_weights parameters. Note that we specifically specified min_child_weight=2 in the default XGBoost trainer - this value will be overwritten during tuning. We configure Tune to minimize the train-mlogloss metric. In random search, this doesn’t affect the evaluated configurations, but it will affect our default results fetching for analysis later. By the way, the name train-mlogloss is provided by the XGBoost library - train is the name of the dataset and mlogloss is the metric we passed in the XGBoost params above. Trainables can report any number of results (in this case we report 2), but most search algorithms only act on one of them - here we chose the mlogloss. tuner = Tuner( trainer, run_config=RunConfig(verbose=1), param_space={ "params": { "max_depth": tune.randint(2, 8), "min_child_weight": tune.randint(1, 10), }, }, tune_config=TuneConfig(num_samples=8, metric="train-mlogloss", mode="min"), ) Let’s run the tuning. This will take a few minutes to complete. results = tuner.fit() == Status ==
Current time: 2022-05-13 12:35:33 (running for 00:03:37.49)
Memory usage on this node: 10.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/6.73 GiB heap, 0.0/2.0 GiB objects
Current best trial: 4ab2f_00007 with train-mlogloss=0.560217 and parameters={'params': {'max_depth': 7, 'min_child_weight': 4}}
Result logdir: /Users/kai/ray_results/XGBoostTrainer_2022-05-13_12-31-55
Number of trials: 8/8 (8 TERMINATED)

(GBDTTrainable pid=62456) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62456) 2022-05-13 12:32:02,793 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62464) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62463) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62465) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62466) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62463) 2022-05-13 12:32:05,102 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62466) 2022-05-13 12:32:05,204 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62464) 2022-05-13 12:32:05,338 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62465) 2022-05-13 12:32:07,164 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62456) 2022-05-13 12:32:10,549 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=62495) [12:32:10] task [xgboost.ray]:6975277392 got new rank 1 (_RemoteRayXGBoostActor pid=62494) [12:32:10] task [xgboost.ray]:4560390352 got new rank 0 (raylet) Spilled 2173 MiB, 22 objects, write throughput 402 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message. (GBDTTrainable pid=62463) 2022-05-13 12:32:17,848 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=62523) [12:32:18] task [xgboost.ray]:4441524624 got new rank 0 (_RemoteRayXGBoostActor pid=62524) [12:32:18] task [xgboost.ray]:6890641808 got new rank 1 (GBDTTrainable pid=62465) 2022-05-13 12:32:21,253 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (GBDTTrainable pid=62466) 2022-05-13 12:32:21,529 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=62563) [12:32:21] task [xgboost.ray]:4667801680 got new rank 1 (_RemoteRayXGBoostActor pid=62562) [12:32:21] task [xgboost.ray]:6856360848 got new rank 0 (_RemoteRayXGBoostActor pid=62530) [12:32:21] task [xgboost.ray]:6971527824 got new rank 0 (_RemoteRayXGBoostActor pid=62532) [12:32:21] task [xgboost.ray]:4538321232 got new rank 1 (GBDTTrainable pid=62464) 2022-05-13 12:32:21,937 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=62544) [12:32:21] task [xgboost.ray]:7005661840 got new rank 1 (_RemoteRayXGBoostActor pid=62543) [12:32:21] task [xgboost.ray]:4516088080 got new rank 0 (raylet) Spilled 4098 MiB, 83 objects, write throughput 347 MiB/s. (GBDTTrainable pid=62456) 2022-05-13 12:32:41,289 INFO main.py:1109 -- Training in progress (31 seconds since last restart). (GBDTTrainable pid=62463) 2022-05-13 12:32:48,617 INFO main.py:1109 -- Training in progress (31 seconds since last restart). (GBDTTrainable pid=62465) 2022-05-13 12:32:52,110 INFO main.py:1109 -- Training in progress (31 seconds since last restart). (GBDTTrainable pid=62466) 2022-05-13 12:32:52,448 INFO main.py:1109 -- Training in progress (31 seconds since last restart). (GBDTTrainable pid=62464) 2022-05-13 12:32:52,692 INFO main.py:1109 -- Training in progress (31 seconds since last restart). (GBDTTrainable pid=62456) 2022-05-13 12:33:11,960 INFO main.py:1109 -- Training in progress (61 seconds since last restart). (GBDTTrainable pid=62463) 2022-05-13 12:33:19,076 INFO main.py:1109 -- Training in progress (61 seconds since last restart). (GBDTTrainable pid=62464) 2022-05-13 12:33:23,409 INFO main.py:1109 -- Training in progress (61 seconds since last restart). (GBDTTrainable pid=62465) 2022-05-13 12:33:23,420 INFO main.py:1109 -- Training in progress (62 seconds since last restart). (GBDTTrainable pid=62466) 2022-05-13 12:33:23,541 INFO main.py:1109 -- Training in progress (62 seconds since last restart). (GBDTTrainable pid=62463) 2022-05-13 12:33:23,693 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 78.74 seconds (65.79 pure XGBoost training time). (GBDTTrainable pid=62464) 2022-05-13 12:33:24,802 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 79.62 seconds (62.85 pure XGBoost training time). (GBDTTrainable pid=62648) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62651) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62648) 2022-05-13 12:33:38,788 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62651) 2022-05-13 12:33:38,766 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62456) 2022-05-13 12:33:42,168 INFO main.py:1109 -- Training in progress (92 seconds since last restart). (GBDTTrainable pid=62456) 2022-05-13 12:33:46,177 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 103.54 seconds (95.60 pure XGBoost training time). (GBDTTrainable pid=62651) 2022-05-13 12:33:51,825 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=62670) [12:33:51] task [xgboost.ray]:4623186960 got new rank 1 (_RemoteRayXGBoostActor pid=62669) [12:33:51] task [xgboost.ray]:4707639376 got new rank 0 (GBDTTrainable pid=62648) 2022-05-13 12:33:52,036 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=62672) [12:33:52] task [xgboost.ray]:4530073552 got new rank 1 (_RemoteRayXGBoostActor pid=62671) [12:33:52] task [xgboost.ray]:6824757200 got new rank 0 (GBDTTrainable pid=62466) 2022-05-13 12:33:54,229 INFO main.py:1109 -- Training in progress (92 seconds since last restart). (GBDTTrainable pid=62465) 2022-05-13 12:33:54,355 INFO main.py:1109 -- Training in progress (93 seconds since last restart). (GBDTTrainable pid=62730) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (GBDTTrainable pid=62730) 2022-05-13 12:34:04,708 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (GBDTTrainable pid=62466) 2022-05-13 12:34:11,126 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 126.08 seconds (109.48 pure XGBoost training time). (GBDTTrainable pid=62730) 2022-05-13 12:34:15,175 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=62753) [12:34:15] task [xgboost.ray]:4468564048 got new rank 1 (_RemoteRayXGBoostActor pid=62752) [12:34:15] task [xgboost.ray]:6799468304 got new rank 0 (GBDTTrainable pid=62648) 2022-05-13 12:34:22,167 INFO main.py:1109 -- Training in progress (30 seconds since last restart). (GBDTTrainable pid=62651) 2022-05-13 12:34:22,147 INFO main.py:1109 -- Training in progress (30 seconds since last restart). (GBDTTrainable pid=62465) 2022-05-13 12:34:24,646 INFO main.py:1109 -- Training in progress (123 seconds since last restart). (GBDTTrainable pid=62465) 2022-05-13 12:34:24,745 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 137.75 seconds (123.36 pure XGBoost training time). (GBDTTrainable pid=62651) 2022-05-13 12:34:40,173 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 61.63 seconds (48.34 pure XGBoost training time). (GBDTTrainable pid=62730) 2022-05-13 12:34:45,745 INFO main.py:1109 -- Training in progress (31 seconds since last restart). (GBDTTrainable pid=62648) 2022-05-13 12:34:52,543 INFO main.py:1109 -- Training in progress (60 seconds since last restart). (GBDTTrainable pid=62648) 2022-05-13 12:35:14,888 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 96.35 seconds (82.83 pure XGBoost training time). (GBDTTrainable pid=62730) 2022-05-13 12:35:16,197 INFO main.py:1109 -- Training in progress (61 seconds since last restart). (GBDTTrainable pid=62730) 2022-05-13 12:35:33,441 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 88.89 seconds (78.26 pure XGBoost training time). 2022-05-13 12:35:33,610 INFO tune.py:753 -- Total run time: 218.52 seconds (217.48 seconds for the tuning loop). Now that we obtained the results, we can analyze them. For instance, we can fetch the best observed result according to the configured metric and mode and print it: # This will fetch the best result according to the `metric` and `mode` specified # in the `TuneConfig` above: best_result = results.get_best_result() print("Best result error rate", best_result.metrics["train-merror"]) Best result error rate 0.196929 For more sophisticated analysis, we can get a pandas dataframe with all trial results: df = results.get_dataframe() print(df.columns) Index(['train-mlogloss', 'train-merror', 'time_this_iter_s', 'should_checkpoint', 'done', 'timesteps_total', 'episodes_total', 'training_iteration', 'trial_id', 'experiment_id', 'date', 'timestamp', 'time_total_s', 'pid', 'hostname', 'node_ip', 'time_since_restore', 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time', 'config/params/max_depth', 'config/params/min_child_weight', 'logdir'], dtype='object') As an example, let’s group the results per min_child_weight parameter and fetch the minimal obtained values: groups = df.groupby("config/params/min_child_weight") mins = groups.min() for min_child_weight, row in mins.iterrows(): print("Min child weight", min_child_weight, "error", row["train-merror"], "logloss", row["train-mlogloss"]) Min child weight 1 error 0.262468 logloss 0.69843 Min child weight 2 error 0.311035 logloss 0.79498 Min child weight 3 error 0.240916 logloss 0.651457 Min child weight 4 error 0.196929 logloss 0.560217 Min child weight 6 error 0.219665 logloss 0.608005 Min child weight 7 error 0.311035 logloss 0.794983 Min child weight 8 error 0.311035 logloss 0.794983 As you can see in our example run, the min child weight of 2 showed the best prediction accuracy with 0.196929. That’s the same as results.get_best_result() gave us! The results.get_dataframe() returns the last reported results per trial. If you want to obtain the best ever observed results, you can pass the filter_metric and filter_mode arguments to results.get_dataframe(). In our example, we’ll filter the minimum ever observed train-merror for each trial: df_min_error = results.get_dataframe(filter_metric="train-merror", filter_mode="min") df_min_error["train-merror"] 0 0.262468 1 0.310307 2 0.310307 3 0.219665 4 0.240916 5 0.220801 6 0.310307 7 0.196929 Name: train-merror, dtype: float64 The best ever observed train-merror is 0.196929, the same as the minimum error in our grouped results. This is expected, as the classification error in XGBoost usually goes down over time - meaning our last results are usually the best results. And that’s how you analyze your hyperparameter tuning results. If you would like to have access to more analytics, please feel free to file a feature request e.g. as a Github issue or on our Discuss platform! Training a model with distributed LightGBM In this example we will train a model in Ray AIR using distributed LightGBM. Let’s start with installing our dependencies: !pip install -qU "ray[tune]" lightgbm_ray [notice] A new release of pip available: 22.3.1 -> 23.1.2 [notice] To update, run: pip install --upgrade pip Then we need some imports: from typing import Tuple import ray from ray.train.lightgbm import LightGBMPredictor from ray.data.preprocessors.chain import Chain from ray.data.preprocessors.encoder import Categorizer from ray.train.lightgbm import LightGBMTrainer from ray.air.config import ScalingConfig from ray.data import Dataset from ray.air.result import Result from ray.data.preprocessors import StandardScaler /Users/balaji/Documents/GitHub/ray/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm 2023-07-07 14:34:14,951 INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output. 2023-07-07 14:34:15,892 INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output. Next we define a function to load our train, validation, and test datasets. def prepare_data() -> Tuple[Dataset, Dataset, Dataset]: dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer_with_categorical.csv") train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) test_dataset = valid_dataset.drop_columns(cols=["target"]) return train_dataset, valid_dataset, test_dataset The following function will create a LightGBM trainer, train it, and return the result. def train_lightgbm(num_workers: int, use_gpu: bool = False) -> Result: train_dataset, valid_dataset, _ = prepare_data() # Scale some random columns, and categorify the categorical_column, # allowing LightGBM to use its built-in categorical feature support preprocessor = Chain( Categorizer(["categorical_column"]), StandardScaler(columns=["mean radius", "mean texture"]) ) # LightGBM specific params params = { "objective": "binary", "metric": ["binary_logloss", "binary_error"], } trainer = LightGBMTrainer( scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), label_column="target", params=params, datasets={"train": train_dataset, "valid": valid_dataset}, preprocessor=preprocessor, num_boost_round=100, ) result = trainer.fit() print(result.metrics) return result Once we have the result, we can do batch inference on the obtained model. Let’s define a utility function for this. import pandas as pd from ray.air import Checkpoint from ray.data import ActorPoolStrategy class Predict: def __init__(self, checkpoint: Checkpoint): self.predictor = LightGBMPredictor.from_checkpoint(checkpoint) def __call__(self, batch: pd.DataFrame) -> pd.DataFrame: return self.predictor.predict(batch) def predict_lightgbm(result: Result): _, _, test_dataset = prepare_data() scores = test_dataset.map_batches( Predict, fn_constructor_args=[result.checkpoint], compute=ActorPoolStrategy(), batch_format="pandas" ) predicted_labels = scores.map_batches(lambda df: (df > 0.5).astype(int), batch_format="pandas") print(f"PREDICTED LABELS") predicted_labels.show() Now we can run the training: result = train_lightgbm(num_workers=2, use_gpu=False)

Tune Status

Current time:2023-07-07 14:34:34
Running for: 00:00:06.06
Memory: 12.2/64.0 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 4.0/10 CPUs, 0/0 GPUs

Trial Status

Trial name status loc iter total time (s) train-binary_logloss train-binary_error valid-binary_logloss
LightGBMTrainer_0c5ae_00000TERMINATED127.0.0.1:10027 101 4.5829 0.000202293 0 0.130232
(LightGBMTrainer pid=10027) The `preprocessor` arg to Trainer is deprecated. Apply preprocessor transformations ahead of time by calling `preprocessor.transform(ds)`. Support for the preprocessor arg will be dropped in a future release. (LightGBMTrainer pid=10027) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(get_pd_value_counts)] (LightGBMTrainer pid=10027) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) (LightGBMTrainer pid=10027) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` (LightGBMTrainer pid=10027) Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format. (LightGBMTrainer pid=10027) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(Categorizer._transform_pandas)] -> AllToAllOperator[Aggregate] (LightGBMTrainer pid=10027) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) (LightGBMTrainer pid=10027) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`    (pid=10027) Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory: 0%| | 0/14 [00:00 TaskPoolMapOperator[MapBatches(Categorizer._transform_pandas)->MapBatches(StandardScaler._transform_pandas)]  (pid=10027) Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory: 7%|▋ | 1/14 [00:00<00:01, 7.59it/s]   (LightGBMTrainer pid=10027) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)  (pid=10027) Running: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory: 7%|▋ | 1/14 [00:00<00:01, 6.59it/s]   (LightGBMTrainer pid=10027) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`    (LightGBMTrainer pid=10027) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(Categorizer._transform_pandas)->MapBatches(StandardScaler._transform_pandas)] (LightGBMTrainer pid=10027) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) (LightGBMTrainer pid=10027) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` (_RemoteRayLightGBMActor pid=10063) [LightGBM] [Info] Trying to bind port 51134... (_RemoteRayLightGBMActor pid=10063) [LightGBM] [Info] Binding port 51134 succeeded (_RemoteRayLightGBMActor pid=10063) [LightGBM] [Info] Listening... (_RemoteRayLightGBMActor pid=10062) [LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds (_RemoteRayLightGBMActor pid=10063) [LightGBM] [Info] Connected to rank 0 (_RemoteRayLightGBMActor pid=10063) [LightGBM] [Info] Local rank: 1, total number of machines: 2 (_RemoteRayLightGBMActor pid=10063) [LightGBM] [Warning] num_threads is set=2, n_jobs=-1 will be ignored. Current value: num_threads=2 (_RemoteRayLightGBMActor pid=10062) /Users/balaji/Documents/GitHub/ray/.venv/lib/python3.11/site-packages/lightgbm/basic.py:1780: UserWarning: Overriding the parameters from Reference Dataset. (_RemoteRayLightGBMActor pid=10062) _log_warning('Overriding the parameters from Reference Dataset.') (_RemoteRayLightGBMActor pid=10062) /Users/balaji/Documents/GitHub/ray/.venv/lib/python3.11/site-packages/lightgbm/basic.py:1513: UserWarning: categorical_column in param dict is overridden. (_RemoteRayLightGBMActor pid=10062) _log_warning(f'{cat_alias} in param dict is overridden.') 2023-07-07 14:34:34,087 INFO tune.py:1148 -- Total run time: 7.18 seconds (6.05 seconds for the tuning loop). {'train-binary_logloss': 0.00020229312743896637, 'train-binary_error': 0.0, 'valid-binary_logloss': 0.13023245107091222, 'valid-binary_error': 0.023529411764705882, 'time_this_iter_s': 0.021785974502563477, 'should_checkpoint': True, 'done': True, 'training_iteration': 101, 'trial_id': '0c5ae_00000', 'date': '2023-07-07_14-34-34', 'timestamp': 1688765674, 'time_total_s': 4.582904100418091, 'pid': 10027, 'hostname': 'Balajis-MacBook-Pro-16', 'node_ip': '127.0.0.1', 'config': {}, 'time_since_restore': 4.582904100418091, 'iterations_since_restore': 101, 'experiment_tag': '0'} And perform inference on the obtained model: predict_lightgbm(result) 2023-07-07 14:34:36,769 INFO read_api.py:374 -- To satisfy the requested parallelism of 20, each read task output will be split into 20 smaller blocks. 2023-07-07 14:34:38,655 WARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune 2023-07-07 14:34:38,668 INFO dataset.py:2180 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format. 2023-07-07 14:34:38,674 INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches()->MapBatches(Predict)] -> TaskPoolMapOperator[MapBatches()] 2023-07-07 14:34:38,674 INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) 2023-07-07 14:34:38,676 INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` 2023-07-07 14:34:38,701 INFO actor_pool_map_operator.py:117 -- MapBatches()->MapBatches(Predict): Waiting for 1 pool actors to start... PREDICTED LABELS {'predictions': 1} {'predictions': 1} {'predictions': 0} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 0} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 0} {'predictions': 1} {'predictions': 1} {'predictions': 1} {'predictions': 0} This example is adapted from Continual AI Avalanche quick start https://avalanche.continualai.org/ Incremental Learning with Ray AIR In this example, we show how to use Ray AIR to incrementally train a simple image classification PyTorch model on a stream of incoming tasks. Each task is a random permutation of the MNIST Dataset, which is a common benchmark used for continual training. After training on all the tasks, the model is expected to be able to make predictions on data from any task. In this example, we use just a naive finetuning strategy, where the model is trained on each task, without any special methods to prevent catastrophic forgetting. Model performance is expected to be poor. More precisely, this example showcases domain incremental training, in which during prediction/testing time, the model is asked to predict on data from tasks trained on so far with the task ID not provided. This is opposed to task incremental training, where the task ID is provided during prediction/testing time. For more information on the 3 different categories for incremental/continual learning, please see “Three scenarios for continual learning” by van de Ven and Tolias This example will cover the following: Loading a PyTorch Dataset to Ray Data Create an Iterator[ray.data.Dataset] abstraction to represent a stream of data to train on for incremental training. Implement a custom Ray AIR preprocessor to preprocess the dataset. Incrementally train a model using data parallel training. Incrementally deploying our trained model with Ray Serve and performing online prediction queries. Step 1: Installations and Initializing Ray To get started, let’s first install the necessary packages: Ray AIR, torch, and torchvision. Uncomment the below lines and run the cell to install the necessary packages. # !pip install -q "ray[air]" # !pip install -q torch # !pip install -q torchvision Then, let’s initialize Ray! We can just import and call ray.init(). If you are running on a Ray cluster, then you can do ray.init("auto") to connect to the cluster instead of initiailzing a new local Ray instance. import ray ray.init() # If runnning on a cluster, use the below line instead. # ray.init("auto") 2022-09-23 16:31:18,554 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 

Ray

Python version: 3.10.6
Ray version: 2.6.3
Dashboard: http://127.0.0.1:8265
Step 2: Define our PyTorch Model Now that we have the necessary installations, let’s define our PyTorch model. For this example to classify MNIST images, we will use a simple multi-layer perceptron. import torch.nn as nn class SimpleMLP(nn.Module): def __init__(self, num_classes=10, input_size=28 * 28): super(SimpleMLP, self).__init__() self.features = nn.Sequential( nn.Linear(input_size, 512), nn.ReLU(inplace=True), nn.Dropout(), ) self.classifier = nn.Linear(512, num_classes) self._input_size = input_size def forward(self, x): x = x.contiguous() x = x.view(-1, self._input_size) x = self.features(x) x = self.classifier(x) return x /home/pdmurray/.pyenv/versions/mambaforge/envs/ray/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Step 3: Create the Stream of tasks We can now create a stream of tasks (where each task contains a dataset to train on). For this example, we will create an artificial stream of tasks consisting of permuted variations of MNIST, which is a classic benchmark in continual learning research. For real-world scenarios, this step is not necessary as fresh data will already be arriving as a stream of tasks. It does not need to be artificially created. 3a: Load MNIST Dataset to a Dataset Let’s first define a simple function that will return the original MNIST Dataset as a distributed Dataset. Ray Data is the standard way to load and exchange data in Ray libraries and applications, read more about the library here! The function in the below code snippet does the following: Downloads the MNIST Dataset from torchvision in-memory Loads the in-memory Torch Dataset into a Dataset Converts the Dataset into Numpy format. Instead of the Dataset iterating over tuples, it will have 2 columns: “image” & “label”. This will allow us to apply built-in preprocessors to the Dataset and allow Datasets to be used with Ray AIR Predictors. For this example, since we are just working with MNIST dataset, which is small, we use the from_torch which just loads the full MNIST dataset into memory. For loading larger datasets in a parallel fashion, you should use Dataset’s additional read APIs to load data from parquet, csv, image files, and more! import pandas as pd import torchvision from torchvision.transforms import RandomCrop import ray def get_mnist_dataset(train: bool = True) -> ray.data.Dataset: """Returns MNIST Dataset as a ray.data.Dataset. Args: train: Whether to return the train dataset or test dataset. """ if train: # Only perform random cropping on the Train dataset. transform = RandomCrop(28, padding=4) else: transform = None mnist_dataset = torchvision.datasets.MNIST("./data", download=True, train=train, transform=transform) mnist_dataset = ray.data.from_torch(mnist_dataset) def convert_batch_to_numpy(batch): images = np.array([np.array(item[0]) for item in batch["item"]]) labels = np.array([item[1] for item in batch["item"]]) return {"image": images, "label": labels} mnist_dataset = mnist_dataset.map_batches(convert_batch_to_numpy).materialize() return mnist_dataset 3b: Create our Stream abstraction Now we can create our “stream” abstraction. This abstraction provides two methods (generate_train_stream and generate_test_stream) that each returns an Iterator over Ray Data. Each item in this iterator contains a unique permutation of MNIST, and is one task that we want to train on. In this example, “the stream of tasks” is contrived since all the data for all tasks exist already in an offline setting. For true online continual learning, you would want to implement a custom dataset iterator that reads from some stream datasource to produce new tasks. The only abstraction that’s needed is Iterator[ray.data.Dataset]. Note that the test dataset stream has the same permutations that are used for the training dataset stream. In general for continual learning, it is expected that the data distribution of the test/prediction data follows what the model was trained on. If you notice that the distribution of new prediction queries is changing compared to the distribution of the training data, then you should probably trigger training of a new task. from typing import Dict, Iterator, List import random import numpy as np from ray.data import ActorPoolStrategy class PermutedMNISTStream: """Generates streams of permuted MNIST Datasets. Example: permuted_mnist = PermutedMNISTStream(n_tasks=3) train_stream = permuted_mnist.generate_train_stream() # Iterate through the train_stream for train_dataset in train_stream: ... Args: n_tasks: The number of tasks to generate. """ def __init__(self, n_tasks: int = 3): self.n_tasks = n_tasks self.permutations = [ np.random.permutation(28 * 28) for _ in range(self.n_tasks) ] self.train_mnist_dataset = get_mnist_dataset(train=True) self.test_mnist_dataset = get_mnist_dataset(train=False) def random_permute_dataset( self, dataset: ray.data.Dataset, permutation: np.ndarray ): """Randomly permutes the pixels for each image in the dataset.""" class PixelsPermutation(object): def __call__(self, batch): batch["image"] = batch["image"].map(lambda image: image.reshape(-1)[permutation].reshape(28, 28)) return batch return dataset.map_batches(PixelsPermutation, compute=ActorPoolStrategy(), batch_format="pandas") def generate_train_stream(self) -> Iterator[ray.data.Dataset]: for permutation in self.permutations: permuted_mnist_dataset = self.random_permute_dataset( self.train_mnist_dataset, permutation ) yield permuted_mnist_dataset def generate_test_stream(self) -> Iterator[ray.data.Dataset]: for permutation in self.permutations: mnist_dataset = get_mnist_dataset(train=False) permuted_mnist_dataset = self.random_permute_dataset( self.test_mnist_dataset, permutation ) yield permuted_mnist_dataset def generate_test_samples(self, num_samples: int = 10) -> List[np.ndarray]: """Generates num_samples permuted MNIST images.""" random_permutation = random.choice(self.permutations) return list(self.random_permute_dataset( self.test_mnist_dataset.random_shuffle().limit(num_samples), random_permutation, ).to_pandas()["image"].to_numpy()) Step 4: Define the logic for Training and Inference/Prediction Now that we can get an Iterator over Ray Data, we can incrementally train our model in a data parallel fashion via Ray Train, while incrementally deploying our model via Ray Serve. Let’s define some helper functions to allow us to do this! If you are not familiar with data parallel training, it is a form of distributed training strategies, where we have multiple model replicas, and each replica trains on a different batch of data. After each batch, the gradients are synchronized across the replicas. This effecitively allows us to train on more data in a shorter amount of time. 4a: Define our training logic for each Data Parallel worker The first thing we need to do is to define the training loop that will be run on each training worker. The training loop takes in a config Dict as an argument that we can use to pass in any configurations for training. This is just standard PyTorch training, with the difference being that we can leverage Ray Train’s utility functions and Ray AIR Sesssion: ray.train.torch.prepare_model(...): This will prepare the model for distributed training by wrapping it in either PyTorch DistributedDataParallel or FullyShardedDataParallel and moving it to the correct accelerator device. ray.air.session.get_dataset_shard(...): This will get the Dataset shard for this particular Data Parallel worker. ray.air.session.report({}, checkpoint=...): This will tell Ray Train to persist the provided Checkpoint object. ray.air.session.get_checkpoint(): Returns a checkpoint to resume from. This is useful for either fault tolerance purposes, or for our purposes, to continue training the same model on a new incoming dataset. from ray import train from ray.air import session, Checkpoint from torch.optim import SGD from torch.nn import CrossEntropyLoss def train_loop_per_worker(config: dict): num_epochs = config["num_epochs"] learning_rate = config["learning_rate"] momentum = config["momentum"] batch_size = config["batch_size"] model = SimpleMLP(num_classes=10) # Load model from checkpoint if there is a checkpoint to load from. checkpoint_to_load = session.get_checkpoint() if checkpoint_to_load: state_dict_to_resume_from = checkpoint_to_load.to_dict()["model"] model.load_state_dict(state_dict=state_dict_to_resume_from) model = train.torch.prepare_model(model) optimizer = SGD(model.parameters(), lr=learning_rate, momentum=momentum) criterion = CrossEntropyLoss() # Get the Dataset shard for this data parallel worker, and convert it to a PyTorch Dataset. dataset_shard = session.get_dataset_shard("train").iter_torch_batches( batch_size=batch_size, ) for epoch_idx in range(num_epochs): running_loss = 0 for iteration, batch in enumerate(dataset_shard): optimizer.zero_grad() train_mb_x, train_mb_y = batch["image"], batch["label"] train_mb_x = train_mb_x.to(train.torch.get_device()) train_mb_y = train_mb_y.to(train.torch.get_device()) # Forward logits = model(train_mb_x) # Loss loss = criterion(logits, train_mb_y) # Backward loss.backward() # Update optimizer.step() running_loss += loss.item() if session.get_world_rank() == 0 and iteration % 500 == 0: print(f"loss: {loss.item():>7f}, epoch: {epoch_idx}, iteration: {iteration}") # Checkpoint model after every epoch. state_dict = model.state_dict() checkpoint = Checkpoint.from_dict(dict(model=state_dict)) session.report({"loss": running_loss}, checkpoint=checkpoint) 4b: Define our Preprocessor Next, we define our Preprocessor to preprocess our data before training and prediction. Our preprocessor will normalize the MNIST Images by the mean and standard deviation of the MNIST training dataset. This is a common operation to do on MNIST to improve training: https://discuss.pytorch.org/t/normalization-in-the-mnist-example/457 from typing import Dict import numpy as np import torch from torchvision import transforms from ray.data.preprocessors import TorchVisionPreprocessor transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) mnist_normalize_preprocessor = TorchVisionPreprocessor(columns=["image"], transform=transform) 4c: Define logic for Deploying and Querying our model In addition to batch inference, we also want to deploy our model so that we can submit live queries to it for online inference. We use Ray Serve’s PredictorDeployment utility to deploy our trained model. Once we deploy the model, we can send HTTP requests to our deployment. from typing import List import requests from requests import Response import numpy as np from ray.serve.http_adapters import json_to_ndarray def deploy_model(checkpoint: ray.air.Checkpoint) -> str: """Deploys the model from the provided Checkpoint and returns the URL for the endpoint of the model deployment.""" serve.run( PredictorDeployment.options( name="mnist_model", route_prefix="/mnist_predict", num_replicas=2, ).bind( http_adapter=json_to_ndarray, predictor_cls=TorchPredictor, checkpoint=latest_checkpoint, model=SimpleMLP(num_classes=10), ) ) return "http://localhost:8000/mnist_predict" # Function that queries our deployed model def query_deployment(test_samples: List[np.ndarray], endpoint_uri: str) -> List[Response]: """Given a set of test samples, queries the model deployment at the provided endpoint and returns the results.""" results = [] # Convert to Python List since Numpy arrays are not Json serializable. for sample in test_samples: results.append(requests.post(endpoint_uri, json={"array": sample.tolist(), "dtype": "float32"})) return results Step 5: Putting it all together Once we have defined our training logic and our preprocessor, we can put everything together! For each dataset in our stream, we do the following: Train on the dataset in Data Parallel fashion. We create a TorchTrainer, specify the config for the training loop we defined above, the dataset to train on, and how much we want to scale. TorchTrainer also accepts a checkpoint arg to continue training from a previously saved checkpoint. Get the saved checkpoint from the training run. After training on each task, we deploy our model so we can query it for predictions. In this example, the training data for each task is well-defined beforehand by the benchmark. For real-world scenarios, this probably will not be the case. It is very likely that the prediction requests after training on one task will become the training data for the next task. from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig from ray.train.torch import TorchPredictor from ray import serve from ray.serve import PredictorDeployment from ray.serve.http_adapters import json_to_ndarray # The number of tasks (i.e. datasets in our stream) that we want to use for this example. n_tasks = 3 # Number of epochs to train each task for. num_epochs = 4 # Batch size. batch_size = 32 # Optimizer args. learning_rate = 0.001 momentum = 0.9 # Number of data parallel workers to use for training. num_workers = 1 # Whether to use GPU or not. use_gpu = ray.available_resources().get("GPU", 0) > 0 permuted_mnist = PermutedMNISTStream(n_tasks=n_tasks) train_stream = permuted_mnist.generate_train_stream() test_stream = permuted_mnist.generate_test_stream() latest_checkpoint = None accuracy_for_all_tasks = [] task_idx = 0 all_test_datasets_seen_so_far = [] for train_dataset, test_dataset in zip(train_stream, test_stream): print(f"Starting training for task: {task_idx}") task_idx += 1 # *********Training***************** trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, train_loop_config={ "num_epochs": num_epochs, "learning_rate": learning_rate, "momentum": momentum, "batch_size": batch_size, }, # Have to specify trainer_resources as 0 so that the example works on Colab. scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu, trainer_resources={"CPU": 0}), datasets={"train": train_dataset}, preprocessor=mnist_normalize_preprocessor, resume_from_checkpoint=latest_checkpoint, ) result = trainer.fit() latest_checkpoint = result.checkpoint # *************Model Deployment & Online Inference*************************** # We can also deploy our model to do online inference with Ray Serve. # Start Ray Serve. test_samples = permuted_mnist.generate_test_samples() endpoint_uri = deploy_model(latest_checkpoint) online_inference_results = query_deployment(test_samples, endpoint_uri) if ray.available_resources().get("CPU", 0) < num_workers+1: # If there are no more CPUs left, then shutdown the Serve replicas so we can continue training on the next task. serve.shutdown() serve.shutdown() Read->Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.42s/it] Read->Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.27it/s] Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.40it/s] Read->Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.17it/s] Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.78it/s] Starting training for task: 0

Tune Status

Current time:2022-09-23 16:31:51
Running for: 00:00:20.79
Memory: 17.1/62.7 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.53 GiB heap, 0.0/16.26 GiB objects

Trial Status

Trial name status loc iter total time (s) loss _timestamp _time_this_iter_s
TorchTrainer_da157_00000TERMINATED10.109.175.190:856770 4 17.0121 0 1663975908 0.0839479
(RayTrainWorker pid=856836) 2022-09-23 16:31:37,847 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=856836) 2022-09-23 16:31:38,047 INFO train_loop_utils.py:354 -- Moving model to device: cuda:0 (RayTrainWorker pid=856836) loss: 2.436360, epoch: 0, iteration: 0 (RayTrainWorker pid=856836) loss: 1.608793, epoch: 0, iteration: 500 (RayTrainWorker pid=856836) loss: 1.285775, epoch: 0, iteration: 1000 (RayTrainWorker pid=856836) loss: 0.785092, epoch: 0, iteration: 1500

Trial Progress

Trial name _time_this_iter_s _timestamp _training_iterationdate done episodes_total experiment_id experiment_taghostname iterations_since_restore lossnode_ip pidshould_checkpoint time_since_restore time_this_iter_s time_total_s timestamp timesteps_since_restoretimesteps_total training_iterationtrial_id warmup_time
TorchTrainer_da157_00000 0.0839479 1663975908 42022-09-23_16-31-49True 96c794a64d6f43d79b87130a76d21f1f 0corvus 4 010.109.175.190856770True 17.0121 0.11111 17.0121 1663975909 0 4da157_00000 0.00297165
2022-09-23 16:31:51,231 INFO tune.py:762 -- Total run time: 20.91 seconds (20.79 seconds for the tuning loop). Map_Batches: 0%| | 0/1 [00:00Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.26it/s] Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.72it/s] Starting training for task: 1

Tune Status

Current time:2022-09-23 16:33:08
Running for: 00:00:19.49
Memory: 18.2/62.7 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.53 GiB heap, 0.0/16.26 GiB objects

Trial Status

Trial name status loc iter total time (s) loss _timestamp _time_this_iter_s
TorchTrainer_09424_00000TERMINATED10.109.175.190:857781 4 15.3611 0 1663975986 0.0699804
(RayTrainWorker pid=857818) 2022-09-23 16:32:55,672 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=857818) 2022-09-23 16:32:55,954 INFO train_loop_utils.py:354 -- Moving model to device: cuda:0 (RayTrainWorker pid=857818) loss: 2.457292, epoch: 0, iteration: 0 (RayTrainWorker pid=857818) loss: 1.339169, epoch: 0, iteration: 500 (RayTrainWorker pid=857818) loss: 1.032746, epoch: 0, iteration: 1000 (RayTrainWorker pid=857818) loss: 0.707931, epoch: 0, iteration: 1500

Trial Progress

Trial name _time_this_iter_s _timestamp _training_iterationdate done episodes_total experiment_id experiment_taghostname iterations_since_restore lossnode_ip pidshould_checkpoint time_since_restore time_this_iter_s time_total_s timestamp timesteps_since_restoretimesteps_total training_iteration trial_id warmup_time
TorchTrainer_09424_00000 0.0699804 1663975986 42022-09-23_16-33-06True 77c9c5f109fa4a47b459b0afadf3ba33 0corvus 4 010.109.175.190857781True 15.3611 0.0725608 15.3611 1663975986 0 409424_00000 0.00418878
2022-09-23 16:33:09,072 INFO tune.py:762 -- Total run time: 19.62 seconds (19.49 seconds for the tuning loop). Map Progress (1 actors 1 pending): 0%| | 0/2 [00:01Map_Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.31it/s] Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.76it/s] Starting training for task: 2

Tune Status

Current time:2022-09-23 16:34:33
Running for: 00:00:19.45
Memory: 18.4/62.7 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.53 GiB heap, 0.0/16.26 GiB objects

Trial Status

Trial name status loc iter total time (s) loss _timestamp _time_this_iter_s
TorchTrainer_3b7e3_00000TERMINATED10.109.175.190:858536 4 15.3994 0 1663976070 0.0710998
(RayTrainWorker pid=858579) 2022-09-23 16:34:19,902 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=858579) 2022-09-23 16:34:20,191 INFO train_loop_utils.py:354 -- Moving model to device: cuda:0 (RayTrainWorker pid=858579) loss: 2.515887, epoch: 0, iteration: 0 (RayTrainWorker pid=858579) loss: 1.260738, epoch: 0, iteration: 500 (RayTrainWorker pid=858579) loss: 0.892560, epoch: 0, iteration: 1000 (RayTrainWorker pid=858579) loss: 0.497198, epoch: 0, iteration: 1500

Trial Progress

Trial name _time_this_iter_s _timestamp _training_iterationdate done episodes_total experiment_id experiment_taghostname iterations_since_restore lossnode_ip pidshould_checkpoint time_since_restore time_this_iter_s time_total_s timestamp timesteps_since_restoretimesteps_total training_iterationtrial_id warmup_time
TorchTrainer_3b7e3_00000 0.0710998 1663976070 42022-09-23_16-34-30True c9312be01e964b958b931d1796623509 0corvus 4 010.109.175.190858536True 15.3994 0.0705044 15.3994 1663976070 0 43b7e3_00000 0.00414133
2022-09-23 16:34:33,315 INFO tune.py:762 -- Total run time: 19.59 seconds (19.45 seconds for the tuning loop). Map Progress (1 actors 1 pending): 0%| | 0/3 [00:01 Result: print("Starting online training") trainer = RLTrainer( run_config=RunConfig( stop={"training_iteration": 5}, checkpoint_config=CheckpointConfig(checkpoint_at_end=True) ), scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), algorithm="PPO", config={ "env": "CartPole-v1", "framework": "tf", }, ) result = trainer.fit() return result Once we obtained a trained checkpoint, we will want to serve it using Ray Serve: def serve_rl_model(checkpoint: Checkpoint, name="RLModel") -> str: """Serve a RL model and return deployment URI. This function will start Ray Serve and deploy a model wrapper that loads the RL checkpoint into a RLPredictor. """ serve.run( PredictorDeployment.options(name=name).bind( RLPredictor, checkpoint ) ) return f"http://localhost:8000/" And to make sure everything works well, we can kick off an evaluation run on a fresh environment. This will query the served policy model to obtain actions using HTTP. def evaluate_served_policy(endpoint_uri: str, num_episodes: int = 3) -> list: """Evaluate a served RL policy on a local environment. This function will create an RL environment and step through it. To obtain the actions, it will query the deployed RL model. """ env = gym.make("CartPole-v1") rewards = [] for i in range(num_episodes): obs, _ = env.reset() reward = 0.0 terminated = truncated = False while not terminated and not truncated: action = query_action(endpoint_uri, obs) obs, r, terminated, truncated, _ = env.step(action) reward += r rewards.append(reward) return rewards def query_action(endpoint_uri: str, obs: np.ndarray): """Perform inference on a served RL model. This will send a HTTP request to the Ray Serve endpoint of the served RL policy model and return the result. """ action_dict = requests.post(endpoint_uri, json={"array": obs.tolist()}).json() return action_dict Let’s put it all together. First, we train the model: num_workers = 2 use_gpu = False result = train_rl_ppo_online(num_workers=num_workers, use_gpu=use_gpu) 2022-05-19 14:19:32,791 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! 2022-05-19 14:19:32,816 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future! Starting online training 2022-05-19 14:19:35,724 INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8269 == Status ==
Current time: 2022-05-19 14:20:14 (running for 00:00:36.01)
Memory usage on this node: 9.7/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.44 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_14-19-32
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) ts reward episode_reward_max episode_reward_min episode_len_mean
AIRPPOTrainer_55884_00000TERMINATED127.0.0.1:15610 5 16.489720000 131.8 200 16 131.8


(raylet) 2022-05-19 14:19:39,542 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=51686 --object-store-name=/tmp/ray/session_2022-05-19_14-19-32_884042_15394/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_14-19-32_884042_15394/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=52347 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65218 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134 (pid=15610) 2022-05-19 14:19:47,006 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (AIRPPOTrainer pid=15610) 2022-05-19 14:19:47,485 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode. (AIRPPOTrainer pid=15610) 2022-05-19 14:19:47,485 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you. (AIRPPOTrainer pid=15610) 2022-05-19 14:19:47,485 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (raylet) 2022-05-19 14:19:48,495 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=51686 --object-store-name=/tmp/ray/session_2022-05-19_14-19-32_884042_15394/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_14-19-32_884042_15394/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=52347 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65218 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 14:19:48,495 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=51686 --object-store-name=/tmp/ray/session_2022-05-19_14-19-32_884042_15394/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_14-19-32_884042_15394/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=52347 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65218 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134 (RolloutWorker pid=15616) 2022-05-19 14:19:56,315 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RolloutWorker pid=15615) 2022-05-19 14:19:56,315 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (AIRPPOTrainer pid=15610) 2022-05-19 14:19:57,667 INFO trainable.py:163 -- Trainable.setup took 10.183 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads. (AIRPPOTrainer pid=15610) 2022-05-19 14:19:57,668 WARNING util.py:65 -- Install gputil for GPU system monitoring. (AIRPPOTrainer pid=15610) 2022-05-19 14:19:59,362 WARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future! Result for AIRPPOTrainer_55884_00000: agent_timesteps_total: 4000 counters: num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_trained: 4000 custom_metrics: {} date: 2022-05-19_14-20-01 done: false episode_len_mean: 20.4020618556701 episode_media: {} episode_reward_max: 91.0 episode_reward_mean: 20.4020618556701 episode_reward_min: 9.0 episodes_this_iter: 194 episodes_total: 194 experiment_id: 91a6faca48864f6aa47a7847d8741683 hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.20000000298023224 cur_lr: 4.999999873689376e-05 entropy: 0.6655290722846985 entropy_coeff: 0.0 kl: 0.028071347624063492 model: {} policy_loss: -0.04146554693579674 total_loss: 8.68990707397461 vf_explained_var: 0.010860291309654713 vf_loss: 8.72575855255127 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_trained: 4000 iterations_since_restore: 1 node_ip: 127.0.0.1 num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 4000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 24.36 ram_util_percent: 60.7 pid: 15610 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06759104378889909 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05391890514620255 mean_inference_ms: 0.5981325037618929 mean_raw_obs_processing_ms: 0.10587287672531599 sampler_results: custom_metrics: {} episode_len_mean: 20.4020618556701 episode_media: {} episode_reward_max: 91.0 episode_reward_mean: 20.4020618556701 episode_reward_min: 9.0 episodes_this_iter: 194 hist_stats: episode_lengths: - 13 - 29 - 17 - 24 - 20 - 14 - 13 - 51 - 24 - 25 - 16 - 33 - 43 - 15 - 12 - 27 - 22 - 20 - 13 - 19 - 9 - 18 - 23 - 20 - 21 - 17 - 12 - 18 - 17 - 13 - 12 - 24 - 12 - 19 - 15 - 26 - 11 - 16 - 19 - 27 - 13 - 11 - 39 - 14 - 12 - 10 - 18 - 20 - 20 - 21 - 25 - 12 - 15 - 19 - 13 - 16 - 15 - 19 - 17 - 11 - 16 - 14 - 11 - 20 - 18 - 30 - 14 - 20 - 18 - 14 - 21 - 10 - 10 - 13 - 41 - 19 - 14 - 13 - 10 - 17 - 18 - 18 - 9 - 22 - 18 - 16 - 27 - 26 - 13 - 10 - 42 - 10 - 46 - 20 - 12 - 91 - 10 - 18 - 15 - 23 - 17 - 15 - 21 - 18 - 22 - 13 - 14 - 20 - 15 - 12 - 23 - 20 - 17 - 15 - 11 - 14 - 22 - 29 - 33 - 18 - 11 - 13 - 30 - 15 - 16 - 17 - 78 - 15 - 35 - 47 - 56 - 10 - 38 - 21 - 11 - 19 - 10 - 12 - 12 - 12 - 30 - 15 - 37 - 11 - 61 - 19 - 38 - 14 - 27 - 9 - 14 - 11 - 26 - 40 - 13 - 13 - 28 - 27 - 45 - 17 - 9 - 29 - 15 - 39 - 12 - 12 - 14 - 16 - 27 - 18 - 18 - 27 - 21 - 16 - 19 - 24 - 16 - 13 - 16 - 24 - 18 - 19 - 12 - 12 - 10 - 18 - 13 - 37 - 20 - 14 - 58 - 22 - 17 - 15 episode_reward: - 13.0 - 29.0 - 17.0 - 24.0 - 20.0 - 14.0 - 13.0 - 51.0 - 24.0 - 25.0 - 16.0 - 33.0 - 43.0 - 15.0 - 12.0 - 27.0 - 22.0 - 20.0 - 13.0 - 19.0 - 9.0 - 18.0 - 23.0 - 20.0 - 21.0 - 17.0 - 12.0 - 18.0 - 17.0 - 13.0 - 12.0 - 24.0 - 12.0 - 19.0 - 15.0 - 26.0 - 11.0 - 16.0 - 19.0 - 27.0 - 13.0 - 11.0 - 39.0 - 14.0 - 12.0 - 10.0 - 18.0 - 20.0 - 20.0 - 21.0 - 25.0 - 12.0 - 15.0 - 19.0 - 13.0 - 16.0 - 15.0 - 19.0 - 17.0 - 11.0 - 16.0 - 14.0 - 11.0 - 20.0 - 18.0 - 30.0 - 14.0 - 20.0 - 18.0 - 14.0 - 21.0 - 10.0 - 10.0 - 13.0 - 41.0 - 19.0 - 14.0 - 13.0 - 10.0 - 17.0 - 18.0 - 18.0 - 9.0 - 22.0 - 18.0 - 16.0 - 27.0 - 26.0 - 13.0 - 10.0 - 42.0 - 10.0 - 46.0 - 20.0 - 12.0 - 91.0 - 10.0 - 18.0 - 15.0 - 23.0 - 17.0 - 15.0 - 21.0 - 18.0 - 22.0 - 13.0 - 14.0 - 20.0 - 15.0 - 12.0 - 23.0 - 20.0 - 17.0 - 15.0 - 11.0 - 14.0 - 22.0 - 29.0 - 33.0 - 18.0 - 11.0 - 13.0 - 30.0 - 15.0 - 16.0 - 17.0 - 78.0 - 15.0 - 35.0 - 47.0 - 56.0 - 10.0 - 38.0 - 21.0 - 11.0 - 19.0 - 10.0 - 12.0 - 12.0 - 12.0 - 30.0 - 15.0 - 37.0 - 11.0 - 61.0 - 19.0 - 38.0 - 14.0 - 27.0 - 9.0 - 14.0 - 11.0 - 26.0 - 40.0 - 13.0 - 13.0 - 28.0 - 27.0 - 45.0 - 17.0 - 9.0 - 29.0 - 15.0 - 39.0 - 12.0 - 12.0 - 14.0 - 16.0 - 27.0 - 18.0 - 18.0 - 27.0 - 21.0 - 16.0 - 19.0 - 24.0 - 16.0 - 13.0 - 16.0 - 24.0 - 18.0 - 19.0 - 12.0 - 12.0 - 10.0 - 18.0 - 13.0 - 37.0 - 20.0 - 14.0 - 58.0 - 22.0 - 17.0 - 15.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06759104378889909 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05391890514620255 mean_inference_ms: 0.5981325037618929 mean_raw_obs_processing_ms: 0.10587287672531599 time_since_restore: 3.4948909282684326 time_this_iter_s: 3.4948909282684326 time_total_s: 3.4948909282684326 timers: learn_throughput: 2226.68 learn_time_ms: 1796.396 load_throughput: 23763762.04 load_time_ms: 0.168 training_iteration_time_ms: 3491.061 update_time_ms: 2.118 timestamp: 1652966401 timesteps_since_restore: 0 timesteps_total: 4000 training_iteration: 1 trial_id: '55884_00000' warmup_time: 10.191807985305786 Result for AIRPPOTrainer_55884_00000: agent_timesteps_total: 12000 counters: num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_trained: 12000 custom_metrics: {} date: 2022-05-19_14-20-07 done: false episode_len_mean: 71.1 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 71.1 episode_reward_min: 11.0 episodes_this_iter: 30 episodes_total: 309 experiment_id: 91a6faca48864f6aa47a7847d8741683 hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.30000001192092896 cur_lr: 4.999999873689376e-05 entropy: 0.586205780506134 entropy_coeff: 0.0 kl: 0.008390319533646107 model: {} policy_loss: -0.019435761496424675 total_loss: 9.387017250061035 vf_explained_var: 0.08327911049127579 vf_loss: 9.403934478759766 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_trained: 12000 iterations_since_restore: 3 node_ip: 127.0.0.1 num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 12000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 20.54 ram_util_percent: 60.839999999999996 pid: 15610 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06752164417665334 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.053784472149459826 mean_inference_ms: 0.5887105603211024 mean_raw_obs_processing_ms: 0.09900861419193328 sampler_results: custom_metrics: {} episode_len_mean: 71.1 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 71.1 episode_reward_min: 11.0 episodes_this_iter: 30 hist_stats: episode_lengths: - 31 - 25 - 59 - 30 - 68 - 86 - 33 - 11 - 43 - 19 - 75 - 15 - 34 - 28 - 94 - 17 - 68 - 27 - 21 - 39 - 49 - 25 - 18 - 53 - 49 - 26 - 23 - 58 - 15 - 44 - 47 - 39 - 65 - 27 - 33 - 107 - 36 - 60 - 58 - 13 - 18 - 25 - 112 - 33 - 36 - 55 - 19 - 90 - 135 - 43 - 16 - 52 - 34 - 32 - 85 - 58 - 99 - 57 - 42 - 26 - 19 - 25 - 29 - 17 - 153 - 96 - 22 - 70 - 16 - 65 - 200 - 154 - 116 - 140 - 185 - 106 - 196 - 146 - 200 - 147 - 29 - 134 - 127 - 115 - 41 - 134 - 195 - 112 - 170 - 123 - 55 - 111 - 45 - 200 - 190 - 47 - 128 - 32 - 195 - 90 episode_reward: - 31.0 - 25.0 - 59.0 - 30.0 - 68.0 - 86.0 - 33.0 - 11.0 - 43.0 - 19.0 - 75.0 - 15.0 - 34.0 - 28.0 - 94.0 - 17.0 - 68.0 - 27.0 - 21.0 - 39.0 - 49.0 - 25.0 - 18.0 - 53.0 - 49.0 - 26.0 - 23.0 - 58.0 - 15.0 - 44.0 - 47.0 - 39.0 - 65.0 - 27.0 - 33.0 - 107.0 - 36.0 - 60.0 - 58.0 - 13.0 - 18.0 - 25.0 - 112.0 - 33.0 - 36.0 - 55.0 - 19.0 - 90.0 - 135.0 - 43.0 - 16.0 - 52.0 - 34.0 - 32.0 - 85.0 - 58.0 - 99.0 - 57.0 - 42.0 - 26.0 - 19.0 - 25.0 - 29.0 - 17.0 - 153.0 - 96.0 - 22.0 - 70.0 - 16.0 - 65.0 - 200.0 - 154.0 - 116.0 - 140.0 - 185.0 - 106.0 - 196.0 - 146.0 - 200.0 - 147.0 - 29.0 - 134.0 - 127.0 - 115.0 - 41.0 - 134.0 - 195.0 - 112.0 - 170.0 - 123.0 - 55.0 - 111.0 - 45.0 - 200.0 - 190.0 - 47.0 - 128.0 - 32.0 - 195.0 - 90.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06752164417665334 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.053784472149459826 mean_inference_ms: 0.5887105603211024 mean_raw_obs_processing_ms: 0.09900861419193328 time_since_restore: 10.045520067214966 time_this_iter_s: 3.312021017074585 time_total_s: 10.045520067214966 timers: learn_throughput: 2334.388 learn_time_ms: 1713.511 load_throughput: 22753909.584 load_time_ms: 0.176 training_iteration_time_ms: 3344.328 update_time_ms: 1.849 timestamp: 1652966407 timesteps_since_restore: 0 timesteps_total: 12000 training_iteration: 3 trial_id: '55884_00000' warmup_time: 10.191807985305786 Result for AIRPPOTrainer_55884_00000: agent_timesteps_total: 20000 counters: num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_trained: 20000 custom_metrics: {} date: 2022-05-19_14-20-14 done: true episode_len_mean: 131.8 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 131.8 episode_reward_min: 16.0 episodes_this_iter: 21 episodes_total: 354 experiment_id: 91a6faca48864f6aa47a7847d8741683 hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.15000000596046448 cur_lr: 4.999999873689376e-05 entropy: 0.5785127878189087 entropy_coeff: 0.0 kl: 0.006441288627684116 model: {} policy_loss: -0.01349598728120327 total_loss: 9.467853546142578 vf_explained_var: 0.08293872326612473 vf_loss: 9.480382919311523 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_trained: 20000 iterations_since_restore: 5 node_ip: 127.0.0.1 num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 20000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 23.48 ram_util_percent: 60.94000000000001 pid: 15610 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06701258389156116 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05356056683821166 mean_inference_ms: 0.5839579119976391 mean_raw_obs_processing_ms: 0.09578881031593292 sampler_results: custom_metrics: {} episode_len_mean: 131.8 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 131.8 episode_reward_min: 16.0 episodes_this_iter: 21 hist_stats: episode_lengths: - 55 - 19 - 90 - 135 - 43 - 16 - 52 - 34 - 32 - 85 - 58 - 99 - 57 - 42 - 26 - 19 - 25 - 29 - 17 - 153 - 96 - 22 - 70 - 16 - 65 - 200 - 154 - 116 - 140 - 185 - 106 - 196 - 146 - 200 - 147 - 29 - 134 - 127 - 115 - 41 - 134 - 195 - 112 - 170 - 123 - 55 - 111 - 45 - 200 - 190 - 47 - 128 - 32 - 195 - 90 - 157 - 178 - 112 - 137 - 200 - 165 - 200 - 137 - 200 - 164 - 155 - 194 - 115 - 200 - 160 - 163 - 197 - 67 - 193 - 190 - 200 - 188 - 178 - 196 - 114 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 199 - 151 - 200 - 198 - 180 - 200 - 190 - 200 - 200 - 200 - 84 episode_reward: - 55.0 - 19.0 - 90.0 - 135.0 - 43.0 - 16.0 - 52.0 - 34.0 - 32.0 - 85.0 - 58.0 - 99.0 - 57.0 - 42.0 - 26.0 - 19.0 - 25.0 - 29.0 - 17.0 - 153.0 - 96.0 - 22.0 - 70.0 - 16.0 - 65.0 - 200.0 - 154.0 - 116.0 - 140.0 - 185.0 - 106.0 - 196.0 - 146.0 - 200.0 - 147.0 - 29.0 - 134.0 - 127.0 - 115.0 - 41.0 - 134.0 - 195.0 - 112.0 - 170.0 - 123.0 - 55.0 - 111.0 - 45.0 - 200.0 - 190.0 - 47.0 - 128.0 - 32.0 - 195.0 - 90.0 - 157.0 - 178.0 - 112.0 - 137.0 - 200.0 - 165.0 - 200.0 - 137.0 - 200.0 - 164.0 - 155.0 - 194.0 - 115.0 - 200.0 - 160.0 - 163.0 - 197.0 - 67.0 - 193.0 - 190.0 - 200.0 - 188.0 - 178.0 - 196.0 - 114.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 199.0 - 151.0 - 200.0 - 198.0 - 180.0 - 200.0 - 190.0 - 200.0 - 200.0 - 200.0 - 84.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06701258389156116 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05356056683821166 mean_inference_ms: 0.5839579119976391 mean_raw_obs_processing_ms: 0.09578881031593292 time_since_restore: 16.48974895477295 time_this_iter_s: 3.275562047958374 time_total_s: 16.48974895477295 timers: learn_throughput: 2393.303 learn_time_ms: 1671.33 load_throughput: 21896653.615 load_time_ms: 0.183 training_iteration_time_ms: 3293.97 update_time_ms: 1.781 timestamp: 1652966414 timesteps_since_restore: 0 timesteps_total: 20000 training_iteration: 5 trial_id: '55884_00000' warmup_time: 10.191807985305786 2022-05-19 14:20:14,687 INFO tune.py:753 -- Total run time: 36.43 seconds (35.98 seconds for the tuning loop). Then, we serve it using Ray Serve: endpoint_uri = serve_rl_model(result.checkpoint) (ServeController pid=15625) INFO 2022-05-19 14:20:16,749 controller 15625 checkpoint_path.py:17 - Using RayInternalKVStore for controller checkpoint and recovery. (ServeController pid=15625) INFO 2022-05-19 14:20:16,751 controller 15625 http_state.py:115 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:127.0.0.1-0' on node 'node:127.0.0.1-0' listening on '127.0.0.1:8000' (HTTPProxyActor pid=15630) INFO: Started server process [15630] (ServeController pid=15625) INFO 2022-05-19 14:20:26,056 controller 15625 deployment_state.py:1217 - Adding 1 replicas to deployment 'RLModel'. (RLModel pid=15633) 2022-05-19 14:20:34,143 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RLModel pid=15633) 2022-05-19 14:20:34,700 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode. (RLModel pid=15633) 2022-05-19 14:20:34,701 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future! (RLModel pid=15633) 2022-05-19 14:20:34,701 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you. (RLModel pid=15633) 2022-05-19 14:20:34,701 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (RolloutWorker pid=15636) 2022-05-19 14:20:42,714 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RolloutWorker pid=15637) 2022-05-19 14:20:42,714 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RLModel pid=15633) 2022-05-19 14:20:44,085 WARNING util.py:65 -- Install gputil for GPU system monitoring. (RLModel pid=15633) 2022-05-19 14:20:44,143 INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /var/folders/b2/0_91bd757rz02lrmr920v0gw0000gn/T/checkpoint_tmp_3whnb5ef/checkpoint-5 (RLModel pid=15633) 2022-05-19 14:20:44,143 INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 16.48974895477295, '_episodes_total': 354} And then we evaluate the served model on a fresh environment: rewards = evaluate_served_policy(endpoint_uri=endpoint_uri) (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,215 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,214 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,253 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 33.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,260 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,267 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,273 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,280 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,285 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,291 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,297 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,304 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,310 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,316 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,321 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,329 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,334 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,252 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 31.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,259 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,266 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,272 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,279 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,284 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,290 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,296 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,303 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,309 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,315 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,321 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,327 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,333 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,339 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,340 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,346 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,353 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,358 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,365 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,370 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,376 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,382 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,388 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,393 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,400 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,405 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,412 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,417 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,424 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,429 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,435 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,441 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,448 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,345 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,352 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,357 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,364 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,369 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,375 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,381 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,387 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,392 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,399 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,404 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,411 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,416 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,423 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,428 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,434 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,440 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,446 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,454 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,460 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,466 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,472 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,477 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,483 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,489 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,495 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,501 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,507 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,512 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,518 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,523 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,530 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,535 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,541 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,546 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,552 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,557 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,453 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,459 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,465 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,471 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,476 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,482 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,488 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,494 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,500 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,506 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,511 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,517 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,522 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,529 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,534 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,540 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,545 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,551 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,556 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,563 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,569 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,575 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,581 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,586 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,592 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,598 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,603 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,609 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,614 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,620 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,625 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,631 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,636 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,642 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,647 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,653 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,659 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,666 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,562 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,568 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,574 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,580 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,586 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,591 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,597 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,602 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,608 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,613 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,619 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,624 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,630 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,635 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,641 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,646 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,652 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,658 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,664 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,671 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,678 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,684 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,690 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,696 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,703 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,708 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,715 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,720 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,727 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,732 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,739 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,743 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,749 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,754 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,761 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,766 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,772 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,670 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,677 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,683 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,689 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,695 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,702 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,707 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,714 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,720 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,726 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,731 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,738 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,743 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,748 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,753 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,760 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,765 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,771 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,776 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,777 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,784 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,789 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,795 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,800 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,807 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,812 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,820 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 5.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,826 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,832 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,838 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,843 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,849 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,855 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,861 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,867 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,872 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,879 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,884 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,783 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,788 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,794 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,799 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,806 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,811 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,818 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,825 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,831 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,837 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,842 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,848 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,854 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,860 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,866 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,871 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,878 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,883 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,890 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,896 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,902 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,908 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,914 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,919 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,925 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,930 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,936 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,941 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,948 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,953 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,959 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,964 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,970 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,976 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,982 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,987 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,993 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,889 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,895 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,901 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,907 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,913 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,918 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,924 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,929 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,935 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,940 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,947 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,952 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,958 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,964 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,969 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,975 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,981 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,986 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,992 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:45,999 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,005 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,011 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,017 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,022 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,028 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,033 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,039 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,044 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,050 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,055 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,061 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,066 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,073 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,079 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,085 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,089 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,096 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,101 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:45,998 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,004 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,010 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,016 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,021 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,027 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,032 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,038 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,043 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,049 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,054 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,060 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,065 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,072 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,078 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,084 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,088 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,095 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,100 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,109 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,114 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,120 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,126 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,132 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,137 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,144 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,150 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,156 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,162 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,169 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,174 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,182 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,187 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,194 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,200 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,206 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,108 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,113 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,119 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,125 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,131 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,136 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,143 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,149 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,155 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,161 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,168 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,173 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,181 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,186 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,193 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,199 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,205 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,212 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,211 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,220 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,226 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,233 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,238 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,245 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,251 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,257 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,263 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,269 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,275 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,282 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,287 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,294 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,300 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,307 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,314 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,219 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,225 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,232 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,237 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,244 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,250 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,256 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,262 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,268 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,274 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,281 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,286 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,293 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,299 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,306 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,313 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,320 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,321 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,329 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,335 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,340 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,347 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,352 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,359 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,365 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,371 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,377 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,383 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,389 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,395 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,400 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,406 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,411 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,417 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,423 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,328 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,334 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,339 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,346 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,351 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,358 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,364 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,370 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,376 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,382 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,388 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,394 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,399 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,405 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,410 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,416 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,421 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,429 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,430 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,436 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,442 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,447 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,453 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,458 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,464 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,469 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,476 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,481 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,487 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,492 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,498 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,503 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,509 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,515 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,520 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,526 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,532 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,538 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,435 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,441 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,446 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,452 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,457 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,463 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,468 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,475 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,480 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,486 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,491 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,497 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,502 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,508 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,514 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,519 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,525 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,531 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,537 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,545 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,550 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,556 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,561 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,567 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,572 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,579 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,584 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,590 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,595 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,601 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,606 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,613 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,618 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,624 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,629 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,635 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,640 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,646 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,544 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,549 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,555 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,560 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,566 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,571 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,577 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,583 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,589 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,594 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,600 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,605 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,612 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,617 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,623 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,628 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,634 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,639 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,645 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,652 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,659 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,664 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,670 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,675 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,682 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,687 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,693 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,698 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,704 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,709 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,715 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,720 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,726 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,731 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,737 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,742 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,748 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,754 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,651 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,658 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,663 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,669 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,674 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,680 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,686 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,692 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,697 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,703 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,708 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,714 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,719 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,725 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,731 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,736 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,741 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,747 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,753 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,762 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,767 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,773 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,778 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,784 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,790 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,796 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,801 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,807 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,812 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,819 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,824 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,831 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,836 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,842 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,847 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,853 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,858 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,760 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,766 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,772 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,777 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,783 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,789 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,795 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,800 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,806 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,811 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,818 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,823 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,830 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,835 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,841 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,846 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,852 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,857 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,863 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,865 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,871 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,877 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,882 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,889 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,894 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,900 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,904 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,911 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,916 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,922 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,928 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,934 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,939 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,944 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,949 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,955 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,961 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,968 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,870 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,876 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,882 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,887 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,893 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,899 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,904 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,910 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,915 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,921 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,927 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,933 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,938 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,943 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,948 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,954 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,960 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,967 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,972 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,973 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,981 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,986 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,992 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:46,999 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,006 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,012 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,018 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,023 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,030 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,035 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,041 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,046 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,053 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,059 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,065 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,070 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,076 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,081 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,980 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,985 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,991 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:46,998 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,005 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,011 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,017 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,022 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,029 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,034 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,040 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,045 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,052 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,058 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,064 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,069 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,075 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,080 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,088 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,093 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,099 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,104 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,110 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,115 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,121 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,126 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,132 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,137 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,144 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,149 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,155 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,161 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,167 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,172 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,179 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,185 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,086 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,092 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,098 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,103 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,109 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,114 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,120 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,125 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,131 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,136 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,142 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,148 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,154 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,160 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,166 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,171 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,177 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,183 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,191 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,192 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,198 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,204 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,210 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,216 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,221 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,228 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,233 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,239 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,243 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,249 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,254 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,260 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,265 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,271 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,276 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,283 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,288 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,295 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,197 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,203 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,209 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,215 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,220 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,227 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,232 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,238 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,242 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,248 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,253 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,259 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,264 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,270 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,275 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,282 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,287 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,294 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,299 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,300 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,308 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,314 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,320 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,325 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,332 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,338 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,344 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,350 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,356 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,362 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,369 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,374 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,381 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,387 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,393 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,399 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,406 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,307 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,313 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,319 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,324 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,331 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,337 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,343 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,349 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,355 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,361 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,368 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,373 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,380 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,386 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,392 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,398 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,405 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,412 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,410 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,420 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,425 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,432 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,437 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,444 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,449 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,456 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,461 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,468 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,473 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,480 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,485 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,492 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,497 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,504 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,510 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,516 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,418 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,424 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,431 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,436 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,442 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,448 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,455 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,460 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,467 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,472 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,479 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,484 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,491 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,496 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,503 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,509 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,515 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,521 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,522 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,529 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,534 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,540 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,545 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,551 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,557 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,563 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,568 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,574 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,580 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,586 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,591 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,597 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,602 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,608 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,613 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,620 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,626 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,528 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,533 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,539 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,544 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,550 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,556 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,562 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,567 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,573 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,579 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,585 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,590 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,596 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,601 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,607 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,613 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,619 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,624 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,631 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,632 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,638 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,645 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,650 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,656 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,662 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,668 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,673 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,679 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,684 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,691 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,696 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,702 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,707 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,713 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,718 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,724 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,729 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,735 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,740 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,637 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,644 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,649 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,655 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,661 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,667 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,672 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,678 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,683 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,690 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,695 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,701 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,706 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,712 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,717 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,723 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,728 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,734 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,739 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,746 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,751 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,757 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,762 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,768 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,773 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,780 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,784 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,791 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,796 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,802 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,807 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,814 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,818 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,825 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,830 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,836 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,841 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,848 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,745 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,750 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,756 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,761 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,767 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,772 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,779 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,784 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,790 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,795 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,801 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,806 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,813 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,818 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,824 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,829 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,835 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,840 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,847 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,853 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,860 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,865 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,871 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,876 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,883 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,888 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,894 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,899 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,905 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,911 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,917 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,922 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,929 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,934 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,939 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,944 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,950 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,955 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,853 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,859 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,864 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,870 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,876 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,882 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,887 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,893 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,898 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,904 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,910 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,916 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,921 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,928 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,933 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,939 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,944 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,949 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,955 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,963 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,968 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,974 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,979 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,985 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,990 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:47,996 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,003 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,009 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,014 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,020 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,025 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,031 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,036 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,042 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,047 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,053 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,058 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,962 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,967 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,973 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,978 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,984 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,989 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:47,996 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,002 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,008 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,013 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,019 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,024 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,030 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,035 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,041 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,046 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,052 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,057 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,064 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,065 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,071 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,077 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,082 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,089 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,094 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,100 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,105 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,112 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,117 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,123 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,128 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,134 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,139 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,146 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,150 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,157 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,162 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,168 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,070 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,076 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,082 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,088 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,093 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,099 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,104 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,111 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,116 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,122 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,127 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,133 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,138 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,145 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,150 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,156 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,161 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,167 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,172 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,173 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,181 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,186 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,193 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,198 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,205 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,210 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,216 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,221 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,228 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,233 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,239 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,244 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,250 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,256 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,262 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,268 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,274 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,280 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,180 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,185 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,192 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,197 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,204 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,209 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,215 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,220 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,227 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,232 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,238 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,243 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,249 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,255 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,261 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,267 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,273 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,279 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,286 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,285 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,293 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,299 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,304 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,310 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,315 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,321 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,327 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,333 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,338 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,344 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,348 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,355 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,360 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,366 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,371 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,377 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,382 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,388 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,393 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,292 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,298 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,303 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,309 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,314 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,320 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,326 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,332 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,337 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,343 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,348 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,354 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,359 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,365 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,370 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,376 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,381 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,387 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,392 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,400 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,404 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,411 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,416 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,422 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,428 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,433 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,438 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,444 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,449 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,455 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,460 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,466 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,471 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,478 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,483 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,489 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,495 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,501 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,399 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,404 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,410 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,415 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,421 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,427 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,432 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,437 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,443 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,448 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,454 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,460 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,465 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,470 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,477 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,482 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,488 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,494 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,500 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,506 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,513 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,518 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,524 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,530 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,536 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,540 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,546 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,551 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,557 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,562 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,568 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,573 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,580 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,585 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,591 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,597 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,603 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,608 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,505 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,512 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,517 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,523 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,529 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,535 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,539 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,545 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,550 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,556 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,561 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,567 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,572 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,579 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,584 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,590 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,596 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,602 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,607 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,614 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,619 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,626 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,631 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,637 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,641 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,647 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,653 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,659 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,664 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,671 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,677 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,683 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,688 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,694 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,699 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,705 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,711 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,717 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,613 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,619 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,625 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,630 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,636 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,640 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,646 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,652 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,658 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,663 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,670 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,676 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,682 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,687 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,693 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,698 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,704 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,710 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,716 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,723 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,730 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,767 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 34.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,775 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,782 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,789 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,795 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,802 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,807 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,813 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,818 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,825 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,722 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,729 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,766 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,774 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,781 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,788 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,794 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,801 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,806 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,812 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,817 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,823 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,831 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,837 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,842 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,848 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,853 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,860 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,865 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,888 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,894 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,900 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,905 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,912 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,917 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,923 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,929 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,830 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,836 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,841 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,847 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,852 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,859 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,864 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,887 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,893 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,899 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,904 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,911 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,916 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,922 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,928 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,936 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,935 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,941 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,947 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,952 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,958 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,963 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,970 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,975 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,981 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,986 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,992 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:48,997 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,003 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,008 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,015 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,020 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,026 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,031 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,037 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,042 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,940 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,946 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,951 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,957 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,963 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,969 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,974 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,980 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,985 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,991 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:48,996 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,002 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,007 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,014 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,019 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,025 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,030 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,036 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,041 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,049 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,054 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,060 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,065 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,070 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,076 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,082 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,087 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,093 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,098 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,104 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,110 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,116 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,121 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,128 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,133 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,139 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,145 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,048 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,053 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,059 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,064 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,069 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,075 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,081 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,086 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,092 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,097 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,103 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,109 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,115 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,120 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,126 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,132 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,138 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,144 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,152 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,151 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,158 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,165 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,170 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,176 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,182 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,188 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,193 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,200 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,205 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,212 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,217 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,223 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,229 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,235 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,240 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,246 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,252 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,258 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,157 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,164 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,169 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,175 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,181 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,187 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,192 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,199 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,204 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,210 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,216 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,222 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,228 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,234 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,239 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,245 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,251 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,257 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,262 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,263 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,270 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,275 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,281 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,286 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,293 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,298 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,304 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,309 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,315 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,321 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,327 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,332 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,339 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,344 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,350 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,356 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,363 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,368 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,269 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,274 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,280 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,285 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,292 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,297 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,303 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,308 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,314 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,320 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,326 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,332 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,338 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,343 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,349 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,355 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,361 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,367 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,374 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,381 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,387 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,392 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,399 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,404 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,411 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,417 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,423 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,429 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,435 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,441 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,447 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,453 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,459 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,465 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,471 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,477 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,373 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,380 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,386 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,391 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,398 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,403 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,409 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,416 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,422 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,428 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,435 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,440 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,446 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,452 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,458 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,464 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,470 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,476 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,484 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,489 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,496 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,501 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,507 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,512 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,518 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,524 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,530 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,535 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,541 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,546 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,552 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,557 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,563 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,568 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,575 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,483 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,488 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,495 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,500 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,506 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,511 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,517 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,522 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,529 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,534 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,540 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,545 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,551 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,556 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,562 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,567 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,573 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,579 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,599 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 21.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,641 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 22.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,663 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,671 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,677 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,685 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,691 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,698 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,704 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,711 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,716 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,722 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,622 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,662 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,670 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,676 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,684 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,690 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,697 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,703 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,710 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,715 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,721 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,726 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,727 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,734 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,739 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,745 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,750 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,756 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,761 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,767 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,772 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,778 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,783 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,789 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,794 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,800 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,806 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,813 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,818 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,824 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,829 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,835 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,733 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,738 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,744 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,749 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,755 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,760 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,766 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,771 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,777 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,782 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,788 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,793 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,799 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,805 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,812 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,817 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,823 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,828 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,834 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,841 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,847 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,852 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,858 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,864 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,870 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,875 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,882 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,887 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,893 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,898 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,904 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,910 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,916 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,921 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,927 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,932 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,938 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,943 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,840 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,846 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,851 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,857 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,863 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,869 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,874 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,881 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,886 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,892 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,897 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,903 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,909 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,915 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,920 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,926 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,931 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,937 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,942 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,950 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,956 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,962 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,967 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,973 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,979 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,985 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,990 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:49,996 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,001 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,010 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 6.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,015 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,021 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,027 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,033 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,038 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,045 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,949 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,955 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,961 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,966 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,972 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,978 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,984 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,989 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:49,995 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,000 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,008 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,014 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,020 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,026 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,032 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,037 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,044 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,049 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,050 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,058 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,063 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,069 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,074 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,081 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,086 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,093 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,098 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,104 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,109 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,115 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,120 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,127 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,132 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,137 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,142 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,149 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,154 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,057 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,062 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,068 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,073 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,080 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,085 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,092 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,097 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,103 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,108 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,114 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,119 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,126 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,131 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,136 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,141 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,147 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,153 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,160 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,161 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,167 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,173 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,178 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,184 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,189 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,195 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,200 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,207 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,213 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,219 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,224 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,231 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,236 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,242 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,247 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,253 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,259 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,265 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,166 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,172 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,177 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,183 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,188 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,194 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,200 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,206 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,212 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,218 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,223 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,229 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,235 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,241 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,246 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,252 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,257 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,264 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,269 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,270 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,277 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,282 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,288 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,293 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,300 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,304 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,346 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 20.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,276 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,282 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,287 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,293 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,299 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,304 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,329 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,366 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,384 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 19.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,393 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,399 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,407 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,413 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,420 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,426 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,433 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,439 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,445 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,450 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,455 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,461 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,467 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,473 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,480 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,485 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,491 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,392 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,398 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,405 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,412 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,419 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,425 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,432 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,438 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,444 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,449 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,454 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,460 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,466 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,471 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,478 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,484 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,490 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,496 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,495 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,504 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,509 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,515 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,520 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,526 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,532 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,538 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,543 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,549 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,554 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,561 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,566 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,572 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,577 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,584 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,589 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,595 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,600 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,503 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,508 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,514 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,519 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,525 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,531 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,537 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,542 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,548 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,553 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,559 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,565 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,571 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,576 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,583 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,588 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,594 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,599 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,607 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,606 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,613 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,619 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,624 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,630 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,635 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,641 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,645 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,652 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,657 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,663 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,668 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,674 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,680 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,686 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,692 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,698 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,704 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,710 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,612 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,618 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,623 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,629 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,634 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,640 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,645 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,651 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,656 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,662 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,667 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,673 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,679 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,685 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,691 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,697 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,703 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,709 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,715 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,716 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,722 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,727 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,733 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,738 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,744 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,749 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,755 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,760 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,767 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,772 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,778 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,783 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,789 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,794 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,800 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,805 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,811 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,816 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,822 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,721 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,726 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,733 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,737 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,743 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,748 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,754 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,759 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,766 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,771 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,777 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,782 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,788 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,794 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,800 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,805 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,810 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,816 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,821 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,828 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,834 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,839 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,845 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,850 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,856 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,861 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,867 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,872 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,878 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,883 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,890 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,895 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,901 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,906 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,912 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,917 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,923 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,929 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,827 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,833 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,838 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,844 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,849 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,855 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,860 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,866 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,871 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,877 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,883 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,888 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,894 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,900 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,905 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,911 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,916 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,922 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,928 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,934 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,935 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,940 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,946 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,951 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,957 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,962 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,968 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,973 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,980 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,986 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,992 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:50,997 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,003 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,010 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 4.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,017 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,022 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,028 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,033 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,039 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,939 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,945 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,950 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,956 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,961 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,967 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,973 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,979 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,985 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,991 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:50,996 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,002 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,009 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,016 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,021 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,027 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,032 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,038 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,043 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,044 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,051 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,056 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,062 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,067 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,073 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,079 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,085 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,090 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,096 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,101 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,107 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,112 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,118 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,124 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,130 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,135 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,141 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,147 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,050 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,055 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,061 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,066 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,072 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,078 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,084 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,089 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,095 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,100 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,106 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,111 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,118 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,123 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,129 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,134 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,140 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,146 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,153 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,154 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,161 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,167 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,172 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,179 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,184 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,190 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,195 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,202 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,207 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,214 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,219 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,226 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,231 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,237 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,242 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,248 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,253 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,259 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,160 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,166 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,171 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,178 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,183 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,189 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,194 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,201 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,206 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,213 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,218 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,224 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,230 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,236 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,241 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,247 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,252 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,258 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,265 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,264 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,271 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,277 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,283 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,288 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,295 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,300 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,306 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,311 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,317 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,322 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,329 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,334 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,340 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,345 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,351 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,356 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,363 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,369 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,270 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,276 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,282 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,287 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,294 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,299 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,305 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,310 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,316 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,321 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,328 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,333 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,339 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,344 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,350 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,355 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,362 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,368 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,376 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,378 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 5.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,384 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,392 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 5.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,399 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,406 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,412 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,419 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,426 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,433 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,439 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,445 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,450 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,456 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,461 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,467 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,472 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,478 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,483 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,383 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,391 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,398 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,405 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,411 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,418 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,424 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,432 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,438 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,444 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,449 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,455 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,460 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,466 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,471 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,477 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,482 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,488 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,490 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,496 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,502 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,507 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,513 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,518 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,524 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,529 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,536 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,542 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,548 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,553 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,559 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,564 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,570 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,575 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,582 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,588 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,595 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,495 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,501 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,506 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,512 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,517 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,523 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,528 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,535 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,541 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,547 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,552 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,558 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,563 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,569 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,574 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,581 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,587 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,594 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,600 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,599 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,607 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,613 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,619 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,624 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,631 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,636 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,643 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,648 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,656 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,661 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,667 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,672 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,678 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,683 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,690 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,695 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,701 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,707 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,606 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,612 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,618 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,623 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,629 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,635 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,642 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,647 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,654 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,660 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,666 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,671 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,677 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,682 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,689 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,694 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,700 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,706 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,713 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,718 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,725 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,730 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,736 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,741 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,747 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,752 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,758 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,764 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,769 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,775 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,781 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,786 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,792 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,797 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,804 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,810 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,816 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,712 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,717 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,724 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,729 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,735 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,740 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,746 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,751 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,757 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,763 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,769 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,774 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,780 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,785 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,791 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,796 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,803 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,809 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,815 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,824 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,831 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,837 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,844 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,850 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,857 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,862 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,868 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,873 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,879 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,884 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,891 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,896 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,902 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,908 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,914 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,919 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,925 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,823 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,830 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,836 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,843 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,849 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,856 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,861 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,867 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,872 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,878 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,884 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,890 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,895 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,901 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,907 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,913 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,918 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,924 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,931 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,938 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,943 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,949 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,954 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,960 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,966 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,972 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,977 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,983 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,989 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:51,995 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,001 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,007 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,013 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,019 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,025 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,032 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,931 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,937 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,942 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,948 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,953 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,959 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,965 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,971 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,976 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,982 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,988 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:51,994 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,000 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,006 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,012 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,018 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,024 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,031 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,038 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,044 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,049 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,055 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,060 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,066 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,071 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 1.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,078 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,083 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,089 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,095 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,101 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,107 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,114 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,119 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,126 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,131 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,137 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,037 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,043 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,048 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,053 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,059 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,065 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,070 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,076 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,082 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,088 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,094 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,100 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,106 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,113 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,118 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,125 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.1ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,130 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,136 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,141 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,142 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,149 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,154 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,161 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.4ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,167 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,173 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,179 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,185 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,190 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,197 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,202 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,208 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,214 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,220 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,226 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,232 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,238 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,244 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,250 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,148 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,153 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,160 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,166 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,172 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,178 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,184 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,189 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,196 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,201 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,207 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,213 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.3ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,219 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,225 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,231 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,237 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,243 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,249 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,257 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,262 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,268 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,273 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,279 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,285 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,291 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,297 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,303 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,310 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 3.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,318 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 5.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,325 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,332 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,338 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,345 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,350 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,356 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,256 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,261 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,267 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,272 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,278 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,284 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,290 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,296 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,302 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,309 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,317 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,324 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,331 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 2.0ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,337 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,344 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,349 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,355 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.9ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,361 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,362 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms print("Episode rewards:", rewards) Episode rewards: [200.0, 200.0, 200.0] After we’re done, we can shutdown Ray Serve. serve.shutdown() (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,369 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,375 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.6ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,381 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 4.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,387 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.1ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,393 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.8ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,398 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,404 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,410 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.3ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,416 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.7ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,421 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.2ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,427 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.9ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,432 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,438 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.5ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,443 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 307 2.0ms (HTTPProxyActor pid=15630) INFO 2022-05-19 14:20:52,448 http_proxy 127.0.0.1 http_proxy.py:320 - POST /RLModel 200 3.6ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,368 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,373 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,380 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,386 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,392 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,397 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,403 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,409 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,415 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,420 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,426 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.8ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,431 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,437 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,442 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 0.2ms (RLModel pid=15633) INFO 2022-05-19 14:20:52,448 RLModel RLModel#OeYEbL replica.py:483 - HANDLE __call__ OK 1.7ms (ServeController pid=15625) INFO 2022-05-19 14:20:52,519 controller 15625 deployment_state.py:1241 - Removing 1 replicas from deployment 'RLModel'. Online reinforcement learning with Ray AIR In this example, we’ll train a reinforcement learning agent using online training. Online training means that the data from the environment is sampled while we are running the algorithm. In contrast, offline training uses data that has been stored on disk before. Let’s start with installing our dependencies: !pip install -qU "ray[rllib]" gymnasium Now we can run some imports: import gymnasium as gym import numpy as np from ray.air import Checkpoint from ray.air.config import CheckpointConfig, RunConfig from ray.train.rl.rl_predictor import RLPredictor from ray.train.rl.rl_trainer import RLTrainer from ray.air.config import ScalingConfig from ray.air.result import Result from ray.tune.tuner import Tuner 2022-05-19 13:54:16,520 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! 2022-05-19 13:54:16,531 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.marwil` has been deprecated. Use `ray.rllib.algorithms.marwil` instead. This will raise an error in the future! Here we define the training function. It will create an RLTrainer using the PPO algorithm and kick off training on the CartPole-v1 environment: def train_rl_ppo_online(num_workers: int, use_gpu: bool = False) -> Result: print("Starting online training") trainer = RLTrainer( run_config=RunConfig( stop={"training_iteration": 5}, checkpoint_config=CheckpointConfig(checkpoint_at_end=True) ), scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), algorithm="PPO", config={ "env": "CartPole-v1", "framework": "tf", }, ) result = trainer.fit() return result Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function: def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list: predictor = RLPredictor.from_checkpoint(checkpoint) env = gym.make("CartPole-v1") rewards = [] for i in range(num_episodes): obs, _ = env.reset() reward = 0.0 terminated = truncated = False while not terminated and not truncated: action = predictor.predict(np.array([obs])) obs, r, terminated, truncated, _ = env.step(action[0]) reward += r rewards.append(reward) return rewards Let’s put it all together. First, we run training: result = train_rl_ppo_online(num_workers=2, use_gpu=False) 2022-05-19 13:54:16,582 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future! Starting online training 2022-05-19 13:54:19,326 INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8267 == Status ==
Current time: 2022-05-19 13:54:57 (running for 00:00:35.99)
Memory usage on this node: 9.6/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.54 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) ts reward episode_reward_max episode_reward_min episode_len_mean
AIRPPOTrainer_cd8d6_00000TERMINATED127.0.0.1:14174 5 16.702920000 124.79 200 9 124.79


(raylet) 2022-05-19 13:54:23,061 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134 (pid=14174) 2022-05-19 13:54:30,271 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,749 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (raylet) 2022-05-19 13:54:31,857 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 13:54:31,857 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134 (RolloutWorker pid=14179) 2022-05-19 13:54:39,442 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RolloutWorker pid=14180) 2022-05-19 13:54:39,492 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836 INFO trainable.py:163 -- Trainable.setup took 10.087 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836 WARNING util.py:65 -- Install gputil for GPU system monitoring. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:42,569 WARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future! Result for AIRPPOTrainer_cd8d6_00000: agent_timesteps_total: 4000 counters: num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_trained: 4000 custom_metrics: {} date: 2022-05-19_13-54-44 done: false episode_len_mean: 22.11731843575419 episode_media: {} episode_reward_max: 87.0 episode_reward_mean: 22.11731843575419 episode_reward_min: 8.0 episodes_this_iter: 179 episodes_total: 179 experiment_id: 158c57d8b6e142ad85b393db57c8bdff hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.20000000298023224 cur_lr: 4.999999873689376e-05 entropy: 0.6653298139572144 entropy_coeff: 0.0 kl: 0.02798665314912796 model: {} policy_loss: -0.0422092080116272 total_loss: 8.986403465270996 vf_explained_var: -0.06533512473106384 vf_loss: 9.023015022277832 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_trained: 4000 iterations_since_restore: 1 node_ip: 127.0.0.1 num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 4000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 24.849999999999998 ram_util_percent: 61.199999999999996 pid: 14174 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06886580197141673 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05465748139159193 mean_inference_ms: 0.6132523881103351 mean_raw_obs_processing_ms: 0.10609273714105154 sampler_results: custom_metrics: {} episode_len_mean: 22.11731843575419 episode_media: {} episode_reward_max: 87.0 episode_reward_mean: 22.11731843575419 episode_reward_min: 8.0 episodes_this_iter: 179 hist_stats: episode_lengths: - 28 - 9 - 12 - 23 - 13 - 21 - 15 - 16 - 19 - 44 - 14 - 19 - 19 - 17 - 17 - 12 - 9 - 48 - 43 - 15 - 21 - 25 - 16 - 14 - 22 - 21 - 24 - 53 - 21 - 16 - 17 - 14 - 20 - 22 - 18 - 17 - 14 - 11 - 46 - 12 - 18 - 21 - 13 - 58 - 10 - 20 - 14 - 25 - 22 - 33 - 23 - 10 - 25 - 11 - 32 - 48 - 12 - 12 - 10 - 24 - 15 - 28 - 14 - 16 - 14 - 21 - 12 - 13 - 8 - 12 - 13 - 10 - 10 - 14 - 30 - 16 - 23 - 47 - 14 - 22 - 11 - 18 - 12 - 21 - 21 - 20 - 18 - 29 - 18 - 24 - 50 - 87 - 21 - 41 - 21 - 34 - 47 - 20 - 26 - 14 - 9 - 24 - 16 - 18 - 44 - 28 - 37 - 10 - 19 - 11 - 56 - 11 - 28 - 16 - 14 - 19 - 23 - 11 - 22 - 63 - 22 - 13 - 29 - 11 - 64 - 44 - 45 - 38 - 17 - 18 - 21 - 13 - 12 - 13 - 10 - 17 - 14 - 16 - 10 - 19 - 25 - 15 - 50 - 13 - 10 - 15 - 12 - 15 - 11 - 14 - 17 - 17 - 14 - 49 - 18 - 13 - 28 - 31 - 19 - 26 - 31 - 29 - 21 - 23 - 17 - 23 - 32 - 35 - 10 - 11 - 30 - 21 - 16 - 15 - 23 - 40 - 24 - 24 - 14 episode_reward: - 28.0 - 9.0 - 12.0 - 23.0 - 13.0 - 21.0 - 15.0 - 16.0 - 19.0 - 44.0 - 14.0 - 19.0 - 19.0 - 17.0 - 17.0 - 12.0 - 9.0 - 48.0 - 43.0 - 15.0 - 21.0 - 25.0 - 16.0 - 14.0 - 22.0 - 21.0 - 24.0 - 53.0 - 21.0 - 16.0 - 17.0 - 14.0 - 20.0 - 22.0 - 18.0 - 17.0 - 14.0 - 11.0 - 46.0 - 12.0 - 18.0 - 21.0 - 13.0 - 58.0 - 10.0 - 20.0 - 14.0 - 25.0 - 22.0 - 33.0 - 23.0 - 10.0 - 25.0 - 11.0 - 32.0 - 48.0 - 12.0 - 12.0 - 10.0 - 24.0 - 15.0 - 28.0 - 14.0 - 16.0 - 14.0 - 21.0 - 12.0 - 13.0 - 8.0 - 12.0 - 13.0 - 10.0 - 10.0 - 14.0 - 30.0 - 16.0 - 23.0 - 47.0 - 14.0 - 22.0 - 11.0 - 18.0 - 12.0 - 21.0 - 21.0 - 20.0 - 18.0 - 29.0 - 18.0 - 24.0 - 50.0 - 87.0 - 21.0 - 41.0 - 21.0 - 34.0 - 47.0 - 20.0 - 26.0 - 14.0 - 9.0 - 24.0 - 16.0 - 18.0 - 44.0 - 28.0 - 37.0 - 10.0 - 19.0 - 11.0 - 56.0 - 11.0 - 28.0 - 16.0 - 14.0 - 19.0 - 23.0 - 11.0 - 22.0 - 63.0 - 22.0 - 13.0 - 29.0 - 11.0 - 64.0 - 44.0 - 45.0 - 38.0 - 17.0 - 18.0 - 21.0 - 13.0 - 12.0 - 13.0 - 10.0 - 17.0 - 14.0 - 16.0 - 10.0 - 19.0 - 25.0 - 15.0 - 50.0 - 13.0 - 10.0 - 15.0 - 12.0 - 15.0 - 11.0 - 14.0 - 17.0 - 17.0 - 14.0 - 49.0 - 18.0 - 13.0 - 28.0 - 31.0 - 19.0 - 26.0 - 31.0 - 29.0 - 21.0 - 23.0 - 17.0 - 23.0 - 32.0 - 35.0 - 10.0 - 11.0 - 30.0 - 21.0 - 16.0 - 15.0 - 23.0 - 40.0 - 24.0 - 24.0 - 14.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06886580197141673 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05465748139159193 mean_inference_ms: 0.6132523881103351 mean_raw_obs_processing_ms: 0.10609273714105154 time_since_restore: 3.7304069995880127 time_this_iter_s: 3.7304069995880127 time_total_s: 3.7304069995880127 timers: learn_throughput: 2006.2 learn_time_ms: 1993.819 load_throughput: 24708712.813 load_time_ms: 0.162 training_iteration_time_ms: 3726.731 update_time_ms: 1.95 timestamp: 1652964884 timesteps_since_restore: 0 timesteps_total: 4000 training_iteration: 1 trial_id: cd8d6_00000 warmup_time: 10.095139741897583 Result for AIRPPOTrainer_cd8d6_00000: agent_timesteps_total: 12000 counters: num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_trained: 12000 custom_metrics: {} date: 2022-05-19_13-54-51 done: false episode_len_mean: 65.15 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 65.15 episode_reward_min: 9.0 episodes_this_iter: 44 episodes_total: 311 experiment_id: 158c57d8b6e142ad85b393db57c8bdff hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.30000001192092896 cur_lr: 4.999999873689376e-05 entropy: 0.5750519633293152 entropy_coeff: 0.0 kl: 0.012749233283102512 model: {} policy_loss: -0.026830431073904037 total_loss: 9.414541244506836 vf_explained_var: 0.046859823167324066 vf_loss: 9.43754768371582 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_trained: 12000 iterations_since_restore: 3 node_ip: 127.0.0.1 num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 12000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 20.9 ram_util_percent: 61.379999999999995 pid: 14174 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06834399059626647 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05423359203664157 mean_inference_ms: 0.5997818239241897 mean_raw_obs_processing_ms: 0.0982917359628421 sampler_results: custom_metrics: {} episode_len_mean: 65.15 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 65.15 episode_reward_min: 9.0 episodes_this_iter: 44 hist_stats: episode_lengths: - 34 - 37 - 38 - 23 - 29 - 56 - 38 - 13 - 10 - 18 - 40 - 23 - 46 - 84 - 29 - 44 - 54 - 32 - 30 - 100 - 28 - 67 - 47 - 40 - 74 - 133 - 32 - 28 - 86 - 133 - 46 - 60 - 17 - 43 - 12 - 51 - 57 - 70 - 54 - 73 - 16 - 29 - 113 - 45 - 31 - 44 - 103 - 62 - 72 - 20 - 15 - 35 - 12 - 9 - 24 - 10 - 102 - 93 - 73 - 27 - 52 - 144 - 19 - 140 - 91 - 133 - 147 - 140 - 90 - 14 - 73 - 71 - 200 - 55 - 184 - 103 - 196 - 168 - 177 - 38 - 33 - 50 - 149 - 67 - 87 - 25 - 134 - 42 - 26 - 24 - 121 - 61 - 109 - 19 - 200 - 60 - 40 - 51 - 88 - 30 episode_reward: - 34.0 - 37.0 - 38.0 - 23.0 - 29.0 - 56.0 - 38.0 - 13.0 - 10.0 - 18.0 - 40.0 - 23.0 - 46.0 - 84.0 - 29.0 - 44.0 - 54.0 - 32.0 - 30.0 - 100.0 - 28.0 - 67.0 - 47.0 - 40.0 - 74.0 - 133.0 - 32.0 - 28.0 - 86.0 - 133.0 - 46.0 - 60.0 - 17.0 - 43.0 - 12.0 - 51.0 - 57.0 - 70.0 - 54.0 - 73.0 - 16.0 - 29.0 - 113.0 - 45.0 - 31.0 - 44.0 - 103.0 - 62.0 - 72.0 - 20.0 - 15.0 - 35.0 - 12.0 - 9.0 - 24.0 - 10.0 - 102.0 - 93.0 - 73.0 - 27.0 - 52.0 - 144.0 - 19.0 - 140.0 - 91.0 - 133.0 - 147.0 - 140.0 - 90.0 - 14.0 - 73.0 - 71.0 - 200.0 - 55.0 - 184.0 - 103.0 - 196.0 - 168.0 - 177.0 - 38.0 - 33.0 - 50.0 - 149.0 - 67.0 - 87.0 - 25.0 - 134.0 - 42.0 - 26.0 - 24.0 - 121.0 - 61.0 - 109.0 - 19.0 - 200.0 - 60.0 - 40.0 - 51.0 - 88.0 - 30.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06834399059626647 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05423359203664157 mean_inference_ms: 0.5997818239241897 mean_raw_obs_processing_ms: 0.0982917359628421 time_since_restore: 10.289561986923218 time_this_iter_s: 3.3495230674743652 time_total_s: 10.289561986923218 timers: learn_throughput: 2276.977 learn_time_ms: 1756.715 load_throughput: 20798201.653 load_time_ms: 0.192 training_iteration_time_ms: 3425.704 update_time_ms: 1.814 timestamp: 1652964891 timesteps_since_restore: 0 timesteps_total: 12000 training_iteration: 3 trial_id: cd8d6_00000 warmup_time: 10.095139741897583 Result for AIRPPOTrainer_cd8d6_00000: agent_timesteps_total: 20000 counters: num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_trained: 20000 custom_metrics: {} date: 2022-05-19_13-54-57 done: true episode_len_mean: 124.79 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 124.79 episode_reward_min: 9.0 episodes_this_iter: 20 episodes_total: 354 experiment_id: 158c57d8b6e142ad85b393db57c8bdff hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.30000001192092896 cur_lr: 4.999999873689376e-05 entropy: 0.5436986684799194 entropy_coeff: 0.0 kl: 0.0034858626313507557 model: {} policy_loss: -0.012989979237318039 total_loss: 9.49295425415039 vf_explained_var: 0.025460055097937584 vf_loss: 9.504897117614746 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_trained: 20000 iterations_since_restore: 5 node_ip: 127.0.0.1 num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 20000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 24.599999999999998 ram_util_percent: 59.775 pid: 14174 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06817872750804764 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05424549075766555 mean_inference_ms: 0.5976919122059019 mean_raw_obs_processing_ms: 0.09603803519062176 sampler_results: custom_metrics: {} episode_len_mean: 124.79 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 124.79 episode_reward_min: 9.0 episodes_this_iter: 20 hist_stats: episode_lengths: - 45 - 31 - 44 - 103 - 62 - 72 - 20 - 15 - 35 - 12 - 9 - 24 - 10 - 102 - 93 - 73 - 27 - 52 - 144 - 19 - 140 - 91 - 133 - 147 - 140 - 90 - 14 - 73 - 71 - 200 - 55 - 184 - 103 - 196 - 168 - 177 - 38 - 33 - 50 - 149 - 67 - 87 - 25 - 134 - 42 - 26 - 24 - 121 - 61 - 109 - 19 - 200 - 60 - 40 - 51 - 88 - 30 - 200 - 186 - 200 - 182 - 196 - 200 - 200 - 200 - 200 - 200 - 200 - 43 - 200 - 109 - 156 - 200 - 183 - 200 - 200 - 200 - 200 - 200 - 107 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 89 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 episode_reward: - 45.0 - 31.0 - 44.0 - 103.0 - 62.0 - 72.0 - 20.0 - 15.0 - 35.0 - 12.0 - 9.0 - 24.0 - 10.0 - 102.0 - 93.0 - 73.0 - 27.0 - 52.0 - 144.0 - 19.0 - 140.0 - 91.0 - 133.0 - 147.0 - 140.0 - 90.0 - 14.0 - 73.0 - 71.0 - 200.0 - 55.0 - 184.0 - 103.0 - 196.0 - 168.0 - 177.0 - 38.0 - 33.0 - 50.0 - 149.0 - 67.0 - 87.0 - 25.0 - 134.0 - 42.0 - 26.0 - 24.0 - 121.0 - 61.0 - 109.0 - 19.0 - 200.0 - 60.0 - 40.0 - 51.0 - 88.0 - 30.0 - 200.0 - 186.0 - 200.0 - 182.0 - 196.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 43.0 - 200.0 - 109.0 - 156.0 - 200.0 - 183.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 107.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 89.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06817872750804764 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05424549075766555 mean_inference_ms: 0.5976919122059019 mean_raw_obs_processing_ms: 0.09603803519062176 time_since_restore: 16.702913284301758 time_this_iter_s: 3.1872010231018066 time_total_s: 16.702913284301758 timers: learn_throughput: 2378.661 learn_time_ms: 1681.619 load_throughput: 16503261.853 load_time_ms: 0.242 training_iteration_time_ms: 3336.7 update_time_ms: 1.759 timestamp: 1652964897 timesteps_since_restore: 0 timesteps_total: 20000 training_iteration: 5 trial_id: cd8d6_00000 warmup_time: 10.095139741897583 2022-05-19 13:54:58,548 INFO tune.py:753 -- Total run time: 36.92 seconds (35.95 seconds for the tuning loop). And then, using the obtained checkpoint, we evaluate the policy on a fresh environment: num_eval_episodes = 3 rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes) print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}") 2022-05-19 13:54:58,589 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode. 2022-05-19 13:54:58,590 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future! 2022-05-19 13:54:58,591 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you. 2022-05-19 13:54:58,591 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (RolloutWorker pid=14191) 2022-05-19 13:55:06,622 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RolloutWorker pid=14192) 2022-05-19 13:55:06,622 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! 2022-05-19 13:55:07,968 WARNING util.py:65 -- Install gputil for GPU system monitoring. 2022-05-19 13:55:08,021 INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16/AIRPPOTrainer_cd8d6_00000_0_2022-05-19_13-54-22/checkpoint_000005/checkpoint-5 2022-05-19 13:55:08,021 INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 16.702913284301758, '_episodes_total': 354} Average reward over 3 episodes: 200.0 Offline reinforcement learning with Ray AIR In this example, we’ll train a reinforcement learning agent using offline training. Offline training means that the data from the environment (and the actions performed by the agent) have been stored on disk. In contrast, online training samples experiences live by interacting with the environment. Let’s start with installing our dependencies: # !pip install -qU "ray[rllib]" gymnasium Now we can run some imports: import gymnasium as gym import numpy as np import ray from ray.air import Checkpoint from ray.air.config import CheckpointConfig, RunConfig from ray.train.rl.rl_predictor import RLPredictor from ray.train.rl.rl_trainer import RLTrainer from ray.air.config import ScalingConfig from ray.air.result import Result from ray.rllib.algorithms.bc import BC from ray.tune.tuner import Tuner /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:18: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. DESCRIPTOR = _descriptor.FileDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:36: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:43: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:29: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _TENSORSHAPEPROTO_DIM = _descriptor.Descriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:73: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:80: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:66: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _TENSORSHAPEPROTO = _descriptor.Descriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:19: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. DESCRIPTOR = _descriptor.FileDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:33: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:37: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:41: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:45: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:49: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:53: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:57: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:61: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:65: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:69: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:73: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:77: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:81: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:85: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:89: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:93: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:97: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:101: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:105: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:109: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:113: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:117: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:121: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:125: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:129: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:133: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:137: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:141: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:145: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:149: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:153: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:157: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:161: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:165: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:169: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:173: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:177: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:181: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:185: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:189: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:193: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:197: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:201: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:205: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:209: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:213: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:217: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.EnumValueDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:27: DeprecationWarning: Call to deprecated create function EnumDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _DATATYPE = _descriptor.EnumDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:287: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:280: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _SERIALIZEDDTYPE = _descriptor.Descriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:20: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. DESCRIPTOR = _descriptor.FileDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:39: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:46: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:32: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _RESOURCEHANDLEPROTO_DTYPEANDSHAPE = _descriptor.Descriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:76: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:83: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:90: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:97: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:104: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:111: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:69: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _RESOURCEHANDLEPROTO = _descriptor.Descriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:21: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. DESCRIPTOR = _descriptor.FileDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:40: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:47: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:54: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:61: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:68: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:75: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:82: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:89: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:96: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:103: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:110: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:117: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:124: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:131: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:138: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:145: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:152: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:33: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _TENSORPROTO = _descriptor.Descriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:183: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:190: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:197: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:176: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _VARIANTTENSORDATAPROTO = _descriptor.Descriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:21: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. DESCRIPTOR = _descriptor.FileDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:40: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:47: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:54: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:61: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:68: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:75: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:82: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. _descriptor.FieldDescriptor( /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:61: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if (distutils.version.LooseVersion(tf.__version__) < /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. distutils.version.LooseVersion(required_tensorflow_version)): WARNING:tensorflow:From /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:561: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_relax_shapes is deprecated and will be removed in a future version. Instructions for updating: experimental_relax_shapes is deprecated, use reduce_retracing instead We will be training on offline data - this means we have full agent trajectories stored somewhere on disk and want to train on these past experiences. Usually this data could come from external systems, or a database of historical data. But for this example, we’ll generate some offline data ourselves and store it using RLlibs output_config. def generate_offline_data(path: str): print(f"Generating offline data for training at {path}") trainer = RLTrainer( algorithm="PPO", run_config=RunConfig(stop={"timesteps_total": 5000}), config={ "env": "CartPole-v1", "output": "dataset", "output_config": { "format": "json", "path": path, "max_num_samples_per_file": 1, }, "batch_mode": "complete_episodes", "framework": "torch" }, ) trainer.fit() Here we define the training function. It will create an RLTrainer using the PPO algorithm and kick off training on the CartPole-v1 environment. It will use the offline data provided in path for this. def train_rl_bc_offline(path: str, num_workers: int, use_gpu: bool = False) -> Result: print("Starting offline training") dataset = ray.data.read_json( path, parallelism=num_workers, ray_remote_args={"num_cpus": 1} ) trainer = RLTrainer( run_config=RunConfig( stop={"training_iteration": 5}, checkpoint_config=CheckpointConfig(checkpoint_at_end=True) ), scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), datasets={"train": dataset}, algorithm=BC, config={ "env": "CartPole-v1", "framework": "tf", "evaluation_num_workers": 1, "evaluation_interval": 1, "evaluation_config": {"input": "sampler"}, "framework": "torch" }, ) result = trainer.fit() return result Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function: def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list: predictor = RLPredictor.from_checkpoint(checkpoint) env = gym.make("CartPole-v1") rewards = [] for i in range(num_episodes): obs, _ = env.reset() reward = 0.0 terminated = truncated = False while not terminated and not truncated: action = predictor.predict(np.array([obs])) obs, r, terminated, truncated, _ = env.step(action[0]) reward += r rewards.append(reward) return rewards Let’s put it all together. First, we initialize Ray and create the offline data: # ray.init(num_cpus=8) path = "/tmp/out" generate_offline_data(path) Generating offline data for training at /tmp/out

Tune Status

Current time:2023-03-30 10:01:06
Running for: 00:00:16.15
Memory: 18.1/32.0 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 0/8 CPUs, 0/0 GPUs

Trial Status

Trial name status loc iter total time (s) ts reward episode_reward_max episode_reward_min episode_len_mean
AIRPPO_6c451_00000TERMINATED127.0.0.1:82000 2 7.148298179 43.54 143 10 43.54
(pid=82000) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:21: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 5x across cluster] (pid=82000) DESCRIPTOR = _descriptor.FileDescriptor( [repeated 5x across cluster] (pid=82000) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:82: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 40x across cluster] (pid=82000) _descriptor.FieldDescriptor( [repeated 40x across cluster] (pid=82000) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:176: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 7x across cluster] (pid=82000) _descriptor.FieldDescriptor( (pid=82000) _descriptor.FieldDescriptor( (pid=82000) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:217: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 47x across cluster] (pid=82000) _descriptor.EnumValueDescriptor( [repeated 47x across cluster] (pid=82000) _descriptor.FieldDescriptor( (pid=82000) _descriptor.FieldDescriptor( (pid=82000) _descriptor.FieldDescriptor( (pid=82000) _descriptor.FieldDescriptor( (pid=82000) _descriptor.FieldDescriptor( (pid=82000) _descriptor.FieldDescriptor( (pid=82000) _descriptor.FieldDescriptor( (ReadJSON pid=81351) 2023-03-30 09:58:43,544 INFO worker.py:837 -- Task failed with retryable exception: TaskID(85bd3840f615b0e4ffffffffffffffffffffffff01000000). [repeated 7x across cluster] (ReadJSON pid=81351) Traceback (most recent call last): [repeated 7x across cluster] (ReadJSON pid=81351) File "python/ray/_raylet.pyx", line 651, in ray._raylet.execute_dynamic_generator_and_store_task_outputs [repeated 7x across cluster] (ReadJSON pid=81351) File "python/ray/_raylet.pyx", line 2533, in ray._raylet.CoreWorker.store_task_outputs [repeated 7x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/execution/operators/map_operator.py", line 374, in _map_task [repeated 8x across cluster] (AIRBC pid=81380) for b_out in fn(iter(blocks), ctx): [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/execution/legacy_compat.py", line 287, in do_map [repeated 8x across cluster] (AIRBC pid=81380) yield from block_fn(blocks, ctx, *fn_args, **fn_kwargs) [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/plan.py", line 1317, in wrapper [repeated 8x across cluster] (AIRBC pid=81380) yield from fn(block, ctx, *args, **kwargs) [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/plan.py", line 1195, in block_fn [repeated 8x across cluster] (AIRBC pid=81380) for block in read_fn(): [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/datasource/file_based_datasource.py", line 500, in read_files [repeated 8x across cluster] (AIRBC pid=81380) yield output_buffer.next() [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/output_buffer.py", line 74, in next [repeated 8x across cluster] (AIRBC pid=81380) block = self._buffer.build() [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/delegating_block_builder.py", line 70, in build [repeated 8x across cluster] (AIRBC pid=81380) return self._builder.build() [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/table_block.py", line 118, in build [repeated 8x across cluster] (AIRBC pid=81380) return self._concat_tables(tables) [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/arrow_block.py", line 127, in _concat_tables [repeated 8x across cluster] (AIRBC pid=81380) return transform_pyarrow.concat(tables) [repeated 8x across cluster] (AIRBC pid=81380) File "/Users/avnish/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 255, in concat [repeated 8x across cluster] (AIRBC pid=81380) table = pyarrow.concat_tables(blocks, promote=True) [repeated 8x across cluster] (AIRBC pid=81380) File "pyarrow/table.pxi", line 2338, in pyarrow.lib.concat_tables [repeated 8x across cluster] (AIRBC pid=81380) File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status [repeated 8x across cluster] (AIRBC pid=81380) File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status [repeated 8x across cluster] (AIRBC pid=81380) pyarrow.lib.ArrowInvalid: Unable to merge: Field policy_batches has incompatible types: struct>, action_logp: list, action_prob: list, actions: list, advantages: list, agent_index: list, eps_id: list, new_obs: string, obs: string, prev_actions: list, prev_rewards: list, rewards: list, t: list, terminateds: list, truncateds: list, unroll_id: list, value_targets: list, vf_preds: list>> vs struct>, action_logp: list, action_prob: list, actions: list, advantages: list, agent_index: list, eps_id: list, infos: list>>>, new_obs: string, obs: string, prev_actions: list, prev_rewards: list, rewards: list, t: list, terminateds: list, truncateds: list, unroll_id: list, value_targets: list, vf_preds: list>> [repeated 8x across cluster] (pid=82000) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:61: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (pid=82000) if (distutils.version.LooseVersion(tf.__version__) < (pid=82000) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (pid=82000) distutils.version.LooseVersion(required_tensorflow_version)): (pid=82000) WARNING:tensorflow:From /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:561: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_relax_shapes is deprecated and will be removed in a future version. (pid=82000) Instructions for updating: (pid=82000) experimental_relax_shapes is deprecated, use reduce_retracing instead (AIRPPO pid=82000) 2023-03-30 10:00:54,374 WARNING algorithm_config.py:636 -- Cannot create PPOConfig from given `config_dict`! Property __stdout_file__ not supported. (AIRPPO pid=82000) 2023-03-30 10:00:54,479 INFO algorithm.py:527 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (pid=82019) _TENSORSHAPEPROTO_DIM = _descriptor.Descriptor( (pid=82019) _TENSORSHAPEPROTO = _descriptor.Descriptor( (pid=82019) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:27: DeprecationWarning: Call to deprecated create function EnumDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82019) _DATATYPE = _descriptor.EnumDescriptor( (pid=82019) _SERIALIZEDDTYPE = _descriptor.Descriptor( (pid=82019) _RESOURCEHANDLEPROTO_DTYPEANDSHAPE = _descriptor.Descriptor( (pid=82019) _RESOURCEHANDLEPROTO = _descriptor.Descriptor( (pid=82019) _TENSORPROTO = _descriptor.Descriptor( (pid=82019) _VARIANTTENSORDATAPROTO = _descriptor.Descriptor( (pid=82020) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:21: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 10x across cluster] (pid=82020) DESCRIPTOR = _descriptor.FileDescriptor( [repeated 10x across cluster] (pid=82020) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:82: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 80x across cluster] (pid=82020) _descriptor.FieldDescriptor( [repeated 80x across cluster] (pid=82020) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:176: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 14x across cluster] (pid=82020) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:217: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 94x across cluster] (pid=82020) _descriptor.EnumValueDescriptor( [repeated 94x across cluster] (AIRPPO pid=82000) 2023-03-30 10:00:59,026 WARNING util.py:67 -- Install gputil for GPU system monitoring. (RolloutWorker pid=82019) 2023-03-30 10:01:00,115 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. [repeated 4x across cluster] (pid=82020) if (distutils.version.LooseVersion(tf.__version__) < [repeated 2x across cluster] (pid=82020) distutils.version.LooseVersion(required_tensorflow_version)): [repeated 2x across cluster] (pid=82020) WARNING:tensorflow:From /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:561: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_relax_shapes is deprecated and will be removed in a future version. [repeated 2x across cluster] (pid=82020) Instructions for updating: [repeated 2x across cluster] (pid=82020) experimental_relax_shapes is deprecated, use reduce_retracing instead [repeated 2x across cluster] (pid=82019) Resource usage vs limits 0: 0%| | 0/1 [00:00

Trial Progress

Trial name agent_timesteps_totalconnector_metrics counters custom_metrics date done episode_len_meanepisode_media episode_reward_max episode_reward_mean episode_reward_min episodes_this_iter episodes_totalhostname info iterations_since_restorenode_ip num_agent_steps_sampled num_agent_steps_trained num_env_steps_sampled num_env_steps_sampled_this_iter num_env_steps_trained num_env_steps_trained_this_iter num_faulty_episodes num_healthy_workers num_in_flight_async_reqs num_remote_worker_restarts num_steps_trained_this_iterperf pidpolicy_reward_max policy_reward_mean policy_reward_min sampler_perf sampler_results time_since_restore time_this_iter_s time_total_stimers timestamp timesteps_total training_iterationtrial_id
AIRPPO_6c451_00000 8179{'ObsPreprocessorConnector_ms': 0.0020761489868164062, 'StateBufferConnector_ms': 0.0016286373138427734, 'ViewRequirementAgentConnector_ms': 0.0386197566986084}{'num_env_steps_sampled': 8179, 'num_env_steps_trained': 8179, 'num_agent_steps_sampled': 8179, 'num_agent_steps_trained': 8179}{} 2023-03-30_10-01-06True 43.54{} 143 43.54 10 90 266avnishs-mbp-3.lan{'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 1.0311362160602584, 'cur_kl_coeff': 0.29999999999999993, 'cur_lr': 5.0000000000000016e-05, 'total_loss': 8.954141706724961, 'policy_loss': -0.029519098582871568, 'vf_loss': 8.978031371037165, 'vf_explained_var': 0.04644986521452665, 'kl': 0.018764853765244092, 'entropy': 0.6120760556931297, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 128.0, 'num_grad_updates_lifetime': 1410.5, 'diff_num_grad_updates_vs_sampler_policy': 479.5}}, 'num_env_steps_sampled': 8179, 'num_env_steps_trained': 8179, 'num_agent_steps_sampled': 8179, 'num_agent_steps_trained': 8179} 2127.0.0.1 8179 8179 8179 4153 8179 4153 0 2 0 0 4153{'cpu_util_percent': 29.48, 'ram_util_percent': 56.42}82000{} {} {} {'mean_raw_obs_processing_ms': 0.1494050133826096, 'mean_inference_ms': 0.3030270243531414, 'mean_action_processing_ms': 0.04484232997177898, 'mean_env_wait_ms': 0.020527675791925257, 'mean_env_render_ms': 0.0}{'episode_reward_max': 143.0, 'episode_reward_min': 10.0, 'episode_reward_mean': 43.54, 'episode_len_mean': 43.54, 'episode_media': {}, 'episodes_this_iter': 90, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [10.0, 21.0, 20.0, 12.0, 15.0, 35.0, 16.0, 32.0, 21.0, 19.0, 72.0, 35.0, 32.0, 102.0, 76.0, 21.0, 15.0, 17.0, 15.0, 43.0, 25.0, 34.0, 16.0, 26.0, 71.0, 28.0, 52.0, 69.0, 10.0, 21.0, 12.0, 130.0, 56.0, 18.0, 35.0, 32.0, 108.0, 61.0, 97.0, 47.0, 31.0, 23.0, 18.0, 20.0, 59.0, 13.0, 69.0, 58.0, 38.0, 22.0, 83.0, 25.0, 22.0, 34.0, 53.0, 29.0, 18.0, 115.0, 25.0, 59.0, 20.0, 28.0, 64.0, 43.0, 40.0, 36.0, 93.0, 13.0, 20.0, 35.0, 29.0, 38.0, 10.0, 58.0, 26.0, 126.0, 17.0, 54.0, 35.0, 55.0, 35.0, 37.0, 110.0, 28.0, 41.0, 31.0, 45.0, 41.0, 20.0, 123.0, 17.0, 136.0, 41.0, 59.0, 51.0, 18.0, 14.0, 98.0, 35.0, 143.0], 'episode_lengths': [10, 21, 20, 12, 15, 35, 16, 32, 21, 19, 72, 35, 32, 102, 76, 21, 15, 17, 15, 43, 25, 34, 16, 26, 71, 28, 52, 69, 10, 21, 12, 130, 56, 18, 35, 32, 108, 61, 97, 47, 31, 23, 18, 20, 59, 13, 69, 58, 38, 22, 83, 25, 22, 34, 53, 29, 18, 115, 25, 59, 20, 28, 64, 43, 40, 36, 93, 13, 20, 35, 29, 38, 10, 58, 26, 126, 17, 54, 35, 55, 35, 37, 110, 28, 41, 31, 45, 41, 20, 123, 17, 136, 41, 59, 51, 18, 14, 98, 35, 143]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.1494050133826096, 'mean_inference_ms': 0.3030270243531414, 'mean_action_processing_ms': 0.04484232997177898, 'mean_env_wait_ms': 0.020527675791925257, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': 0.0020761489868164062, 'StateBufferConnector_ms': 0.0016286373138427734, 'ViewRequirementAgentConnector_ms': 0.0386197566986084}} 7.14829 3.57113 7.14829{'training_iteration_time_ms': 3571.167, 'sample_time_ms': 1119.964, 'load_time_ms': 0.253, 'load_throughput': 16189340.451, 'learn_time_ms': 2449.34, 'learn_throughput': 1669.634, 'synch_weights_time_ms': 1.166} 1680195666 8179 26c451_00000
(pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82020) 2023-03-30 10:01:03,714 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] (pid=82019) Resource usage vs limits 0: 0%| | 0/1 [00:00

Tune Status

Current time:2023-03-30 10:02:06
Running for: 00:00:14.18
Memory: 18.3/32.0 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 0/8 CPUs, 0/0 GPUs

Trial Status

Trial name status loc iter total time (s) ts reward episode_reward_max episode_reward_min episode_len_mean
AIRBC_914d8_00000TERMINATED127.0.0.1:82274 5 0.82942920384 nan nan nan nan
(pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:18: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) DESCRIPTOR = _descriptor.FileDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:36: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:43: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:29: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _TENSORSHAPEPROTO_DIM = _descriptor.Descriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:73: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:80: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py:66: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _TENSORSHAPEPROTO = _descriptor.Descriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:19: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) DESCRIPTOR = _descriptor.FileDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:33: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:37: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:41: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:45: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:49: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:53: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:57: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:61: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:65: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:69: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:73: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:77: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:81: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:85: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:89: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:93: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:97: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:101: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:105: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:109: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:113: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:117: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:121: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:125: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:129: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:133: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:137: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:141: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:145: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:149: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:153: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:157: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:161: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:165: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:169: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:173: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:177: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:181: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:185: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:189: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:193: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:197: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:201: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:205: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:209: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:213: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:217: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.EnumValueDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:27: DeprecationWarning: Call to deprecated create function EnumDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _DATATYPE = _descriptor.EnumDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:287: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:280: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _SERIALIZEDDTYPE = _descriptor.Descriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:20: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) DESCRIPTOR = _descriptor.FileDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:39: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:46: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:32: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _RESOURCEHANDLEPROTO_DTYPEANDSHAPE = _descriptor.Descriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:76: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:83: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:90: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:97: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:104: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:111: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py:69: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _RESOURCEHANDLEPROTO = _descriptor.Descriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:21: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) DESCRIPTOR = _descriptor.FileDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:40: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:47: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:54: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:61: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:68: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:75: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:82: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:89: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:96: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:103: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:110: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:117: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:124: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:131: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:138: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:145: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:152: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:33: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _TENSORPROTO = _descriptor.Descriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:183: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:190: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:197: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:176: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _VARIANTTENSORDATAPROTO = _descriptor.Descriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:21: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) DESCRIPTOR = _descriptor.FileDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:40: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:47: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:54: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:61: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:68: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:75: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:82: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. (pid=82274) _descriptor.FieldDescriptor( (RolloutWorker pid=82020) 2023-03-30 10:01:03,685 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[Write] [repeated 3x across cluster] (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:61: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (pid=82274) if (distutils.version.LooseVersion(tf.__version__) < (pid=82274) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow_probability/python/__init__.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (pid=82274) distutils.version.LooseVersion(required_tensorflow_version)): (pid=82274) WARNING:tensorflow:From /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:561: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_relax_shapes is deprecated and will be removed in a future version. (pid=82274) Instructions for updating: (pid=82274) experimental_relax_shapes is deprecated, use reduce_retracing instead (AIRBC pid=82274) 2023-03-30 10:01:56,521 WARNING algorithm_config.py:636 -- Cannot create BCConfig from given `config_dict`! Property __stdout_file__ not supported. (AIRBC pid=82274) 2023-03-30 10:01:56,624 INFO algorithm.py:527 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (AIRBC pid=82274) 2023-03-30 10:01:56,630 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON] -> AllToAllOperator[Repartition] (pid=82274) Resource usage vs limits 0: 0%| | 0/1 [00:00 AllToAllOperator[RandomShuffle] (AIRBC pid=82274) 2023-03-30 10:02:05,550 WARNING deprecation.py:50 -- DeprecationWarning: `remote_workers()` has been deprecated. Accessing the list of remote workers directly through remote_workers() is strongly discouraged. Please try to use one of the foreach accessors that is fault tolerant. This will raise an error in the future!

Trial Progress

Trial name agent_timesteps_totalconnector_metrics counters custom_metrics date done episode_len_meanepisode_media episode_reward_max episode_reward_mean episode_reward_min episodes_this_iter episodes_totalevaluation hostname info iterations_since_restorenode_ip num_agent_steps_sampled num_agent_steps_trained num_env_steps_sampled num_env_steps_sampled_this_iter num_env_steps_trained num_env_steps_trained_this_iter num_faulty_episodes num_healthy_workers num_in_flight_async_reqs num_remote_worker_restarts num_steps_trained_this_iterperf pidpolicy_reward_max policy_reward_mean policy_reward_min sampler_perf sampler_results time_since_restore time_this_iter_s time_total_stimers timestamp timesteps_total training_iterationtrial_id
AIRBC_914d8_00000 20384{} {'num_env_steps_sampled': 20384, 'num_env_steps_trained': 20384, 'num_agent_steps_sampled': 20384, 'num_agent_steps_trained': 20384}{} 2023-03-30_10-02-06True nan{} nan nan nan 0 0{'episode_reward_max': 76.0, 'episode_reward_min': 11.0, 'episode_reward_mean': 29.6, 'episode_len_mean': 29.6, 'episode_media': {}, 'episodes_this_iter': 10, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [13.0, 13.0, 11.0, 76.0, 48.0, 36.0, 14.0, 34.0, 14.0, 37.0], 'episode_lengths': [13, 13, 11, 76, 48, 36, 14, 34, 14, 37]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.12616764114882342, 'mean_inference_ms': 0.2590957092866947, 'mean_action_processing_ms': 0.04460762030537042, 'mean_env_wait_ms': 0.021067194550248115, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': 0.0019359588623046875, 'StateBufferConnector_ms': 0.0017905235290527344, 'ViewRequirementAgentConnector_ms': 0.03647565841674805}, 'num_agent_steps_sampled_this_iter': 296, 'num_env_steps_sampled_this_iter': 296, 'timesteps_this_iter': 296, 'num_healthy_workers': 1, 'num_in_flight_async_reqs': 0, 'num_remote_worker_restarts': 0}avnishs-mbp-3.lan{'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.0902162678539753, 'policy_loss': 0.6931167542934418, 'total_loss': 0.6931167542934418}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 2000.0, 'num_grad_updates_lifetime': 9.5, 'diff_num_grad_updates_vs_sampler_policy': 8.5}}, 'num_env_steps_sampled': 20384, 'num_env_steps_trained': 20384, 'num_agent_steps_sampled': 20384, 'num_agent_steps_trained': 20384} 5127.0.0.1 20384 20384 20384 4026 20384 4026 0 2 0 0 4026{'cpu_util_percent': 31.2, 'ram_util_percent': 57.0}82274{} {} {} {} {'episode_reward_max': nan, 'episode_reward_min': nan, 'episode_reward_mean': nan, 'episode_len_mean': nan, 'episode_media': {}, 'episodes_this_iter': 0, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [], 'episode_lengths': []}, 'sampler_perf': {}, 'num_faulty_episodes': 0, 'connector_metrics': {}} 0.829429 0.184188 0.829429{'training_iteration_time_ms': 45.667, 'sample_time_ms': 26.685, 'load_time_ms': 0.193, 'load_throughput': 21115508.208, 'learn_time_ms': 17.555, 'learn_throughput': 232232.678, 'synch_weights_time_ms': 1.169} 1680195726 20384 5914d8_00000
(pid=82313) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:21: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 5x across cluster] (pid=82313) DESCRIPTOR = _descriptor.FileDescriptor( [repeated 5x across cluster] (pid=82313) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py:82: DeprecationWarning: Call to deprecated create function FieldDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 40x across cluster] (pid=82313) _descriptor.FieldDescriptor( [repeated 40x across cluster] (pid=82313) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py:176: DeprecationWarning: Call to deprecated create function Descriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 7x across cluster] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) /Users/avnish/miniforge3/envs/ray/lib/python3.8/site-packages/tensorflow/core/framework/types_pb2.py:217: DeprecationWarning: Call to deprecated create function EnumValueDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool. [repeated 47x across cluster] (pid=82313) _descriptor.EnumValueDescriptor( [repeated 47x across cluster] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] (pid=82313) 2023-03-30 10:02:05,854 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[RandomShuffle] 2023-03-30 10:02:06,881 INFO tune.py:945 -- Total run time: 14.67 seconds (14.17 seconds for the tuning loop). And then, using the obtained checkpoint, we evaluate the policy on a fresh environment: num_eval_episodes = 3 rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes) print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}") 2023-03-30 10:02:11,829 WARNING checkpoints.py:109 -- No `rllib_checkpoint.json` file found in checkpoint directory /var/folders/jr/6lgb7_ln64v1kppw9szl17rc0000gn/T/tmp5kq688t7! Trying to extract checkpoint info from other files found in that dir. 2023-03-30 10:02:11,841 INFO policy.py:1285 -- Policy (worker=local) running on CPU. 2023-03-30 10:02:11,841 INFO torch_policy_v2.py:110 -- Found 0 visible cuda devices. Average reward over 3 episodes: 24.666666666666668 Logging results and uploading models to Comet ML In this example, we train a simple XGBoost model and log the training results to Comet ML. We also save the resulting model checkpoints as artifacts. Let’s start with installing our dependencies: !pip install -qU "ray[tune]" scikit-learn xgboost_ray comet_ml Then we need some imports: import ray from ray.air.config import RunConfig, ScalingConfig from ray.air.result import Result from ray.train.xgboost import XGBoostTrainer from ray.air.integrations.comet import CometLoggerCallback We define a simple function that returns our training dataset as a Dataset: def get_train_dataset() -> ray.data.Dataset: dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") return dataset Now we define a simple training function. All the magic happens within the CometLoggerCallback: CometLoggerCallback( project_name=comet_project, save_checkpoints=True, ) It will automatically log all results to Comet ML and upload the checkpoints as artifacts. It assumes you’re logged in into Comet via an API key or your ~./.comet.config. def train_model(train_dataset: ray.data.Dataset, comet_project: str) -> Result: """Train a simple XGBoost model and return the result.""" trainer = XGBoostTrainer( scaling_config=ScalingConfig(num_workers=2), params={"tree_method": "auto"}, label_column="target", datasets={"train": train_dataset}, num_boost_round=10, run_config=RunConfig( callbacks=[ # This is the part needed to enable logging to Comet ML. # It assumes Comet ML can find a valid API (e.g. by setting # the ``COMET_API_KEY`` environment variable). CometLoggerCallback( project_name=comet_project, save_checkpoints=True, ) ] ), ) result = trainer.fit() return result Let’s kick off a run: comet_project = "ray_air_example" train_dataset = get_train_dataset() result = train_model(train_dataset=train_dataset, comet_project=comet_project) 2022-05-19 15:19:17,237 INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8265 == Status ==
Current time: 2022-05-19 15:19:35 (running for 00:00:14.95)
Memory usage on this node: 10.2/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/5.12 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-19-19
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) train-rmse
XGBoostTrainer_ac544_00000TERMINATED127.0.0.1:19852 10 9.7203 0.030717


COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting. (raylet) 2022-05-19 15:19:21,584 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134 COMET INFO: Experiment is live on comet.ml https://www.comet.ml/krfricke/ray-air-example/ecd3726ca127497ba7386003a249fad6 COMET WARNING: Failed to add tag(s) None to the experiment COMET WARNING: Empty mapping given to log_params({}); ignoring (GBDTTrainable pid=19852) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (raylet) 2022-05-19 15:19:24,628 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331069 (GBDTTrainable pid=19852) 2022-05-19 15:19:25,961 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (raylet) 2022-05-19 15:19:26,830 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331069 (raylet) 2022-05-19 15:19:26,918 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=20 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 15:19:26,922 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=21 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 15:19:26,922 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=22 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 15:19:26,923 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=19 --runtime-env-hash=-2010331134 (GBDTTrainable pid=19852) 2022-05-19 15:19:29,272 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=19876) [15:19:29] task [xgboost.ray]:4505889744 got new rank 1 (_RemoteRayXGBoostActor pid=19875) [15:19:29] task [xgboost.ray]:6941849424 got new rank 0 COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 1.0.0 created Result for XGBoostTrainer_ac544_00000: date: 2022-05-19_15-19-30 done: false experiment_id: d3007bd6a2734b328fd90385485c5a8d hostname: Kais-MacBook-Pro.local iterations_since_restore: 1 node_ip: 127.0.0.1 pid: 19852 should_checkpoint: true time_since_restore: 6.529659032821655 time_this_iter_s: 6.529659032821655 time_total_s: 6.529659032821655 timestamp: 1652969970 timesteps_since_restore: 0 train-rmse: 0.357284 training_iteration: 1 trial_id: ac544_00000 warmup_time: 0.003961086273193359 COMET INFO: Scheduling the upload of 3 assets for a size of 2.48 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:1.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 2.0.0 created (previous was: 1.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 3.86 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:2.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 3.0.0 created (previous was: 2.0.0) COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:1.0.0' has been fully uploaded successfully COMET INFO: Scheduling the upload of 3 assets for a size of 5.31 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:3.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 4.0.0 created (previous was: 3.0.0) COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:2.0.0' has been fully uploaded successfully COMET INFO: Scheduling the upload of 3 assets for a size of 6.76 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:4.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 5.0.0 created (previous was: 4.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 8.21 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:3.0.0' has been fully uploaded successfully COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:5.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:4.0.0' has been fully uploaded successfully COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 6.0.0 created (previous was: 5.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 9.87 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:6.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:5.0.0' has been fully uploaded successfully COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 7.0.0 created (previous was: 6.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 11.46 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:7.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:6.0.0' has been fully uploaded successfully COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 8.0.0 created (previous was: 7.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 12.84 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:8.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:7.0.0' has been fully uploaded successfully COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 9.0.0 created (previous was: 8.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 14.36 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:9.0.0' has started uploading asynchronously COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:8.0.0' has been fully uploaded successfully COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 10.0.0 created (previous was: 9.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 16.37 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:10.0.0' has started uploading asynchronously (GBDTTrainable pid=19852) 2022-05-19 15:19:33,890 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.96 seconds (4.61 pure XGBoost training time). COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:9.0.0' has been fully uploaded successfully COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 11.0.0 created (previous was: 10.0.0) COMET INFO: Scheduling the upload of 3 assets for a size of 16.39 KB, this can take some time COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:11.0.0' has started uploading asynchronously COMET INFO: --------------------------- COMET INFO: Comet.ml Experiment Summary COMET INFO: --------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: url : https://www.comet.ml/krfricke/ray-air-example/ecd3726ca127497ba7386003a249fad6 COMET INFO: Metrics [count] (min, max): COMET INFO: iterations_since_restore [10] : (1, 10) COMET INFO: time_since_restore [10] : (6.529659032821655, 9.720295906066895) COMET INFO: time_this_iter_s [10] : (0.3124058246612549, 6.529659032821655) COMET INFO: time_total_s [10] : (6.529659032821655, 9.720295906066895) COMET INFO: timestamp [10] : (1652969970, 1652969973) COMET INFO: timesteps_since_restore : 0 COMET INFO: train-rmse [10] : (0.030717, 0.357284) COMET INFO: training_iteration [10] : (1, 10) COMET INFO: warmup_time : 0.003961086273193359 COMET INFO: Others: COMET INFO: Created from : Ray COMET INFO: Name : XGBoostTrainer_ac544_00000 COMET INFO: experiment_id : d3007bd6a2734b328fd90385485c5a8d COMET INFO: trial_id : ac544_00000 COMET INFO: System Information: COMET INFO: date : 2022-05-19_15-19-33 COMET INFO: hostname : Kais-MacBook-Pro.local COMET INFO: node_ip : 127.0.0.1 COMET INFO: pid : 19852 COMET INFO: Uploads: COMET INFO: artifact assets : 33 (107.92 KB) COMET INFO: artifacts : 11 COMET INFO: environment details : 1 COMET INFO: filename : 1 COMET INFO: installed packages : 1 COMET INFO: notebook : 1 COMET INFO: source_code : 1 COMET INFO: --------------------------- COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds) COMET INFO: The Python SDK has 3600 seconds to finish before aborting... COMET INFO: Waiting for completion of the file uploads (may take several seconds) COMET INFO: The Python SDK has 10800 seconds to finish before aborting... COMET INFO: Still uploading 6 file(s), remaining 21.05 KB/116.69 KB COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:10.0.0' has been fully uploaded successfully COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:11.0.0' has been fully uploaded successfully Result for XGBoostTrainer_ac544_00000: date: 2022-05-19_15-19-33 done: true experiment_id: d3007bd6a2734b328fd90385485c5a8d experiment_tag: '0' hostname: Kais-MacBook-Pro.local iterations_since_restore: 10 node_ip: 127.0.0.1 pid: 19852 should_checkpoint: true time_since_restore: 9.720295906066895 time_this_iter_s: 0.39761900901794434 time_total_s: 9.720295906066895 timestamp: 1652969973 timesteps_since_restore: 0 train-rmse: 0.030717 training_iteration: 10 trial_id: ac544_00000 warmup_time: 0.003961086273193359 2022-05-19 15:19:35,621 INFO tune.py:753 -- Total run time: 15.75 seconds (14.94 seconds for the tuning loop). Check out your Comet ML project to see the results! Logging results and uploading models to Weights & Biases In this example, we train a simple XGBoost model and log the training results to Weights & Biases. We also save the resulting model checkpoints as artifacts. There are two ways to achieve this: Automatically using the ray.air.integrations.wandb.WandbLoggerCallback Manually using the wandb API This tutorial will walk you through both options. Let’s start with installing our dependencies: !pip install -qU "ray[tune]" scikit-learn xgboost_ray wandb Then we need some imports: import ray from ray.air.config import RunConfig, ScalingConfig from ray.air.result import Result from ray.air.integrations.wandb import WandbLoggerCallback We define a simple function that returns our training dataset as a Dataset: def get_train_dataset() -> ray.data.Dataset: dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") return dataset And that’s the common parts. We now dive into the two options to interact with Weights and Biases. Using the WandbLoggerCallback The WandbLoggerCallback does all the logging and reporting for you. It is especially useful when you use an out-of-the-box trainer like XGBoostTrainer. In these trainers, you don’t define your own training loop, so using the AIR W&B callback is the best way to log your results to Weights and Biases. First we define a simple training function. All the magic happens within the WandbLoggerCallback: WandbLoggerCallback( project=wandb_project, save_checkpoints=True, ) It will automatically log all results to Weights & Biases and upload the checkpoints as artifacts. It assumes you’re logged in into Wandb via an API key or wandb login. from ray.train.xgboost import XGBoostTrainer def train_model_xgboost(train_dataset: ray.data.Dataset, wandb_project: str) -> Result: """Train a simple XGBoost model and return the result.""" trainer = XGBoostTrainer( scaling_config=ScalingConfig(num_workers=2), params={"tree_method": "auto"}, label_column="target", datasets={"train": train_dataset}, num_boost_round=10, run_config=RunConfig( callbacks=[ # This is the part needed to enable logging to Weights & Biases. # It assumes you've logged in before, e.g. with `wandb login`. WandbLoggerCallback( project=wandb_project, save_checkpoints=True, ) ] ), ) result = trainer.fit() return result Let’s kick off a run: wandb_project = "ray_air_example_xgboost" train_dataset = get_train_dataset() result = train_model_xgboost(train_dataset=train_dataset, wandb_project=wandb_project) 2022-10-28 16:28:19,325 INFO worker.py:1524 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265  2022-10-28 16:28:22,993 WARNING read_api.py:297 -- ⚠️ The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks. 2022-10-28 16:28:26,033 INFO wandb.py:267 -- Already logged into W&B. Check out your WandB project to see the results! Using the wandb API When you define your own training loop, you sometimes want to manually interact with the Weights and Biases API. Ray AIR provides a setup_wandb() function that takes care of the initialization. The main benefit here is that authentication to Weights and Biases is automatically set up for you, and sensible default names for your runs are set. Of course, you can override these. Additionally in distributed training you often only want to report the results of the rank 0 worker. This can also be done automatically using our setup. Let’s define a distributed training loop. The important part here are: wandb = setup_wandb(config) and later wandb.log({"loss": loss.item()}) The call to setup_wandb() will setup your session, for instance calling wandb.init() with sensible defaults. Because we are in a distributed training setting, this will only happen for the rank 0 - all other workers get a mock object back, and any subsequent calls to wandb.XXX will be a no-op for these. You can then use the wandb as usual: from ray.air import session from ray.air.integrations.wandb import setup_wandb from ray.data.preprocessors import Concatenator import numpy as np import torch.optim as optim import torch.nn as nn def train_loop(config): wandb = setup_wandb(config, project=config.get("wandb_project")) dataset = session.get_dataset_shard("train") model = nn.Linear(30, 2) optimizer = optim.SGD( model.parameters(), lr=config.get("lr", 0.01), ) loss_fn = nn.CrossEntropyLoss() for batch in dataset.iter_torch_batches(batch_size=32): X = batch["data"] y = batch["target"] # Compute prediction error pred = model(X) loss = loss_fn(pred, y) # Backpropagation optimizer.zero_grad() loss.backward() optimizer.step() session.report({"loss": loss.item()}) wandb.log({"loss": loss.item()}) Let’s define a function to kick off the training - again, we can configure Weights and Biases settings in the config. But you could also just pass it to the setup function, e.g. like this: setup_wandb(config, project="my_project") from ray.train.torch import TorchTrainer def train_model_torch(train_dataset: ray.data.Dataset, wandb_project: str) -> Result: """Train a simple XGBoost model and return the result.""" trainer = TorchTrainer( train_loop_per_worker=train_loop, scaling_config=ScalingConfig(num_workers=2), train_loop_config={"lr": 0.01, "wandb_project": wandb_project}, datasets={"train": train_dataset}, preprocessor=Concatenator("data", dtype=np.float32, exclude=["target"]), ) result = trainer.fit() return result Let’s kick off this run: wandb_project = "ray_air_example_torch" train_dataset = get_train_dataset() result = train_model_torch(train_dataset=train_dataset, wandb_project=wandb_project) Check out your WandB project to see the results! Integrate Ray AIR with Feast feature store # !pip install feast==0.20.1 ray[air]>=1.13 xgboost_ray In this example, we showcase how to use Ray AIR with Feast feature store, leveraging both historical features for training a model and online features for inference. The task is adapted from Feast credit scoring tutorial. In this example, we train a xgboost model and run some prediction on an incoming loan request to see if it is approved or rejected. Let’s first set up our workspace and prepare the data to work with. ! wget --no-check-certificate https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip ! unzip air-feast-example.zip %cd air-feast-example --2022-09-12 19:24:21-- https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt' Resolving github.com (github.com)... 192.30.255.113 Connecting to github.com (github.com)|192.30.255.113|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip [following] --2022-09-12 19:24:21-- https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 23715107 (23M) [application/zip] Saving to: ‘air-feast-example.zip’ air-feast-example.z 100%[===================>] 22.62M 8.79MB/s in 2.6s 2022-09-12 19:24:25 (8.79 MB/s) - ‘air-feast-example.zip’ saved [23715107/23715107] Archive: air-feast-example.zip creating: air-feast-example/ creating: air-feast-example/feature_repo/ inflating: air-feast-example/feature_repo/.DS_Store extracting: air-feast-example/feature_repo/__init__.py inflating: air-feast-example/feature_repo/features.py creating: air-feast-example/feature_repo/data/ inflating: air-feast-example/feature_repo/data/.DS_Store inflating: air-feast-example/feature_repo/data/credit_history_sample.csv inflating: air-feast-example/feature_repo/data/zipcode_table_sample.csv inflating: air-feast-example/feature_repo/data/credit_history.parquet inflating: air-feast-example/feature_repo/data/zipcode_table.parquet inflating: air-feast-example/feature_repo/feature_store.yaml inflating: air-feast-example/.DS_Store creating: air-feast-example/data/ inflating: air-feast-example/data/loan_table.parquet inflating: air-feast-example/data/loan_table_sample.csv /home/ray/Desktop/workspace/ray/doc/source/ray-air/examples/air-feast-example ! ls data feature_repo There is already a feature repository set up in feature_repo/. It isn’t necessary to create a new feature repository, but it can be done using the following command: feast init -t local feature_repo. Now let’s take a look at the schema in Feast feature store, which is defined by feature_repo/features.py. There are mainly two features: zipcode_feature and credit_history, both are generated from parquet files - feature_repo/data/zipcode_table.parquet and feature_repo/data/credit_history.parquet. !pygmentize feature_repo/features.py from datetime import timedelta from feast import (Entity, Field, FeatureView, FileSource, ValueType) from feast.types import Float32, Int64, String zipcode = Entity(name="zipcode", value_type=Int64) zipcode_source = FileSource( path="feature_repo/data/zipcode_table.parquet", timestamp_field="event_timestamp", created_timestamp_column="created_timestamp", ) zipcode_features = FeatureView( name="zipcode_features", entities=["zipcode"], ttl=timedelta(days=3650), schema=[ Field(name="city", dtype=String), Field(name="state", dtype=String), Field(name="location_type", dtype=String), Field(name="tax_returns_filed", dtype=Int64), Field(name="population", dtype=Int64), Field(name="total_wages", dtype=Int64), ], source=zipcode_source, ) dob_ssn = Entity( name="dob_ssn", value_type=ValueType.STRING, description="Date of birth and last four digits of social security number", ) credit_history_source = FileSource( path="feature_repo/data/credit_history.parquet", timestamp_field="event_timestamp", created_timestamp_column="created_timestamp", ) credit_history = FeatureView( name="credit_history", entities=["dob_ssn"], ttl=timedelta(days=90), schema=[ Field(name="credit_card_due", dtype=Int64), Field(name="mortgage_due", dtype=Int64), Field(name="student_loan_due", dtype=Int64), Field(name="vehicle_loan_due", dtype=Int64), Field(name="hard_pulls", dtype=Int64), Field(name="missed_payments_2y", dtype=Int64), Field(name="missed_payments_1y", dtype=Int64), Field(name="missed_payments_6m", dtype=Int64), Field(name="bankruptcies", dtype=Int64), ], source=credit_history_source, ) Deploy the above defined feature store by running apply from within the feature_repo/ folder. ! cd feature_repo && feast apply Created entity zipcode Created entity dob_ssn Created feature view credit_history Created feature view zipcode_features Created sqlite table feature_repo_credit_history Created sqlite table feature_repo_zipcode_features import feast fs = feast.FeatureStore(repo_path="feature_repo") Generate training data On top of the features in Feast, we also have labeled training data at data/loan_table.parquet. At the time of training, loan table will be passed into Feast as an entity dataframe for training data generation. Feast will intelligently join credit_history and zipcode_feature tables to create relevant feature vectors to augment the training data. import pandas as pd loan_df = pd.read_parquet("data/loan_table.parquet") display(loan_df)
loan_id dob_ssn zipcode person_age person_income person_home_ownership person_emp_length loan_intent loan_amnt loan_int_rate loan_status event_timestamp created_timestamp
0 10000 19530219_5179 76104 22 59000 RENT 123.0 PERSONAL 35000 16.02 1 2021-08-25 20:34:41.361000+00:00 2021-08-25 20:34:41.361000+00:00
1 10001 19520816_8737 70380 21 9600 OWN 5.0 EDUCATION 1000 11.14 0 2021-08-25 20:16:20.128000+00:00 2021-08-25 20:16:20.128000+00:00
2 10002 19860413_2537 97039 25 9600 MORTGAGE 1.0 MEDICAL 5500 12.87 1 2021-08-25 19:57:58.896000+00:00 2021-08-25 19:57:58.896000+00:00
3 10003 19760701_8090 63785 23 65500 RENT 4.0 MEDICAL 35000 15.23 1 2021-08-25 19:39:37.663000+00:00 2021-08-25 19:39:37.663000+00:00
4 10004 19830125_8297 82223 24 54400 RENT 8.0 MEDICAL 35000 14.27 1 2021-08-25 19:21:16.430000+00:00 2021-08-25 19:21:16.430000+00:00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
28633 38633 19491126_1487 43205 57 53000 MORTGAGE 1.0 PERSONAL 5800 13.16 0 2020-08-25 21:48:06.292000+00:00 2020-08-25 21:48:06.292000+00:00
28634 38634 19681208_6537 24872 54 120000 MORTGAGE 4.0 PERSONAL 17625 7.49 0 2020-08-25 21:29:45.059000+00:00 2020-08-25 21:29:45.059000+00:00
28635 38635 19880422_2592 68826 65 76000 RENT 3.0 HOMEIMPROVEMENT 35000 10.99 1 2020-08-25 21:11:23.826000+00:00 2020-08-25 21:11:23.826000+00:00
28636 38636 19901017_6108 92014 56 150000 MORTGAGE 5.0 PERSONAL 15000 11.48 0 2020-08-25 20:53:02.594000+00:00 2020-08-25 20:53:02.594000+00:00
28637 38637 19960703_3449 69033 66 42000 RENT 2.0 MEDICAL 6475 9.99 0 2020-08-25 20:34:41.361000+00:00 2020-08-25 20:34:41.361000+00:00

28638 rows × 13 columns

feast_features = [ "zipcode_features:city", "zipcode_features:state", "zipcode_features:location_type", "zipcode_features:tax_returns_filed", "zipcode_features:population", "zipcode_features:total_wages", "credit_history:credit_card_due", "credit_history:mortgage_due", "credit_history:student_loan_due", "credit_history:vehicle_loan_due", "credit_history:hard_pulls", "credit_history:missed_payments_2y", "credit_history:missed_payments_1y", "credit_history:missed_payments_6m", "credit_history:bankruptcies", ] loan_w_offline_feature = fs.get_historical_features( entity_df=loan_df, features=feast_features ).to_df() # Drop some unnecessary columns for simplicity loan_w_offline_feature = loan_w_offline_feature.drop(["event_timestamp", "created_timestamp__", "loan_id", "zipcode", "dob_ssn"], axis=1) Now let’s take a look at the training data as it is augmented by Feast. display(loan_w_offline_feature)
person_age person_income person_home_ownership person_emp_length loan_intent loan_amnt loan_int_rate loan_status city state ... total_wages credit_card_due mortgage_due student_loan_due vehicle_loan_due hard_pulls missed_payments_2y missed_payments_1y missed_payments_6m bankruptcies
1358886 55 24543 RENT 3.0 VENTURE 4000 13.92 0 SLIDELL LA ... 315061217 1777 690650 46372 10439 5 1 2 1 0
1358815 58 20000 RENT 0.0 EDUCATION 4000 9.99 0 CHOUTEAU OK ... 59412230 1791 462670 19421 3583 8 7 1 0 2
1353348 64 24000 RENT 1.0 MEDICAL 3000 6.99 0 BISMARCK ND ... 469621263 5917 1780959 11835 27910 8 3 2 1 0
1354200 55 34000 RENT 0.0 DEBTCONSOLIDATION 12000 6.92 1 SANTA BARBARA CA ... 24537583 8091 364271 30248 22640 2 7 3 0 0
1354271 51 74628 MORTGAGE 3.0 PERSONAL 3000 13.49 0 HUNTINGTON BEACH CA ... 19749601 3679 1659968 37582 20284 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
674285 23 74000 RENT 3.0 MEDICAL 25000 10.36 1 MANSFIELD MO ... 33180988 5176 1089963 44642 2877 1 6 1 0 0
668250 21 200000 MORTGAGE 2.0 DEBTCONSOLIDATION 25000 13.99 0 SALISBURY MD ... 470634058 5297 1288915 22471 22630 0 5 2 1 0
668321 24 200000 MORTGAGE 3.0 VENTURE 24000 7.49 0 STRUNK KY ... 10067358 6549 22399 11806 13005 0 1 0 0 0
670025 23 215000 MORTGAGE 7.0 MEDICAL 35000 14.79 0 HAWTHORN PA ... 5956835 9079 876038 4556 21588 0 1 0 0 0
2034006 22 59000 RENT 123.0 PERSONAL 35000 16.02 1 FORT WORTH TX ... 142325465 8419 91803 22328 15078 0 1 0 0 0

28638 rows × 23 columns

# Convert into Train and Validation datasets. import ray loan_ds = ray.data.from_pandas(loan_w_offline_feature) train_ds, validation_ds = loan_ds.split_proportionately([0.8]) 2022-09-12 19:25:14,018 INFO worker.py:1508 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265  Define Preprocessors Preprocessor does last mile processing on Ray Data before feeding into training model. categorical_features = [ "person_home_ownership", "loan_intent", "city", "state", "location_type", ] from ray.data.preprocessors import Chain, OrdinalEncoder, SimpleImputer imputer = SimpleImputer(categorical_features, strategy="most_frequent") encoder = OrdinalEncoder(columns=categorical_features) chained_preprocessor = Chain(imputer, encoder) Train XGBoost model using Ray AIR Trainer Ray AIR provides a variety of Trainers that are integrated with popular machine learning frameworks. You can train a distributed model at scale leveraging Ray using the intuitive API trainer.fit(). The output is a Ray AIR Checkpoint, that will seamlessly transfer the workload from training to prediction. Let’s take a look! LABEL = "loan_status" CHECKPOINT_PATH = "checkpoint" NUM_WORKERS = 1 # Change this based on the resources in the cluster. from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig params = { "tree_method": "approx", "objective": "binary:logistic", "eval_metric": ["logloss", "error"], } trainer = XGBoostTrainer( scaling_config=ScalingConfig( num_workers=NUM_WORKERS, use_gpu=0, ), label_column=LABEL, params=params, datasets={"train": train_ds, "validation": validation_ds}, preprocessor=chained_preprocessor, num_boost_round=100, ) checkpoint = trainer.fit().checkpoint # This saves the checkpoint onto disk checkpoint.to_directory(CHECKPOINT_PATH)

Tune Status

Current time:2022-09-12 19:25:28
Running for: 00:00:09.09
Memory: 12.3/62.7 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.5 GiB heap, 0.0/16.25 GiB objects

Trial Status

Trial name status loc iter total time (s) train-logloss train-error validation-logloss
XGBoostTrainer_4f411_00000TERMINATED10.108.96.251:348845 101 7.67137 0.0578837 0.0127019 0.225994
(XGBoostTrainer pid=348845) /home/ray/.pyenv/versions/mambaforge/envs/ray/lib/python3.9/site-packages/xgboost_ray/main.py:431: UserWarning: `num_actors` in `ray_params` is smaller than 2 (1). XGBoost will NOT be distributed! (XGBoostTrainer pid=348845) warnings.warn( (_RemoteRayXGBoostActor pid=348922) [19:25:23] task [xgboost.ray]:140319682474864 got new rank 0

Trial Progress

Trial name date done episodes_total experiment_id experiment_taghostname iterations_since_restorenode_ip pid time_since_restore time_this_iter_s time_total_s timestamp timesteps_since_restoretimesteps_total train-error train-logloss training_iterationtrial_id validation-error validation-logloss warmup_time
XGBoostTrainer_4f411_000002022-09-12_19-25-28True 83cacc5068a84efc8998c269bc054088 0corvus 10110.108.96.251348845 7.67137 1.01445 7.67137 1663035928 0 0.0127019 0.0578837 1014f411_00000 0.0825768 0.225994 0.00293422
2022-09-12 19:25:28,422 INFO tune.py:762 -- Total run time: 9.86 seconds (9.09 seconds for the tuning loop). 'checkpoint' Inference Now from the Checkpoint object we obtained from last session, we can construct a Ray AIR Predictor that encapsulates everything needed for inference. The API for using Predictor is also very intuitive - simply call Predictor.predict(). from ray.air.checkpoint import Checkpoint from ray.train.xgboost import XGBoostPredictor predictor = XGBoostPredictor.from_checkpoint(Checkpoint.from_directory(CHECKPOINT_PATH)) import numpy as np ## Now let's do some prediciton. loan_request_dict = { "zipcode": [76104], "dob_ssn": ["19630621_4278"], "person_age": [133], "person_income": [59000], "person_home_ownership": ["RENT"], "person_emp_length": [123.0], "loan_intent": ["PERSONAL"], "loan_amnt": [35000], "loan_int_rate": [16.02], } # Now augment the request with online features. zipcode = loan_request_dict["zipcode"][0] dob_ssn = loan_request_dict["dob_ssn"][0] online_features = fs.get_online_features( entity_rows=[{"zipcode": zipcode, "dob_ssn": dob_ssn}], features=feast_features, ).to_dict() loan_request_dict.update(online_features) loan_request_df = pd.DataFrame.from_dict(loan_request_dict) loan_request_df = loan_request_df.drop(["zipcode", "dob_ssn"], axis=1) display(loan_request_df)
person_age person_income person_home_ownership person_emp_length loan_intent loan_amnt loan_int_rate location_type city population ... total_wages hard_pulls bankruptcies missed_payments_1y mortgage_due credit_card_due missed_payments_2y missed_payments_6m student_loan_due vehicle_loan_due
0 133 59000 RENT 123.0 PERSONAL 35000 16.02 None None None ... None None None None None None None None None None

1 rows × 22 columns

# Run through our predictor using `Predictor.predict()` API. loan_result = np.round(predictor.predict(loan_request_df)["predictions"][0]) if loan_result == 0: print("Loan approved!") elif loan_result == 1: print("Loan rejected!") Loan rejected! AutoML for time series forecasting with Ray AIR AutoML (Automatic Machine Learning) boils down to picking the best model for a given task and dataset. In this Ray Core example, we showed how to build an AutoML system which will chooses the best statsforecast model and its corresponding hyperparameters for a time series regression task on the M5 dataset. The basic steps were: Define a set of autoregressive forecasting models to search over. For each model type, we also define a set of model parameters to search over. Perform temporal cross-validation on each model configuration in parallel. Pick the best performing model as the output of the AutoML system. We see that these steps fit into the framework of a hyperparameter optimization problem that can be tackled with the Ray AIR Tuner! In this notebook, we will show how to: Create an AutoML system with Ray AIR for time series forecasting. Leverage the higher-level Tuner API to define the model and hyperparameter search space, as well as parallelize cross-validation of different models. Analyze results to identify the best-performing model type and model parameters for the time-series dataset. Similar to the Ray Core example, we will be using only one partition of the M5 dataset for this example. Setup Let’s first start by installing the statsforecast and ray[air] packages. !pip install statsforecast !pip install ray[air] Next, we’ll make the necessary imports, then initialize and connect to our Ray cluster! import time import itertools import pandas as pd import numpy as np from collections import defaultdict from statsforecast import StatsForecast from statsforecast.models import ETS, AutoARIMA, _TS from pyarrow import parquet as pq from sklearn.metrics import mean_squared_error, mean_absolute_error import ray from ray import air, tune if ray.is_initialized(): ray.shutdown() ray.init(runtime_env={"pip": ["statsforecast"]}) We may want to run on multiple nodes, and setting the runtime_env to include the statsforecast module will guarantee that we can access it on each worker, regardless of which node it lives on. Read a partition of the M5 dataset from S3 We first obtain the data from an S3 bucket and preprocess it to the format that statsforecast expects. As the dataset is quite large, we use PyArrow’s push-down predicate as a filter to obtain just the rows we care about without having to load them all into memory. def get_m5_partition(unique_id: str) -> pd.DataFrame: ds1 = pq.read_table( "s3://anonymous@m5-benchmarks/data/train/target.parquet", filters=[("item_id", "=", unique_id)], ) Y_df = ds1.to_pandas() # StatsForecasts expects specific column names! Y_df = Y_df.rename( columns={"item_id": "unique_id", "timestamp": "ds", "demand": "y"} ) Y_df["unique_id"] = Y_df["unique_id"].astype(str) Y_df["ds"] = pd.to_datetime(Y_df["ds"]) Y_df = Y_df.dropna() constant = 10 Y_df["y"] += constant Y_df = Y_df[Y_df.unique_id == unique_id] return Y_df train_df = get_m5_partition("FOODS_1_001_CA_1") train_df
unique_id ds y
0 FOODS_1_001_CA_1 2011-01-29 13.0
1 FOODS_1_001_CA_1 2011-01-30 10.0
2 FOODS_1_001_CA_1 2011-01-31 10.0
3 FOODS_1_001_CA_1 2011-02-01 11.0
4 FOODS_1_001_CA_1 2011-02-02 14.0
... ... ... ...
1936 FOODS_1_001_CA_1 2016-05-18 10.0
1937 FOODS_1_001_CA_1 2016-05-19 11.0
1938 FOODS_1_001_CA_1 2016-05-20 10.0
1939 FOODS_1_001_CA_1 2016-05-21 10.0
1940 FOODS_1_001_CA_1 2016-05-22 10.0

1941 rows × 3 columns

Create a function that performs cross-validation Next, we will define two methods below: cross_validation performs temporal cross-validation on the dataset and reports the mean prediction error across cross-validation splits. See the visualizations in the analysis section below to see what the cross-validation splits look like and what we are averaging across. The n_splits and test_size parameters are used to configure the cross-validation splits, similar to TimeSeriesSplit from sklearn. compute_metrics_and_aggregate is a helper function used in cross_validation that calculates the aggregated metrics using the dataframe output of StatsForecast.cross_validation. For example, we will calculate the mean squared error between the model predictions and the actual observed data, averaged over all training splits. This metric gets reported to Tune as mse_mean, which is the metric we will use to define the best-performing model. We will run this cross-validation function on all the model types and model parameters we are searching over, and the model that produces the lowest error metric will be the output of this AutoML example. Notice that model_cls_and_params is passed to the function via the config parameter. This is how Tune will set the corresponding model class and parameters for each trial. from ray.air import Checkpoint, session def cross_validation(config, Y_train_df=None): assert Y_train_df is not None, "Must pass in the dataset" # Get the model class model_cls, model_params = config.get("model_cls_and_params") freq = config.get("freq") metrics = config.get("metrics", {"mse": mean_squared_error}) # CV params test_size = config.get("test_size", None) n_splits = config.get("n_splits", 5) model = model_cls(**model_params) # Default the parallelism to the # of cross-validation splits parallelism_kwargs = {"n_jobs": n_splits} # Initialize statsforecast with the model statsforecast = StatsForecast( df=Y_train_df, models=[model], freq=freq, **parallelism_kwargs, ) # Perform temporal cross-validation (see `sklearn.TimeSeriesSplit`) test_size = test_size or len(Y_train_df) // (n_splits + 1) start_time = time.time() forecasts_cv = statsforecast.cross_validation( h=test_size, n_windows=n_splits, step_size=test_size, ) cv_time = time.time() - start_time # Compute metrics (according to `metrics`) cv_results = compute_metrics_and_aggregate(forecasts_cv, model, metrics) # Report metrics and save cross-validation output DataFrame results = { **cv_results, "cv_time": cv_time, } checkpoint_dict = { "cross_validation_df": forecasts_cv, } checkpoint = Checkpoint.from_dict(checkpoint_dict) session.report(results, checkpoint=checkpoint) def compute_metrics_and_aggregate( forecasts_df: pd.DataFrame, model: _TS, metrics: dict ): unique_ids = forecasts_df.index.unique() assert len(unique_ids) == 1, "This example only expects a single time series." cutoff_values = forecasts_df["cutoff"].unique() # Calculate metrics of the predictions of the models fit on # each training split cv_metrics = defaultdict(list) for ct in cutoff_values: # Get CV metrics for a specific training split # All forecasts made with the same `cutoff` date split_df = forecasts_df[forecasts_df["cutoff"] == ct] for metric_name, metric_fn in metrics.items(): cv_metrics[metric_name].append( metric_fn( split_df["y"], split_df[model.__class__.__name__] ) ) # Calculate aggregated metrics (mean, std) across training splits cv_aggregates = {} for metric_name, metric_vals in cv_metrics.items(): try: cv_aggregates[f"{metric_name}_mean"] = np.nanmean( metric_vals ) cv_aggregates[f"{metric_name}_std"] = np.nanstd( metric_vals ) except Exception as e: cv_aggregates[f"{metric_name}_mean"] = np.nan cv_aggregates[f"{metric_name}_std"] = np.nan return { "unique_ids": list(unique_ids), **cv_aggregates, "cutoff_values": cutoff_values, } Define the model search space We want to search over the following set of models and their corresponding parameters: search_space = { AutoARIMA: {}, ETS: { "season_length": [6, 7], "model": ["ZNA", "ZZZ"], } } This translates to 5 possible (model_class, model_params) configurations, which we generate using the helper function below. def generate_configurations(search_space): for model, params in search_space.items(): if not params: yield model, {} else: configurations = itertools.product(*params.values()) for config in configurations: config_dict = {k: v for k, v in zip(params.keys(), config)} yield model, config_dict configs = list(generate_configurations(search_space)) configs [(statsforecast.models.AutoARIMA, {}), (statsforecast.models.ETS, {'season_length': 6, 'model': 'ZNA'}), (statsforecast.models.ETS, {'season_length': 6, 'model': 'ZZZ'}), (statsforecast.models.ETS, {'season_length': 7, 'model': 'ZNA'}), (statsforecast.models.ETS, {'season_length': 7, 'model': 'ZZZ'})] Create a Tuner to run a grid search over configurations Now that we have defined the search space as well as the cross-validation function to apply to each configuration inside that search space, we can define our Ray AIR Tuner to launch the trials in parallel. Here’s a summary of what we are doing in the code below: First, we include the training dataset using tune.with_parameters, which will put the dataset into the Ray object storage so that it can be retrieved as a common reference from every Tune trial. Next, we define the Tuner param_space. We use Tune’s tune.grid_search to create one trial for each (model_class, model_params) tuple that we want to try. The rest of the parameters are constants that will be passed into the config parameter along with model_cls_and_params. Finally, we specify that we want to minimize the reported mse_mean metric. We can launch the trials by using Tuner.fit, which returns a ResultGrid that we can use for analysis. tuner = tune.Tuner( tune.with_parameters(cross_validation, Y_train_df=train_df), param_space={ "model_cls_and_params": tune.grid_search(configs), "n_splits": 5, "test_size": 1, "freq": "D", "metrics": {"mse": mean_squared_error, "mae": mean_absolute_error}, }, tune_config=tune.TuneConfig( metric="mse_mean", mode="min", ), ) result_grid = tuner.fit() Great, we’ve computed cross-validation metrics for all the models! Let’s get the result of this AutoML system by selecting the best-performing trial using ResultGrid.get_best_result! best_result = result_grid.get_best_result() We can take a look at the hyperparameter config of the best result: best_result.config {'model_cls_and_params': (statsforecast.models.ETS, {'season_length': 6, 'model': 'ZNA'}), 'n_splits': 5, 'test_size': 1, 'freq': 'D', 'metrics': {'mse': , 'mae': }} Within this config, we can pull out the model type and parameters that resulted in the lowest forecast error! best_model_cls, best_model_params = best_result.config["model_cls_and_params"] print("Best model type:", best_model_cls) print("Best model params:", best_model_params) Best model type: Best model params: {'season_length': 6, 'model': 'ZNA'} We can also inspect the reported metrics: print("Best mse_mean:", best_result.metrics["mse_mean"]) print("Best mae_mean:", best_result.metrics["mae_mean"]) Best mse_mean: 0.64205205 Best mae_mean: 0.7200615 Analysis Finally, let’s wrap up this AutoML example by performing some basic analysis and plotting. Visualize Temporal Cross-validation Splits Let’s first take a look at how cross-validation is being performed. This plot shows how our parameters of n_splits=5 and test_size=1 are being used to generate the cross-validation splits. Only the last 50 points in the dataset are shown for visualization purposes. For each of the 5 splits, the blue ticks represent the data used to train the model. The orange tick is the index that the model is tying to predict, and it’s just a single point due to setting test_size=1. The metrics are calculated by comparing the predicted value to the actual data at the orange data point. The grey points represent data that is not considered for the split. cutoff_values_for_cv = best_result.metrics["cutoff_values"] test_size = best_result.config.get("test_size") mse_per_split = best_result.metrics["mse_mean"] cutoff_idxs = [np.where(train_df["ds"] == ct)[0][0] for ct in cutoff_values_for_cv] colors = np.array(["blue", "orange", "grey"]) import matplotlib.pyplot as plt show_last_n = 50 plt.figure(figsize=(8, 3)) for i, cutoff_idx in enumerate(cutoff_idxs): dataset_idxs = np.arange(len(train_df))[-show_last_n:] color_idxs = np.zeros_like(dataset_idxs) color_idxs[dataset_idxs > cutoff_idx] = 1 color_idxs[dataset_idxs > cutoff_idx + test_size] = 2 plt.scatter( x=dataset_idxs, y=np.ones_like(dataset_idxs) * i, c=colors[color_idxs], marker="_", lw=8 ) plt.title( f"Showing last {show_last_n} training samples of the {len(cutoff_idxs)} splits\n" "Blue=Training, Orange=Test, Grey=Unused" ) plt.show() Visualize model forecasts Earlier, we saved the cross-validation output DataFrame inside a Ray AIR Checkpoint. We can use this to visualize some predictions of the best model! The predictions are pulled from the cross-validation results, where each step is predicted with horizon=1. Again, we only show the last 50 timesteps for visualization purposes. def plot_model_predictions(result, train_df): model_cls, model_params = result.config["model_cls_and_params"] # Get the predictions from the data stored within this result's checkpoint checkpoint_dict = result.checkpoint.to_dict() forecast_df = checkpoint_dict["cross_validation_df"] # Only show the last 50 timesteps of the ground truth data max_points_to_show = 50 plt.figure(figsize=(10, 4)) plt.plot( train_df["ds"][-max_points_to_show:], train_df["y"][-max_points_to_show:], label="Ground Truth" ) plt.plot( forecast_df["ds"], forecast_df[model_cls.__name__], label="Forecast Predictions" ) plt.title( f"{model_cls.__name__}({model_params}), " f"mae_mean={result.metrics['mse_mean']:.4f}\n" f"Showing last {max_points_to_show} points" ) plt.legend() plt.show() plot_model_predictions(best_result, train_df) We can also visualize the predictions of the other models. # Plot for all results for result in result_grid: plot_model_predictions(result, train_df) Batch training & tuning on Ray Tune Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on data batches corresponding to different locations, products, etc. Batch training can take less time to process all the data at once, but only if those batches can run in parallel! This notebook showcases how to conduct batch regression with algorithms from XGBoost and Scikit-learn with Ray Tune. XGBoost is a popular open-source library used for regression and classification. Scikit-learn is a popular open-source library with a vast assortment of well-known ML algorithms. Batch training diagram For the data, we will use the NYC Taxi dataset. This popular tabular dataset contains historical taxi pickups by timestamp and location in NYC. For the training, we will train separate regression models to predict trip_duration, with a different model for each dropoff location in NYC. Specifically, we will conduct an experiment for each dropoff_location_id, to find the best either XGBoost or Scikit-learn model, per location. Contents In this this tutorial, you will learn how to: Define how to load and prepare Parquet data Define a Trainable (callable) function Run batch training and inference with Ray Tune Load a model from checkpoint and perform batch prediction Walkthrough Prerequisite for this notebook: Read the Key Concepts page for Ray Tune. First, let’s make sure we have all Python packages we need installed. !pip install -q "ray[air]" scikit-learn Next, let’s import a few required libraries, including open-source Ray itself! import os print(f"Number of CPUs in this system: {os.cpu_count()}") from typing import Tuple, List, Union, Optional, Callable import time import pandas as pd import numpy as np print(f"numpy: {np.__version__}") import pyarrow import pyarrow.parquet as pq import pyarrow.dataset as pds print(f"pyarrow: {pyarrow.__version__}") Number of CPUs in this system: 8 numpy: 1.21.6 pyarrow: 10.0.0 import ray if ray.is_initialized(): ray.shutdown() ray.init() print(ray.cluster_resources()) {'memory': 451212691046.0, 'object_store_memory': 175243542524.0, 'node:172.31.206.67': 1.0, 'CPU': 152.0, 'node:172.31.138.114': 1.0, 'node:172.31.221.253': 1.0, 'node:172.31.144.75': 1.0, 'node:172.31.169.100': 1.0, 'node:172.31.136.199': 1.0, 'node:172.31.251.87': 1.0, 'node:172.31.249.240': 1.0, 'node:172.31.252.125': 1.0, 'node:172.31.211.165': 1.0} # import standard sklearn libraries import sklearn from sklearn.base import BaseEstimator from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_absolute_error print(f"sklearn: {sklearn.__version__}") import xgboost as xgb print(f"xgboost: {xgb.__version__}") # import ray libraries from ray import air, tune from ray.air import session from ray.air.checkpoint import Checkpoint # set global random seed for sklearn models np.random.seed(415) sklearn: 1.2.0 xgboost: 1.3.3 /home/ray/anaconda3/lib/python3.8/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import MultiIndex, Int64Index # For benchmarking purposes, we can print the times of various operations. # In order to reduce clutter in the output, this is set to False by default. PRINT_TIMES = False def print_time(msg: str): if PRINT_TIMES: print(msg) # To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019. # You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False. SMOKE_TEST = True Define how to load and prepare Parquet data First, we need to load some data. Since the NYC Taxi dataset is fairly large, we will filter files first into a PyArrow dataset. And then in the next cell after, we will filter the data on read into a PyArrow table and convert that to a pandas dataframe. Use PyArrow dataset and table for reading or writing large parquet files, since its native multithreaded C++ adapter is faster than pandas read_parquet, even using engine=pyarrow. # Define some global variables. TARGET = "trip_duration" s3_partitions = pds.dataset( "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/", partitioning=["year", "month"], ) s3_files = [f"s3://anonymous@{file}" for file in s3_partitions.files] # Obtain all location IDs all_location_ids = ( pq.read_table(s3_files[0], columns=["dropoff_location_id"])["dropoff_location_id"] .unique() .to_pylist() ) # drop [264, 265] all_location_ids.remove(264) all_location_ids.remove(265) # Use smoke testing or not. starting_idx = -1 if SMOKE_TEST else 0 # TODO: drop location 199 to test error-handling before final git checkin sample_locations = [141, 229, 173] if SMOKE_TEST else all_location_ids # Display what data will be used. s3_files = s3_files[starting_idx:] print(f"NYC Taxi using {len(s3_files)} file(s)!") print(f"s3_files: {s3_files}") print(f"Locations: {sample_locations}") NYC Taxi using 1 file(s)! s3_files: ['s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet'] Locations: [141, 229, 173] ############ # STEP 1. Define Python functions to # a) read and prepare a segment of data. ############ # Function to read a pyarrow.Table object using pyarrow parquet def read_data(file: str, sample_id: np.int32) -> pd.DataFrame: df = pq.read_table( file, filters=[ ("passenger_count", ">", 0), ("trip_distance", ">", 0), ("fare_amount", ">", 0), ("pickup_location_id", "not in", [264, 265]), ("dropoff_location_id", "not in", [264, 265]), ("dropoff_location_id", "=", sample_id), ], columns=[ "pickup_at", "dropoff_at", "pickup_location_id", "dropoff_location_id", "passenger_count", "trip_distance", "fare_amount", ], ).to_pandas() return df # Function to transform a pandas dataframe def transform_df(input_df: pd.DataFrame) -> pd.DataFrame: df = input_df.copy() # calculate trip_duration df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds # filter trip_durations > 1 minute and less than 24 hours df = df[df["trip_duration"] > 60] df = df[df["trip_duration"] < 24 * 60 * 60] # keep only necessary columns df = df[ ["dropoff_location_id", "passenger_count", "trip_distance", "trip_duration"] ].copy() df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1) return df Define a Trainable (callable) function Next, we define a trainable function, called train_model(), in order to train and evaluate a model on a data partition. This function will be called in parallel for every permutation in the Tune search space! Inside this trainable function: 📖 The input must include a config argument. 📈 Inside the function, the tuning metric (a model’s loss or error) must be calculated and reported using session.report(). ✔️ Optionally checkpoint (save) the model for fault tolerance and easy deployment later. Ray Tune has two ways of defining a trainable, namely the Function API and the Class API. Both are valid ways of defining a trainable, but the Function API is generally recommended. ############ # STEP 1. Define Python functions to # b) train and evaluate a model on a segment of data. ############ def train_model(config: dict) -> None: algorithm = config["algorithm"] sample_location_id = config["location"] # Load data. df_list = [read_data(f, sample_location_id) for f in s3_files] df_raw = pd.concat(df_list, ignore_index=True) # Transform data. df = transform_df(df_raw) # We need at least 10 rows to create a train / test split. if df.shape[0] < 10: print_time(f"Location {sample_location_id} has only {df.shape[0]} rows.") session.report(dict(error=None)) return None # Train/valid split. train_df, valid_df = train_test_split(df, test_size=0.2, shuffle=True) train_X = train_df[["passenger_count", "trip_distance"]] train_y = train_df[TARGET] valid_X = valid_df[["passenger_count", "trip_distance"]] valid_y = valid_df[TARGET] # Train model. model = algorithm.fit(train_X, train_y) pred_y = model.predict(valid_X) # Evaluate. error = sklearn.metrics.mean_absolute_error(valid_y, pred_y) # Define a model checkpoint using AIR API. # https://docs.ray.io/en/latest/tune/tutorials/tune-checkpoints.html checkpoint = ray.air.checkpoint.Checkpoint.from_dict( {"model": algorithm, "location_id": sample_location_id} ) # Save checkpoint and report back metrics, using ray.air.session.report() # The metrics you specify here will appear in Tune summary table. # They will also be recorded in Tune results under `metrics`. metrics = dict(error=error) session.report(metrics, checkpoint=checkpoint) Run batch training on Ray Tune Recall what we are doing, high level, is training several different models per pickup location. We are using Ray Tune so we can run all these trials in parallel on a Ray cluster. At the end, we will inspect the results of the experiment and deploy only the best model per pickup location. Step 1. Define Python functions to read and prepare a segment of data and train and evaluate one or many models per segment of data. We already did this, above. Step 2. Scaling: Below, we use the default resources config which is 1 CPU core for each task. For more information about configuring resource allocations, see A Guide To Parallelism and Resources. Step 3. Search Space: Below, we define our Tune search space, which consists of: Different algorithms: XGBoost Scikit-learn LinearRegression Some or all NYC taxi drop-off locations. Step 4. Search Algorithm or Strategy: Below, our Tune jobs will be defined using a search space and simple grid search. The typical use case for Tune search spaces is for hyperparameter tuning. In our case, we are defining the Tune search space in order to run distributed tuning jobs automatically. Each training job will use a different data partition (taxi pickup location), different algorithm, and the compute resources we defined in the Scaling config. Step 5. Now we are ready to kick off a Ray Tune experiment! Define a tuner object. Put the training function train_model() inside the tuner object. Run the experiment using tuner.fit(). 💡 After you run the cell below, right-click on it and choose “Enable Scrolling for Outputs”! This will make it easier to view, since tuning output can be very long! Setting SMOKE_TEST=False, running on Anyscale: 518 models, using 18 NYC Taxi S3 files dating from 2018/01 to 2019/06 (split into partitions approx 1GiB each), simultaneously trained on a 10-node AWS cluster of m5.4xlarges. Total data reading and train time was 37 minutes. ############ # STEP 2. Customize distributed compute scaling. ############ # Use Ray AIR default resources config which is 1 CPU core for each task. ############ # STEP 3. Define a search space dict of all config parameters. ############ search_space = { "algorithm": tune.grid_search( [LinearRegression(fit_intercept=True), xgb.XGBRegressor(max_depth=4)] ), "location": tune.grid_search(sample_locations), } # Optional STEP 4. Specify the hyperparameter tuning search strategy. ############ # STEP 5. Run the experiment with Ray AIR APIs. # https://docs.ray.io/en/latest/tune/examples/tune-pytorch-lightning.html ############ start = time.time() # Define a tuner object. tuner = tune.Tuner( train_model, param_space=search_space, run_config=air.RunConfig( # redirect logs to relative path instead of default ~/ray_results/ storage_path="my_Tune_logs", name="batch_tuning", # Set Ray Tune verbosity. Print summary table only with levels 2 or 3. verbose=2, ), ) # Fit the tuner object. results = tuner.fit() total_time_taken = time.time() - start print(f"Total number of models: {len(results)}") print(f"TOTAL TIME TAKEN: {total_time_taken/60:.2f} minutes") # Total number of models: 6 # TOTAL TIME TAKEN: 0.37 minutes

Tune Status

Current time:2023-01-10 16:26:11
Running for: 00:00:20.45
Memory: 3.0/30.9 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/152 CPUs, 0/0 GPUs, 0.0/420.22 GiB heap, 0.0/163.21 GiB objects

Trial Status

Trial name status loc algorithm location iter total time (s) error
train_model_7fd9c_00000TERMINATED172.31.211.165:3629 LinearRegression() 141 1 1.90341 500.005
train_model_7fd9c_00001TERMINATED172.31.252.125:17717XGBRegressor(ba_9dc0 141 1 2.41094 523.611
train_model_7fd9c_00002TERMINATED172.31.251.87:4579 LinearRegression() 229 1 1.86279 568.826
train_model_7fd9c_00003TERMINATED172.31.138.114:11079XGBRegressor(ba_0040 229 1 2.53176 583.261
train_model_7fd9c_00004TERMINATED172.31.221.253:3999 LinearRegression() 173 1 1.8416 950.346
train_model_7fd9c_00005TERMINATED172.31.136.199:12355XGBRegressor(ba_0160 173 1 2.029362046.04

Trial Progress

Trial name errorshould_checkpoint
train_model_7fd9c_00000 500.005True
train_model_7fd9c_00001 523.611True
train_model_7fd9c_00002 568.826True
train_model_7fd9c_00003 583.261True
train_model_7fd9c_00004 950.346True
train_model_7fd9c_000052046.04 True
2023-01-10 16:26:11,740 INFO tune.py:762 -- Total run time: 22.07 seconds (20.27 seconds for the tuning loop). Total number of models: 6 TOTAL TIME TAKEN: 0.37 minutes
After the Tune experiment has finished, select the best model per dropoff location. We can assemble the Tune results into a pandas dataframe, then sort by minimum error, to select the best model per dropoff location. # get a list of training loss errors errors = [i.metrics.get("error", 10000.0) for i in results] # get a list of checkpoints checkpoints = [i.checkpoint for i in results] # get a list of locations locations = [i.config["location"] for i in results] # get a list of model params algorithms = [i.config["algorithm"] for i in results] # Assemble a pandas dataframe from Tune results results_df = pd.DataFrame( zip(locations, errors, algorithms, checkpoints), columns=["location_id", "error", "algorithm", "checkpoint"], ) results_df.head(8)
location_id error algorithm checkpoint
0 141 500.005318 LinearRegression() Checkpoint(local_path=/home/ray/christy-air/fo...
1 141 523.610705 XGBRegressor(base_score=0.5, booster='gbtree',... Checkpoint(local_path=/home/ray/christy-air/fo...
2 229 568.826123 LinearRegression() Checkpoint(local_path=/home/ray/christy-air/fo...
3 229 583.261077 XGBRegressor(base_score=0.5, booster='gbtree',... Checkpoint(local_path=/home/ray/christy-air/fo...
4 173 950.345817 LinearRegression() Checkpoint(local_path=/home/ray/christy-air/fo...
5 173 2046.043927 XGBRegressor(base_score=0.5, booster='gbtree',... Checkpoint(local_path=/home/ray/christy-air/fo...
# Keep only 1 model per location_id with minimum error final_df = results_df.copy() final_df = final_df.loc[(final_df.error > 0), :] final_df = final_df.loc[final_df.groupby("location_id")["error"].idxmin()] final_df.sort_values(by=["error"], inplace=True) final_df.set_index("location_id", inplace=True, drop=True) final_df
error algorithm checkpoint
location_id
141 500.005318 LinearRegression() Checkpoint(local_path=/home/ray/christy-air/fo...
229 568.826123 LinearRegression() Checkpoint(local_path=/home/ray/christy-air/fo...
173 950.345817 LinearRegression() Checkpoint(local_path=/home/ray/christy-air/fo...
final_df[["algorithm"]].astype("str").value_counts(normalize=True) # 0.67 XGB # 0.33 Linear Regression algorithm LinearRegression() 1.0 dtype: float64 Load a model from checkpoint and perform batch prediction Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference. Finally, we will restore the best and worst models from checkpoint and make predictions. We will easily obtain AIR Checkpoint objects from the Tune results. We will restore a regression model directly from checkpoint, and demonstrate it can be used for prediction. # Choose a dropoff location sample_location_id = final_df.index[0] sample_location_id 141 # Get the algorithm used sample_algorithm = final_df.loc[[sample_location_id]].algorithm.values[0] print(f"algorithm type:: {type(sample_algorithm)}") # Get a checkpoint directly from the pandas dataframe of Tune results checkpoint = final_df.checkpoint[sample_location_id] print(f"checkpoint type:: {type(checkpoint)}") # Restore a model from checkpoint sample_model = checkpoint.to_dict()["model"] algorithm type:: checkpoint type:: # Create some test data df_list = [read_data(f, sample_location_id) for f in s3_files[:1]] df_raw = pd.concat(df_list, ignore_index=True) df = transform_df(df_raw) _, test_df = train_test_split(df, test_size=0.2, shuffle=True) test_X = test_df[["passenger_count", "trip_distance"]] test_y = np.array(test_df.trip_duration) # actual values # Perform batch prediction using restored model from checkpoint pred_y = sample_model.predict(test_X) # Zip together predictions and actuals to visualize pd.DataFrame(zip(pred_y, test_y), columns=["pred_y", TARGET])[0:10]
pred_y trip_duration
0 1153.574219 1174
1 870.131592 299
2 1065.683105 1206
3 591.070801 566
4 766.853149 630
5 1037.557861 852
6 1540.295410 1596
7 827.835510 801
8 1871.982422 1363
9 960.105408 715
Compare validation and test error. During model training we reported error on “validation” data (random sample). Below, we will report error on a pretend “test” data set (a different random sample). Do a quick validation that both errors are reasonably close together. # Evaluate restored model on test data. error = sklearn.metrics.mean_absolute_error(test_y, pred_y) print(f"Test error: {error}") Test error: 513.4911755733472 # Compare test error with training validation error print(f"Validation error: {final_df.error[sample_location_id]}") # Validation and test errors should be reasonably close together. Validation error: 500.0053176600036 Parallel demand forecasting at scale using Ray Tune Batch training and tuning are common tasks in machine learning use-cases. They require training simple models, on data batches, typcially corresponding to different locations, products, etc. Batch training can take less time to process all the data at once, but only if those batches can run in parallel! This notebook showcases how to conduct batch forecasting with Prophet and ARIMA. Prophet is a popular open-source library developed by Facebook and designed for automatic forecasting of univariate time series data. ARIMA is an older, well-known algorithm for forecasting univariate time series at less fine-grained detail than Prophet. Batch training diagram For the data, we will use the NYC Taxi dataset. This popular tabular dataset contains historical taxi pickups by timestamp and location in NYC. For the training, we will train a separate forecasting model to predict #pickups at each location in NYC at daily level for the next 28 days. Specifically, we will use the pickup_location_id column in the dataset to group the dataset into data batches. Then we will conduct an experiment for each location, to find the best either Prophet or ARIMA model, per location. Contents In this this tutorial, you will learn how to: Define how to load and prepare Parquet data Define a Trainable (callable) function Run batch training and inference with Ray Tune Load a model from checkpoint Create a forecast from model restored from checkpoint Walkthrough Prerequisite for this notebook: Read the Key Concepts page for Ray Tune. First, let’s make sure we have all Python packages we need installed. !pip install -q "ray[air]" scikit-learn prophet statsmodels statsforecast Next, let’s import a few required libraries, including open-source Ray itself! import os num_cpu = os.cpu_count() print(f"Number of CPUs in this system: {num_cpu}") from typing import Tuple, List, Union, Optional, Callable from datetime import datetime, timedelta import time import pandas as pd import numpy as np print(f"numpy: {np.__version__}") import matplotlib.pyplot as plt %matplotlib inline import scipy print(f"scipy: {scipy.__version__}") import pyarrow import pyarrow.parquet as pq import pyarrow.dataset as pds print(f"pyarrow: {pyarrow.__version__}") Number of CPUs in this system: 64 numpy: 1.24.3 scipy: 1.9.1 pyarrow: 11.0.0 import ray if ray.is_initialized(): ray.shutdown() ray.init() 2023-05-17 14:24:46,542 INFO worker.py:1380 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS find: ‘.git’: No such file or directory 2023-05-17 14:24:47,257 INFO worker.py:1498 -- Connecting to existing Ray cluster at address: 172.31.213.59:9031... 2023-05-17 14:24:47,273 INFO worker.py:1673 -- Connected to Ray cluster. View the dashboard at https://console.anyscale.com/api/v2/sessions/ses_jgkdnu2723aleytwqqhebr12vs/services?redirect_to=dashboard 2023-05-17 14:24:47,296 INFO packaging.py:347 -- Pushing file package 'gcs://_ray_pkg_e219f8b9b77b196e3d63ced7d9917421.zip' (5.45MiB) to Ray cluster... 2023-05-17 14:24:47,314 INFO packaging.py:360 -- Successfully pushed file package 'gcs://_ray_pkg_e219f8b9b77b196e3d63ced7d9917421.zip'. print(ray.cluster_resources()) {'memory': 319463062119.0, 'object_store_memory': 141198455193.0, 'CPU': 64.0, 'node:172.31.213.59': 1.0, 'accelerator_type:V100': 1.0, 'GPU': 8.0} # Import forecasting libraries. import prophet from prophet import Prophet print(f"prophet: {prophet.__version__}") import statsforecast from statsforecast import StatsForecast from statsforecast.models import AutoARIMA print(f"statsforecast: {statsforecast.__version__}") # Import ray libraries. from ray import air, tune, serve from ray.air import session, ScalingConfig from ray.air.checkpoint import Checkpoint RAY_IGNORE_UNHANDLED_ERRORS = 1 prophet: 1.1.3 statsforecast: 1.5.0 # For benchmarking purposes, we can print the times of various operations. # In order to reduce clutter in the output, this is set to False by default. PRINT_TIMES = False def print_time(msg: str): if PRINT_TIMES: print(msg) # To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019. # You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False. SMOKE_TEST = True Define how to load and prepare Parquet data First, we need to load some data. Since the NYC Taxi dataset is fairly large, we will filter files first into a PyArrow dataset. And then in the next cell after, we will filter the data on read into a PyArrow table and convert that to a pandas dataframe. Use PyArrow dataset and table for reading or writing large parquet files, since its native multithreaded C++ adapter is faster than pandas read_parquet, even using engine=pyarrow. # Define some global variables. TARGET = "y" FORECAST_LENGTH = 28 MAX_DATE = datetime(2019, 6, 30) s3_partitions = pds.dataset( "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/", partitioning=["year", "month"], ) s3_files = [f"s3://anonymous@{file}" for file in s3_partitions.files] # Obtain all location IDs all_location_ids = ( pq.read_table(s3_files[0], columns=["pickup_location_id"])["pickup_location_id"] .unique() .to_pylist() ) # drop [264, 265, 199] all_location_ids.remove(264) all_location_ids.remove(265) all_location_ids.remove(199) # Use smoke testing or not. starting_idx = -2 if SMOKE_TEST else 0 # TODO: drop location 199 to test error-handling before final git checkin sample_locations = [141, 229, 173] if SMOKE_TEST else all_location_ids # Display what data will be used. s3_files = s3_files[starting_idx:] print(f"NYC Taxi using {len(s3_files)} file(s)!") print(f"s3_files: {s3_files}") print(f"Locations: {sample_locations}") NYC Taxi using 2 file(s)! s3_files: ['s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 's3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet'] Locations: [141, 229, 173] ############ # STEP 1. Define Python functions to # a) read and prepare a segment of data, and ############ # Function to read a pyarrow.Table object using pyarrow parquet def read_data(file: str, sample_id: np.int32) -> pd.DataFrame: # parse out min expected date part_zero = "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/" split_text = file.split(part_zero)[1] min_year = split_text.split("/")[0] min_month = split_text.split("/")[1] string_date = min_year + "-" + min_month + "-" + "01" + " 00:00:00" min_date = datetime.strptime(string_date, "%Y-%m-%d %H:%M:%S") df = pq.read_table( file, filters=[ ("pickup_at", ">", min_date), ("pickup_at", "<=", MAX_DATE), ("passenger_count", ">", 0), ("trip_distance", ">", 0), ("fare_amount", ">", 0), ("pickup_location_id", "not in", [264, 265]), ("dropoff_location_id", "not in", [264, 265]), ("pickup_location_id", "=", sample_id), ], columns=[ "pickup_at", "dropoff_at", "pickup_location_id", "dropoff_location_id", "passenger_count", "trip_distance", "fare_amount", ], ).to_pandas() return df # Function to transform a pandas dataframe def transform_df(input_df: pd.DataFrame) -> pd.DataFrame: df = input_df.copy() # calculate trip_duration df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds # filter trip_durations > 1 minute and less than 24 hours df = df[df["trip_duration"] > 60] df = df[df["trip_duration"] < 24 * 60 * 60] # Prophet requires timstamp is 'ds' and target_value name is 'y' # Prophet requires at least 2 data points per timestamp # StatsForecast requires location name is 'unique_id' # add year_month_day and concat into a unique column to use as groupby key df["ds"] = df["pickup_at"].dt.to_period("D").dt.to_timestamp() df["loc_year_month_day"] = ( df["pickup_location_id"].astype(str) + "_" + df["pickup_at"].dt.year.astype(str) + "_" + df["pickup_at"].dt.month.astype(str) + "_" + df["pickup_at"].dt.day.astype(str) ) # add target_value quantity for groupby count later df["y"] = 1 # rename pickup_location_id to unique_id df.rename(columns={"pickup_location_id": "unique_id"}, inplace=True) # keep only necessary columns df = df[["loc_year_month_day", "unique_id", "ds", "y"]].copy() # groupby aggregregate g = df.groupby("loc_year_month_day").agg({"unique_id": min, "ds": min, "y": sum}) # having num rows in group > 2 g.dropna(inplace=True) g = g[g["y"] > 2].copy() # Drop groupby variable since we do not need it anymore g.reset_index(inplace=True) g.drop(["loc_year_month_day"], axis=1, inplace=True) return g def prepare_data(sample_location_id: np.int32) -> pd.DataFrame: # Load data. df_list = [read_data(f, sample_location_id) for f in s3_files] df_raw = pd.concat(df_list, ignore_index=True) # Abort Tune to avoid Tune Error if df has too few rows if df_raw.shape[0] < FORECAST_LENGTH: print_time(f"Location {sample_location_id} has only {df_raw.shape[0]} rows") session.report(dict(error=None)) return None # Transform data. df = transform_df(df_raw) # Abort Tune to avoid Tune Error if df has too few rows if df.shape[0] < FORECAST_LENGTH: print_time(f"Location {sample_location_id} has only {df.shape[0]} rows") session.report(dict(error=None)) return None else: df.sort_values(by="ds", inplace=True) return df Define a Trainable (callable) function Next, we define a trainable function, called train_model(), in order to train and evaluate a model on a data partition. This function will be called in parallel for every permutation in the Tune search space! Inside this trainable function: 📖 The input must include a config argument. 📈 Inside the function, the tuning metric (a model’s loss or error) must be calculated and reported using session.report(). ✔️ Optionally checkpoint (save) the model for fault tolerance and easy deployment later. Ray Tune has two ways of defining a trainable, namely the Function API and the Class API. Both are valid ways of defining a trainable, but the Function API is generally recommended. ############ # STEP 1. Define Python functions to # b) train and evaluate a model on a segment of data. ############ def evaluate_model_prophet( model: "prophet.forecaster.Prophet", ) -> Tuple[float, pd.DataFrame]: # Inference model using FORECAST_LENGTH. future_dates = model.make_future_dataframe(periods=FORECAST_LENGTH, freq="D") future = model.predict(future_dates) # Calculate mean absolute forecast error. temp = future.copy() temp["forecast_error"] = np.abs(temp["yhat"] - temp["trend"]) error = np.mean(temp["forecast_error"]) return error, future def evaluate_model_statsforecast( model: "statsforecast.models.AutoARIMA", test_df: pd.DataFrame ) -> Tuple[float, pd.DataFrame]: # Inference model using test data. forecast = model.forecast(FORECAST_LENGTH + 1).reset_index() forecast.set_index(["ds"], inplace=True) test_df.set_index("ds", inplace=True) future = pd.concat([test_df, forecast[["AutoARIMA"]]], axis=1) future.dropna(inplace=True) future.columns = ["unique_id", "trend", "yhat"] # Calculate mean absolute forecast error. temp = future.copy() temp["forecast_error"] = np.abs(temp["yhat"] - temp["trend"]) error = np.mean(temp["forecast_error"]) return error, future # 2. Define a custom train function def train_model(config: dict) -> None: # Get Tune parameters sample_location_id = config["params"]["location"] model_type = config["params"]["algorithm"] # Define Prophet model with 75% confidence interval if model_type == "prophet_additive": model = Prophet(interval_width=0.75, seasonality_mode="additive") elif model_type == "prophet_multiplicative": model = Prophet(interval_width=0.75, seasonality_mode="multiplicative") # Define ARIMA model with daily frequency which implies seasonality = 7 elif model_type == "arima": model = [AutoARIMA(season_length=7, approximation=True)] # Read and transform data. df = prepare_data(sample_location_id) # Train model. if model_type == "arima": try: # split data into train, test. train_end = df.ds.max() - timedelta(days=FORECAST_LENGTH + 1) train_df = df.loc[(df.ds <= train_end), :].copy() test_df = df.iloc[-FORECAST_LENGTH:, :].copy() # fit AutoARIMA. model = StatsForecast(df=train_df, models=model, freq="D") # Inference model and evaluate error. error, future = evaluate_model_statsforecast(model, test_df) except: print(f"ARIMA error processing location: {sample_location_id}") else: # model type is Prophet try: # fit Prophet. model = model.fit(df[["ds", "y"]]) # Inference model and evaluate error. error, future = evaluate_model_prophet(model) except: print(f"Prophet error processing location: {sample_location_id}") # Define a model checkpoint using AIR API. # https://docs.ray.io/en/latest/tune/tutorials/tune-checkpoints.html checkpoint = ray.air.checkpoint.Checkpoint.from_dict( { "model": model, "forecast_df": future, "location_id": sample_location_id, } ) # Save checkpoint and report back metrics, using ray.air.session.report() # The metrics you specify here will appear in Tune summary table. # They will also be recorded in Tune results under `metrics`. metrics = dict(error=error) session.report(metrics, checkpoint=checkpoint) Run batch training on Ray Tune Recall what we are doing, high level, is training several different models per pickup location. We are using Ray Tune so we can run all these trials in parallel on a Ray cluster. At the end, we will inspect the results of the experiment and deploy only the best model per pickup location. Step 1. Define Python functions to read and prepare a segment of data and train and evaluate one or many models per segment of data. We already did this, above. Step 2. Scaling: Below, we specify training resources in a ray.air.ScalingConfig object inside the Tune search space. For more information about configuring resource allocations, see A Guide To Parallelism and Resources. Step 3. Search Space: Below, we define our Tune search space, which consists of: Different algorithms, either: Prophet with multiplicative or additive seasonal effects AutoARIMA. NYC taxi pick-up locations. Scaling config Step 4. Search Algorithm or Strategy: Below, our Tune jobs will be defined using a search space and simple grid search. The typical use case for Tune search spaces is for hyperparameter tuning. In our case, we are defining the Tune search space in order to run distributed tuning jobs automatically. Each training job will use a different data partition (taxi pickup location), different algorithm, and the compute resources we defined in the Scaling config. Step 5. Now we are ready to kick off a Ray Tune experiment! Define a tuner object. Put the training function train_model() inside the tuner object. Run the experiment using tuner.fit(). 💡 After you run the cell below, right-click on it and choose “Enable Scrolling for Outputs”! This will make it easier to view, since tuning output can be very long! Setting SMOKE_TEST=False, running on Anyscale: 771 models, using 18 NYC Taxi S3 files dating from 2018/01 to 2019/06 (split into partitions approx 1GiB each), were simultaneously trained on a 7-node AWS cluster of m5.4xlarges, within 40 minutes. ############ # STEP 2. Customize distributed compute scaling. ############ num_training_workers = min(num_cpu - 2, 32) scaling_config = ScalingConfig( # Number of distributed workers. num_workers=num_training_workers, # Turn on/off GPU. use_gpu=False, # Specify resources used for trainer. trainer_resources={"CPU": 1}, # Try to schedule workers on different nodes. placement_strategy="SPREAD", ) ############ # STEP 3. Define a search space dict of all config parameters. ############ SEARCH_SPACE = { "scaling_config": scaling_config, "params": { "algorithm": tune.grid_search( ["prophet_additive", "prophet_multiplicative", "arima"] ), "location": tune.grid_search(sample_locations), }, } # Optional STEP 4. Specify the hyperparameter tuning search strategy. ############ # STEP 5. Run the experiment with Ray AIR APIs. # https://docs.ray.io/en/latest/ray-air/examples/huggingface_text_classification.html ############ start = time.time() # Define a tuner object. tuner = tune.Tuner( train_model, param_space=SEARCH_SPACE, tune_config=tune.TuneConfig( metric="error", mode="min", ), run_config=air.RunConfig( # Redirect logs to relative path instead of default ~/ray_results/. storage_path="my_Tune_logs", # Specify name to make logs easier to find in log path. name="ptf_nyc", ), ) # Fit the tuner object. results = tuner.fit() total_time_taken = time.time() - start print(f"Total number of models: {len(results)}") print(f"TOTAL TIME TAKEN: {total_time_taken/60:.2f} minutes") # Total number of models: 771 # TOTAL TIME TAKEN: 44.64 minutes

Tune Status

Current time:2023-05-17 14:25:23
Running for: 00:00:29.24
Memory: 10.1/480.3 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 1.0/64 CPUs, 0/8 GPUs (0.0/1.0 accelerator_type:V100)

Trial Status

Trial name status loc params/algorithm params/location iter total time (s) error
train_model_42de5_00000TERMINATED172.31.213.59:14115prophet_additive 141 1 5.73623502.849
train_model_42de5_00001TERMINATED172.31.213.59:14116prophet_multipl_43f0 141 1 5.94609483.067
train_model_42de5_00002TERMINATED172.31.213.59:14117arima 141 1 21.574 342.35
train_model_42de5_00003TERMINATED172.31.213.59:14118prophet_additive 229 1 5.54851539.389
train_model_42de5_00004TERMINATED172.31.213.59:14119prophet_multipl_43f0 229 1 5.4158 529.743
train_model_42de5_00005TERMINATED172.31.213.59:14122arima 229 1 20.862 480.844
train_model_42de5_00006TERMINATED172.31.213.59:14120prophet_additive 173 1 4.70871 2.55585
train_model_42de5_00007TERMINATED172.31.213.59:14121prophet_multipl_43f0 173 1 4.64482 2.52897
train_model_42de5_00008TERMINATED172.31.213.59:14123arima 173 1 21.2637 2.81589
(train_model pid=14121) 14:25:05 - cmdstanpy - INFO - Chain [1] start processing (train_model pid=14121) 14:25:06 - cmdstanpy - INFO - Chain [1] done processing

Trial Progress

Trial name date done errorexperiment_tag hostname iterations_since_restorenode_ip pidshould_checkpoint time_since_restore time_this_iter_s time_total_s timestamp training_iterationtrial_id
train_model_42de5_000002023-05-17_14-25-07True 502.849 0_algorithm=prophet_additive,location=141 ip-172-31-213-59 1172.31.213.5914115True 5.73623 5.73623 5.73623 1684358707 142de5_00000
train_model_42de5_000012023-05-17_14-25-07True 483.067 1_algorithm=prophet_multiplicative,location=141ip-172-31-213-59 1172.31.213.5914116True 5.94609 5.94609 5.94609 1684358707 142de5_00001
train_model_42de5_000022023-05-17_14-25-23True 342.35 2_algorithm=arima,location=141 ip-172-31-213-59 1172.31.213.5914117True 21.574 21.574 21.574 1684358723 142de5_00002
train_model_42de5_000032023-05-17_14-25-07True 539.389 3_algorithm=prophet_additive,location=229 ip-172-31-213-59 1172.31.213.5914118True 5.54851 5.54851 5.54851 1684358707 142de5_00003
train_model_42de5_000042023-05-17_14-25-07True 529.743 4_algorithm=prophet_multiplicative,location=229ip-172-31-213-59 1172.31.213.5914119True 5.4158 5.4158 5.4158 1684358707 142de5_00004
train_model_42de5_000052023-05-17_14-25-22True 480.844 5_algorithm=arima,location=229 ip-172-31-213-59 1172.31.213.5914122True 20.862 20.862 20.862 1684358722 142de5_00005
train_model_42de5_000062023-05-17_14-25-06True 2.555856_algorithm=prophet_additive,location=173 ip-172-31-213-59 1172.31.213.5914120True 4.70871 4.70871 4.70871 1684358706 142de5_00006
train_model_42de5_000072023-05-17_14-25-06True 2.528977_algorithm=prophet_multiplicative,location=173ip-172-31-213-59 1172.31.213.5914121True 4.64482 4.64482 4.64482 1684358706 142de5_00007
train_model_42de5_000082023-05-17_14-25-22True 2.815898_algorithm=arima,location=173 ip-172-31-213-59 1172.31.213.5914123True 21.2637 21.2637 21.2637 1684358722 142de5_00008
2023-05-17 14:25:23,079 INFO tune.py:1100 -- Total run time: 30.71 seconds (29.24 seconds for the tuning loop). Total number of models: 9 TOTAL TIME TAKEN: 0.51 minutes Load a model from checkpoint After the Tune experiment has finished, we can assemble the Tune ResultGrid object into a pandas dataframe. Next, we’ll sort the pandas dataframe by pickuplocation and error, and keep only the best model with minimum error per pickup location. # get a list of training loss errors errors = [i.metrics.get("error", 10000.0) for i in results] # get a list of checkpoints checkpoints = [i.checkpoint for i in results] # get a list of locations locations = [i.config["params"]["location"] for i in results] # get a list of model params algorithm = [i.config["params"]["algorithm"] for i in results] # Assemble a pandas dataframe from Tune results results_df = pd.DataFrame( zip(locations, errors, algorithm, checkpoints), columns=["location_id", "error", "algorithm", "checkpoint"], ) results_df.head(8)
location_id error algorithm checkpoint
0 141 502.848601 prophet_additive Checkpoint(local_path=/home/ray/default/doc/so...
1 141 483.067259 prophet_multiplicative Checkpoint(local_path=/home/ray/default/doc/so...
2 141 342.350202 arima Checkpoint(local_path=/home/ray/default/doc/so...
3 229 539.389339 prophet_additive Checkpoint(local_path=/home/ray/default/doc/so...
4 229 529.743081 prophet_multiplicative Checkpoint(local_path=/home/ray/default/doc/so...
5 229 480.844291 arima Checkpoint(local_path=/home/ray/default/doc/so...
6 173 2.555847 prophet_additive Checkpoint(local_path=/home/ray/default/doc/so...
7 173 2.528968 prophet_multiplicative Checkpoint(local_path=/home/ray/default/doc/so...
# Keep only 1 model per location_id with minimum error final_df = results_df.copy() final_df = final_df.loc[(final_df.error > 0), :] final_df = final_df.loc[final_df.groupby("location_id")["error"].idxmin()] final_df.sort_values(by=["error"], inplace=True) final_df.set_index("location_id", inplace=True, drop=True) final_df
error algorithm checkpoint
location_id
173 2.528968 prophet_multiplicative Checkpoint(local_path=/home/ray/default/doc/so...
141 342.350202 arima Checkpoint(local_path=/home/ray/default/doc/so...
229 480.844291 arima Checkpoint(local_path=/home/ray/default/doc/so...
final_df[["algorithm"]].value_counts(normalize=True) algorithm arima 0.666667 prophet_multiplicative 0.333333 dtype: float64 Create a forecast from model restored from checkpoint Finally, we will restore the best and worst models from checkpoint, generate predictions, and inspect the forecasts. Prophet includes a convenient plot library which displays actual data along with backtest predictions and confidence intervals and future forecasts. With ARIMA, you have to create a prediciton manually. We will easily obtain AIR Checkpoint objects from the Tune results. We will restore a Prophet or ARIMA model directly from checkpoint, and demonstrate it can be used for prediction. Ray AIR Predictors make batch inference easy since they have internal logic to parallelize the inference. # Get the pickup location for the best model if SMOKE_TEST: sample_location_id = final_df.index[0] else: sample_location_id = final_df.index[120] # Get the algorithm used sample_algorithm = final_df.loc[[sample_location_id]].algorithm.values[0] # Get a checkpoint directly from the pandas dataframe of Tune results checkpoint = final_df.checkpoint[sample_location_id] print(f"checkpoint type:: {type(checkpoint)}") # Restore a model from checkpoint sample_model = checkpoint.to_dict()["model"] # Prophet .fit() performs inference + prediction. # Arima train only performs inference; prediction is an extra step. if sample_algorithm == "arima": prediction = ( sample_model.forecast(2 * (FORECAST_LENGTH + 1)).reset_index().set_index("ds") ) prediction["trend"] = None prediction.rename(columns={"AutoARIMA": "yhat"}, inplace=True) prediction = prediction.tail(FORECAST_LENGTH + 1) # Restore already-created predictions from model training and eval forecast_df = checkpoint.to_dict()["forecast_df"] # Print pickup location ID, algorithm used, and model validation error. sample_error = final_df.loc[[sample_location_id]].error.values[0] print( f"location {sample_location_id}, algorithm {sample_algorithm}, best error {sample_error}" ) # Plot forecast prediction using best model for this pickup location ID. # If prophet model, use prophet built-in plot if sample_algorithm == "arima": forecast_df[["trend", "yhat"]].plot() else: plot1 = sample_model.plot(forecast_df) checkpoint type:: location 173, algorithm prophet_multiplicative, best error 2.528968219366575 # Get the pickup location for the worst model sample_location_id = final_df.index[len(final_df) - 2] # Get the algorithm used sample_algorithm = final_df.loc[[sample_location_id]].algorithm.values[0] # Get a checkpoint directly from the pandas dataframe of Tune results checkpoint = final_df.checkpoint[sample_location_id] print(f"checkpoint type:: {type(checkpoint)}") # Restore a model from checkpoint sample_model = checkpoint.to_dict()["model"] # Prophet .fit() performs inference + prediction. # Arima train only performs inference; prediction is an extra step. if sample_algorithm == "arima": prediction = ( sample_model.forecast(2 * (FORECAST_LENGTH + 1)).reset_index().set_index("ds") ) prediction["trend"] = None prediction.rename(columns={"AutoARIMA": "yhat"}, inplace=True) prediction = prediction.tail(FORECAST_LENGTH + 1) # Restore already-created inferences from model training and eval forecast_df = checkpoint.to_dict()["forecast_df"] # Append the prediction to the inferences forecast_df = pd.concat([forecast_df, prediction]) # Print pickup location ID, algorithm used, and model validation error. sample_error = final_df.loc[[sample_location_id]].error.values[0] print( f"location {sample_location_id}, algorithm {sample_algorithm}, best error {sample_error}" ) # Plot forecast prediction using best model for this pickup location ID. if sample_algorithm == "arima": forecast_df[["trend", "yhat"]].plot() else: plot1 = sample_model.plot(forecast_df) checkpoint type:: <__array_function__ internals>:200: RuntimeWarning: invalid value encountered in cast /home/ray/anaconda3/lib/python3.8/site-packages/statsforecast/arima.py:914: UserWarning: possible convergence problem: minimize gave code 1] warnings.warn( /home/ray/anaconda3/lib/python3.8/site-packages/statsforecast/arima.py:914: UserWarning: possible convergence problem: minimize gave code 2] warnings.warn( /home/ray/anaconda3/lib/python3.8/site-packages/statsforecast/arima.py:914: UserWarning: possible convergence problem: minimize gave code 2] warnings.warn( /home/ray/anaconda3/lib/python3.8/site-packages/statsforecast/arima.py:914: UserWarning: possible convergence problem: minimize gave code 2] warnings.warn( /home/ray/anaconda3/lib/python3.8/site-packages/statsforecast/arima.py:914: UserWarning: possible convergence problem: minimize gave code 2] warnings.warn( /home/ray/anaconda3/lib/python3.8/site-packages/statsforecast/arima.py:914: UserWarning: possible convergence problem: minimize gave code 2] warnings.warn( /home/ray/anaconda3/lib/python3.8/site-packages/statsforecast/arima.py:914: UserWarning: possible convergence problem: minimize gave code 2] warnings.warn( location 141, algorithm arima, best error 342.35020228794644 Stable Diffusion Batch Prediction with Ray AIR In this example, we will showcase how to use the Ray AIR for Stable Diffusion batch inference. Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database. LAION-5B is the largest, freely accessible multi-modal dataset that currently exists. For more information on Stable Diffusion, click here. We will use Ray Data and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models. It is highly recommended to read Ray AIR Key Concepts and Ray Data Key Concepts before starting this example. In order to run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The amount of memory needed will depend on the model. model_id = "stabilityai/stable-diffusion-2-1" prompt = "a photo of an astronaut riding a horse on mars" import ray We define a runtime environment to ensure that the Ray workers have access to all the necessary packages. You can omit the runtime_env argument if you have all of the packages already installed on each node in your cluster. ray.init( runtime_env={ "pip": [ "accelerate>=0.16.0", "transformers>=4.26.0", "diffusers>=0.13.1", "xformers>=0.0.16", "torch", ] } ) For the purposes of this example, we will use a very small toy dataset composed of multiple copies of our prompt. Ray Data can handle much bigger datasets with ease. import ray.data import pandas as pd ds = ray.data.from_pandas(pd.DataFrame([prompt] * 4, columns=["prompt"])) Since we will be using a pretrained model from Hugging Face hub, the simplest way is to use map_batches with a callable class UDF. This will allow us to save time by initializing a model just once and then feed it multiple batches of data. class PredictCallable: def __init__(self, model_id: str, revision: str = None): from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler # Use xformers for better memory usage from xformers.ops import MemoryEfficientAttentionFlashAttentionOp import torch self.pipe = StableDiffusionPipeline.from_pretrained( model_id, torch_dtype=torch.float16 ) self.pipe.scheduler = DPMSolverMultistepScheduler.from_config( self.pipe.scheduler.config ) self.pipe.enable_xformers_memory_efficient_attention( attention_op=MemoryEfficientAttentionFlashAttentionOp ) # Workaround for not accepting attention shape using VAE for Flash Attention self.pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None) self.pipe = self.pipe.to("cuda") def __call__(self, batch: pd.DataFrame) -> pd.DataFrame: import torch import numpy as np # Set a different seed for every image in batch self.pipe.generator = [ torch.Generator(device="cuda").manual_seed(i) for i in range(len(batch)) ] images = self.pipe(list(batch["prompt"])).images return {"images": np.array(images, dtype=object)} All that is left is to run the map_batches method on the dataset. We specify that we want to use one GPU for each Ray Actor that will be running our callable class. If you have access to large GPUs, you may want to increase the batch size to better saturate them. preds = ds.map_batches( PredictCallable, batch_size=1, fn_constructor_kwargs=dict(model_id=model_id), compute=ray.data.ActorPoolStrategy(), batch_format="pandas", num_gpus=1, ) results = preds.take_all() 2023-02-28 10:38:32,723 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(PredictCallable)] MapBatches(PredictCallable), 0 actors [0 locality hits, 1 misses]: 100%|██████████| 1/1 [01:46<00:00, 106.33s/it] After map_batches is done, we can view our images. results[0]["images"] results[1]["images"] You may notice that we are not using an AIR Predictor here. This is because AIR does not implement an out of the box Predictor for Diffusers. We could implement it ourselves, but Predictors are mainly intended to be used with AIR Checkpoints, and those are not necessary for this example. See Using Predictors for Inference for more information and usage examples. GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed In this example, we will showcase how to use the Ray AIR for GPT-J fine-tuning. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click here. We will use Ray AIR (with the 🤗 Transformers integration) and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models. This example focuses more on the performance and distributed computing aspects of Ray AIR. If you are looking for a more beginner-friendly introduction to Ray AIR 🤗 Transformers integration, see this example. It is highly recommended to read Ray AIR Key Concepts and Ray Data Key Concepts before starting this example. To run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The required amount of memory depends on the model. This notebook is tested with 16 g4dn.4xlarge instances (including the head node). If you wish to use a CPU head node, turn on cloud checkpointing to avoid OOM errors that may happen due to the default behavior of syncing the checkpoint files to the head node. In this notebook, we will: Set up Ray Load the dataset Preprocess the dataset with Ray AIR Run the training with Ray AIR Generate text from prompt with Ray AIR Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with transformers==4.26.0): #! pip install "datasets" "evaluate" "accelerate==0.18.0" "transformers>=4.26.0" "torch>=1.12.0" "deepspeed==0.8.3" import numpy as np import pandas as pd import os Set up Ray First, let’s set some global variables. We will use 16 workers, each being assigned 1 GPU and 8 CPUs. model_name = "EleutherAI/gpt-j-6B" use_gpu = True num_workers = 16 cpus_per_worker = 8 We will use ray.init() to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster. We define a runtime environment to ensure that the Ray workers have access to all the necessary packages. You can omit the runtime_env argument if you have all of the packages already installed on each node in your cluster. import ray ray.init( runtime_env={ "pip": [ "datasets", "evaluate", # Latest combination of accelerate==0.19.0 and transformers==4.29.0 # seems to have issues with DeepSpeed process group initialization, # and will result in a batch_size validation problem. # TODO(jungong) : get rid of the pins once the issue is fixed. "accelerate==0.16.0", "transformers==4.26.0", "torch>=1.12.0", "deepspeed==0.9.2", ] } ) # THIS SHOULD BE HIDDEN IN DOCS AND ONLY RAN IN CI # Download the model from our S3 mirror as it's faster import ray import subprocess import ray.util.scheduling_strategies def force_on_node(node_id: str, remote_func_or_actor_class): scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy( node_id=node_id, soft=False ) options = {"scheduling_strategy": scheduling_strategy} return remote_func_or_actor_class.options(**options) def run_on_every_node(remote_func_or_actor_class, **remote_kwargs): refs = [] for node in ray.nodes(): if node["Alive"] and node["Resources"].get("GPU", None): refs.append( force_on_node(node["NodeID"], remote_func_or_actor_class).remote( **remote_kwargs ) ) return ray.get(refs) @ray.remote(num_gpus=1) def download_model(): from transformers.utils.hub import TRANSFORMERS_CACHE path = os.path.expanduser( os.path.join(TRANSFORMERS_CACHE, "models--EleutherAI--gpt-j-6B") ) subprocess.run(["mkdir", "-p", os.path.join(path, "snapshots", "main")]) subprocess.run(["mkdir", "-p", os.path.join(path, "refs")]) if os.path.exists(os.path.join(path, "refs", "main")): return subprocess.run( [ "aws", "s3", "sync", "--no-sign-request", "s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/", os.path.join(path, "snapshots", "main"), ] ) with open(os.path.join(path, "snapshots", "main", "hash"), "r") as f: f_hash = f.read().strip() with open(os.path.join(path, "refs", "main"), "w") as f: f.write(f_hash) os.rename( os.path.join(path, "snapshots", "main"), os.path.join(path, "snapshots", f_hash) ) _ = run_on_every_node(download_model) Loading the dataset We will be fine-tuning the model on the tiny_shakespeare dataset, comprised of 40,000 lines of Shakespeare from a variety of Shakespeare’s plays. The aim will be to make the GPT-J model better at generating text in the style of Shakespeare. from datasets import load_dataset print("Loading tiny_shakespeare dataset") current_dataset = load_dataset("tiny_shakespeare") current_dataset Loading tiny_shakespeare dataset Found cached dataset tiny_shakespeare (/home/ray/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e) DatasetDict({ train: Dataset({ features: ['text'], num_rows: 1 }) validation: Dataset({ features: ['text'], num_rows: 1 }) test: Dataset({ features: ['text'], num_rows: 1 }) }) We will use Ray Data for distributed preprocessing and data ingestion. We can easily convert the dataset obtained from Hugging Face Hub to Ray Data by using ray.data.from_huggingface(). import ray.data ray_datasets = ray.data.from_huggingface(current_dataset) ray_datasets {'train': Dataset(num_blocks=1, num_rows=1, schema={text: string}), 'validation': Dataset(num_blocks=1, num_rows=1, schema={text: string}), 'test': Dataset(num_blocks=1, num_rows=1, schema={text: string})} Because the dataset is represented by a single large string, we will need to do some preprocessing. For that, we will define two Ray AIR Preprocessors using the BatchMapper API, allowing us to define functions that will be applied on batches of data. The split_text function will take the single string and split it into separate lines, removing empty lines and character names ending with ‘:’ (eg. ‘ROMEO:’). The tokenize function will take the lines and tokenize them using the 🤗 Tokenizer associated with the model, ensuring each entry has the same length (block_size) by padding and truncating. This is necessary for training. This preprocessing can be done in other ways. A common pattern is to tokenize first, and then split the obtained tokens into equally-sized blocks. We will use the splitter and tokenizer Preprocessors below. block_size = 512 from transformers import AutoTokenizer from ray.data.preprocessors import BatchMapper def split_text(batch: pd.DataFrame) -> pd.DataFrame: text = list(batch["text"]) flat_text = "".join(text) split_text = [ x.strip() for x in flat_text.split("\n") if x.strip() and not x.strip()[-1] == ":" ] return pd.DataFrame(split_text, columns=["text"]) def tokenize(batch: pd.DataFrame) -> dict: tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) tokenizer.pad_token = tokenizer.eos_token ret = tokenizer( list(batch["text"]), truncation=True, max_length=block_size, padding="max_length", return_tensors="np", ) ret["labels"] = ret["input_ids"].copy() return dict(ret) splitter = BatchMapper(split_text, batch_format="pandas") tokenizer = BatchMapper(tokenize, batch_format="pandas") Fine-tuning the model with Ray AIR We can now configure Ray AIR’s TransformersTrainer to perform distributed fine-tuning of the model. In order to do that, we specify a trainer_init_per_worker function, which creates a 🤗 Transformers Trainer that will be distributed by Ray using Distributed Data Parallelism (using PyTorch Distributed backend internally). This means that each worker will have its own copy of the model, but operate on different data, At the end of each step, all the workers will sync gradients. Because GPT-J is a relatively large model, it may not be possible to fit it on smaller GPU types (<=16 GB GRAM). To deal with that issue, we can use DeepSpeed, a library to optimize the training process and allow us to (among other things) offload and partition optimizer and parameter states, reducing GRAM usage. Furthermore, DeepSpeed ZeRO Stage 3 allows us to load large models without running out of memory. 🤗 Transformers and Ray AIR’s integration (TransformersTrainer) allow you to easily configure and use DDP and DeepSpeed. All you need to do is specify the DeepSpeed configuration in the TrainingArguments object. There are many DeepSpeed settings that allow you to trade-off speed for memory usage. The settings used below are tailored to the cluster setup used (16 g4dn.4xlarge nodes) and per device batch size of 16. Some things to keep in mind: If your GPUs support bfloat16, use that instead of float16 mixed precision to get better performance and prevent overflows. Replace fp16=True with bf16=True in TrainingArguments. If you are running out of GRAM: try reducing batch size (defined in the cell below the next one), set "overlap_comm": False in DeepSpeed config. If you are running out of RAM, add more nodes to your cluster, use nodes with more RAM, set "pin_memory": False in the DeepSpeed config, reduce the batch size, and remove "offload_param" from the DeepSpeed config. For more information on DeepSpeed configuration, refer to Hugging Face documentation and DeepSpeed documentation. Additionally, if you prefer a lower-level API, the logic below can be expressed as an Accelerate training loop distributed by a Ray AIR TorchTrainer. Training speed As we are using data parallelism, each worker operates on its own shard of the data. The batch size set in TrainingArguments is the per device batch size (per worker batch size). By changing the number of workers, we can change the effective batch size and thus the time needed for training to complete. The effective batch size is then calculated as per device batch size * number of workers * number of gradient accumulation steps. As we add more workers, the effective batch size rises and thus we need less time to complete a full epoch. While the speedup is not exactly linear due to extra communication overheads, in many cases it can be close to linear. The preprocessed dataset has 1348 examples. We have set per device batch size to 16. With 16 g4dn.4xlarge nodes, the effective batch size was 256, which equals to 85 steps per epoch. One epoch took ~2440 seconds (including initialization time). With 32 g4dn.4xlarge nodes, the effective batch size was 512, which equals to 43 steps per epoch. One epoch took ~1280 seconds (including initialization time). import evaluate from transformers import Trainer, TrainingArguments from transformers import ( GPTJForCausalLM, AutoTokenizer, default_data_collator, ) from transformers.utils.logging import disable_progress_bar, enable_progress_bar import torch from ray.air import session def trainer_init_per_worker(train_dataset, eval_dataset=None, **config): # Use the actual number of CPUs assigned by Ray os.environ["OMP_NUM_THREADS"] = str( session.get_trial_resources().bundles[-1].get("CPU", 1) ) # Enable tf32 for better performance torch.backends.cuda.matmul.allow_tf32 = True batch_size = config.get("batch_size", 4) epochs = config.get("epochs", 2) warmup_steps = config.get("warmup_steps", 0) learning_rate = config.get("learning_rate", 0.00002) weight_decay = config.get("weight_decay", 0.01) deepspeed = { "fp16": { "enabled": "auto", "initial_scale_power": 8, }, "bf16": {"enabled": "auto"}, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", }, }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": True, }, "offload_param": { "device": "cpu", "pin_memory": True, }, "overlap_comm": True, "contiguous_gradients": True, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "gather_16bit_weights_on_model_save": True, "round_robin_gradients": True, }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 10, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": False, } print("Preparing training arguments") training_args = TrainingArguments( "output", per_device_train_batch_size=batch_size, logging_steps=1, save_strategy="no", per_device_eval_batch_size=batch_size, learning_rate=learning_rate, weight_decay=weight_decay, warmup_steps=warmup_steps, label_names=["input_ids", "attention_mask"], num_train_epochs=epochs, push_to_hub=False, disable_tqdm=True, # declutter the output a little fp16=True, gradient_checkpointing=True, deepspeed=deepspeed, ) disable_progress_bar() tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token print("Loading model") model = GPTJForCausalLM.from_pretrained(model_name, use_cache=False) model.resize_token_embeddings(len(tokenizer)) print("Model loaded") enable_progress_bar() metric = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, compute_metrics=compute_metrics, tokenizer=tokenizer, data_collator=default_data_collator, ) return trainer With our trainer_init_per_worker complete, we can now instantiate the TransformersTrainer. Aside from the function, we set the scaling_config, controlling the amount of workers and resources used, and the datasets we will use for training and evaluation. We pass the preprocessors we have defined earlier as an argument, wrapped in a Chain. The preprocessor will be included with the returned Checkpoint, meaning it will also be applied during inference. Since this example runs with multiple nodes, we need to persist checkpoints and other outputs to some external storage for access after training has completed. You should set up cloud storage or NFS, then replace storage_path with your own cloud bucket URI or NFS path. See the storage guide for more details. storage_path="s3://your-bucket-here" # TODO: Set up cloud storage # storage_path="/mnt/path/to/nfs" # TODO: Alternatively, set up NFS from ray.train.huggingface import TransformersTrainer from ray.air import RunConfig, ScalingConfig from ray.data.preprocessors import Chain trainer = TransformersTrainer( trainer_init_per_worker=trainer_init_per_worker, trainer_init_config={ "batch_size": 16, # per device "epochs": 1, }, scaling_config=ScalingConfig( num_workers=num_workers, use_gpu=use_gpu, resources_per_worker={"GPU": 1, "CPU": cpus_per_worker}, ), datasets={"train": ray_datasets["train"], "evaluation": ray_datasets["validation"]}, preprocessor=Chain(splitter, tokenizer), run_config=RunConfig(storage_path=storage_path), ) Finally, we call the fit() method to start training with Ray AIR. We will save the Result object to a variable so we can access metrics and checkpoints. results = trainer.fit()

Tune Status

Current time:2023-03-06 17:18:41
Running for: 00:43:11.46
Memory: 31.9/62.0 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/256 CPUs, 0/16 GPUs, 0.0/675.29 GiB heap, 0.0/291.99 GiB objects (0.0/16.0 accelerator_type:T4)

Trial Status

Trial name status loc iter total time (s) loss learning_rate epoch
TransformersTrainer_f623d_00000TERMINATED10.0.30.196:30861 85 2579.30.0715 4.70588e-07 1
(RayTrainWorker pid=31281) 2023-03-06 16:36:00,447 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=1964, ip=10.0.26.83) /tmp/ray/session_2023-03-06_15-55-37_997701_162/runtime_resources/py_modules_files/_ray_pkg_f864ba6869d6802c/ray/train/_internal/dataset_iterator.py:64: UserWarning: session.get_dataset_shard returns a ray.data.DataIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DataIterator docs. (RayTrainWorker pid=1964, ip=10.0.26.83) warnings.warn( (RayTrainWorker pid=1964, ip=10.0.26.83) 2023-03-06 16:36:00,453 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=1963, ip=10.0.54.163) /tmp/ray/session_2023-03-06_15-55-37_997701_162/runtime_resources/py_modules_files/_ray_pkg_f864ba6869d6802c/ray/train/_internal/dataset_iterator.py:64: UserWarning: session.get_dataset_shard returns a ray.data.DataIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DataIterator docs. (RayTrainWorker pid=1963, ip=10.0.54.163) warnings.warn( (RayTrainWorker pid=1963, ip=10.0.54.163) 2023-03-06 16:36:00,452 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=1954, ip=10.0.15.115) /tmp/ray/session_2023-03-06_15-55-37_997701_162/runtime_resources/py_modules_files/_ray_pkg_f864ba6869d6802c/ray/train/_internal/dataset_iterator.py:64: UserWarning: session.get_dataset_shard returns a ray.data.DataIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DataIterator docs. (RayTrainWorker pid=1954, ip=10.0.15.115) warnings.warn( (RayTrainWorker pid=1954, ip=10.0.15.115) 2023-03-06 16:36:00,452 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=1955, ip=10.0.58.255) /tmp/ray/session_2023-03-06_15-55-37_997701_162/runtime_resources/py_modules_files/_ray_pkg_f864ba6869d6802c/ray/train/_internal/dataset_iterator.py:64: UserWarning: session.get_dataset_shard returns a ray.data.DataIterator instead of a Dataset/DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DataIterator docs. (RayTrainWorker pid=1955, ip=10.0.58.255) warnings.warn( (RayTrainWorker pid=1955, ip=10.0.58.255) 2023-03-06 16:36:00,453 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=1942, ip=10.0.57.85) 2023-03-06 16:36:00,452 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=1963, ip=10.0.29.205) 2023-03-06 16:36:00,452 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=1942, ip=10.0.51.113) 2023-03-06 16:36:00,454 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] (RayTrainWorker pid=31281) Preparing training arguments (RayTrainWorker pid=31281) Loading model (RayTrainWorker pid=31281) [2023-03-06 16:37:21,252] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 6.05B parameters (RayTrainWorker pid=31281) Model loaded (RayTrainWorker pid=31281) Using cuda_amp half precision backend (RayTrainWorker pid=31281) [2023-03-06 16:38:03,431] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown (RayTrainWorker pid=31281) [2023-03-06 16:38:03,450] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False (RayTrainWorker pid=31281) ***** Running training ***** (RayTrainWorker pid=31281) Num examples = 1348 (RayTrainWorker pid=31281) Num Epochs = 1 (RayTrainWorker pid=31281) Instantaneous batch size per device = 16 (RayTrainWorker pid=31281) Total train batch size (w. parallel, distributed & accumulation) = 256 (RayTrainWorker pid=31281) Gradient Accumulation steps = 1 (RayTrainWorker pid=31281) Total optimization steps = 85 (RayTrainWorker pid=31281) Number of trainable parameters = 0 (RayTrainWorker pid=31281) /home/ray/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. (RayTrainWorker pid=31281) warnings.warn( (RayTrainWorker pid=31281) [2023-03-06 16:38:25,024] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw (RayTrainWorker pid=31281) [2023-03-06 16:38:25,024] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler (RayTrainWorker pid=31281) [2023-03-06 16:38:25,025] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed LR Scheduler = (RayTrainWorker pid=31281) [2023-03-06 16:38:25,025] [INFO] [logging.py:75:log_dist] [Rank 0] step=0, skipped=0, lr=[2e-05], mom=[[0.9, 0.999]] (RayTrainWorker pid=31281) [2023-03-06 16:38:25,025] [INFO] [config.py:1009:print] DeepSpeedEngine configuration: (RayTrainWorker pid=31281) [2023-03-06 16:38:25,026] [INFO] [config.py:1013:print] activation_checkpointing_config { (RayTrainWorker pid=31281) "partition_activations": false, (RayTrainWorker pid=31281) "contiguous_memory_optimization": false, (RayTrainWorker pid=31281) "cpu_checkpointing": false, (RayTrainWorker pid=31281) "number_checkpoints": null, (RayTrainWorker pid=31281) "synchronize_checkpoint_boundary": false, (RayTrainWorker pid=31281) "profile": false (RayTrainWorker pid=31281) } (RayTrainWorker pid=31281) [2023-03-06 16:38:25,026] [INFO] [config.py:1013:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} (RayTrainWorker pid=31281) [2023-03-06 16:38:25,026] [INFO] [config.py:1013:print] amp_enabled .................. False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,026] [INFO] [config.py:1013:print] amp_params ................... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] autotuning_config ............ { (RayTrainWorker pid=31281) "enabled": false, (RayTrainWorker pid=31281) "start_step": null, (RayTrainWorker pid=31281) "end_step": null, (RayTrainWorker pid=31281) "metric_path": null, (RayTrainWorker pid=31281) "arg_mappings": null, (RayTrainWorker pid=31281) "metric": "throughput", (RayTrainWorker pid=31281) "model_info": null, (RayTrainWorker pid=31281) "results_dir": "autotuning_results", (RayTrainWorker pid=31281) "exps_dir": "autotuning_exps", (RayTrainWorker pid=31281) "overwrite": true, (RayTrainWorker pid=31281) "fast": true, (RayTrainWorker pid=31281) "start_profile_step": 3, (RayTrainWorker pid=31281) "end_profile_step": 5, (RayTrainWorker pid=31281) "tuner_type": "gridsearch", (RayTrainWorker pid=31281) "tuner_early_stopping": 5, (RayTrainWorker pid=31281) "tuner_num_trials": 50, (RayTrainWorker pid=31281) "model_info_path": null, (RayTrainWorker pid=31281) "mp_size": 1, (RayTrainWorker pid=31281) "max_train_batch_size": null, (RayTrainWorker pid=31281) "min_train_batch_size": 1, (RayTrainWorker pid=31281) "max_train_micro_batch_size_per_gpu": 1.024000e+03, (RayTrainWorker pid=31281) "min_train_micro_batch_size_per_gpu": 1, (RayTrainWorker pid=31281) "num_tuning_micro_batch_sizes": 3 (RayTrainWorker pid=31281) } (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] bfloat16_enabled ............. False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] checkpoint_parallel_write_pipeline False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] checkpoint_tag_validation_enabled True (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] checkpoint_tag_validation_fail False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] comms_config ................. (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] communication_data_type ...... None (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] curriculum_enabled_legacy .... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] curriculum_params_legacy ..... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] data_efficiency_enabled ...... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] dataloader_drop_last ......... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] disable_allgather ............ False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] dump_state ................... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] dynamic_loss_scale_args ...... {'init_scale': 256, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_enabled ........... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_gas_boundary_resolution 1 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_layer_name ........ bert.encoder.layer (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_layer_num ......... 0 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_max_iter .......... 100 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_stability ......... 1e-06 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_tol ............... 0.01 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] eigenvalue_verbose ........... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] elasticity_enabled ........... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] flops_profiler_config ........ { (RayTrainWorker pid=31281) "enabled": false, (RayTrainWorker pid=31281) "profile_step": 1, (RayTrainWorker pid=31281) "module_depth": -1, (RayTrainWorker pid=31281) "top_modules": 1, (RayTrainWorker pid=31281) "detailed": true, (RayTrainWorker pid=31281) "output_file": null (RayTrainWorker pid=31281) } (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] fp16_auto_cast ............... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] fp16_enabled ................. True (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] fp16_master_weights_and_gradients False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] global_rank .................. 0 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] grad_accum_dtype ............. None (RayTrainWorker pid=31281) [2023-03-06 16:38:25,027] [INFO] [config.py:1013:print] gradient_accumulation_steps .. 1 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] gradient_clipping ............ 1.0 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] gradient_predivide_factor .... 1.0 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] initial_dynamic_scale ........ 256 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] load_universal_checkpoint .... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] loss_scale ................... 0 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] memory_breakdown ............. False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] nebula_config ................ { (RayTrainWorker pid=31281) "enabled": false, (RayTrainWorker pid=31281) "persistent_storage_path": null, (RayTrainWorker pid=31281) "persistent_time_interval": 100, (RayTrainWorker pid=31281) "num_of_version_in_retention": 2, (RayTrainWorker pid=31281) "enable_nebula_load": true, (RayTrainWorker pid=31281) "load_path": null (RayTrainWorker pid=31281) } (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] optimizer_legacy_fusion ...... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] optimizer_name ............... adamw (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08} (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] pld_enabled .................. False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] pld_params ................... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] prescale_gradients ........... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] scheduler_name ............... None (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] scheduler_params ............. None (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] sparse_attention ............. None (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] sparse_gradients_enabled ..... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] steps_per_print .............. 10 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] train_batch_size ............. 256 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] train_micro_batch_size_per_gpu 16 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] use_node_local_storage ....... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] wall_clock_breakdown ......... False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] world_size ................... 16 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] zero_allow_untested_optimizer False (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=True (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] zero_enabled ................. True (RayTrainWorker pid=31281) [2023-03-06 16:38:25,028] [INFO] [config.py:1013:print] zero_optimization_stage ...... 3 (RayTrainWorker pid=31281) [2023-03-06 16:38:25,029] [INFO] [config.py:998:print_user_config] json = { (RayTrainWorker pid=31281) "fp16": { (RayTrainWorker pid=31281) "enabled": true, (RayTrainWorker pid=31281) "initial_scale_power": 8 (RayTrainWorker pid=31281) }, (RayTrainWorker pid=31281) "bf16": { (RayTrainWorker pid=31281) "enabled": false (RayTrainWorker pid=31281) }, (RayTrainWorker pid=31281) "optimizer": { (RayTrainWorker pid=31281) "type": "AdamW", (RayTrainWorker pid=31281) "params": { (RayTrainWorker pid=31281) "lr": 2e-05, (RayTrainWorker pid=31281) "betas": [0.9, 0.999], (RayTrainWorker pid=31281) "eps": 1e-08 (RayTrainWorker pid=31281) } (RayTrainWorker pid=31281) }, (RayTrainWorker pid=31281) "zero_optimization": { (RayTrainWorker pid=31281) "stage": 3, (RayTrainWorker pid=31281) "offload_optimizer": { (RayTrainWorker pid=31281) "device": "cpu", (RayTrainWorker pid=31281) "pin_memory": true (RayTrainWorker pid=31281) }, (RayTrainWorker pid=31281) "offload_param": { (RayTrainWorker pid=31281) "device": "cpu", (RayTrainWorker pid=31281) "pin_memory": true (RayTrainWorker pid=31281) }, (RayTrainWorker pid=31281) "overlap_comm": true, (RayTrainWorker pid=31281) "contiguous_gradients": true, (RayTrainWorker pid=31281) "reduce_bucket_size": 1.677722e+07, (RayTrainWorker pid=31281) "stage3_prefetch_bucket_size": 1.509949e+07, (RayTrainWorker pid=31281) "stage3_param_persistence_threshold": 4.096000e+04, (RayTrainWorker pid=31281) "gather_16bit_weights_on_model_save": true, (RayTrainWorker pid=31281) "round_robin_gradients": true (RayTrainWorker pid=31281) }, (RayTrainWorker pid=31281) "gradient_accumulation_steps": 1, (RayTrainWorker pid=31281) "gradient_clipping": 1.0, (RayTrainWorker pid=31281) "steps_per_print": 10, (RayTrainWorker pid=31281) "train_batch_size": 256, (RayTrainWorker pid=31281) "train_micro_batch_size_per_gpu": 16, (RayTrainWorker pid=31281) "wall_clock_breakdown": false (RayTrainWorker pid=31281) } (RayTrainWorker pid=31281) Model weights saved in output/checkpoint-85/pytorch_model.bin (RayTrainWorker pid=31281) tokenizer config file saved in output/checkpoint-85/tokenizer_config.json (RayTrainWorker pid=31281) Special tokens file saved in output/checkpoint-85/special_tokens_map.json (RayTrainWorker pid=31281) [2023-03-06 17:18:13,320] [INFO] [engine.py:3516:save_16bit_model] Saving model weights to output/checkpoint-85/pytorch_model.bin (RayTrainWorker pid=31281) [2023-03-06 17:18:13,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/pytorch_model.bin... (RayTrainWorker pid=31281) [2023-03-06 17:18:29,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/pytorch_model.bin. (RayTrainWorker pid=31281) [2023-03-06 17:18:29,087] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint global_step85 is begin to save! (RayTrainWorker pid=31281) [2023-03-06 17:18:29,109] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: output/checkpoint-85/global_step85/zero_pp_rank_0_mp_rank_00_model_states.pt (RayTrainWorker pid=31281) [2023-03-06 17:18:29,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_0_mp_rank_00_model_states.pt... (RayTrainWorker pid=31281) [2023-03-06 17:18:37,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/global_step85/zero_pp_rank_0_mp_rank_00_optim_states.pt. (RayTrainWorker pid=31281) [2023-03-06 17:18:37,984] [INFO] [engine.py:3407:_save_zero_checkpoint] zero checkpoint saved output/checkpoint-85/global_step85/zero_pp_rank_0_mp_rank_00_optim_states.pt (RayTrainWorker pid=31281) (RayTrainWorker pid=31281) (RayTrainWorker pid=31281) Training completed. Do not forget to share your model on huggingface.co/models =) (RayTrainWorker pid=31281) (RayTrainWorker pid=31281) (RayTrainWorker pid=31281) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now! (RayTrainWorker pid=31281) {'train_runtime': 2413.1243, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0} 2023-03-06 17:18:41,018 INFO tune.py:825 -- Total run time: 2591.59 seconds (2591.46 seconds for the tuning loop). You can use the returned Result object to access metrics and the Ray AIR Checkpoint associated with the last iteration. checkpoint = results.checkpoint checkpoint TransformersCheckpoint(local_path=/home/ray/ray_results/TransformersTrainer_2023-03-06_16-35-29/TransformersTrainer_f623d_00000_0_2023-03-06_16-35-30/checkpoint_000000) Generate text from prompt We can use the TransformersPredictor to generate predictions from our fine-tuned model. For large scale batch inference, see End-to-end: Offline Batch Inference. Because the TransformersPredictor uses a 🤗 Transformers pipeline under the hood, we disable the tokenizer AIR Preprocessor we have used for training and let the pipeline to tokenize the data itself. checkpoint.set_preprocessor(None) We also set device_map="auto" so that the model is automatically placed on the right device and set the task to "text-generation". The predict method passes the arguments to a 🤗 Transformers pipeline call. from ray.train.huggingface import TransformersPredictor import pandas as pd prompts = pd.DataFrame(["Romeo and Juliet", "Romeo", "Juliet"], columns=["text"]) # Predict on the head node. predictor = TransformersPredictor.from_checkpoint( checkpoint=checkpoint, task="text-generation", torch_dtype=torch.float16 if use_gpu else None, device_map="auto", use_gpu=use_gpu, ) prediction = predictor.predict( prompts, do_sample=True, temperature=0.9, min_length=32, max_length=128, ) prediction
generated_text
0 Romeo and Juliet, they are married: and it is ...
1 Romeo, thou art Romeo and a Montague; for only...
2 Juliet's name; but I do not sound an ear to na...
GPT-J-6B Batch Prediction with Ray AIR This example showcases how to use the Ray AIR for GPT-J batch inference. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This model has 6 billion parameters. For more information on GPT-J, click here. We use Ray Data and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models. It is highly recommended to read Ray AIR Key Concepts and Ray Data Key Concepts before starting this example. If you are interested in serving (online inference), see GPT-J-6B Serving with Ray AIR. In order to run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The amount of memory needed will depend on the model. model_id = "EleutherAI/gpt-j-6B" revision = "float16" # use float16 weights to fit in 16GB GPUs prompt = ( "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " "previously unexplored valley, in the Andes Mountains. Even more surprising to the " "researchers was the fact that the unicorns spoke perfect English." ) import ray We define a runtime environment to ensure that the Ray workers have access to all the necessary packages. You can omit the runtime_env argument if you have all of the packages already installed on each node in your cluster. ray.init( runtime_env={ "pip": [ "accelerate>=0.16.0", "transformers>=4.26.0", "numpy<1.24", # remove when mlflow updates beyond 2.2 "torch", ] } ) For the purposes of this example, we will use a very small toy dataset composed of multiple copies of our prompt. Ray Data can handle much bigger datasets with ease. import ray.data import pandas as pd ds = ray.data.from_pandas(pd.DataFrame([prompt] * 10, columns=["prompt"])) Since we will be using a pretrained model from Hugging Face hub, the simplest way is to use map_batches with a callable class UDF. This will allow us to save time by initializing a model just once and then feed it multiple batches of data. class PredictCallable: def __init__(self, model_id: str, revision: str = None): from transformers import AutoModelForCausalLM, AutoTokenizer import torch self.model = AutoModelForCausalLM.from_pretrained( model_id, revision=revision, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto", # automatically makes use of all GPUs available to the Actor ) self.tokenizer = AutoTokenizer.from_pretrained(model_id) def __call__(self, batch: pd.DataFrame) -> pd.DataFrame: tokenized = self.tokenizer( list(batch["prompt"]), return_tensors="pt" ) input_ids = tokenized.input_ids.to(self.model.device) attention_mask = tokenized.attention_mask.to(self.model.device) gen_tokens = self.model.generate( input_ids=input_ids, attention_mask=attention_mask, do_sample=True, temperature=0.9, max_length=100, pad_token_id=self.tokenizer.eos_token_id, ) return pd.DataFrame( self.tokenizer.batch_decode(gen_tokens), columns=["responses"] ) All that is left is to run the map_batches method on the dataset. We specify that we want to use one GPU for each Ray Actor that will be running our callable class. Also notice that we repartition the dataset into 100 partitions before mapping batches. This is to make sure there will be enough parallel tasks to take advantage of all the GPUs. 100 is an arbitrary number. You can pick any other numbers as long as it is more than the number of available GPUs in the cluster. If you have access to large GPUs, you may want to increase the batch size to better saturate them. If you want to use inter-node model parallelism, you can also increase num_gpus. As we have created the model with device_map="auto", it will be automatically placed on correct devices. Note that this requires nodes with multiple GPUs. preds = ( ds .repartition(100) .map_batches( PredictCallable, batch_size=4, fn_constructor_kwargs=dict(model_id=model_id, revision=revision), batch_format="pandas", compute=ray.data.ActorPoolStrategy(), num_gpus=1, ) ) After map_batches is done, we can view our generated text. preds.take_all() 2023-02-28 10:40:50,530 INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(PredictCallable)] MapBatches(PredictCallable), 0 actors [0 locality hits, 1 misses]: 100%|██████████| 1/1 [12:10<00:00, 730.80s/it] [{'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nThe finding comes from the team of researchers, which includes Dr. Michael Goldberg, a professor and chair of the Zoology Department at the University of Maryland. Dr. Goldberg spent a year collecting and conducting research in the Ecuadorian Andes, including the Pinchahu'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nThe team of British, Argentine and Chilean scientists found that the elusive unicorns had been living in the valley for at least 50 years, and had even interacted with humans.\n\nThe team’s findings published in the journal Scientific Reports has been hailed as a'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nAs far as the rest of human kind knew, unicorns had never existed on Earth, but the presence of this herd has left some very confused. Are the scientists simply overreacting? Or has the valley become the new Unicorn Valley?\n\nThere are only'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. The discovery was announced by Oxford University and was published in the journal Science. According to the researchers, this is proof of an alien life. This time around the aliens are definitely not from outer space – they are quite cozy.\n\n“I saw the herd for the'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nIn the article, The Daily Beast and NewScientist report on these "extraordinary find[s], reported this week to the Royal Society." According to the article:\n\nThe authors, who were part of a team from the University of Lincoln’s'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. This was no ordinary herd of animals.\n\nThe discovery was made by the team while they were riding horses in the wilds of the Peruvian Andes. As they rode through the area, they came upon a herd of white alpacas, which were quite exotic'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nThe mountain valley that the unicorns lived in sat under the shadow of an active volcano emitting smoke as big as Mount St. Helens. The scientists named the newly discovered unicorn herd the Andes Biodiversity Center—or ABC for short.\n\nThe discovery'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nUnicorns have been depicted in the fairy tales and legends of many cultures throughout history, but scientists have been unable to explain the species.\n\nIn a paper published in the journal Biology Letters, the researchers studied five male and five female unicorns and their offspring'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. How did this amazing discovery occur?\n\nBefore I tell you exactly how unicorns managed to come into existence, allow me to explain how I think unicorns occur. I think they exist in the same way some people believe Jesus rose from the dead.\n\nI think'}, {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nIt is well known that horses, zebras, and other hoofed beasts have long since left their ancestral lands in the grasslands of South America, and are now found throughout Eurasia and North Africa. However, there are also a number of other,'}] You may notice that we are not using an AIR Predictor here. This is because Predictors are mainly intended to be used with AIR Checkpoints, which we don’t for this example. See Using Predictors for Inference for more information and usage examples. GPT-J-6B Serving with Ray AIR In this example, we will showcase how to use the Ray AIR for GPT-J serving (online inference). GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click here. We will use Ray Serve for online inference and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models. It is highly recommended to read Ray AIR Key Concepts and Ray Serve Key Concepts before starting this example. If you are interested in batch prediction (offline inference), see GPT-J-6B Batch Prediction with Ray AIR. In order to run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The amount of memory needed will depend on the model. model_id = "EleutherAI/gpt-j-6B" revision = "float16" # use float16 weights to fit in 16GB GPUs prompt = ( "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " "previously unexplored valley, in the Andes Mountains. Even more surprising to the " "researchers was the fact that the unicorns spoke perfect English." ) import ray We define a runtime environment to ensure that the Ray workers have access to all the necessary packages. You can omit the runtime_env argument if you have all of the packages already installed on each node in your cluster. ray.init( runtime_env={ "pip": [ "accelerate>=0.16.0", "transformers>=4.26.0", "numpy<1.24", # remove when mlflow updates beyond 2.2 "torch", ] } ) Setting up basic serving with Ray Serve is very similar to batch inference with Ray Data. First, we define a callable class that will serve as the Serve deployment. At runtime, a deployment consists of a number of replicas, which are individual copies of the class or function that are started in separate Ray Actors (processes). The number of replicas can be scaled up or down (or even autoscaled) to match the incoming request load. We make sure to set the deployment to use 1 GPU by setting "num_gpus" in ray_actor_options. We load the model in __init__, which will allow us to save time by initializing a model just once and then use it to handle multiple requests. If you want to use inter-node model parallelism, you can also increase num_gpus. As we have created the model with device_map="auto", it will be automatically placed on correct devices. Note that this requires nodes with multiple GPUs. import pandas as pd from ray import serve from starlette.requests import Request @serve.deployment(ray_actor_options={"num_gpus": 1}) class PredictDeployment: def __init__(self, model_id: str, revision: str = None): from transformers import AutoModelForCausalLM, AutoTokenizer import torch self.model = AutoModelForCausalLM.from_pretrained( model_id, revision=revision, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto", # automatically makes use of all GPUs available to the Actor ) self.tokenizer = AutoTokenizer.from_pretrained(model_id) def generate(self, text: str) -> pd.DataFrame: input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to( self.model.device ) gen_tokens = self.model.generate( input_ids, do_sample=True, temperature=0.9, max_length=100, ) return pd.DataFrame( self.tokenizer.batch_decode(gen_tokens), columns=["responses"] ) async def __call__(self, http_request: Request) -> str: json_request: str = await http_request.json() prompts = [] for prompt in json_request: text = prompt["text"] if isinstance(text, list): prompts.extend(text) else: prompts.append(text) return self.generate(prompts) We can now bind the deployment with our arguments, and use run() to start it. If you were running this script outside of a Jupyter notebook, the recommended way is to use the serve run CLI command. In this case, you would remove the serve.run(deployment) line, and instead start the deployment by calling serve run FILENAME:deployment. For more information, see Serve Development Workflow. deployment = PredictDeployment.bind(model_id=model_id, revision=revision) serve.run(deployment) RayServeSyncHandle(deployment='PredictDeployment') Let’s try submitting a request to our deployment. We will use the same prompt as before, and send a POST request. The deployment will generate a response and return it. import requests prompt = ( "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " "previously unexplored valley, in the Andes Mountains. Even more surprising to the " "researchers was the fact that the unicorns spoke perfect English." ) sample_input = {"text": prompt} output = requests.post("http://localhost:8000/", json=[sample_input]).json() print(output) (ServeReplica:PredictDeployment pid=651, ip=10.0.8.161) The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. (ServeReplica:PredictDeployment pid=651, ip=10.0.8.161) Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. [{'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nThe findings come from a recent expedition to the region of Cordillera del Divisor, in northern Peru. The region was previously known to have an unusually high number of native animals.\n\n"Our team was conducting a population census of the region’'}] You may notice that we are not using an AIR Predictor here. This is because Predictors are mainly intended to be used with AIR Checkpoints, which we don’t for this example. See Using Predictors for Inference for more information and usage examples. Fine-tuning DreamBooth with Ray AIR This example shows how to fine-tune a DreamBooth model using Ray AIR. Because of the large model sizes, you’ll need 2 A10G GPUs per worker. The example can leverage data-parallel training to speed up training time. Of course, this will require more GPUs. The demo tunes both the text_encoder and unet parts of Stable Diffusion, and utilizes the prior preserving loss function. DreamBooth example The full code repository can be found here: https://github.com/ray-project/ray/blob/master/python/ray/air/examples/dreambooth/ How it works This example leverages Ray Data for data loading and Ray Train for distributed training. Data loading You can find the latest version of the code here: dataset.py The latest version might differ slightly from the code presented here. We use Ray Data for data loading. The code has three interesting parts. First, we load two datasets using ray.data.read_images(): instance_dataset = read_images(args.instance_images_dir) class_dataset = read_images(args.class_images_dir) Then, we tokenize the prompt that generated these images: tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path=args.model_dir, subfolder="tokenizer", ) def _tokenize(prompt): return tokenizer( prompt, truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt", ).input_ids.numpy() # Get the token ids for both prompts. class_prompt_ids = _tokenize(args.class_prompt)[0] instance_prompt_ids = _tokenize(args.instance_prompt)[0] And lastly, we apply a torchvision preprocessing pipeline to the images: transform = transforms.Compose( [ transforms.ToTensor(), transforms.RandomCrop(image_resolution), transforms.Normalize([0.5], [0.5]), ] ) preprocessor = TorchVisionPreprocessor(["image"], transform=transform) We apply all of this in final step: instance_dataset = preprocessor.transform(instance_dataset).add_column( "prompt_ids", lambda df: [instance_prompt_ids] * len(df) ) class_dataset = preprocessor.transform(class_dataset).add_column( "prompt_ids", lambda df: [class_prompt_ids] * len(df) ) Distributed training You can find the latest version of the code here: train.py The latest version might differ slightly from the code presented here. The central part of the training code is the training function. This function accepts a configuration dict that contains the hyperparameters. It then defines a regular PyTorch training loop. There are only a few locations where we interact with the Ray AIR API. We marked them with in-line comments in the snippet below. Remember that we want to do data-parallel training for all our models. We load the data shard for each worker with session.get_dataset_shard(“train”) We iterate over the dataset with train_dataset.iter_torch_batches() We report results to Ray AIR with session.report(results) The code was compacted for brevity. The full code is more thoroughly annotated. def train_fn(config): cuda = get_cuda_devices() # Load pre-trained models. text_encoder, noise_scheduler, vae, unet = load_models(config, cuda) # Wrap in DDP text_encoder = DistributedDataParallel( text_encoder, device_ids=[cuda[1]], output_device=cuda[1] ) unet = DistributedDataParallel(unet, device_ids=[cuda[0]], output_device=cuda[0]) # Use the regular AdamW optimizer to work with bfloat16 weights. optimizer = torch.optim.AdamW( itertools.chain(text_encoder.parameters(), unet.parameters()), lr=config["lr"], ) train_dataset = session.get_dataset_shard("train") # Train! num_train_epochs = config["num_epochs"] print(f"Running {num_train_epochs} epochs.") global_step = 0 for epoch in range(num_train_epochs): for step, batch in enumerate( train_dataset.iter_torch_batches( batch_size=config["train_batch_size"], device=cuda[1] ) ): # Load batch on GPU 2 because VAE and text encoder are there. batch = collate(batch, cuda[1], torch.bfloat16) optimizer.zero_grad() # Convert images to latent space latents = vae.encode(batch["images"]).latent_dist.sample() * 0.18215 # Sample noise that we'll add to the latents noise = torch.randn_like(latents) bsz = latents.shape[0] # Sample a random timestep for each image timesteps = torch.randint( 0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device, ) timesteps = timesteps.long() # Add noise to the latents according to the noise magnitude at each timestep # (this is the forward diffusion process) noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps) # Get the text embedding for conditioning encoder_hidden_states = text_encoder(batch["prompt_ids"])[0] # Predict the noise residual. We need to move all data bits to GPU 1. model_pred = unet( noisy_latents.to(cuda[0]), timesteps.to(cuda[0]), encoder_hidden_states.to(cuda[0]), ).sample target = get_target(noise_scheduler, noise, latents, timesteps).to(cuda[0]) # Now, move model prediction to GPU 2 for loss calculation. loss = prior_preserving_loss( model_pred, target, config["prior_loss_weight"] ) loss.backward() # Gradient clipping before optimizer stepping. clip_grad_norm_( itertools.chain(text_encoder.parameters(), unet.parameters()), config["max_grad_norm"], ) optimizer.step() # Step all optimizers. global_step += 1 results = { "step": global_step, "loss": loss.detach().item(), } session.report(results) We can then run this training loop with Ray AIR’s TorchTrainer: args = train_arguments().parse_args() # Build training dataset. train_dataset = get_train_dataset(args) print(f"Loaded training dataset (size: {train_dataset.count()})") # Train with Ray AIR TorchTrainer. trainer = TorchTrainer( train_fn, train_loop_config=vars(args), scaling_config=ScalingConfig( use_gpu=True, num_workers=args.num_workers, resources_per_worker={ "GPU": 2, }, ), datasets={ "train": train_dataset, }, ) result = trainer.fit() Configuring the scale In the TorchTrainer, we can easily configure our scale. The above example uses the num_workers argument to specify the number of workers. This defaults to 2 workers with 2 GPUs each - so 4 GPUs in total. To run the example on 8 GPUs, just set the number of workers to 4 using --num-workers=4! Or you can change the scaling config directly: scaling_config=ScalingConfig( use_gpu=True, - num_workers=args.num_workers, + num_workers=4, resources_per_worker={ "GPU": 2, }, ) If you’re running multi-node training, you should make sure that all nodes have access to a shared storage (e.g. via NFS or EFS). In the example script below, you can adjust this location with the DATA_PREFIX environment variable. Training throughput We ran training using 1, 2, 4, and 8 workers (and 2, 4, 8, and 16 GPUs, respectively) to compare throughput. Setup: 2 x g5.12xlarge nodes with 4 A10G GPUs each Model as configured below Data from this example 200 regularization images Training for 4 epochs (800 steps) Use a mounted External File System to share data between nodes 3 runs per configuration Because network storage can be slow, we excluded the time it takes to save the final model from the training time. We expect that the training time should benefit from scale and decreases when running with more workers and GPUs. DreamBooth training times Number of workers Number of GPUs Training time 1 2 458.16 (3.82) 2 4 364.61 (1.65) 4 8 252.37 (3.18) 8 16 160.97 (1.36) While the training time decreases linearly with the amount of workers/GPUs, we observe some penalty. Specifically, with double the amount of workers we don’t get half of the training time. This is most likely due to additional communication between processes and the transfer of large model weights. We are also only training with a batch size of one because our GPU memory is limited. On larger GPUs with higher batch sizes we would expect a greater benefit from scaling out. Run the example First, we download the pre-trained stable diffusion model as a starting point. We will then train this model with a few images of our subject. To achieve this, we choose a non-word as an identifier, e.g. unqtkn. When fine-tuning the model with our subject, we will teach it that the prompt is A photo of a unqtkn . After fine-tuning we can run inference with this specific prompt. For instance: A photo of a unqtkn will create an image of our subject. Step 0: Preparation Clone the Ray repository, go to the example directory, and install dependencies. git clone https://github.com/ray-project/ray.git cd ray/python/ray/air/examples/dreambooth pip install -Ur requirements.txt Prepare some directories and environment variables. export DATA_PREFIX="./" export ORIG_MODEL_NAME="CompVis/stable-diffusion-v1-4" export ORIG_MODEL_HASH="249dd2d739844dea6a0bc7fc27b3c1d014720b28" export ORIG_MODEL_DIR="$DATA_PREFIX/model-orig" export ORIG_MODEL_PATH="$ORIG_MODEL_DIR/models--${ORIG_MODEL_NAME/\//--}/snapshots/$ORIG_MODEL_HASH" export TUNED_MODEL_DIR="$DATA_PREFIX/model-tuned" export IMAGES_REG_DIR="$DATA_PREFIX/images-reg" export IMAGES_OWN_DIR="$DATA_PREFIX/images-own" export IMAGES_NEW_DIR="$DATA_PREFIX/images-new" export CLASS_NAME="toy car" mkdir -p $ORIG_MODEL_DIR $TUNED_MODEL_DIR $IMAGES_REG_DIR $IMAGES_OWN_DIR $IMAGES_NEW_DIR Copy some images for fine-tuning into $IMAGES_OWN_DIR. Step 1: Download the pre-trained model Download and cache a pre-trained Stable-Diffusion model locally. Default model and version are CompVis/stable-diffusion-v1-4 at git hash 3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4. python cache_model.py --model_dir=$ORIG_MODEL_DIR --model_name=$ORIG_MODEL_NAME --revision=$ORIG_MODEL_HASH Note that actual model files will be downloaded into \\snapshots\\ directory. Step 2: Create the regularization images Create a regularization image set for a class of subjects: python run_model.py \ --model_dir=$ORIG_MODEL_PATH \ --output_dir=$IMAGES_REG_DIR \ --prompts="photo of a $CLASS_NAME" \ --num_samples_per_prompt=200 Step 3: Fine-tune the model Save a few (4 to 5) images of the subject being fine-tuned in a local directory. Then launch the training job with: python train.py \ --model_dir=$ORIG_MODEL_PATH \ --output_dir=$TUNED_MODEL_DIR \ --instance_images_dir=$IMAGES_OWN_DIR \ --instance_prompt="a photo of unqtkn $CLASS_NAME" \ --class_images_dir=$IMAGES_REG_DIR \ --class_prompt="a photo of a $CLASS_NAME" Step 4: Generate images of our subject Try your model with the same commandline as Step 2, but point to your own model this time! python run_model.py \ --model_dir=$TUNED_MODEL_DIR \ --output_dir=$IMAGES_NEW_DIR \ --prompts="photo of a unqtkn $CLASS_NAME" \ --num_samples_per_prompt=20 Fine-tune dolly-v2-7b with Ray AIR LightningTrainer and FSDP In this example, we demonstrate how to use Ray AIR to fine-tune a dolly-v2-7b model. dolly-v2-12b is a 12 billion parameter causal language model created by Databricks, derived from EleutherAI’s Pythia-12b, and fine-tuned on a ~15K record instruction corpus. We load the pre-trained model from the HuggingFace model hub into a LightningModule and launch an FSDP fine-tuning job across 16 T4 GPUs with the help of Ray LightningTrainer. It is also straightforward to fine-tune other similar large language models in a similar manner as shown in this example. Before starting this example, we highly recommend reading Ray AIR Key Concepts and Ray Data Key Concepts. Set up ray cluster In this example, we are using a ray cluster with 16 g4dn.4xlarge instances. Each instance has one Tesla T4 GPU (16GiB Memory). We define a runtime_env to install the necessary Python libraries on each node. You can skip this step if you have already installed all the required packages in your workers’ base image. We tested this example with pytorch_lightning==2.0.2 and transformers==4.29.2. import ray ray.init( runtime_env={ "pip": [ "datasets", "evaluate", "transformers>=4.26.0", "torch>=1.12.0", "pytorch_lightning>=2.0", ] } ) MODEL_NAME = "databricks/dolly-v2-7b" Prepare your data We are using tiny_shakespeare for fine-tuning, which contains 40,000 lines of Shakespeare from a variety of Shakespeare’s plays. Featured in Andrej Karpathy’s blog post ‘The Unreasonable Effectiveness of Recurrent Neural Networks’. Dataset samples: BAPTISTA: I know him well: you are welcome for his sake. GREMIO: Saving your tale, Petruchio, I pray, Let us, that are poor petitioners, speak too: Baccare! you are marvellous forward. PETRUCHIO: O, pardon me, Signior Gremio; I would fain be doing. Here, we have adopted similar pre-processing logic from another demo: GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed. import ray import pandas as pd from datasets import load_dataset from ray.data.preprocessors import BatchMapper, Chain from transformers import AutoTokenizer, AutoModelForCausalLM def split_text(batch: pd.DataFrame) -> pd.DataFrame: text = list(batch["text"]) flat_text = "".join(text) split_text = [ x.strip() for x in flat_text.split("\n") if x.strip() and not x.strip()[-1] == ":" ] return pd.DataFrame(split_text, columns=["text"]) def tokenize(batch: pd.DataFrame) -> dict: tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left") tokenizer.pad_token = tokenizer.eos_token ret = tokenizer( list(batch["text"]), truncation=True, max_length=256, padding="max_length", return_tensors="np", ) ret["labels"] = ret["input_ids"].copy() return dict(ret) splitter = BatchMapper(split_text, batch_format="pandas") tokenizer = BatchMapper(tokenize, batch_format="pandas") preprocessor = Chain(splitter, tokenizer) hf_dataset = load_dataset("tiny_shakespeare") ray_datasets = ray.data.from_huggingface(hf_dataset) We first split the original paragraphs into multiple sentences, then tokenize them. Here are some samples: ds = ray_datasets["train"] splitter.fit_transform(ds).take(10) [{'text': 'Before we proceed any further, hear me speak.'}, {'text': 'Speak, speak.'}, {'text': 'You are all resolved rather to die than to famish?'}, {'text': 'Resolved. resolved.'}, {'text': 'First, you know Caius Marcius is chief enemy to the people.'}, {'text': "We know't, we know't."}, {'text': "Let us kill him, and we'll have corn at our own price."}, {'text': "Is't a verdict?"}, {'text': "No more talking on't; let it be done: away, away!"}, {'text': 'One word, good citizens.'}] Define your lightning model In this example, we use the dolly-v2-7b model for finetuning. It is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. We load the model weights from Huggingface Model Hub and encapsulate it into a pl.LightningModule. Make sure you pass the FSDP wrapped model parameters self.trainer.model.parameters() into the optimizer, instead of self.model.parameters(). import torch import pytorch_lightning as pl class DollyV2Model(pl.LightningModule): def __init__(self, lr=2e-5, eps=1e-8): super().__init__() self.lr = lr self.eps = eps self.model = AutoModelForCausalLM.from_pretrained(MODEL_NAME) self.predictions = [] self.references = [] def forward(self, batch): outputs = self.model( batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["labels"] ) return outputs.loss def training_step(self, batch, batch_idx): loss = self.forward(batch) self.log("train_loss", loss, prog_bar=True, on_step=True) return loss def configure_optimizers(self): if self.global_rank == 0: print(self.trainer.model) return torch.optim.AdamW(self.trainer.model.parameters(), lr=self.lr, eps=self.eps) Configure your FSDP strategy As Dolly-v2-3b is a relatively large model, it cannot be properly fit into a single commercial GPU. In this example, we use the FSDP strategy to shard model parameters across multiple workers. This allows us to avoid GPU out-of-memory issues and support a larger global batch size. Image source: Fully Sharded Data Parallel: faster AI training with fewer GPUs FSDP is a type of data parallelism that shards model parameters, optimizer states and gradients across DDP ranks. This was inspired by Xu et al. as well as the ZeRO Stage 3 from DeepSpeed. You may refer to these blogs for more information: Fully Sharded Data Parallel: faster AI training with fewer GPUs Getting Started with Fully Sharded Data Parallel(FSDP) PyTorch FSDP Tutorial To start trainig with Lightning’s FSDPStrategy, you only need to provide the initialization arguments in LightningConfigBuilder.strategy(). Behind the scenes, LightningTrainer handles the cluster environment settings and job launching. import functools from ray.train.lightning import LightningTrainer, LightningConfigBuilder from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy from torch.distributed.fsdp import ShardingStrategy, BackwardPrefetch from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXLayer # Define the model sharding policy: # Wrap every GPTNeoXLayer as its own FSDP instance auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls = {GPTNeoXLayer} ) # Aggregate all arguments for LightningTrainer lightning_config = ( LightningConfigBuilder() .module(cls=DollyV2Model, lr=2e-5, eps=1e-8) .trainer( max_epochs=1, accelerator="gpu", precision="16-mixed", ) .strategy( name="fsdp", sharding_strategy=ShardingStrategy.FULL_SHARD, backward_prefetch=BackwardPrefetch.BACKWARD_PRE, forward_prefetch=True, auto_wrap_policy=auto_wrap_policy, limit_all_gathers=True, activation_checkpointing=[GPTNeoXLayer], ) .checkpointing(save_top_k=0, save_weights_only=True, save_last=True) ) Some tips for FSDP configutarion: sharding_strategy: ShardingStrategy.NO_SHARD: Parameters, gradients, and optimizer states are not sharded. Similar to DDP. ShardingStrategy.SHARD_GRAD_OP: Gradients and optimizer states are sharded during computation, and additionally, parameters are sharded outside computation. Similar to ZeRO stage-2. ShardingStrategy.FULL_SHARD: Parameters, gradients, and optimizer states are sharded. It has minimal GRAM usage among the 3 options. Similar to ZeRO stage-3. auto_wrap_policy: Model layers are often wrapped with FSDP in a layered fashion. This means that only the layers in a single FSDP instance are required to aggregate all parameters to a single device during forwarding or backward calculations. Use transformer_auto_wrap_policy to automatically wrap each Transformer Block into a single FSDP instance. backward_prefetch and forward_prefetch: Overlap the upcoming all-gather while executing the current forward/backward pass. It can improve throughput but may slightly increase peak memory usage. Fine-tune with LightningTrainer num_workers = 16 batch_size_per_worker = 10 Since this example runs with multiple nodes, we need to persist checkpoints and other outputs to some external storage for access after training has completed. You should set up cloud storage or NFS, then replace storage_path with your own cloud bucket URI or NFS path. See the storage guide for more details. storage_path="s3://your-bucket-here" # TODO: Set up cloud storage # storage_path="/mnt/path/to/nfs" # TODO: Alternatively, set up NFS from ray.tune.syncer import SyncConfig # Save AIR checkpoints according to the performance on validation set run_config = RunConfig( storage_path=storage_path, name="finetune_dolly-v2-7b", checkpoint_config=CheckpointConfig(), sync_config=SyncConfig(sync_artifacts=False), ) # Scale the DDP training workload across 16 GPUs # You can change this config based on your compute resources. scaling_config = ScalingConfig( num_workers=num_workers, use_gpu=True, resources_per_worker={"CPU": 12, "GPU": 1} ) trainer = LightningTrainer( lightning_config=lightning_config.build(), run_config=run_config, scaling_config=scaling_config, datasets={"train": ray_datasets["train"]}, datasets_iter_config={"batch_size": batch_size_per_worker}, preprocessor=preprocessor, ) result = trainer.fit() result

Tune Status

Current time:2023-05-05 01:03:12
Running for: 00:45:50.28
Memory: 35.4/124.4 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 0/272 CPUs, 0/16 GPUs (0.0/16.0 accelerator_type:T4)

Trial Status

Trial name status loc iter total time (s) train_loss epoch step
LightningTrainer_e0990_00000TERMINATED10.0.102.147:41219 1 2699.78 0.166992 0 135
2023-05-05 00:17:21,842 WARNING trial_runner.py:1607 -- The maximum number of pending trials has been automatically set to the number of available cluster CPUs, which is high (299 CPUs/pending trials). If you're running an experiment with a large number of trials, this could lead to scheduling overhead. In this case, consider setting the `TUNE_MAX_PENDING_TRIALS_PG` environment variable to the desired maximum number of concurrent trials. (LightningTrainer pid=41219) 2023-05-05 00:17:28,673 INFO backend_executor.py:128 -- Starting distributed worker processes: ['41376 (10.0.102.147)', '8301 (10.0.67.96)', '8263 (10.0.103.36)', '27794 (10.0.105.149)', '8088 (10.0.110.210)', '8238 (10.0.106.19)', '8225 (10.0.81.63)', '8200 (10.0.106.22)', '8231 (10.0.90.160)', '8345 (10.0.98.168)', '28207 (10.0.76.146)', '8213 (10.0.115.72)', '8272 (10.0.92.209)', '8247 (10.0.74.31)', '27629 (10.0.68.102)', '8224 (10.0.88.86)'] (RayTrainWorker pid=41376) 2023-05-05 00:17:30,953 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=16] (pid=41219) Running: 0.0/272.0 CPU, 0.0/16.0 GPU, 0.0 MiB/73.21 GiB object_store_memory: 0%| | 0/1 [00:00 TaskPoolMapOperator[BatchMapper->BatchMapper] -> AllToAllOperator[RandomizeBlockOrder] (pid=41219) Running: 0.0/272.0 CPU, 0.0/16.0 GPU, 0.0 MiB/73.21 GiB object_store_memory: 0%| | 0/1 [00:00

Trial Progress

Trial name _report_on date done epoch experiment_taghostname iterations_since_restorenode_ip pidshould_checkpoint step time_since_restore time_this_iter_s time_total_s timestamp train_loss training_iterationtrial_id
LightningTrainer_e0990_00000train_epoch_end2023-05-05_01-02-26True 0 0ip-10-0-102-147 110.0.102.14741219True 135 2699.78 2699.78 2699.78 1683273746 0.166992 1e0990_00000
(RayTrainWorker pid=41376) `Trainer.fit` stopped: `max_epochs=1` reached. (RayTrainWorker pid=41376) RayFSDPStrategy: tearing down strategy... We finished training in 2361s. The price for an on-demand g4dn.4xlarge instance is $1.204/hour, while a g4dn.4xlarge instance costs $2.176/hour. The total cost would be ($1.204 * 15 + $2.176) * 2699 / 3600 = $15.17. Text-generation with HuggingFace Pipeline We can use the HuggingFace Pipeline to generate predictions from our fine-tuned model. Let’s input some prompts and see if our tuned Dolly can speak like Shakespeare: from transformers import pipeline tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="right") dolly = result.checkpoint.get_model(model_class=DollyV2Model, map_location=torch.device("cpu")) nlp_pipeline = pipeline( task="text-generation", model=dolly.model, tokenizer=tokenizer, device_map="auto" ) for prompt in ["This is", "I am", "Once more"]: print(nlp_pipeline(prompt, max_new_tokens=20, do_sample=True, pad_token_id=tokenizer.eos_token_id)) [{'generated_text': 'This is the very place, my lord, where I was born.'}] [{'generated_text': 'I am a man of a thousand lives, and I will live.'}] [{'generated_text': 'Once more, my lord, I beseech you, hear me speak.'}] References: PyTorch FSDP Tutorial Getting Started with Fully Sharded Data Parallel(FSDP) Fully Sharded Data Parallel: faster AI training with fewer GPUs Hugging Face: dolly-v2-7b Model Card Hugging Face: Handling big models for inference Ray AIR API Preprocessor Preprocessor Interface Constructor Preprocessor() Implements an ML preprocessing operation. ray.data.preprocessor.Preprocessor class ray.data.preprocessor.Preprocessor[source] Bases: abc.ABC Implements an ML preprocessing operation. Preprocessors are stateful objects that can be fitted against a Dataset and used to transform both local data batches and distributed data. For example, a Normalization preprocessor may calculate the mean and stdev of a field during fitting, and uses these attributes to implement its normalization transform. Preprocessors can also be stateless and transform data without needed to be fitted. For example, a preprocessor may simply remove a column, which does not require any state to be fitted. If you are implementing your own Preprocessor sub-class, you should override the following: _fit if your preprocessor is stateful. Otherwise, set _is_fittable=False. _transform_pandas and/or _transform_numpy for best performance, implement both. Otherwise, the data will be converted to the match the implemented method. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__() fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessor.Preprocessor.__init__ Preprocessor.__init__() ray.data.preprocessor.Preprocessor.fit Preprocessor.fit(ds: Dataset) -> Preprocessor[source] Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessor.Preprocessor.fit_transform Preprocessor.fit_transform(ds: Dataset) -> Dataset[source] Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessor.Preprocessor.preferred_batch_format classmethod Preprocessor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat[source] Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessor.Preprocessor.transform Preprocessor.transform(ds: Dataset) -> Dataset[source] Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessor.Preprocessor.transform_batch Preprocessor.transform_batch(data: DataBatchType) -> DataBatchType[source] Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessor.Preprocessor.transform_stats Preprocessor.transform_stats() -> Optional[str][source] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Fit/Transform APIs fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. Generic Preprocessors BatchMapper(fn, batch_format[, batch_size]) Apply an arbitrary operation to a dataset. Chain(*preprocessors) Combine multiple preprocessors into a single Preprocessor. Concatenator([output_column_name, include, ...]) Combine numeric columns into a column of type TensorDtype. SimpleImputer(columns[, strategy, fill_value]) Replace missing values with imputed values. ray.data.preprocessors.BatchMapper class ray.data.preprocessors.BatchMapper(fn: Union[Callable[[pandas.DataFrame], pandas.DataFrame], Callable[[Union[numpy.ndarray, Dict[str, numpy.ndarray]]], Union[numpy.ndarray, Dict[str, numpy.ndarray]]]], batch_format: Optional[ray.air.util.data_batch_conversion.BatchFormat], batch_size: Optional[Union[int, typing_extensions.Literal[default]]] = 'default')[source] Bases: ray.data.preprocessor.Preprocessor Apply an arbitrary operation to a dataset. BatchMapper applies a user-defined function to batches of a dataset. A batch is a Pandas DataFrame that represents a small amount of data. By modifying batches instead of individual records, this class can efficiently transform a dataset with vectorized operations. Use this preprocessor to apply stateless operations that aren’t already built-in. BatchMapper doesn’t need to be fit. You can call transform without calling fit. Examples Use BatchMapper to apply arbitrary operations like dropping a column. >>> import pandas as pd >>> import numpy as np >>> from typing import Dict >>> import ray >>> from ray.data.preprocessors import BatchMapper >>> >>> df = pd.DataFrame({"X": [0, 1, 2], "Y": [3, 4, 5]}) >>> ds = ray.data.from_pandas(df) >>> >>> def fn(batch: pd.DataFrame) -> pd.DataFrame: ... return batch.drop("Y", axis="columns") >>> >>> preprocessor = BatchMapper(fn, batch_format="pandas") >>> preprocessor.transform(ds) Dataset(num_blocks=1, num_rows=3, schema={X: int64}) >>> >>> def fn_numpy(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: ... return {"X": batch["X"]} >>> preprocessor = BatchMapper(fn_numpy, batch_format="numpy") >>> preprocessor.transform(ds) Dataset(num_blocks=1, num_rows=3, schema={X: int64}) Parameters fn – The function to apply to data batches. batch_size – The desired number of rows in each data batch provided to fn. Semantics are the same as in `dataset.map_batches(): specifying None wil use the entire underlying blocks as batches (blocks may contain different number of rows) and the actual size of the batch provided to fn may be smaller than batch_size if batch_size doesn’t evenly divide the block(s) sent to a given map task. Defaults to 4096, which is the same default value as dataset.map_batches(). batch_format – The preferred batch format to use in UDF. If not given, we will infer based on the input dataset data format. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.BatchMapper.fit BatchMapper.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.BatchMapper.fit_transform BatchMapper.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.BatchMapper.preferred_batch_format classmethod BatchMapper.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.BatchMapper.transform BatchMapper.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.BatchMapper.transform_batch BatchMapper.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.BatchMapper.transform_stats BatchMapper.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.Chain class ray.data.preprocessors.Chain(*preprocessors: ray.data.preprocessor.Preprocessor)[source] Bases: ray.data.preprocessor.Preprocessor Combine multiple preprocessors into a single Preprocessor. When you call fit, each preprocessor is fit on the dataset produced by the preceeding preprocessor’s fit_transform. Example >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import * >>> >>> df = pd.DataFrame({ ... "X0": [0, 1, 2], ... "X1": [3, 4, 5], ... "Y": ["orange", "blue", "orange"], ... }) >>> ds = ray.data.from_pandas(df) >>> >>> preprocessor = Chain( ... StandardScaler(columns=["X0", "X1"]), ... Concatenator(include=["X0", "X1"], output_column_name="X"), ... LabelEncoder(label_column="Y") ... ) >>> preprocessor.fit_transform(ds).to_pandas() Y X 0 1 [-1.224744871391589, -1.224744871391589] 1 0 [0.0, 0.0] 2 1 [1.224744871391589, 1.224744871391589] Parameters preprocessors – The preprocessors to sequentially compose. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.Chain.fit Chain.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.Chain.preferred_batch_format classmethod Chain.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.Chain.transform Chain.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.Chain.transform_batch Chain.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.Chain.transform_stats Chain.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.Concatenator class ray.data.preprocessors.Concatenator(output_column_name: str = 'concat_out', include: Optional[List[str]] = None, exclude: Optional[Union[str, List[str]]] = None, dtype: Optional[numpy.dtype] = None, raise_if_missing: bool = False)[source] Bases: ray.data.preprocessor.Preprocessor Combine numeric columns into a column of type TensorDtype. This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains TensorArrayElement objects of shape (m,), where m is the number of columns concatenated. The m concatenated columns are dropped after concatenation. Examples >>> import numpy as np >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Concatenator Concatenator combines numeric columns into a column of TensorDtype. >>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator() >>> concatenator.fit_transform(ds).to_pandas() concat_out 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9] By default, the created column is called "concat_out", but you can specify a different name. >>> concatenator = Concatenator(output_column_name="tensor") >>> concatenator.fit_transform(ds).to_pandas() tensor 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9] Sometimes, you might not want to concatenate all of of the columns in your dataset. In this case, you can exclude columns with the exclude parameter. >>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator(exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9] Alternatively, you can specify which columns to concatenate with the include parameter. >>> concatenator = Concatenator(include=["X0", "X1"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9] Note that if a column is in both include and exclude, the column is excluded. >>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9] By default, the concatenated tensor is a dtype common to the input columns. However, you can also explicitly set the dtype with the dtype parameter. >>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32) >>> concatenator.fit_transform(ds) Dataset(num_blocks=1, num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)}) Parameters output_column_name – The desired name for the new column. Defaults to "concat_out". include – A list of columns to concatenate. If None, all columns are concatenated. exclude – A list of column to exclude from concatenation. If a column is in both include and exclude, the column is excluded from concatenation. dtype – The dtype to convert the output tensors to. If unspecified, the dtype is determined by standard coercion rules. raise_if_missing – If True, an error is raised if any of the columns in include or exclude don’t exist. Defaults to False. Raises ValueError – if raise_if_missing is True and a column in include or exclude doesn’t exist in the dataset. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.Concatenator.fit Concatenator.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.Concatenator.fit_transform Concatenator.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.Concatenator.preferred_batch_format classmethod Concatenator.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.Concatenator.transform Concatenator.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.Concatenator.transform_batch Concatenator.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.Concatenator.transform_stats Concatenator.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.SimpleImputer class ray.data.preprocessors.SimpleImputer(columns: List[str], strategy: str = 'mean', fill_value: Optional[Union[str, numbers.Number]] = None)[source] Bases: ray.data.preprocessor.Preprocessor Replace missing values with imputed values. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import SimpleImputer >>> df = pd.DataFrame({"X": [0, None, 3, 3], "Y": [None, "b", "c", "c"]}) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X Y 0 0.0 None 1 NaN b 2 3.0 c 3 3.0 c The "mean" strategy imputes missing values with the mean of non-missing values. This strategy doesn’t work with categorical data. >>> preprocessor = SimpleImputer(columns=["X"], strategy="mean") >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 None 1 2.0 b 2 3.0 c 3 3.0 c The "most_frequent" strategy imputes missing values with the most frequent value in each column. >>> preprocessor = SimpleImputer(columns=["X", "Y"], strategy="most_frequent") >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 c 1 3.0 b 2 3.0 c 3 3.0 c The "constant" strategy imputes missing values with the value specified by fill_value. >>> preprocessor = SimpleImputer( ... columns=["Y"], ... strategy="constant", ... fill_value="?", ... ) >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 ? 1 NaN b 2 3.0 c 3 3.0 c Parameters columns – The columns to apply imputation to. strategy – How imputed values are chosen."mean": The mean of non-missing values. This strategy only works with numeric columns. "most_frequent": The most common value. "constant": The value passed to fill_value. fill_value – The value to use when strategy is "constant". Raises ValueError – if strategy is not "mean", "most_frequent", or "constant". PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.SimpleImputer.fit SimpleImputer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.SimpleImputer.fit_transform SimpleImputer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.SimpleImputer.preferred_batch_format classmethod SimpleImputer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.SimpleImputer.transform SimpleImputer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.SimpleImputer.transform_batch SimpleImputer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.SimpleImputer.transform_stats SimpleImputer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Categorical Encoders Categorizer(columns[, dtypes]) Convert columns to pd.CategoricalDtype. LabelEncoder(label_column) Encode labels as integer targets. MultiHotEncoder(columns, *[, max_categories]) Multi-hot encode categorical data. OneHotEncoder(columns, *[, max_categories]) One-hot encode categorical data. OrdinalEncoder(columns, *[, encode_lists]) Encode values within columns as ordered integer values. ray.data.preprocessors.Categorizer class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Optional[Dict[str, pandas.core.dtypes.dtypes.CategoricalDtype]] = None)[source] Bases: ray.data.preprocessor.Preprocessor Convert columns to pd.CategoricalDtype. Use this preprocessor with frameworks that have built-in support for pd.CategoricalDtype like LightGBM. If you don’t specify dtypes, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Categorizer >>> >>> df = pd.DataFrame( ... { ... "sex": ["male", "female", "male", "female"], ... "level": ["L4", "L5", "L3", "L4"], ... }) >>> ds = ray.data.from_pandas(df) >>> categorizer = Categorizer(columns=["sex", "level"]) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)] If you know the categories in advance, you can specify the categories with the dtypes parameter. >>> categorizer = Categorizer( ... columns=["sex", "level"], ... dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)}, ... ) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)] Parameters columns – The columns to convert to pd.CategoricalDtype. dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects. If you don’t include a column in dtypes, the categories are inferred. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.Categorizer.fit Categorizer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.Categorizer.fit_transform Categorizer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.Categorizer.preferred_batch_format classmethod Categorizer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.Categorizer.transform Categorizer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.Categorizer.transform_batch Categorizer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.Categorizer.transform_stats Categorizer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.LabelEncoder class ray.data.preprocessors.LabelEncoder(label_column: str)[source] Bases: ray.data.preprocessor.Preprocessor Encode labels as integer targets. LabelEncoder encodes labels as integer targets that range from 0 to n - 1, where n is the number of unique labels. If you transform a label that isn’t in the fitted datset, then the label is encoded as float("nan"). Examples >>> import pandas as pd >>> import ray >>> df = pd.DataFrame({ ... "sepal_width": [5.1, 7, 4.9, 6.2], ... "sepal_height": [3.5, 3.2, 3, 3.4], ... "species": ["setosa", "versicolor", "setosa", "virginica"] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> from ray.data.preprocessors import LabelEncoder >>> encoder = LabelEncoder(label_column="species") >>> encoder.fit_transform(ds).to_pandas() sepal_width sepal_height species 0 5.1 3.5 0 1 7.0 3.2 1 2 4.9 3.0 0 3 6.2 3.4 2 If you transform a label not present in the original dataset, then the new label is encoded as float("nan"). >>> df = pd.DataFrame({ ... "sepal_width": [4.2], ... "sepal_height": [2.7], ... "species": ["bracteata"] ... }) >>> ds = ray.data.from_pandas(df) >>> encoder.transform(ds).to_pandas() sepal_width sepal_height species 0 4.2 2.7 NaN Parameters label_column – A column containing labels that you want to encode. OrdinalEncoder If you’re encoding ordered features, use OrdinalEncoder instead of LabelEncoder. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.LabelEncoder.fit LabelEncoder.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.LabelEncoder.fit_transform LabelEncoder.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.LabelEncoder.preferred_batch_format classmethod LabelEncoder.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.LabelEncoder.transform LabelEncoder.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.LabelEncoder.transform_batch LabelEncoder.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.LabelEncoder.transform_stats LabelEncoder.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.MultiHotEncoder class ray.data.preprocessors.MultiHotEncoder(columns: List[str], *, max_categories: Optional[Dict[str, int]] = None)[source] Bases: ray.data.preprocessor.Preprocessor Multi-hot encode categorical data. This preprocessor replaces each list of categories with an m-length binary list, where m is the number of unique categories in the column or the value specified in max_categories. The i\text{-th} element of the binary list is 1 if category i is in the input list and 0 otherwise. Columns must contain hashable objects or lists of hashable objects. Also, you can’t have both types in the same column. The logic is similar to scikit-learn’s MultiLabelBinarizer. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import MultiHotEncoder >>> >>> df = pd.DataFrame({ ... "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"], ... "genre": [ ... ["comedy", "action", "sports"], ... ["animation", "comedy", "action"], ... ["documentary"], ... ], ... }) >>> ds = ray.data.from_pandas(df) >>> >>> encoder = MultiHotEncoder(columns=["genre"]) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [1, 0, 1, 0, 1] 1 Moana [1, 1, 1, 0, 0] 2 The Smartest Guys in the Room [0, 0, 0, 1, 0] If you specify max_categories, then MultiHotEncoder creates features for only the most frequent categories. >>> encoder = MultiHotEncoder(columns=["genre"], max_categories={"genre": 3}) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [1, 1, 1] 1 Moana [1, 1, 0] 2 The Smartest Guys in the Room [0, 0, 0] >>> encoder.stats_ OrderedDict([('unique_values(genre)', {'comedy': 0, 'action': 1, 'sports': 2})]) Parameters columns – The columns to separately encode. max_categories – The maximum number of features to create for each column. If a value isn’t specified for a column, then a feature is created for every unique category in that column. OneHotEncoder If you’re encoding individual categories instead of lists of categories, use OneHotEncoder. OrdinalEncoder If your categories are ordered, you may want to use OrdinalEncoder. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.MultiHotEncoder.fit MultiHotEncoder.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.MultiHotEncoder.fit_transform MultiHotEncoder.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.MultiHotEncoder.preferred_batch_format classmethod MultiHotEncoder.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.MultiHotEncoder.transform MultiHotEncoder.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.MultiHotEncoder.transform_batch MultiHotEncoder.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.MultiHotEncoder.transform_stats MultiHotEncoder.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.OneHotEncoder class ray.data.preprocessors.OneHotEncoder(columns: List[str], *, max_categories: Optional[Dict[str, int]] = None)[source] Bases: ray.data.preprocessor.Preprocessor One-hot encode categorical data. This preprocessor creates a column named {column}_{category} for each unique {category} in {column}. The value of a column is 1 if the category matches and 0 otherwise. If you encode an infrequent category (see max_categories) or a category that isn’t in the fitted dataset, then the category is encoded as all 0s. Columns must contain hashable objects or lists of hashable objects. Lists are treated as categories. If you want to encode individual list elements, use MultiHotEncoder. Example >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import OneHotEncoder >>> >>> df = pd.DataFrame({"color": ["red", "green", "red", "red", "blue", "green"]}) >>> ds = ray.data.from_pandas(df) >>> encoder = OneHotEncoder(columns=["color"]) >>> encoder.fit_transform(ds).to_pandas() color_blue color_green color_red 0 0 0 1 1 0 1 0 2 0 0 1 3 0 0 1 4 1 0 0 5 0 1 0 If you one-hot encode a value that isn’t in the fitted dataset, then the value is encoded with zeros. >>> df = pd.DataFrame({"color": ["yellow"]}) >>> batch = ray.data.from_pandas(df) >>> encoder.transform(batch).to_pandas() color_blue color_green color_red 0 0 0 0 Likewise, if you one-hot encode an infrequent value, then the value is encoded with zeros. >>> encoder = OneHotEncoder(columns=["color"], max_categories={"color": 2}) >>> encoder.fit_transform(ds).to_pandas() color_red color_green 0 1 0 1 0 1 2 1 0 3 1 0 4 0 0 5 0 1 Parameters columns – The columns to separately encode. max_categories – The maximum number of features to create for each column. If a value isn’t specified for a column, then a feature is created for every category in that column. MultiHotEncoder If you want to encode individual list elements, use MultiHotEncoder. OrdinalEncoder If your categories are ordered, you may want to use OrdinalEncoder. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.OneHotEncoder.fit OneHotEncoder.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.OneHotEncoder.fit_transform OneHotEncoder.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.OneHotEncoder.preferred_batch_format classmethod OneHotEncoder.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.OneHotEncoder.transform OneHotEncoder.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.OneHotEncoder.transform_batch OneHotEncoder.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.OneHotEncoder.transform_stats OneHotEncoder.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.OrdinalEncoder class ray.data.preprocessors.OrdinalEncoder(columns: List[str], *, encode_lists: bool = True)[source] Bases: ray.data.preprocessor.Preprocessor Encode values within columns as ordered integer values. OrdinalEncoder encodes categorical features as integers that range from 0 to n - 1, where n is the number of categories. If you transform a value that isn’t in the fitted datset, then the value is encoded as float("nan"). Columns must contain either hashable values or lists of hashable values. Also, you can’t have both scalars and lists in the same column. Examples Use OrdinalEncoder to encode categorical features as integers. >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import OrdinalEncoder >>> df = pd.DataFrame({ ... "sex": ["male", "female", "male", "female"], ... "level": ["L4", "L5", "L3", "L4"], ... }) >>> ds = ray.data.from_pandas(df) >>> encoder = OrdinalEncoder(columns=["sex", "level"]) >>> encoder.fit_transform(ds).to_pandas() sex level 0 1 1 1 0 2 2 1 0 3 0 1 If you transform a value not present in the original dataset, then the value is encoded as float("nan"). >>> df = pd.DataFrame({"sex": ["female"], "level": ["L6"]}) >>> ds = ray.data.from_pandas(df) >>> encoder.transform(ds).to_pandas() sex level 0 0 NaN OrdinalEncoder can also encode categories in a list. >>> df = pd.DataFrame({ ... "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"], ... "genre": [ ... ["comedy", "action", "sports"], ... ["animation", "comedy", "action"], ... ["documentary"], ... ], ... }) >>> ds = ray.data.from_pandas(df) >>> encoder = OrdinalEncoder(columns=["genre"]) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [2, 0, 4] 1 Moana [1, 2, 0] 2 The Smartest Guys in the Room [3] Parameters columns – The columns to separately encode. encode_lists – If True, encode list elements. If False, encode whole lists (i.e., replace each list with an integer). True by default. OneHotEncoder Another preprocessor that encodes categorical data. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.OrdinalEncoder.fit OrdinalEncoder.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.OrdinalEncoder.fit_transform OrdinalEncoder.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.OrdinalEncoder.preferred_batch_format classmethod OrdinalEncoder.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.OrdinalEncoder.transform OrdinalEncoder.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.OrdinalEncoder.transform_batch OrdinalEncoder.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.OrdinalEncoder.transform_stats OrdinalEncoder.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Feature Scalers MaxAbsScaler(columns) Scale each column by its absolute max value. MinMaxScaler(columns) Scale each column by its range. Normalizer(columns[, norm]) Scales each sample to have unit norm. PowerTransformer(columns, power[, method]) Apply a power transform to make your data more normally distributed. RobustScaler(columns[, quantile_range]) Scale and translate each column using quantiles. StandardScaler(columns) Translate and scale each column by its mean and standard deviation, respectively. ray.data.preprocessors.MaxAbsScaler class ray.data.preprocessors.MaxAbsScaler(columns: List[str])[source] Bases: ray.data.preprocessor.Preprocessor Scale each column by its absolute max value. The general formula is given by x' = \frac{x}{\max{\vert x \vert}} where x is the column and x' is the transformed column. If \max{\vert x \vert} = 0 (i.e., the column contains all zeros), then the column is unmodified. This is the recommended way to scale sparse data. If you data isn’t sparse, you can use MinMaxScaler or StandardScaler instead. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import MaxAbsScaler >>> >>> df = pd.DataFrame({"X1": [-6, 3], "X2": [2, -4], "X3": [0, 0]}) # noqa: E501 >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 -6 2 0 1 3 -4 0 Columns are scaled separately. >>> preprocessor = MaxAbsScaler(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -1.0 0.5 0 1 0.5 -1.0 0 Zero-valued columns aren’t scaled. >>> preprocessor = MaxAbsScaler(columns=["X3"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -6 2 0.0 1 3 -4 0.0 Parameters columns – The columns to separately scale. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.MaxAbsScaler.fit MaxAbsScaler.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.MaxAbsScaler.fit_transform MaxAbsScaler.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.MaxAbsScaler.preferred_batch_format classmethod MaxAbsScaler.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.MaxAbsScaler.transform MaxAbsScaler.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.MaxAbsScaler.transform_batch MaxAbsScaler.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.MaxAbsScaler.transform_stats MaxAbsScaler.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.MinMaxScaler class ray.data.preprocessors.MinMaxScaler(columns: List[str])[source] Bases: ray.data.preprocessor.Preprocessor Scale each column by its range. The general formula is given by x' = \frac{x - \min(x)}{\max{x} - \min{x}} where x is the column and x' is the transformed column. If \max{x} - \min{x} = 0 (i.e., the column is constant-valued), then the transformed column will get filled with zeros. Transformed values are always in the range [0, 1]. This can be used as an alternative to StandardScaler. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import MinMaxScaler >>> >>> df = pd.DataFrame({"X1": [-2, 0, 2], "X2": [-3, -3, 3], "X3": [1, 1, 1]}) # noqa: E501 >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 -2 -3 1 1 0 -3 1 2 2 3 1 Columns are scaled separately. >>> preprocessor = MinMaxScaler(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 0.0 0.0 1 1 0.5 0.0 1 2 1.0 1.0 1 Constant-valued columns get filled with zeros. >>> preprocessor = MinMaxScaler(columns=["X3"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -2 -3 0.0 1 0 -3 0.0 2 2 3 0.0 Parameters columns – The columns to separately scale. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.MinMaxScaler.fit MinMaxScaler.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.MinMaxScaler.fit_transform MinMaxScaler.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.MinMaxScaler.preferred_batch_format classmethod MinMaxScaler.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.MinMaxScaler.transform MinMaxScaler.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.MinMaxScaler.transform_batch MinMaxScaler.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.MinMaxScaler.transform_stats MinMaxScaler.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.Normalizer class ray.data.preprocessors.Normalizer(columns: List[str], norm='l2')[source] Bases: ray.data.preprocessor.Preprocessor Scales each sample to have unit norm. This preprocessor works by dividing each sample (i.e., row) by the sample’s norm. The general formula is given by s' = \frac{s}{\lVert s \rVert_p} where s is the sample, s' is the transformed sample, :math:lVert s rVert`, and p is the norm type. The following norms are supported: "l1" (L^1): Sum of the absolute values. "l2" (L^2): Square root of the sum of the squared values. "max" (L^\infty): Maximum value. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Normalizer >>> >>> df = pd.DataFrame({"X1": [1, 1], "X2": [1, 0], "X3": [0, 1]}) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 1 1 0 1 1 0 1 The L^2-norm of the first sample is \sqrt{2}, and the L^2-norm of the second sample is 1. >>> preprocessor = Normalizer(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 0.707107 0.707107 0 1 1.000000 0.000000 1 The L^1-norm of the first sample is 2, and the L^1-norm of the second sample is 1. >>> preprocessor = Normalizer(columns=["X1", "X2"], norm="l1") >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 0.5 0.5 0 1 1.0 0.0 1 The L^\infty-norm of the both samples is 1. >>> preprocessor = Normalizer(columns=["X1", "X2"], norm="max") >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 1.0 1.0 0 1 1.0 0.0 1 Parameters columns – The columns to scale. For each row, these colmumns are scaled to unit-norm. norm – The norm to use. The supported values are "l1", "l2", or "max". Defaults to "l2". Raises ValueError – if norm is not "l1", "l2", or "max". PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.Normalizer.fit Normalizer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.Normalizer.fit_transform Normalizer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.Normalizer.preferred_batch_format classmethod Normalizer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.Normalizer.transform Normalizer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.Normalizer.transform_batch Normalizer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.Normalizer.transform_stats Normalizer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.PowerTransformer class ray.data.preprocessors.PowerTransformer(columns: List[str], power: float, method: str = 'yeo-johnson')[source] Bases: ray.data.preprocessor.Preprocessor Apply a power transform to make your data more normally distributed. Some models expect data to be normally distributed. By making your data more Gaussian-like, you might be able to improve your model’s performance. This preprocessor supports the following transformations: Yeo-Johnson Box-Cox Box-Cox requires all data to be positive. You need to manually specify the transform’s power parameter. If you choose a bad value, the transformation might not work well. Parameters columns – The columns to separately transform. power – A parameter that determines how your data is transformed. Practioners typically set power between -2.5 and 2.5, although you may need to try different values to find one that works well. method – A string representing which transformation to apply. Supports "yeo-johnson" and "box-cox". If you choose "box-cox", your data needs to be positive. Defaults to "yeo-johnson". PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.PowerTransformer.fit PowerTransformer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.PowerTransformer.fit_transform PowerTransformer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.PowerTransformer.preferred_batch_format classmethod PowerTransformer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.PowerTransformer.transform PowerTransformer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.PowerTransformer.transform_batch PowerTransformer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.PowerTransformer.transform_stats PowerTransformer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.RobustScaler class ray.data.preprocessors.RobustScaler(columns: List[str], quantile_range: Tuple[float, float] = (0.25, 0.75))[source] Bases: ray.data.preprocessor.Preprocessor Scale and translate each column using quantiles. The general formula is given by x' = \frac{x - \mu_{1/2}}{\mu_h - \mu_l} where x is the column, x' is the transformed column, \mu_{1/2} is the column median. \mu_{h} and \mu_{l} are the high and low quantiles, respectively. By default, \mu_{h} is the third quartile and \mu_{l} is the first quartile. This scaler works well when your data contains many outliers. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import RobustScaler >>> >>> df = pd.DataFrame({ ... "X1": [1, 2, 3, 4, 5], ... "X2": [13, 5, 14, 2, 8], ... "X3": [1, 2, 2, 2, 3], ... }) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 1 13 1 1 2 5 2 2 3 14 2 3 4 2 2 4 5 8 3 RobustScaler separately scales each column. >>> preprocessor = RobustScaler(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -1.0 0.625 1 1 -0.5 -0.375 2 2 0.0 0.750 2 3 0.5 -0.750 2 4 1.0 0.000 3 Parameters columns – The columns to separately scale. quantile_range – A tuple that defines the lower and upper quantiles. Values must be between 0 and 1. Defaults to the 1st and 3rd quartiles: (0.25, 0.75). PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.RobustScaler.fit RobustScaler.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.RobustScaler.fit_transform RobustScaler.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.RobustScaler.preferred_batch_format classmethod RobustScaler.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.RobustScaler.transform RobustScaler.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.RobustScaler.transform_batch RobustScaler.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.RobustScaler.transform_stats RobustScaler.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.StandardScaler class ray.data.preprocessors.StandardScaler(columns: List[str])[source] Bases: ray.data.preprocessor.Preprocessor Translate and scale each column by its mean and standard deviation, respectively. The general formula is given by x' = \frac{x - \bar{x}}{s} where x is the column, x' is the transformed column, \bar{x} is the column average, and s is the column’s sample standard deviation. If s = 0 (i.e., the column is constant-valued), then the transformed column will contain zeros. StandardScaler works best when your data is normal. If your data isn’t approximately normal, then the transformed features won’t be meaningful. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import StandardScaler >>> >>> df = pd.DataFrame({"X1": [-2, 0, 2], "X2": [-3, -3, 3], "X3": [1, 1, 1]}) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 -2 -3 1 1 0 -3 1 2 2 3 1 Columns are scaled separately. >>> preprocessor = StandardScaler(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -1.224745 -0.707107 1 1 0.000000 -0.707107 1 2 1.224745 1.414214 1 Constant-valued columns get filled with zeros. >>> preprocessor = StandardScaler(columns=["X3"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -2 -3 0.0 1 0 -3 0.0 2 2 3 0.0 Parameters columns – The columns to separately scale. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.StandardScaler.fit StandardScaler.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.StandardScaler.fit_transform StandardScaler.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.StandardScaler.preferred_batch_format classmethod StandardScaler.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.StandardScaler.transform StandardScaler.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.StandardScaler.transform_batch StandardScaler.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.StandardScaler.transform_stats StandardScaler.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases. K-Bins Discretizers CustomKBinsDiscretizer(columns, bins, *[, ...]) Bin values into discrete intervals using custom bin edges. UniformKBinsDiscretizer(columns, bins, *[, ...]) Bin values into discrete intervals (bins) of uniform width. ray.data.preprocessors.CustomKBinsDiscretizer class ray.data.preprocessors.CustomKBinsDiscretizer(columns: List[str], bins: Union[Iterable[float], pandas.core.indexes.interval.IntervalIndex, Dict[str, Union[Iterable[float], pandas.core.indexes.interval.IntervalIndex]]], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Optional[Dict[str, Union[pandas.core.dtypes.dtypes.CategoricalDtype, Type[numpy.integer]]]] = None)[source] Bases: ray.data.preprocessors.discretizer._AbstractKBinsDiscretizer Bin values into discrete intervals using custom bin edges. Columns must contain numerical values. Examples Use CustomKBinsDiscretizer to bin continuous features. >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CustomKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins=[0, 1, 4, 10, 25] ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0 2 1 1 3 2 1 3 3 2 3 4 2 3 5 1 3 You can also specify different bin edges per column. >>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins={"value_1": [0, 1, 4], "value_2": [0, 18, 35, 70]}, ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0.0 0 1 1.0 0 2 1.0 0 3 NaN 0 4 NaN 1 5 1.0 1 Parameters columns – The columns to discretize. bins – Defines custom bin edges. Can be an iterable of numbers, a pd.IntervalIndex, or a dict mapping columns to either of them. Note that pd.IntervalIndex for bins must be non-overlapping. right – Indicates whether bins include the rightmost edge. include_lowest – Indicates whether the first interval should be left-inclusive. duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise ValueError or drop non-uniques. dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects or np.integer types. If you don’t include a column in dtypes or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use a pd.CategoricalDtype, the outputted column will be a pd.CategoricalDtype with the categories being mapped to bins. You can use pd.CategoricalDtype(categories, ordered=True) to preserve information about bin order. UniformKBinsDiscretizer If you want to bin data into uniform width bins. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.CustomKBinsDiscretizer.fit CustomKBinsDiscretizer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.CustomKBinsDiscretizer.fit_transform CustomKBinsDiscretizer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.CustomKBinsDiscretizer.preferred_batch_format classmethod CustomKBinsDiscretizer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.CustomKBinsDiscretizer.transform CustomKBinsDiscretizer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.CustomKBinsDiscretizer.transform_batch CustomKBinsDiscretizer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.CustomKBinsDiscretizer.transform_stats CustomKBinsDiscretizer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.UniformKBinsDiscretizer class ray.data.preprocessors.UniformKBinsDiscretizer(columns: List[str], bins: Union[int, Dict[str, int]], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Optional[Dict[str, Union[pandas.core.dtypes.dtypes.CategoricalDtype, Type[numpy.integer]]]] = None)[source] Bases: ray.data.preprocessors.discretizer._AbstractKBinsDiscretizer Bin values into discrete intervals (bins) of uniform width. Columns must contain numerical values. Examples Use UniformKBinsDiscretizer to bin continuous features. >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import UniformKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins=4 ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 1 2 0 0 3 2 0 4 3 3 5 0 3 You can also specify different number of bins per column. >>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins={"value_1": 4, "value_2": 3} ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 0 2 0 0 3 2 0 4 3 2 5 0 2 Parameters columns – The columns to discretize. bins – Defines the number of equal-width bins. Can be either an integer (which will be applied to all columns), or a dict that maps columns to integers. The range is extended by .1% on each side to include the minimum and maximum values. right – Indicates whether bins includes the rightmost edge or not. include_lowest – Whether the first interval should be left-inclusive or not. duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise ValueError or drop non-uniques. dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects or np.integer types. If you don’t include a column in dtypes or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use a pd.CategoricalDtype, the outputted column will be a pd.CategoricalDtype with the categories being mapped to bins. You can use pd.CategoricalDtype(categories, ordered=True) to preserve information about bin order. CustomKBinsDiscretizer If you want to specify your own bin edges. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.UniformKBinsDiscretizer.fit UniformKBinsDiscretizer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.UniformKBinsDiscretizer.fit_transform UniformKBinsDiscretizer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.UniformKBinsDiscretizer.preferred_batch_format classmethod UniformKBinsDiscretizer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.UniformKBinsDiscretizer.transform UniformKBinsDiscretizer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.UniformKBinsDiscretizer.transform_batch UniformKBinsDiscretizer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.UniformKBinsDiscretizer.transform_stats UniformKBinsDiscretizer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Image Preprocessors TorchVisionPreprocessor(columns, transform) Apply a TorchVision transform to image columns. ray.data.preprocessors.TorchVisionPreprocessor class ray.data.preprocessors.TorchVisionPreprocessor(columns: List[str], transform: Callable[[Union[np.ndarray, torch.Tensor]], torch.Tensor], output_columns: Optional[List[str]] = None, batched: bool = False)[source] Bases: ray.data.preprocessor.Preprocessor Apply a TorchVision transform to image columns. Examples Torch models expect inputs of shape (B, C, H, W) in the range [0.0, 1.0]. To convert images to this format, add ToTensor to your preprocessing pipeline. from torchvision import transforms import ray from ray.data.preprocessors import TorchVisionPreprocessor transform = transforms.Compose([ transforms.ToTensor(), transforms.Resize((224, 224)), ]) preprocessor = TorchVisionPreprocessor(["image"], transform=transform) dataset = ray.data.read_images("s3://anonymous@air-example-data-2/imagenet-sample-images") dataset = preprocessor.transform(dataset) For better performance, set batched to True and replace ToTensor with a batch-supporting Lambda. import numpy as np import torch def to_tensor(batch: np.ndarray) -> torch.Tensor: tensor = torch.as_tensor(batch, dtype=torch.float) # (B, H, W, C) -> (B, C, H, W) tensor = tensor.permute(0, 3, 1, 2).contiguous() # [0., 255.] -> [0., 1.] tensor = tensor.div(255) return tensor transform = transforms.Compose([ transforms.Lambda(to_tensor), transforms.Resize((224, 224)) ]) preprocessor = TorchVisionPreprocessor(["image"], transform=transform, batched=True) dataset = ray.data.read_images("s3://anonymous@air-example-data-2/imagenet-sample-images") dataset = preprocessor.transform(dataset) Parameters columns – The columns to apply the TorchVision transform to. transform – The TorchVision transform you want to apply. This transform should accept a np.ndarray or torch.Tensor as input and return a torch.Tensor as output. output_columns – The output name for each input column. If not specified, this defaults to the same set of columns as the columns. batched – If True, apply transform to batches of shape (B, H, W, C). Otherwise, apply transform to individual images. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.TorchVisionPreprocessor.fit TorchVisionPreprocessor.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.TorchVisionPreprocessor.fit_transform TorchVisionPreprocessor.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.TorchVisionPreprocessor.transform TorchVisionPreprocessor.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.TorchVisionPreprocessor.transform_batch TorchVisionPreprocessor.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.TorchVisionPreprocessor.transform_stats TorchVisionPreprocessor.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Text Encoders CountVectorizer(columns[, tokenization_fn, ...]) Count the frequency of tokens in a column of strings. FeatureHasher(columns, num_features) Apply the hashing trick to a table that describes token frequencies. HashingVectorizer(columns, num_features[, ...]) Count the frequency of tokens using the hashing trick. Tokenizer(columns[, tokenization_fn]) Replace each string with a list of tokens. ray.data.preprocessors.CountVectorizer class ray.data.preprocessors.CountVectorizer(columns: List[str], tokenization_fn: Optional[Callable[[str], List[str]]] = None, max_features: Optional[int] = None)[source] Bases: ray.data.preprocessor.Preprocessor Count the frequency of tokens in a column of strings. CountVectorizer operates on columns that contain strings. For example: corpus 0 I dislike Python 1 I like Python This preprocessors creates a column named like {column}_{token} for each unique token. These columns represent the frequency of token {token} in column {column}. For example: corpus_I corpus_Python corpus_dislike corpus_like 0 1 1 1 0 1 1 1 0 1 Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CountVectorizer >>> >>> df = pd.DataFrame({ ... "corpus": [ ... "Jimmy likes volleyball", ... "Bob likes volleyball too", ... "Bob also likes fruit jerky" ... ] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> vectorizer = CountVectorizer(["corpus"]) >>> vectorizer.fit_transform(ds).to_pandas() corpus_likes corpus_volleyball corpus_Bob corpus_Jimmy corpus_too corpus_also corpus_fruit corpus_jerky 0 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 2 1 0 1 0 0 1 1 1 You can limit the number of tokens in the vocabulary with max_features. >>> vectorizer = CountVectorizer(["corpus"], max_features=3) >>> vectorizer.fit_transform(ds).to_pandas() corpus_likes corpus_volleyball corpus_Bob 0 1 1 0 1 1 1 1 2 1 0 1 Parameters columns – The columns to separately tokenize and count. tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" "). max_features – The maximum number of tokens to encode in the transformed dataset. If specified, only the most frequent tokens are encoded. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.CountVectorizer.fit CountVectorizer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.CountVectorizer.fit_transform CountVectorizer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.CountVectorizer.preferred_batch_format classmethod CountVectorizer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.CountVectorizer.transform CountVectorizer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.CountVectorizer.transform_batch CountVectorizer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.CountVectorizer.transform_stats CountVectorizer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.FeatureHasher class ray.data.preprocessors.FeatureHasher(columns: List[str], num_features: int)[source] Bases: ray.data.preprocessor.Preprocessor Apply the hashing trick to a table that describes token frequencies. FeatureHasher creates num_features columns named hash_{index}, where index ranges from 0 to num_features- 1. The column hash_{index} describes the frequency of tokens that hash to index. Distinct tokens can correspond to the same index. However, if num_features is large enough, then columns probably correspond to a unique token. This preprocessor is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model. Sparse matrices aren’t supported. If you use a large num_features, this preprocessor might behave poorly. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import FeatureHasher The data below describes the frequencies of tokens in "I like Python" and "I dislike Python". >>> df = pd.DataFrame({ ... "I": [1, 1], ... "like": [1, 0], ... "dislike": [0, 1], ... "Python": [1, 1] ... }) >>> ds = ray.data.from_pandas(df) FeatureHasher hashes each token to determine its index. For example, the index of "I" is hash(\texttt{"I"}) \pmod 8 = 5. >>> hasher = FeatureHasher(columns=["I", "like", "dislike", "Python"], num_features=8) >>> hasher.fit_transform(ds).to_pandas().to_numpy() array([[0, 0, 0, 2, 0, 1, 0, 0], [0, 0, 0, 1, 0, 1, 1, 0]]) Notice the hash collision: both "like" and "Python" correspond to index 3. You can avoid hash collisions like these by increasing num_features. Parameters columns – The columns to apply the hashing trick to. Each column should describe the frequency of a token. num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens. CountVectorizer Use this preprocessor to generate inputs for FeatureHasher. ray.data.preprocessors.HashingVectorizer If your input data describes documents rather than token frequencies, use HashingVectorizer. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.FeatureHasher.fit FeatureHasher.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.FeatureHasher.fit_transform FeatureHasher.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.FeatureHasher.preferred_batch_format classmethod FeatureHasher.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.FeatureHasher.transform FeatureHasher.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.FeatureHasher.transform_batch FeatureHasher.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.FeatureHasher.transform_stats FeatureHasher.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.HashingVectorizer class ray.data.preprocessors.HashingVectorizer(columns: List[str], num_features: int, tokenization_fn: Optional[Callable[[str], List[str]]] = None)[source] Bases: ray.data.preprocessor.Preprocessor Count the frequency of tokens using the hashing trick. This preprocessors creates num_features columns named like hash_{column_name}_{index}. If num_features is large enough relative to the size of your vocabulary, then each column approximately corresponds to the frequency of a unique token. HashingVectorizer is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model. This preprocessor transforms each input column to a document-term matrix. A document-term matrix is a table that describes the frequency of tokens in a collection of documents. For example, the strings "I like Python" and "I dislike Python" might have the document-term matrix below: corpus_I corpus_Python corpus_dislike corpus_like 0 1 1 1 0 1 1 1 0 1 To generate the matrix, you typically map each token to a unique index. For example: token index 0 I 0 1 Python 1 2 dislike 2 3 like 3 The problem with this approach is that memory use scales linearly with the size of your vocabulary. HashingVectorizer circumvents this problem by computing indices with a hash function: \texttt{index} = hash(\texttt{token}). Sparse matrices aren’t currently supported. If you use a large num_features, this preprocessor might behave poorly. Examples >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import HashingVectorizer >>> >>> df = pd.DataFrame({ ... "corpus": [ ... "Jimmy likes volleyball", ... "Bob likes volleyball too", ... "Bob also likes fruit jerky" ... ] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> vectorizer = HashingVectorizer(["corpus"], num_features=8) >>> vectorizer.fit_transform(ds).to_pandas() hash_corpus_0 hash_corpus_1 hash_corpus_2 hash_corpus_3 hash_corpus_4 hash_corpus_5 hash_corpus_6 hash_corpus_7 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 2 0 0 1 1 0 2 1 0 Parameters columns – The columns to separately tokenize and count. num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens. tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" "). CountVectorizer Another method for counting token frequencies. Unlike HashingVectorizer, CountVectorizer creates a feature for each unique token. This enables you to compute the inverse transformation. FeatureHasher This preprocessor is similar to HashingVectorizer, except it expects a table describing token frequencies. In contrast, FeatureHasher expects a column containing documents. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.HashingVectorizer.fit HashingVectorizer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.HashingVectorizer.fit_transform HashingVectorizer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.HashingVectorizer.preferred_batch_format classmethod HashingVectorizer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.HashingVectorizer.transform HashingVectorizer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.HashingVectorizer.transform_batch HashingVectorizer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.HashingVectorizer.transform_stats HashingVectorizer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.preprocessors.Tokenizer class ray.data.preprocessors.Tokenizer(columns: List[str], tokenization_fn: Optional[Callable[[str], List[str]]] = None)[source] Bases: ray.data.preprocessor.Preprocessor Replace each string with a list of tokens. Examples >>> import pandas as pd >>> import ray >>> df = pd.DataFrame({"text": ["Hello, world!", "foo bar\nbaz"]}) >>> ds = ray.data.from_pandas(df) The default tokenization_fn delimits strings using the space character. >>> from ray.data.preprocessors import Tokenizer >>> tokenizer = Tokenizer(columns=["text"]) >>> tokenizer.transform(ds).to_pandas() text 0 [Hello,, world!] 1 [foo, bar\nbaz] If the default logic isn’t adequate for your use case, you can specify a custom tokenization_fn. >>> import string >>> def tokenization_fn(s): ... for character in string.punctuation: ... s = s.replace(character, "") ... return s.split() >>> tokenizer = Tokenizer(columns=["text"], tokenization_fn=tokenization_fn) >>> tokenizer.transform(ds).to_pandas() text 0 [Hello, world] 1 [foo, bar, baz] Parameters columns – The columns to tokenize. tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" "). PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods fit(ds) Fit this Preprocessor to the Dataset. fit_transform(ds) Fit this Preprocessor to the Dataset and then transform the Dataset. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. transform(ds) Transform the given dataset. transform_batch(data) Transform a single batch of data. transform_stats() Return Dataset stats for the most recent transform call, if any. ray.data.preprocessors.Tokenizer.fit Tokenizer.fit(ds: Dataset) -> Preprocessor Fit this Preprocessor to the Dataset. Fitted state attributes will be directly set in the Preprocessor. Calling it more than once will overwrite all previously fitted state: preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B). Parameters ds – Input dataset. Returns The fitted Preprocessor with state attributes. Return type Preprocessorray.data.preprocessors.Tokenizer.fit_transform Tokenizer.fit_transform(ds: Dataset) -> Dataset Fit this Preprocessor to the Dataset and then transform the Dataset. Calling it more than once will overwrite all previously fitted state: preprocessor.fit_transform(A).fit_transform(B) is equivalent to preprocessor.fit_transform(B). Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Datasetray.data.preprocessors.Tokenizer.preferred_batch_format classmethod Tokenizer.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _transform_pandas and _transform_numpy are implemented. Defaults to Pandas. Can be overriden by Preprocessor classes depending on which transform path is the most optimal. DeveloperAPI: This API may change across minor Ray releases.ray.data.preprocessors.Tokenizer.transform Tokenizer.transform(ds: Dataset) -> Dataset Transform the given dataset. Parameters ds – Input Dataset. Returns The transformed Dataset. Return type ray.data.Dataset Raises PreprocessorNotFittedException – if fit is not called yet.ray.data.preprocessors.Tokenizer.transform_batch Tokenizer.transform_batch(data: DataBatchType) -> DataBatchType Transform a single batch of data. The data will be converted to the format supported by the Preprocessor, based on which _transform_* methods are implemented. Parameters data – Input data batch. Returns The transformed data batch. This may differ from the input type depending on which _transform_* methods are implemented. Return type DataBatchTyperay.data.preprocessors.Tokenizer.transform_stats Tokenizer.transform_stats() -> Optional[str] Return Dataset stats for the most recent transform call, if any. DEPRECATED: This API is deprecated and may be removed in future Ray releases. Ray Data Ingest into AIR Trainers See this AIR Data ingest guide for usage examples. air.session.get_dataset_shard([dataset_name]) Returns the ray.data.DataIterator shard for this worker. DataIterator() An iterator for reading records from a Dataset or DatasetPipeline. ray.train.DataConfig([datasets_to_split, ...]) Class responsible for configuring Train dataset preprocessing. Debugging Utilities make_local_dataset_iterator(dataset, ...) A helper function to create a local DataIterator, like the one returned by get_dataset_shard(). DummyTrainer(*args, **kwargs) A Trainer that does nothing except read the data for a given number of epochs. ray.air.util.check_ingest.make_local_dataset_iterator ray.air.util.check_ingest.make_local_dataset_iterator(dataset: ray.data.dataset.Dataset, preprocessor: ray.data.preprocessor.Preprocessor, dataset_config: ray.air.config.DatasetConfig) -> ray.data.iterator.DataIterator[source] A helper function to create a local DataIterator, like the one returned by get_dataset_shard(). This function should only be used for development and debugging. It will raise an exception if called by a worker instead of the driver. Parameters dataset – The input Dataset. preprocessor – The preprocessor that will be applied to the input dataset. dataset_config – The dataset config normally passed to the trainer. DeveloperAPI: This API may change across minor Ray releases.ray.air.util.check_ingest.DummyTrainer class ray.air.util.check_ingest.DummyTrainer(*args, **kwargs)[source] Bases: ray.train.data_parallel_trainer.DataParallelTrainer A Trainer that does nothing except read the data for a given number of epochs. It prints out as much debugging statistics as possible. This is useful for debugging data ingest problem. This trainer supports normal scaling options same as any other Trainer (e.g., num_workers, use_gpu). Parameters scaling_config – Configuration for how to scale training. This is the same as for BaseTrainer. num_epochs – How many many times to iterate through the datasets for. prefetch_batches – The number of batches to prefetch ahead of the current block during the scan. This is the same as iter_batches() DeveloperAPI: This API may change across minor Ray releases. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. make_train_loop(num_epochs, ...) Make a debug train loop that runs for the given amount of epochs. restore(path[, train_loop_per_worker, ...]) Restores a DataParallelTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.air.util.check_ingest.DummyTrainer.as_trainable DummyTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.air.util.check_ingest.DummyTrainer.can_restore classmethod DummyTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.air.util.check_ingest.DummyTrainer.fit DummyTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.util.check_ingest.DummyTrainer.get_dataset_config DummyTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.air.util.check_ingest.DummyTrainer.make_train_loop static DummyTrainer.make_train_loop(num_epochs: int, prefetch_batches: int, prefetch_blocks: int, batch_size: Optional[int])[source] Make a debug train loop that runs for the given amount of epochs.ray.air.util.check_ingest.DummyTrainer.restore classmethod DummyTrainer.restore(path: str, train_loop_per_worker: Optional[Union[Callable[[], None], Callable[[Dict], None]]] = None, train_loop_config: Optional[Dict] = None, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None) -> DataParallelTrainer Restores a DataParallelTrainer from a previously interrupted/failed run. Parameters train_loop_per_worker – Optionally re-specified train loop function. This should be used to re-specify a function that is not restorable in a new Ray cluster (e.g., it holds onto outdated object references). This should be the same training loop that was passed to the original trainer constructor. train_loop_config – Optionally re-specified train config. This should similarly be used if the original train_loop_config contained outdated object references, and it should not be modified from what was originally passed in. See BaseTrainer.restore() for descriptions of the other arguments. Returns A restored instance of the DataParallelTrainer subclass that is calling this method. Return type DataParallelTrainerray.air.util.check_ingest.DummyTrainer.setup DummyTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop. Ray AIR Configurations TODO(ml-team): Add a general AIR configuration guide that covers all of these configs. See this Ray Train configuration user guide for more details. air.RunConfig([name, storage_path, ...]) Runtime configuration for training and tuning runs. air.ScalingConfig([trainer_resources, ...]) Configuration for scaling training. air.CheckpointConfig([num_to_keep, ...]) Configurable parameters for defining the checkpointing strategy. air.FailureConfig([max_failures, fail_fast]) Configuration related to failure handling of each training/tuning run. ray.air.RunConfig class ray.air.RunConfig(name: Optional[str] = None, storage_path: Optional[str] = None, callbacks: Optional[List[Callback]] = None, stop: Optional[Union[Mapping, Stopper, Callable[[str, Mapping], bool]]] = None, failure_config: Optional[ray.air.config.FailureConfig] = None, sync_config: Optional[SyncConfig] = None, checkpoint_config: Optional[ray.air.config.CheckpointConfig] = None, progress_reporter: Optional[ProgressReporter] = None, verbose: Optional[Union[int, AirVerbosity, Verbosity]] = None, log_to_file: Union[bool, str, Tuple[str, str]] = False, local_dir: Optional[str] = None)[source] Bases: object Runtime configuration for training and tuning runs. Upon resuming from a training or tuning run checkpoint, Ray Train/Tune will automatically apply the RunConfig from the previously checkpointed run. Parameters name – Name of the trial or experiment. If not provided, will be deduced from the Trainable. storage_path – Path to store results at. Can be a local directory or a destination on cloud storage. If Ray storage is set up, defaults to the storage location. Otherwise, this defaults to the local ~/ray_results directory. stop – Stop conditions to consider. Refer to ray.tune.stopper.Stopper for more info. Stoppers should be serializable. callbacks – Callbacks to invoke. Refer to ray.tune.callback.Callback for more info. Callbacks should be serializable. Currently only stateless callbacks are supported for resumed runs. (any state of the callback will not be checkpointed by Tune and thus will not take effect in resumed runs). failure_config – Failure mode configuration. sync_config – Configuration object for syncing. See tune.SyncConfig. checkpoint_config – Checkpointing configuration. progress_reporter – Progress reporter for reporting intermediate experiment progress. Defaults to CLIReporter if running in command-line, or JupyterNotebookReporter if running in a Jupyter notebook. verbose – 0, 1, or 2. Verbosity mode. 0 = silent, 1 = default, 2 = verbose. Defaults to 1. If the RAY_AIR_NEW_OUTPUT=1 environment variable is set, uses the old verbosity settings: 0 = silent, 1 = only status updates, 2 = status and brief results, 3 = status and detailed results. log_to_file – Log stdout and stderr to files in trial directories. If this is False (default), no files are written. If true, outputs are written to trialdir/stdout and trialdir/stderr, respectively. If this is a single string, this is interpreted as a file relative to the trialdir, to which both streams are written. If this is a Sequence (e.g. a Tuple), it has to have length 2 and the elements indicate the files to which stdout and stderr are written, respectively. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Attributes callbacks checkpoint_config failure_config local_dir log_to_file name progress_reporter stop storage_path sync_config verbose ray.air.RunConfig.callbacks RunConfig.callbacks: Optional[List[Callback]] = None ray.air.RunConfig.checkpoint_config RunConfig.checkpoint_config: Optional[ray.air.config.CheckpointConfig] = None ray.air.RunConfig.failure_config RunConfig.failure_config: Optional[ray.air.config.FailureConfig] = None ray.air.RunConfig.local_dir RunConfig.local_dir: Optional[str] = None ray.air.RunConfig.log_to_file RunConfig.log_to_file: Union[bool, str, Tuple[str, str]] = False ray.air.RunConfig.name RunConfig.name: Optional[str] = None ray.air.RunConfig.progress_reporter RunConfig.progress_reporter: Optional[ProgressReporter] = None ray.air.RunConfig.stop RunConfig.stop: Optional[Union[Mapping, Stopper, Callable[[str, Mapping], bool]]] = None ray.air.RunConfig.storage_path RunConfig.storage_path: Optional[str] = None ray.air.RunConfig.sync_config RunConfig.sync_config: Optional[SyncConfig] = None ray.air.RunConfig.verbose RunConfig.verbose: Optional[Union[int, AirVerbosity, Verbosity]] = None ray.air.ScalingConfig class ray.air.ScalingConfig(trainer_resources: Optional[Union[Dict, Domain, Dict[str, List]]] = None, num_workers: Optional[Union[int, Domain, Dict[str, List]]] = None, use_gpu: Union[bool, Domain, Dict[str, List]] = False, resources_per_worker: Optional[Union[Dict, Domain, Dict[str, List]]] = None, placement_strategy: Union[str, Domain, Dict[str, List]] = 'PACK', _max_cpu_fraction_per_node: Optional[Union[float, Domain, Dict[str, List]]] = None)[source] Bases: object Configuration for scaling training. Parameters trainer_resources – Resources to allocate for the trainer. If None is provided, will default to 1 CPU. num_workers – The number of workers (Ray actors) to launch. Each worker will reserve 1 CPU by default. The number of CPUs reserved by each worker can be overridden with the resources_per_worker argument. use_gpu – If True, training will be done on GPUs (1 per worker). Defaults to False. The number of GPUs reserved by each worker can be overridden with the resources_per_worker argument. resources_per_worker – If specified, the resources defined in this Dict will be reserved for each worker. The CPU and GPU keys (case-sensitive) can be defined to override the number of CPU/GPUs used by each worker. placement_strategy – The placement strategy to use for the placement group of the Ray actors. See Placement Group Strategies for the possible options. _max_cpu_fraction_per_node – [Experimental] The max fraction of CPUs per node that Train will use for scheduling training actors. The remaining CPUs can be used for dataset tasks. It is highly recommended that you set this to less than 1.0 (e.g., 0.8) when passing datasets to trainers, to avoid hangs / CPU starvation of dataset tasks. Warning: this feature is experimental and is not recommended for use with autoscaling (scale-up will not trigger properly). PublicAPI (beta): This API is in beta and may change before becoming stable. Methods as_placement_group_factory() Returns a PlacementGroupFactory to specify resources for Tune. from_placement_group_factory(pgf) Create a ScalingConfig from a Tune's PlacementGroupFactory ray.air.ScalingConfig.as_placement_group_factory ScalingConfig.as_placement_group_factory() -> PlacementGroupFactory[source] Returns a PlacementGroupFactory to specify resources for Tune.ray.air.ScalingConfig.from_placement_group_factory classmethod ScalingConfig.from_placement_group_factory(pgf: PlacementGroupFactory) -> ScalingConfig[source] Create a ScalingConfig from a Tune’s PlacementGroupFactory Attributes additional_resources_per_worker Resources per worker, not including CPU or GPU resources. num_cpus_per_worker The number of CPUs to set per worker. num_gpus_per_worker The number of GPUs to set per worker. num_workers placement_strategy resources_per_worker total_resources Map of total resources required for the trainer. trainer_resources use_gpu ray.air.ScalingConfig.additional_resources_per_worker property ScalingConfig.additional_resources_per_worker Resources per worker, not including CPU or GPU resources.ray.air.ScalingConfig.num_cpus_per_worker property ScalingConfig.num_cpus_per_worker The number of CPUs to set per worker.ray.air.ScalingConfig.num_gpus_per_worker property ScalingConfig.num_gpus_per_worker The number of GPUs to set per worker.ray.air.ScalingConfig.num_workers ScalingConfig.num_workers: Optional[Union[int, Domain, Dict[str, List]]] = None ray.air.ScalingConfig.placement_strategy ScalingConfig.placement_strategy: Union[str, Domain, Dict[str, List]] = 'PACK' ray.air.ScalingConfig.resources_per_worker ScalingConfig.resources_per_worker: Optional[Union[Dict, Domain, Dict[str, List]]] = None ray.air.ScalingConfig.total_resources property ScalingConfig.total_resources Map of total resources required for the trainer.ray.air.ScalingConfig.trainer_resources ScalingConfig.trainer_resources: Optional[Union[Dict, Domain, Dict[str, List]]] = None ray.air.ScalingConfig.use_gpu ScalingConfig.use_gpu: Union[bool, Domain, Dict[str, List]] = False ray.air.CheckpointConfig class ray.air.CheckpointConfig(num_to_keep: Optional[int] = None, checkpoint_score_attribute: Optional[str] = None, checkpoint_score_order: Optional[str] = 'max', checkpoint_frequency: Optional[int] = 0, checkpoint_at_end: Optional[bool] = None, _checkpoint_keep_all_ranks: Optional[bool] = False, _checkpoint_upload_from_workers: Optional[bool] = False)[source] Bases: object Configurable parameters for defining the checkpointing strategy. Default behavior is to persist all checkpoints to disk. If num_to_keep is set, the default retention policy is to keep the checkpoints with maximum timestamp, i.e. the most recent checkpoints. Parameters num_to_keep – The number of checkpoints to keep on disk for this run. If a checkpoint is persisted to disk after there are already this many checkpoints, then an existing checkpoint will be deleted. If this is None then checkpoints will not be deleted. Must be >= 1. checkpoint_score_attribute – The attribute that will be used to score checkpoints to determine which checkpoints should be kept on disk when there are greater than num_to_keep checkpoints. This attribute must be a key from the checkpoint dictionary which has a numerical value. Per default, the last checkpoints will be kept. checkpoint_score_order – Either “max” or “min”. If “max”, then checkpoints with highest values of checkpoint_score_attribute will be kept. If “min”, then checkpoints with lowest values of checkpoint_score_attribute will be kept. checkpoint_frequency – Number of iterations between checkpoints. If 0 this will disable checkpointing. Please note that most trainers will still save one checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops. checkpoint_at_end – If True, will save a checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops. Defaults to True for trainers that support it and False for generic function trainables. _checkpoint_keep_all_ranks – If True, will save checkpoints from all ranked training workers. If False, only checkpoint from rank 0 worker is kept. NOTE: This API is experimental and subject to change between minor releases. _checkpoint_upload_from_workers – If True, distributed workers will upload their checkpoints to cloud directly. This is to avoid the need for transferring large checkpoint files to the training worker group coordinator for persistence. NOTE: This API is experimental and subject to change between minor releases. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Attributes checkpoint_at_end checkpoint_frequency checkpoint_score_attribute checkpoint_score_order num_to_keep ray.air.CheckpointConfig.checkpoint_at_end CheckpointConfig.checkpoint_at_end: Optional[bool] = None ray.air.CheckpointConfig.checkpoint_frequency CheckpointConfig.checkpoint_frequency: Optional[int] = 0 ray.air.CheckpointConfig.checkpoint_score_attribute CheckpointConfig.checkpoint_score_attribute: Optional[str] = None ray.air.CheckpointConfig.checkpoint_score_order CheckpointConfig.checkpoint_score_order: Optional[str] = 'max' ray.air.CheckpointConfig.num_to_keep CheckpointConfig.num_to_keep: Optional[int] = None ray.air.FailureConfig class ray.air.FailureConfig(max_failures: int = 0, fail_fast: Union[bool, str] = False)[source] Bases: object Configuration related to failure handling of each training/tuning run. Parameters max_failures – Tries to recover a run at least this many times. Will recover from the latest checkpoint if present. Setting to -1 will lead to infinite recovery retries. Setting to 0 will disable retries. Defaults to 0. fail_fast – Whether to fail upon the first error. Only used for Ray Tune - this does not apply to single training runs (e.g. with Trainer.fit()). If fail_fast=’raise’ provided, Ray Tune will automatically raise the exception received by the Trainable. fail_fast=’raise’ can easily leak resources and should be used with caution (it is best used with ray.init(local_mode=True)). PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Attributes fail_fast max_failures ray.air.FailureConfig.fail_fast FailureConfig.fail_fast: Union[bool, str] = False ray.air.FailureConfig.max_failures FailureConfig.max_failures: int = 0 tune.TuneConfig([mode, metric, search_alg, ...]) Tune specific configs. tune.syncer.SyncConfig([upload_dir, syncer, ...]) Configuration object for Tune syncing. Ray Train API This page covers framework specific integrations with Ray Train and Ray Train Developer APIs. For core Ray AIR APIs, take a look at the AIR package reference. Ray Train Base Classes (Developer APIs) Trainer Base Classes BaseTrainer(*args, **kwargs) Defines interface for distributed training on Ray. DataParallelTrainer(*args, **kwargs) A Trainer for data parallel training. DataConfig([datasets_to_split, ...]) Class responsible for configuring Train dataset preprocessing. GBDTTrainer(*args, **kwargs) Abstract class for scaling gradient-boosting decision tree (GBDT) frameworks. ray.train.trainer.BaseTrainer class ray.train.trainer.BaseTrainer(*args, **kwargs)[source] Bases: abc.ABC Defines interface for distributed training on Ray. Note: The base BaseTrainer class cannot be instantiated directly. Only one of its subclasses can be used. Note to AIR developers: If a new AIR trainer is added, please update air/_internal/usage.py. How does a trainer work? First, initialize the Trainer. The initialization runs locally, so heavyweight setup should not be done in __init__. Then, when you call trainer.fit(), the Trainer is serialized and copied to a remote Ray actor. The following methods are then called in sequence on the remote actor. trainer.setup(): Any heavyweight Trainer setup should be specified here. trainer.preprocess_datasets(): The datasets passed to the Trainer will be setup here. trainer.train_loop(): Executes the main training logic. Calling trainer.fit() will return a ray.result.Result object where you can access metrics from your training run, as well as any checkpoints that may have been saved. How do I create a new Trainer? Subclass ray.train.trainer.BaseTrainer, and override the training_loop method, and optionally setup. import torch from ray.train.trainer import BaseTrainer from ray import tune from ray.air import session class MyPytorchTrainer(BaseTrainer): def setup(self): self.model = torch.nn.Linear(1, 1) self.optimizer = torch.optim.SGD( self.model.parameters(), lr=0.1) def training_loop(self): # You can access any Trainer attributes directly in this method. # self.datasets["train"] has already been dataset = self.datasets["train"] torch_ds = dataset.iter_torch_batches(dtypes=torch.float) loss_fn = torch.nn.MSELoss() for epoch_idx in range(10): loss = 0 num_batches = 0 torch_ds = dataset.iter_torch_batches( dtypes=torch.float, batch_size=2 ) for batch in torch_ds: X = torch.unsqueeze(batch["x"], 1) y = torch.unsqueeze(batch["y"], 1) # Compute prediction error pred = self.model(X) batch_loss = loss_fn(pred, y) # Backpropagation self.optimizer.zero_grad() batch_loss.backward() self.optimizer.step() loss += batch_loss.item() num_batches += 1 loss /= num_batches # Use Tune functions to report intermediate # results. session.report({"loss": loss, "epoch": epoch_idx}) # Initialize the Trainer, and call Trainer.fit() import ray train_dataset = ray.data.from_items( [{"x": i, "y": i} for i in range(10)]) my_trainer = MyPytorchTrainer(datasets={"train": train_dataset}) result = my_trainer.fit() ... Parameters scaling_config – Configuration for how to scale training. run_config – Configuration for the execution of the training run. datasets – Any Datasets to use for training. Use the key “train” to denote which dataset is the training dataset. resume_from_checkpoint – A checkpoint to resume training from. DeveloperAPI: This API may change across minor Ray releases. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. preprocess_datasets() Called during fit() to preprocess dataset attributes with preprocessor. restore(path[, datasets, preprocessor, ...]) Restores a Train experiment from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. training_loop() Loop called by fit() to run training and report results to Tune. ray.train.trainer.BaseTrainer.as_trainable BaseTrainer.as_trainable() -> Type[Trainable][source] Convert self to a tune.Trainable class.ray.train.trainer.BaseTrainer.can_restore classmethod BaseTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool[source] Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.trainer.BaseTrainer.fit BaseTrainer.fit() -> ray.air.result.Result[source] Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.trainer.BaseTrainer.preprocess_datasets BaseTrainer.preprocess_datasets() -> None[source] Called during fit() to preprocess dataset attributes with preprocessor. This method is run on a remote process. This method is called prior to entering the training_loop. If the Trainer has both a datasets dict and a preprocessor, the datasets dict contains a training dataset (denoted by the “train” key), and the preprocessor has not yet been fit, then it will be fit on the train dataset. Then, all Trainer’s datasets will be transformed by the preprocessor. The transformed datasets will be set back in the self.datasets attribute of the Trainer to be used when overriding training_loop.ray.train.trainer.BaseTrainer.restore classmethod BaseTrainer.restore(path: str, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None, **kwargs) -> BaseTrainer[source] Restores a Train experiment from a previously interrupted/failed run. Restore should be used for experiment-level fault tolerance in the event that the head node crashes (e.g., OOM or some other runtime error) or the entire cluster goes down (e.g., network error affecting all nodes). The following example can be paired with implementing job retry using Ray Jobs to produce a Train experiment that will attempt to resume on both experiment-level and trial-level failures: import os import ray from ray import air from ray.data.preprocessors import BatchMapper from ray.train.trainer import BaseTrainer experiment_name = "unique_experiment_name" local_dir = "~/ray_results" experiment_dir = os.path.join(local_dir, experiment_name) # Define some dummy inputs for demonstration purposes datasets = {"train": ray.data.from_items([{"a": i} for i in range(10)])} preprocessor = BatchMapper(lambda x: x, batch_format="numpy") class CustomTrainer(BaseTrainer): def training_loop(self): pass if CustomTrainer.can_restore(experiment_dir): trainer = CustomTrainer.restore( experiment_dir, datasets=datasets, ) else: trainer = CustomTrainer( datasets=datasets, preprocessor=preprocessor, run_config=air.RunConfig( name=experiment_name, local_dir=local_dir, # Tip: You can also enable retries on failure for # worker-level fault tolerance failure_config=air.FailureConfig(max_failures=3), ), ) result = trainer.fit() ... Parameters path – The path to the experiment directory of the training run to restore. This can be a local path or a remote URI if the experiment was uploaded to the cloud. datasets – Re-specified datasets used in the original training run. This must include all the datasets that were passed in the original trainer constructor. preprocessor – Optionally re-specified preprocessor that was passed in the original trainer constructor. This should be used to re-supply the preprocessor if it is not restorable in a new Ray cluster. This preprocessor will be fit at the start before resuming training. If no preprocessor is passed in restore, then the old preprocessor will be loaded from the latest checkpoint and will not be re-fit. scaling_config – Optionally re-specified scaling config. This can be modified to be different from the original spec. **kwargs – Other optionally re-specified arguments, passed in by subclasses. Raises ValueError – If all datasets were not re-supplied on restore. Returns A restored instance of the class that is calling this method. Return type BaseTrainerray.train.trainer.BaseTrainer.setup BaseTrainer.setup() -> None[source] Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.trainer.BaseTrainer.training_loop abstract BaseTrainer.training_loop() -> None[source] Loop called by fit() to run training and report results to Tune. This method runs on a remote process. self.datasets have already been preprocessed by self.preprocessor. You can use the Tune Function API functions (session.report() and session.get_checkpoint()) inside this training loop. Example: from ray.train.trainer import BaseTrainer from ray.air import session class MyTrainer(BaseTrainer): def training_loop(self): for epoch_idx in range(5): ... session.report({"epoch": epoch_idx})ray.train.data_parallel_trainer.DataParallelTrainer class ray.train.data_parallel_trainer.DataParallelTrainer(*args, **kwargs)[source] Bases: ray.train.base_trainer.BaseTrainer A Trainer for data parallel training. You should subclass this Trainer if your Trainer follows SPMD (single program, multiple data) programming paradigm - you want multiple processes to run the same function, but on different data. This Trainer runs the function train_loop_per_worker on multiple Ray Actors. The train_loop_per_worker function is expected to take in either 0 or 1 arguments: def train_loop_per_worker(): ... def train_loop_per_worker(config: Dict): ... If train_loop_per_worker accepts an argument, then train_loop_config will be passed in as the argument. This is useful if you want to tune the values in train_loop_config as hyperparameters. If the datasets dict contains a training dataset (denoted by the “train” key), then it will be split into multiple dataset shards that can then be accessed by session.get_dataset_shard("train") inside train_loop_per_worker. All the other datasets will not be split and session.get_dataset_shard(...) will return the the entire Dataset. Inside the train_loop_per_worker function, you can use any of the Ray AIR session methods. def train_loop_per_worker(): # Report intermediate results for callbacks or logging and # checkpoint data. session.report(...) # Returns dict of last saved checkpoint. session.get_checkpoint() # Returns the Dataset shard for the given key. session.get_dataset_shard("my_dataset") # Returns the total number of workers executing training. session.get_world_size() # Returns the rank of this worker. session.get_world_rank() # Returns the rank of the worker on the current node. session.get_local_rank() Any returns from the train_loop_per_worker will be discarded and not used or persisted anywhere. How do I use DataParallelTrainer or any of its subclasses? Example: import ray from ray.air import session from ray.air.config import ScalingConfig from ray.train.data_parallel_trainer import DataParallelTrainer def train_loop_for_worker(): dataset_shard_for_this_worker = session.get_dataset_shard("train") # 3 items for 3 workers, each worker gets 1 item batches = list(dataset_shard_for_this_worker.iter_batches(batch_size=1)) assert len(batches) == 1 train_dataset = ray.data.from_items([1, 2, 3]) assert train_dataset.count() == 3 trainer = DataParallelTrainer( train_loop_for_worker, scaling_config=ScalingConfig(num_workers=3), datasets={"train": train_dataset}, ) result = trainer.fit() ... How do I develop on top of DataParallelTrainer? In many cases, using DataParallelTrainer directly is sufficient to execute functions on multiple actors. However, you may want to subclass DataParallelTrainer and create a custom Trainer for the following 2 use cases: Use Case 1: You want to do data parallel training, but want to have a predefined training_loop_per_worker. Use Case 2: You want to implement a custom Backend that automatically handles additional setup or teardown logic on each actor, so that the users of this new trainer do not have to implement this logic. For example, a TensorflowTrainer can be built on top of DataParallelTrainer that automatically handles setting the proper environment variables for distributed Tensorflow on each actor. For 1, you can set a predefined training loop in __init__ from ray.train.data_parallel_trainer import DataParallelTrainer class MyDataParallelTrainer(DataParallelTrainer): def __init__(self, *args, **kwargs): predefined_train_loop_per_worker = lambda: 1 super().__init__(predefined_train_loop_per_worker, *args, **kwargs) For 2, you can implement the ray.train.Backend and ray.train.BackendConfig interfaces. from dataclasses import dataclass from ray.train.backend import Backend, BackendConfig class MyBackend(Backend): def on_start(self, worker_group, backend_config): def set_env_var(env_var_value): import os os.environ["MY_ENV_VAR"] = env_var_value worker_group.execute(set_env_var, backend_config.env_var) @dataclass class MyBackendConfig(BackendConfig): env_var: str = "default_value" def backend_cls(self): return MyBackend class MyTrainer(DataParallelTrainer): def __init__(self, train_loop_per_worker, my_backend_config: MyBackendConfig, **kwargs): super().__init__( train_loop_per_worker, backend_config=my_backend_config, **kwargs) Parameters train_loop_per_worker – The training function to execute. This can either take in no arguments or a config dict. train_loop_config – Configurations to pass into train_loop_per_worker if it accepts an argument. backend_config – Configuration for setting up a Backend (e.g. Torch, Tensorflow, Horovod) on each worker to enable distributed communication. If no Backend should be set up, then set this to None. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. This is merged with the default dataset config for the given trainer (cls._dataset_config). run_config – Configuration for the execution of the training run. datasets – Any Datasets to use for training. Use the key “train” to denote which dataset is the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. DeveloperAPI: This API may change across minor Ray releases. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. restore(path[, train_loop_per_worker, ...]) Restores a DataParallelTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.data_parallel_trainer.DataParallelTrainer.as_trainable DataParallelTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.data_parallel_trainer.DataParallelTrainer.can_restore classmethod DataParallelTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.data_parallel_trainer.DataParallelTrainer.fit DataParallelTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.data_parallel_trainer.DataParallelTrainer.get_dataset_config DataParallelTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig[source] Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.data_parallel_trainer.DataParallelTrainer.restore classmethod DataParallelTrainer.restore(path: str, train_loop_per_worker: Optional[Union[Callable[[], None], Callable[[Dict], None]]] = None, train_loop_config: Optional[Dict] = None, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None) -> DataParallelTrainer[source] Restores a DataParallelTrainer from a previously interrupted/failed run. Parameters train_loop_per_worker – Optionally re-specified train loop function. This should be used to re-specify a function that is not restorable in a new Ray cluster (e.g., it holds onto outdated object references). This should be the same training loop that was passed to the original trainer constructor. train_loop_config – Optionally re-specified train config. This should similarly be used if the original train_loop_config contained outdated object references, and it should not be modified from what was originally passed in. See BaseTrainer.restore() for descriptions of the other arguments. Returns A restored instance of the DataParallelTrainer subclass that is calling this method. Return type DataParallelTrainerray.train.data_parallel_trainer.DataParallelTrainer.setup DataParallelTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.DataConfig class ray.train.DataConfig(datasets_to_split: Optional[List[str]] = None, execution_options: Optional[ray.data._internal.execution.interfaces.ExecutionOptions] = None)[source] Bases: object Class responsible for configuring Train dataset preprocessing. For advanced use cases, this class can be subclassed and the configure() method overriden for custom data preprocessing. PublicAPI: This API is stable across Ray releases. Methods __init__([datasets_to_split, execution_options]) Construct a DataConfig. configure(datasets, world_size, ...) Configure how Train datasets should be assigned to workers. default_ingest_options() The default Ray Data options used for data ingest. ray.train.DataConfig.__init__ DataConfig.__init__(datasets_to_split: Optional[List[str]] = None, execution_options: Optional[ray.data._internal.execution.interfaces.ExecutionOptions] = None)[source] Construct a DataConfig. Parameters datasets_to_split – The list of dataset names to split between workers. By default, only the “train” dataset will be split. execution_options – The execution options to pass to Ray Data. By default, the options will be optimized for data ingest. When overriding this, base your options off of DataConfig.default_ingest_options().ray.train.DataConfig.configure DataConfig.configure(datasets: Dict[str, ray.data.dataset.Dataset], world_size: int, worker_handles: Optional[List[ray.actor.ActorHandle]], worker_node_ids: Optional[List[str]], **kwargs) -> List[Dict[str, ray.data.iterator.DataIterator]][source] Configure how Train datasets should be assigned to workers. Parameters datasets – The datasets dict passed to Train by the user. world_size – The number of Train workers in total. worker_handles – The actor handles of the Train workers. worker_node_ids – The node ids of the Train workers. kwargs – Forwards compatibility placeholder. Returns A list of dataset splits for each worker. The size of the list must be equal to world_size. Each element of the list contains the assigned DataIterator instances by name for the worker. DeveloperAPI: This API may change across minor Ray releases.ray.train.DataConfig.default_ingest_options static DataConfig.default_ingest_options() -> ray.data._internal.execution.interfaces.ExecutionOptions[source] The default Ray Data options used for data ingest. We enable output locality, which means that Ray Data will try to place tasks on the node the data will be consumed. We also set the object store memory limit to a fixed smaller value, to avoid using too much memory per Train worker.ray.train.gbdt_trainer.GBDTTrainer class ray.train.gbdt_trainer.GBDTTrainer(*args, **kwargs)[source] Bases: ray.train.base_trainer.BaseTrainer Abstract class for scaling gradient-boosting decision tree (GBDT) frameworks. Inherited by XGBoostTrainer and LightGBMTrainer. Parameters datasets – Datasets to use for training and validation. Must include a “train” key denoting the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. All non-training datasets will be used as separate validation sets, each reporting a separate metric. label_column – Name of the label column. A column with this name must be present in the training dataset. params – Framework specific training parameters. dmatrix_params – Dict of dataset name:dict of kwargs passed to respective xgboost_ray.RayDMatrix initializations. num_boost_round – Target number of boosting iterations (trees in the model). scaling_config – Configuration for how to scale data parallel training. run_config – Configuration for the execution of the training run. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. **train_kwargs – Additional kwargs passed to framework train() function. DeveloperAPI: This API may change across minor Ray releases. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. preprocess_datasets() Called during fit() to preprocess dataset attributes with preprocessor. restore(path[, datasets, preprocessor, ...]) Restores a Train experiment from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.gbdt_trainer.GBDTTrainer.as_trainable GBDTTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.gbdt_trainer.GBDTTrainer.can_restore classmethod GBDTTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.gbdt_trainer.GBDTTrainer.fit GBDTTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.gbdt_trainer.GBDTTrainer.preprocess_datasets GBDTTrainer.preprocess_datasets() -> None Called during fit() to preprocess dataset attributes with preprocessor. This method is run on a remote process. This method is called prior to entering the training_loop. If the Trainer has both a datasets dict and a preprocessor, the datasets dict contains a training dataset (denoted by the “train” key), and the preprocessor has not yet been fit, then it will be fit on the train dataset. Then, all Trainer’s datasets will be transformed by the preprocessor. The transformed datasets will be set back in the self.datasets attribute of the Trainer to be used when overriding training_loop.ray.train.gbdt_trainer.GBDTTrainer.restore classmethod GBDTTrainer.restore(path: str, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None, **kwargs) -> BaseTrainer Restores a Train experiment from a previously interrupted/failed run. Restore should be used for experiment-level fault tolerance in the event that the head node crashes (e.g., OOM or some other runtime error) or the entire cluster goes down (e.g., network error affecting all nodes). The following example can be paired with implementing job retry using Ray Jobs to produce a Train experiment that will attempt to resume on both experiment-level and trial-level failures: import os import ray from ray import air from ray.data.preprocessors import BatchMapper from ray.train.trainer import BaseTrainer experiment_name = "unique_experiment_name" local_dir = "~/ray_results" experiment_dir = os.path.join(local_dir, experiment_name) # Define some dummy inputs for demonstration purposes datasets = {"train": ray.data.from_items([{"a": i} for i in range(10)])} preprocessor = BatchMapper(lambda x: x, batch_format="numpy") class CustomTrainer(BaseTrainer): def training_loop(self): pass if CustomTrainer.can_restore(experiment_dir): trainer = CustomTrainer.restore( experiment_dir, datasets=datasets, ) else: trainer = CustomTrainer( datasets=datasets, preprocessor=preprocessor, run_config=air.RunConfig( name=experiment_name, local_dir=local_dir, # Tip: You can also enable retries on failure for # worker-level fault tolerance failure_config=air.FailureConfig(max_failures=3), ), ) result = trainer.fit() ... Parameters path – The path to the experiment directory of the training run to restore. This can be a local path or a remote URI if the experiment was uploaded to the cloud. datasets – Re-specified datasets used in the original training run. This must include all the datasets that were passed in the original trainer constructor. preprocessor – Optionally re-specified preprocessor that was passed in the original trainer constructor. This should be used to re-supply the preprocessor if it is not restorable in a new Ray cluster. This preprocessor will be fit at the start before resuming training. If no preprocessor is passed in restore, then the old preprocessor will be loaded from the latest checkpoint and will not be re-fit. scaling_config – Optionally re-specified scaling config. This can be modified to be different from the original spec. **kwargs – Other optionally re-specified arguments, passed in by subclasses. Raises ValueError – If all datasets were not re-supplied on restore. Returns A restored instance of the class that is calling this method. Return type BaseTrainerray.train.gbdt_trainer.GBDTTrainer.setup GBDTTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop. BaseTrainer API fit() Runs training. setup() Called during fit() to perform initial setup on the Trainer. preprocess_datasets() Called during fit() to preprocess dataset attributes with preprocessor. training_loop() Loop called by fit() to run training and report results to Tune. as_trainable() Convert self to a tune.Trainable class. Train Backend Base Classes Backend(*args, **kwargs) Singleton for distributed communication backend. BackendConfig() Parent class for configurations of training backend. ray.train.backend.Backend class ray.train.backend.Backend(*args, **kwargs)[source] Bases: object Singleton for distributed communication backend. share_cuda_visible_devices If True, each worker process will have CUDA_VISIBLE_DEVICES set as the visible device IDs of all workers on the same node for this training instance. If False, each worker will have CUDA_VISIBLE_DEVICES set to the device IDs allocated by Ray for that worker. Type bool DeveloperAPI: This API may change across minor Ray releases. on_start(worker_group: ray.train._internal.worker_group.WorkerGroup, backend_config: ray.train.backend.BackendConfig)[source] Logic for starting this backend. on_shutdown(worker_group: ray.train._internal.worker_group.WorkerGroup, backend_config: ray.train.backend.BackendConfig)[source] Logic for shutting down the backend. on_training_start(worker_group: ray.train._internal.worker_group.WorkerGroup, backend_config: ray.train.backend.BackendConfig)[source] Logic ran right before training is started. Session API is available at this point. static encode_data(data_dict: Dict) -> ray.train.backend.EncodedData[source] Logic to encode a data dict before sending to the driver. This function will be called on the workers for any data that is sent to the driver via session.report(). static decode_data(encoded_data: ray.train.backend.EncodedData) -> Dict[source] Logic to decode an encoded data dict. This function will be called on the driver after receiving the encoded data dict from the worker.ray.train.backend.BackendConfig class ray.train.backend.BackendConfig[source] Bases: object Parent class for configurations of training backend. DeveloperAPI: This API may change across minor Ray releases. Ray Train Integrations PyTorch TorchTrainer(*args, **kwargs) A Trainer for data parallel PyTorch training. TorchConfig([backend, init_method, timeout_s]) Configuration for torch process group setup. TorchCheckpoint([local_path, data_dict, uri]) A Checkpoint with Torch-specific functionality. ray.train.torch.TorchTrainer class ray.train.torch.TorchTrainer(*args, **kwargs)[source] Bases: ray.train.data_parallel_trainer.DataParallelTrainer A Trainer for data parallel PyTorch training. This Trainer runs the function train_loop_per_worker on multiple Ray Actors. These actors already have the necessary torch process group configured for distributed PyTorch training. The train_loop_per_worker function is expected to take in either 0 or 1 arguments: def train_loop_per_worker(): ... from typing import Dict, Any def train_loop_per_worker(config: Dict[str, Any]): ... If train_loop_per_worker accepts an argument, then train_loop_config will be passed in as the argument. This is useful if you want to tune the values in train_loop_config as hyperparameters. If the datasets dict contains a training dataset (denoted by the “train” key), then it will be split into multiple dataset shards that can then be accessed by session.get_dataset_shard("train") inside train_loop_per_worker. All the other datasets will not be split and session.get_dataset_shard(...) will return the the entire Dataset. Inside the train_loop_per_worker function, you can use any of the Ray AIR session methods. See full example code below. def train_loop_per_worker(): # Report intermediate results for callbacks or logging and # checkpoint data. session.report(...) # Get dict of last saved checkpoint. session.get_checkpoint() # Session returns the Dataset shard for the given key. session.get_dataset_shard("my_dataset") # Get the total number of workers executing training. session.get_world_size() # Get the rank of this worker. session.get_world_rank() # Get the rank of the worker on the current node. session.get_local_rank() You can also use any of the Torch specific function utils, such as ray.train.torch.get_device() and ray.train.torch.prepare_model() def train_loop_per_worker(): # Prepares model for distribted training by wrapping in # `DistributedDataParallel` and moving to correct device. train.torch.prepare_model(...) # Configures the dataloader for distributed training by adding a # `DistributedSampler`. # You should NOT use this if you are doing # `session.get_dataset_shard(...).iter_torch_batches(...)` train.torch.prepare_data_loader(...) # Get the current torch device. train.torch.get_device() Any returns from the train_loop_per_worker will be discarded and not used or persisted anywhere. To save a model to use for the TorchPredictor, you must save it under the “model” kwarg in Checkpoint passed to session.report(). When you wrap the model with prepare_model, the keys of its state_dict are prefixed by module.. For example, layer1.0.bn1.bias becomes module.layer1.0.bn1.bias. However, when saving model through session.report() all module. prefixes are stripped. As a result, when you load from a saved checkpoint, make sure that you first load state_dict to the model before calling prepare_model. Otherwise, you will run into errors like Error(s) in loading state_dict for DistributedDataParallel: Missing key(s) in state_dict: "module.conv1.weight", .... See snippet below. from torchvision.models import resnet18 from ray.air import session from ray.air.checkpoint import Checkpoint import ray.train as train def train_func(): ... model = resnet18() model = train.torch.prepare_model(model) for epoch in range(3): ... ckpt = Checkpoint.from_dict({ "epoch": epoch, "model": model.state_dict(), # "model": model.module.state_dict(), # ** The above two are equivalent ** }) session.report({"foo": "bar"}, ckpt) Example import torch import torch.nn as nn import ray from ray import train from ray.air import session, Checkpoint from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig from ray.air.config import RunConfig from ray.air.config import CheckpointConfig # If using GPUs, set this to True. use_gpu = False # Define NN layers archicture, epochs, and number of workers input_size = 1 layer_size = 32 output_size = 1 num_epochs = 20 num_workers = 3 # Define your network structure class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) # Define your train worker loop def train_loop_per_worker(): torch.manual_seed(42) # Fetch training set from the session dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() # Loss function, optimizer, prepare model for training. # This moves the data and prepares model for distributed # execution loss_fn = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=0.01) model = train.torch.prepare_model(model) # Iterate over epochs and batches for epoch in range(num_epochs): for batches in dataset_shard.iter_torch_batches(batch_size=32, dtypes=torch.float): # Add batch or unsqueeze as an additional dimension [32, x] inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"] output = model(inputs) # Make output shape same as the as labels loss = loss_fn(output.squeeze(), labels) # Zero out grads, do backward, and update optimizer optimizer.zero_grad() loss.backward() optimizer.step() # Print what's happening with loss per 30 epochs if epoch % 20 == 0: print(f"epoch: {epoch}/{num_epochs}, loss: {loss:.3f}") # Report and record metrics, checkpoint model at end of each # epoch session.report({"loss": loss.item(), "epoch": epoch}, checkpoint=Checkpoint.from_dict( dict(epoch=epoch, model=model.state_dict())) ) train_dataset = ray.data.from_items( [{"x": x, "y": 2 * x + 1} for x in range(2000)] ) # Define scaling and run configs scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1)) trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=scaling_config, run_config=run_config, datasets={"train": train_dataset}) result = trainer.fit() best_checkpoint_loss = result.metrics['loss'] # Assert loss is less 0.09 assert best_checkpoint_loss <= 0.09 ... Parameters train_loop_per_worker – The training function to execute. This can either take in no arguments or a config dict. train_loop_config – Configurations to pass into train_loop_per_worker if it accepts an argument. torch_config – Configuration for setting up the PyTorch backend. If set to None, use the default configuration. This replaces the backend_config arg of DataParallelTrainer. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. run_config – Configuration for the execution of the training run. datasets – Any Datasets to use for training. Use the key “train” to denote which dataset is the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. restore(path[, train_loop_per_worker, ...]) Restores a DataParallelTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.torch.TorchTrainer.as_trainable TorchTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.torch.TorchTrainer.can_restore classmethod TorchTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.torch.TorchTrainer.fit TorchTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.torch.TorchTrainer.get_dataset_config TorchTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.torch.TorchTrainer.restore classmethod TorchTrainer.restore(path: str, train_loop_per_worker: Optional[Union[Callable[[], None], Callable[[Dict], None]]] = None, train_loop_config: Optional[Dict] = None, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None) -> DataParallelTrainer Restores a DataParallelTrainer from a previously interrupted/failed run. Parameters train_loop_per_worker – Optionally re-specified train loop function. This should be used to re-specify a function that is not restorable in a new Ray cluster (e.g., it holds onto outdated object references). This should be the same training loop that was passed to the original trainer constructor. train_loop_config – Optionally re-specified train config. This should similarly be used if the original train_loop_config contained outdated object references, and it should not be modified from what was originally passed in. See BaseTrainer.restore() for descriptions of the other arguments. Returns A restored instance of the DataParallelTrainer subclass that is calling this method. Return type DataParallelTrainerray.train.torch.TorchTrainer.setup TorchTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.torch.TorchConfig class ray.train.torch.TorchConfig(backend: Optional[str] = None, init_method: str = 'env', timeout_s: int = 1800)[source] Bases: ray.train.backend.BackendConfig Configuration for torch process group setup. See https://pytorch.org/docs/stable/distributed.html for more info. Parameters backend – The backend to use for training. See torch.distributed.init_process_group for more info and valid values. If set to None, nccl will be used if GPUs are requested, else gloo will be used. init_method – The initialization method to use. Either “env” for environment variable initialization or “tcp” for TCP initialization. Defaults to “env”. timeout_s – Seconds for process group operations to timeout. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Attributes backend backend_cls init_method timeout_s ray.train.torch.TorchConfig.backend TorchConfig.backend: Optional[str] = None ray.train.torch.TorchConfig.backend_cls property TorchConfig.backend_cls ray.train.torch.TorchConfig.init_method TorchConfig.init_method: str = 'env' ray.train.torch.TorchConfig.timeout_s TorchConfig.timeout_s: int = 1800 ray.train.torch.TorchCheckpoint class ray.train.torch.TorchCheckpoint(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] Bases: ray.air.checkpoint.Checkpoint A Checkpoint with Torch-specific functionality. Create this from a generic Checkpoint by calling TorchCheckpoint.from_checkpoint(ckpt). PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__([local_path, data_dict, uri]) DeveloperAPI: This API may change across minor Ray releases. as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_model(model, *[, preprocessor]) Create a Checkpoint that stores a Torch model. from_state_dict(state_dict, *[, preprocessor]) Create a Checkpoint that stores a model state dictionary. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_model([model]) Retrieve the model stored in this checkpoint. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.torch.TorchCheckpoint.__init__ TorchCheckpoint.__init__(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None) DeveloperAPI: This API may change across minor Ray releases.ray.train.torch.TorchCheckpoint.as_directory TorchCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.torch.TorchCheckpoint.from_bytes classmethod TorchCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.torch.TorchCheckpoint.from_checkpoint classmethod TorchCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.torch.TorchCheckpoint.from_dict classmethod TorchCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.torch.TorchCheckpoint.from_directory classmethod TorchCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.torch.TorchCheckpoint.from_model classmethod TorchCheckpoint.from_model(model: torch.nn.modules.module.Module, *, preprocessor: Optional[Preprocessor] = None) -> TorchCheckpoint[source] Create a Checkpoint that stores a Torch model. PyTorch recommends storing state dictionaries. To create a TorchCheckpoint from a state dictionary, call from_state_dict(). To learn more about state dictionaries, read Saving and Loading Models. # noqa: E501 Parameters model – The Torch model to store in the checkpoint. preprocessor – A fitted preprocessor to be applied before inference. Returns A TorchCheckpoint containing the specified model. Examples from ray.train.torch import TorchCheckpoint from ray.train.torch import TorchPredictor import torch # Set manual seed torch.manual_seed(42) # Create model identity and send a random tensor to it model = torch.nn.Identity() input = torch.randn(2, 2) output = model(input) # Create a checkpoint checkpoint = TorchCheckpoint.from_model(model) # You can use a class TorchCheckpoint to create an # a class ray.train.torch.TorchPredictor and perform inference. predictor = TorchPredictor.from_checkpoint(checkpoint) pred = predictor.predict(input.numpy()) # Convert prediction dictionary value into a tensor pred = torch.tensor(pred['predictions']) # Assert the output from the original and checkoint model are the same assert torch.equal(output, pred) print("worked") ...ray.train.torch.TorchCheckpoint.from_state_dict classmethod TorchCheckpoint.from_state_dict(state_dict: Dict[str, Any], *, preprocessor: Optional[Preprocessor] = None) -> TorchCheckpoint[source] Create a Checkpoint that stores a model state dictionary. This is the recommended method for creating TorchCheckpoints. Parameters state_dict – The model state dictionary to store in the checkpoint. preprocessor – A fitted preprocessor to be applied before inference. Returns A TorchCheckpoint containing the specified state dictionary. Examples import torch import torch.nn as nn from ray.train.torch import TorchCheckpoint # Set manual seed torch.manual_seed(42) # Function to create a NN model def create_model() -> nn.Module: model = nn.Sequential(nn.Linear(1, 10), nn.ReLU(), nn.Linear(10,1)) return model # Create a TorchCheckpoint from our model's state_dict model = create_model() checkpoint = TorchCheckpoint.from_state_dict(model.state_dict()) # Now load the model from the TorchCheckpoint by providing the # model architecture model_from_chkpt = checkpoint.get_model(create_model()) # Assert they have the same state dict assert str(model.state_dict()) == str(model_from_chkpt.state_dict()) print("worked") ...ray.train.torch.TorchCheckpoint.from_uri classmethod TorchCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.torch.TorchCheckpoint.get_internal_representation TorchCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.torch.TorchCheckpoint.get_model TorchCheckpoint.get_model(model: Optional[torch.nn.modules.module.Module] = None) -> torch.nn.modules.module.Module[source] Retrieve the model stored in this checkpoint. Parameters model – If the checkpoint contains a model state dict, and not the model itself, then the state dict will be loaded to this model. Otherwise, the model will be discarded.ray.train.torch.TorchCheckpoint.get_preprocessor TorchCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.torch.TorchCheckpoint.set_preprocessor TorchCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.torch.TorchCheckpoint.to_bytes TorchCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.torch.TorchCheckpoint.to_dict TorchCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.torch.TorchCheckpoint.to_directory TorchCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.torch.TorchCheckpoint.to_uri TorchCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.torch.TorchCheckpoint.path property TorchCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.torch.TorchCheckpoint.uri property TorchCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). PyTorch Training Loop Utilities prepare_model(model[, move_to_device, ...]) Prepares the model for distributed execution. prepare_optimizer(optimizer) Wraps optimizer to support automatic mixed precision. prepare_data_loader(data_loader[, ...]) Prepares DataLoader for distributed execution. get_device() Gets the correct torch device configured for this process. accelerate([amp]) Enables training optimizations. backward(tensor) Computes the gradient of the specified tensor w.r.t. enable_reproducibility([seed]) Limits sources of nondeterministic behavior. ray.train.torch.prepare_model ray.train.torch.prepare_model(model: torch.nn.modules.module.Module, move_to_device: Union[bool, torch.device] = True, parallel_strategy: Optional[str] = 'ddp', parallel_strategy_kwargs: Optional[Dict[str, Any]] = None) -> torch.nn.modules.module.Module[source] Prepares the model for distributed execution. This allows you to use the same exact code regardless of number of workers or the device type being used (CPU, GPU). Parameters model (torch.nn.Module) – A torch model to prepare. move_to_device – Either a boolean indiciating whether to move the model to the correct device or an actual device to move the model to. If set to False, the model needs to manually be moved to the correct device. parallel_strategy ("ddp", "fsdp", or None) – Whether to wrap models in DistributedDataParallel, FullyShardedDataParallel, or neither. parallel_strategy_kwargs (Dict[str, Any]) – Args to pass into DistributedDataParallel or FullyShardedDataParallel initialization if parallel_strategy is set to “ddp” or “fsdp”, respectively. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.torch.prepare_optimizer ray.train.torch.prepare_optimizer(optimizer: torch.optim.optimizer.Optimizer) -> torch.optim.optimizer.Optimizer[source] Wraps optimizer to support automatic mixed precision. Parameters optimizer (torch.optim.Optimizer) – The DataLoader to prepare. Returns A wrapped optimizer. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.torch.prepare_data_loader ray.train.torch.prepare_data_loader(data_loader: torch.utils.data.dataloader.DataLoader, add_dist_sampler: bool = True, move_to_device: bool = True, auto_transfer: bool = True) -> torch.utils.data.dataloader.DataLoader[source] Prepares DataLoader for distributed execution. This allows you to use the same exact code regardless of number of workers or the device type being used (CPU, GPU). Parameters data_loader (torch.utils.data.DataLoader) – The DataLoader to prepare. add_dist_sampler – Whether to add a DistributedSampler to the provided DataLoader. move_to_device – If set, automatically move the data returned by the data loader to the correct device. auto_transfer – If set and device is GPU, another CUDA stream is created to automatically copy data from host (CPU) memory to device (GPU) memory (the default CUDA stream still runs the training procedure). If device is CPU, it will be disabled regardless of the setting. This configuration will be ignored if move_to_device is False. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.torch.get_device ray.train.torch.get_device() -> Union[torch.device, List[torch.device]][source] Gets the correct torch device configured for this process. Returns a list of devices if more than 1 GPU per worker is requested. Assumes that CUDA_VISIBLE_DEVICES is set and is a superset of the ray.get_gpu_ids(). Example >>> # os.environ["CUDA_VISIBLE_DEVICES"] = "3,4" >>> # ray.get_gpu_ids() == [3] >>> # torch.cuda.is_available() == True >>> # get_device() == torch.device("cuda:0") >>> # os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4" >>> # ray.get_gpu_ids() == [4] >>> # torch.cuda.is_available() == True >>> # get_device() == torch.device("cuda:4") >>> # os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5" >>> # ray.get_gpu_ids() == [4,5] >>> # torch.cuda.is_available() == True >>> # get_device() == torch.device("cuda:4") PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.torch.accelerate ray.train.torch.accelerate(amp: bool = False) -> None[source] Enables training optimizations. Parameters amp – If true, perform training with automatic mixed precision. Otherwise, use full precision. train.torch.accelerate cannot be called more than once, and it must be called before any other train.torch utility function. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.torch.backward ray.train.torch.backward(tensor: torch.Tensor) -> None[source] Computes the gradient of the specified tensor w.r.t. graph leaves. Parameters tensor (torch.Tensor) – Tensor of which the derivative will be computed. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.torch.enable_reproducibility ray.train.torch.enable_reproducibility(seed: int = 0) -> None[source] Limits sources of nondeterministic behavior. This function: Seeds PyTorch, Python, and NumPy. Disables CUDA convolution benchmarking. Configures PyTorch to use determinstic algorithms. Seeds workers spawned for multi-process data loading. Parameters seed – The number to seed libraries and data workers with. train.torch.enable_reproducibility() can’t guarantee completely reproducible results across executions. To learn more, read the PyTorch notes on randomness. PublicAPI (beta): This API is in beta and may change before becoming stable. PyTorch Lightning LightningTrainer(*args, **kwargs) A Trainer for data parallel PyTorch Lightning training. LightningConfigBuilder() Configuration Class to pass into LightningTrainer. LightningCheckpoint(*args, **kwargs) A Checkpoint with Lightning-specific functionality. LightningPredictor(model[, preprocessor, ...]) A predictor for PyTorch Lightning modules. ray.train.lightning.LightningTrainer class ray.train.lightning.LightningTrainer(*args, **kwargs)[source] Bases: ray.train.torch.torch_trainer.TorchTrainer A Trainer for data parallel PyTorch Lightning training. This Trainer runs the pytorch_lightning.Trainer.fit() method on multiple Ray Actors. The training is carried out in a distributed fashion through PyTorch DDP. These actors already have the necessary Torch process group configured for distributed data parallel training. We will support more distributed training strategies in the future. The training function ran on every Actor will first initialize an instance of the user-provided lightning_module class, which is a subclass of pytorch_lightning.LightningModule using the arguments provided in LightningConfigBuilder.module(). For data ingestion, the LightningTrainer will then either convert the Ray Dataset shards to a pytorch_lightning.LightningDataModule, or directly use the datamodule or dataloaders if provided by users. The trainer also creates a ModelCheckpoint callback based on the configuration provided in LightningConfigBuilder.checkpointing(). In addition to checkpointing, this callback also calls session.report() to report the latest metrics along with the checkpoint to the AIR session. For logging, users can continue to use Lightning’s native loggers, such as WandbLogger, TensorboardLogger, etc. LightningTrainer will also log the latest metrics to the training results directory whenever a new checkpoint is saved. Then, the training function will initialize an instance of pl.Trainer using the arguments provided in LightningConfigBuilder.fit_params() and then run pytorch_lightning.Trainer.fit. Example import torch import torch.nn.functional as F from torchmetrics import Accuracy from torch.utils.data import DataLoader, Subset from torchvision.datasets import MNIST from torchvision import transforms import pytorch_lightning as pl from ray.air.config import ScalingConfig from ray.train.lightning import LightningTrainer, LightningConfigBuilder class MNISTClassifier(pl.LightningModule): def __init__(self, lr, feature_dim): super(MNISTClassifier, self).__init__() self.fc1 = torch.nn.Linear(28 * 28, feature_dim) self.fc2 = torch.nn.Linear(feature_dim, 10) self.lr = lr self.accuracy = Accuracy() self.val_loss = [] self.val_acc = [] def forward(self, x): x = x.view(-1, 28 * 28) x = torch.relu(self.fc1(x)) x = self.fc2(x) return x def training_step(self, batch, batch_idx): x, y = batch y_hat = self(x) loss = torch.nn.functional.cross_entropy(y_hat, y) self.log("train_loss", loss) return loss def validation_step(self, val_batch, batch_idx): x, y = val_batch logits = self.forward(x) loss = F.nll_loss(logits, y) acc = self.accuracy(logits, y) self.val_loss.append(loss) self.val_acc.append(acc) return {"val_loss": loss, "val_accuracy": acc} def on_validation_epoch_end(self): avg_loss = torch.stack(self.val_loss).mean() avg_acc = torch.stack(self.val_acc).mean() self.log("ptl/val_loss", avg_loss) self.log("ptl/val_accuracy", avg_acc) self.val_acc.clear() self.val_loss.clear() def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=self.lr) return optimizer # Prepare MNIST Datasets transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] ) mnist_train = MNIST( './data', train=True, download=True, transform=transform ) mnist_val = MNIST( './data', train=False, download=True, transform=transform ) # Take small subsets for smoke test # Please remove these two lines if you want to train the full dataset mnist_train = Subset(mnist_train, range(1000)) mnist_train = Subset(mnist_train, range(500)) train_loader = DataLoader(mnist_train, batch_size=128, shuffle=True) val_loader = DataLoader(mnist_val, batch_size=128, shuffle=False) lightning_config = ( LightningConfigBuilder() .module(cls=MNISTClassifier, lr=1e-3, feature_dim=128) .trainer(max_epochs=3, accelerator="cpu") .fit_params(train_dataloaders=train_loader, val_dataloaders=val_loader) .build() ) scaling_config = ScalingConfig( num_workers=4, use_gpu=False, resources_per_worker={"CPU": 1} ) trainer = LightningTrainer( lightning_config=lightning_config, scaling_config=scaling_config, ) result = trainer.fit() result ... Parameters lightning_config – Configuration for setting up the Pytorch Lightning Trainer. You can setup the configurations with LightningConfigBuilder, and generate this config dictionary through LightningBuilder.build(). torch_config – Configuration for setting up the PyTorch backend. If set to None, use the default configuration. This replaces the backend_config arg of DataParallelTrainer. Same as in TorchTrainer. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. run_config – Configuration for the execution of the training run. datasets – A dictionary of Ray Datasets to use for training. Use the key “train” to denote which dataset is the training dataset and (optionally) key “val” to denote the validation dataset. Internally, LightningTrainer shards the training dataset across all workers, and creates a PyTorch Dataloader for each shard.The datasets will be transformed by preprocessor if it is provided. If the preprocessor has not already been fit, it will be fit on the training dataset.If datasets is not specified, LightningTrainer will use datamodule or dataloaders specified in LightningConfigBuilder.fit_params instead. datasets_iter_config – Configuration for iterating over the input ray datasets. You can configure the per-device batch size, prefetch batch size, collate function, and more. For valid arguments to pass, please refer to: Dataset.iter_torch_batchesNote that if you provide a datasets parameter, you must always specify datasets_iter_config for it. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. restore(path[, datasets, preprocessor, ...]) Restores a LightningTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.lightning.LightningTrainer.as_trainable LightningTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.lightning.LightningTrainer.can_restore classmethod LightningTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.lightning.LightningTrainer.fit LightningTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.lightning.LightningTrainer.get_dataset_config LightningTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.lightning.LightningTrainer.restore classmethod LightningTrainer.restore(path: str, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None, **kwargs) -> LightningTrainer[source] Restores a LightningTrainer from a previously interrupted/failed run. See BaseTrainer.restore() for descriptions of the arguments. Returns A restored instance of LightningTrainer Return type LightningTrainerray.train.lightning.LightningTrainer.setup LightningTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.lightning.LightningConfigBuilder class ray.train.lightning.LightningConfigBuilder[source] Bases: object Configuration Class to pass into LightningTrainer. Example import torch import torch.nn as nn import pytorch_lightning as pl from ray.train.lightning import LightningConfigBuilder class LinearModule(pl.LightningModule): def __init__(self, input_dim, output_dim) -> None: super().__init__() self.linear = nn.Linear(input_dim, output_dim) def forward(self, input): return self.linear(input) def training_step(self, batch): output = self.forward(batch) loss = torch.sum(output) self.log("loss", loss) return loss def predict_step(self, batch, batch_idx): return self.forward(batch) def configure_optimizers(self): return torch.optim.SGD(self.parameters(), lr=0.1) class MyDataModule(pl.LightningDataModule): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # ... lightning_config = ( LightningConfigBuilder() .module( cls=LinearModule, input_dim=32, output_dim=4, ) .trainer(max_epochs=5, accelerator="gpu") .fit_params(datamodule=MyDataModule()) .strategy(name="ddp") .checkpointing(monitor="loss", save_top_k=2, mode="min") .build() ) PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods __init__() Initialize the configurations with default values. build() Build and return a config dictionary to pass into LightningTrainer. checkpointing(**kwargs) Set up the configurations of pytorch_lightning.callbacks.ModelCheckpoint. fit_params(**kwargs) The parameter lists for pytorch_lightning.Trainer.fit() module([cls]) Set up the Pytorch Lightning module class. strategy([name]) Set up the configurations of pytorch_lightning.strategies.Strategy. trainer(**kwargs) Set up the configurations of pytorch_lightning.Trainer. ray.train.lightning.LightningConfigBuilder.__init__ LightningConfigBuilder.__init__() -> None[source] Initialize the configurations with default values.ray.train.lightning.LightningConfigBuilder.build LightningConfigBuilder.build() -> Dict[str, Any][source] Build and return a config dictionary to pass into LightningTrainer.ray.train.lightning.LightningConfigBuilder.checkpointing LightningConfigBuilder.checkpointing(**kwargs) -> ray.train.lightning.lightning_trainer.LightningConfigBuilder[source] Set up the configurations of pytorch_lightning.callbacks.ModelCheckpoint. LightningTrainer creates a subclass instance of the ModelCheckpoint callback with the kwargs. It handles checkpointing and metrics logging logics. Specifically, the callback periodically reports the latest metrics and checkpoint to the AIR session via session.report(). The report frequency matches the checkpointing frequency here. You have to make sure that the target metrics (e.g. metrics defined in TuneConfig or CheckpointConfig) are ready when a new checkpoint is being saved. Note that this method is not a replacement for the ray.air.configs.CheckpointConfig. You still need to specify your AIR checkpointing strategy in CheckpointConfig. Otherwise, AIR stores all the reported checkpoints by default. Parameters kwargs – For valid arguments to pass, please refer to: https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.htmlray.train.lightning.LightningConfigBuilder.fit_params LightningConfigBuilder.fit_params(**kwargs) -> ray.train.lightning.lightning_trainer.LightningConfigBuilder[source] The parameter lists for pytorch_lightning.Trainer.fit() LightningTrainer creates a model instance with the parameters provided in module() and feeds it into the pl.Trainer.fit() method. Therefore, you do not need to provide a model instance here. Parameters kwargs – The parameter lists for pytorch_lightning.Trainer.fit() For valid arguments to pass, please refer to: https://lightning.ai/docs/pytorch/stable/common/trainer.html#fit.ray.train.lightning.LightningConfigBuilder.module LightningConfigBuilder.module(cls: Optional[Type[pytorch_lightning.core.lightning.LightningModule]] = None, **kwargs) -> ray.train.lightning.lightning_trainer.LightningConfigBuilder[source] Set up the Pytorch Lightning module class. Parameters cls – A subclass of pytorch_lightning.LightningModule that defines your model and training logic. Note that this is a class definition instead of a class instance. **kwargs – The initialization argument list of your lightning module.ray.train.lightning.LightningConfigBuilder.strategy LightningConfigBuilder.strategy(name: str = 'ddp', **kwargs) -> ray.train.lightning.lightning_trainer.LightningConfigBuilder[source] Set up the configurations of pytorch_lightning.strategies.Strategy. Parameters name – The name of your distributed strategy. You can choose from “ddp”, “fsdp”, and “deepspeed”. Default: “ddp”. kwargs – For valid arguments to pass, please refer to: https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.DDPStrategy.html , https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.FSDPStrategy.html and https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.DeepSpeedStrategy.htmlray.train.lightning.LightningConfigBuilder.trainer LightningConfigBuilder.trainer(**kwargs) -> ray.train.lightning.lightning_trainer.LightningConfigBuilder[source] Set up the configurations of pytorch_lightning.Trainer. Note that you don’t have to specify the strategy, device and num_nodes arguments here, since the LightningTrainer creates a PyTorch Lightning Strategy object with the configurations specified in the strategy() method. The device and num_nodes are also configured automatically by the LightningTrainer. If no configuration is specified, it creates a DDPStrategy by default. For accelerator, currently only "cpu" and "gpu" are supported. Parameters kwargs – The initialization arguments for pytorch_lightning.Trainer For valid arguments to pass, please refer to: https://lightning.ai/docs/pytorch/stable/common/trainer.html#init.ray.train.lightning.LightningCheckpoint class ray.train.lightning.LightningCheckpoint(*args, **kwargs)[source] Bases: ray.train.torch.torch_checkpoint.TorchCheckpoint A Checkpoint with Lightning-specific functionality. LightningCheckpoint only support file based checkpoint loading. Create this by calling LightningCheckpoint.from_directory(ckpt_dir), LightningCheckpoint.from_uri(uri) or LightningCheckpoint.from_path(path) LightningCheckpoint loads file named model under the specified directory. Examples >>> from ray.train.lightning import LightningCheckpoint >>> >>> # Suppose we saved a checkpoint in "./checkpoint_00000/model": >>> # Option 1: Load from a file >>> checkpoint = LightningCheckpoint.from_path( ... path="./checkpoint_00000/model" ... ) >>> >>> # Option 2: Load from a directory >>> checkpoint = LightningCheckpoint.from_directory( ... path="./checkpoint_00000/" ... ) >>> >>> # Suppose we saved a checkpoint in an S3 bucket: >>> # Option 3: Load from URI >>> checkpoint = LightningCheckpoint.from_uri( ... path="s3://path/to/checkpoint/directory/" ... ) >>> >>> PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_model(model, *[, preprocessor]) Create a Checkpoint that stores a Torch model. from_path(path, *[, preprocessor]) Create a ray.air.lightning.LightningCheckpoint from a checkpoint file. from_state_dict(state_dict, *[, preprocessor]) Create a Checkpoint that stores a model state dictionary. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_model(model_class, ...) Retrieve the model stored in this checkpoint. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.lightning.LightningCheckpoint.as_directory LightningCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.lightning.LightningCheckpoint.from_bytes classmethod LightningCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.lightning.LightningCheckpoint.from_checkpoint classmethod LightningCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.lightning.LightningCheckpoint.from_dict classmethod LightningCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.lightning.LightningCheckpoint.from_directory classmethod LightningCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.lightning.LightningCheckpoint.from_model classmethod LightningCheckpoint.from_model(model: torch.nn.modules.module.Module, *, preprocessor: Optional[Preprocessor] = None) -> TorchCheckpoint Create a Checkpoint that stores a Torch model. PyTorch recommends storing state dictionaries. To create a TorchCheckpoint from a state dictionary, call from_state_dict(). To learn more about state dictionaries, read Saving and Loading Models. # noqa: E501 Parameters model – The Torch model to store in the checkpoint. preprocessor – A fitted preprocessor to be applied before inference. Returns A TorchCheckpoint containing the specified model. Examples from ray.train.torch import TorchCheckpoint from ray.train.torch import TorchPredictor import torch # Set manual seed torch.manual_seed(42) # Create model identity and send a random tensor to it model = torch.nn.Identity() input = torch.randn(2, 2) output = model(input) # Create a checkpoint checkpoint = TorchCheckpoint.from_model(model) # You can use a class TorchCheckpoint to create an # a class ray.train.torch.TorchPredictor and perform inference. predictor = TorchPredictor.from_checkpoint(checkpoint) pred = predictor.predict(input.numpy()) # Convert prediction dictionary value into a tensor pred = torch.tensor(pred['predictions']) # Assert the output from the original and checkoint model are the same assert torch.equal(output, pred) print("worked") ...ray.train.lightning.LightningCheckpoint.from_path classmethod LightningCheckpoint.from_path(path: str, *, preprocessor: Optional[ray.data.preprocessor.Preprocessor] = None) -> ray.train.lightning.lightning_checkpoint.LightningCheckpoint[source] Create a ray.air.lightning.LightningCheckpoint from a checkpoint file. Parameters path – The file path to the PyTorch Lightning checkpoint file. preprocessor – A fitted preprocessor to be applied before inference. Returns An LightningCheckpoint containing the model.ray.train.lightning.LightningCheckpoint.from_state_dict classmethod LightningCheckpoint.from_state_dict(state_dict: Dict[str, Any], *, preprocessor: Optional[Preprocessor] = None) -> TorchCheckpoint Create a Checkpoint that stores a model state dictionary. This is the recommended method for creating TorchCheckpoints. Parameters state_dict – The model state dictionary to store in the checkpoint. preprocessor – A fitted preprocessor to be applied before inference. Returns A TorchCheckpoint containing the specified state dictionary. Examples import torch import torch.nn as nn from ray.train.torch import TorchCheckpoint # Set manual seed torch.manual_seed(42) # Function to create a NN model def create_model() -> nn.Module: model = nn.Sequential(nn.Linear(1, 10), nn.ReLU(), nn.Linear(10,1)) return model # Create a TorchCheckpoint from our model's state_dict model = create_model() checkpoint = TorchCheckpoint.from_state_dict(model.state_dict()) # Now load the model from the TorchCheckpoint by providing the # model architecture model_from_chkpt = checkpoint.get_model(create_model()) # Assert they have the same state dict assert str(model.state_dict()) == str(model_from_chkpt.state_dict()) print("worked") ...ray.train.lightning.LightningCheckpoint.from_uri classmethod LightningCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.lightning.LightningCheckpoint.get_internal_representation LightningCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.lightning.LightningCheckpoint.get_model LightningCheckpoint.get_model(model_class: Type[pytorch_lightning.core.lightning.LightningModule], **load_from_checkpoint_kwargs: Optional[Dict[str, Any]]) -> pytorch_lightning.core.lightning.LightningModule[source] Retrieve the model stored in this checkpoint. Example import pytorch_lightning as pl from ray.train.lightning import LightningCheckpoint, LightningPredictor class MyLightningModule(pl.LightningModule): def __init__(self, input_dim, output_dim) -> None: super().__init__() self.linear = nn.Linear(input_dim, output_dim) self.save_hyperparameters() # ... # After the training is finished, LightningTrainer saves AIR # checkpoints in the result directory, for example: # ckpt_dir = "{storage_path}/LightningTrainer_.*/checkpoint_000000" # You can load model checkpoint with model init arguments def load_checkpoint(ckpt_dir): ckpt = LightningCheckpoint.from_directory(ckpt_dir) # `get_model()` takes the argument list of # `LightningModule.load_from_checkpoint()` as additional kwargs. # Please refer to PyTorch Lightning API for more details. return checkpoint.get_model( model_class=MyLightningModule, input_dim=32, output_dim=10, ) # You can also load checkpoint with a hyperparameter file def load_checkpoint_with_hparams( ckpt_dir, hparam_file="./hparams.yaml" ): ckpt = LightningCheckpoint.from_directory(ckpt_dir) return ckpt.get_model( model_class=MyLightningModule, hparams_file=hparam_file ) Parameters model_class – A subclass of pytorch_lightning.LightningModule that defines your model and training logic. **load_from_checkpoint_kwargs – Arguments to pass into pl.LightningModule.load_from_checkpoint. Returns An instance of the loaded model. Return type pl.LightningModuleray.train.lightning.LightningCheckpoint.get_preprocessor LightningCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.lightning.LightningCheckpoint.set_preprocessor LightningCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.lightning.LightningCheckpoint.to_bytes LightningCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.lightning.LightningCheckpoint.to_dict LightningCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.lightning.LightningCheckpoint.to_directory LightningCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.lightning.LightningCheckpoint.to_uri LightningCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.lightning.LightningCheckpoint.path property LightningCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.lightning.LightningCheckpoint.uri property LightningCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI).ray.train.lightning.LightningPredictor class ray.train.lightning.LightningPredictor(model: pytorch_lightning.core.lightning.LightningModule, preprocessor: Optional[ray.data.preprocessor.Preprocessor] = None, use_gpu: bool = False)[source] Bases: ray.train.torch.torch_predictor.TorchPredictor A predictor for PyTorch Lightning modules. Example import torch import numpy as np import pytorch_lightning as pl from ray.train.lightning import LightningPredictor class MyModel(pl.LightningModule): def __init__(self, input_dim, output_dim): super().__init__() self.linear = torch.nn.Linear(input_dim, output_dim) def forward(self, x): out = self.linear(x) return out def training_step(self, batch, batch_idx): x, y = batch y_hat = self.forward(x) loss = torch.nn.functional.mse_loss(y_hat, y) self.log("train_loss", loss) return loss def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) return optimizer batch_size, input_dim, output_dim = 10, 3, 5 model = MyModel(input_dim=input_dim, output_dim=output_dim) predictor = LightningPredictor(model=model, use_gpu=False) batch = np.random.rand(batch_size, input_dim).astype(np.float32) # Internally, LightningPredictor.predict() invokes the forward() method # of the model to generate predictions output = predictor.predict(batch) assert output["predictions"].shape == (batch_size, output_dim) Parameters model – The PyTorch Lightning module to use for predictions. preprocessor – A preprocessor used to transform data batches prior to prediction. use_gpu – If set, the model will be moved to GPU on instantiation and prediction happens on GPU. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods call_model(inputs) Runs inference on a single batch of tensor data. from_checkpoint(checkpoint, model_class, *) Instantiate the LightningPredictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data[, dtype]) Run inference on data batch. preferred_batch_format() DeveloperAPI: This API may change across minor Ray releases. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.lightning.LightningPredictor.call_model LightningPredictor.call_model(inputs: Union[torch.Tensor, Dict[str, torch.Tensor]]) -> Union[torch.Tensor, Dict[str, torch.Tensor]] Runs inference on a single batch of tensor data. This method is called by TorchPredictor.predict after converting the original data batch to torch tensors. Override this method to add custom logic for processing the model input or output. Parameters inputs – A batch of data to predict on, represented as either a single PyTorch tensor or for multi-input models, a dictionary of tensors. Returns The model outputs, either as a single tensor or a dictionary of tensors. Example import numpy as np import torch from ray.train.torch import TorchPredictor # List outputs are not supported by default TorchPredictor. # So let's define a custom TorchPredictor and override call_model class MyModel(torch.nn.Module): def forward(self, input_tensor): return [input_tensor, input_tensor] # Use a custom predictor to format model output as a dict. class CustomPredictor(TorchPredictor): def call_model(self, inputs): model_output = super().call_model(inputs) return { str(i): model_output[i] for i in range(len(model_output)) } # create our data batch data_batch = np.array([1, 2]) # create custom predictor and predict predictor = CustomPredictor(model=MyModel()) predictions = predictor.predict(data_batch) print(f"Predictions: {predictions.get('0')}, {predictions.get('1')}") Predictions: [1 2], [1 2] DeveloperAPI: This API may change across minor Ray releases.ray.train.lightning.LightningPredictor.from_checkpoint classmethod LightningPredictor.from_checkpoint(checkpoint: ray.train.lightning.lightning_checkpoint.LightningCheckpoint, model_class: Type[pytorch_lightning.core.lightning.LightningModule], *, preprocessor: Optional[ray.data.preprocessor.Preprocessor] = None, use_gpu: bool = False, **load_from_checkpoint_kwargs) -> ray.train.lightning.lightning_predictor.LightningPredictor[source] Instantiate the LightningPredictor from a Checkpoint. The checkpoint is expected to be a result of LightningTrainer. Example import pytorch_lightning as pl from ray.train.lightning import LightningCheckpoint, LightningPredictor class MyLightningModule(pl.LightningModule): def __init__(self, input_dim, output_dim) -> None: super().__init__() self.linear = nn.Linear(input_dim, output_dim) # ... # After the training is finished, LightningTrainer saves AIR # checkpoints in the result directory, for example: # ckpt_dir = "{storage_path}/LightningTrainer_.*/checkpoint_000000" def load_predictor_from_checkpoint(ckpt_dir): checkpoint = LightningCheckpoint.from_directory(ckpt_dir) # `from_checkpoint()` takes the argument list of # `LightningModule.load_from_checkpoint()` as additional kwargs. return LightningPredictor.from_checkpoint( checkpoint=checkpoint, use_gpu=False, model_class=MyLightningModule, input_dim=32, output_dim=10, ) Parameters checkpoint – The checkpoint to load the model and preprocessor from. It is expected to be from the result of a LightningTrainer run. model_class – A subclass of pytorch_lightning.LightningModule that defines your model and training logic. Note that this is a class type instead of a model instance. preprocessor – A preprocessor used to transform data batches prior to prediction. use_gpu – If set, the model will be moved to GPU on instantiation and prediction happens on GPU. **load_from_checkpoint_kwargs – Arguments to pass into pl.LightningModule.load_from_checkpoint.ray.train.lightning.LightningPredictor.from_pandas_udf classmethod LightningPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.lightning.LightningPredictor.get_preprocessor LightningPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.lightning.LightningPredictor.predict LightningPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], dtype: Optional[Union[torch.dtype, Dict[str, torch.dtype]]] = None) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]] Run inference on data batch. If the provided data is a single array or a dataframe/table with a single column, it will be converted into a single PyTorch tensor before being inputted to the model. If the provided data is a multi-column table or a dict of numpy arrays, it will be converted into a dict of tensors before being inputted to the model. This is useful for multi-modal inputs (for example your model accepts both image and text). Parameters data – A batch of input data of DataBatchType. dtype – The dtypes to use for the tensors. Either a single dtype for all tensors or a mapping from column name to dtype. Returns Prediction result. The return type will be the same as the input type. Return type DataBatchType Example import numpy as np import pandas as pd import torch import ray from ray.train.torch import TorchPredictor # Define a custom PyTorch module class CustomModule(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(1, 1) self.linear2 = torch.nn.Linear(1, 1) def forward(self, input_dict: dict): out1 = self.linear1(input_dict["A"].unsqueeze(1)) out2 = self.linear2(input_dict["B"].unsqueeze(1)) return out1 + out2 # Set manul seed so we get consistent output torch.manual_seed(42) # Use Standard PyTorch model model = torch.nn.Linear(2, 1) predictor = TorchPredictor(model=model) # Define our data data = np.array([[1, 2], [3, 4]]) predictions = predictor.predict(data, dtype=torch.float) print(f"Standard model predictions: {predictions}") print("---") # Use Custom PyTorch model with TorchPredictor predictor = TorchPredictor(model=CustomModule()) # Define our data and predict Customer model with TorchPredictor data = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) predictions = predictor.predict(data, dtype=torch.float) print(f"Custom model predictions: {predictions}") Standard model predictions: {'predictions': array([[1.5487633], [3.8037925]], dtype=float32)} --- Custom model predictions: predictions 0 [0.61623406] 1 [2.857038]ray.train.lightning.LightningPredictor.preferred_batch_format classmethod LightningPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat DeveloperAPI: This API may change across minor Ray releases.ray.train.lightning.LightningPredictor.set_preprocessor LightningPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions. Tensorflow/Keras TensorflowTrainer(*args, **kwargs) A Trainer for data parallel Tensorflow training. TensorflowConfig() PublicAPI (beta): This API is in beta and may change before becoming stable. TensorflowCheckpoint(*args, **kwargs) A Checkpoint with TensorFlow-specific functionality. ray.train.tensorflow.TensorflowTrainer class ray.train.tensorflow.TensorflowTrainer(*args, **kwargs)[source] Bases: ray.train.data_parallel_trainer.DataParallelTrainer A Trainer for data parallel Tensorflow training. This Trainer runs the function train_loop_per_worker on multiple Ray Actors. These actors already have the necessary TensorFlow process group already configured for distributed TensorFlow training. The train_loop_per_worker function is expected to take in either 0 or 1 arguments: def train_loop_per_worker(): ... def train_loop_per_worker(config: Dict): ... If train_loop_per_worker accepts an argument, then train_loop_config will be passed in as the argument. This is useful if you want to tune the values in train_loop_config as hyperparameters. If the datasets dict contains a training dataset (denoted by the “train” key), then it will be split into multiple dataset shards that can then be accessed by session.get_dataset_shard("train") inside train_loop_per_worker. All the other datasets will not be split and session.get_dataset_shard(...) will return the the entire Dataset. Inside the train_loop_per_worker function, you can use any of the Ray AIR session methods. Ray will not automatically set any environment variables or configuration related to local parallelism / threading aside from “OMP_NUM_THREADS”. If you desire greater control over TensorFlow threading, use the tf.config.threading module (eg. tf.config.threading.set_inter_op_parallelism_threads(num_cpus)) at the beginning of your train_loop_per_worker function. def train_loop_per_worker(): # Report intermediate results for callbacks or logging and # checkpoint data. session.report(...) # Returns dict of last saved checkpoint. session.get_checkpoint() # Returns the Dataset shard for the given key. session.get_dataset_shard("my_dataset") # Returns the total number of workers executing training. session.get_world_size() # Returns the rank of this worker. session.get_world_rank() # Returns the rank of the worker on the current node. session.get_local_rank() Any returns from the train_loop_per_worker will be discarded and not used or persisted anywhere. To save a model to use for the TensorflowPredictor, you must save it under the “model” kwarg in Checkpoint passed to session.report(). Example: import tensorflow as tf import ray from ray.air import session, Checkpoint from ray.air.config import ScalingConfig from ray.train.tensorflow import TensorflowTrainer def build_model(): # toy neural network : 1-layer return tf.keras.Sequential( [tf.keras.layers.Dense( 1, activation="linear", input_shape=(1,))] ) def train_loop_per_worker(config): dataset_shard = session.get_dataset_shard("train") strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() with strategy.scope(): model = build_model() model.compile( optimizer="Adam", loss="mean_squared_error", metrics=["mse"]) tf_dataset = dataset_shard.to_tf( feature_columns="x", label_columns="y", batch_size=1 ) for epoch in range(config["num_epochs"]): model.fit(tf_dataset) # You can also use ray.air.integrations.keras.Callback # for reporting and checkpointing instead of reporting manually. session.report( {}, checkpoint=Checkpoint.from_dict( dict(epoch=epoch, model=model.get_weights()) ), ) train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) trainer = TensorflowTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=ScalingConfig(num_workers=3, use_gpu=True), datasets={"train": train_dataset}, train_loop_config={"num_epochs": 2}, ) result = trainer.fit() ... Parameters train_loop_per_worker – The training function to execute. This can either take in no arguments or a config dict. train_loop_config – Configurations to pass into train_loop_per_worker if it accepts an argument. tensorflow_config – Configuration for setting up the TensorFlow backend. If set to None, use the default configuration. This replaces the backend_config arg of DataParallelTrainer. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. run_config – Configuration for the execution of the training run. datasets – Any Datasets to use for training. Use the key “train” to denote which dataset is the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. restore(path[, train_loop_per_worker, ...]) Restores a DataParallelTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.tensorflow.TensorflowTrainer.as_trainable TensorflowTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.tensorflow.TensorflowTrainer.can_restore classmethod TensorflowTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.tensorflow.TensorflowTrainer.fit TensorflowTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.tensorflow.TensorflowTrainer.get_dataset_config TensorflowTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.tensorflow.TensorflowTrainer.restore classmethod TensorflowTrainer.restore(path: str, train_loop_per_worker: Optional[Union[Callable[[], None], Callable[[Dict], None]]] = None, train_loop_config: Optional[Dict] = None, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None) -> DataParallelTrainer Restores a DataParallelTrainer from a previously interrupted/failed run. Parameters train_loop_per_worker – Optionally re-specified train loop function. This should be used to re-specify a function that is not restorable in a new Ray cluster (e.g., it holds onto outdated object references). This should be the same training loop that was passed to the original trainer constructor. train_loop_config – Optionally re-specified train config. This should similarly be used if the original train_loop_config contained outdated object references, and it should not be modified from what was originally passed in. See BaseTrainer.restore() for descriptions of the other arguments. Returns A restored instance of the DataParallelTrainer subclass that is calling this method. Return type DataParallelTrainerray.train.tensorflow.TensorflowTrainer.setup TensorflowTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.tensorflow.TensorflowConfig class ray.train.tensorflow.TensorflowConfig[source] Bases: ray.train.backend.BackendConfig PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Attributes backend_cls ray.train.tensorflow.TensorflowConfig.backend_cls property TensorflowConfig.backend_cls ray.train.tensorflow.TensorflowCheckpoint class ray.train.tensorflow.TensorflowCheckpoint(*args, **kwargs)[source] Bases: ray.air.checkpoint.Checkpoint A Checkpoint with TensorFlow-specific functionality. Create this from a generic Checkpoint by calling TensorflowCheckpoint.from_checkpoint(ckpt). PublicAPI (beta): This API is in beta and may change before becoming stable. Methods as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_h5(file_path, *[, preprocessor]) Create a Checkpoint that stores a Keras model from H5 format. from_model(model, *[, preprocessor]) Create a Checkpoint that stores a Keras model. from_saved_model(dir_path, *[, preprocessor]) Create a Checkpoint that stores a Keras model from SavedModel format. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_model([model, model_definition]) Retrieve the model stored in this checkpoint. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.tensorflow.TensorflowCheckpoint.as_directory TensorflowCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.tensorflow.TensorflowCheckpoint.from_bytes classmethod TensorflowCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.tensorflow.TensorflowCheckpoint.from_checkpoint classmethod TensorflowCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.tensorflow.TensorflowCheckpoint.from_dict classmethod TensorflowCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.tensorflow.TensorflowCheckpoint.from_directory classmethod TensorflowCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.tensorflow.TensorflowCheckpoint.from_h5 classmethod TensorflowCheckpoint.from_h5(file_path: str, *, preprocessor: Optional[Preprocessor] = None) -> TensorflowCheckpoint[source] Create a Checkpoint that stores a Keras model from H5 format. The checkpoint generated by this method contains all the information needed. Thus no model is needed to be supplied when using this checkpoint. file_path must maintain validity even after this function returns. Some new files/directories may be added to the parent directory of file_path, as a side effect of this method. Parameters file_path – The path to the .h5 file to load model from. This is the same path that is used for model.save(path). preprocessor – A fitted preprocessor to be applied before inference. Returns A TensorflowCheckpoint converted from h5 format. Examples >>> import tensorflow as tf >>> import ray >>> from ray.train.batch_predictor import BatchPredictor >>> from ray.train.tensorflow import ( ... TensorflowCheckpoint, TensorflowTrainer, TensorflowPredictor ... ) >>> from ray.air import session >>> from ray.air.config import ScalingConfig >>> def train_func(): ... model = tf.keras.Sequential( ... [ ... tf.keras.layers.InputLayer(input_shape=()), ... tf.keras.layers.Flatten(), ... tf.keras.layers.Dense(10), ... tf.keras.layers.Dense(1), ... ] ... ) ... model.save("my_model.h5") ... checkpoint = TensorflowCheckpoint.from_h5("my_model.h5") ... session.report({"my_metric": 1}, checkpoint=checkpoint) >>> trainer = TensorflowTrainer( ... train_loop_per_worker=train_func, ... scaling_config=ScalingConfig(num_workers=2)) >>> result_checkpoint = trainer.fit().checkpoint >>> batch_predictor = BatchPredictor.from_checkpoint( ... result_checkpoint, TensorflowPredictor) >>> batch_predictor.predict(ray.data.range(3)) ray.train.tensorflow.TensorflowCheckpoint.from_model classmethod TensorflowCheckpoint.from_model(model: keras.engine.training.Model, *, preprocessor: Optional[Preprocessor] = None) -> TensorflowCheckpoint[source] Create a Checkpoint that stores a Keras model. The checkpoint created with this method needs to be paired with model when used. Parameters model – The Keras model, whose weights are stored in the checkpoint. preprocessor – A fitted preprocessor to be applied before inference. Returns A TensorflowCheckpoint containing the specified model. Examples >>> from ray.train.tensorflow import TensorflowCheckpoint >>> import tensorflow as tf >>> >>> model = tf.keras.applications.resnet.ResNet101() >>> checkpoint = TensorflowCheckpoint.from_model(model) ray.train.tensorflow.TensorflowCheckpoint.from_saved_model classmethod TensorflowCheckpoint.from_saved_model(dir_path: str, *, preprocessor: Optional[Preprocessor] = None) -> TensorflowCheckpoint[source] Create a Checkpoint that stores a Keras model from SavedModel format. The checkpoint generated by this method contains all the information needed. Thus no model is needed to be supplied when using this checkpoint. dir_path must maintain validity even after this function returns. Some new files/directories may be added to dir_path, as a side effect of this method. Parameters dir_path – The directory containing the saved model. This is the same directory as used by model.save(dir_path). preprocessor – A fitted preprocessor to be applied before inference. Returns A TensorflowCheckpoint converted from SavedModel format. Examples >>> import tensorflow as tf >>> import ray >>> from ray.train.batch_predictor import BatchPredictor >>> from ray.train.tensorflow import ( ... TensorflowCheckpoint, TensorflowTrainer, TensorflowPredictor) >>> from ray.air import session >>> from ray.air.config import ScalingConfig >>> def train_fn(): ... model = tf.keras.Sequential( ... [ ... tf.keras.layers.InputLayer(input_shape=()), ... tf.keras.layers.Flatten(), ... tf.keras.layers.Dense(10), ... tf.keras.layers.Dense(1), ... ]) ... model.save("my_model") ... checkpoint = TensorflowCheckpoint.from_saved_model("my_model") ... session.report({"my_metric": 1}, checkpoint=checkpoint) >>> trainer = TensorflowTrainer( ... train_loop_per_worker=train_fn, ... scaling_config=ScalingConfig(num_workers=2)) >>> result_checkpoint = trainer.fit().checkpoint >>> batch_predictor = BatchPredictor.from_checkpoint( ... result_checkpoint, TensorflowPredictor) >>> batch_predictor.predict(ray.data.range(3)) ray.train.tensorflow.TensorflowCheckpoint.from_uri classmethod TensorflowCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.tensorflow.TensorflowCheckpoint.get_internal_representation TensorflowCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.tensorflow.TensorflowCheckpoint.get_model TensorflowCheckpoint.get_model(model: Optional[Union[keras.engine.training.Model, Callable[[], keras.engine.training.Model]]] = None, model_definition: Optional[Callable[[], keras.engine.training.Model]] = None) -> keras.engine.training.Model[source] Retrieve the model stored in this checkpoint. Parameters model – This arg is expected only if the original checkpoint was created via TensorflowCheckpoint.from_model. model_definition – This parameter is deprecated. Use model instead. Returns The Tensorflow Keras model stored in the checkpoint.ray.train.tensorflow.TensorflowCheckpoint.get_preprocessor TensorflowCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.tensorflow.TensorflowCheckpoint.set_preprocessor TensorflowCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.tensorflow.TensorflowCheckpoint.to_bytes TensorflowCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.tensorflow.TensorflowCheckpoint.to_dict TensorflowCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.tensorflow.TensorflowCheckpoint.to_directory TensorflowCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.tensorflow.TensorflowCheckpoint.to_uri TensorflowCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.tensorflow.TensorflowCheckpoint.path property TensorflowCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.tensorflow.TensorflowCheckpoint.uri property TensorflowCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). Tensorflow/Keras Training Loop Utilities prepare_dataset_shard(tf_dataset_shard) A utility function that overrides default config for Tensorflow Dataset. ray.train.tensorflow.prepare_dataset_shard ray.train.tensorflow.prepare_dataset_shard(tf_dataset_shard: tensorflow.python.data.ops.dataset_ops.DatasetV2)[source] A utility function that overrides default config for Tensorflow Dataset. This should be used on a TensorFlow Dataset created by calling iter_tf_batches() on a ray.data.Dataset returned by ray.air.session.get_dataset_shard() since the dataset has already been sharded across the workers. Parameters tf_dataset_shard (tf.data.Dataset) – A TensorFlow Dataset. Returns autosharding turned off prefetching turned on with autotune enabled Return type A TensorFlow Dataset with PublicAPI (beta): This API is in beta and may change before becoming stable. ReportCheckpointCallback([checkpoint_on, ...]) Keras callback for Ray AIR reporting and checkpointing. Horovod HorovodTrainer(*args, **kwargs) A Trainer for data parallel Horovod training. HorovodConfig([nics, verbose, key, ...]) Configurations for Horovod setup. ray.train.horovod.HorovodTrainer class ray.train.horovod.HorovodTrainer(*args, **kwargs)[source] Bases: ray.train.data_parallel_trainer.DataParallelTrainer A Trainer for data parallel Horovod training. This Trainer runs the function train_loop_per_worker on multiple Ray Actors. These actors already have the necessary Horovod setup already configured for distributed Horovod training. The train_loop_per_worker function is expected to take in either 0 or 1 arguments: def train_loop_per_worker(): ... def train_loop_per_worker(config: Dict): ... If train_loop_per_worker accepts an argument, then train_loop_config will be passed in as the argument. This is useful if you want to tune the values in train_loop_config as hyperparameters. If the datasets dict contains a training dataset (denoted by the “train” key), then it will be split into multiple dataset shards that can then be accessed by session.get_dataset_shard("train") inside train_loop_per_worker. All the other datasets will not be split and session.get_dataset_shard(...) will return the the entire Dataset. Inside the train_loop_per_worker function, you can use any of the Ray AIR session methods. def train_loop_per_worker(): # Report intermediate results for callbacks or logging and # checkpoint data. session.report(...) # Returns dict of last saved checkpoint. session.get_checkpoint() # Returns the Dataset shard for the given key. session.get_dataset_shard("my_dataset") # Returns the total number of workers executing training. session.get_world_size() # Returns the rank of this worker. session.get_world_rank() # Returns the rank of the worker on the current node. session.get_local_rank() Any returns from the train_loop_per_worker will be discarded and not used or persisted anywhere. You could use TensorflowPredictor or TorchPredictor in conjunction with HorovodTrainer. You must save the model under the “model” kwarg in the Checkpoint passed to session.report(), so that it can be used by corresponding predictors. Example: import ray import ray.train as train import ray.train.torch. # Need this to use `train.torch.get_device()` import horovod.torch as hvd import torch import torch.nn as nn from ray.air import session from ray.train.horovod import HorovodTrainer from ray.train.torch import TorchCheckpoint from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False input_size = 1 layer_size = 15 output_size = 1 num_epochs = 3 class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) def train_loop_per_worker(): hvd.init() dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() device = train.torch.get_device() model.to(device) loss_fn = nn.MSELoss() lr_scaler = 1 optimizer = torch.optim.SGD(model.parameters(), lr=0.1 * lr_scaler) # Horovod: wrap optimizer with DistributedOptimizer. optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), op=hvd.Average, ) for epoch in range(num_epochs): model.train() for batch in dataset_shard.iter_torch_batches( batch_size=32, dtypes=torch.float ): inputs, labels = torch.unsqueeze(batch["x"], 1), batch["y"] outputs = model(inputs) loss = loss_fn(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") session.report( {}, checkpoint=TorchCheckpoint.from_state_dict( model.state_dict() ), ) train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) trainer = HorovodTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() Parameters train_loop_per_worker – The training function to execute. This can either take in no arguments or a config dict. train_loop_config – Configurations to pass into train_loop_per_worker if it accepts an argument. horovod_config – Configuration for setting up the Horovod backend. If set to None, use the default configuration. This replaces the backend_config arg of DataParallelTrainer. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. run_config – Configuration for the execution of the training run. datasets – Any Datasets to use for training. Use the key “train” to denote which dataset is the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. restore(path[, train_loop_per_worker, ...]) Restores a DataParallelTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.horovod.HorovodTrainer.as_trainable HorovodTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.horovod.HorovodTrainer.can_restore classmethod HorovodTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.horovod.HorovodTrainer.fit HorovodTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.horovod.HorovodTrainer.get_dataset_config HorovodTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.horovod.HorovodTrainer.restore classmethod HorovodTrainer.restore(path: str, train_loop_per_worker: Optional[Union[Callable[[], None], Callable[[Dict], None]]] = None, train_loop_config: Optional[Dict] = None, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None) -> DataParallelTrainer Restores a DataParallelTrainer from a previously interrupted/failed run. Parameters train_loop_per_worker – Optionally re-specified train loop function. This should be used to re-specify a function that is not restorable in a new Ray cluster (e.g., it holds onto outdated object references). This should be the same training loop that was passed to the original trainer constructor. train_loop_config – Optionally re-specified train config. This should similarly be used if the original train_loop_config contained outdated object references, and it should not be modified from what was originally passed in. See BaseTrainer.restore() for descriptions of the other arguments. Returns A restored instance of the DataParallelTrainer subclass that is calling this method. Return type DataParallelTrainerray.train.horovod.HorovodTrainer.setup HorovodTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.horovod.HorovodConfig class ray.train.horovod.HorovodConfig(nics: Optional[Set[str]] = None, verbose: int = 1, key: Optional[str] = None, ssh_port: Optional[int] = None, ssh_identity_file: Optional[str] = None, ssh_str: Optional[str] = None, timeout_s: int = 300, placement_group_timeout_s: int = 100)[source] Bases: ray.train.backend.BackendConfig Configurations for Horovod setup. See https://github.com/horovod/horovod/blob/master/horovod/runner/common/util/settings.py # noqa: E501 Parameters nics (Optional[Set[str]) – Network interfaces that can be used for communication. verbose – Horovod logging verbosity. key (Optional[str]) – Secret used for communication between workers. ssh_port (Optional[int]) – Port for SSH server running on worker nodes. ssh_identity_file (Optional[str]) – Path to the identity file to ssh into different hosts on the cluster. ssh_str (Optional[str]) – CAUTION WHEN USING THIS. Private key file contents. Writes the private key to ssh_identity_file. timeout_s – Timeout parameter for Gloo rendezvous. placement_group_timeout_s – Timeout parameter for Ray Placement Group creation. Currently unused. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Attributes backend_cls key nics placement_group_timeout_s ssh_identity_file ssh_port ssh_str start_timeout timeout_s verbose ray.train.horovod.HorovodConfig.backend_cls property HorovodConfig.backend_cls ray.train.horovod.HorovodConfig.key HorovodConfig.key: Optional[str] = None ray.train.horovod.HorovodConfig.nics HorovodConfig.nics: Optional[Set[str]] = None ray.train.horovod.HorovodConfig.placement_group_timeout_s HorovodConfig.placement_group_timeout_s: int = 100 ray.train.horovod.HorovodConfig.ssh_identity_file HorovodConfig.ssh_identity_file: Optional[str] = None ray.train.horovod.HorovodConfig.ssh_port HorovodConfig.ssh_port: Optional[int] = None ray.train.horovod.HorovodConfig.ssh_str HorovodConfig.ssh_str: Optional[str] = None ray.train.horovod.HorovodConfig.start_timeout property HorovodConfig.start_timeout ray.train.horovod.HorovodConfig.timeout_s HorovodConfig.timeout_s: int = 300 ray.train.horovod.HorovodConfig.verbose HorovodConfig.verbose: int = 1 XGBoost XGBoostTrainer(*args, **kwargs) A Trainer for data parallel XGBoost training. XGBoostCheckpoint([local_path, data_dict, uri]) A Checkpoint with XGBoost-specific functionality. ray.train.xgboost.XGBoostTrainer class ray.train.xgboost.XGBoostTrainer(*args, **kwargs)[source] Bases: ray.train.gbdt_trainer.GBDTTrainer A Trainer for data parallel XGBoost training. This Trainer runs the XGBoost training loop in a distributed manner using multiple Ray Actors. XGBoostTrainer does not modify or otherwise alter the working of the XGBoost distributed training algorithm. Ray only provides orchestration, data ingest and fault tolerance. For more information on XGBoost distributed training, refer to XGBoost documentation. Example import ray from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = XGBoostTrainer( label_column="y", params={"objective": "reg:squarederror"}, scaling_config=ScalingConfig(num_workers=3), datasets={"train": train_dataset} ) result = trainer.fit() ... Parameters datasets – Datasets to use for training and validation. Must include a “train” key denoting the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. All non-training datasets will be used as separate validation sets, each reporting a separate metric. label_column – Name of the label column. A column with this name must be present in the training dataset. params – XGBoost training parameters. Refer to XGBoost documentation for a list of possible parameters. dmatrix_params – Dict of dataset name:dict of kwargs passed to respective xgboost_ray.RayDMatrix initializations, which in turn are passed to xgboost.DMatrix objects created on each worker. For example, this can be used to add sample weights with the weights parameter. num_boost_round – Target number of boosting iterations (trees in the model). Note that unlike in xgboost.train, this is the target number of trees, meaning that if you set num_boost_round=10 and pass a model that has already been trained for 5 iterations, it will be trained for 5 iterations more, instead of 10 more. scaling_config – Configuration for how to scale data parallel training. run_config – Configuration for the execution of the training run. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. **train_kwargs – Additional kwargs passed to xgboost.train() function. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. restore(path[, datasets, preprocessor, ...]) Restores a Train experiment from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.xgboost.XGBoostTrainer.as_trainable XGBoostTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.xgboost.XGBoostTrainer.can_restore classmethod XGBoostTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.xgboost.XGBoostTrainer.fit XGBoostTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.xgboost.XGBoostTrainer.restore classmethod XGBoostTrainer.restore(path: str, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None, **kwargs) -> BaseTrainer Restores a Train experiment from a previously interrupted/failed run. Restore should be used for experiment-level fault tolerance in the event that the head node crashes (e.g., OOM or some other runtime error) or the entire cluster goes down (e.g., network error affecting all nodes). The following example can be paired with implementing job retry using Ray Jobs to produce a Train experiment that will attempt to resume on both experiment-level and trial-level failures: import os import ray from ray import air from ray.data.preprocessors import BatchMapper from ray.train.trainer import BaseTrainer experiment_name = "unique_experiment_name" local_dir = "~/ray_results" experiment_dir = os.path.join(local_dir, experiment_name) # Define some dummy inputs for demonstration purposes datasets = {"train": ray.data.from_items([{"a": i} for i in range(10)])} preprocessor = BatchMapper(lambda x: x, batch_format="numpy") class CustomTrainer(BaseTrainer): def training_loop(self): pass if CustomTrainer.can_restore(experiment_dir): trainer = CustomTrainer.restore( experiment_dir, datasets=datasets, ) else: trainer = CustomTrainer( datasets=datasets, preprocessor=preprocessor, run_config=air.RunConfig( name=experiment_name, local_dir=local_dir, # Tip: You can also enable retries on failure for # worker-level fault tolerance failure_config=air.FailureConfig(max_failures=3), ), ) result = trainer.fit() ... Parameters path – The path to the experiment directory of the training run to restore. This can be a local path or a remote URI if the experiment was uploaded to the cloud. datasets – Re-specified datasets used in the original training run. This must include all the datasets that were passed in the original trainer constructor. preprocessor – Optionally re-specified preprocessor that was passed in the original trainer constructor. This should be used to re-supply the preprocessor if it is not restorable in a new Ray cluster. This preprocessor will be fit at the start before resuming training. If no preprocessor is passed in restore, then the old preprocessor will be loaded from the latest checkpoint and will not be re-fit. scaling_config – Optionally re-specified scaling config. This can be modified to be different from the original spec. **kwargs – Other optionally re-specified arguments, passed in by subclasses. Raises ValueError – If all datasets were not re-supplied on restore. Returns A restored instance of the class that is calling this method. Return type BaseTrainerray.train.xgboost.XGBoostTrainer.setup XGBoostTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.xgboost.XGBoostCheckpoint class ray.train.xgboost.XGBoostCheckpoint(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] Bases: ray.air.checkpoint.Checkpoint A Checkpoint with XGBoost-specific functionality. Create this from a generic Checkpoint by calling XGBoostCheckpoint.from_checkpoint(ckpt). PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__([local_path, data_dict, uri]) DeveloperAPI: This API may change across minor Ray releases. as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_model(booster, *[, preprocessor]) Create a Checkpoint that stores an XGBoost model. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_model() Retrieve the XGBoost model stored in this checkpoint. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.xgboost.XGBoostCheckpoint.__init__ XGBoostCheckpoint.__init__(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None) DeveloperAPI: This API may change across minor Ray releases.ray.train.xgboost.XGBoostCheckpoint.as_directory XGBoostCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.xgboost.XGBoostCheckpoint.from_bytes classmethod XGBoostCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.xgboost.XGBoostCheckpoint.from_checkpoint classmethod XGBoostCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.xgboost.XGBoostCheckpoint.from_dict classmethod XGBoostCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.xgboost.XGBoostCheckpoint.from_directory classmethod XGBoostCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.xgboost.XGBoostCheckpoint.from_model classmethod XGBoostCheckpoint.from_model(booster: xgboost.core.Booster, *, preprocessor: Optional[Preprocessor] = None) -> XGBoostCheckpoint[source] Create a Checkpoint that stores an XGBoost model. Parameters booster – The XGBoost model to store in the checkpoint. preprocessor – A fitted preprocessor to be applied before inference. Returns An XGBoostCheckpoint containing the specified Estimator. Examples … testcode: import numpy as np import ray from ray.train.xgboost import XGBoostCheckpoint import xgboost train_X = np.array([[1, 2], [3, 4]]) train_y = np.array([0, 1]) model = xgboost.XGBClassifier().fit(train_X, train_y) checkpoint = XGBoostCheckpoint.from_model(model.get_booster()) You can use a XGBoostCheckpoint to create an XGBoostPredictor and preform inference. … testcode: from ray.train.xgboost import XGBoostPredictor predictor = XGBoostPredictor.from_checkpoint(checkpoint)ray.train.xgboost.XGBoostCheckpoint.from_uri classmethod XGBoostCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.xgboost.XGBoostCheckpoint.get_internal_representation XGBoostCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.xgboost.XGBoostCheckpoint.get_model XGBoostCheckpoint.get_model() -> xgboost.core.Booster[source] Retrieve the XGBoost model stored in this checkpoint.ray.train.xgboost.XGBoostCheckpoint.get_preprocessor XGBoostCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.xgboost.XGBoostCheckpoint.set_preprocessor XGBoostCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.xgboost.XGBoostCheckpoint.to_bytes XGBoostCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.xgboost.XGBoostCheckpoint.to_dict XGBoostCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.xgboost.XGBoostCheckpoint.to_directory XGBoostCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.xgboost.XGBoostCheckpoint.to_uri XGBoostCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.xgboost.XGBoostCheckpoint.path property XGBoostCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.xgboost.XGBoostCheckpoint.uri property XGBoostCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). LightGBM LightGBMTrainer(*args, **kwargs) A Trainer for data parallel LightGBM training. LightGBMCheckpoint([local_path, data_dict, uri]) A Checkpoint with LightGBM-specific functionality. ray.train.lightgbm.LightGBMTrainer class ray.train.lightgbm.LightGBMTrainer(*args, **kwargs)[source] Bases: ray.train.gbdt_trainer.GBDTTrainer A Trainer for data parallel LightGBM training. This Trainer runs the LightGBM training loop in a distributed manner using multiple Ray Actors. If you would like to take advantage of LightGBM’s built-in handling for features with the categorical data type, consider using the Categorizer preprocessor to set the dtypes in the dataset. LightGBMTrainer does not modify or otherwise alter the working of the LightGBM distributed training algorithm. Ray only provides orchestration, data ingest and fault tolerance. For more information on LightGBM distributed training, refer to LightGBM documentation. Example import ray from ray.train.lightgbm import LightGBMTrainer from ray.air.config import ScalingConfig train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = LightGBMTrainer( label_column="y", params={"objective": "regression"}, scaling_config=ScalingConfig(num_workers=3), datasets={"train": train_dataset} ) result = trainer.fit() ... Parameters datasets – Datasets to use for training and validation. Must include a “train” key denoting the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. All non-training datasets will be used as separate validation sets, each reporting a separate metric. label_column – Name of the label column. A column with this name must be present in the training dataset. params – LightGBM training parameters passed to lightgbm.train(). Refer to LightGBM documentation for a list of possible parameters. dmatrix_params – Dict of dataset name:dict of kwargs passed to respective xgboost_ray.RayDMatrix initializations, which in turn are passed to lightgbm.Dataset objects created on each worker. For example, this can be used to add sample weights with the weights parameter. num_boost_round – Target number of boosting iterations (trees in the model). Note that unlike in lightgbm.train, this is the target number of trees, meaning that if you set num_boost_round=10 and pass a model that has already been trained for 5 iterations, it will be trained for 5 iterations more, instead of 10 more. scaling_config – Configuration for how to scale data parallel training. run_config – Configuration for the execution of the training run. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. **train_kwargs – Additional kwargs passed to lightgbm.train() function. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. restore(path[, datasets, preprocessor, ...]) Restores a Train experiment from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.lightgbm.LightGBMTrainer.as_trainable LightGBMTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.lightgbm.LightGBMTrainer.can_restore classmethod LightGBMTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.lightgbm.LightGBMTrainer.fit LightGBMTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.lightgbm.LightGBMTrainer.restore classmethod LightGBMTrainer.restore(path: str, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None, **kwargs) -> BaseTrainer Restores a Train experiment from a previously interrupted/failed run. Restore should be used for experiment-level fault tolerance in the event that the head node crashes (e.g., OOM or some other runtime error) or the entire cluster goes down (e.g., network error affecting all nodes). The following example can be paired with implementing job retry using Ray Jobs to produce a Train experiment that will attempt to resume on both experiment-level and trial-level failures: import os import ray from ray import air from ray.data.preprocessors import BatchMapper from ray.train.trainer import BaseTrainer experiment_name = "unique_experiment_name" local_dir = "~/ray_results" experiment_dir = os.path.join(local_dir, experiment_name) # Define some dummy inputs for demonstration purposes datasets = {"train": ray.data.from_items([{"a": i} for i in range(10)])} preprocessor = BatchMapper(lambda x: x, batch_format="numpy") class CustomTrainer(BaseTrainer): def training_loop(self): pass if CustomTrainer.can_restore(experiment_dir): trainer = CustomTrainer.restore( experiment_dir, datasets=datasets, ) else: trainer = CustomTrainer( datasets=datasets, preprocessor=preprocessor, run_config=air.RunConfig( name=experiment_name, local_dir=local_dir, # Tip: You can also enable retries on failure for # worker-level fault tolerance failure_config=air.FailureConfig(max_failures=3), ), ) result = trainer.fit() ... Parameters path – The path to the experiment directory of the training run to restore. This can be a local path or a remote URI if the experiment was uploaded to the cloud. datasets – Re-specified datasets used in the original training run. This must include all the datasets that were passed in the original trainer constructor. preprocessor – Optionally re-specified preprocessor that was passed in the original trainer constructor. This should be used to re-supply the preprocessor if it is not restorable in a new Ray cluster. This preprocessor will be fit at the start before resuming training. If no preprocessor is passed in restore, then the old preprocessor will be loaded from the latest checkpoint and will not be re-fit. scaling_config – Optionally re-specified scaling config. This can be modified to be different from the original spec. **kwargs – Other optionally re-specified arguments, passed in by subclasses. Raises ValueError – If all datasets were not re-supplied on restore. Returns A restored instance of the class that is calling this method. Return type BaseTrainerray.train.lightgbm.LightGBMTrainer.setup LightGBMTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.lightgbm.LightGBMCheckpoint class ray.train.lightgbm.LightGBMCheckpoint(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] Bases: ray.air.checkpoint.Checkpoint A Checkpoint with LightGBM-specific functionality. Create this from a generic Checkpoint by calling LightGBMCheckpoint.from_checkpoint(ckpt). PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__([local_path, data_dict, uri]) DeveloperAPI: This API may change across minor Ray releases. as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_model(booster, *[, preprocessor]) Create a Checkpoint that stores a LightGBM model. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_model() Retrieve the LightGBM model stored in this checkpoint. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.lightgbm.LightGBMCheckpoint.__init__ LightGBMCheckpoint.__init__(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None) DeveloperAPI: This API may change across minor Ray releases.ray.train.lightgbm.LightGBMCheckpoint.as_directory LightGBMCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.lightgbm.LightGBMCheckpoint.from_bytes classmethod LightGBMCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.lightgbm.LightGBMCheckpoint.from_checkpoint classmethod LightGBMCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.lightgbm.LightGBMCheckpoint.from_dict classmethod LightGBMCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.lightgbm.LightGBMCheckpoint.from_directory classmethod LightGBMCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.lightgbm.LightGBMCheckpoint.from_model classmethod LightGBMCheckpoint.from_model(booster: lightgbm.basic.Booster, *, preprocessor: Optional[Preprocessor] = None) -> LightGBMCheckpoint[source] Create a Checkpoint that stores a LightGBM model. Parameters booster – The LightGBM model to store in the checkpoint. preprocessor – A fitted preprocessor to be applied before inference. Returns An LightGBMCheckpoint containing the specified Estimator. Examples >>> import lightgbm >>> import numpy as np >>> from ray.train.lightgbm import LightGBMCheckpoint >>> >>> train_X = np.array([[1, 2], [3, 4]]) >>> train_y = np.array([0, 1]) >>> >>> model = lightgbm.LGBMClassifier().fit(train_X, train_y) >>> checkpoint = LightGBMCheckpoint.from_model(model.booster_) You can use a LightGBMCheckpoint to create an LightGBMPredictor and preform inference. >>> from ray.train.lightgbm import LightGBMPredictor >>> >>> predictor = LightGBMPredictor.from_checkpoint(checkpoint)ray.train.lightgbm.LightGBMCheckpoint.from_uri classmethod LightGBMCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.lightgbm.LightGBMCheckpoint.get_internal_representation LightGBMCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.lightgbm.LightGBMCheckpoint.get_model LightGBMCheckpoint.get_model() -> lightgbm.basic.Booster[source] Retrieve the LightGBM model stored in this checkpoint.ray.train.lightgbm.LightGBMCheckpoint.get_preprocessor LightGBMCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.lightgbm.LightGBMCheckpoint.set_preprocessor LightGBMCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.lightgbm.LightGBMCheckpoint.to_bytes LightGBMCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.lightgbm.LightGBMCheckpoint.to_dict LightGBMCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.lightgbm.LightGBMCheckpoint.to_directory LightGBMCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.lightgbm.LightGBMCheckpoint.to_uri LightGBMCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.lightgbm.LightGBMCheckpoint.path property LightGBMCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.lightgbm.LightGBMCheckpoint.uri property LightGBMCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). Hugging Face Transformers TransformersTrainer(*args, **kwargs) A Trainer for data parallel HuggingFace Transformers on PyTorch training. TransformersCheckpoint([local_path, ...]) A Checkpoint with HuggingFace-specific functionality. ray.train.huggingface.TransformersTrainer class ray.train.huggingface.TransformersTrainer(*args, **kwargs)[source] Bases: ray.train.torch.torch_trainer.TorchTrainer A Trainer for data parallel HuggingFace Transformers on PyTorch training. This Trainer runs the transformers.Trainer.train() method on multiple Ray Actors. The training is carried out in a distributed fashion through PyTorch DDP. These actors already have the necessary torch process group already configured for distributed PyTorch training. If you have PyTorch >= 1.12.0 installed, you can also run FSDP training by specifying the fsdp argument in TrainingArguments. DeepSpeed is also supported - see GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed. For more information on configuring FSDP or DeepSpeed, refer to Hugging Face documentation. The training function ran on every Actor will first run the specified trainer_init_per_worker function to obtain an instantiated transformers.Trainer object. The trainer_init_per_worker function will have access to preprocessed train and evaluation datasets. If the datasets dict contains a training dataset (denoted by the “train” key), then it will be split into multiple dataset shards, with each Actor training on a single shard. All the other datasets will not be split. Please note that if you use a custom transformers.Trainer subclass, the get_train_dataloader method will be wrapped around to disable sharding by transformers.IterableDatasetShard, as the dataset will already be sharded on the Ray AIR side. You can also provide datasets.Dataset object or other dataset objects allowed by transformers.Trainer directly in the trainer_init_per_worker function, without specifying the datasets dict. It is recommended to initialize those objects inside the function, as otherwise they will be serialized and passed to the function, which may lead to long runtime and memory issues with large amounts of data. In this case, the training dataset will be split automatically by Transformers. HuggingFace loggers will be automatically disabled, and the local_rank argument in TrainingArguments will be automatically set. Please note that if you want to use CPU training, you will need to set the no_cuda argument in TrainingArguments manually - otherwise, an exception (segfault) may be thrown. This Trainer requires transformers>=4.19.0 package. It is tested with transformers==4.19.1. Example # Based on # huggingface/notebooks/examples/language_modeling_from_scratch.ipynb # Hugging Face imports from datasets import load_dataset import transformers from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer import ray from ray.train.huggingface import TransformersTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = True model_checkpoint = "gpt2" tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer" block_size = 128 datasets = load_dataset("wikitext", "wikitext-2-raw-v1") tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint) def tokenize_function(examples): return tokenizer(examples["text"]) tokenized_datasets = datasets.map( tokenize_function, batched=True, num_proc=1, remove_columns=["text"] ) def group_texts(examples): # Concatenate all texts. concatenated_examples = { k: sum(examples[k], []) for k in examples.keys() } total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model # supported it. # instead of this drop, you can customize this part to your needs. total_length = (total_length // block_size) * block_size # Split by chunks of max_len. result = { k: [ t[i : i + block_size] for i in range(0, total_length, block_size) ] for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() return result lm_datasets = tokenized_datasets.map( group_texts, batched=True, batch_size=1000, num_proc=1, ) ray_train_ds = ray.data.from_huggingface(lm_datasets["train"]) ray_evaluation_ds = ray.data.from_huggingface( lm_datasets["validation"] ) def trainer_init_per_worker(train_dataset, eval_dataset, **config): model_config = AutoConfig.from_pretrained(model_checkpoint) model = AutoModelForCausalLM.from_config(model_config) args = transformers.TrainingArguments( output_dir=f"{model_checkpoint}-wikitext2", evaluation_strategy="epoch", save_strategy="epoch", logging_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, no_cuda=(not use_gpu), # Take a small subset for doctest max_steps=100, ) return transformers.Trainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) scaling_config = ScalingConfig(num_workers=4, use_gpu=use_gpu) trainer = TransformersTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=scaling_config, datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds}, ) result = trainer.fit() ... Parameters trainer_init_per_worker – The function that returns an instantiated transformers.Trainer object and takes in the following arguments: train Torch.Dataset, optional evaluation Torch.Dataset and config as kwargs. The Torch Datasets are automatically created by converting the Ray Datasets internally before they are passed into the function. trainer_init_config – Configurations to pass into trainer_init_per_worker as kwargs. torch_config – Configuration for setting up the PyTorch backend. If set to None, use the default configuration. This replaces the backend_config arg of DataParallelTrainer. Same as in TorchTrainer. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. run_config – Configuration for the execution of the training run. datasets – Any Ray Datasets to use for training. Use the key “train” to denote which dataset is the training dataset and key “evaluation” to denote the evaluation dataset. Can only contain a training dataset and up to one extra dataset to be used for evaluation. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. restore(path[, trainer_init_per_worker, ...]) Restores a TransformersTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.huggingface.TransformersTrainer.as_trainable TransformersTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.huggingface.TransformersTrainer.can_restore classmethod TransformersTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.huggingface.TransformersTrainer.fit TransformersTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.huggingface.TransformersTrainer.get_dataset_config TransformersTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.huggingface.TransformersTrainer.restore classmethod TransformersTrainer.restore(path: str, trainer_init_per_worker: Optional[Callable[[torch.utils.data.dataset.Dataset, Optional[torch.utils.data.dataset.Dataset], Any], transformers.trainer.Trainer]] = None, trainer_init_config: Optional[Dict] = None, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None) -> TransformersTrainer[source] Restores a TransformersTrainer from a previously interrupted/failed run. Parameters trainer_init_per_worker – Optionally re-specified trainer init function. This should be used to re-specify a function that is not restorable in a new Ray cluster (e.g., it holds onto outdated object references). This should be the same trainer init that was passed to the original trainer constructor. trainer_init_config – Optionally re-specified trainer init config. This should similarly be used if the original train_loop_config contained outdated object references, and it should not be modified from what was originally passed in. See BaseTrainer.restore() for descriptions of the other arguments. Returns A restored instance of TransformersTrainer Return type TransformersTrainerray.train.huggingface.TransformersTrainer.setup TransformersTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.huggingface.TransformersCheckpoint class ray.train.huggingface.TransformersCheckpoint(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] Bases: ray.air.checkpoint.Checkpoint A Checkpoint with HuggingFace-specific functionality. Use TransformersCheckpoint.from_model to create this type of checkpoint. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods __init__([local_path, data_dict, uri]) DeveloperAPI: This API may change across minor Ray releases. as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_model(model[, tokenizer, preprocessor]) Create a Checkpoint that stores a HuggingFace model. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_model(model, **pretrained_model_kwargs) Retrieve the model stored in this checkpoint. get_preprocessor() Return the saved preprocessor, if one exists. get_tokenizer(tokenizer, **kwargs) Create a tokenizer using the data stored in this checkpoint. get_training_arguments() Retrieve the training arguments stored in this checkpoint. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.huggingface.TransformersCheckpoint.__init__ TransformersCheckpoint.__init__(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None) DeveloperAPI: This API may change across minor Ray releases.ray.train.huggingface.TransformersCheckpoint.as_directory TransformersCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.huggingface.TransformersCheckpoint.from_bytes classmethod TransformersCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.huggingface.TransformersCheckpoint.from_checkpoint classmethod TransformersCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.huggingface.TransformersCheckpoint.from_dict classmethod TransformersCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.huggingface.TransformersCheckpoint.from_directory classmethod TransformersCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.huggingface.TransformersCheckpoint.from_model classmethod TransformersCheckpoint.from_model(model: Union[transformers.modeling_utils.PreTrainedModel, torch.nn.Module], tokenizer: Optional[transformers.PreTrainedTokenizer] = None, *, path: os.PathLike, preprocessor: Optional[Preprocessor] = None) -> TransformersCheckpoint[source] Create a Checkpoint that stores a HuggingFace model. Parameters model – The pretrained transformer or Torch model to store in the checkpoint. tokenizer – The Tokenizer to use in the Transformers pipeline for inference. path – The directory where the checkpoint will be stored. preprocessor – A fitted preprocessor to be applied before inference. Returns A TransformersCheckpoint containing the specified model.ray.train.huggingface.TransformersCheckpoint.from_uri classmethod TransformersCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.huggingface.TransformersCheckpoint.get_internal_representation TransformersCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.huggingface.TransformersCheckpoint.get_model TransformersCheckpoint.get_model(model: Union[Type[transformers.modeling_utils.PreTrainedModel], torch.nn.modules.module.Module], **pretrained_model_kwargs) -> Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module][source] Retrieve the model stored in this checkpoint.ray.train.huggingface.TransformersCheckpoint.get_preprocessor TransformersCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.huggingface.TransformersCheckpoint.get_tokenizer TransformersCheckpoint.get_tokenizer(tokenizer: Type[transformers.tokenization_utils.PreTrainedTokenizer], **kwargs) -> Optional[transformers.tokenization_utils.PreTrainedTokenizer][source] Create a tokenizer using the data stored in this checkpoint.ray.train.huggingface.TransformersCheckpoint.get_training_arguments TransformersCheckpoint.get_training_arguments() -> transformers.training_args.TrainingArguments[source] Retrieve the training arguments stored in this checkpoint.ray.train.huggingface.TransformersCheckpoint.set_preprocessor TransformersCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.huggingface.TransformersCheckpoint.to_bytes TransformersCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.huggingface.TransformersCheckpoint.to_dict TransformersCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.huggingface.TransformersCheckpoint.to_directory TransformersCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.huggingface.TransformersCheckpoint.to_uri TransformersCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.huggingface.TransformersCheckpoint.path property TransformersCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.huggingface.TransformersCheckpoint.uri property TransformersCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). Accelerate AccelerateTrainer(*args, **kwargs) A Trainer for data parallel HuggingFace Accelerate training with PyTorch. ray.train.huggingface.AccelerateTrainer class ray.train.huggingface.AccelerateTrainer(*args, **kwargs)[source] Bases: ray.train.torch.torch_trainer.TorchTrainer A Trainer for data parallel HuggingFace Accelerate training with PyTorch. This Trainer is a wrapper around the TorchTrainer, providing the following extra functionality: 1. Loading and parsing of Accelerate configuration files (created by accelerate config CLI command), 2. Applying the configuration files on all workers, making sure the environment is set up correctly. This Trainer runs the function train_loop_per_worker on multiple Ray Actors. These actors already have the necessary torch process group configured for distributed PyTorch training, as well as all environment variables required by Accelerate, as defined in the configuration file. This allows you to use Accelerate APIs (such as Accelerator) inside train_loop_per_worker as you would without Ray. Inside the train_loop_per_worker function, In addition to Accelerate APIs, you can use any of the Ray AIR session methods. See full example code below. def train_loop_per_worker(): # Report intermediate results for callbacks or logging and # checkpoint data. session.report(...) # Get dict of last saved checkpoint. session.get_checkpoint() # Session returns the Dataset shard for the given key. session.get_dataset_shard("my_dataset") # Get the total number of workers executing training. session.get_world_size() # Get the rank of this worker. session.get_world_rank() # Get the rank of the worker on the current node. session.get_local_rank() For more information, see the documentation of TorchTrainer. You need to use session.report() to communicate results and checkpoints back to Ray Train. Accelerate integrations with DeepSpeed, FSDP, MegatronLM etc. are fully supported. If the Accelerate configuration contains a path to a DeepSpeed config file (deepspeed_config_file), that file will also be loaded and applied on the workers. The following Accelerate configuration options will be ignored and automatically set by the Trainer according to Ray AIR configs (eg. ScalingConfig): - Number of machines (num_machines) - Number of processes (num_processes) - Rank of the current machine (machine_rank) - Local rank of the current machine - GPU IDs (gpu_ids) - Number of CPU threads per process (num_cpu_threads_per_process) - IP of the head process (main_process_ip) - Port of the head process (main_process_port) - Whether all machines are on the same network (same_network) - Whether to force a CPU-only mode (cpu/use_cpu) - rdzv backend (rdzv_backend) - Main training function (main_training_function) - Type of launcher This Trainer requires accelerate>=0.17.0 package. Example import torch import torch.nn as nn from accelerate import Accelerator import ray from ray.air import session, Checkpoint from ray.train.huggingface import AccelerateTrainer from ray.air.config import ScalingConfig from ray.air.config import RunConfig from ray.air.config import CheckpointConfig # If using GPUs, set this to True. use_gpu = False # Define NN layers archicture, epochs, and number of workers input_size = 1 layer_size = 32 output_size = 1 num_epochs = 30 num_workers = 3 # Define your network structure class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) # Define your train worker loop def train_loop_per_worker(): torch.manual_seed(42) # Initialize the Accelerator accelerator = Accelerator() # Fetch training set from the session dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() # Loss function, optimizer, prepare model for training. # This moves the data and prepares model for distributed # execution loss_fn = nn.MSELoss() optimizer = torch.optim.Adam( model.parameters(), lr=0.01, weight_decay=0.01 ) model, optimizer = accelerator.prepare(model, optimizer) # Iterate over epochs and batches for epoch in range(num_epochs): for batches in dataset_shard.iter_torch_batches( batch_size=32, dtypes=torch.float ): # Add batch or unsqueeze as an additional dimension [32, x] inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"] output = model(inputs) # Make output shape same as the as labels loss = loss_fn(output.squeeze(), labels) # Zero out grads, do backward, and update optimizer optimizer.zero_grad() accelerator.backward(loss) optimizer.step() # Print what's happening with loss per 30 epochs if epoch % 20 == 0: print(f"epoch: {epoch}/{num_epochs}, loss: {loss:.3f}") # Report and record metrics, checkpoint model at end of each # epoch session.report( {"loss": loss.item(), "epoch": epoch}, checkpoint=Checkpoint.from_dict( dict( epoch=epoch, model=accelerator.unwrap_model(model).state_dict(), ) ), ) train_dataset = ray.data.from_items( [{"x": x, "y": 2 * x + 1} for x in range(2000)] ) # Define scaling and run configs scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1)) trainer = AccelerateTrainer( train_loop_per_worker=train_loop_per_worker, # Instead of using a dict, you can run ``accelerate config``. # The default value of None will then load that configuration # file. accelerate_config={}, scaling_config=scaling_config, run_config=run_config, datasets={"train": train_dataset}, ) result = trainer.fit() best_checkpoint_loss = result.metrics["loss"] # Assert loss is less 0.09 assert best_checkpoint_loss <= 0.09 ... Parameters train_loop_per_worker – The training function to execute. This can either take in no arguments or a config dict. train_loop_config – Configurations to pass into train_loop_per_worker if it accepts an argument. accelerate_config – Accelerate configuration to be applied on every worker. This can be a path to a file generated with accelerate config, a configuration dict or None, in which case it will load the configuration file from the default location as defined by Accelerate. torch_config – Configuration for setting up the PyTorch backend. If set to None, use the default configuration. This replaces the backend_config arg of DataParallelTrainer. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. run_config – Configuration for the execution of the training run. datasets – Any Datasets to use for training. Use the key “train” to denote which dataset is the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. Methods can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. restore(path[, train_loop_per_worker, ...]) Restores a DataParallelTrainer from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.huggingface.AccelerateTrainer.can_restore classmethod AccelerateTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.huggingface.AccelerateTrainer.fit AccelerateTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.huggingface.AccelerateTrainer.get_dataset_config AccelerateTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.huggingface.AccelerateTrainer.restore classmethod AccelerateTrainer.restore(path: str, train_loop_per_worker: Optional[Union[Callable[[], None], Callable[[Dict], None]]] = None, train_loop_config: Optional[Dict] = None, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None) -> DataParallelTrainer Restores a DataParallelTrainer from a previously interrupted/failed run. Parameters train_loop_per_worker – Optionally re-specified train loop function. This should be used to re-specify a function that is not restorable in a new Ray cluster (e.g., it holds onto outdated object references). This should be the same training loop that was passed to the original trainer constructor. train_loop_config – Optionally re-specified train config. This should similarly be used if the original train_loop_config contained outdated object references, and it should not be modified from what was originally passed in. See BaseTrainer.restore() for descriptions of the other arguments. Returns A restored instance of the DataParallelTrainer subclass that is calling this method. Return type DataParallelTrainerray.train.huggingface.AccelerateTrainer.setup AccelerateTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop. Scikit-Learn SklearnTrainer(*args, **kwargs) A Trainer for scikit-learn estimator training. SklearnCheckpoint([local_path, data_dict, uri]) A Checkpoint with sklearn-specific functionality. ray.train.sklearn.SklearnTrainer class ray.train.sklearn.SklearnTrainer(*args, **kwargs)[source] Bases: ray.train.base_trainer.BaseTrainer A Trainer for scikit-learn estimator training. This Trainer runs the fit method of the given estimator in a non-distributed manner on a single Ray Actor. By default, the n_jobs (or thread_count) estimator parameters will be set to match the number of CPUs assigned to the Ray Actor. This behavior can be disabled by setting set_estimator_cpus=False. If you wish to use GPU-enabled estimators (eg. cuML), make sure to set "GPU": 1 in scaling_config.trainer_resources. The results are reported all at once and not in an iterative fashion. No checkpointing is done during training. This may be changed in the future. Example: import ray from ray.train.sklearn import SklearnTrainer from sklearn.ensemble import RandomForestRegressor train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = SklearnTrainer( estimator=RandomForestRegressor(), label_column="y", scaling_config=ray.air.config.ScalingConfig( trainer_resources={"CPU": 4} ), datasets={"train": train_dataset} ) result = trainer.fit() ... Parameters estimator – A scikit-learn compatible estimator to use. datasets – Datasets to use for training and validation. Must include a “train” key denoting the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. All non-training datasets will be used as separate validation sets, each reporting separate metrics. label_column – Name of the label column. A column with this name must be present in the training dataset. If None, no validation will be performed. params – Optional dict of params to be set on the estimator before fitting. Useful for hyperparameter tuning. scoring – Strategy to evaluate the performance of the model on the validation sets and for cross-validation. Same as in sklearn.model_selection.cross_validation. If scoring represents a single score, one can use:a single string; a callable that returns a single value.If scoring represents multiple scores, one can use:a list or tuple of unique strings; a callable returning a dictionary where the keys are the metric names and the values are the metric scores; a dictionary with metric names as keys and callables a values. cv – Determines the cross-validation splitting strategy. If specified, cross-validation will be run on the train dataset, in addition to computing metrics for validation datasets. Same as in sklearn.model_selection.cross_validation, with the exception of None. Possible inputs for cv are:None, to skip cross-validation. int, to specify the number of folds in a (Stratified)KFold, CV splitter, An iterable yielding (train, test) splits as arrays of indices.For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.If you provide a “cv_groups” column in the train dataset, it will be used as group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold). This corresponds to the groups argument in sklearn.model_selection.cross_validation. return_train_score_cv – Whether to also return train scores during cross-validation. Ignored if cv is None. parallelize_cv – If set to True, will parallelize cross-validation instead of the estimator. If set to None, will detect if the estimator has any parallelism-related params (n_jobs or thread_count) and parallelize cross-validation if there are none. If False, will not parallelize cross-validation. Cannot be set to True if there are any GPUs assigned to the trainer. Ignored if cv is None. set_estimator_cpus – If set to True, will automatically set the values of all n_jobs and thread_count parameters in the estimator (including in nested objects) to match the number of available CPUs. scaling_config – Configuration for how to scale training. Only the trainer_resources key can be provided, as the training is not distributed. run_config – Configuration for the execution of the training run. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. **fit_params – Additional kwargs passed to estimator.fit() method. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. preprocess_datasets() Called during fit() to preprocess dataset attributes with preprocessor. restore(path[, datasets, preprocessor, ...]) Restores a Train experiment from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.sklearn.SklearnTrainer.as_trainable SklearnTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.sklearn.SklearnTrainer.can_restore classmethod SklearnTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.sklearn.SklearnTrainer.fit SklearnTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.sklearn.SklearnTrainer.preprocess_datasets SklearnTrainer.preprocess_datasets() -> None Called during fit() to preprocess dataset attributes with preprocessor. This method is run on a remote process. This method is called prior to entering the training_loop. If the Trainer has both a datasets dict and a preprocessor, the datasets dict contains a training dataset (denoted by the “train” key), and the preprocessor has not yet been fit, then it will be fit on the train dataset. Then, all Trainer’s datasets will be transformed by the preprocessor. The transformed datasets will be set back in the self.datasets attribute of the Trainer to be used when overriding training_loop.ray.train.sklearn.SklearnTrainer.restore classmethod SklearnTrainer.restore(path: str, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None, **kwargs) -> BaseTrainer Restores a Train experiment from a previously interrupted/failed run. Restore should be used for experiment-level fault tolerance in the event that the head node crashes (e.g., OOM or some other runtime error) or the entire cluster goes down (e.g., network error affecting all nodes). The following example can be paired with implementing job retry using Ray Jobs to produce a Train experiment that will attempt to resume on both experiment-level and trial-level failures: import os import ray from ray import air from ray.data.preprocessors import BatchMapper from ray.train.trainer import BaseTrainer experiment_name = "unique_experiment_name" local_dir = "~/ray_results" experiment_dir = os.path.join(local_dir, experiment_name) # Define some dummy inputs for demonstration purposes datasets = {"train": ray.data.from_items([{"a": i} for i in range(10)])} preprocessor = BatchMapper(lambda x: x, batch_format="numpy") class CustomTrainer(BaseTrainer): def training_loop(self): pass if CustomTrainer.can_restore(experiment_dir): trainer = CustomTrainer.restore( experiment_dir, datasets=datasets, ) else: trainer = CustomTrainer( datasets=datasets, preprocessor=preprocessor, run_config=air.RunConfig( name=experiment_name, local_dir=local_dir, # Tip: You can also enable retries on failure for # worker-level fault tolerance failure_config=air.FailureConfig(max_failures=3), ), ) result = trainer.fit() ... Parameters path – The path to the experiment directory of the training run to restore. This can be a local path or a remote URI if the experiment was uploaded to the cloud. datasets – Re-specified datasets used in the original training run. This must include all the datasets that were passed in the original trainer constructor. preprocessor – Optionally re-specified preprocessor that was passed in the original trainer constructor. This should be used to re-supply the preprocessor if it is not restorable in a new Ray cluster. This preprocessor will be fit at the start before resuming training. If no preprocessor is passed in restore, then the old preprocessor will be loaded from the latest checkpoint and will not be re-fit. scaling_config – Optionally re-specified scaling config. This can be modified to be different from the original spec. **kwargs – Other optionally re-specified arguments, passed in by subclasses. Raises ValueError – If all datasets were not re-supplied on restore. Returns A restored instance of the class that is calling this method. Return type BaseTrainerray.train.sklearn.SklearnTrainer.setup SklearnTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.sklearn.SklearnCheckpoint class ray.train.sklearn.SklearnCheckpoint(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] Bases: ray.air.checkpoint.Checkpoint A Checkpoint with sklearn-specific functionality. Create this from a generic Checkpoint by calling SklearnCheckpoint.from_checkpoint(ckpt) PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods __init__([local_path, data_dict, uri]) DeveloperAPI: This API may change across minor Ray releases. as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_estimator(estimator, *, path[, ...]) Create a Checkpoint that stores an sklearn Estimator. from_uri(uri) Create checkpoint object from location URI (e.g. get_estimator() Retrieve the Estimator stored in this checkpoint. get_internal_representation() Return tuple of (type, data) for the internal representation. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.sklearn.SklearnCheckpoint.__init__ SklearnCheckpoint.__init__(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None) DeveloperAPI: This API may change across minor Ray releases.ray.train.sklearn.SklearnCheckpoint.as_directory SklearnCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.sklearn.SklearnCheckpoint.from_bytes classmethod SklearnCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.sklearn.SklearnCheckpoint.from_checkpoint classmethod SklearnCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.sklearn.SklearnCheckpoint.from_dict classmethod SklearnCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.sklearn.SklearnCheckpoint.from_directory classmethod SklearnCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.sklearn.SklearnCheckpoint.from_estimator classmethod SklearnCheckpoint.from_estimator(estimator: sklearn.base.BaseEstimator, *, path: os.PathLike, preprocessor: Optional[Preprocessor] = None) -> SklearnCheckpoint[source] Create a Checkpoint that stores an sklearn Estimator. Parameters estimator – The Estimator to store in the checkpoint. path – The directory where the checkpoint will be stored. preprocessor – A fitted preprocessor to be applied before inference. Returns An SklearnCheckpoint containing the specified Estimator. Examples >>> from ray.train.sklearn import SklearnCheckpoint >>> from sklearn.ensemble import RandomForestClassifier >>> >>> estimator = RandomForestClassifier() >>> checkpoint = SklearnCheckpoint.from_estimator(estimator, path=".") You can use a SklearnCheckpoint to create an SklearnPredictor and preform inference. >>> from ray.train.sklearn import SklearnPredictor >>> >>> predictor = SklearnPredictor.from_checkpoint(checkpoint)ray.train.sklearn.SklearnCheckpoint.from_uri classmethod SklearnCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.sklearn.SklearnCheckpoint.get_estimator SklearnCheckpoint.get_estimator() -> sklearn.base.BaseEstimator[source] Retrieve the Estimator stored in this checkpoint.ray.train.sklearn.SklearnCheckpoint.get_internal_representation SklearnCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.sklearn.SklearnCheckpoint.get_preprocessor SklearnCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.sklearn.SklearnCheckpoint.set_preprocessor SklearnCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.sklearn.SklearnCheckpoint.to_bytes SklearnCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.sklearn.SklearnCheckpoint.to_dict SklearnCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.sklearn.SklearnCheckpoint.to_directory SklearnCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.sklearn.SklearnCheckpoint.to_uri SklearnCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.sklearn.SklearnCheckpoint.path property SklearnCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.sklearn.SklearnCheckpoint.uri property SklearnCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). Mosaic MosaicTrainer(*args, **kwargs) A Trainer for data parallel Mosaic Composers on PyTorch training. ray.train.mosaic.MosaicTrainer class ray.train.mosaic.MosaicTrainer(*args, **kwargs)[source] Bases: ray.train.torch.torch_trainer.TorchTrainer A Trainer for data parallel Mosaic Composers on PyTorch training. This Trainer runs the composer.trainer.Trainer.fit() method on multiple Ray Actors. The training is carried out in a distributed fashion through PyTorch DDP. These actors already have the necessary torch process group already configured for distributed PyTorch training. The training function ran on every Actor will first run the specified trainer_init_per_worker function to obtain an instantiated composer.Trainer object. The trainer_init_per_worker function will have access to preprocessed train and evaluation datasets. Example >>> import torch.utils.data >>> import torchvision >>> from torchvision import transforms, datasets >>> >>> from composer.models.tasks import ComposerClassifier >>> import composer.optim >>> from composer.algorithms import LabelSmoothing >>> >>> import ray >>> from ray.air.config import ScalingConfig >>> import ray.train as train >>> from ray.air import session >>> from ray.train.mosaic import MosaicTrainer >>> >>> def trainer_init_per_worker(config): ... # prepare the model for distributed training and wrap with ... # ComposerClassifier for Composer Trainer compatibility ... model = torchvision.models.resnet18(num_classes=10) ... model = ComposerClassifier(ray.train.torch.prepare_model(model)) ... ... # prepare train/test dataset ... mean = (0.507, 0.487, 0.441) ... std = (0.267, 0.256, 0.276) ... cifar10_transforms = transforms.Compose( ... [transforms.ToTensor(), transforms.Normalize(mean, std)] ... ) ... data_directory = "~/data" ... train_dataset = datasets.CIFAR10( ... data_directory, ... train=True, ... download=True, ... transform=cifar10_transforms ... ) ... ... # prepare train dataloader ... batch_size_per_worker = BATCH_SIZE // session.get_world_size() ... train_dataloader = torch.utils.data.DataLoader( ... train_dataset, ... batch_size=batch_size_per_worker ... ) ... train_dataloader = ray.train.torch.prepare_data_loader(train_dataloader) ... ... # prepare optimizer ... optimizer = composer.optim.DecoupledSGDW( ... model.parameters(), ... lr=0.05, ... momentum=0.9, ... weight_decay=2.0e-3, ... ) ... ... return composer.trainer.Trainer( ... model=model, ... train_dataloader=train_dataloader, ... optimizers=optimizer, ... **config ... ) ... >>> scaling_config = ScalingConfig(num_workers=2, use_gpu=True) >>> trainer_init_config = { ... "max_duration": "1ba", ... "algorithms": [LabelSmoothing()], ... } ... >>> trainer = MosaicTrainer( ... trainer_init_per_worker=trainer_init_per_worker, ... trainer_init_config=trainer_init_config, ... scaling_config=scaling_config, ... ) ... >>> trainer.fit() Parameters trainer_init_per_worker – The function that returns an instantiated composer.Trainer object and takes in configuration dictionary (config) as an argument. This dictionary is based on trainer_init_config and is modified for Ray - Composer integration. datasets – Any Datasets to use for training. At the moment, we do not support passing datasets to the trainer and using the dataset shards in the trainer loop. Instead, configure and load the datasets inside trainer_init_per_worker function trainer_init_config – Configurations to pass into trainer_init_per_worker as kwargs. Although the kwargs can be hard-coded in the trainer_init_per_worker, using the config allows the flexibility of reusing the same worker init function while changing the trainer arguments. For example, when hyperparameter tuning you can reuse the same trainer_init_per_worker function with different hyperparameter values rather than having multiple trainer_init_per_worker functions with different hard-coded hyperparameter values. torch_config – Configuration for setting up the PyTorch backend. If set to None, use the default configuration. This replaces the backend_config arg of DataParallelTrainer. Same as in TorchTrainer. scaling_config – Configuration for how to scale data parallel training. dataset_config – Configuration for dataset ingest. run_config – Configuration for the execution of the training run. preprocessor – A ray.data.Preprocessor to preprocess the provided datasets. resume_from_checkpoint – A MosiacCheckpoint to resume training from. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods as_trainable() Convert self to a tune.Trainable class. can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. get_dataset_config() Return a copy of this Trainer's final dataset configs. setup() Called during fit() to perform initial setup on the Trainer. ray.train.mosaic.MosaicTrainer.as_trainable MosaicTrainer.as_trainable() -> Type[Trainable] Convert self to a tune.Trainable class.ray.train.mosaic.MosaicTrainer.can_restore classmethod MosaicTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.mosaic.MosaicTrainer.fit MosaicTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.mosaic.MosaicTrainer.get_dataset_config MosaicTrainer.get_dataset_config() -> ray.train._internal.data_config.DataConfig Return a copy of this Trainer’s final dataset configs. Returns The merged default + user-supplied dataset config.ray.train.mosaic.MosaicTrainer.setup MosaicTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop. Reinforcement Learning (RLlib) RLTrainer(*args, **kwargs) Reinforcement learning trainer. RLCheckpoint([local_path, data_dict, uri]) A Checkpoint with RLlib-specific functionality. ray.train.rl.RLTrainer class ray.train.rl.RLTrainer(*args, **kwargs)[source] Bases: ray.train.base_trainer.BaseTrainer Reinforcement learning trainer. This trainer provides an interface to RLlib trainables. If datasets and preprocessors are used, they can be utilized for offline training, e.g. using behavior cloning. Otherwise, this trainer will use online training. Parameters algorithm – Algorithm to train on. Can be a string reference, (e.g. "PPO") or a RLlib trainer class. scaling_config – Configuration for how to scale training. run_config – Configuration for the execution of the training run. datasets – Any Datasets to use for training. Use the key “train” to denote which dataset is the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. If specified, datasets will be used for offline training. Will be configured as an RLlib input config item. preprocessor – A preprocessor to preprocess the provided datasets. resume_from_checkpoint – A checkpoint to resume training from. Example Online training: from ray.air.config import RunConfig, ScalingConfig from ray.train.rl import RLTrainer trainer = RLTrainer( run_config=RunConfig(stop={"training_iteration": 5}), scaling_config=ScalingConfig(num_workers=2, use_gpu=False), algorithm="PPO", config={ "env": "CartPole-v0", "framework": "tf", "evaluation_num_workers": 1, "evaluation_interval": 1, "evaluation_config": {"input": "sampler"}, }, ) result = trainer.fit() ... Example Offline training (assumes data is stored in /tmp/data-dir): import ray from ray.air.config import RunConfig, ScalingConfig from ray.train.rl import RLTrainer from ray.rllib.algorithms.bc.bc import BC dataset = ray.data.read_json( "/tmp/data-dir", parallelism=2, ray_remote_args={"num_cpus": 1} ) trainer = RLTrainer( run_config=RunConfig(stop={"training_iteration": 5}), scaling_config=ScalingConfig( num_workers=2, use_gpu=False, ), datasets={"train": dataset}, algorithm=BCTrainer, config={ "env": "CartPole-v0", "framework": "tf", "evaluation_num_workers": 1, "evaluation_interval": 1, "evaluation_config": {"input": "sampler"}, }, ) result = trainer.fit() PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods can_restore(path) Checks whether a given directory contains a restorable Train experiment. fit() Runs training. preprocess_datasets() Called during fit() to preprocess dataset attributes with preprocessor. restore(path[, datasets, preprocessor, ...]) Restores a Train experiment from a previously interrupted/failed run. setup() Called during fit() to perform initial setup on the Trainer. ray.train.rl.RLTrainer.can_restore classmethod RLTrainer.can_restore(path: Union[str, pathlib.Path]) -> bool Checks whether a given directory contains a restorable Train experiment. Parameters path – The path to the experiment directory of the Train experiment. This can be either a local directory (e.g., ~/ray_results/exp_name) or a remote URI (e.g., s3://bucket/exp_name). Returns Whether this path exists and contains the trainer state to resume from Return type boolray.train.rl.RLTrainer.fit RLTrainer.fit() -> ray.air.result.Result Runs training. Returns A Result object containing the training result. Raises TrainingFailedError – If any failures during the execution of self.as_trainable()`, or during the Tune execution loop – PublicAPI (beta): This API is in beta and may change before becoming stable.ray.train.rl.RLTrainer.preprocess_datasets RLTrainer.preprocess_datasets() -> None Called during fit() to preprocess dataset attributes with preprocessor. This method is run on a remote process. This method is called prior to entering the training_loop. If the Trainer has both a datasets dict and a preprocessor, the datasets dict contains a training dataset (denoted by the “train” key), and the preprocessor has not yet been fit, then it will be fit on the train dataset. Then, all Trainer’s datasets will be transformed by the preprocessor. The transformed datasets will be set back in the self.datasets attribute of the Trainer to be used when overriding training_loop.ray.train.rl.RLTrainer.restore classmethod RLTrainer.restore(path: str, datasets: Optional[Dict[str, Union[Dataset, Callable[[], Dataset]]]] = None, preprocessor: Optional[Preprocessor] = None, scaling_config: Optional[ray.air.config.ScalingConfig] = None, **kwargs) -> BaseTrainer Restores a Train experiment from a previously interrupted/failed run. Restore should be used for experiment-level fault tolerance in the event that the head node crashes (e.g., OOM or some other runtime error) or the entire cluster goes down (e.g., network error affecting all nodes). The following example can be paired with implementing job retry using Ray Jobs to produce a Train experiment that will attempt to resume on both experiment-level and trial-level failures: import os import ray from ray import air from ray.data.preprocessors import BatchMapper from ray.train.trainer import BaseTrainer experiment_name = "unique_experiment_name" local_dir = "~/ray_results" experiment_dir = os.path.join(local_dir, experiment_name) # Define some dummy inputs for demonstration purposes datasets = {"train": ray.data.from_items([{"a": i} for i in range(10)])} preprocessor = BatchMapper(lambda x: x, batch_format="numpy") class CustomTrainer(BaseTrainer): def training_loop(self): pass if CustomTrainer.can_restore(experiment_dir): trainer = CustomTrainer.restore( experiment_dir, datasets=datasets, ) else: trainer = CustomTrainer( datasets=datasets, preprocessor=preprocessor, run_config=air.RunConfig( name=experiment_name, local_dir=local_dir, # Tip: You can also enable retries on failure for # worker-level fault tolerance failure_config=air.FailureConfig(max_failures=3), ), ) result = trainer.fit() ... Parameters path – The path to the experiment directory of the training run to restore. This can be a local path or a remote URI if the experiment was uploaded to the cloud. datasets – Re-specified datasets used in the original training run. This must include all the datasets that were passed in the original trainer constructor. preprocessor – Optionally re-specified preprocessor that was passed in the original trainer constructor. This should be used to re-supply the preprocessor if it is not restorable in a new Ray cluster. This preprocessor will be fit at the start before resuming training. If no preprocessor is passed in restore, then the old preprocessor will be loaded from the latest checkpoint and will not be re-fit. scaling_config – Optionally re-specified scaling config. This can be modified to be different from the original spec. **kwargs – Other optionally re-specified arguments, passed in by subclasses. Raises ValueError – If all datasets were not re-supplied on restore. Returns A restored instance of the class that is calling this method. Return type BaseTrainerray.train.rl.RLTrainer.setup RLTrainer.setup() -> None Called during fit() to perform initial setup on the Trainer. This method is run on a remote process. This method will not be called on the driver, so any expensive setup operations should be placed here and not in __init__. This method is called prior to preprocess_datasets and training_loop.ray.train.rl.RLCheckpoint class ray.train.rl.RLCheckpoint(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] Bases: ray.air.checkpoint.Checkpoint A Checkpoint with RLlib-specific functionality. Create this from a generic Checkpoint by calling RLCheckpoint.from_checkpoint(ckpt). PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods __init__([local_path, data_dict, uri]) DeveloperAPI: This API may change across minor Ray releases. as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_policy([env]) Retrieve the policy stored in this checkpoint. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.train.rl.RLCheckpoint.__init__ RLCheckpoint.__init__(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None) DeveloperAPI: This API may change across minor Ray releases.ray.train.rl.RLCheckpoint.as_directory RLCheckpoint.as_directory() -> Iterator[str] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.train.rl.RLCheckpoint.from_bytes classmethod RLCheckpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.train.rl.RLCheckpoint.from_checkpoint classmethod RLCheckpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.train.rl.RLCheckpoint.from_dict classmethod RLCheckpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.train.rl.RLCheckpoint.from_directory classmethod RLCheckpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.train.rl.RLCheckpoint.from_uri classmethod RLCheckpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.train.rl.RLCheckpoint.get_internal_representation RLCheckpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.train.rl.RLCheckpoint.get_policy RLCheckpoint.get_policy(env: Optional[Any] = None) -> ray.rllib.policy.policy.Policy[source] Retrieve the policy stored in this checkpoint. Parameters env – Optional environment to instantiate the trainer with. If not given, it is parsed from the saved trainer configuration. Returns The policy stored in this checkpoint.ray.train.rl.RLCheckpoint.get_preprocessor RLCheckpoint.get_preprocessor() -> Optional[Preprocessor] Return the saved preprocessor, if one exists.ray.train.rl.RLCheckpoint.set_preprocessor RLCheckpoint.set_preprocessor(preprocessor: Optional[Preprocessor]) Saves the provided preprocessor to this Checkpoint.ray.train.rl.RLCheckpoint.to_bytes RLCheckpoint.to_bytes() -> bytes Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.train.rl.RLCheckpoint.to_dict RLCheckpoint.to_dict() -> dict Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.train.rl.RLCheckpoint.to_directory RLCheckpoint.to_directory(path: Optional[str] = None) -> str Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.train.rl.RLCheckpoint.to_uri RLCheckpoint.to_uri(uri: str) -> str Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.train.rl.RLCheckpoint.path property RLCheckpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.train.rl.RLCheckpoint.uri property RLCheckpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). Ray Train Experiment Restoration train.trainer.BaseTrainer.restore(path[, ...]) Restores a Train experiment from a previously interrupted/failed run. All trainer classes have a restore method that takes in a path pointing to the directory of the experiment to be restored. restore also exposes a subset of construtor arguments that can be re-specified. See Restoration API for Built-in Trainers below for details on restore arguments for different AIR trainer integrations. Restoration API for Built-in Trainers train.data_parallel_trainer.DataParallelTrainer.restore(path) Restores a DataParallelTrainer from a previously interrupted/failed run. train.huggingface.TransformersTrainer.restore(path) Restores a TransformersTrainer from a previously interrupted/failed run. TorchTrainer.restore, TensorflowTrainer.restore, and HorovodTrainer.restore can take in the same parameters as their parent class’s DataParallelTrainer.restore. Unless otherwise specified, other trainers will accept the same parameters as BaseTrainer.restore. See Restore a Ray Train Experiment for more details on when and how trainer restore should be used. Tune Execution (tune.Tuner) Tuner Tuner([trainable, param_space, tune_config, ...]) Tuner is the recommended way of launching hyperparameter tuning jobs with Ray Tune. ray.tune.Tuner class ray.tune.Tuner(trainable: Optional[Union[str, Callable, Type[ray.tune.trainable.trainable.Trainable], BaseTrainer]] = None, *, param_space: Optional[Dict[str, Any]] = None, tune_config: Optional[ray.tune.tune_config.TuneConfig] = None, run_config: Optional[ray.air.config.RunConfig] = None, _tuner_kwargs: Optional[Dict] = None, _tuner_internal: Optional[ray.tune.impl.tuner_internal.TunerInternal] = None, _entrypoint: ray.air._internal.usage.AirEntrypoint = AirEntrypoint.TUNER)[source] Bases: object Tuner is the recommended way of launching hyperparameter tuning jobs with Ray Tune. Parameters trainable – The trainable to be tuned. param_space – Search space of the tuning job. One thing to note is that both preprocessor and dataset can be tuned here. tune_config – Tuning algorithm specific configs. Refer to ray.tune.tune_config.TuneConfig for more info. run_config – Runtime configuration that is specific to individual trials. If passed, this will overwrite the run config passed to the Trainer, if applicable. Refer to ray.air.config.RunConfig for more info. Usage pattern: from sklearn.datasets import load_breast_cancer from ray import tune from ray.data import from_pandas from ray.air.config import RunConfig, ScalingConfig from ray.train.xgboost import XGBoostTrainer from ray.tune.tuner import Tuner def get_dataset(): data_raw = load_breast_cancer(as_frame=True) dataset_df = data_raw["data"] dataset_df["target"] = data_raw["target"] dataset = from_pandas(dataset_df) return dataset trainer = XGBoostTrainer( label_column="target", params={}, datasets={"train": get_dataset()}, ) param_space = { "scaling_config": ScalingConfig( num_workers=tune.grid_search([2, 4]), resources_per_worker={ "CPU": tune.grid_search([1, 2]), }, ), # You can even grid search various datasets in Tune. # "datasets": { # "train": tune.grid_search( # [ds1, ds2] # ), # }, "params": { "objective": "binary:logistic", "tree_method": "approx", "eval_metric": ["logloss", "error"], "eta": tune.loguniform(1e-4, 1e-1), "subsample": tune.uniform(0.5, 1.0), "max_depth": tune.randint(1, 9), }, } tuner = Tuner(trainable=trainer, param_space=param_space, run_config=RunConfig(name="my_tune_run")) results = tuner.fit() To retry a failed tune run, you can then do tuner = Tuner.restore(results.experiment_path, trainable=trainer) tuner.fit() results.experiment_path can be retrieved from the ResultGrid object. It can also be easily seen in the log output from your first run. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__([trainable, param_space, ...]) Configure and construct a tune run. can_restore(path) Checks whether a given directory contains a restorable Tune experiment. fit() Executes hyperparameter tuning job as configured and returns result. get_results() Get results of a hyperparameter tuning run. restore(path, trainable[, ...]) Restores Tuner after a previously failed run. ray.tune.Tuner.__init__ Tuner.__init__(trainable: Optional[Union[str, Callable, Type[ray.tune.trainable.trainable.Trainable], BaseTrainer]] = None, *, param_space: Optional[Dict[str, Any]] = None, tune_config: Optional[ray.tune.tune_config.TuneConfig] = None, run_config: Optional[ray.air.config.RunConfig] = None, _tuner_kwargs: Optional[Dict] = None, _tuner_internal: Optional[ray.tune.impl.tuner_internal.TunerInternal] = None, _entrypoint: ray.air._internal.usage.AirEntrypoint = AirEntrypoint.TUNER)[source] Configure and construct a tune run.ray.tune.Tuner.can_restore classmethod Tuner.can_restore(path: Union[str, pathlib.Path]) -> bool[source] Checks whether a given directory contains a restorable Tune experiment. Usage Pattern: Use this utility to switch between starting a new Tune experiment and restoring when possible. This is useful for experiment fault-tolerance when re-running a failed tuning script. import os from ray.tune import Tuner from ray.air import RunConfig def train_fn(config): # Make sure to implement checkpointing so that progress gets # saved on restore. pass name = "exp_name" local_dir = "~/ray_results" exp_dir = os.path.join(local_dir, name) if Tuner.can_restore(exp_dir): tuner = Tuner.restore(exp_dir, trainable=train_fn, resume_errored=True) else: tuner = Tuner( train_fn, run_config=RunConfig(name=name, local_dir=local_dir), ) tuner.fit() Parameters path – The path to the experiment directory of the Tune experiment. This can be either a local directory (e.g. ~/ray_results/exp_name) or a remote URI (e.g. s3://bucket/exp_name). Returns True if this path exists and contains the Tuner state to resume from Return type boolray.tune.Tuner.fit Tuner.fit() -> ray.tune.result_grid.ResultGrid[source] Executes hyperparameter tuning job as configured and returns result. Failure handling: For the kind of exception that happens during the execution of a trial, one may inspect it together with stacktrace through the returned result grid. See ResultGrid for reference. Each trial may fail up to a certain number. This is configured by RunConfig.FailureConfig.max_failures. Exception that happens beyond trials will be thrown by this method as well. In such cases, there will be instruction like the following printed out at the end of console output to inform users on how to resume. Please use Tuner.restore to resume. tuner = Tuner.restore("~/ray_results/tuner_resume", trainable=trainable) tuner.fit() Raises RayTaskError – If user-provided trainable raises an exception TuneError – General Ray Tune error.ray.tune.Tuner.get_results Tuner.get_results() -> ray.tune.result_grid.ResultGrid[source] Get results of a hyperparameter tuning run. This method returns the same results as fit() and can be used to retrieve the results after restoring a tuner without calling fit() again. If the tuner has not been fit before, an error will be raised. from ray.tune import Tuner # `trainable` is what was passed in to the original `Tuner` tuner = Tuner.restore("/path/to/experiment', trainable=trainable) results = tuner.get_results() Returns Result grid of a previously fitted tuning run.ray.tune.Tuner.restore classmethod Tuner.restore(path: str, trainable: Union[str, Callable, Type[ray.tune.trainable.trainable.Trainable], BaseTrainer], resume_unfinished: bool = True, resume_errored: bool = False, restart_errored: bool = False, param_space: Optional[Dict[str, Any]] = None) -> Tuner[source] Restores Tuner after a previously failed run. All trials from the existing run will be added to the result table. The argument flags control how existing but unfinished or errored trials are resumed. Finished trials are always added to the overview table. They will not be resumed. Unfinished trials can be controlled with the resume_unfinished flag. If True (default), they will be continued. If False, they will be added as terminated trials (even if they were only created and never trained). Errored trials can be controlled with the resume_errored and restart_errored flags. The former will resume errored trials from their latest checkpoints. The latter will restart errored trials from scratch and prevent loading their last checkpoints. Parameters path – The path where the previous failed run is checkpointed. This information could be easily located near the end of the console output of previous run. Note: depending on whether ray client mode is used or not, this path may or may not exist on your local machine. trainable – The trainable to use upon resuming the experiment. This should be the same trainable that was used to initialize the original Tuner. param_space – The same param_space that was passed to the original Tuner. This can be optionally re-specified due to the param_space potentially containing Ray object references (tuning over Datasets or tuning over several ray.put object references). Tune expects the `param_space` to be unmodified, and the only part that will be used during restore are the updated object references. Changing the hyperparameter search space then resuming is NOT supported by this API. resume_unfinished – If True, will continue to run unfinished trials. resume_errored – If True, will re-schedule errored trials and try to restore from their latest checkpoints. restart_errored – If True, will re-schedule errored trials but force restarting them from scratch (no checkpoint will be loaded). Tuner.fit() Executes hyperparameter tuning job as configured and returns result. Tuner.get_results() Get results of a hyperparameter tuning run. Tuner Configuration TuneConfig([mode, metric, search_alg, ...]) Tune specific configs. ray.tune.TuneConfig class ray.tune.TuneConfig(mode: Optional[str] = None, metric: Optional[str] = None, search_alg: Optional[Union[ray.tune.search.searcher.Searcher, ray.tune.search.search_algorithm.SearchAlgorithm]] = None, scheduler: Optional[ray.tune.schedulers.trial_scheduler.TrialScheduler] = None, num_samples: int = 1, max_concurrent_trials: Optional[int] = None, time_budget_s: Optional[Union[int, float, datetime.timedelta]] = None, reuse_actors: Optional[bool] = None, trial_name_creator: Optional[Callable[[ray.tune.experiment.trial.Trial], str]] = None, trial_dirname_creator: Optional[Callable[[ray.tune.experiment.trial.Trial], str]] = None, chdir_to_trial_dir: bool = True)[source] Bases: object Tune specific configs. Parameters metric – Metric to optimize. This metric should be reported with tune.report(). If set, will be passed to the search algorithm and scheduler. mode – Must be one of [min, max]. Determines whether objective is minimizing or maximizing the metric attribute. If set, will be passed to the search algorithm and scheduler. search_alg – Search algorithm for optimization. Default to random search. scheduler – Scheduler for executing the experiment. Choose among FIFO (default), MedianStopping, AsyncHyperBand, HyperBand and PopulationBasedTraining. Refer to ray.tune.schedulers for more options. num_samples – Number of times to sample from the hyperparameter space. Defaults to 1. If grid_search is provided as an argument, the grid will be repeated num_samples of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met. max_concurrent_trials – Maximum number of trials to run concurrently. Must be non-negative. If None or 0, no limit will be applied. This is achieved by wrapping the search_alg in a ConcurrencyLimiter, and thus setting this argument will raise an exception if the search_alg is already a ConcurrencyLimiter. Defaults to None. time_budget_s – Global time budget in seconds after which all trials are stopped. Can also be a datetime.timedelta object. reuse_actors – Whether to reuse actors between different trials when possible. This can drastically speed up experiments that start and stop actors often (e.g., PBT in time-multiplexing mode). This requires trials to have the same resource requirements. Defaults to True for function trainables (including most Ray AIR trainers) and False for class and registered trainables (e.g. RLlib). trial_name_creator – Optional function that takes in a Trial and returns its name (i.e. its string representation). Be sure to include some unique identifier (such as Trial.trial_id) in each trial’s name. NOTE: This API is in alpha and subject to change. trial_dirname_creator – Optional function that takes in a trial and generates its trial directory name as a string. Be sure to include some unique identifier (such as Trial.trial_id) is used in each trial’s directory name. Otherwise, trials could overwrite artifacts and checkpoints of other trials. The return value cannot be a path. NOTE: This API is in alpha and subject to change. chdir_to_trial_dir – Whether to change the working directory of each worker to its corresponding trial directory. Defaults to True to prevent contention between workers saving trial-level outputs. If set to False, files are accessible with paths relative to the original working directory. However, all workers on the same node now share the same working directory, so be sure to use session.get_trial_dir() as the path to save any outputs. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods Attributes chdir_to_trial_dir max_concurrent_trials metric mode num_samples reuse_actors scheduler search_alg time_budget_s trial_dirname_creator trial_name_creator ray.tune.TuneConfig.chdir_to_trial_dir TuneConfig.chdir_to_trial_dir: bool = True ray.tune.TuneConfig.max_concurrent_trials TuneConfig.max_concurrent_trials: Optional[int] = None ray.tune.TuneConfig.metric TuneConfig.metric: Optional[str] = None ray.tune.TuneConfig.mode TuneConfig.mode: Optional[str] = None ray.tune.TuneConfig.num_samples TuneConfig.num_samples: int = 1 ray.tune.TuneConfig.reuse_actors TuneConfig.reuse_actors: Optional[bool] = None ray.tune.TuneConfig.scheduler TuneConfig.scheduler: Optional[ray.tune.schedulers.trial_scheduler.TrialScheduler] = None ray.tune.TuneConfig.search_alg TuneConfig.search_alg: Optional[Union[ray.tune.search.searcher.Searcher, ray.tune.search.search_algorithm.SearchAlgorithm]] = None ray.tune.TuneConfig.time_budget_s TuneConfig.time_budget_s: Optional[Union[int, float, datetime.timedelta]] = None ray.tune.TuneConfig.trial_dirname_creator TuneConfig.trial_dirname_creator: Optional[Callable[[ray.tune.experiment.trial.Trial], str]] = None ray.tune.TuneConfig.trial_name_creator TuneConfig.trial_name_creator: Optional[Callable[[ray.tune.experiment.trial.Trial], str]] = None The Tuner constructor also takes in a air.RunConfig. Restoring a Tuner Tuner.restore(path, trainable[, ...]) Restores Tuner after a previously failed run. Tuner.can_restore(path) Checks whether a given directory contains a restorable Tune experiment. tune.run_experiments run_experiments(experiments[, scheduler, ...]) Runs and blocks until all trials finish. Experiment(name, run, *[, stop, ...]) Tracks experiment specifications. ray.tune.run_experiments ray.tune.run_experiments(experiments: Union[ray.tune.experiment.experiment.Experiment, Mapping, Sequence[Union[ray.tune.experiment.experiment.Experiment, Mapping]]], scheduler: Optional[ray.tune.schedulers.trial_scheduler.TrialScheduler] = None, server_port: Optional[int] = None, verbose: Optional[Union[int, ray.tune.experimental.output.AirVerbosity, ray.tune.utils.log.Verbosity]] = None, progress_reporter: Optional[ray.tune.progress_reporter.ProgressReporter] = None, resume: Union[bool, str] = False, reuse_actors: Optional[bool] = None, trial_executor: Optional[ray.tune.execution.ray_trial_executor.RayTrialExecutor] = None, raise_on_failed_trial: bool = True, concurrent: bool = True, callbacks: Optional[Sequence[ray.tune.callback.Callback]] = None, _remote: Optional[bool] = None)[source] Runs and blocks until all trials finish. Example >>> from ray.tune.experiment import Experiment >>> from ray.tune.tune import run_experiments >>> def my_func(config): return {"score": 0} >>> experiment_spec = Experiment("experiment", my_func) >>> run_experiments(experiments=experiment_spec) >>> experiment_spec = {"experiment": {"run": my_func}} >>> run_experiments(experiments=experiment_spec) Returns List of Trial objects, holding data for each executed trial. PublicAPI: This API is stable across Ray releases.ray.tune.Experiment class ray.tune.Experiment(name: str, run: Union[str, Callable, Type], *, stop: Optional[Union[Mapping, ray.tune.stopper.stopper.Stopper, Callable[[str, Mapping], bool]]] = None, time_budget_s: Optional[Union[int, float, datetime.timedelta]] = None, config: Optional[Dict[str, Any]] = None, resources_per_trial: Union[None, Mapping[str, Union[float, int, Mapping]], PlacementGroupFactory] = None, num_samples: int = 1, storage_path: Optional[str] = None, _experiment_checkpoint_dir: Optional[str] = None, sync_config: Optional[Union[ray.tune.syncer.SyncConfig, dict]] = None, checkpoint_config: Optional[Union[ray.air.config.CheckpointConfig, dict]] = None, trial_name_creator: Optional[Callable[[Trial], str]] = None, trial_dirname_creator: Optional[Callable[[Trial], str]] = None, log_to_file: bool = False, export_formats: Optional[Sequence] = None, max_failures: int = 0, restore: Optional[str] = None, local_dir: Optional[str] = None)[source] Bases: object Tracks experiment specifications. Implicitly registers the Trainable if needed. The args here take the same meaning as the arguments defined tune.py:run. experiment_spec = Experiment( "my_experiment_name", my_func, stop={"mean_accuracy": 100}, config={ "alpha": tune.grid_search([0.2, 0.4, 0.6]), "beta": tune.grid_search([1, 2]), }, resources_per_trial={ "cpu": 1, "gpu": 0 }, num_samples=10, local_dir="~/ray_results", checkpoint_freq=10, max_failures=2) Parameters TODO (xwjiang) – Add the whole list. _experiment_checkpoint_dir – Internal use only. If present, use this as the root directory for experiment checkpoint. If not present, the directory path will be deduced from trainable name instead. DeveloperAPI: This API may change across minor Ray releases. Methods from_json(name, spec) Generates an Experiment object from JSON. get_experiment_checkpoint_dir(run_obj[, ...]) Get experiment checkpoint dir without setting up an experiment. get_trainable_name(run_object) Get Trainable name. register_if_needed(run_object) Registers Trainable or Function at runtime. ray.tune.Experiment.from_json classmethod Experiment.from_json(name: str, spec: dict)[source] Generates an Experiment object from JSON. Parameters name – Name of Experiment. spec – JSON configuration of experiment.ray.tune.Experiment.get_experiment_checkpoint_dir classmethod Experiment.get_experiment_checkpoint_dir(run_obj: Union[str, Callable, Type], storage_path: Optional[str] = None, name: Optional[str] = None)[source] Get experiment checkpoint dir without setting up an experiment. This is only used internally for better support of Tuner API. Parameters run_obj – Trainable to run. storage_path – The path to Ray AIR’s result storage. name – The name of the experiment specified by user. Returns Checkpoint directory for experiment.ray.tune.Experiment.get_trainable_name classmethod Experiment.get_trainable_name(run_object: Union[str, Callable, Type])[source] Get Trainable name. Parameters run_object – Trainable to run. If string, assumes it is an ID and does not modify it. Otherwise, returns a string corresponding to the run_object name. Returns A string representing the trainable identifier. Raises TuneError – if run_object passed in is invalid.ray.tune.Experiment.register_if_needed classmethod Experiment.register_if_needed(run_object: Union[str, Callable, Type])[source] Registers Trainable or Function at runtime. Assumes already registered if run_object is a string. Also, does not inspect interface of given run_object. Parameters run_object – Trainable to run. If string, assumes it is an ID and does not modify it. Otherwise, returns a string corresponding to the run_object name. Returns A string representing the trainable identifier. Attributes PUBLIC_KEYS checkpoint_config checkpoint_dir warning. local_dir warning. local_path path public_spec Returns the spec dict with only the public-facing keys. remote_checkpoint_dir warning. remote_path run_identifier Returns a string representing the trainable identifier. stopper ray.tune.Experiment.PUBLIC_KEYS Experiment.PUBLIC_KEYS = {'num_samples', 'stop', 'time_budget_s'} ray.tune.Experiment.checkpoint_config property Experiment.checkpoint_config ray.tune.Experiment.checkpoint_dir property Experiment.checkpoint_dir DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.tune.Experiment.local_dir property Experiment.local_dir DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.tune.Experiment.local_path property Experiment.local_path: Optional[str] ray.tune.Experiment.path property Experiment.path: Optional[str] ray.tune.Experiment.public_spec property Experiment.public_spec: Dict[str, Any] Returns the spec dict with only the public-facing keys. Intended to be used for passing information to callbacks, Searchers and Schedulers.ray.tune.Experiment.remote_checkpoint_dir property Experiment.remote_checkpoint_dir: Optional[str] DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.tune.Experiment.remote_path property Experiment.remote_path: Optional[str] ray.tune.Experiment.run_identifier property Experiment.run_identifier Returns a string representing the trainable identifier.ray.tune.Experiment.stopper property Experiment.stopper Ray AIR Integrations with ML Libraries PyTorch There are 2 recommended ways to train PyTorch models on a Ray cluster. If you’re training PyTorch models with PyTorch Lightning, see below for the available PyTorch Lightning Ray AIR integrations. See the options 1️⃣ 2️⃣ below, along with the usage scenarios and API references for each: 1️⃣ Vanilla PyTorch with Ray Tune Usage Scenario: Non-distributed training, where the dataset is relatively small and there are many trials (e.g., many hyperparameter configurations). Use vanilla PyTorch with Ray Tune to parallelize model training. See an example here. 2️⃣ TorchTrainer Usage Scenario: Data-parallel training, such as multi-GPU or multi-node training. TorchTrainer(*args, **kwargs) A Trainer for data parallel PyTorch training. See here for an example. PyTorch Lightning There are 2 recommended ways to train with PyTorch Lightning on a Ray cluster. See the options 1️⃣ 2️⃣ below, along with the usage scenarios and API references for each: 1️⃣ Vanilla PyTorch Lightning with a Ray Callback Usage Scenario: Non-distributed training, where the dataset is relatively small and there are many trials (e.g., many hyperparameter configurations). Use vanilla PyTorch Lightning with Ray Tune to parallelize model training. TuneReportCallback([metrics, on]) PyTorch Lightning to Ray Tune reporting callback TuneReportCheckpointCallback([metrics, ...]) PyTorch Lightning report and checkpoint callback See an example here. 2️⃣ LightningTrainer Usage Scenario: Distributed training, such as multi-GPU or multi-node data-parallel training. LightningTrainer(*args, **kwargs) A Trainer for data parallel PyTorch Lightning training. See the full API reference for the Ray Train Lightning integration. See an example here. Tensorflow/Keras There are 2 recommended ways to train Tensorflow/Keras models with Ray. See the options 1️⃣ 2️⃣ below, along with the usage scenarios and API references for each: 1️⃣ Vanilla Keras with a Ray Callback Usage Scenario: Non-distributed training, where the dataset is relatively small and there are many trials (e.g., many hyperparameter configurations). Use vanilla Tensorflow/Keras with Ray Tune to parallelize model training. ReportCheckpointCallback([checkpoint_on, ...]) Keras callback for Ray AIR reporting and checkpointing. ray.air.integrations.keras.ReportCheckpointCallback class ray.air.integrations.keras.ReportCheckpointCallback(checkpoint_on: Union[str, List[str]] = 'epoch_end', report_metrics_on: Union[str, List[str]] = 'epoch_end', metrics: Optional[Union[str, List[str], Dict[str, str]]] = None)[source] Bases: ray.air.integrations.keras._Callback Keras callback for Ray AIR reporting and checkpointing. Metrics are always reported with checkpoints, even if the event isn’t specified in report_metrics_on. Example code-block: python ############# Using it in TrainSession ############### from ray.air.integrations.keras import ReportCheckpointCallback def train_loop_per_worker(): strategy = tf.distribute.MultiWorkerMirroredStrategy() with strategy.scope(): model = build_model() model.fit(dataset_shard, callbacks=[ReportCheckpointCallback()]) Parameters metrics – Metrics to report. If this is a list, each item describes the metric key reported to Keras, and it’s reported under the same name. If this is a dict, each key is the name reported and the respective value is the metric key reported to Keras. If this is None, all Keras logs are reported. report_metrics_on – When to report metrics. Must be one of the Keras event hooks (less the on_), e.g. “train_start” or “predict_end”. Defaults to “epoch_end”. checkpoint_on – When to save checkpoints. Must be one of the Keras event hooks (less the on_), e.g. “train_start” or “predict_end”. Defaults to “epoch_end”. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods on_batch_begin(batch[, logs]) A backwards compatibility alias for on_train_batch_begin. on_batch_end(batch[, logs]) A backwards compatibility alias for on_train_batch_end. ray.air.integrations.keras.ReportCheckpointCallback.on_batch_begin ReportCheckpointCallback.on_batch_begin(batch, logs=None) A backwards compatibility alias for on_train_batch_begin.ray.air.integrations.keras.ReportCheckpointCallback.on_batch_end ReportCheckpointCallback.on_batch_end(batch, logs=None) A backwards compatibility alias for on_train_batch_end. See an example here. 2️⃣ TensorflowTrainer Usage Scenario: Data-parallel training, such as multi-GPU or multi-node training. TensorflowTrainer(*args, **kwargs) A Trainer for data parallel Tensorflow training. ReportCheckpointCallback([checkpoint_on, ...]) Keras callback for Ray AIR reporting and checkpointing. See here for an example. XGBoost There are 3 recommended ways to train XGBoost models with Ray. See the options 1️⃣ 2️⃣ 3️⃣ below, along with the usage scenarios and API references for each: 1️⃣ Vanilla XGBoost with a Ray Callback Usage Scenario: Non-distributed training, where the dataset is relatively small and there are many trials (e.g., many hyperparameter configurations). Use vanilla XGBoost with these Ray Tune callbacks to parallelize model training. TuneReportCallback([metrics, ...]) XGBoost to Ray Tune reporting callback TuneReportCheckpointCallback([metrics, ...]) XGBoost report and checkpoint callback See an example here. 2️⃣ XGBoostTrainer Usage Scenario: Data-parallel training, such as multi-GPU or multi-node training. XGBoostTrainer(*args, **kwargs) A Trainer for data parallel XGBoost training. See an example here. 3️⃣ xgboost_ray Usage Scenario: Use as a (nearly) drop-in replacement for the regular xgboost API, with added support for distributed training on a Ray cluster. See the xgboost_ray documentation. LightGBM There are 3 recommended ways to train LightGBM models with Ray. See the options 1️⃣ 2️⃣ 3️⃣ below, along with the usage scenarios and API references for each: 1️⃣ Vanilla LightGBM with a Ray Callback Usage Scenario: Non-distributed training, where the dataset is relatively small and there are many trials (e.g., many hyperparameter configurations). Use vanilla LightGBM with these Ray Tune callbacks to parallelize model training. TuneReportCallback([metrics, ...]) Create a callback that reports metrics to Ray Tune. TuneReportCheckpointCallback([metrics, ...]) Creates a callback that reports metrics and checkpoints model. See an example here. 2️⃣ LightGBMTrainer Usage Scenario: Data-parallel training, such as multi-GPU or multi-node training. LightGBMTrainer(*args, **kwargs) A Trainer for data parallel LightGBM training. See an example here. 3️⃣ lightgbm_ray Usage Scenario: Use as a (nearly) drop-in replacement for the regular lightgbm API, with added support for distributed training on a Ray cluster. See the lightgbm_ray documentation. Experiment Tracking Integrations Comet (air.integrations.comet) CometLoggerCallback([online, tags, ...]) CometLoggerCallback for logging Tune results to Comet. ray.air.integrations.comet.CometLoggerCallback class ray.air.integrations.comet.CometLoggerCallback(online: bool = True, tags: Optional[List[str]] = None, save_checkpoints: bool = False, **experiment_kwargs)[source] Bases: ray.tune.logger.logger.LoggerCallback CometLoggerCallback for logging Tune results to Comet. Comet (https://comet.ml/site/) is a tool to manage and optimize the entire ML lifecycle, from experiment tracking, model optimization and dataset versioning to model production monitoring. This Ray Tune LoggerCallback sends metrics and parameters to Comet for tracking. In order to use the CometLoggerCallback you must first install Comet via pip install comet_ml Then set the following environment variables export COMET_API_KEY= Alternatively, you can also pass in your API Key as an argument to the CometLoggerCallback constructor. CometLoggerCallback(api_key=) Parameters online – Whether to make use of an Online or Offline Experiment. Defaults to True. tags – Tags to add to the logged Experiment. Defaults to None. save_checkpoints – If True, model checkpoints will be saved to Comet ML as artifacts. Defaults to False. **experiment_kwargs – Other keyword arguments will be passed to the constructor for comet_ml.Experiment (or OfflineExperiment if online=False). Please consult the Comet ML documentation for more information on the Experiment and OfflineExperiment classes: https://comet.ml/site/ Example: from ray.air.integrations.comet import CometLoggerCallback tune.run( train, config=config callbacks=[CometLoggerCallback( True, ['tag1', 'tag2'], workspace='my_workspace', project_name='my_project_name' )] ) Methods get_state() Get the state of the callback. log_trial_restore(trial) Handle logging when a trial restores. log_trial_result(iteration, trial, result) Log the current result of a Trial upon each iteration. log_trial_start(trial) Initialize an Experiment (or OfflineExperiment if self.online=False) and start logging to Comet. on_checkpoint(iteration, trials, trial, ...) Called after a trial saved a checkpoint with Tune. on_experiment_end(trials, **info) Called after experiment is over and all trials have concluded. on_step_begin(iteration, trials, **info) Called at the start of each tuning loop step. on_step_end(iteration, trials, **info) Called at the end of each tuning loop step. set_state(state) Set the state of the callback. setup([stop, num_samples, total_num_samples]) Called once at the very beginning of training. ray.air.integrations.comet.CometLoggerCallback.get_state CometLoggerCallback.get_state() -> Optional[Dict] Get the state of the callback. This method should be implemented by subclasses to return a dictionary representation of the object’s current state. This is called automatically by Tune to periodically checkpoint callback state. Upon Tune experiment restoration, callback state will be restored via set_state(). from typing import Dict, List, Optional from ray.tune import Callback from ray.tune.experiment import Trial class MyCallback(Callback): def __init__(self): self._trial_ids = set() def on_trial_start( self, iteration: int, trials: List["Trial"], trial: "Trial", **info ): self._trial_ids.add(trial.trial_id) def get_state(self) -> Optional[Dict]: return {"trial_ids": self._trial_ids.copy()} def set_state(self, state: Dict) -> Optional[Dict]: self._trial_ids = state["trial_ids"] Returns State of the callback. Should be None if the callback does not have any state to save (this is the default). Return type dictray.air.integrations.comet.CometLoggerCallback.log_trial_restore CometLoggerCallback.log_trial_restore(trial: Trial) Handle logging when a trial restores. Parameters trial – Trial object.ray.air.integrations.comet.CometLoggerCallback.log_trial_result CometLoggerCallback.log_trial_result(iteration: int, trial: ray.tune.experiment.trial.Trial, result: Dict)[source] Log the current result of a Trial upon each iteration.ray.air.integrations.comet.CometLoggerCallback.log_trial_start CometLoggerCallback.log_trial_start(trial: ray.tune.experiment.trial.Trial)[source] Initialize an Experiment (or OfflineExperiment if self.online=False) and start logging to Comet. Parameters trial – Trial object.ray.air.integrations.comet.CometLoggerCallback.on_checkpoint CometLoggerCallback.on_checkpoint(iteration: int, trials: List[Trial], trial: Trial, checkpoint: _TrackedCheckpoint, **info) Called after a trial saved a checkpoint with Tune. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. trial – Trial that just has errored. checkpoint – Checkpoint object that has been saved by the trial. **info – Kwargs dict for forward compatibility.ray.air.integrations.comet.CometLoggerCallback.on_experiment_end CometLoggerCallback.on_experiment_end(trials: List[Trial], **info) Called after experiment is over and all trials have concluded. Parameters trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.comet.CometLoggerCallback.on_step_begin CometLoggerCallback.on_step_begin(iteration: int, trials: List[Trial], **info) Called at the start of each tuning loop step. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.comet.CometLoggerCallback.on_step_end CometLoggerCallback.on_step_end(iteration: int, trials: List[Trial], **info) Called at the end of each tuning loop step. The iteration counter is increased before this hook is called. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.comet.CometLoggerCallback.set_state CometLoggerCallback.set_state(state: Dict) Set the state of the callback. This method should be implemented by subclasses to restore the callback’s state based on the given dict state. This is used automatically by Tune to restore checkpoint callback state on Tune experiment restoration. See get_state() for an example implementation. Parameters state – State of the callback.ray.air.integrations.comet.CometLoggerCallback.setup CometLoggerCallback.setup(stop: Optional[Stopper] = None, num_samples: Optional[int] = None, total_num_samples: Optional[int] = None, **info) Called once at the very beginning of training. Any Callback setup should be added here (setting environment variables, etc.) Parameters stop – Stopping criteria. If time_budget_s was passed to air.RunConfig, a TimeoutStopper will be passed here, either by itself or as a part of a CombinedStopper. num_samples – Number of times to sample from the hyperparameter space. Defaults to 1. If grid_search is provided as an argument, the grid will be repeated num_samples of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met. total_num_samples – Total number of samples factoring in grid search samplers. **info – Kwargs dict for forward compatibility. See here for an example. MLflow (air.integrations.mlflow) MLflowLoggerCallback([tracking_uri, ...]) MLflow Logger to automatically log Tune results and config to MLflow. setup_mlflow([config, tracking_uri, ...]) Set up a MLflow session. ray.air.integrations.mlflow.MLflowLoggerCallback class ray.air.integrations.mlflow.MLflowLoggerCallback(tracking_uri: Optional[str] = None, *, registry_uri: Optional[str] = None, experiment_name: Optional[str] = None, tags: Optional[Dict] = None, tracking_token: Optional[str] = None, save_artifact: bool = False)[source] Bases: ray.tune.logger.logger.LoggerCallback MLflow Logger to automatically log Tune results and config to MLflow. MLflow (https://mlflow.org) Tracking is an open source library for recording and querying experiments. This Ray Tune LoggerCallback sends information (config parameters, training results & metrics, and artifacts) to MLflow for automatic experiment tracking. Parameters tracking_uri – The tracking URI for where to manage experiments and runs. This can either be a local file path or a remote server. This arg gets passed directly to mlflow initialization. When using Tune in a multi-node setting, make sure to set this to a remote server and not a local file path. registry_uri – The registry URI that gets passed directly to mlflow initialization. experiment_name – The experiment name to use for this Tune run. If the experiment with the name already exists with MLflow, it will be reused. If not, a new experiment will be created with that name. tags – An optional dictionary of string keys and values to set as tags on the run tracking_token – Tracking token used to authenticate with MLflow. save_artifact – If set to True, automatically save the entire contents of the Tune local_dir as an artifact to the corresponding run in MlFlow. Example: from ray.air.integrations.mlflow import MLflowLoggerCallback tags = { "user_name" : "John", "git_commit_hash" : "abc123"} tune.run( train_fn, config={ # define search space here "parameter_1": tune.choice([1, 2, 3]), "parameter_2": tune.choice([4, 5, 6]), }, callbacks=[MLflowLoggerCallback( experiment_name="experiment1", tags=tags, save_artifact=True)]) Methods get_state() Get the state of the callback. log_trial_restore(trial) Handle logging when a trial restores. log_trial_save(trial) Handle logging when a trial saves a checkpoint. on_checkpoint(iteration, trials, trial, ...) Called after a trial saved a checkpoint with Tune. on_experiment_end(trials, **info) Called after experiment is over and all trials have concluded. on_step_begin(iteration, trials, **info) Called at the start of each tuning loop step. on_step_end(iteration, trials, **info) Called at the end of each tuning loop step. set_state(state) Set the state of the callback. ray.air.integrations.mlflow.MLflowLoggerCallback.get_state MLflowLoggerCallback.get_state() -> Optional[Dict] Get the state of the callback. This method should be implemented by subclasses to return a dictionary representation of the object’s current state. This is called automatically by Tune to periodically checkpoint callback state. Upon Tune experiment restoration, callback state will be restored via set_state(). from typing import Dict, List, Optional from ray.tune import Callback from ray.tune.experiment import Trial class MyCallback(Callback): def __init__(self): self._trial_ids = set() def on_trial_start( self, iteration: int, trials: List["Trial"], trial: "Trial", **info ): self._trial_ids.add(trial.trial_id) def get_state(self) -> Optional[Dict]: return {"trial_ids": self._trial_ids.copy()} def set_state(self, state: Dict) -> Optional[Dict]: self._trial_ids = state["trial_ids"] Returns State of the callback. Should be None if the callback does not have any state to save (this is the default). Return type dictray.air.integrations.mlflow.MLflowLoggerCallback.log_trial_restore MLflowLoggerCallback.log_trial_restore(trial: Trial) Handle logging when a trial restores. Parameters trial – Trial object.ray.air.integrations.mlflow.MLflowLoggerCallback.log_trial_save MLflowLoggerCallback.log_trial_save(trial: Trial) Handle logging when a trial saves a checkpoint. Parameters trial – Trial object.ray.air.integrations.mlflow.MLflowLoggerCallback.on_checkpoint MLflowLoggerCallback.on_checkpoint(iteration: int, trials: List[Trial], trial: Trial, checkpoint: _TrackedCheckpoint, **info) Called after a trial saved a checkpoint with Tune. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. trial – Trial that just has errored. checkpoint – Checkpoint object that has been saved by the trial. **info – Kwargs dict for forward compatibility.ray.air.integrations.mlflow.MLflowLoggerCallback.on_experiment_end MLflowLoggerCallback.on_experiment_end(trials: List[Trial], **info) Called after experiment is over and all trials have concluded. Parameters trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.mlflow.MLflowLoggerCallback.on_step_begin MLflowLoggerCallback.on_step_begin(iteration: int, trials: List[Trial], **info) Called at the start of each tuning loop step. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.mlflow.MLflowLoggerCallback.on_step_end MLflowLoggerCallback.on_step_end(iteration: int, trials: List[Trial], **info) Called at the end of each tuning loop step. The iteration counter is increased before this hook is called. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.mlflow.MLflowLoggerCallback.set_state MLflowLoggerCallback.set_state(state: Dict) Set the state of the callback. This method should be implemented by subclasses to restore the callback’s state based on the given dict state. This is used automatically by Tune to restore checkpoint callback state on Tune experiment restoration. See get_state() for an example implementation. Parameters state – State of the callback.ray.air.integrations.mlflow.setup_mlflow ray.air.integrations.mlflow.setup_mlflow(config: Optional[Dict] = None, tracking_uri: Optional[str] = None, registry_uri: Optional[str] = None, experiment_id: Optional[str] = None, experiment_name: Optional[str] = None, tracking_token: Optional[str] = None, artifact_location: Optional[str] = None, run_name: Optional[str] = None, create_experiment_if_not_exists: bool = False, tags: Optional[Dict] = None, rank_zero_only: bool = True) -> Union[module, ray.air.integrations.mlflow._NoopModule][source] Set up a MLflow session. This function can be used to initialize an MLflow session in a (distributed) training or tuning run. By default, the MLflow experiment ID is the Ray trial ID and the MLlflow experiment name is the Ray trial name. These settings can be overwritten by passing the respective keyword arguments. The config dict is automatically logged as the run parameters (excluding the mlflow settings). In distributed training with Ray Train, only the zero-rank worker will initialize mlflow. All other workers will return a noop client, so that logging is not duplicated in a distributed run. This can be disabled by passing rank_zero_only=False, which will then initialize mlflow in every training worker. This function will return the mlflow module or a noop module for non-rank zero workers if rank_zero_only=True. By using mlflow = setup_mlflow(config) you can ensure that only the rank zero worker calls the mlflow API. Parameters config – Configuration dict to be logged to mlflow as parameters. tracking_uri – The tracking URI for MLflow tracking. If using Tune in a multi-node setting, make sure to use a remote server for tracking. registry_uri – The registry URI for the MLflow model registry. experiment_id – The id of an already created MLflow experiment. All logs from all trials in tune.Tuner() will be reported to this experiment. If this is not provided or the experiment with this id does not exist, you must provide an``experiment_name``. This parameter takes precedence over experiment_name. experiment_name – The name of an already existing MLflow experiment. All logs from all trials in tune.Tuner() will be reported to this experiment. If this is not provided, you must provide a valid experiment_id. tracking_token – A token to use for HTTP authentication when logging to a remote tracking server. This is useful when you want to log to a Databricks server, for example. This value will be used to set the MLFLOW_TRACKING_TOKEN environment variable on all the remote training processes. artifact_location – The location to store run artifacts. If not provided, MLFlow picks an appropriate default. Ignored if experiment already exists. run_name – Name of the new MLflow run that will be created. If not set, will default to the experiment_name. create_experiment_if_not_exists – Whether to create an experiment with the provided name if it does not already exist. Defaults to False. tags – Tags to set for the new run. rank_zero_only – If True, will return an initialized session only for the rank 0 worker in distributed training. If False, will initialize a session for all workers. Defaults to True. Example Per default, you can just call setup_mlflow and continue to use MLflow like you would normally do: from ray.air.integrations.mlflow import setup_mlflow def training_loop(config): mlflow = setup_mlflow(config) # ... mlflow.log_metric(key="loss", val=0.123, step=0) In distributed data parallel training, you can utilize the return value of setup_mlflow. This will make sure it is only invoked on the first worker in distributed training runs. from ray.air.integrations.mlflow import setup_mlflow def training_loop(config): mlflow = setup_mlflow(config) # ... mlflow.log_metric(key="loss", val=0.123, step=0) You can also use MlFlow’s autologging feature if using a training framework like Pytorch Lightning, XGBoost, etc. More information can be found here (https://mlflow.org/docs/latest/tracking.html#automatic-logging). from ray.air.integrations.mlflow import setup_mlflow def train_fn(config): mlflow = setup_mlflow(config) mlflow.autolog() xgboost_results = xgb.train(config, ...) PublicAPI (alpha): This API is in alpha and may change before becoming stable. See here for an example. Weights and Biases (air.integrations.wandb) WandbLoggerCallback([project, group, ...]) Weights and biases (https://www.wandb.ai/) is a tool for experiment tracking, model optimization, and dataset versioning. setup_wandb([config, api_key, api_key_file, ...]) Set up a Weights & Biases session. ray.air.integrations.wandb.WandbLoggerCallback class ray.air.integrations.wandb.WandbLoggerCallback(project: Optional[str] = None, group: Optional[str] = None, api_key_file: Optional[str] = None, api_key: Optional[str] = None, excludes: Optional[List[str]] = None, log_config: bool = False, upload_checkpoints: bool = False, save_checkpoints: bool = False, upload_timeout: int = 1800, **kwargs)[source] Bases: ray.tune.logger.logger.LoggerCallback Weights and biases (https://www.wandb.ai/) is a tool for experiment tracking, model optimization, and dataset versioning. This Ray Tune LoggerCallback sends metrics to Wandb for automatic tracking and visualization. Example import random from ray import tune from ray.air import session, RunConfig from ray.air.integrations.wandb import WandbLoggerCallback def train_func(config): offset = random.random() / 5 for epoch in range(2, config["epochs"]): acc = 1 - (2 + config["lr"]) ** -epoch - random.random() / epoch - offset loss = (2 + config["lr"]) ** -epoch + random.random() / epoch + offset session.report({"acc": acc, "loss": loss}) tuner = tune.Tuner( train_func, param_space={ "lr": tune.grid_search([0.001, 0.01, 0.1, 1.0]), "epochs": 10, }, run_config=RunConfig( callbacks=[WandbLoggerCallback(project="Optimization_Project")] ), ) results = tuner.fit() ... Parameters project – Name of the Wandb project. Mandatory. group – Name of the Wandb group. Defaults to the trainable name. api_key_file – Path to file containing the Wandb API KEY. This file only needs to be present on the node running the Tune script if using the WandbLogger. api_key – Wandb API Key. Alternative to setting api_key_file. excludes – List of metrics and config that should be excluded from the log. log_config – Boolean indicating if the config parameter of the results dict should be logged. This makes sense if parameters will change during training, e.g. with PopulationBasedTraining. Defaults to False. upload_checkpoints – If True, model checkpoints will be uploaded to Wandb as artifacts. Defaults to False. **kwargs – The keyword arguments will be pased to wandb.init(). Wandb’s group, run_id and run_name are automatically selected by Tune, but can be overwritten by filling out the respective configuration values. Please see here for all other valid configuration settings: https://docs.wandb.ai/library/init Methods get_state() Get the state of the callback. log_trial_restore(trial) Handle logging when a trial restores. on_checkpoint(iteration, trials, trial, ...) Called after a trial saved a checkpoint with Tune. on_experiment_end(trials, **info) Wait for the actors to finish their call to wandb.finish. on_step_begin(iteration, trials, **info) Called at the start of each tuning loop step. on_step_end(iteration, trials, **info) Called at the end of each tuning loop step. set_state(state) Set the state of the callback. ray.air.integrations.wandb.WandbLoggerCallback.get_state WandbLoggerCallback.get_state() -> Optional[Dict] Get the state of the callback. This method should be implemented by subclasses to return a dictionary representation of the object’s current state. This is called automatically by Tune to periodically checkpoint callback state. Upon Tune experiment restoration, callback state will be restored via set_state(). from typing import Dict, List, Optional from ray.tune import Callback from ray.tune.experiment import Trial class MyCallback(Callback): def __init__(self): self._trial_ids = set() def on_trial_start( self, iteration: int, trials: List["Trial"], trial: "Trial", **info ): self._trial_ids.add(trial.trial_id) def get_state(self) -> Optional[Dict]: return {"trial_ids": self._trial_ids.copy()} def set_state(self, state: Dict) -> Optional[Dict]: self._trial_ids = state["trial_ids"] Returns State of the callback. Should be None if the callback does not have any state to save (this is the default). Return type dictray.air.integrations.wandb.WandbLoggerCallback.log_trial_restore WandbLoggerCallback.log_trial_restore(trial: Trial) Handle logging when a trial restores. Parameters trial – Trial object.ray.air.integrations.wandb.WandbLoggerCallback.on_checkpoint WandbLoggerCallback.on_checkpoint(iteration: int, trials: List[Trial], trial: Trial, checkpoint: _TrackedCheckpoint, **info) Called after a trial saved a checkpoint with Tune. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. trial – Trial that just has errored. checkpoint – Checkpoint object that has been saved by the trial. **info – Kwargs dict for forward compatibility.ray.air.integrations.wandb.WandbLoggerCallback.on_experiment_end WandbLoggerCallback.on_experiment_end(trials: List[ray.tune.experiment.trial.Trial], **info)[source] Wait for the actors to finish their call to wandb.finish. This includes uploading all logs + artifacts to wandb.ray.air.integrations.wandb.WandbLoggerCallback.on_step_begin WandbLoggerCallback.on_step_begin(iteration: int, trials: List[Trial], **info) Called at the start of each tuning loop step. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.wandb.WandbLoggerCallback.on_step_end WandbLoggerCallback.on_step_end(iteration: int, trials: List[Trial], **info) Called at the end of each tuning loop step. The iteration counter is increased before this hook is called. Parameters iteration – Number of iterations of the tuning loop. trials – List of trials. **info – Kwargs dict for forward compatibility.ray.air.integrations.wandb.WandbLoggerCallback.set_state WandbLoggerCallback.set_state(state: Dict) Set the state of the callback. This method should be implemented by subclasses to restore the callback’s state based on the given dict state. This is used automatically by Tune to restore checkpoint callback state on Tune experiment restoration. See get_state() for an example implementation. Parameters state – State of the callback. Attributes AUTO_CONFIG_KEYS Results that are saved with wandb.config instead of wandb.log. ray.air.integrations.wandb.WandbLoggerCallback.AUTO_CONFIG_KEYS WandbLoggerCallback.AUTO_CONFIG_KEYS = ['trial_id', 'experiment_tag', 'node_ip', 'experiment_id', 'hostname', 'pid', 'date'] Results that are saved with wandb.config instead of wandb.log.ray.air.integrations.wandb.setup_wandb ray.air.integrations.wandb.setup_wandb(config: Optional[Dict] = None, api_key: Optional[str] = None, api_key_file: Optional[str] = None, rank_zero_only: bool = True, **kwargs) -> None[source] Set up a Weights & Biases session. This function can be used to initialize a Weights & Biases session in a (distributed) training or tuning run. By default, the run ID is the trial ID, the run name is the trial name, and the run group is the experiment name. These settings can be overwritten by passing the respective arguments as kwargs, which will be passed to wandb.init(). In distributed training with Ray Train, only the zero-rank worker will initialize wandb. All other workers will return a disabled run object, so that logging is not duplicated in a distributed run. This can be disabled by passing rank_zero_only=False, which will then initialize wandb in every training worker. The config argument will be passed to Weights and Biases and will be logged as the run configuration. If no API key or key file are passed, wandb will try to authenticate using locally stored credentials, created for instance by running wandb login. Keyword arguments passed to setup_wandb() will be passed to wandb.init() and take precedence over any potential default settings. Parameters config – Configuration dict to be logged to Weights and Biases. Can contain arguments for wandb.init() as well as authentication information. api_key – API key to use for authentication with Weights and Biases. api_key_file – File pointing to API key for with Weights and Biases. rank_zero_only – If True, will return an initialized session only for the rank 0 worker in distributed training. If False, will initialize a session for all workers. kwargs – Passed to wandb.init(). Example code-block: python from ray.air.integrations.wandb import setup_wandb def training_loop(config): wandb = setup_wandb(config) # ... wandb.log({"loss": 0.123}) PublicAPI (alpha): This API is in alpha and may change before becoming stable. See here for an example. Ray AIR Session See this Ray Train user guide and this Ray Tune user guide for usage examples of ray.air.session in the respective libraries. Report Metrics and Save Checkpoints session.report(metrics, *[, checkpoint]) Report metrics and optionally save a checkpoint. ray.air.session.report ray.air.session.report(metrics: Dict, *, checkpoint: Optional[ray.air.checkpoint.Checkpoint] = None) -> None[source] Report metrics and optionally save a checkpoint. Each invocation of this method will automatically increment the underlying iteration number. The physical meaning of this “iteration” is defined by user (or more specifically the way they call report). It does not necessarily map to one epoch. This API is the canonical way to report metrics from Tune and Train, and replaces the legacy tune.report, with tune.checkpoint_dir, train.report and train.save_checkpoint calls. Note on directory checkpoints: AIR will take ownership of checkpoints passed to report() by moving them to a new path. The original directory will no longer be accessible to the caller after the report call. Example code-block: python from ray.air import session from ray.air.checkpoint import Checkpoint from ray.air.config import ScalingConfig ######## Using it in the *per worker* train loop (TrainSession) ####### def train_func(): model = build_model() model.save("my_model", overwrite=True) session.report( metrics={"foo": "bar"}, checkpoint=Checkpoint.from_directory(temp_dir.name) ) # Air guarantees by this point, you can safely write new stuff to # "my_model" directory. scaling_config = ScalingConfig(num_workers=2) trainer = TensorflowTrainer( train_loop_per_worker=train_func, scaling_config=scaling_config ) result = trainer.fit() # If you navigate to result.checkpoint's path, you will find the content of ``model.save()`` under it. # If you have `SyncConfig` configured, the content should also # show up in the corresponding cloud storage path. Parameters metrics – The metrics you want to report. checkpoint – The optional checkpoint you want to report. PublicAPI (beta): This API is in beta and may change before becoming stable. Retrieve Checkpoints and Datasets session.get_checkpoint() Access the session's last checkpoint to resume from if applicable. session.get_dataset_shard([dataset_name]) Returns the ray.data.DataIterator shard for this worker. ray.air.session.get_checkpoint ray.air.session.get_checkpoint() -> Optional[ray.air.checkpoint.Checkpoint][source] Access the session’s last checkpoint to resume from if applicable. Returns Checkpoint object if the session is currently being resumed. Otherwise, return None. ######## Using it in the *per worker* train loop (TrainSession) ###### from ray.air import session from ray.air.checkpoint import Checkpoint from ray.air.config import ScalingConfig def train_func(): ckpt = session.get_checkpoint() if ckpt: with ckpt.as_directory() as loaded_checkpoint_dir: import tensorflow as tf model = tf.keras.models.load_model(loaded_checkpoint_dir) else: model = build_model() model.save("my_model", overwrite=True) session.report( metrics={"iter": 1}, checkpoint=Checkpoint.from_directory("my_model") ) scaling_config = ScalingConfig(num_workers=2) trainer = TensorflowTrainer( train_loop_per_worker=train_func, scaling_config=scaling_config ) result = trainer.fit() # trainer2 will pick up from the checkpoint saved by trainer1. trainer2 = TensorflowTrainer( train_loop_per_worker=train_func, scaling_config=scaling_config, # this is ultimately what is accessed through # ``Session.get_checkpoint()`` resume_from_checkpoint=result.checkpoint, ) result2 = trainer2.fit() PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_dataset_shard ray.air.session.get_dataset_shard(dataset_name: Optional[str] = None) -> Optional[DataIterator][source] Returns the ray.data.DataIterator shard for this worker. Call iter_torch_batches() or to_tf() on this shard to convert it to the appropriate framework-specific data type. import ray from ray import train from ray.air import session from ray.air.config import ScalingConfig def train_loop_per_worker(): model = Net() for iter in range(100): # Trainer will automatically handle sharding. data_shard = session.get_dataset_shard("train") for batch in data_shard.iter_torch_batches(): # ... return model train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = TorchTrainer(train_loop_per_worker, scaling_config=ScalingConfig(num_workers=2), datasets={"train": train_dataset}) trainer.fit() Parameters dataset_name – If a Dictionary of Datasets was passed to Trainer, then specifies which dataset shard to return. Returns The DataIterator shard to use for this worker. If no dataset is passed into Trainer, then return None. PublicAPI (beta): This API is in beta and may change before becoming stable. AIR Session Metadata session.get_experiment_name() Experiment name for the corresponding trial. session.get_trial_name() Trial name for the corresponding trial. session.get_trial_id() Trial id for the corresponding trial. session.get_trial_resources() Trial resources for the corresponding trial. session.get_trial_dir() Log directory corresponding to the trial directory for a Tune session. session.get_world_size() Get the current world size (i.e. session.get_world_rank() Get the world rank of this worker. session.get_local_world_size() Get the local rank of this worker (rank of the worker on its node). session.get_local_rank() Get the local rank of this worker (rank of the worker on its node). session.get_node_rank() Get the local rank of this worker (rank of the worker on its node). ray.air.session.get_experiment_name ray.air.session.get_experiment_name() -> str[source] Experiment name for the corresponding trial. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_trial_name ray.air.session.get_trial_name() -> str[source] Trial name for the corresponding trial. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_trial_id ray.air.session.get_trial_id() -> str[source] Trial id for the corresponding trial. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_trial_resources ray.air.session.get_trial_resources() -> PlacementGroupFactory[source] Trial resources for the corresponding trial. PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_trial_dir ray.air.session.get_trial_dir() -> str[source] Log directory corresponding to the trial directory for a Tune session. If calling from a Train session, this will give the trial directory of its parent Tune session. from ray import tune from ray.air import session def train_func(): # Example: # >>> session.get_trial_dir() # ~/ray_results// tuner = tune.Tuner(train_func) tuner.fit() PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_world_size ray.air.session.get_world_size() -> int[source] Get the current world size (i.e. total number of workers) for this run. import time from ray.air import session from ray.air.config import ScalingConfig def train_loop_per_worker(config): assert session.get_world_size() == 4 train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = TensorflowTrainer(train_loop_per_worker, scaling_config=ScalingConfig(num_workers=1), datasets={"train": train_dataset}) trainer.fit() PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_world_rank ray.air.session.get_world_rank() -> int[source] Get the world rank of this worker. import time from ray.air import session from ray.air.config import ScalingConfig def train_loop_per_worker(): for iter in range(100): time.sleep(1) if session.get_world_rank() == 0: print("Worker 0") train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = TensorflowTrainer(train_loop_per_worker, scaling_config=ScalingConfig(num_workers=1), datasets={"train": train_dataset}) trainer.fit() PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_local_world_size ray.air.session.get_local_world_size() -> int[source] Get the local rank of this worker (rank of the worker on its node). Example >>> import ray >>> from ray.air import session >>> from ray.air.config import ScalingConfig >>> from ray.train.torch import TorchTrainer >>> >>> def train_loop_per_worker(): ... return session.get_local_world_size() >>> >>> train_dataset = ray.data.from_items( ... [{"x": x, "y": x + 1} for x in range(32)]) >>> trainer = TorchTrainer(train_loop_per_worker, ... scaling_config=ScalingConfig(num_workers=1), ... datasets={"train": train_dataset}) >>> trainer.fit() PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_local_rank ray.air.session.get_local_rank() -> int[source] Get the local rank of this worker (rank of the worker on its node). import time from ray.air import session from ray.air.config import ScalingConfig def train_loop_per_worker(): if torch.cuda.is_available(): torch.cuda.set_device(session.get_local_rank()) ... train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = TensorflowTrainer(train_loop_per_worker, scaling_config=ScalingConfig(num_workers=1), datasets={"train": train_dataset}) trainer.fit() PublicAPI (beta): This API is in beta and may change before becoming stable.ray.air.session.get_node_rank ray.air.session.get_node_rank() -> int[source] Get the local rank of this worker (rank of the worker on its node). Example >>> import ray >>> from ray.air import session >>> from ray.air.config import ScalingConfig >>> from ray.train.torch import TorchTrainer >>> >>> def train_loop_per_worker(): ... return session.get_node_rank() >>> >>> train_dataset = ray.data.from_items( ... [{"x": x, "y": x + 1} for x in range(32)]) >>> trainer = TorchTrainer(train_loop_per_worker, ... scaling_config=ScalingConfig(num_workers=1), ... datasets={"train": train_dataset}) >>> trainer.fit() PublicAPI (beta): This API is in beta and may change before becoming stable. Tune Experiment Results (tune.ResultGrid) ResultGrid (tune.ResultGrid) ResultGrid(experiment_analysis) A set of Result objects for interacting with Ray Tune results. ray.tune.ResultGrid class ray.tune.ResultGrid(experiment_analysis: ray.tune.analysis.experiment_analysis.ExperimentAnalysis)[source] Bases: object A set of Result objects for interacting with Ray Tune results. You can use it to inspect the trials and obtain the best result. The constructor is a private API. This object can only be created as a result of Tuner.fit(). Example: .. testcode: import random from ray import air, tune def random_error_trainable(config): if random.random() < 0.5: return {"loss": 0.0} else: raise ValueError("This is an error") tuner = tune.Tuner( random_error_trainable, run_config=air.RunConfig(name="example-experiment"), tune_config=tune.TuneConfig(num_samples=10), ) try: result_grid = tuner.fit() except ValueError: pass for i in range(len(result_grid)): result = result_grid[i] if not result.error: print(f"Trial finishes successfully with metrics" f"{result.metrics}.") else: print(f"Trial failed with error {result.error}.") ... You can also use result_grid for more advanced analysis. >>> # Get the best result based on a particular metric. >>> best_result = result_grid.get_best_result( ... metric="loss", mode="min") >>> # Get the best checkpoint corresponding to the best result. >>> best_checkpoint = best_result.checkpoint >>> # Get a dataframe for the last reported results of all of the trials >>> df = result_grid.get_dataframe() >>> # Get a dataframe for the minimum loss seen for each trial >>> df = result_grid.get_dataframe(metric="loss", mode="min") Note that trials of all statuses are included in the final result grid. If a trial is not in terminated state, its latest result and checkpoint as seen by Tune will be provided. See Analyzing Tune Experiment Results for more usage examples. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods get_best_result([metric, mode, scope, ...]) Get the best result from all the trials run. get_dataframe([filter_metric, filter_mode]) Return dataframe of all trials with their configs and reported results. ray.tune.ResultGrid.get_best_result ResultGrid.get_best_result(metric: Optional[str] = None, mode: Optional[str] = None, scope: str = 'last', filter_nan_and_inf: bool = True) -> ray.air.result.Result[source] Get the best result from all the trials run. Parameters metric – Key for trial info to order on. Defaults to the metric specified in your Tuner’s TuneConfig. mode – One of [min, max]. Defaults to the mode specified in your Tuner’s TuneConfig. scope – One of [all, last, avg, last-5-avg, last-10-avg]. If scope=last, only look at each trial’s final step for metric, and compare across trials based on mode=[min,max]. If scope=avg, consider the simple average over all steps for metric and compare across trials based on mode=[min,max]. If scope=last-5-avg or scope=last-10-avg, consider the simple average over the last 5 or 10 steps for metric and compare across trials based on mode=[min,max]. If scope=all, find each trial’s min/max score for metric based on mode, and compare trials based on mode=[min,max]. filter_nan_and_inf – If True (default), NaN or infinite values are disregarded and these trials are never selected as the best trial.ray.tune.ResultGrid.get_dataframe ResultGrid.get_dataframe(filter_metric: Optional[str] = None, filter_mode: Optional[str] = None) -> pandas.core.frame.DataFrame[source] Return dataframe of all trials with their configs and reported results. Per default, this returns the last reported results for each trial. If filter_metric and filter_mode are set, the results from each trial are filtered for this metric and mode. For example, if filter_metric="some_metric" and filter_mode="max", for each trial, every received result is checked, and the one where some_metric is maximal is returned. Example from ray.air import session from ray.air.config import RunConfig from ray.tune import Tuner def training_loop_per_worker(config): session.report({"accuracy": 0.8}) result_grid = Tuner( trainable=training_loop_per_worker, run_config=RunConfig(name="my_tune_run") ).fit() # Get last reported results per trial df = result_grid.get_dataframe() # Get best ever reported accuracy per trial df = result_grid.get_dataframe( filter_metric="accuracy", filter_mode="max" ) ... Parameters filter_metric – Metric to filter best result for. filter_mode – If filter_metric is given, one of ["min", "max"] to specify if we should find the minimum or maximum result. Returns Pandas DataFrame with each trial as a row and their results as columns. Attributes errors Returns the exceptions of errored trials. experiment_path Path pointing to the experiment directory on persistent storage. num_errors Returns the number of errored trials. num_terminated Returns the number of terminated (but not errored) trials. ray.tune.ResultGrid.errors property ResultGrid.errors Returns the exceptions of errored trials.ray.tune.ResultGrid.experiment_path property ResultGrid.experiment_path: str Path pointing to the experiment directory on persistent storage. This can point to a remote storage location (e.g. S3) or to a local location (path on the head node). For instance, if your remote storage path is s3://bucket/location, this will point to s3://bucket/location/experiment_name.ray.tune.ResultGrid.num_errors property ResultGrid.num_errors Returns the number of errored trials.ray.tune.ResultGrid.num_terminated property ResultGrid.num_terminated Returns the number of terminated (but not errored) trials. get_best_result([metric, mode, scope, ...]) Get the best result from all the trials run. get_dataframe([filter_metric, filter_mode]) Return dataframe of all trials with their configs and reported results. Result (air.Result) Result(metrics, checkpoint, error[, ...]) The final result of a ML training run or a Tune trial. ray.air.Result class ray.air.Result(metrics: Optional[Dict[str, Any]], checkpoint: Optional[ray.air.checkpoint.Checkpoint], error: Optional[Exception], metrics_dataframe: Optional[pandas.core.frame.DataFrame] = None, best_checkpoints: Optional[List[Tuple[ray.air.checkpoint.Checkpoint, Dict[str, Any]]]] = None, _local_path: Optional[str] = None, _remote_path: Optional[str] = None, log_dir: Optional[pathlib.Path] = None)[source] Bases: object The final result of a ML training run or a Tune trial. This is the class produced by Trainer.fit(). It contains a checkpoint, which can be used for resuming training and for creating a Predictor object. It also contains a metrics object describing training metrics. error is included so that unsuccessful runs and trials can be represented as well. The constructor is a private API. metrics The final metrics as reported by a Trainable. Type Optional[Dict[str, Any]] checkpoint The final checkpoint of the Trainable. Type Optional[ray.air.checkpoint.Checkpoint] error The execution error of the Trainable run, if the trial finishes in error. Type Optional[Exception] metrics_dataframe The full result dataframe of the Trainable. The dataframe is indexed by iterations and contains reported metrics. Type Optional[pandas.core.frame.DataFrame] best_checkpoints A list of tuples of the best checkpoints saved by the Trainable and their associated metrics. The number of saved checkpoints is determined by the checkpoint_config argument of run_config (by default, all checkpoints will be saved). Type Optional[List[Tuple[ray.air.checkpoint.Checkpoint, Dict[str, Any]]]] PublicAPI (beta): This API is in beta and may change before becoming stable. property config: Optional[Dict[str, Any]] The config associated with the result. property path: str Path pointing to the result directory on persistent storage. This can point to a remote storage location (e.g. S3) or to a local location (path on the head node). For instance, if your remote storage path is s3://bucket/location, this will point to s3://bucket/location/experiment_name/trial_name. classmethod from_path(path: str) -> ray.air.result.Result[source] Restore a Result object from local trial directory. Parameters path – the path to a local trial directory. Returns A Result object of that trial. get_best_checkpoint(metric: str, mode: str) -> Optional[ray.air.checkpoint.Checkpoint][source] Get the best checkpoint from this trial based on a specific metric. Any checkpoints without an associated metric value will be filtered out. Parameters metric – The key for checkpoints to order on. mode – One of [“min”, “max”]. Returns Checkpoint object, or None if there is no valid checkpoint associated with the metric. PublicAPI (alpha): This API is in alpha and may change before becoming stable. ExperimentAnalysis (tune.ExperimentAnalysis) An ExperimentAnalysis is the output of the tune.run API. It’s now recommended to use Tuner.fit, which outputs a ResultGrid object. ExperimentAnalysis(experiment_checkpoint_path) Analyze results from a Tune experiment. ray.tune.ExperimentAnalysis class ray.tune.ExperimentAnalysis(experiment_checkpoint_path: str, trials: Optional[List[ray.tune.experiment.trial.Trial]] = None, default_metric: Optional[str] = None, default_mode: Optional[str] = None, remote_storage_path: Optional[str] = None, sync_config: Optional[ray.tune.syncer.SyncConfig] = None)[source] Bases: object Analyze results from a Tune experiment. To use this class, the experiment must be executed with the JsonLogger. Parameters experiment_checkpoint_path – Path to a json file or directory representing an experiment state, or a directory containing multiple experiment states (a run’s local_dir). Corresponds to Experiment.local_dir/Experiment.name/ experiment_state.json trials – List of trials that can be accessed via analysis.trials. default_metric – Default metric for comparing results. Can be overwritten with the metric parameter in the respective functions. default_mode – Default mode for comparing results. Has to be one of [min, max]. Can be overwritten with the mode parameter in the respective functions. Example >>> from ray import tune >>> tune.run( ... my_trainable, name="my_exp", local_dir="~/tune_results") >>> analysis = ExperimentAnalysis( ... experiment_checkpoint_path="~/tune_results/my_exp/state.json") PublicAPI (beta): This API is in beta and may change before becoming stable. Methods dataframe([metric, mode]) Returns a pandas.DataFrame object constructed from the trials. fetch_trial_dataframes() Fetches trial dataframes from files. get_all_configs([prefix]) Returns a list of all configurations. get_best_checkpoint(trial[, metric, mode, ...]) Gets best persistent checkpoint path of provided trial. get_best_config([metric, mode, scope]) Retrieve the best config corresponding to the trial. get_best_logdir([metric, mode, scope]) Retrieve the logdir corresponding to the best trial. get_best_trial([metric, mode, scope, ...]) Retrieve the best trial object. get_last_checkpoint([trial, metric, mode]) Gets the last persistent checkpoint path of the provided trial, i.e., with the highest "training_iteration". get_trial_checkpoints_paths(trial[, metric]) Gets paths and metrics of all persistent checkpoints of a trial. runner_data() Returns a dictionary of the TrialRunner data. set_filetype([file_type]) Overrides the existing file type. stats() Returns a dictionary of the statistics of the experiment. ray.tune.ExperimentAnalysis.dataframe ExperimentAnalysis.dataframe(metric: Optional[str] = None, mode: Optional[str] = None) -> pandas.core.frame.DataFrame[source] Returns a pandas.DataFrame object constructed from the trials. This function will look through all observed results of each trial and return the one corresponding to the passed metric and mode: If mode=min, it returns the result with the lowest ever observed metric for this trial (this is not necessarily the last)! For mode=max, it’s the highest, respectively. If metric=None or mode=None, the last result will be returned. Parameters metric – Key for trial info to order on. If None, uses last result. mode – One of [None, “min”, “max”]. Returns Constructed from a result dict of each trial. Return type pd.DataFrameray.tune.ExperimentAnalysis.fetch_trial_dataframes ExperimentAnalysis.fetch_trial_dataframes() -> Dict[str, pandas.core.frame.DataFrame][source] Fetches trial dataframes from files. Returns A dictionary containing “trial dir” to Dataframe.ray.tune.ExperimentAnalysis.get_all_configs ExperimentAnalysis.get_all_configs(prefix: bool = False) -> Dict[str, Dict][source] Returns a list of all configurations. Parameters prefix – If True, flattens the config dict and prepends config/. Returns Dict of all configurations of trials, indexed by their trial dir. Return type Dict[str, Dict]ray.tune.ExperimentAnalysis.get_best_checkpoint ExperimentAnalysis.get_best_checkpoint(trial: ray.tune.experiment.trial.Trial, metric: Optional[str] = None, mode: Optional[str] = None, return_path: bool = False) -> Optional[Union[ray.air.checkpoint.Checkpoint, str]][source] Gets best persistent checkpoint path of provided trial. Any checkpoints with an associated metric value of nan will be filtered out. Parameters trial – The log directory of a trial, or a trial instance. metric – key of trial info to return, e.g. “mean_accuracy”. “training_iteration” is used by default if no value was passed to self.default_metric. mode – One of [min, max]. Defaults to self.default_mode. return_path – If True, only returns the path (and not the Checkpoint object). If using Ray client, it is not guaranteed that this path is available on the local (client) node. Can also contain a cloud URI. Returns Checkpoint object or string if return_path=True.ray.tune.ExperimentAnalysis.get_best_config ExperimentAnalysis.get_best_config(metric: Optional[str] = None, mode: Optional[str] = None, scope: str = 'last') -> Optional[Dict][source] Retrieve the best config corresponding to the trial. Compares all trials’ scores on metric. If metric is not specified, self.default_metric will be used. If mode is not specified, self.default_mode will be used. These values are usually initialized by passing the metric and mode parameters to tune.run(). Parameters metric – Key for trial info to order on. Defaults to self.default_metric. mode – One of [min, max]. Defaults to self.default_mode. scope – One of [all, last, avg, last-5-avg, last-10-avg]. If scope=last, only look at each trial’s final step for metric, and compare across trials based on mode=[min,max]. If scope=avg, consider the simple average over all steps for metric and compare across trials based on mode=[min,max]. If scope=last-5-avg or scope=last-10-avg, consider the simple average over the last 5 or 10 steps for metric and compare across trials based on mode=[min,max]. If scope=all, find each trial’s min/max score for metric based on mode, and compare trials based on mode=[min,max].ray.tune.ExperimentAnalysis.get_best_logdir ExperimentAnalysis.get_best_logdir(metric: Optional[str] = None, mode: Optional[str] = None, scope: str = 'last') -> Optional[str][source] Retrieve the logdir corresponding to the best trial. Compares all trials’ scores on metric. If metric is not specified, self.default_metric will be used. If mode is not specified, self.default_mode will be used. These values are usually initialized by passing the metric and mode parameters to tune.run(). Parameters metric – Key for trial info to order on. Defaults to self.default_metric. mode – One of [min, max]. Defaults to self.default_mode. scope – One of [all, last, avg, last-5-avg, last-10-avg]. If scope=last, only look at each trial’s final step for metric, and compare across trials based on mode=[min,max]. If scope=avg, consider the simple average over all steps for metric and compare across trials based on mode=[min,max]. If scope=last-5-avg or scope=last-10-avg, consider the simple average over the last 5 or 10 steps for metric and compare across trials based on mode=[min,max]. If scope=all, find each trial’s min/max score for metric based on mode, and compare trials based on mode=[min,max].ray.tune.ExperimentAnalysis.get_best_trial ExperimentAnalysis.get_best_trial(metric: Optional[str] = None, mode: Optional[str] = None, scope: str = 'last', filter_nan_and_inf: bool = True) -> Optional[ray.tune.experiment.trial.Trial][source] Retrieve the best trial object. Compares all trials’ scores on metric. If metric is not specified, self.default_metric will be used. If mode is not specified, self.default_mode will be used. These values are usually initialized by passing the metric and mode parameters to tune.run(). Parameters metric – Key for trial info to order on. Defaults to self.default_metric. mode – One of [min, max]. Defaults to self.default_mode. scope – One of [all, last, avg, last-5-avg, last-10-avg]. If scope=last, only look at each trial’s final step for metric, and compare across trials based on mode=[min,max]. If scope=avg, consider the simple average over all steps for metric and compare across trials based on mode=[min,max]. If scope=last-5-avg or scope=last-10-avg, consider the simple average over the last 5 or 10 steps for metric and compare across trials based on mode=[min,max]. If scope=all, find each trial’s min/max score for metric based on mode, and compare trials based on mode=[min,max]. filter_nan_and_inf – If True (default), NaN or infinite values are disregarded and these trials are never selected as the best trial. Returns The best trial for the provided metric. If no trials contain the provided metric, or if the value for the metric is NaN for all trials, then returns None.ray.tune.ExperimentAnalysis.get_last_checkpoint ExperimentAnalysis.get_last_checkpoint(trial=None, metric='training_iteration', mode='max')[source] Gets the last persistent checkpoint path of the provided trial, i.e., with the highest “training_iteration”. If no trial is specified, it loads the best trial according to the provided metric and mode (defaults to max. training iteration). Parameters trial – The log directory or an instance of a trial. If None, load the latest trial automatically. metric – If no trial is specified, use this metric to identify the best trial and load the last checkpoint from this trial. mode – If no trial is specified, use the metric and this mode to identify the best trial and load the last checkpoint from it. Returns Path for last checkpoint of trialray.tune.ExperimentAnalysis.get_trial_checkpoints_paths ExperimentAnalysis.get_trial_checkpoints_paths(trial: ray.tune.experiment.trial.Trial, metric: Optional[str] = None) -> List[Tuple[str, numbers.Number]][source] Gets paths and metrics of all persistent checkpoints of a trial. Parameters trial – The log directory of a trial, or a trial instance. metric – key for trial info to return, e.g. “mean_accuracy”. “training_iteration” is used by default if no value was passed to self.default_metric. Returns List of [path, metric] for all persistent checkpoints of the trial.ray.tune.ExperimentAnalysis.runner_data ExperimentAnalysis.runner_data() -> Dict[source] Returns a dictionary of the TrialRunner data. If experiment_checkpoint_path pointed to a directory of experiments, the dict will be in the format of {experiment_session_id: TrialRunner_data}.ray.tune.ExperimentAnalysis.set_filetype ExperimentAnalysis.set_filetype(file_type: Optional[str] = None)[source] Overrides the existing file type. Parameters file_type – Read results from json or csv files. Has to be one of [None, json, csv]. Defaults to csv.ray.tune.ExperimentAnalysis.stats ExperimentAnalysis.stats() -> Dict[source] Returns a dictionary of the statistics of the experiment. If experiment_checkpoint_path pointed to a directory of experiments, the dict will be in the format of {experiment_session_id: stats}. Attributes best_checkpoint Get the checkpoint path of the best trial of the experiment best_config Get the config of the best trial of the experiment best_dataframe Get the full result dataframe of the best trial of the experiment best_logdir Get the logdir of the best trial of the experiment best_result Get the last result of the best trial of the experiment best_result_df Get the best result of the experiment as a pandas dataframe. best_trial Get the best trial of the experiment experiment_path Path pointing to the experiment directory on persistent storage. results Get the last result of the all trials of the experiment results_df Get all the last results as a pandas dataframe. trial_dataframes List of all dataframes of the trials. ray.tune.ExperimentAnalysis.best_checkpoint property ExperimentAnalysis.best_checkpoint: ray.air.checkpoint.Checkpoint Get the checkpoint path of the best trial of the experiment The best trial is determined by comparing the last trial results using the metric and mode parameters passed to tune.run(). If you didn’t pass these parameters, use get_best_checkpoint(trial, metric, mode) instead. Returns Checkpoint object.ray.tune.ExperimentAnalysis.best_config property ExperimentAnalysis.best_config: Dict Get the config of the best trial of the experiment The best trial is determined by comparing the last trial results using the metric and mode parameters passed to tune.run(). If you didn’t pass these parameters, use get_best_config(metric, mode, scope) instead.ray.tune.ExperimentAnalysis.best_dataframe property ExperimentAnalysis.best_dataframe: pandas.core.frame.DataFrame Get the full result dataframe of the best trial of the experiment The best trial is determined by comparing the last trial results using the metric and mode parameters passed to tune.run(). If you didn’t pass these parameters, use get_best_logdir(metric, mode) and use it to look for the dataframe in the self.trial_dataframes dict.ray.tune.ExperimentAnalysis.best_logdir property ExperimentAnalysis.best_logdir: str Get the logdir of the best trial of the experiment The best trial is determined by comparing the last trial results using the metric and mode parameters passed to tune.run(). If you didn’t pass these parameters, use get_best_logdir(metric, mode) instead.ray.tune.ExperimentAnalysis.best_result property ExperimentAnalysis.best_result: Dict Get the last result of the best trial of the experiment The best trial is determined by comparing the last trial results using the metric and mode parameters passed to tune.run(). If you didn’t pass these parameters, use get_best_trial(metric, mode, scope).last_result instead.ray.tune.ExperimentAnalysis.best_result_df property ExperimentAnalysis.best_result_df: pandas.core.frame.DataFrame Get the best result of the experiment as a pandas dataframe. The best trial is determined by comparing the last trial results using the metric and mode parameters passed to tune.run(). If you didn’t pass these parameters, use get_best_trial(metric, mode, scope).last_result instead.ray.tune.ExperimentAnalysis.best_trial property ExperimentAnalysis.best_trial: ray.tune.experiment.trial.Trial Get the best trial of the experiment The best trial is determined by comparing the last trial results using the metric and mode parameters passed to tune.run(). If you didn’t pass these parameters, use get_best_trial(metric, mode, scope) instead.ray.tune.ExperimentAnalysis.experiment_path property ExperimentAnalysis.experiment_path: str Path pointing to the experiment directory on persistent storage. This can point to a remote storage location (e.g. S3) or to a local location (path on the head node). For instance, if your remote storage path is s3://bucket/location, this will point to s3://bucket/location/experiment_name.ray.tune.ExperimentAnalysis.results property ExperimentAnalysis.results: Dict[str, Dict] Get the last result of the all trials of the experimentray.tune.ExperimentAnalysis.results_df property ExperimentAnalysis.results_df: pandas.core.frame.DataFrame Get all the last results as a pandas dataframe.ray.tune.ExperimentAnalysis.trial_dataframes property ExperimentAnalysis.trial_dataframes: Dict[str, pandas.core.frame.DataFrame] List of all dataframes of the trials. Each dataframe is indexed by iterations and contains reported metrics. Ray AIR Checkpoint See this API reference section for framework-specific checkpoints used with AIR’s library integrations. Constructor Options Checkpoint([local_path, data_dict, uri]) Ray AIR Checkpoint. ray.air.checkpoint.Checkpoint class ray.air.checkpoint.Checkpoint(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] Bases: object Ray AIR Checkpoint. An AIR Checkpoint are a common interface for accessing models across different AIR components and libraries. A Checkpoint can have its data represented in one of three ways: as a directory on local (on-disk) storage as a directory on an external storage (e.g., cloud storage) as an in-memory dictionary The Checkpoint object also has methods to translate between different checkpoint storage locations. These storage representations provide flexibility in distributed environments, where you may want to recreate an instance of the same model on multiple nodes or across different Ray clusters. Example: from ray.air.checkpoint import Checkpoint # Create checkpoint data dict checkpoint_data = {"data": 123} # Create checkpoint object from data checkpoint = Checkpoint.from_dict(checkpoint_data) # Save checkpoint to a directory on the file system. path = checkpoint.to_directory() # This path can then be passed around, # # e.g. to a different function or a different script. # You can also use `checkpoint.to_uri/from_uri` to # read from/write to cloud storage # In another function or script, recover Checkpoint object from path checkpoint = Checkpoint.from_directory(path) # Convert into dictionary again recovered_data = checkpoint.to_dict() # It is guaranteed that the original data has been recovered assert recovered_data == checkpoint_data Checkpoints can be used to instantiate a Predictor, BatchPredictor, or PredictorDeployment class. The constructor is a private API, instead the from_ methods should be used to create checkpoint objects (e.g. Checkpoint.from_directory()). Other implementation notes: When converting between different checkpoint formats, it is guaranteed that a full round trip of conversions (e.g. directory –> dict –> –> directory) will recover the original checkpoint data. There are no guarantees made about compatibility of intermediate representations. New data can be added to a Checkpoint during conversion. Consider the following conversion: directory –> dict (adding dict[“foo”] = “bar”) –> directory –> dict (expect to see dict[“foo”] = “bar”). Note that the second directory will contain pickle files with the serialized additional field data in them. Similarly with a dict as a source: dict –> directory (add file “foo.txt”) –> dict –> directory (will have “foo.txt” in it again). Note that the second dict representation will contain an extra field with the serialized additional files in it. Checkpoints can be pickled and sent to remote processes. Please note that checkpoints pointing to local directories will be pickled as data representations, so the full checkpoint data will be contained in the checkpoint object. If you want to avoid this, consider passing only the checkpoint directory to the remote task and re-construct your checkpoint object in that function. Note that this will only work if the “remote” task is scheduled on the same node or a node that also has access to the local data path (e.g. on a shared file system like NFS). If you need persistence across clusters, use the to_uri() or to_directory() methods to persist your checkpoints to disk. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__([local_path, data_dict, uri]) DeveloperAPI: This API may change across minor Ray releases. as_directory() Return checkpoint directory path in a context. from_bytes(data) Create a checkpoint from the given byte string. from_checkpoint(other) Create a checkpoint from a generic Checkpoint. from_dict(data) Create checkpoint object from dictionary. from_directory(path) Create checkpoint object from directory. from_uri(uri) Create checkpoint object from location URI (e.g. get_internal_representation() Return tuple of (type, data) for the internal representation. get_preprocessor() Return the saved preprocessor, if one exists. set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. to_bytes() Return Checkpoint serialized as bytes object. to_dict() Return checkpoint data as dictionary. to_directory([path]) Write checkpoint data to directory. to_uri(uri) Write checkpoint data to location URI (e.g. ray.air.checkpoint.Checkpoint.__init__ Checkpoint.__init__(local_path: Optional[Union[str, os.PathLike]] = None, data_dict: Optional[dict] = None, uri: Optional[str] = None)[source] DeveloperAPI: This API may change across minor Ray releases.ray.air.checkpoint.Checkpoint.as_directory Checkpoint.as_directory() -> Iterator[str][source] Return checkpoint directory path in a context. This function makes checkpoint data available as a directory while avoiding unnecessary copies and left-over temporary data. If the checkpoint is already a directory checkpoint, it will return the existing path. If it is not, it will create a temporary directory, which will be deleted after the context is exited. Users should treat the returned checkpoint directory as read-only and avoid changing any data within it, as it might get deleted when exiting the context. Example: with checkpoint.as_directory() as checkpoint_dir: # Do some read-only processing of files within checkpoint_dir pass # At this point, if a temporary directory was created, it will have # been deleted.ray.air.checkpoint.Checkpoint.from_bytes classmethod Checkpoint.from_bytes(data: bytes) -> ray.air.checkpoint.Checkpoint[source] Create a checkpoint from the given byte string. Parameters data – Data object containing pickled checkpoint data. Returns checkpoint object. Return type Checkpointray.air.checkpoint.Checkpoint.from_checkpoint classmethod Checkpoint.from_checkpoint(other: ray.air.checkpoint.Checkpoint) -> ray.air.checkpoint.Checkpoint[source] Create a checkpoint from a generic Checkpoint. This method can be used to create a framework-specific checkpoint from a generic Checkpoint object. Examples >>> result = TorchTrainer.fit(...) >>> checkpoint = TorchCheckpoint.from_checkpoint(result.checkpoint) >>> model = checkpoint.get_model() Linear(in_features=1, out_features=1, bias=True) DeveloperAPI: This API may change across minor Ray releases.ray.air.checkpoint.Checkpoint.from_dict classmethod Checkpoint.from_dict(data: dict) -> ray.air.checkpoint.Checkpoint[source] Create checkpoint object from dictionary. Parameters data – Dictionary containing checkpoint data. Returns checkpoint object. Return type Checkpointray.air.checkpoint.Checkpoint.from_directory classmethod Checkpoint.from_directory(path: Union[str, os.PathLike]) -> ray.air.checkpoint.Checkpoint[source] Create checkpoint object from directory. Parameters path – Directory containing checkpoint data. The caller promises to not delete the directory (gifts ownership of the directory to this Checkpoint). Returns checkpoint object. Return type Checkpointray.air.checkpoint.Checkpoint.from_uri classmethod Checkpoint.from_uri(uri: str) -> ray.air.checkpoint.Checkpoint[source] Create checkpoint object from location URI (e.g. cloud storage). Valid locations currently include AWS S3 (s3://), Google cloud storage (gs://), HDFS (hdfs://), and local files (file://). Parameters uri – Source location URI to read data from. Returns checkpoint object. Return type Checkpointray.air.checkpoint.Checkpoint.get_internal_representation Checkpoint.get_internal_representation() -> Tuple[str, Union[dict, str, ray.ObjectRef]][source] Return tuple of (type, data) for the internal representation. The internal representation can be used e.g. to compare checkpoint objects for equality or to access the underlying data storage. The returned type is a string and one of ["local_path", "data_dict", "uri"]. The data is the respective data value. Note that paths converted from file://... will be returned as local_path (without the file:// prefix) and not as uri. Returns Tuple of type and data. DeveloperAPI: This API may change across minor Ray releases.ray.air.checkpoint.Checkpoint.get_preprocessor Checkpoint.get_preprocessor() -> Optional[Preprocessor][source] Return the saved preprocessor, if one exists.ray.air.checkpoint.Checkpoint.set_preprocessor Checkpoint.set_preprocessor(preprocessor: Optional[Preprocessor])[source] Saves the provided preprocessor to this Checkpoint.ray.air.checkpoint.Checkpoint.to_bytes Checkpoint.to_bytes() -> bytes[source] Return Checkpoint serialized as bytes object. Returns Bytes object containing checkpoint data. Return type bytesray.air.checkpoint.Checkpoint.to_dict Checkpoint.to_dict() -> dict[source] Return checkpoint data as dictionary. Returns Dictionary containing checkpoint data. Return type dictray.air.checkpoint.Checkpoint.to_directory Checkpoint.to_directory(path: Optional[str] = None) -> str[source] Write checkpoint data to directory. Parameters path – Target directory to restore data in. If not specified, will create a temporary directory. Returns Directory containing checkpoint data. Return type strray.air.checkpoint.Checkpoint.to_uri Checkpoint.to_uri(uri: str) -> str[source] Write checkpoint data to location URI (e.g. cloud storage). Parameters uri – Target location URI to write data to. Returns Cloud location containing checkpoint data. Return type str Attributes path Return path to checkpoint, if available. uri Return checkpoint URI, if available. ray.air.checkpoint.Checkpoint.path property Checkpoint.path: Optional[str] Return path to checkpoint, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local path if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.path == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.path == None Returns Checkpoint path if this checkpoint is reachable from the current node (e.g. cloud storage or locally available directory).ray.air.checkpoint.Checkpoint.uri property Checkpoint.uri: Optional[str] Return checkpoint URI, if available. This will return a URI to cloud storage if this checkpoint is persisted on cloud, or a local file:// URI if this checkpoint is persisted on local disk and available on the current node. In all other cases, this will return None. Users can then choose to persist to cloud with Checkpoint.to_uri(). Example >>> from ray.air import Checkpoint >>> checkpoint = Checkpoint.from_uri("s3://some-bucket/some-location") >>> assert checkpoint.uri == "s3://some-bucket/some-location" >>> checkpoint = Checkpoint.from_dict({"data": 1}) >>> assert checkpoint.uri == None Returns Checkpoint URI if this URI is reachable from the current node (e.g. cloud storage or locally available file URI). Checkpoint.from_dict(data) Create checkpoint object from dictionary. Checkpoint.from_bytes(data) Create a checkpoint from the given byte string. Checkpoint.from_directory(path) Create checkpoint object from directory. Checkpoint.from_uri(uri) Create checkpoint object from location URI (e.g. Checkpoint.from_checkpoint(other) Create a checkpoint from a generic Checkpoint. Checkpoint Properties Checkpoint.uri Return checkpoint URI, if available. Checkpoint.get_internal_representation() Return tuple of (type, data) for the internal representation. Checkpoint.get_preprocessor() Return the saved preprocessor, if one exists. Checkpoint.set_preprocessor(preprocessor) Saves the provided preprocessor to this Checkpoint. Checkpoint Format Conversions Checkpoint.to_dict() Return checkpoint data as dictionary. Checkpoint.to_bytes() Return Checkpoint serialized as bytes object. Checkpoint.to_directory([path]) Write checkpoint data to directory. Checkpoint.as_directory() Return checkpoint directory path in a context. Checkpoint.to_uri(uri) Write checkpoint data to location URI (e.g. Predictor See this user guide on performing model inference in AIR for usage examples. Predictor Interface Constructor Options predictor.Predictor([preprocessor]) Predictors load models from checkpoints to perform inference. ray.train.predictor.Predictor class ray.train.predictor.Predictor(preprocessor: Optional[ray.data.preprocessor.Preprocessor] = None)[source] Bases: abc.ABC Predictors load models from checkpoints to perform inference. The base Predictor class cannot be instantiated directly. Only one of its subclasses can be used. How does a Predictor work? Predictors expose a predict method that accepts an input batch of type DataBatchType and outputs predictions of the same type as the input batch. When the predict method is called the following occurs: The input batch is converted into a pandas DataFrame. Tensor input (like a np.ndarray) will be converted into a single column Pandas Dataframe. If there is a Preprocessor saved in the provided Checkpoint, the preprocessor will be used to transform the DataFrame. The transformed DataFrame will be passed to the model for inference (via the predictor._predict_pandas method). The predictions will be outputted by predict in the same type as the original input. How do I create a new Predictor? To implement a new Predictor for your particular framework, you should subclass the base Predictor and implement the following two methods: _predict_pandas: Given a pandas.DataFrame input, return a pandas.DataFrame containing predictions. from_checkpoint: Logic for creating a Predictor from an AIR Checkpoint. Optionally _predict_numpy for better performance when working with tensor data to avoid extra copies from Pandas conversions. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__([preprocessor]) Subclasseses must call Predictor.__init__() to set a preprocessor. from_checkpoint(checkpoint, **kwargs) Create a specific predictor from a checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data, **kwargs) Perform inference on a batch of data. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.predictor.Predictor.__init__ Predictor.__init__(preprocessor: Optional[ray.data.preprocessor.Preprocessor] = None)[source] Subclasseses must call Predictor.__init__() to set a preprocessor.ray.train.predictor.Predictor.from_checkpoint abstract classmethod Predictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint, **kwargs) -> ray.train.predictor.Predictor[source] Create a specific predictor from a checkpoint. Parameters checkpoint – Checkpoint to load predictor data from. kwargs – Arguments specific to predictor implementations. Returns Predictor object. Return type Predictorray.train.predictor.Predictor.from_pandas_udf classmethod Predictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor[source] Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.predictor.Predictor.get_preprocessor Predictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor][source] Get the preprocessor to use prior to executing predictions.ray.train.predictor.Predictor.predict Predictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], **kwargs) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Perform inference on a batch of data. Parameters data – A batch of input data of type DataBatchType. kwargs – Arguments specific to predictor implementations. These are passed _predict_pandas. (directly to _predict_numpy or) – Returns Prediction result. The return type will be the same as the input type. Return type DataBatchTyperay.train.predictor.Predictor.preferred_batch_format classmethod Predictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat[source] Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _predict_pandas and _predict_numpy are implemented. Defaults to Pandas. Can be overriden by predictor classes depending on the framework type, e.g. TorchPredictor prefers Numpy and XGBoostPredictor prefers Pandas as native batch format. DeveloperAPI: This API may change across minor Ray releases.ray.train.predictor.Predictor.set_preprocessor Predictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None[source] Set the preprocessor to use prior to executing predictions. predictor.Predictor.from_checkpoint(...) Create a specific predictor from a checkpoint. predictor.Predictor.from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. Predictor Properties predictor.Predictor.get_preprocessor() Get the preprocessor to use prior to executing predictions. predictor.Predictor.set_preprocessor(...) Set the preprocessor to use prior to executing predictions. Prediction API predictor.Predictor.predict(data, **kwargs) Perform inference on a batch of data. Supported Data Formats predictor.Predictor.preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. DataBatchType The central part of internal API. ray.train.predictor.DataBatchType ray.train.predictor.DataBatchType The central part of internal API. This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict. alias of Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]] Batch Predictor Constructor Options batch_predictor.BatchPredictor(checkpoint, ...) Batch predictor class. ray.train.batch_predictor.BatchPredictor class ray.train.batch_predictor.BatchPredictor(checkpoint: ray.air.checkpoint.Checkpoint, predictor_cls: Type[ray.train.predictor.Predictor], **predictor_kwargs)[source] Bases: object Batch predictor class. Takes a predictor class and a checkpoint and provides an interface to run batch scoring on Datasets. This batch predictor wraps around a predictor class and executes it in a distributed way when calling predict(). DEPRECATED: This API is deprecated and may be removed in future Ray releases. BatchPredictor is deprecated from Ray 2.6. Use Dataset.map_batches instead for offline batch inference. For a migration guide, see https://github.com/ray-project/ray/issues/37489. To learn more about batch inference with Ray Data, see http://batchinference.io. Methods from_checkpoint(checkpoint, predictor_cls, ...) Create a BatchPredictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data, *[, feature_columns, ...]) Run batch scoring on a Dataset. predict_pipelined(data, *[, ...]) Setup a prediction pipeline for batch scoring. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.batch_predictor.BatchPredictor.from_checkpoint classmethod BatchPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint, predictor_cls: Type[ray.train.predictor.Predictor], **kwargs) -> ray.train.batch_predictor.BatchPredictor[source] Create a BatchPredictor from a Checkpoint. Example from torchvision import models from ray.train.batch_predictor import BatchPredictor from ray.train.torch import TorchCheckpoint, TorchPredictor model = models.resnet50(pretrained=True) checkpoint = TorchCheckpoint.from_model(model) predictor = BatchPredictor.from_checkpoint(checkpoint, TorchPredictor) Parameters checkpoint – A Checkpoint containing model state and optionally a preprocessor. predictor_cls – The type of predictor to use. **kwargs – Optional arguments to pass the predictor_cls constructor.ray.train.batch_predictor.BatchPredictor.from_pandas_udf classmethod BatchPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.batch_predictor.BatchPredictor[source] Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.batch_predictor.BatchPredictor.get_preprocessor BatchPredictor.get_preprocessor() -> ray.data.preprocessor.Preprocessor[source] Get the preprocessor to use prior to executing predictions.ray.train.batch_predictor.BatchPredictor.predict BatchPredictor.predict(data: Union[ray.data.dataset.Dataset, ray.data.dataset_pipeline.DatasetPipeline], *, feature_columns: Optional[List[str]] = None, keep_columns: Optional[List[str]] = None, batch_size: int = 4096, min_scoring_workers: int = 1, max_scoring_workers: Optional[int] = None, num_cpus_per_worker: Optional[int] = None, num_gpus_per_worker: Optional[int] = None, separate_gpu_stage: bool = True, ray_remote_args: Optional[Dict[str, Any]] = None, **predict_kwargs) -> Union[ray.data.dataset.Dataset, ray.data.dataset_pipeline.DatasetPipeline][source] Run batch scoring on a Dataset. In Ray 2.4, BatchPredictor is lazy by default. Use one of the Dataset consumption APIs, such as iterating through the output, to trigger the execution of prediction. Parameters data – Dataset or pipeline to run batch prediction on. feature_columns – List of columns in the preprocessed dataset to use for prediction. Columns not specified will be dropped from data before being passed to the predictor. If None, use all columns in the preprocessed dataset. keep_columns – List of columns in the preprocessed dataset to include in the prediction result. This is useful for calculating final accuracies/metrics on the result dataset. If None, the columns in the output dataset will contain just the prediction results. batch_size – Split dataset into batches of this size for prediction. min_scoring_workers – Minimum number of scoring actors. max_scoring_workers – If set, specify the maximum number of scoring actors. num_cpus_per_worker – Number of CPUs to allocate per scoring worker. Set to 1 by default. num_gpus_per_worker – Number of GPUs to allocate per scoring worker. Set to 0 by default. If you want to use GPUs for inference, please specify this parameter. separate_gpu_stage – If using GPUs, specifies whether to execute GPU processing in a separate stage (enabled by default). This avoids running expensive preprocessing steps on GPU workers. ray_remote_args – Additional resource requirements to request from ray. predict_kwargs – Keyword arguments passed to the predictor’s predict() method. Returns Dataset containing scoring results. Examples import pandas as pd import ray from ray.train.batch_predictor import BatchPredictor def calculate_accuracy(df): return pd.DataFrame({"correct": df["preds"] == df["label"]}) # Create a batch predictor that returns identity as the predictions. batch_pred = BatchPredictor.from_pandas_udf( lambda data: pd.DataFrame({"preds": data["feature_1"]})) # Create a dummy dataset. ds = ray.data.from_pandas(pd.DataFrame({ "feature_1": [1, 2, 3], "label": [1, 2, 3]})) # Execute batch prediction using this predictor. predictions = batch_pred.predict(ds, feature_columns=["feature_1"], keep_columns=["label"]) # print predictions and calculate final accuracy print(predictions) correct = predictions.map_batches(calculate_accuracy) print(f"Final accuracy: {correct.sum(on='correct') / correct.count()}") MapBatches(ScoringWrapper) +- Dataset(num_blocks=1, num_rows=3, schema={feature_1: int64, label: int64}) Final accuracy: 1.0ray.train.batch_predictor.BatchPredictor.predict_pipelined BatchPredictor.predict_pipelined(data: ray.data.dataset.Dataset, *, blocks_per_window: Optional[int] = None, bytes_per_window: Optional[int] = None, feature_columns: Optional[List[str]] = None, keep_columns: Optional[List[str]] = None, batch_size: int = 4096, min_scoring_workers: int = 1, max_scoring_workers: Optional[int] = None, num_cpus_per_worker: Optional[int] = None, num_gpus_per_worker: Optional[int] = None, separate_gpu_stage: bool = True, ray_remote_args: Optional[Dict[str, Any]] = None, **predict_kwargs) -> ray.data.dataset_pipeline.DatasetPipeline[source] Setup a prediction pipeline for batch scoring. Unlike predict(), this generates a DatasetPipeline object and does not perform execution. Execution can be triggered by pulling from the pipeline. This is a convenience wrapper around calling window() on the Dataset prior to passing it BatchPredictor.predict(). Parameters data – Dataset to run batch prediction on. blocks_per_window – The window size (parallelism) in blocks. Increasing window size increases pipeline throughput, but also increases the latency to initial output, since it decreases the length of the pipeline. Setting this to infinity effectively disables pipelining. bytes_per_window – Specify the window size in bytes instead of blocks. This will be treated as an upper bound for the window size, but each window will still include at least one block. This is mutually exclusive with blocks_per_window. feature_columns – List of columns in data to use for prediction. Columns not specified will be dropped from data before being passed to the predictor. If None, use all columns. keep_columns – List of columns in data to include in the prediction result. This is useful for calculating final accuracies/metrics on the result dataset. If None, the columns in the output dataset will contain just the prediction results. batch_size – Split dataset into batches of this size for prediction. min_scoring_workers – Minimum number of scoring actors. max_scoring_workers – If set, specify the maximum number of scoring actors. num_cpus_per_worker – Number of CPUs to allocate per scoring worker. num_gpus_per_worker – Number of GPUs to allocate per scoring worker. separate_gpu_stage – If using GPUs, specifies whether to execute GPU processing in a separate stage (enabled by default). This avoids running expensive preprocessing steps on GPU workers. ray_remote_args – Additional resource requirements to request from ray. predict_kwargs – Keyword arguments passed to the predictor’s predict() method. Returns DatasetPipeline that generates scoring results. Examples import pandas as pd import ray from ray.train.batch_predictor import BatchPredictor # Create a batch predictor that always returns `42` for each input. batch_pred = BatchPredictor.from_pandas_udf( lambda data: pd.DataFrame({"a": [42] * len(data)})) # Create a dummy dataset. ds = ray.data.range_tensor(1000, parallelism=4) # Setup a prediction pipeline. print(batch_pred.predict_pipelined(ds, blocks_per_window=1)) DatasetPipeline(num_windows=4, num_stages=3)ray.train.batch_predictor.BatchPredictor.set_preprocessor BatchPredictor.set_preprocessor(preprocessor: ray.data.preprocessor.Preprocessor) -> None[source] Set the preprocessor to use prior to executing predictions. batch_predictor.BatchPredictor.from_checkpoint(...) Create a BatchPredictor from a Checkpoint. batch_predictor.BatchPredictor.from_pandas_udf(...) Create a Predictor from a Pandas UDF. Batch Prediction API batch_predictor.BatchPredictor.predict(data, *) Run batch scoring on a Dataset. batch_predictor.BatchPredictor.predict_pipelined(data, *) Setup a prediction pipeline for batch scoring. Built-in Predictors for Library Integrations XGBoostPredictor(model[, preprocessor]) A predictor for XGBoost models. LightGBMPredictor(model[, preprocessor]) A predictor for LightGBM models. TensorflowPredictor(*[, model, ...]) A predictor for TensorFlow models. TorchPredictor(model[, preprocessor, use_gpu]) A predictor for PyTorch models. TransformersPredictor([pipeline, ...]) A predictor for HuggingFace Transformers PyTorch models. SklearnPredictor(estimator[, preprocessor]) A predictor for scikit-learn compatible estimators. RLPredictor(policy[, preprocessor]) A predictor for RLlib policies. ray.train.xgboost.XGBoostPredictor class ray.train.xgboost.XGBoostPredictor(model: xgboost.core.Booster, preprocessor: Optional[Preprocessor] = None)[source] Bases: ray.train.predictor.Predictor A predictor for XGBoost models. Parameters model – The XGBoost booster to use for predictions. preprocessor – A preprocessor used to transform data batches prior to prediction. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods from_checkpoint(checkpoint) Instantiate the predictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data[, feature_columns, dmatrix_kwargs]) Run inference on data batch. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.xgboost.XGBoostPredictor.from_checkpoint classmethod XGBoostPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint) -> ray.train.xgboost.xgboost_predictor.XGBoostPredictor[source] Instantiate the predictor from a Checkpoint. The checkpoint is expected to be a result of XGBoostTrainer. Parameters checkpoint – The checkpoint to load the model and preprocessor from. It is expected to be from the result of a XGBoostTrainer run.ray.train.xgboost.XGBoostPredictor.from_pandas_udf classmethod XGBoostPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.xgboost.XGBoostPredictor.get_preprocessor XGBoostPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.xgboost.XGBoostPredictor.predict XGBoostPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], feature_columns: Optional[Union[List[str], List[int]]] = None, dmatrix_kwargs: Optional[Dict[str, Any]] = None, **predict_kwargs) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Run inference on data batch. The data is converted into an XGBoost DMatrix before being inputted to the model. Parameters data – A batch of input data. feature_columns – The names or indices of the columns in the data to use as features to predict on. If None, then use all columns in data. dmatrix_kwargs – Dict of keyword arguments passed to xgboost.DMatrix. **predict_kwargs – Keyword arguments passed to xgboost.Booster.predict. Examples: import numpy as np import xgboost as xgb from ray.train.xgboost import XGBoostPredictor train_X = np.array([[1, 2], [3, 4]]) train_y = np.array([0, 1]) model = xgb.XGBClassifier().fit(train_X, train_y) predictor = XGBoostPredictor(model=model.get_booster()) data = np.array([[1, 2], [3, 4]]) predictions = predictor.predict(data) # Only use first and second column as the feature data = np.array([[1, 2, 8], [3, 4, 9]]) predictions = predictor.predict(data, feature_columns=[0, 1]) import pandas as pd import xgboost as xgb from ray.train.xgboost import XGBoostPredictor train_X = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) train_y = pd.Series([0, 1]) model = xgb.XGBClassifier().fit(train_X, train_y) predictor = XGBoostPredictor(model=model.get_booster()) # Pandas dataframe. data = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) predictions = predictor.predict(data) # Only use first and second column as the feature data = pd.DataFrame([[1, 2, 8], [3, 4, 9]], columns=["A", "B", "C"]) predictions = predictor.predict(data, feature_columns=["A", "B"]) Returns Prediction result.ray.train.xgboost.XGBoostPredictor.preferred_batch_format classmethod XGBoostPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _predict_pandas and _predict_numpy are implemented. Defaults to Pandas. Can be overriden by predictor classes depending on the framework type, e.g. TorchPredictor prefers Numpy and XGBoostPredictor prefers Pandas as native batch format. DeveloperAPI: This API may change across minor Ray releases.ray.train.xgboost.XGBoostPredictor.set_preprocessor XGBoostPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions.ray.train.lightgbm.LightGBMPredictor class ray.train.lightgbm.LightGBMPredictor(model: lightgbm.basic.Booster, preprocessor: Optional[Preprocessor] = None)[source] Bases: ray.train.predictor.Predictor A predictor for LightGBM models. Parameters model – The LightGBM booster to use for predictions. preprocessor – A preprocessor used to transform data batches prior to prediction. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods from_checkpoint(checkpoint) Instantiate the predictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data[, feature_columns]) Run inference on data batch. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.lightgbm.LightGBMPredictor.from_checkpoint classmethod LightGBMPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint) -> ray.train.lightgbm.lightgbm_predictor.LightGBMPredictor[source] Instantiate the predictor from a Checkpoint. The checkpoint is expected to be a result of LightGBMTrainer. Parameters checkpoint – The checkpoint to load the model and preprocessor from. It is expected to be from the result of a LightGBMTrainer run.ray.train.lightgbm.LightGBMPredictor.from_pandas_udf classmethod LightGBMPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.lightgbm.LightGBMPredictor.get_preprocessor LightGBMPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.lightgbm.LightGBMPredictor.predict LightGBMPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], feature_columns: Optional[Union[List[str], List[int]]] = None, **predict_kwargs) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Run inference on data batch. Parameters data – A batch of input data. feature_columns – The names or indices of the columns in the data to use as features to predict on. If None, then use all columns in data. **predict_kwargs – Keyword arguments passed to lightgbm.Booster.predict. Examples >>> import numpy as np >>> import lightgbm as lgbm >>> from ray.train.lightgbm import LightGBMPredictor >>> >>> train_X = np.array([[1, 2], [3, 4]]) >>> train_y = np.array([0, 1]) >>> >>> model = lgbm.LGBMClassifier().fit(train_X, train_y) >>> predictor = LightGBMPredictor(model=model.booster_) >>> >>> data = np.array([[1, 2], [3, 4]]) >>> predictions = predictor.predict(data) >>> >>> # Only use first and second column as the feature >>> data = np.array([[1, 2, 8], [3, 4, 9]]) >>> predictions = predictor.predict(data, feature_columns=[0, 1]) >>> import pandas as pd >>> import lightgbm as lgbm >>> from ray.train.lightgbm import LightGBMPredictor >>> >>> train_X = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) >>> train_y = pd.Series([0, 1]) >>> >>> model = lgbm.LGBMClassifier().fit(train_X, train_y) >>> predictor = LightGBMPredictor(model=model.booster_) >>> >>> # Pandas dataframe. >>> data = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) >>> predictions = predictor.predict(data) >>> >>> # Only use first and second column as the feature >>> data = pd.DataFrame([[1, 2, 8], [3, 4, 9]], columns=["A", "B", "C"]) >>> predictions = predictor.predict(data, feature_columns=["A", "B"]) Returns Prediction result.ray.train.lightgbm.LightGBMPredictor.preferred_batch_format classmethod LightGBMPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _predict_pandas and _predict_numpy are implemented. Defaults to Pandas. Can be overriden by predictor classes depending on the framework type, e.g. TorchPredictor prefers Numpy and XGBoostPredictor prefers Pandas as native batch format. DeveloperAPI: This API may change across minor Ray releases.ray.train.lightgbm.LightGBMPredictor.set_preprocessor LightGBMPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions.ray.train.tensorflow.TensorflowPredictor class ray.train.tensorflow.TensorflowPredictor(*, model: Optional[keras.engine.training.Model] = None, preprocessor: Optional[Preprocessor] = None, use_gpu: bool = False)[source] Bases: ray.train._internal.dl_predictor.DLPredictor A predictor for TensorFlow models. Parameters model – A Tensorflow Keras model to use for predictions. preprocessor – A preprocessor used to transform data batches prior to prediction. model_weights – List of weights to use for the model. use_gpu – If set, the model will be moved to GPU on instantiation and prediction happens on GPU. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods call_model(inputs) Runs inference on a single batch of tensor data. from_checkpoint(checkpoint[, ...]) Instantiate the predictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data[, dtype]) Run inference on data batch. preferred_batch_format() DeveloperAPI: This API may change across minor Ray releases. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.tensorflow.TensorflowPredictor.call_model TensorflowPredictor.call_model(inputs: Union[tensorflow.python.framework.ops.Tensor, Dict[str, tensorflow.python.framework.ops.Tensor]]) -> Union[tensorflow.python.framework.ops.Tensor, Dict[str, tensorflow.python.framework.ops.Tensor]][source] Runs inference on a single batch of tensor data. This method is called by TorchPredictor.predict after converting the original data batch to torch tensors. Override this method to add custom logic for processing the model input or output. Example # List outputs are not supported by default TensorflowPredictor. def build_model() -> tf.keras.Model: input = tf.keras.layers.Input(shape=1) model = tf.keras.models.Model(inputs=input, outputs=[input, input]) return model # Use a custom predictor to format model output as a dict. class CustomPredictor(TensorflowPredictor): def call_model(self, inputs): model_output = super().call_model(inputs) return { str(i): model_output[i] for i in range(len(model_output)) } import numpy as np data_batch = np.array([[0.5], [0.6], [0.7]], dtype=np.float32) predictor = CustomPredictor(model=build_model()) predictions = predictor.predict(data_batch) Parameters inputs – A batch of data to predict on, represented as either a single TensorFlow tensor or for multi-input models, a dictionary of tensors. Returns The model outputs, either as a single tensor or a dictionary of tensors. DeveloperAPI: This API may change across minor Ray releases.ray.train.tensorflow.TensorflowPredictor.from_checkpoint classmethod TensorflowPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint, model_definition: Optional[Union[Callable[[], keras.engine.training.Model], Type[keras.engine.training.Model]]] = None, use_gpu: Optional[bool] = False) -> ray.train.tensorflow.tensorflow_predictor.TensorflowPredictor[source] Instantiate the predictor from a Checkpoint. The checkpoint is expected to be a result of TensorflowTrainer. Parameters checkpoint – The checkpoint to load the model and preprocessor from. It is expected to be from the result of a TensorflowTrainer run. model_definition – A callable that returns a TensorFlow Keras model to use. Model weights will be loaded from the checkpoint. This is only needed if the checkpoint was created from TensorflowCheckpoint.from_model. use_gpu – Whether GPU should be used during prediction.ray.train.tensorflow.TensorflowPredictor.from_pandas_udf classmethod TensorflowPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.tensorflow.TensorflowPredictor.get_preprocessor TensorflowPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.tensorflow.TensorflowPredictor.predict TensorflowPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], dtype: Optional[Union[tensorflow.python.framework.dtypes.DType, Dict[str, tensorflow.python.framework.dtypes.DType]]] = None) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Run inference on data batch. If the provided data is a single array or a dataframe/table with a single column, it will be converted into a single Tensorflow tensor before being inputted to the model. If the provided data is a multi-column table or a dict of numpy arrays, it will be converted into a dict of tensors before being inputted to the model. This is useful for multi-modal inputs (for example your model accepts both image and text). Parameters data – A batch of input data. Either a pandas DataFrame or numpy array. dtype – The dtypes to use for the tensors. Either a single dtype for all tensors or a mapping from column name to dtype. Examples >>> import numpy as np >>> import tensorflow as tf >>> from ray.train.tensorflow import TensorflowPredictor >>> >>> def build_model(): ... return tf.keras.Sequential( ... [ ... tf.keras.layers.InputLayer(input_shape=()), ... tf.keras.layers.Flatten(), ... tf.keras.layers.Dense(1), ... ] ... ) >>> >>> weights = [np.array([[2.0]]), np.array([0.0])] >>> predictor = TensorflowPredictor(model=build_model()) >>> >>> data = np.asarray([1, 2, 3]) >>> predictions = predictor.predict(data) >>> import pandas as pd >>> import tensorflow as tf >>> from ray.train.tensorflow import TensorflowPredictor >>> >>> def build_model(): ... input1 = tf.keras.layers.Input(shape=(1,), name="A") ... input2 = tf.keras.layers.Input(shape=(1,), name="B") ... merged = tf.keras.layers.Concatenate(axis=1)([input1, input2]) ... output = tf.keras.layers.Dense(2, input_dim=2)(merged) ... return tf.keras.models.Model( ... inputs=[input1, input2], outputs=output) >>> >>> predictor = TensorflowPredictor(model=build_model()) >>> >>> # Pandas dataframe. >>> data = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) >>> >>> predictions = predictor.predict(data) Returns Prediction result. The return type will be the same as the input type. Return type DataBatchTyperay.train.tensorflow.TensorflowPredictor.preferred_batch_format classmethod TensorflowPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat DeveloperAPI: This API may change across minor Ray releases.ray.train.tensorflow.TensorflowPredictor.set_preprocessor TensorflowPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions.ray.train.torch.TorchPredictor class ray.train.torch.TorchPredictor(model: torch.nn.modules.module.Module, preprocessor: Optional[Preprocessor] = None, use_gpu: bool = False)[source] Bases: ray.train._internal.dl_predictor.DLPredictor A predictor for PyTorch models. Parameters model – The torch module to use for predictions. preprocessor – A preprocessor used to transform data batches prior to prediction. use_gpu – If set, the model will be moved to GPU on instantiation and prediction happens on GPU. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods call_model(inputs) Runs inference on a single batch of tensor data. from_checkpoint(checkpoint[, model, use_gpu]) Instantiate the predictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data[, dtype]) Run inference on data batch. preferred_batch_format() DeveloperAPI: This API may change across minor Ray releases. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.torch.TorchPredictor.call_model TorchPredictor.call_model(inputs: Union[torch.Tensor, Dict[str, torch.Tensor]]) -> Union[torch.Tensor, Dict[str, torch.Tensor]][source] Runs inference on a single batch of tensor data. This method is called by TorchPredictor.predict after converting the original data batch to torch tensors. Override this method to add custom logic for processing the model input or output. Parameters inputs – A batch of data to predict on, represented as either a single PyTorch tensor or for multi-input models, a dictionary of tensors. Returns The model outputs, either as a single tensor or a dictionary of tensors. Example import numpy as np import torch from ray.train.torch import TorchPredictor # List outputs are not supported by default TorchPredictor. # So let's define a custom TorchPredictor and override call_model class MyModel(torch.nn.Module): def forward(self, input_tensor): return [input_tensor, input_tensor] # Use a custom predictor to format model output as a dict. class CustomPredictor(TorchPredictor): def call_model(self, inputs): model_output = super().call_model(inputs) return { str(i): model_output[i] for i in range(len(model_output)) } # create our data batch data_batch = np.array([1, 2]) # create custom predictor and predict predictor = CustomPredictor(model=MyModel()) predictions = predictor.predict(data_batch) print(f"Predictions: {predictions.get('0')}, {predictions.get('1')}") Predictions: [1 2], [1 2] DeveloperAPI: This API may change across minor Ray releases.ray.train.torch.TorchPredictor.from_checkpoint classmethod TorchPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint, model: Optional[torch.nn.modules.module.Module] = None, use_gpu: bool = False) -> ray.train.torch.torch_predictor.TorchPredictor[source] Instantiate the predictor from a Checkpoint. The checkpoint is expected to be a result of TorchTrainer. Parameters checkpoint – The checkpoint to load the model and preprocessor from. It is expected to be from the result of a TorchTrainer run. model – If the checkpoint contains a model state dict, and not the model itself, then the state dict will be loaded to this model. If the checkpoint already contains the model itself, this model argument will be discarded. use_gpu – If set, the model will be moved to GPU on instantiation and prediction happens on GPU.ray.train.torch.TorchPredictor.from_pandas_udf classmethod TorchPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.torch.TorchPredictor.get_preprocessor TorchPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.torch.TorchPredictor.predict TorchPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], dtype: Optional[Union[torch.dtype, Dict[str, torch.dtype]]] = None) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Run inference on data batch. If the provided data is a single array or a dataframe/table with a single column, it will be converted into a single PyTorch tensor before being inputted to the model. If the provided data is a multi-column table or a dict of numpy arrays, it will be converted into a dict of tensors before being inputted to the model. This is useful for multi-modal inputs (for example your model accepts both image and text). Parameters data – A batch of input data of DataBatchType. dtype – The dtypes to use for the tensors. Either a single dtype for all tensors or a mapping from column name to dtype. Returns Prediction result. The return type will be the same as the input type. Return type DataBatchType Example import numpy as np import pandas as pd import torch import ray from ray.train.torch import TorchPredictor # Define a custom PyTorch module class CustomModule(torch.nn.Module): def __init__(self): super().__init__() self.linear1 = torch.nn.Linear(1, 1) self.linear2 = torch.nn.Linear(1, 1) def forward(self, input_dict: dict): out1 = self.linear1(input_dict["A"].unsqueeze(1)) out2 = self.linear2(input_dict["B"].unsqueeze(1)) return out1 + out2 # Set manul seed so we get consistent output torch.manual_seed(42) # Use Standard PyTorch model model = torch.nn.Linear(2, 1) predictor = TorchPredictor(model=model) # Define our data data = np.array([[1, 2], [3, 4]]) predictions = predictor.predict(data, dtype=torch.float) print(f"Standard model predictions: {predictions}") print("---") # Use Custom PyTorch model with TorchPredictor predictor = TorchPredictor(model=CustomModule()) # Define our data and predict Customer model with TorchPredictor data = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) predictions = predictor.predict(data, dtype=torch.float) print(f"Custom model predictions: {predictions}") Standard model predictions: {'predictions': array([[1.5487633], [3.8037925]], dtype=float32)} --- Custom model predictions: predictions 0 [0.61623406] 1 [2.857038]ray.train.torch.TorchPredictor.preferred_batch_format classmethod TorchPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat DeveloperAPI: This API may change across minor Ray releases.ray.train.torch.TorchPredictor.set_preprocessor TorchPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions.ray.train.huggingface.TransformersPredictor class ray.train.huggingface.TransformersPredictor(pipeline: Optional[Pipeline] = None, preprocessor: Optional[Preprocessor] = None, use_gpu: bool = False)[source] Bases: ray.train.predictor.Predictor A predictor for HuggingFace Transformers PyTorch models. This predictor uses Transformers Pipelines for inference. Parameters pipeline – The Transformers pipeline to use for inference. preprocessor – A preprocessor used to transform data batches prior to prediction. use_gpu – If set, the model will be moved to GPU on instantiation and prediction happens on GPU. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods from_checkpoint(checkpoint, *[, ...]) Instantiate the predictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data[, feature_columns]) Run inference on data batch. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.huggingface.TransformersPredictor.from_checkpoint classmethod TransformersPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint, *, pipeline_cls: Optional[Type[Pipeline]] = None, model_cls: Optional[Union[str, Type[PreTrainedModel], Type[TFPreTrainedModel]]] = None, pretrained_model_kwargs: Optional[dict] = None, use_gpu: bool = False, **pipeline_kwargs) -> TransformersPredictor[source] Instantiate the predictor from a Checkpoint. The checkpoint is expected to be a result of TransformersTrainer. Note that the Transformers pipeline used internally expects to receive raw text. If you have any Preprocessors in Checkpoint that tokenize the data, remove them by calling Checkpoint.set_preprocessor(None) beforehand. Parameters checkpoint – The checkpoint to load the model, tokenizer and preprocessor from. It is expected to be from the result of a TransformersTrainer run. pipeline_cls – A transformers.pipelines.Pipeline class to use. If not specified, will use the pipeline abstraction wrapper. model_cls – A transformers.PreTrainedModel class to create from the checkpoint. pretrained_model_kwargs – If set and a model_cls is provided, will be passed to TransformersCheckpoint.get_model(). use_gpu – If set, the model will be moved to GPU on instantiation and prediction happens on GPU. **pipeline_kwargs – Any kwargs to pass to the pipeline initialization. If pipeline_cls is None, this must contain the ‘task’ argument. Can be used to override the tokenizer with ‘tokenizer’. If use_gpu is True, ‘device’ will be set to 0 by default, unless ‘device_map’ is passed.ray.train.huggingface.TransformersPredictor.from_pandas_udf classmethod TransformersPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.huggingface.TransformersPredictor.get_preprocessor TransformersPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.huggingface.TransformersPredictor.predict TransformersPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], feature_columns: Optional[Union[List[str], List[int]]] = None, **predict_kwargs) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Run inference on data batch. The data is converted into a list (unless pipeline is a TableQuestionAnsweringPipeline) and passed to the pipeline object. Parameters data – A batch of input data. Either a pandas DataFrame or numpy array. feature_columns – The names or indices of the columns in the data to use as features to predict on. If None, use all columns. **pipeline_call_kwargs – additional kwargs to pass to the pipeline object. Examples >>> import pandas as pd >>> from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer >>> from transformers.pipelines import pipeline >>> from ray.train.huggingface import TransformersPredictor >>> >>> model_checkpoint = "gpt2" >>> tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer" >>> tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint) >>> >>> model_config = AutoConfig.from_pretrained(model_checkpoint) >>> model = AutoModelForCausalLM.from_config(model_config) >>> predictor = TransformersPredictor( ... pipeline=pipeline( ... task="text-generation", model=model, tokenizer=tokenizer ... ) ... ) >>> >>> prompts = pd.DataFrame( ... ["Complete me", "And me", "Please complete"], columns=["sentences"] ... ) >>> predictions = predictor.predict(prompts) Returns Prediction result.ray.train.huggingface.TransformersPredictor.preferred_batch_format classmethod TransformersPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _predict_pandas and _predict_numpy are implemented. Defaults to Pandas. Can be overriden by predictor classes depending on the framework type, e.g. TorchPredictor prefers Numpy and XGBoostPredictor prefers Pandas as native batch format. DeveloperAPI: This API may change across minor Ray releases.ray.train.huggingface.TransformersPredictor.set_preprocessor TransformersPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions.ray.train.sklearn.SklearnPredictor class ray.train.sklearn.SklearnPredictor(estimator: sklearn.base.BaseEstimator, preprocessor: Optional[Preprocessor] = None)[source] Bases: ray.train.predictor.Predictor A predictor for scikit-learn compatible estimators. Parameters estimator – The fitted scikit-learn compatible estimator to use for predictions. preprocessor – A preprocessor used to transform data batches prior to prediction. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods from_checkpoint(checkpoint) Instantiate the predictor from a Checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data[, feature_columns, ...]) Run inference on data batch. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.sklearn.SklearnPredictor.from_checkpoint classmethod SklearnPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint) -> ray.train.sklearn.sklearn_predictor.SklearnPredictor[source] Instantiate the predictor from a Checkpoint. The checkpoint is expected to be a result of SklearnTrainer. Parameters checkpoint – The checkpoint to load the model and preprocessor from. It is expected to be from the result of a SklearnTrainer run.ray.train.sklearn.SklearnPredictor.from_pandas_udf classmethod SklearnPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.sklearn.SklearnPredictor.get_preprocessor SklearnPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.sklearn.SklearnPredictor.predict SklearnPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], feature_columns: Optional[Union[List[str], List[int]]] = None, num_estimator_cpus: Optional[int] = None, **predict_kwargs) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Run inference on data batch. Parameters data – A batch of input data. Either a pandas DataFrame or numpy array. feature_columns – The names or indices of the columns in the data to use as features to predict on. If None, then use all columns in data. num_estimator_cpus – If set to a value other than None, will set the values of all n_jobs and thread_count parameters in the estimator (including in nested objects) to the given value. **predict_kwargs – Keyword arguments passed to estimator.predict. Examples >>> import numpy as np >>> from sklearn.ensemble import RandomForestClassifier >>> from ray.train.sklearn import SklearnPredictor >>> >>> train_X = np.array([[1, 2], [3, 4]]) >>> train_y = np.array([0, 1]) >>> >>> model = RandomForestClassifier().fit(train_X, train_y) >>> predictor = SklearnPredictor(estimator=model) >>> >>> data = np.array([[1, 2], [3, 4]]) >>> predictions = predictor.predict(data) >>> >>> # Only use first and second column as the feature >>> data = np.array([[1, 2, 8], [3, 4, 9]]) >>> predictions = predictor.predict(data, feature_columns=[0, 1]) >>> import pandas as pd >>> from sklearn.ensemble import RandomForestClassifier >>> from ray.train.sklearn import SklearnPredictor >>> >>> train_X = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) >>> train_y = pd.Series([0, 1]) >>> >>> model = RandomForestClassifier().fit(train_X, train_y) >>> predictor = SklearnPredictor(estimator=model) >>> >>> # Pandas dataframe. >>> data = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) >>> predictions = predictor.predict(data) >>> >>> # Only use first and second column as the feature >>> data = pd.DataFrame([[1, 2, 8], [3, 4, 9]], columns=["A", "B", "C"]) >>> predictions = predictor.predict(data, feature_columns=["A", "B"]) Returns Prediction result.ray.train.sklearn.SklearnPredictor.preferred_batch_format classmethod SklearnPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _predict_pandas and _predict_numpy are implemented. Defaults to Pandas. Can be overriden by predictor classes depending on the framework type, e.g. TorchPredictor prefers Numpy and XGBoostPredictor prefers Pandas as native batch format. DeveloperAPI: This API may change across minor Ray releases.ray.train.sklearn.SklearnPredictor.set_preprocessor SklearnPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions.ray.train.rl.RLPredictor class ray.train.rl.RLPredictor(policy: ray.rllib.policy.policy.Policy, preprocessor: Optional[Preprocessor] = None)[source] Bases: ray.train.predictor.Predictor A predictor for RLlib policies. Parameters policy – The RLlib policy on which to perform inference on. preprocessor – A preprocessor used to transform data batches prior to prediction. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods from_checkpoint(checkpoint[, env]) Create RLPredictor from checkpoint. from_pandas_udf(pandas_udf) Create a Predictor from a Pandas UDF. get_preprocessor() Get the preprocessor to use prior to executing predictions. predict(data, **kwargs) Perform inference on a batch of data. preferred_batch_format() Batch format hint for upstream producers to try yielding best block format. set_preprocessor(preprocessor) Set the preprocessor to use prior to executing predictions. ray.train.rl.RLPredictor.from_checkpoint classmethod RLPredictor.from_checkpoint(checkpoint: ray.air.checkpoint.Checkpoint, env: Optional[Any] = None, **kwargs) -> ray.train.predictor.Predictor[source] Create RLPredictor from checkpoint. This method requires that the checkpoint was created with the Ray AIR RLTrainer. Parameters checkpoint – The checkpoint to load the model and preprocessor from. env – Optional environment to instantiate the trainer with. If not given, it is parsed from the saved trainer configuration instead.ray.train.rl.RLPredictor.from_pandas_udf classmethod RLPredictor.from_pandas_udf(pandas_udf: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]) -> ray.train.predictor.Predictor Create a Predictor from a Pandas UDF. Parameters pandas_udf – A function that takes a pandas.DataFrame and other optional kwargs and returns a pandas.DataFrame.ray.train.rl.RLPredictor.get_preprocessor RLPredictor.get_preprocessor() -> Optional[ray.data.preprocessor.Preprocessor] Get the preprocessor to use prior to executing predictions.ray.train.rl.RLPredictor.predict RLPredictor.predict(data: Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]], **kwargs) -> Union[numpy.ndarray, pandas.DataFrame, Dict[str, numpy.ndarray]] Perform inference on a batch of data. Parameters data – A batch of input data of type DataBatchType. kwargs – Arguments specific to predictor implementations. These are passed _predict_pandas. (directly to _predict_numpy or) – Returns Prediction result. The return type will be the same as the input type. Return type DataBatchTyperay.train.rl.RLPredictor.preferred_batch_format classmethod RLPredictor.preferred_batch_format() -> ray.air.util.data_batch_conversion.BatchFormat Batch format hint for upstream producers to try yielding best block format. The preferred batch format to use if both _predict_pandas and _predict_numpy are implemented. Defaults to Pandas. Can be overriden by predictor classes depending on the framework type, e.g. TorchPredictor prefers Numpy and XGBoostPredictor prefers Pandas as native batch format. DeveloperAPI: This API may change across minor Ray releases.ray.train.rl.RLPredictor.set_preprocessor RLPredictor.set_preprocessor(preprocessor: Optional[ray.data.preprocessor.Preprocessor]) -> None Set the preprocessor to use prior to executing predictions. Model Serving in AIR See this model serving guide to see how Ray Serve can be used within the Ray AIR ecosystem. PredictorWrapper(predictor_cls, checkpoint) Serve any Ray AIR predictor from an AIR checkpoint. ray.serve.air_integrations.PredictorWrapper class ray.serve.air_integrations.PredictorWrapper(predictor_cls: Union[str, Type[Predictor]], checkpoint: Union[Checkpoint, str], http_adapter: Union[str, Callable[[Any], Any]] = 'ray.serve.http_adapters.json_to_ndarray', batching_params: Optional[Union[Dict[str, int], bool]] = None, predict_kwargs: Optional[Dict[str, Any]] = None, **predictor_from_checkpoint_kwargs)[source] Bases: ray.serve.air_integrations.SimpleSchemaIngress Serve any Ray AIR predictor from an AIR checkpoint. Parameters predictor_cls – The class or path for predictor class. The type must be a subclass of ray.train.predictor.Predictor. checkpoint – The checkpoint object or a uri to load checkpoint fromThe checkpoint object must be an instance of ray.air.checkpoint.Checkpoint. The uri string will be called to construct a checkpoint object using Checkpoint.from_uri("uri_to_load_from"). http_adapter – The FastAPI input conversion function. By default, Serve will use the NdArray schema and convert to numpy array. You can pass in any FastAPI dependency resolver that returns an array. When you pass in a string, Serve will import it. Please refer to Serve HTTP adatpers documentation to learn more. batching_params – override the default parameters to ray.serve.batch(). Pass False to disable batching. predict_kwargs – optional keyword arguments passed to the Predictor.predict method upon each call. **predictor_from_checkpoint_kwargs – Additional keyword arguments passed to the Predictor.from_checkpoint() call. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods predict(inp) Perform inference directly without HTTP. reconfigure(config) Reconfigure model from config checkpoint ray.serve.air_integrations.PredictorWrapper.predict async PredictorWrapper.predict(inp)[source] Perform inference directly without HTTP.ray.serve.air_integrations.PredictorWrapper.reconfigure PredictorWrapper.reconfigure(config)[source] Reconfigure model from config checkpoint ray.serve.air_integrations.PredictorDeployment alias of Deployment(name=PredictorDeployment,version=None,route_prefix=/PredictorDeployment) Benchmarks Below we document key performance benchmarks for common AIR tasks and workflows. Bulk Ingest This task uses the DummyTrainer module to ingest 200GiB of synthetic data. We test out the performance across different cluster sizes. Bulk Ingest Script Bulk Ingest Cluster Configuration For this benchmark, we configured the nodes to have reasonable disk size and throughput to account for object spilling. aws: BlockDeviceMappings: - DeviceName: /dev/sda1 Ebs: Iops: 5000 Throughput: 1000 VolumeSize: 1000 VolumeType: gp3 Cluster Setup Performance Disk Spill Command 1 m5.4xlarge node (1 actor) 390 s (0.51 GiB/s) 205 GiB python data_benchmark.py --dataset-size-gb=200 --num-workers=1 5 m5.4xlarge nodes (5 actors) 70 s (2.85 GiB/S) 206 GiB python data_benchmark.py --dataset-size-gb=200 --num-workers=5 20 m5.4xlarge nodes (20 actors) 3.8 s (52.6 GiB/s) 0 GiB python data_benchmark.py --dataset-size-gb=200 --num-workers=20 XGBoost Batch Prediction This task uses the BatchPredictor module to process different amounts of data using an XGBoost model. We test out the performance across different cluster sizes and data sizes. XGBoost Prediction Script XGBoost Cluster Configuration TODO: Add script for generating data and running the benchmark. Cluster Setup Data Size Performance Command 1 m5.4xlarge node (1 actor) 10 GB (26M rows) 275 s (94.5k rows/s) python xgboost_benchmark.py --size 10GB 10 m5.4xlarge nodes (10 actors) 100 GB (260M rows) 331 s (786k rows/s) python xgboost_benchmark.py --size 100GB XGBoost training This task uses the XGBoostTrainer module to train on different sizes of data with different amounts of parallelism. XGBoost parameters were kept as defaults for xgboost==1.6.1 this task. XGBoost Training Script XGBoost Cluster Configuration Cluster Setup Data Size Performance Command 1 m5.4xlarge node (1 actor) 10 GB (26M rows) 692 s python xgboost_benchmark.py --size 10GB 10 m5.4xlarge nodes (10 actors) 100 GB (260M rows) 693 s python xgboost_benchmark.py --size 100GB GPU image batch prediction This task uses the BatchPredictor module to process different amounts of data using a Pytorch pre-trained ResNet model. We test out the performance across different cluster sizes and data sizes. GPU image batch prediction script GPU prediction small cluster configuration GPU prediction large cluster configuration Cluster Setup Data Size Performance Command 1 g4dn.8xlarge node 1 GB (1623 images) 46.12 s (35.19 images/sec) python gpu_batch_prediction.py --data-size-gb=1 1 g4dn.8xlarge node 20 GB (32460 images) 285.2 s (113.81 images/sec) python gpu_batch_prediction.py --data-size-gb=20 4 g4dn.12xlarge nodes 100 GB (162300 images) 304.01 s (533.86 images/sec) python gpu_batch_prediction.py --data-size-gb=100 GPU image training This task uses the TorchTrainer module to train different amounts of data using an Pytorch ResNet model. We test out the performance across different cluster sizes and data sizes. GPU image training script GPU training small cluster configuration GPU training large cluster configuration For multi-host distributed training, on AWS we need to ensure ec2 instances are in the same VPC and all ports are open in the secure group. Cluster Setup Data Size Performance Command 1 g3.8xlarge node (1 worker) 1 GB (1623 images) 79.76 s (2 epochs, 40.7 images/sec) python pytorch_training_e2e.py --data-size-gb=1 1 g3.8xlarge node (1 worker) 20 GB (32460 images) 1388.33 s (2 epochs, 46.76 images/sec) python pytorch_training_e2e.py --data-size-gb=20 4 g3.16xlarge nodes (16 workers) 100 GB (162300 images) 434.95 s (2 epochs, 746.29 images/sec) python pytorch_training_e2e.py --data-size-gb=100 --num-workers=16 Pytorch Training Parity This task checks the performance parity between native Pytorch Distributed and Ray Train’s distributed TorchTrainer. We demonstrate that the performance is similar (within 2.5%) between the two frameworks. Performance may vary greatly across different model, hardware, and cluster configurations. The reported times are for the raw training times. There is an unreported constant setup overhead of a few seconds for both methods that is negligible for longer training runs. Pytorch comparison training script Pytorch comparison CPU cluster configuration Pytorch comparison GPU cluster configuration Cluster Setup Dataset Performance Command 4 m5.2xlarge nodes (4 workers) FashionMNIST 196.64 s (vs 194.90 s Pytorch) python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 4 --cpus-per-worker 8 4 m5.2xlarge nodes (16 workers) FashionMNIST 430.88 s (vs 475.97 s Pytorch) python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 2 4 g4dn.12xlarge node (16 workers) FashionMNIST 149.80 s (vs 146.46 s Pytorch) python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 4 --use-gpu Tensorflow Training Parity This task checks the performance parity between native Tensorflow Distributed and Ray Train’s distributed TensorflowTrainer. We demonstrate that the performance is similar (within 1%) between the two frameworks. Performance may vary greatly across different model, hardware, and cluster configurations. The reported times are for the raw training times. There is an unreported constant setup overhead of a few seconds for both methods that is negligible for longer training runs. The batch size and number of epochs is different for the GPU benchmark, resulting in a longer runtime. Tensorflow comparison training script Tensorflow comparison CPU cluster configuration Tensorflow comparison GPU cluster configuration Cluster Setup Dataset Performance Command 4 m5.2xlarge nodes (4 workers) FashionMNIST 78.81 s (vs 79.67 s Tensorflow) python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 4 --cpus-per-worker 8 4 m5.2xlarge nodes (16 workers) FashionMNIST 64.57 s (vs 67.45 s Tensorflow) python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 2 4 g4dn.12xlarge node (16 workers) FashionMNIST 465.16 s (vs 461.74 s Tensorflow) python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 200 --num-workers 16 --cpus-per-worker 4 --batch-size 64 --use-gpu Ray Data: Scalable Datasets for ML Ray Data is a scalable data processing library for ML workloads. It provides flexible and performant APIs for scaling Offline batch inference and Data preprocessing and ingest for ML training. Ray Data uses streaming execution to efficiently process large datasets. https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit Install Ray Data To install Ray Data, run: $ pip install -U 'ray[data]' To learn more about installing Ray and its libraries, see Installing Ray. Learn more Ray Data Overview Get an overview of Ray Data, the workloads that it supports, and how it compares to alternatives. Ray Data Overview Key Concepts Understand the key concepts behind Ray Data. Learn what Datasets are and how they’re used. Learn Key Concepts User Guides Learn how to use Ray Data, from basic usage to end-to-end guides. Learn how to use Ray Data Examples Find both simple and scaling-out examples of using Ray Data. Ray Data Examples API Get more in-depth information about the Ray Data API. Read the API Reference Ray blogs Get the latest on engineering updates from the Ray team and how companies are using Ray Data. Read the Ray blogs Ray Data Overview https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit Ray Data is a scalable data processing library for ML workloads, particularly suited for the following workloads: Offline batch inference Data preprocessing and ingest for ML training It provides flexible and performant APIs for distributed data processing: Simple transformations such as maps (map_batches()) Global and grouped aggregations (groupby()) Shuffling operations (random_shuffle(), sort(), repartition()). Ray Data is built on top of Ray, so it scales effectively to large clusters and offers scheduling support for both CPU and GPU resources. Ray Data uses streaming execution to efficiently process large datasets. Ray Data doesn’t have a SQL interface and isn’t meant as a replacement for generic ETL pipelines like Spark. Why choose Ray Data? Faster and cheaper for modern deep learning applications Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming Dataset primitive, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently. By using Ray Data, your GPUs are no longer idle during CPU computation, reducing overall cost of the batch inference job. Cloud, framework, and data format agnostic Ray Data has no restrictions on cloud provider, ML framework, or data format. Through the Ray cluster launcher, you can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a wide variety of formats including CSV, Parquet, and raw images. Out of the box scaling Ray Data is built on Ray, so it easily scales to many machines. Code that works on one machine also runs on a large cluster without any changes. Python first With Ray Data, you can express your inference job directly in Python instead of YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience. Offline Batch Inference Get in touch to get help using Ray Data, the industry’s fastest and cheapest solution for offline batch inference. Offline batch inference is a process for generating model predictions on a fixed set of input data. Ray Data offers an efficient and scalable solution for batch inference, providing faster execution and cost-effectiveness for deep learning applications. For more details on how to use Ray Data for offline batch inference, see the batch inference user guide. https://docs.google.com/presentation/d/1l03C1-4jsujvEFZUM4JVNy8Ju8jnY5Lc_3q7MBWi2PQ/edit#slide=id.g230eb261ad2_0_0 How does Ray Data compare to X for offline inference? Batch Services: AWS Batch, GCP Batch Cloud providers such as AWS, GCP, and Azure provide batch services to manage compute infrastructure for you. Each service uses the same process: you provide the code, and the service runs your code on each node in a cluster. However, while infrastructure management is necessary, it is often not enough. These services have limitations, such as a lack of software libraries to address optimized parallelization, efficient data transfer, and easy debugging. These solutions are suitable only for experienced users who can write their own optimized batch inference code. Ray Data abstracts away not only the infrastructure management, but also the sharding your dataset, the parallelization of the inference over these shards, and the transfer of data from storage to CPU to GPU. Online inference solutions: Bento ML, Sagemaker Batch Transform Solutions like Bento ML, Sagemaker Batch Transform, or Ray Serve provide APIs to make it easy to write performant inference code and can abstract away infrastructure complexities. But they are designed for online inference rather than offline batch inference, which are two different problems with different sets of requirements. These solutions introduce additional complexity like HTTP, and cannot effectively handle large datasets leading inference service providers like Bento ML to integrating with Apache Spark for offline inference. Ray Data is built for offline batch jobs, without all the extra complexities of starting servers or sending HTTP requests. For a more detailed performance comparison between Ray Data and Sagemaker Batch Transform, see Offline Batch Inference: Comparing Ray, Apache Spark, and SageMaker. Distributed Data Processing Frameworks: Apache Spark Ray Data handles many of the same batch processing workloads as Apache Spark, but with a streaming paradigm that is better suited for GPU workloads for deep learning inference. For a more detailed performance comarison between Ray Data and Apache Spark, see Offline Batch Inference: Comparing Ray, Apache Spark, and SageMaker. Batch inference case studies Sewer AI speeds up object detection on videos 3x using Ray Data Spotify’s new ML platform built on Ray, using Ray Data for batch inference Preprocessing and ingest for ML training Use Ray Data to load and preprocess data for distributed ML training pipelines in a streaming fashion. Ray Data serves as a last-mile bridge from storage or ETL pipeline outputs to distributed applications and libraries in Ray. Don’t use it as a replacement for more general data processing systems. https://docs.google.com/presentation/d/1l03C1-4jsujvEFZUM4JVNy8Ju8jnY5Lc_3q7MBWi2PQ/edit How does Ray Data compare to X for ML training ingest? PyTorch Dataset and DataLoader Framework-agnostic: Datasets is framework-agnostic and portable between different distributed training frameworks, while Torch datasets are specific to Torch. No built-in IO layer: Torch datasets do not have an I/O layer for common file formats or in-memory exchange with other frameworks; users need to bring in other libraries and roll this integration themselves. Generic distributed data processing: Datasets is more general: it can handle generic distributed operations, including global per-epoch shuffling, which would otherwise have to be implemented by stitching together two separate systems. Torch datasets would require such stitching for anything more involved than batch-based preprocessing, and does not natively support shuffling across worker shards. See our blog post on why this shared infrastructure is important for 3rd generation ML architectures. Lower overhead: Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines of Torch datasets. TensorFlow Dataset Framework-agnostic: Datasets is framework-agnostic and portable between different distributed training frameworks, while TensorFlow datasets is specific to TensorFlow. Unified single-node and distributed: Datasets unifies single and multi-node training under the same abstraction. TensorFlow datasets presents separate concepts for distributed data loading and prevents code from being seamlessly scaled to larger clusters. Generic distributed data processing: Datasets is more general: it can handle generic distributed operations, including global per-epoch shuffling, which would otherwise have to be implemented by stitching together two separate systems. TensorFlow datasets would require such stitching for anything more involved than basic preprocessing, and does not natively support full-shuffling across worker shards; only file interleaving is supported. See our blog post on why this shared infrastructure is important for 3rd generation ML architectures. Lower overhead: Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines of TensorFlow datasets. Petastorm Supported data types: Petastorm only supports Parquet data, while Ray Data supports many file formats. Lower overhead: Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines used by Petastorm. No data processing: Petastorm does not expose any data processing APIs. NVTabular Supported data types: NVTabular only supports tabular (Parquet, CSV, Avro) data, while Ray Data supports many other file formats. Lower overhead: Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines used by Petastorm. Heterogeneous compute: NVTabular doesn’t support mixing heterogeneous resources in dataset transforms (e.g. both CPU and GPU transformations), while Ray Data supports this. ML ingest case studies Predibase speeds up image augmentation for model training using Ray Data Spotify’s new ML platform built on Ray, using Ray Data for feature preprocessing Key Concepts Learn about Dataset and the functionality it provides. This guide provides a lightweight introduction to: Loading data Transforming data Consuming data Saving data Datasets Ray Data’s main abstraction is a Dataset, which is a distributed data collection. Datasets are designed for machine learning, and they can represent data collections that exceed a single machine’s memory. Loading data Create datasets from on-disk files, Python objects, and cloud storage services like S3. Ray Data can read from any filesystem supported by Arrow. import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") ds.show(limit=1) {'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0} To learn more about creating datasets, read Loading data. Transforming data Apply user-defined functions (UDFs) to transform datasets. Ray executes transformations in parallel for performance. from typing import Dict import numpy as np # Compute a "petal area" attribute. def transform_batch(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: vec_a = batch["petal length (cm)"] vec_b = batch["petal width (cm)"] batch["petal area (cm^2)"] = vec_a * vec_b return batch transformed_ds = ds.map_batches(transform_batch) print(transformed_ds.materialize()) MaterializedDataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64, petal area (cm^2): double } ) To learn more about transforming datasets, read Transforming data. Consuming data Pass datasets to Ray Tasks or Actors, and access records with methods like take_batch() and iter_batches(). Local print(transformed_ds.take_batch(batch_size=3)) {'sepal length (cm)': array([5.1, 4.9, 4.7]), 'sepal width (cm)': array([3.5, 3. , 3.2]), 'petal length (cm)': array([1.4, 1.4, 1.3]), 'petal width (cm)': array([0.2, 0.2, 0.2]), 'target': array([0, 0, 0]), 'petal area (cm^2)': array([0.28, 0.28, 0.26])} Tasks @ray.remote def consume(ds: ray.data.Dataset) -> int: num_batches = 0 for batch in ds.iter_batches(batch_size=8): num_batches += 1 return num_batches ray.get(consume.remote(transformed_ds)) Actors @ray.remote class Worker: def train(self, data_iterator): for batch in data_iterator.iter_batches(batch_size=8): pass workers = [Worker.remote() for _ in range(4)] shards = transformed_ds.streaming_split(n=4, equal=True) ray.get([w.train.remote(s) for w, s in zip(workers, shards)]) To learn more about consuming datasets, see Iterating over Data and Saving Data. Saving data Call methods like write_parquet() to save dataset contents to local or remote filesystems. # The number of blocks can be non-determinstic. Repartition the dataset beforehand # so that the number of written files is consistent. transformed_ds = transformed_ds.repartition(2) import os transformed_ds.write_parquet("/tmp/iris") print(sorted(os.listdir("/tmp/iris"))) ['..._000000.parquet', '..._000001.parquet'] To learn more about saving dataset contents, see Saving data. User Guides If you’re new to Ray Data, we recommend starting with the Ray Data Key Concepts. This user guide will help you navigate the Ray Data project and show you how achieve several tasks. Loading Data Ray Data loads data from various sources. This guide shows you how to: Read files like images Load in-memory data like pandas DataFrames Read databases like MySQL Reading files Ray Data reads files from local disk or cloud storage in a variety of file formats. To view the full list of supported file formats, see the Input/Output reference. Parquet To read Parquet files, call read_parquet(). import ray ds = ray.data.read_parquet("local:///tmp/iris.parquet") print(ds.schema()) Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Images To read raw images, call read_images(). Ray Data represents images as NumPy ndarrays. import ray ds = ray.data.read_images("local:///tmp/batoidea/JPEGImages/") print(ds.schema()) Column Type ------ ---- image numpy.ndarray(shape=(32, 32, 3), dtype=uint8) Text To read lines of text, call read_text(). import ray ds = ray.data.read_text("local:///tmp/this.txt") print(ds.schema()) Column Type ------ ---- text string CSV To read CSV files, call read_csv(). import ray ds = ray.data.read_csv("local:///tmp/iris.csv") print(ds.schema()) Column Type ------ ---- sepal length (cm) double sepal width (cm) double petal length (cm) double petal width (cm) double target int64 Binary To read raw binary files, call read_binary_files(). import ray ds = ray.data.read_binary_files("local:///tmp/file.dat") print(ds.schema()) Column Type ------ ---- bytes binary TFRecords To read TFRecords files, call read_tfrecords(). import ray ds = ray.data.read_tfrecords("local:///tmp/iris.tfrecords") print(ds.schema()) Column Type ------ ---- sepal length (cm) double sepal width (cm) double petal length (cm) double petal width (cm) double target int64 Reading files from local disk To read files from local disk, call a function like read_parquet() and specify paths with the local:// schema. Paths can point to files or directories. To read formats other than Parquet, see the Input/Output reference. If your files are accessible on every node, exclude local:// to parallelize the read tasks across the cluster. import ray ds = ray.data.read_parquet("local:///tmp/iris.parquet") print(ds.schema()) Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Reading files from cloud storage To read files in cloud storage, authenticate all nodes with your cloud service provider. Then, call a method like read_parquet() and specify URIs with the appropriate schema. URIs can point to buckets, folders, or objects. To read formats other than Parquet, see the Input/Output reference. S3 To read files from Amazon S3, specify URIs with the s3:// scheme. import ray ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") print(ds.schema()) Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string GCS To read files from Google Cloud Storage, install the Filesystem interface to Google Cloud Storage pip install gcsfs Then, create a GCSFileSystem and specify URIs with the gcs:// scheme. import ray ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") print(ds.schema()) Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string ABL To read files from Azure Blob Storage, install the Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage pip install adlfs Then, create a AzureBlobFileSystem and specify URIs with the az:// scheme. import adlfs import ray ds = ray.data.read_parquet( "az://ray-example-data/iris.parquet", adlfs.AzureBlobFileSystem(account_name="azureopendatastorage") ) print(ds.schema()) Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Reading files from NFS To read files from NFS filesystems, call a function like read_parquet() and specify files on the mounted filesystem. Paths can point to files or directories. To read formats other than Parquet, see the Input/Output reference. import ray ds = ray.data.read_parquet("/mnt/cluster_storage/iris.parquet") print(ds.schema()) Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string Handling compressed files To read a compressed file, specify compression in arrow_open_stream_args. You can use any Codec supported by Arrow. import ray ds = ray.data.read_csv( "s3://anonymous@ray-example-data/iris.csv.gz", arrow_open_stream_args={"compression": "gzip"}, ) Loading data from other libraries Loading data from single-node data libraries Ray Data interoperates with libraries like pandas, NumPy, and Arrow. Python objects To create a Dataset from Python objects, call from_items() and pass in a list of Dict. Ray Data treats each Dict as a row. import ray ds = ray.data.from_items([ {"food": "spam", "price": 9.34}, {"food": "ham", "price": 5.37}, {"food": "eggs", "price": 0.94} ]) print(ds) MaterializedDataset( num_blocks=3, num_rows=3, schema={food: string, price: double} ) You can also create a Dataset from a list of regular Python objects. import ray ds = ray.data.from_items([1, 2, 3, 4, 5]) print(ds) MaterializedDataset(num_blocks=5, num_rows=5, schema={item: int64}) NumPy To create a Dataset from a NumPy array, call from_numpy(). Ray Data treats the outer axis as the row dimension. import numpy as np import ray array = np.ones((3, 2, 2)) ds = ray.data.from_numpy(array) print(ds) MaterializedDataset( num_blocks=1, num_rows=3, schema={data: numpy.ndarray(shape=(2, 2), dtype=double)} ) pandas To create a Dataset from a pandas DataFrame, call from_pandas(). import pandas as pd import ray df = pd.DataFrame({ "food": ["spam", "ham", "eggs"], "price": [9.34, 5.37, 0.94] }) ds = ray.data.from_pandas(df) print(ds) MaterializedDataset( num_blocks=1, num_rows=3, schema={food: object, price: float64} ) PyArrow To create a Dataset from an Arrow table, call from_arrow(). import pyarrow as pa table = pa.table({ "food": ["spam", "ham", "eggs"], "price": [9.34, 5.37, 0.94] }) ds = ray.data.from_arrow(table) print(ds) MaterializedDataset( num_blocks=1, num_rows=3, schema={food: string, price: double} ) Loading data from distributed DataFrame libraries Ray Data interoperates with distributed data processing frameworks like Dask, Spark, Modin, and Mars. Dask To create a Dataset from a Dask DataFrame, call from_dask(). This function constructs a Dataset backed by the distributed Pandas DataFrame partitions that underly the Dask DataFrame. import dask.dataframe as dd import pandas as pd import ray df = pd.DataFrame({"col1": list(range(10000)), "col2": list(map(str, range(10000)))}) ddf = dd.from_pandas(df, npartitions=4) # Create a Dataset from a Dask DataFrame. ds = ray.data.from_dask(ddf) ds.show(3) {'string': 'spam', 'number': 0} {'string': 'ham', 'number': 1} {'string': 'eggs', 'number': 2} Spark To create a Dataset from a Spark DataFrame, call from_spark(). This function creates a Dataset backed by the distributed Spark DataFrame partitions that underly the Spark DataFrame. import ray import raydp spark = raydp.init_spark(app_name="Spark -> Datasets Example", num_executors=2, executor_cores=2, executor_memory="500MB") df = spark.createDataFrame([(i, str(i)) for i in range(10000)], ["col1", "col2"]) ds = ray.data.from_spark(df) ds.show(3) {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} Modin To create a Dataset from a Modin DataFrame, call from_modin(). This function constructs a Dataset backed by the distributed Pandas DataFrame partitions that underly the Modin DataFrame. import modin.pandas as md import pandas as pd import ray df = pd.DataFrame({"col1": list(range(10000)), "col2": list(map(str, range(10000)))}) mdf = md.DataFrame(df) # Create a Dataset from a Modin DataFrame. ds = ray.data.from_modin(mdf) ds.show(3) {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} Mars To create a Dataset from a Mars DataFrame, call from_mars(). This function constructs a Dataset backed by the distributed Pandas DataFrame partitions that underly the Mars DataFrame. import mars import mars.dataframe as md import pandas as pd import ray cluster = mars.new_cluster_in_ray(worker_num=2, worker_cpu=1) df = pd.DataFrame({"col1": list(range(10000)), "col2": list(map(str, range(10000)))}) mdf = md.DataFrame(df, num_partitions=8) # Create a tabular Dataset from a Mars DataFrame. ds = ray.data.from_mars(mdf) ds.show(3) {'col1': 0, 'col2': '0'} {'col1': 1, 'col2': '1'} {'col1': 2, 'col2': '2'} Loading data from ML libraries Ray Data interoperates with HuggingFace and TensorFlow datasets. HuggingFace To convert a 🤗 Dataset to a Ray Datasets, call from_huggingface(). This function accesses the underlying Arrow table and converts it to a Dataset directly. from_huggingface doesn’t support parallel reads. This isn’t an issue with in-memory 🤗 Datasets, but may fail with large memory-mapped 🤗 Datasets. Also, 🤗 IterableDataset objects aren’t supported. import ray.data from datasets import load_dataset hf_ds = load_dataset("wikitext", "wikitext-2-raw-v1") ray_ds = ray.data.from_huggingface(hf_ds) ray_ds["train"].take(2) [{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}] TensorFlow To convert a TensorFlow dataset to a Ray Dataset, call from_tf(). from_tf doesn’t support parallel reads. Only use this function with small datasets like MNIST or CIFAR. import ray import tensorflow_datasets as tfds tf_ds, _ = tfds.load("cifar10", split=["train", "test"]) ds = ray.data.from_tf(tf_ds) print(ds) The following `testoutput` is mocked to avoid illustrating download logs like "Downloading and preparing dataset 162.17 MiB". MaterializedDataset( num_blocks=..., num_rows=50000, schema={ id: binary, image: numpy.ndarray(shape=(32, 32, 3), dtype=uint8), label: int64 } ) Reading databases Ray Data reads from databases like MySQL, Postgres, and MongoDB. Reading SQL databases Call read_sql() to read data from a database that provides a Python DB API2-compliant connector. MySQL To read from MySQL, install MySQL Connector/Python. It’s the first-party MySQL database connector. pip install mysql-connector-python Then, define your connection logic and query the database. import mysql.connector import ray def create_connection(): return mysql.connector.connect( user="admin", password=..., host="example-mysql-database.c2c2k1yfll7o.us-west-2.rds.amazonaws.com", connection_timeout=30, database="example", ) # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) PostgreSQL To read from PostgreSQL, install Psycopg 2. It’s the most popular PostgreSQL database connector. pip install psycopg2-binary Then, define your connection logic and query the database. import psycopg2 import ray def create_connection(): return psycopg2.connect( user="postgres", password=..., host="example-postgres-database.c2c2k1yfll7o.us-west-2.rds.amazonaws.com", dbname="example", ) # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) Snowflake To read from Snowflake, install the Snowflake Connector for Python. pip install snowflake-connector-python Then, define your connection logic and query the database. import snowflake.connector import ray def create_connection(): return snowflake.connector.connect( user=..., password=... account="ZZKXUVH-IPB52023", database="example", ) # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) Databricks To read from Databricks, install the Databricks SQL Connector for Python. pip install databricks-sql-connector Then, define your connection logic and read from the Databricks SQL warehouse. from databricks import sql import ray def create_connection(): return sql.connect( server_hostname="dbc-1016e3a4-d292.cloud.databricks.com", http_path="/sql/1.0/warehouses/a918da1fc0b7fed0", access_token=..., # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) BigQuery To read from BigQuery, install the Python Client for Google BigQuery. This package includes a DB API2-compliant database connector. pip install google-cloud-bigquery Then, define your connection logic and query the dataset. from google.cloud import bigquery from google.cloud.bigquery import dbapi import ray def create_connection(): client = bigquery.Client(...) return dbapi.Connection(client) # Get all movies dataset = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 dataset = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year dataset = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) Reading MongoDB To read data from MongoDB, call read_mongo() and specify the the source URI, database, and collection. You also need to specify a pipeline to run against the collection. import ray # Read a local MongoDB. ds = ray.data.read_mongo( uri="mongodb://localhost:27017", database="my_db", collection="my_collection", pipeline=[{"$match": {"col": {"$gte": 0, "$lt": 10}}}, {"$sort": "sort_col"}], ) # Reading a remote MongoDB is the same. ds = ray.data.read_mongo( uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", database="my_db", collection="my_collection", pipeline=[{"$match": {"col": {"$gte": 0, "$lt": 10}}}, {"$sort": "sort_col"}], ) # Write back to MongoDB. ds.write_mongo( MongoDatasource(), uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", database="my_db", collection="my_collection", ) Creating synthetic data Synthetic datasets can be useful for testing and benchmarking. Int Range To create a synthetic Dataset from a range of integers, call range(). Ray Data stores the integer range in a single column. import ray ds = ray.data.range(10000) print(ds.schema()) Column Type ------ ---- id int64 Tensor Range To create a synthetic Dataset containing arrays, call range_tensor(). Ray Data packs an integer range into ndarrays of the provided shape. import ray ds = ray.data.range_tensor(10, shape=(64, 64)) print(ds.schema()) Column Type ------ ---- data numpy.ndarray(shape=(64, 64), dtype=int64) Loading other data sources If Ray Data can’t load your data, subclass Datasource. Then, construct an instance of your custom datasource and pass it to read_datasource(). # Read from a custom datasource. ds = ray.data.read_datasource(YourCustomDatasource(), **read_args) # Write to a custom datasource. ds.write_datasource(YourCustomDatasource(), **write_args) For an example, see Implementing a Custom Datasource. Performance considerations The dataset parallelism determines the number of blocks the base data will be split into for parallel reads. Ray Data will decide internally how many read tasks to run concurrently to best utilize the cluster, ranging from 1...parallelism tasks. In other words, the higher the parallelism, the smaller the data blocks in the Dataset and hence the more opportunity for parallel execution. This default parallelism can be overridden via the parallelism argument; see the performance guide for more information on how to tune this read parallelism. Transforming Data Transformations let you process and modify your dataset. You can compose transformations to express a chain of computations. Transformations are lazy by default. They aren’t executed until you trigger consumption of the data by iterating over the Dataset, saving the Dataset, or inspecting properties of the Dataset. This guide shows you how to: Transform rows Transform batches Groupby and transform groups Shuffle rows Repartition data Transforming rows To transform rows, call map() or flat_map(). Transforming rows with map If your transformation returns exactly one row for each input row, call map(). import os from typing import Any, Dict import ray def parse_filename(row: Dict[str, Any]) -> Dict[str, Any]: row["filename"] = os.path.basename(row["path"]) return row ds = ( ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple", include_paths=True) .map(parse_filename) ) If your transformation is vectorized, call map_batches() for better performance. To learn more, see Transforming batches. Transforming rows with flat map If your transformation returns multiple rows for each input row, call flat_map(). from typing import Any, Dict, List import ray def duplicate_row(row: Dict[str, Any]) -> List[Dict[str, Any]]: return [row] * 2 print( ray.data.range(3) .flat_map(duplicate_row) .take_all() ) [{'id': 0}, {'id': 0}, {'id': 1}, {'id': 1}, {'id': 2}, {'id': 2}] Transforming batches If your transformation is vectorized like most NumPy or pandas operations, transforming batches is more performant than transforming rows. Choosing between tasks and actors Ray Data transforms batches with either tasks or actors. Actors perform setup exactly once. In contrast, tasks require setup every batch. So, if your transformation involves expensive setup like downloading model weights, use actors. Otherwise, use tasks. To learn more about tasks and actors, read the Ray Core Key Concepts. Transforming batches with tasks To transform batches with tasks, call map_batches(). Ray Data uses tasks by default. from typing import Dict import numpy as np import ray def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch ds = ( ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") .map_batches(increase_brightness) ) Transforming batches with actors To transform batches with actors, complete these steps: Implement a class. Perform setup in __init__ and transform data in __call__. Create an ActorPoolStrategy and configure the number of concurrent workers. Each worker transforms a partition of data. Call map_batches() and pass your ActorPoolStrategy to compute. CPU from typing import Dict import numpy as np import torch import ray class TorchPredictor: def __init__(self): self.model = torch.nn.Identity() self.model.eval() def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: inputs = torch.as_tensor(batch["data"], dtype=torch.float32) with torch.inference_mode(): batch["output"] = self.model(inputs).detach().numpy() return batch ds = ( ray.data.from_numpy(np.ones((32, 100))) .map_batches(TorchPredictor, compute=ray.data.ActorPoolStrategy(size=2)) ) ds.materialize() GPU from typing import Dict import numpy as np import torch import ray class TorchPredictor: def __init__(self): self.model = torch.nn.Identity().cuda() self.model.eval() def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: inputs = torch.as_tensor(batch["data"], dtype=torch.float32).cuda() with torch.inference_mode(): batch["output"] = self.model(inputs).detach().cpu().numpy() return batch ds = ( ray.data.from_numpy(np.ones((32, 100))) .map_batches( TorchPredictor, # Two workers with one GPU each compute=ray.data.ActorPoolStrategy(size=2), # Batch size is required if you're using GPUs. batch_size=4, num_gpus=1 ) ) ds.materialize() Configuring batch format Ray Data represents batches as dicts of NumPy ndarrays or pandas DataFrames. By default, Ray Data represents batches as dicts of NumPy ndarrays. To configure the batch type, specify batch_format in map_batches(). You can return either format from your function. NumPy from typing import Dict import numpy as np import ray def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch ds = ( ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") .map_batches(increase_brightness, batch_format="numpy") ) pandas import pandas as pd import ray def drop_nas(batch: pd.DataFrame) -> pd.DataFrame: return batch.dropna() ds = ( ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") .map_batches(drop_nas, batch_format="pandas") ) Configuring batch size Increasing batch_size improves the performance of vectorized transformations like NumPy functions and model inference. However, if your batch size is too large, your program might run out of memory. If you encounter an out-of-memory error, decrease your batch_size. The default batch size depends on your resource type. If you’re using CPUs, the default batch size is 4096. If you’re using GPUs, you must specify an explicit batch size. Groupby and transforming groups To transform groups, call groupby() to group rows. Then, call map_groups() to transform the groups. NumPy from typing import Dict import numpy as np import ray items = [ {"image": np.zeros((32, 32, 3)), "label": label} for _ in range(10) for label in range(100) ] def normalize_images(group: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: group["image"] = (group["image"] - group["image"].mean()) / group["image"].std() return group ds = ( ray.data.from_items(items) .groupby("label") .map_groups(normalize_images) ) pandas import pandas as pd import ray def normalize_features(group: pd.DataFrame) -> pd.DataFrame: target = group.drop("target") group = (group - group.min()) / group.std() group["target"] = target return group ds = ( ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") .groupby("target") .map_groups(normalize_features) ) Shuffling rows To randomly shuffle all rows, call random_shuffle(). import ray ds = ( ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") .random_shuffle() ) random_shuffle() is slow. For better performance, try Iterating over batches with shuffling. Repartitioning data A Dataset operates on a sequence of distributed data blocks. If you want to achieve more fine-grained parallelization, increase the number of blocks by setting a higher parallelism at read time. To change the number of blocks for an existing Dataset, call Dataset.repartition(). import ray ds = ray.data.range(10000, parallelism=1000) # Repartition the data into 100 blocks. Since shuffle=False, Ray Data will minimize # data movement during this operation by merging adjacent blocks. ds = ds.repartition(100, shuffle=False).materialize() # Repartition the data into 200 blocks, and force a full data shuffle. # This operation will be more expensive ds = ds.repartition(200, shuffle=True).materialize() Inspecting Data Inspect Datasets to better understand your data. This guide shows you how to: Describe datasets Inspect rows Inspect batches Inspect execution statistics Describing datasets Datasets are tabular. To view a Dataset’s column names and types, call Dataset.schema(). import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") print(ds.schema()) Column Type ------ ---- sepal length (cm) double sepal width (cm) double petal length (cm) double petal width (cm) double target int64 For more information like the number of rows, print the Dataset. import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") print(ds) Dataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64 } ) Inspecting rows To get a list of rows, call Dataset.take() or Dataset.take_all(). Ray Data represents each row as a dictionary. import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") rows = ds.take(1) print(rows) [{'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0}] For more information on working with rows, see Transforming rows and Iterating over rows. Inspecting batches A batch contains data from multiple rows. To inspect batches, call Dataset.take_batch(). By default, Ray Data represents batches as dicts of NumPy ndarrays. To change the type of the returned batch, set batch_format. NumPy import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") batch = ds.take_batch(batch_size=2, batch_format="numpy") print("Batch:", batch) print("Image shape", batch["image"].shape) Batch: {'image': array([[[[...]]]], dtype=uint8)} Image shape: (2, 32, 32, 3) pandas import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") batch = ds.take_batch(batch_size=2, batch_format="pandas") print(batch) sepal length (cm) sepal width (cm) ... petal width (cm) target 0 5.1 3.5 ... 0.2 0 1 4.9 3.0 ... 0.2 0 [2 rows x 5 columns] For more information on working with batches, see Transforming batches and Iterating over batches. Inspecting execution statistics Ray Data calculates statistics during execution like the wall clock time and memory usage for the different stages. To view stats about your Datasets, call Dataset.stats() on an executed dataset. The stats are also persisted under /tmp/ray/session_*/logs/ray-data.log. def pause(x): time.sleep(.0001) return x ds = ( ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") .map(lambda x: x) .map(pause) ) for batch in ds.iter_batches(): pass print(ds.stats()) Stage 1 ReadCSV->Map()->Map(pause): 1/1 blocks executed in 0.23s * Remote wall time: 222.1ms min, 222.1ms max, 222.1ms mean, 222.1ms total * Remote cpu time: 15.6ms min, 15.6ms max, 15.6ms mean, 15.6ms total * Peak heap memory usage (MiB): 157953.12 min, 157953.12 max, 157953 mean * Output num rows: 150 min, 150 max, 150 mean, 150 total * Output size bytes: 6000 min, 6000 max, 6000 mean, 6000 total * Tasks per node: 1 min, 1 max, 1 mean; 1 nodes used * Extra metrics: {'obj_store_mem_alloc': 6000, 'obj_store_mem_freed': 5761, 'obj_store_mem_peak': 6000} Dataset iterator time breakdown: * Total time user code is blocked: 5.68ms * Total time in user code: 0.96us * Total time overall: 238.93ms * Num blocks local: 0 * Num blocks remote: 0 * Num blocks unknown location: 1 * Batch iteration time breakdown (summed across prefetch threads): * In ray.get(): 2.16ms min, 2.16ms max, 2.16ms avg, 2.16ms total * In batch creation: 897.67us min, 897.67us max, 897.67us avg, 897.67us total * In batch formatting: 836.87us min, 836.87us max, 836.87us avg, 836.87us total Iterating over Data Ray Data lets you iterate over rows or batches of data. This guide shows you how to: Iterate over rows Iterate over batches Iterate over batches with shuffling Split datasets for distributed parallel training Iterating over rows To iterate over the rows of your dataset, call Dataset.iter_rows(). Ray Data represents each row as a dictionary. import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") for row in ds.iter_rows(): print(row) {'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0} {'sepal length (cm)': 4.9, 'sepal width (cm)': 3.0, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0} ... {'sepal length (cm)': 5.9, 'sepal width (cm)': 3.0, 'petal length (cm)': 5.1, 'petal width (cm)': 1.8, 'target': 2} For more information on working with rows, see Transforming rows and Inspecting rows. Iterating over batches A batch contains data from multiple rows. Iterate over batches of dataset in different formats by calling one of the following methods: Dataset.iter_batches() Dataset.iter_torch_batches() Dataset.to_tf() NumPy import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_batches(batch_size=2, batch_format="numpy"): print(batch) {'image': array([[[[...]]]], dtype=uint8)} ... {'image': array([[[[...]]]], dtype=uint8)} pandas import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") for batch in ds.iter_batches(batch_size=2, batch_format="pandas"): print(batch) sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 ... sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 6.2 3.4 5.4 2.3 2 1 5.9 3.0 5.1 1.8 2 Torch import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_torch_batches(batch_size=2): print(batch) {'image': tensor([[[[...]]]], dtype=torch.uint8)} ... {'image': tensor([[[[...]]]], dtype=torch.uint8)} TensorFlow import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") tf_dataset = ds.to_tf( feature_columns="sepal length (cm)", label_columns="target", batch_size=2 ) for features, labels in tf_dataset: print(features, labels) tf.Tensor([5.1 4.9], shape=(2,), dtype=float64) tf.Tensor([0 0], shape=(2,), dtype=int64) ... tf.Tensor([6.2 5.9], shape=(2,), dtype=float64) tf.Tensor([2 2], shape=(2,), dtype=int64) For more information on working with batches, see Transforming batches and Inspecting batches. Iterating over batches with shuffling Dataset.random_shuffle is slow because it shuffles all rows. If a full global shuffle isn’t required, you can shuffle a subset of rows up to a provided buffer size during iteration by specifying local_shuffle_buffer_size. While this isn’t a true global shuffle like random_shuffle, it’s more performant because it doesn’t require excessive data movement. To configure local_shuffle_buffer_size, choose the smallest value that achieves sufficient randomness. Higher values result in more randomness at the cost of slower iteration. NumPy import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_batches( batch_size=2, batch_format="numpy", local_shuffle_buffer_size=250, ): print(batch) {'image': array([[[[...]]]], dtype=uint8)} ... {'image': array([[[[...]]]], dtype=uint8)} pandas import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") for batch in ds.iter_batches( batch_size=2, batch_format="pandas", local_shuffle_buffer_size=250, ): print(batch) sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 6.3 2.9 5.6 1.8 2 1 5.7 4.4 1.5 0.4 0 ... sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 5.6 2.7 4.2 1.3 1 1 4.8 3.0 1.4 0.1 0 Torch import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_torch_batches( batch_size=2, local_shuffle_buffer_size=250, ): print(batch) {'image': tensor([[[[...]]]], dtype=torch.uint8)} ... {'image': tensor([[[[...]]]], dtype=torch.uint8)} TensorFlow import ray ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") tf_dataset = ds.to_tf( feature_columns="sepal length (cm)", label_columns="target", batch_size=2, local_shuffle_buffer_size=250, ) for features, labels in tf_dataset: print(features, labels) tf.Tensor([5.2 6.3], shape=(2,), dtype=float64) tf.Tensor([1 2], shape=(2,), dtype=int64) ... tf.Tensor([5. 5.8], shape=(2,), dtype=float64) tf.Tensor([0 0], shape=(2,), dtype=int64) Splitting datasets for distributed parallel training If you’re performing distributed data parallel training, call Dataset.streaming_split to split your dataset into disjoint shards. If you’re using Ray Train, you don’t need to split the dataset. Ray Train automatically splits your dataset for you. To learn more, see Configuring training datasets. import ray @ray.remote class Worker: def train(self, data_iterator): for batch in data_iterator.iter_batches(batch_size=8): pass ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") workers = [Worker.remote() for _ in range(4)] shards = ds.streaming_split(n=4, equal=True) ray.get([w.train.remote(s) for w, s in zip(workers, shards)]) Saving Data Ray Data lets you save data in files or other Python objects. This guide shows you how to: Write data to files Convert Datasets to other Python libraries Writing data to files Ray Data writes to local disk and cloud storage. Writing data to local disk To save your Dataset to local disk, call a method like Dataset.write_parquet and specify a local directory with the local:// scheme. If your cluster contains multiple nodes and you don’t use local://, Ray Data writes different partitions of data to different nodes. import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.write_parquet("local:///tmp/iris/") To write data to formats other than Parquet, read the Input/Output reference. Writing data to cloud storage To save your Dataset to cloud storage, authenticate all nodes with your cloud service provider. Then, call a method like Dataset.write_parquet and specify a URI with the appropriate scheme. URI can point to buckets or folders. S3 To save data to Amazon S3, specify a URI with the s3:// scheme. import ray ds = ray.data.read_csv("local:///tmp/iris.csv") ds.write_parquet("s3://my-bucket/my-folder") GCS To save data to Google Cloud Storage, install the Filesystem interface to Google Cloud Storage pip install gcsfs Then, create a GCSFileSystem and specify a URI with the gcs:// scheme. import ray ds = ray.data.read_csv("local:///tmp/iris.csv") filesystem = gcsfs.GCSFileSystem(project="my-google-project") ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem) ABL To save data to Azure Blob Storage, install the Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage pip install adlfs Then, create a AzureBlobFileSystem and specify a URI with the az:// scheme. import ray ds = ray.data.read_csv("local:///tmp/iris.csv") filesystem = adlfs.AzureBlobFileSystem(account_name="azureopendatastorage") ds.write_parquet("az://my-bucket/my-folder", filesystem=filesystem) To write data to formats other than Parquet, read the Input/Output reference. Writing data to NFS To save your Dataset to NFS file systems, call a method like Dataset.write_parquet and specify a mounted directory. import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.write_parquet("/mnt/cluster_storage/iris") To write data to formats other than Parquet, read the Input/Output reference. Changing the number of output files When you call a write method, Ray Data writes your data to one file per block. To change the number of blocks, call repartition(). import os import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") ds.repartition(2).write_csv("/tmp/two_files/") print(os.listdir("/tmp/two_files/")) ['26b07dba90824a03bb67f90a1360e104_000003.csv', '26b07dba90824a03bb67f90a1360e104_000002.csv'] Converting Datasets to other Python libraries Converting Datasets to pandas To convert a Dataset to a pandas DataFrame, call Dataset.to_pandas(). Your data must fit in memory on the head node. import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_pandas() print(df) sepal length (cm) sepal width (cm) ... petal width (cm) target 0 5.1 3.5 ... 0.2 0 1 4.9 3.0 ... 0.2 0 2 4.7 3.2 ... 0.2 0 3 4.6 3.1 ... 0.2 0 4 5.0 3.6 ... 0.2 0 .. ... ... ... ... ... 145 6.7 3.0 ... 2.3 2 146 6.3 2.5 ... 1.9 2 147 6.5 3.0 ... 2.0 2 148 6.2 3.4 ... 2.3 2 149 5.9 3.0 ... 1.8 2 [150 rows x 5 columns] Converting Datasets to distributed DataFrames Ray Data interoperates with distributed data processing frameworks like Dask, Spark, Modin, and Mars. Dask To convert a Dataset to a Dask DataFrame, call Dataset.to_dask(). import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_dask() Spark To convert a Dataset to a Spark DataFrame, call Dataset.to_spark(). import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") df = ds.to_spark() Modin To convert a Dataset to a Modin DataFrame, call Dataset.to_modin(). import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") mdf = ds.to_modin() Mars To convert a Dataset from a Mars DataFrame, call Dataset.to_mars(). import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") mdf = ds.to_mars() Working with Images With Ray Data, you can easily read and transform large image datasets. This guide shows you how to: Read images Transform images Perform inference on images Save images Reading images Ray Data can read images from a variety of formats. To view the full list of supported file formats, see the Input/Output reference. Raw images To load raw images like JPEG files, call read_images(). read_images() uses PIL. For a list of supported file formats, see Image file formats. import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages") print(ds.schema()) Column Type ------ ---- image numpy.ndarray(shape=(32, 32, 3), dtype=uint8) NumPy To load images stored in NumPy format, call read_numpy(). import ray ds = ray.data.read_numpy("s3://anonymous@air-example-data/cifar-10/images.npy") print(ds.schema()) Column Type ------ ---- data numpy.ndarray(shape=(32, 32, 3), dtype=uint8) TFRecords Image datasets often contain tf.train.Example messages that look like this: features { feature { key: "image" value { bytes_list { value: ... # Raw image bytes } } } feature { key: "label" value { int64_list { value: 3 } } } } To load examples stored in this format, call read_tfrecords(). Then, call map() to decode the raw image bytes. import io from typing import Any, Dict import numpy as np from PIL import Image import ray def decode_bytes(row: Dict[str, Any]) -> Dict[str, Any]: data = row["image"] image = Image.open(io.BytesIO(data)) row["image"] = np.array(image) return row ds = ( ray.data.read_tfrecords( "s3://anonymous@air-example-data/cifar-10/tfrecords" ) .map(decode_bytes) ) print(ds.schema()) The following `testoutput` is mocked because the order of column names can be non-deterministic. For an example, see https://buildkite.com/ray-project/oss-ci-build-branch/builds/4849#01892c8b-0cd0-4432-bc9f-9f86fcd38edd. Column Type ------ ---- image numpy.ndarray(shape=(32, 32, 3), dtype=uint8) label int64 Parquet To load image data stored in Parquet files, call ray.data.read_parquet(). import ray ds = ray.data.read_parquet("s3://anonymous@air-example-data/cifar-10/parquet") print(ds.schema()) Column Type ------ ---- image numpy.ndarray(shape=(32, 32, 3), dtype=uint8) label int64 For more information on creating datasets, see Loading Data. Transforming images To transform images, call map() or map_batches(). from typing import Any, Dict import numpy as np import ray def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch ds = ( ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages") .map_batches(increase_brightness) ) For more information on transforming data, see Transforming data. Performing inference on images To perform inference with a pre-trained model, first load and transform your data. from typing import Any, Dict from torchvision import transforms import ray def transform_image(row: Dict[str, Any]) -> Dict[str, Any]: transform = transforms.Compose([ transforms.ToTensor(), transforms.Resize((32, 32)) ]) row["image"] = transform(row["image"]) return row ds = ( ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages") .map(transform_image) ) Next, implement a callable class that sets up and invokes your model. import torch from torchvision import models class ImageClassifier: def __init__(self): weights = models.ResNet18_Weights.DEFAULT self.model = models.resnet18(weights=weights) self.model.eval() def __call__(self, batch): inputs = torch.from_numpy(batch["image"]) with torch.inference_mode(): outputs = self.model(inputs) return {"class": outputs.argmax(dim=1)} Finally, call Dataset.map_batches(). predictions = ds.map_batches( ImageClassifier, compute=ray.data.ActorPoolStrategy(size=2), batch_size=4 ) predictions.show(3) {'class': 118} {'class': 153} {'class': 296} For more information on performing inference, see End-to-end: Offline Batch Inference and Transforming batches with actors. Saving images Save images with formats like Parquet, NumPy, and JSON. To view all supported formats, see the Input/Output reference. Parquet To save images in Parquet files, call write_parquet(). import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_parquet("/tmp/simple") NumPy To save images in a NumPy file, call write_numpy(). import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_numpy("/tmp/simple", column="image") JSON To save images in a JSON file, call write_json(). import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_json("/tmp/simple") For more information on saving data, see Saving data. Working with Text With Ray Data, you can easily read and transform large amounts of text data. This guide shows you how to: Read text files Transform text data Perform inference on text data Save text data Reading text files Ray Data can read lines of text and JSONL. Alternatiely, you can read raw binary files and manually decode data. Text lines To read lines of text, call read_text(). Ray Data creates a row for each line of text. import ray ds = ray.data.read_text("s3://anonymous@ray-example-data/this.txt") ds.show(3) {'text': 'The Zen of Python, by Tim Peters'} {'text': 'Beautiful is better than ugly.'} {'text': 'Explicit is better than implicit.'} JSON Lines JSON Lines is a text format for structured data. It’s typically used to process data one record at a time. To read JSON Lines files, call read_json(). Ray Data creates a row for each JSON object. import ray ds = ray.data.read_json("s3://anonymous@ray-example-data/logs.json") ds.show(3) {'timestamp': datetime.datetime(2022, 2, 8, 15, 43, 41), 'size': 48261360} {'timestamp': datetime.datetime(2011, 12, 29, 0, 19, 10), 'size': 519523} {'timestamp': datetime.datetime(2028, 9, 9, 5, 6, 7), 'size': 2163626} Other formats To read other text formats, call read_binary_files(). Then, call map() to decode your data. from typing import Any, Dict from bs4 import BeautifulSoup import ray def parse_html(row: Dict[str, Any]) -> Dict[str, Any]: html = row["bytes"].decode("utf-8") soup = BeautifulSoup(html, features="html.parser") return {"text": soup.get_text().strip()} ds = ( ray.data.read_binary_files("s3://anonymous@ray-example-data/index.html") .map(parse_html) ) ds.show() {'text': 'Batoidea\nBatoidea is a superorder of cartilaginous fishes...'} For more information on reading files, see Loading data. Transforming text To transform text, implement your transformation in a function or callable class. Then, call Dataset.map() or Dataset.map_batches(). Ray Data transforms your text in parallel. from typing import Any, Dict import ray def to_lower(row: Dict[str, Any]) -> Dict[str, Any]: row["text"] = row["text"].lower() return row ds = ( ray.data.read_text("s3://anonymous@ray-example-data/this.txt") .map(to_lower) ) ds.show(3) {'text': 'the zen of python, by tim peters'} {'text': 'beautiful is better than ugly.'} {'text': 'explicit is better than implicit.'} For more information on transforming data, see Transforming data. Performing inference on text To perform inference with a pre-trained model on text data, implement a callable class that sets up and invokes a model. Then, call Dataset.map_batches(). from typing import Dict import numpy as np from transformers import pipeline import ray class TextClassifier: def __init__(self): self.model = pipeline("text-classification") def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: predictions = self.model(list(batch["text"])) batch["label"] = [prediction["label"] for prediction in predictions] return batch ds = ( ray.data.read_text("s3://anonymous@ray-example-data/this.txt") .map_batches(TextClassifier, compute=ray.data.ActorPoolStrategy(size=2)) ) ds.show(3) {'text': 'The Zen of Python, by Tim Peters', 'label': 'POSITIVE'} {'text': 'Beautiful is better than ugly.', 'label': 'POSITIVE'} {'text': 'Explicit is better than implicit.', 'label': 'POSITIVE'} For more information on performing inference, see End-to-end: Offline Batch Inference and Transforming batches with actors. Saving text To save text, call a method like write_parquet(). Ray Data can save text in many formats. To view the full list of supported file formats, see the Input/Output reference. import ray ds = ray.data.read_text("s3://anonymous@ray-example-data/this.txt") ds.write_parquet("local:///tmp/results") For more information on saving data, see Saving data. Working with Tensors N-dimensional arrays (i.e., tensors) are ubiquitous in ML workloads. This guide describes the limitations and best practices of working with such data. Tensor data representation Ray Data represents tensors as NumPy ndarrays. import ray ds = ray.data.read_images("s3://anonymous@air-example-data/digits") print(ds) Dataset( num_blocks=..., num_rows=100, schema={image: numpy.ndarray(shape=(28, 28), dtype=uint8)} ) Batches of fixed-shape tensors If your tensors have a fixed shape, Ray Data represents batches as regular ndarrays. >>> import ray >>> ds = ray.data.read_images("s3://anonymous@air-example-data/digits") >>> batch = ds.take_batch(batch_size=32) >>> batch["image"].shape (32, 28, 28) >>> batch["image"].dtype dtype('uint8') Batches of variable-shape tensors If your tensors vary in shape, Ray Data represents batches as arrays of object dtype. >>> import ray >>> ds = ray.data.read_images("s3://anonymous@air-example-data/AnimalDetection") >>> batch = ds.take_batch(batch_size=32) >>> batch["image"].shape (32,) >>> batch["image"].dtype dtype('O') The individual elements of these object arrays are regular ndarrays. >>> batch["image"][0].dtype dtype('uint8') >>> batch["image"][0].shape (375, 500, 3) >>> batch["image"][3].shape (333, 465, 3) Transforming tensor data Call map() or map_batches() to transform tensor data. from typing import Any, Dict import ray import numpy as np ds = ray.data.read_images("s3://anonymous@air-example-data/AnimalDetection") def increase_brightness(row: Dict[str, Any]) -> Dict[str, Any]: row["image"] = np.clip(row["image"] + 4, 0, 255) return row # Increase the brightness, record at a time. ds.map(increase_brightness) def batch_increase_brightness(batch: Dict[str, np.ndarray]) -> Dict: batch["image"] = np.clip(batch["image"] + 4, 0, 255) return batch # Increase the brightness, batch at a time. ds.map_batches(batch_increase_brightness) In this example, we return np.ndarray directly as the output. Ray Data will also treat returned lists of np.ndarray and objects implementing __array__ (e.g., torch.Tensor) as tensor data. For more information on transforming data, read Transforming data. Saving tensor data Save tensor data with formats like Parquet, NumPy, and JSON. To view all supported formats, see the Input/Output reference. Parquet Call write_parquet() to save data in Parquet files. import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_parquet("/tmp/simple") NumPy Call write_numpy() to save an ndarray column in NumPy files. import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_numpy("/tmp/simple", column="image") JSON To save images in a JSON file, call write_json(). import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") ds.write_json("/tmp/simple") For more information on saving data, read Saving data. Working with PyTorch Ray Data integrates with the PyTorch ecosystem. This guide describes how to: Iterate over your dataset as torch tensors for model training Write transformations that deal with torch tensors Perform batch inference with torch models Save Datasets containing torch tensors Migrate from PyTorch Datasets to Ray Data Iterating over torch tensors for training To iterate over batches of data in torch format, call Dataset.iter_torch_batches(). Each batch is represented as Dict[str, torch.Tensor], with one tensor per column in the dataset. This is useful for training torch models with batches from your dataset. For configuration details such as providing a collate_fn for customizing the conversion, see the API reference. import ray import torch ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") for batch in ds.iter_torch_batches(batch_size=2): print(batch) {'image': tensor([[[[...]]]], dtype=torch.uint8)} ... {'image': tensor([[[[...]]]], dtype=torch.uint8)} Integration with Ray Train Ray Data integrates with Ray Train for easy data ingest for data parallel training, with support for PyTorch, PyTorch Lightning, or Huggingface training. import torch from torch import nn import ray from ray.air import session, ScalingConfig from ray.train.torch import TorchTrainer def train_func(config): model = nn.Sequential(nn.Linear(30, 1), nn.Sigmoid()) loss_fn = torch.nn.BCELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.001) # Datasets can be accessed in your train_func via ``get_dataset_shard``. train_data_shard = session.get_dataset_shard("train") for epoch_idx in range(2): for batch in train_data_shard.iter_torch_batches(batch_size=128, dtypes=torch.float32): features = torch.stack([batch[col_name] for col_name in batch.keys() if col_name != "target"], axis=1) predictions = model(features) train_loss = loss_fn(predictions, batch["target"].unsqueeze(1)) train_loss.backward() optimizer.step() train_dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") trainer = TorchTrainer( train_func, datasets={"train": train_dataset}, scaling_config=ScalingConfig(num_workers=2) ) trainer.fit() ... For more details, see the Ray Train user guide. Transformations with torch tensors Transformations applied with map or map_batches can return torch tensors. Under the hood, Ray Data automatically converts torch tensors to numpy arrays. Subsequent transformations accept numpy arrays as input, not torch tensors. map from typing import Dict import numpy as np import torch import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") def convert_to_torch(row: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]: return {"tensor": torch.as_tensor(row["image"])} # The tensor gets converted into a Numpy array under the hood transformed_ds = ds.map(convert_to_torch) print(transformed_ds.schema()) # Subsequent transformations take in Numpy array as input. def check_numpy(row: Dict[str, np.ndarray]): assert isinstance(row["tensor"], np.ndarray) return row transformed_ds.map(check_numpy).take_all() Column Type ------ ---- tensor numpy.ndarray(shape=(32, 32, 3), dtype=uint8) map_batches from typing import Dict import numpy as np import torch import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") def convert_to_torch(batch: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]: return {"tensor": torch.as_tensor(batch["image"])} # The tensor gets converted into a Numpy array under the hood transformed_ds = ds.map_batches(convert_to_torch, batch_size=2) print(transformed_ds.schema()) # Subsequent transformations take in Numpy array as input. def check_numpy(batch: Dict[str, np.ndarray]): assert isinstance(batch["tensor"], np.ndarray) return batch transformed_ds.map_batches(check_numpy, batch_size=2).take_all() Column Type ------ ---- tensor numpy.ndarray(shape=(32, 32, 3), dtype=uint8) For more information on transforming data, see Transforming data. Built-in PyTorch transforms You can use built-in torch transforms from torchvision, torchtext, and torchaudio Ray Data transformations. torchvision from typing import Dict import numpy as np import torch from torchvision import transforms import ray # Create the Dataset. ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") # Define the torchvision transform. transform = transforms.Compose( [ transforms.ToTensor(), transforms.CenterCrop(10) ] ) # Define the map function def transform_image(row: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]: row["transformed_image"] = transform(row["image"]) return row # Apply the transform over the dataset. transformed_ds = ds.map(transform_image) print(transformed_ds.schema()) Column Type ------ ---- image numpy.ndarray(shape=(32, 32, 3), dtype=uint8) transformed_image numpy.ndarray(shape=(3, 10, 10), dtype=float) torchtext from typing import Dict, List import numpy as np from torchtext import transforms import ray # Create the Dataset. ds = ray.data.read_text("s3://anonymous@ray-example-data/simple.txt") # Define the torchtext transform. VOCAB_FILE = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt" transform = transforms.BERTTokenizer(vocab_path=VOCAB_FILE, do_lower_case=True, return_tokens=True) # Define the map_batches function. def tokenize_text(batch: Dict[str, np.ndarray]) -> Dict[str, List[str]]: batch["tokenized_text"] = transform(list(batch["text"])) return batch # Apply the transform over the dataset. transformed_ds = ds.map_batches(tokenize_text, batch_size=2) print(transformed_ds.schema()) Column Type ------ ---- text tokenized_text Batch inference with PyTorch With Ray Datasets, you can do scalable offline batch inference with torch models by mapping a pre-trained model over your data. from typing import Dict import numpy as np import torch import torch.nn as nn import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.ones((1, 100))) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and re-use it for inference across multiple batches. class TorchPredictor: def __init__(self): # Load a dummy neural network. # Set `self.model` to your pre-trained PyTorch model. self.model = nn.Sequential( nn.Linear(in_features=100, out_features=1), nn.Sigmoid(), ) self.model.eval() # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: tensor = torch.as_tensor(batch["data"], dtype=torch.float32) with torch.inference_mode(): # Get the predictions from the input batch. return {"output": self.model(tensor).numpy()} # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. scale = ray.data.ActorPoolStrategy(size=2) # Step 3: Map the Predictor over the Dataset to get predictions. predictions = ds.map_batches(TorchPredictor, compute=scale) # Step 4: Show one prediction output. predictions.show(limit=1) {'output': array([0.5590901], dtype=float32)} For more details, see the Batch inference user guide. Saving Datasets containing torch tensors Datasets containing torch tensors can be saved to files, like parquet or numpy. For more information on saving data, read Saving data. Torch tensors that are on GPU devices can’t be serialized and written to disk. Convert the tensors to CPU (tensor.to("cpu")) before saving the data. Parquet import torch import ray tensor = torch.Tensor(1) ds = ray.data.from_items([{"tensor": tensor}]) ds.write_parquet("local:///tmp/tensor") Numpy import torch import ray tensor = torch.Tensor(1) ds = ray.data.from_items([{"tensor": tensor}]) ds.write_numpy("local:///tmp/tensor", column="tensor") Migrating from PyTorch Datasets and DataLoaders If you’re currently using PyTorch Datasets and DataLoaders, you can migrate to Ray Data for working with distributed datasets. PyTorch Datasets are replaced by the Dataset abtraction, and the PyTorch DataLoader is replaced by Dataset.iter_torch_batches(). Built-in PyTorch Datasets If you are using built-in PyTorch datasets, for example from torchvision, these can be converted to a Ray Dataset using the from_torch() API. from_torch() requires the PyTorch Dataset to fit in memory. Use this only for small, built-in datasets for prototyping or testing. import torchvision import ray mnist = torchvision.datasets.MNIST(root="/tmp/", download=True) ds = ray.data.from_torch(mnist) # The data for each item of the torch dataset is under the "item" key. print(ds.schema()) The following `testoutput` is mocked to avoid illustrating download logs like "Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz". Column Type ------ ---- item Custom PyTorch Datasets If you have a custom PyTorch Dataset, you can migrate to Ray Data by converting the logic in __getitem__ to Ray Data read and transform operations. Any logic for reading data from cloud storage and disk can be replaced by one of the Ray Data read_* APIs, and any transformation logic can be applied as a map call on the Dataset. The following example shows a custom PyTorch Dataset, and what the analagous would look like with Ray Data. Unlike PyTorch Map-style datasets, Ray Datasets are not indexable. PyTorch Dataset import tempfile import boto3 from botocore import UNSIGNED from botocore.config import Config from torchvision import transforms from torch.utils.data import Dataset from PIL import Image class ImageDataset(Dataset): def __init__(self, bucket_name: str, dir_path: str): self.s3 = boto3.resource("s3", config=Config(signature_version=UNSIGNED)) self.bucket = self.s3.Bucket(bucket_name) self.files = [obj.key for obj in self.bucket.objects.filter(Prefix=dir_path)] self.transform = transforms.Compose([ transforms.ToTensor(), transforms.Resize((128, 128)), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) def __len__(self): return len(self.files) def __getitem__(self, idx): img_name = self.files[idx] # Infer the label from the file name. last_slash_idx = img_name.rfind("/") dot_idx = img_name.rfind(".") label = int(img_name[last_slash_idx+1:dot_idx]) # Download the S3 file locally. obj = self.bucket.Object(img_name) tmp = tempfile.NamedTemporaryFile() tmp_name = "{}.jpg".format(tmp.name) with open(tmp_name, "wb") as f: obj.download_fileobj(f) f.flush() f.close() image = Image.open(tmp_name) # Preprocess the image. image = self.transform(image) return image, label dataset = ImageDataset(bucket_name="ray-example-data", dir_path="batoidea/JPEGImages/") Ray Data import torchvision import ray ds = ray.data.read_images("s3://anonymous@ray-example-data/batoidea/JPEGImages", include_paths=True) # Extract the label from the file path. def extract_label(row: dict): filepath = row["path"] last_slash_idx = filepath.rfind("/") dot_idx = filepath.rfind('.') label = int(filepath[last_slash_idx+1:dot_idx]) row["label"] = label return row transform = transforms.Compose([ transforms.ToTensor(), transforms.Resize((128, 128)), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) # Preprocess the images. def transform_image(row: dict): row["transformed_image"] = transform(row["image"]) return row # Map the transformations over the dataset. ds = ds.map(extract_label).map(transform_image) PyTorch DataLoader The PyTorch DataLoader can be replaced by calling Dataset.iter_torch_batches() to iterate over batches of the dataset. The following table describes how the arguments for PyTorch DataLoader map to Ray Data. Note the the behavior may not necessarily be identical. For exact semantics and usage, see the API reference. PyTorch DataLoader arguments Ray Data API batch_size batch_size arg to ds.iter_torch_batches() shuffle local_shuffle_buffer_size arg to ds.iter_torch_batches() collate_fn collate_fn arg to ds.iter_torch_batches() sampler Not supported. Can be manually implemented after iterating through the dataset with ds.iter_torch_batches(). batch_sampler Not supported. Can be manually implemented after iterating through the dataset with ds.iter_torch_batches(). drop_last drop_last arg to ds.iter_torch_batches() num_workers Use prefetch_batches arg to ds.iter_torch_batches() to indicate how many batches to prefetch. The number of prefetching threads will automatically be configured according to prefetch_batches. prefetch_factor Use prefetch_batches arg to ds.iter_torch_batches() to indicate how many batches to prefetch. The number of prefetching threads will automatically be configured according to prefetch_batches. pin_memory Pass in device to ds.iter_torch_batches() to get tensors that have already been moved to the correct device. End-to-end: Offline Batch Inference Get in touch to get help using Ray Data, the industry’s fastest and cheapest solution for offline batch inference. Offline batch inference is a process for generating model predictions on a fixed set of input data. Ray Data offers an efficient and scalable solution for batch inference, providing faster execution and cost-effectiveness for deep learning applications. For an overview on why you should use Ray Data for offline batch inference, and how it compares to alternatives, see the Ray Data Overview. Quickstart To start, install Ray Data: pip install -U "ray[data]" Using Ray Data for offline inference involves four basic steps: Step 1: Load your data into a Ray Dataset. Ray Data supports many different data sources and formats. For more details, see Loading Data. Step 2: Define a Python class to load the pre-trained model. Step 3: Transform your dataset using the pre-trained model by calling ds.map_batches(). For more details, see Transforming Data. Step 4: Get the final predictions by either iterating through the output or saving the results. For more details, see the Iterating over data and Saving data user guides. For more in-depth examples for your use case, see our batch inference examples. For how to configure batch inference, see the configuration guide. HuggingFace PyTorch TensorFlow from typing import Dict import numpy as np import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"])) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and re-use it for inference across multiple batches. class HuggingFacePredictor: def __init__(self): from transformers import pipeline # Initialize a pre-trained GPT2 Huggingface pipeline. self.model = pipeline("text-generation", model="gpt2") # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: # Get the predictions from the input batch. predictions = self.model(list(batch["data"]), max_length=20, num_return_sequences=1) # `predictions` is a list of length-one lists. For example: # [[{'generated_text': 'output_1'}], ..., [{'generated_text': 'output_2'}]] # Modify the output to get it into the following format instead: # ['output_1', 'output_2'] batch["output"] = [sequences[0]["generated_text"] for sequences in predictions] return batch # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. scale = ray.data.ActorPoolStrategy(size=2) # Step 3: Map the Predictor over the Dataset to get predictions. predictions = ds.map_batches(HuggingFacePredictor, compute=scale) # Step 4: Show one prediction output. predictions.show(limit=1) {'data': 'Complete this', 'output': 'Complete this information or purchase any item from this site.\n\nAll purchases are final and non-'} from typing import Dict import numpy as np import torch import torch.nn as nn import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.ones((1, 100))) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and re-use it for inference across multiple batches. class TorchPredictor: def __init__(self): # Load a dummy neural network. # Set `self.model` to your pre-trained PyTorch model. self.model = nn.Sequential( nn.Linear(in_features=100, out_features=1), nn.Sigmoid(), ) self.model.eval() # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: tensor = torch.as_tensor(batch["data"], dtype=torch.float32) with torch.inference_mode(): # Get the predictions from the input batch. return {"output": self.model(tensor).numpy()} # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. scale = ray.data.ActorPoolStrategy(size=2) # Step 3: Map the Predictor over the Dataset to get predictions. predictions = ds.map_batches(TorchPredictor, compute=scale) # Step 4: Show one prediction output. predictions.show(limit=1) {'output': array([0.5590901], dtype=float32)} from typing import Dict import numpy as np import ray # Step 1: Create a Ray Dataset from in-memory Numpy arrays. # You can also create a Ray Dataset from many other sources and file # formats. ds = ray.data.from_numpy(np.ones((1, 100))) # Step 2: Define a Predictor class for inference. # Use a class to initialize the model just once in `__init__` # and re-use it for inference across multiple batches. class TFPredictor: def __init__(self): from tensorflow import keras # Load a dummy neural network. # Set `self.model` to your pre-trained Keras model. input_layer = keras.Input(shape=(100,)) output_layer = keras.layers.Dense(1, activation="sigmoid") self.model = keras.Sequential([input_layer, output_layer]) # Logic for inference on 1 batch of data. def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: # Get the predictions from the input batch. return {"output": self.model(batch["data"]).numpy()} # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. scale = ray.data.ActorPoolStrategy(size=2) # Step 3: Map the Predictor over the Dataset to get predictions. predictions = ds.map_batches(TFPredictor, compute=scale) # Step 4: Show one prediction output. predictions.show(limit=1) {'output': array([0.625576], dtype=float32)} More examples Image Classification Batch Inference with PyTorch ResNet18 Object Detection Batch Inference with PyTorch FasterRCNN_ResNet50 Image Classification Batch Inference with Huggingface Vision Transformer Configuration and troubleshooting Using GPUs for inference To use GPUs for inference, make the following changes to your code: Update the class implementation to move the model and data to and from GPU. Specify num_gpus=1 in the ds.map_batches() call to indicate that each actor should use 1 GPU. Specify a batch_size for inference. For more details on how to configure the batch size, see batch_inference_batch_size. The remaining is the same as the Quickstart. HuggingFace PyTorch TensorFlow from typing import Dict import numpy as np import ray ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"])) class HuggingFacePredictor: def __init__(self): from transformers import pipeline # Set "cuda:0" as the device so the Huggingface pipeline uses GPU. self.model = pipeline("text-generation", model="gpt2", device="cuda:0") def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: predictions = self.model(list(batch["data"]), max_length=20, num_return_sequences=1) batch["output"] = [sequences[0]["generated_text"] for sequences in predictions] return batch # Use 2 actors, each actor using 1 GPU. 2 GPUs total. predictions = ds.map_batches( HuggingFacePredictor, num_gpus=1, # Specify the batch size for inference. # Increase this for larger datasets. batch_size=1, # Set the ActorPool size to the number of GPUs in your cluster. compute=ray.data.ActorPoolStrategy(size=2), ) predictions.show(limit=1) {'data': 'Complete this', 'output': 'Complete this poll. Which one do you think holds the most promise for you?\n\nThank you'} from typing import Dict import numpy as np import torch import torch.nn as nn import ray ds = ray.data.from_numpy(np.ones((1, 100))) class TorchPredictor: def __init__(self): # Move the neural network to GPU device by specifying "cuda". self.model = nn.Sequential( nn.Linear(in_features=100, out_features=1), nn.Sigmoid(), ).cuda() self.model.eval() def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: # Move the input batch to GPU device by specifying "cuda". tensor = torch.as_tensor(batch["data"], dtype=torch.float32, device="cuda") with torch.inference_mode(): # Move the prediction output back to CPU before returning. return {"output": self.model(tensor).cpu().numpy()} # Use 2 actors, each actor using 1 GPU. 2 GPUs total. predictions = ds.map_batches( TorchPredictor, num_gpus=1, # Specify the batch size for inference. # Increase this for larger datasets. batch_size=1, # Set the ActorPool size to the number of GPUs in your cluster. compute=ray.data.ActorPoolStrategy(size=2) ) predictions.show(limit=1) {'output': array([0.5590901], dtype=float32)} from typing import Dict import numpy as np import tensorflow as tf from tensorflow import keras import ray ds = ray.data.from_numpy(np.ones((1, 100))) class TFPredictor: def __init__(self): # Move the neural network to GPU by specifying the GPU device. with tf.device("GPU:0"): input_layer = keras.Input(shape=(100,)) output_layer = keras.layers.Dense(1, activation="sigmoid") self.model = keras.Sequential([input_layer, output_layer]) def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: # Move the input batch to GPU by specifying GPU device. with tf.device("GPU:0"): return {"output": self.model(batch["data"]).numpy()} # Use 2 actors, each actor using 1 GPU. 2 GPUs total. predictions = ds.map_batches( TFPredictor, num_gpus=1, # Specify the batch size for inference. # Increase this for larger datasets. batch_size=1, # Set the ActorPool size to the number of GPUs in your cluster. compute=ray.data.ActorPoolStrategy(size=2) ) predictions.show(limit=1) {'output': array([0.625576], dtype=float32)} Configuring Batch Size Configure the size of the input batch that is passed to __call__ by setting the batch_size argument for ds.map_batches() Increasing batch size results in faster execution because inference is a vectorized operation. For GPU inference, increasing batch size increases GPU utilization. Set the batch size to as large possible without running out of memory. If you encounter OOMs, decreasing batch_size may help. import numpy as np import ray ds = ray.data.from_numpy(np.ones((10, 100))) def assert_batch(batch: Dict[str, np.ndarray]): assert len(batch) == 2 return batch # Specify that each input batch should be of size 2. ds.map_batches(assert_batch, batch_size=2) The default batch_size of 4096 may be too large for datasets with large rows (e.g., tables with many columns or a collection of large images). Handling GPU out-of-memory failures If you run into CUDA out-of-memory issues, your batch size is likely too large. Decrease the batch size by following these steps. If your batch size is already set to 1, then use either a smaller model or GPU devices with more memory. For advanced users working with large models, you can use model parallelism to shard the model across multiple GPUs. Optimizing expensive CPU preprocessing If your workload involves expensive CPU preprocessing in addition to model inference, you can optimize throughput by separating the preprocessing and inference logic into separate stages. This separation allows inference on batch N to execute concurrently with preprocessing on batch N+1. For an example where preprocessing is done in a separate map call, see Image Classification Batch Inference with PyTorch ResNet18. Handling CPU out-of-memory failures If you run out of CPU RAM, you likely that you have too many model replicas that are running concurrently on the same node. For example, if a model uses 5GB of RAM when created / run, and a machine has 16GB of RAM total, then no more than three of these models can be run at the same time. The default resource assignments of one CPU per task/actor will likely lead to OutOfMemoryError from Ray in this situation. Suppose your cluster has 4 nodes, each with 16 CPUs. To limit to at most 3 of these actors per node, you can override the CPU or memory: from typing import Dict import numpy as np import ray ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"])) class HuggingFacePredictor: def __init__(self): from transformers import pipeline self.model = pipeline("text-generation", model="gpt2") def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: predictions = self.model(list(batch["data"]), max_length=20, num_return_sequences=1) batch["output"] = [sequences[0]["generated_text"] for sequences in predictions] return batch predictions = ds.map_batches( HuggingFacePredictor, # Require 5 CPUs per actor (so at most 3 can fit per 16 CPU node). num_cpus=5, # 3 actors per node, with 4 nodes in the cluster means ActorPool size of 12. compute=ray.data.ActorPoolStrategy(size=12) ) predictions.show(limit=1) Using models from Ray Train Models that have been trained with Ray Train can then be used for batch inference with Ray Data via the Checkpoint that is returned by Ray Train. Step 1: Train a model with Ray Train. import ray from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) trainer = XGBoostTrainer( scaling_config=ScalingConfig( num_workers=2, use_gpu=False, ), label_column="target", num_boost_round=20, params={ "objective": "binary:logistic", "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) result = trainer.fit() ... Step 2: Extract the Checkpoint from the training Result. checkpoint = result.checkpoint Step 3: Use Ray Data for batch inference. To load in the model from the Checkpoint inside the Python class, use one of the framework-specific Checkpoint classes. In this case, we use the XGBoostCheckpoint to load the model. The rest of the logic looks the same as in the Quickstart. from typing import Dict import pandas as pd import numpy as np import xgboost from ray.air import Checkpoint from ray.train.xgboost import XGBoostCheckpoint test_dataset = valid_dataset.drop_columns(["target"]) class XGBoostPredictor: def __init__(self, checkpoint: Checkpoint): xgboost_checkpoint = XGBoostCheckpoint.from_checkpoint(checkpoint) self.model = xgboost_checkpoint.get_model() def __call__(self, data: pd.DataFrame) -> Dict[str, np.ndarray]: dmatrix = xgboost.DMatrix(data) return {"predictions": self.model.predict(dmatrix)} # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. scale = ray.data.ActorPoolStrategy(size=2) # Map the Predictor over the Dataset to get predictions. predictions = test_dataset.map_batches( XGBoostPredictor, compute=scale, batch_format="pandas", # Pass in the Checkpoint to the XGBoostPredictor constructor. fn_constructor_kwargs={"checkpoint": checkpoint} ) predictions.show(limit=1) {'predictions': 0.9969483017921448} Ray Data Internals This guide describes the implementation of Ray Data. The intended audience is advanced users and Ray Data developers. For a gentler introduction to Ray Data, see Key concepts. Datasets and blocks A Dataset operates over a sequence of Ray object references to blocks. Each block contains a disjoint subset of rows, and Ray Data loads and transforms these blocks in parallel. The following figure visualizes a dataset with three blocks, each holding 1000 rows. https://docs.google.com/drawings/d/1PmbDvHRfVthme9XD7EYM-LIHPXtHdOfjCbc1SCsM64k/edit Operations Reading files Ray Data uses Ray tasks to read files in parallel. Each read task reads one or more files and produces an output block: https://docs.google.com/drawings/d/15B4TB8b5xN15Q9S8-s0MjW6iIvo_PrH7JtV1fL123pU/edit To handle transient errors from remote datasources, Ray Data retries application-level exceptions. For more information on loading data, see Loading data. Transforming data Ray Data uses either Ray tasks or Ray actors to transform blocks. By default, it uses tasks. https://docs.google.com/drawings/d/12STHGV0meGWfdWyBlJMUgw7a-JcFPu9BwSOn5BjRw9k/edit For more information on transforming data, see Transforming data. Shuffling data When you call random_shuffle(), sort(), or groupby(), Ray Data shuffles blocks in a map-reduce style: map tasks partition blocks by value and then reduce tasks merge co-partitioned blocks. Shuffles materialize Datasets in memory. In other words, shuffle execution isn’t streamed through memory. For an in-depth guide on shuffle performance, see Performance Tips and Tuning. Scheduling Ray Data uses Ray Core for execution, and is subject to the same scheduling considerations as normal Ray Tasks and Actors. Ray Data uses the following custom scheduling settings by default for improved performance: The SPREAD scheduling strategy ensures that data blocks and map tasks are evenly balanced across the cluster. Dataset tasks ignore placement groups by default, see Ray Data and Placement Groups. Ray Data and placement groups By default, Ray Data configures its tasks and actors to use the cluster-default scheduling strategy (“DEFAULT”). You can inspect this configuration variable here: ray.data.DataContext.get_current().scheduling_strategy. This scheduling strategy schedules these Tasks and Actors outside any present placement group. To force Ray Data to schedule tasks within the current placement group (i.e., to use current placement group resources specifically for Ray Data), set ray.data.DataContext.get_current().scheduling_strategy = None. Consider this override only for advanced use cases to improve performance predictability. The general recommendation is to let Ray Data run outside placement groups. Ray Data and Tune When using Ray Data in conjunction with Ray Tune, it is important to ensure there are enough free CPUs for Ray Data to run on. By default, Tune will try to fully utilize cluster CPUs. This can prevent Ray Data from scheduling tasks, reducing performance or causing workloads to hang. To ensure CPU resources are always available for Ray Data execution, limit the number of concurrent Tune trials. This can be done using the max_concurrent_trials Tune option. import ray from ray import tune # This workload will use spare cluster resources for execution. def objective(*args): ray.data.range(10).show() # Create a cluster with 4 CPU slots available. ray.init(num_cpus=4) # By setting `max_concurrent_trials=3`, this ensures the cluster will always # have a sparse CPU for Dataset. Try setting `max_concurrent_trials=4` here, # and notice that the experiment will appear to hang. tuner = tune.Tuner( tune.with_resources(objective, {"cpu": 1}), tune_config=tune.TuneConfig( num_samples=1, max_concurrent_trials=3 ) ) tuner.fit() Execution Ray Data execution by default is: Lazy: This means that transformations on Dataset are not executed until a consumption operation (e.g. ds.iter_batches()) or Dataset.materialize() is called. This creates opportunities for optimizing the execution plan (e.g. stage fusion). Streaming: This means that Dataset transformations will be executed in a streaming way, incrementally on the base data, instead of on all of the data at once, and overlapping the execution of operations. This can be used for streaming data loading into ML training to overlap the data preprocessing and model training, or to execute batch transformations on large datasets without needing to load the entire dataset into cluster memory. Lazy Execution Lazy execution offers opportunities for improved performance and memory stability due to stage fusion optimizations and aggressive garbage collection of intermediate results. Dataset creation and transformation APIs are lazy, with execution only triggered via “sink” APIs, such as consuming (ds.iter_batches()), writing (ds.write_parquet()), or manually triggering via ds.materialize(). There are a few exceptions to this rule, where transformations such as ds.union() and ds.limit() trigger execution; we plan to make these operations lazy in the future. Check the API docs for Ray Data methods to see if they trigger execution. Those that do trigger execution will have a Note indicating as much. Streaming Execution The following code is a hello world example which invokes the execution with ds.iter_batches() consumption. We will also enable verbose progress reporting, which shows per-operator progress in addition to overall progress. import ray import time # Enable verbose reporting. This can also be toggled on by setting # the environment variable RAY_DATA_VERBOSE_PROGRESS=1. ctx = ray.data.DataContext.get_current() ctx.execution_options.verbose_progress = True def sleep(x): time.sleep(0.1) return x for _ in ( ray.data.range_tensor(5000, shape=(80, 80, 3), parallelism=200) .map_batches(sleep, num_cpus=2) .map_batches(sleep, compute=ray.data.ActorPoolStrategy(2, 4)) .map_batches(sleep, num_cpus=1) .iter_batches() ): pass This launches a simple 4-stage pipeline. We use different compute args for each stage, which forces them to be run as separate operators instead of getting fused together. You should see a log message indicating streaming execution is being used: 2023-03-30 16:40:10,076 INFO streaming_executor.py:83 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange] -> TaskPoolMapOperator[MapBatches(sleep)] -> ActorPoolMapOperator[MapBatches(sleep)] -> TaskPoolMapOperator[MapBatches(sleep)] The next few lines will show execution progress. Here is how to interpret the output: Running: 7.0/16.0 CPU, 0.0/0.0 GPU, 76.91 MiB/2.25 GiB object_store_memory 65%|██▊ | 130/200 [00:08<00:02, 22.52it/s] This line tells you how many resources are currently being used by the streaming executor out of the limits, as well as the number of completed output blocks. The streaming executor will attempt to keep resource usage under the printed limits by throttling task executions. ReadRange: 2 active, 37 queued, 7.32 MiB objects 1: 80%|████████▊ | 161/200 [00:08<00:02, 17.81it/s] MapBatches(sleep): 5 active, 5 queued, 18.31 MiB objects 2: 76%|██▎| 151/200 [00:08<00:02, 19.93it/s] MapBatches(sleep): 7 active, 2 queued, 25.64 MiB objects, 2 actors [all objects local] 3: 71%|▋| 142/ MapBatches(sleep): 2 active, 0 queued, 7.32 MiB objects 4: 70%|██▊ | 139/200 [00:08<00:02, 23.16it/s] These lines are only shown when verbose progress reporting is enabled. The active count indicates the number of running tasks for the operator. The queued count is the number of input blocks for the operator that are computed but are not yet submitted for execution. For operators that use actor-pool execution, the number of running actors is shown as actors. Avoid returning large outputs from the final operation of a pipeline you are iterating over, since the consumer process will be a serial bottleneck. Fault tolerance Ray Data performs lineage reconstruction to recover data. If an application error or system failure occurs, Ray Data recreates blocks by re-executing tasks. Fault tolerance isn’t supported if the process that created the Dataset dies. Stage Fusion Optimization In order to reduce memory usage and task overheads, Ray Data will automatically fuse together lazy operations that are compatible: Same compute pattern: embarrassingly parallel map vs. all-to-all shuffle Same compute strategy: Ray tasks vs Ray actors Same resource specification, e.g. num_cpus or num_gpus requests Read stages and subsequent map-like transformations will usually be fused together. All-to-all transformations such as ds.random_shuffle() can be fused with earlier map-like stages, but not later stages. You can tell if stage fusion is enabled by checking the Dataset stats and looking for fused stages (e.g., read->map_batches). Stage N read->map_batches->shuffle_map: N/N blocks executed in T * Remote wall time: T min, T max, T mean, T total * Remote cpu time: T min, T max, T mean, T total * Output num rows: N min, N max, N mean, N total Memory Management This section describes how Ray Data manages execution and object store memory. Execution Memory During execution, a task can read multiple input blocks, and write multiple output blocks. Input and output blocks consume both worker heap memory and shared memory via Ray’s object store. Ray Data attempts to bound its heap memory usage to num_execution_slots * max_block_size. The number of execution slots is by default equal to the number of CPUs, unless custom resources are specified. The maximum block size is set by the configuration parameter ray.data.DataContext.target_max_block_size and is set to 512MiB by default. When a task’s output is larger than this value, the worker will automatically split the output into multiple smaller blocks to avoid running out of heap memory. Large block size can lead to potential out-of-memory situations. To avoid these issues, make sure no single item in your Ray Data is too large, and always call ds.map_batches() with batch size small enough such that the output batch can comfortably fit into memory. Object Store Memory Ray Data uses the Ray object store to store data blocks, which means it inherits the memory management features of the Ray object store. This section discusses the relevant features: Object Spilling: Since Ray Data uses the Ray object store to store data blocks, any blocks that can’t fit into object store memory are automatically spilled to disk. The objects are automatically reloaded when needed by downstream compute tasks: Locality Scheduling: Ray will preferentially schedule compute tasks on nodes that already have a local copy of the object, reducing the need to transfer objects between nodes in the cluster. Reference Counting: Dataset blocks are kept alive by object store reference counting as long as there is any Dataset that references them. To free memory, delete any Python references to the Dataset object. Advanced: Performance Tips and Tuning Optimizing transforms Batching transforms If your transformation is vectorized like most NumPy or pandas operations, use map_batches() rather than map(). It’s faster. If your transformation isn’t vectorized, there’s no performance benefit. Optimizing reads Tuning read parallelism By default, Ray Data automatically selects the read parallelism according to the following procedure: The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. The parallelism is set to the estimated number of CPUs multiplied by 2. If the parallelism is less than 8, it is set to 8. The in-memory data size is estimated. If the parallelism would create in-memory blocks that are larger on average than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size. Occasionally, it is advantageous to manually tune the parallelism to optimize the application. This can be done when loading data via the parallelism parameter. For example, use ray.data.read_parquet(path, parallelism=1000) to force up to 1000 read tasks to be created. Tuning read resources By default, Ray requests 1 CPU per read task, which means one read tasks per CPU can execute concurrently. For datasources that can benefit from higher degress of IO parallelism, you can specify a lower num_cpus value for the read function with the ray_remote_args parameter. For example, use ray.data.read_parquet(path, ray_remote_args={"num_cpus": 0.25}) to allow up to four read tasks per CPU. Parquet column pruning Current Dataset will read all Parquet columns into memory. If you only need a subset of the columns, make sure to specify the list of columns explicitly when calling ray.data.read_parquet() to avoid loading unnecessary data (projection pushdown). For example, use ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet", columns=["sepal.length", "variety"]) to read just two of the five columns of Iris dataset. Parquet row pruning Similarly, you can pass in a filter to ray.data.read_parquet() (filter pushdown) which will be applied at the file scan so only rows that match the filter predicate will be returned. For example, use ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet", filter=pyarrow.dataset.field("sepal.length") > 5.0) (where pyarrow has to be imported) to read rows with sepal.length greater than 5.0. This can be used in conjunction with column pruning when appropriate to get the benefits of both. Optimizing shuffles When should I use global per-epoch shuffling? Use global per-epoch shuffling only if your model is sensitive to the randomness of the training data. Based on a theoretical foundation all gradient-descent-based model trainers benefit from improved (global) shuffle quality. In practice, the benefit is particularly pronounced for tabular data/models. However, the more global the shuffle is, the more expensive the shuffling operation. The increase compounds with distributed data-parallel training on a multi-node cluster due to data transfer costs. This cost can be prohibitive when using very large datasets. The best route for determining the best tradeoff between preprocessing time and cost and per-epoch shuffle quality is to measure the precision gain per training step for your particular model under different shuffling policies: no shuffling, local (per-shard) limited-memory shuffle buffer, local (per-shard) shuffling, windowed (pseudo-global) shuffling, and fully global shuffling. From the perspective of keeping preprocessing time in check, as long as your data loading and shuffling throughput is higher than your training throughput, your GPU should be saturated. If you have shuffle-sensitive models, push the shuffle quality higher until this threshold is hit. Enabling push-based shuffle Some Dataset operations require a shuffle operation, meaning that data is shuffled from all of the input partitions to all of the output partitions. These operations include Dataset.random_shuffle, Dataset.sort and Dataset.groupby. Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size cannot fit into memory. Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance. We recommend trying this out if your dataset has more than 1000 blocks or is larger than 1 TB in size. To try this out locally or on a cluster, you can start with the nightly release test that Ray runs for Dataset.random_shuffle and Dataset.sort. To get an idea of the performance you can expect, here are some run time results for Dataset.random_shuffle on 1-10TB of data on 20 machines (m5.4xlarge instances on AWS EC2, each with 16 vCPUs, 64GB RAM). To try out push-based shuffle, set the environment variable RAY_DATA_PUSH_BASED_SHUFFLE=1 when running your application: $ wget https://raw.githubusercontent.com/ray-project/ray/master/release/nightly_tests/dataset/sort.py $ RAY_DATA_PUSH_BASED_SHUFFLE=1 python sort.py --num-partitions=10 --partition-size=1e7 # Dataset size: 10 partitions, 0.01GB partition size, 0.1GB total # [dataset]: Run `pip install tqdm` to enable progress reporting. # 2022-05-04 17:30:28,806 INFO push_based_shuffle.py:118 -- Using experimental push-based shuffle. # Finished in 9.571171760559082 # ... You can also specify the shuffle implementation during program execution by setting the DataContext.use_push_based_shuffle flag: import ray ctx = ray.data.DataContext.get_current() ctx.use_push_based_shuffle = True ds = ( ray.data.range(1000) .random_shuffle() ) Configuring execution Configuring resources and locality By default, the CPU and GPU limits are set to the cluster size, and the object store memory limit conservatively to 1/4 of the total object store size to avoid the possibility of disk spilling. You may want to customize these limits in the following scenarios: - If running multiple concurrent jobs on the cluster, setting lower limits can avoid resource contention between the jobs. - If you want to fine-tune the memory limit to maximize performance. - For data loading into training jobs, you may want to set the object store memory to a low value (e.g., 2GB) to limit resource usage. You can configure execution options with the global DataContext. The options are applied for future jobs launched in the process: ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits.cpu = 10 ctx.execution_options.resource_limits.gpu = 5 ctx.execution_options.resource_limits.object_store_memory = 10e9 Locality with output (ML ingest use case) ctx.execution_options.locality_with_output = True Setting this parameter to True tells Ray Data to prefer placing operator tasks onto the consumer node in the cluster, rather than spreading them evenly across the cluster. This setting can be useful if you know you are consuming the output data directly on the consumer node (i.e., for ML training ingest). However, other use cases may incur a performance penalty with this setting. Reproducibility Deterministic execution # By default, this is set to False. ctx.execution_options.preserve_order = True To enable deterministic execution, set the above to True. This setting may decrease performance, but ensures block ordering is preserved through execution. This flag defaults to False. Monitoring your application View the Ray Dashboard to monitor your application and troubleshoot issues. To learn more about the Ray dashboard, see Ray Dashboard. Ray Data Examples Include all examples in a hidden toctree so Sphinx build does not complain. Image Classification Batch Inference with Huggingface Vision Transformer In this example, we will introduce how to use the Ray Data for large-scale image classification batch inference with multiple GPU workers. In particular, we will: Load Imagenette dataset from S3 bucket and create a Ray Dataset. Load a pretrained Vision Transformer from Huggingface that’s been trained on ImageNet. Use Ray Data to preprocess the dataset and do model inference parallelizing across multiple GPUs Evaluate the predictions and save results to S3/local disk. This example will still work even if you do not have GPUs available, but overall performance will be slower. To run this example, you will need to install the following: !pip install -q -U "ray[data]" transformers Step 1: Reading the Dataset from S3 Imagenette is a subset of Imagenet with 10 classes. We have this dataset hosted publicly in an S3 bucket. Since we are only doing inference here, we load in just the validation split. Here, we use ray.data.read_images to load the validation set from S3. Ray Data also supports reading from a variety of other datasources and formats. import ray s3_uri = "s3://anonymous@air-example-data-2/imagenette2/val/" ds = ray.data.read_images( s3_uri, mode="RGB" ) ds [2023-05-24 11:25:47] INFO ray._private.worker::Connecting to existing Ray cluster at address: 10.0.33.149:6379... [2023-05-24 11:25:47] INFO ray._private.worker::Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_6h5a4kl2xhfgtdy4w41he6iwyw/services?redirect_to=dashboard  [2023-05-24 11:25:47] INFO ray._private.runtime_env.packaging::Pushing file package 'gcs://_ray_pkg_2429254893b10da6df2b65ceaf858894.zip' (8.71MiB) to Ray cluster... [2023-05-24 11:25:47] INFO ray._private.runtime_env.packaging::Successfully pushed file package 'gcs://_ray_pkg_2429254893b10da6df2b65ceaf858894.zip'. [2023-05-24 11:25:50] [Ray Data] WARNING ray.data.dataset::Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes. Learn more here: https://docs.ray.io/en/master/data/faq.html#migrating-to-strict-mode Inspecting the schema, we can see that there is 1 column in the dataset containing the images stored as Numpy arrays. ds.schema() Column Type ------ ---- image numpy.ndarray(ndim=3, dtype=uint8) Step 2: Inference on a single batch Next, we can do inference on a single batch of data, using a pre-trained Vision Transformer from Huggingface following this Huggingface example. Let’s get a batch of 10 from our dataset. Each image in the batch is represented as a Numpy array. single_batch = ds.take_batch(10) We can visualize 1 image from this batch. from PIL import Image img = Image.fromarray(single_batch["image"][0]) img Now, let’s create a Huggingface Image Classification pipeline from a pre-trained Vision Transformer model. We specify the following configurations: Set the device to “cuda:0” to use GPU for inference We set the batch size to 10 so that we can maximize GPU utilization and do inference on the entire batch at once. We also convert the image Numpy arrays into PIL Images since that’s what Huggingface expects. From the results, we see that all of the images in the batch are correctly being classified as “tench” which is a type of fish. from transformers import pipeline from PIL import Image # If doing CPU inference, set device="cpu" instead. classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device="cuda:0") outputs = classifier([Image.fromarray(image_array) for image_array in single_batch["image"]], top_k=1, batch_size=10) del classifier # Delete the classifier to free up GPU memory. outputs [[{'score': 0.9997267127037048, 'label': 'tench, Tinca tinca'}], [{'score': 0.9993537068367004, 'label': 'tench, Tinca tinca'}], [{'score': 0.9997393488883972, 'label': 'tench, Tinca tinca'}], [{'score': 0.99950110912323, 'label': 'tench, Tinca tinca'}], [{'score': 0.9986729621887207, 'label': 'tench, Tinca tinca'}], [{'score': 0.999290943145752, 'label': 'tench, Tinca tinca'}], [{'score': 0.9997896552085876, 'label': 'tench, Tinca tinca'}], [{'score': 0.9997585415840149, 'label': 'tench, Tinca tinca'}], [{'score': 0.9985774755477905, 'label': 'tench, Tinca tinca'}], [{'score': 0.9996065497398376, 'label': 'tench, Tinca tinca'}]] Step 3: Scaling up to the full Dataset with Ray Data By using Ray Data, we can apply the same logic in the previous section to scale up to the entire dataset, leveraging all the GPUs in our cluster. There are a couple unique properties about the inference step: Model initialization is usually pretty expensive We want to do inference in batches to maximize GPU utilization. To address 1, we package the inference code in a ImageClassifier class. Using a class allows us to put the expensive pipeline loading and initialization code in the __init__ constructor, which will run only once. The actual model inference logic is in the __call__ method, which will be called for each batch. To address 2, we do our inference in batches, specifying a batch_size to the Huggingface Pipeline. The __call__ method takes a batch of data items, instead of a single one. In this case, the batch is a dict that has one key named “image”, and the value is a Numpy array of images represented in np.ndarray format. This is the same format in step 2, and we can reuse the same inferencing logic from step 2. from typing import Dict import numpy as np from transformers import pipeline from PIL import Image # Pick the largest batch size that can fit on our GPUs BATCH_SIZE = 1024 class ImageClassifier: def __init__(self): # If doing CPU inference, set `device="cpu"` instead. self.classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device="cuda:0") def __call__(self, batch: Dict[str, np.ndarray]): # Convert the numpy array of images into a list of PIL images which is the format the HF pipeline expects. outputs = self.classifier( [Image.fromarray(image_array) for image_array in batch["image"]], top_k=1, batch_size=BATCH_SIZE) # `outputs` is a list of length-one lists. For example: # [[{'score': '...', 'label': '...'}], ..., [{'score': '...', 'label': '...'}]] batch["score"] = [output[0]["score"] for output in outputs] batch["label"] = [output[0]["label"] for output in outputs] return batch Then we use the map_batches API to apply the model to the whole dataset. The first parameter of map_batches is the user-defined function (UDF), which can either be a function or a class. Since we are using a class in this case, the UDF will run as long-running Ray actors. For class-based UDFs, we use the compute argument to specify ActorPoolStrategy with the number of parallel actors. And the batch_size argument indicates the number of images in each batch. The num_gpus argument specifies the number of GPUs needed for each ImageClassifier instance. In this case, we want 1 GPU for each model replica. predictions = ds.map_batches( ImageClassifier, compute=ray.data.ActorPoolStrategy(size=4), # Use 4 GPUs. Change this number based on the number of GPUs in your cluster. num_gpus=1, # Specify 1 GPU per model replica. batch_size=BATCH_SIZE # Use the largest batch size that can fit on our GPUs ) Verify and Save Results Let’s take a small batch and verify the results. prediction_batch = predictions.take_batch(5) [2023-05-24 12:08:44] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage] -> ActorPoolMapOperator[MapBatches(ImageClassifier)] [2023-05-24 12:08:44] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [2023-05-24 12:08:44] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [2023-05-24 12:08:44] [Ray Data] INFO ray.data._internal.execution.operators.actor_pool_map_operator.logfile::MapBatches(ImageClassifier): Waiting for 4 pool actors to start... (_MapWorker pid=137172) 2023-05-24 12:08:49.035713: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 2x across cluster] (_MapWorker pid=137172) 2023-05-24 12:08:49.035721: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. (_MapWorker pid=131332) /home/ray/anaconda3/lib/python3.10/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. [repeated 2x across cluster] (_MapWorker pid=131332) from pandas import MultiIndex, Int64Index [repeated 2x across cluster] (_MapWorker pid=137169) 2023-05-24 12:08:48.988387: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. (_MapWorker pid=137170) 2023-05-24 12:08:49.136309: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 6x across cluster] (_MapWorker pid=137169) from pandas import MultiIndex, Int64Index (_MapWorker pid=137169) from pandas import MultiIndex, Int64Index (_MapWorker pid=137170) 2023-05-24 12:08:49.136316: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 2x across cluster] (_MapWorker pid=137171) /home/ray/anaconda3/lib/python3.10/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. (_MapWorker pid=137171) from pandas import MultiIndex, Int64Index [2023-05-24 12:09:22] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Shutting down . [2023-05-24 12:09:22] [Ray Data] WARNING ray.data._internal.execution.operators.actor_pool_map_operator.logfile::To ensure full parallelization across an actor pool of size 4, the specified batch size should be at most 255. Your configured batch size for this operator was 1024. We see that all the images are correctly classified as “tench”, which is a type of fish. from PIL import Image for image, prediction in zip(prediction_batch["image"], prediction_batch["label"]): img = Image.fromarray(image) display(img) print("Label: ", prediction) Label: tench, Tinca tinca Label: tench, Tinca tinca Label: tench, Tinca tinca Label: tench, Tinca tinca Label: tench, Tinca tinca If the samples look good, we can proceed with saving the results to an external storage, e.g., S3 or local disks. See Ray Data Input/Output for all supported stoarges and file formats. ds.write_parquet("local://tmp/inference_results") Image Classification Batch Inference with PyTorch In this example, we will introduce how to use the Ray Data for large-scale batch inference with multiple GPU workers. In particular, we will: Load Imagenette dataset from S3 bucket and create a Ray Dataset. Load a pretrained ResNet model. Use Ray Data to preprocess the dataset and do model inference parallelizing across multiple GPUs Evaluate the predictions and save results to S3/local disk. This example will still work even if you do not have GPUs available, but overall performance will be slower. See this guide on batch inference for tips and troubleshooting when adapting this example to use your own model and dataset! To run this example, you will need the following packages: !pip install -q "ray[data]" torch torchvision Step 1: Reading the Dataset from S3 Imagenette is a subset of Imagenet with 10 classes. We have this dataset hosted publicly in an S3 bucket. Since we are only doing inference here, we load in just the validation split. Here, we use ray.data.read_images to load the validation set from S3. Ray Data also supports reading from a variety of other datasources and formats. import ray s3_uri = "s3://anonymous@air-example-data-2/imagenette2/train/" ds = ray.data.read_images(s3_uri, mode="RGB") ds 2023-06-27 23:23:57,184 INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 10.0.5.141:6379... 2023-06-27 23:23:57,228 INFO worker.py:1627 -- Connected to Ray cluster. View the dashboard at https://session-kncgqf3p7w2j7qcsnz2safl4tj.i.anyscaleuserdata-staging.com  2023-06-27 23:23:57,243 INFO packaging.py:347 -- Pushing file package 'gcs://_ray_pkg_32ef287a3a39e82021e70d2413880a69.zip' (4.49MiB) to Ray cluster... 2023-06-27 23:23:57,257 INFO packaging.py:360 -- Successfully pushed file package 'gcs://_ray_pkg_32ef287a3a39e82021e70d2413880a69.zip'. 2023-06-27 23:23:59,629 WARNING dataset.py:253 -- Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes. Learn more here: https://docs.ray.io/en/master/data/faq.html#migrating-to-strict-mode Inspecting the schema, we can see that there is 1 column in the dataset containing the images stored as Numpy arrays. ds.schema() Column Type ------ ---- image numpy.ndarray(ndim=3, dtype=uint8) Step 2: Inference on a single batch Next, we can do inference on a single batch of data, using a pre-trained ResNet18 model and following this PyTorch example. Let’s get a batch of 10 from our dataset. Each image in the batch is represented as a Numpy array. single_batch = ds.take_batch(10) We can visualize 1 image from this batch. from PIL import Image img = Image.fromarray(single_batch["image"][0]) img Now, let’s download a pre-trained PyTorch Resnet model and get the required preprocessing transforms to preprocess the images prior to prediction. import torch from torchvision.models import ResNet152_Weights from torchvision import transforms from torchvision import models weights = ResNet152_Weights.IMAGENET1K_V1 # Load the pretrained resnet model and move to GPU if one is available. device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = models.resnet152(weights=weights).to(device) model.eval() imagenet_transforms = weights.transforms transform = transforms.Compose([transforms.ToTensor(), imagenet_transforms()]) Then, we apply the transforms to our batch of images, and pass the batch to the model for inference, making sure to use the GPU device for inference. We can see that most of the images in the batch have been correctly classified as “tench” which is a type of fish. transformed_batch = [transform(image) for image in single_batch["image"]] with torch.inference_mode(): prediction_results = model(torch.stack(transformed_batch).to(device)) classes = prediction_results.argmax(dim=1).cpu() del model # Free up GPU memory labels = [weights.meta["categories"][i] for i in classes] labels ['tench', 'tench', 'tench', 'tench', 'tench', 'tench', 'tench', 'tench', 'bittern', 'tench'] Step 3: Scaling up to the full Dataset with Ray Data By using Ray Data, we can apply the same logic in the previous section to scale up to the entire dataset, leveraging all the GPUs in our cluster. Preprocessing First let’s convert the preprocessing code to Ray Data. We’ll package the preprocessing code within a preprocess_image function. This function should take only one argument, which is a dict that contains a single image in the dataset, represented as a numpy array. We use the same transform function that was defined above and store the transformed image in a new transformed_image field. import numpy as np from typing import Any, Dict def preprocess_image(row: Dict[str, np.ndarray]): return { "original_image": row["image"], "transformed_image": transform(row["image"]), } Then we use the map() API to apply the function to the whole dataset row by row. We use this instead of map_batches() because the torchvision transforms must be applied one image at a time due to the dataset containing images of different sizes. By using Ray Data’s map, we can scale out the preprocessing to utilize all the resources in our Ray cluster. Note: the map method is lazy, it won’t perform execution until we consume the results. transformed_ds = ds.map(preprocess_image) 2023-06-27 23:25:59,387 WARNING dataset.py:4384 -- The `map`, `flat_map`, and `filter` operations are unvectorized and can be very slow. If you're using a vectorized transformation, consider using `.map_batches()` instead. Model Inference Next, let’s convert the model inference part. Compared with preprocessing, model inference has 2 differences: Model loading and initialization is usually expensive. Model inference can be optimized with hardware acceleration if we process data in batches. Using larger batches improves GPU utilization and the overall runtime of the inference job. Thus, we convert the model inference code to the following ResnetModel class. In this class, we put the expensive model loading and initialization code in the __init__ constructor, which will run only once. And we put the model inference code in the __call__ method, which will be called for each batch. The __call__ method takes a batch of data items, instead of a single one. In this case, the batch is also a dict that has the "transformed_image" key populated by our preprocessing step, and the value is a Numpy array of images represented in np.ndarray format. We reuse the same inferencing logic from step 2. from typing import Dict import numpy as np import torch class ResnetModel: def __init__(self): self.weights = ResNet152_Weights.IMAGENET1K_V1 self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.model = models.resnet152(weights=self.weights).to(self.device) self.model.eval() def __call__(self, batch: Dict[str, np.ndarray]): # Convert the numpy array of images into a PyTorch tensor. # Move the tensor batch to GPU if available. torch_batch = torch.from_numpy(batch["transformed_image"]).to(self.device) with torch.inference_mode(): prediction = self.model(torch_batch) predicted_classes = prediction.argmax(dim=1).detach().cpu() predicted_labels = [ self.weights.meta["categories"][i] for i in predicted_classes ] return { "predicted_label": predicted_labels, "original_image": batch["original_image"], } Then we use the map_batches() API to apply the model to the whole dataset. The first parameter of map_batches is the user-defined function (UDF), which can either be a function or a class. Since we are using a class in this case, the UDF will run as long-running Ray actors. For class-based UDFs, we use the compute argument to specify ActorPoolStrategy with the number of parallel actors. The batch_size argument indicates the number of images in each batch. See the Ray dashboard for GPU memory usage to experiment with the batch_size when using your own model and dataset. You should aim to max out the batch size without running out of GPU memory. The num_gpus argument specifies the number of GPUs needed for each ResnetModel instance. In this case, we want 1 GPU for each model replica. If you are doing CPU inference, you can remove the num_gpus=1. predictions = transformed_ds.map_batches( ResnetModel, compute=ray.data.ActorPoolStrategy( size=4 ), # Use 4 GPUs. Change this number based on the number of GPUs in your cluster. num_gpus=1, # Specify 1 GPU per model replica. batch_size=720, # Use the largest batch size that can fit on our GPUs ) Verify and Save Results Let’s take a small batch of predictions and verify the results. prediction_batch = predictions.take_batch(5) 2023-06-27 23:26:04,893 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage->Map] -> ActorPoolMapOperator[MapBatches(ResnetModel)] 2023-06-27 23:26:04,894 INFO streaming_executor.py:92 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) 2023-06-27 23:26:04,895 INFO streaming_executor.py:94 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` 2023-06-27 23:26:04,950 INFO actor_pool_map_operator.py:114 -- MapBatches(ResnetModel): Waiting for 4 pool actors to start... 2023-06-27 23:26:29,120 INFO streaming_executor.py:149 -- Shutting down . 2023-06-27 23:26:29,335 WARNING actor_pool_map_operator.py:264 -- To ensure full parallelization across an actor pool of size 4, the specified batch size should be at most 360. Your configured batch size for this operator was 720. We see that all the images are correctly classified as “tench”, which is a type of fish. from PIL import Image for image, prediction in zip( prediction_batch["original_image"], prediction_batch["predicted_label"] ): img = Image.fromarray(image) display(img) print("Label: ", prediction) Label: tench Label: tench Label: tench Label: tench Label: tench If the samples look good, we can proceed with saving the results to an external storage, e.g., S3 or local disks. See the guide on saving data for all supported storage and file formats. import tempfile temp_dir = tempfile.mkdtemp() # First, drop the original images to avoid them being saved as part of the predictions. # Then, write the predictions in parquet format to a path with the `local://` prefix # to make sure all results get written on the head node. predictions.drop_columns(["original_image"]).write_parquet(f"local://{temp_dir}") print(f"Predictions saved to `{temp_dir}`!") 2023-06-27 23:26:38,105 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage->Map] -> ActorPoolMapOperator[MapBatches(ResnetModel)] -> TaskPoolMapOperator[MapBatches()] -> TaskPoolMapOperator[Write] 2023-06-27 23:26:38,106 INFO streaming_executor.py:92 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) 2023-06-27 23:26:38,106 INFO streaming_executor.py:94 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` 2023-06-27 23:26:38,141 INFO actor_pool_map_operator.py:114 -- MapBatches(ResnetModel): Waiting for 4 pool actors to start... 2023-06-27 23:27:27,855 INFO streaming_executor.py:149 -- Shutting down . Predictions saved to `/tmp/tmp0y52g_f5`! Object Detection Batch Inference with PyTorch This example demonstrates how to do object detection batch inference at scale with a pre-trained PyTorch model and Ray Data. Here is what you’ll do: Perform object detection on a single image with a pre-trained PyTorch model. Scale the PyTorch model with Ray Data, and perform object detection batch inference on a large set of images. Verify the inference results and save them to an external storage. Learn how to use Ray Data with GPUs. Before You Begin Install the following dependencies if you haven’t already. !pip install "ray[data]" torchvision Object Detection on a single Image with PyTorch Before diving into Ray Data, let’s take a look at this object detection example from PyTorch’s official documentation. The example used a pre-trained model (FasterRCNN_ResNet50) to do object detection inference on a single image. First, download an image from the Internet. import requests from PIL import Image url = "https://s3-us-west-2.amazonaws.com/air-example-data/AnimalDetection/JPEGImages/2007_000063.jpg" img = Image.open(requests.get(url, stream=True).raw) display(img) Second, load and intialize a pre-trained PyTorch model. from torchvision import transforms from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2, FasterRCNN_ResNet50_FPN_V2_Weights weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT model = fasterrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.9) model.eval() FasterRCNN( (transform): GeneralizedRCNNTransform( Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) Resize(min_size=(800,), max_size=1333, mode='bilinear') ) (backbone): BackboneWithFPN( (body): IntermediateLayerGetter( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) ) (fpn): FeaturePyramidNetwork( (inner_blocks): ModuleList( (0): Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): Conv2dNormActivation( (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (2): Conv2dNormActivation( (0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (3): Conv2dNormActivation( (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer_blocks): ModuleList( (0-3): 4 x Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (extra_blocks): LastLevelMaxPool() ) ) (rpn): RegionProposalNetwork( (anchor_generator): AnchorGenerator() (head): RPNHead( (conv): Sequential( (0): Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) (1): Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) ) ) (cls_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1)) (bbox_pred): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1)) ) ) (roi_heads): RoIHeads( (box_roi_pool): MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'], output_size=(7, 7), sampling_ratio=2) (box_head): FastRCNNConvFCHead( (0): Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (1): Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (2): Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (3): Conv2dNormActivation( (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (4): Flatten(start_dim=1, end_dim=-1) (5): Linear(in_features=12544, out_features=1024, bias=True) (6): ReLU(inplace=True) ) (box_predictor): FastRCNNPredictor( (cls_score): Linear(in_features=1024, out_features=91, bias=True) (bbox_pred): Linear(in_features=1024, out_features=364, bias=True) ) ) ) Then apply the preprocessing transforms. img = transforms.Compose([transforms.PILToTensor()])(img) preprocess = weights.transforms() batch = [preprocess(img)] Then use the model for inference. prediction = model(batch)[0] Lastly, visualize the result. from torchvision.utils import draw_bounding_boxes from torchvision.transforms.functional import to_pil_image labels = [weights.meta["categories"][i] for i in prediction["labels"]] box = draw_bounding_boxes(img, boxes=prediction["boxes"], labels=labels, colors="red", width=4) im = to_pil_image(box.detach()) display(im) Scaling with Ray Data Then let’s see how to scale the previous example to a large set of images. We will use Ray Data to do batch inference in a distributed fashion, leveraging all the CPU and GPU resources in our cluster. Loading the Image Dataset The dataset that we will be using is a subset of Pascal VOC that contains cats and dogs (the full dataset has 20 classes). There are 2434 images in the this dataset. First, we use the ray.data.read_images API to load a prepared image dataset from S3. We can use the schema API to check the schema of the dataset. As we can see, it has one column named “image”, and the value is the image data represented in np.ndarray format. import ray ds = ray.data.read_images("s3://anonymous@air-example-data/AnimalDetection/JPEGImages") display(ds.schema()) [2023-05-19 18:10:29] INFO ray._private.worker::Started a local Ray instance. View the dashboard at 127.0.0.1:8265  [2023-05-19 18:10:35] [Ray Data] WARNING ray.data.dataset::Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes. Learn more here: https://docs.ray.io/en/master/data/faq.html#migrating-to-strict-mode Column Type ------ ---- image numpy.ndarray(ndim=3, dtype=uint8) Batch inference with Ray Data As we can see from the PyTorch example, model inference consists of 2 steps: preprocessing the image and model inference. Preprocessing First let’s convert the preprocessing code to Ray Data. We’ll package the preprocessing code within a preprocess_image function. This function should take only one argument, which is a dict that contains a single image in the dataset, represented as a numpy array. import numpy as np import torch from torchvision import transforms from torchvision.models.detection import (FasterRCNN_ResNet50_FPN_V2_Weights, fasterrcnn_resnet50_fpn_v2) from typing import Dict def preprocess_image(data: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT preprocessor = transforms.Compose( [transforms.ToTensor(), weights.transforms()] ) return { "image": data["image"], "transformed": preprocessor(data["image"]), } Then we use the map API to apply the function to the whole dataset. By using Ray Data’s map, we can scale out the preprocessing to all the resources in our Ray cluster Note, the map method is lazy, it won’t perform execution until we start to consume the results. ds = ds.map(preprocess_image) [2023-05-19 18:10:37] [Ray Data] WARNING ray.data.dataset::The `map`, `flat_map`, and `filter` operations are unvectorized and can be very slow. If you're using a vectorized transformation, consider using `.map_batches()` instead. Model inference Next, let’s convert the model inference part. Compared with preprocessing, model inference has 2 differences: Model loading and initialization is usually expensive. Model inference can be optimized with hardware acceleration if we process data in batches. Using larger batches improves GPU utilization and the overall runtime of the inference job. Thus, we convert the model inference code to the following ObjectDetectionModel class. In this class, we put the expensive model loading and initialization code in the __init__ constructor, which will run only once. And we put the model inference code in the __call__ method, which will be called for each batch. The __call__ method takes a batch of data items, instead of a single one. In this case, the batch is also a dict that has one key named “image”, and the value is an array of images represented in np.ndarray format. We can also use the take_batch API to fetch a single batch, and inspect its internal data structure. single_batch = ds.take_batch(batch_size=3) display(single_batch) [2023-05-19 18:10:38] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage->Map] [2023-05-19 18:10:38] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [2023-05-19 18:10:38] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [2023-05-19 18:10:40] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Shutting down . {'image': array([array([[[173, 153, 142], [255, 246, 242], [255, 245, 245], ..., [255, 255, 244], [237, 235, 223], [214, 212, 200]], [[124, 105, 90], [255, 249, 238], [251, 244, 236], ..., [255, 252, 245], [255, 254, 247], [247, 244, 237]], [[ 56, 37, 20], [255, 253, 239], [248, 248, 236], ..., [248, 247, 243], [248, 247, 243], [254, 253, 249]], ..., [[ 64, 78, 87], [ 63, 74, 80], [105, 113, 115], ..., [ 94, 105, 109], [ 90, 99, 104], [ 84, 91, 97]], [[ 68, 86, 96], [ 69, 82, 88], [ 55, 63, 66], ..., [ 82, 98, 98], [ 54, 70, 70], [ 82, 96, 97]], [[ 67, 87, 96], [ 43, 60, 67], [ 80, 96, 96], ..., [ 63, 75, 75], [ 89, 101, 101], [ 54, 65, 67]]], dtype=uint8), array([[[31, 32, 26], [31, 32, 26], [30, 31, 25], ..., [82, 83, 78], [82, 83, 78], [82, 83, 78]], [[32, 33, 27], [29, 30, 24], [26, 27, 21], ..., [82, 83, 78], [82, 83, 78], [82, 83, 78]], [[27, 28, 22], [23, 24, 18], [21, 22, 16], ..., [84, 85, 80], [84, 85, 80], [84, 85, 80]], ..., [[43, 18, 21], [36, 14, 16], [39, 19, 20], ..., [19, 24, 18], [19, 24, 18], [13, 18, 12]], [[47, 21, 24], [39, 14, 17], [36, 16, 17], ..., [21, 26, 20], [24, 29, 23], [22, 27, 21]], [[47, 16, 22], [40, 13, 18], [36, 16, 18], ..., [ 9, 14, 8], [ 7, 12, 6], [ 1, 6, 0]]], dtype=uint8), array([[[ 17, 3, 2], [ 17, 3, 2], [ 19, 3, 3], ..., [ 55, 68, 84], [ 56, 69, 85], [ 56, 69, 85]], [[ 18, 4, 3], [ 18, 4, 3], [ 19, 3, 3], ..., [ 56, 69, 85], [ 56, 69, 85], [ 57, 70, 86]], [[ 18, 4, 3], [ 18, 4, 3], [ 19, 3, 3], ..., [ 56, 69, 85], [ 56, 69, 85], [ 57, 70, 86]], ..., [[ 9, 0, 1], [ 9, 0, 1], [ 9, 0, 1], ..., [123, 124, 116], [121, 122, 114], [116, 117, 109]], [[ 9, 0, 1], [ 9, 0, 1], [ 9, 0, 1], ..., [121, 122, 114], [119, 120, 112], [115, 116, 108]], [[ 9, 0, 1], [ 9, 0, 1], [ 9, 0, 1], ..., [121, 122, 114], [119, 120, 112], [116, 117, 109]]], dtype=uint8)], dtype=object), 'transformed': array([array([[[0.6784314 , 1. , 1. , ..., 1. , 0.92941177, 0.8392157 ], [0.4862745 , 1. , 0.9843137 , ..., 1. , 1. , 0.96862745], [0.21960784, 1. , 0.972549 , ..., 0.972549 , 0.972549 , 0.99607843], ..., [0.2509804 , 0.24705882, 0.4117647 , ..., 0.36862746, 0.3529412 , 0.32941177], [0.26666668, 0.27058825, 0.21568628, ..., 0.32156864, 0.21176471, 0.32156864], [0.2627451 , 0.16862746, 0.3137255 , ..., 0.24705882, 0.34901962, 0.21176471]], [[0.6 , 0.9647059 , 0.9607843 , ..., 1. , 0.92156863, 0.83137256], [0.4117647 , 0.9764706 , 0.95686275, ..., 0.9882353 , 0.99607843, 0.95686275], [0.14509805, 0.99215686, 0.972549 , ..., 0.96862745, 0.96862745, 0.99215686], ..., [0.30588236, 0.2901961 , 0.44313726, ..., 0.4117647 , 0.3882353 , 0.35686275], [0.3372549 , 0.32156864, 0.24705882, ..., 0.38431373, 0.27450982, 0.3764706 ], [0.34117648, 0.23529412, 0.3764706 , ..., 0.29411766, 0.39607844, 0.25490198]], [[0.5568628 , 0.9490196 , 0.9607843 , ..., 0.95686275, 0.8745098 , 0.78431374], [0.3529412 , 0.93333334, 0.9254902 , ..., 0.9607843 , 0.96862745, 0.92941177], [0.07843138, 0.9372549 , 0.9254902 , ..., 0.9529412 , 0.9529412 , 0.9764706 ], ..., [0.34117648, 0.3137255 , 0.4509804 , ..., 0.42745098, 0.40784314, 0.38039216], [0.3764706 , 0.34509805, 0.25882354, ..., 0.38431373, 0.27450982, 0.38039216], [0.3764706 , 0.2627451 , 0.3764706 , ..., 0.29411766, 0.39607844, 0.2627451 ]]], dtype=float32) , array([[[0.12156863, 0.12156863, 0.11764706, ..., 0.32156864, 0.32156864, 0.32156864], [0.1254902 , 0.11372549, 0.10196079, ..., 0.32156864, 0.32156864, 0.32156864], [0.10588235, 0.09019608, 0.08235294, ..., 0.32941177, 0.32941177, 0.32941177], ..., [0.16862746, 0.14117648, 0.15294118, ..., 0.07450981, 0.07450981, 0.05098039], [0.18431373, 0.15294118, 0.14117648, ..., 0.08235294, 0.09411765, 0.08627451], [0.18431373, 0.15686275, 0.14117648, ..., 0.03529412, 0.02745098, 0.00392157]], [[0.1254902 , 0.1254902 , 0.12156863, ..., 0.3254902 , 0.3254902 , 0.3254902 ], [0.12941177, 0.11764706, 0.10588235, ..., 0.3254902 , 0.3254902 , 0.3254902 ], [0.10980392, 0.09411765, 0.08627451, ..., 0.33333334, 0.33333334, 0.33333334], ..., [0.07058824, 0.05490196, 0.07450981, ..., 0.09411765, 0.09411765, 0.07058824], [0.08235294, 0.05490196, 0.0627451 , ..., 0.10196079, 0.11372549, 0.10588235], [0.0627451 , 0.05098039, 0.0627451 , ..., 0.05490196, 0.04705882, 0.02352941]], [[0.10196079, 0.10196079, 0.09803922, ..., 0.30588236, 0.30588236, 0.30588236], [0.10588235, 0.09411765, 0.08235294, ..., 0.30588236, 0.30588236, 0.30588236], [0.08627451, 0.07058824, 0.0627451 , ..., 0.3137255 , 0.3137255 , 0.3137255 ], ..., [0.08235294, 0.0627451 , 0.07843138, ..., 0.07058824, 0.07058824, 0.04705882], [0.09411765, 0.06666667, 0.06666667, ..., 0.07843138, 0.09019608, 0.08235294], [0.08627451, 0.07058824, 0.07058824, ..., 0.03137255, 0.02352941, 0. ]]], dtype=float32) , array([[[0.06666667, 0.06666667, 0.07450981, ..., 0.21568628, 0.21960784, 0.21960784], [0.07058824, 0.07058824, 0.07450981, ..., 0.21960784, 0.21960784, 0.22352941], [0.07058824, 0.07058824, 0.07450981, ..., 0.21960784, 0.21960784, 0.22352941], ..., [0.03529412, 0.03529412, 0.03529412, ..., 0.48235294, 0.4745098 , 0.45490196], [0.03529412, 0.03529412, 0.03529412, ..., 0.4745098 , 0.46666667, 0.4509804 ], [0.03529412, 0.03529412, 0.03529412, ..., 0.4745098 , 0.46666667, 0.45490196]], [[0.01176471, 0.01176471, 0.01176471, ..., 0.26666668, 0.27058825, 0.27058825], [0.01568628, 0.01568628, 0.01176471, ..., 0.27058825, 0.27058825, 0.27450982], [0.01568628, 0.01568628, 0.01176471, ..., 0.27058825, 0.27058825, 0.27450982], ..., [0. , 0. , 0. , ..., 0.4862745 , 0.47843137, 0.45882353], [0. , 0. , 0. , ..., 0.47843137, 0.47058824, 0.45490196], [0. , 0. , 0. , ..., 0.47843137, 0.47058824, 0.45882353]], [[0.00784314, 0.00784314, 0.01176471, ..., 0.32941177, 0.33333334, 0.33333334], [0.01176471, 0.01176471, 0.01176471, ..., 0.33333334, 0.33333334, 0.3372549 ], [0.01176471, 0.01176471, 0.01176471, ..., 0.33333334, 0.33333334, 0.3372549 ], ..., [0.00392157, 0.00392157, 0.00392157, ..., 0.45490196, 0.44705883, 0.42745098], [0.00392157, 0.00392157, 0.00392157, ..., 0.44705883, 0.4392157 , 0.42352942], [0.00392157, 0.00392157, 0.00392157, ..., 0.44705883, 0.4392157 , 0.42745098]]], dtype=float32) ], dtype=object)} class ObjectDetectionModel: def __init__(self): # Define the model loading and initialization code in `__init__`. self.weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT self.model = fasterrcnn_resnet50_fpn_v2( weights=self.weights, box_score_thresh=0.9, ) if torch.cuda.is_available(): # Move the model to GPU if it's available. self.model = self.model.cuda() self.model.eval() def __call__(self, input_batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: # Define the per-batch inference code in `__call__`. batch = [torch.from_numpy(image) for image in input_batch["transformed"]] if torch.cuda.is_available(): # Move the data to GPU if it's available. batch = [image.cuda() for image in batch] predictions = self.model(batch) return { "image": input_batch["image"], "labels": [pred["labels"].detach().cpu().numpy() for pred in predictions], "boxes": [pred["boxes"].detach().cpu().numpy() for pred in predictions], } Then we use the map_batches API to apply the model to the whole dataset. The first parameter of map and map_batches is the user-defined function (UDF), which can either be a function or a class. Function-based UDFs will run as short-running Ray tasks, and class-based UDFs will run as long-running Ray actors. For class-based UDFs, we use the compute argument to specify ActorPoolStrategy with the number of parallel actors. And the batch_size argument indicates the number of images in each batch. The num_gpus argument specifies the number of GPUs needed for each ObjectDetectionModel instance. The Ray scheduler can handle heterogeous resource requirements in order to maximize the resource utilization. In this case, the ObjectDetectionModel instances will run on GPU and preprocess_image instances will run on CPU. ds = ds.map_batches( ObjectDetectionModel, compute=ray.data.ActorPoolStrategy(size=4), # Use 4 GPUs. Change this number based on the number of GPUs in your cluster. batch_size=4, # Use the largest batch size that can fit in GPU memory. num_gpus=1, # Specify 1 GPU per model replica. Remove this if you are doing CPU inference. ) Verify and Save Results Then let’s take a small batch and verify the inference results with visualization. from torchvision.transforms.functional import convert_image_dtype, to_tensor batch = ds.take_batch(batch_size=2) for image, labels, boxes in zip(batch["image"], batch["labels"], batch["boxes"]): image = convert_image_dtype(to_tensor(image), torch.uint8) labels = [weights.meta["categories"][i] for i in labels] boxes = torch.from_numpy(boxes) img = to_pil_image(draw_bounding_boxes( image, boxes, labels=labels, colors="red", width=4, )) display(img) [2023-05-19 18:10:40] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[ReadImage->Map->MapBatches(ObjectDetectionModel)] [2023-05-19 18:10:40] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [2023-05-19 18:10:40] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [2023-05-19 18:10:40] [Ray Data] INFO ray.data._internal.execution.operators.actor_pool_map_operator.logfile::ReadImage->Map->MapBatches(ObjectDetectionModel): Waiting for 4 pool actors to start... [2023-05-19 18:11:50] [Ray Data] INFO ray.data._internal.execution.streaming_executor.logfile::Shutting down . [2023-05-19 18:11:50] [Ray Data] WARNING ray.data._internal.execution.operators.actor_pool_map_operator.logfile::To ensure full parallelization across an actor pool of size 4, the specified batch size should be at most 3. Your configured batch size for this operator was 4. If the samples look good, we can proceed with saving the results to an external storage, e.g., S3 or local disks. See Ray Data Input/Output for all supported stoarges and file formats. ds.write_parquet("local://tmp/inference_results") Processing NYC taxi data using Ray Data The NYC Taxi dataset is a popular tabular dataset. In this example, we demonstrate some basic data processing on this dataset using Ray Data. Overview This tutorial will cover: Reading Parquet data Inspecting the metadata and first few rows of a large Ray Dataset Calculating some common global and grouped statistics on the dataset Dropping columns and rows Adding a derived column Shuffling the data Sharding the data and feeding it to parallel consumers (trainers) Applying batch (offline) inference to the data Walkthrough Let’s start by importing Ray and initializing a local Ray cluster. # Import ray and initialize a local Ray cluster. import ray ray.init() Reading and Inspecting the Data Next, we read a few of the files from the dataset. This read is lazy, where reading and all future transformations are delayed until a downstream operation triggers execution (e.g. consuming the data with ds.take()) # Read two Parquet files in parallel. ds = ray.data.read_parquet([ "s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_01_data.parquet", "s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_02_data.parquet" ]) We can easily inspect the schema of this dataset. For Parquet files, we don’t even have to read the actual data to get the schema; we can read it from the lightweight Parquet metadata! # Fetch the schema from the underlying Parquet metadata. ds.schema() vendor_id: string pickup_at: timestamp[us] dropoff_at: timestamp[us] passenger_count: int8 trip_distance: float pickup_longitude: float pickup_latitude: float rate_code_id: null store_and_fwd_flag: string dropoff_longitude: float dropoff_latitude: float payment_type: string fare_amount: float extra: float mta_tax: float tip_amount: float tolls_amount: float total_amount: float -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2524 Parquet even stores the number of rows per file in the Parquet metadata, so we can get the number of rows in ds without triggering a full data read. ds.count() 2749936 We can get a nice, cheap summary of the Dataset by leveraging it’s informative repr: # Display some metadata about the dataset. ds Dataset(num_blocks=2, num_rows=2749936, schema={vendor_id: string, pickup_at: timestamp[us], dropoff_at: timestamp[us], passenger_count: int8, trip_distance: float, pickup_longitude: float, pickup_latitude: float, rate_code_id: null, store_and_fwd_flag: string, dropoff_longitude: float, dropoff_latitude: float, payment_type: string, fare_amount: float, extra: float, mta_tax: float, tip_amount: float, tolls_amount: float, total_amount: float}) We can also poke at the actual data, taking a peek at a single row. Since this is only returning a row from the first file, reading of the second file is not triggered yet. ds.take(1) [ArrowRow({'vendor_id': 'VTS', 'pickup_at': datetime.datetime(2009, 1, 21, 14, 58), 'dropoff_at': datetime.datetime(2009, 1, 21, 15, 3), 'passenger_count': 1, 'trip_distance': 0.5299999713897705, 'pickup_longitude': -73.99270629882812, 'pickup_latitude': 40.7529411315918, 'rate_code_id': None, 'store_and_fwd_flag': None, 'dropoff_longitude': -73.98814392089844, 'dropoff_latitude': 40.75956344604492, 'payment_type': 'CASH', 'fare_amount': 4.5, 'extra': 0.0, 'mta_tax': None, 'tip_amount': 0.0, 'tolls_amount': 0.0, 'total_amount': 4.5})] To get a better sense of the data size, we can calculate the size in bytes of the full dataset. Note that for Parquet files, this size-in-bytes will be pulled from the Parquet metadata (not triggering a data read), and therefore might be significantly different than the in-memory size! ds.size_bytes() 427503965 In order to get the in-memory size, we can trigger full reading of the dataset and inspect the size in bytes. ds.materialize().size_bytes() Read progress: 100%|██████████| 2/2 [00:00<00:00, 2.50it/s] 226524489 Advanced Aside - Reading Partitioned Parquet Datasets In addition to being able to read lists of individual files, ray.data.read_parquet() (as well as other ray.data.read_*() APIs) can read directories containing multiple Parquet files. For Parquet in particular, reading Parquet datasets partitioned by a particular column is supported, allowing for path-based (zero-read) partition filtering and (optionally) including the partition column value specified in the file paths directly in the read table data. For the NYC taxi dataset, instead of reading individual per-month Parquet files, we can read the entire 2009 directory. This could be a lot of data (downsampled with 0.01 ratio leads to ~50.2 MB on disk, ~147 MB in memory), so be careful triggering full reads on a limited-memory machine! This is one place where Dataset’s lazy reading comes in handy: Dataset will not execute any read tasks eagerly and will execute the minimum number of file reads to satisfy downstream operations, which allows us to inspect a subset of the data without having to read the entire dataset. # Read all Parquet data for the year 2009. year_ds = ray.data.read_parquet("s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet") The metadata that Dataset prints in its repr is guaranteed to not trigger reads of all files; data such as the row count and the schema is pulled directly from the Parquet metadata. year_ds.count() 1710629 That’s a lot of rows! Since we’re not going to use this full-year data, let’s now delete this dataset to free up some memory in our Ray cluster. del year_ds Data Exploration and Cleaning Let’s calculate some stats to get a better picture of our data. # What's the longets trip distance, largest tip amount, and most number of passengers? ds.max(["trip_distance", "tip_amount", "passenger_count"]) Shuffle Map: 100%|██████████| 2/2 [00:00<00:00, 50.69it/s] Shuffle Reduce: 100%|██████████| 1/1 [00:00<00:00, 114.04it/s] ArrowRow({'max(trip_distance)': 50.0, 'max(tip_amount)': 100.0, 'max(passenger_count)': 6}) We don’t have any use for the store_and_fwd_flag or mta_tax columns, so let’s drop those. # Drop some columns. ds = ds.drop_columns(["store_and_fwd_flag", "mta_tax"]) Map_Batches: 100%|██████████| 2/2 [00:03<00:00, 1.59s/it] Let’s say we want to know how many trips there are for each passenger count. ds.groupby("passenger_count").count().take() Sort Sample: 100%|██████████| 2/2 [00:00<00:00, 5.01it/s] Shuffle Map: 100%|██████████| 2/2 [03:21<00:00, 100.61s/it] Shuffle Reduce: 0%| | 0/2 [00:00 0]) Map_Batches: 100%|██████████| 2/2 [00:03<00:00, 1.60s/it] Do the passenger counts influences the typical trip distance? # Mean trip distance grouped by passenger count. ds.groupby("passenger_count").mean("trip_distance").take() Sort Sample: 100%|██████████| 2/2 [00:00<00:00, 4.57it/s] Shuffle Map: 100%|██████████| 2/2 [03:23<00:00, 101.59s/it] Shuffle Reduce: 100%|██████████| 2/2 [00:00<00:00, 178.79it/s] [PandasRow({'passenger_count': 1, 'mean(trip_distance)': 2.543288084787955}), PandasRow({'passenger_count': 2, 'mean(trip_distance)': 2.7043459216040686}), PandasRow({'passenger_count': 3, 'mean(trip_distance)': 2.6233412684454716}), PandasRow({'passenger_count': 4, 'mean(trip_distance)': 2.642096445352584}), PandasRow({'passenger_count': 5, 'mean(trip_distance)': 2.6286944833939314}), PandasRow({'passenger_count': 6, 'mean(trip_distance)': 2.5848625579855855})] See Transforming Data for more information on how we can process our data with Ray Data. Advanced Aside - Projection and Filter Pushdown Note that Ray Data’ Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won’t even be read from disk! The row-based filter is specified via Arrow’s dataset field expressions. See the Parquet row pruning tips for more information. # Only read the passenger_count and trip_distance columns. import pyarrow as pa filter_expr = ( (pa.dataset.field("passenger_count") <= 10) & (pa.dataset.field("passenger_count") > 0) ) pushdown_ds = ray.data.read_parquet( [ "s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_01_data.parquet", "s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_02_data.parquet", ], columns=["passenger_count", "trip_distance"], filter=filter_expr, ) # Force full execution of both of the file reads. pushdown_ds = pushdown_ds.materialize() pushdown_ds ⚠️ The number of blocks in this dataset (2) limits its parallelism to 2 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks. Read progress: 100%|██████████| 2/2 [00:00<00:00, 9.19it/s] Dataset(num_blocks=2, num_rows=2749842, schema={passenger_count: int8, trip_distance: float}) # Delete the pushdown dataset. Deleting the Dataset object # will release the underlying memory in the cluster. del pushdown_ds Ingesting into Model Trainers Now that we’ve learned more about our data and we have cleaned up our dataset a bit, we now look at how we can feed this dataset into some dummy model trainers. First, let’s do a full global random shuffle of the dataset to decorrelate these samples. ds = ds.random_shuffle() Shuffle Map: 100%|██████████| 2/2 [00:01<00:00, 1.34it/s] Shuffle Reduce: 100%|██████████| 2/2 [00:01<00:00, 1.09it/s] We define a dummy Trainer actor, where each trainer will consume a dataset shard in batches and simulate model training. In a real training workflow, we would feed ds to Ray Train, which would do this sharding and creation of training actors for us, under the hood. @ray.remote class Trainer: def __init__(self, rank: int): pass def train(self, shard: ray.data.Dataset) -> int: for batch in shard.iter_batches(batch_size=256): pass return shard.count() trainers = [Trainer.remote(i) for i in range(4)] trainers [Actor(Trainer, 9326d43345699213608f324003000000), Actor(Trainer, f0ce2ce44528fbf748c9c1a103000000), Actor(Trainer, 7ba39c8f82ebd78c68e92ec903000000), Actor(Trainer, b95fe3494b7bc2d8f42abbba03000000)] Next, we split the dataset into len(trainers) shards, ensuring that the shards are of equal size. shards = ds.split(n=len(trainers), equal=True) shards [Dataset(num_blocks=1, num_rows=687460, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32}), Dataset(num_blocks=1, num_rows=687460, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32}), Dataset(num_blocks=2, num_rows=687460, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32}), Dataset(num_blocks=1, num_rows=687460, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32})] Finally, we simulate training, passing each shard to the corresponding trainer. The number of rows per shard is returned. ray.get([w.train.remote(s) for w, s in zip(trainers, shards)]) [687460, 687460, 687460, 687460] # Delete trainer actor handle references, which should terminate the actors. del trainers Parallel Batch Inference Refer to the blog on Model Batch Inference in Ray for an overview of batch inference strategies in Ray and additional examples. After we’ve trained a model, we may want to perform batch (offline) inference on such a tabular dataset. With Ray Data, this is as easy as a ds.map_batches() call! First, we define a callable class that will cache the loading of the model in its constructor. import pandas as pd def load_model(): # A dummy model. def model(batch: pd.DataFrame) -> pd.DataFrame: return pd.DataFrame({"score": batch["passenger_count"] % 2 == 0}) return model class BatchInferModel: def __init__(self): self.model = load_model() def __call__(self, batch: pd.DataFrame) -> pd.DataFrame: return self.model(batch) BatchInferModel’s constructor will only be called once per actor worker when using the actor pool compute strategy in ds.map_batches(). ds.map_batches(BatchInferModel, batch_size=2048, compute=ray.data.ActorPoolStrategy()).take() Map Progress (2 actors 1 pending): 100%|██████████| 2/2 [00:05<00:00, 2.57s/it] [PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False})] If wanting to perform batch inference on GPUs, simply specify the number of GPUs you wish to provision for each batch inference worker. This will only run successfully if your cluster has nodes with GPUs! ds.map_batches( BatchInferModel, batch_size=256, #num_gpus=1, # Uncomment this to run this on GPUs! compute=ray.data.ActorPoolStrategy(), ).take() Map Progress (15 actors 4 pending): 100%|██████████| 2/2 [00:21<00:00, 10.67s/it] [PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False})] We can also configure the autoscaling actor pool that this inference stage uses, setting upper and lower bounds on the actor pool size, and even tweak the batch prefetching vs. inference task queueing tradeoff. from ray.data import ActorPoolStrategy # The actor pool will have at least 2 workers and at most 8 workers. strategy = ActorPoolStrategy(min_size=2, max_size=8) ds.map_batches( BatchInferModel, batch_size=256, #num_gpus=1, # Uncomment this to run this on GPUs! compute=strategy, ).take() Map Progress (8 actors 0 pending): 100%|██████████| 2/2 [00:21<00:00, 10.71s/it] [PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': False}), PandasRow({'score': True}), PandasRow({'score': False})] Batch Training with Ray Data Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on data batches corresponding to different locations, products, etc. Batch training can take less time to process all the data at once, but only if those batches can run in parallel! This notebook showcases how to conduct batch training regression algorithms from XGBoost and Scikit-learn with Ray Data. XGBoost is a popular open-source library used for regression and classification. Scikit-learn is a popular open-source library with a vast assortment of well-known ML algorithms. The workload showcased in this notebook can be expressed using different Ray components, such as Ray Data, Ray Tune and Ray Core. For more information, including best practices, see Many Model Training. Batch training diagram For the data, we will use the NYC Taxi dataset. This popular tabular dataset contains historical taxi pickups by timestamp and location in NYC. For the training, we will train separate regression models to predict trip_duration, with a different model for each dropoff location in NYC. Specifically, we will conduct an experiment for each dropoff_location_id, to find the best either XGBoost or Scikit-learn model, per location. Contents In this this tutorial, you will learn about: Creating a Dataset Filtering a Dataset on Read Inspecting a Dataset Transforming a Dataset in parallel Batch training with Ray Data in parallel Load a saved model and perform batch prediction Walkthrough Let us start by importing a few required libraries, including open-source Ray itself! import os num_cpu = os.cpu_count() print(f"Number of CPUs in this system: {num_cpu}") from typing import Tuple, List, Union, Optional, Callable import time import pandas as pd import numpy as np print(f"numpy: {np.__version__}") import pyarrow import pyarrow.parquet as pq import pyarrow.dataset as pds print(f"pyarrow: {pyarrow.__version__}") from ray.data import Dataset Number of CPUs in this system: 8 numpy: 1.23.3 pyarrow: 6.0.1 import ray if ray.is_initialized(): ray.shutdown() ray.init() 2022-12-08 17:04:06,689 INFO worker.py:1223 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS 2022-12-08 17:04:06,691 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.174.62:9031... 2022-12-08 17:04:06,700 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_gyl6mbksa8xt7b149ib6abld/services?redirect_to=dashboard  print(ray.cluster_resources()) {'CPU': 8.0, 'object_store_memory': 9093674188.0, 'memory': 18187348379.0, 'node:172.31.174.62': 1.0} # For benchmarking purposes, we can print the times of various operations. # In order to reduce clutter in the output, this is set to False by default. PRINT_TIMES = False def print_time(msg: str): if PRINT_TIMES: print(msg) # To speed things up, we’ll only use a small subset of the full dataset consisting of two last months of 2019. # You can choose to use the full dataset for 2018-2019 by setting the SMOKE_TEST variable to False. SMOKE_TEST = True Creating a Dataset Ray Data uses PyArrow dataset and table for reading or writing large parquet files. Its native multithreaded C++ adpater is faster than pandas read_parquet, even using engine='pyarrow'. For more details see Ray Data User Guide. Ray Data is the standard way to load and exchange data in Ray libraries and applications. We will use the Ray Data APIs to read the data and quickly inspect it. First, we will define some global variables we will use throughout the notebook, such as the list of S3 links to the files making up the dataset and the possible location IDs. # Define some global variables. TARGET = "trip_duration" s3_partitions = pds.dataset( "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/", partitioning=["year", "month"], ) s3_files = [f"s3://anonymous@{file}" for file in s3_partitions.files] # Obtain all location IDs location_ids = ( pq.read_table(s3_files[0], columns=["dropoff_location_id"])["dropoff_location_id"] .unique() .to_pylist() ) # Use smoke testing or not. starting_idx = -1 if SMOKE_TEST else 0 # drop location 199 to test error-handling before final git checkin sample_locations = [141, 229, 173] if SMOKE_TEST else location_ids # Display what data will be used. s3_files = s3_files[starting_idx:] print(f"NYC Taxi using {len(s3_files)} file(s)!") print(f"s3_files: {s3_files}") print(f"Locations: {sample_locations}") NYC Taxi using 1 file(s)! s3_files: ['s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet'] Locations: [141, 229, 173] The easiest way to create a ray dataset is to use ray.data.read_parquet to read parquet files in parallel onto the Ray cluster. Uncomment the cell below if you want to try it out. # # This cell is commented out because it can take a long time! # # In the next section "Filtering Read" we make it faster. # # Read everything in the files list into a ray dataset. # start = time.time() # ds = ray.data.read_parquet(s3_files) # print(f"Data loading time: {data_loading_time:.2f} seconds") # ds Filtering a Dataset on Read Normally there is some last-mile data processing required before training. Let’s just assume we know the data processing steps are: Drop negative trip distances, 0 fares, 0 passengers. Drop 2 unknown zones: ['264', '265']. Calculate trip duration and add it as a new column. Drop trip durations smaller than 1 minute and greater than 24 hours. Instead of blindly reading all the data, it would be better if we only read the data we needed. This is similar concept to SQL SELECT only rows, columns you need vs SELECT *. Best practice is to filter as much as you can directly in the Dataset read_parquet(). Note that Ray Data’ Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won’t even be read from disk. This can save a lot of memory, especially with big datasets, and allow us to avoid OOM issues. The row-based filter is specified via Arrow’s dataset field expressions. def pushdown_read_data(files_list: list, sample_ids: list) -> Dataset: start = time.time() filter_expr = ( (pds.field("passenger_count") > 0) & (pds.field("trip_distance") > 0) & (pds.field("fare_amount") > 0) & (~pds.field("pickup_location_id").isin([264, 265])) & (~pds.field("dropoff_location_id").isin([264, 265])) & (pds.field("dropoff_location_id").isin(sample_ids)) ) dataset = ray.data.read_parquet( files_list, columns=[ "pickup_at", "dropoff_at", "pickup_location_id", "dropoff_location_id", "passenger_count", "trip_distance", "fare_amount", ], filter=filter_expr, ) data_loading_time = time.time() - start print_time(f"Data loading time: {data_loading_time:.2f} seconds") return dataset # Test the pushdown_read_data function ds_raw = pushdown_read_data(s3_files, sample_locations) 2022-12-08 17:04:09,202 WARNING read_api.py:291 -- ⚠️ The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks. Inspecting a Dataset Let’s get some basic statistics about our newly created Dataset. As our Dataset is backed by Parquet, we can obtain the number of rows from the metadata without triggering a full data read. print(f"Number of rows: {ds_raw.count()}") Number of rows: 6941024 Similarly, we can obtain the Dataset size (in bytes) from the metadata. print(f"Size bytes (from parquet metadata): {ds_raw.size_bytes()}") Size bytes (from parquet metadata): 925892280 Let’s fetch and inspect the schema of the underlying Parquet files. print("\nSchema data types:") data_types = list(zip(ds_raw.schema().names, ds_raw.schema().types)) for s in data_types: print(f"{s[0]}: {s[1]}") Schema data types: pickup_at: timestamp[us] dropoff_at: timestamp[us] pickup_location_id: int32 dropoff_location_id: int32 passenger_count: int8 trip_distance: float fare_amount: float Transforming a Dataset in parallel using custom functions Ray Data allows you to specify custom data transform functions. These user defined functions (UDFs) can be called using Dataset.map_batches(my_function). The transformation will be conducted in parallel for each data batch. You may need to call Dataset.repartition(n) first to split the Dataset into more blocks internally. By default, each block corresponds to one file. The upper bound of parallelism is the number of blocks. You can specify the data format you are using in the batch_format parameter. The dataset will be divided into batches and those batches converted into the specified format. Available data formats you can specify in the batch_format paramater include "pandas", "pyarrow", "numpy". Tabular data will be passed into your UDF by default as a pandas DataFrame. Tensor data will be passed into your UDF as a numpy array. Here, we will use batch_format="pandas" explicitly for clarity. # A pandas DataFrame UDF for transforming the Dataset in parallel. def transform_df(input_df: pd.DataFrame) -> pd.DataFrame: df = input_df.copy() # calculate trip_duration df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds # filter trip_durations > 1 minute and less than 24 hours df = df[df["trip_duration"] > 60] df = df[df["trip_duration"] < 24 * 60 * 60] # keep only necessary columns df.drop( ["dropoff_at", "pickup_at", "pickup_location_id", "fare_amount"], axis=1, inplace=True, ) df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1) return df %%time # Test the transform UDF. print(f"Number of rows before transformation: {ds_raw.count()}") # Repartition the dataset to allow for higher parallelism. # Best practice: repartition to all available cpu except a few, with a cap num_partitions = min(num_cpu - 2, 32) ds = ds_raw.repartition(num_partitions) # .map_batches applies a UDF to each partition of the data in parallel. ds = ds.map_batches(transform_df, batch_format="pandas") # Verify row count. print(f"Number of rows after transformation: {ds.count()}") Number of rows before transformation: 6941024 Read: 100%|██████████| 1/1 [00:01<00:00, 1.97s/it] Repartition: 100%|██████████| 6/6 [00:02<00:00, 2.87it/s] Map_Batches: 100%|██████████| 6/6 [00:02<00:00, 2.90it/s] Number of rows after transformation: 285323 CPU times: user 320 ms, sys: 114 ms, total: 434 ms Wall time: 6.19 s Batch training with Ray Data Now that we have learned more about our data and written a pandas UDF to transform our data, we are ready to train a model on batches of this data in parallel. We will use the dropoff_location_id column in the dataset to group the dataset into data batches. Then we will fit a separate model for each batch to predict trip_duration. # import standard sklearn libraries import sklearn from sklearn.base import BaseEstimator from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_absolute_error print(f"sklearn: {sklearn.__version__}") import xgboost as xgb print(f"xgboost: {xgb.__version__}") # set global random seed for sklearn models np.random.seed(415) sklearn: 1.1.2 xgboost: 1.3.3 /home/ray/anaconda3/lib/python3.8/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import MultiIndex, Int64Index Define search space for training In this notebook, we will run parallel training jobs per data batch, drop-off location. The training jobs will be defined using a search space and simple grid search. Depending on your need, fancier search spaces and search algorithms are possible with Ray Tune. Below, we define our search space consists of: Different algorithms, either: Linear Regression or XGBoost Tree Regression. We want to train using every algorithm in the search space. What this means is every algorithm will be applied to every NYC Taxi drop-off location. ALGORITHMS = [ LinearRegression(fit_intercept=True), xgb.XGBRegressor(max_depth=4), ] Define training functions We want to fit a linear regression model to the trip duration for each drop-off location. For scoring, we will calculate mean absolute error on the validation set, and report that as model error per drop-off location. The fit_and_score_sklearn function contains the logic necessary to fit a scikit-learn model and evaluate it using mean absolute error. def fit_and_score_sklearn( train_df: pd.DataFrame, test_df: pd.DataFrame, model: BaseEstimator ) -> pd.DataFrame: # Assemble train/test pandas dfs train_X = train_df[["passenger_count", "trip_distance"]] train_y = train_df[TARGET] test_X = test_df[["passenger_count", "trip_distance"]] test_y = test_df[TARGET] # Start training. model = model.fit(train_X, train_y) pred_y = model.predict(test_X) # Evaluate. error = sklearn.metrics.mean_absolute_error(test_y, pred_y) if error is None: error = 10000.0 # Assemble return as a pandas dataframe. return_df = pd.DataFrame({"model": [model], "error": [error]}) # return model, error return return_df The train_and_evaluate function contains the logic for train-test splitting and fitting of a model using the fit_and_score_sklearn function. As an input, this function takes in a pandas DataFrame. When we call Dataset.map_batches or Dataset.groupby().map_groups(), the Dataset will be batched into multiple pandas DataFrames and this function will run for each batch in parallel. We will return the model and its error. Those results will be collected back into a Dataset. def train_and_evaluate( df: pd.DataFrame, models: List[BaseEstimator], location_id: int ) -> pd.DataFrame: # We need at least 4 rows to create a train / test split. if len(df) < 4: print_time( f"Data batch for LocID {location_id} is empty or smaller than 4 rows" ) return None start = time.time() # Train / test split # Randomly split the data into 80/20 train/test. train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True) # Launch a fit and score task for each model. # results is a list of pandas dataframes, one per model results = [fit_and_score_sklearn(train_df, test_df, model) for model in models] # Assemble location_id, name of model, and metrics in a pandas DataFrame results_df = pd.concat(results, axis=0, join="inner", ignore_index=True) results_df.insert(0, column="location_id", value=location_id) training_time = time.time() - start print_time(f"Training time for LocID {location_id}: {training_time:.2f} seconds") return results_df Recall how we wrote a data transform transform_batch UDF? It was called with pattern: Dataset.map_batches(transform_batch, batch_format="pandas") Similarly, we can write a custom groupy-aggregate function agg_func which will run for each Dataset group-by group in parallel. The usage pattern is: Dataset.groupby(column).map_groups(agg_func, batch_format="pandas"). In the cell below, we define our custom agg_func. # A Pandas DataFrame aggregation function for processing # grouped batches of Dataset data. def agg_func(df: pd.DataFrame) -> pd.DataFrame: location_id = df["dropoff_location_id"][0] # Handle errors in data groups try: # Transform the input pandas AND fit_and_evaluate the transformed pandas results_df = train_and_evaluate(df, ALGORITHMS, location_id) assert results_df is not None except Exception: # assemble a null entry print(f"Failed on LocID {location_id}!") results_df = pd.DataFrame( [[location_id, None, 10000.0]], columns=["location_id", "model", "error"], dtypes=["int32", BaseEstimator, "float64"], ) return results_df Run batch training using map_groups The main “driver code” reads each Parquet file (where each file corresponds to one month of NYC taxi data) into a Dataset ds. Then we use Dataset group-by to map each group into a batch of data and run agg_func on each grouping in parallel by calling ds.groupby("dropoff_location_id").map_groups(agg_func, batch_format="pandas"). # Driver code to run this. start = time.time() # Read data into Dataset # ds = pushdown_read_data(s3_files, sample_locations)\ # .repartition(14)\ # .ds.map_batches(transform_df, batch_format="pandas") # Use Dataset groupby.map_groups() to process each group in parallel and return a Dataset. results = ds.groupby("dropoff_location_id").map_groups(agg_func, batch_format="pandas") total_time_taken = time.time() - start print(f"Total number of models: {results.count()}") print(f"TOTAL TIME TAKEN: {total_time_taken:.2f} seconds") Sort Sample: 100%|██████████| 6/6 [00:01<00:00, 4.17it/s] Shuffle Map: 100%|██████████| 6/6 [00:01<00:00, 3.67it/s] Shuffle Reduce: 100%|██████████| 6/6 [00:01<00:00, 3.61it/s] Map_Batches: 100%|██████████| 6/6 [01:43<00:00, 17.31s/it] Total number of models: 6 TOTAL TIME TAKEN: 108.69 seconds Finally, we can inspect the models we have trained and their errors. results Dataset(num_blocks=6, num_rows=6, schema={location_id: int32, model: object, error: float64}) # sort values by location id results_df = results.to_pandas() results_df.sort_values(by=["location_id"], ascending=True, inplace=True) results_df
location_id model error
0 141 LinearRegression() 535.858862
1 141 XGBRegressor(base_score=0.5, booster='gbtree',... 527.156189
2 173 LinearRegression() 1279.122424
3 173 XGBRegressor(base_score=0.5, booster='gbtree',... 1377.166627
4 229 LinearRegression() 556.860355
5 229 XGBRegressor(base_score=0.5, booster='gbtree',... 559.876944
results_df.dtypes location_id int32 model object error float64 dtype: object # Keep only 1 model per location_id with minimum error final_df = results_df.copy() final_df = final_df.loc[(final_df.error > 0), :] final_df = final_df.loc[final_df.groupby("location_id")["error"].idxmin()] final_df.sort_values(by=["error"], inplace=True) final_df.set_index("location_id", inplace=True, drop=True) print(final_df.dtypes) final_df model object error float64 dtype: object
model error
location_id
141 XGBRegressor(base_score=0.5, booster='gbtree',... 527.156189
229 LinearRegression() 556.860355
173 LinearRegression() 1279.122424
final_df[["model"]].astype("str").value_counts(normalize=True) model LinearRegression() 0.666667 XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n importance_type='gain', interaction_constraints='',\n learning_rate=0.300000012, max_delta_step=0, max_depth=4,\n min_child_weight=1, missing=nan, monotone_constraints='()',\n n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,\n reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n tree_method='exact', validate_parameters=1, verbosity=None) 0.333333 dtype: float64 Re-load a model and perform batch prediction We will restore a regression model and demonstrate it can be used for prediction. # Choose a dropoff location sample_location_id = final_df.index[0] sample_location_id 141 # Get the algorithm used sample_algorithm = final_df.loc[[sample_location_id]].model.values[0] print(f"algorithm type:: {type(sample_algorithm)}") # Get the saved model directly from the pandas dataframe of results sample_model = final_df.model[sample_location_id] print(f"sample_model type:: {type(sample_model)}") algorithm type:: sample_model type:: # Create some test data df = ds.to_pandas(limit=ds.count()) df = df.loc[(df.dropoff_location_id == sample_location_id), :] _, test_df = train_test_split(df, test_size=0.2, shuffle=True) test_X = test_df[["passenger_count", "trip_distance"]] test_y = np.array(test_df[TARGET]) # actual values # Perform batch prediction using restored model pred_y = sample_model.predict(test_X) # Zip together predictions and actuals to evaluate pd.DataFrame(zip(pred_y, test_y), columns=["pred_y", "trip_duration"])[0:10] /home/ray/anaconda3/lib/python3.8/site-packages/xgboost/data.py:192: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import MultiIndex, Int64Index
pred_y trip_duration
0 1175.119019 1174
1 381.193146 299
2 1099.755737 1206
3 260.620178 566
4 684.046021 630
5 1038.442139 852
6 1581.762817 1596
7 533.471680 801
8 1618.910889 1363
9 695.661072 715
Compare validation and test error. During model training we reported error on “validation” data (random sample). Below, we will report error on a pretend “test” data set (a different random sample). Do a quick validation that both errors are reasonably close together. # Evaluate restored model on test data. error = sklearn.metrics.mean_absolute_error(test_y, pred_y) print(f"Test error: {error}") Test error: 930.7620476282492 # Compare test error with training validation error print(f"Validation error: {final_df.error[sample_location_id]}") # Validation and test errors should be reasonably close together. Validation error: 527.1561889430844 Scaling OCR using Ray Data In this example, we will show you how to run optical character recognition (OCR) on a set of documents and analyze the resulting text with the natural language processing library spaCy. Running OCR on a large dataset is very computationally expensive, so using Ray for distributed processing can really speed up the analysis. Ray Data makes it easy to compose the different steps of the pipeline, namely the OCR and the natural language processing. Ray Data’ actor support also allows us to be more efficient by sharing the spaCy NLP context between several datapoints. To make it more interesting, we will run the analysis on the LightShot dataset. It is a large publicly available OCR dataset with a wide variety of different documents, all of them screenshots of various forms. It is easy to replace that dataset with your own data and adapt the example to your own use cases! Overview This tutorial will cover: Creating a Dataset that represents the images in the dataset Running the computationally expensive OCR process on each image in the dataset in parallel Filtering the dataset by keeping only images that contain text Performing various NLP operations on the text Walkthrough Let’s start by preparing the dependencies and downloading the dataset. First we install the OCR software tesseract and its Python client: macOS brew install tesseract pip install pytesseract linux sudo apt-get install tesseract-ocr pip install pytesseract By default, the following example will run on a tiny dataset we provide. If you want to run it on the full dataset, we recommend to run it on a cluster since processing all the images with tesseract takes a lot of time. If you want to run the example on the full LightShot dataset, you need to download the dataset and extract it. You can extract the dataset by first running unzip archive.zip and then unrar x LightShot13k.rar . and then you can upload the dataset to S3 with aws s3 cp LightShot13k/ s3:/// --recursive. Let’s now import Ray and initialize a local Ray cluster. If you want to run OCR at a very large scale, you should run this workload on a multi-node cluster. # Import ray and initialize a local Ray cluster. import ray ray.init() Running the OCR software on the data We can now use the ray.data.read_binary_files function to read all the images from S3. We set the include_paths=True option to create a dataset of the S3 paths and image contents. We then run the ds.map function on this dataset to execute the actual OCR process on each file and convert the screen shots into text. This creates a tabular dataset with columns path and text. If you want to load the data from a private bucket, you have to run import pyarrow.fs ds = ray.data.read_binary_files("s3:///", include_paths=True, filesystem=pyarrow.fs.S3FileSystem( access_key="...", secret_key="...", session_token="...")) from io import BytesIO from PIL import Image import pytesseract def perform_ocr(data): path, img = data return { "path": path, "text": pytesseract.image_to_string(Image.open(BytesIO(img))) } ds = ray.data.read_binary_files( "s3://anonymous@air-example-data/ocr_tiny_dataset", include_paths=True) results = ds.map(perform_ocr) Let us have a look at some of the data points with the take function. results.take(10) Saving and loading the result of the OCR run Saving the dataset is optional, you can also continue with the in-memory data without persisting it to storage. We can save the result of running tesseract on the dataset on disk so we can read it out later if we want to re-run the NLP analysis without needing to re-run the OCR (which is very expensive on the whole dataset). This can be done with the write_parquet function: import os results.write_parquet(os.path.expanduser("~/LightShot13k_results")) You can later reload the data with the read_parquet function: results = ray.data.read_parquet(os.path.expanduser("~/LightShot13k_results")) Process the extracted text data with spaCy This is the part where the fun begins. Depending on your task there will be different needs for post processing, for example: If you are scanning books or articles you might want to separate the text out into sections and paragraphs. If you are scanning forms, receipts or checks, you might want to extract the different items listed, as well as extra information for those items like the price, or the total amount listed on the receipt or check. If you are scanning legal documents, you might want to extract information like the type of document, who is mentioned in the document and more semantic information about what the document claims. If you are scanning medical records, you might want to extract the patient name and the treatment history. In our specific example, let’s try to determine all the documents in the LightShot dataset that are chat protocols and extract named entities in those documents. We will extract this data with spaCy. Let’s first make sure the libraries are installed: !pip install "spacy>=3" !python -m spacy download en_core_web_sm !pip install spacy_langdetect This is some code to determine the language of a piece of text: import spacy from spacy.language import Language from spacy_langdetect import LanguageDetector nlp = spacy.load('en_core_web_sm') @Language.factory("language_detector") def get_lang_detector(nlp, name): return LanguageDetector() nlp.add_pipe('language_detector', last=True) nlp("This is an English sentence. Ray rocks!")._.language It gives both the language and a confidence score for that language. In order to run the code on the dataset, we should use Ray Data’ built in support for actors since the nlp object is not serializable and we want to avoid having to recreate it for each individual sentence. We also batch the computation with the map_batches function to ensure spaCy can use more efficient vectorized operations where available: import spacy from spacy.language import Language from spacy_langdetect import LanguageDetector class SpacyBatchInference: def __init__(self): self.nlp = spacy.load('en_core_web_sm') @Language.factory("language_detector") def get_lang_detector(nlp, name): return LanguageDetector() self.nlp.add_pipe('language_detector', last=True) def __call__(self, df): docs = list(self.nlp.pipe(list(df["text"]))) df["language"] = [doc._.language["language"] for doc in docs] df["score"] = [doc._.language["score"] for doc in docs] return df results.limit(10).map_batches(SpacyBatchInference, compute=ray.data.ActorPoolStrategy()) We can now get language statistics over the whole dataset: languages = results.map_batches(SpacyBatchInference, compute=ray.data.ActorPoolStrategy()) languages.groupby("language").count().show() On the full LightShot dataset, you would get the following: {'language': 'UNKNOWN', 'count()': 2815} {'language': 'af', 'count()': 109} {'language': 'ca', 'count()': 268} {'language': 'cs', 'count()': 13} {'language': 'cy', 'count()': 80} {'language': 'da', 'count()': 33} {'language': 'de', 'count()': 281} {'language': 'en', 'count()': 5640} {'language': 'es', 'count()': 453} {'language': 'et', 'count()': 82} {'language': 'fi', 'count()': 32} {'language': 'fr', 'count()': 168} {'language': 'hr', 'count()': 143} {'language': 'hu', 'count()': 57} {'language': 'id', 'count()': 128} {'language': 'it', 'count()': 139} {'language': 'lt', 'count()': 17} {'language': 'lv', 'count()': 12} {'language': 'nl', 'count()': 982} {'language': 'no', 'count()': 56} We can now filter to include only the English documents and also sort them according to their score. languages.filter(lambda row: row["language"] == "en").sort("score", descending=True).take(1000) If you are interested in this example and want to extend it, you can do the following for the full dataset: go throught these results in order create labels on whether the text is a chat conversation and then train a model like Huggingface Transformers on the data. Contributions that extend the example in this direction with a PR are welcome! Random Data Access (Experimental) Any Arrow-format dataset can be enabled for random access by calling ds.to_random_access_dataset(key="col_name"). This partitions the data across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. import ray # Generate a dummy embedding table as an example. ds = ray.data.range(100) ds = ds.add_column("embedding", lambda b: b["id"] ** 2) # -> schema={id: int64, embedding: int64} # Enable random access on the dataset. This launches a number of actors # spread across the cluster that serve random access queries to the data. rmap = ds.to_random_access_dataset(key="id", num_workers=4) # Example of a point query by key. ray.get(rmap.get_async(2)) # -> {"id": 2, "embedding": 4} # Queries to missing keys return None. ray.get(rmap.get_async(-1)) # -> None # Example of a multiget query. rmap.multiget([4, 2]) # -> [{"id": 4, "embedding": 16}, {"id": 2, "embedding": 4}] Similar to Dataset, a RandomAccessDataset can be passed to and used from any Ray actor or task. Architecture RandomAccessDataset spreads its workers evenly across the cluster. Each worker fetches and pins in shared memory all blocks of the sorted source data found on its node. In addition, it is ensured that each block is assigned to at least one worker. A central index of block to key-range assignments is computed, which is used to serve lookups. Lookups occur as follows: First, the id of the block that contains the given key is located via binary search on the central index. Second, an actor that has the block pinned is selected (this is done randomly). A method call is sent to the actor, which then performs binary search to locate the record for the key. This means that each random lookup costs ~1 network RTT as well as a small amount of computation on both the client and server side. Performance Since actor communication goes directly from worker to worker in Ray, the throughput of a RandomAccessDataset scales linearly with the number of workers available. As a rough measure, a single worker can provide ~2k individual gets/s and serve ~10k records/s for multigets, and this scales linearly as you increase the number of clients and workers for a single RandomAccessDataset. Large workloads may require hundreds of workers for sufficient throughput. You will also generally want more workers than clients, since the client does less computation than worker actors do. To debug performance problems, use random_access_ds.stats(). This will return a string showing the actor-side measured latencies as well as the distribution of data blocks and queries across the actors. Load imbalances can cause bottlenecks as certain actors receive more requests than others. Ensure that load is evenly distributed across the key space to avoid this. It is important to note that the client (Ray worker process) can also be a bottleneck. To scale past the throughput of a single client, use multiple tasks to gather the data, for example: import numpy as np import ray @ray.remote def fetch(rmap, keys): return rmap.multiget(keys) # Generate a dummy embedding table as an example. rmap = ( ray.data.range(1000) .add_column("embedding", lambda row: row["id"] ** 2) .to_random_access_dataset(key="id", num_workers=4) ) # Split the list of keys we want to fetch into 10 pieces. requested_keys = list(range(0, 1000, 2)) pieces = np.array_split(requested_keys, 10) # Fetch from the RandomAccessDataset in parallel using 10 remote tasks. print(ray.get([fetch.remote(rmap, p) for p in pieces])) Fault Tolerance Currently, RandomAccessDataset is not fault-tolerant. Losing any of the worker actors invalidates the dataset, and it must be re-created from the source data. Implementing a Custom Datasource This MongoDatasource guide below is for education only. For production use of MongoDB in Ray Data, see Creating Dataset from MongoDB. Ray Data supports multiple ways to create a dataset, allowing you to easily ingest data of common formats from popular sources. However, if the datasource you want to read from is not in the built-in list, don’t worry, you can implement a custom one for your use case. This guide walks through building a custom datasource, using MongoDB as an example. By the end of the guide, you will have a MongoDatasource that you can use to create dataset as follows: # Read from custom MongoDB datasource to create a dataset. ds = ray.data.read_datasource( MongoDatasource(), uri=MY_URI, database=MY_DATABASE, collection=MY_COLLECTION, pipelines=MY_PIPELINES ) # Write the dataset to custom MongoDB datasource. ds.write_datasource( MongoDatasource(), uri=MY_URI, database=MY_DATABASE, collection=MY_COLLECTION ) There are a few MongoDB concepts involved here. The URI points to a MongoDB instance, which hosts Databases and Collections. A collection is analogous to a table in SQL databases. MongoDB also has a pipeline concept, which expresses document processing in a series of stages (e.g. match documents with a predicate, sort results, and then select a few fields). The execution results of the pipelines are used to create dataset. A custom datasource is an implementation of Datasource. In this example, it’s called MongoDatasource. At a high level, it has two core parts to build out: Read support with create_reader() Write support with do_write(). Here are the key design choices we will make in this guide: MongoDB connector: We use PyMongo to connect to MongoDB. MongoDB to Arrow conversion: We use PyMongoArrow to convert MongoDB execution results into Arrow tables, which Datasets supports as a data format. Parallel execution: We ask the user to provide a list of MongoDB pipelines, with each corresponding to a partition of the MongoDB collection, which will be executed in parallel with ReadTask. For example, suppose you have a MongoDB collection with 4 documents, which have a partition_field with values 0, 1, 2, 3. You can compose two MongoDB pipelines (each handled by a ReadTask) as follows to read the collection in parallel: # A list of pipelines. Each pipeline is a series of stages, typed as List[Dict]. my_pipelines = [ # The first pipeline: match documents in partition range [0, 2) [ { "$match": { "partition_field": { "$gte": 0, "$lt": 2 } } } ], # The second pipeline: match documents in partition range [2, 4) [ { "$match": { "partition_field": { "$gte": 2, "$lt": 4 } } } ], ] Read support To support reading, we implement create_reader(), returning a Reader implementation for MongoDB. This Reader creates a list of ReadTask for the given list of MongoDB pipelines. Each ReadTask returns a list of blocks when called, and each ReadTask is executed in remote workers to parallelize the execution. You can find documentation about Ray Data block concept here and block APIs here. First, let’s handle a single MongoDB pipeline, which is the unit of execution in ReadTask. We need to connect to MongoDB, execute the pipeline against it, and then convert results into Arrow format. We use PyMongo and PyMongoArrow to achieve this. from ray.data.block import Block # This connects to MongoDB, executes the pipeline against it, converts the result # into Arrow format and returns the result as a Block. def _read_single_partition( uri, database, collection, pipeline, schema, kwargs ) -> Block: import pymongo from pymongoarrow.api import aggregate_arrow_all client = pymongo.MongoClient(uri) # Read more about this API here: # https://mongo-arrow.readthedocs.io/en/stable/api/api.html#pymongoarrow.api.aggregate_arrow_all return aggregate_arrow_all( client[database][collection], pipeline, schema=schema, **kwargs ) Once we have this building block, we apply it for each of the provided MongoDB pipelines. In particular, below, we construct a _MongoDatasourceReader by subclassing Reader, and implement the __init__ and get_read_tasks. In __init__, we pass in a couple arguments that will be eventually used in constructing the MongoDB pipeline in _read_single_partition. In get_read_tasks, we construct a ReadTask object for each pipeline object. This will need to provide a BlockMetadata and a no-arg read function as arguments. The BlockMetadata contains metadata like number of rows, size in bytes and schema that we know about the block prior to actually executing the read task; the no-arg read function is just a wrapper of _read_single_partition. A list of ReadTask objects are returned by get_read_tasks, and these tasks are executed on remote workers. You can find more details about Dataset read execution here. from typing import Any, Dict, List, Optional from ray.data.datasource.datasource import Datasource, Reader, ReadTask from ray.data.block import BlockMetadata class _MongoDatasourceReader(Reader): # This is constructed by the MongoDatasource, which will supply these args # about MongoDB. def __init__(self, uri, database, collection, pipelines, schema, kwargs): self._uri = uri self._database = database self._collection = collection self._pipelines = pipelines self._schema = schema self._kwargs = kwargs # Create a list of ``ReadTask``, one for each pipeline (i.e. a partition of # the MongoDB collection). Those tasks will be executed in parallel. # Note: The ``parallelism`` which is supposed to indicate how many ``ReadTask`` to # return will have no effect here, since we map each query into a ``ReadTask``. def get_read_tasks(self, parallelism: int) -> List[ReadTask]: read_tasks: List[ReadTask] = [] for pipeline in self._pipelines: # The metadata about the block that we know prior to actually executing # the read task. metadata = BlockMetadata( num_rows=None, size_bytes=None, schema=self._schema, input_files=None, exec_stats=None, ) # Supply a no-arg read function (which returns a block) and pre-read # block metadata. read_task = ReadTask( lambda uri=self._uri, database=self._database, collection=self._collection, pipeline=pipeline, schema=self._schema, kwargs=self._kwargs: [ _read_single_partition( uri, database, collection, pipeline, schema, kwargs ) ], metadata, ) read_tasks.append(read_task) return read_tasks Now, we have finished implementing support for reading from a custom datasource! Let’s move on to implementing support for writing back to the custom datasource. Write support Similar to read support, we start with handling a single block. Again the PyMongo and PyMongoArrow are used for MongoDB interactions. # This connects to MongoDB and writes a block into it. # Note this is an insertion, i.e. each record in the block are treated as # new document to the MongoDB (so no mutation of existing documents). def _write_single_block(uri, database, collection, block: Block): import pymongo from pymongoarrow.api import write client = pymongo.MongoClient(uri) # Read more about this API here: # https://mongo-arrow.readthedocs.io/en/stable/api/api.html#pymongoarrow.api.write write(client[database][collection], block) Unlike read support, we do not need to implement a custom interface. Below, we implement a helper function to parallelize writing, which is expected to return a list of Ray ObjectRefs. This helper function will later be used in the implementation of do_write(). In short, the below function spawns multiple Ray remote tasks and returns their futures (object refs). from ray.data._internal.remote_fn import cached_remote_fn from ray.types import ObjectRef from ray.data.datasource.datasource import WriteResult # This writes a list of blocks into MongoDB. Each block is handled by a task and # tasks are executed in parallel. def _write_multiple_blocks( blocks: List[ObjectRef[Block]], metadata: List[BlockMetadata], ray_remote_args: Optional[Dict[str, Any]], uri, database, collection, ) -> List[ObjectRef[WriteResult]]: # The ``cached_remote_fn`` turns the ``_write_single_block`` into a Ray # remote function. write_block = cached_remote_fn(_write_single_block).options(**ray_remote_args) write_tasks = [] for block in blocks: # Create a Ray remote function for each block. write_task = write_block.remote(uri, database, collection, block) write_tasks.append(write_task) return write_tasks Putting it all together With _MongoDatasourceReader and _write_multiple_blocks above, we are ready to implement create_reader() and do_write(), and put together a MongoDatasource. # MongoDB datasource, for reading from and writing to MongoDB. class MongoDatasource(Datasource): def create_reader( self, uri, database, collection, pipelines, schema, kwargs ) -> Reader: return _MongoDatasourceReader( uri, database, collection, pipelines, schema, kwargs ) def do_write( self, blocks: List[ObjectRef[Block]], metadata: List[BlockMetadata], ray_remote_args: Optional[Dict[str, Any]], uri, database, collection, ) -> List[ObjectRef[WriteResult]]: return _write_multiple_blocks( blocks, metadata, ray_remote_args, uri, database, collection ) Now you can create a Dataset from and write back to MongoDB, just like any other datasource. # Read from MongoDB datasource and create a dataset. # The args are passed to MongoDatasource.create_reader(). ds = ray.data.read_datasource( MongoDatasource(), uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", database="my_db", collection=="my_collection", pipelines=my_pipelines, # See the example definition of ``my_pipelines`` above ) # Data preprocessing with Dataset APIs here # ... # Write the dataset back to MongoDB datasource. # The args are passed to MongoDatasource.do_write(). ds.write_datasource( MongoDatasource(), uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", database="my_db", collection="my_collection" ) Check out the Datasets User Guide to learn more about Dataset features in-depth. Ray Data is a data processing engine that supports multiple data modalities and types. Here you will find a few end-to-end examples of some basic data processing with Ray Data on tabular data, text (coming soon), and images. Computer Vision Image Classification Batch Inference with Huggingface Vision Transformer Image Classification Batch Inference with PyTorch ResNet152 Object Detection Batch Inference with PyTorch FasterRCNN_ResNet50 Simple Data Processing Processing the NYC taxi dataset Batch Training with Ray Data Scaling OCR with Ray Data Other Examples Random Data Access (Experimental) Implementing a Custom Datasource Ray Data API Input/Output Synthetic Data range(n, *[, parallelism]) Create a dataset from a range of integers [0..n). range_tensor(n, *[, shape, parallelism]) Create a Tensor stream from a range of integers [0..n). ray.data.range ray.data.range(n: int, *, parallelism: int = - 1) -> ray.data.dataset.Dataset[source] Create a dataset from a range of integers [0..n). Examples >>> import ray >>> ds = ray.data.range(10000) >>> ds Dataset(num_blocks=..., num_rows=10000, schema={id: int64}) >>> ds.map(lambda x: {"id": x["id"] * 2}).take(4) [{"id": 0}, {"id": 2}, {"id": 4}, {"id": 6}] Parameters n – The upper bound of the range of integers. parallelism – The amount of parallelism to use for the dataset. Parallelism may be limited by the number of items. Returns Dataset producing the integers. PublicAPI: This API is stable across Ray releases.ray.data.range_tensor ray.data.range_tensor(n: int, *, shape: Tuple = (1,), parallelism: int = - 1) -> ray.data.dataset.Dataset[source] Create a Tensor stream from a range of integers [0..n). Examples >>> import ray >>> ds = ray.data.range_tensor(1000, shape=(2, 2)) >>> ds Dataset( num_blocks=..., num_rows=1000, schema={data: numpy.ndarray(shape=(2, 2), dtype=int64)} ) >>> ds.map_batches(lambda arr: arr * 2).take(2) [array([[0, 0], [0, 0]]), array([[2, 2], [2, 2]])] This is similar to range_table(), but uses the ArrowTensorArray extension type. The dataset elements take the form {“data”: array(N, shape=shape)}. Parameters n – The upper bound of the range of integer records. shape – The shape of each record. parallelism – The amount of parallelism to use for the dataset. Parallelism may be limited by the number of items. Returns Dataset producing the integers as Arrow tensor records. PublicAPI: This API is stable across Ray releases. Python Objects from_items(items, *[, parallelism, ...]) Create a Dataset from a list of local Python objects. ray.data.from_items ray.data.from_items(items: List[Any], *, parallelism: int = - 1, output_arrow_format: bool = True) -> ray.data.dataset.MaterializedDataset[source] Create a Dataset from a list of local Python objects. Use this method to create small datasets for testing and exploration. Examples import ray ds = ray.data.from_items([1, 2, 3, 4, 5]) print(ds.schema()) Column Type ------ ---- item int64 Parameters items – List of local Python objects. parallelism – The amount of parallelism to use for the dataset. Parallelism might be limited by the number of items. Returns A Dataset holding the items. PublicAPI: This API is stable across Ray releases. Parquet read_parquet(paths, *[, filesystem, ...]) Create an Arrow dataset from parquet files. read_parquet_bulk(paths, *[, filesystem, ...]) Create an Arrow dataset from a large number (such as >1K) of parquet files quickly. Dataset.write_parquet(path, *[, filesystem, ...]) Write the dataset to parquet. ray.data.read_parquet ray.data.read_parquet(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, columns: Optional[List[str]] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, tensor_column_schema: Optional[Dict[str, Tuple[numpy.dtype, Tuple[int, ...]]]] = None, meta_provider: ray.data.datasource.file_meta_provider.ParquetMetadataProvider = , **arrow_parquet_args) -> ray.data.dataset.Dataset[source] Create an Arrow dataset from parquet files. Examples >>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_parquet("s3://bucket/path") >>> # Read multiple local files. >>> ray.data.read_parquet(["/path/to/file1", "/path/to/file2"]) >>> # Specify a schema for the parquet file. >>> import pyarrow as pa >>> fields = [("sepal.length", pa.float64()), ... ("sepal.width", pa.float64()), ... ("petal.length", pa.float64()), ... ("petal.width", pa.float64()), ... ("variety", pa.string())] >>> ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet", ... schema=pa.schema(fields)) The Parquet reader also supports projection and filter pushdown, allowing column selection and row filtering to be pushed down to the file scan. import pyarrow as pa # Create a Dataset by reading a Parquet file, pushing column selection and # row filtering down to the file scan. ds = ray.data.read_parquet( "s3://anonymous@ray-example-data/iris.parquet", columns=["sepal.length", "variety"], filter=pa.dataset.field("sepal.length") > 5.0, ) ds.show(2) {'sepal.length': 5.1, 'variety': 'Setosa'} {'sepal.length': 5.4, 'variety': 'Setosa'} For further arguments you can pass to pyarrow as a keyword argument, see https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment Parameters paths – A single file path or directory, or a list of file paths. Multiple directories are not supported. filesystem – The filesystem implementation to read from. These are specified in https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations. columns – A list of column names to read. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset. ray_remote_args – kwargs passed to ray.remote in the read tasks. tensor_column_schema – A dict of column name –> tensor dtype and shape mappings for converting a Parquet column containing serialized tensors (ndarrays) as their elements to our tensor column extension type. This assumes that the tensors were serialized in the raw NumPy array format in C-contiguous order (e.g. via arr.tobytes()). meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. arrow_parquet_args – Other parquet read options to pass to pyarrow, see https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment Returns Dataset producing Arrow records read from the specified paths. PublicAPI: This API is stable across Ray releases.ray.data.read_parquet_bulk ray.data.read_parquet_bulk(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, columns: Optional[List[str]] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, arrow_open_file_args: Optional[Dict[str, Any]] = None, tensor_column_schema: Optional[Dict[str, Tuple[numpy.dtype, Tuple[int, ...]]]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.parquet'], allow_if_no_extensions=False), **arrow_parquet_args) -> ray.data.dataset.Dataset[source] Create an Arrow dataset from a large number (such as >1K) of parquet files quickly. By default, ONLY file paths should be provided as input (i.e. no directory paths), and an OSError will be raised if one or more paths point to directories. If your use-case requires directory paths, then the metadata provider should be changed to one that supports directory expansion (e.g. DefaultFileMetadataProvider). Offers improved performance vs. read_parquet() due to not using PyArrow’s ParquetDataset abstraction, whose latency scales linearly with the number of input files due to collecting all file metadata on a single node. Also supports a wider variety of input Parquet file types than read_parquet() due to not trying to merge and resolve a unified schema for all files. However, unlike read_parquet(), this does not offer file metadata resolution by default, so a custom metadata provider should be provided if your use-case requires a unified schema, block sizes, row counts, etc. Examples >>> # Read multiple local files. You should always provide only input file >>> # paths (i.e. no directory paths) when known to minimize read latency. >>> ray.data.read_parquet_bulk( ... ["/path/to/file1", "/path/to/file2"]) >>> # Read a directory of files in remote storage. Caution should be taken >>> # when providing directory paths, since the time to both check each path >>> # type and expand its contents may result in greatly increased latency >>> # and/or request rate throttling from cloud storage service providers. >>> ray.data.read_parquet_bulk( ... "s3://bucket/path", ... meta_provider=DefaultFileMetadataProvider()) Parameters paths – A single file path or a list of file paths. If one or more directories are provided, then meta_provider should also be set to an implementation that supports directory expansion (e.g. DefaultFileMetadataProvider). filesystem – The filesystem implementation to read from. columns – A list of column names to read. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset. ray_remote_args – kwargs passed to ray.remote in the read tasks. arrow_open_file_args – kwargs passed to pyarrow.fs.FileSystem.open_input_file. tensor_column_schema – A dict of column name –> tensor dtype and shape mappings for converting a Parquet column containing serialized tensors (ndarrays) as their elements to our tensor column extension type. This assumes that the tensors were serialized in the raw NumPy array format in C-contiguous order (e.g. via arr.tobytes()). meta_provider – File metadata provider. Defaults to a fast file metadata provider that skips file size collection and requires all input paths to be files. Change to DefaultFileMetadataProvider or a custom metadata provider if directory expansion and/or file metadata resolution is required. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.parquet”. arrow_parquet_args – Other parquet read options to pass to pyarrow. Returns Dataset producing Arrow records read from the specified paths. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.write_parquet Dataset.write_parquet(path: str, *, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = , arrow_parquet_args_fn: Callable[[], Dict[str, Any]] = >, ray_remote_args: Dict[str, Any] = None, **arrow_parquet_args) -> None[source] Write the dataset to parquet. This is only supported for datasets convertible to Arrow records. To control the number of files, use Dataset.repartition(). Unless a custom block path provider is given, the format of the output files will be {uuid}_{block_idx}.parquet, where uuid is an unique id for the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples import ray ds = ray.data.range(100) ds.write_parquet("s3://bucket/folder/") Time complexity: O(dataset size / parallelism) Parameters path – The path to the destination root directory, where Parquet files will be written to. filesystem – The filesystem implementation to write to. try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path. arrow_parquet_args_fn – Callable that returns a dictionary of write arguments to use when writing each block to a file. Overrides any duplicate keys from arrow_parquet_args. This should be used instead of arrow_parquet_args if any of your write arguments cannot be pickled, or if you’d like to lazily resolve the write arguments for each dataset block. ray_remote_args – Kwargs passed to ray.remote in the write tasks. arrow_parquet_args – Options to pass to pyarrow.parquet.write_table(), which is used to write out each block to a file. CSV read_csv(paths, *[, filesystem, ...]) Create an Arrow dataset from csv files. Dataset.write_csv(path, *[, filesystem, ...]) Write the dataset to csv. ray.data.read_csv ray.data.read_csv(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = None, partitioning: ray.data.datasource.partitioning.Partitioning = Partitioning(style='hive', base_dir='', field_names=None, filesystem=None), ignore_missing_paths: bool = False, **arrow_csv_args) -> ray.data.dataset.Dataset[source] Create an Arrow dataset from csv files. Examples >>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_csv("s3://bucket/path") >>> # Read multiple local files. >>> ray.data.read_csv(["/path/to/file1", "/path/to/file2"]) >>> # Read multiple directories. >>> ray.data.read_csv( ... ["s3://bucket/path1", "s3://bucket/path2"]) >>> # Read files that use a different delimiter. For more uses of ParseOptions see >>> # https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html # noqa: #501 >>> from pyarrow import csv >>> parse_options = csv.ParseOptions(delimiter="\\t") >>> ds = ray.data.read_csv( ... "s3://anonymous@ray-example-data/iris.tsv", ... parse_options=parse_options) >>> # Convert a date column with a custom format from a CSV file. >>> # For more uses of ConvertOptions see >>> # https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html # noqa: #501 >>> from pyarrow import csv >>> convert_options = csv.ConvertOptions( ... timestamp_parsers=["%m/%d/%Y"]) >>> ds = ray.data.read_csv( ... "s3://anonymous@ray-example-data/dow_jones.csv", ... convert_options=convert_options) By default, read_csv parses Hive-style partitions from file paths. If your data adheres to a different partitioning scheme, set the partitioning parameter. >>> ds = ray.data.read_csv("s3://anonymous@ray-example-data/year=2022/month=09/sales.csv") >>> ds.take(1) [{'order_number': 10107, 'quantity': 30, 'year': '2022', 'month': '09'}] By default, read_csv reads all files from file paths. If you want to filter files by file extensions, set the partition_filter parameter. >>> # Read only *.csv files from multiple directories. >>> from ray.data.datasource import FileExtensionFilter >>> ray.data.read_csv("s3://anonymous@ray-example-data/different-extensions/", ... partition_filter=FileExtensionFilter("csv")) Parameters paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories. filesystem – The filesystem implementation to read from. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset. ray_remote_args – kwargs passed to ray.remote in the read tasks. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this does not filter out any files. If wishing to filter out all file paths except those whose file extension matches e.g. “.csv”, a FileExtensionFilter("csv") can be provided. partitioning – A Partitioning object that describes how paths are organized. By default, this function parses Hive-style partitions. arrow_csv_args – Other csv read options to pass to pyarrow. ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False. Returns Dataset producing Arrow records read from the specified paths. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.write_csv Dataset.write_csv(path: str, *, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = , arrow_csv_args_fn: Callable[[], Dict[str, Any]] = >, ray_remote_args: Dict[str, Any] = None, **arrow_csv_args) -> None[source] Write the dataset to csv. This is only supported for datasets convertible to Arrow records. To control the number of files, use Dataset.repartition(). Unless a custom block path provider is given, the format of the output files will be {uuid}_{block_idx}.csv, where uuid is an unique id for the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples import ray ds = ray.data.range(100) ds.write_csv("s3://bucket/folder/") Time complexity: O(dataset size / parallelism) Parameters path – The path to the destination root directory, where csv files will be written to. filesystem – The filesystem implementation to write to. try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path. arrow_csv_args_fn – Callable that returns a dictionary of write arguments to use when writing each block to a file. Overrides any duplicate keys from arrow_csv_args. This should be used instead of arrow_csv_args if any of your write arguments cannot be pickled, or if you’d like to lazily resolve the write arguments for each dataset block. ray_remote_args – Kwargs passed to ray.remote in the write tasks. arrow_csv_args – Other CSV write options to pass to pyarrow. JSON read_json(paths, *[, filesystem, ...]) Create an Arrow dataset from json files. Dataset.write_json(path, *[, filesystem, ...]) Write the dataset to json. ray.data.read_json ray.data.read_json(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.json'], allow_if_no_extensions=False), partitioning: ray.data.datasource.partitioning.Partitioning = Partitioning(style='hive', base_dir='', field_names=None, filesystem=None), ignore_missing_paths: bool = False, **arrow_json_args) -> ray.data.dataset.Dataset[source] Create an Arrow dataset from json files. Examples >>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_json("s3://bucket/path") >>> # Read multiple local files. >>> ray.data.read_json(["/path/to/file1", "/path/to/file2"]) >>> # Read multiple directories. >>> ray.data.read_json( ... ["s3://bucket/path1", "s3://bucket/path2"]) By default, read_json parses Hive-style partitions from file paths. If your data adheres to a different partitioning scheme, set the partitioning parameter. >>> ds = ray.data.read_json("s3://anonymous@ray-example-data/year=2022/month=09/sales.json") >>> ds.take(1) [{'order_number': 10107, 'quantity': 30, 'year': '2022', 'month': '09'}] Parameters paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories. filesystem – The filesystem implementation to read from. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset. ray_remote_args – kwargs passed to ray.remote in the read tasks. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.json”. arrow_json_args – Other json read options to pass to pyarrow. partitioning – A Partitioning object that describes how paths are organized. By default, this function parses Hive-style partitions. ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False. Returns Dataset producing records read from the specified paths. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.write_json Dataset.write_json(path: str, *, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = , pandas_json_args_fn: Callable[[], Dict[str, Any]] = >, ray_remote_args: Dict[str, Any] = None, **pandas_json_args) -> None[source] Write the dataset to json. This is only supported for datasets convertible to Arrow records. To control the number of files, use Dataset.repartition(). Unless a custom block path provider is given, the format of the output files will be {self._uuid}_{block_idx}.json, where uuid is an unique id for the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples import ray ds = ray.data.range(100) ds.write_json("s3://bucket/folder/") Time complexity: O(dataset size / parallelism) Parameters path – The path to the destination root directory, where json files will be written to. filesystem – The filesystem implementation to write to. try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path. pandas_json_args_fn – Callable that returns a dictionary of write arguments to use when writing each block to a file. Overrides any duplicate keys from pandas_json_args. This should be used instead of pandas_json_args if any of your write arguments cannot be pickled, or if you’d like to lazily resolve the write arguments for each dataset block. ray_remote_args – Kwargs passed to ray.remote in the write tasks. pandas_json_args – These args will be passed to pandas.DataFrame.to_json(), which we use under the hood to write out each Dataset block. These are dict(orient=”records”, lines=True) by default. Text read_text(paths, *[, encoding, errors, ...]) Create a dataset from lines stored in text files. ray.data.read_text ray.data.read_text(paths: Union[str, List[str]], *, encoding: str = 'utf-8', errors: str = 'ignore', drop_empty_lines: bool = True, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, ray_remote_args: Optional[Dict[str, Any]] = None, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = None, partitioning: ray.data.datasource.partitioning.Partitioning = None, ignore_missing_paths: bool = False) -> ray.data.dataset.Dataset[source] Create a dataset from lines stored in text files. Examples >>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_text("s3://bucket/path") >>> # Read multiple local files. >>> ray.data.read_text(["/path/to/file1", "/path/to/file2"]) Parameters paths – A single file path or a list of file paths (or directories). encoding – The encoding of the files (e.g., “utf-8” or “ascii”). errors – What to do with errors on decoding. Specify either “strict”, “ignore”, or “replace”. Defaults to “ignore”. filesystem – The filesystem implementation to read from. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the stream. ray_remote_args – Kwargs passed to ray.remote in the read tasks and in the subsequent text decoding map task. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a stream. By default, this does not filter out any files. If wishing to filter out all file paths except those whose file extension matches e.g. “.txt”, a FileXtensionFilter("txt") can be provided. partitioning – A Partitioning object that describes how paths are organized. Defaults to None. ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False. Returns Dataset producing lines of text read from the specified paths. PublicAPI: This API is stable across Ray releases. Images read_images(paths, *[, filesystem, ...]) Read images from the specified paths. ray.data.read_images ray.data.read_images(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , ray_remote_args: Dict[str, Any] = None, arrow_open_file_args: Optional[Dict[str, Any]] = None, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif'], allow_if_no_extensions=False), partitioning: ray.data.datasource.partitioning.Partitioning = None, size: Optional[Tuple[int, int]] = None, mode: Optional[str] = None, include_paths: bool = False, ignore_missing_paths: bool = False) -> ray.data.dataset.Dataset[source] Read images from the specified paths. Examples >>> import ray >>> path = "s3://anonymous@air-example-data-2/movie-image-small-filesize-1GB" >>> ds = ray.data.read_images(path) >>> ds Dataset(num_blocks=..., num_rows=41979, schema={image: numpy.ndarray(ndim=3, dtype=uint8)}) If you need image file paths, set include_paths=True. >>> ds = ray.data.read_images(path, include_paths=True) >>> ds Dataset(num_blocks=..., num_rows=41979, schema={image: numpy.ndarray(ndim=3, dtype=uint8), path: string}) >>> ds.take(1)[0]["path"] 'air-example-data-2/movie-image-small-filesize-1GB/0.jpg' If your images are arranged like: root/dog/xxx.png root/dog/xxy.png root/cat/123.png root/cat/nsdf3.png Then you can include the labels by specifying a Partitioning. >>> import ray >>> from ray.data.datasource.partitioning import Partitioning >>> root = "s3://anonymous@ray-example-data/image-datasets/dir-partitioned" >>> partitioning = Partitioning("dir", field_names=["class"], base_dir=root) >>> ds = ray.data.read_images(root, size=(224, 224), partitioning=partitioning) >>> ds Dataset(num_blocks=..., num_rows=94946, schema={image: TensorDtype(shape=(224, 224, 3), dtype=uint8), class: object}) Parameters paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories. filesystem – The filesystem implementation to read from. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset. meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. ray_remote_args – kwargs passed to ray.remote in the read tasks. arrow_open_file_args – kwargs passed to pyarrow.fs.FileSystem.open_input_file. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match *.png, *.jpg, *.jpeg, *.tiff, *.bmp, or *.gif. partitioning – A Partitioning object that describes how paths are organized. Defaults to None. size – The desired height and width of loaded images. If unspecified, images retain their original shape. mode – A Pillow mode describing the desired type and depth of pixels. If unspecified, image modes are inferred by Pillow. include_paths – If True, include the path to each image. File paths are stored in the 'path' column. ignore_missing_paths – If True, ignores any file/directory paths in paths that are not found. Defaults to False. Returns A Dataset producing tensors that represent the images at the specified paths. For information on working with tensors, read the tensor data guide. Raises ValueError – if size contains non-positive numbers. ValueError – if mode is unsupported. PublicAPI (beta): This API is in beta and may change before becoming stable. Binary read_binary_files(paths, *[, include_paths, ...]) Create a dataset from binary files of arbitrary contents. ray.data.read_binary_files ray.data.read_binary_files(paths: Union[str, List[str]], *, include_paths: bool = False, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = None, partitioning: ray.data.datasource.partitioning.Partitioning = None, ignore_missing_paths: bool = False, output_arrow_format: bool = False) -> ray.data.dataset.Dataset[source] Create a dataset from binary files of arbitrary contents. Examples >>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_binary_files("s3://bucket/path") >>> # Read multiple local files. >>> ray.data.read_binary_files( ... ["/path/to/file1", "/path/to/file2"]) Parameters paths – A single file path or a list of file paths (or directories). include_paths – Whether to include the full path of the file in the dataset records. When specified, the stream records will be a tuple of the file path and the file contents. filesystem – The filesystem implementation to read from. ray_remote_args – kwargs passed to ray.remote in the read tasks. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the stream. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this does not filter out any files. partitioning – A Partitioning object that describes how paths are organized. Defaults to None. ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False. output_arrow_format – If True, returns data in Arrow format, instead of Python list format. Defaults to False. Returns Dataset producing records read from the specified paths. PublicAPI: This API is stable across Ray releases. TFRecords read_tfrecords(paths, *[, filesystem, ...]) Create a dataset from TFRecord files that contain tf.train.Example messages. Dataset.write_tfrecords(path, *[, ...]) Write the dataset to TFRecord files. ray.data.read_tfrecords ray.data.read_tfrecords(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = None, ignore_missing_paths: bool = False, tf_schema: Optional[schema_pb2.Schema] = None) -> ray.data.dataset.Dataset[source] Create a dataset from TFRecord files that contain tf.train.Example messages. This function exclusively supports tf.train.Example messages. If a file contains a message that isn’t of type tf.train.Example, then this function errors. Examples >>> import os >>> import tempfile >>> import tensorflow as tf >>> features = tf.train.Features( ... feature={ ... "length": tf.train.Feature(float_list=tf.train.FloatList(value=[5.1])), ... "width": tf.train.Feature(float_list=tf.train.FloatList(value=[3.5])), ... "species": tf.train.Feature(bytes_list=tf.train.BytesList(value=[b"setosa"])), ... } ... ) >>> example = tf.train.Example(features=features) >>> path = os.path.join(tempfile.gettempdir(), "data.tfrecords") >>> with tf.io.TFRecordWriter(path=path) as writer: ... writer.write(example.SerializeToString()) This function reads tf.train.Example messages into a tabular Dataset. >>> import ray >>> ray.data.read_tfrecords("s3://anonymous@ray-example-data/iris.tfrecords") Dataset( num_blocks=..., num_rows=150, schema={...} ) We can also read compressed TFRecord files which uses one of the compression type supported by Arrow: >>> ds = ray.data.read_tfrecords( ... "s3://anonymous@ray-example-data/iris.tfrecords.gz", ... arrow_open_stream_args={"compression": "gzip"}, ... ) >>> ds.to_pandas() length width species 0 5.1 3.5 b'setosa' Parameters paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories. filesystem – The filesystem implementation to read from. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files in the dataset. arrow_open_stream_args – Key-word arguments passed to pyarrow.fs.FileSystem.open_input_stream. To read a compressed TFRecord file, pass the corresponding compression type (e.g. for GZIP or ZLIB, use arrow_open_stream_args={'compression_type': 'gzip'}). meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match "*.tfrecords*". ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False. tf_schema – Optional TensorFlow Schema which is used to explicitly set the schema of the underlying Dataset. Returns A Dataset that contains the example features. Raises ValueError – If a file contains a message that isn’t a tf.train.Example. PublicAPI (alpha): This API is in alpha and may change before becoming stable.ray.data.Dataset.write_tfrecords Dataset.write_tfrecords(path: str, *, tf_schema: Optional[schema_pb2.Schema] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = , ray_remote_args: Dict[str, Any] = None) -> None[source] Write the dataset to TFRecord files. The TFRecord files will contain tf.train.Example # noqa: E501 records, with one Example record for each row in the dataset. tf.train.Feature only natively stores ints, floats, and bytes, so this function only supports datasets with these data types, and will error if the dataset contains unsupported types. This is only supported for datasets convertible to Arrow records. To control the number of files, use Dataset.repartition(). Unless a custom block path provider is given, the format of the output files will be {uuid}_{block_idx}.tfrecords, where uuid is an unique id for the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples import ray ds = ray.data.range(100) ds.write_tfrecords("s3://bucket/folder/") Time complexity: O(dataset size / parallelism) Parameters path – The path to the destination root directory, where tfrecords files will be written to. filesystem – The filesystem implementation to write to. try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path. ray_remote_args – Kwargs passed to ray.remote in the write tasks. Pandas from_pandas(dfs) Create a dataset from a list of Pandas dataframes. from_pandas_refs(dfs) Create a dataset from a list of Ray object references to Pandas dataframes. Dataset.to_pandas([limit]) Convert this dataset into a single Pandas DataFrame. Dataset.to_pandas_refs() Convert this dataset into a distributed set of Pandas dataframes. ray.data.from_pandas ray.data.from_pandas(dfs: Union[pandas.DataFrame, List[pandas.DataFrame]]) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a list of Pandas dataframes. Parameters dfs – A Pandas dataframe or a list of Pandas dataframes. Returns MaterializedDataset holding Arrow records read from the dataframes. PublicAPI: This API is stable across Ray releases.ray.data.from_pandas_refs ray.data.from_pandas_refs(dfs: Union[ray.types.ObjectRef[pandas.DataFrame], List[ray.types.ObjectRef[pandas.DataFrame]]]) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a list of Ray object references to Pandas dataframes. Parameters dfs – A Ray object references to pandas dataframe, or a list of Ray object references to pandas dataframes. Returns MaterializedDataset holding Arrow records read from the dataframes. DeveloperAPI: This API may change across minor Ray releases.ray.data.Dataset.to_pandas Dataset.to_pandas(limit: int = 100000) -> pandas.DataFrame[source] Convert this dataset into a single Pandas DataFrame. This is only supported for datasets convertible to Arrow or Pandas records. An error is raised if the number of records exceeds the provided limit. Note that you can use limit() on the dataset beforehand to truncate the dataset manually. Examples >>> import ray >>> ds = ray.data.from_items([{"a": i} for i in range(3)]) >>> ds.to_pandas() a 0 0 1 1 2 2 This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size) Parameters limit – The maximum number of records to return. An error will be raised if the limit is exceeded. Returns A Pandas DataFrame created from this dataset, containing a limited number of records.ray.data.Dataset.to_pandas_refs Dataset.to_pandas_refs() -> List[ray.types.ObjectRef[pandas.DataFrame]][source] Convert this dataset into a distributed set of Pandas dataframes. This is only supported for datasets convertible to Arrow records. This function induces a copy of the data. For zero-copy access to the underlying data, consider using Dataset.to_arrow() or Dataset.get_internal_block_refs(). This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size / parallelism) Returns A list of remote Pandas dataframes created from this dataset. DeveloperAPI: This API may change across minor Ray releases. NumPy read_numpy(paths, *[, filesystem, ...]) Create an Arrow dataset from numpy files. from_numpy(ndarrays) Create a dataset from a list of NumPy ndarrays. from_numpy_refs(ndarrays) Create a dataset from a list of NumPy ndarray futures. Dataset.write_numpy(path, *[, column, ...]) Write a tensor column of the dataset to npy files. Dataset.to_numpy_refs(*[, column]) Convert this dataset into a distributed set of NumPy ndarrays. ray.data.read_numpy ray.data.read_numpy(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.npy'], allow_if_no_extensions=False), partitioning: ray.data.datasource.partitioning.Partitioning = None, ignore_missing_paths: bool = False, **numpy_load_args) -> ray.data.dataset.Dataset[source] Create an Arrow dataset from numpy files. Examples >>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_numpy("s3://bucket/path") >>> # Read multiple local files. >>> ray.data.read_numpy(["/path/to/file1", "/path/to/file2"]) >>> # Read multiple directories. >>> ray.data.read_numpy( ... ["s3://bucket/path1", "s3://bucket/path2"]) Parameters paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories. filesystem – The filesystem implementation to read from. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream numpy_load_args – Other options to pass to np.load. meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.npy”. partitioning – A Partitioning object that describes how paths are organized. Defaults to None. ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False. Returns Dataset holding Tensor records read from the specified paths. PublicAPI: This API is stable across Ray releases.ray.data.from_numpy ray.data.from_numpy(ndarrays: Union[numpy.ndarray, List[numpy.ndarray]]) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a list of NumPy ndarrays. Parameters ndarrays – A NumPy ndarray or a list of NumPy ndarrays. Returns MaterializedDataset holding the given ndarrays. PublicAPI: This API is stable across Ray releases.ray.data.from_numpy_refs ray.data.from_numpy_refs(ndarrays: Union[ray.types.ObjectRef[numpy.ndarray], List[ray.types.ObjectRef[numpy.ndarray]]]) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a list of NumPy ndarray futures. Parameters ndarrays – A Ray object reference to a NumPy ndarray or a list of Ray object references to NumPy ndarrays. Returns MaterializedDataset holding the given ndarrays. DeveloperAPI: This API may change across minor Ray releases.ray.data.Dataset.write_numpy Dataset.write_numpy(path: str, *, column: Optional[str] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = , ray_remote_args: Dict[str, Any] = None) -> None[source] Write a tensor column of the dataset to npy files. This is only supported for datasets convertible to Arrow records that contain a TensorArray column. To control the number of files, use Dataset.repartition(). Unless a custom block path provider is given, the format of the output files will be {self._uuid}_{block_idx}.npy, where uuid is an unique id for the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples import ray ds = ray.data.range(100) ds.write_numpy("s3://bucket/folder/", column="id") Time complexity: O(dataset size / parallelism) Parameters path – The path to the destination root directory, where npy files will be written to. column – The name of the table column that contains the tensor to be written. filesystem – The filesystem implementation to write to. try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path. ray_remote_args – Kwargs passed to ray.remote in the write tasks.ray.data.Dataset.to_numpy_refs Dataset.to_numpy_refs(*, column: Optional[str] = None) -> List[ray.types.ObjectRef[numpy.ndarray]][source] Convert this dataset into a distributed set of NumPy ndarrays. This is only supported for datasets convertible to NumPy ndarrays. This function induces a copy of the data. For zero-copy access to the underlying data, consider using Dataset.to_arrow() or Dataset.get_internal_block_refs(). Time complexity: O(dataset size / parallelism) Parameters column – The name of the column to convert to numpy, or None to specify the blocks (entire row. If not specified for Arrow or Pandas) – returned (each) – ndarrays. (future will represent a dict of column) – Returns A list of remote NumPy ndarrays created from this dataset. DeveloperAPI: This API may change across minor Ray releases. Arrow from_arrow(tables) Create a dataset from a list of Arrow tables. from_arrow_refs(tables) Create a dataset from a set of Arrow tables. Dataset.to_arrow_refs() Convert this dataset into a distributed set of Arrow tables. ray.data.from_arrow ray.data.from_arrow(tables: Union[pyarrow.Table, bytes, List[Union[pyarrow.Table, bytes]]]) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a list of Arrow tables. Parameters tables – An Arrow table, or a list of Arrow tables, or its streaming format in bytes. Returns MaterializedDataset holding Arrow records from the tables. PublicAPI: This API is stable across Ray releases.ray.data.from_arrow_refs ray.data.from_arrow_refs(tables: Union[ray.types.ObjectRef[Union[pyarrow.Table, bytes]], List[ray.types.ObjectRef[Union[pyarrow.Table, bytes]]]]) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a set of Arrow tables. Parameters tables – A Ray object reference to Arrow table, or list of Ray object references to Arrow tables, or its streaming format in bytes. Returns MaterializedDataset holding Arrow records from the tables. DeveloperAPI: This API may change across minor Ray releases.ray.data.Dataset.to_arrow_refs Dataset.to_arrow_refs() -> List[ray.types.ObjectRef[pyarrow.Table]][source] Convert this dataset into a distributed set of Arrow tables. This is only supported for datasets convertible to Arrow records. This function is zero-copy if the existing data is already in Arrow format. Otherwise, the data will be converted to Arrow format. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(1) unless conversion is required. Returns A list of remote Arrow tables created from this dataset. DeveloperAPI: This API may change across minor Ray releases. MongoDB read_mongo(uri, database, collection, *[, ...]) Create an Arrow dataset from MongoDB. Dataset.write_mongo(uri, database, collection) Write the dataset to a MongoDB datasource. ray.data.read_mongo ray.data.read_mongo(uri: str, database: str, collection: str, *, pipeline: Optional[List[Dict]] = None, schema: Optional[pymongoarrow.api.Schema] = None, parallelism: int = - 1, ray_remote_args: Dict[str, Any] = None, **mongo_args) -> ray.data.dataset.Dataset[source] Create an Arrow dataset from MongoDB. The data to read from is specified via the uri, database and collection of the MongoDB. The dataset is created from the results of executing pipeline against the collection. If pipeline is None, the entire collection will be read. You can check out more details here about these MongoDB concepts: - URI: https://www.mongodb.com/docs/manual/reference/connection-string/ - Database and Collection: https://www.mongodb.com/docs/manual/core/databases-and-collections/ - Pipeline: https://www.mongodb.com/docs/manual/core/aggregation-pipeline/ To read the MongoDB in parallel, the execution of the pipeline is run on partitions of the collection, with a Ray read task to handle a partition. Partitions are created in an attempt to evenly distribute the documents into the specified number of partitions. The number of partitions is determined by parallelism which can be requested from this interface or automatically chosen if unspecified (see the parallelism arg below). Examples >>> import ray >>> from pymongoarrow.api import Schema >>> ds = ray.data.read_mongo( ... uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", # noqa: E501 ... database="my_db", ... collection="my_collection", ... pipeline=[{"$match": {"col2": {"$gte": 0, "$lt": 100}}}, {"$sort": "sort_field"}], # noqa: E501 ... schema=Schema({"col1": pa.string(), "col2": pa.int64()}), ... parallelism=10, ... ) Parameters uri – The URI of the source MongoDB where the dataset will be read from. For the URI format, see details in https://www.mongodb.com/docs/manual/reference/connection-string/. database – The name of the database hosted in the MongoDB. This database must exist otherwise ValueError will be raised. collection – The name of the collection in the database. This collection must exist otherwise ValueError will be raised. pipeline – A MongoDB pipeline, which will be executed on the given collection with results used to create Dataset. If None, the entire collection will be read. schema – The schema used to read the collection. If None, it’ll be inferred from the results of pipeline. parallelism – The requested parallelism of the read. If -1, it will be automatically chosen based on the available cluster resources and estimated in-memory data size. ray_remote_args – kwargs passed to ray.remote in the read tasks. mongo_args – kwargs passed to aggregate_arrow_all() in pymongoarrow in producing Arrow-formatted results. Returns Dataset producing Arrow records from the results of executing the pipeline on the specified MongoDB collection. PublicAPI (alpha): This API is in alpha and may change before becoming stable.ray.data.Dataset.write_mongo Dataset.write_mongo(uri: str, database: str, collection: str, ray_remote_args: Optional[Dict[str, Any]] = None) -> None[source] Write the dataset to a MongoDB datasource. This is only supported for datasets convertible to Arrow records. To control the number of parallel write tasks, use Dataset.repartition`() before calling this method. Currently, this supports only a subset of the pyarrow’s types, due to the limitation of pymongoarrow which is used underneath. Writing unsupported types will fail on type checking. See all the supported types at: https://mongo-arrow.readthedocs.io/en/latest/data_types.html. The records will be inserted into MongoDB as new documents. If a record has the _id field, this _id must be non-existent in MongoDB, otherwise the write will be rejected and fail (hence preexisting documents are protected from being mutated). It’s fine to not have _id field in record and MongoDB will auto generate one at insertion. This operation will trigger execution of the lazy transformations performed on this dataset. Examples import ray ds = ray.data.range(100) ds.write_mongo( uri="mongodb://username:password@mongodb0.example.com:27017/?authSource=admin", database="my_db", collection="my_collection" ) Parameters uri – The URI to the destination MongoDB where the dataset will be written to. For the URI format, see details in https://www.mongodb.com/docs/manual/reference/connection-string/. database – The name of the database. This database must exist otherwise ValueError will be raised. collection – The name of the collection in the database. This collection must exist otherwise ValueError will be raised. ray_remote_args – Kwargs passed to ray.remote in the write tasks. SQL Databases read_sql(sql, connection_factory, *[, ...]) Read from a database that provides a Python DB API2-compliant connector. ray.data.read_sql ray.data.read_sql(sql: str, connection_factory: Callable[[], Any], *, parallelism: int = - 1, ray_remote_args: Optional[Dict[str, Any]] = None) -> ray.data.dataset.Dataset[source] Read from a database that provides a Python DB API2-compliant connector. By default, read_sql launches multiple read tasks, and each task executes a LIMIT and OFFSET to fetch a subset of the rows. However, for many databases, OFFSET is slow. As a workaround, set parallelism=1 to directly fetch all rows in a single task. Note that this approach requires all result rows to fit in the memory of single task. If the rows don’t fit, your program may raise an out of memory error. Examples For examples of reading from larger databases like MySQL and PostgreSQL, see Reading from SQL Databases. import sqlite3 import ray # Create a simple database connection = sqlite3.connect("example.db") connection.execute("CREATE TABLE movie(title, year, score)") connection.execute( """ INSERT INTO movie VALUES ('Monty Python and the Holy Grail', 1975, 8.2), ("Monty Python Live at the Hollywood Bowl", 1982, 7.9), ("Monty Python's Life of Brian", 1979, 8.0), ("Rocky II", 1979, 7.3) """ ) connection.commit() connection.close() def create_connection(): return sqlite3.connect("example.db") # Get all movies ds = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 ds = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year ds = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection ) Parameters sql – The SQL query to execute. connection_factory – A function that takes no arguments and returns a Python DB API2 Connection object. parallelism – The requested parallelism of the read. ray_remote_args – Keyword arguments passed to ray.remote() in read tasks. Returns A Dataset containing the queried data. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Dask from_dask(df) Create a dataset from a Dask DataFrame. Dataset.to_dask([meta]) Convert this dataset into a Dask DataFrame. ray.data.from_dask ray.data.from_dask(df: dask.DataFrame) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a Dask DataFrame. Parameters df – A Dask DataFrame. Returns MaterializedDataset holding Arrow records read from the DataFrame. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.to_dask Dataset.to_dask(meta: Optional[Union[pandas.DataFrame, pandas.Series, Dict[str, Any], Iterable[Any], Tuple[Any]]] = None) -> dask.DataFrame[source] Convert this dataset into a Dask DataFrame. This is only supported for datasets convertible to Arrow records. Note that this function will set the Dask scheduler to Dask-on-Ray globally, via the config. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size / parallelism) Parameters meta – An empty pandas DataFrame or Series that matches the dtypes and column names of the stream. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. By default, this will be inferred from the underlying Dataset schema, with this argument supplying an optional override. Returns A Dask DataFrame created from this dataset. Spark from_spark(df, *[, parallelism]) Create a dataset from a Spark dataframe. Dataset.to_spark(spark) Convert this dataset into a Spark dataframe. ray.data.from_spark ray.data.from_spark(df: pyspark.sql.DataFrame, *, parallelism: Optional[int] = None) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a Spark dataframe. Parameters spark – A SparkSession, which must be created by RayDP (Spark-on-Ray). df – A Spark dataframe, which must be created by RayDP (Spark-on-Ray). parallelism: The amount of parallelism to use for the dataset. If not provided, it will be equal to the number of partitions of the original Spark dataframe. Returns MaterializedDataset holding Arrow records read from the dataframe. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.to_spark Dataset.to_spark(spark: pyspark.sql.SparkSession) -> pyspark.sql.DataFrame[source] Convert this dataset into a Spark dataframe. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size / parallelism) Returns A Spark dataframe created from this dataset. Modin from_modin(df) Create a dataset from a Modin dataframe. Dataset.to_modin() Convert this dataset into a Modin dataframe. ray.data.from_modin ray.data.from_modin(df: modin.DataFrame) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a Modin dataframe. Parameters df – A Modin dataframe, which must be using the Ray backend. Returns MaterializedDataset holding Arrow records read from the dataframe. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.to_modin Dataset.to_modin() -> modin.DataFrame[source] Convert this dataset into a Modin dataframe. This works by first converting this dataset into a distributed set of Pandas dataframes (using Dataset.to_pandas_refs()). Please see caveats there. Then the individual dataframes are used to create the modin DataFrame using modin.distributed.dataframe.pandas.partitions.from_partitions(). This is only supported for datasets convertible to Arrow records. This function induces a copy of the data. For zero-copy access to the underlying data, consider using Dataset.to_arrow() or get_internal_block_refs(). This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size / parallelism) Returns A Modin dataframe created from this dataset. Mars from_mars(df) Create a dataset from a MARS dataframe. Dataset.to_mars() Convert this dataset into a MARS dataframe. ray.data.from_mars ray.data.from_mars(df: mars.DataFrame) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a MARS dataframe. Parameters df – A MARS dataframe, which must be executed by MARS-on-Ray. Returns MaterializedDataset holding Arrow records read from the dataframe. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.to_mars Dataset.to_mars() -> mars.DataFrame[source] Convert this dataset into a MARS dataframe. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size / parallelism) Returns A MARS dataframe created from this dataset. Torch from_torch(dataset) Create a dataset from a Torch dataset. ray.data.from_torch ray.data.from_torch(dataset: torch.utils.data.Dataset) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a Torch dataset. This function is inefficient. Use it to read small datasets or prototype. If your dataset is large, this function may execute slowly or raise an out-of-memory error. To avoid issues, read the underyling data with a function like read_images(). This function isn’t paralellized. It loads the entire dataset into the head node’s memory before moving the data to the distributed object store. Examples >>> import ray >>> from torchvision import datasets >>> dataset = datasets.MNIST("data", download=True) >>> ds = ray.data.from_torch(dataset) >>> ds Dataset(num_blocks=..., num_rows=60000, schema={item: object}) >>> ds.take(1) {"item": (, 5)} Parameters dataset – A Torch dataset. Returns A MaterializedDataset containing the Torch dataset samples. PublicAPI: This API is stable across Ray releases. Hugging Face from_huggingface(dataset) Create a dataset from a Hugging Face Datasets Dataset. ray.data.from_huggingface ray.data.from_huggingface(dataset: Union[datasets.Dataset, datasets.DatasetDict]) -> Union[ray.data.dataset.MaterializedDataset, Dict[str, ray.data.dataset.MaterializedDataset]][source] Create a dataset from a Hugging Face Datasets Dataset. This function is not parallelized, and is intended to be used with Hugging Face Datasets that are loaded into memory (as opposed to memory-mapped). Example: >>> import ray >>> import datasets >>> hf_dataset = datasets.load_dataset("tweet_eval", "emotion") Downloading ... >>> ray_ds = ray.data.from_huggingface(hf_dataset) >>> ray_ds {'train': MaterializedDataset( num_blocks=..., num_rows=3257, schema={text: string, label: int64} ), 'test': MaterializedDataset( num_blocks=..., num_rows=1421, schema={text: string, label: int64} ), 'validation': MaterializedDataset( num_blocks=..., num_rows=374, schema={text: string, label: int64} )} >>> ray_ds = ray.data.from_huggingface(hf_dataset["train"]) >>> ray_ds MaterializedDataset( num_blocks=..., num_rows=3257, schema={text: string, label: int64} ) Parameters dataset – A Hugging Face Dataset, or DatasetDict. IterableDataset is not supported. IterableDataset is not supported. Returns Dataset holding Arrow records from the Hugging Face Dataset, or a dict of datasets in case dataset is a DatasetDict. PublicAPI: This API is stable across Ray releases. TensorFlow from_tf(dataset) Create a dataset from a TensorFlow dataset. ray.data.from_tf ray.data.from_tf(dataset: tf.data.Dataset) -> ray.data.dataset.MaterializedDataset[source] Create a dataset from a TensorFlow dataset. This function is inefficient. Use it to read small datasets or prototype. If your dataset is large, this function may execute slowly or raise an out-of-memory error. To avoid issues, read the underyling data with a function like read_images(). This function isn’t paralellized. It loads the entire dataset into the local node’s memory before moving the data to the distributed object store. Examples >>> import ray >>> import tensorflow_datasets as tfds >>> dataset, _ = tfds.load('cifar10', split=["train", "test"]) >>> ds = ray.data.from_tf(dataset) >>> ds Dataset(num_blocks=..., num_rows=50000, schema={id: binary, image: numpy.ndarray(shape=(32, 32, 3), dtype=uint8), label: int64}) >>> ds.take(1) [{'id': b'train_16399', 'image': array([[[143, 96, 70], [141, 96, 72], [135, 93, 72], ..., [ 96, 37, 19], [105, 42, 18], [104, 38, 20]], …, [[195, 161, 126], [187, 153, 123], [186, 151, 128], …, [212, 177, 147], [219, 185, 155], [221, 187, 157]]], dtype=uint8), ‘label’: 7}] Parameters dataset – A TensorFlow dataset. Returns A MaterializedDataset that contains the samples stored in the TensorFlow dataset. PublicAPI: This API is stable across Ray releases. WebDataset read_webdataset(paths, *[, filesystem, ...]) Create a dataset from WebDataset files. ray.data.read_webdataset ray.data.read_webdataset(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = , partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = None, decoder: Optional[Union[bool, str, callable, list]] = True, fileselect: Optional[Union[list, callable]] = None, filerename: Optional[Union[list, callable]] = None, suffixes: Optional[Union[list, callable]] = None, verbose_open: bool = False) -> ray.data.dataset.Dataset[source] Create a dataset from WebDataset files. Parameters paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories. filesystem – The filesystem implementation to read from. parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files in the dataset. arrow_open_stream_args – Key-word arguments passed to pyarrow.fs.FileSystem.open_input_stream. To read a compressed TFRecord file, pass the corresponding compression type (e.g. for GZIP or ZLIB, use arrow_open_stream_args={'compression_type': 'gzip'}). meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. decoder – A function or list of functions to decode the data. fileselect – A callable or list of glob patterns to select files. filerename – A function or list of tuples to rename files prior to grouping. suffixes – A function or list of suffixes to select for creating samples. verbose_open – Whether to print the file names as they are opened. Returns A Dataset that contains the example features. Raises ValueError – If a file contains a message that isn’t a tf.train.Example. PublicAPI (alpha): This API is in alpha and may change before becoming stable. Datasource API read_datasource(datasource, *[, ...]) Read a stream from a custom data source. Dataset.write_datasource(datasource, *[, ...]) Write the dataset to a custom datasource. Datasource() Interface for defining a custom ray.data.Dataset datasource. ReadTask(read_fn, metadata) A function used to read blocks from the dataset. datasource.Reader() A bound read operation for a datasource. ray.data.read_datasource ray.data.read_datasource(datasource: ray.data.datasource.datasource.Datasource, *, parallelism: int = - 1, ray_remote_args: Dict[str, Any] = None, **read_args) -> ray.data.dataset.Dataset[source] Read a stream from a custom data source. Parameters datasource – The datasource to read data from. parallelism – The requested parallelism of the read. Parallelism may be limited by the available partitioning of the datasource. If set to -1, parallelism will be automatically chosen based on the available cluster resources and estimated in-memory data size. read_args – Additional kwargs to pass to the datasource impl. ray_remote_args – kwargs passed to ray.remote in the read tasks. Returns Dataset that reads data from the datasource. PublicAPI: This API is stable across Ray releases.ray.data.Dataset.write_datasource Dataset.write_datasource(datasource: ray.data.datasource.datasource.Datasource, *, ray_remote_args: Optional[Dict[str, Any]] = None, **write_args) -> None[source] Write the dataset to a custom datasource. For an example of how to use this method, see Implementing a Custom Datasource. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size / parallelism) Parameters datasource – The datasource to write to. ray_remote_args – Kwargs passed to ray.remote in the write tasks. write_args – Additional write args to pass to the datasource.ray.data.Datasource class ray.data.Datasource[source] Bases: object Interface for defining a custom ray.data.Dataset datasource. To read a datasource into a dataset, use ray.data.read_datasource(). To write to a writable datasource, use Dataset.write_datasource(). See RangeDatasource and DummyOutputDatasource for examples of how to implement readable and writable datasources. Datasource instances must be serializable, since create_reader() and write() are called in remote tasks. For an example of subclassing Datasource, read Implementing a Custom Datasource. PublicAPI: This API is stable across Ray releases. Methods __init__() create_reader(**read_args) Return a Reader for the given read arguments. do_write(blocks, metadata, ray_remote_args, ...) Launch Ray tasks for writing blocks out to the datasource. get_name() Return a human-readable name for this datasource. on_write_complete(write_results, **kwargs) Callback for when a write job completes. on_write_failed(write_results, error, **kwargs) Callback for when a write job fails. prepare_read(parallelism, **read_args) Deprecated: Please implement create_reader() instead. write(blocks, **write_args) Write blocks out to the datasource. ray.data.Datasource.__init__ Datasource.__init__() ray.data.Datasource.create_reader Datasource.create_reader(**read_args) -> ray.data.datasource.datasource.Reader[source] Return a Reader for the given read arguments. The reader object will be responsible for querying the read metadata, and generating the actual read tasks to retrieve the data blocks upon request. Parameters read_args – Additional kwargs to pass to the datasource impl.ray.data.Datasource.do_write Datasource.do_write(blocks: List[ray.types.ObjectRef[Union[pyarrow.Table, pandas.DataFrame]]], metadata: List[ray.data.block.BlockMetadata], ray_remote_args: Dict[str, Any], **write_args) -> List[ray.types.ObjectRef[Any]][source] Launch Ray tasks for writing blocks out to the datasource. Parameters blocks – List of data block references. It is recommended that one write task be generated per block. metadata – List of block metadata. ray_remote_args – Kwargs passed to ray.remote in the write tasks. write_args – Additional kwargs to pass to the datasource impl. Returns A list of the output of the write tasks. DEPRECATED: This API is deprecated and may be removed in future Ray releases. do_write() is deprecated in Ray 2.4. Use write() insteadray.data.Datasource.get_name Datasource.get_name() -> str[source] Return a human-readable name for this datasource. This will be used as the names of the read tasks.ray.data.Datasource.on_write_complete Datasource.on_write_complete(write_results: List[Any], **kwargs) -> None[source] Callback for when a write job completes. This can be used to “commit” a write output. This method must succeed prior to write_datasource() returning to the user. If this method fails, then on_write_failed() will be called. Parameters write_results – The list of the write task results. kwargs – Forward-compatibility placeholder.ray.data.Datasource.on_write_failed Datasource.on_write_failed(write_results: List[ray.types.ObjectRef[Any]], error: Exception, **kwargs) -> None[source] Callback for when a write job fails. This is called on a best-effort basis on write failures. Parameters write_results – The list of the write task result futures. error – The first error encountered. kwargs – Forward-compatibility placeholder.ray.data.Datasource.prepare_read Datasource.prepare_read(parallelism: int, **read_args) -> List[ray.data.datasource.datasource.ReadTask][source] Deprecated: Please implement create_reader() instead. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.Datasource.write Datasource.write(blocks: Iterable[Union[pyarrow.Table, pandas.DataFrame]], **write_args) -> Any[source] Write blocks out to the datasource. This is used by a single write task. Parameters blocks – List of data blocks. write_args – Additional kwargs to pass to the datasource impl. Returns The output of the write task.ray.data.ReadTask class ray.data.ReadTask(read_fn: Callable[[], Iterable[Union[pyarrow.Table, pandas.DataFrame]]], metadata: ray.data.block.BlockMetadata)[source] Bases: Callable[[], Iterable[Union[pyarrow.Table, pandas.DataFrame]]] A function used to read blocks from the dataset. Read tasks are generated by reader.get_read_tasks(), and return a list of ray.data.Block when called. Initial metadata about the read operation can be retrieved via get_metadata() prior to executing the read. Final metadata is returned after the read along with the blocks. Ray will execute read tasks in remote functions to parallelize execution. Note that the number of blocks returned can vary at runtime. For example, if a task is reading a single large file it can return multiple blocks to avoid running out of memory during the read. The initial metadata should reflect all the blocks returned by the read, e.g., if the metadata says num_rows=1000, the read can return a single block of 1000 rows, or multiple blocks with 1000 rows altogether. The final metadata (returned with the actual block) reflects the exact contents of the block itself. DeveloperAPI: This API may change across minor Ray releases. Methods ray.data.datasource.Reader class ray.data.datasource.Reader[source] Bases: object A bound read operation for a datasource. This is a stateful class so that reads can be prepared in multiple stages. For example, it is useful for Datasets to know the in-memory size of the read prior to executing it. PublicAPI: This API is stable across Ray releases. Methods __init__() estimate_inmemory_data_size() Return an estimate of the in-memory data size, or None if unknown. get_read_tasks(parallelism) Execute the read and return read tasks. ray.data.datasource.Reader.__init__ Reader.__init__() ray.data.datasource.Reader.estimate_inmemory_data_size Reader.estimate_inmemory_data_size() -> Optional[int][source] Return an estimate of the in-memory data size, or None if unknown. Note that the in-memory data size may be larger than the on-disk data size.ray.data.datasource.Reader.get_read_tasks Reader.get_read_tasks(parallelism: int) -> List[ray.data.datasource.datasource.ReadTask][source] Execute the read and return read tasks. Parameters parallelism – The requested read parallelism. The number of read tasks should equal to this value if possible. read_args – Additional kwargs to pass to the datasource impl. Returns A list of read tasks that can be executed to read blocks from the datasource in parallel. Partitioning API datasource.Partitioning(style[, base_dir, ...]) Partition scheme used to describe path-based partitions. datasource.PartitionStyle(value) Supported dataset partition styles. datasource.PathPartitionEncoder(partitioning) Callable that generates directory path strings for path-based partition formats. datasource.PathPartitionParser(partitioning) Partition parser for path-based partition formats. datasource.PathPartitionFilter(...) Partition filter for path-based partition formats. ray.data.datasource.Partitioning class ray.data.datasource.Partitioning(style: ray.data.datasource.partitioning.PartitionStyle, base_dir: Optional[str] = None, field_names: Optional[List[str]] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None)[source] Bases: object Partition scheme used to describe path-based partitions. Path-based partition formats embed all partition keys and values directly in their dataset file paths. For example, to read a dataset with Hive-style partitions: >>> import ray >>> from ray.data.datasource.partitioning import Partitioning >>> ds = ray.data.read_csv( ... "s3://anonymous@ray-example-data/iris.csv", ... partitioning=Partitioning("hive"), ... ) Instead, if your files are arranged in a directory structure such as: root/dog/dog_0.jpeg root/dog/dog_1.jpeg ... root/cat/cat_0.jpeg root/cat/cat_1.jpeg ... Then you can use directory-based partitioning: >>> import ray >>> from ray.data.datasource.partitioning import Partitioning >>> root = "s3://anonymous@air-example-data/cifar-10/images" >>> partitioning = Partitioning("dir", field_names=["class"], base_dir=root) >>> ds = ray.data.read_images(root, partitioning=partitioning) DeveloperAPI: This API may change across minor Ray releases. Methods Attributes base_dir "/"-delimited base directory that all partitioned paths should exist under (exclusive). field_names The partition key field names (i.e. filesystem Filesystem that will be used for partition path file I/O. normalized_base_dir Returns the base directory normalized for compatibility with a filesystem. resolved_filesystem Returns the filesystem resolved for compatibility with a base directory. style The partition style - may be either HIVE or DIRECTORY. ray.data.datasource.Partitioning.base_dir Partitioning.base_dir: Optional[str] = None “/”-delimited base directory that all partitioned paths should exist under (exclusive). File paths either outside of, or at the first level of, this directory will be considered unpartitioned. Specify None or an empty string to search for partitions in all file path directories.ray.data.datasource.Partitioning.field_names Partitioning.field_names: Optional[List[str]] = None The partition key field names (i.e. column names for tabular datasets). When non-empty, the order and length of partition key field names must match the order and length of partition values. Required when parsing DIRECTORY partitioned paths or generating HIVE partitioned paths.ray.data.datasource.Partitioning.filesystem Partitioning.filesystem: Optional[pyarrow.fs.FileSystem] = None Filesystem that will be used for partition path file I/O.ray.data.datasource.Partitioning.normalized_base_dir property Partitioning.normalized_base_dir: str Returns the base directory normalized for compatibility with a filesystem.ray.data.datasource.Partitioning.resolved_filesystem property Partitioning.resolved_filesystem: pyarrow.fs.FileSystem Returns the filesystem resolved for compatibility with a base directory.ray.data.datasource.Partitioning.style Partitioning.style: ray.data.datasource.partitioning.PartitionStyle The partition style - may be either HIVE or DIRECTORY.ray.data.datasource.PartitionStyle class ray.data.datasource.PartitionStyle(value)[source] Bases: str, enum.Enum Supported dataset partition styles. Inherits from str to simplify plain text serialization/deserialization. Examples >>> # Serialize to JSON text. >>> json.dumps(PartitionStyle.HIVE) '"hive"' >>> # Deserialize from JSON text. >>> PartitionStyle(json.loads('"hive"')) DeveloperAPI: This API may change across minor Ray releases. Attributes HIVE DIRECTORY ray.data.datasource.PartitionStyle.HIVE PartitionStyle.HIVE = 'hive' ray.data.datasource.PartitionStyle.DIRECTORY PartitionStyle.DIRECTORY = 'dir' ray.data.datasource.PathPartitionEncoder class ray.data.datasource.PathPartitionEncoder(partitioning: ray.data.datasource.partitioning.Partitioning)[source] Bases: object Callable that generates directory path strings for path-based partition formats. Path-based partition formats embed all partition keys and values directly in their dataset file paths. Two path partition formats are currently supported - HIVE and DIRECTORY. For HIVE Partitioning, all partition directories will be generated using a “{key1}={value1}/{key2}={value2}” naming convention under the base directory. An accompanying ordered list of partition key field names must also be provided, where the order and length of all partition values must match the order and length of field names For DIRECTORY Partitioning, all directories will be generated from partition values using a “{value1}/{value2}” naming convention under the base directory. DeveloperAPI: This API may change across minor Ray releases. Methods __init__(partitioning) Creates a new partition path encoder. of([style, base_dir, field_names, filesystem]) Creates a new partition path encoder. ray.data.datasource.PathPartitionEncoder.__init__ PathPartitionEncoder.__init__(partitioning: ray.data.datasource.partitioning.Partitioning)[source] Creates a new partition path encoder. Parameters partitioning – The path-based partition scheme. All partition paths will be generated under this scheme’s base directory. Field names are required for HIVE partition paths, optional for DIRECTORY partition paths. When non-empty, the order and length of partition key field names must match the order and length of partition values.ray.data.datasource.PathPartitionEncoder.of static PathPartitionEncoder.of(style: ray.data.datasource.partitioning.PartitionStyle = PartitionStyle.HIVE, base_dir: Optional[str] = None, field_names: Optional[List[str]] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None) -> PathPartitionEncoder[source] Creates a new partition path encoder. Parameters style – The partition style - may be either HIVE or DIRECTORY. base_dir – “/”-delimited base directory that all partition paths will be generated under (exclusive). field_names – The partition key field names (i.e. column names for tabular datasets). Required for HIVE partition paths, optional for DIRECTORY partition paths. When non-empty, the order and length of partition key field names must match the order and length of partition values. filesystem – Filesystem that will be used for partition path file I/O. Returns The new partition path encoder. Attributes scheme Returns the partitioning for this encoder. ray.data.datasource.PathPartitionEncoder.scheme property PathPartitionEncoder.scheme: ray.data.datasource.partitioning.Partitioning Returns the partitioning for this encoder.ray.data.datasource.PathPartitionParser class ray.data.datasource.PathPartitionParser(partitioning: ray.data.datasource.partitioning.Partitioning)[source] Bases: object Partition parser for path-based partition formats. Path-based partition formats embed all partition keys and values directly in their dataset file paths. Two path partition formats are currently supported - HIVE and DIRECTORY. For HIVE Partitioning, all partition directories under the base directory will be discovered based on “{key1}={value1}/{key2}={value2}” naming conventions. Key/value pairs do not need to be presented in the same order across all paths. Directory names nested under the base directory that don’t follow this naming condition will be considered unpartitioned. If a partition filter is defined, then it will be called with an empty input dictionary for each unpartitioned file. For DIRECTORY Partitioning, all directories under the base directory will be interpreted as partition values of the form “{value1}/{value2}”. An accompanying ordered list of partition field names must also be provided, where the order and length of all partition values must match the order and length of field names. Files stored directly in the base directory will be considered unpartitioned. If a partition filter is defined, then it will be called with an empty input dictionary for each unpartitioned file. For example, if the base directory is “foo” then “foo.csv” and “foo/bar.csv” would be considered unpartitioned files but “foo/bar/baz.csv” would be associated with partition “bar”. If the base directory is undefined, then “foo.csv” would be unpartitioned, “foo/bar.csv” would be associated with partition “foo”, and “foo/bar/baz.csv” would be associated with partition (“foo”, “bar”). DeveloperAPI: This API may change across minor Ray releases. Methods __init__(partitioning) Creates a path-based partition parser. of([style, base_dir, field_names, filesystem]) Creates a path-based partition parser using a flattened argument list. ray.data.datasource.PathPartitionParser.__init__ PathPartitionParser.__init__(partitioning: ray.data.datasource.partitioning.Partitioning)[source] Creates a path-based partition parser. Parameters partitioning – The path-based partition scheme. The parser starts searching for partitions from this scheme’s base directory. File paths outside the base directory will be considered unpartitioned. If the base directory is None or an empty string then this will search for partitions in all file path directories. Field names are required for DIRECTORY partitioning, and optional for HIVE partitioning. When non-empty, the order and length of partition key field names must match the order and length of partition directories discovered.ray.data.datasource.PathPartitionParser.of static PathPartitionParser.of(style: ray.data.datasource.partitioning.PartitionStyle = PartitionStyle.HIVE, base_dir: Optional[str] = None, field_names: Optional[List[str]] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None) -> PathPartitionParser[source] Creates a path-based partition parser using a flattened argument list. Parameters style – The partition style - may be either HIVE or DIRECTORY. base_dir – “/”-delimited base directory to start searching for partitions (exclusive). File paths outside of this directory will be considered unpartitioned. Specify None or an empty string to search for partitions in all file path directories. field_names – The partition key names. Required for DIRECTORY partitioning. Optional for HIVE partitioning. When non-empty, the order and length of partition key field names must match the order and length of partition directories discovered. Partition key field names are not required to exist in the dataset schema. filesystem – Filesystem that will be used for partition path file I/O. Returns The new path-based partition parser. Attributes scheme Returns the partitioning for this parser. ray.data.datasource.PathPartitionParser.scheme property PathPartitionParser.scheme: ray.data.datasource.partitioning.Partitioning Returns the partitioning for this parser.ray.data.datasource.PathPartitionFilter class ray.data.datasource.PathPartitionFilter(path_partition_parser: ray.data.datasource.partitioning.PathPartitionParser, filter_fn: Callable[[Dict[str, str]], bool])[source] Bases: object Partition filter for path-based partition formats. Used to explicitly keep or reject files based on a custom filter function that takes partition keys and values parsed from the file’s path as input. PublicAPI (beta): This API is in beta and may change before becoming stable. Methods __init__(path_partition_parser, filter_fn) Creates a new path-based partition filter based on a parser. of(filter_fn[, style, base_dir, ...]) Creates a path-based partition filter using a flattened argument list. ray.data.datasource.PathPartitionFilter.__init__ PathPartitionFilter.__init__(path_partition_parser: ray.data.datasource.partitioning.PathPartitionParser, filter_fn: Callable[[Dict[str, str]], bool])[source] Creates a new path-based partition filter based on a parser. Parameters path_partition_parser – The path-based partition parser. filter_fn – Callback used to filter partitions. Takes a dictionary mapping partition keys to values as input. Unpartitioned files are denoted with an empty input dictionary. Returns True to read a file for that partition or False to skip it. Partition keys and values are always strings read from the filesystem path. For example, this removes all unpartitioned files: lambda d: True if d else False This raises an assertion error for any unpartitioned file found: lambda d: assert d, "Expected all files to be partitioned!" And this only reads files from January, 2022 partitions: lambda d: d["month"] == "January" and d["year"] == "2022"ray.data.datasource.PathPartitionFilter.of static PathPartitionFilter.of(filter_fn: Callable[[Dict[str, str]], bool], style: ray.data.datasource.partitioning.PartitionStyle = PartitionStyle.HIVE, base_dir: Optional[str] = None, field_names: Optional[List[str]] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None) -> PathPartitionFilter[source] Creates a path-based partition filter using a flattened argument list. Parameters filter_fn – Callback used to filter partitions. Takes a dictionary mapping partition keys to values as input. Unpartitioned files are denoted with an empty input dictionary. Returns True to read a file for that partition or False to skip it. Partition keys and values are always strings read from the filesystem path. For example, this removes all unpartitioned files:lambda d: True if d else FalseThis raises an assertion error for any unpartitioned file found:def do_assert(val, msg): assert val, msg lambda d: do_assert(d, "Expected all files to be partitioned!")And this only reads files from January, 2022 partitions:lambda d: d["month"] == "January" and d["year"] == "2022" style – The partition style - may be either HIVE or DIRECTORY. base_dir – “/”-delimited base directory to start searching for partitions (exclusive). File paths outside of this directory will be considered unpartitioned. Specify None or an empty string to search for partitions in all file path directories. field_names – The partition key names. Required for DIRECTORY partitioning. Optional for HIVE partitioning. When non-empty, the order and length of partition key field names must match the order and length of partition directories discovered. Partition key field names are not required to exist in the dataset schema. filesystem – Filesystem that will be used for partition path file I/O. Returns The new path-based partition filter. Attributes parser Returns the path partition parser for this filter. ray.data.datasource.PathPartitionFilter.parser property PathPartitionFilter.parser: ray.data.datasource.partitioning.PathPartitionParser Returns the path partition parser for this filter. MetadataProvider API datasource.FileMetadataProvider() Abstract callable that provides metadata for the files of a single dataset block. datasource.BaseFileMetadataProvider() Abstract callable that provides metadata for FileBasedDatasource datasource.ParquetMetadataProvider() Abstract callable that provides block metadata for Arrow Parquet file fragments. datasource.DefaultFileMetadataProvider() Default metadata provider for FileBasedDatasource implementations that reuse the base prepare_read method. datasource.DefaultParquetMetadataProvider() The default file metadata provider for ParquetDatasource. datasource.FastFileMetadataProvider() Fast Metadata provider for FileBasedDatasource implementations. ray.data.datasource.FileMetadataProvider class ray.data.datasource.FileMetadataProvider[source] Bases: object Abstract callable that provides metadata for the files of a single dataset block. Current subclasses: BaseFileMetadataProvider ParquetMetadataProvider DeveloperAPI: This API may change across minor Ray releases. Methods __init__() ray.data.datasource.FileMetadataProvider.__init__ FileMetadataProvider.__init__() ray.data.datasource.BaseFileMetadataProvider class ray.data.datasource.BaseFileMetadataProvider[source] Bases: ray.data.datasource.file_meta_provider.FileMetadataProvider Abstract callable that provides metadata for FileBasedDatasource implementations that reuse the base prepare_read method. Also supports file and file size discovery in input directory paths. Current subclasses: DefaultFileMetadataProvider DeveloperAPI: This API may change across minor Ray releases. Methods __init__() expand_paths(paths, filesystem[, ...]) Expands all paths into concrete file paths by walking directories. ray.data.datasource.BaseFileMetadataProvider.__init__ BaseFileMetadataProvider.__init__() ray.data.datasource.BaseFileMetadataProvider.expand_paths BaseFileMetadataProvider.expand_paths(paths: List[str], filesystem: Optional[pyarrow.fs.FileSystem], partitioning: Optional[ray.data.datasource.partitioning.Partitioning] = None, ignore_missing_paths: bool = False) -> Iterator[Tuple[str, int]][source] Expands all paths into concrete file paths by walking directories. Also returns a sidecar of file sizes. The input paths must be normalized for compatibility with the input filesystem prior to invocation. Args: paths: A list of file and/or directory paths compatible with the given filesystem. filesystem: The filesystem implementation that should be used for expanding all paths and reading their files. ignore_missing_paths: If True, ignores any file paths in paths that are not found. Defaults to False. Returns: An iterator of (file_path, file_size) pairs. None may be returned for the file size if it is either unknown or will be fetched later by _get_block_metadata(), but the length of both lists must be equal.ray.data.datasource.ParquetMetadataProvider class ray.data.datasource.ParquetMetadataProvider[source] Bases: ray.data.datasource.file_meta_provider.FileMetadataProvider Abstract callable that provides block metadata for Arrow Parquet file fragments. All file fragments should belong to a single dataset block. Supports optional pre-fetching of ordered metadata for all file fragments in a single batch to help optimize metadata resolution. Current subclasses: DefaultParquetMetadataProvider DeveloperAPI: This API may change across minor Ray releases. Methods __init__() prefetch_file_metadata(pieces, **ray_remote_args) Pre-fetches file metadata for all Parquet file fragments in a single batch. ray.data.datasource.ParquetMetadataProvider.__init__ ParquetMetadataProvider.__init__() ray.data.datasource.ParquetMetadataProvider.prefetch_file_metadata ParquetMetadataProvider.prefetch_file_metadata(pieces: List[pyarrow.dataset.ParquetFileFragment], **ray_remote_args) -> Optional[List[Any]][source] Pre-fetches file metadata for all Parquet file fragments in a single batch. Subsets of the metadata returned will be provided as input to subsequent calls to _get_block_metadata() together with their corresponding Parquet file fragments. Implementations that don’t support pre-fetching file metadata shouldn’t override this method. Parameters pieces – The Parquet file fragments to fetch metadata for. Returns Metadata resolved for each input file fragment, or None. Metadata must be returned in the same order as all input file fragments, such that metadata[i] always contains the metadata for pieces[i].ray.data.datasource.DefaultFileMetadataProvider class ray.data.datasource.DefaultFileMetadataProvider[source] Bases: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider Default metadata provider for FileBasedDatasource implementations that reuse the base prepare_read method. Calculates block size in bytes as the sum of its constituent file sizes, and assumes a fixed number of rows per file. DeveloperAPI: This API may change across minor Ray releases. Methods __init__() ray.data.datasource.DefaultFileMetadataProvider.__init__ DefaultFileMetadataProvider.__init__() ray.data.datasource.DefaultParquetMetadataProvider class ray.data.datasource.DefaultParquetMetadataProvider[source] Bases: ray.data.datasource.file_meta_provider.ParquetMetadataProvider The default file metadata provider for ParquetDatasource. Aggregates total block bytes and number of rows using the Parquet file metadata associated with a list of Arrow Parquet dataset file fragments. DeveloperAPI: This API may change across minor Ray releases. Methods __init__() ray.data.datasource.DefaultParquetMetadataProvider.__init__ DefaultParquetMetadataProvider.__init__() ray.data.datasource.FastFileMetadataProvider class ray.data.datasource.FastFileMetadataProvider[source] Bases: ray.data.datasource.file_meta_provider.DefaultFileMetadataProvider Fast Metadata provider for FileBasedDatasource implementations. Offers improved performance vs. DefaultFileMetadataProvider by skipping directory path expansion and file size collection. While this performance improvement may be negligible for local filesystems, it can be substantial for cloud storage service providers. This should only be used when all input paths exist and are known to be files. DeveloperAPI: This API may change across minor Ray releases. Methods __init__() ray.data.datasource.FastFileMetadataProvider.__init__ FastFileMetadataProvider.__init__() Dataset API Constructor Dataset(plan, epoch[, lazy, logical_plan]) A Dataset is a distributed data collection for data loading and processing. ray.data.Dataset class ray.data.Dataset(plan: ray.data._internal.plan.ExecutionPlan, epoch: int, lazy: bool = True, logical_plan: Optional[ray.data._internal.logical.interfaces.LogicalPlan] = None)[source] Bases: object A Dataset is a distributed data collection for data loading and processing. Datasets are distributed pipelines that produce ObjectRef[Block] outputs, where each block holds data in Arrow format, representing a shard of the overall data collection. The block also determines the unit of parallelism. Datasets can be created in multiple ways: from synthetic data via range_*() APIs, from existing memory data via from_*() APIs (this creates a subclass of Dataset called MaterializedDataset), or from external storage systems such as local disk, S3, HDFS etc. via the read_*() APIs. The (potentially processed) Dataset can be saved back to external storage systems via the write_*() APIs. Examples >>> import ray >>> # Create dataset from synthetic data. >>> ds = ray.data.range(1000) >>> # Create dataset from in-memory data. >>> ds = ray.data.from_items( ... [{"col1": i, "col2": i * 2} for i in range(1000)]) >>> # Create dataset from external storage system. >>> ds = ray.data.read_parquet("s3://bucket/path") >>> # Save dataset back to external storage system. >>> ds.write_csv("s3://bucket/output") Dataset has two kinds of operations: transformation, which takes in Dataset and outputs a new Dataset (e.g. map_batches()); and consumption, which produces values (not Datatream) as output (e.g. iter_batches()). Dataset transformations are lazy, with execution of the transformations being triggered by downstream consumption. Dataset supports parallel processing at scale: transformations such as map_batches(), aggregations such as min()/max()/mean(), grouping via groupby(), shuffling operations such as sort(), random_shuffle(), and repartition(). Examples >>> import ray >>> ds = ray.data.range(1000) >>> # Transform batches (Dict[str, np.ndarray]) with map_batches(). >>> ds.map_batches(lambda batch: {"id": batch["id"] * 2}) MapBatches() +- Dataset(num_blocks=..., num_rows=1000, schema={id: int64}) >>> # Compute the maximum. >>> ds.max("id") 999 >>> # Shuffle this dataset randomly. >>> ds.random_shuffle() RandomShuffle +- Dataset(num_blocks=..., num_rows=1000, schema={id: int64}) >>> # Sort it back in order. >>> ds.sort("id") Sort +- Dataset(num_blocks=..., num_rows=1000, schema={id: int64}) Both unexecuted and materialized Datasets can be passed between Ray tasks and actors without incurring a copy. Dataset supports conversion to/from several more featureful dataframe libraries (e.g., Spark, Dask, Modin, MARS), and are also compatible with distributed TensorFlow / PyTorch. PublicAPI: This API is stable across Ray releases. Methods __init__(plan, epoch[, lazy, logical_plan]) Construct a Dataset (internal API). add_column(col, fn, *[, compute]) Add the given column to the dataset. aggregate(*aggs) Aggregate the entire dataset as one group. columns([fetch_if_missing]) Returns the columns of this Dataset. count() Count the number of records in the dataset. dataset_format() warning. default_batch_format() warning. deserialize_lineage(serialized_ds) Deserialize the provided lineage-serialized Dataset. drop_columns(cols, *[, compute]) Drop one or more columns from the dataset. filter(fn, *[, compute]) Filter out records that do not satisfy the given predicate. flat_map(fn, *[, compute, num_cpus, num_gpus]) Apply the given function to each record and then flatten results. fully_executed() warning. get_internal_block_refs() Get a list of references to the underlying blocks of this dataset. groupby(key) Group the dataset by the key function or column name. has_serializable_lineage() Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster. input_files() Return the list of input files for the dataset. is_fully_executed() warning. iter_batches(*[, prefetch_batches, ...]) Return a local batched iterator over the dataset. iter_rows(*[, prefetch_blocks]) Return a local row iterator over the dataset. iter_tf_batches(*[, prefetch_batches, ...]) Return a local batched iterator of TensorFlow Tensors over the dataset. iter_torch_batches(*[, prefetch_batches, ...]) Return a local batched iterator of Torch Tensors over the dataset. iterator() Return a DataIterator that can be used to repeatedly iterate over the dataset. lazy() Enable lazy evaluation. limit(limit) Materialize and truncate the dataset to the first limit records. map(fn, *[, compute, num_cpus, num_gpus]) Apply the given function to each record of this dataset. map_batches(fn, *[, batch_size, compute, ...]) Apply the given function to batches of data. materialize() Execute and materialize this dataset into object store memory. max([on, ignore_nulls]) Compute maximum over entire dataset. mean([on, ignore_nulls]) Compute mean over entire dataset. min([on, ignore_nulls]) Compute minimum over entire dataset. num_blocks() Return the number of blocks of this dataset. random_sample(fraction, *[, seed]) Randomly samples a fraction of the elements of this dataset. random_shuffle(*[, seed, num_blocks]) Randomly shuffle the elements of this dataset. randomize_block_order(*[, seed]) Randomly shuffle the blocks of this dataset. repartition(num_blocks, *[, shuffle]) Repartition the dataset into exactly this number of blocks. repeat([times]) Convert this into a DatasetPipeline by looping over this dataset. schema([fetch_if_missing]) Return the schema of the dataset. select_columns(cols, *[, compute]) Select one or more columns from the dataset. serialize_lineage() Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster. show([limit]) Print up to the given number of records from the dataset. size_bytes() Return the in-memory size of the dataset. sort([key, descending]) Sort the dataset by the specified key column or key function. split(n, *[, equal, locality_hints]) Materialize and split the dataset into n disjoint pieces. split_at_indices(indices) Materialize and split the dataset at the given indices (like np.split). split_proportionately(proportions) Materialize and split the dataset using proportions. stats() Returns a string containing execution timing information. std([on, ddof, ignore_nulls]) Compute standard deviation over entire dataset. streaming_split(n, *[, equal, locality_hints]) Returns n DataIterators that can be used to read disjoint subsets of the dataset in parallel. sum([on, ignore_nulls]) Compute sum over entire dataset. take([limit]) Return up to limit records from the dataset. take_all([limit]) Return all of the records in the dataset. take_batch([batch_size, batch_format]) Return up to batch_size records from the dataset in a batch. to_arrow_refs() Convert this dataset into a distributed set of Arrow tables. to_dask([meta]) Convert this dataset into a Dask DataFrame. to_mars() Convert this dataset into a MARS dataframe. to_modin() Convert this dataset into a Modin dataframe. to_numpy_refs(*[, column]) Convert this dataset into a distributed set of NumPy ndarrays. to_pandas([limit]) Convert this dataset into a single Pandas DataFrame. to_pandas_refs() Convert this dataset into a distributed set of Pandas dataframes. to_random_access_dataset(key[, num_workers]) Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL). to_spark(spark) Convert this dataset into a Spark dataframe. to_tf(feature_columns, label_columns, *[, ...]) Return a TF Dataset over this dataset. to_torch(*[, label_column, feature_columns, ...]) Return a Torch IterableDataset over this dataset. train_test_split(test_size, *[, shuffle, seed]) Materialize and split the dataset into train and test subsets. union(*other) Materialize and combine this dataset with others of the same type. unique(column) List of unique elements in the given column. window(*[, blocks_per_window, bytes_per_window]) Convert this into a DatasetPipeline by windowing over data blocks. write_csv(path, *[, filesystem, ...]) Write the dataset to csv. write_datasource(datasource, *[, ...]) Write the dataset to a custom datasource. write_json(path, *[, filesystem, ...]) Write the dataset to json. write_mongo(uri, database, collection[, ...]) Write the dataset to a MongoDB datasource. write_numpy(path, *[, column, filesystem, ...]) Write a tensor column of the dataset to npy files. write_parquet(path, *[, filesystem, ...]) Write the dataset to parquet. write_tfrecords(path, *[, tf_schema, ...]) Write the dataset to TFRecord files. write_webdataset(path, *[, filesystem, ...]) Write the dataset to WebDataset files. zip(other) Materialize and zip this dataset with the elements of another. ray.data.Dataset.__init__ Dataset.__init__(plan: ray.data._internal.plan.ExecutionPlan, epoch: int, lazy: bool = True, logical_plan: Optional[ray.data._internal.logical.interfaces.LogicalPlan] = None)[source] Construct a Dataset (internal API). The constructor is not part of the Dataset API. Use the ray.data.* read methods to construct a dataset.ray.data.Dataset.add_column Dataset.add_column(col: str, fn: Callable[[pandas.DataFrame], pandas.Series], *, compute: Optional[str] = None, **ray_remote_args) -> Dataset[source] Add the given column to the dataset. This is only supported for datasets convertible to pandas format. A function generating the new column values given the batch in pandas format must be specified. Examples >>> import ray >>> ds = ray.data.range(100) >>> # Add a new column equal to value * 2. >>> ds = ds.add_column("new_col", lambda df: df["id"] * 2) >>> # Overwrite the existing "value" with zeros. >>> ds = ds.add_column("id", lambda df: 0) Time complexity: O(dataset size / parallelism) Parameters col – Name of the column to add. If the name already exists, the column will be overwritten. fn – Map function generating the column values given a batch of records in pandas format. compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).ray.data.Dataset.aggregate Dataset.aggregate(*aggs: ray.data.aggregate._aggregate.AggregateFn) -> Union[Any, Dict[str, Any]][source] Aggregate the entire dataset as one group. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> from ray.data.aggregate import Max, Mean >>> ray.data.range(100).aggregate(Max("id"), Mean("id")) {'max(id)': 99, 'mean(id)': 49.5} Time complexity: O(dataset size / parallelism) Parameters aggs – Aggregations to do. Returns If the input dataset is a simple dataset then the output is a tuple of (agg1, agg2, ...) where each tuple element is the corresponding aggregation result. If the input dataset is an Arrow dataset then the output is an dict where each column is the corresponding aggregation result. If the dataset is empty, return None.ray.data.Dataset.columns Dataset.columns(fetch_if_missing: bool = True) -> Optional[List[str]][source] Returns the columns of this Dataset. If this dataset consists of more than a read, or if the schema can’t be determined from the metadata provided by the datasource, or if fetch_if_missing=True (the default), then this operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(1) Example >>> import ray >>> # Create dataset from synthetic data. >>> ds = ray.data.range(1000) >>> ds.columns() ['id'] Parameters fetch_if_missing – If True, synchronously fetch the column names from the schema if it’s not known. If False, None is returned if the schema is not known. Default is True. Returns A list of the column names for this Dataset or None if schema is not known and fetch_if_missing is False.ray.data.Dataset.count Dataset.count() -> int[source] Count the number of records in the dataset. If this dataset consists of more than a read, or if the row count can’t be determined from the metadata provided by the datasource, then this operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size / parallelism), O(1) for parquet Examples >>> import ray >>> ds = ray.data.range(10) >>> ds.count() 10 Returns The number of records in the dataset.ray.data.Dataset.dataset_format Dataset.dataset_format() -> ray.air.util.data_batch_conversion.BlockFormat[source] DEPRECATED: This API is deprecated and may be removed in future Ray releases. The dataset format is no longer exposed as a public API.ray.data.Dataset.default_batch_format Dataset.default_batch_format() -> Type[source] DEPRECATED: This API is deprecated and may be removed in future Ray releases. The batch format is no longer exposed as a public API.ray.data.Dataset.deserialize_lineage static Dataset.deserialize_lineage(serialized_ds: bytes) -> ray.data.dataset.Dataset[source] Deserialize the provided lineage-serialized Dataset. This assumes that the provided serialized bytes were serialized using Dataset.serialize_lineage(). Examples import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") serialized_ds = ds.serialize_lineage() ds = ray.data.Dataset.deserialize_lineage(serialized_ds) print(ds) Dataset( num_blocks=1, num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64 } ) Parameters serialized_ds – The serialized Dataset that we wish to deserialize. Returns A deserialized Dataset instance. DeveloperAPI: This API may change across minor Ray releases.ray.data.Dataset.drop_columns Dataset.drop_columns(cols: List[str], *, compute: Optional[str] = None, **ray_remote_args) -> ray.data.dataset.Dataset[source] Drop one or more columns from the dataset. Examples >>> import ray >>> ds = ray.data.range(100) >>> # Add a new column equal to value * 2. >>> ds = ds.add_column("new_col", lambda df: df["id"] * 2) >>> # Drop the existing "value" column. >>> ds = ds.drop_columns(["id"]) Time complexity: O(dataset size / parallelism) Parameters cols – Names of the columns to drop. If any name does not exist, an exception will be raised. compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).ray.data.Dataset.filter Dataset.filter(fn: Union[Callable[[Dict[str, Any]], bool], Callable[[Dict[str, Any]], Iterator[bool]], _CallableClassProtocol], *, compute: Union[str, ray.data._internal.compute.ComputeStrategy] = None, **ray_remote_args) -> Dataset[source] Filter out records that do not satisfy the given predicate. Consider using .map_batches() for better performance (you can implement filter by dropping records). Examples >>> import ray >>> ds = ray.data.range(100) >>> ds.filter(lambda x: x["id"] % 2 == 0) Filter +- Dataset(num_blocks=..., num_rows=100, schema={id: int64}) Time complexity: O(dataset size / parallelism) Parameters fn – The predicate to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy. compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).ray.data.Dataset.flat_map Dataset.flat_map(fn: Union[Callable[[Dict[str, Any]], List[Dict[str, Any]]], Callable[[Dict[str, Any]], Iterator[List[Dict[str, Any]]]], _CallableClassProtocol], *, compute: Optional[ray.data._internal.compute.ComputeStrategy] = None, num_cpus: Optional[float] = None, num_gpus: Optional[float] = None, **ray_remote_args) -> Dataset[source] Apply the given function to each record and then flatten results. Consider using .map_batches() for better performance (the batch size can be altered in map_batches). Examples >>> import ray >>> ds = ray.data.range(1000) >>> ds.flat_map(lambda x: [{"id": 1}, {"id": 2}, {"id": 4}]) FlatMap +- Dataset(num_blocks=..., num_rows=1000, schema={id: int64}) Time complexity: O(dataset size / parallelism) Parameters fn – The function or generator to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy. compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. num_cpus – The number of CPUs to reserve for each parallel map worker. num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify num_gpus=1 to request 1 GPU for each parallel map worker. ray_remote_args – Additional resource requirements to request from ray for each map worker. map_batches() Call this method to transform batches of data. It’s faster and more flexible than map() and flat_map(). map() Call this method to transform one record at time. This method isn’t recommended because it’s slow; call map_batches() instead.ray.data.Dataset.fully_executed Dataset.fully_executed() -> ray.data.dataset.MaterializedDataset[source] DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use Dataset.materialize() instead.ray.data.Dataset.get_internal_block_refs Dataset.get_internal_block_refs() -> List[ray.types.ObjectRef[Union[pyarrow.Table, pandas.DataFrame]]][source] Get a list of references to the underlying blocks of this dataset. This function can be used for zero-copy access to the data. It blocks until the underlying blocks are computed. Examples >>> import ray >>> ds = ray.data.range(1) >>> ds.get_internal_block_refs() [ObjectRef(...)] This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(1) Returns A list of references to this dataset’s blocks. DeveloperAPI: This API may change across minor Ray releases.ray.data.Dataset.groupby Dataset.groupby(key: Optional[str]) -> GroupedData[source] Group the dataset by the key function or column name. Examples >>> import ray >>> # Group by a table column and aggregate. >>> ray.data.from_items([ ... {"A": x % 3, "B": x} for x in range(100)]).groupby( ... "A").count() Aggregate +- Dataset(num_blocks=..., num_rows=100, schema={A: int64, B: int64}) Time complexity: O(dataset size * log(dataset size / parallelism)) Parameters key – A column name. If this is None, the grouping is global. Returns A lazy GroupedData that can be aggregated later.ray.data.Dataset.has_serializable_lineage Dataset.has_serializable_lineage() -> bool[source] Whether this dataset’s lineage is able to be serialized for storage and later deserialized, possibly on a different cluster. Only datasets that are created from data that we know will still exist at deserialization time, e.g. data external to this Ray cluster such as persistent cloud object stores, support lineage-based serialization. All of the ray.data.read_*() APIs support lineage-based serialization. Examples >>> import ray >>> ray.data.from_items(list(range(10))).has_serializable_lineage() False >>> ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv").has_serializable_lineage() Trueray.data.Dataset.input_files Dataset.input_files() -> List[str][source] Return the list of input files for the dataset. Examples >>> import ray >>> ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") >>> ds.input_files() ['ray-example-data/iris.csv'] If this dataset consists of more than a read, then this operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(num input files) Returns The list of input files used to create the dataset, or an empty list if the input files is not known.ray.data.Dataset.is_fully_executed Dataset.is_fully_executed() -> bool[source] DEPRECATED: This API is deprecated and may be removed in future Ray releases. Check isinstance(Dataset, MaterializedDataset) instead.ray.data.Dataset.iter_batches Dataset.iter_batches(*, prefetch_batches: int = 1, batch_size: Optional[int] = 256, batch_format: Optional[str] = 'default', drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, _collate_fn: Optional[Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Any]] = None, prefetch_blocks: int = 0) -> Iterator[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]][source] Return a local batched iterator over the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> for batch in ray.data.range(1000000).iter_batches(): ... print(batch) Time complexity: O(1) Parameters prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the datasetContext. batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256. batch_format – Specify "default" to use the default block format (NumPy), "pandas" to select pandas.DataFrame, “pyarrow” to select pyarrow.Table, or "numpy" to select Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting. drop_last – Whether to drop the last batch if it’s incomplete. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. local_shuffle_seed – The seed to use for the local random shuffle. Returns An iterator over record batches.ray.data.Dataset.iter_rows Dataset.iter_rows(*, prefetch_blocks: int = 0) -> Iterator[Dict[str, Any]][source] Return a local row iterator over the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> for i in ray.data.range(1000000).iter_rows(): ... print(i) Time complexity: O(1) Parameters prefetch_blocks – The number of blocks to prefetch ahead of the current block during the scan. Returns A local iterator over the entire dataset.ray.data.Dataset.iter_tf_batches Dataset.iter_tf_batches(*, prefetch_batches: int = 1, batch_size: Optional[int] = 256, dtypes: Optional[Union[tf.dtypes.DType, Dict[str, tf.dtypes.DType]]] = None, drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, prefetch_blocks: int = 0) -> Iterator[Union[tf.Tensor, Dict[str, tf.Tensor]]][source] Return a local batched iterator of TensorFlow Tensors over the dataset. This iterator will yield single-tensor batches of the underlying dataset consists of a single column; otherwise, it will yield a dictionary of column-tensors. If you don’t need the additional flexibility provided by this method, consider using to_tf() instead. It’s easier to use. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> for batch in ray.data.range( ... 12, ... ).iter_tf_batches(batch_size=4): ... print(batch.shape) (4, 1) (4, 1) (4, 1) Time complexity: O(1) Parameters prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the datasetContext. batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256. dtypes – The TensorFlow dtype(s) for the created tensor(s); if None, the dtype will be inferred from the tensor data. drop_last – Whether to drop the last batch if it’s incomplete. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling. local_shuffle_seed – The seed to use for the local random shuffle. Returns An iterator over TensorFlow Tensor batches.ray.data.Dataset.iter_torch_batches Dataset.iter_torch_batches(*, prefetch_batches: int = 1, batch_size: Optional[int] = 256, dtypes: Optional[Union[torch.dtype, Dict[str, torch.dtype]]] = None, device: Optional[str] = None, collate_fn: Optional[Callable[[Dict[str, numpy.ndarray]], Any]] = None, drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, prefetch_blocks: int = 0) -> Iterator[TorchTensorBatchType][source] Return a local batched iterator of Torch Tensors over the dataset. This iterator will yield single-tensor batches if the underlying dataset consists of a single column; otherwise, it will yield a dictionary of column-tensors. If looking for more flexibility in the tensor conversion (e.g. casting dtypes) or the batch format, try use iter_batches directly, which is a lower-level API. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> for batch in ray.data.range( ... 12, ... ).iter_torch_batches(batch_size=4): ... print(batch.shape) torch.Size([4, 1]) torch.Size([4, 1]) torch.Size([4, 1]) Time complexity: O(1) Parameters prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the datasetContext. batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256. dtypes – The Torch dtype(s) for the created tensor(s); if None, the dtype will be inferred from the tensor data. device – The device on which the tensor should be placed; if None, the Torch tensor will be constructed on the CPU. collate_fn – A function to convert a Numpy batch to a PyTorch tensor batch. Potential use cases include collating along a dimension other than the first, padding sequences of various lengths, or generally handling batches of different length tensors. If not provided, the default collate function is used which simply converts the batch of numpy arrays to a batch of PyTorch tensors. This API is still experimental and is subject to change. drop_last – Whether to drop the last batch if it’s incomplete. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling. local_shuffle_seed – The seed to use for the local random shuffle. Returns An iterator over Torch Tensor batches.ray.data.Dataset.iterator Dataset.iterator() -> ray.data.iterator.DataIterator[source] Return a DataIterator that can be used to repeatedly iterate over the dataset. Calling any of the consumption methods on the returned DataIterator will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> for batch in ray.data.range( ... 1000000 ... ).iterator().iter_batches(): ... print(batch) It is recommended to use DataIterator methods over directly calling methods such as iter_batches().ray.data.Dataset.lazy Dataset.lazy() -> ray.data.dataset.Dataset[source] Enable lazy evaluation. Dataset is lazy by default, so this is only useful for datasets created from ray.data.from_items(), which is eager. The returned dataset is a lazy dataset, where all subsequent operations on the stream won’t be executed until the dataset is consumed (e.g. .take(), .iter_batches(), .to_torch(), .to_tf(), etc.) or execution is manually triggered via .materialize(). DEPRECATED: This API is deprecated and may be removed in future Ray releases. Dataset is lazy by default, so this conversion call is no longer needed and this API will be removed in a future releaseray.data.Dataset.limit Dataset.limit(limit: int) -> ray.data.dataset.Dataset[source] Materialize and truncate the dataset to the first limit records. Contrary to :meth`.take`, this will not move any data to the caller’s machine. Instead, it will return a new Dataset pointing to the truncated distributed data. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.range(1000) >>> ds.limit(5).take_batch() {'id': array([0, 1, 2, 3, 4])} Time complexity: O(limit specified) Parameters limit – The size of the dataset to truncate to. Returns The truncated dataset.ray.data.Dataset.map Dataset.map(fn: Union[Callable[[Dict[str, Any]], Dict[str, Any]], Callable[[Dict[str, Any]], Iterator[Dict[str, Any]]], _CallableClassProtocol], *, compute: Optional[ray.data._internal.compute.ComputeStrategy] = None, num_cpus: Optional[float] = None, num_gpus: Optional[float] = None, **ray_remote_args) -> Dataset[source] Apply the given function to each record of this dataset. Note that mapping individual records can be quite slow. Consider using map_batches() for performance. Examples >>> import ray >>> # Transform python objects. >>> ds = ray.data.range(1000) >>> # The function goes from record (Dict[str, Any]) to record. >>> ds.map(lambda record: {"id": record["id"] * 2}) Map +- Dataset(num_blocks=..., num_rows=1000, schema={id: int64}) >>> # Transform Arrow records. >>> ds = ray.data.from_items( ... [{"value": i} for i in range(1000)]) >>> ds.map(lambda record: {"v2": record["value"] * 2}) Map +- Dataset(num_blocks=200, num_rows=1000, schema={value: int64}) >>> # Define a callable class that persists state across >>> # function invocations for efficiency. >>> init_model = ... >>> class CachedModel: ... def __init__(self): ... self.model = init_model() ... def __call__(self, batch): ... return self.model(batch) >>> # Apply the transform in parallel on GPUs. Since >>> # compute=ActorPoolStrategy(size=8) the transform will be applied on a >>> # pool of 8 Ray actors, each allocated 1 GPU by Ray. >>> ds.map(CachedModel, ... compute=ray.data.ActorPoolStrategy(size=8), ... num_gpus=1) Time complexity: O(dataset size / parallelism) Parameters fn – The function to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy. compute – The compute strategy, either None (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. num_cpus – The number of CPUs to reserve for each parallel map worker. num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify num_gpus=1 to request 1 GPU for each parallel map worker. ray_remote_args – Additional resource requirements to request from ray for each map worker. flat_map(): Call this method to create new records from existing ones. Unlike map(), a function passed to flat_map() can return multiple records. flat_map() isn’t recommended because it’s slow; call map_batches() instead. map_batches() Call this method to transform batches of data. It’s faster and more flexible than map() and flat_map().ray.data.Dataset.map_batches Dataset.map_batches(fn: Union[Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Iterator[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]]], _CallableClassProtocol], *, batch_size: Union[int, None, typing_extensions.Literal[default]] = 'default', compute: Optional[ray.data._internal.compute.ComputeStrategy] = None, batch_format: Optional[str] = 'default', zero_copy_batch: bool = False, fn_args: Optional[Iterable[Any]] = None, fn_kwargs: Optional[Dict[str, Any]] = None, fn_constructor_args: Optional[Iterable[Any]] = None, fn_constructor_kwargs: Optional[Dict[str, Any]] = None, num_cpus: Optional[float] = None, num_gpus: Optional[float] = None, **ray_remote_args) -> Dataset[source] Apply the given function to batches of data. This applies the fn in parallel with map tasks, with each task handling a batch of data (typically Dict[str, np.ndarray] or pd.DataFrame). To learn more, see the Transforming batches user guide. If fn does not mutate its input, set zero_copy_batch=True to elide a batch copy, which can improve performance and decrease memory utilization. fn will then receive zero-copy read-only batches. If fn mutates its input, you will need to ensure that the batch provided to fn is writable by setting zero_copy_batch=False (default). This will create an extra, mutable copy of each batch before handing it to fn. The size of the batches provided to fn may be smaller than the provided batch_size if batch_size doesn’t evenly divide the block(s) sent to a given map task. When batch_size is specified, each map task will be sent a single block if the block is equal to or larger than batch_size, and will be sent a bundle of blocks up to (but not exceeding) batch_size if blocks are smaller than batch_size. Examples >>> import numpy as np >>> import ray >>> ds = ray.data.from_items([ ... {"name": "Luna", "age": 4}, ... {"name": "Rory", "age": 14}, ... {"name": "Scout", "age": 9}, ... ]) >>> ds MaterializedDataset( num_blocks=3, num_rows=3, schema={name: string, age: int64} ) Here fn returns the same batch type as the input, but your fn can also return a different batch type (e.g., pd.DataFrame). Read more about Transforming batches. >>> from typing import Dict >>> def map_fn(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: ... batch["age_in_dog_years"] = 7 * batch["age"] ... return batch >>> ds = ds.map_batches(map_fn) >>> ds MapBatches(map_fn) +- Dataset(num_blocks=3, num_rows=3, schema={name: string, age: int64}) Actors can improve the performance of some workloads. For example, you can use actors to load a model once per worker instead of once per inference. To transform batches with actors, pass a callable type to fn and specify an ActorPoolStrategy. In the example below, CachedModel is called on an autoscaling pool of two to eight actors, each allocated one GPU by Ray. >>> init_large_model = ... >>> class CachedModel: ... def __init__(self): ... self.model = init_large_model() ... def __call__(self, item): ... return self.model(item) >>> ds.map_batches( ... CachedModel, ... batch_size=256, ... compute=ray.data.ActorPoolStrategy(size=8), ... num_gpus=1, ... ) fn can also be a generator, yielding multiple batches in a single invocation. This is useful when returning large objects. Instead of returning a very large output batch, fn can instead yield the output batch in chunks. >>> def map_fn_with_large_output(batch): ... for i in range(3): ... yield {"large_output": np.ones((100, 1000))} >>> ds = ray.data.from_items([1]) >>> ds = ds.map_batches(map_fn_with_large_output) >>> ds MapBatches(map_fn_with_large_output) +- Dataset(num_blocks=..., num_rows=1, schema={item: int64}) Parameters fn – The function or generator to apply to each record batch, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy. Note fn must be pickle-able. batch_size – The desired number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The actual size of the batch provided to fn may be smaller than batch_size if batch_size doesn’t evenly divide the block(s) sent to a given map task. Default batch_size is 4096 with “default”. compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. batch_format – Specify "default" to use the default block format (NumPy), "pandas" to select pandas.DataFrame, “pyarrow” to select pyarrow.Table, or "numpy" to select Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting. zero_copy_batch – Whether fn should be provided zero-copy, read-only batches. If this is True and no copy is required for the batch_format conversion, the batch will be a zero-copy, read-only view on data in Ray’s object store, which can decrease memory utilization and improve performance. If this is False, the batch will be writable, which will require an extra copy to guarantee. If fn mutates its input, this will need to be False in order to avoid “assignment destination is read-only” or “buffer source array is read-only” errors. Default is False. fn_args – Positional arguments to pass to fn after the first argument. These arguments are top-level arguments to the underlying Ray task. fn_kwargs – Keyword arguments to pass to fn. These arguments are top-level arguments to the underlying Ray task. fn_constructor_args – Positional arguments to pass to fn’s constructor. You can only provide this if fn is a callable class. These arguments are top-level arguments in the underlying Ray actor construction task. fn_constructor_kwargs – Keyword arguments to pass to fn’s constructor. This can only be provided if fn is a callable class. These arguments are top-level arguments in the underlying Ray actor construction task. num_cpus – The number of CPUs to reserve for each parallel map worker. num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify num_gpus=1 to request 1 GPU for each parallel map worker. ray_remote_args – Additional resource requirements to request from ray for each map worker. iter_batches() Call this function to iterate over batches of data. flat_map(): Call this method to create new records from existing ones. Unlike map(), a function passed to flat_map() can return multiple records. flat_map() isn’t recommended because it’s slow; call map_batches() instead. map() Call this method to transform one record at time. This method isn’t recommended because it’s slow; call map_batches() instead.ray.data.Dataset.materialize Dataset.materialize() -> ray.data.dataset.MaterializedDataset[source] Execute and materialize this dataset into object store memory. This operation will trigger execution of the lazy transformations performed on this dataset. This can be used to read all blocks into memory. By default, Dataset doesn’t read blocks from the datasource until the first transform. Note that this does not mutate the original Dataset. Only the blocks of the returned MaterializedDataset class are pinned in memory. Examples >>> import ray >>> ds = ray.data.range(10) >>> materialized_ds = ds.materialize() >>> materialized_ds MaterializedDataset(num_blocks=..., num_rows=10, schema={id: int64}) Returns A MaterializedDataset holding the materialized data blocks.ray.data.Dataset.max Dataset.max(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> Union[Any, Dict[str, Any]][source] Compute maximum over entire dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ray.data.range(100).max("id") 99 >>> ray.data.from_items([ ... {"A": i, "B": i**2} ... for i in range(100)]).max(["A", "B"]) {'max(A)': 99, 'max(B)': 9801} Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the max; if False, if a null value is encountered, the output will be None. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The max result.For different values of on, the return varies:on=None: an dict containing the column-wise max of all columns, on="col": a scalar representing the max of all items in column "col", on=["col_1", ..., "col_n"]: an n-column dict containing the column-wise max of the provided columns.If the dataset is empty, all values are null, or any value is null AND ignore_nulls is False, then the output will be None.ray.data.Dataset.mean Dataset.mean(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> Union[Any, Dict[str, Any]][source] Compute mean over entire dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ray.data.range(100).mean("id") 49.5 >>> ray.data.from_items([ ... {"A": i, "B": i**2} ... for i in range(100)]).mean(["A", "B"]) {'mean(A)': 49.5, 'mean(B)': 3283.5} Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the mean; if False, if a null value is encountered, the output will be None. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The mean result.For different values of on, the return varies:on=None: an dict containing the column-wise mean of all columns, on="col": a scalar representing the mean of all items in column "col", on=["col_1", ..., "col_n"]: an n-column dict containing the column-wise mean of the provided columns.If the dataset is empty, all values are null, or any value is null AND ignore_nulls is False, then the output will be None.ray.data.Dataset.min Dataset.min(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> Union[Any, Dict[str, Any]][source] Compute minimum over entire dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ray.data.range(100).min("id") 0 >>> ray.data.from_items([ ... {"A": i, "B": i**2} ... for i in range(100)]).min(["A", "B"]) {'min(A)': 0, 'min(B)': 0} Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the min; if False, if a null value is encountered, the output will be None. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The min result.For different values of on, the return varies:on=None: an dict containing the column-wise min of all columns, on="col": a scalar representing the min of all items in column "col", on=["col_1", ..., "col_n"]: an n-column dict containing the column-wise min of the provided columns.If the dataset is empty, all values are null, or any value is null AND ignore_nulls is False, then the output will be None.ray.data.Dataset.num_blocks Dataset.num_blocks() -> int[source] Return the number of blocks of this dataset. Note that during read and transform operations, the number of blocks may be dynamically adjusted to respect memory limits, increasing the number of blocks at runtime. Examples >>> import ray >>> ds = ray.data.range(100).repartition(10) >>> ds.num_blocks() 10 Time complexity: O(1) Returns The number of blocks of this dataset.ray.data.Dataset.random_sample Dataset.random_sample(fraction: float, *, seed: Optional[int] = None) -> ray.data.dataset.Dataset[source] Randomly samples a fraction of the elements of this dataset. Note that the exact number of elements returned is not guaranteed, and that the number of elements being returned is roughly fraction * total_rows. Examples >>> import ray >>> ds = ray.data.range(100) >>> ds.random_sample(0.1) >>> ds.random_sample(0.2, seed=12345) Parameters fraction – The fraction of elements to sample. seed – Seeds the python random pRNG generator. Returns Returns a Dataset containing the sampled elements.ray.data.Dataset.random_shuffle Dataset.random_shuffle(*, seed: Optional[int] = None, num_blocks: Optional[int] = None, **ray_remote_args) -> ray.data.dataset.Dataset[source] Randomly shuffle the elements of this dataset. random_shuffle can be slow. For better performance, try Iterating over batches with shuffling. Examples >>> import ray >>> ds = ray.data.range(100) >>> # Shuffle this dataset randomly. >>> ds.random_shuffle() RandomShuffle +- Dataset(num_blocks=..., num_rows=100, schema={id: int64}) >>> # Shuffle this dataset with a fixed random seed. >>> ds.random_shuffle(seed=12345) RandomShuffle +- Dataset(num_blocks=..., num_rows=100, schema={id: int64}) Time complexity: O(dataset size / parallelism) Parameters seed – Fix the random seed to use, otherwise one will be chosen based on system randomness. num_blocks – The number of output blocks after the shuffle, or None to retain the number of blocks. Returns The shuffled dataset.ray.data.Dataset.randomize_block_order Dataset.randomize_block_order(*, seed: Optional[int] = None) -> ray.data.dataset.Dataset[source] Randomly shuffle the blocks of this dataset. Examples >>> import ray >>> ds = ray.data.range(100) >>> # Randomize the block order. >>> ds.randomize_block_order() >>> # Randomize the block order with a fixed random seed. >>> ds.randomize_block_order(seed=12345) Parameters seed – Fix the random seed to use, otherwise one will be chosen based on system randomness. Returns The block-shuffled dataset.ray.data.Dataset.repartition Dataset.repartition(num_blocks: int, *, shuffle: bool = False) -> ray.data.dataset.Dataset[source] Repartition the dataset into exactly this number of blocks. After repartitioning, all blocks in the returned dataset will have approximately the same number of rows. Repartition has two modes: shuffle=False - performs the minimal data movement needed to equalize block sizes shuffle=True - performs a full distributed shuffle https://docs.google.com/drawings/d/132jhE3KXZsf29ho1yUdPrCHB9uheHBWHJhDQMXqIVPA/edit Examples >>> import ray >>> ds = ray.data.range(100) >>> # Set the number of output partitions to write to disk. >>> ds.repartition(10).write_parquet("/tmp/test") Time complexity: O(dataset size / parallelism) Parameters num_blocks – The number of blocks. shuffle – Whether to perform a distributed shuffle during the repartition. When shuffle is enabled, each output block contains a subset of data rows from each input block, which requires all-to-all data movement. When shuffle is disabled, output blocks are created from adjacent input blocks, minimizing data movement. Returns The repartitioned dataset.ray.data.Dataset.repeat Dataset.repeat(times: Optional[int] = None) -> DatasetPipeline[source] Convert this into a DatasetPipeline by looping over this dataset. Transformations prior to the call to repeat() are evaluated once. Transformations done on the returned pipeline are evaluated on each loop of the pipeline over the base dataset. Note that every repeat of the dataset is considered an “epoch” for the purposes of DatasetPipeline.iter_epochs(). This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.range(5, parallelism=1) >>> # Infinite pipeline of numbers [0, 5) >>> ds.repeat().take_batch() {'id': array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4, ...])} >>> # Can shuffle each epoch (dataset) in the pipeline. >>> ds.repeat().random_shuffle().take_batch() {'id': array([2, 3, 0, 4, 1, 4, 0, 2, 1, 3, ...])} Parameters times – The number of times to loop over this dataset, or None to repeat indefinitely. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.Dataset.schema Dataset.schema(fetch_if_missing: bool = True) -> Optional[ray.data.dataset.Schema][source] Return the schema of the dataset. Examples >>> import ray >>> ds = ray.data.range(10) >>> ds.schema() Column Type ------ ---- id int64 If this dataset consists of more than a read, or if the schema can’t be determined from the metadata provided by the datasource, or if fetch_if_missing=True (the default), then this operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(1) Parameters fetch_if_missing – If True, synchronously fetch the schema if it’s not known. If False, None is returned if the schema is not known. Default is True. Returns The ray.data.Schema class of the records, or None if the schema is not known and fetch_if_missing is False.ray.data.Dataset.select_columns Dataset.select_columns(cols: List[str], *, compute: Optional[Union[str, ray.data._internal.compute.ComputeStrategy]] = None, **ray_remote_args) -> ray.data.dataset.Dataset[source] Select one or more columns from the dataset. All input columns used to select need to be in the schema of the dataset. Examples >>> import ray >>> # Create a dataset with 3 columns >>> ds = ray.data.from_items([{"col1": i, "col2": i+1, "col3": i+2} ... for i in range(10)]) >>> # Select only "col1" and "col2" columns. >>> ds = ds.select_columns(cols=["col1", "col2"]) >>> ds MapBatches() +- Dataset( num_blocks=..., num_rows=10, schema={col1: int64, col2: int64, col3: int64} ) Time complexity: O(dataset size / parallelism) Parameters cols – Names of the columns to select. If any name is not included in the dataset schema, an exception will be raised. compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).ray.data.Dataset.serialize_lineage Dataset.serialize_lineage() -> bytes[source] Serialize this dataset’s lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster. Note that this will drop all computed data, and that everything will be recomputed from scratch after deserialization. Use Dataset.deserialize_lineage() to deserialize the serialized bytes returned from this method into a Dataset. Unioned and zipped datasets, produced by :py:meth`Dataset.union` and Dataset.zip(), are not lineage-serializable. Examples import ray ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") serialized_ds = ds.serialize_lineage() ds = ray.data.Dataset.deserialize_lineage(serialized_ds) print(ds) Dataset( num_blocks=1, num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64 } ) Returns Serialized bytes containing the lineage of this dataset. DeveloperAPI: This API may change across minor Ray releases.ray.data.Dataset.show Dataset.show(limit: int = 20) -> None[source] Print up to the given number of records from the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(limit specified) Parameters limit – The max number of records to print.ray.data.Dataset.size_bytes Dataset.size_bytes() -> int[source] Return the in-memory size of the dataset. Examples >>> import ray >>> ds = ray.data.range(10) >>> ds.size_bytes() 80 If this dataset consists of more than a read, then this operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(1) Returns The in-memory size of the dataset in bytes, or None if the in-memory size is not known.ray.data.Dataset.sort Dataset.sort(key: Optional[str] = None, descending: bool = False) -> ray.data.dataset.Dataset[source] Sort the dataset by the specified key column or key function. Examples >>> import ray >>> # Sort by a single column in descending order. >>> ds = ray.data.from_items( ... [{"value": i} for i in range(1000)]) >>> ds.sort("value", descending=True) Sort +- Dataset(num_blocks=200, num_rows=1000, schema={value: int64}) Time complexity: O(dataset size * log(dataset size / parallelism)) Parameters key – The column to sort by. To sort by multiple columns, use a map function to generate the sort column beforehand. descending – Whether to sort in descending order. Returns A new, sorted dataset.ray.data.Dataset.split Dataset.split(n: int, *, equal: bool = False, locality_hints: Optional[List[Any]] = None) -> List[ray.data.dataset.MaterializedDataset][source] Materialize and split the dataset into n disjoint pieces. This returns a list of MaterializedDatasets that can be passed to Ray tasks and actors and used to read the dataset records in parallel. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.range(100) >>> workers = ... >>> # Split up a dataset to process over `n` worker actors. >>> shards = ds.split(len(workers), locality_hints=workers) >>> for shard, worker in zip(shards, workers): ... worker.consume.remote(shard) Time complexity: O(1) See also: Dataset.split_at_indices, Dataset.split_proportionately, and Dataset.streaming_split. Parameters n – Number of child datasets to return. equal – Whether to guarantee each split has an equal number of records. This may drop records if they cannot be divided equally among the splits. locality_hints – [Experimental] A list of Ray actor handles of size n. The system will try to co-locate the blocks of the i-th dataset with the i-th actor to maximize data locality. Returns A list of n disjoint dataset splits.ray.data.Dataset.split_at_indices Dataset.split_at_indices(indices: List[int]) -> List[ray.data.dataset.MaterializedDataset][source] Materialize and split the dataset at the given indices (like np.split). This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.range(10) >>> d1, d2, d3 = ds.split_at_indices([2, 5]) >>> d1.take_batch() {'id': array([0, 1])} >>> d2.take_batch() {'id': array([2, 3, 4])} >>> d3.take_batch() {'id': array([5, 6, 7, 8, 9])} Time complexity: O(num splits) See also: Dataset.split_at_indices, Dataset.split_proportionately, and Dataset.streaming_split. Parameters indices – List of sorted integers which indicate where the dataset will be split. If an index exceeds the length of the dataset, an empty dataset will be returned. Returns The dataset splits.ray.data.Dataset.split_proportionately Dataset.split_proportionately(proportions: List[float]) -> List[ray.data.dataset.MaterializedDataset][source] Materialize and split the dataset using proportions. A common use case for this would be splitting the dataset into train and test sets (equivalent to eg. scikit-learn’s train_test_split). See also Dataset.train_test_split for a higher level abstraction. The indices to split at will be calculated in such a way so that all splits always contains at least one element. If that is not possible, an exception will be raised. This is equivalent to caulculating the indices manually and calling Dataset.split_at_indices. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.range(10) >>> d1, d2, d3 = ds.split_proportionately([0.2, 0.5]) >>> d1.take_batch() {'id': array([0, 1])} >>> d2.take_batch() {'id': array([2, 3, 4, 5, 6])} >>> d3.take_batch() {'id': array([7, 8, 9])} Time complexity: O(num splits) See also: Dataset.split, Dataset.split_at_indices, Dataset.train_test_split Parameters proportions – List of proportions to split the dataset according to. Must sum up to less than 1, and each proportion has to be bigger than 0. Returns The dataset splits.ray.data.Dataset.stats Dataset.stats() -> str[source] Returns a string containing execution timing information. This operation will trigger execution of the lazy transformations performed on this dataset. Note that this does not trigger execution, so if the dataset has not yet executed, an empty string will be returned. Examples: import ray ds = ray.data.range(10) assert ds.stats() == "" ds = ds.materialize() print(ds.stats()) Stage 0 Read: .../... blocks executed in ... * Remote wall time: ... min, ... max, ... mean, ... total * Remote cpu time: ... min, ... max, ... mean, ... total * Peak heap memory usage (MiB): ... min, ... max, ... mean * Output num rows: ... min, ... max, ... mean, ... total * Output size bytes: ... min, ... max, ... mean, ... total * Tasks per node: ... min, ... max, ... mean; ... nodes used ray.data.Dataset.std Dataset.std(on: Optional[Union[str, List[str]]] = None, ddof: int = 1, ignore_nulls: bool = True) -> Union[Any, Dict[str, Any]][source] Compute standard deviation over entire dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> round(ray.data.range(100).std("id", ddof=0), 5) 28.86607 >>> ray.data.from_items([ ... {"A": i, "B": i**2} ... for i in range(100)]).std(["A", "B"]) {'std(A)': 29.011491975882016, 'std(B)': 2968.1748039269296} This uses Welford’s online method for an accumulator-style computation of the standard deviation. This method was chosen due to it’s numerical stability, and it being computable in a single pass. This may give different (but more accurate) results than NumPy, Pandas, and sklearn, which use a less numerically stable two-pass algorithm. See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm Parameters on – a column name or a list of column names to aggregate. ddof – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the std; if False, if a null value is encountered, the output will be None. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The standard deviation result.For different values of on, the return varies:on=None: an dict containing the column-wise std of all columns, on="col": a scalar representing the std of all items in column "col", on=["col_1", ..., "col_n"]: an n-column dict containing the column-wise std of the provided columns.If the dataset is empty, all values are null, or any value is null AND ignore_nulls is False, then the output will be None.ray.data.Dataset.streaming_split Dataset.streaming_split(n: int, *, equal: bool = False, locality_hints: Optional[List[NodeIdStr]] = None) -> List[ray.data.iterator.DataIterator][source] Returns n DataIterators that can be used to read disjoint subsets of the dataset in parallel. This method is the recommended way to consume Datasets from multiple processes (e.g., for distributed training), and requires streaming execution mode. Streaming split works by delegating the execution of this Dataset to a coordinator actor. The coordinator pulls block references from the executed stream, and divides those blocks among n output iterators. Iterators pull blocks from the coordinator actor to return to their caller on next. The returned iterators are also repeatable; each iteration will trigger a new execution of the Dataset. There is an implicit barrier at the start of each iteration, which means that next must be called on all iterators before the iteration starts. Warning: because iterators are pulling blocks from the same Dataset execution, if one iterator falls behind other iterators may be stalled. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.range(1000000) >>> it1, it2 = ds.streaming_split(2, equal=True) >>> # Can consume from both iterators in parallel. >>> @ray.remote ... def consume(it): ... for batch in it.iter_batches(): ... print(batch) >>> ray.get([consume.remote(it1), consume.remote(it2)]) >>> # Can loop over the iterators multiple times (multiple epochs). >>> @ray.remote ... def train(it): ... NUM_EPOCHS = 100 ... for _ in range(NUM_EPOCHS): ... for batch in it.iter_batches(): ... print(batch) >>> ray.get([train.remote(it1), train.remote(it2)]) >>> # ERROR: this will block waiting for a read on `it2` to start. >>> ray.get(train.remote(it1)) Parameters n – Number of output iterators to return. equal – If True, each output iterator will see an exactly equal number of rows, dropping data if necessary. If False, some iterators may see slightly more or less rows than other, but no data will be dropped. locality_hints – Specify the node ids corresponding to each iterator location. Dataset will try to minimize data movement based on the iterator output locations. This list must have length n. You can get the current node id of a task or actor by calling ray.get_runtime_context().get_node_id(). Returns The output iterator splits. These iterators are Ray-serializable and can be freely passed to any Ray task or actor.ray.data.Dataset.sum Dataset.sum(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> Union[Any, Dict[str, Any]][source] Compute sum over entire dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ray.data.range(100).sum("id") 4950 >>> ray.data.from_items([ ... {"A": i, "B": i**2} ... for i in range(100)]).sum(["A", "B"]) {'sum(A)': 4950, 'sum(B)': 328350} Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the sum; if False, if a null value is encountered, the output will be None. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The sum result.For different values of on, the return varies:on=None: a dict containing the column-wise sum of all columns, on="col": a scalar representing the sum of all items in column "col", on=["col_1", ..., "col_n"]: an n-column dict containing the column-wise sum of the provided columns.If the dataset is empty, all values are null, or any value is null AND ignore_nulls is False, then the output will be None.ray.data.Dataset.take Dataset.take(limit: int = 20) -> List[Dict[str, Any]][source] Return up to limit records from the dataset. This will move up to limit records to the caller’s machine; if limit is very large, this can result in an OutOfMemory crash on the caller. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(limit specified) Parameters limit – The max number of records to return. Returns A list of up to limit records from the dataset.ray.data.Dataset.take_all Dataset.take_all(limit: Optional[int] = None) -> List[Dict[str, Any]][source] Return all of the records in the dataset. This will move the entire dataset to the caller’s machine; if the dataset is very large, this can result in an OutOfMemory crash on the caller. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(dataset size) Parameters limit – Raise an error if the size exceeds the specified limit. Returns A list of all the records in the dataset.ray.data.Dataset.take_batch Dataset.take_batch(batch_size: int = 20, *, batch_format: Optional[str] = 'default') -> Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Return up to batch_size records from the dataset in a batch. Unlike take(), the records are returned in the same format as used for iter_batches and map_batches. This will move up to batch_size records to the caller’s machine; if batch_size is very large, this can result in an OutOfMemory crash on the caller. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(batch_size specified) Parameters batch_size – The max number of records to return. batch_format – Specify "default" to use the default block format (NumPy), "pandas" to select pandas.DataFrame, “pyarrow” to select pyarrow.Table, or "numpy" to select Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting. Returns A batch of up to batch_size records from the dataset. Raises ValueError if the dataset is empty. – ray.data.Dataset.to_random_access_dataset Dataset.to_random_access_dataset(key: str, num_workers: Optional[int] = None) -> ray.data.random_access_dataset.RandomAccessDataset[source] Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL). RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the dataset. Note that the key must be unique in the dataset. If there are duplicate keys, an arbitrary value is returned. This is only supported for Arrow-format datasets. This operation will trigger execution of the lazy transformations performed on this dataset. Parameters key – The key column over which records can be queried. num_workers – The number of actors to use to serve random access queries. By default, this is determined by multiplying the number of Ray nodes in the cluster by four. As a rule of thumb, you can expect each worker to provide ~3000 records / second via get_async(), and ~10000 records / second via multiget().ray.data.Dataset.to_tf Dataset.to_tf(feature_columns: Union[str, List[str]], label_columns: Union[str, List[str]], *, prefetch_batches: int = 1, batch_size: int = 1, drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, prefetch_blocks: int = 0) -> tf.data.Dataset[source] Return a TF Dataset over this dataset. If your dataset contains ragged tensors, this method errors. To prevent errors, resize your tensors. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") >>> ds Dataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64 } ) If your model accepts a single tensor as input, specify a single feature column. >>> ds.to_tf(feature_columns="sepal length (cm)", label_columns="target") <_OptionsDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.float64, name='sepal length (cm)'), TensorSpec(shape=(None,), dtype=tf.int64, name='target'))> If your model accepts a dictionary as input, specify a list of feature columns. >>> ds.to_tf(["sepal length (cm)", "sepal width (cm)"], "target") <_OptionsDataset element_spec=({'sepal length (cm)': TensorSpec(shape=(None,), dtype=tf.float64, name='sepal length (cm)'), 'sepal width (cm)': TensorSpec(shape=(None,), dtype=tf.float64, name='sepal width (cm)')}, TensorSpec(shape=(None,), dtype=tf.int64, name='target'))> If your dataset contains multiple features but your model accepts a single tensor as input, combine features with Concatenator. >>> from ray.data.preprocessors import Concatenator >>> preprocessor = Concatenator(output_column_name="features", exclude="target") >>> ds = preprocessor.transform(ds) >>> ds Concatenator +- Dataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64 } ) >>> ds.to_tf("features", "target") <_OptionsDataset element_spec=(TensorSpec(shape=(None, 4), dtype=tf.float64, name='features'), TensorSpec(shape=(None,), dtype=tf.int64, name='target'))> Parameters feature_columns – Columns that correspond to model inputs. If this is a string, the input data is a tensor. If this is a list, the input data is a dict that maps column names to their tensor representation. label_column – Columns that correspond to model targets. If this is a string, the target data is a tensor. If this is a list, the target data is a dict that maps column names to their tensor representation. prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the datasetContext. batch_size – Record batch size. Defaults to 1. drop_last – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of the stream is not divisible by the batch size, then the last batch will be smaller. Defaults to False. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling. local_shuffle_seed – The seed to use for the local random shuffle. Returns A tf.data.Dataset that yields inputs and targets. iter_tf_batches() Call this method if you need more flexibility.ray.data.Dataset.to_torch Dataset.to_torch(*, label_column: Optional[str] = None, feature_columns: Optional[Union[List[str], List[List[str]], Dict[str, List[str]]]] = None, label_column_dtype: Optional[torch.dtype] = None, feature_column_dtypes: Optional[Union[torch.dtype, List[torch.dtype], Dict[str, torch.dtype]]] = None, batch_size: int = 1, prefetch_batches: int = 1, drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, unsqueeze_label_tensor: bool = True, unsqueeze_feature_tensors: bool = True, prefetch_blocks: int = 0) -> torch.utils.data.IterableDataset[source] Return a Torch IterableDataset over this dataset. This is only supported for datasets convertible to Arrow records. It is recommended to use the returned IterableDataset directly instead of passing it into a torch DataLoader. Each element in IterableDataset will be a tuple consisting of 2 elements. The first item contains the feature tensor(s), and the second item is the label tensor. Those can take on different forms, depending on the specified arguments. For the features tensor (N is the batch_size and n, m, k are the number of features per tensor): If feature_columns is a List[str], the features will be a tensor of shape (N, n), with columns corresponding to feature_columns If feature_columns is a List[List[str]], the features will be a list of tensors of shape [(N, m),…,(N, k)], with columns of each tensor corresponding to the elements of feature_columns If feature_columns is a Dict[str, List[str]], the features will be a dict of key-tensor pairs of shape {key1: (N, m),…, keyN: (N, k)}, with columns of each tensor corresponding to the value of feature_columns under the key. If unsqueeze_label_tensor=True (default), the label tensor will be of shape (N, 1). Otherwise, it will be of shape (N,). If label_column is specified as None, then no column from the Dataset will be treated as the label, and the output label tensor will be None. Note that you probably want to call Dataset.split() on this dataset if there are to be multiple Torch workers consuming the data. This operation will trigger execution of the lazy transformations performed on this dataset. Time complexity: O(1) Parameters label_column – The name of the column used as the label (second element of the output list). Can be None for prediction, in which case the second element of returned tuple will also be None. feature_columns – The names of the columns to use as the features. Can be a list of lists or a dict of string-list pairs for multi-tensor output. If None, then use all columns except the label column as the features. label_column_dtype – The torch dtype to use for the label column. If None, then automatically infer the dtype. feature_column_dtypes – The dtypes to use for the feature tensors. This should match the format of feature_columns, or be a single dtype, in which case it will be applied to all tensors. If None, then automatically infer the dtype. batch_size – How many samples per batch to yield at a time. Defaults to 1. prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the datasetContext. drop_last – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of the stream is not divisible by the batch size, then the last batch will be smaller. Defaults to False. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling. local_shuffle_seed – The seed to use for the local random shuffle. unsqueeze_label_tensor – If set to True, the label tensor will be unsqueezed (reshaped to (N, 1)). Otherwise, it will be left as is, that is (N, ). In general, regression loss functions expect an unsqueezed tensor, while classification loss functions expect a squeezed one. Defaults to True. unsqueeze_feature_tensors – If set to True, the features tensors will be unsqueezed (reshaped to (N, 1)) before being concatenated into the final features tensor. Otherwise, they will be left as is, that is (N, ). Defaults to True. Returns A torch IterableDataset.ray.data.Dataset.train_test_split Dataset.train_test_split(test_size: Union[int, float], *, shuffle: bool = False, seed: Optional[int] = None) -> Tuple[ray.data.dataset.MaterializedDataset, ray.data.dataset.MaterializedDataset][source] Materialize and split the dataset into train and test subsets. This operation will trigger execution of the lazy transformations performed on this dataset. Examples >>> import ray >>> ds = ray.data.range(8) >>> train, test = ds.train_test_split(test_size=0.25) >>> train.take_batch() {'id': array([0, 1, 2, 3, 4, 5])} >>> test.take_batch() {'id': array([6, 7])} Parameters test_size – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. The train split will always be the compliment of the test split. shuffle – Whether or not to globally shuffle the dataset before splitting. Defaults to False. This may be a very expensive operation with large dataset. seed – Fix the random seed to use for shuffle, otherwise one will be chosen based on system randomness. Ignored if shuffle=False. Returns Train and test subsets as two MaterializedDatasets.ray.data.Dataset.union Dataset.union(*other: List[ray.data.dataset.Dataset]) -> ray.data.dataset.Dataset[source] Materialize and combine this dataset with others of the same type. The order of the blocks in the datasets is preserved, as is the relative ordering between the datasets passed in the argument list. Unioned datasets are not lineage-serializable, i.e. they can not be used as a tunable hyperparameter in Ray Tune. This operation will trigger execution of the lazy transformations performed on this dataset. Parameters other – List of datasets to combine with this one. The datasets must have the same schema as this dataset, otherwise the behavior is undefined. Returns A new dataset holding the union of their data.ray.data.Dataset.unique Dataset.unique(column: str) -> List[Any][source] List of unique elements in the given column. Examples >>> import ray >>> ds = ray.data.from_items([1, 2, 3, 2, 3]) >>> ds.unique("item") [1, 2, 3] This function is very useful for computing labels in a machine learning dataset: >>> import ray >>> ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") >>> ds.unique("target") [0, 1, 2] One common use case is to convert the class labels into integers for training and inference: >>> classes = {0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'} >>> def preprocessor(df, classes): ... df["variety"] = df["target"].map(classes) ... return df >>> train_ds = ds.map_batches( ... preprocessor, fn_kwargs={"classes": classes}, batch_format="pandas") >>> train_ds.sort("sepal length (cm)").take(1) # Sort to make it deterministic [{'sepal length (cm)': 4.3, ..., 'variety': 'Setosa'}] Time complexity: O(dataset size * log(dataset size / parallelism)) Parameters column – The column to collect unique elements over. Returns A list with unique elements in the given column.ray.data.Dataset.window Dataset.window(*, blocks_per_window: Optional[int] = None, bytes_per_window: Optional[int] = None) -> DatasetPipeline[source] Convert this into a DatasetPipeline by windowing over data blocks. Transformations prior to the call to window() are evaluated in bulk on the entire dataset. Transformations done on the returned pipeline are evaluated incrementally per window of blocks as data is read from the output of the pipeline. Windowing execution allows for output to be read sooner without waiting for all transformations to fully execute, and can also improve efficiency if transforms use different resources (e.g., GPUs). Without windowing: [preprocessing......] [inference.......] [write........] Time -----------------------------------------------------------> With windowing: [prep1] [prep2] [prep3] [infer1] [infer2] [infer3] [write1] [write2] [write3] Time -----------------------------------------------------------> Examples >>> import ray >>> # Create an inference pipeline. >>> ds = ray.data.read_binary_files(dir) >>> infer = ... >>> pipe = ds.window(blocks_per_window=10).map(infer) DatasetPipeline(num_windows=40, num_stages=2) >>> # The higher the stage parallelism, the shorter the pipeline. >>> pipe = ds.window(blocks_per_window=20).map(infer) DatasetPipeline(num_windows=20, num_stages=2) >>> # Outputs can be incrementally read from the pipeline. >>> for item in pipe.iter_rows(): ... print(item) Parameters blocks_per_window – The window size (parallelism) in blocks. Increasing window size increases pipeline throughput, but also increases the latency to initial output, since it decreases the length of the pipeline. Setting this to infinity effectively disables pipelining. bytes_per_window – Specify the window size in bytes instead of blocks. This will be treated as an upper bound for the window size, but each window will still include at least one block. This is mutually exclusive with blocks_per_window. DEPRECATED: This API is deprecated and may be removed in future Ray releases.ray.data.Dataset.write_webdataset Dataset.write_webdataset(path: str, *, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = , ray_remote_args: Dict[str, Any] = None, encoder: Optional[Union[bool, str, callable, list]] = True) -> None[source] Write the dataset to WebDataset files. The TFRecord files will contain tf.train.Example # noqa: E501 records, with one Example record for each row in the dataset. tf.train.Feature only natively stores ints, floats, and bytes, so this function only supports datasets with these data types, and will error if the dataset contains unsupported types. This is only supported for datasets convertible to Arrow records. To control the number of files, use Dataset.repartition(). Unless a custom block path provider is given, the format of the output files will be {uuid}_{block_idx}.tfrecords, where uuid is an unique id for the dataset. This operation will trigger execution of the lazy transformations performed on this dataset. Examples import ray ds = ray.data.range(100) ds.write_webdataset("s3://bucket/folder/") Time complexity: O(dataset size / parallelism) Parameters path – The path to the destination root directory, where tfrecords files will be written to. filesystem – The filesystem implementation to write to. try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist. arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path. ray_remote_args – Kwargs passed to ray.remote in the write tasks. PublicAPI (alpha): This API is in alpha and may change before becoming stable.ray.data.Dataset.zip Dataset.zip(other: ray.data.dataset.Dataset) -> ray.data.dataset.Dataset[source] Materialize and zip this dataset with the elements of another. The datasets must have the same number of rows. Their column sets will be merged, and any duplicate column names disambiguated with _1, _2, etc. suffixes. The smaller of the two datasets will be repartitioned to align the number of rows per block with the larger dataset. Zipped datasets are not lineage-serializable, i.e. they can not be used as a tunable hyperparameter in Ray Tune. Examples >>> import ray >>> ds1 = ray.data.range(5) >>> ds2 = ray.data.range(5) >>> ds1.zip(ds2).take_batch() {'id': array([0, 1, 2, 3, 4]), 'id_1': array([0, 1, 2, 3, 4])} Time complexity: O(dataset size / parallelism) Parameters other – The dataset to zip with on the right hand side. Returns A Dataset containing the columns of the second dataset concatenated horizontally with the columns of the first dataset, with duplicate column names disambiguated with _1, _2, etc. suffixes. Attributes context Return the DataContext used to create this Dataset. ray.data.Dataset.context property Dataset.context: ray.data.context.DataContext Return the DataContext used to create this Dataset. DeveloperAPI: This API may change across minor Ray releases. Basic Transformations Dataset.map(fn, *[, compute, num_cpus, num_gpus]) Apply the given function to each record of this dataset. Dataset.map_batches(fn, *[, batch_size, ...]) Apply the given function to batches of data. Dataset.flat_map(fn, *[, compute, num_cpus, ...]) Apply the given function to each record and then flatten results. Dataset.filter(fn, *[, compute]) Filter out records that do not satisfy the given predicate. Dataset.add_column(col, fn, *[, compute]) Add the given column to the dataset. Dataset.drop_columns(cols, *[, compute]) Drop one or more columns from the dataset. Dataset.select_columns(cols, *[, compute]) Select one or more columns from the dataset. Dataset.random_sample(fraction, *[, seed]) Randomly samples a fraction of the elements of this dataset. Dataset.limit(limit) Materialize and truncate the dataset to the first limit records. Sorting, Shuffling, Repartitioning Dataset.sort([key, descending]) Sort the dataset by the specified key column or key function. Dataset.random_shuffle(*[, seed, num_blocks]) Randomly shuffle the elements of this dataset. Dataset.randomize_block_order(*[, seed]) Randomly shuffle the blocks of this dataset. Dataset.repartition(num_blocks, *[, shuffle]) Repartition the dataset into exactly this number of blocks. Splitting and Merging Datasets Dataset.split(n, *[, equal, locality_hints]) Materialize and split the dataset into n disjoint pieces. Dataset.split_at_indices(indices) Materialize and split the dataset at the given indices (like np.split). Dataset.split_proportionately(proportions) Materialize and split the dataset using proportions. Dataset.streaming_split(n, *[, equal, ...]) Returns n DataIterators that can be used to read disjoint subsets of the dataset in parallel. Dataset.train_test_split(test_size, *[, ...]) Materialize and split the dataset into train and test subsets. Dataset.union(*other) Materialize and combine this dataset with others of the same type. Dataset.zip(other) Materialize and zip this dataset with the elements of another. Grouped and Global Aggregations Dataset.groupby(key) Group the dataset by the key function or column name. Dataset.unique(column) List of unique elements in the given column. Dataset.aggregate(*aggs) Aggregate the entire dataset as one group. Dataset.sum([on, ignore_nulls]) Compute sum over entire dataset. Dataset.min([on, ignore_nulls]) Compute minimum over entire dataset. Dataset.max([on, ignore_nulls]) Compute maximum over entire dataset. Dataset.mean([on, ignore_nulls]) Compute mean over entire dataset. Dataset.std([on, ddof, ignore_nulls]) Compute standard deviation over entire dataset. Consuming Data Dataset.show([limit]) Print up to the given number of records from the dataset. Dataset.take([limit]) Return up to limit records from the dataset. Dataset.take_batch([batch_size, batch_format]) Return up to batch_size records from the dataset in a batch. Dataset.take_all([limit]) Return all of the records in the dataset. Dataset.iterator() Return a DataIterator that can be used to repeatedly iterate over the dataset. Dataset.iter_rows(*[, prefetch_blocks]) Return a local row iterator over the dataset. Dataset.iter_batches(*[, prefetch_batches, ...]) Return a local batched iterator over the dataset. Dataset.iter_torch_batches(*[, ...]) Return a local batched iterator of Torch Tensors over the dataset. Dataset.iter_tf_batches(*[, ...]) Return a local batched iterator of TensorFlow Tensors over the dataset. I/O and Conversion Dataset.write_parquet(path, *[, filesystem, ...]) Write the dataset to parquet. Dataset.write_json(path, *[, filesystem, ...]) Write the dataset to json. Dataset.write_csv(path, *[, filesystem, ...]) Write the dataset to csv. Dataset.write_numpy(path, *[, column, ...]) Write a tensor column of the dataset to npy files. Dataset.write_tfrecords(path, *[, ...]) Write the dataset to TFRecord files. Dataset.write_webdataset(path, *[, ...]) Write the dataset to WebDataset files. Dataset.write_mongo(uri, database, collection) Write the dataset to a MongoDB datasource. Dataset.write_datasource(datasource, *[, ...]) Write the dataset to a custom datasource. Dataset.to_torch(*[, label_column, ...]) Return a Torch IterableDataset over this dataset. Dataset.to_tf(feature_columns, label_columns, *) Return a TF Dataset over this dataset. Dataset.to_dask([meta]) Convert this dataset into a Dask DataFrame. Dataset.to_mars() Convert this dataset into a MARS dataframe. Dataset.to_modin() Convert this dataset into a Modin dataframe. Dataset.to_spark(spark) Convert this dataset into a Spark dataframe. Dataset.to_pandas([limit]) Convert this dataset into a single Pandas DataFrame. Dataset.to_pandas_refs() Convert this dataset into a distributed set of Pandas dataframes. Dataset.to_numpy_refs(*[, column]) Convert this dataset into a distributed set of NumPy ndarrays. Dataset.to_arrow_refs() Convert this dataset into a distributed set of Arrow tables. Dataset.to_random_access_dataset(key[, ...]) Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL). Inspecting Metadata Dataset.count() Count the number of records in the dataset. Dataset.columns([fetch_if_missing]) Returns the columns of this Dataset. Dataset.schema([fetch_if_missing]) Return the schema of the dataset. Dataset.num_blocks() Return the number of blocks of this dataset. Dataset.size_bytes() Return the in-memory size of the dataset. Dataset.input_files() Return the list of input files for the dataset. Dataset.stats() Returns a string containing execution timing information. Dataset.get_internal_block_refs() Get a list of references to the underlying blocks of this dataset. Execution Dataset.materialize() Execute and materialize this dataset into object store memory. ActorPoolStrategy([legacy_min_size, ...]) Specify the compute strategy for a Dataset transform. ray.data.ActorPoolStrategy class ray.data.ActorPoolStrategy(legacy_min_size: Optional[int] = None, legacy_max_size: Optional[int] = None, *, size: Optional[int] = None, min_size: Optional[int] = None, max_size: Optional[int] = None, max_tasks_in_flight_per_actor: Optional[int] = None)[source] Bases: ray.data._internal.compute.ComputeStrategy Specify the compute strategy for a Dataset transform. ActorPoolStrategy specifies that an autoscaling pool of actors should be used for a given Dataset transform. This is useful for stateful setup of callable classes. For a fixed-sized pool of size n, specify compute=ActorPoolStrategy(size=n). To autoscale from m to n actors, specify ActorPoolStrategy(min_size=m, max_size=n). To increase opportunities for pipelining task dependency prefetching with computation and avoiding actor startup delays, set max_tasks_in_flight_per_actor to 2 or greater; to try to decrease the delay due to queueing of tasks on the worker actors, set max_tasks_in_flight_per_actor to 1. PublicAPI: This API is stable across Ray releases. Methods __init__([legacy_min_size, legacy_max_size, ...]) Construct ActorPoolStrategy for a Dataset transform. ray.data.ActorPoolStrategy.__init__ ActorPoolStrategy.__init__(legacy_min_size: Optional[int] = None, legacy_max_size: Optional[int] = None, *, size: Optional[int] = None, min_size: Optional[int] = None, max_size: Optional[int] = None, max_tasks_in_flight_per_actor: Optional[int] = None)[source] Construct ActorPoolStrategy for a Dataset transform. Parameters size – Specify a fixed size actor pool of this size. It is an error to specify both size and min_size or max_size. min_size – The minimize size of the actor pool. max_size – The maximum size of the actor pool. max_tasks_in_flight_per_actor – The maximum number of tasks to concurrently send to a single actor worker. Increasing this will increase opportunities for pipelining task dependency prefetching with computation and avoiding actor startup delays, but will also increase queueing delay. Serialization Dataset.has_serializable_lineage() Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster. Dataset.serialize_lineage() Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster. Dataset.deserialize_lineage(serialized_ds) Deserialize the provided lineage-serialized Dataset. Internals block.Block The central part of internal API. block.BlockExecStats() Execution stats for this block. block.BlockMetadata(num_rows, size_bytes, ...) Metadata about the block. block.BlockAccessor() Provides accessor methods for a specific block. ray.data.block.Block ray.data.block.Block The central part of internal API. This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict. alias of Union[pyarrow.Table, pandas.DataFrame]ray.data.block.BlockExecStats class ray.data.block.BlockExecStats[source] Bases: object Execution stats for this block. wall_time_s The wall-clock time it took to compute this block. cpu_time_s The CPU time it took to compute this block. node_id A unique id for the node that computed this block. DeveloperAPI: This API may change across minor Ray releases. Methods ray.data.block.BlockMetadata class ray.data.block.BlockMetadata(num_rows: Optional[int], size_bytes: Optional[int], schema: Optional[Union[type, pyarrow.lib.Schema]], input_files: Optional[List[str]], exec_stats: Optional[ray.data.block.BlockExecStats])[source] Bases: object Metadata about the block. DeveloperAPI: This API may change across minor Ray releases. Methods Attributes num_rows The number of rows contained in this block, or None. size_bytes The approximate size in bytes of this block, or None. schema The pyarrow schema or types of the block elements, or None. input_files The list of file paths used to generate this block, or the empty list if indeterminate. exec_stats Execution stats for this block. ray.data.block.BlockMetadata.num_rows BlockMetadata.num_rows: Optional[int] The number of rows contained in this block, or None.ray.data.block.BlockMetadata.size_bytes BlockMetadata.size_bytes: Optional[int] The approximate size in bytes of this block, or None.ray.data.block.BlockMetadata.schema BlockMetadata.schema: Optional[Union[type, pyarrow.lib.Schema]] The pyarrow schema or types of the block elements, or None.ray.data.block.BlockMetadata.input_files BlockMetadata.input_files: Optional[List[str]] The list of file paths used to generate this block, or the empty list if indeterminate.ray.data.block.BlockMetadata.exec_stats BlockMetadata.exec_stats: Optional[ray.data.block.BlockExecStats] Execution stats for this block.ray.data.block.BlockAccessor class ray.data.block.BlockAccessor[source] Bases: object Provides accessor methods for a specific block. Ideally, we wouldn’t need a separate accessor classes for blocks. However, this is needed if we want to support storing pyarrow.Table directly as a top-level Ray object, without a wrapping class (issue #17186). DeveloperAPI: This API may change across minor Ray releases. Methods __init__() aggregate_combined_blocks(blocks, key, agg) Aggregate partially combined and sorted blocks. batch_to_block(batch) Create a block from user-facing data formats. builder() Create a builder for this block type. combine(key, agg) Combine rows with the same key into an accumulator. for_block(block) Create a block accessor for the given block. get_metadata(input_files, exec_stats) Create a metadata object from this block. iter_rows(public_row_format) Iterate over the rows of this block. merge_sorted_blocks(blocks, key, descending) Return a sorted block by merging a list of sorted blocks. num_rows() Return the number of rows contained in this block. random_shuffle(random_seed) Randomly shuffle this block. sample(n_samples, key) Return a random sample of items from this block. schema() Return the Python type or pyarrow schema of this block. select(columns) Return a new block containing the provided columns. size_bytes() Return the approximate size in bytes of this block. slice(start, end, copy) Return a slice of this block. sort_and_partition(boundaries, key, descending) Return a list of sorted partitions of this block. take(indices) Return a new block containing the provided row indices. to_arrow() Convert this block into an Arrow table. to_batch_format(batch_format) Convert this block into the provided batch format. to_block() Return the base block that this accessor wraps. to_default() Return the default data format for this accessor. to_numpy([columns]) Convert this block (or columns of block) into a NumPy ndarray. to_pandas() Convert this block into a Pandas dataframe. zip(other) Zip this block with another block of the same type and size. ray.data.block.BlockAccessor.__init__ BlockAccessor.__init__() ray.data.block.BlockAccessor.aggregate_combined_blocks static BlockAccessor.aggregate_combined_blocks(blocks: List[Union[pyarrow.Table, pandas.DataFrame]], key: Optional[str], agg: AggregateFn) -> Tuple[Union[pyarrow.Table, pandas.DataFrame], ray.data.block.BlockMetadata][source] Aggregate partially combined and sorted blocks.ray.data.block.BlockAccessor.batch_to_block static BlockAccessor.batch_to_block(batch: Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]) -> Union[pyarrow.Table, pandas.DataFrame][source] Create a block from user-facing data formats.ray.data.block.BlockAccessor.builder static BlockAccessor.builder() -> BlockBuilder[source] Create a builder for this block type.ray.data.block.BlockAccessor.combine BlockAccessor.combine(key: Optional[str], agg: AggregateFn) -> Union[pyarrow.Table, pandas.DataFrame][source] Combine rows with the same key into an accumulator.ray.data.block.BlockAccessor.for_block static BlockAccessor.for_block(block: Union[pyarrow.Table, pandas.DataFrame]) -> BlockAccessor[T][source] Create a block accessor for the given block.ray.data.block.BlockAccessor.get_metadata BlockAccessor.get_metadata(input_files: List[str], exec_stats: Optional[ray.data.block.BlockExecStats]) -> ray.data.block.BlockMetadata[source] Create a metadata object from this block.ray.data.block.BlockAccessor.iter_rows BlockAccessor.iter_rows(public_row_format: bool) -> Iterator[ray.data.block.T][source] Iterate over the rows of this block. Parameters public_row_format – Whether to cast rows into the public Dict row format (this incurs extra copy conversions).ray.data.block.BlockAccessor.merge_sorted_blocks static BlockAccessor.merge_sorted_blocks(blocks: List[Block], key: Any, descending: bool) -> Tuple[Union[pyarrow.Table, pandas.DataFrame], ray.data.block.BlockMetadata][source] Return a sorted block by merging a list of sorted blocks.ray.data.block.BlockAccessor.num_rows BlockAccessor.num_rows() -> int[source] Return the number of rows contained in this block.ray.data.block.BlockAccessor.random_shuffle BlockAccessor.random_shuffle(random_seed: Optional[int]) -> Union[pyarrow.Table, pandas.DataFrame][source] Randomly shuffle this block.ray.data.block.BlockAccessor.sample BlockAccessor.sample(n_samples: int, key: Any) -> Union[pyarrow.Table, pandas.DataFrame][source] Return a random sample of items from this block.ray.data.block.BlockAccessor.schema BlockAccessor.schema() -> Union[type, pyarrow.lib.Schema][source] Return the Python type or pyarrow schema of this block.ray.data.block.BlockAccessor.select BlockAccessor.select(columns: List[Optional[str]]) -> Union[pyarrow.Table, pandas.DataFrame][source] Return a new block containing the provided columns.ray.data.block.BlockAccessor.size_bytes BlockAccessor.size_bytes() -> int[source] Return the approximate size in bytes of this block.ray.data.block.BlockAccessor.slice BlockAccessor.slice(start: int, end: int, copy: bool) -> Union[pyarrow.Table, pandas.DataFrame][source] Return a slice of this block. Parameters start – The starting index of the slice. end – The ending index of the slice. copy – Whether to perform a data copy for the slice. Returns The sliced block result.ray.data.block.BlockAccessor.sort_and_partition BlockAccessor.sort_and_partition(boundaries: List[ray.data.block.T], key: Any, descending: bool) -> List[Union[pyarrow.Table, pandas.DataFrame]][source] Return a list of sorted partitions of this block.ray.data.block.BlockAccessor.take BlockAccessor.take(indices: List[int]) -> Union[pyarrow.Table, pandas.DataFrame][source] Return a new block containing the provided row indices. Parameters indices – The row indices to return. Returns A new block containing the provided row indices.ray.data.block.BlockAccessor.to_arrow BlockAccessor.to_arrow() -> pyarrow.Table[source] Convert this block into an Arrow table.ray.data.block.BlockAccessor.to_batch_format BlockAccessor.to_batch_format(batch_format: Optional[str]) -> Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]][source] Convert this block into the provided batch format. Parameters batch_format – The batch format to convert this block to. Returns This block formatted as the provided batch format.ray.data.block.BlockAccessor.to_block BlockAccessor.to_block() -> Union[pyarrow.Table, pandas.DataFrame][source] Return the base block that this accessor wraps.ray.data.block.BlockAccessor.to_default BlockAccessor.to_default() -> Union[pyarrow.Table, pandas.DataFrame][source] Return the default data format for this accessor.ray.data.block.BlockAccessor.to_numpy BlockAccessor.to_numpy(columns: Optional[Union[str, List[str]]] = None) -> Union[numpy.ndarray, Dict[str, numpy.ndarray]][source] Convert this block (or columns of block) into a NumPy ndarray. Parameters columns – Name of columns to convert, or None if converting all columns.ray.data.block.BlockAccessor.to_pandas BlockAccessor.to_pandas() -> pandas.DataFrame[source] Convert this block into a Pandas dataframe.ray.data.block.BlockAccessor.zip BlockAccessor.zip(other: Union[pyarrow.Table, pandas.DataFrame]) -> Union[pyarrow.Table, pandas.DataFrame][source] Zip this block with another block of the same type and size. DataIterator API class ray.data.DataIterator[source] An iterator for reading records from a Dataset or DatasetPipeline. For Datasets, each iteration call represents a complete read of all items in the Dataset. For DatasetPipelines, each iteration call represents one pass (epoch) over the base Dataset. Note that for DatasetPipelines, each pass iterates over the original Dataset, instead of a window (if .window() was used). If using Ray AIR, each trainer actor should get its own iterator by calling session.get_dataset_shard("train"). Examples >>> import ray >>> ds = ray.data.range(5) >>> ds Dataset(num_blocks=..., num_rows=5, schema={id: int64}) >>> ds.iterator() DataIterator(Dataset(num_blocks=..., num_rows=5, schema={id: int64})) For debugging purposes, use make_local_dataset_iterator() to create a local DataIterator from a Dataset, a Preprocessor, and a DatasetConfig. PublicAPI (beta): This API is in beta and may change before becoming stable. DataIterator.iter_batches(*[, ...]) Return a local batched iterator over the dataset. DataIterator.iter_torch_batches(*[, ...]) Return a local batched iterator of Torch Tensors over the dataset. DataIterator.to_tf(feature_columns, ...[, ...]) Return a TF Dataset over this dataset. DataIterator.stats() Returns a string containing execution timing information. ray.data.DataIterator.iter_batches DataIterator.iter_batches(*, prefetch_batches: int = 1, batch_size: int = 256, batch_format: Optional[str] = 'default', drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, _collate_fn: Optional[Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Any]] = None, _finalize_fn: Optional[Callable[[Any], Any]] = None, prefetch_blocks: int = 0) -> Iterator[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]][source] Return a local batched iterator over the dataset. Examples >>> import ray >>> for batch in ray.data.range( ... 1000000 ... ).iterator().iter_batches(): ... print(batch) Time complexity: O(1) Parameters prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the DataContext. batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256. batch_format – Specify "default" to use the default block format (NumPy), "pandas" to select pandas.DataFrame, “pyarrow” to select pyarrow.Table, or "numpy" to select Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting. drop_last – Whether to drop the last batch if it’s incomplete. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. local_shuffle_seed – The seed to use for the local random shuffle. Returns An iterator over record batches.ray.data.DataIterator.iter_torch_batches DataIterator.iter_torch_batches(*, prefetch_batches: int = 1, batch_size: Optional[int] = 256, dtypes: Optional[Union[torch.dtype, Dict[str, torch.dtype]]] = None, device: Optional[str] = None, collate_fn: Optional[Callable[[Union[numpy.ndarray, Dict[str, numpy.ndarray]]], Any]] = None, drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, prefetch_blocks: int = 0) -> Iterator[TorchTensorBatchType][source] Return a local batched iterator of Torch Tensors over the dataset. This iterator will yield single-tensor batches if the underlying dataset consists of a single column; otherwise, it will yield a dictionary of column-tensors. If looking for more flexibility in the tensor conversion (e.g. casting dtypes) or the batch format, try using iter_batches directly. Examples >>> import ray >>> for row in ray.data.range( ... 1000000 ... ).iterator().iter_rows(): ... print(row) Time complexity: O(1) Parameters prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the DataContext. batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256. dtypes – The Torch dtype(s) for the created tensor(s); if None, the dtype will be inferred from the tensor data. device – The device on which the tensor should be placed; if None, the Torch tensor will be constructed on the CPU. collate_fn – A function to apply to each data batch before returning it. When this parameter is specified, the user should manually handle the host to device data transfer outside of collate_fn. Potential use cases include collating along a dimension other than the first, padding sequences of various lengths, or generally handling batches of different length tensors. This API is still experimental and is subject to change. This parameter cannot be used in conjunction with dtypes or device. drop_last – Whether to drop the last batch if it’s incomplete. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling. local_shuffle_seed – The seed to use for the local random shuffle. Returns An iterator over Torch Tensor batches.ray.data.DataIterator.to_tf DataIterator.to_tf(feature_columns: Union[str, List[str]], label_columns: Union[str, List[str]], *, prefetch_batches: int = 1, batch_size: int = 1, drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, prefetch_blocks: int = 0) -> tf.data.Dataset[source] Return a TF Dataset over this dataset. If your dataset contains ragged tensors, this method errors. To prevent errors, resize your tensors. Examples >>> import ray >>> ds = ray.data.read_csv( ... "s3://anonymous@air-example-data/iris.csv" ... ) >>> it = ds.iterator(); it DataIterator(Dataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64 } )) If your model accepts a single tensor as input, specify a single feature column. >>> it.to_tf(feature_columns="sepal length (cm)", label_columns="target") <_OptionsDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.float64, name='sepal length (cm)'), TensorSpec(shape=(None,), dtype=tf.int64, name='target'))> If your model accepts a dictionary as input, specify a list of feature columns. >>> it.to_tf(["sepal length (cm)", "sepal width (cm)"], "target") <_OptionsDataset element_spec=({'sepal length (cm)': TensorSpec(shape=(None,), dtype=tf.float64, name='sepal length (cm)'), 'sepal width (cm)': TensorSpec(shape=(None,), dtype=tf.float64, name='sepal width (cm)')}, TensorSpec(shape=(None,), dtype=tf.int64, name='target'))> If your dataset contains multiple features but your model accepts a single tensor as input, combine features with Concatenator. >>> from ray.data.preprocessors import Concatenator >>> preprocessor = Concatenator(output_column_name="features", exclude="target") >>> it = preprocessor.transform(ds).iterator() >>> it DataIterator(Concatenator +- Dataset( num_blocks=..., num_rows=150, schema={ sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64 } )) >>> it.to_tf("features", "target") <_OptionsDataset element_spec=(TensorSpec(shape=(None, 4), dtype=tf.float64, name='features'), TensorSpec(shape=(None,), dtype=tf.int64, name='target'))> Parameters feature_columns – Columns that correspond to model inputs. If this is a string, the input data is a tensor. If this is a list, the input data is a dict that maps column names to their tensor representation. label_column – Columns that correspond to model targets. If this is a string, the target data is a tensor. If this is a list, the target data is a dict that maps column names to their tensor representation. prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses prefetch_blocks by setting use_legacy_iter_batches to True in the DataContext. batch_size – Record batch size. Defaults to 1. drop_last – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. Defaults to False. local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling. local_shuffle_seed – The seed to use for the local random shuffle. Returns A tf.data.Dataset that yields inputs and targets.ray.data.DataIterator.stats abstract DataIterator.stats() -> str[source] Returns a string containing execution timing information. ExecutionOptions API Constructor ExecutionOptions(resource_limits, ...) Common options for execution. ray.data.ExecutionOptions class ray.data.ExecutionOptions(resource_limits: ray.data._internal.execution.interfaces.ExecutionResources = , locality_with_output: Union[bool, List[str]] = False, preserve_order: bool = False, actor_locality_enabled: bool = True, verbose_progress: bool = False)[source] Bases: object Common options for execution. Some options may not be supported on all executors (e.g., resource limits). resource_limits Set a soft limit on the resource usage during execution. This is not supported in bulk execution mode. Autodetected by default. Type ray.data._internal.execution.interfaces.ExecutionResources locality_with_output Set this to prefer running tasks on the same node as the output node (node driving the execution). It can also be set to a list of node ids to spread the outputs across those nodes. Off by default. Type Union[bool, List[str]] preserve_order Set this to preserve the ordering between blocks processed by operators under the streaming executor. The bulk executor always preserves order. Off by default. Type bool actor_locality_enabled Whether to enable locality-aware task dispatch to actors (on by default). This applies to both ActorPoolStrategy map and streaming_split operations. Type bool verbose_progress Whether to report progress individually per operator. By default, only AllToAll operators and global progress is reported. This option is useful for performance debugging. Off by default. Type bool DeveloperAPI: This API may change across minor Ray releases. Resource Options ExecutionResources([cpu, gpu, ...]) Specifies resources usage or resource limits for execution. ray.data.ExecutionResources class ray.data.ExecutionResources(cpu: Optional[float] = None, gpu: Optional[float] = None, object_store_memory: Optional[int] = None)[source] Bases: object Specifies resources usage or resource limits for execution. The value None represents unknown resource usage or an unspecified limit. object_store_memory_str() -> str[source] Returns a human-readable string for the object store memory field. add(other: ray.data._internal.execution.interfaces.ExecutionResources) -> ray.data._internal.execution.interfaces.ExecutionResources[source] Adds execution resources. Returns A new ExecutionResource object with summed resources. satisfies_limit(limit: ray.data._internal.execution.interfaces.ExecutionResources) -> bool[source] Return if this resource struct meets the specified limits. Note that None for a field means no limit. scale(f: float) -> ray.data._internal.execution.interfaces.ExecutionResources[source] Return copy with all set values scaled by f. GroupedData API GroupedData objects are returned by groupby call: Dataset.groupby(). Constructor grouped_data.GroupedData(dataset, key) Represents a grouped dataset created by calling Dataset.groupby(). ray.data.grouped_data.GroupedData class ray.data.grouped_data.GroupedData(dataset: ray.data.dataset.Dataset, key: str)[source] Bases: object Represents a grouped dataset created by calling Dataset.groupby(). The actual groupby is deferred until an aggregation is applied. PublicAPI: This API is stable across Ray releases. Methods __init__(dataset, key) Construct a dataset grouped by key (internal API). aggregate(*aggs) Implements an accumulator-based aggregation. count() Compute count aggregation. map_groups(fn, *[, compute, batch_format]) Apply the given function to each group of records of this dataset. max([on, ignore_nulls]) Compute grouped max aggregation. mean([on, ignore_nulls]) Compute grouped mean aggregation. min([on, ignore_nulls]) Compute grouped min aggregation. std([on, ddof, ignore_nulls]) Compute grouped standard deviation aggregation. sum([on, ignore_nulls]) Compute grouped sum aggregation. ray.data.grouped_data.GroupedData.__init__ GroupedData.__init__(dataset: ray.data.dataset.Dataset, key: str)[source] Construct a dataset grouped by key (internal API). The constructor is not part of the GroupedData API. Use the Dataset.groupby() method to construct one.ray.data.grouped_data.GroupedData.aggregate GroupedData.aggregate(*aggs: ray.data.aggregate._aggregate.AggregateFn) -> ray.data.dataset.Dataset[source] Implements an accumulator-based aggregation. Parameters aggs – Aggregations to do. Returns The output is an dataset of n + 1 columns where the first column is the groupby key and the second through n + 1 columns are the results of the aggregations. If groupby key is None then the key part of return is omitted.ray.data.grouped_data.GroupedData.count GroupedData.count() -> ray.data.dataset.Dataset[source] Compute count aggregation. Examples >>> import ray >>> ray.data.from_items([ ... {"A": x % 3, "B": x} for x in range(100)]).groupby( ... "A").count() Returns A dataset of [k, v] columns where k is the groupby key and v is the number of rows with that key. If groupby key is None then the key part of return is omitted.ray.data.grouped_data.GroupedData.map_groups GroupedData.map_groups(fn: Union[Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Iterator[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]]], _CallableClassProtocol], *, compute: Union[str, ray.data._internal.compute.ComputeStrategy] = None, batch_format: Optional[str] = 'default', **ray_remote_args) -> Dataset[source] Apply the given function to each group of records of this dataset. While map_groups() is very flexible, note that it comes with downsides: It may be slower than using more specific methods such as min(), max(). It requires that each group fits in memory on a single node. In general, prefer to use aggregate() instead of map_groups(). Examples >>> # Return a single record per group (list of multiple records in, >>> # list of a single record out). >>> import ray >>> import pandas as pd >>> import numpy as np >>> # Get first value per group. >>> ds = ray.data.from_items([ ... {"group": 1, "value": 1}, ... {"group": 1, "value": 2}, ... {"group": 2, "value": 3}, ... {"group": 2, "value": 4}]) >>> ds.groupby("group").map_groups( ... lambda g: {"result": np.array([g["value"][0]])}) >>> # Return multiple records per group (dataframe in, dataframe out). >>> df = pd.DataFrame( ... {"A": ["a", "a", "b"], "B": [1, 1, 3], "C": [4, 6, 5]} ... ) >>> ds = ray.data.from_pandas(df) >>> grouped = ds.groupby("A") >>> grouped.map_groups( ... lambda g: g.apply( ... lambda c: c / g[c.name].sum() if c.name in ["B", "C"] else c ... ) ... ) Parameters fn – The function to apply to each group of records, or a class type that can be instantiated to create such a callable. It takes as input a batch of all records from a single group, and returns a batch of zero or more records, similar to map_batches(). compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool. batch_format – Specify "default" to use the default block format (NumPy), "pandas" to select pandas.DataFrame, “pyarrow” to select pyarrow.Table, or "numpy" to select Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting. ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks). Returns The return type is determined by the return type of fn, and the return value is combined from results of all groups.ray.data.grouped_data.GroupedData.max GroupedData.max(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> ray.data.dataset.Dataset[source] Compute grouped max aggregation. Examples >>> import ray >>> ray.data.le(100).groupby("value").max() >>> ray.data.from_items([ ... {"A": i % 3, "B": i, "C": i**2} ... for i in range(100)]) \ ... .groupby("A") \ ... .max(["B", "C"]) Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the max; if False, if a null value is encountered, the output will be null. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The max result.For different values of on, the return varies:on=None: a dataset containing a groupby key column, "k", and a column-wise max column for each original column in the dataset. on=["col_1", ..., "col_n"]: a dataset of n + 1 columns where the first column is the groupby key and the second through n + 1 columns are the results of the aggregations.If groupby key is None then the key part of return is omitted.ray.data.grouped_data.GroupedData.mean GroupedData.mean(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> ray.data.dataset.Dataset[source] Compute grouped mean aggregation. Examples >>> import ray >>> ray.data.le(100).groupby("value").mean() >>> ray.data.from_items([ ... {"A": i % 3, "B": i, "C": i**2} ... for i in range(100)]) \ ... .groupby("A") \ ... .mean(["B", "C"]) Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the mean; if False, if a null value is encountered, the output will be null. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The mean result.For different values of on, the return varies:on=None: a dataset containing a groupby key column, "k", and a column-wise mean column for each original column in the dataset. on=["col_1", ..., "col_n"]: a dataset of n + 1 columns where the first column is the groupby key and the second through n + 1 columns are the results of the aggregations.If groupby key is None then the key part of return is omitted.ray.data.grouped_data.GroupedData.min GroupedData.min(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> ray.data.dataset.Dataset[source] Compute grouped min aggregation. Examples >>> import ray >>> ray.data.le(100).groupby("value").min() >>> ray.data.from_items([ ... {"A": i % 3, "B": i, "C": i**2} ... for i in range(100)]) \ ... .groupby("A") \ ... .min(["B", "C"]) Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the min; if False, if a null value is encountered, the output will be null. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The min result.For different values of on, the return varies:on=None: a dataset containing a groupby key column, "k", and a column-wise min column for each original column in the dataset. on=["col_1", ..., "col_n"]: a dataset of n + 1 columns where the first column is the groupby key and the second through n + 1 columns are the results of the aggregations.If groupby key is None then the key part of return is omitted.ray.data.grouped_data.GroupedData.std GroupedData.std(on: Optional[Union[str, List[str]]] = None, ddof: int = 1, ignore_nulls: bool = True) -> ray.data.dataset.Dataset[source] Compute grouped standard deviation aggregation. Examples >>> import ray >>> ray.data.range(100).groupby("id").std(ddof=0) >>> ray.data.from_items([ ... {"A": i % 3, "B": i, "C": i**2} ... for i in range(100)]) \ ... .groupby("A") \ ... .std(["B", "C"]) NOTE: This uses Welford’s online method for an accumulator-style computation of the standard deviation. This method was chosen due to it’s numerical stability, and it being computable in a single pass. This may give different (but more accurate) results than NumPy, Pandas, and sklearn, which use a less numerically stable two-pass algorithm. See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm Parameters on – a column name or a list of column names to aggregate. ddof – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the std; if False, if a null value is encountered, the output will be null. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The standard deviation result.For different values of on, the return varies:on=None: a dataset containing a groupby key column, "k", and a column-wise std column for each original column in the dataset. on=["col_1", ..., "col_n"]: a dataset of n + 1 columns where the first column is the groupby key and the second through n + 1 columns are the results of the aggregations.If groupby key is None then the key part of return is omitted.ray.data.grouped_data.GroupedData.sum GroupedData.sum(on: Optional[Union[str, List[str]]] = None, ignore_nulls: bool = True) -> ray.data.dataset.Dataset[source] Compute grouped sum aggregation. Examples >>> import ray >>> ray.data.from_items([ ... (i % 3, i, i**2) ... for i in range(100)]) \ ... .groupby(lambda x: x[0] % 3) \ ... .sum(lambda x: x[2]) >>> ray.data.range(100).groupby("id").sum() >>> ray.data.from_items([ ... {"A": i % 3, "B": i, "C": i**2} ... for i in range(100)]) \ ... .groupby("A") \ ... .sum(["B", "C"]) Parameters on – a column name or a list of column names to aggregate. ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the sum; if False, if a null value is encountered, the output will be null. We consider np.nan, None, and pd.NaT to be null values. Default is True. Returns The sum result.For different values of on, the return varies:on=None: a dataset containing a groupby key column, "k", and a column-wise sum column for each original column in the dataset. on=["col_1", ..., "col_n"]: a dataset of n + 1 columns where the first column is the groupby key and the second through n + 1 columns are the results of the aggregations.If groupby key is None then the key part of return is omitted. Computations / Descriptive Stats grouped_data.GroupedData.count() Compute count aggregation. grouped_data.GroupedData.sum([on, ignore_nulls]) Compute grouped sum aggregation. grouped_data.GroupedData.min([on, ignore_nulls]) Compute grouped min aggregation. grouped_data.GroupedData.max([on, ignore_nulls]) Compute grouped max aggregation. grouped_data.GroupedData.mean([on, ignore_nulls]) Compute grouped mean aggregation. grouped_data.GroupedData.std([on, ddof, ...]) Compute grouped standard deviation aggregation. Function Application grouped_data.GroupedData.aggregate(*aggs) Implements an accumulator-based aggregation. grouped_data.GroupedData.map_groups(fn, *[, ...]) Apply the given function to each group of records of this dataset. Aggregate Function aggregate.AggregateFn(init, ...) PublicAPI: This API is stable across Ray releases. aggregate.Count() Defines count aggregation. aggregate.Sum([on, ignore_nulls, alias_name]) Defines sum aggregation. aggregate.Max([on, ignore_nulls, alias_name]) Defines max aggregation. aggregate.Mean([on, ignore_nulls, alias_name]) Defines mean aggregation. aggregate.Std([on, ddof, ignore_nulls, ...]) Defines standard deviation aggregation. aggregate.AbsMax([on, ignore_nulls, alias_name]) Defines absolute max aggregation. ray.data.aggregate.AggregateFn class ray.data.aggregate.AggregateFn(init: Callable[[ray.data.block.KeyType], ray.data.block.AggType], merge: Callable[[ray.data.block.AggType, ray.data.block.AggType], ray.data.block.AggType], accumulate_row: Callable[[ray.data.block.AggType, ray.data.block.T], ray.data.block.AggType] = None, accumulate_block: Callable[[ray.data.block.AggType, Union[pyarrow.Table, pandas.DataFrame]], ray.data.block.AggType] = None, finalize: Callable[[ray.data.block.AggType], ray.data.block.U] = >, name: Optional[str] = None)[source] Bases: object PublicAPI: This API is stable across Ray releases. Methods __init__(init, merge[, accumulate_row, ...]) Defines an aggregate function in the accumulator style. ray.data.aggregate.AggregateFn.__init__ AggregateFn.__init__(init: Callable[[ray.data.block.KeyType], ray.data.block.AggType], merge: Callable[[ray.data.block.AggType, ray.data.block.AggType], ray.data.block.AggType], accumulate_row: Callable[[ray.data.block.AggType, ray.data.block.T], ray.data.block.AggType] = None, accumulate_block: Callable[[ray.data.block.AggType, Union[pyarrow.Table, pandas.DataFrame]], ray.data.block.AggType] = None, finalize: Callable[[ray.data.block.AggType], ray.data.block.U] = >, name: Optional[str] = None)[source] Defines an aggregate function in the accumulator style. Aggregates a collection of inputs of type T into a single output value of type U. See https://www.sigops.org/s/conferences/sosp/2009/papers/yu-sosp09.pdf for more details about accumulator-based aggregation. Parameters init – This is called once for each group to return the empty accumulator. For example, an empty accumulator for a sum would be 0. merge – This may be called multiple times, each time to merge two accumulators into one. accumulate_row – This is called once per row of the same group. This combines the accumulator and the row, returns the updated accumulator. Exactly one of accumulate_row and accumulate_block must be provided. accumulate_block – This is used to calculate the aggregation for a single block, and is vectorized alternative to accumulate_row. This will be given a base accumulator and the entire block, allowing for vectorized accumulation of the block. Exactly one of accumulate_row and accumulate_block must be provided. finalize – This is called once to compute the final aggregation result from the fully merged accumulator. name – The name of the aggregation. This will be used as the output column name in the case of Arrow dataset.ray.data.aggregate.Count class ray.data.aggregate.Count[source] Bases: ray.data.aggregate._aggregate.AggregateFn Defines count aggregation. PublicAPI: This API is stable across Ray releases. Methods ray.data.aggregate.Sum class ray.data.aggregate.Sum(on: Optional[str] = None, ignore_nulls: bool = True, alias_name: Optional[str] = None)[source] Bases: ray.data.aggregate._aggregate._AggregateOnKeyBase Defines sum aggregation. PublicAPI: This API is stable across Ray releases. Methods ray.data.aggregate.Max class ray.data.aggregate.Max(on: Optional[str] = None, ignore_nulls: bool = True, alias_name: Optional[str] = None)[source] Bases: ray.data.aggregate._aggregate._AggregateOnKeyBase Defines max aggregation. PublicAPI: This API is stable across Ray releases. Methods ray.data.aggregate.Mean class ray.data.aggregate.Mean(on: Optional[str] = None, ignore_nulls: bool = True, alias_name: Optional[str] = None)[source] Bases: ray.data.aggregate._aggregate._AggregateOnKeyBase Defines mean aggregation. PublicAPI: This API is stable across Ray releases. Methods ray.data.aggregate.Std class ray.data.aggregate.Std(on: Optional[str] = None, ddof: int = 1, ignore_nulls: bool = True, alias_name: Optional[str] = None)[source] Bases: ray.data.aggregate._aggregate._AggregateOnKeyBase Defines standard deviation aggregation. Uses Welford’s online method for an accumulator-style computation of the standard deviation. This method was chosen due to its numerical stability, and it being computable in a single pass. This may give different (but more accurate) results than NumPy, Pandas, and sklearn, which use a less numerically stable two-pass algorithm. See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm PublicAPI: This API is stable across Ray releases. Methods ray.data.aggregate.AbsMax class ray.data.aggregate.AbsMax(on: Optional[str] = None, ignore_nulls: bool = True, alias_name: Optional[str] = None)[source] Bases: ray.data.aggregate._aggregate._AggregateOnKeyBase Defines absolute max aggregation. PublicAPI: This API is stable across Ray releases. Methods DataContext API Constructor DataContext(block_splitting_enabled, ...) Singleton for shared Dataset resources and configurations. ray.data.DataContext class ray.data.DataContext(block_splitting_enabled: bool, target_max_block_size: int, target_min_block_size: int, streaming_read_buffer_size: int, enable_pandas_block: bool, optimize_fuse_stages: bool, optimize_fuse_read_stages: bool, optimize_fuse_shuffle_stages: bool, optimize_reorder_stages: bool, actor_prefetcher_enabled: bool, use_push_based_shuffle: bool, pipeline_push_based_shuffle_reduce_tasks: bool, scheduling_strategy: Union[None, str, ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy, ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy], scheduling_strategy_large_args: Union[None, str, ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy, ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy], large_args_threshold: int, use_polars: bool, new_execution_backend: bool, use_streaming_executor: bool, eager_free: bool, decoding_size_estimation: bool, min_parallelism: bool, enable_tensor_extension_casting: bool, enable_auto_log_stats: bool, trace_allocations: bool, optimizer_enabled: bool, execution_options: ExecutionOptions, use_ray_tqdm: bool, use_legacy_iter_batches: bool, enable_progress_bars: bool)[source] Bases: object Singleton for shared Dataset resources and configurations. This object is automatically propagated to workers and can be retrieved from the driver and remote workers via DataContext.get_current(). DeveloperAPI: This API may change across minor Ray releases. Methods __init__(block_splitting_enabled, ...) Private constructor (use get_current() instead). get_current() Get or create a singleton context. ray.data.DataContext.__init__ DataContext.__init__(block_splitting_enabled: bool, target_max_block_size: int, target_min_block_size: int, streaming_read_buffer_size: int, enable_pandas_block: bool, optimize_fuse_stages: bool, optimize_fuse_read_stages: bool, optimize_fuse_shuffle_stages: bool, optimize_reorder_stages: bool, actor_prefetcher_enabled: bool, use_push_based_shuffle: bool, pipeline_push_based_shuffle_reduce_tasks: bool, scheduling_strategy: Union[None, str, ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy, ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy], scheduling_strategy_large_args: Union[None, str, ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy, ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy], large_args_threshold: int, use_polars: bool, new_execution_backend: bool, use_streaming_executor: bool, eager_free: bool, decoding_size_estimation: bool, min_parallelism: bool, enable_tensor_extension_casting: bool, enable_auto_log_stats: bool, trace_allocations: bool, optimizer_enabled: bool, execution_options: ExecutionOptions, use_ray_tqdm: bool, use_legacy_iter_batches: bool, enable_progress_bars: bool)[source] Private constructor (use get_current() instead).ray.data.DataContext.get_current static DataContext.get_current() -> ray.data.context.DataContext[source] Get or create a singleton context. If the context has not yet been created in this process, it will be initialized with default settings. Get DataContext DataContext.get_current() Get or create a singleton context. RandomAccessDataset (experimental) RandomAccessDataset objects are returned by call: Dataset.to_random_access_dataset(). Constructor random_access_dataset.RandomAccessDataset(ds, ...) A class that provides distributed, random access to a Dataset. ray.data.random_access_dataset.RandomAccessDataset class ray.data.random_access_dataset.RandomAccessDataset(ds: Dataset, key: str, num_workers: int)[source] Bases: object A class that provides distributed, random access to a Dataset. See: Dataset.to_random_access_dataset(). PublicAPI (alpha): This API is in alpha and may change before becoming stable. Methods __init__(ds, key, num_workers) Construct a RandomAccessDataset (internal API). get_async(key) Asynchronously finds the record for a single key. multiget(keys) Synchronously find the records for a list of keys. stats() Returns a string containing access timing information. ray.data.random_access_dataset.RandomAccessDataset.__init__ RandomAccessDataset.__init__(ds: Dataset, key: str, num_workers: int)[source] Construct a RandomAccessDataset (internal API). The constructor is a private API. Use ds.to_random_access_dataset() to construct a RandomAccessDataset.ray.data.random_access_dataset.RandomAccessDataset.get_async RandomAccessDataset.get_async(key: Any) -> ray.types.ObjectRef[Any][source] Asynchronously finds the record for a single key. Parameters key – The key of the record to find. Returns ObjectRef containing the record (in pydict form), or None if not found.ray.data.random_access_dataset.RandomAccessDataset.multiget RandomAccessDataset.multiget(keys: List[Any]) -> List[Optional[Any]][source] Synchronously find the records for a list of keys. Parameters keys – List of keys to find the records for. Returns List of found records (in pydict form), or None for missing records.ray.data.random_access_dataset.RandomAccessDataset.stats RandomAccessDataset.stats() -> str[source] Returns a string containing access timing information. Functions random_access_dataset.RandomAccessDataset.get_async(key) Asynchronously finds the record for a single key. random_access_dataset.RandomAccessDataset.multiget(keys) Synchronously find the records for a list of keys. random_access_dataset.RandomAccessDataset.stats() Returns a string containing access timing information. Utility set_progress_bars(enabled) Set whether progress bars are enabled. ray.data.set_progress_bars ray.data.set_progress_bars(enabled: bool) -> bool[source] Set whether progress bars are enabled. The default behavior is controlled by the RAY_DATA_DISABLE_PROGRESS_BARS environment variable. By default, it is set to “0”. Setting it to “1” will disable progress bars, unless they are reenabled by this method. Returns Whether progress bars were previously enabled. PublicAPI: This API is stable across Ray releases. API Guide for Users from Other Data Libraries Ray Data is a data loading and preprocessing library for ML. It shares certain similarities with other ETL data processing libraries, but also has its own focus. In this API guide, we will provide API mappings for users who come from those data libraries, so you can quickly map what you may already know to Ray Data APIs. This is meant to map APIs that perform comparable but not necessarily identical operations. Please check the API reference for exact semantics and usage. This list may not be exhaustive: Ray Data is not a traditional ETL data processing library, so not all data processing APIs can map to Datasets. In addition, we try to focus on common APIs or APIs that are less obvious to see a connection. For Pandas Users Pandas DataFrame vs. Ray Data APIs Pandas DataFrame API Ray Data API df.head() ds.show(), ds.take(), or ds.take_batch() df.dtypes ds.schema() len(df) or df.shape[0] ds.count() df.truncate() ds.limit() df.iterrows() ds.iter_rows() df.drop() ds.drop_columns() df.transform() ds.map_batches() or ds.map() df.groupby() ds.groupby() df.groupby().apply() ds.groupby().map_groups() df.sample() ds.random_sample() df.sort_values() ds.sort() df.append() ds.union() df.aggregate() ds.aggregate() df.min() ds.min() df.max() ds.max() df.sum() ds.sum() df.mean() ds.mean() df.std() ds.std() For PyArrow Users PyArrow Table vs. Ray Data APIs PyArrow Table API Ray Data API pa.Table.schema ds.schema() pa.Table.num_rows ds.count() pa.Table.filter() ds.filter() pa.Table.drop() ds.drop_columns() pa.Table.add_column() ds.add_column() pa.Table.groupby() ds.groupby() pa.Table.sort_by() ds.sort() For PyTorch Dataset & DataLoader Users For more details, see the Migrating from PyTorch to Ray Data. Ray Train: Scalable Model Training Train is currently in beta. Fill out this short form to get involved with Train development! Ray Train scales model training for popular ML frameworks such as Torch, XGBoost, TensorFlow, and more. It seamlessly integrates with other Ray libraries such as Tune and Predictors: https://docs.google.com/drawings/d/1FezcdrXJuxLZzo6Rjz1CHyJzseH8nPFZp6IUepdn3N4/edit Intro to Ray Train Framework support: Train abstracts away the complexity of scaling up training for common machine learning frameworks such as XGBoost, Pytorch, and Tensorflow. There are three broad categories of Trainers that Train offers: Deep Learning Trainers (Pytorch, Tensorflow, Horovod) Tree-based Trainers (XGboost, LightGBM) Other ML frameworks (HuggingFace, Scikit-Learn, RLlib) Built for ML practitioners: Train supports standard ML tools and features that practitioners love: Callbacks for early stopping Checkpointing Integration with TensorBoard, Weights/Biases, and MLflow Jupyter notebooks Batteries included: Train is part of Ray AIR and seamlessly operates in the Ray ecosystem. Use Ray Data with Train to load and process datasets both small and large. Use Ray Tune with Train to sweep parameter grids and leverage cutting edge hyperparameter search algorithms. Leverage the Ray cluster launcher to launch autoscaling or spot instance clusters on any cloud. Quick Start to Distributed Training with Ray Train XGBoost import ray from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) trainer = XGBoostTrainer( scaling_config=ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, ), label_column="target", num_boost_round=20, params={ # XGBoost specific params "objective": "binary:logistic", # "tree_method": "gpu_hist", # uncomment this to use GPU for training "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) result = trainer.fit() print(result.metrics) LightGBM import ray from ray.train.lightgbm import LightGBMTrainer from ray.air.config import ScalingConfig # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) trainer = LightGBMTrainer( scaling_config=ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, ), label_column="target", num_boost_round=20, params={ # LightGBM specific params "objective": "binary", "metric": ["binary_logloss", "binary_error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) result = trainer.fit() print(result.metrics) Pytorch import torch import torch.nn as nn import ray from ray import train from ray.air import session, Checkpoint from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False input_size = 1 layer_size = 15 output_size = 1 num_epochs = 3 class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) def train_loop_per_worker(): dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() loss_fn = nn.MSELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.1) model = train.torch.prepare_model(model) for epoch in range(num_epochs): for batches in dataset_shard.iter_torch_batches( batch_size=32, dtypes=torch.float ): inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"] output = model(inputs) loss = loss_fn(output, labels) optimizer.zero_grad() loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") session.report( {}, checkpoint=Checkpoint.from_dict( dict(epoch=epoch, model=model.state_dict()) ), ) train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)]) scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() Tensorflow import ray import tensorflow as tf from ray.air import session from ray.air.integrations.keras import ReportCheckpointCallback from ray.train.tensorflow import TensorflowTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False a = 5 b = 10 size = 100 def build_model() -> tf.keras.Model: model = tf.keras.Sequential( [ tf.keras.layers.InputLayer(input_shape=()), # Add feature dimension, expanding (batch_size,) to (batch_size, 1). tf.keras.layers.Flatten(), tf.keras.layers.Dense(10), tf.keras.layers.Dense(1), ] ) return model def train_func(config: dict): batch_size = config.get("batch_size", 64) epochs = config.get("epochs", 3) strategy = tf.distribute.MultiWorkerMirroredStrategy() with strategy.scope(): # Model building/compiling need to be within `strategy.scope()`. multi_worker_model = build_model() multi_worker_model.compile( optimizer=tf.keras.optimizers.SGD(learning_rate=config.get("lr", 1e-3)), loss=tf.keras.losses.mean_squared_error, metrics=[tf.keras.metrics.mean_squared_error], ) dataset = session.get_dataset_shard("train") results = [] for _ in range(epochs): tf_dataset = dataset.to_tf( feature_columns="x", label_columns="y", batch_size=batch_size ) history = multi_worker_model.fit( tf_dataset, callbacks=[ReportCheckpointCallback()] ) results.append(history.history) return results config = {"lr": 1e-3, "batch_size": 32, "epochs": 4} train_dataset = ray.data.from_items( [{"x": x / 200, "y": 2 * x / 200} for x in range(200)] ) scaling_config = ScalingConfig(num_workers=2, use_gpu=use_gpu) trainer = TensorflowTrainer( train_loop_per_worker=train_func, train_loop_config=config, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() print(result.metrics) Horovod import ray import ray.train as train import ray.train.torch # Need this to use `train.torch.get_device()` import horovod.torch as hvd import torch import torch.nn as nn from ray.air import session, Checkpoint from ray.train.horovod import HorovodTrainer from ray.air.config import ScalingConfig # If using GPUs, set this to True. use_gpu = False input_size = 1 layer_size = 15 output_size = 1 num_epochs = 3 class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.layer1 = nn.Linear(input_size, layer_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(layer_size, output_size) def forward(self, input): return self.layer2(self.relu(self.layer1(input))) def train_loop_per_worker(): hvd.init() dataset_shard = session.get_dataset_shard("train") model = NeuralNetwork() device = train.torch.get_device() model.to(device) loss_fn = nn.MSELoss() lr_scaler = 1 optimizer = torch.optim.SGD(model.parameters(), lr=0.1 * lr_scaler) # Horovod: wrap optimizer with DistributedOptimizer. optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), op=hvd.Average, ) for epoch in range(num_epochs): model.train() for batch in dataset_shard.iter_torch_batches( batch_size=32, dtypes=torch.float ): inputs, labels = torch.unsqueeze(batch["x"], 1), batch["y"] outputs = model(inputs) loss = loss_fn(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") session.report( {}, checkpoint=Checkpoint.from_dict(dict(model=model.state_dict())), ) train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu) trainer = HorovodTrainer( train_loop_per_worker=train_loop_per_worker, scaling_config=scaling_config, datasets={"train": train_dataset}, ) result = trainer.fit() Training Framework Catalog Here is a catalog of the framework-specific Trainer, Checkpoint, and Predictor classes that ship out of the box with Train: Trainer Class Checkpoint Class Predictor Class TorchTrainer TorchCheckpoint TorchPredictor TensorflowTrainer TensorflowCheckpoint TensorflowPredictor HorovodTrainer (Torch/TF Checkpoint) (Torch/TF Predictor) XGBoostTrainer XGBoostCheckpoint XGBoostPredictor LightGBMTrainer LightGBMCheckpoint LightGBMPredictor SklearnTrainer SklearnCheckpoint SklearnPredictor TransformersTrainer TransformersCheckpoint TransformersPredictor LightningTrainer LightningCheckpoint LightningPredictor RLTrainer RLCheckpoint RLPredictor Next steps Getting Started Key Concepts for Ray Train User Guide for Deep Learning Trainers User Guide for Tree-Based Trainers Getting Started with Distributed Model Training in Ray Train Ray Train offers multiple Trainers which implement scalable model training for different machine learning frameworks. Here are examples for some of the commonly used trainers: XGBoost In this example we will train a model using distributed XGBoost. First, we load the dataset from S3 using Ray Data and split it into a train and validation dataset. import ray # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) In the ScalingConfig, we configure the number of workers to use: from ray.air.config import ScalingConfig scaling_config = ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, ) We then instantiate our XGBoostTrainer by passing in: The aforementioned ScalingConfig. The label_column refers to the column name containing the labels in the Dataset The params are XGBoost training parameters from ray.train.xgboost import XGBoostTrainer trainer = XGBoostTrainer( scaling_config=scaling_config, label_column="target", num_boost_round=20, params={ # XGBoost specific params "objective": "binary:logistic", # "tree_method": "gpu_hist", # uncomment this to use GPU for training "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) Lastly, we call trainer.fit() to kick off training and obtain the results. result = trainer.fit() print(result.metrics) LightGBM In this example we will train a model using distributed LightGBM. First, we load the dataset from S3 using Ray Data and split it into a train and validation dataset. import ray # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) In the ScalingConfig, we configure the number of workers to use: from ray.air.config import ScalingConfig scaling_config = ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, ) We then instantiate our LightGBMTrainer by passing in: The aforementioned ScalingConfig The label_column refers to the column name containing the labels in the Dataset The params are core LightGBM training parameters from ray.train.lightgbm import LightGBMTrainer trainer = LightGBMTrainer( scaling_config=scaling_config, label_column="target", num_boost_round=20, params={ # LightGBM specific params "objective": "binary", "metric": ["binary_logloss", "binary_error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) And lastly we call trainer.fit() to kick off training and obtain the results. result = trainer.fit() print(result.metrics) PyTorch This example shows how you can use Ray Train with PyTorch. First, set up your dataset and model. import torch import torch.nn as nn from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor def get_dataset(): return datasets.FashionMNIST( root="/tmp/data", train=True, download=True, transform=ToTensor(), ) class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28 * 28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), ) def forward(self, inputs): inputs = self.flatten(inputs) logits = self.linear_relu_stack(inputs) return logits Now define your single-worker PyTorch training function. def train_func(): num_epochs = 3 batch_size = 64 dataset = get_dataset() dataloader = DataLoader(dataset, batch_size=batch_size) model = NeuralNetwork() criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for epoch in range(num_epochs): for inputs, labels in dataloader: optimizer.zero_grad() pred = model(inputs) loss = criterion(pred, labels) loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") This training function can be executed with: train_func() Now let’s convert this to a distributed multi-worker training function! All you have to do is use the ray.train.torch.prepare_model and ray.train.torch.prepare_data_loader utility functions to easily setup your model & data for distributed training. This will automatically wrap your model with DistributedDataParallel and place it on the right device, and add DistributedSampler to your DataLoaders. from ray import train def train_func_distributed(): num_epochs = 3 batch_size = 64 dataset = get_dataset() dataloader = DataLoader(dataset, batch_size=batch_size) dataloader = train.torch.prepare_data_loader(dataloader) model = NeuralNetwork() model = train.torch.prepare_model(model) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for epoch in range(num_epochs): for inputs, labels in dataloader: optimizer.zero_grad() pred = model(inputs) loss = criterion(pred, labels) loss.backward() optimizer.step() print(f"epoch: {epoch}, loss: {loss.item()}") Then, instantiate a TorchTrainer with 4 workers, and use it to run the new training function! from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = TorchTrainer( train_func_distributed, scaling_config=ScalingConfig(num_workers=4, use_gpu=use_gpu) ) results = trainer.fit() See Porting code from PyTorch, TensorFlow, or Horovod to Ray Train for a more comprehensive example. TensorFlow This example shows how you can use Ray Train to set up Multi-worker training with Keras. First, set up your dataset and model. import numpy as np import tensorflow as tf def mnist_dataset(batch_size): (x_train, y_train), _ = tf.keras.datasets.mnist.load_data() # The `x` arrays are in uint8 and have values in the [0, 255] range. # You need to convert them to float32 with values in the [0, 1] range. x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train, y_train)).shuffle(60000).repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(): model = tf.keras.Sequential([ tf.keras.layers.InputLayer(input_shape=(28, 28)), tf.keras.layers.Reshape(target_shape=(28, 28, 1)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.SGD(learning_rate=0.001), metrics=['accuracy']) return model Now define your single-worker TensorFlow training function. def train_func(): batch_size = 64 single_worker_dataset = mnist_dataset(batch_size) single_worker_model = build_and_compile_cnn_model() single_worker_model.fit(single_worker_dataset, epochs=3, steps_per_epoch=70) This training function can be executed with: train_func() Now let’s convert this to a distributed multi-worker training function! All you need to do is: Set the per-worker batch size - each worker will process the same size batch as in the single-worker code. Choose your TensorFlow distributed training strategy. In this example we use the MultiWorkerMirroredStrategy. import json import os def train_func_distributed(): per_worker_batch_size = 64 # This environment variable will be set by Ray Train. tf_config = json.loads(os.environ['TF_CONFIG']) num_workers = len(tf_config['cluster']['worker']) strategy = tf.distribute.MultiWorkerMirroredStrategy() global_batch_size = per_worker_batch_size * num_workers multi_worker_dataset = mnist_dataset(global_batch_size) with strategy.scope(): # Model building/compiling need to be within `strategy.scope()`. multi_worker_model = build_and_compile_cnn_model() multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70) Then, instantiate a TensorflowTrainer with 4 workers, and use it to run the new training function! from ray.train.tensorflow import TensorflowTrainer from ray.air.config import ScalingConfig # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = TensorflowTrainer(train_func_distributed, scaling_config=ScalingConfig(num_workers=4, use_gpu=use_gpu)) trainer.fit() See Porting code from PyTorch, TensorFlow, or Horovod to Ray Train for a more comprehensive example. Next Steps To check how your application is doing, you can use the Ray dashboard. Key Concepts of Ray Train There are four main concepts in the Ray Train library. Trainers execute distributed training. Configuration objects are used to configure training. Checkpoints are returned as the result of training. Predictors can be used for inference and batch prediction. https://docs.google.com/drawings/d/1FezcdrXJuxLZzo6Rjz1CHyJzseH8nPFZp6IUepdn3N4/edit Trainers Trainers are responsible for executing (distributed) training runs. The output of a Trainer run is a Result that contains metrics from the training run and the latest saved Checkpoint. Trainers can also be configured with Datasets and Preprocessors for scalable data ingest and preprocessing. Deep Learning, Tree-Based, and other Trainers There are three categories of built-in Trainers: Deep Learning Trainers Ray Train supports the following deep learning trainers: TorchTrainer TensorflowTrainer HorovodTrainer LightningTrainer For these trainers, you usually define your own training function that loads the model and executes single-worker training steps. Refer to the following guides for more details: Deep learning user guide Quick overview of deep-learning trainers in the Ray AIR documentation Tree-Based Trainers Tree-based trainers utilize gradient-based decision trees for training. The most popular libraries for this are XGBoost and LightGBM. XGBoostTrainer LightGBMTrainer For these trainers, you just pass a dataset and parameters. The training loop is configured automatically. XGBoost/LightGBM user guide Quick overview of tree-based trainers in the Ray AIR documentation Other Trainers Some trainers don’t fit into the other two categories, such as: TransformersTrainer for NLP RLTrainer for reinforcement learning SklearnTrainer for (non-distributed) training of sklearn models. Other trainers in the Ray AIR documentation Train Configuration Trainers are configured with configuration objects. There are two main configuration classes, the ScalingConfig and the RunConfig. The latter contains subconfigurations, such as the FailureConfig, SyncConfig and CheckpointConfig. Check out the Configurations User Guide for an in-depth guide on using these configurations. Train Checkpoints Calling Trainer.fit() returns a Result object, which includes information about the run such as the reported metrics and the saved checkpoints. Checkpoints have the following purposes: They can be passed to a Trainer to resume training from the given model state. They can be used to create a Predictor / BatchPredictor for scalable batch prediction. They can be deployed with Ray Serve. Train Predictors Predictors are the counterpart to Trainers. A Trainer trains a model on a dataset, and a predictor uses the resulting model and performs inference on it. Each Trainer has a respective Predictor implementation that is compatible with its generated checkpoints. Example: XGBoostPredictor import numpy as np import ray from ray.train.xgboost import XGBoostTrainer, XGBoostPredictor from ray.air.config import ScalingConfig train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)]) trainer = XGBoostTrainer( label_column="y", params={"objective": "reg:squarederror"}, scaling_config=ScalingConfig(num_workers=3), datasets={"train": train_dataset}, ) result = trainer.fit() predictor = XGBoostPredictor.from_checkpoint(result.checkpoint) predictions = predictor.predict(np.expand_dims(np.arange(32, 64), 1)) A predictor can be passed into a BatchPredictor is used to scale up prediction over a Ray cluster. It takes a Dataset as input. Example: Batch prediction with XGBoostPredictor import pandas as pd from ray.train.batch_predictor import BatchPredictor batch_predictor = BatchPredictor.from_checkpoint(result.checkpoint, XGBoostPredictor) predict_dataset = ray.data.from_pandas(pd.DataFrame({"x": np.arange(32)})) predictions = batch_predictor.predict( data=predict_dataset, batch_size=8, min_scoring_workers=2, ) predictions.show() See the Predictors user guide for more information and examples. Ray Train User Guides Configurations User Guide Deep Learning User Guide XGBoost / LightGBM User Guide Ray Train Architecture Ray Train Configuration User Guide The following overviews how to configure scale-out, run options, and fault-tolerance for Train. For more details on how to configure data ingest, also refer to Configuring Training Datasets. Scaling Configurations in Train (ScalingConfig) The scaling configuration specifies distributed training properties like the number of workers or the resources per worker. The properties of the scaling configuration are tunable. from ray.air import ScalingConfig scaling_config = ScalingConfig( # Number of distributed workers. num_workers=2, # Turn on/off GPU. use_gpu=True, # Specify resources used for trainer. trainer_resources={"CPU": 1}, # Try to schedule workers on different nodes. placement_strategy="SPREAD", ) See the ScalingConfig API reference. Run Configuration in Train (RunConfig) RunConfig is a configuration object used in Ray Train to define the experiment spec that corresponds to a call to trainer.fit(). It includes settings such as the experiment name, storage path for results, stopping conditions, custom callbacks, checkpoint configuration, verbosity level, and logging options. Many of these settings are configured through other config objects and passed through the RunConfig. The following sub-sections contain descriptions of these configs. The properties of the run configuration are not tunable. from ray.air import RunConfig from ray.air.integrations.wandb import WandbLoggerCallback run_config = RunConfig( # Name of the training run (directory name). name="my_train_run", # The experiment results will be saved to: storage_path/name storage_path="~/ray_results", # storage_path="s3://my_bucket/tune_results", # Low training verbosity. verbose=1, # Custom and built-in callbacks callbacks=[WandbLoggerCallback()], # Stopping criteria stop={"training_iteration": 10}, ) See the RunConfig API reference. See How to Configure Persistent Storage in Ray Tune for storage configuration examples (related to storage_path). Failure configurations in Train (FailureConfig) The failure configuration specifies how training failures should be dealt with. As part of the RunConfig, the properties of the failure configuration are not tunable. from ray.air import RunConfig, FailureConfig run_config = RunConfig( failure_config=FailureConfig( # Tries to recover a run up to this many times. max_failures=2 ) ) See the FailureConfig API reference. Checkpoint configurations in Train (CheckpointConfig) The checkpoint configuration specifies how often to checkpoint training state and how many checkpoints to keep. As part of the RunConfig, the properties of the checkpoint configuration are not tunable. from ray.air import RunConfig, CheckpointConfig run_config = RunConfig( checkpoint_config=CheckpointConfig( # Only keep the 2 *best* checkpoints and delete the others. num_to_keep=2, # *Best* checkpoints are determined by these params: checkpoint_score_attribute="mean_accuracy", checkpoint_score_order="max", ), # This will store checkpoints on S3. storage_path="s3://remote-bucket/location", ) Trainers of certain frameworks including XGBoostTrainer, LightGBMTrainer, and TransformersTrainer implement checkpointing out of the box. For these trainers, checkpointing can be enabled by setting the checkpoint frequency within the CheckpointConfig. from ray.air import RunConfig, CheckpointConfig run_config = RunConfig( checkpoint_config=CheckpointConfig( # Checkpoint every iteration. checkpoint_frequency=1, # Only keep the latest checkpoint and delete the others. num_to_keep=1, ) ) # from ray.train.xgboost import XGBoostTrainer # trainer = XGBoostTrainer(..., run_config=run_config) checkpoint_frequency and other parameters do not work for trainers that accept a custom training loop such as TorchTrainer, since checkpointing is fully user-controlled. See the CheckpointConfig API reference. [Experimental] Distributed Checkpoints: For model parallel workloads where the models do not fit in a single GPU worker, it will be important to save and upload the model that is partitioned across different workers. You can enable this by setting _checkpoint_keep_all_ranks=True to retain the model checkpoints across workers, and _checkpoint_upload_from_workers=True to upload their checkpoints to cloud directly in CheckpointConfig. This functionality works for any trainer that inherits from DataParallelTrainer. Synchronization configurations in Train (tune.SyncConfig) The tune.SyncConfig specifies how synchronization of results and checkpoints should happen in a distributed Ray cluster. As part of the RunConfig, the properties of the failure configuration are not tunable. This configuration is mostly relevant to running multiple Train runs with a Ray Tune. See How to Configure Persistent Storage in Ray Tune for a guide on using the SyncConfig. See the SyncConfig API reference. Distributed Deep Learning with Ray Train User Guide This guide explains how to use Train to scale PyTorch, TensorFlow and Horovod. In this guide, we cover examples for the following use cases: How do I port my code to use Ray Train? How do I use Ray Train to train with a large dataset? How do I monitor my training? How do I run my training on pre-emptible instances (fault tolerance)? How do I tune my Ray Train model? Using Deep Learning Frameworks as Backends Ray Train provides a thin API around different backend frameworks for distributed deep learning. At the moment, Ray Train allows you to perform training with: PyTorch: Ray Train initializes your distributed process group, allowing you to run your DistributedDataParallel training script. See PyTorch Distributed Overview for more information. TensorFlow: Ray Train configures TF_CONFIG for you, allowing you to run your MultiWorkerMirroredStrategy training script. See Distributed training with TensorFlow for more information. Horovod: Ray Train configures the Horovod environment and Rendezvous server for you, allowing you to run your DistributedOptimizer training script. See Horovod documentation for more information. Porting code from PyTorch, TensorFlow, or Horovod to Ray Train The following instructions assume you have a training function that can already be run on a single worker for one of the supported backend frameworks. Updating your training function First, you’ll want to update your training function to support distributed training. PyTorch Ray Train will set up your distributed process group for you and also provides utility methods to automatically prepare your model and data for distributed training. Ray Train will still work even if you don’t use the ray.train.torch.prepare_model() and ray.train.torch.prepare_data_loader() utilities below, and instead handle the logic directly inside your training function. First, use the prepare_model() function to automatically move your model to the right device and wrap it in DistributedDataParallel: import torch from torch.nn.parallel import DistributedDataParallel +from ray.air import session +from ray import train +import ray.train.torch def train_func(): - device = torch.device(f"cuda:{session.get_local_rank()}" if - torch.cuda.is_available() else "cpu") - torch.cuda.set_device(device) # Create model. model = NeuralNetwork() - model = model.to(device) - model = DistributedDataParallel(model, - device_ids=[session.get_local_rank()] if torch.cuda.is_available() else None) + model = train.torch.prepare_model(model) ... Then, use the prepare_data_loader function to automatically add a DistributedSampler to your DataLoader and move the batches to the right device. This step is not necessary if you are passing in Ray Data to your Trainer (see Distributed Data Ingest with Ray Data and Ray Train): import torch from torch.utils.data import DataLoader, DistributedSampler +from ray.air import session +from ray import train +import ray.train.torch def train_func(): - device = torch.device(f"cuda:{session.get_local_rank()}" if - torch.cuda.is_available() else "cpu") - torch.cuda.set_device(device) ... - data_loader = DataLoader(my_dataset, batch_size=worker_batch_size, sampler=DistributedSampler(dataset)) + data_loader = DataLoader(my_dataset, batch_size=worker_batch_size) + data_loader = train.torch.prepare_data_loader(data_loader) for X, y in data_loader: - X = X.to_device(device) - y = y.to_device(device) Keep in mind that DataLoader takes in a batch_size which is the batch size for each worker. The global batch size can be calculated from the worker batch size (and vice-versa) with the following equation: global_batch_size = worker_batch_size * session.get_world_size() TensorFlow The current TensorFlow implementation supports MultiWorkerMirroredStrategy (and MirroredStrategy). If there are other strategies you wish to see supported by Ray Train, please let us know by submitting a feature request on GitHub. These instructions closely follow TensorFlow’s Multi-worker training with Keras tutorial. One key difference is that Ray Train will handle the environment variable set up for you. Step 1: Wrap your model in MultiWorkerMirroredStrategy. The MultiWorkerMirroredStrategy enables synchronous distributed training. The Model must be built and compiled within the scope of the strategy. with tf.distribute.MultiWorkerMirroredStrategy().scope(): model = ... # build model model.compile() Step 2: Update your Dataset batch size to the global batch size. The batch will be split evenly across worker processes, so batch_size should be set appropriately. -batch_size = worker_batch_size +batch_size = worker_batch_size * session.get_world_size() Horovod If you have a training function that already runs with the Horovod Ray Executor, you should not need to make any additional changes! To onboard onto Horovod, please visit the Horovod guide. Creating a Ray Train Trainer Trainers are the primary Ray Train classes that are used to manage state and execute training. You can create a simple Trainer for the backend of choice with one of the following: PyTorch from ray.air import ScalingConfig from ray.train.torch import TorchTrainer # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(use_gpu=use_gpu, num_workers=2) ) TensorFlow Ray will not automatically set any environment variables or configuration related to local parallelism / threading aside from “OMP_NUM_THREADS”. If you desire greater control over TensorFlow threading, use the tf.config.threading module (eg. tf.config.threading.set_inter_op_parallelism_threads(num_cpus)) at the beginning of your train_loop_per_worker function. from ray.air import ScalingConfig from ray.train.tensorflow import TensorflowTrainer # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = TensorflowTrainer( train_func, scaling_config=ScalingConfig(use_gpu=use_gpu, num_workers=2) ) Horovod from ray.air import ScalingConfig from ray.train.horovod import HorovodTrainer # For GPU Training, set `use_gpu` to True. use_gpu = False trainer = HorovodTrainer( train_func, scaling_config=ScalingConfig(use_gpu=use_gpu, num_workers=2) ) To customize the backend setup, you can use the framework-specific config objects. PyTorch from ray.air import ScalingConfig from ray.train.torch import TorchTrainer, TorchConfig trainer = TorchTrainer( train_func, torch_backend=TorchConfig(...), scaling_config=ScalingConfig(num_workers=2), ) TensorFlow from ray.air import ScalingConfig from ray.train.tensorflow import TensorflowTrainer, TensorflowConfig trainer = TensorflowTrainer( train_func, tensorflow_backend=TensorflowConfig(...), scaling_config=ScalingConfig(num_workers=2), ) Horovod from ray.air import ScalingConfig from ray.train.horovod import HorovodTrainer, HorovodConfig trainer = HorovodTrainer( train_func, tensorflow_backend=HorovodConfig(...), scaling_config=ScalingConfig(num_workers=2), ) For more configurability, please reference the DataParallelTrainer API. Running your training function With a distributed training function and a Ray Train Trainer, you are now ready to start training! trainer.fit() Configuring Training With Ray Train, you can execute a training function (train_func) in a distributed manner by calling Trainer.fit. To pass arguments into the training function, you can expose a single config dictionary parameter: -def train_func(): +def train_func(config): Then, you can pass in the config dictionary as an argument to Trainer: +config = {} # This should be populated. trainer = TorchTrainer( train_func, + train_loop_config=config, scaling_config=ScalingConfig(num_workers=2) ) Putting this all together, you can run your training function with different configurations. As an example: from ray.air import session, ScalingConfig from ray.train.torch import TorchTrainer def train_func(config): for i in range(config["num_epochs"]): session.report({"epoch": i}) trainer = TorchTrainer( train_func, train_loop_config={"num_epochs": 2}, scaling_config=ScalingConfig(num_workers=2) ) result = trainer.fit() print(result.metrics["num_epochs"]) # 1 A primary use-case for config is to try different hyperparameters. To perform hyperparameter tuning with Ray Train, please refer to the Ray Tune integration. TODO add support for with_parameters Accessing Training Results TODO(ml-team) Flesh this section out. The return of a Trainer.fit is a Result object, containing information about the training run. You can access it to obtain saved checkpoints, metrics and other relevant data. For example, you can: Print the metrics for the last training iteration: from pprint import pprint pprint(result.metrics) # {'_time_this_iter_s': 0.001016855239868164, # '_timestamp': 1657829125, # '_training_iteration': 2, # 'config': {}, # 'date': '2022-07-14_20-05-25', # 'done': True, # 'episodes_total': None, # 'epoch': 1, # 'experiment_id': '5a3f8b9bf875437881a8ddc7e4dd3340', # 'experiment_tag': '0', # 'hostname': 'ip-172-31-43-110', # 'iterations_since_restore': 2, # 'node_ip': '172.31.43.110', # 'pid': 654068, # 'time_since_restore': 3.4353830814361572, # 'time_this_iter_s': 0.00809168815612793, # 'time_total_s': 3.4353830814361572, # 'timestamp': 1657829125, # 'timesteps_since_restore': 0, # 'timesteps_total': None, # 'training_iteration': 2, # 'trial_id': '4913f_00000', # 'warmup_time': 0.003167867660522461} View the dataframe containing the metrics from all iterations: print(result.metrics_dataframe) Obtain the Checkpoint, used for resuming training, prediction and serving. result.checkpoint # last saved checkpoint result.best_checkpoints # N best saved checkpoints, as configured in run_config Log Directory Structure Each Trainer will have a local directory created for logs and checkpoints. You can obtain the path to the directory by accessing the log_dir attribute of the Result object returned by Trainer.fit(). print(result.log_dir) # '/home/ubuntu/ray_results/TorchTrainer_2022-06-13_20-31-06/checkpoint_000003' Distributed Data Ingest with Ray Data and Ray Train Ray Data is the recommended way to work with large datasets in Ray Train. Ray Data provides automatic loading, sharding, and streamed ingest of Data across multiple Train workers. To get started, pass in one or more datasets under the datasets keyword argument for Trainer (e.g., Trainer(datasets={...})). Here’s a simple code overview of the Ray Data integration: from ray.air import session # Datasets can be accessed in your train_func via ``get_dataset_shard``. def train_func(config): train_data_shard = session.get_dataset_shard("train") validation_data_shard = session.get_dataset_shard("validation") ... # Random split the dataset into 80% training data and 20% validation data. dataset = ray.data.read_csv("...") train_dataset, validation_dataset = dataset.train_test_split( test_size=0.2, shuffle=True, ) trainer = TorchTrainer( train_func, datasets={"train": train_dataset, "validation": validation_dataset}, scaling_config=ScalingConfig(num_workers=8), ) trainer.fit() For more details on how to configure data ingest for Train, please refer to Configuring Training Datasets. TODO link to Training Run Iterator API as a 3rd option for logging. Logging, Checkpointing and Callbacks in Ray Train Ray Train has mechanisms to easily collect intermediate results from the training workers during the training run and also has a Callback interface to perform actions on these intermediate results (such as logging, aggregations, etc.). You can use either the built-in callbacks that Ray AIR provides, or implement a custom callback for your use case. The callback API is shared with Ray Tune. Ray Train also provides a way to save Checkpoints during the training process. This is useful for: Integration with Ray Tune to use certain Ray Tune schedulers. Running a long-running training job on a cluster of pre-emptible machines/pods. Persisting trained model state to later use for serving/inference. In general, storing any model artifacts. Reporting intermediate results and handling checkpoints Ray AIR provides a Session API for reporting intermediate results and checkpoints from the training function (run on distributed workers) up to the Trainer (where your python script is executed) by calling session.report(metrics). The results will be collected from the distributed workers and passed to the driver to be logged and displayed. Only the results from rank 0 worker will be used. However, in order to ensure consistency, session.report() has to be called on each worker. If you want to aggregate results from multiple workers, see How to obtain and aggregate results from different workers?. The primary use-case for reporting is for metrics (accuracy, loss, etc.) at the end of each training epoch. from ray.air import session def train_func(): ... for i in range(num_epochs): result = model.train(...) session.report({"result": result}) The session concept exists on several levels: The execution layer (called Tune Session) and the Data Parallel training layer (called Train Session). The following figure shows how these two sessions look like in a Data Parallel training scenario. https://docs.google.com/drawings/d/1g0pv8gqgG29aPEPTcd4BC0LaRNbW1sAkv3H6W1TCp0c/edit Saving checkpoints Checkpoints can be saved by calling session.report(metrics, checkpoint=Checkpoint(...)) in the training function. This will cause the checkpoint state from the distributed workers to be saved on the Trainer (where your python script is executed). The latest saved checkpoint can be accessed through the checkpoint attribute of the Result, and the best saved checkpoints can be accessed by the best_checkpoints attribute. Concrete examples are provided to demonstrate how checkpoints (model weights but not models) are saved appropriately in distributed training. PyTorch import ray.train.torch from ray.air import session, Checkpoint, ScalingConfig from ray.train.torch import TorchTrainer import torch import torch.nn as nn from torch.optim import Adam import numpy as np def train_func(config): n = 100 # create a toy dataset # data : X - dim = (n, 4) # target : Y - dim = (n, 1) X = torch.Tensor(np.random.normal(0, 1, size=(n, 4))) Y = torch.Tensor(np.random.uniform(0, 1, size=(n, 1))) # toy neural network : 1-layer # wrap the model in DDP model = ray.train.torch.prepare_model(nn.Linear(4, 1)) criterion = nn.MSELoss() optimizer = Adam(model.parameters(), lr=3e-4) for epoch in range(config["num_epochs"]): y = model.forward(X) # compute loss loss = criterion(y, Y) # back-propagate loss optimizer.zero_grad() loss.backward() optimizer.step() state_dict = model.state_dict() checkpoint = Checkpoint.from_dict( dict(epoch=epoch, model_weights=state_dict) ) session.report({}, checkpoint=checkpoint) trainer = TorchTrainer( train_func, train_loop_config={"num_epochs": 5}, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() print(result.checkpoint.to_dict()) # {'epoch': 4, 'model_weights': OrderedDict([('bias', tensor([-0.1215])), ('weight', tensor([[0.3253, 0.1979, 0.4525, 0.2850]]))]), '_timestamp': 1656107095, '_preprocessor': None, '_current_checkpoint_id': 4} TensorFlow from ray.air import session, Checkpoint, ScalingConfig from ray.train.tensorflow import TensorflowTrainer import numpy as np def train_func(config): import tensorflow as tf n = 100 # create a toy dataset # data : X - dim = (n, 4) # target : Y - dim = (n, 1) X = np.random.normal(0, 1, size=(n, 4)) Y = np.random.uniform(0, 1, size=(n, 1)) strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() with strategy.scope(): # toy neural network : 1-layer model = tf.keras.Sequential([tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))]) model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"]) for epoch in range(config["num_epochs"]): model.fit(X, Y, batch_size=20) checkpoint = Checkpoint.from_dict( dict(epoch=epoch, model_weights=model.get_weights()) ) session.report({}, checkpoint=checkpoint) trainer = TensorflowTrainer( train_func, train_loop_config={"num_epochs": 5}, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() print(result.checkpoint.to_dict()) # {'epoch': 4, 'model_weights': [array([[-0.31858477], # [ 0.03747174], # [ 0.28266194], # [ 0.8626015 ]], dtype=float32), array([0.02230084], dtype=float32)], '_timestamp': 1656107383, '_preprocessor': None, '_current_checkpoint_id': 4} By default, checkpoints will be persisted to local disk in the log directory of each run. print(result.checkpoint.get_internal_representation()) # ('local_path', '/home/ubuntu/ray_results/TorchTrainer_2022-06-24_21-34-49/TorchTrainer_7988b_00000_0_2022-06-24_21-34-49/checkpoint_000003') Configuring checkpoints For more configurability of checkpointing behavior (specifically saving checkpoints to disk), a CheckpointConfig can be passed into Trainer. As an example, to completely disable writing checkpoints to disk: from ray.air import session, RunConfig, CheckpointConfig, ScalingConfig from ray.train.torch import TorchTrainer def train_func(): for epoch in range(3): checkpoint = Checkpoint.from_dict(dict(epoch=epoch)) session.report({}, checkpoint=checkpoint) checkpoint_config = CheckpointConfig(num_to_keep=0) trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), run_config=RunConfig(checkpoint_config=checkpoint_config) ) trainer.fit() You may also config CheckpointConfig to keep the “N best” checkpoints persisted to disk. The following example shows how you could keep the 2 checkpoints with the lowest “loss” value: from ray.air import session, Checkpoint, RunConfig, CheckpointConfig, ScalingConfig from ray.train.torch import TorchTrainer def train_func(): # first checkpoint session.report(dict(loss=2), checkpoint=Checkpoint.from_dict(dict(loss=2))) # second checkpoint session.report(dict(loss=2), checkpoint=Checkpoint.from_dict(dict(loss=4))) # third checkpoint session.report(dict(loss=2), checkpoint=Checkpoint.from_dict(dict(loss=1))) # fourth checkpoint session.report(dict(loss=2), checkpoint=Checkpoint.from_dict(dict(loss=3))) # Keep the 2 checkpoints with the smallest "loss" value. checkpoint_config = CheckpointConfig( num_to_keep=2, checkpoint_score_attribute="loss", checkpoint_score_order="min" ) trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), run_config=RunConfig(checkpoint_config=checkpoint_config), ) result = trainer.fit() print(result.best_checkpoints[0][0].get_internal_representation()) # ('local_path', '/home/ubuntu/ray_results/TorchTrainer_2022-06-24_21-34-49/TorchTrainer_7988b_00000_0_2022-06-24_21-34-49/checkpoint_000000') print(result.best_checkpoints[1][0].get_internal_representation()) # ('local_path', '/home/ubuntu/ray_results/TorchTrainer_2022-06-24_21-34-49/TorchTrainer_7988b_00000_0_2022-06-24_21-34-49/checkpoint_000002') Loading checkpoints Checkpoints can be loaded into the training function in 2 steps: From the training function, ray.air.session.get_checkpoint() can be used to access the most recently saved Checkpoint. This is useful to continue training even if there’s a worker failure. The checkpoint to start training with can be bootstrapped by passing in a Checkpoint to Trainer as the resume_from_checkpoint argument. PyTorch import ray.train.torch from ray.air import session, Checkpoint, ScalingConfig from ray.train.torch import TorchTrainer import torch import torch.nn as nn from torch.optim import Adam import numpy as np def train_func(config): n = 100 # create a toy dataset # data : X - dim = (n, 4) # target : Y - dim = (n, 1) X = torch.Tensor(np.random.normal(0, 1, size=(n, 4))) Y = torch.Tensor(np.random.uniform(0, 1, size=(n, 1))) # toy neural network : 1-layer model = nn.Linear(4, 1) criterion = nn.MSELoss() optimizer = Adam(model.parameters(), lr=3e-4) start_epoch = 0 checkpoint = session.get_checkpoint() if checkpoint: # assume that we have run the session.report() example # and successfully save some model weights checkpoint_dict = checkpoint.to_dict() model.load_state_dict(checkpoint_dict.get("model_weights")) start_epoch = checkpoint_dict.get("epoch", -1) + 1 # wrap the model in DDP model = ray.train.torch.prepare_model(model) for epoch in range(start_epoch, config["num_epochs"]): y = model.forward(X) # compute loss loss = criterion(y, Y) # back-propagate loss optimizer.zero_grad() loss.backward() optimizer.step() state_dict = model.state_dict() checkpoint = Checkpoint.from_dict( dict(epoch=epoch, model_weights=state_dict) ) session.report({}, checkpoint=checkpoint) trainer = TorchTrainer( train_func, train_loop_config={"num_epochs": 2}, scaling_config=ScalingConfig(num_workers=2), ) # save a checkpoint result = trainer.fit() # load checkpoint trainer = TorchTrainer( train_func, train_loop_config={"num_epochs": 4}, scaling_config=ScalingConfig(num_workers=2), resume_from_checkpoint=result.checkpoint, ) result = trainer.fit() print(result.checkpoint.to_dict()) # {'epoch': 3, 'model_weights': OrderedDict([('bias', tensor([0.0902])), ('weight', tensor([[-0.1549, -0.0861, 0.4353, -0.4116]]))]), '_timestamp': 1656108265, '_preprocessor': None, '_current_checkpoint_id': 2} TensorFlow from ray.air import session, Checkpoint, ScalingConfig from ray.train.tensorflow import TensorflowTrainer import numpy as np def train_func(config): import tensorflow as tf n = 100 # create a toy dataset # data : X - dim = (n, 4) # target : Y - dim = (n, 1) X = np.random.normal(0, 1, size=(n, 4)) Y = np.random.uniform(0, 1, size=(n, 1)) start_epoch = 0 strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() with strategy.scope(): # toy neural network : 1-layer model = tf.keras.Sequential([tf.keras.layers.Dense(1, activation="linear", input_shape=(4,))]) checkpoint = session.get_checkpoint() if checkpoint: # assume that we have run the session.report() example # and successfully save some model weights checkpoint_dict = checkpoint.to_dict() model.set_weights(checkpoint_dict.get("model_weights")) start_epoch = checkpoint_dict.get("epoch", -1) + 1 model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["mse"]) for epoch in range(start_epoch, config["num_epochs"]): model.fit(X, Y, batch_size=20) checkpoint = Checkpoint.from_dict( dict(epoch=epoch, model_weights=model.get_weights()) ) session.report({}, checkpoint=checkpoint) trainer = TensorflowTrainer( train_func, train_loop_config={"num_epochs": 2}, scaling_config=ScalingConfig(num_workers=2), ) # save a checkpoint result = trainer.fit() # load a checkpoint trainer = TensorflowTrainer( train_func, train_loop_config={"num_epochs": 5}, scaling_config=ScalingConfig(num_workers=2), resume_from_checkpoint=result.checkpoint, ) result = trainer.fit() print(result.checkpoint.to_dict()) # {'epoch': 4, 'model_weights': [array([[-0.70056134], # [-0.8839263 ], # [-1.0043601 ], # [-0.61634773]], dtype=float32), array([0.01889327], dtype=float32)], '_timestamp': 1656108446, '_preprocessor': None, '_current_checkpoint_id': 3} Callbacks You may want to plug in your training code with your favorite experiment management framework. Ray AIR provides an interface to fetch intermediate results and callbacks to process/log your intermediate results (the values passed into ray.air.session.report()). Ray AIR contains built-in callbacks for popular tracking frameworks, or you can implement your own callback via the Callback interface. Example: Logging to MLflow and TensorBoard Step 1: Install the necessary packages $ pip install mlflow $ pip install tensorboardX Step 2: Run the following training script from ray.air import ScalingConfig, RunConfig, session from ray.train.torch import TorchTrainer from ray.air.integrations.mlflow import MLflowLoggerCallback from ray.tune.logger import TBXLoggerCallback def train_func(): for i in range(3): session.report(dict(epoch=i)) trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2), run_config=RunConfig( callbacks=[ MLflowLoggerCallback(experiment_name="train_experiment"), TBXLoggerCallback(), ], ), ) # Run the training function, logging all the intermediate results # to MLflow and Tensorboard. result = trainer.fit() # For MLFLow logs: # MLFlow logs will by default be saved in an `mlflow` directory # in the current working directory. # $ cd mlflow # # View the MLflow UI. # $ mlflow ui # You can change the directory by setting the `tracking_uri` argument # in `MLflowLoggerCallback`. # For TensorBoard logs: # Print the latest run directory and keep note of it. # For example: /home/ubuntu/ray_results/TorchTrainer_2022-06-13_20-31-06 print("Run directory:", result.log_dir.parent) # TensorBoard is saved in parent dir # How to visualize the logs # Navigate to the run directory of the trainer. # For example `cd /home/ubuntu/ray_results/TorchTrainer_2022-06-13_20-31-06` # $ cd # # # View the tensorboard UI. # $ tensorboard --logdir . Custom Callbacks If the provided callbacks do not cover your desired integrations or use-cases, you may always implement a custom callback by subclassing LoggerCallback. If the callback is general enough, please feel welcome to add it to the ray repository. A simple example for creating a callback that will print out results: from typing import List, Dict from ray.air import session, RunConfig, ScalingConfig from ray.train.torch import TorchTrainer from ray.tune.logger import LoggerCallback # LoggerCallback is a higher level API of Callback. class LoggingCallback(LoggerCallback): def __init__(self) -> None: self.results = [] def log_trial_result(self, iteration: int, trial: "Trial", result: Dict): self.results.append(trial.last_result) def train_func(): for i in range(3): session.report({"epoch": i}) callback = LoggingCallback() trainer = TorchTrainer( train_func, run_config=RunConfig(callbacks=[callback]), scaling_config=ScalingConfig(num_workers=2), ) trainer.fit() print("\n".join([str(x) for x in callback.results])) # {'trial_id': '0f1d0_00000', 'experiment_id': '494a1d050b4a4d11aeabd87ba475fcd3', 'date': '2022-06-27_17-03-28', 'timestamp': 1656349408, 'pid': 23018, 'hostname': 'ip-172-31-43-110', 'node_ip': '172.31.43.110', 'config': {}} # {'epoch': 0, '_timestamp': 1656349412, '_time_this_iter_s': 0.0026497840881347656, '_training_iteration': 1, 'time_this_iter_s': 3.433483362197876, 'done': False, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 1, 'trial_id': '0f1d0_00000', 'experiment_id': '494a1d050b4a4d11aeabd87ba475fcd3', 'date': '2022-06-27_17-03-32', 'timestamp': 1656349412, 'time_total_s': 3.433483362197876, 'pid': 23018, 'hostname': 'ip-172-31-43-110', 'node_ip': '172.31.43.110', 'config': {}, 'time_since_restore': 3.433483362197876, 'timesteps_since_restore': 0, 'iterations_since_restore': 1, 'warmup_time': 0.003779172897338867, 'experiment_tag': '0'} # {'epoch': 1, '_timestamp': 1656349412, '_time_this_iter_s': 0.0013833045959472656, '_training_iteration': 2, 'time_this_iter_s': 0.016670703887939453, 'done': False, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 2, 'trial_id': '0f1d0_00000', 'experiment_id': '494a1d050b4a4d11aeabd87ba475fcd3', 'date': '2022-06-27_17-03-32', 'timestamp': 1656349412, 'time_total_s': 3.4501540660858154, 'pid': 23018, 'hostname': 'ip-172-31-43-110', 'node_ip': '172.31.43.110', 'config': {}, 'time_since_restore': 3.4501540660858154, 'timesteps_since_restore': 0, 'iterations_since_restore': 2, 'warmup_time': 0.003779172897338867, 'experiment_tag': '0'} How to obtain and aggregate results from different workers? In real applications, you may want to calculate optimization metrics besides accuracy and loss: recall, precision, Fbeta, etc. You may also want to collect metrics from multiple workers. While Ray Train currently only reports metrics from the rank 0 worker, you can use third-party libraries or distributed primitives of your machine learning framework to report metrics from multiple workers. PyTorch Ray Train natively supports TorchMetrics, which provides a collection of machine learning metrics for distributed, scalable PyTorch models. Here is an example of reporting both the aggregated R2 score and mean train and validation loss from all workers. # First, pip install torchmetrics # This code is tested with torchmetrics==0.7.3 and torch==1.12.1 import ray.train.torch from ray.air import session, ScalingConfig from ray.train.torch import TorchTrainer import torch import torch.nn as nn import torchmetrics from torch.optim import Adam import numpy as np def train_func(config): n = 100 # create a toy dataset X = torch.Tensor(np.random.normal(0, 1, size=(n, 4))) X_valid = torch.Tensor(np.random.normal(0, 1, size=(n, 4))) Y = torch.Tensor(np.random.uniform(0, 1, size=(n, 1))) Y_valid = torch.Tensor(np.random.uniform(0, 1, size=(n, 1))) # toy neural network : 1-layer # wrap the model in DDP model = ray.train.torch.prepare_model(nn.Linear(4, 1)) criterion = nn.MSELoss() mape = torchmetrics.MeanAbsolutePercentageError() # for averaging loss mean_valid_loss = torchmetrics.MeanMetric() optimizer = Adam(model.parameters(), lr=3e-4) for epoch in range(config["num_epochs"]): model.train() y = model.forward(X) # compute loss loss = criterion(y, Y) # back-propagate loss optimizer.zero_grad() loss.backward() optimizer.step() # evaluate model.eval() with torch.no_grad(): pred = model(X_valid) valid_loss = criterion(pred, Y_valid) # save loss in aggregator mean_valid_loss(valid_loss) mape(pred, Y_valid) # collect all metrics # use .item() to obtain a value that can be reported valid_loss = valid_loss.item() mape_collected = mape.compute().item() mean_valid_loss_collected = mean_valid_loss.compute().item() session.report( { "mape_collected": mape_collected, "valid_loss": valid_loss, "mean_valid_loss_collected": mean_valid_loss_collected, } ) # reset for next epoch mape.reset() mean_valid_loss.reset() trainer = TorchTrainer( train_func, train_loop_config={"num_epochs": 5}, scaling_config=ScalingConfig(num_workers=2), ) result = trainer.fit() print(result.metrics["valid_loss"], result.metrics["mean_valid_loss_collected"]) # 0.5109779238700867 0.5512474775314331 TensorFlow TensorFlow Keras automatically aggregates metrics from all workers. If you wish to have more control over that, consider implementing a custom training loop. Running on the cloud -------------------- Use Ray Train with the Ray cluster launcher by changing the following: .. code-block:: bash ray up cluster.yaml TODO. Fault Tolerance Automatically Recover from Train Worker Failures Ray Train has built-in fault tolerance to recover from worker failures (i.e. RayActorErrors). When a failure is detected, the workers will be shut down and new workers will be added in. Elastic Training is not yet supported. The training function will be restarted, but progress from the previous execution can be resumed through checkpointing. In order to retain progress when recovery, your training function must implement logic for both saving and loading checkpoints. Each instance of recovery from a worker failure is considered a retry. The number of retries is configurable through the max_failures attribute of the FailureConfig argument set in the RunConfig passed to the Trainer: from ray.air import RunConfig, FailureConfig run_config = RunConfig( failure_config=FailureConfig( # Tries to recover a run up to this many times. max_failures=2 ) ) Restore a Ray Train Experiment At the experiment level, Trainer restoration allows you to resume a previously interrupted experiment from where it left off. A Train experiment may be interrupted due to one of the following reasons: The experiment was manually interrupted (e.g., Ctrl+C, or pre-empted head node instance). The head node crashed (e.g., OOM or some other runtime error). The entire cluster went down (e.g., network error affecting all nodes). Trainer restoration is possible for all of Ray Train’s built-in trainers, but we use TorchTrainer in the examples for demonstration. We also use Trainer to refer to methods that are shared across all built-in trainers. Let’s say your initial Train experiment is configured as follows. The actual training loop is just for demonstration purposes: the important detail is that saving and loading checkpoints has been implemented. from typing import Dict, Optional import ray from ray import air from ray.air import session from ray.train.torch import TorchCheckpoint, TorchTrainer def get_datasets() -> Dict[str, ray.data.Dataset]: return {"train": ray.data.from_items([{"x": i, "y": 2 * i} for i in range(10)])} def train_loop_per_worker(config: dict): from torchvision.models import resnet18 # Checkpoint loading checkpoint: Optional[TorchCheckpoint] = session.get_checkpoint() model = checkpoint.get_model() if checkpoint else resnet18() ray.train.torch.prepare_model(model) train_ds = session.get_dataset_shard("train") for epoch in range(5): # Do some training... # Checkpoint saving session.report( {"epoch": epoch}, checkpoint=TorchCheckpoint.from_model(model), ) trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, datasets=get_datasets(), scaling_config=air.ScalingConfig(num_workers=2), run_config=air.RunConfig( storage_path="~/ray_results", name="dl_trainer_restore", ), ) result = trainer.fit() The results and checkpoints of the experiment are saved to the path configured by RunConfig. If the experiment has been interrupted due to one of the reasons listed above, use this path to resume: from ray.train.torch import TorchTrainer restored_trainer = TorchTrainer.restore( path="~/ray_results/dl_trainer_restore", datasets=get_datasets(), ) You can also restore from a remote path (e.g., from an experiment directory stored in a s3 bucket). original_trainer = TorchTrainer( # ... run_config=air.RunConfig( # Configure cloud storage storage_path="s3://results-bucket", name="dl_trainer_restore", ), ) result = trainer.fit() restored_trainer = TorchTrainer.restore( "s3://results-bucket/dl_trainer_restore", datasets=get_datasets(), ) Different trainers may allow more parameters to be optionally re-specified on restore. Only datasets are required to be re-specified on restore, if they were supplied originally. See Restoration API for Built-in Trainers for more details. Auto-resume Adding the branching logic below will allow you to run the same script after the interrupt, picking up training from where you left on the previous run. Notice that we use the Trainer.can_restore utility method to determine the existence and validity of the given experiment directory. if TorchTrainer.can_restore("~/ray_results/dl_restore_autoresume"): trainer = TorchTrainer.restore( "~/ray_results/dl_restore_autoresume", datasets=get_datasets(), ) result = trainer.fit() else: trainer = TorchTrainer( train_loop_per_worker=train_loop_per_worker, datasets=get_datasets(), scaling_config=air.ScalingConfig(num_workers=2), run_config=air.RunConfig( storage_path="~/ray_results", name="dl_restore_autoresume" ), ) result = trainer.fit() See the BaseTrainer.restore docstring for a full example. Trainer.restore is different from Trainer(..., resume_from_checkpoint=...). resume_from_checkpoint is meant to be used to start a new Train experiment, which writes results to a new directory and starts over from iteration 0. Trainer.restore is used to continue an existing experiment, where new results will continue to be appended to existing logs. Running on pre-emptible machines -------------------------------- You may want to TODO. We do not have a profiling callback in AIR as the execution engine has changed to Tune. The behavior of the callback can be replicated with checkpoints (do a trace, save it to checkpoint, it gets downloaded to driver every iteration). .. _train-profiling: Profiling --------- Ray Train comes with an integration with `PyTorch Profiler `_. Specifically, it comes with a :ref:`TorchWorkerProfiler ` utility class and :ref:`train-api-torch-tensorboard-profiler-callback` callback that allow you to use the PyTorch Profiler as you would in a non-distributed PyTorch script, and synchronize the generated Tensorboard traces onto the disk that from which your script was executed from. **Step 1: Update training function with** ``TorchWorkerProfiler`` .. code-block:: bash from ray.train.torch import TorchWorkerProfiler def train_func(): twp = TorchWorkerProfiler() with profile(..., on_trace_ready=twp.trace_handler) as p: ... profile_results = twp.get_and_clear_profile_traces() train.report(..., **profile_results) ... **Step 2: Run training function with** ``TorchTensorboardProfilerCallback`` .. code-block:: python from ray.train import Trainer from ray.train.callbacks import TorchTensorboardProfilerCallback trainer = Trainer(backend="torch", num_workers=2) trainer.start() trainer.run(train_func, callbacks=[TorchTensorboardProfilerCallback()]) trainer.shutdown() **Step 3: Visualize the logs** .. code-block:: bash # Navigate to the run directory of the trainer. # For example `cd /home/ray_results/train_2021-09-01_12-00-00/run_001/pytorch_profiler` $ cd /pytorch_profiler # Install the PyTorch Profiler TensorBoard Plugin. $ pip install torch_tb_profiler # Star the TensorBoard UI. $ tensorboard --logdir . # View the PyTorch Profiler traces. $ open http://localhost:6006/#pytorch_profiler Hyperparameter tuning (Ray Tune) Hyperparameter tuning with Ray Tune is natively supported with Ray Train. Specifically, you can take an existing Trainer and simply pass it into a Tuner. from ray import tune from ray.air import session, ScalingConfig from ray.train.torch import TorchTrainer from ray.tune.tuner import Tuner, TuneConfig def train_func(config): # In this example, nothing is expected to change over epochs, # and the output metric is equivalent to the input value. for _ in range(config["num_epochs"]): session.report(dict(output=config["input"])) trainer = TorchTrainer(train_func, scaling_config=ScalingConfig(num_workers=2)) tuner = Tuner( trainer, param_space={ "train_loop_config": { "num_epochs": 2, "input": tune.grid_search([1, 2, 3]), } }, tune_config=TuneConfig(num_samples=5, metric="output", mode="max"), ) result_grid = tuner.fit() print(result_grid.get_best_result().metrics["output"]) # 3 Automatic Mixed Precision Automatic mixed precision (AMP) lets you train your models faster by using a lower precision datatype for operations like linear layers and convolutions. PyTorch You can train your Torch model with AMP by: Adding ray.train.torch.accelerate() with amp=True to the top of your training function. Wrapping your optimizer with ray.train.torch.prepare_optimizer(). Replacing your backward call with ray.train.torch.backward(). def train_func(): + train.torch.accelerate(amp=True) model = NeuralNetwork() model = train.torch.prepare_model(model) data_loader = DataLoader(my_dataset, batch_size=worker_batch_size) data_loader = train.torch.prepare_data_loader(data_loader) optimizer = torch.optim.SGD(model.parameters(), lr=0.001) + optimizer = train.torch.prepare_optimizer(optimizer) model.train() for epoch in range(90): for images, targets in dataloader: optimizer.zero_grad() outputs = model(images) loss = torch.nn.functional.cross_entropy(outputs, targets) - loss.backward() + train.torch.backward(loss) optimizer.step() ... The performance of AMP varies based on GPU architecture, model type, and data shape. For certain workflows, AMP may perform worse than full-precision training. Reproducibility PyTorch To limit sources of nondeterministic behavior, add ray.train.torch.enable_reproducibility() to the top of your training function. def train_func(): + train.torch.enable_reproducibility() model = NeuralNetwork() model = train.torch.prepare_model(model) ... ray.train.torch.enable_reproducibility() can’t guarantee completely reproducible results across executions. To learn more, read the PyTorch notes on randomness. import ray from ray import tune def training_func(config): dataloader = ray.train.get_dataset()\ .get_shard(torch.rank())\ .iter_torch_batches(batch_size=config["batch_size"]) for i in config["epochs"]: ray.train.report(...) # use same intermediate reporting API # Declare the specification for training. trainer = Trainer(backend="torch", num_workers=12, use_gpu=True) dataset = ray.dataset.window() # Convert this to a trainable. trainable = trainer.to_tune_trainable(training_func, dataset=dataset) tuner = tune.Tuner(trainable, param_space={"lr": tune.uniform(), "batch_size": tune.randint(1, 2, 3)}, tune_config=tune.TuneConfig(num_samples=12)) results = tuner.fit() Advanced APIs ------------- TODO Training Run Iterator API ~~~~~~~~~~~~~~~~~~~~~~~~~ TODO Stateful Class API ~~~~~~~~~~~~~~~~~~ TODO XGBoost & LightGBM User Guide for Ray Train Ray Train has built-in support for XGBoost and LightGBM. Basic Training with Tree-Based Models in Train Just as in the original xgboost.train() and lightgbm.train() functions, the training parameters are passed as the params dictionary. XGBoost Run pip install -U xgboost_ray. import ray from ray.train.xgboost import XGBoostTrainer from ray.air.config import ScalingConfig # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) trainer = XGBoostTrainer( scaling_config=ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, ), label_column="target", num_boost_round=20, params={ # XGBoost specific params "objective": "binary:logistic", # "tree_method": "gpu_hist", # uncomment this to use GPU for training "eval_metric": ["logloss", "error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) result = trainer.fit() print(result.metrics) LightGBM Run pip install -U lightgbm_ray. import ray from ray.train.lightgbm import LightGBMTrainer from ray.air.config import ScalingConfig # Load data. dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv") # Split data into train and validation. train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3) trainer = LightGBMTrainer( scaling_config=ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=False, ), label_column="target", num_boost_round=20, params={ # LightGBM specific params "objective": "binary", "metric": ["binary_logloss", "binary_error"], }, datasets={"train": train_dataset, "valid": valid_dataset}, ) result = trainer.fit() print(result.metrics) Ray-specific params are passed in through the trainer constructors. Saving and Loading XGBoost and LightGBM Checkpoints When a new tree is trained on every boosting round, it’s possible to save a checkpoint to snapshot the training progress so far. XGBoostTrainer and LightGBMTrainer both implement checkpointing out of the box. The only required change is to configure CheckpointConfig to set the checkpointing frequency. For example, the following configuration will save a checkpoint on every boosting round and will only keep the latest checkpoint: from ray.air import RunConfig, CheckpointConfig run_config = RunConfig( checkpoint_config=CheckpointConfig( # Checkpoint every iteration. checkpoint_frequency=1, # Only keep the latest checkpoint and delete the others. num_to_keep=1, ) ) # from ray.train.xgboost import XGBoostTrainer # trainer = XGBoostTrainer(..., run_config=run_config) Once checkpointing is enabled, you can follow this guide to enable fault tolerance. See the Trainer restore API reference for more details. How to scale out training? The benefit of using Ray AIR is that you can seamlessly scale up your training by adjusting the ScalingConfig. Ray Train does not modify or otherwise alter the working of the underlying XGBoost / LightGBM distributed training algorithms. Ray only provides orchestration, data ingest and fault tolerance. For more information on GBDT distributed training, refer to XGBoost documentation and LightGBM documentation. Here are some examples for common use-cases: Multi-node CPU Setup: 4 nodes with 8 CPUs each. Use-case: To utilize all resources in multi-node training. scaling_config = ScalingConfig( num_workers=4, trainer_resources={"CPU": 0}, resources_per_worker={"CPU": 8}, ) Note that we pass 0 CPUs for the trainer resources, so that all resources can be allocated to the actual distributed training workers. Single-node multi-GPU Setup: 1 node with 8 CPUs and 4 GPUs. Use-case: If you have a single node with multiple GPUs, you need to use distributed training to leverage all GPUs. scaling_config = ScalingConfig( num_workers=4, use_gpu=True, ) Multi-node multi-GPU Setup: 4 node with 8 CPUs and 4 GPUs each. Use-case: If you have a multiple nodes with multiple GPUs, you need to schedule one worker per GPU. scaling_config = ScalingConfig( num_workers=16, use_gpu=True, ) Note that you just have to adjust the number of workers - everything else will be handled by Ray automatically. How many remote actors should I use? This depends on your workload and your cluster setup. Generally there is no inherent benefit of running more than one remote actor per node for CPU-only training. This is because XGBoost can already leverage multiple CPUs via threading. However, there are some cases when you should consider starting more than one actor per node: For multi GPU training, each GPU should have a separate remote actor. Thus, if your machine has 24 CPUs and 4 GPUs, you will want to start 4 remote actors with 6 CPUs and 1 GPU each In a heterogeneous cluster , you might want to find the greatest common divisor for the number of CPUs. E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively, you should set the number of actors to 6 and the CPUs per actor to 4. How to use GPUs for training? Ray AIR enables multi GPU training for XGBoost and LightGBM. The core backends will automatically leverage NCCL2 for cross-device communication. All you have to do is to start one actor per GPU and set GPU-compatible parameters, e.g. XGBoost’s tree_method to gpu_hist (see XGBoost documentation for more details.) For instance, if you have 2 machines with 4 GPUs each, you will want to start 8 workers, and set use_gpu=True. There is usually no benefit in allocating less (e.g. 0.5) or more than one GPU per actor. You should divide the CPUs evenly across actors per machine, so if your machines have 16 CPUs in addition to the 4 GPUs, each actor should have 4 CPUs to use. trainer = XGBoostTrainer( scaling_config=ScalingConfig( # Number of workers to use for data parallelism. num_workers=2, # Whether to use GPU acceleration. use_gpu=True, ), params={ # XGBoost specific params "tree_method": "gpu_hist", "eval_metric": ["logloss", "error"], }, label_column="target", num_boost_round=20, datasets={"train": train_dataset, "valid": valid_dataset}, ) How to optimize XGBoost memory usage? XGBoost uses a compute-optimized datastructure, the DMatrix, to hold training data. When converting a dataset to a DMatrix, XGBoost creates intermediate copies and ends up holding a complete copy of the full data. The data will be converted into the local dataformat (on a 64 bit system these are 64 bit floats.) Depending on the system and original dataset dtype, this matrix can thus occupy more memory than the original dataset. The peak memory usage for CPU-based training is at least 3x the dataset size (assuming dtype float32 on a 64bit system) plus about 400,000 KiB for other resources, like operating system requirements and storing of intermediate results. Example Machine type: AWS m5.xlarge (4 vCPUs, 16 GiB RAM) Usable RAM: ~15,350,000 KiB Dataset: 1,250,000 rows with 1024 features, dtype float32. Total size: 5,000,000 KiB XGBoost DMatrix size: ~10,000,000 KiB This dataset will fit exactly on this node for training. Note that the DMatrix size might be lower on a 32 bit system. GPUs Generally, the same memory requirements exist for GPU-based training. Additionally, the GPU must have enough memory to hold the dataset. In the example above, the GPU must have at least 10,000,000 KiB (about 9.6 GiB) memory. However, empirically we found that using a DeviceQuantileDMatrix seems to show more peak GPU memory usage, possibly for intermediate storage when loading data (about 10%). Best practices In order to reduce peak memory usage, consider the following suggestions: Store data as float32 or less. More precision is often not needed, and keeping data in a smaller format will help reduce peak memory usage for initial data loading. Pass the dtype when loading data from CSV. Otherwise, floating point values will be loaded as np.float64 per default, increasing peak memory usage by 33%. TODO: the diagram and some of the components (in the given context) are outdated. Make sure to fix this. Ray Train Architecture The process of training models with Ray Train consists of several components. First, depending on the training framework you want to work with, you will have to provide a so-called Trainer that manages the training process. For instance, to use a PyTorch model, you use a TorchTrainer. The actual training load is distributed among workers on a cluster that belong to a WorkerGroup. Each framework has its specific communication protocols and exchange formats, which is why Ray Train provides Backend implementations (e.g. TorchBackend) that can be used to run the training process using a BackendExecutor. Here’s a visual overview of the architecture components of Ray Train: Below we discuss each component in a bit more detail. Trainer Trainers are your main entry point to the Ray Train API. Train provides a BaseTrainer, and many framework-specific Trainers inherit from the derived DataParallelTrainer (like TensorFlow or Torch) and GBDTTrainer (like XGBoost or LightGBM). Defining an actual Trainer, such as TorchTrainer works as follows: You pass in a function to the Trainer which defines the training logic. The Trainer will create an Executor to run the distributed training. The Trainer will handle callbacks based on the results from the executor. Backend Backends are used to initialize and manage framework-specific communication protocols. Each training library (Torch, Horovod, TensorFlow, etc.) has a separate backend and takes specific configuration values defined in a BackendConfig. Each backend comes with a BackendExecutor that is used to run the training process. Executor The executor is an interface (BackendExecutor) that executes distributed training. It handles the creation of a group of workers (using Ray Actors) and is initialized with a backend. The executor passes all required resources, the number of workers, and information about worker placement to the WorkerGroup. WorkerGroup The WorkerGroup is a generic utility class for managing a group of Ray Actors. This is similar in concept to Fiber’s Ring. Ray Train Examples Example .rst files should be organized in the same manner as the .py files in ray/python/ray/train/examples. Below are examples for using Ray Train with a variety of models, frameworks, and use cases. You can filter these examples by the following categories: Distributed Training Examples using Ray Train PyTorch Fashion MNIST Training Example Transformers with PyTorch Training Example TensorFlow MNIST Training Example End-to-end Horovod Training Example End-to-end PyTorch Lightning Training Example Use LightningTrainer with Ray Data and Batch Predictor Fine-tune LLM with AIR LightningTrainer and FSDP Ray Train Examples Using Loggers & Callbacks Logging Training Runs with MLflow Using Experiment Tracking Tools in LightningTrainer Ray Train & Tune Integration Examples End-to-end Example for Tuning a TensorFlow Model End-to-end Example for Tuning a PyTorch Model with PBT TODO implement these examples! Features -------- * Example for using a custom callback * End-to-end example for running on an elastic cluster (elastic training) Models ------ * Example training on Vision model. Ray Train Benchmarks Benchmark example for the PyTorch data transfer auto pipeline Running Distributed Training of a PyTorch Model on Fashion MNIST with Ray Train import argparse from typing import Dict from ray.air import session import torch from torch import nn from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor import ray.train as train from ray.train.torch import TorchTrainer from ray.air.config import ScalingConfig # Download training data from open datasets. training_data = datasets.FashionMNIST( root="~/data", train=True, download=True, transform=ToTensor(), ) # Download test data from open datasets. test_data = datasets.FashionMNIST( root="~/data", train=False, download=True, transform=ToTensor(), ) # Define model class NeuralNetwork(nn.Module): def __init__(self): super(NeuralNetwork, self).__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28 * 28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), nn.ReLU(), ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits def train_epoch(dataloader, model, loss_fn, optimizer): size = len(dataloader.dataset) // session.get_world_size() model.train() for batch, (X, y) in enumerate(dataloader): # Compute prediction error pred = model(X) loss = loss_fn(pred, y) # Backpropagation optimizer.zero_grad() loss.backward() optimizer.step() if batch % 100 == 0: loss, current = loss.item(), batch * len(X) print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") def validate_epoch(dataloader, model, loss_fn): size = len(dataloader.dataset) // session.get_world_size() num_batches = len(dataloader) model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for X, y in dataloader: pred = model(X) test_loss += loss_fn(pred, y).item() correct += (pred.argmax(1) == y).type(torch.float).sum().item() test_loss /= num_batches correct /= size print( f"Test Error: \n " f"Accuracy: {(100 * correct):>0.1f}%, " f"Avg loss: {test_loss:>8f} \n" ) return test_loss def train_func(config: Dict): batch_size = config["batch_size"] lr = config["lr"] epochs = config["epochs"] worker_batch_size = batch_size // session.get_world_size() # Create data loaders. train_dataloader = DataLoader(training_data, batch_size=worker_batch_size) test_dataloader = DataLoader(test_data, batch_size=worker_batch_size) train_dataloader = train.torch.prepare_data_loader(train_dataloader) test_dataloader = train.torch.prepare_data_loader(test_dataloader) # Create model. model = NeuralNetwork() model = train.torch.prepare_model(model) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=lr) for _ in range(epochs): train_epoch(train_dataloader, model, loss_fn, optimizer) loss = validate_epoch(test_dataloader, model, loss_fn) session.report(dict(loss=loss)) def train_fashion_mnist(num_workers=2, use_gpu=False): trainer = TorchTrainer( train_loop_per_worker=train_func, train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4}, scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), ) result = trainer.fit() print(f"Last result: {result.metrics}") if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "--address", required=False, type=str, help="the address to use for Ray" ) parser.add_argument( "--num-workers", "-n", type=int, default=2, help="Sets number of workers for training.", ) parser.add_argument( "--use-gpu", action="store_true", default=False, help="Enables GPU training" ) parser.add_argument( "--smoke-test", action="store_true", default=False, help="Finish quickly for testing.", ) args, _ = parser.parse_known_args() import ray if args.smoke_test: # 2 workers + 1 for trainer. ray.init(num_cpus=3) train_fashion_mnist() else: ray.init(address=args.address) train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu) Train a Pytorch Lightning Image Classifier This example introduces how to train a Pytorch Lightning Module using AIR LightningTrainer. We will demonstrate how to train a basic neural network on the MNIST dataset with distributed data parallelism. !pip install "torchmetrics>=0.9" "pytorch_lightning>=1.6" import os import numpy as np import random import torch import torch.nn as nn import torch.nn.functional as F from filelock import FileLock from torch.utils.data import DataLoader, random_split, Subset from torchmetrics import Accuracy from torchvision.datasets import MNIST from torchvision import transforms import pytorch_lightning as pl from pytorch_lightning import trainer from pytorch_lightning.loggers.csv_logs import CSVLogger Prepare Dataset and Module The Pytorch Lightning Trainer takes either torch.utils.data.DataLoader or pl.LightningDataModule as data inputs. You can keep using them without any changes for the Ray AIR LightningTrainer. class MNISTDataModule(pl.LightningDataModule): def __init__(self, batch_size=100): super().__init__() self.data_dir = os.getcwd() self.batch_size = batch_size self.transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] ) def setup(self, stage=None): with FileLock(f"{self.data_dir}.lock"): mnist = MNIST( self.data_dir, train=True, download=True, transform=self.transform ) # split data into train and val sets self.mnist_train, self.mnist_val = random_split(mnist, [55000, 5000]) def train_dataloader(self): return DataLoader(self.mnist_train, batch_size=self.batch_size, num_workers=4) def val_dataloader(self): return DataLoader(self.mnist_val, batch_size=self.batch_size, num_workers=4) def test_dataloader(self): with FileLock(f"{self.data_dir}.lock"): self.mnist_test = MNIST( self.data_dir, train=False, download=True, transform=self.transform ) return DataLoader(self.mnist_test, batch_size=self.batch_size, num_workers=4) datamodule = MNISTDataModule(batch_size=128) Next, define a simple multi-layer perception as the subclass of pl.LightningModule. class MNISTClassifier(pl.LightningModule): def __init__(self, lr=1e-3, feature_dim=128): torch.manual_seed(421) super(MNISTClassifier, self).__init__() self.linear_relu_stack = nn.Sequential( nn.Linear(28 * 28, feature_dim), nn.ReLU(), nn.Linear(feature_dim, 10), nn.ReLU(), ) self.lr = lr self.accuracy = Accuracy(task="multiclass", num_classes=10) self.eval_loss = [] self.eval_accuracy = [] self.test_accuracy = [] pl.seed_everything(888) def forward(self, x): x = x.view(-1, 28 * 28) x = self.linear_relu_stack(x) return x def training_step(self, batch, batch_idx): x, y = batch y_hat = self(x) loss = torch.nn.functional.cross_entropy(y_hat, y) self.log("train_loss", loss) return loss def validation_step(self, val_batch, batch_idx): loss, acc = self._shared_eval(val_batch) self.log("val_accuracy", acc) self.eval_loss.append(loss) self.eval_accuracy.append(acc) return {"val_loss": loss, "val_accuracy": acc} def test_step(self, test_batch, batch_idx): loss, acc = self._shared_eval(test_batch) self.test_accuracy.append(acc) self.log("test_accuracy", acc, sync_dist=True, on_epoch=True) return {"test_loss": loss, "test_accuracy": acc} def _shared_eval(self, batch): x, y = batch logits = self.forward(x) loss = F.nll_loss(logits, y) acc = self.accuracy(logits, y) return loss, acc def on_validation_epoch_end(self): avg_loss = torch.stack(self.eval_loss).mean() avg_acc = torch.stack(self.eval_accuracy).mean() self.log("val_loss", avg_loss, sync_dist=True) self.log("val_accuracy", avg_acc, sync_dist=True) self.eval_loss.clear() self.eval_accuracy.clear() def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=self.lr) return optimizer You don’t need to make any change to the definition of PyTorch Lightning model and datamodule. Define the Cofigurations for AIR LightningTrainer The LightningConfigBuilder class stores all the parameters involved in training a PyTorch Lightning module. It takes the same parameter lists as those in PyTorch Lightning. The .module() method takes a subclass of pl.LightningModule and its initialization parameters. LightningTrainer will instantiate a model instance internally in the workers’ training loop. The .trainer() method takes the initialization parameters of pl.Trainer. You can specify training configurations, loggers, and callbacks here. The .fit_params() method stores all the parameters that will be passed into pl.Trainer.fit(), including train/val dataloaders, datamodules, and checkpoint paths. The .checkpointing() method saves the configurations for a RayModelCheckpoint callback. This callback reports the latest metrics to the AIR session along with a newly saved checkpoint. The .build() method generates a dictionary that contains all the configurations in the builder. This dictionary will be passed to LightningTrainer later. Next, let’s go step-by-step to see how to convert your existing PyTorch Lightning training script to a LightningTrainer. from pytorch_lightning.callbacks import ModelCheckpoint from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig from ray.train.lightning import ( LightningTrainer, LightningConfigBuilder, LightningCheckpoint, ) def build_lightning_config_from_existing_code(use_gpu): # Create a config builder to encapsulate all required parameters. # Note that model instantiation and fitting will occur later in the LightingTrainer, # rather than in the config builder. config_builder = LightningConfigBuilder() # 1. define your model # model = MNISTClassifier(lr=1e-3, feature_dim=128) config_builder.module(cls=MNISTClassifier, lr=1e-3, feature_dim=128) # 2. define a ModelCheckpoint callback # checkpoint_callback = ModelCheckpoint( # monitor="val_accuracy", mode="max", save_top_k=3 # ) config_builder.checkpointing(monitor="val_accuracy", mode="max", save_top_k=3) # 3. Define a Lightning trainer # trainer = pl.Trainer( # max_epochs=10, # accelerator="cpu", # strategy="ddp", # log_every_n_steps=100, # logger=CSVLogger("logs"), # callbacks=[checkpoint_callback], # ) config_builder.trainer( max_epochs=10, accelerator="gpu" if use_gpu else "cpu", log_every_n_steps=100, logger=CSVLogger("logs"), ) # You do not need to provide the checkpoint callback and strategy here, # since LightningTrainer configures them automatically. # You can also add any other callbacks into LightningConfigBuilder.trainer(). # 4. Parameters for model fitting # trainer.fit(model, datamodule=datamodule) config_builder.fit_params(datamodule=datamodule) # Finally, compile all the configs into a dictionary for LightningTrainer lightning_config = config_builder.build() return lightning_config Now put everything together: use_gpu = True # Set it to False if you want to run without GPUs num_workers = 4 lightning_config = build_lightning_config_from_existing_code(use_gpu=use_gpu) scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu) run_config = RunConfig( name="ptl-mnist-example", storage_path="/tmp/ray_results", checkpoint_config=CheckpointConfig( num_to_keep=3, checkpoint_score_attribute="val_accuracy", checkpoint_score_order="max", ), ) trainer = LightningTrainer( lightning_config=lightning_config, scaling_config=scaling_config, run_config=run_config, ) Now fit your trainer: result = trainer.fit() print("Validation Accuracy: ", result.metrics["val_accuracy"]) result 2023-06-13 16:05:12,869 INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 10.0.28.253:6379... 2023-06-13 16:05:12,877 INFO worker.py:1627 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_15dlj65vax84ljl7ayeplubryd/services?redirect_to=dashboard  2023-06-13 16:05:13,036 INFO packaging.py:347 -- Pushing file package 'gcs://_ray_pkg_488e346d50f332edaa288fdaa22b2bdc.zip' (52.65MiB) to Ray cluster... 2023-06-13 16:05:13,221 INFO packaging.py:360 -- Successfully pushed file package 'gcs://_ray_pkg_488e346d50f332edaa288fdaa22b2bdc.zip'. 2023-06-13 16:05:13,314 INFO tune.py:226 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.

Tune Status

Current time:2023-06-13 16:05:52
Running for: 00:00:39.29
Memory: 5.5/30.9 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 1.0/32 CPUs, 4.0/4 GPUs

Trial Status

Trial name status loc iter total time (s) train_loss val_accuracy val_loss
LightningTrainer_c0d28_00000TERMINATED10.0.28.253:16995 10 28.5133 0.0315991 0.970002 -12.3467
(pid=16995) /home/ray/anaconda3/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. (pid=16995) from pandas import MultiIndex, Int64Index (LightningTrainer pid=16995) 2023-06-13 16:05:24,007 INFO backend_executor.py:137 -- Starting distributed worker processes: ['17232 (10.0.28.253)', '6371 (10.0.1.80)', '7319 (10.0.58.90)', '6493 (10.0.26.229)'] (RayTrainWorker pid=17232) 2023-06-13 16:05:24,966 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=4] (RayTrainWorker pid=17232) from pandas import MultiIndex, Int64Index (RayTrainWorker pid=17232) from pandas import MultiIndex, Int64Index (RayTrainWorker pid=7319, ip=10.0.58.90) /home/ray/anaconda3/lib/python3.9/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. (RayTrainWorker pid=7319, ip=10.0.58.90) from pandas import MultiIndex, Int64Index (RayTrainWorker pid=17232) Global seed set to 888 (RayTrainWorker pid=17232) GPU available: True, used: True (RayTrainWorker pid=17232) TPU available: False, using: 0 TPU cores (RayTrainWorker pid=17232) IPU available: False, using: 0 IPUs (RayTrainWorker pid=17232) HPU available: False, using: 0 HPUs (RayTrainWorker pid=6371, ip=10.0.1.80) Missing logger folder: logs/lightning_logs (RayTrainWorker pid=6371, ip=10.0.1.80) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (RayTrainWorker pid=17232) (RayTrainWorker pid=17232) | Name | Type | Params (RayTrainWorker pid=17232) ------------------------------------------------- (RayTrainWorker pid=17232) 0 | linear_relu_stack | Sequential | 101 K (RayTrainWorker pid=17232) 1 | accuracy | Accuracy | 0 (RayTrainWorker pid=17232) ------------------------------------------------- (RayTrainWorker pid=17232) 101 K Trainable params (RayTrainWorker pid=17232) 0 Non-trainable params (RayTrainWorker pid=17232) 101 K Total params (RayTrainWorker pid=17232) 0.407 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00

Trial Progress

Trial name _report_on date done epoch experiment_taghostname iterations_since_restorenode_ip pidshould_checkpoint step time_since_restore time_this_iter_s time_total_s timestamp train_loss training_iterationtrial_id val_accuracy val_loss
LightningTrainer_c0d28_00000train_epoch_end2023-06-13_16-05-50True 9 0ip-10-0-28-253 1010.0.28.25316995True 1080 28.5133 1.73311 28.5133 1686697550 0.0315991 10c0d28_00000 0.970002 -12.3467