Cassandra Dataset Manager¶
Cassandra Dataset Manager, (cdm) is a tool to make it simple to start learning Apache Cassandra or Datastax Enterprise (DSE). This utility will provide a framework for building and installing datasets, which can then be explored via cqlsh, DevCenter, and the Jupyter notebooks that are included with datasets. In short, the focus of this tool is on the following:
- Development of Datasets
- The CDM framework will provide a consistent experience for people interested in sharing public datasets.
- Installation of Datasets
- It should take 15 minutes or less for a user to go from “I want to learn” to “I’m looking at data”. The experience of loading sample data should be as simple as possible, with helpful error messages when things do go wrong or a requirement it not installed.
- Visual Learning via Examples
- Several tools, such as Jupyter and Zeppelin notebooks, provide an elegant means of teaching database and data model concepts by interweaving explanatory text and executable code. By providing stable datasets through CDM people can create tutorials, blog posts, videos, and code examples without having to worry about creating new data models every time.
Things this is not¶
- A bulk loader
- A way for you to manage your schema in your projects
Want to get up and running? See the QuickStart
User Documentation¶
QuickStart¶
Make sure you have either open source Cassandra (2.1 or later) or DataStax Enterprise (4.8 or greater) installed.
Getting up and running with CDM is simple. In a virtualenv, run the following:
pip install cassandra-dataset-manager
You’ll have a command line utility, cdm
, installed in your virtualenv’s bin directory. Update your local dataset list, then install the movielens-small dataset:
cdm update
cdm install movielens-small
You now have the movielens-small dataset installed in your local cassandra cluster.
Next, type cqlsh to start working with the Cassandra shell.
Once cqlsh starts, type use movielens_small then desc tables to see all the tables in the schema. Type the following to read some data:
Usage¶
Installing the cdm package will set up a cdm
executable.
Installing Datasets¶
cdm list
will provide a list of installable datasets and their descriptions.
cdm install <dataset>
will install a specific dataset. Future versions will include support for DSE Search and DSE Graph.
Updating the local database¶
cdm update
will update the local database.
Frequently Asked Questions¶
Developer Documentation¶
Guide: Creating Datasets¶
This information is relevant only to developers wishing to create their own datasets for distribution.
What is a Dataset?¶
Think of a Dataset similar to a package managed by yum or apt. Instead of binaries and configuration files, installing a Dataset gives you a Cassandra schema, sample data, and a Jupyter notebook with tutorials on how to use that data.
Create a new project from the skeleton¶
Make sure CDM is installed. You will not be able to provide additional Python modules other than what CDM already provides (yet).
Create a new dataset with the cdm new
command. It will generate a project skeleton for you. For example:
cdm new example-name
Installers are created by having a file called install.py
in the top level of your dataset. The installer must subclass cdm.installer.Installer
. The cdm utility will discover the Installer automatically so the name is somewhat arbitrary, however it should reflect the dataset’s name as a convention.
Download resources and setup¶
Set up your post_init()
hook. You should download and load any data into memory you’ll need for all the various imports:
class MovieLensInstaller(Installer):
def post_init(self):
context = self.context
If you need to download any data (like a zip file of CSVs, etc), you can use context.download(url)
which will download and cache the file at the URL return a file pointer. Caching is provided automatically.
If you download a zip file, the easiest way to access the data is using the built in Python ZipFile module:
fp = context.download("http://files.grouplens.org/datasets/movielens/ml-100k.zip")
zf = ZipFile(file=fp)
fp = zf.open("ml-100k/u.item")
You can use the file pointers returned from ZipFile.open(name)
as normal pointers. If you’re working with CSV data, it’s recommended to use the Pandas library (provided by CDM):
movies = read_csv(fp, sep="|", header=None, index_col=0, names=["id", "name", "genre"]).fillna(0)
If you’d like to include your data with your dataset (a good idea of the dataset is small), you
You can see how it’s pretty easy to use the Context
to download and cache external files, then process and prepare using Pandas.
Set up Cassandra Schema¶
Next you’ll want to set up a schema for Cassandra. There’s a few options varying in complexity. Read up on the different options for configuring your Cassandra Schema.
Load Cassandra Data¶
Assuming you’ve loading some data into memory in the post_init()
, you can now load data into your schema.
To load data, you’ll want to use the session
provided by the Context
:
class MyInstaller(Installer):
def install_cassandra(self):
context = self.context
session = context.session
prepared = session.prepare("INSERT INTO data (key, value) VALUES (?, ?)")
for row in self.data:
session.execute(prepared, row.key, row.value)
Provided Libraries¶
- Cassandra Driver
- The project would be useless without a driver, so it’s included. We will stay reasonably up to date with current packages. It is always made available via the Context as the
session
variable. - Pandas
- Pandas is an excellent library for reading various raw formats such as CSV. It also provides facilities for data manipulation, which may be required to transform data.
- Faker
- Faker makes for each generation of fake data. This is especially useful when you’re dealing with an incomplete data model or one that has been anonymized.
Testing¶
Testing datasets is important. This project is leveraging features of py.test that make it easy to test datasets.
CDM will include a tool for testing a project. This runs all the projects unit tests as well as tests that verify project structure and conventions:
cdm test
All tests must pass cdm test
for inclusion in the official Dataset repository.
Cassandra Schema¶
Working with a Cassandra schema is very flexible using CDM. There are several options available.
Using a schema file¶
This is useful if you have a schema somewhere already that you want to write to disk through cqlsh, and you don’t wish to use CQLEngine models.
To easily use a schema file, make sure your installer
subclasses SimpleCQLSchema
first:
class MyInstaller(SimpleCQLSchema, Installer):
pass
Put your schema in schema.cql, and it will automatically be picked up and loaded, splitting the statements on ;
.
CQLEngine Models¶
This is a convenient as you’ll frequently want to leverage CQLEngine models for validating and inserting data. We’ll use the cassandra_schema()
hook to return the classes we want sync’ed to the database.
For example, in movielens-small, we define our Movie
Model similar to this:
class Movie(Model):
__table_name__ = 'movies'
id = Integer(primary_key=True)
name = Text()
release_date = Date()
video_release_date = Date()
url = Text()
avg_rating = Float()
genres = Set(Text)
In our installer, we return a list of table models:
class MovieLensInstaller(Installer):
def cassandra_schema(self):
return [Movie]
Specifying a Schema Inline¶
This will be necessary for UDAs/UDFs as they aren’t simply split on ;
. A future version of CDM may include a parser to properly support this but it’s unlikely anytime soon. Until that day comes, it’s possible to use fat strings to specify schema:
class MovieLensInstaller(Installer):
def cassandra_schema(self):
statements = ["""CREATE TABLE movies
(id uuid primary key,
name text)""",
"""CREATE CUSTOM INDEX on movies(name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'"""
return statements
Mixed Mode¶
There are cases which are not handled with CQLEngine yet. Materialized views, SASI indexes, UDFs, UDAs are all difficult to express. Python allows us a lot of flexibility by allowing lists to contain objects of mixed types. We can leverage our CQLEngine models for our tables and provide fat strings for the rest of the schema:
class MovieLensInstaller(Installer):
def cassandra_schema(self):
statements = [Movie,
"""CREATE CUSTOM INDEX on movies(name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'"""]
This is cool because we can leverage CQLEngine for our database models but still get the flexibility of using any CQL that it doesn’t support yet.
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line