Welcome to bioCADDIE DATS’s documentation

NOTE: this documentation has been replaced by the documentation at the Data Tag Suite github organization

Introduction:

DATS, which stands for DAta Tag Suite, is a data description model designed and produced to describe datasets being ingested in DataMed, a prototype for data discovery developed as part of the NIH Big Data 2 Knowledge bioCADDIE project.

For more information about DATS, please check the DATS pre-print available in bioarxiv. For more information about DataMed, please check the DataMed pre-print available in bioarxiv. For more information about the objectives of the bioCADDIE project, please have a look at the bioCADDIE White Paper.

This documentation describes the DATS model and how to use it. More details about how DATS was designed and how it relates to other models can be found in the aforementioned documents as well as in the documents accompanying each of the releases.

Table of Contents:

First Steps with DATS

This document offers an overview of the DATS model from a practical perspective, detailing how DATS may be used to document a specific dataset.

The DATS model is centered around the Dataset entity, which supports most of the relevant information about the data being observed.

The main building blocks of the DATS model are defined as “entities”, which and for convenience purposes, may be compared to the different “sections” of information in a flat document. Each entity has a number of properties that are instantiated either as other entities or as direct entries. For the latter, information may may be structured (e.g., integer, date, URI) or unstructured (string, or free text entries).

First and foremost, Dataset entity aims to cater for essential provenance information: who, when, what, why, where, and how. By answering these questions, each dataset source will define its own view on what a dataset is. The Dataset entity is also designed to declare which variables were measured and what type of data was collected.

What is the dataset about?

The nature of the information available in a dataset can be recorded via the DATS Dimension entity. It is the object to use for reporting variables measured and for which data have been collected.

The DATS Dimension object can be qualitied using the DATS DataType entity.

The DATS DataType covers four aspects of a variable’s nature: type of information (what the data is about), method (how the data was generated), platform (the instrumentation, software and reagents used to generate the data), and instrument (the specific device used to generate the data).

Importantly, it is key to remember that Dataset may be constitutive parts of another Dataset. Each of these dataset parts can be used to describe a particular aspect of a dataset in greater details. For instance, a dataset describing a multi-omics experiment may contain several datasets, one focusing on transcriptomics, one focusing on metabolomics and so on.

Why was the data produced?

As a Dataset property, the “description” is a textual narrative that typically indicates the dataset’s purpose and why it was produced.

In addition, in the extended DATS it is possible to describe the Study that produced one, or several related datasets, including the purpose, objective, or hypothesis that gave origin to the dataset(s) defined as belonging to a study.

Related studies may also be grouped to constitute a series.

Tracking dataset spatial and temporal properties

Where was the dataset collected and where was it produced?

The DATS Dataset property spatialCoverage includes a description of the geography covered by the dataset and/or measured by the dataset’s dimensions or variables.

spatialCoverage is instantiated within a Place entity, which maps to the entity bearing the same name in schema.org (http://schema.org/Place), to “geoLocation” in the DataCite schema (http://schema.datacite.org/meta/kernel-4.0/) and to “Feature” in GeoJSON (https://tools.ietf.org/html/rfc7946).

When was the dataset produced?

DATS model provides a Date object to records key Date(s) associated with the description of a Dataset.

For each Date, users have to identify its type, in relation to a specific event (e.g. creation, update, validation, verification, deprecation…).

Such generic mechanism of providing Date and temporal information offers flexibility and extensibility. Dates may be repeated and differentiated by type. This allows for extensions to new types of dates that may be required in specific scenarios. The actual definition of the types is delegated to existing ontologies.

Who produced the dataset?

Using the Dataset’s “creators” property, DATS records the Person and/or Organization associated with the dataset, and supports documenting their roles (e.g., creator, curator, developer, funder, principal investigator).

Where and How can the dataset be accessed?

DATS provides for a comprehensive description of the ways to access a Dataset. This information can be reported in the Access entity, that is part of DatasetDistribution as well as part of the description of a DataRepository. It covers information such as the dataset landing page and/or access URL if available, a description of the type of access (such as download, remote access, remote service, enclave or not available) as well as any authorization or authentication needed to access the dataset.

DATS Model

DATS specifications
Entity Property Definition Value(s) Cardinality Requirement Level Relevant Competency Question(s) Notes or Example(s)
dataset identifier Primary identifiers for the dataset. IdentifiersInformation 0..n SHOULD BGUC5  
  relatedIdentifiers Related identifiers for the dataset. IdentifiersInformation 0..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the dataset. AlternateIdentifiersInformation 0..n MAY    
  title The name of the dataset, usually one sentence or short description of the dataset. string 1 MUST BGUC5 DataCite[/resource/titles];DataCite[/resource/titles/title];Schema.org[https://schema.org/headline];HCLS[(dct:title,rdf:langString)]
  types A term, ideally from a controlled terminology, identifying the dataset type or nature of the data, placing it in a typology. DataType 1..n MUST BGUC1-1;BGUC1-2;BGUC3-2;BGUC3-3;BGUC5;BGUC5-1;WPUC1;WPUC2;WPUC3;WPUC9-p7;UC1 For example: microscopy imaging, gene expression profile, genomic sequence, fMRI, pathway simulation.
  creators The person(s) or organization(s) which contributed to the creation of the dataset. Person or Organization 1..n MUST UC2  
  dates Relevant dates for the dataset, a date must be added, e.g. creation date or last modification date should be added. Date 0..n MAY    
  distributions The distribution(s) by which datasets are made available (for example: mySQL dump). DataSet Distribution 0..n SHOULD    
  dimensions The different dimensions (granular components) making up a dataset. Dimension 0..n MAY BGUC2;BGUC5-4  
  isCitedBy The relevant publication(s) describing how the dataset was produced or used. Publication 0..n MAY BGUC5-2  
  producedBy A study process which generated a given dataset, if any. Study 0..1 SHOULD    
  isAbout Different entiies (biological entity, taxonomic information, disease, molecular entity, anatomical part, treatment) associated with this dataset. BiologicalEntity or TaxonomicInformation or Disease or MolecularEntity or AnatomicalPart or Treatment 0..n SHOULD    
  hasPart A Dataset that is a subset of this Dataset; Datasets declaring the ‘hasPart’ relationship are considered a collection of Datasets, the aggregation criteria could be included in the ‘description’ field. Dataset 0..n MAY    
  keywords Tags associated with the dataset, which will help in its discovery. Annotation 0..n MAY    
  acknowledges The grant(s) which funded and supported the work reported by the dataset. Grant 0..n MAY    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
DatasetDistribution   “A specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed. (From DCAT) “       BGUC5  
  identifiers Primary identifiers for the dataset distribution. IdentifiersInformation 1..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the dataset distribution. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the dataset distribution. RelatedIdentifiersInformation 0..n MAY    
  title “The name of the dataset distribution, usually one sentece or short description of the dataset.” string 0..1 MAY    
  description A textual narrative comprised of one or more statements describing the dataset distribution. string 0..1 SHOULD    
  dates “Relevant dates for the datasets, a date must be added, e.g. creation date or last modification date should be added.” Date 1..n MUST    
  ``storedIn `` The data repository(ies) hosting the dataset. DataRepository 0..n MAY BGUC1-1;UC2 “While from the DDI perspective, every dataset may be coming from a data repository, we put a less strict requirement allowing for datasets available online and not in a repository.”
  version A release point for the dataset when applicable. string 0..1 SHOULD WPUC5-p7  
  accessModalities The information about access modality for the dataset. Access 1..n MUST    
  licenses The terms of use of the data standard. License 0..n SHOULD BGUC5-4  
  curationStatus The level of curation of the dataset distribution. Annotation 0..n MAY   “E.g. manually or authomatic or both, other values such as https://wiki.nci.nih.gov/display/CTRPdoc/Curation+Status+Definitions+-+Include+v4.3.1
  conformsTo A data standard whose requirements and constraints are met by the dataset. DataStandard 0..n MAY BGUC5-7;WPUC9-p7  
  format The technical format of the dataset distribution. Use the file extension or MIME type when possible. (Definition adapted from DataCite) string 0..n MAY   “e.g. PDF, XML, MPG or application/pdf, text/xml, video/mpeg”
  qualifiers “One or more characteristics of the dataset distribution (e.g. how it relates to other distributions, if the data is raw or processed, compressed or encrypted). “ Annotation or CategoryValuesPair 0..n MAY   “e.g. indicate if the distribution is isomorphic (corresponds completely with the dataset), a derivative from the dataset, or is a partial distribution of the dataset. These qualifiers can also indicate if the distribution refers to raw, processed or summarised data. It could also refer to the data being encrypted or compressed.”
  ``size `` The size of the dataset. number 0..1 MAY BGUC5-1  
  unit “The unit of measurement used to estimate the size of the dataset (e.g, petabyte). Ideally, the unit should be coming from a reference controlled terminology.” Annotation “1, if size is reported” (MUST)    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
DataStandard   “A format, reporting guideline, terminology. It is used to indicate whether the dataset conforms to a particular community norm or specification.”       BGUC5-7;UC15;WPUC9-p7  
  identifiers Primary identifiers for the standard. IdentifiersInformation 0..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the standard. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the standard. RelatedIdentifiersInformation 0..n MAY    
  name “The name of the standard (e.g. FASTQ, CDISC STDM, ISO8601)” string 1 MUST    
  type “The nature of the information resource, ideally specified with a controlled vocabulary or ontology (.e.g model or format, vocabulary, reporting guideline).” Annotation 1 MUST WPUC9-p7  
  description A textual narrative comprised of one or more statements describing the data standard. string 0..1 SHOULD    
  licenses The terms of use of the data standard. License 0..n SHOULD BGUC5-4  
  version A release point for the repository when applicable. string 0..1 SHOULD    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
DataRepository   A repository or catalog of datasets. It could be a primary repository or a repository that aggregates data existing in other repositories.       BGUC1-1;UC2;UC15  
  identifiers Primary identifiers for the data repository. IdentifiersInformation 0..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the data repository. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the data repository. RelatedIdentifiersInformation 0..n MAY    
  name The name of the data repository. string 1 MUST BGUC1-1;UC2  
  description A textual narrative comprised of one or more statements describing the data repository. string 0..1 SHOULD    
  dates Relevant dates for the data repository. Date 0..n MAY    
  scopes “Information about the nature of the datasets in the repository, ideally from a controlled vocabulary or ontology (e.g. transcription profile, sequence reads, molecular structure, image, DNA sequence, NMR spectra).” Annotation 0..n 1..n SPUC1;SPUC7-2  
  types “A descriptor (ideally from a controlled vocabulary) providing information about the type of repository, such as primary resource or aggregator.” Annotation 0..n SHOULD    
  licenses The terms of use of the data repository. License 0..n SHOULD BGUC5-4  
  version “A release point for the repository, when applicable.” string 0..1 SHOULD    
  publishers The person(s) or organization(s) responsible for the repository and its availability. Person or Organization 0..n SHOULD    
  aggregatorOf The DataRepositories aggregated by this repository. This property will be empty for primary repositories. DataRepository 0..n MAY    
  accessModalities The information about access modality for the data repository. Access 1..n MAY    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
Software   “A digital entity containing sets of instructions and operation, which allows computation and operation of and by computer.”       SPUC11;SPUC10  
  identifiers Primary identifiers for the software. IdentifiersInformation 0..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the software. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the software. RelatedIdentifiersInformation 0..n MAY    
  name The name of the software. string 1 MUST    
  licenses The terms of use of the software. License 0..n SHOULD    
  isUsedBy The data acquisition activity that makes use of this software. DataAcquisition or DataAnalysis 0..n MAY    
  manufacturer The person or organisation that produced the software. Person or Organization 0..1 MAY   e.g. Adobe
  version A release point for the software. string 0..1 SHOULD    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
Publication   A (digital) document made available by a publisher.       BGUC5-2;WPUC5-p7;WPUC10-p7;UC2  
  identifiers Primary identifiers for the publication. IdentifiersInformation 1..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the publication. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the publication. RelatedIdentifiersInformation 0..n MAY    
  title “The name of the publication, usually one sentece or short description of the publication.” string 1 SHOULD    
  ``dates `` “Relevant dates, the date of the publication must be provided. “ Date 1..n SHOULD    
  type “Publication type, ideally delegated to an external vocabulary/resource.” Annotation 0..1 SHOULD   “e.g. book, article, weblog, chapter, review, correspondence”
  publicationVenue The name of the publication venue where the document is published if applicable. string 0..1 MAY    
  authorsList The list of authors made available as a string (does not allow disambiguation). string 0..1 SHOULD    
  authors The person(s) and/or organisation(s) responsible for the publication. Person or Organization 1..n SHOULD BGUC5-6  
  acknowledges The grant(s) which funded and supported the work reported by the publication. Grant 0..n SHOULD    
  licenses The terms of use of the publication. License 0..n SHOULD    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
IdentifiersInformation   Information about the primary identifier.       BGUC5  
  identifier A code uniquely identifying an entity locally to a system or globally. string or IRI 0..n SHOULD BGUC5  
  identifierSource The identifier source represents information about the organisation/namespace responsible for minting the identifiers. It must be provided if the identifier is provided. string “1, if identifier is available” (MUST)    
AlternateIdentifiersInformation   Information about an alternate identifier (other than the primary).       BGUC5  
  alternateIdentifier An identifier or identifiers other than the primary Identifier applied to the resource being registered. (definition from DataCite) string or IRI 0..n MAY    
  alternateIdentifierSource The identifier source represents information about the organisation/namespace responsible for minting the identifiers. It must be provided if the identifier is provided. string 0..n MAY    
RelatedIdentifiersInformation   Information about a related identifier.       BGUC5  
  relatedIdentifier An identifier of a related resource. string or IRI   MUST    
  relatedIdentifierSource The identifier source represents information about the organisation/namespace responsible for minting the identifiers. It must be provided if the identifier is provided. string   (MUST)    
  relationType The type of the relationship corresponding to this identifier. string or IRI   SHOULD    
Annotation   “A pair of value (string or numeric) with a corresponding ontology term (IRI), if applicable.”       BGUC5  
  ``value `` A label or value (string or numeric) that might be associated with an ontology term. string or number 1 MUST    
  ontologyTermIRI /suggested renaming = ValueIRI The IRI of an ontology term that corresponds to value. IRI 0..1 MAY    
Date   “Information about a calendar date or timestamp indicating day, month, year and time of an event.”       BGUC5  
  date A date following the ISO8601 standard. date 1 MUST   “The type of date is specified in the dateType field, following the DataCite practice. (change cardinality from 1..n to 1)”
Access   Information about resources that provide the means to obtain an asset (a dataset or other research object).   Description of the access conditions for the object   BGUC5  
  identifiers Primary identifiers for the access information. IdentifiersInformation 1..n SHOULD    
  alternateIdentifiers Alternate identifiers for the access information. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the access information. RelatedIdentifiersInformation 0..n MAY    
  landingPage A web page that contains information about the associated dataset or other research object and a direct link to the object itself. IRI 1 MUST    
  accessURL “A URL from which the resource (dataset or other research object) can be retrieved, i.e. a direct link to the object itself.” IRI 0..1 SHOULD    
  types “Method to obtain the resource, ideally specified from a controlled vocabulary or ontology.” Annotation (see worksheet ‘Access Types’ for CV defined by WG7) 0..n SHOULD   “download, remote access, remote service, enclave, not available”
  authorizations Types of verification that accessing the resource is allowed. Authorization occurs before successful authentication and refers to the process of obtaining approval to use a data set. Ideally specified from a controlled vocabulary or ontology. Annotation (see worksheet ‘Access Types’ for CV defined by WG7) 0..n SHOULD   “none, click license, registration, dual individual, dual institution”
  authentications “Types of verification of the credentials for accessing the resource, it is the identification process at the time of access. ideally specified from a controlled vocabulary or ontology.” Annotation (see worksheet ‘Access Types’ for CV defined by WG7) 0..n SHOULD   “none, simple login, multiple login”
  licenses Terms of usage as specified on a license or data use agreement. License 0..n MAY BGUC5-1;BGUC5-4;BGUC5-8  
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
Grant   An allocated sum of funds given by a government or other organization for a particular purpose       BGUC5-6  
  identifiers Primary identifiers for the grant. IdentifiersInformation 1..n SHOULD BGUC5 (change to MUST?)
  alternateIdentifiers Alternate identifiers for the grant. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the grant. RelatedIdentifiersInformation 0..n MAY    
  name The name of the grant and its funding program. string 1 MUST    
  funds The study or dataset supported by the grant. Study or Dataset 0..n SHOULD    
  funders The person(s) or organization(s) which has awarded the funds supporting the project. (Person or Organization) and role funder 1..n MUST BGUC5-6;WPUC7-p7;WPUC8-p7;WPUC10-p7;UC1  
  awardees The person(s) or organization(s) which received the funds supporting the project. Person or Organization 0..n SHOULD    
  extraProperties Extra properties that do not fit in the previous specified attributes. ExtraProperty 0..n MAY    
License   “A legal document giving official permission to do something with a Resource. It is assumed that an external vocabulary will describe with sufficient granularity the permission for redistribution, modification, derivation, reuse, etc. and conditions for citation/acknowledgment.”       “BGUC5-4,BGUC5-8”  
  identifiers Primary identifiers for the license. IdentifiersInformation 1..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the license. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the license. RelatedIdentifiersInformation 0..n MAY    
  name The name of the license. string 1 MUST    
  version The version of the license. string 0..1 SHOULD    
  creators The person(s) or organization(s) responsible for writing the license. Person or Organization 0..n SHOULD    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
Dimension   “A feature of an entity, i.e. an individual measurable property (both quantitative or qualitative) of the entity being observed”       BGUC2;BGUC4;BGUC5-1;BGUC5-4;PB1 “e.g. demographic characteristics, quality indicator, access statistics”
  identifiers Primary identifiers for the dimension. IdentifiersInformation 1..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the dimension. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the dimension. RelatedIdentifiersInformation 0..n MAY    
  name “The name of the dimension measured or observed during the data acquisition process, ideally from a controlled terminology.” Annotation 1 MUST “BGUC5-10,WPUC3, SPUC6,SPUC1” “e.g. signal intensity, standard deviation”
  types “A term, ideally from a controlled terminology, identifying the nature of the dimension, placing it in a typology.” Annotation 1..n MUST   “e.g. continuous, discrete, scalar, ordinal “
  partOf The dataset(s) this dimension belongs to. Dataset 1..n MUST    
  description A textual narrative comprised of one or more statements describing the dimension. string 0..1 SHOULD    
  values The actual collections of values collected for that dimension. array 0..n SHOULD BGUC2  
  unit “A reference measurement unit associated with scalar dimensions, ideally from a reference controlled terminology.” Annotation 0..1 MAY    
  ``isAbout `` “A material or a dataset, which is the object of this dimension (this dimension is about the material - e.g. the heights of the patients - or the dataset - e.g. the standard deviation or the set of outliers or a quality indicator of a dataset).” Dataset or Material 0..n MAY BGUC5-4;WPUC9-p7;PB1  
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
  information The measurements or facts that the data is about. Annotation 0..1 MAY   “e.g. gene expression, protein structure, proteomics, phenotyping.”
  method The procedure or technology used to generate the information. Annotation 0..1 MAY   “e.g. imaging, microarray, clinical trial.”
  platform “The set of instruments, software and reagents that are needed to generated the data.” Annotation 0..1 MAY   “e.g. Affymetrix, NGS, mass spectrometer type”
  instrument The specific device used to generate the data. Annotation 0..1 MAY    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
Material   “A physical entity, part of collection or used in a study (e.g. patient)”       BGUC3-3;BGUC3-5;BGUC5;BGUC5-1;BGUC5-9;BGUC5-11;PB1;SPUC13;WPUC6-p7  
  identifiers Primary identifiers for the material. IdentifiersInformation 1..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the material. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the material. RelatedIdentifiersInformation 0..n MAY    
  name The name of the material. string 1 MUST    
  derivesFrom A material from which this material originated. Material or AnatomicalPart 0..n MAY BGUC2  
  bearerOfDisease The pathology affecting the material used in the study or refered to in the dataset (ideally from a controlled vocabulary/ontology). Disease 0..n MAY “BGUC1-1;BGUC1-2;BGUC1-3;BGUC5,BGUC5-4,BGUC5-6,BGUC5-8,BGUC-5-9,SPUC7-3,WPUC1”  
  taxonomicInformation The taxonomic information for this material (ideally specified from a controlled vocabulary/ontology). TaxonomicInformation 0..n MAY BGUC2  
  involvedInBiologicalEntity A biological process (ideally specified from a controlled vocabulary/ontology) in which the material is involved. BiologicalEntity 0..n MAY BGUC2;BGUC3-1;BGUC3-2;BGUC4;SPUC18  
  characteristics The characteristic information or attributes denoting the material. Dimension or Material 0..n MAY BGUC2  
  roles The roles played by a material. Annotation 0..n SHOULD    
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
Person   A human being.       UC2  
  identifiers Primary identifiers for the person. IdentifiersInformation 1..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the person. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the person. RelatedIdentifiersInformation 0..n MAY    
  fullName “The first name, any middle names, and surname of a person.” string 1 SHOULD    
  firstName The given name of the person. string 1 MAY    
  middleInitial The first letter of the person’s middle name. string 0..n MAY    
  lastName The person’s family name. string 1 SHOULD    
  email An electronic mail address for the person. string (format=email) 0..1 SHOULD    
  affiliations The organizations to which the person is associated with. Organization 0..n SHOULD    
  roles “The roles assumed by a person, ideally from a controlled vocabulary/ontology.” Annotation 0..n MAY “(has_role author) BGUC5-6, UC2” “e.g. author, creator, contributor, awardee, submitter, researcher, patient”
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    
  identifiers Primary identifiers for the organization. IdentifiersInformation 1..n SHOULD BGUC5  
  alternateIdentifiers Alternate identifiers for the organization. AlternateIdentifiersInformation 0..n MAY    
  relatedIdentifiers Related identifiers for the organization. RelatedIdentifiersInformation 0..n MAY    
  name The name of the organization. string 1 MUST    
  abbreviation “The shortname, abbreviation associated to the organization.” string 0..1 MAY    
  postalAddress “The postal, street address associated to the organization.” string 0..1 MAY    
  roles “The roles of the organization, ideally from a controlled vocabulary/ontology.” Annotation 0..n MAY UC1; SPUC5 “e.g. author, creator, contributor, awardee, submitter, researcher, patient”
  extraProperties Extra properties that do not fit in the previous specified attributes. CategoryValuesPair 0..n MAY    

DATS Counting things:

A recurring capability query cases is that addressing the ability to assemble synthetic cohorts by interogating a collection of resources or datasets based on a certain charactieristics. It it therefore important to be able to accurately represent or summarize such information, as well as track relations between entities. This section aims to illustrate how DATS model provides the relevant mechanisms to do so.

Tracking patient and specimen relationships

Relationships between materials matter. It is therefore important for the model to be able to represent information assessing sample / specimen origin and patient identity. For instance, in the context of longitudinal studies, repeated measure designs, where samples are collected or variables measured several times over the course of a study. The figure below shows the main properties of the DATS Material object, with associations to key biologically relevent entities such as:

_images/DATS-v2.3-Material.png

Owing to awareness in resources such DO, GO, UBERON, the ease in integration and compatibility with biomedical ontologies should be highlighted.

Groups and sizes in the context of studies

For all datasets characterising “signal”, the ability to identify, list and characterise study populations matters, as does the ability to capture descriptors for ‘treatment’ or ‘perturbations’.

_images/DATS-v2.3-Study-and-Groups.png

As shown in the figure above, the Data Study object allows the declaration and identification of groups (DATS Study Groups) of related materials as well as list all their members. The objects can be qualified with group size properties, allowing direct querying.

Note: While DATS model has been designed to enable granular representation, it does necessary follow that such granularity should always be used. Also, it is often the case, primary resources can not provide information to the extent required to perform the query case introduced at the top of the section.

DATS Measuring things:

This section describes the DATS objects for supporting the description of variables , dimensions and their relation to datasets.

_images/DATS-v2.3_Dimension-Data_Type.png

The nature of the information available in a dataset can be recorded via the DATS Dimension entity. It is the object to use for reporting variables measured and for which data have been collected.

The DATS Dimension object can be qualitied using the DATS DataType entity.

The DATS DataType covers four aspects of a variable’s nature: type of information (what the data is about), method (how the data was generated), platform (the instrumentation, software and reagents used to generate the data), and instrument (the specific device used to generate the data).

Importantly, it is key to remember that Dataset may be constitutive parts of another Dataset. Each of these dataset parts can be used to describe a particular aspect of a dataset in greater details. For instance, a dataset describing a multi-omics experiment may contain several datasets, one focusing on transcriptomics, one focusing on metabolomics and so on.

DATS.Dimension: meant to be used to report what data points are about in a dataset, their nature, their units.

DATS.Dimension should be typed (categorical, continuous)

DATS.Dimension used from the following DATS objects:

DATS. Material .characteristics.Dimension

DATS. DataAcquisition .measures.Dimension

Dataset Distribution

Where and How (can the dataset be accessed):

  • Document DataSet Distribution options. This encompasses specifying:

    • data availability (boolean choice: available, unavailable)
    • data formats or mime-types ([terminology needs to be specified] ‘resource: <https://github.com/lukaszsliwa/friendly_mime/blob/master/mimes.csv>`_)
    • data access conditions
    • data compression (boolean choice: compressed, uncompressed)
    • data encryption (boolean choice: encrypted, non-encrypted)
    • data privacy protection (fully identifiable, pseudo-anonymized, full anonymized….[terminology needs to be specified])

The image below provides an graphical overview of how to use Biocaddie DATS objects to encode information about dataset availability in a similar file format but from 3 distinct data repositories, each with it own access modalities.

The three INSDC sequence databases (DDBJ, SRA and ENA) exchange their data and provide the same datasets it in the three sites. Let’s consider an example dataset.

The same Dataset identified by accession number DRP000443 can be accessed through the following 3 access URI pages:

While the distributions use the same Format, the accessURL are different as are the Repository but these distributions are all about the same dataset

A conceptual map detailing Biocaddie DATS distribution for an nucleic acid sequencing dataset as mirrored by 3 INSDC repositories: NCBI SRA, EBI ENA and DDBJ.

The block below shows a snippet of a bioCADDIE DATS JSON document holding key information about dataset distribution. Note the link to access information and data file format information.

Dataset Creator(s)

Who (produced the dataset):

  • Document the Person(s) or Organization(s) which contributed to the creation of the Dataset.
  • Document their roles (creator,curator,developer,funder,principal investigator…[terminology needs to be specified])

Dataset About

Describing what the dataset is about (i.e what was the scope, objective, materials) and providing information about the type of data associated with the given dataset:

  • Document the nature of information available in a dataset through the Biocaddie ‘data type’ object.
A conceptual map detailing Biocaddie DATS data-type qualifiers and data distribution descriptors .

In this context, the ‘data type’ required to annotate a DataSet should be viewed as a content type [terminology needs to be specified]). This encompasses the nature of the signal recorded in a dataset or information content of interest. For instance: gene expression data or phenotypic data, electronic health records But mime-type may be used. * chemical * sequence * spectrum * audio * image * video * …

but other descriptors may be used such as Biosharing, Scicrunch or re3data category/data domain descriptors.

  • Data aggregation type:

    In the context of DataMed indexing, the information obtained from repositories may correspond to datasets served individually or may correspond to collections or records. As these 2 situations represent a very different metadata context, the Biocaddie DATS model allows to distinguish between the two cases.

  • collection (as in ‘collection of instances’)

  • singleton (as in ‘individual instance’)

  • Data refinement type:

To describe the level of data processing associated with the data available from the dataset and its distributions….[terminology needs to be specified])

  • raw data
  • preprocessed data
  • analyzed data
  • summarized data
  • curated data
  • reannotated data
  • data privacy protection type: (applicable only to human/clinical data)

    • fully identifiable none
    • pseudo-anonymized data
    • fully anonymized data
    • not information available
  • Document the Material, object, scope and Biological Entities the dataset is about and their characteristics or properties.

  • Document the nature of intervention and Treatment applied to the Material, if any or if applicable.

  • Data Types and specific Platform

Currently, in DataMed, datasets can be search according to Data Type (.e.g Proteomics data) and/or by Platform (e.g. Illumina) DATS provides a mechanism via DataType object to qualify the nature of the data collected in a Dataset. The 4 facets/attributes allow to incrementally specify the type of information contained by the data and how it has been produced

  • data acquisition / method type:
    This attribute allows to indicate the technique or technology , also known sometimes as data modality used to acquire the signal. For instance:
    • ‘crystallography’,
    • ‘mass spectrometry’
    • ‘nucleic acid sequencing’,
    • ‘computational simulation’
    • ‘questionaire based survey’
    • ‘nuclear magnetic resonance spectroscropy’
    • ‘nuclear magnetic resonance imaging’
    • ‘questionnaire’
  • platform/instrument type
    • Agilent, Bruker,Affymetrix,Illumina,SeaHorse
    • HumanHap550v3.0
    • HumanExome-12 v1.1 BeadChip
    • Sentrix Human-6 Expression BeadChip
    • SureSelect Human All Exon v2 - 44Mb
    • HiSeq 2000

Dataset Provenance

In order to proceed with indexing a data source under bioCADDIE DataMed, it is essential to provide information about the actual source of information. This means unambiguously identifying the repository, the actual material from that resource used as input to the transformation allowing processing by DataMed software agents.

This falls under the provenance information section of the DATS for DataMed.

  • identify the repository
  • document the url or filename and address of the source information
  • document the date of last access to the resource as input to the data transformation
  • document the data transformation pipeline in the datamed infrastructure, ideally by pointed to the biocaddie github repository .

Frequently Asked Questions

Why are some properties (e.g. “title” and “description”) included in both Dataset and DataDistribution?

When designing DATS we chose to be flexible and consider some redundancy by including properties in both Dataset as well as DatasetDistribution, even though in some cases it might be expected that a Dataset property should be inherited by their DatasetDistributions. We followed this approach to cover cases where repositories may have different information. For example, it would be possible that each DatasetDistribution has more information in its “description” on how the distribution was produced, adding more details to the general information in the corresponding Dataset.

License:

BioCADDIE DATS is licensed under Creative Commons Attribution Share-Alike 4.0.

Contributing:

If you wish to contribute to DATS and/or this documentation, please report issues in our tracker or contact us directly (agbeltran and proccaserra).

The different releases of DATS are available in the bioCADDIE Working Group 3 Github Repository, including documents and appendixes, JSON schemas, JSON-LD context files and JSON-LD instance files. Each release is preserved in the Zenodo repository and has its own persistent Digital Object Identifier (DOI). All releases in Zenodo can be accessed through the Zenodo DATS Community.

Indices and tables: