DataLad — data management and publication multitool

Welcome to DataLad’s technical documentation. Information here is targeting software developers and is focused on the Python API and CLI, as well as software design, employed technologies, and key features. Comprehensive user documentation with information on installation, basic operation, support, and (advanced) use case descriptions is available in the DataLad handbook.

Content

Change log

0.19.6 (2024-02-02)

Enhancements and New Features
  • Add the “http_token” authentication mechanism which provides ‘Authentication: Token {TOKEN}’ header. PR #7551 (by @yarikoptic)

Internal

0.19.5 (2023-12-28)

Tests
  • Fix text to account for a recent change in git-annex dropping sub-second clock precision. As a result we might not report push of git-annex branch since there would be none. PR #7544 (by @yarikoptic)

0.19.4 (2023-12-13)

Bug Fixes
  • Update target detection for adjusted mode datasets has been improved. Fixes #7507 via PR #7522 (by @mih)

  • Fix typos found by new codespell 2.2.6 and also add checking/fixing “hidden files”. PR #7530 (by @yarikoptic)

Documentation
Internal
Tests
  • Cache value of the has_symlink_capability to spare some cycles. PR #7471 (by @yarikoptic)

  • RF(TST): use setup_method and teardown_method in TestAddArchiveOptions. PR #7488 (by @yarikoptic)

  • Announce test_clone_datasets_root xfail on github osx. PR #7489 (by @yarikoptic)

  • Inform asv that there should be no warmup runs for time_remove benchmark. PR #7505 (by @yarikoptic)

  • BF(TST): Relax matching of git-annex error message about unsafe drop, which was changed in 10.20231129-18-gfd0b510573. PR #7541 (by @yarikoptic)

0.19.3 (2023-08-10)

Bug Fixes
  • Type annotate get_status_dict and note that we can pass Exception or CapturedException which is not subclass. PR #7403 (by @yarikoptic)

  • BF: create-sibling-gitlab used to raise a TypeError when attempting a recursive operation in a dataset with uninstalled subdatasets. It now raises an impossible result instead. PR #7430 (by @adswa)

  • Pass branch option into recursive call within Install - for the cases whenever install is invoked with URL(s). Fixes #7461 via PR #7463 (by @yarikoptic)

  • Allow for reckless=ephemeral clone using relative path for the original location. Fixes #7469 via PR #7472 (by @yarikoptic)

Documentation
  • Fix a property name and default costs described in “getting subdatasets” section of get documentation. Fixes #7458 via PR #7460 (by @mslw)

Internal
Tests
  • Disable some S3 tests of their VCR taping where they fail for known issues. PR #7467 (by @yarikoptic)

0.19.2 (2023-07-03)

Bug Fixes
  • Remove surrounding quotes in output filenames even for newer version of annex. Fixes #7440 via PR #7443 (by @yarikoptic)

Documentation
  • DOC: clarify description of the “install” interface to reflect its convoluted behavior. PR #7445 (by @yarikoptic)

0.19.1 (2023-06-26)

Internal
  • Make compatible with upcoming release of git-annex (next after 10.20230407) and pass explicit core.quotepath=false to all git calls. Also added tools/find-hanged-tests helper. PR #7372 (by @yarikoptic)

Tests
  • Adjust tests for upcoming release of git-annex (next after 10.20230407) and ignore DeprecationWarning for pkg_resources for now. PR #7372 (by @yarikoptic)

0.19.0 (2023-06-14)

Enhancements and New Features
  • Address gitlab API special character restrictions. PR #7407 (by @jsheunis)

  • BF: The default layout of create-sibling-gitlab is now collection. The previous default, hierarchy has been removed as it failed in –recursive mode in different edgecases. For single-level datasets, the outcome of collection and hierarchy is identical. PR #7410 (by @jsheunis and @adswa)

Bug Fixes
  • WTF - bring back and extend information on metadata extractors etc, and allow for sections to have subsections and be selected at both levels PR #7309 (by @yarikoptic)

  • BF: Run an actual git invocation with interactive commit config. PR #7398 (by @adswa)

Dependencies
  • Raise minimal version of tqdm (progress bars) to v.4.32.0 PR #7330 (by @mslw)

Documentation
Tests
  • Remove nose-based testing utils and possibility to test extensions using nose. PR #7261 (by @yarikoptic)

0.18.5 (2023-06-13)

Bug Fixes
  • More correct summary reporting for relaxed (no size) –annex. PR #7050 (by @yarikoptic)

  • ENH: minor tune up of addurls to be more tolerant and “informative”. PR #7388 (by @yarikoptic)

  • Ensure that data generated by timeout handlers in the asynchronous runner are accessible via the result generator, even if no other other events occur. PR #7390 (by @christian-monch)

  • Do not map (leave as is) trailing / or  in github URLs. PR #7418 (by @yarikoptic)

Documentation
Internal
Tests

0.18.4 (2023-05-16)

Bug Fixes
  • Provider config files were ignored, when CWD changed between different datasets during runtime. Fixes #7347 via PR #7357 (by @bpoldrack)

Documentation
  • Added a workaround for an issue with documentation theme (search function not working on Read the Docs). Fixes #7374 via PR #7385 (by @mslw)

Internal
Tests
  • Fix failing testing on CI PR #7379 (by @yarikoptic)

    • use sample S3 url DANDI archive,

    • use our copy of old .deb from datasets.datalad.org instead of snapshots.d.o

    • use specific miniconda installer for py 3.7.

0.18.3 (2023-03-25)

Bug Fixes
  • Fixed that the get command would fail, when subdataset source-candidate-templates where using the path property from .gitmodules. Also enhance the respective documentation for the get command. Fixes #7274 via PR #7280 (by @bpoldrack)

  • Improve up-to-dateness of config reports across manager instances. Fixes #7299 via PR #7301 (by @mih)

  • BF: GitRepo.merge do not allow merging unrelated unconditionally. PR #7312 (by @yarikoptic)

  • Do not render (empty) WTF report on other records. PR #7322 (by @yarikoptic)

  • Fixed a bug where changing DataLad’s log level could lead to failing git-annex calls. Fixes #7328 via PR #7329 (by @bpoldrack)

  • Fix an issue with uninformative error reporting by the datalad special remote. Fixes #7332 via PR #7333 (by @bpoldrack)

  • Fix save to not force committing into git if reference dataset is pure git (not git-annex). Fixes #7351 via PR #7355 (by @yarikoptic)

Documentation
  • Include a few previously missing commands in html API docs. Fixes #7288 via PR #7289 (by @mslw)

Internal
Tests

0.18.2 (2023-02-27)

Bug Fixes
Dependencies
Internal
  • Codespell more (CHANGELOGs etc) and remove custom CLI options from tox.ini. PR #7271 (by @yarikoptic)

Tests

0.18.1 (2023-01-16)

Bug Fixes
  • Fixes crashes on windows where DataLad was mistaking git-annex 10.20221212 for a not yet released git-annex version and trying to use a new feature. Fixes #7248 via PR #7249 (by @bpoldrack)

Documentation
Performance
  • Integrate buffer size optimization from datalad-next, leading to significant performance improvement for status and diff. Fixes #7190 via PR #7250 (by @bpoldrack)

0.18.0 (2022-12-31)

Breaking Changes
  • Move all old-style metadata commands aggregate_metadata, search, metadata and extract-metadata, as well as the cfg_metadatatypes procedure and the old metadata extractors into the datalad-deprecated extension. Now recommended way of handling metadata is to install the datalad-metalad extension instead. Fixes #7012 via PR #7014

  • Automatic reconfiguration of the ORA special remote when cloning from RIA stores now only applies locally rather than being committed. PR #7235 (by @bpoldrack)

Enhancements and New Features
  • A repository description can be specified with a new --description option when creating siblings using create-sibling-[gin|gitea|github|gogs]. Fixes #6816 via PR #7109 (by @mslw)

  • Make validation failure of alternative constraints more informative. Fixes #7092 via PR #7132 (by @bpoldrack)

  • Saving removed dataset content was sped-up, and reporting of types of removed content now accurately states dataset for added and removed subdatasets, instead of file. Moreover, saving previously staged deletions is now also reported. PR #6784 (by @mih)

  • foreach-dataset command got a new possible value for the –output-streamns|–o-s option ‘relpath’ to capture and pass-through prefixing with path to subds. Very handy for e.g. running git grep command across subdatasets. PR #7071 (by @yarikoptic)

  • New config datalad.create-sibling-ghlike.extra-remote-settings.NETLOC.KEY=VALUE allows to add and/or overwrite local configuration for the created sibling by the commands create-sibling-<gin|gitea|github|gitlab|gogs>. PR #7213 (by @matrss)

  • The siblings command does not concern the user with messages about inconsequential failure to annex-enable a remote anymore. PR #7217 (by @bpoldrack)

  • ORA special remote now allows to override its configuration locally. PR #7235 (by @bpoldrack)

  • Added a ‘ria’ special remote to provide backwards compatibility with datasets that were set up with the deprecated ria-remote. PR #7235 (by @bpoldrack)

Bug Fixes
  • When create-sibling-ria was invoked with a sibling name of a pre-existing sibling, a duplicate key in the result record caused a crashed. Fixes #6950 via PR #6952 (by @adswa)

Documentation
  • create-sibling-ria’s docstring now defines the schema of RIA URLs and clarifies internal layout of a RIA store. PR #6861 (by @adswa)

  • Move maintenance team info from issue to CONTRIBUTING. PR #6904 (by @adswa)

  • Describe specifications for a DataLad GitHub Action. PR #6931 (by @thewtex)

  • Fix capitalization of some service names. PR #6936 (by @aqw)

  • Command categories in help text are more consistently named. PR #7027 (by @aqw)

  • DOC: Add design document on Tests and CI. PR #7195 (by @adswa)

  • CONTRIBUTING.md was extended with up-to-date information on CI logging, changelog and release procedures. PR #7204 (by @yarikoptic)

Internal
  • Allow EnsureDataset constraint to handle Path instances. Fixes #7069 via PR #7133 (by @bpoldrack)

  • Use looseversion.LooseVersion as drop-in replacement for distutils.version.LooseVersion Fixes #6307 via PR #6839 (by @effigies)

  • Use –pathspec-from-file where possible instead of passing long lists of paths to git/git-annex calls. Fixes #6922 via PR #6932 (by @yarikoptic)

  • Make clone_dataset() better patchable ny extensions and less monolithic. PR #7017 (by @mih)

  • Remove simplejson in favor of using json. Fixes #7034 via PR #7035 (by @christian-monch)

  • Fix an error in the command group names-test. PR #7044 (by @christian-monch)

  • Move eval_results() into interface.base to simplify imports for command implementations. Deprecate use from interface.utils accordingly. Fixes #6694 via PR #7170 (by @adswa)

Performance
  • Use regular dicts instead of OrderedDicts for speedier operations. Fixes #6566 via PR #7174 (by @adswa)

  • Reimplement get_submodules_() without get_content_info() for substantial performance boosts especially for large datasets with few subdatasets. Originally proposed in PR #6942 by @mih, fixing #6940. PR #7189 (by @adswa). Complemented with PR #7220 (by @yarikoptic) to avoid O(N^2) (instead of O(N*log(N)) performance in some cases.

  • Use –include=* or –anything instead of –copies 0 to speed up get_content_annexinfo. PR #7230 (by @yarikoptic)

Tests
  • Re-enable two now-passing core test on Windows CI. PR #7152 (by @adswa)

  • Remove the with_testrepos decorator and associated tests for it Fixes #6752 via PR #7176 (by @adswa)

0.17.10 (2022-12-14)

Enhancements and New Features
  • Enhance concurrent invocation behavior of ThreadedRunner.run(). If possible invocations are serialized instead of raising re-enter runtime errors. Deadlock situations are detected and runtime errors are raised instead of deadlocking. Fixes #7138 via PR #7201 (by @christian-monch)

  • Exceptions bubbling up through CLI are now reported on including their chain of cause. Fixes #7163 via PR #7210 (by @bpoldrack)

Bug Fixes
  • BF: read RIA config from stdin instead of temporary file. Fixes #6514 via PR #7147 (by @adswa)

  • Prevent doomed annex calls on files we already know are untracked. Fixes #7032 via PR #7166 (by @adswa)

  • Comply to Posix-like clone URL formats on Windows. Fixes #7180 via PR #7181 (by @adswa)

  • Ensure that paths used in the datalad-url field of .gitmodules are posix. Fixes #7182 via PR #7183 (by @adswa)

  • Bandaids for export-to-figshare to restore functionality. PR #7188 (by @adswa)

  • Fixes hanging threads when close() or del where called in BatchedCommand instances. That could lead to hanging tests if the tests used the @serve_path_via_http()-decorator Fixes #6804 via PR #7201 (by @christian-monch)

  • Interpret file-URL path components according to the local operating system as described in RFC 8089. With this fix, datalad.network.RI('file:...').localpath returns a correct local path on Windows if the RI is constructed with a file-URL. Fixes #7186 via PR #7206 (by @christian-monch)

  • Fix a bug when retrieving several files from a RIA store via SSH, when the annex key does not contain size information. Fixes #7214 via PR #7215 (by @mslw)

  • Interface-specific (python vs CLI) doc generation for commands and their parameters was broken when brackets were used within the interface markups. Fixes #7225 via PR #7226 (by @bpoldrack)

Documentation
  • Fix documentation of Runner.run() to not accept strings. Instead, encoding must be ensured by the caller. Fixes #7145 via PR #7155 (by @bpoldrack)

Internal
Tests

0.17.9 (2022-11-07)

Bug Fixes
  • Various small fixups ran after looking post-release and trying to build Debian package. PR #7112 (by @yarikoptic)

  • BF: Fix add-archive-contents try-finally statement by defining variable earlier. PR #7117 (by @adswa)

  • Fix RIA file URL reporting in exception handling. PR #7123 (by @adswa)

  • HTTP download treated ‘429 - too many requests’ as an authentication issue and was consequently trying to obtain credentials. Fixes #7129 via PR #7129 (by @bpoldrack)

Dependencies
  • Unrestrict pytest and pytest-cov versions. PR #7125 (by @jwodder)

  • Remove remaining references to nose and the implied requirement for building the documentation Fixes #7100 via PR #7136 (by @bpoldrack)

Internal
  • Use datalad/release-action. Fixes #7110. PR #7111 (by @jwodder)

  • Fix all logging to use %-interpolation and not .format, sort imports in touched files, add pylint-ing for % formatting in log messages to tox -e lint. PR #7118 (by @yarikoptic)

Tests
  • Increase the upper time limit after which we assume that a process is stalling. That should reduce false positives from datalad.support.tests.test_parallel.py::test_stalling, without impacting the runtime of passing tests. PR #7119 (by @christian-monch)

  • XFAIL a check on length of results in test_gracefull_death. PR #7126 (by @yarikoptic)

  • Configure Git to allow for “file” protocol in tests. PR #7130 (by @yarikoptic)

0.17.8 (2022-10-24)

Bug Fixes
  • Prevent adding duplicate entries to .gitmodules. PR #7088 (by @yarikoptic)

  • [BF] Prevent double yielding of impossible get result Fixes #5537. PR #7093 (by @jsheunis)

  • Stop rendering the output of internal subdatset() call in the results of run_procedure(). Fixes #7091 via PR #7094 (by @mslw & @mih)

  • Improve handling of --existing reconfigure in create-sibling-ria: previously, the command would not make the underlying git init call for existing local repositories, leading to some configuration updates not being applied. Partially addresses https://github.com/datalad/datalad/issues/6967 via https://github.com/datalad/datalad/pull/7095 (by @mslw)

  • Ensure subprocess environments have a valid path in os.environ['PWD'], even if a Path-like object was given to the runner on subprocess creation or invocation. Fixes #7040 via PR #7107 (by @christian-monch)

  • Improved reporting when using dry-run with github-like create-sibling* commands (-gin, -gitea, -github, -gogs). The result messages will now display names of the repositories which would be created (useful for recursive operations). PR #7103 (by @mslw)

0.17.7 (2022-10-14)

Bug Fixes
Internal
  • Do not use gen4-metadata methods in datalad metadata-command. PR #7001 (by @christian-monch)

  • Revert “Remove chardet version upper limit” (introduced in 0.17.6~11^2) to bring back upper limit <= 5.0.0 on chardet. Otherwise we can get some deprecation warnings from requests PR #7057 (by @yarikoptic)

  • Ensure that BatchedCommandError is raised if the subprocesses of BatchedCommand fails or raises a CommandError. PR #7068 (by @christian-monch)

  • RF: remove unused code str-ing PurePath. PR #7073 (by @yarikoptic)

  • Update GitHub Actions action versions. PR #7082 (by @jwodder)

Tests
  • Fix broken test helpers for result record testing that would falsely pass. PR #7002 (by @bpoldrack)

0.17.6 (2022-09-21)

Bug Fixes
  • UX: push - provide specific error with details if push failed due to permission issue. PR #7011 (by @yarikoptic)

  • Fix datalad –help to not have Global options empty with python 3.10 and list options in “options:” section. PR #7028 (by @yarikoptic)

  • Let create touch the dataset root, if not saving in parent dataset. PR #7036 (by @mih)

  • Let get_status_dict() use exception message if none is passed. PR #7037 (by @mih)

  • Make choices for status|diff --annex and status|diff --untracked visible. PR #7039 (by @mih)

  • push: Assume 0 bytes pushed if git-annex does not provide bytesize. PR #7049 (by @yarikoptic)

Internal
Tests
  • Allow for any 2 from first 3 to be consumed in test_gracefull_death. PR #7041 (by @yarikoptic)


0.17.5 (Fri Sep 02 2022)

Bug Fix
Authors: 3

0.17.4 (Tue Aug 30 2022)

Bug Fix
  • BF: make logic more consistent for files=[] argument (which is False but not None) #6976 (@yarikoptic)

  • Run pytests in parallel (-n 2) on appveyor #6987 (@yarikoptic)

  • Add workflow for autogenerating changelog snippets #6981 (@jwodder)

  • Provide /dev/null (b:\nul on Windows) instead of empty string as a git-repo to avoid reading local repo configuration #6986 (@yarikoptic)

  • RF: call_from_parser - move code into “else” to simplify reading etc #6982 (@yarikoptic)

  • BF: if early attempt to parse resulted in error, setup subparsers #6980 (@yarikoptic)

  • Run pytests in parallel (-n 2) on Travis #6915 (@yarikoptic)

  • Send one character (no newline) to stdout in protocol test to guarantee a single “message” and thus a single custom value #6978 (@christian-monch)

Tests
Authors: 3

0.17.3 (Tue Aug 23 2022)

Bug Fix
  • BF: git_ignore_check do not overload possible value of stdout/err if present #6937 (@yarikoptic)

  • DOCfix: fix docstring GeneratorStdOutErrCapture to say that treats both stdout and stderr identically #6930 (@yarikoptic)

  • Explain purpose of create-sibling-ria’s –post-update-hook #6958 (@mih)

  • ENH+BF: get_parent_paths - make / into sep option and consistently use “/” as path separator #6963 (@yarikoptic)

  • BF(TEMP): use git-annex from neurodebian -devel to gain fix for bug detected with datalad-crawler #6965 (@yarikoptic)

  • BF(TST): make tests use path helper for Windows “friendliness” of the tests #6955 (@yarikoptic)

  • BF(TST): prevent auto-upgrade of “remote” test sibling, do not use local path for URL #6957 (@yarikoptic)

  • Forbid drop operation from symlink’ed annex (e.g. due to being cloned with –reckless=ephemeral) to prevent data-loss #6959 (@mih)

  • Acknowledge git-config comment chars #6944 (@mih @yarikoptic)

  • Minor tuneups to please updated codespell #6956 (@yarikoptic)

  • TST: Add a testcase for #6950 #6957 (@adswa)

  • BF+ENH(TST): fix typo in code of wtf filesystems reports #6920 (@yarikoptic)

  • DOC: Datalad -> DataLad #6937 (@aqw)

  • BF: fix typo which prevented silently to not show details of filesystems #6930 (@yarikoptic)

  • BF(TST): allow for a annex repo version to upgrade if running in adjusted branches #6927 (@yarikoptic)

  • RF extensions github action to centralize configuration for extensions etc, use pytest for crawler #6914 (@yarikoptic)

  • BF: travis - mark our directory as safe to interact with as root #6919 (@yarikoptic)

  • BF: do not pretend we know what repo version git-annex would upgrade to #6902 (@yarikoptic)

  • BF(TST): do not expect log message for guessing Path to be possibly a URL on windows #6911 (@yarikoptic)

  • ENH(TST): Disable coverage reporting on travis while running pytest #6898 (@yarikoptic)

  • RF: just rename internal variable from unclear “op” to “io” #6907 (@yarikoptic)

  • DX: Demote loglevel of message on url parameters to DEBUG while guessing RI #6891 (@adswa @yarikoptic)

  • Fix and expand datalad.runner type annotations #6893 (@christian-monch @yarikoptic)

  • Use pytest to test datalad-metalad in test_extensions-workflow #6892 (@christian-monch)

  • Let push honor multiple publication dependencies declared via siblings #6869 (@mih @yarikoptic)

  • ENH: upgrade versioneer from versioneer-0.20.dev0 to versioneer-0.23.dev0 #6888 (@yarikoptic)

  • ENH: introduce typing checking and GitHub workflow #6885 (@yarikoptic)

  • RF,ENH(TST): future proof testing of git annex version upgrade + test annex init on all supported versions #6880 (@yarikoptic)

  • ENH(TST): test against supported git annex repo version 10 + make it a full sweep over tests #6881 (@yarikoptic)

  • BF: RF f-string uses in logger to %-interpolations #6886 (@yarikoptic)

  • Merge branch ‘bf-sphinx-5.1.0’ into maint #6883 (@yarikoptic)

  • BF(DOC): workaround for #10701 of sphinx in 5.1.0 #6883 (@yarikoptic)

  • Clarify confusing INFO log message from get() on dataset installation #6871 (@mih)

  • Protect again failing to load a command interface from an extension #6879 (@mih)

  • Support unsetting config via datalad -c :<name> #6864 (@mih)

  • Fix DOC string typo in the path within AnnexRepo.annexstatus, and replace with proper sphinx reference #6858 (@christian-monch)

  • Improved support for saving typechanges #6793 (@mih)

Pushed to maint
  • BF: Remove duplicate ds key from result record (@adswa)

  • DOC: fix capitalization of service names (@aqw)

Tests
  • BF(TST,workaround): just xfail failing archives test on NFS #6912 (@yarikoptic)

Authors: 5

0.17.2 (Sat Jul 16 2022)

Bug Fix
  • BF(TST): do proceed to proper test for error being caught for recent git-annex on windows with symlinks #6850 (@yarikoptic)

  • Addressing problem testing against python 3.10 on Travis (skip more annex versions) #6842 (@yarikoptic)

  • XFAIL test_runner_parametrized_protocol on python3.8 when getting duplicate output #6837 (@yarikoptic)

  • BF: Make create’s check for procedures work with several again #6841 (@adswa)

  • Support older pytests #6836 (@jwodder)

Authors: 3

0.17.1 (Mon Jul 11 2022)

Bug Fix
  • DOC: minor fix - consistent DataLad (not Datalad) in docs and CHANGELOG #6830 (@yarikoptic)

  • DOC: fixup/harmonize Changelog for 0.17.0 a little #6828 (@yarikoptic)

  • BF: use –python-match minor option in new datalad-installer release to match outside version of Python #6827 (@christian-monch @yarikoptic)

  • Do not quote paths for ssh >= 9 #6826 (@christian-monch @yarikoptic)

  • Suppress DeprecationWarning to allow for distutils to be used #6819 (@yarikoptic)

  • RM(TST): remove testing of datalad.test which was removed from 0.17.0 #6822 (@yarikoptic)

  • Avoid import of nose-based tests.utils, make skip_if_no_module() and skip_if_no_network() allowed at module level #6817 (@jwodder)

  • BF(TST): use higher level asyncio.run instead of asyncio.get_event_loop in test_inside_async #6808 (@yarikoptic)

Authors: 3

0.17.0 (Thu Jul 7 2022) – pytest migration

Enhancements and new features
  • “log” progress bar now reports about starting a specific action as well. #6756 (by @yarikoptic)

  • Documentation and behavior of traceback reporting for log messages via DATALAD_LOG_TRACEBACK was improved to yield a more compact report. The documentation for this feature has been clarified. #6746 (by @mih)

  • datalad unlock gained a progress bar. #6704 (by @adswa)

  • When create-sibling-gitlab is called on non-existing subdatasets or paths it now returns an impossible result instead of no feedback at all. #6701 (by @adswa)

  • datalad wtf includes a report on file system types of commonly used paths. #6664 (by @adswa)

  • Use next generation metadata code in search, if it is available. #6518 (by @christian-monch)

Deprecations and removals
  • Remove unused and untested log helpers NoProgressLog and OnlyProgressLog. #6747 (by @mih)

  • Remove unused sorted_files() helper. #6722 (by @adswa)

  • Discontinued the value stdout for use with the config variable datalad.log.target as its use would inevitably break special remote implementations. #6675 (by @bpoldrack)

  • AnnexRepo.add_urls() is deprecated in favor of AnnexRepo.add_url_to_file() or a direct call to AnnexRepo.call_annex(). #6667 (by @mih)

  • datalad test command and supporting functionality (e.g., datalad.test) were removed. #6273 (by @jwodder)

Bug Fixes
  • export-archive does not rely on normalize_path() methods anymore and became more robust when called from subdirectories. #6745 (by @adswa)

  • Sanitize keys before checking content availability to ensure that the content availability of files with URL- or custom backend keys is correctly determined and marked. #6663 (by @adswa)

  • Ensure saving a new subdataset to a superdataset yields a valid .gitmodules record regardless of whether and how a path constraint is given to the save() call. Fixes #6547 #6790 (by @mih)

  • save now repairs annex symlinks broken by a git-mv operation prior recording a new dataset state. Fixes #4967 #6795 (by @mih)

Documentation
  • API documentation for log helpers, like log_progress() is now included in the renderer documentation. #6746 (by @mih)

  • New design document on progress reporting. #6734 (by @mih)

  • Explain downstream consequences of using --fast option in addurls. #6684 (by @jdkent)

Internal
  • Inline code of create-sibling-ria has been refactored to an internal helper to check for siblings with particular names across dataset hierarchies in datalad-next, and is reintroduced into core to modularize the code base further. #6706 (by @adswa)

  • get_initialized_logger now lets a given logtarget take precedence over datalad.log.target. #6675 (by @bpoldrack)

  • Many uses of deprecated call options were replaced with the recommended ones. #6273 (by @jwodder)

  • Get rid of asyncio import by defining few noops methods from asyncio.protocols.SubprocessProtocol directly in WitlessProtocol. #6648 (by @yarikoptic)

  • Consolidate GitRepo.remove() and AnnexRepo.remove() into a single implementation. #6783 (by @mih) ## Tests

  • Discontinue use of with_testrepos decorator other than for the deprecation cycle for nose. #6690 (by @mih @bpoldrack) See #6144 for full list of changes.

  • Remove usage of deprecated AnnexRepo.add_urls in tests. #6683 (by @bpoldrack)

  • Minimalistic (adapters, no assert changes, etc) migration from nose to pytest. Support functionality possibly used by extensions and relying on nose helpers is left in place to avoid affecting their run time and defer migration of their test setups.. #6273 (by @jwodder)

Authors: 7
  • Yaroslav Halchenko (@yarikoptic)

  • Michael Hanke (@mih)

  • Benjamin Poldrack (@bpoldrack)

  • Adina Wagner (@adswa)

  • John T. Wodder (@jwodder)

  • Christian Mnch (@christian-monch)

  • James Kent (@jdkent)

0.16.7 (Wed Jul 06 2022)

Bug Fix
Pushed to maint
  • Make sure a subdataset is saved with a complete .gitmodules record (@mih)

Authors: 5

0.16.6 (Tue Jun 14 2022)

Bug Fix
Authors: 2

0.16.5 (Wed Jun 08 2022)

Bug Fix
  • BF: push to github - remove datalad-push-default-first config only in non-dry run to ensure we push default branch separately in next step #6750 (@yarikoptic)

  • In addition to default (system) ssh version, report configured ssh; fix ssh version parsing on Windows #6729 (@yarikoptic)

Authors: 1

0.16.4 (Thu Jun 02 2022)

Bug Fix
  • BF(TST): RO operations - add test directory into git safe.directory #6726 (@yarikoptic)

  • DOC: fixup of docstring for skip_ssh #6727 (@yarikoptic)

  • DOC: Set language in Sphinx config to en #6727 (@adswa)

  • BF: Catch KeyErrors from unavailable WTF infos #6712 (@adswa)

  • Add annex.private to ephemeral clones. That would make git-annex not assign shared (in git-annex branch) annex uuid. #6702 (@bpoldrack @adswa)

  • BF: require argcomplete version at least 1.12.3 to test/operate correctly #6693 (@yarikoptic)

  • Replace Zenodo DOI with JOSS for due credit #6725 (@adswa)

Authors: 3

0.16.3 (Thu May 12 2022)

Bug Fix
  • No change for a PR to trigger release #6692 (@yarikoptic)

  • Sanitize keys before checking content availability to ensure correct value for keys with URL or custom backend #6665 (@adswa @yarikoptic)

  • Change a key-value pair in drop result record #6625 (@mslw)

  • Link docs of datalad-next #6677 (@mih)

  • Fix GitRepo.get_branch_commits_() to handle branch names conflicts with paths #6661 (@mih)

  • OPT: AnnexJsonProtocol - avoid dragging possibly long data around #6660 (@yarikoptic)

  • Remove two too prominent create() INFO log message that duplicate DEBUG log and harmonize some other log messages #6638 (@mih @yarikoptic)

  • Remove unsupported parameter create_sibling_ria(existing=None) #6637 (@mih)

  • Add released plugin to .autorc to annotate PRs on when released #6639 (@yarikoptic)

Authors: 4

0.16.2 (Thu Apr 21 2022)

Bug Fix
  • Demote (to level 1 from DEBUG) and speed-up API doc logging (parseParameters) #6635 (@mih)

  • Factor out actual data transfer in push #6618 (@christian-monch)

  • ENH: include version of datalad in tests teardown Versions: report #6628 (@yarikoptic)

  • MNT: Require importlib-metadata >=3.6 for Python < 3.10 for entry_points taking kwargs #6631 (@effigies)

  • Factor out credential handling of create-sibling-ghlike #6627 (@mih)

  • BF: Fix wrong key name of annex’ JSON records #6624 (@bpoldrack)

Pushed to maint
Authors: 5

0.16.1 (Fr Apr 8 2022) – April Fools’ Release

  • Fixes forgotten changelog in docs

0.16.0 (Fr Apr 8 2022) – Spring cleaning!

Enhancements and new features
  • A new set of create-sibling-* commands reimplements the GitHub-platform support of create-sibling-github and adds support to interface three new platforms in a unified fashion: GIN (create-sibling-gin), GOGS (create-sibling-gogs), and Gitea (create-sibling-gitea). All commands rely on personal access tokens only for authentication, allow for specifying one of several stored credentials via a uniform --credential parameter, and support a uniform --dry-run mode for testing without network. #5949 (by @mih)

  • create-sibling-github now has supports direct specification of organization repositories via a [<org>/]reposyntax #5949 (by @mih)

  • create-sibling-gitlab gained a --dry-run parameter to match the corresponding parameters in create-sibling-{github,gin,gogs,gitea} #6013 (by @adswa)

  • The --new-store-ok parameter of create-sibling-ria only creates new RIA stores when explicitly provided #6045 (by @adswa)

  • The default performance of status() and diff() commands is improved by up to 700% removing file-type evaluation as a default operation, and simplifying the type reporting rule #6097 (by @mih)

  • drop() and remove() were reimplemented in full, conceptualized as the antagonist commands to get() and clone(). A new, harmonized set of parameters (--what ['filecontent', 'allkeys', 'datasets', 'all'], --reckless ['modification', 'availability', 'undead', 'kill']) simplifies their API. Both commands include additional safeguards. uninstall is replaced with a thin shim command around drop() #6111 (by @mih)

  • add_archive_content() was refactored into a dataset method and gained progress bars #6105 (by @adswa)

  • The datalad and datalad-archives special remotes have been reimplemented based on AnnexRemote #6165 (by @mih)

  • The result_renderer() semantics were decomplexified and harmonized. The previous default result renderer was renamed to generic. #6174 (by @mih)

  • get_status_dict learned to include exit codes in the case of CommandErrors #5642 (by @yarikoptic)

  • datalad clone can now pass options to git-clone, adding support for cloning specific tags or branches, naming siblings other names than origin, and exposing git clone’s optimization arguments #6218 (by @kyleam and @mih)

  • Inactive BatchedCommands are cleaned up #6206 (by @jwodder)

  • export-archive-ora learned to filter files exported to 7z archives #6234 (by @mih and @bpinsard)

  • datalad run learned to glob recursively #6262 (by @AKSoo)

  • The ORA remote learned to recover from interrupted uploads #6267 (by @mih)

  • A new threaded runner with support for timeouts and generator-based subprocess communication is introduced and used in BatchedCommand and AnnexRepo #6244 (by @christian-monch)

  • A new switch allows to enable librarymode and queries for the effective API in use #6213 (by @mih)

  • run and rerun now support parallel jobs via --jobs #6279 (by @AKSoo)

  • A new foreach-dataset plumbing command allows to run commands on each (sub)dataset, similar to git submodule foreach #5517 (by @yarikoptic)

  • The dataset parameter is not restricted to only locally resolvable file-URLs anymore #6276 (by @christian-monch)

  • DataLad’s credential system is now able to query git-credential by specifying credential type git in the respective provider configuration #5796 (by @bpoldrack)

  • DataLad now comes with a git credential helper git-credential-datalad allowing Git to query DataLad’s credential system #5796 (by @bpoldrack and @mih)

  • The new runner now allows for multiple threads #6371 (by @christian-monch)

  • A new configurationcommand provides an interface to manipulate and query the DataLad configuration. #6306 (by @mih)

    • Unlike the global Python-only datalad.cfg or dataset-specific Dataset.config configuration managers, this command offers a uniform API across the Python and the command line interfaces.

    • This command was previously available in the mihextras extension as x-configuration, and has been merged into the core package in an improved version. #5489 (by @mih)

    • In its default dump mode, the command provides an annotated list of the effective configuration after considering all configuration sources, including hints on additional configuration settings and their supported values.

  • The command line interface help-reporting has been sped up by ~20% #6370 #6378 (by @mih)

  • ConfigManager now supports reading committed dataset configuration in bare repositories. Analog to reading .datalad/config from a worktree, blob:HEAD:.datalad/config is read (e.g., the config committed in the default branch). The support includes `reload() change detection using the gitsha of this file. The behavior for non-bare repositories is unchanged. #6332 (by @mih)

  • The CLI help generation has been sped up, and now also supports the completion of parameter values for a fixed set of choices #6415 (by @mih)

  • Individual command implementations can now declare a specific “on-failure” behavior by defining Interface.on_failure to be one of the supported modes (stop, continue, ignore). Previously, such a modification was only possible on a per-call basis. #6430 (by @mih)

  • The run command changed its default “on-failure” behavior from continue to stop. This change prevents the execution of a command in case a declared input can not be obtained. Previously, only an error result was yielded (and run eventually yielded a non-zero exit code or an IncompleteResultsException), but the execution proceeded and potentially saved a dataset modification despite incomplete inputs, in case the command succeeded. This previous default behavior can still be achieved by calling run with the equivalent of --on-failure continue #6430 (by @mih)

  • The `run command now provides readily executable, API-specific instructions how to save the results of a command execution that failed expectedly #6434 (by @mih)

  • create-sibling --since=^ mode will now be as fast as push --since=^ to figure out for which subdatasets to create siblings #6436 (by @yarikoptic)

  • When file names contain illegal characters or reserved file names that are incompatible with Windows systems a configurable check for save (datalad.save.windows-compat-warning) will either do nothing (none), emit an incompatibility warning (warning, default), or cause save to error (error) #6291 (by @adswa)

  • Improve responsiveness of datalad drop in datasets with a large annex. #6580 (by @christian-monch)

  • save code might operate faster on heavy file trees #6581 (by @yarikoptic)

  • Removed a per-file overhead cost for ORA when downloading over HTTP #6609 (by @bpoldrack)

  • A new module datalad.support.extensions offers the utility functions register_config() and has_config() that allow extension developers to announce additional configuration items to the central configuration management. #6601 (by @mih)

  • When operating in a dirty dataset, export-to-figshare now yields and impossible result instead of raising a RunTimeError #6543 (by @adswa)

  • Loading DataLad extension packages has been sped-up leading to between 2x and 4x faster run times for loading individual extensions and reporting help output across all installed extensions. #6591 (by @mih)

  • Introduces the configuration key datalad.ssh.executable. This key allows specifying an ssh-client executable that should be used by datalad to establish ssh-connections. The default value is ssh unless on a Windows system where $WINDIR\System32\OpenSSH\ssh.exe exists. In this case, the value defaults to $WINDIR\System32\OpenSSH\ssh.exe. #6553 (by @christian-monch)

  • create-sibling should perform much faster in case of --since specification since would consider only submodules related to the changes since that point. #6528 (by @yarikoptic)

  • A new configuration setting datalad.ssh.try-use-annex-bundled-git=yes|no can be used to influence the default remote git-annex bundle sensing for SSH connections. This was previously done unconditionally for any call to datalad sshrun (which is also used for any SSH-related Git or git-annex functionality triggered by DataLad-internal processing) and could incur a substantial per-call runtime cost. The new default is to not perform this sensing, because for, e.g., use as GIT_SSH_COMMAND there is no expectation to have a remote git-annex installation, and even with an existing git-annex/Git bundle on the remote, it is not certain that the bundled Git version is to be preferred over any other Git installation in a user’s PATH. #6533 (by @mih)

  • run now yields a result record immediately after executing a command. This allows callers to use the standard --on-failure switch to control whether dataset modifications will be saved for a command that exited with an error. #6447 (by @mih)

Deprecations and removals
  • The --pbs-runner commandline option (deprecated in 0.15.0) was removed #5981 (by @mih)

  • The dependency to PyGithub was dropped #5949 (by @mih)

  • create-sibling-github’s credential handling was trimmed down to only allow personal access tokens, because GitHub discontinued user/password based authentication #5949 (by @mih)

  • create-sibling-gitlab’s --dryrun parameter is deprecated in favor or --dry-run #6013 (by @adswa)

  • Internal obsolete Gitrepo.*_submodule methods were moved to datalad-deprecated #6010 (by @mih)

  • datalad/support/versions.py is unused in DataLad core and removed #6115 (by @yarikoptic)

  • Support for the undocumented datalad.api.result-renderer config setting has been dropped #6174 (by @mih)

  • Undocumented use of result_renderer=None is replaced with result_renderer='disabled' #6174 (by @mih)

  • remove’s --recursive argument has been deprecated #6257 (by @mih)

  • The use of the internal helper get_repo_instance() is discontinued and deprecated #6268 (by @mih)

  • Support for Python 3.6 has been dropped (#6286 (by @christian-monch) and #6364 (by @yarikoptic))

  • All but one Singularity recipe flavor have been removed due to their limited value with the end of life of Singularity Hub #6303 (by @mih)

  • All code in module datalad.cmdline was (re)moved, only datalad.cmdline.helpers.get_repo_instanceis kept for a deprecation period (by @mih)

  • datalad.interface.common_opts.eval_default has been deprecated. All (command-specific) defaults for common interface parameters can be read from Interface class attributes (#6391 (by @mih)

  • Remove unused and untested datalad.interface.utils helpers cls2cmdlinename and path_is_under #6392 (by @mih)

  • An unused code path for result rendering was removed from the CLI main() #6394 (by @mih)

  • create-sibling will require now "^" instead of an empty string for since option #6436 (by @yarikoptic)

  • run no longer raises a CommandError exception for failed commands, but yields an error result that includes a superset of the information provided by the exception. This change impacts command line usage insofar as the exit code of the underlying command is no longer relayed as the exit code of the run command call – although run continues to exit with a non-zero exit code in case of an error. For Python API users, the nature of the raised exception changes from CommandError to IncompleteResultsError, and the exception handling is now configurable using the standard on_failure command argument. The original CommandError exception remains available via the exception property of the newly introduced result record for the command execution, and this result record is available via IncompleteResultsError.failed, if such an exception is raised. #6447 (by @mih)

  • Custom cast helpers were removed from datalad core and migrated to a standalone repository https://github.com/datalad/screencaster #6516 (by @adswa)

  • The bundled parameter of get_connection_hash() is now ignored and will be removed with a future release. #6532 (by @mih)

  • BaseDownloader.fetch() is logging download attempts on DEBUG (previously INFO) level to avoid polluting output of higher-level commands. #6564 (by @mih)

Bug Fixes
  • create-sibling-gitlab erroneously overwrote existing sibling configurations. A safeguard will now prevent overwriting and exit with an error result #6015 (by @adswa)

  • create-sibling-gogs now relays HTTP500 errors, such as “no space left on device” #6019 (by @mih)

  • annotate_paths() is removed from the last parts of code base that still contained it #6128 (by @mih)

  • add_archive_content() doesn’t crash with --key and --use-current-dir anymore #6105 (by @adswa)

  • run-procedure now returns an error result when a non-existent procedure name is specified #6143 (by @mslw)

  • A fix for a silent failure of download-url --archive when extracting the archive #6172 (by @adswa)

  • Uninitialized AnnexRepos can now be dropped #6183 (by @mih)

  • Instead of raising an error, the formatters tests are skipped when the formatters module is not found #6212 (by @adswa)

  • create-sibling-gin does not disable git-annex availability on Gin remotes anymore #6230 (by @mih)

  • The ORA special remote messaging is fixed to not break the special remote protocol anymore and to better relay messages from exceptions to communicate underlying causes #6242 (by @mih)

  • A keyring.delete() call was fixed to not call an uninitialized private attribute anymore #6253 (by @bpoldrack)

  • An erroneous placement of result keyword arguments into a format() method instead of get_status_dict() of create-sibling-ria has been fixed #6256 (by @adswa)

  • status, run-procedure, and metadata are no longer swallowing result-related messages in renderers #6280 (by @mih)

  • uninstall now recommends the new --reckless parameter instead of the deprecated --nocheck parameter when reporting hints #6277 (by @adswa)

  • download-url learned to handle Pathobjects #6317 (by @adswa)

  • Restore default result rendering behavior broken by Key interface documentation #6394 (by @mih)

  • Fix a broken check for file presence in the ConfigManager that could have caused a crash in rare cases when a config file is removed during the process runtime #6332 (by @mih) `- ConfigManager.get_from_source() now accesses the correct information when using the documented source='local', avoiding a crash #6332 (by @mih)

  • run no longer let’s the internal call to save render its results unconditionally, but the parameterization f run determines the effective rendering format. #6421 (by @mih)

  • Remove an unnecessary and misleading warning from the runner #6425 (by @christian-monch)

  • A number of commands stopped to double-report results #6446 (by @adswa)

  • create-sibling-ria no longer creates an annex/objects directory in-store, when called with --no-storage-sibling. #6495 (by @bpoldrack )

  • Improve error message when an invalid URL is given to clone. #6500 (by @mih)

  • DataLad declares a minimum version dependency to keyring >= 20.0 to ensure that token-based authentication can be used. #6515 (by @adswa)

  • ORA special remote tries to obtain permissions when dropping a key from a RIA store rather than just failing. Thus having the same permissions in the store’s object trees as one directly managed by git-annex would have, works just fine now. #6493 (by @bpoldrack )

  • require_dataset() now uniformly raises NoDatasetFound when no dataset was found. Implementations that catch the previously documented InsufficientArgumentsError or the actually raised ValueError will continue to work, because NoDatasetFound is derived from both types. #6521 (by @mih)

  • Keyboard-interactive authentication is now possibly with non-multiplexed SSH connections (i.e., when no connection sharing is possible, due to lack of socket support, for example on Windows). Previously, it was disabled forcefully by DataLad for no valid reason. #6537 (by @mih)

  • Remove duplicate exception type in reporting of top-level CLI exception handler. #6563 (by @mih)

  • Fixes DataLad’s parsing of git-annex’ reporting on unknown paths depending on its version and the value of the annex.skipunknown config. #6550 (by @bpoldrack)

  • Fix ORA special remote not properly reporting on HTTP failures. #6535 (by @bpoldrack)

  • ORA special remote didn’t show per-file progress bars when downloading over HTTP #6609 (by @bpoldrack)

  • save now can commit the change where file becomes a directory with a staged for commit file. #6581 (by @yarikoptic)

  • create-sibling will no longer create siblings for not yet saved new subdatasets, and will now create sub-datasets nested in the subdatasets which did not yet have those siblings. #6603 (by @yarikoptic)

Documentation
  • A new design document sheds light on result records #6167 (by @mih)

  • The disabled result renderer mode is documented #6174 (by @mih)

  • A new design document sheds light on the datalad and datalad-archives special remotes #6181 (by @mih)

  • A new design document sheds light on BatchedCommand and BatchedAnnex #6203 (by @christian-monch)

  • A new design document sheds light on standard parameters #6214 (by @adswa)

  • The DataLad project adopted the Contributor Covenant COC v2.1 #6236 (by @adswa)

  • Docstrings learned to include Sphinx’ “version added” and “deprecated” directives #6249 (by @mih)

  • A design document sheds light on basic docstring handling and formatting #6249 (by @mih)

  • A new design document sheds light on position versus keyword parameter usage #6261 (by @yarikoptic)

  • create-sibling-gin’s examples have been improved to suggest push as an additional step to ensure proper configuration #6289 (by @mslw)

  • A new document describes the credential system from a user’s perspective #5796 (by @bpoldrack)

  • Enhance the design document on DataLad’s credential system #5796 (by @bpoldrack)

  • The documentation of the configuration command now details all locations DataLad is reading configuration items from, and their respective rules of precedence #6306 (by @mih)

  • API docs for datalad.interface.base are now included in the documentation #6378 (by @mih)

  • A new design document is provided that describes the basics of the command line interface implementation #6382 (by @mih)

  • The `datalad.interface.base.Interface class, the basis of all DataLad command implementations, has been extensively documented to provide an overview of basic principles and customization possibilities #6391 (by @mih)

  • --since=^ mode of operation of create-sibling is documented now #6436 (by @yarikoptic)

Internal
  • The internal status() helper was equipped with docstrings and promotes “breadth-first” reporting with a new parameter reporting_order #6006 (by @mih)

  • AnnexRepo.get_file_annexinfo() is introduced for more convenient queries for single files and replaces a now deprecated AnnexRepo.get_file_key() to receive information with fewer calls to Git #6104 (by @mih)

  • A new get_paths_by_ds() helper exposes status’ path normalization and sorting #6110 (by @mih)

  • status is optimized with a cache for dataset roots #6137 (by @yarikoptic)

  • The internal get_func_args_doc() helper with Python 2 is removed from DataLad core #6175 (by @yarikoptic)

  • Further restructuring of the source tree to better reflect the internal dependency structure of the code: AddArchiveContent is moved from datalad/interface to datalad/local (#6188 (by @mih)), Clean is moved from datalad/interface to datalad/local (#6191 (by @mih)), Unlock is moved from datalad/interface to datalad/local (#6192 (by @mih)), DownloadURL is moved from datalad/interface to datalad/local (#6217 (by @mih)), Rerun is moved from datalad/interface to datalad/local (#6220 (by @mih)), RunProcedure is moved from datalad/interface to datalad/local (#6222 (by @mih)). The interface command list is restructured and resorted #6223 (by @mih)

  • wrapt is replaced with functools’ wraps #6190 (by @yariktopic)

  • The unmaintained appdirs library has been replaced with platformdirs #6198 (by @adswa)

  • Modelines mismatching the code style in source files were fixed #6263 (by @AKSoo)

  • datalad/__init__.py has been cleaned up #6271 (by @mih)

  • GitRepo.call_git_items is implemented with a generator-based runner #6278 (by @christian-monch)

  • Separate positional from keyword arguments in the Python API to match CLI with * #6176 (by @yarikoptic), #6304 (by @christian-monch)

  • GitRepo.bare does not require the ConfigManager anymore #6323 (by @mih)

  • _get_dot_git() was reimplemented to be more efficient and consistent, by testing for common scenarios first and introducing a consistently applied resolved flag for result path reporting #6325 (by @mih)

  • All data files under datalad are now included when installing DataLad #6336 (by @jwodder)

  • Add internal method for non-interactive provider/credential storing #5796 (by @bpoldrack)

  • Allow credential classes to have a context set, consisting of a URL they are to be used with and a dataset DataLad is operating on, allowing to consider “local” and “dataset” config locations #5796 (by @bpoldrack)

  • The Interface method get_refds_path() was deprecated #6387 (by @adswa)

  • datalad.interface.base.Interface is now an abstract class #6391 (by @mih)

  • Simplified the decision making for result rendering, and reduced code complexity #6394 (by @mih)

  • Reduce code duplication in datalad.support.json_py #6398 (by @mih)

  • Use public ArgumentParser.parse_known_args instead of protected _parse_known_args #6414 (by @yarikoptic)

  • add-archive-content does not rely on the deprecated tempfile.mktemp anymore, but uses the more secure tempfile.mkdtemp #6428 (by @adswa)

  • AnnexRepo’s internal annexstatus is deprecated. In its place, a new test helper assists the few tests that rely on it #6413 (by @adswa)

  • config has been refactored from where[="dataset"] to scope[="branch"] #5969 (by @yarikoptic)

  • Common command arguments are now uniformly and exhaustively passed to result renderers and filters for decision making. Previously, the presence of a particular argument depended on the respective API and circumstances of a command call. #6440 (by @mih)

  • Entrypoint processing for extensions and metadata extractors has been consolidated on a uniform helper that is about twice as fast as the previous implementations. #6591 (by @mih)

Tests
  • A range of Windows tests pass and were enabled #6136 (by @adswa)

  • Invalid escape sequences in some tests were fixed #6147 (by @mih)

  • A cross-platform compatible HTTP-serving test environment is introduced #6153 (by @mih)

  • A new helper exposes serve_path_via_http to the command line to deploy an ad-hoc instance of the HTTP server used for internal testing, with SSL and auth, if desired. #6169 (by @mih)

  • Windows tests were redistributed across worker runs to harmonize runtime #6200 (by @adswa)

  • Batchedcommand gained a basic test #6203 (by @christian-monch)

  • The use of with_testrepo is discontinued in all core tests #6224 (by @mih)

  • The new git-annex.filter.annex.process configuration is enabled by default on Windows to speed up the test suite #6245 (by @mih)

  • If the available Git version supports it, the test suite now uses GIT_CONFIG_GLOBAL to configure a fake home directory instead of overwriting HOME on OSX (#6251 (by @bpoldrack)) and HOME and USERPROFILE on Windows #6260 (by @adswa)

  • Windows test timeouts of runners were addressed #6311 (by @christian-monch)

  • A handful of Windows tests were fixed (#6352 (by @yarikoptic)) or disabled (#6353 (by @yarikoptic))

  • download-url’s test under http_proxy are skipped when a session can’t be established #6361 (by @yarikoptic)

  • A test for datalad clean was fixed to be invoked within a dataset #6359 (by @yarikoptic)

  • The new datalad.cli.tests have an improved module coverage of 80% #6378 (by @mih)

  • The test_source_candidate_subdataset has been marked as @slow #6429 (by @yarikoptic)

  • Dedicated CLI benchmarks exist now #6381 (by @mih)

  • Enable code coverage report for subprocesses #6546 (by @adswa)

  • Skip a test on annex>=10.20220127 due to a bug in annex. See https://git-annex.branchable.com/bugs/Change_to_annex.largefiles_leaves_repo_modified/

Infra
  • A new issue template using GitHub forms prestructures bug reports #6048 (by @Remi-Gau)

  • DataLad and its dependency stack were packaged for Gentoo Linux #6088 (by @TheChymera)

  • The readthedocs configuration is modernized to version 2 #6207 (by @adswa)

  • The Windows CI setup now runs on Appveyor’s Visual Studio 2022 configuration #6228 (by @adswa)

  • The readthedocs-theme and Sphinx versions were pinned to re-enable rendering of bullet points in the documentation #6346 (by @adswa)

  • The PR template was updated with a CHANGELOG template. Future PRs should use it to include a summary for the CHANGELOG #6396 (by @mih)

Authors: 11
  • Michael Hanke (@mih)

  • Yaroslav Halchenko (@yarikoptic)

  • Adina Wagner (@adswa)

  • Remi Gau (@Remi-Gau)

  • Horea Christian (@TheChymera)

  • Micha Szczepanik (@mslw)

  • Christian Mnch (@christian-monch)

  • John T. Wodder (@jwodder)

  • Benjamin Poldrack (@bpoldrack)

  • Sin Kim (@AKSoo)

  • Basile Pinsard (@bpinsard)


0.15.6 (Sun Feb 27 2022)

Bug Fix
  • BF: do not use BaseDownloader instance wide InterProcessLock - resolves stalling or errors during parallel installs #6507 (@yarikoptic)

  • release workflow: add -vv to auto invocation (@yarikoptic)

  • Fix version incorrectly incremented by release process in CHANGELOGs #6459 (@yarikoptic)

  • BF(TST): add another condition to skip under http_proxy set #6459 (@yarikoptic)

Authors: 1

0.15.5 (Wed Feb 09 2022)

Enhancement
  • BF: When download-url gets Pathobject as path convert it to a string #6364 (@adswa)

Bug Fix
Authors: 5

0.15.4 (Thu Dec 16 2021)

Bug Fix
Tests
  • RF+BF: use skip_if_no_module helper instead of try/except for libxmp and boto #6148 (@yarikoptic)

  • git://github.com -> https://github.com #6134 (@mih)

Authors: 6

0.15.3 (Sat Oct 30 2021)

Bug Fix
  • BF: Don’t make create-sibling recursive by default #6116 (@adswa)

  • BF: Add dashes to ‘force’ option in non-empty directory error message #6078 (@DisasterMo)

  • DOC: Add supported URL types to download-url’s docstring #6098 (@adswa)

  • BF: Retain git-annex error messages & don’t show them if operation successful #6070 (@DisasterMo)

  • Remove uses of __full_version__ and datalad.version #6073 (@jwodder)

  • BF: ORA shouldn’t crash while handling a failure #6063 (@bpoldrack)

  • DOC: Refine –reckless docstring on usage and wording #6043 (@adswa)

  • BF: archives upon strip - use rmtree which retries etc instead of rmdir #6064 (@yarikoptic)

  • BF: do not leave test in a tmp dir destined for removal #6059 (@yarikoptic)

  • Next wave of exc_str() removals #6022 (@mih)

Pushed to maint
  • CI: Enable new codecov uploader in Appveyor CI (@adswa)

Internal
  • UX: Log clone-candidate number and URLs #6092 (@adswa)

  • UX/ENH: Disable reporting, and don’t do superfluous internal subdatasets calls #6094 (@adswa)

  • Update codecov action to v2 #6072 (@jwodder)

Documentation
  • Design document on URL substitution feature #6065 (@mih)

Tests
Authors: 7

0.15.2 (Wed Oct 06 2021)

Bug Fix
  • BF: Don’t suppress datalad subdatasets output #6035 (@DisasterMo @mih)

  • Honor datalad.runtime.use-patool if set regardless of OS (was Windows only) #6033 (@mih)

  • Discontinue usage of deprecated (public) helper #6032 (@mih)

  • BF: ProgressHandler - close the other handler if was specified #6020 (@yarikoptic)

  • UX: Report GitLab weburl of freshly created projects in the result #6017 (@adswa)

  • Ensure there’s a blank line between the class __doc__ and “Parameters” in build_doc docstrings #6004 (@jwodder)

  • Large code-reorganization of everything runner-related #6008 (@mih)

  • Discontinue exc_str() in all modern parts of the code base #6007 (@mih)

Tests
  • TST: Add test to ensure functionality with subdatasets starting with a hyphen (-) #6042 (@DisasterMo)

  • BF(TST): filter away warning from coverage from analysis of stderr of –help #6028 (@yarikoptic)

  • BF: disable outdated SSL root certificate breaking chain on older/buggy clients #6027 (@yarikoptic)

  • BF: start global test_http_server only if not running already #6023 (@yarikoptic)

Authors: 5

0.15.1 (Fri Sep 24 2021)

Bug Fix
  • BF: downloader - fail to download even on non-crippled FS if symlink exists #5991 (@yarikoptic)

  • ENH: import datalad.api to bind extensions methods for discovery of dataset methods #5999 (@yarikoptic)

  • Restructure cmdline API presentation #5988 (@mih)

  • Close file descriptors after process exit #5983 (@mih)

Pushed to maint
  • Discontinue testing of hirni extension (@mih)

Internal
Documentation
  • Coarse description of the credential subsystem’s functionality #5998 (@mih)

Tests
  • BF(TST): use sys.executable, mark test_ria_basics.test_url_keys as requiring network #5986 (@yarikoptic)

Authors: 3

0.15.0 (Tue Sep 14 2021) – We miss you Kyle!

Enhancements and new features
  • Command execution is now performed by a new Runner implementation that is no longer based on the asyncio framework, which was found to exhibit fragile performance in interaction with other asyncio-using code, such as Jupyter notebooks. The new implementation is based on threads. It also supports the specification of “protocols” that were introduced with the switch to the asyncio implementation in 0.14.0. (#5667)

  • clone now supports arbitrary URL transformations based on regular expressions. One or more transformation steps can be defined via datalad.clone.url-substitute.<label> configuration settings. The feature can be (and is now) used to support convenience mappings, such as https://osf.io/q8xnk/ (displayed in a browser window) to osf://q8xnk (clonable via the datalad-osf extension. (#5749)

  • Homogenize SSH use and configurability between DataLad and git-annex, by instructing git-annex to use DataLad’s sshrun for SSH calls (instead of SSH directly). (#5389)

  • The ORA special remote has received several new features:

    • It now support a push-url setting as an alternative to url for write access. An analog parameter was also added to create-sibling-ria. (#5420, #5428)

    • Access of RIA stores now performs homogeneous availability checks, regardless of access protocol. Before, broken HTTP-based access due to misspecified URLs could have gone unnoticed. (#5459, #5672)

    • Error reporting was introduce to inform about undesirable conditions in remote RIA stores. (#5683)

  • create-sibling-ria now supports --alias for the specification of a convenience dataset alias name in a RIA store. (#5592)

  • Analog to git commit, save now features an --amend mode to support incremental updates of a dataset state. (#5430)

  • run now supports a dry-run mode that can be used to inspect the result of parameter expansion on the effective command to ease the composition of more complicated command lines. (#5539)

  • run now supports a --assume-ready switch to avoid the (possibly expensive) preparation of inputs and outputs with large datasets that have already been readied through other means. (#5431)

  • update now features --how and --how-subds parameters to configure how an update shall be performed. Supported modes are fetch (unchanged default), and merge (previously also possible via --merge), but also new strategies like reset or checkout. (#5534)

  • update has a new --follow=parentds-lazy mode that only performs a fetch operation in subdatasets when the desired commit is not yet present. During recursive updates involving many subdatasets this can substantially speed up performance. (#5474)

  • DataLad’s command line API can now report the version for individual commands via datalad <cmd> --version. The output has been homogenized to <providing package> <version>. (#5543)

  • create-sibling now logs information on an auto-generated sibling name, in the case that no --name/-s was provided. (#5550)

  • create-sibling-github has been updated to emit result records like any standard DataLad command. Previously it was implemented as a “plugin”, which did not support all standard API parameters. (#5551)

  • copy-file now also works with content-less files in datasets on crippled filesystems (adjusted mode), when a recent enough git-annex (8.20210428 or later) is available. (#5630)

  • addurls can now be instructed how to behave in the event of file name collision via a new parameter --on-collision. (#5675)

  • addurls reporting now informs which particular subdatasets were created. (#5689)

  • Credentials can now be provided or overwritten via all means supported by ConfigManager. Importantly, datalad.credential.<name>.<field> configuration settings and analog specification via environment variables are now supported (rather than custom environment variables only). Previous specification methods are still supported too. (#5680)

  • A new datalad.credentials.force-ask configuration flag can now be used to force re-entry of already known credentials. This simplifies credential updates without having to use an approach native to individual credential stores. (#5777)

  • Suppression of rendering repeated similar results is now configurable via the configuration switches datalad.ui.suppress-similar-results (bool), and datalad.ui.suppress-similar-results-threshold (int). (#5681)

  • The performance of status and similar functionality when determining local file availability has been improved. (#5692)

  • push now renders a result summary on completion. (#5696)

  • A dedicated info log message indicates when dataset repositories are subjected to an annex version upgrade. (#5698)

  • Error reporting improvements:

    • The NoDatasetFound exception now provides information for which purpose a dataset is required. (#5708)

    • Wording of the MissingExternalDependeny error was rephrased to account for cases of non-functional installations. (#5803)

    • push reports when a --to parameter specification was (likely) forgotten. (#5726)

    • Detailed information is now given when DataLad fails to obtain a lock for credential entry in a timely fashion. Previously only a generic debug log message was emitted. (#5884)

    • Clarified error message when create-sibling-gitlab was called without --project. (#5907)

  • add-readme now provides a README template with more information on the nature and use of DataLad datasets. A README file is no longer annex’ed by default, but can be using the new --annex switch. ([#5723][], [#5725][])

  • clean now supports a --dry-run mode to inform about cleanable content. (#5738)

  • A new configuration setting datalad.locations.locks can be used to control the placement of lock files. (#5740)

  • wtf now also reports branch names and states. (#5804)

  • AnnexRepo.whereis() now supports batch mode. (#5533)

Deprecations and removals
  • The minimum supported git-annex version is now 8.20200309. (#5512)

  • ORA special remote configuration items ssh-host, and base-path are deprecated. They are completely replaced by ria+<protocol>:// URL specifications. (#5425)

  • The deprecated no_annex parameter of create() was removed from the Python API. (#5441)

  • The unused GitRepo.pull() method has been removed. (#5558)

  • Residual support for “plugins” (a mechanism used before DataLad supported extensions) was removed. This includes the configuration switches datalad.locations.{system,user}-plugins. (#5554, #5564)

  • Several features and comments have been moved to the datalad-deprecated package. This package must now be installed to be able to use keep using this functionality.

    • The publish command. Use push instead. (#5837)

    • The ls command. (#5569)

    • The web UI that is deployable via datalad create-sibling --ui. (#5555)

    • The “automagic IO” feature. (#5577)

  • AnnexRepo.copy_to() has been deprecated. The push command should be used instead. (#5560)

  • AnnexRepo.sync() has been deprecated. AnnexRepo.call_annex(['sync', ...]) should be used instead. (#5461)

  • All GitRepo.*_submodule() methods have been deprecated and will be removed in a future release. (#5559)

  • create-sibling-github’s --dryrun switch was deprecated, use --dry-run instead. (#5551)

  • The datalad --pbs-runner option has been deprecated, use condor_run (or similar) instead. (#5956)

Fixes
  • Prevent invalid declaration of a publication dependencies for ‘origin’ on any auto-detected ORA special remotes, when cloing from a RIA store. An ORA remote is now checked whether it actually points to the RIA store the clone was made from. (#5415)

  • The ORA special remote implementation has received several fixes:

    • It can now handle HTTP redirects. (#5792)

    • Prevents failure when URL-type annex keys contain the ‘/’ character. (#5823)

    • Properly support the specification of usernames, passwords and ports in ria+<protocol>:// URLs. (#5902)

  • It is now possible to specifically select the default (or generic) result renderer via datalad -f default and with that override a tailored result renderer that may be preconfigured for a particular command. (#5476)

  • Starting with 0.14.0, original URLs given to clone were recorded in a subdataset record. This was initially done in a second commit, leading to inflation of commits and slowdown in superdatasets with many subdatasets. Such subdataset record annotation is now collapsed into a single commits. (#5480)

  • run now longer removes leading empty directories as part of the output preparation. This was surprising behavior for commands that do not ensure on their own that output directories exist. (#5492)

  • A potentially existing message property is no longer removed when using the json or json_pp result renderer to avoid undesired withholding of relevant information. (#5536)

  • subdatasets now reports state=present, rather than state=clean, for installed subdatasets to complement state=absent reports for uninstalled dataset. (#5655)

  • create-sibling-ria now executes commands with a consistent environment setup that matches all other command execution in other DataLad commands. (#5682)

  • save no longer saves unspecified subdatasets when called with an explicit path (list). The fix required a behavior change of GitRepo.get_content_info() in its interpretation of None vs. [] path argument values that now aligns the behavior of GitRepo.diff|status() with their respective documentation. (#5693)

  • get now prefers the location of a subdatasets that is recorded in a superdataset’s .gitmodules record. Previously, DataLad tried to obtain a subdataset from an assumed checkout of the superdataset’s origin. This new default order is (re-)configurable via the datalad.get.subdataset-source-candidate-<priority-label> configuration mechanism. (#5760)

  • create-sibling-gitlab no longer skips the root dataset when . is given as a path. (#5789)

  • siblings now rejects a value given to --as-common-datasrc that clashes with the respective Git remote. (#5805)

  • The usage synopsis reported by siblings now lists all supported actions. (#5913)

  • siblings now renders non-ok results to avoid silent failure. (#5915)

  • .gitattribute file manipulations no longer leave the file without a trailing newline. (#5847)

  • Prevent crash when trying to delete a non-existing keyring credential field. (#5892)

  • git-annex is no longer called with an unconditional annex.retry=3 configuration. Instead, this parameterization is now limited to annex get and annex copy calls. (#5904)

Tests
  • file:// URLs are no longer the predominant test case for AnnexRepo functionality. A built-in HTTP server now used in most cases. (#5332)


0.14.8 (Sun Sep 12 2021)

Bug Fix
  • BF: add-archive-content on .xz and other non-.gz stream compressed files #5930 (@yarikoptic)

  • BF(UX): do not keep logging ERROR possibly present in progress records #5936 (@yarikoptic)

  • Annotate datalad_core as not needing actual data – just uses annex whereis #5971 (@yarikoptic)

  • BF: limit CMD_MAX_ARG if obnoxious value is encountered. #5945 (@yarikoptic)

  • Download session/credentials locking – inform user if locking is “failing” to be obtained, fail upon ~5min timeout #5884 (@yarikoptic)

  • Render siblings()’s non-ok results with the default renderer #5915 (@mih)

  • BF: do not crash, just skip whenever trying to delete non existing field in the underlying keyring #5892 (@yarikoptic)

  • Fix argument-spec for siblings and improve usage synopsis #5913 (@mih)

  • Clarify error message re unspecified gitlab project #5907 (@mih)

  • Support username, password and port specification in RIA URLs #5902 (@mih)

  • BF: take path from SSHRI, test URLs not only on Windows #5881 (@yarikoptic)

  • ENH(UX): warn user if keyring returned a “null” keyring #5875 (@yarikoptic)

  • ENH(UX): state original purpose in NoDatasetFound exception + detail it for get #5708 (@yarikoptic)

Pushed to maint
  • Merge branch ‘bf-http-headers-agent’ into maint (@yarikoptic)

  • RF(BF?)+DOC: provide User-Agent to entire session headers + use those if provided (@yarikoptic)

Internal
  • Pass --no-changelog to auto shipit if changelog already has entry #5952 (@jwodder)

  • Add isort config to match current convention + run isort via pre-commit (if configured) #5923 (@jwodder)

  • .travis.yml: use python -m {nose,coverage} invocations, and always show combined report #5888 (@yarikoptic)

  • Add project URLs into the package metadata for convenience links on Pypi #5866 (@adswa @yarikoptic)

Tests
  • BF: do use OBSCURE_FILENAME instead of hardcoded unicode #5944 (@yarikoptic)

  • BF(TST): Skip testing for having PID listed if no psutil #5920 (@yarikoptic)

  • BF(TST): Boost version of git-annex to 8.20201129 to test an error message #5894 (@yarikoptic)

Authors: 4

0.14.7 (Tue Aug 03 2021)

Bug Fix
  • UX: When two or more clone URL templates are found, error out more gracefully #5839 (@adswa)

  • BF: http_auth - follow redirect (just 1) to re-authenticate after initial attempt #5852 (@yarikoptic)

  • addurls Formatter - provide value repr in exception #5850 (@yarikoptic)

  • ENH: allow for “patch” level semver for “master” branch #5839 (@yarikoptic)

  • BF: Report info from annex JSON error message in CommandError #5809 (@mih)

  • RF(TST): do not test for no EASY and pkg_resources in shims #5817 (@yarikoptic)

  • http downloaders: Provide custom informative User-Agent, do not claim to be “Authenticated access” #5802 (@yarikoptic)

  • ENH(UX,DX): inform user with a warning if version is 0+unknown #5787 (@yarikoptic)

  • shell-completion: add argcomplete to ‘misc’ extra_depends, log an ERROR if argcomplete fails to import #5781 (@yarikoptic)

  • ENH (UX): add python-gitlab dependency #5776 (s.heunis@fz-juelich.de)

Internal
  • BF: Fix reported paths in ORA remote #5821 (@adswa)

  • BF: import importlib.metadata not importlib_metadata whenever available #5818 (@yarikoptic)

Tests
  • TST: set –allow-unrelated-histories in the mk_push_target setup for Windows #5855 (@adswa)

  • Tests: Allow for version to contain + as a separator and provide more information for version related comparisons #5786 (@yarikoptic)

Authors: 4

0.14.6 (Sun Jun 27 2021)

Internal
Authors: 2

0.14.5 (Mon Jun 21 2021)

Bug Fix
  • BF(TST): parallel - take longer for producer to produce #5747 (@yarikoptic)

  • add –on-failure default value and document it #5690 (@christian-monch @yarikoptic)

  • ENH: harmonize “purpose” statements to imperative form #5733 (@yarikoptic)

  • ENH(TST): populate heavy tree with 100 unique keys (not just 1) among 10,000 #5734 (@yarikoptic)

  • BF: do not use .acquired - just get state from acquire() #5718 (@yarikoptic)

  • BF: account for annex now “scanning for annexed” instead of “unlocked” files #5705 (@yarikoptic)

  • interface: Don’t repeat custom summary for non-generator results #5688 (@kyleam)

  • RF: just pip install datalad-installer #5676 (@yarikoptic)

  • DOC: addurls.extract: Drop mention of removed ‘stream’ parameter #5690 (@kyleam)

  • Merge pull request #5674 from kyleam/test-addurls-copy-fix #5674 (@kyleam)

  • Merge pull request #5663 from kyleam/status-ds-equal-path #5663 (@kyleam)

  • Merge pull request #5671 from kyleam/update-fetch-fail #5671 (@kyleam)

  • BF: update: Honor –on-failure if fetch fails #5671 (@kyleam)

  • RF: update: Avoid fetch’s deprecated kwargs #5671 (@kyleam)

  • CLN: update: Drop an unused import #5671 (@kyleam)

  • Merge pull request #5664 from kyleam/addurls-better-url-parts-error #5664 (@kyleam)

  • Merge pull request #5661 from kyleam/sphinx-fix-plugin-refs #5661 (@kyleam)

  • BF: status: Provide special treatment of “this dataset” path #5663 (@kyleam)

  • BF: addurls: Provide better placeholder error for special keys #5664 (@kyleam)

  • RF: addurls: Simply construction of placeholder exception message #5664 (@kyleam)

  • RF: addurls._get_placeholder_exception: Rename a parameter #5664 (@kyleam)

  • RF: status: Avoid repeated Dataset.path access #5663 (@kyleam)

  • DOC: Reference plugins via datalad.api #5661 (@kyleam)

  • download-url: Set up datalad special remote if needed #5648 (@kyleam @yarikoptic)

Pushed to maint
  • MNT: Post-release dance (@kyleam)

Internal
Tests
  • BF(TST): skip testing for showing “Scanning for …” since not shown if too quick #5727 (@yarikoptic)

  • Revert “TST: test_partial_unlocked: Document and avoid recent git-annex failure” #5651 (@kyleam)

Authors: 4

0.14.4 (May 10, 2021) – .

Fixes
  • Following an internal call to git-clone, clone assumed that the remote name was “origin”, but this may not be the case if clone.defaultRemoteName is configured (available as of Git 2.30). (#5572)

  • Several test fixes, including updates for changes in git-annex. (#5612) (#5632) (#5639)

0.14.3 (April 28, 2021) – .

Fixes
  • For outputs that include a glob, run didn’t re-glob after executing the command, which is necessary to catch changes if --explicit or --expand={outputs,both} is specified. (#5594)

  • run now gives an error result rather than a warning when an input glob doesn’t match. (#5594)

  • The procedure for creating a RIA store checks for an existing ria-layout-version file and makes sure its version matches the desired version. This check wasn’t done correctly for SSH hosts. (#5607)

  • A helper for transforming git-annex JSON records into DataLad results didn’t account for the unusual case where the git-annex record doesn’t have a “file” key. (#5580)

  • The test suite required updates for recent changes in PyGithub and git-annex. (#5603) (#5609)

Enhancements and new features
  • The DataLad source repository has long had a tools/cmdline-completion helper. This functionality is now exposed as a command, datalad shell-completion. (#5544)

0.14.2 (April 14, 2021) – .

Fixes
  • push now works bottom-up, pushing submodules first so that hooks on the remote can aggregate updated subdataset information. (#5416)

  • run-procedure didn’t ensure that the configuration of subdatasets was reloaded. (#5552)

0.14.1 (April 01, 2021) – .

Fixes
  • The recent default branch changes on GitHub’s side can lead to “git-annex” being selected over “master” as the default branch on GitHub when setting up a sibling with create-sibling-github. To work around this, the current branch is now pushed first. (#5010)

  • The logic for reading in a JSON line from git-annex failed if the response exceeded the buffer size (256 KB on *nix systems).

  • Calling unlock with a path of “.” from within an untracked subdataset incorrectly aborted, complaining that the “dataset containing given paths is not underneath the reference dataset”. (#5458)

  • clone didn’t account for the possibility of multiple accessible ORA remotes or the fact that none of them may be associated with the RIA store being cloned. (#5488)

  • create-sibling-ria didn’t call git update-server-info after setting up the remote repository and, as a result, the repository couldn’t be fetched until something else (e.g., a push) triggered a call to git update-server-info. (#5531)

  • The parser for git-config output didn’t properly handle multi-line values and got thrown off by unexpected and unrelated lines. (#5509)

  • The 0.14 release introduced regressions in the handling of progress bars for git-annex actions, including collapsing progress bars for concurrent operations. (#5421) (#5438)

  • save failed if the user configured Git’s diff.ignoreSubmodules to a non-default value. (#5453)

  • A interprocess lock is now used to prevent a race between checking for an SSH socket’s existence and creating it. (#5466)

  • If a Python procedure script is executable, run-procedure invokes it directly rather than passing it to sys.executable. The non-executable Python procedures that ship with DataLad now include shebangs so that invoking them has a chance of working on file systems that present all files as executable. (#5436)

  • DataLad’s wrapper around argparse failed if an underscore was used in a positional argument. (#5525)

Enhancements and new features
  • DataLad’s method for mapping environment variables to configuration options (e.g., DATALAD_FOO_X__Y to datalad.foo.x-y) doesn’t work if the subsection name (“FOO”) has an underscore. This limitation can be sidestepped with the new DATALAD_CONFIG_OVERRIDES_JSON environment variable, which can be set to a JSON record of configuration values. (#5505)

0.14.0 (February 02, 2021) – .

Major refactoring and deprecations
  • Git versions below v2.19.1 are no longer supported. (#4650)

  • The minimum git-annex version is still 7.20190503, but, if you’re on Windows (or use adjusted branches in general), please upgrade to at least 8.20200330 but ideally 8.20210127 to get subdataset-related fixes. (#4292) (#5290)

  • The minimum supported version of Python is now 3.6. (#4879)

  • publish is now deprecated in favor of push. It will be removed in the 0.15.0 release at the earliest.

  • A new command runner was added in v0.13. Functionality related to the old runner has now been removed: Runner, GitRunner, and run_gitcommand_on_file_list_chunks from the datalad.cmd module along with the datalad.tests.protocolremote, datalad.cmd.protocol, and datalad.cmd.protocol.prefix configuration options. (#5229)

  • The --no-storage-sibling switch of create-sibling-ria is deprecated in favor of --storage-sibling=off and will be removed in a later release. (#5090)

  • The get_git_dir static method of GitRepo is deprecated and will be removed in a later release. Use the dot_git attribute of an instance instead. (#4597)

  • The ProcessAnnexProgressIndicators helper from datalad.support.annexrepo has been removed. (#5259)

  • The save argument of install, a noop since v0.6.0, has been dropped. (#5278)

  • The get_URLS method of AnnexCustomRemote is deprecated and will be removed in a later release. (#4955)

  • ConfigManager.get now returns a single value rather than a tuple when there are multiple values for the same key, as very few callers correctly accounted for the possibility of a tuple return value. Callers can restore the old behavior by passing get_all=True. (#4924)

  • In 0.12.0, all of the assure_* functions in datalad.utils were renamed as ensure_*, keeping the old names around as compatibility aliases. The assure_* variants are now marked as deprecated and will be removed in a later release. (#4908)

  • The datalad.interface.run module, which was deprecated in 0.12.0 and kept as a compatibility shim for datalad.core.local.run, has been removed. (#4583)

  • The saver argument of datalad.core.local.run.run_command, marked as obsolete in 0.12.0, has been removed. (#4583)

  • The dataset_only argument of the ConfigManager class was deprecated in 0.12 and has now been removed. (#4828)

  • The linux_distribution_name, linux_distribution_release, and on_debian_wheezy attributes in datalad.utils are no longer set at import time and will be removed in a later release. Use datalad.utils.get_linux_distribution instead. (#4696)

  • datalad.distribution.clone, which was marked as obsolete in v0.12 in favor of datalad.core.distributed.clone, has been removed. (#4904)

  • datalad.support.annexrepo.N_AUTO_JOBS, announced as deprecated in v0.12.6, has been removed. (#4904)

  • The compat parameter of GitRepo.get_submodules, added in v0.12 as a temporary compatibility layer, has been removed. (#4904)

  • The long-deprecated (and non-functional) url parameter of GitRepo.__init__ has been removed. (#5342)

Fixes
  • Cloning onto a system that enters adjusted branches by default (as Windows does) did not properly record the clone URL. (#5128)

  • The RIA-specific handling after calling clone was correctly triggered by ria+http URLs but not ria+https URLs. (#4977)

  • If the registered commit wasn’t found when cloning a subdataset, the failed attempt was left around. (#5391)

  • The remote calls to cp and chmod in create-sibling were not portable and failed on macOS. (#5108)

  • A more reliable check is now done to decide if configuration files need to be reloaded. (#5276)

  • The internal command runner’s handling of the event loop has been improved to play nicer with outside applications and scripts that use asyncio. (#5350) (#5367)

Enhancements and new features
  • The subdataset handling for adjusted branches, which is particularly important on Windows where git-annex enters an adjusted branch by default, has been improved. A core piece of the new approach is registering the commit of the primary branch, not its checked out adjusted branch, in the superdataset. Note: This means that git   status will always consider a subdataset on an adjusted branch as dirty while datalad status will look more closely and see if the tip of the primary branch matches the registered commit. (#5241)

  • The performance of the subdatasets command has been improved, with substantial speedups for recursive processing of many subdatasets. (#4868) (#5076)

  • Adding new subdatasets via save has been sped up. (#4793)

  • get, save, and addurls gained support for parallel operations that can be enabled via the --jobs command-line option or the new datalad.runtime.max-jobs configuration option. (#5022)

  • addurls

    • learned how to read data from standard input. (#4669)

    • now supports tab-separated input. (#4845)

    • now lets Python callers pass in a list of records rather than a file name. (#5285)

    • gained a --drop-after switch that signals to drop a file’s content after downloading and adding it to the annex. (#5081)

    • is now able to construct a tree of files from known checksums without downloading content via its new --key option. (#5184)

    • records the URL file in the commit message as provided by the caller rather than using the resolved absolute path. (#5091)

    • is now speedier. (#4867) (#5022)

  • create-sibling-github learned how to create private repositories (thanks to Nolan Nichols). (#4769)

  • create-sibling-ria gained a --storage-sibling option. When --storage-sibling=only is specified, the storage sibling is created without an accompanying Git sibling. This enables using hosts without Git installed for storage. (#5090)

  • The download machinery (and thus the datalad special remote) gained support for a new scheme, shub://, which follows the same format used by singularity run and friends. In contrast to the short-lived URLs obtained by querying Singularity Hub directly, shub:// URLs are suitable for registering with git-annex. (#4816)

  • A provider is now included for https://registry-1.docker.io URLs. This is useful for storing an image’s blobs in a dataset and registering the URLs with git-annex. (#5129)

  • The add-readme command now links to the DataLad handbook rather than http://docs.datalad.org. (#4991)

  • New option datalad.locations.extra-procedures specifies an additional location that should be searched for procedures. (#5156)

  • The class for handling configuration values, ConfigManager, now takes a lock before writes to allow for multiple processes to modify the configuration of a dataset. (#4829)

  • clone now records the original, unresolved URL for a subdataset under submodule.<name>.datalad-url in the parent’s .gitmodules, enabling later get calls to use the original URL. This is particularly useful for ria+ URLs. (#5346)

  • Installing a subdataset now uses custom handling rather than calling git submodule update --init. This avoids some locking issues when running get in parallel and enables more accurate source URLs to be recorded. (#4853)

  • GitRepo.get_content_info, a helper that gets triggered by many commands, got faster by tweaking its git ls-files call. (#5067)

  • wtf now includes credentials-related information (e.g. active backends) in the its output. (#4982)

  • The call_git* methods of GitRepo now have a read_only parameter. Callers can set this to True to promise that the provided command does not write to the repository, bypassing the cost of some checks and locking. (#5070)

  • New call_annex* methods in the AnnexRepo class provide an interface for running git-annex commands similar to that of the GitRepo.call_git* methods. (#5163)

  • It’s now possible to register a custom metadata indexer that is discovered by search and used to generate an index. (#4963)

  • The ConfigManager methods get, getbool, getfloat, and getint now return a single value (with same precedence as git   config --get) when there are multiple values for the same key (in the non-committed git configuration, if the key is present there, or in the dataset configuration). For get, the old behavior can be restored by specifying get_all=True. (#4924)

  • Command-line scripts are now defined via the entry_points argument of setuptools.setup instead of the scripts argument. (#4695)

  • Interactive use of --help on the command-line now invokes a pager on more systems and installation setups. (#5344)

  • The datalad special remote now tries to eliminate some unnecessary interactions with git-annex by being smarter about how it queries for URLs associated with a key. (#4955)

  • The GitRepo class now does a better job of handling bare repositories, a step towards bare repositories support in DataLad. (#4911)

  • More internal work to move the code base over to the new command runner. (#4699) (#4855) (#4900) (#4996) (#5002) (#5141) (#5142) (#5229)

0.13.7 (January 04, 2021) – .

Fixes
  • Cloning from a RIA store on the local file system initialized annex in the Git sibling of the RIA source, which is problematic because all annex-related functionality should go through the storage sibling. clone now sets remote.origin.annex-ignore to true after cloning from RIA stores to prevent this. (#5255)

  • create-sibling invoked cp in a way that was not compatible with macOS. (#5269)

  • Due to a bug in older Git versions (before 2.25), calling status with a file under .git/ (e.g., datalad status .git/config) incorrectly reported the file as untracked. A workaround has been added. (#5258)

  • Update tests for compatibility with latest git-annex. (#5254)

Enhancements and new features
  • copy-file now aborts if .git/ is in the target directory, adding to its existing .git/ safety checks. (#5258)

0.13.6 (December 14, 2020) – .

Fixes
  • An assortment of fixes for Windows compatibility. (#5113) (#5119) (#5125) (#5127) (#5136) (#5201) (#5200) (#5214)

  • Adding a subdataset on a system that defaults to using an adjusted branch (i.e. doesn’t support symlinks) didn’t properly set up the submodule URL if the source dataset was not in an adjusted state. (#5127)

  • push failed to push to a remote that did not have an annex-uuid value in the local .git/config. (#5148)

  • The default renderer has been improved to avoid a spurious leading space, which led to the displayed path being incorrect in some cases. (#5121)

  • siblings showed an uninformative error message when asked to configure an unknown remote. (#5146)

  • drop confusingly relayed a suggestion from git annex drop to use --force, an option that does not exist in datalad drop. (#5194)

  • create-sibling-github no longer offers user/password authentication because it is no longer supported by GitHub. (#5218)

  • The internal command runner’s handling of the event loop has been tweaked to hopefully fix issues with running DataLad from IPython. (#5106)

  • SSH cleanup wasn’t reliably triggered by the ORA special remote on failure, leading to a stall with a particular version of git-annex, 8.20201103. (This is also resolved on git-annex’s end as of 8.20201127.) (#5151)

Enhancements and new features
  • The credential helper no longer asks the user to repeat tokens or AWS keys. (#5219)

  • The new option datalad.locations.sockets controls where DataLad stores SSH sockets, allowing users to more easily work around file system and path length restrictions. (#5238)

0.13.5 (October 30, 2020) – .

Fixes
  • SSH connection handling has been reworked to fix cloning on Windows. A new configuration option, datalad.ssh.multiplex-connections, defaults to false on Windows. (#5042)

  • The ORA special remote and post-clone RIA configuration now provide authentication via DataLad’s credential mechanism and better handling of HTTP status codes. (#5025) (#5026)

  • By default, if a git executable is present in the same location as git-annex, DataLad modifies PATH when running git and git-annex so that the bundled git is used. This logic has been tightened to avoid unnecessarily adjusting the path, reducing the cases where the adjustment interferes with the local environment, such as special remotes in a virtual environment being masked by the system-wide variants. (#5035)

  • git-annex is now consistently invoked as “git annex” rather than “git-annex” to work around failures on Windows. (#5001)

  • push called git annex sync ... on plain git repositories. (#5051)

  • save in genernal doesn’t support registering multiple levels of untracked subdatasets, but it can now properly register nested subdatasets when all of the subdataset paths are passed explicitly (e.g., datalad save -d. sub-a sub-a/sub-b). (#5049)

  • When called with --sidecar and --explicit, run didn’t save the sidecar. (#5017)

  • A couple of spots didn’t properly quote format fields when combining substrings into a format string. (#4957)

  • The default credentials configured for indi-s3 prevented anonymous access. (#5045)

Enhancements and new features
  • Messages about suppressed similar results are now rate limited to improve performance when there are many similar results coming through quickly. (#5060)

  • create-sibling-github can now be told to replace an existing sibling by passing --existing=replace. (#5008)

  • Progress bars now react to changes in the terminal’s width (requires tqdm 2.1 or later). (#5057)

0.13.4 (October 6, 2020) – .

Fixes
  • Ephemeral clones mishandled bare repositories. (#4899)

  • The post-clone logic for configuring RIA stores didn’t consider https:// URLs. (#4977)

  • DataLad custom remotes didn’t escape newlines in messages sent to git-annex. (#4926)

  • The datalad-archives special remote incorrectly treated file names as percent-encoded. (#4953)

  • The result handler didn’t properly escape “%” when constructing its message template. (#4953)

  • In v0.13.0, the tailored rendering for specific subtypes of external command failures (e.g., “out of space” or “remote not available”) was unintentionally switched to the default rendering. (#4966)

  • Various fixes and updates for the NDA authenticator. (#4824)

  • The helper for getting a versioned S3 URL did not support anonymous access or buckets with “.” in their name. (#4985)

  • Several issues with the handling of S3 credentials and token expiration have been addressed. (#4927) (#4931) (#4952)

Enhancements and new features
  • A warning is now given if the detected Git is below v2.13.0 to let users that run into problems know that their Git version is likely the culprit. (#4866)

  • A fix to push in v0.13.2 introduced a regression that surfaces when push.default is configured to “matching” and prevents the git-annex branch from being pushed. Note that, as part of the fix, the current branch is now always pushed even when it wouldn’t be based on the configured refspec or push.default value. (#4896)

  • publish

    • now allows spelling the empty string value of --since= as ^ for consistency with push. (#4683)

    • compares a revision given to --since= with HEAD rather than the working tree to speed up the operation. (#4448)

  • rerun

    • emits more INFO-level log messages. (#4764)

    • provides better handling of adjusted branches and aborts with a clear error for cases that are not supported. (#5328)

  • The archives are handled with p7zip, if available, since DataLad v0.12.0. This implementation now supports .tgz and .tbz2 archives. (#4877)

0.13.3 (August 28, 2020) – .

Fixes
  • Work around a Python bug that led to our asyncio-based command runner intermittently failing to capture the output of commands that exit very quickly. (#4835)

  • push displayed an overestimate of the transfer size when multiple files pointed to the same key. (#4821)

  • When download-url calls git annex addurl, it catches and reports any failures rather than crashing. A change in v0.12.0 broke this handling in a particular case. (#4817)

Enhancements and new features
  • The wrapper functions returned by decorators are now given more meaningful names to hopefully make tracebacks easier to digest. (#4834)

0.13.2 (August 10, 2020) – .

Deprecations
  • The allow_quick parameter of AnnexRepo.file_has_content and AnnexRepo.is_under_annex is now ignored and will be removed in a later release. This parameter was only relevant for git-annex versions before 7.20190912. (#4736)

Fixes
  • Updates for compatibility with recent git and git-annex releases. (#4746) (#4760) (#4684)

  • push didn’t sync the git-annex branch when --data=nothing was specified. (#4786)

  • The datalad.clone.reckless configuration wasn’t stored in non-annex datasets, preventing the values from being inherited by annex subdatasets. (#4749)

  • Running the post-update hook installed by create-sibling --ui could overwrite web log files from previous runs in the unlikely event that the hook was executed multiple times in the same second. (#4745)

  • clone inspected git’s standard error in a way that could cause an attribute error. (#4775)

  • When cloning a repository whose HEAD points to a branch without commits, clone tries to find a more useful branch to check out. It unwisely considered adjusted branches. (#4792)

  • Since v0.12.0, SSHManager.close hasn’t closed connections when the ctrl_path argument was explicitly given. (#4757)

  • When working in a dataset in which git annex init had not yet been called, the file_has_content and is_under_annex methods of AnnexRepo incorrectly took the “allow quick” code path on file systems that did not support it (#4736)

Enhancements
  • create now assigns version 4 (random) UUIDs instead of version 1 UUIDs that encode the time and hardware address. (#4790)

  • The documentation for create now does a better job of describing the interaction between --dataset and PATH. (#4763)

  • The format_commit and get_hexsha methods of GitRepo have been sped up. (#4807) (#4806)

  • A better error message is now shown when the ^ or ^. shortcuts for --dataset do not resolve to a dataset. (#4759)

  • A more helpful error message is now shown if a caller tries to download an ftp:// link but does not have request_ftp installed. (#4788)

  • clone now tries harder to get up-to-date availability information after auto-enabling type=git special remotes. (#2897)

0.13.1 (July 17, 2020) – .

Fixes
  • Cloning a subdataset should inherit the parent’s datalad.clone.reckless value, but that did not happen when cloning via datalad get rather than datalad install or datalad clone. (#4657)

  • The default result renderer crashed when the result did not have a path key. (#4666) (#4673)

  • datalad push didn’t show information about git push errors when the output was not in the format that it expected. (#4674)

  • datalad push silently accepted an empty string for --since even though it is an invalid value. (#4682)

  • Our JavaScript testing setup on Travis grew stale and has now been updated. (Thanks to Xiao Gui.) (#4687)

  • The new class for running Git commands (added in v0.13.0) ignored any changes to the process environment that occurred after instantiation. (#4703)

Enhancements and new features
  • datalad push now avoids unnecessary git push dry runs and pushes all refspecs with a single git push call rather than invoking git   push for each one. (#4692) (#4675)

  • The readability of SSH error messages has been improved. (#4729)

  • datalad.support.annexrepo avoids calling datalad.utils.get_linux_distribution at import time and caches the result once it is called because, as of Python 3.8, the function uses distro underneath, adding noticeable overhead. (#4696)

    Third-party code should be updated to use get_linux_distribution directly in the unlikely event that the code relied on the import-time call to get_linux_distribution setting the linux_distribution_name, linux_distribution_release, or on_debian_wheezy attributes in `datalad.utils.

0.13.0 (June 23, 2020) – .

A handful of new commands, including copy-file, push, and create-sibling-ria, along with various fixes and enhancements

Major refactoring and deprecations
  • The no_annex parameter of create, which is exposed in the Python API but not the command line, is deprecated and will be removed in a later release. Use the new annex argument instead, flipping the value. Command-line callers that use --no-annex are unaffected. (#4321)

  • datalad add, which was deprecated in 0.12.0, has been removed. (#4158) (#4319)

  • The following GitRepo and AnnexRepo methods have been removed: get_changed_files, get_missing_files, and get_deleted_files. (#4169) (#4158)

  • The get_branch_commits method of GitRepo and AnnexRepo has been renamed to get_branch_commits_. (#3834)

  • The custom commit method of AnnexRepo has been removed, and AnnexRepo.commit now resolves to the parent method, GitRepo.commit. (#4168)

  • GitPython’s git.repo.base.Repo class is no longer available via the .repo attribute of GitRepo and AnnexRepo. (#4172)

  • AnnexRepo.get_corresponding_branch now returns None rather than the current branch name when a managed branch is not checked out. (#4274)

  • The special UUID for git-annex web remotes is now available as datalad.consts.WEB_SPECIAL_REMOTE_UUID. It remains accessible as AnnexRepo.WEB_UUID for compatibility, but new code should use consts.WEB_SPECIAL_REMOTE_UUID (#4460).

Fixes
  • Widespread improvements in functionality and test coverage on Windows and crippled file systems in general. (#4057) (#4245) (#4268) (#4276) (#4291) (#4296) (#4301) (#4303) (#4304) (#4305) (#4306)

  • AnnexRepo.get_size_from_key incorrectly handled file chunks. (#4081)

  • create-sibling would too readily clobber existing paths when called with --existing=replace. It now gets confirmation from the user before doing so if running interactively and unconditionally aborts when running non-interactively. (#4147)

  • update (#4159)

    • queried the incorrect branch configuration when updating non-annex repositories.

    • didn’t account for the fact that the local repository can be configured as the upstream “remote” for a branch.

  • When the caller included --bare as a git init option, create crashed creating the bare repository, which is currently unsupported, rather than aborting with an informative error message. (#4065)

  • The logic for automatically propagating the ‘origin’ remote when cloning a local source could unintentionally trigger a fetch of a non-local remote. (#4196)

  • All remaining get_submodules() call sites that relied on the temporary compatibility layer added in v0.12.0 have been updated. (#4348)

  • The custom result summary renderer for get, which was visible with --output-format=tailored, displayed incorrect and confusing information in some cases. The custom renderer has been removed entirely. (#4471)

  • The documentation for the Python interface of a command listed an incorrect default when the command overrode the value of command parameters such as result_renderer. (#4480)

Enhancements and new features
  • The default result renderer learned to elide a chain of results after seeing ten consecutive results that it considers similar, which improves the display of actions that have many results (e.g., saving hundreds of files). (#4337)

  • The default result renderer, in addition to “tailored” result renderer, now triggers the custom summary renderer, if any. (#4338)

  • The new command create-sibling-ria provides support for creating a sibling in a RIA store. (#4124)

  • DataLad ships with a new special remote, git-annex-remote-ora, for interacting with RIA stores and a new command export-archive-ora for exporting an archive from a local annex object store. (#4260) (#4203)

  • The new command push provides an alternative interface to publish for pushing a dataset hierarchy to a sibling. (#4206) (#4581) (#4617) (#4620)

  • The new command copy-file copies files and associated availability information from one dataset to another. (#4430)

  • The command examples have been expanded and improved. (#4091) (#4314) (#4464)

  • The tooling for linking to the DataLad Handbook from DataLad’s documentation has been improved. (#4046)

  • The --reckless parameter of clone and install learned two new modes:

    • “ephemeral”, where the .git/annex/ of the cloned repository is symlinked to the local source repository’s. (#4099)

    • “shared-{group|all|…}” that can be used to set up datasets for collaborative write access. (#4324)

  • clone

    • learned to handle dataset aliases in RIA stores when given a URL of the form ria+<protocol>://<storelocation>#~<aliasname>. (#4459)

    • now checks datalad.get.subdataset-source-candidate-NAME to see if NAME starts with three digits, which is taken as a “cost”. Sources with lower costs will be tried first. (#4619)

  • update (#4167)

    • learned to disallow non-fast-forward updates when ff-only is given to the --merge option.

    • gained a --follow option that controls how --merge behaves, adding support for merging in the revision that is registered in the parent dataset rather than merging in the configured branch from the sibling.

    • now provides a result record for merge events.

  • create-sibling now supports local paths as targets in addition to SSH URLs. (#4187)

  • siblings now

    • shows a warning if the caller requests to delete a sibling that does not exist. (#4257)

    • phrases its warning about non-annex repositories in a less alarming way. (#4323)

  • The rendering of command errors has been improved. (#4157)

  • save now

    • displays a message to signal that the working tree is clean, making it more obvious that no results being rendered corresponds to a clean state. (#4106)

    • provides a stronger warning against using --to-git. (#4290)

  • diff and save learned about scenarios where they could avoid unnecessary and expensive work. (#4526) (#4544) (#4549)

  • Calling diff without --recursive but with a path constraint within a subdataset (“/”) now traverses into the subdataset, as “/” would, restricting its report to “/”. (#4235)

  • New option datalad.annex.retry controls how many times git-annex will retry on a failed transfer. It defaults to 3 and can be set to 0 to restore the previous behavior. (#4382)

  • wtf now warns when the specified dataset does not exist. (#4331)

  • The repr and str output of the dataset and repo classes got a facelift. (#4420) (#4435) (#4439)

  • The DataLad Singularity container now comes with p7zip-full.

  • DataLad emits a log message when the current working directory is resolved to a different location due to a symlink. This is now logged at the DEBUG rather than WARNING level, as it typically does not indicate a problem. (#4426)

  • DataLad now lets the caller know that git annex init is scanning for unlocked files, as this operation can be slow in some repositories. (#4316)

  • The log_progress helper learned how to set the starting point to a non-zero value and how to update the total of an existing progress bar, two features needed for planned improvements to how some commands display their progress. (#4438)

  • The ExternalVersions object, which is used to check versions of Python modules and external tools (e.g., git-annex), gained an add method that enables DataLad extensions and other third-party code to include other programs of interest. (#4441)

  • All of the remaining spots that use GitPython have been rewritten without it. Most notably, this includes rewrites of the clone, fetch, and push methods of GitRepo. (#4080) (#4087) (#4170) (#4171) (#4175) (#4172)

  • When GitRepo.commit splits its operation across multiple calls to avoid exceeding the maximum command line length, it now amends to initial commit rather than creating multiple commits. (#4156)

  • GitRepo gained a get_corresponding_branch method (which always returns None), allowing a caller to invoke the method without needing to check if the underlying repo class is GitRepo or AnnexRepo. (#4274)

  • A new helper function datalad.core.local.repo.repo_from_path returns a repo class for a specified path. (#4273)

  • New AnnexRepo method localsync performs a git annex sync that disables external interaction and is particularly useful for propagating changes on an adjusted branch back to the main branch. (#4243)

0.12.7 (May 22, 2020) – .

Fixes
  • Requesting tailored output (--output=tailored) from a command with a custom result summary renderer produced repeated output. (#4463)

  • A longstanding regression in argcomplete-based command-line completion for Bash has been fixed. You can enable completion by configuring a Bash startup file to run eval   "$(register-python-argcomplete datalad)" or source DataLad’s tools/cmdline-completion. The latter should work for Zsh as well. (#4477)

  • publish didn’t prevent git-fetch from recursing into submodules, leading to a failure when the registered submodule was not present locally and the submodule did not have a remote named ‘origin’. (#4560)

  • addurls botched path handling when the file name format started with “./” and the call was made from a subdirectory of the dataset. (#4504)

  • Double dash options in manpages were unintentionally escaped. (#4332)

  • The check for HTTP authentication failures crashed in situations where content came in as bytes rather than unicode. (#4543)

  • A check in AnnexRepo.whereis could lead to a type error. (#4552)

  • When installing a dataset to obtain a subdataset, get confusingly displayed a message that described the containing dataset as “underneath” the subdataset. (#4456)

  • A couple of Makefile rules didn’t properly quote paths. (#4481)

  • With DueCredit support enabled (DUECREDIT_ENABLE=1), the query for metadata information could flood the output with warnings if datasets didn’t have aggregated metadata. The warnings are now silenced, with the overall failure of a metadata call logged at the debug level. (#4568)

Enhancements and new features
  • The resource identifier helper learned to recognize URLs with embedded Git transport information, such as gcrypt::https://example.com. (#4529)

  • When running non-interactively, a more informative error is now signaled when the UI backend, which cannot display a question, is asked to do so. (#4553)

0.12.6 (April 23, 2020) – .

Major refactoring and deprecations
  • The value of datalad.support.annexrep.N_AUTO_JOBS is no longer considered. The variable will be removed in a later release. (#4409)

Fixes
  • Staring with v0.12.0, datalad save recorded the current branch of a parent dataset as the branch value in the .gitmodules entry for a subdataset. This behavior is problematic for a few reasons and has been reverted. (#4375)

  • The default for the --jobs option, “auto”, instructed DataLad to pass a value to git-annex’s --jobs equal to min(8, max(3, <number   of CPUs>)), which could lead to issues due to the large number of child processes spawned and file descriptors opened. To avoid this behavior, --jobs=auto now results in git-annex being called with --jobs=1 by default. Configure the new option datalad.runtime.max-annex-jobs to control the maximum value that will be considered when --jobs='auto'. (#4409)

  • Various commands have been adjusted to better handle the case where a remote’s HEAD ref points to an unborn branch. (#4370)

  • search

    • learned to use the query as a regular expression that restricts the keys that are shown for --show-keys short. (#4354)

    • gives a more helpful message when query is an invalid regular expression. (#4398)

  • The code for parsing Git configuration did not follow Git’s behavior of accepting a key with no value as shorthand for key=true. (#4421)

  • AnnexRepo.info needed a compatibility update for a change in how git-annex reports file names. (#4431)

  • create-sibling-github did not gracefully handle a token that did not have the necessary permissions. (#4400)

Enhancements and new features
  • search learned to use the query as a regular expression that restricts the keys that are shown for --show-keys short. (#4354)

  • datalad <subcommand> learned to point to the datalad-container extension when a subcommand from that extension is given but the extension is not installed. (#4400) (#4174)

0.12.5 (Apr 02, 2020) – a small step for datalad …

Fix some bugs and make the world an even better place.

Fixes
  • Our log_progress helper mishandled the initial display and step of the progress bar. (#4326)

  • AnnexRepo.get_content_annexinfo is designed to accept init=None, but passing that led to an error. (#4330)

  • Update a regular expression to handle an output change in Git v2.26.0. (#4328)

  • We now set LC_MESSAGES to ‘C’ while running git to avoid failures when parsing output that is marked for translation. (#4342)

  • The helper for decoding JSON streams loaded the last line of input without decoding it if the line didn’t end with a new line, a regression introduced in the 0.12.0 release. (#4361)

  • The clone command failed to git-annex-init a fresh clone whenever it considered to add the origin of the origin as a remote. (#4367)

0.12.4 (Mar 19, 2020) – Windows?!

The main purpose of this release is to have one on PyPi that has no associated wheel to enable a working installation on Windows (#4315).

Fixes
  • The description of the log.outputs config switch did not keep up with code changes and incorrectly stated that the output would be logged at the DEBUG level; logging actually happens at a lower level. (#4317)

0.12.3 (March 16, 2020) – .

Updates for compatibility with the latest git-annex, along with a few miscellaneous fixes

Major refactoring and deprecations
  • All spots that raised a NoDatasetArgumentFound exception now raise a NoDatasetFound exception to better reflect the situation: it is the dataset rather than the argument that is not found. For compatibility, the latter inherits from the former, but new code should prefer the latter. (#4285)

Fixes
  • Updates for compatibility with git-annex version 8.20200226. (#4214)

  • datalad export-to-figshare failed to export if the generated title was fewer than three characters. It now queries the caller for the title and guards against titles that are too short. (#4140)

  • Authentication was requested multiple times when git-annex launched parallel downloads from the datalad special remote. (#4308)

  • At verbose logging levels, DataLad requests that git-annex display debugging information too. Work around a bug in git-annex that prevented that from happening. (#4212)

  • The internal command runner looked in the wrong place for some configuration variables, including datalad.log.outputs, resulting in the default value always being used. (#4194)

  • publish failed when trying to publish to a git-lfs special remote for the first time. (#4200)

  • AnnexRepo.set_remote_url is supposed to establish shared SSH connections but failed to do so. (#4262)

Enhancements and new features
  • The message provided when a command cannot determine what dataset to operate on has been improved. (#4285)

  • The “aws-s3” authentication type now allows specifying the host through “aws-s3_host”, which was needed to work around an authorization error due to a longstanding upstream bug. (#4239)

  • The xmp metadata extractor now recognizes “.wav” files.

0.12.2 (Jan 28, 2020) – Smoothen the ride

Mostly a bugfix release with various robustifications, but also makes the first step towards versioned dataset installation requests.

Major refactoring and deprecations
  • The minimum required version for GitPython is now 2.1.12. (#4070)

Fixes
  • The class for handling configuration values, ConfigManager, inappropriately considered the current working directory’s dataset, if any, for both reading and writing when instantiated with dataset=None. This misbehavior is fairly inaccessible through typical use of DataLad. It affects datalad.cfg, the top-level configuration instance that should not consider repository-specific values. It also affects Python users that call Dataset with a path that does not yet exist and persists until that dataset is created. (#4078)

  • update saved the dataset when called with --merge, which is unnecessary and risks committing unrelated changes. (#3996)

  • Confusing and irrelevant information about Python defaults have been dropped from the command-line help. (#4002)

  • The logic for automatically propagating the ‘origin’ remote when cloning a local source didn’t properly account for relative paths. (#4045)

  • Various fixes to file name handling and quoting on Windows. (#4049) (#4050)

  • When cloning failed, error lines were not bubbled up to the user in some scenarios. (#4060)

Enhancements and new features
  • clone (and thus install)

    • now propagates the reckless mode from the superdataset when cloning a dataset into it. (#4037)

    • gained support for ria+<protocol>:// URLs that point to RIA stores. (#4022)

    • learned to read “@version” from ria+ URLs and install that version of a dataset (#4036) and to apply URL rewrites configured through Git’s url.*.insteadOf mechanism (#4064).

    • now copies datalad.get.subdataset-source-candidate-<name> options configured within the superdataset into the subdataset. This is particularly useful for RIA data stores. (#4073)

  • Archives are now (optionally) handled with 7-Zip instead of patool. 7-Zip will be used by default, but patool will be used on non-Windows systems if the datalad.runtime.use-patool option is set or the 7z executable is not found. (#4041)

0.12.1 (Jan 15, 2020) – Small bump after big bang

Fix some fallout after major release.

Fixes
  • Revert incorrect relative path adjustment to URLs in clone. (#3538)

  • Various small fixes to internal helpers and test to run on Windows (#2566) (#2534)

0.12.0 (Jan 11, 2020) – Krakatoa

This release is the result of more than a year of development that includes fixes for a large number of issues, yielding more robust behavior across a wider range of use cases, and introduces major changes in API and behavior. It is the first release for which extensive user documentation is available in a dedicated DataLad Handbook. Python 3 (3.5 and later) is now the only supported Python flavor.

Major changes 0.12 vs 0.11
  • save fully replaces add (which is obsolete now, and will be removed in a future release).

  • A new Git-annex aware status command enables detailed inspection of dataset hierarchies. The previously available diff command has been adjusted to match status in argument semantics and behavior.

  • The ability to configure dataset procedures prior and after the execution of particular commands has been replaced by a flexible “hook” mechanism that is able to run arbitrary DataLad commands whenever command results are detected that match a specification.

  • Support of the Windows platform has been improved substantially. While performance and feature coverage on Windows still falls behind Unix-like systems, typical data consumer use cases, and standard dataset operations, such as create and save, are now working. Basic support for data provenance capture via run is also functional.

  • Support for Git-annex direct mode repositories has been removed, following the end of support in Git-annex itself.

  • The semantics of relative paths in command line arguments have changed. Previously, a call datalad save --dataset /tmp/myds some/relpath would have been interpreted as saving a file at /tmp/myds/some/relpath into dataset /tmp/myds. This has changed to saving $PWD/some/relpath into dataset /tmp/myds. More generally, relative paths are now always treated as relative to the current working directory, except for path arguments of Dataset class instance methods of the Python API. The resulting partial duplication of path specifications between path and dataset arguments is mitigated by the introduction of two special symbols that can be given as dataset argument: ^ and ^., which identify the topmost superdataset and the closest dataset that contains the working directory, respectively.

  • The concept of a “core API” has been introduced. Commands situated in the module datalad.core (such as create, save, run, status, diff) receive additional scrutiny regarding API and implementation, and are meant to provide longer-term stability. Application developers are encouraged to preferentially build on these commands.

Major refactoring and deprecations since 0.12.0rc6
  • clone has been incorporated into the growing core API. The public --alternative-source parameter has been removed, and a clone_dataset function with multi-source capabilities is provided instead. The --reckless parameter can now take literal mode labels instead of just being a binary flag, but backwards compatibility is maintained.

  • The get_file_content method of GitRepo was no longer used internally or in any known DataLad extensions and has been removed. (#3812)

  • The function get_dataset_root has been replaced by rev_get_dataset_root. rev_get_dataset_root remains as a compatibility alias and will be removed in a later release. (#3815)

  • The add_sibling module, marked obsolete in v0.6.0, has been removed. (#3871)

  • mock is no longer declared as an external dependency because we can rely on it being in the standard library now that our minimum required Python version is 3.5. (#3860)

  • download-url now requires that directories be indicated with a trailing slash rather than interpreting a path as directory when it doesn’t exist. This avoids confusion that can result from typos and makes it possible to support directory targets that do not exist. (#3854)

  • The dataset_only argument of the ConfigManager class is deprecated. Use source="dataset" instead. (#3907)

  • The --proc-pre and --proc-post options have been removed, and configuration values for datalad.COMMAND.proc-pre and datalad.COMMAND.proc-post are no longer honored. The new result hook mechanism provides an alternative for proc-post procedures. (#3963)

Fixes since 0.12.0rc6
  • publish crashed when called with a detached HEAD. It now aborts with an informative message. (#3804)

  • Since 0.12.0rc6 the call to update in siblings resulted in a spurious warning. (#3877)

  • siblings crashed if it encountered an annex repository that was marked as dead. (#3892)

  • The update of rerun in v0.12.0rc3 for the rewritten diff command didn’t account for a change in the output of diff, leading to rerun --report unintentionally including unchanged files in its diff values. (#3873)

  • In 0.12.0rc5 download-url was updated to follow the new path handling logic, but its calls to AnnexRepo weren’t properly adjusted, resulting in incorrect path handling when the called from a dataset subdirectory. (#3850)

  • download-url called git annex addurl in a way that failed to register a URL when its header didn’t report the content size. (#3911)

  • With Git v2.24.0, saving new subdatasets failed due to a bug in that Git release. (#3904)

  • With DataLad configured to stop on failure (e.g., specifying --on-failure=stop from the command line), a failing result record was not rendered. (#3863)

  • Installing a subdataset yielded an “ok” status in cases where the repository was not yet in its final state, making it ineffective for a caller to operate on the repository in response to the result. (#3906)

  • The internal helper for converting git-annex’s JSON output did not relay information from the “error-messages” field. (#3931)

  • run-procedure reported relative paths that were confusingly not relative to the current directory in some cases. It now always reports absolute paths. (#3959)

  • diff inappropriately reported files as deleted in some cases when to was a value other than None. (#3999)

  • An assortment of fixes for Windows compatibility. (#3971) (#3974) (#3975) (#3976) (#3979)

  • Subdatasets installed from a source given by relative path will now have this relative path used as ‘url’ in their .gitmodules record, instead of an absolute path generated by Git. (#3538)

  • clone will now correctly interpret ‘~/…’ paths as absolute path specifications. (#3958)

  • run-procedure mistakenly reported a directory as a procedure. (#3793)

  • The cleanup for batched git-annex processes has been improved. (#3794) (#3851)

  • The function for adding a version ID to an AWS S3 URL doesn’t support URLs with an “s3://” scheme and raises a NotImplementedError exception when it encounters one. The function learned to return a URL untouched if an “s3://” URL comes in with a version ID. (#3842)

  • A few spots needed to be adjusted for compatibility with git-annex’s new --sameas feature, which allows special remotes to share a data store. (#3856)

  • The swallow_logs utility failed to capture some log messages due to an incompatibility with Python 3.7. (#3935)

  • siblings

    • crashed if --inherit was passed but the parent dataset did not have a remote with a matching name. (#3954)

    • configured the wrong pushurl and annexurl values in some cases. (#3955)

Enhancements and new features since 0.12.0rc6
  • By default, datasets cloned from local source paths will now get a configured remote for any recursively discoverable ‘origin’ sibling that is also available from a local path in order to maximize automatic file availability across local annexes. (#3926)

  • The new result hooks mechanism allows callers to specify, via local Git configuration values, DataLad command calls that will be triggered in response to matching result records (i.e., what you see when you call a command with -f json_pp). (#3903)

  • The command interface classes learned to use a new _examples_ attribute to render documentation examples for both the Python and command-line API. (#3821)

  • Candidate URLs for cloning a submodule can now be generated based on configured templates that have access to various properties of the submodule, including its dataset ID. (#3828)

  • DataLad’s check that the user’s Git identity is configured has been sped up and now considers the appropriate environment variables as well. (#3807)

  • The tag method of GitRepo can now tag revisions other than HEAD and accepts a list of arbitrary git tag options. (#3787)

  • When get clones a subdataset and the subdataset’s HEAD differs from the commit that is registered in the parent, the active branch of the subdataset is moved to the registered commit if the registered commit is an ancestor of the subdataset’s HEAD commit. This handling has been moved to a more central location within GitRepo, and now applies to any update_submodule(..., init=True) call. (#3831)

  • The output of datalad -h has been reformatted to improve readability. (#3862)

  • unlock has been sped up. (#3880)

  • run-procedure learned to provide and render more information about discovered procedures, including whether the procedure is overridden by another procedure with the same base name. (#3960)

  • save now (#3817)

    • records the active branch in the superdataset when registering a new subdataset.

    • calls git annex sync when saving a dataset on an adjusted branch so that the changes are brought into the mainline branch.

  • subdatasets now aborts when its dataset argument points to a non-existent dataset. (#3940)

  • wtf now

    • reports the dataset ID if the current working directory is visiting a dataset. (#3888)

    • outputs entries deterministically. (#3927)

  • The ConfigManager class

    • learned to exclude .datalad/config as a source of configuration values, restricting the sources to standard Git configuration files, when called with source="local". (#3907)

    • accepts a value of “override” for its where argument to allow Python callers to more convenient override configuration. (#3970)

  • Commands now accept a dataset value of “^.” as shorthand for “the dataset to which the current directory belongs”. (#3242)

0.12.0rc6 (Oct 19, 2019) – some releases are better than the others

bet we will fix some bugs and make a world even a better place.

Major refactoring and deprecations
  • DataLad no longer supports Python 2. The minimum supported version of Python is now 3.5. (#3629)

  • Much of the user-focused content at http://docs.datalad.org has been removed in favor of more up to date and complete material available in the DataLad Handbook. Going forward, the plan is to restrict http://docs.datalad.org to technical documentation geared at developers. (#3678)

  • update used to allow the caller to specify which dataset(s) to update as a PATH argument or via the the --dataset option; now only the latter is supported. Path arguments only serve to restrict which subdataset are updated when operating recursively. (#3700)

  • Result records from a get call no longer have a “state” key. (#3746)

  • update and get no longer support operating on independent hierarchies of datasets. (#3700) (#3746)

  • The run update in 0.12.0rc4 for the new path resolution logic broke the handling of inputs and outputs for calls from a subdirectory. (#3747)

  • The is_submodule_modified method of GitRepo as well as two helper functions in gitrepo.py, kwargs_to_options and split_remote_branch, were no longer used internally or in any known DataLad extensions and have been removed. (#3702) (#3704)

  • The only_remote option of GitRepo.is_with_annex was not used internally or in any known extensions and has been dropped. (#3768)

  • The get_tags method of GitRepo used to sort tags by committer date. It now sorts them by the tagger date for annotated tags and the committer date for lightweight tags. (#3715)

  • The rev_resolve_path substituted resolve_path helper. (#3797)

Fixes
  • Correctly handle relative paths in publish. (#3799) (#3102)

  • Do not erroneously discover directory as a procedure. (#3793)

  • Correctly extract version from manpage to trigger use of manpages for --help. (#3798)

  • The cfg_yoda procedure saved all modifications in the repository rather than saving only the files it modified. (#3680)

  • Some spots in the documentation that were supposed appear as two hyphens were incorrectly rendered in the HTML output en-dashs. (#3692)

  • create, install, and clone treated paths as relative to the dataset even when the string form was given, violating the new path handling rules. (#3749) (#3777) (#3780)

  • Providing the “^” shortcut to --dataset didn’t work properly when called from a subdirectory of a subdataset. (#3772)

  • We failed to propagate some errors from git-annex when working with its JSON output. (#3751)

  • With the Python API, callers are allowed to pass a string or list of strings as the cfg_proc argument to create, but the string form was mishandled. (#3761)

  • Incorrect command quoting for SSH calls on Windows that rendered basic SSH-related functionality (e.g., sshrun) on Windows unusable. (#3688)

  • Annex JSON result handling assumed platform-specific paths on Windows instead of the POSIX-style that is happening across all platforms. (#3719)

  • path_is_under() was incapable of comparing Windows paths with different drive letters. (#3728)

Enhancements and new features
  • Provide a collection of “public” call_git* helpers within GitRepo and replace use of “private” and less specific _git_custom_command calls. (#3791)

  • status gained a --report-filetype. Setting it to “raw” can give a performance boost for the price of no longer distinguishing symlinks that point to annexed content from other symlinks. (#3701)

  • save disables file type reporting by status to improve performance. (#3712)

  • subdatasets (#3743)

    • now extends its result records with a contains field that lists which contains arguments matched a given subdataset.

    • yields an ‘impossible’ result record when a contains argument wasn’t matched to any of the reported subdatasets.

  • install now shows more readable output when cloning fails. (#3775)

  • SSHConnection now displays a more informative error message when it cannot start the ControlMaster process. (#3776)

  • If the new configuration option datalad.log.result-level is set to a single level, all result records will be logged at that level. If you’ve been bothered by DataLad’s double reporting of failures, consider setting this to “debug”. (#3754)

  • Configuration values from datalad -c OPTION=VALUE ... are now validated to provide better errors. (#3695)

  • rerun learned how to handle history with merges. As was already the case when cherry picking non-run commits, re-creating merges may results in conflicts, and rerun does not yet provide an interface to let the user handle these. (#2754)

  • The fsck method of AnnexRepo has been enhanced to expose more features of the underlying git fsck command. (#3693)

  • GitRepo now has a for_each_ref_ method that wraps git   for-each-ref, which is used in various spots that used to rely on GitPython functionality. (#3705)

  • Do not pretend to be able to work in optimized (python -O) mode, crash early with an informative message. (#3803)

0.12.0rc5 (September 04, 2019) – .

Various fixes and enhancements that bring the 0.12.0 release closer.

Major refactoring and deprecations
  • The two modules below have a new home. The old locations still exist as compatibility shims and will be removed in a future release.

    • datalad.distribution.subdatasets has been moved to datalad.local.subdatasets (#3429)

    • datalad.interface.run has been moved to datalad.core.local.run (#3444)

  • The lock method of AnnexRepo and the options parameter of AnnexRepo.unlock were unused internally and have been removed. (#3459)

  • The get_submodules method of GitRepo has been rewritten without GitPython. When the new compat flag is true (the current default), the method returns a value that is compatible with the old return value. This backwards-compatible return value and the compat flag will be removed in a future release. (#3508)

  • The logic for resolving relative paths given to a command has changed (#3435). The new rule is that relative paths are taken as relative to the dataset only if a dataset instance is passed by the caller. In all other scenarios they’re considered relative to the current directory.

    The main user-visible difference from the command line is that using the --dataset argument does not result in relative paths being taken as relative to the specified dataset. (The undocumented distinction between “rel/path” and “./rel/path” no longer exists.)

    All commands under datalad.core and datalad.local, as well as unlock and addurls, follow the new logic. The goal is for all commands to eventually do so.

Fixes
  • The function for loading JSON streams wasn’t clever enough to handle content that included a Unicode line separator like U2028. (#3524)

  • When unlock was called without an explicit target (i.e., a directory or no paths at all), the call failed if any of the files did not have content present. (#3459)

  • AnnexRepo.get_content_info failed in the rare case of a key without size information. (#3534)

  • save ignored --on-failure in its underlying call to status. (#3470)

  • Calling remove with a subdirectory displayed spurious warnings about the subdirectory files not existing. (#3586)

  • Our processing of git-annex --json output mishandled info messages from special remotes. (#3546)

  • create

    • didn’t bypass the “existing subdataset” check when called with --force as of 0.12.0rc3 (#3552)

    • failed to register the up-to-date revision of a subdataset when --cfg-proc was used with --dataset (#3591)

  • The base downloader had some error handling that wasn’t compatible with Python 3. (#3622)

  • Fixed a number of Unicode py2-compatibility issues. (#3602)

  • AnnexRepo.get_content_annexinfo did not properly chunk file arguments to avoid exceeding the command-line character limit. (#3587)

Enhancements and new features
  • New command create-sibling-gitlab provides an interface for creating a publication target on a GitLab instance. (#3447)

  • subdatasets (#3429)

    • now supports path-constrained queries in the same manner as commands like save and status

    • gained a --contains=PATH option that can be used to restrict the output to datasets that include a specific path.

    • now narrows the listed subdatasets to those underneath the current directory when called with no arguments

  • status learned to accept a plain --annex (no value) as shorthand for --annex basic. (#3534)

  • The .dirty property of GitRepo and AnnexRepo has been sped up. (#3460)

  • The get_content_info method of GitRepo, used by status and commands that depend on status, now restricts its git calls to a subset of files, if possible, for a performance gain in repositories with many files. (#3508)

  • Extensions that do not provide a command, such as those that provide only metadata extractors, are now supported. (#3531)

  • When calling git-annex with --json, we log standard error at the debug level rather than the warning level if a non-zero exit is expected behavior. (#3518)

  • create no longer refuses to create a new dataset in the odd scenario of an empty .git/ directory upstairs. (#3475)

  • As of v2.22.0 Git treats a sub-repository on an unborn branch as a repository rather than as a directory. Our documentation and tests have been updated appropriately. (#3476)

  • addurls learned to accept a --cfg-proc value and pass it to its create calls. (#3562)

0.12.0rc4 (May 15, 2019) – the revolution is over

With the replacement of the save command implementation with rev-save the revolution effort is now over, and the set of key commands for local dataset operations (create, run, save, status, diff) is now complete. This new core API is available from datalad.core.local (and also via datalad.api, as any other command).

Major refactoring and deprecations
  • The add command is now deprecated. It will be removed in a future release.

Fixes
  • Remove hard-coded dependencies on POSIX path conventions in SSH support code (#3400)

  • Emit an add result when adding a new subdataset during save (#3398)

  • SSH file transfer now actually opens a shared connection, if none exists yet (#3403)

Enhancements and new features
  • SSHConnection now offers methods for file upload and download (get(), put(). The previous copy() method only supported upload and was discontinued (#3401)

0.12.0rc3 (May 07, 2019) – the revolution continues

Continues API consolidation and replaces the create and diff command with more performant implementations.

Major refactoring and deprecations
  • The previous diff command has been replaced by the diff variant from the datalad-revolution extension. (#3366)

  • rev-create has been renamed to create, and the previous create has been removed. (#3383)

  • The procedure setup_yoda_dataset has been renamed to cfg_yoda (#3353).

  • The --nosave of addurls now affects only added content, not newly created subdatasets (#3259).

  • Dataset.get_subdatasets (deprecated since v0.9.0) has been removed. (#3336)

  • The .is_dirty method of GitRepo and AnnexRepo has been replaced by .status or, for a subset of cases, the .dirty property. (#3330)

  • AnnexRepo.get_status has been replaced by AnnexRepo.status. (#3330)

Fixes
  • status

    • reported on directories that contained only ignored files (#3238)

    • gave a confusing failure when called from a subdataset with an explicitly specified dataset argument and “.” as a path (#3325)

    • misleadingly claimed that the locally present content size was zero when --annex basic was specified (#3378)

  • An informative error wasn’t given when a download provider was invalid. (#3258)

  • Calling rev-save PATH saved unspecified untracked subdatasets. (#3288)

  • The available choices for command-line options that take values are now displayed more consistently in the help output. (#3326)

  • The new pathlib-based code had various encoding issues on Python 2. (#3332)

Enhancements and new features
  • wtf now includes information about the Python version. (#3255)

  • When operating in an annex repository, checking whether git-annex is available is now delayed until a call to git-annex is actually needed, allowing systems without git-annex to operate on annex repositories in a restricted fashion. (#3274)

  • The load_stream on helper now supports auto-detection of compressed files. (#3289)

  • create (formerly rev-create)

    • learned to be speedier by passing a path to status (#3294)

    • gained a --cfg-proc (or -c) convenience option for running configuration procedures (or more accurately any procedure that begins with “cfg_”) in the newly created dataset (#3353)

  • AnnexRepo.set_metadata now returns a list while AnnexRepo.set_metadata_ returns a generator, a behavior which is consistent with the add and add_ method pair. (#3298)

  • AnnexRepo.get_metadata now supports batch querying of known annex files. Note, however, that callers should carefully validate the input paths because the batch call will silently hang if given non-annex files. (#3364)

  • status

    • now reports a “bytesize” field for files tracked by Git (#3299)

    • gained a new option eval_subdataset_state that controls how the subdataset state is evaluated. Depending on the information you need, you can select a less expensive mode to make status faster. (#3324)

    • colors deleted files “red” (#3334)

  • Querying repository content is faster due to batching of git   cat-file calls. (#3301)

  • The dataset ID of a subdataset is now recorded in the superdataset. (#3304)

  • GitRepo.diffstatus

    • now avoids subdataset recursion when the comparison is not with the working tree, which substantially improves performance when diffing large dataset hierarchies (#3314)

    • got smarter and faster about labeling a subdataset as “modified” (#3343)

  • GitRepo.get_content_info now supports disabling the file type evaluation, which gives a performance boost in cases where this information isn’t needed. (#3362)

  • The XMP metadata extractor now filters based on file name to improve its performance. (#3329)

0.12.0rc2 (Mar 18, 2019) – revolution!

Fixes
  • GitRepo.dirty does not report on nested empty directories (#3196).

  • GitRepo.save() reports results on deleted files.

Enhancements and new features
  • Absorb a new set of core commands from the datalad-revolution extension:

    • rev-status: like git status, but simpler and working with dataset hierarchies

    • rev-save: a 2-in-1 replacement for save and add

    • rev-create: a ~30% faster create

  • JSON support tools can now read and write compressed files.

0.12.0rc1 (Mar 03, 2019) – to boldly go …

Major refactoring and deprecations
  • Discontinued support for git-annex direct-mode (also no longer supported upstream).

Enhancements and new features
  • Dataset and Repo object instances are now hashable, and can be created based on pathlib Path object instances

  • Imported various additional methods for the Repo classes to query information and save changes.

0.11.8 (Oct 11, 2019) – annex-we-are-catching-up

Fixes
  • Our internal command runner failed to capture output in some cases. (#3656)

  • Workaround in the tests around python in cPython >= 3.7.5 ‘;’ in the filename confusing mimetypes (#3769) (#3770)

Enhancements and new features
  • Prepared for upstream changes in git-annex, including support for the latest git-annex

    • 7.20190912 auto-upgrades v5 repositories to v7. (#3648) (#3682)

    • 7.20191009 fixed treatment of (larger/smaller)than in .gitattributes (#3765)

  • The cfg_text2git procedure, as well the --text-no-annex option of create, now configure .gitattributes so that empty files are stored in git rather than annex. (#3667)

0.11.7 (Sep 06, 2019) – python2-we-still-love-you-but-…

Primarily bugfixes with some optimizations and refactorings.

Fixes
  • addurls

    • now provides better handling when the URL file isn’t in the expected format. (#3579)

    • always considered a relative file for the URL file argument as relative to the current working directory, which goes against the convention used by other commands of taking relative paths as relative to the dataset argument. (#3582)

  • run-procedure

    • hard coded “python” when formatting the command for non-executable procedures ending with “.py”. sys.executable is now used. (#3624)

    • failed if arguments needed more complicated quoting than simply surrounding the value with double quotes. This has been resolved for systems that support shlex.quote, but note that on Windows values are left unquoted. (#3626)

  • siblings now displays an informative error message if a local path is given to --url but --name isn’t specified. (#3555)

  • sshrun, the command DataLad uses for GIT_SSH_COMMAND, didn’t support all the parameters that Git expects it to. (#3616)

  • Fixed a number of Unicode py2-compatibility issues. (#3597)

  • download-url now will create leading directories of the output path if they do not exist (#3646)

Enhancements and new features
  • The annotate-paths helper now caches subdatasets it has seen to avoid unnecessary calls. (#3570)

  • A repeated configuration query has been dropped from the handling of --proc-pre and --proc-post. (#3576)

  • Calls to git annex find now use --in=. instead of the alias --in=here to take advantage of an optimization that git-annex (as of the current release, 7.20190730) applies only to the former. (#3574)

  • addurls now suggests close matches when the URL or file format contains an unknown field. (#3594)

  • Shared logic used in the setup.py files of DataLad and its extensions has been moved to modules in the _datalad_build_support/ directory. (#3600)

  • Get ready for upcoming git-annex dropping support for direct mode (#3631)

0.11.6 (Jul 30, 2019) – am I the last of 0.11.x?

Primarily bug fixes to achieve more robust performance

Fixes
  • Our tests needed various adjustments to keep up with upstream changes in Travis and Git. (#3479) (#3492) (#3493)

  • AnnexRepo.is_special_annex_remote was too selective in what it considered to be a special remote. (#3499)

  • We now provide information about unexpected output when git-annex is called with --json. (#3516)

  • Exception logging in the __del__ method of GitRepo and AnnexRepo no longer fails if the names it needs are no longer bound. (#3527)

  • addurls botched the construction of subdataset paths that were more than two levels deep and failed to create datasets in a reliable, breadth-first order. (#3561)

  • Cloning a type=git special remote showed a spurious warning about the remote not being enabled. (#3547)

Enhancements and new features
  • For calls to git and git-annex, we disable automatic garbage collection due to past issues with GitPython’s state becoming stale, but doing so results in a larger .git/objects/ directory that isn’t cleaned up until garbage collection is triggered outside of DataLad. Tests with the latest GitPython didn’t reveal any state issues, so we’ve re-enabled automatic garbage collection. (#3458)

  • rerun learned an --explicit flag, which it relays to its calls to [run][[]]. This makes it possible to call rerun in a dirty working tree (#3498).

  • The metadata command aborts earlier if a metadata extractor is unavailable. (#3525)

0.11.5 (May 23, 2019) – stability is not overrated

Should be faster and less buggy, with a few enhancements.

Fixes
  • create-sibling (#3318)

    • Siblings are no longer configured with a post-update hook unless a web interface is requested with --ui.

    • git submodule update --init is no longer called from the post-update hook.

    • If --inherit is given for a dataset without a superdataset, a warning is now given instead of raising an error.

  • The internal command runner failed on Python 2 when its env argument had unicode values. (#3332)

  • The safeguard that prevents creating a dataset in a subdirectory that already contains tracked files for another repository failed on Git versions before 2.14. For older Git versions, we now warn the caller that the safeguard is not active. (#3347)

  • A regression introduced in v0.11.1 prevented save from committing changes under a subdirectory when the subdirectory was specified as a path argument. (#3106)

  • A workaround introduced in v0.11.1 made it possible for save to do a partial commit with an annex file that has gone below the annex.largefiles threshold. The logic of this workaround was faulty, leading to files being displayed as typechanged in the index following the commit. (#3365)

  • The resolve_path() helper confused paths that had a semicolon for SSH RIs. (#3425)

  • The detection of SSH RIs has been improved. (#3425)

Enhancements and new features
  • The internal command runner was too aggressive in its decision to sleep. (#3322)

  • The “INFO” label in log messages now retains the default text color for the terminal rather than using white, which only worked well for terminals with dark backgrounds. (#3334)

  • A short flag -R is now available for the --recursion-limit flag, a flag shared by several subcommands. (#3340)

  • The authentication logic for create-sibling-github has been revamped and now supports 2FA. (#3180)

  • New configuration option datalad.ui.progressbar can be used to configure the default backend for progress reporting (“none”, for example, results in no progress bars being shown). (#3396)

  • A new progress backend, available by setting datalad.ui.progressbar to “log”, replaces progress bars with a log message upon completion of an action. (#3396)

  • DataLad learned to consult the NO_COLOR environment variable and the new datalad.ui.color configuration option when deciding to color output. The default value, “auto”, retains the current behavior of coloring output if attached to a TTY (#3407).

  • clean now removes annex transfer directories, which is useful for cleaning up failed downloads. (#3374)

  • clone no longer refuses to clone into a local path that looks like a URL, making its behavior consistent with git clone. (#3425)

  • wtf

    • Learned to fall back to the dist package if platform.dist, which has been removed in the yet-to-be-release Python 3.8, does not exist. (#3439)

    • Gained a --section option for limiting the output to specific sections and a --decor option, which currently knows how to format the output as GitHub’s <details> section. (#3440)

0.11.4 (Mar 18, 2019) – get-ready

Largely a bug fix release with a few enhancements

Important
  • 0.11.x series will be the last one with support for direct mode of git-annex which is used on crippled (no symlinks and no locking) filesystems. v7 repositories should be used instead.

Fixes
  • Extraction of .gz files is broken without p7zip installed. We now abort with an informative error in this situation. (#3176)

  • Committing failed in some cases because we didn’t ensure that the path passed to git read-tree --index-output=... resided on the same filesystem as the repository. (#3181)

  • Some pointless warnings during metadata aggregation have been eliminated. (#3186)

  • With Python 3 the LORIS token authenticator did not properly decode a response (#3205).

  • With Python 3 downloaders unnecessarily decoded the response when getting the status, leading to an encoding error. (#3210)

  • In some cases, our internal command Runner did not adjust the environment’s PWD to match the current working directory specified with the cwd parameter. (#3215)

  • The specification of the pyliblzma dependency was broken. (#3220)

  • search displayed an uninformative blank log message in some cases. (#3222)

  • The logic for finding the location of the aggregate metadata DB anchored the search path incorrectly, leading to a spurious warning. (#3241)

  • Some progress bars were still displayed when stdout and stderr were not attached to a tty. (#3281)

  • Check for stdin/out/err to not be closed before checking for .isatty. (#3268)

Enhancements and new features
  • Creating a new repository now aborts if any of the files in the directory are tracked by a repository in a parent directory. (#3211)

  • run learned to replace the {tmpdir} placeholder in commands with a temporary directory. (#3223)

  • duecredit support has been added for citing DataLad itself as well as datasets that an analysis uses. (#3184)

  • The eval_results interface helper unintentionally modified one of its arguments. (#3249)

  • A few DataLad constants have been added, changed, or renamed (#3250):

    • HANDLE_META_DIR is now DATALAD_DOTDIR. The old name should be considered deprecated.

    • METADATA_DIR now refers to DATALAD_DOTDIR/metadata rather than DATALAD_DOTDIR/meta (which is still available as OLDMETADATA_DIR).

    • The new DATASET_METADATA_FILE refers to METADATA_DIR/dataset.json.

    • The new DATASET_CONFIG_FILE refers to DATALAD_DOTDIR/config.

    • METADATA_FILENAME has been renamed to OLDMETADATA_FILENAME.

0.11.3 (Feb 19, 2019) – read-me-gently

Just a few of important fixes and minor enhancements.

Fixes
  • The logic for setting the maximum command line length now works around Python 3.4 returning an unreasonably high value for SC_ARG_MAX on Debian systems. (#3165)

  • DataLad commands that are conceptually “read-only”, such as datalad ls -L, can fail when the caller lacks write permissions because git-annex tries merging remote git-annex branches to update information about availability. DataLad now disables annex.merge-annex-branches in some common “read-only” scenarios to avoid these failures. (#3164)

Enhancements and new features
  • Accessing an “unbound” dataset method now automatically imports the necessary module rather than requiring an explicit import from the Python caller. For example, calling Dataset.add no longer needs to be preceded by from datalad.distribution.add import Add or an import of datalad.api. (#3156)

  • Configuring the new variable datalad.ssh.identityfile instructs DataLad to pass a value to the -i option of ssh. (#3149) (#3168)

0.11.2 (Feb 07, 2019) – live-long-and-prosper

A variety of bugfixes and enhancements

Major refactoring and deprecations
  • All extracted metadata is now placed under git-annex by default. Previously files smaller than 20 kb were stored in git. (#3109)

  • The function datalad.cmd.get_runner has been removed. (#3104)

Fixes
  • Improved handling of long commands:

    • The code that inspected SC_ARG_MAX didn’t check that the reported value was a sensible, positive number. (#3025)

    • More commands that invoke git and git-annex with file arguments learned to split up the command calls when it is likely that the command would fail due to exceeding the maximum supported length. (#3138)

  • The setup_yoda_dataset procedure created a malformed .gitattributes line. (#3057)

  • download-url unnecessarily tried to infer the dataset when --no-save was given. (#3029)

  • rerun aborted too late and with a confusing message when a ref specified via --onto didn’t exist. (#3019)

  • run:

    • run didn’t preserve the current directory prefix (“./”) on inputs and outputs, which is problematic if the caller relies on this representation when formatting the command. (#3037)

    • Fixed a number of unicode py2-compatibility issues. (#3035) (#3046)

    • To proceed with a failed command, the user was confusingly instructed to use save instead of add even though run uses add underneath. (#3080)

  • Fixed a case where the helper class for checking external modules incorrectly reported a module as unknown. (#3051)

  • add-archive-content mishandled the archive path when the leading path contained a symlink. (#3058)

  • Following denied access, the credential code failed to consider a scenario, leading to a type error rather than an appropriate error message. (#3091)

  • Some tests failed when executed from a git worktree checkout of the source repository. (#3129)

  • During metadata extraction, batched annex processes weren’t properly terminated, leading to issues on Windows. (#3137)

  • add incorrectly handled an “invalid repository” exception when trying to add a submodule. (#3141)

  • Pass GIT_SSH_VARIANT=ssh to git processes to be able to specify alternative ports in SSH urls

Enhancements and new features
  • search learned to suggest closely matching keys if there are no hits. (#3089)

  • create-sibling

  • Interface classes can now override the default renderer for summarizing results. (#3061)

  • run:

    • --input and --output can now be shortened to -i and -o. (#3066)

    • Placeholders such as “{inputs}” are now expanded in the command that is shown in the commit message subject. (#3065)

    • interface.run.run_command gained an extra_inputs argument so that wrappers like datalad-container can specify additional inputs that aren’t considered when formatting the command string. (#3038)

    • “–” can now be used to separate options for run and those for the command in ambiguous cases. (#3119)

  • The utilities create_tree and ok_file_has_content now support “.gz” files. (#3049)

  • The Singularity container for 0.11.1 now uses nd_freeze to make its builds reproducible.

  • A publications page has been added to the documentation. (#3099)

  • GitRepo.set_gitattributes now accepts a mode argument that controls whether the .gitattributes file is appended to (default) or overwritten. (#3115)

  • datalad --help now avoids using man so that the list of subcommands is shown. (#3124)

0.11.1 (Nov 26, 2018) – v7-better-than-v6

Rushed out bugfix release to stay fully compatible with recent git-annex which introduced v7 to replace v6.

Fixes
  • install: be able to install recursively into a dataset (#2982)

  • save: be able to commit/save changes whenever files potentially could have swapped their storage between git and annex (#1651) (#2752) (#3009)

  • [aggregate-metadata][]:

    • dataset’s itself is now not “aggregated” if specific paths are provided for aggregation (#3002). That resolves the issue of -r invocation aggregating all subdatasets of the specified dataset as well

    • also compare/verify the actual content checksum of aggregated metadata while considering subdataset metadata for re-aggregation (#3007)

  • annex commands are now chunked assuming 50% “safety margin” on the maximal command line length. Should resolve crashes while operating of too many files at ones (#3001)

  • run sidecar config processing (#2991)

  • no double trailing period in docs (#2984)

  • correct identification of the repository with symlinks in the paths in the tests (#2972)

  • re-evaluation of dataset properties in case of dataset changes (#2946)

  • [text2git][] procedure to use ds.repo.set_gitattributes (#2974) (#2954)

  • Switch to use plain os.getcwd() if inconsistency with env var $PWD is detected (#2914)

  • Make sure that credential defined in env var takes precedence (#2960) (#2950)

Enhancements and new features
  • shub://datalad/datalad:git-annex-dev provides a Debian buster Singularity image with build environment for git-annex. tools/bisect-git-annex provides a helper for running git bisect on git-annex using that Singularity container (#2995)

  • Added .zenodo.json for better integration with Zenodo for citation

  • run-procedure now provides names and help messages with a custom renderer for (#2993)

  • Documentation: point to datalad-revolution extension (prototype of the greater DataLad future)

  • run

    • support injecting of a detached command (#2937)

  • annex metadata extractor now extracts annex.key metadata record. Should allow now to identify uses of specific files etc (#2952)

  • Test that we can install from http://datasets.datalad.org

  • Proper rendering of CommandError (e.g. in case of “out of space” error) (#2958)

0.11.0 (Oct 23, 2018) – Soon-to-be-perfect

git-annex 6.20180913 (or later) is now required - provides a number of fixes for v6 mode operations etc.

Major refactoring and deprecations
  • datalad.consts.LOCAL_CENTRAL_PATH constant was deprecated in favor of datalad.locations.default-dataset configuration variable (#2835)

Minor refactoring
  • "notneeded" messages are no longer reported by default results renderer

  • run no longer shows commit instructions upon command failure when explicit is true and no outputs are specified (#2922)

  • get_git_dir moved into GitRepo (#2886)

  • _gitpy_custom_call removed from GitRepo (#2894)

  • GitRepo.get_merge_base argument is now called commitishes instead of treeishes (#2903)

Fixes
  • update should not leave the dataset in non-clean state (#2858) and some other enhancements (#2859)

  • Fixed chunking of the long command lines to account for decorators and other arguments (#2864)

  • Progress bar should not crash the process on some missing progress information (#2891)

  • Default value for jobs set to be "auto" (not None) to take advantage of possible parallel get if in -g mode (#2861)

  • wtf must not crash if git-annex is not installed etc (#2865), (#2865), (#2918), (#2917)

  • Fixed paths (with spaces etc) handling while reporting annex error output (#2892), (#2893)

  • __del__ should not access .repo but ._repo to avoid attempts for reinstantiation etc (#2901)

  • Fix up submodule .git right in GitRepo.add_submodule to avoid added submodules being non git-annex friendly (#2909), (#2904)

  • run-procedure (#2905)

    • now will provide dataset into the procedure if called within dataset

    • will not crash if procedure is an executable without .py or .sh suffixes

  • Use centralized .gitattributes handling while setting annex backend (#2912)

  • GlobbedPaths.expand(..., full=True) incorrectly returned relative paths when called more than once (#2921)

Enhancements and new features
  • Report progress on clone when installing from “smart” git servers (#2876)

  • Stale/unused sth_like_file_has_content was removed (#2860)

  • Enhancements to search to operate on “improved” metadata layouts (#2878)

  • Output of git annex init operation is now logged (#2881)

  • New

    • GitRepo.cherry_pick (#2900)

    • GitRepo.format_commit (#2902)

  • run-procedure (#2905)

    • procedures can now recursively be discovered in subdatasets as well. The uppermost has highest priority

    • Procedures in user and system locations now take precedence over those in datasets.

0.10.3.1 (Sep 13, 2018) – Nothing-is-perfect

Emergency bugfix to address forgotten boost of version in datalad/version.py.

0.10.3 (Sep 13, 2018) – Almost-perfect

This is largely a bugfix release which addressed many (but not yet all) issues of working with git-annex direct and version 6 modes, and operation on Windows in general. Among enhancements you will see the support of public S3 buckets (even with periods in their names), ability to configure new providers interactively, and improved egrep search backend.

Although we do not require with this release, it is recommended to make sure that you are using a recent git-annex since it also had a variety of fixes and enhancements in the past months.

Fixes
  • Parsing of combined short options has been broken since DataLad v0.10.0. (#2710)

  • The datalad save instructions shown by datalad run for a command with a non-zero exit were incorrectly formatted. (#2692)

  • Decompression of zip files (e.g., through datalad   add-archive-content) failed on Python 3. (#2702)

  • Windows:

    • colored log output was not being processed by colorama. (#2707)

    • more codepaths now try multiple times when removing a file to deal with latency and locking issues on Windows. (#2795)

  • Internal git fetch calls have been updated to work around a GitPython BadName issue. (#2712), (#2794)

  • The progress bar for annex file transferring was unable to handle an empty file. (#2717)

  • datalad add-readme halted when no aggregated metadata was found rather than displaying a warning. (#2731)

  • datalad rerun failed if --onto was specified and the history contained no run commits. (#2761)

  • Processing of a command’s results failed on a result record with a missing value (e.g., absent field or subfield in metadata). Now the missing value is rendered as “N/A”. (#2725).

  • A couple of documentation links in the “Delineation from related solutions” were misformatted. (#2773)

  • With the latest git-annex, several known V6 failures are no longer an issue. (#2777)

  • In direct mode, commit changes would often commit annexed content as regular Git files. A new approach fixes this and resolves a good number of known failures. (#2770)

  • The reporting of command results failed if the current working directory was removed (e.g., after an unsuccessful install). (#2788)

  • When installing into an existing empty directory, datalad install removed the directory after a failed clone. (#2788)

  • datalad run incorrectly handled inputs and outputs for paths with spaces and other characters that require shell escaping. (#2798)

  • Globbing inputs and outputs for datalad run didn’t work correctly if a subdataset wasn’t installed. (#2796)

  • Minor (in)compatibility with git 2.19 - (no) trailing period in an error message now. (#2815)

Enhancements and new features
  • Anonymous access is now supported for S3 and other downloaders. (#2708)

  • A new interface is available to ease setting up new providers. (#2708)

  • Metadata: changes to egrep mode search (#2735)

    • Queries in egrep mode are now case-sensitive when the query contains any uppercase letters and are case-insensitive otherwise. The new mode egrepcs can be used to perform a case-sensitive query with all lower-case letters.

    • Search can now be limited to a specific key.

    • Multiple queries (list of expressions) are evaluated using AND to determine whether something is a hit.

    • A single multi-field query (e.g., pa*:findme) is a hit, when any matching field matches the query.

    • All matching key/value combinations across all (multi-field) queries are reported in the query_matched result field.

    • egrep mode now shows all hits rather than limiting the results to the top 20 hits.

  • The documentation on how to format commands for datalad run has been improved. (#2703)

  • The method for determining the current working directory on Windows has been improved. (#2707)

  • datalad --version now simply shows the version without the license. (#2733)

  • datalad export-archive learned to export under an existing directory via its --filename option. (#2723)

  • datalad export-to-figshare now generates the zip archive in the root of the dataset unless --filename is specified. (#2723)

  • After importing datalad.api, help(datalad.api) (or datalad.api? in IPython) now shows a summary of the available DataLad commands. (#2728)

  • Support for using datalad from IPython has been improved. (#2722)

  • datalad wtf now returns structured data and reports the version of each extension. (#2741)

  • The internal handling of gitattributes information has been improved. A user-visible consequence is that datalad create   --force no longer duplicates existing attributes. (#2744)

  • The “annex” metadata extractor can now be used even when no content is present. (#2724)

  • The add_url_to_file method (called by commands like datalad   download-url and datalad add-archive-content) learned how to display a progress bar. (#2738)

0.10.2 (Jul 09, 2018) – Thesecuriestever

Primarily a bugfix release to accommodate recent git-annex release forbidding file:// and http://localhost/ URLs which might lead to revealing private files if annex is publicly shared.

Fixes
  • fixed testing to be compatible with recent git-annex (6.20180626)

  • download-url will now download to current directory instead of the top of the dataset

Enhancements and new features
  • do not quote ~ in URLs to be consistent with quote implementation in Python 3.7 which now follows RFC 3986

  • run support for user-configured placeholder values

  • documentation on native git-annex metadata support

  • handle 401 errors from LORIS tokens

  • yoda procedure will instantiate README.md

  • --discover option added to run-procedure to list available procedures

0.10.1 (Jun 17, 2018) – OHBM polish

The is a minor bugfix release.

Fixes
  • Be able to use backports.lzma as a drop-in replacement for pyliblzma.

  • Give help when not specifying a procedure name in run-procedure.

  • Abort early when a downloader received no filename.

  • Avoid rerun error when trying to unlock non-available files.

0.10.0 (Jun 09, 2018) – The Release

This release is a major leap forward in metadata support.

Major refactoring and deprecations
  • Metadata

    • Prior metadata provided by datasets under .datalad/meta is no longer used or supported. Metadata must be reaggregated using 0.10 version

    • Metadata extractor types are no longer auto-guessed and must be explicitly specified in datalad.metadata.nativetype config (could contain multiple values)

    • Metadata aggregation of a dataset hierarchy no longer updates all datasets in the tree with new metadata. Instead, only the target dataset is updated. This behavior can be changed via the –update-mode switch. The new default prevents needless modification of (3rd-party) subdatasets.

    • Neuroimaging metadata support has been moved into a dedicated extension: https://github.com/datalad/datalad-neuroimaging

  • Crawler

  • export_tarball plugin has been generalized to export_archive and can now also generate ZIP archives.

  • By default a dataset X is now only considered to be a super-dataset of another dataset Y, if Y is also a registered subdataset of X.

Fixes

A number of fixes did not make it into the 0.9.x series:

  • Dynamic configuration overrides via the -c option were not in effect.

  • save is now more robust with respect to invocation in subdirectories of a dataset.

  • unlock now reports correct paths when running in a dataset subdirectory.

  • get is more robust to path that contain symbolic links.

  • symlinks to subdatasets of a dataset are now correctly treated as a symlink, and not as a subdataset

  • add now correctly saves staged subdataset additions.

  • Running datalad save in a dataset no longer adds untracked content to the dataset. In order to add content a path has to be given, e.g. datalad save .

  • wtf now works reliably with a DataLad that wasn’t installed from Git (but, e.g., via pip)

  • More robust URL handling in simple_with_archives crawler pipeline.

Enhancements and new features
  • Support for DataLad extension that can contribute API components from 3rd-party sources, incl. commands, metadata extractors, and test case implementations. See https://github.com/datalad/datalad-extension-template for a demo extension.

  • Metadata (everything has changed!)

    • Metadata extraction and aggregation is now supported for datasets and individual files.

    • Metadata query via search can now discover individual files.

    • Extracted metadata can now be stored in XZ compressed files, is optionally annexed (when exceeding a configurable size threshold), and obtained on demand (new configuration option datalad.metadata.create-aggregate-annex-limit).

    • Status and availability of aggregated metadata can now be reported via metadata --get-aggregates

    • New configuration option datalad.metadata.maxfieldsize to exclude too large metadata fields from aggregation.

    • The type of metadata is no longer guessed during metadata extraction. A new configuration option datalad.metadata.nativetype was introduced to enable one or more particular metadata extractors for a dataset.

    • New configuration option datalad.metadata.store-aggregate-content to enable the storage of aggregated metadata for dataset content (i.e. file-based metadata) in contrast to just metadata describing a dataset as a whole.

  • search was completely reimplemented. It offers three different modes now:

    • ‘egrep’ (default): expression matching in a plain string version of metadata

    • ‘textblob’: search a text version of all metadata using a fully featured query language (fast indexing, good for keyword search)

    • ‘autofield’: search an auto-generated index that preserves individual fields of metadata that can be represented in a tabular structure (substantial indexing cost, enables the most detailed queries of all modes)

  • New extensions:

    • addurls, an extension for creating a dataset (and possibly subdatasets) from a list of URLs.

    • export_to_figshare

    • extract_metadata

  • add_readme makes use of available metadata

  • By default the wtf extension now hides sensitive information, which can be included in the output by passing --senstive=some or --senstive=all.

  • Reduced startup latency by only importing commands necessary for a particular command line call.

  • create:

    • -d <parent> --nosave now registers subdatasets, when possible.

    • --fake-dates configures dataset to use fake-dates

  • run now provides a way for the caller to save the result when a command has a non-zero exit status.

  • datalad rerun now has a --script option that can be used to extract previous commands into a file.

  • A DataLad Singularity container is now available on Singularity Hub.

  • More casts have been embedded in the use case section of the documentation.

  • datalad --report-status has a new value ‘all’ that can be used to temporarily re-enable reporting that was disable by configuration settings.

0.9.3 (Mar 16, 2018) – pi+0.02 release

Some important bug fixes which should improve usability

Fixes
  • datalad-archives special remote now will lock on acquiring or extracting an archive - this allows for it to be used with -J flag for parallel operation

  • relax introduced in 0.9.2 demand on git being configured for datalad operation - now we will just issue a warning

  • datalad ls should now list “authored date” and work also for datasets in detached HEAD mode

  • datalad save will now save original file as well, if file was “git mv”ed, so you can now datalad run git mv old new and have changes recorded

Enhancements and new features
  • --jobs argument now could take auto value which would decide on # of jobs depending on the # of available CPUs. git-annex > 6.20180314 is recommended to avoid regression with -J.

  • memoize calls to RI meta-constructor – should speed up operation a bit

  • DATALAD_SEED environment variable could be used to seed Python RNG and provide reproducible UUIDs etc (useful for testing and demos)

0.9.2 (Mar 04, 2018) – it is (again) better than ever

Largely a bugfix release with a few enhancements.

Fixes
  • Execution of external commands (git) should not get stuck when lots of both stdout and stderr output, and should not loose remaining output in some cases

  • Config overrides provided in the command line (-c) should now be handled correctly

  • Consider more remotes (not just tracking one, which might be none) while installing subdatasets

  • Compatibility with git 2.16 with some changed behaviors/annotations for submodules

  • Fail remove if annex drop failed

  • Do not fail operating on files which start with dash (-)

  • URL unquote paths within S3, URLs and DataLad RIs (///)

  • In non-interactive mode fail if authentication/access fails

  • Web UI:

    • refactored a little to fix incorrect listing of submodules in subdirectories

    • now auto-focuses on search edit box upon entering the page

  • Assure that extracted from tarballs directories have executable bit set

Enhancements and new features
  • A log message and progress bar will now inform if a tarball to be downloaded while getting specific files (requires git-annex > 6.20180206)

  • A dedicated datalad rerun command capable of rerunning entire sequences of previously run commands. Reproducibility through VCS. Use ``run`` even if not interested in ``rerun``

  • Alert the user if git is not yet configured but git operations are requested

  • Delay collection of previous ssh connections until it is actually needed. Also do not require ‘:’ while specifying ssh host

  • AutomagicIO: Added proxying of isfile, lzma.LZMAFile and io.open

  • Testing:

    • added DATALAD_DATASETS_TOPURL=http://datasets-tests.datalad.org to run tests against another website to not obscure access stats

    • tests run against temporary HOME to avoid side-effects

    • better unit-testing of interactions with special remotes

  • CONTRIBUTING.md describes how to setup and use git-hub tool to “attach” commits to an issue making it into a PR

  • DATALAD_USE_DEFAULT_GIT env variable could be used to cause DataLad to use default (not the one possibly bundled with git-annex) git

  • Be more robust while handling not supported requests by annex in special remotes

  • Use of swallow_logs in the code was refactored away – less mysteries now, just increase logging level

  • wtf plugin will report more information about environment, externals and the system

0.9.1 (Oct 01, 2017) – “DATALAD!”(JBTM)

Minor bugfix release

Fixes
  • Should work correctly with subdatasets named as numbers of bool values (requires also GitPython >= 2.1.6)

  • Custom special remotes should work without crashing with git-annex >= 6.20170924

0.9.0 (Sep 19, 2017) – isn’t it a lucky day even though not a Friday?

Major refactoring and deprecations
  • the files argument of save has been renamed to path to be uniform with any other command

  • all major commands now implement more uniform API semantics and result reporting. Functionality for modification detection of dataset content has been completely replaced with a more efficient implementation

  • publish now features a --transfer-data switch that allows for a disambiguous specification of whether to publish data – independent of the selection which datasets to publish (which is done via their paths). Moreover, publish now transfers data before repository content is pushed.

Fixes
  • drop no longer errors when some subdatasets are not installed

  • install will no longer report nothing when a Dataset instance was given as a source argument, but rather perform as expected

  • remove doesn’t remove when some files of a dataset could not be dropped

  • publish

    • no longer hides error during a repository push

    • publish behaves “correctly” for --since= in considering only the differences the last “pushed” state

    • data transfer handling while publishing with dependencies, to github

  • improved robustness with broken Git configuration

  • search should search for unicode strings correctly and not crash

  • robustify git-annex special remotes protocol handling to allow for spaces in the last argument

  • UI credentials interface should now allow to Ctrl-C the entry

  • should not fail while operating on submodules named with numerics only or by bool (true/false) names

  • crawl templates should not now override settings for largefiles if specified in .gitattributes

Enhancements and new features
  • Exciting new feature run command to protocol execution of an external command and rerun computation if desired. See screencast

  • save now uses Git for detecting with sundatasets need to be inspected for potential changes, instead of performing a complete traversal of a dataset tree

  • add looks for changes relative to the last committed state of a dataset to discover files to add more efficiently

  • diff can now report untracked files in addition to modified files

  • [uninstall][] will check itself whether a subdataset is properly registered in a superdataset, even when no superdataset is given in a call

  • subdatasets can now configure subdatasets for exclusion from recursive installation (datalad-recursiveinstall submodule configuration property)

  • precrafted pipelines of [crawl][] now will not override annex.largefiles setting if any was set within .gitattribues (e.g. by datalad create --text-no-annex)

  • framework for screencasts: tools/cast* tools and sample cast scripts under doc/casts which are published at datalad.org/features.html

  • new project YouTube channel

  • tests failing in direct and/or v6 modes marked explicitly

0.8.1 (Aug 13, 2017) – the best birthday gift

Bugfixes

Fixes
  • Do not attempt to update a not installed sub-dataset

  • In case of too many files to be specified for get or copy_to, we will make multiple invocations of underlying git-annex command to not overfill command line

  • More robust handling of unicode output in terminals which might not support it

Enhancements and new features
  • Ship a copy of numpy.testing to facilitate [test][] without requiring numpy as dependency. Also allow to pass to command which test(s) to run

  • In get and copy_to provide actual original requested paths, not the ones we deduced need to be transferred, solely for knowing the total

0.8.0 (Jul 31, 2017) – it is better than ever

A variety of fixes and enhancements

Fixes
  • publish would now push merged git-annex branch even if no other changes were done

  • publish should be able to publish using relative path within SSH URI (git hook would use relative paths)

  • publish should better tollerate publishing to pure git and git-annex special remotes

Enhancements and new features
  • plugin mechanism came to replace export. See export_tarball for the replacement of export. Now it should be easy to extend datalad’s interface with custom functionality to be invoked along with other commands.

  • Minimalistic coloring of the results rendering

  • publish/copy_to got progress bar report now and support of --jobs

  • minor fixes and enhancements to crawler (e.g. support of recursive removes)

0.7.0 (Jun 25, 2017) – when it works - it is quite awesome!

New features, refactorings, and bug fixes.

Major refactoring and deprecations
Enhancements and new features
  • siblings can now be used to query and configure a local repository by using the sibling name here

  • siblings can now query and set annex preferred content configuration. This includes wanted (as previously supported in other commands), and now also required

  • New metadata command to interface with datasets/files meta-data

  • Documentation for all commands is now built in a uniform fashion

  • Significant parts of the documentation of been updated

  • Instantiate GitPython’s Repo instances lazily

Fixes
  • API documentation is now rendered properly as HTML, and is easier to browse by having more compact pages

  • Closed files left open on various occasions (Popen PIPEs, etc)

  • Restored basic (consumer mode of operation) compatibility with Windows OS

0.6.0 (Jun 14, 2017) – German perfectionism

This release includes a huge refactoring to make code base and functionality more robust and flexible

  • outputs from API commands could now be highly customized. See --output-format, --report-status, --report-type, and --report-type options for datalad command.

  • effort was made to refactor code base so that underlying functions behave as generators where possible

  • input paths/arguments analysis was redone for majority of the commands to provide unified behavior

Major refactoring and deprecations
  • add-sibling and rewrite-urls were refactored in favor of new siblings command which should be used for siblings manipulations

  • ‘datalad.api.alwaysrender’ config setting/support is removed in favor of new outputs processing

Fixes
  • Do not flush manually git index in pre-commit to avoid “Death by the Lock” issue

  • Deployed by publish post-update hook script now should be more robust (tolerate directory names with spaces, etc.)

  • A variety of fixes, see list of pull requests and issues closed for more information

Enhancements and new features
  • new annotate-paths plumbing command to inspect and annotate provided paths. Use --modified to summarize changes between different points in the history

  • new clone plumbing command to provide a subset (install a single dataset from a URL) functionality of install

  • new diff plumbing command

  • new siblings command to list or manipulate siblings

  • new subdatasets command to list subdatasets and their properties

  • drop and remove commands were refactored

  • benchmarks/ collection of Airspeed velocity benchmarks initiated. See reports at http://datalad.github.io/datalad/

  • crawler would try to download a new url multiple times increasing delay between attempts. Helps to resolve problems with extended crawls of Amazon S3

  • CRCNS crawler pipeline now also fetches and aggregates meta-data for the datasets from datacite

  • overall optimisations to benefit from the aforementioned refactoring and improve user-experience

  • a few stub and not (yet) implemented commands (e.g. move) were removed from the interface

  • Web frontend got proper coloring for the breadcrumbs and some additional caching to speed up interactions. See http://datasets.datalad.org

  • Small improvements to the online documentation. See e.g. summary of differences between git/git-annex/datalad

0.5.1 (Mar 25, 2017) – cannot stop the progress

A bugfix release

Fixes
  • add was forcing addition of files to annex regardless of settings in .gitattributes. Now that decision is left to annex by default

  • tools/testing/run_doc_examples used to run doc examples as tests, fixed up to provide status per each example and not fail at once

  • doc/examples

  • progress bars

    • should no longer crash datalad and report correct sizes and speeds

    • should provide progress reports while using Python 3.x

Enhancements and new features
  • doc/examples

    • nipype_workshop_dataset.sh new example to demonstrate how new super- and sub- datasets were established as a part of our datasets collection

0.5.0 (Mar 20, 2017) – it’s huge

This release includes an avalanche of bug fixes, enhancements, and additions which at large should stay consistent with previous behavior but provide better functioning. Lots of code was refactored to provide more consistent code-base, and some API breakage has happened. Further work is ongoing to standardize output and results reporting (#1350)

Most notable changes
  • requires git-annex >= 6.20161210 (or better even >= 6.20161210 for improved functionality)

  • commands should now operate on paths specified (if any), without causing side-effects on other dirty/staged files

  • save

    • -a is deprecated in favor of -u or --all-updates so only changes known components get saved, and no new files automagically added

    • -S does no longer store the originating dataset in its commit message

  • add

    • can specify commit/save message with -m

  • add-sibling and create-sibling

    • now take the name of the sibling (remote) as a -s (--name) option, not a positional argument

    • --publish-depends to setup publishing data and code to multiple repositories (e.g. github + webserve) should now be functional see this comment

    • got --publish-by-default to specify what refs should be published by default

    • got --annex-wanted, --annex-groupwanted and --annex-group settings which would be used to instruct annex about preferred content. publish then will publish data using those settings if wanted is set.

    • got --inherit option to automagically figure out url/wanted and other git/annex settings for new remote sub-dataset to be constructed

  • publish

    • got --skip-failing refactored into --missing option which could use new feature of create-sibling --inherit

Fixes
  • More consistent interaction through ssh - all ssh connections go through sshrun shim for a “single point of authentication”, etc.

  • More robust ls operation outside of the datasets

  • A number of fixes for direct and v6 mode of annex

Enhancements and new features
  • New drop and remove commands

  • clean

    • got --what to specify explicitly what cleaning steps to perform and now could be invoked with -r

  • datalad and git-annex-remote* scripts now do not use setuptools entry points mechanism and rely on simple import to shorten start up time

  • Dataset is also now using Flyweight pattern, so the same instance is reused for the same dataset

  • progressbars should not add more empty lines

Internal refactoring
  • Majority of the commands now go through _prep for arguments validation and pre-processing to avoid recursive invocations

0.4.1 (Nov 10, 2016) – CA release

Requires now GitPython >= 2.1.0

Fixes
  • save

    • to not save staged files if explicit paths were provided

  • improved (but not yet complete) support for direct mode

  • update to not crash if some sub-datasets are not installed

  • do not log calls to git config to avoid leakage of possibly sensitive settings to the logs

Enhancements and new features
  • New rfc822-compliant metadata format

  • save

    • -S to save the change also within all super-datasets

  • add now has progress-bar reporting

  • create-sibling-github to create a :term:sibling of a dataset on github

  • OpenfMRI crawler and datasets were enriched with URLs to separate files where also available from openfmri s3 bucket (if upgrading your datalad datasets, you might need to run git annex enableremote datalad to make them available)

  • various enhancements to log messages

  • web interface

    • populates “install” box first thus making UX better over slower connections

0.4 (Oct 22, 2016) – Paris is waiting

Primarily it is a bugfix release but because of significant refactoring of the install and get implementation, it gets a new minor release.

Fixes
  • be able to get or install while providing paths while being outside of a dataset

  • remote annex datasets get properly initialized

  • robust detection of outdated git-annex

Enhancements and new features
  • interface changes

    • get --recursion-limit=existing to not recurse into not-installed subdatasets

    • get -n to possibly install sub-datasets without getting any data

    • install --jobs|-J to specify number of parallel jobs for annex get call could use (ATM would not work when data comes from archives)

  • more (unit-)testing

  • documentation: see http://docs.datalad.org/en/latest/basics.html for basic principles and useful shortcuts in referring to datasets

  • various webface improvements: breadcrumb paths, instructions how to install dataset, show version from the tags, etc.

0.3.1 (Oct 1, 2016) – what a wonderful week

Primarily bugfixes but also a number of enhancements and core refactorings

Fixes
  • do not build manpages and examples during installation to avoid problems with possibly previously outdated dependencies

  • install can be called on already installed dataset (with -r or -g)

Enhancements and new features
  • complete overhaul of datalad configuration settings handling (see Configuration documentation), so majority of the environment. Now uses git format and stores persistent configuration settings under .datalad/config and local within .git/config variables we have used were renamed to match configuration names

  • create-sibling does not now by default upload web front-end

  • export command with a plug-in interface and tarball plugin to export datasets

  • in Python, .api functions with rendering of results in command line got a _-suffixed sibling, which would render results as well in Python as well (e.g., using search_ instead of search would also render results, not only output them back as Python objects)

  • get

    • --jobs option (passed to annex get) for parallel downloads

    • total and per-download (with git-annex >= 6.20160923) progress bars (note that if content is to be obtained from an archive, no progress will be reported yet)

  • install --reckless mode option

  • search

    • highlights locations and fieldmaps for better readability

    • supports -d^ or -d/// to point to top-most or centrally installed meta-datasets

    • “complete” paths to the datasets are reported now

    • -s option to specify which fields (only) to search

  • various enhancements and small fixes to meta-data handling, ls, custom remotes, code-base formatting, downloaders, etc

  • completely switched to tqdm library (progressbar is no longer used/supported)

0.3 (Sep 23, 2016) – winter is coming

Lots of everything, including but not limited to

0.2.3 (Jun 28, 2016) – busy OHBM

New features and bugfix release

0.2.2 (Jun 20, 2016) – OHBM we are coming!

New feature and bugfix release

  • greately improved documentation

  • publish command API RFing allows for custom options to annex, and uses –to REMOTE for consistent with annex invocation

  • variety of fixes and enhancements throughout

0.2.1 (Jun 10, 2016)
  • variety of fixes and enhancements throughout

0.2 (May 20, 2016)

Major RFing to switch from relying on rdf to git native submodules etc

0.1 (Oct 14, 2015)

Release primarily focusing on interface functionality including initial publishing

Acknowledgments

DataLad development is being performed as part of a US-German collaboration in computational neuroscience (CRCNS) project “DataGit: converging catalogues, warehouses, and deployment logistics into a federated ‘data distribution’” (Halchenko/Hanke), co-funded by the US National Science Foundation (NSF 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1411). Additional support is provided by the German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences, Imaging Platform

DataLad is built atop the git-annex software that is being developed and maintained by Joey Hess.

Publications

Further conceptual and technical information on DataLad, and applications built on DataLad, are available from the publications listed below.

The best of both worlds: Using semantic web with JSOB-LD. An example with NIDM Results & DataLad [poster]
  • Camille Maumet, Satrajit Ghosh, Yaroslav O. Halchenko, Dorota Jarecka, Nolan Nichols, Jean-Baptist POline, Michael Hanke

One thing to bind them all: A complete raw data structure for auto-generation of BIDS datasets [poster]
  • Benjamin Poldrack, Kyle Meyer, Yaroslav O. Halchenko, Michael Hanke

Fantastic containers and how to tame them [poster]
  • Yaroslav O. Halchenko, Kyle Meyer, Matt Travers, Dorota Jarecka, Satrajit Ghosh, Jakub Kaczmarzyk, Michael Hanke

YODA: YODA’s Organigram on Data Analysis [poster]
  • An outline of a simple approach to structuring and conducting data analyses that aims to tightly connect all their essential ingredients: data, code, and computational environments in a transparent, modular, accountable, and practical way.

  • Michael Hanke, Kyle A. Meyer, Matteo Visconti di Oleggio Castello, Benjamin Poldrack, Yaroslav O. Halchenko

  • F1000Research 2018, 7:1965 (https://doi.org/10.7490/f1000research.1116363.1)

Go FAIR with DataLad [talk]
  • On DataLad’s capabilities to create and maintain Findable, Accessible, Interoperable, and reusable (FAIR) resources.

  • Michael Hanke, Yaroslav O. Halchenko

  • Bernstein Conference 2018 workshop: Practical approaches to research data management and reproducibility (slides)

  • OpenNeuro kick-off meeting, 2018, Stanford (slide sources)

Concepts and technologies

Background and motivation

Vision

Data is at the core of science, and unobstructed access promotes scientific discovery through collaboration between data producers and consumers. The last years have seen dramatic improvements in availability of data resources for collaborative research, and new data providers are becoming available all the time.

However, despite the increased availability of data, their accessibility is far from being optimal. Potential consumers of these public datasets have to manually browse various disconnected warehouses with heterogeneous interfaces. Once obtained, data is disconnected from its origin and data versioning is often ad-hoc or completely absent. If data consumers can be reliably informed about data updates at all, review of changes is difficult, and re-deployment is tedious and error-prone. This leads to wasteful friction caused by outdated or faulty data.

The vision for this project is to transform the state of data-sharing and collaborative work by providing uniform access to available datasets – independent of hosting solutions or authentication schemes – with reliable versioning and versatile deployment logistics. This is achieved by means of a dataset handle, a lightweight representation of a dataset that is capable of tracking the identity and location of a dataset’s content as well as carry meta-data. Together with associated software tools, scientists are able to obtain, use, extend, and share datasets (or parts thereof) in a way that is traceable back to the original data producer and is therefore capable of establishing a strong connection between data consumers and the evolution of a dataset by future extension or error correction.

Moreover, DataLad aims to provide all tools necessary to create and publish data distributions — an analog to software distributions or app-stores that provide logistics middleware for software deployment. Scientific communities can use these tools to gather, curate, and make publicly available specialized collections of datasets for specific research topics or data modalities. All of this is possible by leveraging existing data sharing platforms and institutional resources without the need for funding extra infrastructure of duplicate storage. Specifically, this project aims to provide a comprehensive, extensible data distribution for neuroscientific datasets that is kept up-to-date by an automated service.

Technological foundation: git-annex

The outlined task is not unique to the problem of data-sharing in science. Logistical challenges such as delivering data, long-term storage and archiving, identity tracking, and synchronization between multiple sites are rather common. Consequently, solutions have been developed in other contexts that can be adapted to benefit scientific data-sharing.

The closest match is the software tool git-annex. It combines the features of the distributed version control system (dVCS) Git — a technology that has revolutionized collaborative software development – with versatile data access and delivery logistics. Git-annex was originally developed to address use cases such as managing a collection of family pictures at home. With git-annex, any family member can obtain an individual copy of such a picture library — the annex. The annex in this example is essentially an image repository that presents individual pictures to users as files in a single directory structure, even though the actual image file contents may be distributed across multiple locations, including a home-server, cloud-storage, or even off-line media such as external hard-drives.

Git-annex provides functionality to obtain file contents upon request and can prompt users to make particular storage devices available when needed (e.g. a backup hard-drive kept in a fire-proof compartment). Git-annex can also remove files from a local copy of that image repository, for example to free up space on a laptop, while ensuring a configurable level of data redundancy across all known storage locations. Lastly, git-annex is able to synchronize the content of multiple distributed copies of this image repository, for example in order to incorporate images added with the git-annex on the laptop of another family member. It is important to note that git-annex is agnostic of the actual file types and is not limited to images.

We believe that the approach to data logistics taken by git-annex and the functionality it is currently providing are an ideal middleware for scientific data-sharing. Its data repository model annex readily provides the majority of principal features needed for a dataset handle such as history recording, identity tracking, and item-based resource locators. Consequently, instead of a from-scratch development, required features, such as dedicated support for existing data-sharing portals and dataset meta-information, can be added to a working solution that is already in production for several years. As a result, DataLad focuses on the expansion of git-annex’s functionality and the development of tools that build atop Git and git-annex and enable the creation, management, use, and publication of dataset handles and collections thereof.

Objective

Building atop git-annex, DataLad aims to provide a single, uniform interface to access data from various data-sharing initiatives and data providers, and functionality to create, deliver, update, and share datasets for individuals and portal maintainers. As a command-line tool, it provides an abstraction layer for the underlying Git-based middleware implementing the actual data logistics, and serves as a foundation for other future user front-ends, such as a web-interface.

Basic principles

DataLad is designed to be used both as a command-line tool, and as a Python module. The sections Command line reference and Python module reference provide detailed description of the commands and functions of the two interfaces. This section presents common concepts. Although examples will frequently be presented using command line interface commands, all functionality with identically named functions and options are available through Python API as well.

Datasets

A DataLad dataset is a Git repository that may or may not have a data annex that is used to manage data referenced in a dataset. In practice, most DataLad datasets will come with an annex.

Types of IDs used in datasets

Four types of unique identifiers are used by DataLad to enable identification of different aspects of datasets and their components.

Dataset ID

A UUID that identifies a dataset as a whole across its entire history and flavors. This ID is stored in a dataset’s own configuration file (<dataset root>/.datalad/config) under the configuration key datalad.dataset.id. As this configuration is stored in a file that is part of the Git history of a dataset, this ID is identical for all “clones” of a dataset and across all its versions. If the purpose or scope of a dataset changes enough to warrant a new dataset ID, it can be changed by altering the dataset configuration setting.

Annex ID

A UUID assigned to an annex of each individual clone of a dataset repository. Git-annex uses this UUID to track file content availability information. The UUID is available under the configuration key annex.uuid and is stored in the configuration file of a local clone (<dataset root>/.git/config). A single dataset instance (i.e. clone) can only have a single annex UUID, but a dataset with multiple clones will have multiple annex UUIDs.

Commit ID

A Git hexsha or tag that identifies a version of a dataset. This ID uniquely identifies the content and history of a dataset up to its present state. As the dataset history also includes the dataset ID, a commit ID of a DataLad dataset is unique to a particular dataset.

Content ID

Git-annex key (typically a checksum) assigned to the content of a file in a dataset’s annex. The checksum reflects the content of a file, not its name. Hence the content of multiple identical files in a single (or across) dataset(s) will have the same checksum. Content IDs are managed by Git-annex in a dedicated annex branch of the dataset’s Git repository.

Dataset nesting

Datasets can contain other datasets (subdatasets), which can in turn contain subdatasets, and so on. There is no limit to the depth of nesting datasets. Each dataset in such a hierarchy has its own annex and its own history. The parent or superdataset only tracks the specific state of a subdataset, and information on where it can be obtained. This is a powerful yet lightweight mechanism for combining multiple individual datasets for a specific purpose, such as the combination of source code repositories with other resources for a tailored application. In many cases DataLad can work with a hierarchy of datasets just as if it were a single dataset. Here is a demo:

~ % datalad create demo
[INFO   ] Creating a new annex repo at /demo/demo
create(ok): /demo/demo (dataset)
~ % cd demo

A DataLad dataset is just a Git repo with some initial configuration

~/demo % git log --oneline
472e34b (HEAD -> master) [DATALAD] new dataset
f968257 [DATALAD] Set default backend for all files to be MD5E

We can generate nested datasets, by telling DataLad to register a new dataset in a parent dataset

~/demo % datalad create -d . sub1
[INFO   ] Creating a new annex repo at /demo/demo/sub1
add(ok): sub1 (dataset) [added new subdataset]
add(notneeded): sub1 (dataset) [nothing to add from /demo/demo/sub1]
add(notneeded): .gitmodules (file) [already included in the dataset]
save(ok): /demo/demo (dataset)
create(ok): sub1 (dataset)
action summary:
  add (notneeded: 2, ok: 1)
  create (ok: 1)
  save (ok: 1)

A subdataset is nothing more than regular Git submodule

~/demo % git submodule
 5f0cddf2026e3fb4864139f27e7415fd72c7d4d0 sub1 (heads/master)

Of course subdatasets can be nested

~/demo % datalad create -d . sub1/justadir/sub2
[INFO   ] Creating a new annex repo at /demo/demo/sub1/justadir/sub2
add(ok): sub1/justadir/sub2 (dataset) [added new subdataset]
add(notneeded): sub1/justadir/sub2 (dataset) [nothing to add from /demo/demo/sub1/justadir/sub2]
add(notneeded): sub1/.gitmodules (file) [already included in the dataset]
add(notneeded): sub1 (dataset) [already known subdataset]
save(ok): /demo/demo/sub1 (dataset)
save(ok): /demo/demo (dataset)
create(ok): sub1/justadir/sub2 (dataset)
action summary:
  add (notneeded: 3, ok: 1)
  create (ok: 1)
  save (ok: 2)

Unlike Git, DataLad automatically takes care of committing all changes associated with the added subdataset up to the given parent dataset

~/demo % git status
On branch master
nothing to commit, working tree clean

Let’s create some content in the deepest subdataset

~/demo % mkdir sub1/justadir/sub2/anotherdir
~/demo % touch sub1/justadir/sub2/anotherdir/afile

Git can only tell us that something underneath the top-most subdataset was modified

~/demo % git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

     modified:   sub1 (untracked content)

no changes added to commit (use "git add" and/or "git commit -a")

DataLad saves us from further investigation

~/demo % datalad diff -r
   modified(dataset): sub1
   modified(dataset): sub1/justadir/sub2
untracked(directory): sub1/justadir/sub2/anotherdir

Like Git, it can report individual untracked files, but also across repository boundaries

~/demo % datalad diff -r --report-untracked all
   modified(dataset): sub1
   modified(dataset): sub1/justadir/sub2
     untracked(file): sub1/justadir/sub2/anotherdir/afile

Adding this new content with Git or git-annex would be an exercise

~/demo % git add sub1/justadir/sub2/anotherdir/afile
fatal: Pathspec 'sub1/justadir/sub2/anotherdir/afile' is in submodule 'sub1'

DataLad does not require users to determine the correct repository in the tree

~/demo % datalad add -d . sub1/justadir/sub2/anotherdir/afile
add(ok): sub1/justadir/sub2/anotherdir/afile (file)
save(ok): /demo/demo/sub1/justadir/sub2 (dataset)
save(ok): /demo/demo/sub1 (dataset)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 1)
  save (ok: 3)

Again, all associated changes in the entire dataset tree, up to the given parent dataset, were committed

~/demo % git status
On branch master
nothing to commit, working tree clean

DataLad’s ‘diff’ is able to report the changes from these related commits throughout the repository tree

~/demo % datalad diff --revision @~1 -r
   modified(dataset): sub1
   modified(dataset): sub1/justadir/sub2
         added(file): sub1/justadir/sub2/anotherdir/afile
Dataset collections

A superdataset can also be seen as a curated collection of datasets, for example, for a certain data modality, a field of science, a certain author, or from one project (maybe the resource for a movie production). This lightweight coupling between super and subdatasets enables scenarios where individual datasets are maintained by a disjoint set of people, and the dataset collection itself can be curated by a completely independent entity. Any individual dataset can be part of any number of such collections.

Benefiting from Git’s support for workflows based on decentralized “clones” of a repository, DataLad’s datasets can be (re-)published to a new location without losing the connection between the “original” and the new “copy”. This is extremely useful for collaborative work, but also in more mundane scenarios such as data backup, or temporary deployment of a dataset on a compute cluster, or in the cloud. Using git-annex, data can also get synchronized across different locations of a dataset (siblings in DataLad terminology). Using metadata tags, it is even possible to configure different levels of desired data redundancy across the network of dataset, or to prevent publication of sensitive data to publicly accessible repositories. Individual datasets in a hierarchy of (sub)datasets need not be stored at the same location. Continuing with an earlier example, it is possible to post a curated collection of datasets, as a superdataset, on GitHub, while the actual datasets live on different servers all around the world.

Basic command line usage

All of DataLad’s functionality is available through a single command: datalad

Running the datalad command without any arguments, gives a summary of basic options, and a list of available sub-commands.

~ % datalad
usage: datalad [-h] [-l LEVEL] [-C PATH] [--version]
               [--dbg] [--idbg] [-c KEY=VALUE]
               [-f {default,json,json_pp,tailored,'<template>'}]
               [--report-status {success,failure,ok,notneeded,impossible,error}]
               [--report-type {dataset,file}]
               [--on-failure {ignore,continue,stop}] [--cmd]
               {create,install,get,publish,uninstall,drop,remove,update,create-sibling,create-sibling-github,unlock,save,search,metadata,aggregate-metadata,test,ls,clean,add-archive-content,download-url,run,rerun,addurls,export-archive,extract-metadata,export-to-figshare,no-annex,wtf,add-readme,annotate-paths,clone,create-test-dataset,diff,siblings,sshrun,subdatasets}
               ...
[ERROR  ] Please specify the command
~ % #

More comprehensive information is available via the –help long-option (we will truncate the output here)

~ % datalad --help       | head -n20
Usage: datalad [global-opts] command [command-opts]

DataLad provides a unified data distribution with the convenience of git-annex
repositories as a backend.  DataLad command line tools allow to manipulate
(obtain, create, update, publish, etc.) datasets and their collections.

*Commands for dataset operations*

  create
      Create a new dataset from scratch
  install
      Install a dataset from a (remote) source
  get
      Get any dataset content (files/directories/subdatasets)
  publish
      Publish a dataset to a known sibling
  uninstall
      Uninstall subdatasets

Getting information on any of the available sub commands works in the same way – just pass –help AFTER the sub-command (output again truncated)

~ % datalad create --help       | head -n20
Usage: datalad create [-h] [-f] [-D DESCRIPTION] [-d PATH] [--no-annex]
                      [--nosave] [--annex-version ANNEX_VERSION]
                      [--annex-backend ANNEX_BACKEND]
                      [--native-metadata-type LABEL] [--shared-access MODE]
                      [--git-opts STRING] [--annex-opts STRING]
                      [--annex-init-opts STRING] [--text-no-annex]
                      [PATH]

Create a new dataset from scratch.

This command initializes a new dataset at a given location, or the
current directory. The new dataset can optionally be registered in an
existing superdataset (the new dataset's path needs to be located
within the superdataset for that, and the superdataset needs to be given
explicitly). It is recommended to provide a brief description to label
the dataset's nature *and* location, e.g. "Michael's music on black
laptop". This helps humans to identify data locations in distributed
scenarios.  By default an identifier comprised of user and machine name,
plus path will be generated.
API principles

You can use DataLad’s install command to download datasets. The command accepts URLs of different protocols (http, ssh) as an argument. Nevertheless, the easiest way to obtain a first dataset is downloading the default superdataset from https://datasets.datalad.org/ using a shortcut.

Downloading DataLad’s default superdataset

https://datasets.datalad.org provides a super-dataset consisting of datasets from various portals and sites. Many of them were crawled, and periodically updated, using datalad-crawler extension. The argument /// can be used as a shortcut that points to the superdataset located at https://datasets.datalad.org/. Here are three common examples in command line notation:

datalad install ///

installs this superdataset (metadata without subdatasets) in a datasets.datalad.org/ subdirectory under the current directory

datalad install -r ///openfmri

installs the openfmri superdataset into an openfmri/ subdirectory. Additionally, the -r flag recursively downloads all metadata of datasets available from http://openfmri.org as subdatasets into the openfmri/ subdirectory

datalad install -g -J3 -r ///labs/haxby

installs the superdataset of datasets released by the lab of Dr. James V. Haxby and all subdatasets’ metadata. The -g flag indicates getting the actual data, too. It does so by using 3 parallel download processes (-J3 flag).

Downloading datasets via http

In most places where DataLad accepts URLs as arguments these URLs can be regular http or https protocol URLs. For example:

datalad install https://github.com/psychoinformatics-de/studyforrest-data-phase2.git

Downloading datasets via ssh

DataLad also supports SSH URLs, such as ssh://me@localhost/path.

datalad install ssh://me@localhost/path

Finally, DataLad supports SSH login style resource identifiers, such as me@localhost:/path.

datalad install me@localhost:/path

Commands install vs get

The install and get commands might seem confusingly similar at first. Both of them could be used to install any number of subdatasets, and fetch content of the data files. Differences lie primarily in their default behaviour and outputs, and thus intended use. Both install and get take local paths as their arguments, but their default behavior and output might differ;

  • install primarily operates and reports at the level of datasets, and returns as a result dataset(s) which either were just installed, or were installed previously already under specified locations. So result should be the same if the same install command ran twice on the same datasets. It does not fetch data files by default

  • get primarily operates at the level of paths (datasets, directories, and/or files). As a result it returns only what was installed (datasets) or fetched (files). So result of rerunning the same get command should report that nothing new was installed or fetched. It fetches data files by default.

In how both commands operate on provided paths, it could be said that install == get -n, and install -g == get. But install also has ability to install new datasets from remote locations given their URLs (e.g., https://datasets.datalad.org/ for our super-dataset) and SSH targets (e.g., [login@]host:path) if they are provided as the argument to its call or explicitly as --source option. If datalad install --source URL DESTINATION (command line example) is used, then dataset from URL gets installed under PATH. In case of datalad install URL invocation, PATH is taken from the last name within URL similar to how git clone does it. If former specification allows to specify only a single URL and a PATH at a time, later one can take multiple remote locations from which datasets could be installed.

So, as a rule of thumb – if you want to install from external URL or fetch a sub-dataset without downloading data files stored under annex – use install. In Python API install is also to be used when you want to receive in output the corresponding Dataset object to operate on, and be able to use it even if you rerun the script. In all other cases, use get.

Credentials

Integration with Git

Git and DataLad can use each other’s credential system. Both directions are independent of each other and none is necessarily required. Either direction can be configured based on URL matching patterns. In addition, Git can be configured to always query DataLad for credentials without any URL matching.

Let Git query DataLad

In order to allow Git to query credentials from DataLad, Git needs to be configured to use the git credential helper delivered with DataLad (an executable called git-credential-datalad). That is, a section like this needs to be part of one’s git config file:

[credential "https://*.data.example.com"]
  helper = "datalad"

Note:

  • This most likely only makes sense at the user or system level (options –global`|–system` with git config), since cloning of a repository needs the credentials before there is a local repository.

  • The name of that section is a URL matching expression - see man gitcredentials.

  • The URL matching does NOT include the scheme! Hence, if you need to match http as well as https, you need two such entries.

  • Multiple git credential helpers can be configured - Git will ask them one after another until it got a username and a password for the URL in question. For example on macOS, Git comes with a helper to use the system’s keychain and Git is configured system-wide to query git-credential-osxkeychain. This does not conflict with setting up DataLad’s credential helper.

  • The example configuration requires git-credential-datalad to be in the path in order for Git to find it. Alternatively, the value of the helper entry needs to be the absolute path of git-credential-datalad.

  • In order to make Git always consider DataLad as a credential source, one can simply not specify any URL pattern (so it’s [credential] instead of [credential “SOME-PATTERN”])

Let DataLad query Git

The other way around, DataLad can ask Git for credentials (which it will acquire via other git credential helpers). To do so, a DataLad provider config needs to be set up:

[provider:data_example_provider]
  url_re = https://.*data\.example\.com
  authentication_type = http_basic_auth
  credential = data_example_cred
[credential:data_example_cred]
  type = git

Note:

  • Such a config lives in a dedicated file named after the provider name (e.g. all of the above example would be the content of data_example_provider.cfg, matching [provider:data_example_provider]).

  • Valid locations for these files are listed in Credential management.

  • In opposition to Git’s approach, url_re is a regular expression that matches the entire URL including the scheme.

  • The above is particularly important in case of redirects, as DataLad currently matches the URL it was given instead of the one it ultimately uses the credentials with.

  • The name of the credential section must match the credential entry in the provider section (e.g. [credential:data_example_cred] and credential = data_example_cred in the above example).

DataLad will prompt the user to create a provider configuration and respective credentials when it first encounters a URL that requires authentication but no matching credentials are found. This behavior extends to the credential helper and may therefore be triggered by a git clone if Git is configured to use git-credential-datalad. However, interactivity of git-credential-datalad can be turned off (see git-credential-datalad -h)

It is possible to end up in a situation where Git would query DataLad and vice versa for the same URL, especially if Git is configured to query DataLad unconditionally. git-credential-datalad will discover this circular setup and stop it by simply ignoring DataLad’s provider configuration that points back to Git.

Customization and extension of functionality

DataLad provides numerous commands that cover many use cases. However, there will always be a demand for further customization or extensions of built-in functionality at a particular site, or for an individual user. DataLad addresses this need with a mechanism for extending particular DataLad functionality, such as metadata extractor, or providing entire command suites for a specialized purpose.

As the name suggests, a DataLad extension package is a proper Python package. Consequently, there is a significant amount of boilerplate code involved in the creation of a new DataLad extension. However, this overhead enables a number of useful features for extension developers:

  • extensions can provide any number of additional commands that can be grouped into labeled command suites, and are automatically exposed via the standard DataLad commandline and Python API

  • extensions can define entry_points for any number of additional metadata extractors that become automatically available to DataLad

  • extensions can define entry_points for their test suites, such that the standard datalad create command will automatically run these tests in addition to the tests shipped with DataLad core

  • extensions can ship additional dataset procedures by installing them into a directory resources/procedures underneath the extension module directory

Using an extension

A DataLad extension is a standard Python package. Beyond installation of the package there is no additional setup required.

Writing your own extensions

A good starting point for implementing a new extension is the “helloworld” demo extension available at https://github.com/datalad/datalad-extension-template. This repository can be cloned and adjusted to suit one’s needs. It includes:

  • a basic Python package setup

  • simple demo command implementation

  • Travis test setup

A more complex extension setup can be seen in the DataLad Neuroimaging extension: https://github.com/datalad/datalad-neuroimaging, including additional metadata extractors, test suite registration, and a sphinx-based documentation setup for a DataLad extension.

As a DataLad extension is a standard Python package, an extension should declare dependencies on an appropriate DataLad version, and possibly other extensions via the standard mechanisms.

Design

The chapter described command API principles and the design of particular subsystems in DataLad.

Command line interface

The command line interface (CLI) implementation is located at datalad.cli. It provides a console entry point that automatically constructs an argparse-based command line parser, which is used to make adequately parameterized calls to the targeted command implementations. It also performs error handling. The CLI automatically supports all commands, regardless of whether they are provided by the core package, or by extensions. It only requires them to be discoverable via the respective extension entry points, and to implement the standard datalad.interface.base.Interface.

Basic workflow of a command line based command execution

The functionality of the main command line entrypoint described here is implemented in datalad.cli.main.

  1. Construct an argparse parser.

    • this is happening with inspection of the actual command line arguments in order to avoid needless processing

    • when insufficient arguments or other errors are detected, the CLI will fail informatively already at this stage

  2. Detect argument completions events, and utilize the parser in a optimized fashion for this purpose.

  3. Determine the to-be-executed command from the given command line arguments.

  4. Read any configuration overrides from the command line arguments.

  5. Change the process working directory, if requested.

  6. Execute the target command in one of two modes:

    1. With a basic exception handler

    2. With an exception hook setup that enables dropping into a debugger for any exception that reaches the command line main() routine.

  7. Unless a debugger is utilized, five error categories are distinguished (in the order given below):

    1. Insufficient arguments (exit code 2)

      A command was called with inadequate or incomplete parameters.

    2. Incomplete results (exit code 1)

      While processing an error occurred.

    3. A specific internal shell command execution failed (exit code relayed from underlying command)

      The error is reported, as if the command would have been executed directly in the command line. Its output is written to the stdout, stderr streams, and the exit code of the DataLad process matches the exit code of the underlying command.

    4. Keyboard interrupt (exit code 3)

      The process was interrupted by the equivalent of a user hitting Ctrl+C.

    5. Any other error/exception.

Command parser construction by Interface inspection

The parser setup described here is implemented in datalad.cli.parser.

A dedicated sub-parser for any relevant DataLad command is constructed. For normal execution use cases, only a single subparser for the target command will be constructed for speed reasons. However, when the command line help system is requested (--help) subparsers for all commands (including extensions) are constructed. This can take a considerable amount of time that grows with the number of installed extensions.

The information necessary to configure a subparser for a DataLad command is determined by inspecting the respective Interface class for that command, and reusing individual components for the parser. This includes:

  • the class docstring

  • a _params_ member with a dict of parameter definitions

  • a _examples_ member, with a list of example definitions

All docstrings used for the parser setup will be processed by applying a set of rules to make them more suitable for the command line environment. This includes the processing of CMD markup macros, and stripping their PYTHON counter parts. Parameter constraint definition descriptions are also altered to exclude Python-specific idioms that have no relevance on the command line (e.g., the specification of None as a default).

CLI-based execution of Interface command

The execution handler described here is implemented in datalad.cli.exec.

Once the main command line entry point determine that a command shall be executed, it triggers a handler function that was assigned and parameterized with the underlying command Interface during parser construction. At the time of execution, this handler is given the result of argparse-based command line argument parsing (i.e., a Namespace instance).

From this parser result, the handler constructs positional and keyword arguments for the respective Interface.__call__() execution. It does not only process command-specific arguments, but also generic arguments, such as those for result filtering and rendering, which influence the central processing of result recorded yielded by a command.

If an underlying command returns a Python generator it is unwound to trigger the respective underlying processing. The handler performs no error handling. This is left to the main command line entry point.

Provenance capture

The ability to capture process provenance—the information what activity initiated by which entity yielded which outputs, given a set of parameters, a computational environment, and potential input data—is a core feature of DataLad.

Provenance capture is supported for any computational process that can be expressed as a command line call. The simplest form of provenance tracking can be implemented by prefixing any such a command line call with datalad run .... When executed in the content of a dataset (with the current working directory typically being in the root of a dataset), DataLad will then:

  1. check the dataset for any unsaved modifications

  2. execute the given command, when no modifications were found

  3. save any changes to the dataset that exist after the command has exited without error

The saved changes are annotated with a structured record that, at minimum, contains the executed command.

This kind of usage is sufficient for building up an annotated history of a dataset, where all relevant modifications are clearly associated with the commands that caused them. By providing more, optional, information to the run command, such as a declaration of inputs and outputs, provenance records can be further enriched. This enables additional functionality, such as the automated re-execution of captured processes.

The provenance record

A DataLad provenance record is a key-value mapping comprising the following main items:

  • cmd: executed command, which may contain placeholders

  • dsid: DataLad ID of dataset in whose context the command execution took place

  • exit: numeric exit code of the command

  • inputs: a list of (relative) file paths for all declared inputs

  • outputs: a list of (relative) file paths for all declared outputs

  • pwd: relative path of the working directory for the command execution

A provenance record is stored in a JSON-serialized form in one of two locations:

  1. In the body of the commit message created when saving caused the dataset modifications

  2. In a sidecar file underneath .datalad/runinfo in the root dataset

Sidecar files have a filename (record_id) that is based on checksum of the provenance record content, and are stored as LZMA-compressed binary files. When a sidecar file is used, its record_id is added to the commit message, instead of the complete record.

Declaration of inputs and outputs

While not strictly required, it is possible and recommended to declare all paths for process inputs and outputs of a command execution via the respective options of run.

For all declared inputs, run will ensure that their file content is present locally at the required version before executing the command.

For all declared outputs, run will ensure that the respective locations are writeable.

It is recommended to declare inputs and outputs both exhaustively and precise, in order to enable the provenance-based automated re-execution of a command. In case of a future re-execution the dataset content may have changed substantially, and a needlessly broad specification of inputs/outputs may lead to undesirable data transfers.

Placeholders in commands and IO specifications

Both command and input/output specification can employ placeholders that will be expanded before command execution. Placeholders use the syntax of the Python format() specification. A number of standard placeholders are supported (see the run documentation for a complete list):

  • {pwd} will be replaced with the full path of the current working directory

  • {dspath} will be replaced with the full path of the dataset that run is invoked on

  • {inputs} and {outputs} expand a space-separated list of the declared input and output paths

Additionally, custom placeholders can be defined as configuration variables under the prefix datalad.run.substitutions.. For example, a configuration setting datalad.run.substitutions.myfile=data.txt will cause the placeholder {myfile} to expand to data.txt.

Selection of individual items for placeholders that expand to multiple values is possible via the standard Python format() syntax, for example {inputs[0]}.

Result records emitted by run

When performing a command execution run will emit results for:

  1. Input preparation (i.e. downloads)

  2. Output preparation (i.e. unlocks and removals)

  3. Command execution

  4. Dataset modification saving (i.e. additions, deletions, modifications)

By default, run will stop on the first error. This means that, for example, any failure to download content will prevent command execution. A failing command will prevent saving a potential dataset modification. This behavior can be altered using the standard on_failure switch of the run command.

The emitted result for the command execution contains the provenance record under the run_info key.

Implementation details

Most of the described functionality is implemented by the function datalad.core.local.run.run_command(). It is interfaced by the run command, but also rerun, a utility for automated re-execution based on provenance records, and containers-run (provided by the container extension package) for command execution in DataLad-tracked containerized environments. This function has a more complex interface, and supports a wider range of use cases than described here.

Application-type vs. library-type usage

Historically, DataLad was implemented with the assumption of application-type usage, i.e., a person using DataLad through any of its APIs. Consequently, (error) messaging was primarily targeting humans, and usage advice focused on interactive use. With the increasing utilization of DataLad as an infrastructural component it was necessary to address use cases of library-type or internal usage more explicitly.

DataLad continues to behave like a stand-alone application by default.

For internal use, Python and command-line APIs provide dedicated mode switches.

Library mode can be enabled by setting the boolean configuration setting datalad.runtime.librarymode before the start of the DataLad process. From the command line, this can be done with the option -c datalad.runtime.librarymode=yes, or any other means for setting configuration. In an already running Python process, library mode can be enabled by calling datalad.enable_libarymode(). This should be done immediately after importing the datalad package for maximum impact.

>>> import datalad
>>> datalad.enable_libarymode()

In a Python session, library mode cannot be enabled reliably by just setting the configuration flag after the datalad package was already imported. The enable_librarymode() function must be used.

Moreover, with datalad.in_librarymode() a query utility is provided that can be used throughout the code base for adjusting behavior according to the usage scenario.

Switching back and forth between modes during the runtime of a process is not supported.

A library mode setting is exported into the environment of the Python process. By default, it will be inherited by all child-processes, such as dataset procedure executions.

Library-mode implications
No Python API docs

Generation of comprehensive doc-strings for all API commands is skipped. This speeds up import datalad.api by about 30%.

File URL handling

DataLad datasets can record URLs for file content access as metadata. This is a feature provided by git-annex and is available for any annexed file. DataLad improves upon the git-annex functionality in two ways:

  1. Support for a variety of (additional) protocols and authentication methods.

  2. Support for special URLs pointing to individual files located in registered (annexed) archives, such as tarballs and ZIP files.

These additional features are available to all functionality that is processing URLs, such as get, addurls, or download-url.

Extensible protocol and authentication support

DataLad ships with a dedicated implementation of an external git-annex special remote named git-annex-remote-datalad. This is a somewhat atypical special remote, because it cannot receive files and store them, but only supports read operations.

Specifically, it uses the CLAIMURL feature of the external special remote protocol to take over processing of URLs with supported protocols in all datasets that have this special remote configured and enabled.

This special remote is automatically configured and enabled in DataLad dataset as a datalad remote, by commands that utilize its features, such as download-url. Once enabled, DataLad (but also git-annex) is able to act on additional protocols, such as s3://, and the respective URLs can be given directly to commands like git annex addurl, or datalad download-url.

Beyond additional protocol support, the datalad special remote also interfaces with DataLad’s Credential management. It can identify a particular credential required for a given URL (based on something called a “provider” configuration), ask for the credential or retrieve it from a credential store, and supply it to the respective service in an appropriate form. Importantly, this feature neither requires the necessary credential or provider configuration to be encoded in a URL (where it would become part of the git-annex metadata), nor to be committed to a dataset. Hence all information that may depend on which entity is performing a URL request and in what environment is completely separated from the location information on a particular file content. This minimizes the required dataset maintenance effort (when credentials change), and offers a clean separation of identity and availability tracking vs. authentication management.

Indexing and access of archive content

Another git-annex special remote, named git-annex-remote-datalad-archives, is used to enable file content retrieval from annexed archive files, such as tarballs and ZIP files. Its implementation concept is closely related to the git-annex-remote-datalad, described above. Its main difference is that it claims responsibility for a particular type of “URL” (starting with dl+archive:). These URLs encode the identity of an archive file, in terms of its git-annex key name, and a relative path inside this archive pointing to a particular file.

Like git-annex-remote-datalad, only read operations are supported. When a request to a dl+archive: “URL” is made, the special remote identifies the archive file, if necessary obtains it at the precise version needed, and extracts the respected file content from the archive at the correct location.

This special remote is automatically configured and enabled as datalad-archives by the add-archive-content command. This command indexes annexed archives, extracts, and registers their content to a dataset. File content availability information is recorded in terms of the dl+archive: “URLs”, which are put into the git-annex metadata on a file’s content.

Result records

Result records are the standard return value format for all DataLad commands. Each command invocation yields one or more result records. Result records are routinely inspected throughout the code base, and are used to inform generic error handling, as well as particular calling commands on how to proceed with a specific operation.

The technical implementation of a result record is a Python dictionary. This dictionary must contain a number of mandatory fields/keys (see below). However, an arbitrary number of additional fields may be added to a result record.

The get_status_dict() function simplifies the creation of result records.

Note

Developers must compose result records with care! DataLad supports custom user-provided hook configurations that use result record fields to decide when to trigger a custom post-result operation. Such custom hooks rely on a persistent naming and composition of result record fields. Changes to result records, including field name changes, field value changes, but also timing/order of record emitting potentially break user set ups!

Mandatory fields

The following keys must be present in any result record. If any of these keys is missing, DataLad’s behavior is undefined.

action

A string label identifying which type of operation a result is associated with. Labels must not contain white space. They should be compact, and lower-cases, and use _ (underscore) to separate words in compound labels.

A result without an action label will not be processed and is discarded.

path

A string with an absolute path describing the local entity a result is associated with. Paths must be platform-specific (e.g., Windows paths on Windows, and POSIX paths on other operating systems). When a result is about an entity that has no meaningful relation to the local file system (e.g., a URL to be downloaded), to path value should be determined with respect to the potential impact of the result on any local entity (e.g., a URL downloaded to a local file path, a local dataset modified based on remote information).

status

This field indicates the nature of a result in terms of four categories, identified by a string label.

  • ok: a standard, to-be-expected result

  • notneeded: an operation that was requested, but found to be unnecessary in order to achieve a desired goal

  • impossible: a requested operation cannot be performed, possibly because its preconditions are not met

  • error: an error occurred while performing an operation

Based on the status field, a result is categorized into success (ok, notneeded) and failure (impossible, error). Depending on the on_failure parameterization of a command call, any failure-result emitted by a command can lead to an IncompleteResultsError being raised on command exit, or a non-zero exit code on the command line. With on_failure='stop', an operation is halted on the first failure and the command errors out immediately, with on_failure='continue' an operation will continue despite intermediate failures and the command only errors out at the very end, with on_failure='ignore' the command will not error even when failures occurred. The latter mode can be used in cases where the initial status-characterization needs to be corrected for the particular context of an operation (e.g., to relabel expected and recoverable errors).

Common optional fields

The following fields are not required, but can be used to enrich a result record with additional information that improves its interpretability, or triggers particular optional functionality in generic result processing.

type

This field indicates the type of entity a result is associated with. This may or may not be the type of the local entity identified by the path value. The following values are common, and should be used in matching cases, but arbitrary other values are supported too:

  • dataset: a DataLad dataset

  • file: a regular file

  • directory: a directory

  • symlink: a symbolic link

  • key: a git-annex key

  • sibling: a Dataset sibling or Git remote

message

A message providing additional human-readable information on the nature or provenance of a result. Any non-ok results should have a message providing information on the rational of their status characterization.

A message can be a string or a tuple. In case of a tuple, the second item can contain values for %-expansion of the message string. Expansion is performed only immediately prior to actually outputting the message, hence string formatting runtime costs can be avoided this way, if a message is not actually shown.

logger

If a result record has a message field, then a given Logger instance (typically from logging.getLogger()) will be used to automatically log this message. The log channel/level is determined based on datalad.log.result-level configuration setting. By default, this is the debug level. When set to match-status the log level is determined based on the status field of a result record:

  • debug for 'ok', and 'notneeded' results

  • warning for 'impossible' results

  • error for 'error' results

This feature should be used with care. Unconditional logging can lead to confusing double-reporting when results rendered and also visibly logged.

refds

This field can identify a path (using the same semantics and requirements as the path field) to a reference dataset that represents the larger context of an operation. For example, when recursively processing multiple files across a number of subdatasets, a refds value may point to the common superdataset. This value may influence, for example, how paths are rendered in user-output.

parentds

This field can identify a path (using the same semantics and requirements as the path field) to a dataset containing an entity.

state

A string label categorizing the state of an entity. Common values are:

  • clean

  • untracked

  • modified

  • deleted

  • absent

  • present

error_message

An error message that was captured or produced while achieving a result.

An error message can be a string or a tuple. In the case of a tuple, the second item can contain values for %-expansion of the message string.

exception

An exception that occurred while achieving the reported result.

exception_traceback

A string with a traceback for the exception reported in exception.

Additional fields observed “in the wild”

Given that arbitrary fields are supported in result records, it is impossible to compose a comprehensive list of field names (keys). However, in order to counteract needless proliferation, the following list describes fields that have been observed in implementations. Developers are encouraged to preferably use compatible names from this list, or extend the list for additional items.

In alphabetical order:

bytesize

The size of an entity in bytes (integer).

gitshasum

SHA1 of an entity (string)

prev_gitshasum

SHA1 of a previous state of an entity (string)

key

The git-annex key associated with a type-file entity.

dataset argument

All commands which operate on datasets have a dataset argument (-d or --dataset for the CLI) to identify a single dataset as the context of an operation. If --dataset argument is not provided, the context of an operation is command-specific. For example, clone command will consider the dataset which is being cloned to be the context. But typically, a dataset which current working directory belongs to is the context of an operation. In the latter case, if operation (e.g., get) does not find a dataset in current working directory, operation fails with an NoDatasetFound error.

Impact on relative path resolution

With one exception, the nature of a provided dataset argument does not impact the interpretation of relative paths. Relative paths are always considered to be relative to the process working directory.

The one exception to this rule is passing a Dataset object instance as dataset argument value in the Python API. In this, and only this, case, a relative path is interpreted as relative to the root of the respective dataset.

Special values

There are some pre-defined “shortcut” values for dataset arguments:

^

Represents to the topmost superdataset that contains the dataset the current directory is part of.

^.

Represents the root directory of the dataset the current directory is part of.

///

Represents the “default” dataset located under $HOME/datalad/.

Use cases
Save modification in superdataset hierarchy

Sometimes it is convenient to work only in the context of a subdataset. Executing a datalad save <subdataset content> will record changes to the subdataset, but will leave existing superdatasets dirty, as the subdataset state change will not be saved there. Using the dataset argument it is possible to redefine the scope of the save operation. For example:

datalad save -d^ <subdataset content>

will perform the exact same save operation in the subdataset, but additionally save all subdataset state changes in all superdatasets until the root of a dataset hierarchy. Except for the specification of the dataset scope there is no need to adjust path arguments or change the working directory.

Log levels

Log messages are emitted by a wide range of operations within DataLad. They are categorized into distinct levels. While some levels have self-explanatory descriptions (e.g. warning, error), others are less specific (e.g. info, debug).

Common principles
Parenthical log message use the same level

When log messages are used to indicate the start and end of an operation, both start and end message use the same log-level.

Use cases
Command execution

For the WitlessRunner and its protocols the following log levels are used:

  • High-level execution -> debug

  • Process start/finish -> 8

  • Threading and IO -> 5

Drop dataset components

§1 The drop command is the antagonist of get. Whatever a drop can do, should be undoable by a subsequent get (given unchanged remote availability).

§2 Like get, drop primarily operates on a mandatory path specification (to discover relevant files and sudatasets to operate on).

§3 drop has --what parameter that serves as an extensible “mode-switch” to cover all relevant scenarios, like ‘drop all file content in the work-tree’ (e.g. --what files, default, #5858), ‘drop all keys from any branch’ (i.e. --what allkeys, #2328), but also ‘“drop” AKA uninstall entire subdataset hierarchies’ (e.g. --what all), or drop preferred content (--what preferred-content, #3122).

§4 drop prevents data loss by default (#4750). Like get it features a --reckless “mode-switch” to disable some or all potentially slow safety mechanism, i.e. ‘key available in sufficient number of other remotes’, ‘main or all branches pushed to remote(s)’ (#1142), ‘only check availability of keys associated with the worktree, but not other branches’. “Reckless operation” can be automatic, when following a reckless get (#4744).

§5 drop properly manages annex lifetime information, e.g. by announcing an annex as dead on removal of a repository (#3887).

§6 Like get, drop supports parallelization #1953

§7 datalad drop is not intended to be a comprehensive frontend to git annex drop (e.g. limited support for e.g. #1482 outside standard use cases like #2328).

Note

It is understood that the current uninstall command is largely or completely made obsolete by this drop concept.

§8 Given the development in #5842 towards the complete obsolescence of remove it becomes necessary to import one of its proposed features:

§9 drop should be able to recognize a botched attempt to delete a dataset with a plain rm -rf, and act on it in a meaningful way, even if it is just hinting at chmod + rm -rf.

Use cases

The following use cases operate in the dataset hierarchy depicted below:

super
├── dir
│   ├── fileD1
│   └── fileD2
├── fileS1
├── fileS2
├── subA
│   ├── fileA
│   ├── subsubC
│   │   ├── fileC
│   └── subsubD
└── subB
    └── fileB

Unless explicitly stated, all command are assumed to be executed in the root of super.

  • U1: datalad drop fileS1

    Drops the file content of file1 (as currently done by drop)

  • U2: datalad drop dir

    Drop all file content in the directory (fileD{1,2}; as currently done by drop

  • U3: datalad drop subB

    Drop all file content from the entire subB (fileB)

  • U4: datalad drop subB --what all

    Same as above (default --what files), because it is not operating in the context of a superdataset (no automatic upward lookups). Possibly hint at next usage pattern).

  • U5: datalad drop -d . subB --what all

    Drop all from the superdataset under this path. I.e. drop all from the subdataset and drop the subdataset itself (AKA uninstall)

  • U6: datalad drop subA --what all

    Error: “subA contains subdatasets, forgot –recursive?”

  • U7: datalad drop -d . subA -r --what all

    Drop all content from the subdataset (fileA) and its subdatasets (fileC), uninstall the subdataset (subA) and its subdatasets (subsubC, subsubD)

  • U8: datalad drop subA -r --what all

    Same as above, but keep subA installed

  • U9: datalad drop sub-A -r

    Drop all content from the subdataset and its subdatasets (fileA, fileC)

  • U10: datalad drop . -r --what all

    Drops all file content and subdatasets, but leaves the superdataset repository behind

  • U11: datalad drop -d . subB

    Does nothing and hints at alternative usage, see https://github.com/datalad/datalad/issues/5832#issuecomment-889656335

  • U12: cd .. && datalad drop super/dir

    Like get, errors because the execution is not associated with a dataset. This avoids complexities, when the given path’s point to multiple (disjoint) datasets. It is understood that it could be done, but it is intentionally not done. datalad -C super drop dir or datalad drop -d super super/dir would work.

Python import statements

The following rules apply to any import statement in the code base:

  • All imports must be absolute, unless they import individual pieces of an integrated code component that is only split across several source code files for technical or organizational reasons.

  • Imports must be placed at the top of a source file, unless there is a specific reason not to do so (e.g., delayed import due to performance concerns, circular dependencies). If such a reason exists, it must be documented by a comment at the import statement.

  • There must be no more than one import per line.

  • Multiple individual imports from a single module must follow the pattern:

    from <module> import (
        symbol1,
        symbol2,
    )
    

    Individual imported symbols should be sorted alphabetically. The last symbol line should end with a comma.

  • Imports from packages and modules should be grouped in categories like

    • Standard library packages

    • 3rd-party packages

    • DataLad core (absolute imports)

    • DataLad extensions

    • DataLad core (“local” relative imports)

    Sorting imports can be aided by https://github.com/PyCQA/isort (e.g. python -m isort -m3 --fgw 2 --tc <filename>).

Examples
   from collections import OrderedDict
   import logging
   import os

   from datalad.utils import (
       bytes2human,
       ensure_list,
       ensure_unicode,
       get_dataset_root as gdr,
   )

In the `datalad/submodule/tests/test_mod.py` test file demonstrating an "exception" to absolute imports
rule where test files are accompanying corresponding files of the underlying module::

   import os

   from datalad.utils import ensure_list

   from ..mod import func1

   from datalad.tests.utils_pytest import assert_true
Miscellaneous patterns

DataLad is the result of a distributed and collaborative development effort over many years. During this time the scope of the project has changed multiple times. As a consequence, the API and employed technologies have been adjusted repeatedly. Depending on the age of a piece of code, a clear software design is not always immediately visible. This section documents a few design patterns that the project strives to adopt at present. Changes to existing code and new contributions should follow these guidelines.

Generator methods in Repo classes

Substantial parts of DataLad are implemented to behave like Python generators in order to be maximally responsive when processing long-running tasks. This included methods of the core API classes GitRepo and AnnexRepo. By convention, such methods carry a trailing _ in their name. In some cases, sibling methods with the same name, but without the trailing underscore are provided. These behave like their generator-equivalent, but eventually return an iterable once processing is fully completed.

Calls to Git commands

DataLad is built on Git, so calls to Git commands are a key element of the code base. All such calls should be made through methods of the GitRepo class. This is necessary, as only there it is made sure that Git operates under the desired conditions (environment configuration, etc.).

For some functionality, for example querying and manipulating gitattributes, dedicated methods are provided. However, in many cases simple one-off calls to get specific information from Git, or trigger certain operations are needed. For these purposes the GitRepo class provides a set of convenience methods aiming to cover use cases requiring particular return values:

  • test success of a command: call_git_success()

  • obtain stdout of a command: call_git()

  • obtain a single output line: call_git_oneline()

  • obtain items from output split by a separator: call_git_items_()

All these methods take care of raising appropriate exceptions when expected conditions are not met. Whenever desired functionality can be achieved using simple custom calls to Git via these methods, their use is preferred over the implementation of additional, dedicated wrapper methods.

Command examples

Examples of Python and commandline invocations of DataLad’s user-oriented commands are defined in the class of the respective command as dictionaries within _examples_:

_examples_ = [
 dict(text="""Create a dataset 'mydataset' in the current directory""",
      code_py="create(path='mydataset')",
      code_cmd="datalad create mydataset",
 dict(text="""Apply the text2git procedure upon creation of a dataset""",
      code_py="create(path='mydataset', cfg_proc='text2git')",
      code_cmd="datalad create -c text2git mydataset")
      ]

The formatting of code lines is preserved. Changes to existing examples and new contributions should provide examples for Python and commandline API, as well as a concise description.

Exception handling
Catching exceptions

Whenever we catch an exception in an except clause, the following rules apply:

  • unless we (re-)raise, the first line instantiates a CapturedException:

    except Exception as e:
        ce = CapturedException(e)
    

    First, this ensures a low-level (8) log entry including the traceback of that exception. The depth of the included traceback can be limited by setting the datalad.exc.str.tb_limit config accordingly.

    Second, it deletes the frame stack references of the exception and keeps textual information only, in order to avoid circular references, where an object (whose method raised the exception) isn’t going to be picked by the garbage collection. This can be particularly troublesome if that object holds a reference to a subprocess for example. However, it’s not easy to see in what situation this would really be needed and we never need anything other than the textual information about what happened. Making the reference cleaning a general rule is easiest to write, maintain and review.

  • if we raise, neither a log entry nor such a CapturedException instance is to be created. Eventually, there will be a spot where that (re-)raised exception is caught. This then is the right place to log it. That log entry will have the traceback, there’s no need to leave a trace by means of log messages!

  • if we raise, but do not simply reraise that exact same exception, in order to change the exception class and/or its message, raise from must be used!:

    except SomeError as e:
        raise NewError("new message") from e
    

    This ensures that the original exception is properly registered as the cause for the exception via its __cause__ attribute. Hence, the original exception’s traceback will be part of the later on logged traceback of the new exception.

Messaging about an exception

In addition to the auto-generated low-level log entry there might be a need to create a higher-level log, a user message or a (result) dictionary that includes information from that exception. While such messaging may use anything the (captured) exception provides, please consider that “technical” details about an exception are already auto-logged and generally not incredibly meaningful for users.

For message creation CapturedException comes with a couple of format_* helper methods, its __str__ provides a short representation of the form ExceptionClass(message) and its __repr__ the log form with a traceback that is used for the auto-generated log.

For result dictionaries CapturedException can be assigned to the field exception. Currently, get_status_dict will consider this field and create an additional field with a traceback string. Hence, whether putting a captured exception into that field actually has an effect depends on whether get_status_dict is subsequently used with that dictionary. In the future such functionality may move into result renderers instead, leaving the decision of what to do with the passed CapturedException to them. Therefore, even if of no immediate effect, enhancing the result dicts accordingly makes sense already, since it may be useful when using datalad via its python interface already and provide instant benefits whenever the result rendering gets such an upgrade.

Credential management

Various components of DataLad need to be passed credentials to interact with services that require authentication. This includes downloading files, but also things like REST API usage or authenticated cloning. Key components of DataLad’s credential management are credentials types, providers, authenticators and downloaders.

Credentials

Supported credential types include basic user/password combinations, access tokens, and a range of tailored solutions for particular services. All credential type implementations are derived from a common Credential base class. A mapping from string labels to credential classes is defined in datalad.downloaders.CREDENTIAL_TYPES.

Importantly, credentials must be identified by a name. This name is a label that is often hard-coded in the program code of DataLad, any of its extensions, or specified in a dataset or in provider configurations (see below).

Given a credential name, one or more credential component(s) (e.g., token, username, or password) can be looked up by DataLad in at least two different locations. These locations are tried in the following order, and the first successful lookup yields the final value.

  1. A configuration item datalad.credential.<name>.<component>. Such configuration items can be defined in any location supported by DataLad’s configuration system. As with any other specification of configuration items, environment variables can be used to set or override credentials. Variable names take the form of DATALAD_CREDENTIAL_<NAME>_<COMPONENT>, and standard replacement rules into configuration variable names apply.

  2. DataLad uses the keyring package https://pypi.org/project/keyring to connect to any of its supported back-ends for setting or getting credentials, via a wrapper in keyring_. This provides support for credential storage on all major platforms, but also extensibility, providing 3rd-parties to implement and use specialized solutions.

When a credential is required for operation, but could not be obtained via any of the above approaches, DataLad can prompt for credentials in interactive terminal sessions. Interactively entered credentials will be stored in the active credential store available via the keyring package. Note, however, that the keyring approach is somewhat abused by datalad. The wrapper only uses get_/set_password of keyring with the credential’s FIELDS as the name to query (essentially turning the keyring into a plain key-value store) and “datalad-<CREDENTIAL-LABEL>” as the “service name”. With this approach it’s not possible to use credentials in a system’s keyring that were defined by other, datalad unaware software (or users).

When a credential value is known but invalid, the invalid value must be removed or replaced in the active credential store. By setting the configuration flag datalad.credentials.force-ask, DataLad can be instructed to force interactive credential re-entry to effectively override any store credential with a new value.

Providers

Providers are associating credentials with a context for using them and are defined by configuration files. A single provider is represented by Provider object and the list of available providers is represented by the Providers class. A provider is identified by a label and stored in a dedicated config file per provider named LABEL.cfg. Such a file can reside in a dataset (under .datalad/providers/), at the user level (under {user_config_dir}/providers), at the system level (under {site_config_dir}/providers) or come packaged with the datalad distribution (in directory configs next to providers.py). Such a provider specifies a regular expression to match URLs against and assigns authenticator abd credentials to be used for a match. Credentials are referenced by their label, which in turn is the name of another section in such a file specifying the type of the credential. References to credential and authenticator types are strings that are mapped to classes by the following dict definitions:

  • datalad.downloaders.AUTHENTICATION_TYPES

  • datalad.downloaders.CREDENTIAL_TYPES

Available providers can be loaded by Providers.from_config_files and Providers.get_provider(url) will match a given URL against them and return the appropriate Provider instance. A Provider object will determine a downloader to use (derived from BaseDownloader), based on the URL’s protocol.

Note, that the provider config files are not currently following datalad’s general config approach. Instead they are special config files, read by configparser.ConfigParser that are not compatible with git-config and hence the ConfigManager.

There are currently two ways of storing a provider and thus creating its config file: Providers.enter_new and Providers._store_new. The former will only work interactively and provide the user with options to choose from, while the latter is non-interactive and can therefore only be used, when all properties of the provider config are known and passed to it. There’s no way at the moment to store an existing Provider object directly.

Integration with Git

In addition, there’s a special case for interfacing git-credential: A dedicated GitCredential class is used to talk to Git’s git-credential command instead of the keyring wrapper. This class has identical fields to the UserPassword class and thus can be used by the same authenticators. Since Git’s way to deal with credentials doesn’t involve labels but only matching URLs, it is - in some sense - the equivalent of datalad’s provider layer. However, providers don’t talk to a backend, credentials do. Hence, a more seamless integration requires some changes in the design of datalad’s credential system as a whole.

In the opposite direction - making Git aware of datalad’s credentials, there’s no special casing, though. DataLad comes with a git-credential-datalad executable. Whenever Git is configured to use it by setting credential.helper=datalad, it will be able to query datalad’s credential system for a provider matching the URL in question and retrieve the referenced by this provider credentials. This helper can also store a new provider+credentials when asked to do so by Git. It can do this interactively, asking a user to confirm/change that config or - if credential.helper=’datalad –non-interactive’ - try to non-interactively store with its defaults.

Authenticators

Authenticators are used by downloaders to issue authenticated requests. They are not easily available to directly be applied to requests being made outside of the downloaders.

URL substitution

URL substitution is a transformation of a given URL using a set of specifications. Such specification can be provided as configuration settings (via all supported configuration sources). These configuration items must follow the naming scheme datalad.clone.url-substitute.<label>, where <label> is an arbitrary identifier.

A substitution specification is a string with a match and substitution expression, each following Python’s regular expression syntax. Both expressions are concatenated into a single string with an arbitrary delimiter character. The delimiter is defined by prefixing the string with the delimiter. Prefix and delimiter are stripped from the expressions before processing. Example:

,^http://(.*)$,https://\\1

A particular configuration item can be defined multiple times (see examples below) to form a substitution series. Substitutions in the same series will be applied incrementally, in order of their definition. If the first substitution expression does not match, the entire series will be ignored. However, following a first positive match all further substitutions in a series are processed, regardless whether intermediate expressions match or not.

Any number of substitution series can be configured. They will be considered in no particular order. Consequently, it advisable to implement the first match specification of any series as specific as possible, in order to prevent undesired transformations.

Examples

Change the protocol component of a given URL in order to hand over further processing to a dedicated Git remote helper. Specifically, the following example converts Open Science Framework project URLs like https://osf.io/f5j3e/ into osf://f5j3e, a URL that can be handle by git-remote-osf, the Git remote helper provided by the datalad-osf extension package:

datalad.clone.url-substitute.osf = ,^https://osf.io/([^/]+)[/]*$,osf://\1

Here is a more complex examples with a series of substitutions. The first expression ensures that only GitHub URLs are being processed. The associated substitution disassembles the URL into its two only relevant components, the organisation/user name, and the project name:

datalad.clone.url-substitute.github = ,https?://github.com/([^/]+)/(.*)$,\1###\2

All other expressions in this series that are described below will only be considered if the above expression matched.

The next two expressions in the series normalize URL components that maybe be auto-generated by some DataLad functionality, e.g. subdataset location candidate generation from directory names:

# replace (back)slashes with a single dash
datalad.clone.url-substitute.github = ,[/\\]+,-

# replace with whitespace (URL-quoted or not) with a single underscore
datalad.clone.url-substitute.github = ,\s+|(%2520)+|(%20)+,_

The final expression in the series is recombining the organization/user name and project name components back into a complete URL:

datalad.clone.url-substitute.github = ,([^#]+)###(.*),https://github.com/\1/\2
Threaded runner
Threads

DataLad often requires the execution of subprocesses. While subprocesses are executed, datalad, i.e. its main thread, should be able to read data from stdout and stderr of the subprocess as well as write data to stdin of the subprocess. This requires a way to efficiently multiplex reading from stdout and stderr of the subprocess as well as writing to stdin of the subprocess.

Since non-blocking IO and waiting on multiple sources (poll or select) differs vastly in terms of capabilities and API on different OSs, we decided to use blocking IO and threads to multiplex reading from different sources.

Generally we have a number of threads that might be created and executed, depending on the need for writing to stdin or reading from stdout or stderr. Each thread can read from either a single queue or a file descriptor. Reading is done blocking. Each thread can put data into multiple queues. This is used to transport data that was read as well as for signaling conditions like closed file descriptors.

Conceptually, there are the main thread and two different types of threads:

  • type 1: transport threads (1 thread per process I/O descriptor)

  • type 2: process waiting thread (1 thread)

Transport Threads

Besides the main thread, there might be up to three additional threads to handle data transfer to stdin, and from stdout and stderr. Each of those threads copies data between queues and file descriptors in a tight loop. The stdin-thread reads from an input-queue, the stdout- and stderr-threads write to an output queue. Each thread signals its exit to a set of signal queues, which might be identical to the output queues.

The stdin-thread reads data from a queue and writes it to the stdin-file descriptor of the sub-process. If it reads None from the queue, it will exit. The thread will also exit, if an exit is requested by calling thread.request_exit(), or if an error occurs during writing. In all cases it will enqueue a None to all its signal-queues.

The stdout- and stderr-threads read from the respective file descriptor and enqueue data into their output queue, unless the data has zero length (which indicates a closed descriptor). On a zero-length read they exit and enqueue None into their signal queues.

All queues are infinite. Nevertheless signaling is performed with a timeout of one 100 milliseconds in order to ensure that threads can exit.

Process Waiting Thread

The process waiting thread waits for a given process to exit and enqueues an exit notification into it signal queues.

Main Thread

There is a single queue, the output_queue, on which the main thread waits, after all transport threads, and the process waiting thread are started. The output_queue is the signaling queue and the output queue of the stderr-thread and the stdout-thread. It is also the signaling queue of the stdin-thread, and it is the signaling queue for the process waiting threads.

The main thread waits on the output_queue for data or signals and handles them accordingly, i.e. calls data callbacks of the protocol if data arrives, and calls connection-related callbacks of the protocol if other signals arrive. If no messages arrive on the output_queue, the main thread blocks for 100ms. If it is unblocked, either by getting a message or due to elapsing of the 100ms, it will process timeouts. If the timeout-parameter to the constructor was not None, it will check the last time any of the monitored files (stdout and/or stderr) yielded data. If the time is larger than the specified timeout, it will call the timeout method of the protocol instance. Due to this implementation, the resolution for timeouts is 100ms. The main thread handles the closing of stdin-, stdout-, and stderr-file descriptors if all other threads have terminated and if output_queue is empty. These tasks are either performed in the method ThreadedRunner.run() or in a result generator that is returned by ThreadedRunner.run() whenever send() is called on it.

Protocols

Due to its history datalad uses the protocol defined in asyncio.protocols.SubprocessProtocol and in asyncio.protocols.BaseProtocol. To keep compatibility with the code base, the threaded-runner implementation uses the same interface. Please note, although we use the same interface and although the interface is defined in the asyncio libraries, the threaded-runner implementation does not make any use of asyncio. The description of the interface nevertheless applies in the context of the threaded-runner. The following methods of the SubprocessProtocol are supported.

  • SubprocessProtocol.pipe_data_received(fd, data)

  • SubprocessProtocol.pipe_connection_lost(fd, exc)

  • SubprocessProtocol.process_exited()

In addition the following methods of BaseProtocol are supported:

  • BaseProtocol.connection_made(transport)

  • BaseProtocol.connection_lost(exc)

The datalad-provided protocol datalad.runners.protocol.WitlessProtocol provides an additional callback:

  • WitlessProtocol.timeout(fd)

The method timeout() will be called when the parameter timeout in WitlessRunner.run, ThreadedRunner.run, or run_command is set to a number specifying the desired timeout in seconds. If no data is received from stdin, or stderr (if those are supposed to be captured), the method WitlessProtocol.timeout(fd) is called with fd set to the respective file number, e.g. 1, or 2. If WitlessProtocol.timeout(fd) returns True, only the corresponding file descriptor will be closed and the associated threads will exit.

The method WitlessProtocol.timeout(fd) is also called if stdout, stderr and stdin are closed and the process does not exit within the given interval. In this case fd is set to None. If WitlessProtocol.timeout(fd) returns True the process is terminated.

Object and Generator Results

If the protocol that is provided to run() does not inherit datalad.runner.protocol.GeneratorMixIn, the final result that will be returned to the caller is determined by calling WitlessProtocol._prepare_result(). Whatever object this method returns will be returned to the caller.

If the protocol that is provided to run() does inherit datalad.runner.protocol.GeneratorMixIn, run() will return a Generator. This generator will yield the elements that were sent to it in the protocol-implementation by calling GeneratorMixIn.send_result() in the order in which the method GeneratorMixIn.send_result() is called. For example, if GeneratorMixIn.send_result(43) is called, the generator will yield 43, and if GeneratorMixIn.send_result({"a": 123, "b": "some data"}) is called, the generator will yield {"a": 123, "b": "some data"}.

Internally the generator is implemented by keeping track of the process state and waiting in the output_queue once, when send (or __next__) is called on it.

BatchedCommand and BatchedAnnex
Batched Command

The class BatchedCommand (in datalad.cmd), holds an instance of a running subprocess, allows to send requests to the subprocess over its stdin, and to receive responses from the subprocess over its stdout.

Requests can be provided to an instance of BatchedCommand by passing a single request or a list of requests to BatchCommand.__call__(), i.e. by applying the function call-operator to an instance of BatchedCommand. A request is either a string or a tuple of strings. In the latter case, the elements of the tuple will be joined by " ". More than one request can be given by providing a list of requests, i.e. a list of strings or tuples. In this case, the return value will be a list with one response for every request.

BatchedCommand will send each request that is sent to the subprocess as a single line, after terminating the line by "\n". After the request is sent, BatchedCommand calls an output-handler with stdout-ish (an object that provides a readline()-function which operates on the stdout of the subprocess) of the subprocess as argument. The output-handler can be provided to the constructor. If no output-handler is provided, a default output-handler is used. The default output-handler reads a single output line on stdout, using io.IOBase.readline(), and returns the rstrip()-ed line.

The subprocess must at least emit one line of output per line of input in order to prevent the calling thread from blocking. In addition, the size of the output, i.e. the number of lines that the result consists of, must be discernible by the output-handler. That means, the subprocess must either return a fixed number of lines per input line, or it must indicate the end of a result in some other way, e.g. with an empty line.

Remark: In principle any output processing could be performed. But, if the output-handler blocks on stdout, the calling thread will be blocked. Due to the limited capabilities of the stdout-ish that is passed to the output-handler, the output-handler must rely on readline() to process the output of the subprocess. Together with the line-based request sending, BatchedCommand is geared towards supporting the batch processing modes of git and git-annex. This has to be taken into account when providing a custom output handler.

When BatchedCommand.close() is called, stdin, stdout, and stderr of the subprocess are closed. This indicates the end of processing to the subprocess. Generally the subprocess is expected to exit shortly after that. BatchedCommand.close() will wait for the subprocess to end, if the configuration datalad.runtime.stalled-external is set to "wait". If the configuration datalad.runtime.stalled-external is set to "abandon", BatchedCommand.close() will return after “timeout” seconds if timeout was provided to BatchedCommand.__init__(), otherwise it will return after 11 seconds. If a timeout occurred, the attribute wait_timed_out of the BatchedCommand instance will be set to True. If exception_on_timeout=True is provided to BatchedCommand.__init__(), a subprocess.TimeoutExpired exception will be raised on a timeout while waiting for the process. It is not safe to reused a BatchedCommand instance after such an exception was risen.

Stderr of the subprocess is gathered in a byte-string. Its content will be returned by BatchCommand.close() if the parameter return_stderr is True.

Implementation details

BatchedCommand uses WitlessRunner with a protocol that has datalad.runner.protocol.GeneratorMixIn as a super-class. The protocol uses an output-handler to process data, if an output-handler was specified during construction of BatchedCommand.

BatchedCommand.close() queries the configuration key datalad.runtime.stalled-external to determine how to handle non-exiting processes (there is no killing, processes or process zombies might just linger around until the next reboot).

The current implementation of BatchedCommand can process a list of multiple requests at once, but it will collect all answers before returning a result. That means, if you send 1000 requests, BatchedCommand will return after having received 1000 responses.

BatchedAnnex

BatchedAnnex is a subclass of BatchedCommand (which it actually doesn’t have to be, it just adds git-annex specific parameters to the command and sets a specific output handler).

BatchedAnnex provides a new output-handler if the constructor-argument json is True. In this case, an output handler is used that reads a single line from stdout, strips the line and converts it into a json object, which is returned. If the stripped line is empty, an empty dictionary is returned.

Standard parameters

Several “standard parameters” are used in various DataLad commands. Those standard parameters have an identical meaning across the commands they are used in. Commands should ensure that they use those “standard parameters” where applicable and do not deviate from the common names nor the common meaning.

Currently used standard parameters are listed below, as well as suggestions on how to harmonize currently deviating standard parameters. Deviations from the agreed upon list should be harmonized. The parameters are listed in their command-line form, but similar names and descriptions apply to their Python form.

-d/--dataset

A pointer to the dataset that a given command should operate on

--dry-run

Display details about the command execution without actually running the command.

-f/--force

Enforce the execution of a command, even when certain security checks would normally prevent this

-J/--jobs

Number of parallel jobs to use.

-m/--message

A commit message to attach to the saved change of a command execution.

-r/--recursive

Perform an operation recursively across subdatasets

-R/--recursion-limit

Limit recursion to a given amount of subdataset levels

-s/--sibling-name [SUGGESTION]

The identifier for a dataset sibling (remote)

Certain standard parameters will have their own design document. Please refer to those documents for more in-depth information.

Positional vs Keyword parameters
Motivation

Python allows for keyword arguments (arguments with default values) to be specified positionally. That complicates addition or removal of new keyword arguments since such changes must account for their possible positional use. Moreover, in case of our Interface’s, it contributes to inhomogeneity since when used in CLI, all keyword arguments must be specified via non-positional --<option>’s, whenever Python interface allows for them to be used positionally.

Python 3 added possibility to use a * separator in the function definition to mandate that all keyword arguments after it must be be used only via keyword (<option>=<value>) specification. It is encouraged to use * to explicitly separate out positional from keyword arguments in majority of the cases, and below we outline two major types of constructs.

Interfaces

Subclasses of the Interface provide specification and implementation for both CLI and Python API interfaces. All new interfaces must separate all CLI --options from positional arguments using * in their __call__ signature.

Note: that some positional arguments could still be optional (e.g., destination path for clone), and thus should be listed before *, despite been defined as a keyword argument in the __call__ signature.

A unit-test will be provided to guarantee such consistency between CLI and Python interfaces. Overall, exceptions to this rule could be only some old(er) interfaces.

Regular functions and methods

Use of * is encouraged for any function (or method) with keyword arguments. Generally, * should come before the first keyword argument, but similarly to the Interfaces above, it is left to the discretion of the developer to possibly allocate some (just few) arguments which could be used positionally if specified.

Docstrings

Docstrings in DataLad source code are used and consumed in many ways. Besides serving as documentation directly in the sources, they are also transformed and rendered in various ways.

  • Command line --help output

  • Python’s help() or IPython’s ?

  • Manpages

  • Sphinx-rendered documentation for the Python API and the command line API

A common source docstring is transformed, amended and tuned specifically for each consumption scenario.

Formatting overview and guidelines

In general, the docstring format follows the NumPy standard. In addition, we follow the guidelines of Restructured Text with the additional features and treatments provided by Sphinx, and some custom formatting outlined below.

Version information

Additions, changes, or deprecation should be recorded in a docstring using the standard Sphinx directives versionadded, versionchanged, deprecated:

.. deprecated:: 0.16
   The ``dryrun||--dryrun`` option will be removed in a future release, use
   the renamed ``dry_run||--dry-run`` option instead.
API-conditional docs

The CMD and PY macros can be used to selectively include documentation for specific APIs only:

options to pass to :command:`git init`. [PY: Options can be given as a list
of command line arguments or as a GitPython-style option dictionary PY][CMD:
Any argument specified after the destination path of the repository will be
passed to git-init as-is CMD].

For API-alternative command and argument specifications the following format can be used:

``<python-api>||<cmdline-api``

where the double backticks are mandatory and <python-part> and <cmdline-part> represent the respective argument specification for each API. In these specifications only valid argument/command names are allowed, plus a comma character to list multiples, and the dot character to include an ellipsis:

``github_organization||-g,--github-organization``

``create_sibling_...||create-sibling-...``
Reflow text

When automatic transformations negatively affect the presentation of a docstring due to excessive removal of content, leaving “holes”, the REFLOW macro can be used to enclose such segments, in order to reformat them as the final processing step. Example:

|| REFLOW >>
The API has been aligned with the some
``create_sibling_...||create-sibling-...`` commands of other GitHub-like
services, such as GOGS, GIN, GitTea.<< REFLOW ||

The start macro must appear on a dedicated line.

Progress reporting

Progress reporting is implemented via the logging system. A dedicated function datalad.log.log_progress() represents the main API for progress reporting. For some standard use cases, the utilities datalad.log.with_progress() and datalad.log.with_result_progress() can simplify result reporting further.

Design and implementation

This basic idea is to use an instance of datalad’s loggers to emit log messages with particular attributes that are picked up by datalad.log.ProgressHandler (derived from logging.Handler), and are acted on differently, depending on configuration and conditions of a session (e.g., interactive terminal sessions vs. non-interactive usage in scripts). This variable behavior is implemented via the use of logging standard library log filters and handlers. Roughly speaking, datalad.log.ProgressHandler will only be used for interactive sessions. In non-interactive cases, progress log messages are inspected by datalad.log.filter_noninteractive_progress(), and are either discarded or treated like any other log message (see datalad.log.LoggerHelper.get_initialized_logger() for details on the handler and filter setup).

datalad.log.ProgressHandler inspects incoming log records for attributes with names starting with dlm_progress. It will only process such records and pass others on to the underlying original log handler otherwise.

datalad.log.ProgressHandler takes care of creating, updating and destroying any number of simultaneously running progress bars. Progress reports must identify the respective process via an arbitrary string ID. It is the caller’s responsibility to ensure that this ID is unique to the target process/activity.

Reporting progress with log_progress()

Typical progress reporting via datalad.log.log_progress() involves three types of calls.

1. Start reporting progress about a process

A typical call to start of progress reporting looks like this

log_progress(
    # the callable used to emit log messages
    lgr.info,
    # a unique identifiers of the activity progress is reported for
    identifier,
    # main message
    'Unlocking files',
    # optional unit string for a progress bar
    unit=' Files',
    # optional label to be displayed in a progress bar
    label='Unlocking',
    # maximum value for a progress bar
    total=nfiles,
)

A new progress bar will be created automatically for any report with a previously unseen activity identifier. It can be configured via the specification of a number of arguments, most notably a target total for the progress bar. See datalad.log.log_progress() for a complete overview.

Starting a progress report must be done with a dedicated call. It cannot be combined with a progress update.

2. Update progress information about a process

Any subsequent call to datalad.log.log_progress() with an activity identifier that has already been seen either updates, or finishes the progress reporting for an activity. Updates must contain an update key which either specifies a new value (if increment=False, the default) or an increment to previously known value (if increment=True):

log_progress(
    lgr.info,
    # must match the identifier used to start the progress reporting
    identifier,
    # arbitrary message content, string expansion supported just like
    # regular log messages
    "Files to unlock %i", nfiles,
    # critical key for report updates
    update=1,
    # ``update`` could be an absolute value or an increment
    increment=True
)

Updating a progress report can only be done after a progress reporting was initialized (see above).

3. Report completion of a process

A progress bar will remain active until it is explicitly taken down, even if an initially declared total value may have been reached. Finishing a progress report requires a final log message with the corresponding identifiers which, like the first initializing message, does NOT contain an update key.

log_progress(
    lgr.info,
    identifier,
    # closing log message
    "Completed unlocking files",
)
Progress reporting in non-interactive sessions

datalad.log.log_progress() takes a noninteractive_level argument that can be used to specify a log level at which progress is logged when no progress bars can be used, but actual log messages are produced.

import logging

log_progress(
    lgr.info,
    identifier,
    "Completed unlocking files",
    noninteractive_level=logging.INFO
)

Each call to log_progress() can be given a different log level, in order to control the verbosity of the reporting in such a scenario. For example, it is possible to log the start or end of an activity at a higher level than intermediate updates. It is also possible to single out particular intermediate events, and report them at a higher level.

If no noninteractive_level is specified, the progress update is unconditionally logged at the level implied by the given logger callable.

Reporting progress with with_(result_)progress()

For cases were a list of items needs to be processes sequentially, and progress shall be communicated, two additional helpers could be used: the decorators datalad.log.with_progress() and datalad.log.with_result_progress(). They require a callable that takes a list (or more generally a sequence) of items to be processed as the first positional argument. They both set up and perform all necessary calls to log_progress().

The difference between these helpers is that datalad.log.with_result_progress() expects a callable to produce DataLad result records, and supports customs filters to decide which particular result records to consider for progress reporting (e.g., only records for a particular action and type).

Output non-progress information without interfering with progress bars

log_progress() can also be useful when not reporting progress, but ensuring that no other output is interfering with progress bars, and vice versa. The argument maint can be used in this case, with no particular activity identifier (it always impacts all active progress bars):

log_progress(
    lgr.info,
    None,
    'Clear progress bars',
    maint='clear',
)

This call will trigger a temporary discontinuation of any progress bar display. Progress bars can either be re-enabled all at once, by an analog message with maint='refresh', or will re-show themselves automatically when the next update is received. A no_progress() context manager helper can be used to surround your context with those two calls to prevent progress bars from interfering.

GitHub Action

The purpose of the DataLad GitHub Action is to support CI testing with DataLad datasets by making it easy to install datalad and get data from the datasets.

Example Usage

Dataset installed at ${GITHUB_WORKSPACE}/studyforrest-data-phase2, get’s all the data:

- uses: datalad/datalad-action@master
  with:
    datasets:
      - source: https://github.com/psychoinformatics-de/studyforrest-data-phase2
      - install_get_data: true

Specify advanced options:

- name: Download testing data
  uses: datalad/datalad-action@master
  with:
    datalad_version: ^0.15.5
    add_datalad_to_path: false
    datasets:
      - source: https://github.com/psychoinformatics-de/studyforrest-data-phase2
      - branch: develop
      - install_path: test_data
      - install_jobs: 2
      - install_get_data: false
      - recursive: true
      - recursion_limit: 2
      - get_jobs: 2
      - get_paths:
          - sub-01
          - sub-02
          - stimuli
Options
datalad_version

datalad version to install. Defaults to the latest release.

add_datalad_to_path

Add datalad to the PATH for manual invocation in subsequent steps.

Defaults to true.

source

URL for the dataset (mandatory).

branch

Git branch to install (optional).

install_path

Path to install the dataset relative to GITHUB_WORKSPACE.

Defaults to the repository name.

install_jobs

Jobs to use for datalad install.

Defaults to auto.

install_get_data

Get all the data in the dataset by passing --get-data to datalad install.

Defaults to false.

recursive

Boolean defining whether to clone subdatasets.

Defaults to true.

recursion_limit

Integer defining limits to recursion.

If not defined, there is no limit.

get_jobs

Jobs to use for datalad get.

Defaults to auto.

get_paths

A list of paths in the dataset to download with datalad get.

Defaults to everything.

Continuous integration and testing

DataLad is tested using a pytest-based testsuite that is run locally and via continuous integrations setups. Code development should ensure that old and new functionality is appropriately tested. The project aims for good unittest coverage (at least 80%).

Running tests

Starting at the top level with datalad/tests, every module in the package comes with a subdirectory tests/, containing the tests for that portion of the codebase. This structure is meant to simplify (re-)running the tests for a particular module. The test suite is run using

pip install -e .[tests]
python -m pytest -c tox.ini datalad
# or, with coverage reports
python -m pytest  -c tox.ini --cov=datalad datalad

Individual tests can be run using a path to the test file, followed by two colons and the test name:

python -m pytest datalad/core/local/tests/test_save.py::test_save_message_file

The set of to-be-run tests can be further sub-selected with environment variable based configurations that enable tests based on their Test annotations, or pytest-specific parameters. Invoking a test run using DATALAD_TESTS_KNOWNFAILURES_PROBE=True pytest datalad, for example, will run tests marked as known failures whether or not they still fail. See section Configuration for all available configurations. Invoking a test run using DATALAD_TESTS_SSH=1 pytest -m xfail -c tox.ini datalad will run only those tests marked as xfail.

Local setup

Local test execution usually requires a local installation with all development requirements. It is recommended to either use a virtualenv, or tox via a tox.ini file in the code base.

CI setup

At the moment, Travis-CI, Appveyor, and GitHub Workflows exercise the tests battery for every PR and on the default branch, covering different operating systems, Python versions, and file systems. Tests should be ran on the oldest, latest, and current stable Python release. The projects uses https://codecov.io for an overview of code coverage.

Writing tests

Additional functionality is tested by extending existing similar tests with new test cases, or adding new tests to the respective test script of the module. Generally, every file example.py `with datalad code comes with a corresponding `tests/test_example.py. Test helper functions assisting various general and DataLad specific assertions as well the construction of test directories and files can be found in datalad/tests/utils_pytest.py.

Test annotations

datalad/tests/utils_pytest.py also defines test decorators. Some of those are used to annotate tests for various aspects to allow for easy sub-selection via environment variables.

Speed: Please annotate tests that take a while to complete with following decorators

  • @slow if test runs over 10 seconds

  • @turtle if test runs over 120 seconds (those would not typically be ran on CIs)

Purpose: Please further annotate tests with a special purpose specifically. As those tests also usually tend to be slower, use in conjunction with @slow or @turtle when slow.

  • @integration - tests verifying correct operation with external tools/services beyond git/git-annex

  • @usecase - represents some (user) use-case, and not necessarily a “unit-test” of functionality

Dysfunction: If tests are not meant to be run on certain platforms or under certain conditions, @known_failure or @skip annotations can be used. Examples include:

  • @skip, @skip_if_on_windows, @skip_ssh, @skip_wo_symlink_capability, @skip_if_adjusted_branch, @skip_if_no_network, @skip_if_root

  • @knownfailure, @known_failure_windows, known_failure_githubci_win or known_failure_githubci_osx

Migrating tests from nose to pytest

DataLad’s test suite has been migrated from nose to pytest in the 0.17.0 release. This might be relevant for DataLad extensions that still use nose.

For the time being, datalad.tests.utils keeps providing nose-based utils, and datalad.__init__ keeps providing nose-based fixtures to not break extensions that still use nose for testing. A migration to pytest is recommended, though. To perform a typical migration of a DataLad extension to use pytest instead of nose, go through the following list:

  • keep all the assert_* and ok_ helpers, but import them from datalad.tests.utils_pytest instead

  • for @with_* and other decorators populating positional arguments, convert corresponding posarg to kwarg by adding =None

  • convert all generator-based parametric tests into direct invocations or, preferably, @pytest.mark.parametrized tests

  • address DeprecationWarnings in the code. Only where desired to test deprecation, add @pytest.mark.filterwarnings("ignore: BEGINNING OF WARNING") decorator to the test.

For an example, see a “migrate to pytest” PR against datalad-deprecated: datalad/datalad-deprecated#51 .

User messaging: result records vs exceptions vs logging
Motivation

This specification delineates the applicable contexts for using result records, exceptions, progress reporting, specific log levels, or other types of user messaging processes.

Specification
Result records

Result records are the only return value format for all DataLad interfaces.

Contrasting with classic Python interfaces that return specific non-annotated values, DataLad interfaces (i.e. subclasses of datalad.interface.base.Interface) implement message passing by yielding result records that are associated with individual operations. Result records are routinely inspected throughout the code base and their annotations are used to inform general program flow and error handling.

DataLad interface calls can include an on_failure parameterization to specify how to proceed with a particular operation if a returned result record is classified as a failure result. DataLad interface calls can also include a result_renderer parameterization to explicitly enable or disable the rendering of result records.

Developers should be aware that external callers will use DataLad interface call parameterizations that can selectively ignore or act on result records, and that the process should therefore yield meaningful result records. If, in turn, the process itself receives a set of result records from a sub-process, these should be inspected individually in order to identify result values that could require re-annotation or status re-classification.

For user messaging purposes, result records can also be enriched with additional human-readable information on the nature of the result, via the message key, and human-readable hints to the user, via the hints key. Both of these are rendered via the UI Module.

Exception handling

In general, exceptions should be raised when there is no way to ignore or recover from the offending action.

More specifically, raise an exception when:

  1. A DataLad interface’s parameter specifications are violated

  2. An additional requirement (beyond parameters) for the meaningful continuation of a DataLad interface, function, or process is not met

It must be made clear to the user/caller what the exact cause of the exception is, given the context within which the user/caller triggered the action. This is achieved directly via a (re)raised exception, as opposed to logging messages or results records which could be ignored or unseen by the user.

Note

In the case of a complex set of dependent actions it could be expensive to confirm parameter violations. In such cases, initial sub-routines might already generate result records that have to be inspected by the caller, and it could be practically better to yield a result record (with status=[error|impossible]) to communicate the failure. It would then be up to the upstream caller to decide whether to specify on_failure='ignore' or whether to inspect individual result records and turn them into exceptions or not.

Logging

Logging provides developers with additional means to describe steps in a process, so as to allow insight into the program flow during debugging or analysis of e.g. usage patterns. Logging can be turned off externally, filtered, and redirected. Apart from the log-level and message, it is not inspectable and cannot be used to control the logic or flow of a program.

Importantly, logging should not be the primary user messaging method for command outcomes, Therefore:

  1. No interface should rely solely on logging for user communication

  2. Use logging for in-progress user communication via the mechanism for progress reporting

  3. Use logging to inform debugging processes

UI Module

The ui module provides the means to communicate information to the user in a user-interface-specific manner, e.g. via a console, dialog, or an iPython interface. Internally, all DataLad results processed by the result renderer are passed through the UI module.

Therefore: unless the criteria for logging apply, and unless the message to be delivered to the user is specified via the message key of a result record, developers should let explicit user communication happen through the UI module as it provides the flexibility to adjust to the present UI. Specifically, datalad.ui.message() allows passing a simple message via the UI module.

Examples

The following links point to actual code implementations of the respective user messaging methods:

Glossary

DataLad purposefully uses a terminology that is different from the one used by its technological foundations Git and git-annex. This glossary provides definitions for terms used in the datalad documentation and API, and relates them to the corresponding Git/git-annex concepts.

annex

Extension to a Git repository, provided and managed by git-annex as means to track and distribute large (and small) files without having to inject them directly into a Git repository (which would slow Git operations significantly and impair handling of such repositories in general).

CLI

A Command Line Interface. Could be used interactively by executing commands in a shell, or as a programmable API for shell scripts.

DataLad extension

A Python package, developed outside of the core DataLad codebase, which (when installed) typically either provides additional top level datalad commands and/or additional metadata extractors. Visit Handbook, Ch.2. DataLad’s extensions for a representative list of extensions and instructions on how to install them.

dataset

A regular Git repository with an (optional) annex.

sibling

A dataset (location) that is related to a particular dataset, by sharing content and history. In Git terminology, this is a clone of a dataset that is configured as a remote.

subdataset

A dataset that is part of another dataset, by means of being tracked as a Git submodule. As such, a subdataset is also a complete dataset and not different from a standalone dataset.

superdataset

A dataset that contains at least one subdataset.

Commands and API

Command line reference

Main command
datalad
Synopsis
datalad [-c (:name|name=value)] [-C PATH] [--cmd] [-l LEVEL] [--on-failure
    {ignore,continue,stop}] [--report-status
    {success,failure,ok,notneeded,impossible,error}] [--report-type
    {dataset,file}] [-f
    {generic,json,json_pp,tailored,disabled,'<template>'}] [--dbg]
    [--idbg] [--version] {create-sibling-github,create-sibling-gitla
    b,create-sibling-gogs,create-sibling-gin,create-sibling-gitea,cr
    eate-sibling-ria,create-sibling,siblings,update,subdatasets,drop
    ,remove,addurls,copy-file,download-url,foreach-dataset,install,r
    erun,run-procedure,create,save,status,clone,get,push,run,diff,co
    nfiguration,wtf,clean,add-archive-content,add-readme,export-arch
    ive,export-archive-ora,export-to-figshare,no-annex,check-dates,u
    nlock,uninstall,create-test-dataset,sshrun,shell-completion} ...
Description

Comprehensive data management solution

DataLad provides a unified data distribution system built on the Git and Git-annex. DataLad command line tools allow to manipulate (obtain, create, update, publish, etc.) datasets and provide a comprehensive toolbox for joint management of data and code. Compared to Git/annex it primarily extends their functionality to transparently and simultaneously work with multiple inter-related repositories.

Options
{create-sibling-github,create-sibling-gitlab,create-sibling-gogs,create-sibling-gin,create-sibling-gitea,create-sibling-ria,create-sibling,siblings,update,subdatasets,drop,remove,addurls,copy-file,download-url,foreach-dataset,install,rerun,run-procedure,create,save,status,clone,get,push,run,diff,configuration,wtf,clean,add-archive-content,add-readme,export-archive,export-archive-ora,export-to-figshare,no-annex,check-dates,unlock,uninstall,create-test-dataset,sshrun,shell-completion}
-c (:name|name=value)

specify configuration setting overrides. They override any configuration read from a file. A configuration can also be unset temporarily by prefixing its name with a colon (‘:’), e.g. ‘:user.name’. Overrides specified here may be overridden themselves by configuration settings declared as environment variables.

-C PATH

run as if datalad was started in <path> instead of the current working directory. When multiple -C options are given, each subsequent non-absolute -C <path> is interpreted relative to the preceding -C <path>. This option affects the interpretations of the path names in that they are made relative to the working directory caused by the -C option

--cmd

syntactical helper that can be used to end the list of global command line options before the subcommand label. Options taking an arbitrary number of arguments may require to be followed by a single –cmd in order to enable identification of the subcommand.

-l LEVEL, --log-level LEVEL

set logging verbosity level. Choose among critical, error, warning, info, debug. Also you can specify an integer <10 to provide even more debugging information

--on-failure {ignore,continue,stop}

when an operation fails: ‘ignore’ and continue with remaining operations, the error is logged but does not lead to a non-zero exit code of the command; ‘continue’ works like ‘ignore’, but an error causes a non-zero exit code; ‘stop’ halts on first failure and yields non-zero exit code. A failure is any result with status ‘impossible’ or ‘error’. [Default: ‘continue’, but individual commands may define an alternative default]

--report-status {success,failure,ok,notneeded,impossible,error}

constrain command result report to records matching the given status. ‘success’ is a synonym for ‘ok’ OR ‘notneeded’, ‘failure’ stands for ‘impossible’ OR ‘error’.

--report-type {dataset,file}

constrain command result report to records matching the given type. Can be given more than once to match multiple types.

-f {generic,json,json_pp,tailored,disabled,’<template>’}, --output-format {generic,json,json_pp,tailored,disabled,’<template>’}

select rendering mode command results. ‘tailored’ enables a command-specific rendering style that is typically tailored to human consumption, if there is one for a specific command, or otherwise falls back on the the ‘generic’ result renderer; ‘generic’ renders each result in one line with key info like action, status, path, and an optional message); ‘json’ a complete JSON line serialization of the full result record; ‘json_pp’ like ‘json’, but pretty- printed spanning multiple lines; ‘disabled’ turns off result rendering entirely; ‘<template>’ reports any value(s) of any result properties in any format indicated by the template (e.g. ‘{path}’, compare with JSON output for all key- value choices). The template syntax follows the Python “format() language”. It is possible to report individual dictionary values, e.g. ‘{metadata[name]}’. If a 2nd-level key contains a colon, e.g. ‘music:Genre’, ‘:’ must be substituted by ‘#’ in the template, like so: ‘{metadata[music#Genre]}’. [Default: ‘tailored’]

--dbg

enter Python debugger for an uncaught exception

--idbg

enter IPython debugger for an uncaught exception

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Core commands

A minimal set of commands that cover essential functionality. Core commands receive special scrutiny with regard API composition and (breaking) changes.

Local operation
datalad create
Synopsis
datalad create [-h] [-f] [-D DESCRIPTION] [-d DATASET] [--no-annex] [--fake-dates]
    [-c PROC] [--version] [PATH] ...
Description

Create a new dataset from scratch.

This command initializes a new dataset at a given location, or the current directory. The new dataset can optionally be registered in an existing superdataset (the new dataset’s path needs to be located within the superdataset for that, and the superdataset needs to be given explicitly via –dataset). It is recommended to provide a brief description to label the dataset’s nature and location, e.g. “Michael’s music on black laptop”. This helps humans to identify data locations in distributed scenarios. By default an identifier comprised of user and machine name, plus path will be generated.

This command only creates a new dataset, it does not add existing content to it, even if the target directory already contains additional files or directories.

Plain Git repositories can be created via –no-annex. However, the result will not be a full dataset, and, consequently, not all features are supported (e.g. a description).

To create a local version of a remote dataset use the install command instead.

NOTE

Power-user info: This command uses git init and git annex init to prepare the new dataset. Registering to a superdataset is performed via a git submodule add operation in the discovered superdataset.

Examples

Create a dataset ‘mydataset’ in the current directory:

% datalad create mydataset

Apply the text2git procedure upon creation of a dataset:

% datalad create -c text2git mydataset

Create a subdataset in the root of an existing dataset:

% datalad create -d . mysubdataset

Create a dataset in an existing, non-empty directory:

% datalad create --force

Create a plain Git repository:

% datalad create --no-annex mydataset
Options
PATH

path where the dataset shall be created, directories will be created as necessary. If no location is provided, a dataset will be created in the location specified by –dataset (if given) or the current working directory. Either way the command will error if the target directory is not empty. Use –force to create a dataset in a non-empty directory. Constraints: value must be a string or Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

INIT OPTIONS

options to pass to git init. Any argument specified after the destination path of the repository will be passed to git-init as-is. Note that not all options will lead to viable results. For example ‘–bare’ will not yield a repository where DataLad can adjust files in its working tree.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-f, --force

enforce creation of a dataset in a non-empty directory.

-D DESCRIPTION, --description DESCRIPTION

short description to use for a dataset location. Its primary purpose is to help humans to identify a dataset copy (e.g., “mike’s dataset on lab server”). Note that when a dataset is published, this information becomes available on the remote side. Constraints: value must be a string or value must be NONE

-d DATASET, --dataset DATASET

specify the dataset to perform the create operation on. If a dataset is given along with PATH, a new subdataset will be created in it at the path provided to the create command. If a dataset is given but PATH is unspecified, a new dataset will be created at the location specified by this option. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--no-annex

if set, a plain Git repository will be created without any annex.

--fake-dates

Configure the repository to use fake dates. The date for a new commit will be set to one second later than the latest commit in the repository. This can be used to anonymize dates.

-c PROC, --cfg-proc PROC

Run cfg_PROC procedure(s) (can be specified multiple times) on the created dataset. Use run-procedure –discover to get a list of available procedures, such as cfg_text2git.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad save
Synopsis
datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS] [-u] [-F
    MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend] [--version] [PATH
    ...]
Description

Save the current state of a dataset

Saving the state of a dataset records changes that have been made to it. This change record is annotated with a user-provided description. Optionally, an additional tag, such as a version, can be assigned to the saved state. Such tag enables straightforward retrieval of past versions at a later point in time.

NOTE

Before Git v2.22, any Git repository without an initial commit located inside a Dataset is ignored, and content underneath it will be saved to the respective superdataset. DataLad datasets always have an initial commit, hence are not affected by this behavior.

Examples

Save any content underneath the current directory, without altering any potential subdataset:

% datalad save .

Save specific content in the dataset:

% datalad save myfile.txt

Attach a commit message to save:

% datalad save -m 'add file' myfile.txt

Save any content underneath the current directory, and recurse into any potential subdatasets:

% datalad save . -r

Save any modification of known dataset content in the current directory, but leave untracked files (e.g. temporary files) untouched:

% datalad save -u .

Tag the most recent saved state of a dataset:

% datalad save --version-tag 'bestyet'

Save a specific change but integrate into last commit keeping the already recorded commit message:

% datalad save myfile.txt --amend
Options
PATH

path/name of the dataset component to save. If given, only changes made to those components are recorded in the new state. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-m MESSAGE, --message MESSAGE

a description of the state or the changes made to a dataset. Constraints: value must be a string or value must be NONE

-d DATASET, --dataset DATASET

“specify the dataset to save. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-t ID, --version-tag ID

an additional marker for that state. Every dataset that is touched will receive the tag. Constraints: value must be a string or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-u, --updated

if given, only saves previously tracked paths.

-F MESSAGE_FILE, --message-file MESSAGE_FILE

take the commit message from this file. This flag is mutually exclusive with -m. Constraints: value must be a string or value must be NONE

--to-git

flag whether to add data directly to Git, instead of tracking data identity only. Use with caution, there is no guarantee that a file put directly into Git like this will not be annexed in a subsequent save operation. If not specified, it will be up to git-annex to decide how a file is tracked, based on a dataset’s configuration to track particular paths, file types, or file sizes with either Git or git-annex. (see https://git-annex.branchable.com/tips/largefiles).

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--amend

if set, changes are not recorded in a new, separate commit, but are integrated with the changeset of the previous commit, and both together are recorded by replacing that previous commit. This is mutually exclusive with recursive operation.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad run
Synopsis
datalad run [-h] [-d DATASET] [-i PATH] [-o PATH] [--expand {inputs|outputs|both}]
    [--assume-ready {inputs|outputs|both}] [--explicit] [-m MESSAGE]
    [--sidecar {yes|no}] [--dry-run {basic|command}] [-J NJOBS]
    [--version] ...
Description

Run an arbitrary shell command and record its impact on a dataset.

It is recommended to craft the command such that it can run in the root directory of the dataset that the command will be recorded in. However, as long as the command is executed somewhere underneath the dataset root, the exact location will be recorded relative to the dataset root.

If the executed command did not alter the dataset in any way, no record of the command execution is made.

If the given command errors, a COMMANDERROR exception with the same exit code will be raised, and no modifications will be saved. A command execution will not be attempted, by default, when an error occurred during input or output preparation. This default stop behavior can be overridden via –on-failure ….

In the presence of subdatasets, the full dataset hierarchy will be checked for unsaved changes prior command execution, and changes in any dataset will be saved after execution. Any modification of subdatasets is also saved in their respective superdatasets to capture a comprehensive record of the entire dataset hierarchy state. The associated provenance record is duplicated in each modified (sub)dataset, although only being fully interpretable and re-executable in the actual top-level superdataset. For this reason the provenance record contains the dataset ID of that superdataset.

Command format

A few placeholders are supported in the command via Python format specification. “{pwd}” will be replaced with the full path of the current working directory. “{dspath}” will be replaced with the full path of the dataset that run is invoked on. “{tmpdir}” will be replaced with the full path of a temporary directory. “{inputs}” and “{outputs}” represent the values specified by –input and –output. If multiple values are specified, the values will be joined by a space. The order of the values will match that order from the command line, with any globs expanded in alphabetical order (like bash). Individual values can be accessed with an integer index (e.g., “{inputs[0]}”).

Note that the representation of the inputs or outputs in the formatted command string depends on whether the command is given as a list of arguments or as a string (quotes surrounding the command). The concatenated list of inputs or outputs will be surrounded by quotes when the command is given as a list but not when it is given as a string. This means that the string form is required if you need to pass each input as a separate argument to a preceding script (i.e., write the command as “./script {inputs}”, quotes included). The string form should also be used if the input or output paths contain spaces or other characters that need to be escaped.

To escape a brace character, double it (i.e., “{{” or “}}”).

Custom placeholders can be added as configuration variables under “datalad.run.substitutions”. As an example:

Add a placeholder “name” with the value “joe”:

% datalad configuration --scope branch set datalad.run.substitutions.name=joe
% datalad save -m "Configure name placeholder" .datalad/config

Access the new placeholder in a command:

% datalad run "echo my name is {name} >me"

Examples

Run an executable script and record the impact on a dataset:

% datalad run -m 'run my script' 'code/script.sh'

Run a command and specify a directory as a dependency for the run. The contents of the dependency will be retrieved prior to running the script:

% datalad run -m 'run my script' -i 'data/*' 'code/script.sh'

Run an executable script and specify output files of the script to be unlocked prior to running the script:

% datalad run -m 'run my script' -i 'data/*' \
  -o 'output_dir/*' 'code/script.sh'

Specify multiple inputs and outputs:

% datalad run -m 'run my script' -i 'data/*' \
  -i 'datafile.txt' -o 'output_dir/*' -o \
  'outfile.txt' 'code/script.sh'

Use ** to match any file at any directory depth recursively. Single * does not check files within matched directories.:

% datalad run -m 'run my script' -i 'data/**/*.dat' \
  -o 'output_dir/**' 'code/script.sh'
Options
COMMAND

command for execution. A leading ‘–’ can be used to disambiguate this command from the preceding options to DataLad.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to record the command results in. An attempt is made to identify the dataset based on the current working directory. If a dataset is given, the command will be executed in the root directory of this dataset. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-i PATH, --input PATH

A dependency for the run. Before running the command, the content for this relative path will be retrieved. A value of “.” means “run datalad get .”. The value can also be a glob. This option can be given more than once.

-o PATH, --output PATH

Prepare this relative path to be an output file of the command. A value of “.” means “run datalad unlock .” (and will fail if some content isn’t present). For any other value, if the content of this file is present, unlock the file. Otherwise, remove it. The value can also be a glob. This option can be given more than once.

--expand {inputs|outputs|both}

Expand globs when storing inputs and/or outputs in the commit message. Constraints: value must be one of (‘inputs’, ‘outputs’, ‘both’)

--assume-ready {inputs|outputs|both}

Assume that inputs do not need to be retrieved and/or outputs do not need to unlocked or removed before running the command. This option allows you to avoid the expense of these preparation steps if you know that they are unnecessary. Constraints: value must be one of (‘inputs’, ‘outputs’, ‘both’)

--explicit

Consider the specification of inputs and outputs to be explicit. Don’t warn if the repository is dirty, and only save modifications to the listed outputs.

-m MESSAGE, --message MESSAGE

a description of the state or the changes made to a dataset. Constraints: value must be a string or value must be NONE

--sidecar {yes|no}

By default, the configuration variable ‘datalad.run.record-sidecar’ determines whether a record with information on a command’s execution is placed into a separate record file instead of the commit message (default: off). This option can be used to override the configured behavior on a case-by-case basis. Sidecar files are placed into the dataset’s ‘.datalad/runinfo’ directory (customizable via the ‘datalad.run.record-directory’ configuration variable). Constraints: value must be NONE or value must be convertible to type bool

--dry-run {basic|command}

Do not run the command; just display details about the command execution. A value of “basic” reports a few important details about the execution, including the expanded command and expanded inputs and outputs. “command” displays the expanded command only. Note that input and output globs underneath an uninstalled dataset will be left unexpanded because no subdatasets will be installed for a dry run. Constraints: value must be one of (‘basic’, ‘command’)

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad status
Synopsis
datalad status [-h] [-d DATASET] [--annex [{basic|availability|all}]] [--untracked
    {no|normal|all}] [-r] [-R LEVELS] [-e {no|commit|full}] [-t
    {raw|eval}] [--version] [PATH ...]
Description

Report on the state of dataset content.

This is an analog to git status that is simultaneously crippled and more powerful. It is crippled, because it only supports a fraction of the functionality of its counter part and only distinguishes a subset of the states that Git knows about. But it is also more powerful as it can handle status reports for a whole hierarchy of datasets, with the ability to report on a subset of the content (selection of paths) across any number of datasets in the hierarchy.

Path conventions

All reports are guaranteed to use absolute paths that are underneath the given or detected reference dataset, regardless of whether query paths are given as absolute or relative paths (with respect to the working directory, or to the reference dataset, when such a dataset is given explicitly). Moreover, so-called “explicit relative paths” (i.e. paths that start with ‘.’ or ‘..’) are also supported, and are interpreted as relative paths with respect to the current working directory regardless of whether a reference dataset with specified.

When it is necessary to address a subdataset record in a superdataset without causing a status query for the state _within_ the subdataset itself, this can be achieved by explicitly providing a reference dataset and the path to the root of the subdataset like so:

datalad status --dataset . subdspath

In contrast, when the state of the subdataset within the superdataset is not relevant, a status query for the content of the subdataset can be obtained by adding a trailing path separator to the query path (rsync-like syntax):

datalad status --dataset . subdspath/

When both aspects are relevant (the state of the subdataset content and the state of the subdataset within the superdataset), both queries can be combined:

datalad status --dataset . subdspath subdspath/

When performing a recursive status query, both status aspects of subdataset are always included in the report.

Content types

The following content types are distinguished:

  • ‘dataset’ – any top-level dataset, or any subdataset that is properly registered in superdataset

  • ‘directory’ – any directory that does not qualify for type ‘dataset’

  • ‘file’ – any file, or any symlink that is placeholder to an annexed file when annex-status reporting is enabled

  • ‘symlink’ – any symlink that is not used as a placeholder for an annexed file

Content states

The following content states are distinguished:

  • ‘clean’

  • ‘added’

  • ‘modified’

  • ‘deleted’

  • ‘untracked’

Examples

Report on the state of a dataset:

% datalad status

Report on the state of a dataset and all subdatasets:

% datalad status -r

Address a subdataset record in a superdataset without causing a status query for the state _within_ the subdataset itself:

% datalad status -d . mysubdataset

Get a status query for the state within the subdataset without causing a status query for the superdataset (using trailing path separator in the query path)::

% datalad status -d . mysubdataset/

Report on the state of a subdataset in a superdataset and on the state within the subdataset:

% datalad status -d . mysubdataset mysubdataset/

Report the file size of annexed content in a dataset:

% datalad status --annex
Options
PATH

path to be evaluated. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to query. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--annex [{basic|availability|all}]

Switch whether to include information on the annex content of individual files in the status report, such as recorded file size. By default no annex information is reported (faster). Three report modes are available: basic information like file size and key name (‘basic’); additionally test whether file content is present in the local annex (‘availability’; requires one or two additional file system stat calls, but does not call git-annex), this will add the result properties ‘has_content’ (boolean flag) and ‘objloc’ (absolute path to an existing annex object file); or ‘all’ which will report all available information (presently identical to ‘availability’). The ‘basic’ mode will be assumed when this option is given, but no mode is specified. Constraints: value must be one of (‘basic’, ‘availability’, ‘all’)

--untracked {no|normal|all}

If and how untracked content is reported when comparing a revision to the state of the working tree. ‘no’: no untracked content is reported; ‘normal’: untracked files and entire untracked directories are reported as such; ‘all’: report individual files even in fully untracked directories. Constraints: value must be one of (‘no’, ‘normal’, ‘all’) [Default: ‘normal’]

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-e {no|commit|full}, --eval-subdataset-state {no|commit|full}

Evaluation of subdataset state (clean vs. modified) can be expensive for deep dataset hierarchies as subdataset have to be tested recursively for uncommitted modifications. Setting this option to ‘no’ or ‘commit’ can substantially boost performance by limiting what is being tested. With ‘no’ no state is evaluated and subdataset result records typically do not contain a ‘state’ property. With ‘commit’ only a discrepancy of the HEAD commit shasum of a subdataset and the shasum recorded in the superdataset’s record is evaluated, and the ‘state’ result property only reflects this aspect. With ‘full’ any other modification is considered too (see the ‘untracked’ option for further tailoring modification testing). Constraints: value must be one of (‘no’, ‘commit’, ‘full’) [Default: ‘full’]

-t {raw|eval}, --report-filetype {raw|eval}

THIS OPTION IS IGNORED. It will be removed in a future release. Dataset component types are always reported as-is (previous ‘raw’ mode), unless annex- reporting is enabled with the –annex option, in which case symlinks that represent annexed files will be reported as type=’file’. Constraints: value must be one of (‘raw’, ‘eval’)

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad diff
Synopsis
datalad diff [-h] [-f REVISION] [-t REVISION] [-d DATASET] [--annex
    [{basic|availability|all}]] [--untracked {no|normal|all}] [-r]
    [-R LEVELS] [--version] [PATH ...]
Description

Report differences between two states of a dataset (hierarchy)

The two to-be-compared states are given via the –from and –to options. These state identifiers are evaluated in the context of the (specified or detected) dataset. In the case of a recursive report on a dataset hierarchy, corresponding state pairs for any subdataset are determined from the subdataset record in the respective superdataset. Only changes recorded in a subdataset between these two states are reported, and so on.

Any paths given as additional arguments will be used to constrain the difference report. As with Git’s diff, it will not result in an error when a path is specified that does not exist on the filesystem.

Reports are very similar to those of the STATUS command, with the distinguished content types and states being identical.

Examples

Show unsaved changes in a dataset:

% datalad diff

Compare a previous dataset state identified by shasum against current worktree:

% datalad diff --from <SHASUM>

Compare two branches against each other:

% datalad diff --from branch1 --to branch2

Show unsaved changes in the dataset and potential subdatasets:

% datalad diff -r

Show unsaved changes made to a particular file:

% datalad diff <path/to/file>
Options
PATH

path to constrain the report to. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-f REVISION, --from REVISION

original state to compare to, as given by any identifier that Git understands. Constraints: value must be a string [Default: ‘HEAD’]

-t REVISION, --to REVISION

state to compare against the original state, as given by any identifier that Git understands. If none is specified, the state of the working tree will be compared. Constraints: value must be a string or value must be NONE

-d DATASET, --dataset DATASET

specify the dataset to query. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--annex [{basic|availability|all}]

Switch whether to include information on the annex content of individual files in the status report, such as recorded file size. By default no annex information is reported (faster). Three report modes are available: basic information like file size and key name (‘basic’); additionally test whether file content is present in the local annex (‘availability’; requires one or two additional file system stat calls, but does not call git-annex), this will add the result properties ‘has_content’ (boolean flag) and ‘objloc’ (absolute path to an existing annex object file); or ‘all’ which will report all available information (presently identical to ‘availability’). The ‘basic’ mode will be assumed when this option is given, but no mode is specified. Constraints: value must be one of (‘basic’, ‘availability’, ‘all’)

--untracked {no|normal|all}

If and how untracked content is reported when comparing a revision to the state of the working tree. ‘no’: no untracked content is reported; ‘normal’: untracked files and entire untracked directories are reported as such; ‘all’: report individual files even in fully untracked directories. Constraints: value must be one of (‘no’, ‘normal’, ‘all’) [Default: ‘normal’]

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Distributed operation
datalad clone
Synopsis
datalad clone [-h] [-d DATASET] [-D DESCRIPTION] [--reckless
    [auto|ephemeral|shared-...]] [--version] SOURCE [PATH] ...
Description

Obtain a dataset (copy) from a URL or local directory

The purpose of this command is to obtain a new clone (copy) of a dataset and place it into a not-yet-existing or empty directory. As such CLONE provides a strict subset of the functionality offered by install. Only a single dataset can be obtained, and immediate recursive installation of subdatasets is not supported. However, once a (super)dataset is installed via CLONE, any content, including subdatasets can be obtained by a subsequent get command.

Primary differences over a direct git clone call are 1) the automatic initialization of a dataset annex (pure Git repositories are equally supported); 2) automatic registration of the newly obtained dataset as a subdataset (submodule), if a parent dataset is specified; 3) support for additional resource identifiers (DataLad resource identifiers as used on datasets.datalad.org, and RIA store URLs as used for store.datalad.org - optionally in specific versions as identified by a branch or a tag; see examples); and 4) automatic configurable generation of alternative access URL for common cases (such as appending ‘.git’ to the URL in case the accessing the base URL failed).

In case the clone is registered as a subdataset, the original URL passed to CLONE is recorded in .gitmodules of the parent dataset in addition to the resolved URL used internally for git-clone. This allows to preserve datalad specific URLs like ria+ssh://… for subsequent calls to GET if the subdataset was locally removed later on.

URL mapping configuration

‘clone’ supports the transformation of URLs via (multi-part) substitution specifications. A substitution specification is defined as a configuration setting ‘datalad.clone.url-substition.<seriesID>’ with a string containing a match and substitution expression, each following Python’s regular expression syntax. Both expressions are concatenated to a single string with an arbitrary delimiter character. The delimiter is defined by prefixing the string with the delimiter. Prefix and delimiter are stripped from the expressions (Example: “,^http://(.*)$,https://1”). This setting can be defined multiple times, using the same ‘<seriesID>’. Substitutions in a series will be applied incrementally, in order of their definition. The first substitution in such a series must match, otherwise no further substitutions in a series will be considered. However, following the first match all further substitutions in a series are processed, regardless whether intermediate expressions match or not. Substitution series themselves have no particular order, each matching series will result in a candidate clone URL. Consequently, the initial match specification in a series should be as precise as possible to prevent inflation of candidate URLs.

SEEALSO

handbook:3-001 (http://handbook.datalad.org/symbols)

More information on Remote Indexed Archive (RIA) stores

Examples

Install a dataset from GitHub into the current directory:

% datalad clone https://github.com/datalad-datasets/longnow-podcasts.git

Install a dataset into a specific directory:

% datalad clone https://github.com/datalad-datasets/longnow-podcasts.git \
  myfavpodcasts

Install a dataset as a subdataset into the current dataset:

% datalad clone -d . https://github.com/datalad-datasets/longnow-podcasts.git

Install the main superdataset from datasets.datalad.org:

% datalad clone ///

Install a dataset identified by a literal alias from store.datalad.org:

% datalad clone ria+http://store.datalad.org#~hcp-openaccess

Install a dataset in a specific version as identified by a branch or tag name from store.datalad.org:

% datalad clone ria+http://store.datalad.org#76b6ca66-36b1-11ea-a2e6-f0d5bf7b5561@myidentifier

Install a dataset with group-write access permissions:

% datalad clone http://example.com/dataset --reckless shared-group
Options
SOURCE

URL, DataLad resource identifier, local path or instance of dataset to be cloned. Constraints: value must be a string

PATH

path to clone into. If no PATH is provided a destination path will be derived from a source URL similar to git clone.

GIT CLONE OPTIONS

Options to pass to git clone. Any argument specified after SOURCE and the optional PATH will be passed to git-clone. Note that not all options will lead to viable results. For example ‘–single-branch’ will not result in a functional annex repository because both a regular branch and the git-annex branch are required. Note that a version in a RIA URL takes precedence over ‘–branch’.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

(parent) dataset to clone into. If given, the newly cloned dataset is registered as a subdataset of the parent. Also, if given, relative paths are interpreted as being relative to the parent dataset, and not relative to the working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-D DESCRIPTION, --description DESCRIPTION

short description to use for a dataset location. Its primary purpose is to help humans to identify a dataset copy (e.g., “mike’s dataset on lab server”). Note that when a dataset is published, this information becomes available on the remote side. Constraints: value must be a string or value must be NONE

--reckless [auto|ephemeral|shared-…]

Obtain a dataset or subdatset and set it up in a potentially unsafe way for performance, or access reasons. Use with care, any dataset is marked as ‘untrusted’. The reckless mode is stored in a dataset’s local configuration under ‘datalad.clone.reckless’, and will be inherited to any of its subdatasets. Supported modes are: [‘auto’]: hard-link files between local clones. In-place modification in any clone will alter original annex content. [‘ephemeral’]: symlink annex to origin’s annex and discard local availability info via git- annex-dead ‘here’ and declares this annex private. Shares an annex between origin and clone w/o git-annex being aware of it. In case of a change in origin you need to update the clone before you’re able to save new content on your end. Alternative to ‘auto’ when hardlinks are not an option, or number of consumed inodes needs to be minimized. Note that this mode can only be used with clones from non-bare repositories or a RIA store! Otherwise two different annex object tree structures (dirhashmixed vs dirhashlower) will be used simultaneously, and annex keys using the respective other structure will be inaccessible. [‘shared-<mode>’]: set up repository and annex permission to enable multi-user access. This disables the standard write protection of annex’ed files. <mode> can be any value support by ‘git init –shared=’, such as ‘group’, or ‘all’. Constraints: value must be one of (True, False, ‘auto’, ‘ephemeral’) or value must start with ‘shared-’

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad push
Synopsis
datalad push [-h] [-d DATASET] [--to SIBLING] [--since SINCE] [--data
    {anything|nothing|auto|auto-if-wanted}] [-f
    {all|gitpush|checkdatapresent}] [-r] [-R LEVELS] [-J NJOBS]
    [--version] [PATH ...]
Description

Push a dataset to a known sibling.

This makes a saved state of a dataset available to a sibling or special remote data store of a dataset. Any target sibling must already exist and be known to the dataset.

By default, all files tracked in the last saved state (of the current branch) will be copied to the target location. Optionally, it is possible to limit a push to changes relative to a particular point in the version history of a dataset (e.g. a release tag) using the –since option in conjunction with the specification of a reference dataset. In recursive mode subdatasets will also be evaluated, and only those subdatasets are pushed where a change was recorded that is reflected in the current state of the top-level reference dataset.

NOTE

Power-user info: This command uses git push, and git annex copy to push a dataset. Publication targets are either configured remote Git repositories, or git-annex special remotes (if they support data upload).

Options
PATH

path to constrain a push to. If given, only data or changes for those paths are considered for a push. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to push. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--to SIBLING

name of the target sibling. If no name is given an attempt is made to identify the target based on the dataset’s configuration (i.e. a configured tracking branch, or a single sibling that is configured for push). Constraints: value must be a string or value must be NONE

--since SINCE

specifies commit-ish (tag, shasum, etc.) from which to look for changes to decide whether pushing is necessary. If ‘^’ is given, the last state of the current branch at the sibling is taken as a starting point. Constraints: value must be a string or value must be NONE

--data {anything|nothing|auto|auto-if-wanted}

what to do with (annex’ed) data. ‘anything’ would cause transfer of all annexed content, ‘nothing’ would avoid call to git annex copy altogether. ‘auto’ would use ‘git annex copy’ with ‘–auto’ thus transferring only data which would satisfy “wanted” or “numcopies” settings for the remote (thus “nothing” otherwise). ‘auto-if-wanted’ would enable ‘–auto’ mode only if there is a “wanted” setting for the remote, and transfer ‘anything’ otherwise. Constraints: value must be one of (‘anything’, ‘nothing’, ‘auto’, ‘auto-if-wanted’) [Default: ‘auto-if-wanted’]

-f {all|gitpush|checkdatapresent}, --force {all|gitpush|checkdatapresent}

force particular operations, possibly overruling safety protections or optimizations: use –force with git-push (‘gitpush’); do not use –fast with git-annex copy (‘checkdatapresent’); combine all force modes (‘all’). Constraints: value must be one of (‘all’, ‘gitpush’, ‘checkdatapresent’)

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Extended set of functionality
Dataset operations
datalad add-readme
Synopsis
datalad add-readme [-h] [-d DATASET] [--existing {skip|append|replace}] [--version]
    [PATH]
Description

Add basic information about DataLad datasets to a README file

The README file is added to the dataset and the addition is saved in the dataset. Note: Make sure that no unsaved modifications to your dataset’s .gitattributes file exist.

Options
PATH

Path of the README file within the dataset. Constraints: value must be a string [Default: ‘README.md’]

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

Dataset to add information to. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--existing {skip|append|replace}

How to react if a file with the target name already exists: ‘skip’: do nothing; ‘append’: append information to the existing file; ‘replace’: replace the existing file with new content. Constraints: value must be one of (‘skip’, ‘append’, ‘replace’) [Default: ‘skip’]

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad addurls
Synopsis
datalad addurls [-h] [-d DATASET] [-t TYPE] [-x REGEXP] [-m FORMAT] [--key FORMAT]
    [--message MESSAGE] [-n] [--fast] [--ifexists {overwrite|skip}]
    [--missing-value VALUE] [--nosave] [--version-urls] [-c PROC]
    [-J NJOBS] [--drop-after] [--on-collision
    {error|error-if-different|take-first|take-last}] [--version]
    URL-FILE URL-FORMAT FILENAME-FORMAT
Description

Create and update a dataset from a list of URLs.

Format specification

Several arguments take format strings. These are similar to normal Python format strings where the names from URL-FILE (column names for a comma- or tab-separated file or properties for JSON) are available as placeholders. If URL-FILE is a CSV or TSV file, a positional index can also be used (i.e., “{0}” for the first column). Note that a placeholder cannot contain a ‘:’ or ‘!’.

In addition, the FILENAME-FORMAT arguments has a few special placeholders.

  • _repindex

    The constructed file names must be unique across all fields rows. To avoid collisions, the special placeholder “_repindex” can be added to the formatter. Its value will start at 0 and increment every time a file name repeats.

  • _url_hostname, _urlN, _url_basename*

    Various parts of the formatted URL are available. Take “http://datalad.org/asciicast/seamless_nested_repos.sh” as an example.

    “datalad.org” is stored as “_url_hostname”. Components of the URL’s path can be referenced as “_urlN”. “_url0” and “_url1” would map to “asciicast” and “seamless_nested_repos.sh”, respectively. The final part of the path is also available as “_url_basename”.

    This name is broken down further. “_url_basename_root” and “_url_basename_ext” provide access to the root name and extension. These values are similar to the result of os.path.splitext, but, in the case of multiple periods, the extension is identified using the same length heuristic that git-annex uses. As a result, the extension of “file.tar.gz” would be “.tar.gz”, not “.gz”. In addition, the fields “_url_basename_root_py” and “_url_basename_ext_py” provide access to the result of os.path.splitext.

  • _url_filename*

    These are similar to _url_basename* fields, but they are obtained with a server request. This is useful if the file name is set in the Content-Disposition header.

Examples

Consider a file “avatars.csv” that contains:

who,ext,link
neurodebian,png,https://avatars3.githubusercontent.com/u/260793
datalad,png,https://avatars1.githubusercontent.com/u/8927200

To download each link into a file name composed of the ‘who’ and ‘ext’ fields, we could run:

$ datalad addurls -d avatar_ds avatars.csv '{link}' '{who}.{ext}'

The -d avatar_ds is used to create a new dataset in “$PWD/avatar_ds”.

If we were already in a dataset and wanted to create a new subdataset in an “avatars” subdirectory, we could use “//” in the FILENAME-FORMAT argument:

$ datalad addurls avatars.csv '{link}' 'avatars//{who}.{ext}'

If the information is represented as JSON lines instead of comma separated values or a JSON array, you can use a utility like jq to transform the JSON lines into an array that addurls accepts:

$ ... | jq --slurp . | datalad addurls - '{link}' '{who}.{ext}'

NOTE

For users familiar with ‘git annex addurl’: A large part of this plugin’s functionality can be viewed as transforming data from URL-FILE into a “url filename” format that fed to ‘git annex addurl –batch –with-files’.

Options
URL-FILE

A file that contains URLs or information that can be used to construct URLs. Depending on the value of –input-type, this should be a comma- or tab-separated file (with a header as the first row) or a JSON file (structured as a list of objects with string values). If ‘-’, read from standard input, taking the content as JSON when –input-type is at its default value of ‘ext’.

URL-FORMAT

A format string that specifies the URL for each entry. See the ‘Format Specification’ section above.

FILENAME-FORMAT

Like URL-FORMAT, but this format string specifies the file to which the URL’s content will be downloaded. The name should be a relative path and will be taken as relative to the top-level dataset, regardless of whether it is specified via –dataset or inferred. The file name may contain directories. The separator “//” can be used to indicate that the left-side directory should be created as a new subdataset. See the ‘Format Specification’ section above.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

Add the URLs to this dataset (or possibly subdatasets of this dataset). An empty or non-existent directory is passed to create a new dataset. New subdatasets can be specified with FILENAME-FORMAT. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-t TYPE, --input-type TYPE

Whether URL-FILE should be considered a CSV file, TSV file, or JSON file. The default value, “ext”, means to consider URL-FILE as a JSON file if it ends with “.json” or a TSV file if it ends with “.tsv”. Otherwise, treat it as a CSV file. Constraints: value must be one of (‘ext’, ‘csv’, ‘tsv’, ‘json’) [Default: ‘ext’]

-x REGEXP, --exclude-autometa REGEXP

By default, metadata field=value pairs are constructed with each column in URL- FILE, excluding any single column that is specified via URL-FORMAT. This argument can be used to exclude columns that match a regular expression. If set to ‘*’ or an empty string, automatic metadata extraction is disabled completely. This argument does not affect metadata set explicitly with –meta.

-m FORMAT, --meta FORMAT

A format string that specifies metadata. It should be structured as “<field>=<value>”. As an example, “location={3}” would mean that the value for the “location” metadata field should be set the value of the fourth column. This option can be given multiple times.

--key FORMAT

A format string that specifies an annex key for the file content. In this case, the file is not downloaded; instead the key is used to create the file without content. The value should be structured as “[et:]<input backend>[-s<bytes>]–<hash>”. The optional “et:” prefix, which requires git- annex 8.20201116 or later, signals to toggle extension state of the input backend (i.e., MD5 vs MD5E). As an example, “et:MD5-s{size}–{md5sum}” would use the ‘md5sum’ and ‘size’ columns to construct the key, migrating the key from MD5 to MD5E, with an extension based on the file name. Note: If the input backend itself is an annex extension backend (i.e., a backend with a trailing “E”), the key’s extension will not be updated to match the extension of the corresponding file name. Thus, unless the input keys and file names are generated from git- annex, it is recommended to avoid using extension backends as input. If an extension is desired, use the plain variant as input and prepend “et:” so that git-annex will migrate from the plain backend to the extension variant.

--message MESSAGE

Use this message when committing the URL additions. Constraints: value must be NONE or value must be a string

-n, --dry-run

Report which URLs would be downloaded to which files and then exit.

--fast

If True, add the URLs, but don’t download their content. WARNING: ONLY USE THIS OPTION IF YOU UNDERSTAND THE CONSEQUENCES. If the content of the URLs is not downloaded, then datalad will refuse to retrieve the contents with datalad get <file> by default because the content of the URLs is not verified. Add annex.security.allow-unverified-downloads = ACKTHPPT to your git config to bypass the safety check. Underneath, this passes the –fast flag to git annex addurl.

--ifexists {overwrite|skip}

What to do if a constructed file name already exists. The default behavior is to proceed with the git annex addurl, which will fail if the file size has changed. If set to ‘overwrite’, remove the old file before adding the new one. If set to ‘skip’, do not add the new file. Constraints: value must be one of (‘overwrite’, ‘skip’)

--missing-value VALUE

When an empty string is encountered, use this value instead. Constraints: value must be NONE or value must be a string

--nosave

by default all modifications to a dataset are immediately saved. Giving this option will disable this behavior.

--version-urls

Try to add a version ID to the URL. This currently only has an effect on HTTP URLs for AWS S3 buckets. s3:// URL versioning is not yet supported, but any URL that already contains a “versionId=” parameter will be used as is.

-c PROC, --cfg-proc PROC

Pass this –cfg_proc value when calling CREATE to make datasets.

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--drop-after

drop files after adding to annex.

--on-collision {error|error-if-different|take-first|take-last}

What to do when more than one row produces the same file name. By default an error is triggered. “error-if-different” suppresses that error if rows for a given file name collision have the same URL and metadata. “take-first” or “take- last” indicate to instead take the first row or last row from each set of colliding rows. Constraints: value must be one of (‘error’, ‘error-if- different’, ‘take-first’, ‘take-last’) [Default: ‘error’]

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad copy-file
Synopsis
datalad copy-file [-h] [-d DATASET] [--recursive] [--target-dir DIRECTORY] [--specs-from
    SOURCE] [-m MESSAGE] [--version] [PATH ...]
Description

Copy files and their availability metadata from one dataset to another.

The difference to a system copy command is that here additional content availability information, such as registered URLs, is also copied to the target dataset. Moreover, potentially required git-annex special remote configurations are detected in a source dataset and are applied to a target dataset in an analogous fashion. It is possible to copy a file for which no content is available locally, by just copying the required metadata on content identity and availability.

NOTE

At the moment, only URLs for the special remotes ‘web’ (git-annex built-in) and ‘datalad’ are recognized and transferred.

The interface is modeled after the POSIX ‘cp’ command, but with one additional way to specify what to copy where: –specs-from allows the caller to flexibly input source-destination path pairs.

This command can copy files out of and into a hierarchy of nested datasets. Unlike with other DataLad command, the –recursive switch does not enable recursion into subdatasets, but is analogous to the POSIX ‘cp’ command switch and enables subdirectory recursion, regardless of dataset boundaries. It is not necessary to enable recursion in order to save changes made to nested target subdatasets.

Examples

Copy a file into a dataset ‘myds’ using a path and a target directory specification, and save its addition to ‘myds’:

% datalad copy-file path/to/myfile -d path/to/myds

Copy a file to a dataset ‘myds’ and save it under a new name by providing two paths:

% datalad copy-file path/to/myfile path/to/myds/new -d path/to/myds

Copy a file into a dataset without saving it:

% datalad copy-file path/to/myfile -t path/to/myds

Copy a directory and its subdirectories into a dataset ‘myds’ and save the addition in ‘myds’:

% datalad copy-file path/to/dir -r -d path/to/myds

Copy files using a path and optionally target specification from a file:

% datalad copy-file -d path/to/myds --specs-from specfile

Read a specification from stdin and pipe the output of a find command into the copy-file command:

% find <expr> | datalad copy-file -d path/to/myds --specs-from -
Options
PATH

paths to copy (and possibly a target path to copy to). Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

root dataset to save after copy operations are completed. All destination paths must be within this dataset, or its subdatasets. If no dataset is given, dataset modifications will be left unsaved. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--recursive, -r

copy directories recursively.

--target-dir DIRECTORY, -t DIRECTORY

copy all source files into this DIRECTORY. This value is overridden by any explicit destination path provided via –specs-from. When not given, this defaults to the path of the dataset specified via –dataset. Constraints: value must be a string or value must be NONE

--specs-from SOURCE

read list of source (and destination) path names from a given file, or stdin (with ‘-‘). Each line defines either a source path, or a source/destination path pair (separated by a null byte character).

-m MESSAGE, --message MESSAGE

a description of the state or the changes made to a dataset. Constraints: value must be a string or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad drop
Synopsis
datalad drop [-h] [--what {filecontent|allkeys|datasets|all}] [--reckless
    {modification|availability|undead|kill}] [-d DATASET] [-r] [-R
    LEVELS] [-J NJOBS] [--nocheck] [--if-dirty IF_DIRTY] [--version]
    [PATH ...]
Description

Drop content of individual files or entire (sub)datasets

This command is the antagonist of ‘get’. It can undo the retrieval of file content, and the installation of subdatasets.

Dropping is a safe-by-default operation. Before dropping any information, the command confirms the continued availability of file-content (see e.g., configuration ‘annex.numcopies’), and the state of all dataset branches from at least one known dataset sibling. Moreover, prior removal of an entire dataset annex, that it is confirmed that it is no longer marked as existing in the network of dataset siblings.

Importantly, all checks regarding version history availability and local annex availability are performed using the current state of remote siblings as known to the local dataset. This is done for performance reasons and for resilience in case of absent network connectivity. To ensure decision making based on up-to-date information, it is advised to execute a dataset update before dropping dataset components.

Examples

Drop single file content:

% datalad drop <path/to/file>

Drop all file content in the current dataset:

% datalad drop

Drop all file content in a dataset and all its subdatasets:

% datalad drop -d <path/to/dataset> -r

Disable check to ensure the configured minimum number of remote sources for dropped data:

% datalad drop <path/to/content> --reckless availability

Drop (uninstall) an entire dataset (will fail with subdatasets present):

% datalad drop --what all

Kill a dataset recklessly with any existing subdatasets too(this will be fast, but will disable any and all safety checks):

% datalad drop --what all, --reckless kill --recursive
Options
PATH

path of a dataset or dataset component to be dropped. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--what {filecontent|allkeys|datasets|all}

select what type of items shall be dropped. With ‘filecontent’, only the file content (git-annex keys) of files in a dataset’s worktree will be dropped. With ‘allkeys’, content of any version of any file in any branch (including, but not limited to the worktree) will be dropped. This effectively empties the annex of a local dataset. With ‘datasets’, only complete datasets will be dropped (implies ‘allkeys’ mode for each such dataset), but no filecontent will be dropped for any files in datasets that are not dropped entirely. With ‘all’, content for any matching file or dataset will be dropped entirely. Constraints: value must be one of (‘filecontent’, ‘allkeys’, ‘datasets’, ‘all’) [Default: ‘filecontent’]

--reckless {modification|availability|undead|kill}

disable individual or all data safety measures that would normally prevent potentially irreversible data-loss. With ‘modification’, unsaved modifications in a dataset will not be detected. This improves performance at the cost of permitting potential loss of unsaved or untracked dataset components. With ‘availability’, detection of dataset/branch-states that are only available in the local dataset, and detection of an insufficient number of file-content copies will be disabled. Especially the latter is a potentially expensive check which might involve numerous network transactions. With ‘undead’, detection of whether a to-be-removed local annex is still known to exist in the network of dataset-clones is disabled. This could cause zombie-records of invalid file availability. With ‘kill’, all safety-checks are disabled. Constraints: value must be one of (‘modification’, ‘availability’, ‘undead’, ‘kill’)

-d DATASET, --dataset DATASET

specify the dataset to perform drop from. If no dataset is given, the current working directory is used as operation context. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--nocheck

DEPRECATED: use ‘–reckless availability’.

--if-dirty IF_DIRTY

DEPRECATED and IGNORED: use –reckless instead.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad get
Synopsis
datalad get [-h] [-s LABEL] [-d PATH] [-r] [-R LEVELS] [-n] [-D DESCRIPTION]
    [--reckless [auto|ephemeral|shared-...]] [-J NJOBS] [--version]
    [PATH ...]
Description

Get any dataset content (files/directories/subdatasets).

This command only operates on dataset content. To obtain a new independent dataset from some source use the CLONE command.

By default this command operates recursively within a dataset, but not across potential subdatasets, i.e. if a directory is provided, all files in the directory are obtained. Recursion into subdatasets is supported too. If enabled, relevant subdatasets are detected and installed in order to fulfill a request.

Known data locations for each requested file are evaluated and data are obtained from some available location (according to git-annex configuration and possibly assigned remote priorities), unless a specific source is specified.

Getting subdatasets

Just as DataLad supports getting file content from more than one location, the same is supported for subdatasets, including a ranking of individual sources for prioritization.

The following location candidates are considered. For each candidate a cost is given in parenthesis, higher values indicate higher cost, and thus lower priority:

  • A datalad URL recorded in .gitmodules (cost 590). This allows for datalad URLs that require additional handling/resolution by datalad, like ria-schemes (ria+http, ria+ssh, etc.)

  • A URL or absolute path recorded for git in .gitmodules (cost 600).

  • URL of any configured superdataset remote that is known to have the desired submodule commit, with the submodule path appended to it. There can be more than one candidate (cost 650).

  • In case .gitmodules contains a relative path instead of a URL, the URL of any configured superdataset remote that is known to have the desired submodule commit, with this relative path appended to it. There can be more than one candidate (cost 650).

  • In case .gitmodules contains a relative path as a URL, the absolute path of the superdataset, appended with this relative path (cost 900).

Additional candidate URLs can be generated based on templates specified as configuration variables with the pattern

datalad.get.subdataset-source-candidate-<name>

where NAME is an arbitrary identifier. If name starts with three digits (e.g. ‘400myserver’) these will be interpreted as a cost, and the respective candidate will be sorted into the generated candidate list according to this cost. If no cost is given, a default of 700 is used.

A template string assigned to such a variable can utilize the Python format mini language and may reference a number of properties that are inferred from the parent dataset’s knowledge about the target subdataset. Properties include any submodule property specified in the respective .gitmodules record. For convenience, an existing datalad-id record is made available under the shortened name ID.

Additionally, the URL of any configured remote that contains the respective submodule commit is available as remoteurl-<name> property, where NAME is the configured remote name.

Hence, such a template could be http://example.org/datasets/{id} or http://example.org/datasets/{path}, where {id} and {path} would be replaced by the datalad-id or PATH entry in the .gitmodules record.

If this config is committed in .datalad/config, a clone of a dataset can look up any subdataset’s URL according to such scheme(s) irrespective of what URL is recorded in .gitmodules.

Lastly, all candidates are sorted according to their cost (lower values first), and duplicate URLs are stripped, while preserving the first item in the candidate list.

NOTE

Power-user info: This command uses git annex get to fulfill file handles.

Examples

Get a single file:

% datalad get <path/to/file>

Get contents of a directory:

% datalad get <path/to/dir/>

Get all contents of the current dataset and its subdatasets:

% datalad get . -r

Get (clone) a registered subdataset, but don’t retrieve data:

% datalad get -n <path/to/subds>
Options
PATH

path/name of the requested dataset component. The component must already be known to a dataset. To add new components to a dataset use the ADD command. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-s LABEL, --source LABEL

label of the data source to be used to fulfill requests. This can be the name of a dataset sibling or another known source. Constraints: value must be a string or value must be NONE

-d PATH, --dataset PATH

specify the dataset to perform the add operation on, in which case PATH arguments are interpreted as being relative to this dataset. If no dataset is given, an attempt is made to identify a dataset for each input path. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdataset to the given number of levels. Alternatively, ‘existing’ will limit recursion to subdatasets that already existed on the filesystem at the start of processing, and prevent new subdatasets from being obtained recursively. Constraints: value must be convertible to type ‘int’ or value must be one of (‘existing’,) or value must be NONE

-n, --no-data

whether to obtain data for all file handles. If disabled, GET operations are limited to dataset handles. This option prevents data for file handles from being obtained.

-D DESCRIPTION, --description DESCRIPTION

short description to use for a dataset location. Its primary purpose is to help humans to identify a dataset copy (e.g., “mike’s dataset on lab server”). Note that when a dataset is published, this information becomes available on the remote side. Constraints: value must be a string or value must be NONE

--reckless [auto|ephemeral|shared-…]

Obtain a dataset or subdatset and set it up in a potentially unsafe way for performance, or access reasons. Use with care, any dataset is marked as ‘untrusted’. The reckless mode is stored in a dataset’s local configuration under ‘datalad.clone.reckless’, and will be inherited to any of its subdatasets. Supported modes are: [‘auto’]: hard-link files between local clones. In-place modification in any clone will alter original annex content. [‘ephemeral’]: symlink annex to origin’s annex and discard local availability info via git- annex-dead ‘here’ and declares this annex private. Shares an annex between origin and clone w/o git-annex being aware of it. In case of a change in origin you need to update the clone before you’re able to save new content on your end. Alternative to ‘auto’ when hardlinks are not an option, or number of consumed inodes needs to be minimized. Note that this mode can only be used with clones from non-bare repositories or a RIA store! Otherwise two different annex object tree structures (dirhashmixed vs dirhashlower) will be used simultaneously, and annex keys using the respective other structure will be inaccessible. [‘shared-<mode>’]: set up repository and annex permission to enable multi-user access. This disables the standard write protection of annex’ed files. <mode> can be any value support by ‘git init –shared=’, such as ‘group’, or ‘all’. Constraints: value must be one of (True, False, ‘auto’, ‘ephemeral’) or value must start with ‘shared-’

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,) [Default: ‘auto’]

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad install
Synopsis
datalad install [-h] [-s URL-OR-PATH] [-d DATASET] [-g] [-D DESCRIPTION] [-r] [-R
    LEVELS] [--reckless [auto|ephemeral|shared-...]] [-J NJOBS]
    [--branch BRANCH] [--version] [URL-OR-PATH ...]
Description

Install one or many datasets from remote URL(s) or local PATH source(s).

This command creates local sibling(s) of existing dataset(s) from (remote) locations specified as URL(s) or path(s). Optional recursion into potential subdatasets, and download of all referenced data is supported. The new dataset(s) can be optionally registered in an existing superdataset by identifying it via the DATASET argument (the new dataset’s path needs to be located within the superdataset for that).

If no explicit -s|–source option is specified, then all positional URL-OR-PATH arguments are considered to be “sources” if they are URLs or target locations if they are paths. If a target location path corresponds to a submodule, the source location for it is figured out from its record in the .gitmodules. If -s|–source is specified, then a single optional positional PATH would be taken as the destination path for that dataset.

It is possible to provide a brief description to label the dataset’s nature and location, e.g. “Michael’s music on black laptop”. This helps humans to identify data locations in distributed scenarios. By default an identifier comprised of user and machine name, plus path will be generated.

When only partial dataset content shall be obtained, it is recommended to use this command without the get-data flag, followed by a get operation to obtain the desired data.

NOTE

Power-user info: This command uses git clone, and git annex init to prepare the dataset. Registering to a superdataset is performed via a git submodule add operation in the discovered superdataset.

Examples

Install a dataset from GitHub into the current directory:

% datalad install https://github.com/datalad-datasets/longnow-podcasts.git

Install a dataset as a subdataset into the current dataset:

% datalad install -d . \
  --source='https://github.com/datalad-datasets/longnow-podcasts.git'

Install a dataset into ‘podcasts’ (not ‘longnow-podcasts’) directory, and get all content right away:

% datalad install --get-data \
  -s https://github.com/datalad-datasets/longnow-podcasts.git podcasts

Install a dataset with all its subdatasets:

% datalad install -r \
  https://github.com/datalad-datasets/longnow-podcasts.git
Options
URL-OR-PATH

path/name of the installation target. If no PATH is provided a destination path will be derived from a source URL similar to git clone.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-s URL-OR-PATH, --source URL-OR-PATH

URL or local path of the installation source. Constraints: value must be a string or value must be NONE

-d DATASET, --dataset DATASET

specify the dataset to perform the install operation on. If no dataset is given, an attempt is made to identify the dataset in a parent directory of the current working directory and/or the PATH given. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-g, --get-data

if given, obtain all data content too.

-D DESCRIPTION, --description DESCRIPTION

short description to use for a dataset location. Its primary purpose is to help humans to identify a dataset copy (e.g., “mike’s dataset on lab server”). Note that when a dataset is published, this information becomes available on the remote side. Constraints: value must be a string or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--reckless [auto|ephemeral|shared-…]

Obtain a dataset or subdatset and set it up in a potentially unsafe way for performance, or access reasons. Use with care, any dataset is marked as ‘untrusted’. The reckless mode is stored in a dataset’s local configuration under ‘datalad.clone.reckless’, and will be inherited to any of its subdatasets. Supported modes are: [‘auto’]: hard-link files between local clones. In-place modification in any clone will alter original annex content. [‘ephemeral’]: symlink annex to origin’s annex and discard local availability info via git- annex-dead ‘here’ and declares this annex private. Shares an annex between origin and clone w/o git-annex being aware of it. In case of a change in origin you need to update the clone before you’re able to save new content on your end. Alternative to ‘auto’ when hardlinks are not an option, or number of consumed inodes needs to be minimized. Note that this mode can only be used with clones from non-bare repositories or a RIA store! Otherwise two different annex object tree structures (dirhashmixed vs dirhashlower) will be used simultaneously, and annex keys using the respective other structure will be inaccessible. [‘shared-<mode>’]: set up repository and annex permission to enable multi-user access. This disables the standard write protection of annex’ed files. <mode> can be any value support by ‘git init –shared=’, such as ‘group’, or ‘all’. Constraints: value must be one of (True, False, ‘auto’, ‘ephemeral’) or value must start with ‘shared-’

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,) [Default: ‘auto’]

--branch BRANCH

Clone source at this branch or tag. This option applies only to the top-level dataset not any subdatasets that may be cloned when installing recursively. Note that if the source is a RIA URL with a version, it takes precedence over this option. Constraints: value must be a string or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad no-annex
Synopsis
datalad no-annex [-h] [-d DATASET] [--pattern PATTERN [PATTERN ...]] [--ref-dir
    REF_DIR] [--makedirs] [--version]
Description

Configure a dataset to never put some content into the dataset’s annex

This can be useful in mixed datasets that also contain textual data, such as source code, which can be efficiently and more conveniently managed directly in Git.

Patterns generally look like this:

code/*

which would match all file in the code directory. In order to match all files under code/, including all its subdirectories use such a pattern:

code/**

Note that this command works incrementally, hence any existing configuration (e.g. from a previous plugin run) is amended, not replaced.

Options
-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

“specify the dataset to configure. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--pattern PATTERN [PATTERN …]

list of path patterns. Any content whose path is matching any pattern will not be annexed when added to a dataset, but instead will be tracked directly in Git. Path pattern have to be relative to the directory given by the REF_DIR option. By default, patterns should be relative to the root of the dataset.

--ref-dir REF_DIR

Relative path (within the dataset) to the directory that is to be configured. All patterns are interpreted relative to this path, and configuration is written to a .gitattributes file in this directory. [Default: ‘.’]

--makedirs

If set, any missing directories will be created in order to be able to place a file into --ref-dir.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad remove
Synopsis
datalad remove [-h] [-d DATASET] [--drop {datasets|all}] [--reckless
    {modification|availability|undead|kill}] [-m MESSAGE] [-J NJOBS]
    [--recursive] [--nocheck] [--nosave] [--if-dirty IF_DIRTY]
    [--version] [PATH ...]
Description

Remove components from datasets

Removing “unlinks” a dataset component, such as a file or subdataset, from a dataset. Such a removal advances the state of a dataset, just like adding new content. A remove operation can be undone, by restoring a previous dataset state, but might require re-obtaining file content and subdatasets from remote locations.

This command relies on the ‘drop’ command for safe operation. By default, only file content from datasets which will be uninstalled as part of a removal will be dropped. Otherwise file content is retained, such that restoring a previous version also immediately restores file content access, just as it is the case for files directly committed to Git. This default behavior can be changed to always drop content prior removal, for cases where a minimal storage footprint for local datasets installations is desirable.

Removing a dataset component is always a recursive operation. Removing a directory, removes all content underneath the directory too. If subdatasets are located under a to-be-removed path, they will be uninstalled entirely, and all their content dropped. If any subdataset can not be uninstalled safely, the remove operation will fail and halt.

Changed in version 0.16

More in-depth and comprehensive safety-checks are now performed by default. The --if-dirty argument is ignored, will be removed in a future release, and can be removed for a safe-by-default behavior. For other cases consider the --reckless argument. The --save argument is ignored and will be removed in a future release, a dataset modification is now always saved. Consider save’s --amend argument for post-remove fix-ups. The --recursive argument is ignored, and will be removed in a future release. Removal operations are always recursive, and the parameter can be stripped from calls for a safe-by-default behavior.

Deprecated in version 0.16

The --check argument will be removed in a future release. It needs to be replaced with --reckless.

Examples

Permanently remove a subdataset (and all further subdatasets contained in it) from a dataset:

% datalad remove -d <path/to/dataset> <path/to/subds>

Permanently remove a superdataset (with all subdatasets) from the filesystem:

% datalad remove -d <path/to/dataset>

DANGER-ZONE: Fast wipe-out a dataset and all its subdataset, bypassing all safety checks:

% datalad remove -d <path/to/dataset> --reckless kill
Options
PATH

path of a dataset or dataset component to be removed. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to perform remove from. If no dataset is given, the current working directory is used as operation context. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--drop {datasets|all}

which dataset components to drop prior removal. This parameter is passed on to the underlying drop operation as its ‘what’ argument. Constraints: value must be one of (‘datasets’, ‘all’) [Default: ‘datasets’]

--reckless {modification|availability|undead|kill}

disable individual or all data safety measures that would normally prevent potentially irreversible data-loss. With ‘modification’, unsaved modifications in a dataset will not be detected. This improves performance at the cost of permitting potential loss of unsaved or untracked dataset components. With ‘availability’, detection of dataset/branch-states that are only available in the local dataset, and detection of an insufficient number of file-content copies will be disabled. Especially the latter is a potentially expensive check which might involve numerous network transactions. With ‘undead’, detection of whether a to-be-removed local annex is still known to exist in the network of dataset-clones is disabled. This could cause zombie-records of invalid file availability. With ‘kill’, all safety-checks are disabled. Constraints: value must be one of (‘modification’, ‘availability’, ‘undead’, ‘kill’)

-m MESSAGE, --message MESSAGE

a description of the state or the changes made to a dataset. Constraints: value must be a string or value must be NONE

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--recursive, -r

DEPRECATED and IGNORED: removal is always a recursive operation.

--nocheck

DEPRECATED: use ‘–reckless availability’.

--nosave

DEPRECATED and IGNORED; use save –amend instead.

--if-dirty IF_DIRTY

DEPRECATED and IGNORED: use –reckless instead.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad subdatasets
Synopsis
datalad subdatasets [-h] [-d DATASET] [--state {present|absent|any}] [--fulfilled
    FULFILLED] [-r] [-R LEVELS] [--contains PATH] [--bottomup]
    [--set-property NAME VALUE] [--delete-property NAME] [--version]
    [PATH ...]
Description

Report subdatasets and their properties.

The following properties are reported (if possible) for each matching subdataset record.

“name”

Name of the subdataset in the parent (often identical with the relative path in the parent dataset)

“path”

Absolute path to the subdataset

“parentds”

Absolute path to the parent dataset

“gitshasum”

SHA1 of the subdataset commit recorded in the parent dataset

“state”

Condition of the subdataset: ‘absent’, ‘present’

“gitmodule_url”

URL of the subdataset recorded in the parent

“gitmodule_name”

Name of the subdataset recorded in the parent

“gitmodule_<label>”

Any additional configuration property on record.

Performance note: Property modification, requesting BOTTOMUP reporting order, or a particular numerical recursion_limit implies an internal switch to an alternative query implementation for recursive query that is more flexible, but also notably slower (performs one call to Git per dataset versus a single call for all combined).

The following properties for subdatasets are recognized by DataLad (without the ‘gitmodule_’ prefix that is used in the query results):

“datalad-recursiveinstall”

If set to ‘skip’, the respective subdataset is skipped when DataLad is recursively installing its superdataset. However, the subdataset remains installable when explicitly requested, and no other features are impaired.

“datalad-url”

If a subdataset was originally established by cloning, ‘datalad-url’ records the URL that was used to do so. This might be different from ‘url’ if the URL contains datalad specific pieces like any URL of the form “ria+<some protocol>…”.

Options
PATH

path/name to query for subdatasets. Defaults to the current directory. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to query. If no dataset is given, an attempt is made to identify the dataset based on the input and/or the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--state {present|absent|any}

indicate which (sub)datasets to consider: either only locally present, absent, or any of those two kinds. Constraints: value must be one of (‘present’, ‘absent’, ‘any’) [Default: ‘any’]

--fulfilled FULFILLED

DEPRECATED: use –state instead. If given, must be a boolean flag indicating whether to consider either only locally present or absent datasets. By default all subdatasets are considered regardless of their status. Constraints: value must be convertible to type bool or value must be NONE [Default: None(DEPRECATED)]

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--contains PATH

limit to the subdatasets containing the given path. If a root path of a subdataset is given, the last considered dataset will be the subdataset itself. This option can be given multiple times, in which case datasets that contain any of the given paths will be considered. Constraints: value must be a string or value must be NONE

--bottomup

whether to report subdatasets in bottom-up order along each branch in the dataset tree, and not top-down.

--set-property NAME VALUE

Name and value of one or more subdataset properties to be set in the parent dataset’s .gitmodules file. The property name is case-insensitive, must start with a letter, and consist only of alphanumeric characters. The value can be a Python format() template string wrapped in ‘<>’ (e.g. ‘<{gitmodule_name}>’). Supported keywords are any item reported in the result properties of this command, plus ‘refds_relpath’ and ‘refds_relname’: the relative path of a subdataset with respect to the base dataset of the command call, and, in the latter case, the same string with all directory separators replaced by dashes. This option can be given multiple times. Constraints: value must be a string or value must be NONE

--delete-property NAME

Name of one or more subdataset properties to be removed from the parent dataset’s .gitmodules file. This option can be given multiple times. Constraints: value must be a string or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad unlock
Synopsis
datalad unlock [-h] [-d DATASET] [-r] [-R LEVELS] [--version] [path ...]
Description

Unlock file(s) of a dataset

Unlock files of a dataset in order to be able to edit the actual content

Examples

Unlock a single file:

% datalad unlock <path/to/file>

Unlock all contents in the dataset:

% datalad unlock .
Options
path

file(s) to unlock. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

“specify the dataset to unlock files in. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Dataset siblings and 3rd-party platform support
datalad siblings
Synopsis
datalad siblings [-h] [-d DATASET] [-s NAME] [--url [URL]] [--pushurl PUSHURL] [-D
    DESCRIPTION] [--fetch] [--as-common-datasrc NAME]
    [--publish-depends SIBLINGNAME] [--publish-by-default REFSPEC]
    [--annex-wanted EXPR] [--annex-required EXPR] [--annex-group
    EXPR] [--annex-groupwanted EXPR] [--inherit] [--no-annex-info]
    [-r] [-R LEVELS] [--version]
    [{query|add|remove|configure|enable}]
Description

Manage sibling configuration

This command offers four different actions: ‘query’, ‘add’, ‘remove’, ‘configure’, ‘enable’. ‘query’ is the default action and can be used to obtain information about (all) known siblings. ‘add’ and ‘configure’ are highly similar actions, the only difference being that adding a sibling with a name that is already registered will fail, whereas re-configuring a (different) sibling under a known name will not be considered an error. ‘enable’ can be used to complete access configuration for non-Git sibling (aka git-annex special remotes). Lastly, the ‘remove’ action allows for the removal (or de-configuration) of a registered sibling.

For each sibling (added, configured, or queried) all known sibling properties are reported. This includes:

“name”

Name of the sibling

“path”

Absolute path of the dataset

“url”

For regular siblings at minimum a “fetch” URL, possibly also a “pushurl”

Additionally, any further configuration will also be reported using a key that matches that in the Git configuration.

By default, sibling information is rendered as one line per sibling following this scheme:

<dataset_path>: <sibling_name>(<+|->) [<access_specification]

where the + and - labels indicate the presence or absence of a remote data annex at a particular remote, and ACCESS_SPECIFICATION contains either a URL and/or a type label for the sibling.

Options
{query|add|remove|configure|enable}

command action selection (see general documentation). Constraints: value must be one of (‘query’, ‘add’, ‘remove’, ‘configure’, ‘enable’) [Default: ‘query’]

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to configure. If no dataset is given, an attempt is made to identify the dataset based on the input and/or the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-s NAME, --name NAME

name of the sibling. For addition with path “URLs” and sibling removal this option is mandatory, otherwise the hostname part of a given URL is used as a default. This option can be used to limit ‘query’ to a specific sibling. Constraints: value must be a string or value must be NONE

--url [URL]

the URL of or path to the dataset sibling named by NAME. For recursive operation it is required that a template string for building subdataset sibling URLs is given. List of currently available placeholders: %NAME the name of the dataset, where slashes are replaced by dashes. Constraints: value must be a string or value must be NONE

--pushurl PUSHURL

in case the URL cannot be used to publish to the dataset sibling, this option specifies a URL to be used instead. If no url is given, PUSHURL serves as url as well. Constraints: value must be a string or value must be NONE

-D DESCRIPTION, --description DESCRIPTION

short description to use for a dataset location. Its primary purpose is to help humans to identify a dataset copy (e.g., “mike’s dataset on lab server”). Note that when a dataset is published, this information becomes available on the remote side. Constraints: value must be a string or value must be NONE

--fetch

fetch the sibling after configuration.

--as-common-datasrc NAME

configure a sibling as a common data source of the dataset that can be automatically used by all consumers of the dataset. The sibling must be a regular Git remote with a configured HTTP(S) URL.

--publish-depends SIBLINGNAME

add a dependency such that the given existing sibling is always published prior to the new sibling. This equals setting a configuration item ‘remote.SIBLINGNAME.datalad-publish-depends’. This option can be given more than once to configure multiple dependencies. Constraints: value must be a string or value must be NONE

--publish-by-default REFSPEC

add a refspec to be published to this sibling by default if nothing specified. Constraints: value must be a string or value must be NONE

--annex-wanted EXPR

expression to specify ‘wanted’ content for the repository/sibling. See https://git-annex.branchable.com/git-annex-wanted/ for more information. Constraints: value must be a string or value must be NONE

--annex-required EXPR

expression to specify ‘required’ content for the repository/sibling. See https://git-annex.branchable.com/git-annex-required/ for more information. Constraints: value must be a string or value must be NONE

--annex-group EXPR

expression to specify a group for the repository. See https://git- annex.branchable.com/git-annex-group/ for more information. Constraints: value must be a string or value must be NONE

--annex-groupwanted EXPR

expression for the groupwanted. Makes sense only if –annex-wanted=”groupwanted” and annex-group is given too. See https://git-annex.branchable.com/git-annex- groupwanted/ for more information. Constraints: value must be a string or value must be NONE

--inherit

if sibling is missing, inherit settings (git config, git annex wanted/group/groupwanted) from its super-dataset.

--no-annex-info

Whether to query all information about the annex configurations of siblings. Can be disabled if speed is a concern.

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-sibling
Synopsis
datalad create-sibling [-h] [-s [NAME]] [--target-dir PATH] [--target-url URL]
    [--target-pushurl URL] [--dataset DATASET] [-r] [-R LEVELS]
    [--existing MODE] [--shared
    {false|true|umask|group|all|world|everybody|0xxx}] [--group
    GROUP] [--ui {false|true|html_filename}] [--as-common-datasrc
    NAME] [--publish-by-default REFSPEC] [--publish-depends
    SIBLINGNAME] [--annex-wanted EXPR] [--annex-group EXPR]
    [--annex-groupwanted EXPR] [--inherit] [--since SINCE]
    [--version] [SSHURL]
Description

Create a dataset sibling on a UNIX-like Shell (local or SSH)-accessible machine

Given a local dataset, and a path or SSH login information this command creates a remote dataset repository and configures it as a dataset sibling to be used as a publication target (see PUBLISH command).

Various properties of the remote sibling can be configured (e.g. name location on the server, read and write access URLs, and access permissions.

Optionally, a basic web-viewer for DataLad datasets can be installed at the remote location.

This command supports recursive processing of dataset hierarchies, creating a remote sibling for each dataset in the hierarchy. By default, remote siblings are created in hierarchical structure that reflects the organization on the local file system. However, a simple templating mechanism is provided to produce a flat list of datasets (see –target-dir).

Options
SSHURL

Login information for the target server. This can be given as a URL (ssh://host/path), SSH-style (user@host:path) or just a local path. Unless overridden, this also serves the future dataset’s access URL and path on the server. Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-s [NAME], --name [NAME]

sibling name to create for this publication target. If RECURSIVE is set, the same name will be used to label all the subdatasets’ siblings. When creating a target dataset fails, no sibling is added. Constraints: value must be a string or value must be NONE

--target-dir PATH

path to the directory on the server where the dataset shall be created. By default this is set to the URL (or local path) specified via SSHURL. If a relative path is provided here, it is interpreted as being relative to the user’s home directory on the server (or relative to SSHURL, when that is a local path). Additional features are relevant for recursive processing of datasets with subdatasets. By default, the local dataset structure is replicated on the server. However, it is possible to provide a template for generating different target directory names for all (sub)datasets. Templates can contain certain placeholder that are substituted for each (sub)dataset. For example: “/mydirectory/dataset%RELNAME”. Supported placeholders: %RELNAME - the name of the datasets, with any slashes replaced by dashes. Constraints: value must be a string or value must be NONE

--target-url URL

“public” access URL of the to-be-created target dataset(s) (default: SSHURL). Accessibility of this URL determines the access permissions of potential consumers of the dataset. As with target_dir, templates (same set of placeholders) are supported. Also, if specified, it is provided as the annex description. Constraints: value must be a string or value must be NONE

--target-pushurl URL

In case the TARGET_URL cannot be used to publish to the dataset, this option specifies an alternative URL for this purpose. As with target_url, templates (same set of placeholders) are supported. Constraints: value must be a string or value must be NONE

--dataset DATASET, -d DATASET

specify the dataset to create the publication target for. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--existing MODE

action to perform, if a sibling is already configured under the given name and/or a target (non-empty) directory already exists. In this case, a dataset can be skipped (‘skip’), the sibling configuration be updated (‘reconfigure’), or process interrupts with error (‘error’). DANGER ZONE: If ‘replace’ is used, an existing target directory will be forcefully removed, re-initialized, and the sibling (re-)configured (thus implies ‘reconfigure’). REPLACE could lead to data loss, so use with care. To minimize possibility of data loss, in interactive mode DataLad will ask for confirmation, but it would raise an exception in non- interactive mode. Constraints: value must be one of (‘skip’, ‘error’, ‘reconfigure’, ‘replace’) [Default: ‘error’]

--shared {false|true|umask|group|all|world|everybody|0xxx}

if given, configures the access permissions on the server for multi-users (this could include access by a webserver!). Possible values for this option are identical to those of git init –shared and are described in its documentation. Constraints: value must be a string or value must be convertible to type bool or value must be NONE

--group GROUP

Filesystem group for the repository. Specifying the group is particularly important when –shared=group. Constraints: value must be a string or value must be NONE

--ui {false|true|html_filename}

publish a web interface for the dataset with an optional user-specified name for the html at publication target. defaults to index.html at dataset root. Constraints: value must be convertible to type bool or value must be a string [Default: False]

--as-common-datasrc NAME

configure the created sibling as a common data source of the dataset that can be automatically used by all consumers of the dataset (technical: git-annex auto- enabled special remote).

--publish-by-default REFSPEC

add a refspec to be published to this sibling by default if nothing specified. Constraints: value must be a string or value must be NONE

--publish-depends SIBLINGNAME

add a dependency such that the given existing sibling is always published prior to the new sibling. This equals setting a configuration item ‘remote.SIBLINGNAME.datalad-publish-depends’. This option can be given more than once to configure multiple dependencies. Constraints: value must be a string or value must be NONE

--annex-wanted EXPR

expression to specify ‘wanted’ content for the repository/sibling. See https://git-annex.branchable.com/git-annex-wanted/ for more information. Constraints: value must be a string or value must be NONE

--annex-group EXPR

expression to specify a group for the repository. See https://git- annex.branchable.com/git-annex-group/ for more information. Constraints: value must be a string or value must be NONE

--annex-groupwanted EXPR

expression for the groupwanted. Makes sense only if –annex-wanted=”groupwanted” and annex-group is given too. See https://git-annex.branchable.com/git-annex- groupwanted/ for more information. Constraints: value must be a string or value must be NONE

--inherit

if sibling is missing, inherit settings (git config, git annex wanted/group/groupwanted) from its super-dataset.

--since SINCE

limit processing to subdatasets that have been changed since a given state (by tag, branch, commit, etc). This can be used to create siblings for recently added subdatasets. If ‘^’ is given, the last state of the current branch at the sibling is taken as a starting point. Constraints: value must be a string or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-sibling-github
Synopsis
datalad create-sibling-github [-h] [--dataset DATASET] [-r] [-R LEVELS] [-s NAME] [--existing
    {skip|error|reconfigure|replace}] [--github-login TOKEN]
    [--credential NAME] [--github-organization NAME]
    [--access-protocol {https|ssh|https-ssh}] [--publish-depends
    SIBLINGNAME] [--private] [--description DESCRIPTION] [--dryrun]
    [--dry-run] [--api URL] [--version] [<org-name>/]<repo-basename>
Description

Create dataset sibling on GitHub.org (or an enterprise deployment).

GitHub is a popular commercial solution for code hosting and collaborative development. GitHub cannot host dataset content (but see LFS, http://handbook.datalad.org/r.html?LFS). However, in combination with other data sources and siblings, publishing a dataset to GitHub can facilitate distribution and exchange, while still allowing any dataset consumer to obtain actual data content from alternative sources.

In order to be able to use this command, a personal access token has to be generated on the platform (Account->Settings->Developer Settings->Personal access tokens->Generate new token).

This command can be configured with “datalad.create-sibling-ghlike.extra-remote-settings.NETLOC.KEY=VALUE” in order to add any local KEY = VALUE configuration to the created sibling in the local .git/config file. NETLOC is the domain of the Github instance to apply the configuration for. This leads to a behavior that is equivalent to calling datalad’s siblings('configure', ...)``||``siblings configure command with the respective KEY-VALUE pair after creating the sibling. The configuration, like any other, could be set at user- or system level, so users do not need to add this configuration to every sibling created with the service at NETLOC themselves.

Changed in version 0.16

The API has been aligned with the some create-sibling-... commands of other GitHub-like services, such as GOGS, GIN, GitTea.

Deprecated in version 0.16

The --dryrun option will be removed in a future release, use the renamed --dry-run option instead. The --github-login option will be removed in a future release, use the --credential option instead. The --github-organization option will be removed in a future release, prefix the reposity name with <org>/ instead.

Examples

Use a new sibling on GIN as a common data source that is auto- available when cloning from GitHub:

% datalad create-sibling-gin myrepo -s gin

# the sibling on GitHub will be used for collaborative work
% datalad create-sibling-github myrepo -s github

# register the storage of the public GIN repo as a data source
% datalad siblings configure -s gin --as-common-datasrc gin-storage

# announce its availability on github
% datalad push --to github
Options
[<org-name>/]<repo-(base)name>

repository name, optionally including an ‘<organization>/’ prefix if the repository shall not reside under a user’s namespace. When operating recursively, a suffix will be appended to this name for each subdataset. Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--dataset DATASET, -d DATASET

dataset to create the publication target for. If not given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-s NAME, --name NAME

name of the sibling in the local dataset installation (remote name). Constraints: value must be a string or value must be NONE [Default: ‘github’]

--existing {skip|error|reconfigure|replace}

behavior when already existing or configured siblings are discovered: skip the dataset (‘skip’), update the configuration (‘reconfigure’), or fail (‘error’). DEPRECATED DANGER ZONE: With ‘replace’, an existing repository will be irreversibly removed, re-initialized, and the sibling (re-)configured (thus implies ‘reconfigure’). REPLACE could lead to data loss! In interactive sessions a confirmation prompt is shown, an exception is raised in non-interactive sessions. The ‘replace’ mode will be removed in a future release. Constraints: value must be one of (‘skip’, ‘error’, ‘reconfigure’, ‘replace’) [Default: ‘error’]

--github-login TOKEN

Deprecated, use the credential parameter instead. If given must be a personal access token. Constraints: value must be a string or value must be NONE

--credential NAME

name of the credential providing a personal access token to be used for authorization. The token can be supplied via configuration setting ‘datalad.credential.<name>.token’, or environment variable DATALAD_CREDENTIAL_<NAME>_TOKEN, or will be queried from the active credential store using the provided name. If none is provided, the host-part of the API URL is used as a name (e.g. ‘https://api.github.com’ -> ‘api.github.com’). Constraints: value must be a string or value must be NONE

--github-organization NAME

Deprecated, prepend a repo name with an ‘<orgname>/’ prefix instead. Constraints: value must be a string or value must be NONE

--access-protocol {https|ssh|https-ssh}

access protocol/URL to configure for the sibling. With ‘https-ssh’ SSH will be used for write access, whereas HTTPS is used for read access. Constraints: value must be one of (‘https’, ‘ssh’, ‘https-ssh’) [Default: ‘https’]

--publish-depends SIBLINGNAME

add a dependency such that the given existing sibling is always published prior to the new sibling. This equals setting a configuration item ‘remote.SIBLINGNAME.datalad-publish-depends’. This option can be given more than once to configure multiple dependencies. Constraints: value must be a string or value must be NONE

--private

if set, create a private repository.

--description DESCRIPTION

Brief description, displayed on the project’s page. Constraints: value must be a string or value must be NONE

--dryrun

Deprecated. Use the renamed --dry-run parameter.

--dry-run

if set, no repository will be created, only tests for sibling name collisions will be performed, and would-be repository names are reported for all relevant datasets.

--api URL

URL of the GitHub instance API. Constraints: value must be a string or value must be NONE [Default: ‘https://api.github.com’]

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-sibling-gitlab
Synopsis
datalad create-sibling-gitlab [-h] [--site SITENAME] [--project NAME/LOCATION] [--layout
    {collection|flat}] [--dataset DATASET] [-r] [-R LEVELS] [-s
    NAME] [--existing {skip|error|reconfigure}] [--access
    {http|ssh|ssh+http}] [--publish-depends SIBLINGNAME]
    [--description DESCRIPTION] [--dryrun] [--dry-run] [--version]
    [PATH ...]
Description

Create dataset sibling at a GitLab site

An existing GitLab project, or a project created via the GitLab web interface can be configured as a sibling with the siblings command. Alternatively, this command can create a GitLab project at any location/path a given user has appropriate permissions for. This is particularly helpful for recursive sibling creation for subdatasets. API access and authentication are implemented via python-gitlab, and all its features are supported. A particular GitLab site must be configured in a named section of a python-gitlab.cfg file (see https://python-gitlab.readthedocs.io/en/stable/cli-usage.html#configuration-file-format for details), such as:

[mygit]
url = https://git.example.com
api_version = 4
private_token = abcdefghijklmnopqrst

Subsequently, this site is identified by its name (‘mygit’ in the example above).

(Recursive) sibling creation for all, or a selected subset of subdatasets is supported with two different project layouts (see –layout):

“flat”

All datasets are placed as GitLab projects in the same group. The project name of the top-level dataset follows the configured datalad.gitlab-SITENAME-project configuration. The project names of contained subdatasets extend the configured name with the subdatasets’ s relative path within the root dataset, with all path separator characters replaced by ‘-’. This path separator is configurable (see Configuration).

“collection”

A new group is created for the dataset hierarchy, following the datalad.gitlab-SITENAME-project configuration. The root dataset is placed in a “project” project inside this group, and all nested subdatasets are represented inside the group using a “flat” layout. The root datasets project name is configurable (see Configuration).

GitLab cannot host dataset content. However, in combination with other data sources (and siblings), publishing a dataset to GitLab can facilitate distribution and exchange, while still allowing any dataset consumer to obtain actual data content from alternative sources.

Configuration

Many configuration switches and options for GitLab sibling creation can be provided as arguments to the command. However, it is also possible to specify a particular setup in a dataset’s configuration. This is particularly important when managing large collections of datasets. Configuration options are:

“datalad.gitlab-default-site”

Name of the default GitLab site (see –site)

“datalad.gitlab-SITENAME-siblingname”

Name of the sibling configured for the local dataset that points to the GitLab instance SITENAME (see –name)

“datalad.gitlab-SITENAME-layout”

Project layout used at the GitLab instance SITENAME (see –layout)

“datalad.gitlab-SITENAME-access”

Access method used for the GitLab instance SITENAME (see –access)

“datalad.gitlab-SITENAME-project”

Project “location/path” used for a datasets at GitLab instance SITENAME (see –project). Configuring this is useful for deriving project paths for subdatasets, relative to superdataset. The root-level group (“location”) needs to be created beforehand via GitLab’s web interface.

“datalad.gitlab-default-projectname”

The collection layout publishes (sub)datasets as projects with a custom name. The default name “project” can be overridden with this configuration.

“datalad.gitlab-default-pathseparator”

The flat and collection layout represent subdatasets with project names that correspond to their path within the superdataset, with the regular path separator replaced with a “-”: superdataset-subdataset. This configuration can be used to override this default separator.

This command can be configured with “datalad.create-sibling-ghlike.extra-remote-settings.NETLOC.KEY=VALUE” in order to add any local KEY = VALUE configuration to the created sibling in the local .git/config file. NETLOC is the domain of the Gitlab instance to apply the configuration for. This leads to a behavior that is equivalent to calling datalad’s siblings('configure', ...)``||``siblings configure command with the respective KEY-VALUE pair after creating the sibling. The configuration, like any other, could be set at user- or system level, so users do not need to add this configuration to every sibling created with the service at NETLOC themselves.

Options
PATH

selectively create siblings for any datasets underneath a given path. By default only the root dataset is considered.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--site SITENAME

name of the GitLab site to create a sibling at. Must match an existing python- gitlab configuration section with location and authentication settings (see https://python-gitlab.readthedocs.io/en/stable/cli-usage.html#configuration). By default the dataset configuration is consulted. Constraints: value must be NONE or value must be a string

--project NAME/LOCATION

project name/location at the GitLab site. If a subdataset of the reference dataset is processed, its project path is automatically determined by the LAYOUT configuration, by default. Users need to create the root-level GitLab group (NAME) via the webinterface before running the command. Constraints: value must be NONE or value must be a string

--layout {collection|flat}

layout of projects at the GitLab site, if a collection, or a hierarchy of datasets and subdatasets is to be created. By default the dataset configuration is consulted. Constraints: value must be one of (‘collection’, ‘flat’)

--dataset DATASET, -d DATASET

reference or root dataset. If no path constraints are given, a sibling for this dataset will be created. In this and all other cases, the reference dataset is also consulted for the GitLab configuration, and desired project layout. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-s NAME, --name NAME

name to represent the GitLab sibling remote in the local dataset installation. If not specified a name is looked up in the dataset configuration, or defaults to the SITE name. Constraints: value must be a string or value must be NONE

--existing {skip|error|reconfigure}

desired behavior when already existing or configured siblings are discovered. ‘skip’: ignore; ‘error’: fail, if access URLs differ; ‘reconfigure’: use the existing repository and reconfigure the local dataset to use it as a sibling. Constraints: value must be one of (‘skip’, ‘error’, ‘reconfigure’) [Default: ‘error’]

--access {http|ssh|ssh+http}

access method used for data transfer to and from the sibling. ‘ssh’: read and write access used the SSH protocol; ‘http’: read and write access use HTTP requests; ‘ssh+http’: read access is done via HTTP and write access performed with SSH. Dataset configuration is consulted for a default, ‘http’ is used otherwise. Constraints: value must be one of (‘http’, ‘ssh’, ‘ssh+http’)

--publish-depends SIBLINGNAME

add a dependency such that the given existing sibling is always published prior to the new sibling. This equals setting a configuration item ‘remote.SIBLINGNAME.datalad-publish-depends’. This option can be given more than once to configure multiple dependencies. Constraints: value must be a string or value must be NONE

--description DESCRIPTION

brief description for the GitLab project (displayed on the site). Constraints: value must be a string or value must be NONE

--dryrun

Deprecated. Use the renamed --dry-run parameter.

--dry-run

if set, no repository will be created, only tests for name collisions will be performed, and would-be repository names are reported for all relevant datasets.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-sibling-gogs
Synopsis
datalad create-sibling-gogs [-h] [--api URL] [--dataset DATASET] [-r] [-R LEVELS] [-s NAME]
    [--existing {skip|error|reconfigure|replace}] [--credential
    NAME] [--access-protocol {https|ssh|https-ssh}]
    [--publish-depends SIBLINGNAME] [--private] [--description
    DESCRIPTION] [--dry-run] [--version]
    [<org-name>/]<repo-basename>
Description

Create a dataset sibling on a GOGS site

GOGS is a self-hosted, free and open source code hosting solution with low resource demands that enable running it on inexpensive devices like a Raspberry Pi, or even directly on a NAS device.

In order to be able to use this command, a personal access token has to be generated on the platform (Account->Your Settings->Applications->Generate New Token).

This command can be configured with “datalad.create-sibling-ghlike.extra-remote-settings.NETLOC.KEY=VALUE” in order to add any local KEY = VALUE configuration to the created sibling in the local .git/config file. NETLOC is the domain of the Gogs instance to apply the configuration for. This leads to a behavior that is equivalent to calling datalad’s siblings('configure', ...)``||``siblings configure command with the respective KEY-VALUE pair after creating the sibling. The configuration, like any other, could be set at user- or system level, so users do not need to add this configuration to every sibling created with the service at NETLOC themselves.

New in version 0.16

Options
[<org-name>/]<repo-(base)name>

repository name, optionally including an ‘<organization>/’ prefix if the repository shall not reside under a user’s namespace. When operating recursively, a suffix will be appended to this name for each subdataset. Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--api URL

URL of the GOGS instance without a ‘api/<version>’ suffix. Constraints: value must be a string or value must be NONE

--dataset DATASET, -d DATASET

dataset to create the publication target for. If not given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-s NAME, --name NAME

name of the sibling in the local dataset installation (remote name). Constraints: value must be a string or value must be NONE

--existing {skip|error|reconfigure|replace}

behavior when already existing or configured siblings are discovered: skip the dataset (‘skip’), update the configuration (‘reconfigure’), or fail (‘error’). DEPRECATED DANGER ZONE: With ‘replace’, an existing repository will be irreversibly removed, re-initialized, and the sibling (re-)configured (thus implies ‘reconfigure’). REPLACE could lead to data loss! In interactive sessions a confirmation prompt is shown, an exception is raised in non-interactive sessions. The ‘replace’ mode will be removed in a future release. Constraints: value must be one of (‘skip’, ‘error’, ‘reconfigure’, ‘replace’) [Default: ‘error’]

--credential NAME

name of the credential providing a personal access token to be used for authorization. The token can be supplied via configuration setting ‘datalad.credential.<name>.token’, or environment variable DATALAD_CREDENTIAL_<NAME>_TOKEN, or will be queried from the active credential store using the provided name. If none is provided, the host-part of the API URL is used as a name (e.g. ‘https://api.github.com’ -> ‘api.github.com’). Constraints: value must be a string or value must be NONE

--access-protocol {https|ssh|https-ssh}

access protocol/URL to configure for the sibling. With ‘https-ssh’ SSH will be used for write access, whereas HTTPS is used for read access. Constraints: value must be one of (‘https’, ‘ssh’, ‘https-ssh’) [Default: ‘https’]

--publish-depends SIBLINGNAME

add a dependency such that the given existing sibling is always published prior to the new sibling. This equals setting a configuration item ‘remote.SIBLINGNAME.datalad-publish-depends’. This option can be given more than once to configure multiple dependencies. Constraints: value must be a string or value must be NONE

--private

if set, create a private repository.

--description DESCRIPTION

Brief description, displayed on the project’s page. Constraints: value must be a string or value must be NONE

--dry-run

if set, no repository will be created, only tests for sibling name collisions will be performed, and would-be repository names are reported for all relevant datasets.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-sibling-gitea
Synopsis
datalad create-sibling-gitea [-h] [--dataset DATASET] [-r] [-R LEVELS] [-s NAME] [--existing
    {skip|error|reconfigure|replace}] [--api URL] [--credential
    NAME] [--access-protocol {https|ssh|https-ssh}]
    [--publish-depends SIBLINGNAME] [--private] [--description
    DESCRIPTION] [--dry-run] [--version]
    [<org-name>/]<repo-basename>
Description

Create a dataset sibling on a Gitea site

Gitea is a lightweight, free and open source code hosting solution with low resource demands that enable running it on inexpensive devices like a Raspberry Pi.

This command uses the main Gitea instance at https://gitea.com as the default target, but other deployments can be used via the ‘api’ parameter.

In order to be able to use this command, a personal access token has to be generated on the platform (Account->Settings->Applications->Generate Token).

This command can be configured with “datalad.create-sibling-ghlike.extra-remote-settings.NETLOC.KEY=VALUE” in order to add any local KEY = VALUE configuration to the created sibling in the local .git/config file. NETLOC is the domain of the Gitea instance to apply the configuration for. This leads to a behavior that is equivalent to calling datalad’s siblings('configure', ...)``||``siblings configure command with the respective KEY-VALUE pair after creating the sibling. The configuration, like any other, could be set at user- or system level, so users do not need to add this configuration to every sibling created with the service at NETLOC themselves.

New in version 0.16

Options
[<org-name>/]<repo-(base)name>

repository name, optionally including an ‘<organization>/’ prefix if the repository shall not reside under a user’s namespace. When operating recursively, a suffix will be appended to this name for each subdataset. Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--dataset DATASET, -d DATASET

dataset to create the publication target for. If not given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-s NAME, --name NAME

name of the sibling in the local dataset installation (remote name). Constraints: value must be a string or value must be NONE [Default: ‘gitea’]

--existing {skip|error|reconfigure|replace}

behavior when already existing or configured siblings are discovered: skip the dataset (‘skip’), update the configuration (‘reconfigure’), or fail (‘error’). DEPRECATED DANGER ZONE: With ‘replace’, an existing repository will be irreversibly removed, re-initialized, and the sibling (re-)configured (thus implies ‘reconfigure’). REPLACE could lead to data loss! In interactive sessions a confirmation prompt is shown, an exception is raised in non-interactive sessions. The ‘replace’ mode will be removed in a future release. Constraints: value must be one of (‘skip’, ‘error’, ‘reconfigure’, ‘replace’) [Default: ‘error’]

--api URL

URL of the Gitea instance without a ‘api/<version>’ suffix. Constraints: value must be a string or value must be NONE [Default: ‘https://gitea.com’]

--credential NAME

name of the credential providing a personal access token to be used for authorization. The token can be supplied via configuration setting ‘datalad.credential.<name>.token’, or environment variable DATALAD_CREDENTIAL_<NAME>_TOKEN, or will be queried from the active credential store using the provided name. If none is provided, the host-part of the API URL is used as a name (e.g. ‘https://api.github.com’ -> ‘api.github.com’). Constraints: value must be a string or value must be NONE

--access-protocol {https|ssh|https-ssh}

access protocol/URL to configure for the sibling. With ‘https-ssh’ SSH will be used for write access, whereas HTTPS is used for read access. Constraints: value must be one of (‘https’, ‘ssh’, ‘https-ssh’) [Default: ‘https’]

--publish-depends SIBLINGNAME

add a dependency such that the given existing sibling is always published prior to the new sibling. This equals setting a configuration item ‘remote.SIBLINGNAME.datalad-publish-depends’. This option can be given more than once to configure multiple dependencies. Constraints: value must be a string or value must be NONE

--private

if set, create a private repository.

--description DESCRIPTION

Brief description, displayed on the project’s page. Constraints: value must be a string or value must be NONE

--dry-run

if set, no repository will be created, only tests for sibling name collisions will be performed, and would-be repository names are reported for all relevant datasets.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-sibling-gin
Synopsis
datalad create-sibling-gin [-h] [--dataset DATASET] [-r] [-R LEVELS] [-s NAME] [--existing
    {skip|error|reconfigure|replace}] [--api URL] [--credential
    NAME] [--access-protocol {https|ssh|https-ssh}]
    [--publish-depends SIBLINGNAME] [--private] [--description
    DESCRIPTION] [--dry-run] [--version]
    [<org-name>/]<repo-basename>
Description

Create a dataset sibling on a GIN site (with content hosting)

GIN (G-Node infrastructure) is a free data management system. It is a GitHub-like, web-based repository store and provides fine-grained access control to shared data. GIN is built on Git and git-annex, and can natively host DataLad datasets, including their data content!

This command uses the main GIN instance at https://gin.g-node.org as the default target, but other deployments can be used via the ‘api’ parameter.

An SSH key, properly registered at the GIN instance, is required for data upload via DataLad. Data download from public projects is also possible via anonymous HTTP.

In order to be able to use this command, a personal access token has to be generated on the platform (Account->Your Settings->Applications->Generate New Token).

This command can be configured with “datalad.create-sibling-ghlike.extra-remote-settings.NETLOC.KEY=VALUE” in order to add any local KEY = VALUE configuration to the created sibling in the local .git/config file. NETLOC is the domain of the Gin instance to apply the configuration for. This leads to a behavior that is equivalent to calling datalad’s siblings('configure', ...)``||``siblings configure command with the respective KEY-VALUE pair after creating the sibling. The configuration, like any other, could be set at user- or system level, so users do not need to add this configuration to every sibling created with the service at NETLOC themselves.

New in version 0.16

Examples

Create a repo ‘myrepo’ on GIN and register it as sibling ‘mygin’:

% datalad create-sibling-gin myrepo -s mygin

Create private repos with name(-prefix) ‘myrepo’ on GIN for a dataset and all its present subdatasets:

% datalad create-sibling-gin myrepo -r --private

Create a sibling repo on GIN, and register it as a common data source in the dataset that is available regardless of whether the dataset was directly cloned from GIN:

% datalad create-sibling-gin myrepo -s gin
# first push creates git-annex branch remotely and obtains annex UUID
% datalad push --to gin
% datalad siblings configure -s gin --as-common-datasrc gin-storage
# announce availability (redo for other siblings)
% datalad push --to gin
Options
[<org-name>/]<repo-(base)name>

repository name, optionally including an ‘<organization>/’ prefix if the repository shall not reside under a user’s namespace. When operating recursively, a suffix will be appended to this name for each subdataset. Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--dataset DATASET, -d DATASET

dataset to create the publication target for. If not given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

-s NAME, --name NAME

name of the sibling in the local dataset installation (remote name). Constraints: value must be a string or value must be NONE [Default: ‘gin’]

--existing {skip|error|reconfigure|replace}

behavior when already existing or configured siblings are discovered: skip the dataset (‘skip’), update the configuration (‘reconfigure’), or fail (‘error’). DEPRECATED DANGER ZONE: With ‘replace’, an existing repository will be irreversibly removed, re-initialized, and the sibling (re-)configured (thus implies ‘reconfigure’). REPLACE could lead to data loss! In interactive sessions a confirmation prompt is shown, an exception is raised in non-interactive sessions. The ‘replace’ mode will be removed in a future release. Constraints: value must be one of (‘skip’, ‘error’, ‘reconfigure’, ‘replace’) [Default: ‘error’]

--api URL

URL of the GIN instance without an ‘api/<version>’ suffix. Constraints: value must be a string or value must be NONE [Default: ‘https://gin.g-node.org’]

--credential NAME

name of the credential providing a personal access token to be used for authorization. The token can be supplied via configuration setting ‘datalad.credential.<name>.token’, or environment variable DATALAD_CREDENTIAL_<NAME>_TOKEN, or will be queried from the active credential store using the provided name. If none is provided, the host-part of the API URL is used as a name (e.g. ‘https://api.github.com’ -> ‘api.github.com’). Constraints: value must be a string or value must be NONE

--access-protocol {https|ssh|https-ssh}

access protocol/URL to configure for the sibling. With ‘https-ssh’ SSH will be used for write access, whereas HTTPS is used for read access. Constraints: value must be one of (‘https’, ‘ssh’, ‘https-ssh’) [Default: ‘https-ssh’]

--publish-depends SIBLINGNAME

add a dependency such that the given existing sibling is always published prior to the new sibling. This equals setting a configuration item ‘remote.SIBLINGNAME.datalad-publish-depends’. This option can be given more than once to configure multiple dependencies. Constraints: value must be a string or value must be NONE

--private

if set, create a private repository.

--description DESCRIPTION

Brief description, displayed on the project’s page. Constraints: value must be a string or value must be NONE

--dry-run

if set, no repository will be created, only tests for sibling name collisions will be performed, and would-be repository names are reported for all relevant datasets.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-sibling-ria
Synopsis
datalad create-sibling-ria [-h] -s NAME [-d DATASET] [--storage-name NAME] [--alias ALIAS]
    [--post-update-hook] [--shared
    {false|true|umask|group|all|world|everybody|0xxx}] [--group
    GROUP] [--storage-sibling MODE] [--existing MODE]
    [--new-store-ok] [--trust-level TRUST-LEVEL] [-r] [-R LEVELS]
    [--no-storage-sibling] [--push-url
    ria+<ssh|file>://<host>[/path]] [--version]
    ria+<ssh|file|https>://<host>[/path]
Description

Creates a sibling to a dataset in a RIA store

Communication with a dataset in a RIA store is implemented via two siblings. A regular Git remote (repository sibling) and a git-annex special remote for data transfer (storage sibling) – with the former having a publication dependency on the latter. By default, the name of the storage sibling is derived from the repository sibling’s name by appending “-storage”.

The store’s base path is expected to not exist, be an empty directory, or a valid RIA store.

Notes

RIA URL format

Interactions with new or existing RIA stores require RIA URLs to identify the store or specific datasets inside of it.

The general structure of a RIA URL pointing to a store takes the form ria+[scheme]://<storelocation> (e.g., ria+ssh://[user@]hostname:/absolute/path/to/ria-store, or ria+file:///absolute/path/to/ria-store)

The general structure of a RIA URL pointing to a dataset in a store (for example for cloning) takes a similar form, but appends either the datasets UUID or a “~” symbol followed by the dataset’s alias name: ria+[scheme]://<storelocation>#<dataset-UUID> or ria+[scheme]://<storelocation>#~<aliasname>. In addition, specific version identifiers can be appended to the URL with an additional “@” symbol: ria+[scheme]://<storelocation>#<dataset-UUID>@<dataset-version>, where dataset-version refers to a branch or tag.

RIA store layout

A RIA store is a directory tree with a dedicated subdirectory for each dataset in the store. The subdirectory name is constructed from the DataLad dataset ID, e.g. 124/68afe-59ec-11ea-93d7-f0d5bf7b5561, where the first three characters of the ID are used for an intermediate subdirectory in order to mitigate files system limitations for stores containing a large number of datasets.

By default, a dataset in a RIA store consists of two components: A Git repository (for all dataset contents stored in Git) and a storage sibling (for dataset content stored in git-annex).

It is possible to selectively disable either component using storage-sibling 'off' or storage-sibling 'only', respectively. If neither component is disabled, a dataset’s subdirectory layout in a RIA store contains a standard bare Git repository and an annex/ subdirectory inside of it. The latter holds a Git-annex object store and comprises the storage sibling. Disabling the standard git-remote (storage-sibling='only') will result in not having the bare git repository, disabling the storage sibling (storage-sibling='off') will result in not having the annex/ subdirectory.

Optionally, there can be a further subdirectory archives with (compressed) 7z archives of annex objects. The storage remote is able to pull annex objects from these archives, if it cannot find in the regular annex object store. This feature can be useful for storing large collections of rarely changing data on systems that limit the number of files that can be stored.

Each dataset directory also contains a ria-layout-version file that identifies the data organization (as, for example, described above).

Lastly, there is a global ria-layout-version file at the store’s base path that identifies where dataset subdirectories themselves are located. At present, this file must contain a single line stating the version (currently “1”). This line MUST end with a newline character.

It is possible to define an alias for an individual dataset in a store by placing a symlink to the dataset location into an alias/ directory in the root of the store. This enables dataset access via URLs of format: ria+<protocol>://<storelocation>#~<aliasname>.

Compared to standard git-annex object stores, the annex/ subdirectories used as storage siblings follow a different layout naming scheme (‘dirhashmixed’ instead of ‘dirhashlower’). This is mostly noted as a technical detail, but also serves to remind git-annex powerusers to refrain from running git-annex commands directly in-store as it can cause severe damage due to the layout difference. Interactions should be handled via the ORA special remote instead.

Error logging

To enable error logging at the remote end, append a pipe symbol and an “l” to the version number in ria-layout-version (like so: 1|l\n).

Error logging will create files in an “error_log” directory whenever the git-annex special remote (storage sibling) raises an exception, storing the Python traceback of it. The logfiles are named according to the scheme <dataset id>.<annex uuid of the remote>.log showing “who” ran into this issue with which dataset. Because logging can potentially leak personal data (like local file paths for example), it can be disabled client-side by setting the configuration variable annex.ora-remote.<storage-sibling-name>.ignore-remote-config.

Options
ria+<ssh|file|http(s)>://<host>[/path]

URL identifying the target RIA store and access protocol. If --push-url is given in addition, this is used for read access only. Otherwise it will be used for write access too and to create the repository sibling in the RIA store. Note, that HTTP(S) currently is valid for consumption only thus requiring to provide --push-url. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-s NAME, --name NAME

Name of the sibling. With RECURSIVE, the same name will be used to label all the subdatasets’ siblings. Constraints: value must be a string or value must be NONE

-d DATASET, --dataset DATASET

specify the dataset to process. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--storage-name NAME

Name of the storage sibling (git-annex special remote). Must not be identical to the sibling name. If not specified, defaults to the sibling name plus ‘-storage’ suffix. If only a storage sibling is created, this setting is ignored, and the primary sibling name is used. Constraints: value must be a string or value must be NONE

--alias ALIAS

Alias for the dataset in the RIA store. Add the necessary symlink so that this dataset can be cloned from the RIA store using the given ALIAS instead of its ID. With recursive=True, only the top dataset will be aliased. Constraints: value must be a string or value must be NONE

--post-update-hook

Enable Git’s default post-update-hook for the created sibling. This is useful when the sibling is made accessible via a “dumb server” that requires running ‘git update-server-info’ to let Git interact properly with it.

--shared {false|true|umask|group|all|world|everybody|0xxx}

If given, configures the permissions in the RIA store for multi-users access. Possible values for this option are identical to those of git init –shared and are described in its documentation. Constraints: value must be a string or value must be convertible to type bool or value must be NONE

--group GROUP

Filesystem group for the repository. Specifying the group is crucial when –shared=group. Constraints: value must be a string or value must be NONE

--storage-sibling MODE

By default, an ORA storage sibling and a Git repository sibling are created (on). Alternatively, creation of the storage sibling can be disabled (off), or a storage sibling created only and no Git sibling (only). In the latter mode, no Git installation is required on the target host. Constraints: value must be one of (‘only’,) or value must be convertible to type bool or value must be NONE [Default: True]

--existing MODE

Action to perform, if a (storage) sibling is already configured under the given name and/or a target already exists. In this case, a dataset can be skipped (‘skip’), an existing target repository be forcefully re-initialized, and the sibling (re-)configured (‘reconfigure’), or the command be instructed to fail (‘error’). Constraints: value must be one of (‘skip’, ‘error’, ‘reconfigure’) [Default: ‘error’]

--new-store-ok

When set, a new store will be created, if necessary. Otherwise, a sibling will only be created if the url points to an existing RIA store.

--trust-level TRUST-LEVEL

specify a trust level for the storage sibling. If not specified, the default git-annex trust level is used. ‘trust’ should be used with care (see the git- annex-trust man page). Constraints: value must be one of (‘trust’, ‘semitrust’, ‘untrust’)

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--no-storage-sibling

This option is deprecated. Use ‘–storage-sibling off’ instead.

--push-url ria+<ssh|file>://<host>[/path]

URL identifying the target RIA store and access protocol for write access to the storage sibling. If given this will also be used for creation of the repository sibling in the RIA store. Constraints: value must be a string or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad export-archive
Synopsis
datalad export-archive [-h] [-d DATASET] [-t {tar|zip}] [-c {gz|bz2|}] [--missing-content
    {error|continue|ignore}] [--version] [PATH]
Description

Export the content of a dataset as a TAR/ZIP archive.

Options
PATH

File name of the generated TAR archive. If no file name is given the archive will be generated in the current directory and will be named: datalad_<dataset_uuid>.(tar.*|zip). To generate that file in a different directory, provide an existing directory as the file name. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

“specify the dataset to export. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-t {tar|zip}, --archivetype {tar|zip}

Type of archive to generate. Constraints: value must be one of (‘tar’, ‘zip’) [Default: ‘tar’]

-c {gz|bz2|}, --compression {gz|bz2|}

Compression method to use. ‘bz2’ is not supported for ZIP archives. No compression is used when an empty string is given. Constraints: value must be one of (‘gz’, ‘bz2’, ‘’) [Default: ‘gz’]

--missing-content {error|continue|ignore}

By default, any discovered file with missing content will result in an error and the export is aborted. Setting this to ‘continue’ will issue warnings instead of failing on error. The value ‘ignore’ will only inform about problem at the ‘debug’ log level. The latter two can be helpful when generating a TAR archive from a dataset where some file content is not available locally. Constraints: value must be one of (‘error’, ‘continue’, ‘ignore’) [Default: ‘error’]

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad export-archive-ora
Synopsis
datalad export-archive-ora [-h] [-d DATASET] [--for LABEL] [--annex-wanted FILTERS] [--from FROM
    [FROM ...]] [--missing-content {error|continue|ignore}]
    [--version] TARGET ...
Description

Export an archive of a local annex object store for the ORA remote.

Keys in the local annex object store are reorganized in a temporary directory (using links to avoid storage duplication) to use the ‘hashdirlower’ setup used by git-annex for bare repositories and the directory-type special remote. This alternative object store is then moved into a 7zip archive that is suitable for use in a ORA remote dataset store. Placing such an archive into:

<dataset location>/archives/archive.7z

Enables the ORA special remote to locate and retrieve all keys contained in the archive.

Options
TARGET

if an existing directory, an ‘archive.7z’ is placed into it, otherwise this is the path to the target archive. Constraints: value must be a string or value must be NONE

list of options for 7z to replace the default ‘-mx0’ to generate an uncompressed archive.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to process. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--for LABEL

name of the target sibling, wanted/preferred settings will be used to filter the files added to the archives. Constraints: value must be a string or value must be NONE

--annex-wanted FILTERS

git-annex-preferred-content expression for git-annex find to filter files. Should start with ‘or’ or ‘and’ when used in combination with –for.

--from FROM [FROM …]

one or multiple tree-ish from which to select files.

--missing-content {error|continue|ignore}

By default, any discovered file with missing content will result in an error and the export is aborted. Setting this to ‘continue’ will issue warnings instead of failing on error. The value ‘ignore’ will only inform about problem at the ‘debug’ log level. The latter two can be helpful when generating a TAR archive from a dataset where some file content is not available locally. Constraints: value must be one of (‘error’, ‘continue’, ‘ignore’) [Default: ‘error’]

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad export-to-figshare
Synopsis
datalad export-to-figshare [-h] [-d DATASET] [--missing-content {error|continue|ignore}]
    [--no-annex] [--article-id ID] [--version] [PATH]
Description

Export the content of a dataset as a ZIP archive to figshare

Very quick and dirty approach. Ideally figshare should be supported as a proper git annex special remote. Unfortunately, figshare does not support having directories, and can store only a flat list of files. That makes it impossible for any sensible publishing of complete datasets.

The only workaround is to publish dataset as a zip-ball, where the entire content is wrapped into a .zip archive for which figshare would provide a navigator.

Options
PATH

File name of the generated ZIP archive. If no file name is given the archive will be generated in the top directory of the dataset and will be named: datalad_<dataset_uuid>.zip. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

“specify the dataset to export. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--missing-content {error|continue|ignore}

By default, any discovered file with missing content will result in an error and the plugin is aborted. Setting this to ‘continue’ will issue warnings instead of failing on error. The value ‘ignore’ will only inform about problem at the ‘debug’ log level. The latter two can be helpful when generating a TAR archive from a dataset where some file content is not available locally. Constraints: value must be one of (‘error’, ‘continue’, ‘ignore’) [Default: ‘error’]

--no-annex

By default the generated .zip file would be added to annex, and all files would get registered in git-annex to be available from such a tarball. Also upon upload we will register for that archive to be a possible source for it in annex. Setting this flag disables this behavior.

--article-id ID

Which article to publish to. Constraints: value must be convertible to type ‘int’ or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad update
Synopsis
datalad update [-h] [-s SIBLING] [--merge [ALLOWED]] [--how
    [{fetch|merge|ff-only|reset|checkout}]] [--how-subds
    [{fetch|merge|ff-only|reset|checkout}]] [--follow
    {sibling|parentds|parentds-lazy}] [-d DATASET] [-r] [-R LEVELS]
    [--fetch-all] [--reobtain-data] [--version] [PATH ...]
Description

Update a dataset from a sibling.

Examples

Update from a particular sibling:

% datalad update -s <siblingname>

Update from a particular sibling and merge the changes from a configured or matching branch from the sibling (see –follow for details):

% datalad update --how=merge -s <siblingname>

Update from the sibling ‘origin’, traversing into subdatasets. For subdatasets, merge the revision registered in the parent dataset into the current branch:

% datalad update -s origin --how=merge --follow=parentds -r

Fetch and merge the remote tracking branch into the current dataset. Then update each subdataset by resetting its current branch to the revision registered in the parent dataset, fetching only if the revision isn’t already present:

% datalad update --how=merge --how-subds=reset --follow=parentds-lazy -r
Options
PATH

constrain to-be-updated subdatasets to the given path for recursive operation. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-s SIBLING, --sibling SIBLING

name of the sibling to update from. When unspecified, updates from all siblings are fetched. If there is more than one sibling and changes will be brought into the working tree (as requested via –merge, –how, or –how-subds), a sibling will be chosen based on the configured remote for the current branch. Constraints: value must be a string or value must be NONE

--merge [ALLOWED]

merge obtained changes from the sibling. This is a subset of the functionality that can be achieved via the newer –how. –merge or –merge=any is equivalent to –how=merge. –merge=ff-only is equivalent to –how=ff-only. Constraints: value must be convertible to type bool or value must be one of (‘any’, ‘ff- only’) [Default: False]

--how [{fetch|merge|ff-only|reset|checkout}]

how to update the dataset. The default (“fetch”) simply fetches the changes from the sibling but doesn’t incorporate them into the working tree. A value of “merge” or “ff-only” merges in changes, with the latter restricting the allowed merges to fast-forwards. “reset” incorporates the changes with ‘git reset –hard <target>’, staying on the current branch but discarding any changes that aren’t shared with the target. “checkout”, on the other hand, runs ‘git checkout <target>’, switching from the current branch to a detached state. When –recursive is specified, this action will also apply to subdatasets unless overridden by –how-subds. Constraints: value must be one of (‘fetch’, ‘merge’, ‘ff-only’, ‘reset’, ‘checkout’)

--how-subds [{fetch|merge|ff-only|reset|checkout}]

Override the behavior of –how in subdatasets. Constraints: value must be one of (‘fetch’, ‘merge’, ‘ff-only’, ‘reset’, ‘checkout’)

--follow {sibling|parentds|parentds-lazy}

source of updates for subdatasets. For ‘sibling’, the update will be done by merging in a branch from the (specified or inferred) sibling. The branch brought in will either be the current branch’s configured branch, if it points to a branch that belongs to the sibling, or a sibling branch with a name that matches the current branch. For ‘parentds’, the revision registered in the parent dataset of the subdataset is merged in. ‘parentds-lazy’ is like ‘parentds’, but prevents fetching from a subdataset’s sibling if the registered revision is present in the subdataset. Note that the current dataset is always updated according to ‘sibling’. This option has no effect unless a merge is requested and –recursive is specified. Constraints: value must be one of (‘sibling’, ‘parentds’, ‘parentds-lazy’) [Default: ‘sibling’]

-d DATASET, --dataset DATASET

specify the dataset to update. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--fetch-all

this option has no effect and will be removed in a future version. When no siblings are given, an all-sibling update will be performed.

--reobtain-data

if enabled, file content that was present before an update will be re-obtained in case a file was changed by the update.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Reproducible execution

Extending the functionality of the core run command.

datalad rerun
Synopsis
datalad rerun [-h] [--since SINCE] [-d DATASET] [-b NAME] [-m MESSAGE] [--onto base]
    [--script FILE] [--report] [--assume-ready
    {inputs|outputs|both}] [--explicit] [-J NJOBS] [--version]
    [REVISION]
Description

Re-execute previous datalad run commands.

This will unlock any dataset content that is on record to have been modified by the command in the specified revision. It will then re-execute the command in the recorded path (if it was inside the dataset). Afterwards, all modifications will be saved.

Report mode

When called with –report, this command reports information about what would be re-executed as a series of records. There will be a record for each revision in the specified revision range. Each of these will have one of the following “rerun_action” values:

  • run: the revision has a recorded command that would be re-executed

  • skip-or-pick: the revision does not have a recorded command and would be either skipped or cherry picked

  • merge: the revision is a merge commit and a corresponding merge would be made

The decision to skip rather than cherry pick a revision is based on whether the revision would be reachable from HEAD at the time of execution.

In addition, when a starting point other than HEAD is specified, there is a rerun_action value “checkout”, in which case the record includes information about the revision the would be checked out before rerunning any commands.

NOTE

Currently the “onto” feature only sets the working tree of the current dataset to a previous state. The working trees of any subdatasets remain unchanged.

Examples

Re-execute the command from the previous commit:

% datalad rerun

Re-execute any commands in the last five commits:

% datalad rerun --since=HEAD~5

Do the same as above, but re-execute the commands on top of HEAD~5 in a detached state:

% datalad rerun --onto= --since=HEAD~5

Re-execute all previous commands and compare the old and new results:

% # on master branch
% datalad rerun --branch=verify --since=
% # now on verify branch
% datalad diff --revision=master..
% git log --oneline --left-right --cherry-pick master...
Options
REVISION

rerun command(s) in REVISION. By default, the command from this commit will be executed, but –since can be used to construct a revision range. The default value is like “HEAD” but resolves to the main branch when on an adjusted branch. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--since SINCE

If SINCE is a commit-ish, the commands from all commits that are reachable from revision but not SINCE will be re-executed (in other words, the commands in git log SINCE..REVISION). If SINCE is an empty string, it is set to the parent of the first commit that contains a recorded command (i.e., all commands in git log REVISION will be re-executed). Constraints: value must be a string or value must be NONE

-d DATASET, --dataset DATASET

specify the dataset from which to rerun a recorded command. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. If a dataset is given, the command will be executed in the root directory of this dataset. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-b NAME, --branch NAME

create and checkout this branch before rerunning the commands. Constraints: value must be a string or value must be NONE

-m MESSAGE, --message MESSAGE

use MESSAGE for the reran commit rather than the recorded commit message. In the case of a multi-commit rerun, all the reran commits will have this message. Constraints: value must be a string or value must be NONE

--onto base

start point for rerunning the commands. If not specified, commands are executed at HEAD. This option can be used to specify an alternative start point, which will be checked out with the branch name specified by –branch or in a detached state otherwise. As a special case, an empty value for this option means the parent of the first run commit in the specified revision list. Constraints: value must be a string or value must be NONE

--script FILE

extract the commands into FILE rather than rerunning. Use - to write to stdout instead. This option implies –report. Constraints: value must be a string or value must be NONE

--report

Don’t actually re-execute anything, just display what would be done. Note: If you give this option, you most likely want to set –output-format to ‘json’ or ‘json_pp’.

--assume-ready {inputs|outputs|both}

Assume that inputs do not need to be retrieved and/or outputs do not need to unlocked or removed before running the command. This option allows you to avoid the expense of these preparation steps if you know that they are unnecessary. Note that this option also affects any additional outputs that are automatically inferred based on inspecting changed files in the run commit. Constraints: value must be one of (‘inputs’, ‘outputs’, ‘both’)

--explicit

Consider the specification of inputs and outputs in the run record to be explicit. Don’t warn if the repository is dirty, and only save modifications to the outputs from the original record. Note that when several run commits are specified, this applies to every one. Care should also be taken when using –onto because checking out a new HEAD can easily fail when the working tree has modifications.

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad run-procedure
Synopsis
datalad run-procedure [-h] [-d PATH] [--discover] [--help-proc] [--version] ...
Description

Run prepared procedures (DataLad scripts) on a dataset

Concept

A “procedure” is an algorithm with the purpose to process a dataset in a particular way. Procedures can be useful in a wide range of scenarios, like adjusting dataset configuration in a uniform fashion, populating a dataset with particular content, or automating other routine tasks, such as synchronizing dataset content with certain siblings.

Implementations of some procedures are shipped together with DataLad, but additional procedures can be provided by 1) any DataLad extension, 2) any (sub-)dataset, 3) a local user, or 4) a local system administrator. DataLad will look for procedures in the following locations and order:

Directories identified by the configuration settings

  • ‘datalad.locations.user-procedures’ (determined by platformdirs.user_config_dir; defaults to ‘$HOME/.config/datalad/procedures’ on GNU/Linux systems)

  • ‘datalad.locations.system-procedures’ (determined by platformdirs.site_config_dir; defaults to ‘/etc/xdg/datalad/procedures’ on GNU/Linux systems)

  • ‘datalad.locations.dataset-procedures’

and subsequently in the ‘resources/procedures/’ directories of any installed extension, and, lastly, of the DataLad installation itself.

Please note that a dataset that defines ‘datalad.locations.dataset-procedures’ provides its procedures to any dataset it is a subdataset of. That way you can have a collection of such procedures in a dedicated dataset and install it as a subdataset into any dataset you want to use those procedures with. In case of a naming conflict with such a dataset hierarchy, the dataset you’re calling run-procedures on will take precedence over its subdatasets and so on.

Each configuration setting can occur multiple times to indicate multiple directories to be searched. If a procedure matching a given name is found (filename without a possible extension), the search is aborted and this implementation will be executed. This makes it possible for individual datasets, users, or machines to override externally provided procedures (enabling the implementation of customizable processing “hooks”).

Procedure implementation

A procedure can be any executable. Executables must have the appropriate permissions and, in the case of a script, must contain an appropriate “shebang” line. If a procedure is not executable, but its filename ends with ‘.py’, it is automatically executed by the ‘python’ interpreter (whichever version is available in the present environment). Likewise, procedure implementations ending on ‘.sh’ are executed via ‘bash’.

Procedures can implement any argument handling, but must be capable of taking at least one positional argument (the absolute path to the dataset they shall operate on).

For further customization there are two configuration settings per procedure available:

  • ‘datalad.procedures.<NAME>.call-format’ fully customizable format string to determine how to execute procedure NAME (see also datalad-run). It currently requires to include the following placeholders:

    • ‘{script}’: will be replaced by the path to the procedure

    • ‘{ds}’: will be replaced by the absolute path to the dataset the procedure shall operate on

    • ‘{args}’: (not actually required) will be replaced by all additional arguments passed into run-procedure after NAME

      As an example the default format string for a call to a python script is: “python {script} {ds} {args}”

  • ‘datalad.procedures.<NAME>.help’ will be shown on datalad run-procedure –help-proc NAME to provide a description and/or usage info for procedure NAME

Examples

Find out which procedures are available on the current system:

% datalad run-procedure --discover

Run the ‘yoda’ procedure in the current dataset:

% datalad run-procedure cfg_yoda
Options
NAME [ARGS]

Name and possibly additional arguments of the to-be-executed procedure. [PY: Can also be a dictionary coming from run-procedure(discover=True).]Note, that all options to run-procedure need to be put before NAME, since all ARGS get assigned to NAME.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d PATH, --dataset PATH

specify the dataset to run the procedure on. An attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--discover

if given, all configured paths are searched for procedures and one result record per discovered procedure is yielded, but no procedure is executed.

--help-proc

if given, get a help message for procedure NAME from config setting datalad.procedures.NAME.help.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Helpers and support utilities
datalad add-archive-content
Synopsis
datalad add-archive-content [-h] [-d DATASET] [--annex ANNEX] [--add-archive-leading-dir]
    [--strip-leading-dirs] [--leading-dirs-depth LEADING_DIRS_DEPTH]
    [--leading-dirs-consider LEADING_DIRS_CONSIDER]
    [--use-current-dir] [-D] [--key] [-e EXCLUDE] [-r RENAME]
    [--existing {fail,overwrite,archive-suffix,numeric-suffix}] [-o
    ANNEX_OPTIONS] [--copy] [--no-commit] [--allow-dirty] [--stats
    STATS] [--drop-after] [--delete-after] [--version] archive
Description

Add content of an archive under git annex control.

Given an already annex’ed archive, extract and add its files to the dataset, and reference the original archive as a custom special remote.

Examples

Add files from the archive ‘big_tarball.tar.gz’, but keep big_tarball.tar.gz in the index:

% datalad add-archive-content big_tarball.tar.gz

Add files from the archive ‘tarball.tar.gz’, and remove big_tarball.tar.gz from the index:

% datalad add-archive-content big_tarball.tar.gz --delete

Add files from the archive ‘s3.zip’ but remove the leading directory:

% datalad add-archive-content s3.zip --strip-leading-dirs
Options
archive

archive file or a key (if –key specified). Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

“specify the dataset to save. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--annex ANNEX

DEPRECATED. Use the ‘dataset’ parameter instead.

--add-archive-leading-dir

place extracted content under a directory which would correspond to the archive name with all suffixes stripped. E.g. the content of archive.tar.gz will be extracted under archive/.

--strip-leading-dirs

remove one or more leading directories from the archive layout on extraction.

--leading-dirs-depth LEADING_DIRS_DEPTH

maximum depth of leading directories to strip. If not specified (None), no limit.

--leading-dirs-consider LEADING_DIRS_CONSIDER

regular expression(s) for directories to consider to strip away. Constraints: value must be a string or value must be NONE

--use-current-dir

extract the archive under the current directory, not the directory where the archive is located. This parameter is applied automatically if –key was used.

-D, --delete

delete original archive from the filesystem/Git in current tree. Note that it will be of no effect if –key is given.

--key

signal if provided archive is not actually a filename on its own but an annex key. The archive will be extracted in the current directory.

-e EXCLUDE, --exclude EXCLUDE

regular expressions for filenames which to exclude from being added to annex. Applied after –rename if that one is specified. For exact matching, use anchoring. Constraints: value must be a string or value must be NONE

-r RENAME, --rename RENAME

regular expressions to rename files before added them under to Git. The first defines how to split provided string into two parts: Python regular expression (with groups), and replacement string. Constraints: value must be a string or value must be NONE

--existing {fail,overwrite,archive-suffix,numeric-suffix}

what operation to perform if a file from an archive tries to overwrite an existing file with the same name. ‘fail’ (default) leads to an error result, ‘overwrite’ silently replaces existing file, ‘archive-suffix’ instructs to add a suffix (prefixed with a ‘-’) matching archive name from which file gets extracted, and if that one is present as well, ‘numeric-suffix’ is in effect in addition, when incremental numeric suffix (prefixed with a ‘.’) is added until no name collision is longer detected. [Default: ‘fail’]

-o ANNEX_OPTIONS, --annex-options ANNEX_OPTIONS

additional options to pass to git-annex. Constraints: value must be a string or value must be NONE

--copy

copy the content of the archive instead of moving.

--no-commit

don’t commit upon completion.

--allow-dirty

flag that operating on a dirty repository (uncommitted or untracked content) is ok.

--stats STATS

ActivityStats instance for global tracking.

--drop-after

drop extracted files after adding to annex.

--delete-after

extract under a temporary directory, git-annex add, and delete afterwards. To be used to “index” files within annex without actually creating corresponding files under git. Note that annex dropunused would later remove that load.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad clean
Synopsis
datalad clean [-h] [-d DATASET] [--what [WHAT ...]] [--dry-run] [-r] [-R LEVELS]
    [--version]
Description

Clean up after DataLad (possible temporary files etc.)

Removes temporary files and directories left behind by DataLad and git-annex in a dataset.

Examples

Clean all known temporary locations of a dataset:

% datalad clean

Report on all existing temporary locations of a dataset:

% datalad clean --dry-run

Clean all known temporary locations of a dataset and all its subdatasets:

% datalad clean -r

Clean only the archive extraction caches of a dataset and all its subdatasets:

% datalad clean --what cached-archives -r

Report on existing annex transfer files of a dataset and all its subdatasets:

% datalad clean --what annex-transfer -r --dry-run
Options
-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to perform the clean operation on. If no dataset is given, an attempt is made to identify the dataset in current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--what [WHAT …]

What to clean. If none specified – all known targets are considered. Constraints: value must be one of (‘cached-archives’, ‘annex-tmp’, ‘annex- transfer’, ‘search-index’) or value must be NONE

--dry-run

Report on cleanable locations - not actually cleaning up anything.

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad check-dates
Synopsis
datalad check-dates [-h] [-D DATE] [--rev REVISION] [--annex {all|tree|none}] [--no-tags]
    [--older] [--version] [PATH ...]
Description

Find repository dates that are more recent than a reference date.

The main purpose of this tool is to find “leaked” real dates in repositories that are configured to use fake dates. It checks dates from three sources: (1) commit timestamps (author and committer dates), (2) timestamps within files of the “git-annex” branch, and (3) the timestamps of annotated tags.

Options
PATH

Root directory in which to search for Git repositories. The current working directory will be used by default. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-D DATE, --reference-date DATE

Compare dates to this date. If dateutil is installed, this value can be any format that its parser recognizes. Otherwise, it should be a unix timestamp that starts with a “@”. The default value corresponds to 01 Jan, 2018 00:00:00 -0000. Constraints: value must be a string [Default: @1514764800’]

--rev REVISION

Search timestamps from commits that are reachable from REVISION. Any revision specification supported by git log, including flags like –all and –tags, can be used. This option can be given multiple times.

--annex {all|tree|none}

Mode for “git-annex” branch search. If ‘all’, all blobs within the branch are searched. ‘tree’ limits the search to blobs that are referenced by the tree at the tip of the branch. ‘none’ disables search of “git-annex” blobs. Constraints: value must be one of (‘all’, ‘tree’, ‘none’) [Default: ‘all’]

--no-tags

Don’t check the dates of annotated tags.

--older

Find dates which are older than the reference date rather than newer.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad configuration
Synopsis
datalad configuration [-h] [--scope {global|local|branch}] [-d DATASET] [-r] [-R LEVELS]
    [--version] [{dump|get|set|unset}] [name[=value] ...]
Description

Get and set dataset, dataset-clone-local, or global configuration

This command works similar to git-config, but some features are not supported (e.g., modifying system configuration), while other features are not available in git-config (e.g., multi-configuration queries).

Query and modification of three distinct configuration scopes is supported:

  • ‘branch’: the persistent configuration in .datalad/config of a dataset branch

  • ‘local’: a dataset clone’s Git repository configuration in .git/config

  • ‘global’: non-dataset-specific configuration (usually in $USER/.gitconfig)

Modifications of the persistent ‘branch’ configuration will not be saved by this command, but have to be committed with a subsequent SAVE call.

Rules of precedence regarding different configuration scopes are the same as in Git, with two exceptions: 1) environment variables can be used to override any datalad configuration, and have precedence over any other configuration scope (see below). 2) the ‘branch’ scope is considered in addition to the standard git configuration scopes. Its content has lower precedence than Git configuration scopes, but it is committed to a branch, hence can be used to ship (default and branch-specific) configuration with a dataset.

Besides storing configuration settings statically via this command or git config, DataLad also reads any DATALAD_* environment on process startup or import, and maps it to a configuration item. Their values take precedence over any other specification. In variable names _ encodes a . in the configuration name, and __ encodes a -, such that DATALAD_SOME__VAR is mapped to datalad.some-var. Additionally, a DATALAD_CONFIG_OVERRIDES_JSON environment variable is queried, which may contain configuration key-value mappings as a JSON-formatted string of a JSON-object:

DATALAD_CONFIG_OVERRIDES_JSON='{"datalad.credential.example_com.user": "jane", ...}'

This is useful when characters are part of the configuration key that cannot be encoded into an environment variable name. If both individual configuration variables and JSON-overrides are used, the former take precedent over the latter, overriding the respective individual settings from configurations declared in the JSON-overrides.

This command supports recursive operation for querying and modifying configuration across a hierarchy of datasets.

Examples

Dump the effective configuration, including an annotation for common items:

% datalad configuration

Query two configuration items:

% datalad configuration get user.name user.email

Recursively set configuration in all (sub)dataset repositories:

% datalad configuration -r set my.config=value

Modify the persistent branch configuration (changes are not committed):

% datalad configuration --scope branch set my.config=value
Options
{dump|get|set|unset}

which action to perform. Constraints: value must be one of (‘dump’, ‘get’, ‘set’, ‘unset’) [Default: ‘dump’]

name[=value]

configuration name (for actions ‘get’ and ‘unset’), or name/value pair (for action ‘set’).

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--scope {global|local|branch}

scope for getting or setting configuration. If no scope is declared for a query, all configuration sources (including overrides via environment variables) are considered according to the normal rules of precedence. For action ‘get’ only ‘branch’ and ‘local’ (which include ‘global’ here) are supported. For action ‘dump’, a scope selection is ignored and all available scopes are considered. Constraints: value must be one of (‘global’, ‘local’, ‘branch’)

-d DATASET, --dataset DATASET

specify the dataset to query or to configure. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad create-test-dataset
Synopsis
datalad create-test-dataset [-h] [--spec SPEC] [--seed SEED] [--version] path
Description

Create test (meta-)dataset.

Options
path

path/name where to create (if specified, must not exist). Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--spec SPEC

spec for hierarchy, defined as a min-max (min could be omitted to assume 0) defining how many (random number from min to max) of sub-datasets to generate at any given level of the hierarchy. Each level separated from each other with /. Example: 1-3/-2 would generate from 1 to 3 subdatasets at the top level, and up to two within those at the 2nd level. Constraints: value must be a string or value must be NONE

--seed SEED

seed for rng. Constraints: value must be convertible to type ‘int’ or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad download-url
Synopsis
datalad download-url [-h] [-d PATH] [-O PATH] [-o] [--archive] [--nosave] [-m MESSAGE]
    [--version] url [url ...]
Description

Download content

It allows for a uniform download interface to various supported URL schemes (see command help for details), re-using or asking for authentication details maintained by datalad.

Examples

Download files from an http and S3 URL:

% datalad download-url http://example.com/file.dat s3://bucket/file2.dat

Download a file to a path and provide a commit message:

% datalad download-url -m 'added a file' -O myfile.dat \
  s3://bucket/file2.dat

Append a trailing slash to the target path to download into a specified directory:

% datalad download-url --path=data/ http://example.com/file.dat

Leave off the trailing slash to download into a regular file:

% datalad download-url --path=data http://example.com/file.dat
Options
url

URL(s) to be downloaded. Supported protocols: ‘ftp’, ‘http’, ‘https’, ‘s3’, ‘shub’. Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d PATH, --dataset PATH

specify the dataset to add files to. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. Use –nosave to prevent adding files to the dataset. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-O PATH, --path PATH

target for download. If the path has a trailing separator, it is treated as a directory, and each specified URL is downloaded under that directory to a base name taken from the URL. Without a trailing separator, the value specifies the name of the downloaded file (file name extensions inferred from the URL may be added to it, if they are not yet present) and only a single URL should be given. In both cases, leading directories will be created if needed. This argument defaults to the current directory. Constraints: value must be a string or value must be NONE

-o, --overwrite

flag to overwrite it if target file exists.

--archive

pass the downloaded files to datalad add-archive-content –delete.

--nosave

by default all modifications to a dataset are immediately saved. Giving this option will disable this behavior.

-m MESSAGE, --message MESSAGE

a description of the state or the changes made to a dataset. Constraints: value must be a string or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad foreach-dataset
Synopsis
datalad foreach-dataset [-h] [--cmd-type {auto|external|exec|eval}] [-d DATASET] [--state
    {present|absent|any}] [-r] [-R LEVELS] [--contains PATH]
    [--bottomup] [-s] [--output-streams
    {capture|pass-through|relpath}] [--chpwd {ds|pwd}]
    [--safe-to-consume {auto|all-subds-done|superds-done|always}]
    [-J NJOBS] [--version] ...
Description

Run a command or Python code on the dataset and/or each of its sub-datasets.

This command provides a convenience for the cases were no dedicated DataLad command is provided to operate across the hierarchy of datasets. It is very similar to git submodule foreach command with the following major differences

  • by default (unless –subdatasets-only) it would include operation on the original dataset as well,

  • subdatasets could be traversed in bottom-up order,

  • can execute commands in parallel (see JOBS option), but would account for the order, e.g. in bottom-up order command is executed in super-dataset only after it is executed in all subdatasets.

Additional notes:

  • for execution of “external” commands we use the environment used to execute external git and git-annex commands.

Command format

–cmd-type external: A few placeholders are supported in the command via Python format specification:

  • “{pwd}” will be replaced with the full path of the current working directory.

  • “{ds}” and “{refds}” will provide instances of the dataset currently operated on and the reference “context” dataset which was provided via dataset argument.

  • “{tmpdir}” will be replaced with the full path of a temporary directory.

Examples

Aggressively git clean all datasets, running 5 parallel jobs:

% datalad foreach-dataset -r -J 5 git clean -dfx
Options
COMMAND

command for execution. A leading ‘–’ can be used to disambiguate this command from the preceding options to DataLad. For –cmd-type exec or eval only a single command argument (Python code) is supported.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--cmd-type {auto|external|exec|eval}

type of the command. EXTERNAL: to be run in a child process using dataset’s runner; ‘exec’: Python source code to execute using ‘exec(), no value returned; ‘eval’: Python source code to evaluate using ‘eval()’, return value is placed into ‘result’ field. ‘auto’: If used via Python API, and cmd is a Python function, it will use ‘eval’, and otherwise would assume ‘external’. Constraints: value must be one of (‘auto’, ‘external’, ‘exec’, ‘eval’) [Default: ‘auto’]

-d DATASET, --dataset DATASET

specify the dataset to operate on. If no dataset is given, an attempt is made to identify the dataset based on the input and/or the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--state {present|absent|any}

indicate which (sub)datasets to consider: either only locally present, absent, or any of those two kinds. Constraints: value must be one of (‘present’, ‘absent’, ‘any’) [Default: ‘present’]

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type ‘int’ or value must be NONE

--contains PATH

limit to the subdatasets containing the given path. If a root path of a subdataset is given, the last considered dataset will be the subdataset itself. This option can be given multiple times, in which case datasets that contain any of the given paths will be considered. Constraints: value must be a string or value must be NONE

--bottomup

whether to report subdatasets in bottom-up order along each branch in the dataset tree, and not top-down.

-s, --subdatasets-only

whether to exclude top level dataset. It is implied if a non-empty CONTAINS is used.

--output-streams {capture|pass-through|relpath}, --o-s {capture|pass-through|relpath}

ways to handle outputs. ‘capture’ and return outputs from ‘cmd’ in the record (‘stdout’, ‘stderr’); ‘pass-through’ to the screen (and thus absent from returned record); prefix with ‘relpath’ captured output (similar to like grep does) and write to stdout and stderr. In ‘relpath’, relative path is relative to the top of the dataset if DATASET is specified, and if not - relative to current directory. Constraints: value must be one of (‘capture’, ‘pass-through’, ‘relpath’) [Default: ‘pass-through’]

--chpwd {ds|pwd}

‘ds’ will change working directory to the top of the corresponding dataset. With ‘pwd’ no change of working directory will happen. Note that for Python commands, due to use of threads, we do not allow chdir=ds to be used with jobs > 1. Hint: use ‘ds’ and ‘refds’ objects’ methods to execute commands in the context of those datasets. Constraints: value must be one of (‘ds’, ‘pwd’) [Default: ‘ds’]

--safe-to-consume {auto|all-subds-done|superds-done|always}

Important only in the case of parallel (jobs greater than 1) execution. ‘all- subds-done’ instructs to not consider superdataset until command finished execution in all subdatasets (it is the value in case of ‘auto’ if traversal is bottomup). ‘superds-done’ instructs to not process subdatasets until command finished in the super-dataset (it is the value in case of ‘auto’ in traversal is not bottom up, which is the default). With ‘always’ there is no constraint on either to execute in sub or super dataset. Constraints: value must be one of (‘auto’, ‘all-subds-done’, ‘superds-done’, ‘always’) [Default: ‘auto’]

-J NJOBS, --jobs NJOBS

how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. Constraints: value must be convertible to type ‘int’ or value must be NONE or value must be one of (‘auto’,)

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad sshrun
Synopsis
datalad sshrun [-h] [-p PORT] [-4] [-6] [-o OPTION] [-n] [--version] login cmd
Description

Run command on remote machines via SSH.

This is a replacement for a small part of the functionality of SSH. In addition to SSH alone, this command can make use of datalad’s SSH connection management. Its primary use case is to be used with Git as ‘core.sshCommand’ or via “GIT_SSH_COMMAND”.

Configure datalad.ssh.identityfile to pass a file to the ssh’s -i option.

Options
login

[user@]hostname.

cmd

command for remote execution.

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-p PORT, --port PORT

port to connect to on the remote host.

-4

use IPv4 addresses only.

-6

use IPv6 addresses only.

-o OPTION

configuration option passed to SSH.

-n

Do not connect stdin to the process.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad shell-completion
Synopsis
datalad shell-completion [-h] [--version]
Description

Display shell script for enabling shell completion for DataLad.

Output of this command should be “sourced” by the bash or zsh to enable shell completions provided by argcomplete.

Example:

$ source <(datalad shell-completion) $ datalad –<PRESS TAB to display available option>

Options
-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad wtf
Synopsis
datalad wtf [-h] [-d DATASET] [-s {some|all}] [-S SECTION] [--flavor {full|short}]
    [-D {html_details}] [-c] [--version]
Description

Generate a report about the DataLad installation and configuration

IMPORTANT: Sharing this report with untrusted parties (e.g. on the web) should be done with care, as it may include identifying information, and/or credentials or access tokens.

Options
-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

“specify the dataset to report on. no dataset is given, an attempt is made to identify the dataset based on the current working directory. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-s {some|all}, --sensitive {some|all}

if set to ‘some’ or ‘all’, it will display sections such as config and metadata which could potentially contain sensitive information (credentials, names, etc.). If ‘some’, the fields which are known to be sensitive will still be masked out. Constraints: value must be one of (‘some’, ‘all’)

-S SECTION, --section SECTION

section to include. If not set - depends on flavor. ‘*’ could be used to force all sections. If there are subsections like section.subsection available, then specifying just ‘section’ would select all subsections for that section. This option can be given multiple times. Constraints: value must be one of (‘configuration’, ‘credentials’, ‘datalad’, ‘dataset’, ‘dependencies’, ‘environment’, ‘extensions’, ‘git-annex’, ‘location’, ‘metadata’, ‘metadata.extractors’, ‘metadata.filters’, ‘metadata.indexers’, ‘python’, ‘system’, ‘*’)

--flavor {full|short}

Flavor of WTF. ‘full’ would produce markdown with exhaustive list of sections. ‘short’ will provide a condensed summary only of datalad and dependencies by default. Use –section to list other sections. Constraints: value must be one of (‘full’, ‘short’) [Default: ‘full’]

-D {html_details}, --decor {html_details}

decoration around the rendering to facilitate embedding into issues etc, e.g. use ‘html_details’ for posting collapsible entry to GitHub issues. Constraints: value must be one of (‘html_details’,)

-c, --clipboard

if set, do not print but copy to clipboard (requires pyperclip module).

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Deprecated commands
datalad uninstall
Synopsis
datalad uninstall [-h] [-d DATASET] [-r] [--nocheck] [--if-dirty
    {fail,save-before,ignore}] [--version] [PATH ...]
Description

DEPRECATED: use the DROP command

Options
PATH

path/name of the component to be uninstalled. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to perform the operation on. If no dataset is given, an attempt is made to identify a dataset based on the PATH given. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-r, --recursive

if set, recurse into potential subdatasets.

--nocheck

whether to perform checks to assure the configured minimum number (remote) source for data. Give this option to skip checks.

--if-dirty {fail,save-before,ignore}

desired behavior if a dataset with unsaved changes is discovered: ‘fail’ will trigger an error and further processing is aborted; ‘save-before’ will save all changes prior any further action; ‘ignore’ let’s datalad proceed as if the dataset would not have unsaved changes. [Default: ‘save-before’]

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Python module reference

This module reference extends the manual with a comprehensive overview of the available functionality built into datalad. Each module in the package is documented by a general summary of its purpose and the list of classes and functions it provides.

High-level user interface
Dataset operations

api.Dataset(*args, **kwargs)

Representation of a DataLad dataset/repository

api.create([path, initopts, force, ...])

Create a new dataset from scratch.

api.create_sibling(sshurl, *[, name, ...])

Create a dataset sibling on a UNIX-like Shell (local or SSH)-accessible machine

api.create_sibling_github(reponame, *[, ...])

Create dataset sibling on GitHub.org (or an enterprise deployment).

api.create_sibling_gitlab([path, site, ...])

Create dataset sibling at a GitLab site

api.create_sibling_gogs(reponame, *[, api, ...])

Create a dataset sibling on a GOGS site

api.create_sibling_gitea(reponame, *[, ...])

Create a dataset sibling on a Gitea site

api.create_sibling_gin(reponame, *[, ...])

Create a dataset sibling on a GIN site (with content hosting)

api.create_sibling_ria(url, name, *[, ...])

Creates a sibling to a dataset in a RIA store

api.drop([path, what, reckless, dataset, ...])

Drop content of individual files or entire (sub)datasets

api.get([path, source, dataset, recursive, ...])

Get any dataset content (files/directories/subdatasets).

api.install([path, source, dataset, ...])

Install one or many datasets from remote URL(s) or local PATH source(s).

api.push([path, dataset, to, since, data, ...])

Push a dataset to a known sibling.

api.remove([path, dataset, drop, reckless, ...])

Remove components from datasets

api.save([path, message, dataset, ...])

Save the current state of a dataset

api.status([path, dataset, annex, ...])

Report on the state of dataset content.

api.update([path, sibling, merge, how, ...])

Update a dataset from a sibling.

api.unlock([path, dataset, recursive, ...])

Unlock file(s) of a dataset

Reproducible execution

api.run([cmd, dataset, inputs, outputs, ...])

Run an arbitrary shell command and record its impact on a dataset.

api.rerun([revision, since, dataset, ...])

Re-execute previous datalad run commands.

api.run_procedure([spec, dataset, discover, ...])

Run prepared procedures (DataLad scripts) on a dataset

Plumbing commands

api.clean(*[, dataset, what, dry_run, ...])

Clean up after DataLad (possible temporary files etc.)

api.clone(source[, path, git_clone_opts, ...])

Obtain a dataset (copy) from a URL or local directory

api.copy_file([path, dataset, recursive, ...])

Copy files and their availability metadata from one dataset to another.

api.create_test_dataset([path, spec, seed])

Create test (meta-)dataset.

api.diff([path, fr, to, dataset, annex, ...])

Report differences between two states of a dataset (hierarchy)

api.download_url(urls, *[, dataset, path, ...])

Download content

api.foreach_dataset(cmd, *[, cmd_type, ...])

Run a command or Python code on the dataset and/or each of its sub-datasets.

api.siblings([action, dataset, name, url, ...])

Manage sibling configuration

api.sshrun(login, cmd, *[, port, ipv4, ...])

Run command on remote machines via SSH.

api.subdatasets([path, dataset, state, ...])

Report subdatasets and their properties.

Miscellaneous commands

api.add_archive_content(archive, *[, ...])

Add content of an archive under git annex control.

api.add_readme([filename, dataset, existing])

Add basic information about DataLad datasets to a README file

api.addurls(urlfile, urlformat, ...[, ...])

Create and update a dataset from a list of URLs.

api.check_dates(paths, *[, reference_date, ...])

Find repository dates that are more recent than a reference date.

api.configuration([action, spec, scope, ...])

Get and set dataset, dataset-clone-local, or global configuration

api.export_archive([filename, dataset, ...])

Export the content of a dataset as a TAR/ZIP archive.

api.export_archive_ora(target[, opts, ...])

Export an archive of a local annex object store for the ORA remote.

api.export_to_figshare([filename, dataset, ...])

Export the content of a dataset as a ZIP archive to figshare

api.no_annex(dataset, pattern[, ref_dir, ...])

Configure a dataset to never put some content into the dataset's annex

api.shell_completion()

Display shell script for enabling shell completion for DataLad.

api.wtf(*[, dataset, sensitive, sections, ...])

Generate a report about the DataLad installation and configuration

Support functionality

cmd

Class the starts a subprocess and keeps it around to communicate with it via stdin.

consts

constants for datalad

log

Logging setup and utilities, including progress reporting

utils

version

support.gitrepo

Internal low-level interface to Git repositories

support.annexrepo

Interface to git-annex by Joey Hess.

support.archives

Various handlers/functionality for different types of files (e.g. for archives).

support.extensions

Support functionality for extension development

customremotes.base

Base classes to custom git-annex remotes (e.g. extraction from archives).

customremotes.archives

Custom remote to get the load from archives present under annex

runner.nonasyncrunner

Thread based subprocess execution with stdout and stderr passed to protocol objects

runner.protocol

Base class of a protocol to be used with the DataLad runner

Configuration management

config

Test infrastructure

tests.utils_pytest

Miscellaneous utilities to assist with testing

tests.utils_testrepos

tests.heavyoutput

Helper to provide heavy load on stdout and stderr

Command interface

interface.base

High-level interface generation

Command line interface infrastructure

cli.exec

Call a command interface

cli.main

This is the main() CLI entryproint

cli.parser

Components to build the parser instance for the CLI

cli.renderer

Render results in a terminal

Configuration

DataLad uses the same configuration mechanism and syntax as Git itself. Consequently, datalad can be configured using the git config command. Both a global user configuration (typically at ~/.gitconfig), and a local repository-specific configuration (.git/config) are inspected.

In addition, datalad supports a persistent dataset-specific configuration. This configuration is stored at .datalad/config in any dataset. As it is part of a dataset, settings stored there will also be in effect for any consumer of such a dataset. Both global and local settings on a particular machine always override configuration shipped with a dataset.

All datalad-specific configuration variables are prefixed with datalad..

It is possible to override or amend the configuration using environment variables. Any variable with a name that starts with DATALAD_ will be available as the corresponding datalad. configuration variable, replacing any __ (two underscores) with a hyphen, then any _ (single underscore) with a dot, and finally converting all letters to lower case. Values from environment variables take precedence over configuration file settings.

In addition, the DATALAD_CONFIG_OVERRIDES_JSON environment variable can be set to a JSON record with configuration values. This is particularly useful for options that aren’t accessible through the naming scheme described above (e.g., an option name that includes an underscore).

The following sections provide a (non-exhaustive) list of settings honored by datalad. They are categorized according to the scope they are typically associated with.

Global user configuration
datalad.clone.url-substitute.github

GitHub URL substitution rule: Mangling for GitHub-related URL. A substitution specification is a string with a match and substitution expression, each following Python’s regular expression syntax. Both expressions are concatenated to a single string with an arbitrary delimiter character. The delimiter is defined by prefixing the string with the delimiter. Prefix and delimiter are stripped from the expressions (Example: “,^http://(.*)$,https://1”). This setting can be defined multiple times. Substitutions will be applied incrementally, in order of their definition. The first substitution in such a series must match, otherwise no further substitutions in a series will be considered. However, following the first match all further substitutions in a series are processed, regardless whether intermediate expressions match or not. Default: (‘,https?://github.com/([^/]+)/(.*)$,\1###\2’, ‘,[/\\]+(?!$),-’, ‘,\s+|(%2520)+|(%20)+,_’, ‘,([^#]+)###(.*),https://github.com/\1/\2’)

datalad.clone.url-substitute.osf

Open Science Framework URL substitution rule: Mangling for OSF-related URLs. A substitution specification is a string with a match and substitution expression, each following Python’s regular expression syntax. Both expressions are concatenated to a single string with an arbitrary delimiter character. The delimiter is defined by prefixing the string with the delimiter. Prefix and delimiter are stripped from the expressions (Example: “,^http://(.*)$,https://1”). This setting can be defined multiple times. Substitutions will be applied incrementally, in order of their definition. The first substitution in such a series must match, otherwise no further substitutions in a series will be considered. However, following the first match all further substitutions in a series are processed, regardless whether intermediate expressions match or not. Default: (‘,^https://osf.io/([^/]+)[/]*$,osf://\1’,)

datalad.extensions.load

DataLad extension packages to load: Indicate which extension packages should be loaded unconditionally on CLI startup or on importing ‘datalad.[core]api’. This enables the respective extensions to customize DataLad with functionality and configurability outside the scope of extension commands. For merely running extension commands it is not necessary to load them specifically Default: None

datalad.externals.nda.dbserver

NDA database server: Hostname of the database server Default: https://nda.nih.gov/DataManager/dataManager

datalad.locations.cache

Cache directory: Where should datalad cache files? Default: ~/.cache/datalad

datalad.locations.default-dataset

Default dataset path: Where should datalad should look for (or install) a default dataset? Default: ~/datalad

datalad.locations.extra-procedures

Extra procedure directory: Where should datalad search for some additional procedures?

datalad.locations.locks

Lockfile directory: Where should datalad store lock files? Default: ~/.cache/datalad/locks

datalad.locations.sockets

Socket directory: Where should datalad store socket files? Default: ~/.cache/datalad/sockets

datalad.locations.system-procedures

System procedure directory: Where should datalad search for system procedures? Default: /etc/xdg/datalad/procedures

datalad.locations.user-procedures

User procedure directory: Where should datalad search for user procedures? Default: ~/.config/datalad/procedures

datalad.ssh.executable

Name of ssh executable for ‘datalad sshrun’: Specifies the name of the ssh-client executable thatdatalad will use. This might be an absolute path. On Windows systems it is currently by default set to point to the ssh executable of OpenSSH for Windows, if OpenSSH for Windows is installed. On other systems it defaults to ‘ssh’. Default: ssh

[value must be a string]

datalad.ssh.identityfile

If set, pass this file as ssh’s -i option.: Default: None

datalad.ssh.multiplex-connections

Whether to use a single shared connection for multiple SSH processes aiming at the same target.: Default: True

[value must be convertible to type bool]

datalad.ssh.try-use-annex-bundled-git

Whether to attempt adjusting the PATH in a remote shell to include Git binaries located in a detected git-annex bundle: If enabled, this will be a ‘best-effort’ attempt that only supports remote hosts with a Bourne shell and the which command available. The remote PATH must already contain a git-annex installation. If git-annex is not found, or the detected git-annex does not have a bundled Git installation, detection failure will not result in an error, but only slow remote execution by one-time sensing overhead per each opened connection. Default: False

[value must be convertible to type bool]

datalad.tests.cache

Cache directory for tests: Where should datalad cache test files? Default: ~/.cache/datalad/tests

datalad.tests.credentials

Credentials to use during tests: Which credentials should be available while running tests? If “plaintext” (default), a new plaintext keyring would be created in tests temporary HOME. If “system”, no custom configuration would be passed to keyring and known to system credentials could be used. Default: plaintext

[value must be one of [CMD: (‘plaintext’, ‘system’) CMD][PY: (‘plaintext’, ‘system’) PY]]

Local repository configuration
datalad.crawl.cache

Crawler download caching: Should the crawler cache downloaded files?

[value must be convertible to type bool]

datalad.fake-dates

Fake (anonymize) dates: Should the dates in the logs be faked? Default: False

[value must be convertible to type bool]

Sticky dataset configuration
datalad.locations.dataset-procedures

Dataset procedure directory: Where should datalad search for dataset procedures (relative to a dataset root)? Default: .datalad/procedures

Miscellaneous configuration
datalad.annex.retry

Value for annex.retry to use for git-annex calls: On transfer failure, annex.retry (sans “datalad.”) controls the number of times that git-annex retries. DataLad will call git-annex with annex.retry set to the value here unless the annex.retry is explicitly configured Default: 3

[value must be convertible to type ‘int’]

datalad.credentials.force-ask

Force (re-)entry of credentials: Should DataLad prompt for credential (re-)entry? This can be used to update previously stored credentials. Default: False

[value must be convertible to type bool]

datalad.credentials.githelper.noninteractive

Non-interactive mode for git-credential helper: Should git-credential-datalad operate in non-interactive mode? This would mean to not ask for user confirmation when storing new credentials/provider configs. Default: False

[bool]

datalad.exc.str.tblimit

This flag is used by datalad to cap the number of traceback steps included in exception logging and result reporting to DATALAD_EXC_STR_TBLIMIT of pre-processed entries from traceback.:

datalad.fake-dates-start

Initial fake date: When faking dates and there are no commits in any local branches, generate the date by adding one second to this value (Unix epoch time). The value must be positive. Default: 1112911993

[value must be convertible to type ‘int’]

datalad.github.token-note

GitHub token note: Description for a Personal access token to generate. Default: DataLad

datalad.install.inherit-local-origin

Inherit local origin of dataset source: If enabled, a local ‘origin’ remote of a local dataset clone source is configured as an ‘origin-2’ remote to make its annex automatically available. The process is repeated recursively for any further qualifying ‘origin’ dataset thereof.Note that if clone.defaultRemoteName is configured to use a name other than ‘origin’, that name will be used instead. Default: True

[value must be convertible to type bool]

datalad.log.level

Used for control the verbosity of logs printed to stdout while running datalad commands/debugging:

datalad.log.name

Include name of the log target in the log line:

datalad.log.names

Which names (,-separated) to print log lines for:

datalad.log.namesre

Regular expression for which names to print log lines for:

datalad.log.outputs

Whether to log stdout and stderr for executed commands: When enabled, setting the log level to 5 should catch all execution output, though some output may be logged at higher levels Default: False

[value must be convertible to type bool]

datalad.log.result-level

Log level for command result messages: If ‘match-status’, it will log ‘impossible’ results as a warning, ‘error’ results as errors, and everything else as ‘debug’. Otherwise the indicated log-level will be used for all such messages Default: debug

[value must be one of [CMD: (‘debug’, ‘info’, ‘warning’, ‘error’, ‘match-status’) CMD][PY: (‘debug’, ‘info’, ‘warning’, ‘error’, ‘match-status’) PY]]

datalad.log.timestamp

Used to add timestamp to datalad logs: Default: False

[value must be convertible to type bool]

datalad.log.traceback

Includes a compact traceback in a log message, with generic components removed. This setting is only in effect when given as an environment variable DATALAD_LOG_TRACEBACK. An integer value specifies the maximum traceback depth to be considered. If set to “collide”, a common traceback prefix between a current traceback and a previously logged traceback is replaced with “…” (maximum depth 100).:

datalad.repo.backend

git-annex backend: Backend to use when creating git-annex repositories Default: MD5E

datalad.repo.direct

Direct Mode for git-annex repositories: Set this flag to create annex repositories in direct mode by default Default: False

[value must be convertible to type bool]

datalad.repo.version

git-annex repository version: Specifies the repository version for git-annex to be used by default Default: 8

[value must be convertible to type ‘int’]

datalad.runtime.max-annex-jobs

Maximum number of git-annex jobs to request when “jobs” option set to “auto” (default): Set this value to enable parallel annex jobs that may speed up certain operations (e.g. get file content). The effective number of jobs will not exceed the number of available CPU cores (or 3 if there is less than 3 cores). Default: 1

[value must be convertible to type ‘int’]

datalad.runtime.max-batched

Maximum number of batched commands to run in parallel: Automatic cleanup of batched commands will try to keep at most this many commands running. Default: 20

[value must be convertible to type ‘int’]

datalad.runtime.max-inactive-age

Maximum time (in seconds) a batched command can be inactive before it is eligible for cleanup: Automatic cleanup of batched commands will consider an inactive command eligible for cleanup if more than this many seconds have transpired since the command’s last activity. Default: 60

[value must be convertible to type ‘int’]

datalad.runtime.max-jobs

Maximum number of jobs DataLad can run in “parallel”: Set this value to enable parallel multi-threaded DataLad jobs that may speed up certain operations, in particular operation across multiple datasets (e.g., install multiple subdatasets, etc). Default: 1

[value must be convertible to type ‘int’]

datalad.runtime.pathspec-from-file

Provide list of files to git commands via –pathspec-from-file: Instructs when DataLad will provide list of paths to ‘git’ commands which support –pathspec-from-file option via some temporary file. If set to ‘multi-chunk’ it will be done only if multiple invocations of the command on chunks of files list is needed. If set to ‘always’, DataLad will always use –pathspec-from-file. Default: multi-chunk

[value must be one of [CMD: (‘multi-chunk’, ‘always’) CMD][PY: (‘multi-chunk’, ‘always’) PY]]

datalad.runtime.raiseonerror

Error behavior: Set this flag to cause DataLad to raise an exception on errors that would have otherwise just get logged Default: False

[value must be convertible to type bool]

datalad.runtime.report-status

Command line result reporting behavior: If set (to other than ‘all’), constrains command result report to records matching the given status. ‘success’ is a synonym for ‘ok’ OR ‘notneeded’, ‘failure’ stands for ‘impossible’ OR ‘error’ Default: None

[value must be one of [CMD: (‘all’, ‘success’, ‘failure’, ‘ok’, ‘notneeded’, ‘impossible’, ‘error’) CMD][PY: (‘all’, ‘success’, ‘failure’, ‘ok’, ‘notneeded’, ‘impossible’, ‘error’) PY]]

datalad.runtime.stalled-external

Behavior for handing external processes: What to do with external processes if they do not finish in some minimal reasonable time. If “abandon”, datalad would proceed without waiting for external process to exit. ATM applies only to batched git-annex processes. Should be changed with caution. Default: wait

[value must be one of [CMD: (‘wait’, ‘abandon’) CMD][PY: (‘wait’, ‘abandon’) PY]]

datalad.save.no-message

Commit message handling: When no commit message was provided: attempt to obtain one interactively (interactive); or use a generic commit message (generic). NOTE: The interactive option is experimental. The behavior may change in backwards-incompatible ways. Default: generic

[value must be one of [CMD: (‘interactive’, ‘generic’) CMD][PY: (‘interactive’, ‘generic’) PY]]

datalad.save.windows-compat-warning

Action when Windows-incompatible file names are saved: Certain characters or names can make file names incompatible with Windows. If such files are saved ‘warning’ will alert users with a log message, ‘error’ will yield an ‘impossible’ result, and ‘none’ will ignore the incompatibility. Default: warning

[value must be one of [CMD: (‘warning’, ‘error’, ‘none’) CMD][PY: (‘warning’, ‘error’, ‘none’) PY]]

datalad.source.epoch

Datetime epoch to use for dates in built materials: Datetime to use for reproducible builds. Originally introduced for Debian packages to interface SOURCE_DATE_EPOCH described at https://reproducible-builds.org/docs/source-date-epoch/ .By default - current time Default: 1709821116.739514

[value must be convertible to type ‘float’]

datalad.tests.dataladremote

Binary flag to specify whether each annex repository should get datalad special remote in every test repository:

[value must be convertible to type bool]

datalad.tests.knownfailures.probe

Probes tests that are known to fail on whether or not they are actually still failing: Default: False

[value must be convertible to type bool]

datalad.tests.knownfailures.skip

Skips tests that are known to currently fail: Default: True

[value must be convertible to type bool]

datalad.tests.nonetwork

Skips network tests completely if this flag is set, Examples include test for S3, git_repositories, OpenfMRI, etc:

[value must be convertible to type bool]

datalad.tests.nonlo

Specifies network interfaces to bring down/up for testing. Currently used by Travis CI.:

datalad.tests.noteardown

Does not execute teardown_package which cleans up temp files and directories created by tests if this flag is set:

[value must be convertible to type bool]

datalad.tests.runcmdline

Binary flag to specify if shell testing using shunit2 to be carried out:

[value must be convertible to type bool]

datalad.tests.setup.testrepos

Pre-creates repositories for @with_testrepos within setup_package: Default: False

[value must be convertible to type bool]

datalad.tests.ssh

Skips SSH tests if this flag is not set:

[value must be convertible to type bool]

datalad.tests.temp.dir

Create a temporary directory at location specified by this flag. It is used by tests to create a temporary git directory while testing git annex archives etc: Default: None

[value must be a string]

datalad.tests.temp.fs

Specify the temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation:

datalad.tests.temp.fssize

Specify the size of temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation:

datalad.tests.temp.keep

Function rmtemp will not remove temporary file/directory created for testing if this flag is set:

[value must be convertible to type bool]

datalad.tests.ui.backend

Tests UI backend: Which UI backend to use Default: tests-noninteractive

datalad.tests.usecassette

Specifies the location of the file to record network transactions by the VCR module. Currently used by when testing custom special remotes:

datalad.ui.color

Colored terminal output: Enable or disable ANSI color codes in outputs; “on” overrides NO_COLOR environment variable Default: auto

[value must be one of [CMD: (‘on’, ‘off’, ‘auto’) CMD][PY: (‘on’, ‘off’, ‘auto’) PY]]

datalad.ui.progressbar

UI progress bars: Default backend for progress reporting Default: None

[value must be one of [CMD: (‘tqdm’, ‘tqdm-ipython’, ‘log’, ‘none’) CMD][PY: (‘tqdm’, ‘tqdm-ipython’, ‘log’, ‘none’) PY]]

datalad.ui.suppress-similar-results

Suppress rendering of similar repetitive results: If enabled, after a certain number of subsequent results that are identical regarding key properties, such as ‘status’, ‘action’, and ‘type’, additional similar results are not rendered by the common result renderer anymore. Instead, a count of suppressed results is displayed. If disabled, or when not running in an interactive terminal, all results are rendered. Default: True

[value must be convertible to type bool]

datalad.ui.suppress-similar-results-threshold

Threshold for suppressing similar repetitive results: Minimum number of similar results to occur before suppression is considered. See ‘datalad.ui.suppress-similar-results’ for more information. Default: 10

[value must be convertible to type ‘int’]

Extension packages

DataLad can be customized and additional functionality can be integrated via extensions. Each extension provides its own documentation:

Indices and tables