Skip to content

v0.3.0 - Ruth Wodak

Compare
Choose a tag to compare
@skasberger skasberger released this 27 Jan 01:47
· 52 commits to master since this release

This release is a big change in many parts of the package. It adds new API's, re-factored models and lots of new documentation.

Overview of the most important changes:

  • Re-factored data models: setters, getters, data validation and JSON export and import
  • Export and import of metadata to/from pre-formatted CSV templates
  • Add User Guides, Use-Cases, Contributor Guide and much more to the documentation
  • Add SWORD, Search, Metrics and Data Access API
  • Collect the complete data tree of a Dataverse with get_children()
  • Use JSON schemas for metadata validation (jsonschemas required)
  • Updated Python requirements: Python>=3.6 (no Python 2 support anymore)
  • Curl required, only for update_datafile()
  • Transfer pyDataverse to GDCC - the Global Dataverse Community Consortium (#52)

Version 0.3.0 is named in honor of Ruth Wodak (Wikipedia), an Austrian linguist. Her work is mainly located in discourse studies, more specific in critical discourse analysis, which looks at discourse as a form of social practice. She was awarded with the Wittgenstein-Preis, the highest Austrian science award.

For help or general questions please have a look in our Docs or email stefan.kasberger@univie.ac.at.

Use-Cases

The new functionalities were developed with some specific use-cases in mind:

See more detailed in our Documentation.

Retrieve data structure and metadata from Dataverse instance (DevOps)

Collect all Dataverses, Datasets and Datafiles of a Dataverse instance, or just a part of it. The results then can be stored in JSON files, which can be used for testing purposes, like checking the completeness of data after a Dataverse upgrade or migration.

Upload and removal of test data (DevOps)

For testing, you often have to upload a collection of data and metadata, which should be removed after the test is finished. For this, we offer easy to use functionalities.

Import data from CSV templates (Data Scientist)

Importing lots of data from data sources outside dataverse can be done with the CSV templates as a bridge. Fill the CSV templates with your data, by machine or by human, and import them into pyDataverse for an easy mass upload via the Dataverse API.

Bugs

  • Missing JSON schemas (#56)
  • Datafile metadata title (#50)
  • Error long_description_content_type (#4)

Features & Enhancements

API

Summary: Add other API's next to Native API and update Native API.

  • add Data Access API:
    • get datafile(s) (get_datafile(), get_datafiles(), get_datafile_bundle())
    • request datafile access (request_access(), allow_access_request(), grant_file_access(), list_file_access_requests())
  • add Metrics API:
    • total(), past_days(), get_dataverses_by_subject(), get_dataverses_by_category(), get_datasets_by_subject(), get_datasets_by_data_location()
  • add SWORD API:
    • get_service_document()
  • add Search API:
    • search()
  • Native API:
    • Get all children data-types of a Dataverse or a Dataset in a tree structure (get_children())
    • Convert Dataverse ID's to its alias (dataverse_id2alias())
    • Get contents of a Dataverse (Datasets, Dataverses) (get_dataverse_contents())
    • Get Dataverse assignements (get_dataverse_assignments())
    • Get Dataverse facets (get_dataverse_facets())
    • Edit Dataset metadata (edit_dataset_metadata()) (#19)
    • Destroy Dataset (destroy_dataset())
    • Dataset private URL functionalities (create_dataset_private_url(), get_dataset_private_url(), delete_dataset_private_url())
    • Get Dataset version(s) (get_dataset_versions(), get_dataset_version())
    • Get Dataset assignments (get_dataset_assignments())
    • Check if Dataset is locked (get_dataset_lock())
    • Get Datafiles metadata get_datafiles_metadata()
    • Update datafile metadata (update_datafile_metadata())
    • Redetect Datafile file type (redetect_file_type())
    • Restrict Datafile (restrict_datafile())
    • ingest Datafiles (reingest_datafile(), uningest_datafile())
    • Datafile upload in native Python (no CURL dependency anymore) (upload_datafile())
    • Replace existing Datafile replace_datafile()
    • Roles functionalities (get_dataverse_roles(), create_role(), show_role(), delete_role())
    • Add API token functionalities (get_user_api_token_expiration_date(), recreate_user_api_token(), delete_user_api_token())
    • Get current user data (get_user()) (#59)
    • Get API ToU (get_info_api_terms_of_use())
    • Add import of existing Dataset in create_dataset() (#3)
    • Datafile upload natively in Python (no curl anymore) (upload_datafile())
  • Api
    • Set User-Agent for requests to pydataverse
    • Change authentication during request functions (get, post, delete, put): If API token is passed, use it. If not, don't set it. No auth parameter used anymore.

Models

Summary: Re-factoring of all models (Dataverse, Dataset, Datafile).

New methods:

  • from_json() imports JSON (like Dataverse's own JSON format) to pyDataverse models object
  • get() returns a dict of the pyDataverse models object
  • json() returns a JSON string (like Dataverse's own JSON format) of the pyDataverse models object. Mostly used for API uploads.
  • validate_data() validates a pyDataverse object with a JSON schema

Utils

  • Save list of metadata (Dataverses, Datasets or Datafiles) to a CSV file (write_dicts_as_csv()) (#11)
  • Walk through the data tree from get_children() and extract Dataverses, Datasets and Datafiles (dataverse_tree_walker())
  • Store the results from dataverse_tree_walker() in seperate JSON files (save_tree_data())
  • Validate any data model dictionary (Dataverse, Dataset, Datafile) against a JSON schema (validate_data())
  • Clean strings (trim whitespace) (clean_string())
  • Create URL's from identifier (create_dataverse_url(), create_dataset_url(), create_datafile_url())
  • Update read_csv_to_dict(): replace dv. prefix, load JSON cells and convert boolean cell strings

Docs

Many new pages and tutorials:

Tests

  • Add tests for new functions
  • Re-factor existing tests
  • Create fixtures
  • Create test data

Miscellaneous

  • Add Python 3.8 and Python 2.7, 3.4 and 3.5 removed (Python>=3.6 required now)
  • Add jsonschema as requirement
  • Add JSON schemas for Dataverse upload, Dataset upload, Datafile upload and DSpace to package
  • Add CSV templates for Dataverses, Datasets and Datafiles from pyDataverse_templates
  • Transfer pyDataverse to GDCC - the Global Dataverse Community Consortium (#52)
  • Improve code formatting: black, isort, pylint, mypy, pre-commit
  • Add pylint linter
  • Add mypy type checker
  • Add pre-commit for managing pre-commit hooks.
  • Add radon code metrics
  • Add GitHub templates (PR, issues, commit) (#57)
  • Re-structure requirements
  • Get DOI:10.5281/zenodo.4470151 for GitHub repository

Other

Thanks to Daniel Melichar (@dmelichar), Vyacheslav Tykhonov (Slava), GDCC, @ecowan, @BPeuch, @j-n-c and @ambhudia for their support for this release. Special thanks to the Pandas project for their great blueprint for the Contributor Guide.

PyDataverse is supported by funding as part of the Horizon2020 project SSHOC.