Checking a Data Package’s metadata

The purpose of check-datapackage is to make sure a Data Package’s metadata—stored in its datapackage.json file—complies with the Data Package standard. This standard defines the available properties at each level of the datapackage.json, which ones are required, and what values are allowed.

This guide shows you how to use the main function check() to run these checks. Each section walks you through a different part of check(), starting with its basic usage with the properties argument, introducing the default checks, how to configure which checks you want to run with the config argument, and how to handle failed checks with the error argument.

Tip

For the full reference of the check() function, see the reference documentation.

Getting started with check() (properties)

check() checks a Data Package’s metadata against the Data Package standard and reports any issues it finds.

It requires one main input, properties, which is a Python dict representing the properties in the datapackage.json file. You can load these properties from the JSON file using the helper function read_json() included in check-datapackage.

By default, if any issues are found, check() returns a list of Issue objects—each one describing a failed check on a property. If no issues are found, check() returns an empty list. If you’d rather have the function give an error if it finds any issues, see the Stop program on failed checks (error=True) section below.

Let’s look at an example. The code below defines a package_properties dictionary that includes all the required properties in a correct format. When we call check() on these properties, it returns an empty list:

import check_datapackage as cdp

package_properties = {
    "name": "woolly-dormice",
    "title": "Hibernation Physiology of the Woolly Dormouse: A Scoping Review.",
    "description": """
        This scoping review explores the hibernation physiology of the
        woolly dormouse, drawing on data collected over a 10-year period
        along the Taurus Mountain range in Turkey.
    """,
    "id": "123-abc-123",
    "created": "2014-05-14T05:00:01+00:00",
    "version": "1.0.0",
    "licenses": [{"name": "odc-pddl"}],
    "resources": [
        {
            "name": "woolly-dormice-2015",
            "title": "Body fat percentage in the hibernating woolly dormouse",
            "path": "resources/woolly-dormice-2015/data.parquet",
        }
    ],
}

cdp.check(properties=package_properties)
[]

Now let’s edit package_properties to have a name of the wrong type (a number instead of a string) and run check() again:

package_properties["name"] = 123

cdp.check(properties=package_properties)
[Issue(jsonpath='$.name', type='type', message="123 is not of type 'string'")]

The output now lists two issues: one for the missing description field and one for the name field of the wrong type.

Default checks and configuration (config)

By default, check() runs the standard checks defined as MUSTs in the Data Package standard. These include checking that all required properties are present and that their values have the correct types and formats. This happens through a default Config object passed to the config argument of check().

If you want to configure which checks are performed, you can provide your own Config object in check(). With this object you can exclude certain checks, include additional SHOULD recommendations from the Data Package standard, or add your own custom checks.

Tip

For more details on configuring checks, see the Configuring checks guide or the Config reference documentation.

Stop program on failed checks (error=True)

If you want failed checks to result in errors and terminate program execution, you can achieve this by setting the error argument of check() to True. Using the same incorrect package_properties example as before, we can call check() with error=True like this:

cdp.check(properties=package_properties, error=True)
---------------------------------------------------------------------------
DataPackageError                          Traceback (most recent call last)
Cell In[3], line 1
----> 1 cdp.check(properties=package_properties, error=True)

File ~/work/check-datapackage/check-datapackage/src/check_datapackage/check.py:94, in check(properties, config, error)
     91 issues = exclude(issues, config.exclusions, properties)
     93 if error and issues:
---> 94     raise DataPackageError(issues)
     96 return sorted(set(issues))

DataPackageError: There were some issues found in your `datapackage.json`:

- Property `$.name`: 123 is not of type 'string'

Since some checks failed, the function now raises an error rather than returning the list of issues.