Architecture

This documentation contains the architectural design for check-datapackage. For design details of the Seedcase Project as a whole, see the Seedcase Design documentation.

This document outlines the architecture of check-datapackage mostly to ensure the team shares a common understanding before implementation, but also to communicate the design to anyone else interested in the internal workings of the package.

User types

This section describes the different users we expect and design for:

  • Owner: Creates and owns the Data Package. Wants to ensure that the Data Package is compliant with the Data Package standard on a general level.
  • Manager: Manages and edits the properties within the Data Package. Wants to make sure that whenever changes are made to the properties (e.g., the description field is updated), the Data Package remains compliant with the standard.
  • Developer: Contributes to building up the Data Package including the data itself and/or the infrastructure around it. Wants to ensure that changes don’t impact the compliance of the Data Package. Might add extensions (additional checks) or exclude certain Data Package checks to fit the specific needs of the project.

Naming

This section contains a naming scheme for check-datapackage that is inspired by the Data Package standard.

Overall, we follow the Data Package terminology where possible to keep things consistent. However, we also introduce some new terms and concepts specific to check-datapackage. The main objects and actions used throughout the package can be found in the tables below.

Objects

Objects used throughout check-datapackage.
Object Description
package A Data Package that contains a collection of related data resources and descriptor(s).
descriptor A standalone and complete metadata structure contained in a JSON file, for example, in datapackage.json.
properties Metadata fields (name-value pairs) of a descriptor loaded as a Python dictionary. This can be a subset of the original descriptor or the entire structure.
schema The JSON schema defining the Data Package standard.
config An object containing settings for modifying the behaviour and output of the check mechanism.

Actions

Actions that check-datapackage can perform.
Action Description
check Check that a descriptor conforms to the Data Package standard.
explain Explain issues flagged by the check action in more detail using non-technical language.
read Read various files, such as a Data Package descriptor or a configuration file.

C4 Models

This section contains the C4 Models for check-datapackage. The C4 Model is an established visualisation approach to describe the architecture of a software system. It breaks the system down into four levels of architectural abstraction: System context, containers, components, and code.

System context

The system context diagram shows the users and any external systems that interact with check-datapackage. This includes the user types and the Data Package standard.

check-datapackage receives the definitions of the Data Package descriptor’s structure—including properties that must or should be included and their formats—from the Data Package standard (version 2). The standard provides this information through versioned JSON Schema profiles that define required properties and textual descriptions that outline compliance.

Note

In the initial version of check-datapackage, we only support the second edition of the Data Package standard (v2.0). However, we plan to extend this to support future editions as they are released, as well as the first edition to ensure backward compatibility.

The users, described in the User types section, provide check-datapackage with their Data Package’s descriptor to check its compliance with the standard.

flowchart LR

    subgraph "Users"
        user_owner("Owner<br>[person]")
        user_manager("Manager<br>[person]")
        user_developer("Developer<br>[person]")
    end

    dp_standard("Data Package<br>[standard]")
    check("check-datapackage<br>[Python package]")


    dp_standard --"Definition of the standard"--> check
    Users --"Check Data Package<br>descriptor"--> check
    %% Styling
    style Users fill:#FFFFFF, color:#000000
Figure 1: C4 system context diagram showing the anticipated users and the external system (the Data Package standard) check-datapackage interacts with.

Container

In C4, a container diagram zooms in on the system boundary to show the containers within it, such as web applications or databases. This diagram displays the main containers of check-datapackage, their responsibilities, and how they interact, including the technologies used for each.

Currently, we build check-datapackage with a single container—the core Python package—but we’ve designed it to be extendable as a command line interface (CLI) in the future. With a CLI, we want to ease the process of implementing the checks in e.g., continuous integration pipelines.

flowchart LR

    users("Users<br>[person]")
    dp_standard("Data Package<br>[standard]")

    subgraph "check-datapackage"
        python("Core Python Package<br>[Python, JSON schema]")
        cli("CLI<br>[Python]")

    python -. "Provides<br>functionality" .-> cli
    end

    dp_standard --"Definition of the standard"--> python
    users --"Check Data Package<br>descriptor programmatically"--> python
    users -. "Check Data Package<br>descriptor via the CLI" .-> cli

    %% Styling
    style check-datapackage fill:#FFFFFF, color:#000000
    style cli fill:#FFFFFF, stroke-dasharray: 5 5
Figure 2: C4 container diagram showing the core Python package in check-datapackage and the future command line interface (displayed dashed).

Component/code

In the diagram below, we zoom in on the core Python package container to show its internal components. In C4, a component is “a grouping of related functionality encapsulated behind a well-defined interface”, like a class or a module, while code is the basic building blocks, such as classes and functions.

Because the core Python package is relatively small and simple, and because both component and code diagrams include classes, we combine the component and code levels of the C4 model into a single diagram as shown below. This diagram shows the main classes and functions within the core Python package. Because the CLI is only a planned future extension, we do not include a component/code diagram for it at this time.

flowchart LR

    subgraph python_package["Core Python Package"]

        subgraph config_file["Configuration file"]
            config("Config<br>[class]")
            exclusion("Exclusion<br>[class]")
            extension("Extension<br>[class]")
        end

        read_config["read_config()<br>[function]"]
        read_json["read_json()<br>[function]"]
        check("check()<br>[function]")
        explain("explain()<br>[function]")

        exclusion --"Defines checks to exclude"--> config
        extension --"Defines additional checks"--> config
        config_file -. "Reads configuration<br>from file" .-> read_config

        read_json --"Provides properties as dict"--> check
        read_config -. "Adds check<br>configurations" .-> check
        config --"Adds check<br>configurations"--> check

        check --"Passes found issues<br>to get non-technical<br>explanation"--> explain
    end

    dp_standard("Data Package<br>[standard]")
    user("User<br>[person]")

    dp_standard --"Defines the Data<br>Package standard"--> check
    user --"Provides datapackage.json<br>to check"--> read_json
    user --"Provides configuration file<br>(optional)"--> config_file

    %% Styling
    style python_package fill:#FFFFFF, color:#000000
    style config_file fill:#FFFFFF, color:#000000, stroke-dasharray: 5 5
Figure 3: C4 component diagram showing the parts of the Python package and their connections.

For more details on the individual classes and functions, see the interface documentation and the reference documentation.