This documentation contains the architectural design for check-datapackage. For design details of the Seedcase Project as a whole, see the Seedcase Design documentation.
This document outlines the architecture of check-datapackage mostly to ensure the team shares a common understanding before implementation, but also to communicate the design to anyone else interested in the internal workings of the package.
User types
This section describes the different users we expect and design for:
Owner: Creates and owns the Data Package. Wants to ensure that the Data Package is compliant with the Data Package standard on a general level.
Manager: Manages and edits the properties within the Data Package. Wants to make sure that whenever changes are made to the properties (e.g., the description field is updated), the Data Package remains compliant with the standard.
Developer: Contributes to building up the Data Package including the data itself and/or the infrastructure around it. Wants to ensure that changes don’t impact the compliance of the Data Package. Might add extensions (additional checks) or exclude certain Data Package checks to fit the specific needs of the project.
Naming
This section contains a naming scheme for check-datapackage that is inspired by the Data Package standard.
Overall, we follow the Data Package terminology where possible to keep things consistent. However, we also introduce some new terms and concepts specific to check-datapackage. The main objects and actions used throughout the package can be found in the tables below.
Objects
Objects used throughout check-datapackage.
Object
Description
package
A Data Package that contains a collection of related data resources and descriptor(s).
descriptor
A standalone and complete metadata structure contained in a JSON file, for example, in datapackage.json.
properties
Metadata fields (name-value pairs) of a descriptor loaded as a Python dictionary. This can be a subset of the original descriptor or the entire structure.
schema
The JSON schema defining the Data Package standard.
config
An object containing settings for modifying the behaviour and output of the check mechanism.
Actions
Actions that check-datapackage can perform.
Action
Description
check
Check that a descriptor conforms to the Data Package standard.
explain
Explain issues flagged by the check action in more detail using non-technical language.
read
Read various files, such as a Data Package descriptor or a configuration file.
C4 Models
This section contains the C4 Models for check-datapackage. The C4 Model is an established visualisation approach to describe the architecture of a software system. It breaks the system down into four levels of architectural abstraction: System context, containers, components, and code.
System context
The system context diagram shows the users and any external systems that interact with check-datapackage. This includes the user types and the Data Package standard.
check-datapackage receives the definitions of the Data Package descriptor’s structure—including properties that must or should be included and their formats—from the Data Package standard (version 2). The standard provides this information through versioned JSON Schema profiles that define required properties and textual descriptions that outline compliance.
Note
In the initial version of check-datapackage, we only support the second edition of the Data Package standard (v2.0). However, we plan to extend this to support future editions as they are released, as well as the first edition to ensure backward compatibility.
The users, described in the User types section, provide check-datapackage with their Data Package’s descriptor to check its compliance with the standard.
flowchart LR
subgraph "Users"
user_owner("Owner<br>[person]")
user_manager("Manager<br>[person]")
user_developer("Developer<br>[person]")
end
dp_standard("Data Package<br>[standard]")
check("check-datapackage<br>[Python package]")
dp_standard --"Definition of the standard"--> check
Users --"Check Data Package<br>descriptor"--> check
%% Styling
style Users fill:#FFFFFF, color:#000000
Figure 1: C4 system context diagram showing the anticipated users and the external system (the Data Package standard) check-datapackage interacts with.
Container
In C4, a container diagram zooms in on the system boundary to show the containers within it, such as web applications or databases. This diagram displays the main containers of check-datapackage, their responsibilities, and how they interact, including the technologies used for each.
Currently, we build check-datapackage with a single container—the core Python package—but we’ve designed it to be extendable as a command line interface (CLI) in the future. With a CLI, we want to ease the process of implementing the checks in e.g., continuous integration pipelines.
flowchart LR
users("Users<br>[person]")
dp_standard("Data Package<br>[standard]")
subgraph "check-datapackage"
python("Core Python Package<br>[Python, JSON schema]")
cli("CLI<br>[Python]")
python -. "Provides<br>functionality" .-> cli
end
dp_standard --"Definition of the standard"--> python
users --"Check Data Package<br>descriptor programmatically"--> python
users -. "Check Data Package<br>descriptor via the CLI" .-> cli
%% Styling
style check-datapackage fill:#FFFFFF, color:#000000
style cli fill:#FFFFFF, stroke-dasharray: 5 5
Figure 2: C4 container diagram showing the core Python package in check-datapackage and the future command line interface (displayed dashed).
Component/code
In the diagram below, we zoom in on the core Python package container to show its internal components. In C4, a component is “a grouping of related functionality encapsulated behind a well-defined interface”, like a class or a module, while code is the basic building blocks, such as classes and functions.
Because the core Python package is relatively small and simple, and because both component and code diagrams include classes, we combine the component and code levels of the C4 model into a single diagram as shown below. This diagram shows the main classes and functions within the core Python package. Because the CLI is only a planned future extension, we do not include a component/code diagram for it at this time.
flowchart LR
subgraph python_package["Core Python Package"]
subgraph config_file["Configuration file"]
config("Config<br>[class]")
exclusion("Exclusion<br>[class]")
extension("Extension<br>[class]")
end
read_config["read_config()<br>[function]"]
read_json["read_json()<br>[function]"]
check("check()<br>[function]")
explain("explain()<br>[function]")
exclusion --"Defines checks to exclude"--> config
extension --"Defines additional checks"--> config
config_file -. "Reads configuration<br>from file" .-> read_config
read_json --"Provides properties as dict"--> check
read_config -. "Adds check<br>configurations" .-> check
config --"Adds check<br>configurations"--> check
check --"Passes found issues<br>to get non-technical<br>explanation"--> explain
end
dp_standard("Data Package<br>[standard]")
user("User<br>[person]")
dp_standard --"Defines the Data<br>Package standard"--> check
user --"Provides datapackage.json<br>to check"--> read_json
user --"Provides configuration file<br>(optional)"--> config_file
%% Styling
style python_package fill:#FFFFFF, color:#000000
style config_file fill:#FFFFFF, color:#000000, stroke-dasharray: 5 5
Figure 3: C4 component diagram showing the parts of the Python package and their connections.