Using the STAC Specification#
Purpose of the STAC Item descriptors#
The STAC Item specification is used to encode the metadata of the resources that may be created, shared and published in the AIOPEN Service. Concretely, this currently applies to trained models and to training data.
Complete examples of STAC Item descriptors for both resource types are provided in the Sharing and Publishing section of the Developer Manual.
A web-based STAC Validator tool has been integrated in the Development Services to facilitate the creation and validation of the STAC Items. See: Using the STAC Validator.
When STAC Item descriptors containing resources metadata are pushed in a GitHub repository monitored by the Service, these are automatically registered either in the user workspace Local Catalogue, or in the Service Global Catalogue. The destination Catalogue depends on the git branch in which the file is pushed:
STAC Item descriptors pushed (or merged) in the
develop
branch are registered in the user workspace Local Catalogue.STAC Item descriptors pushed (or merged) in the
main
branch are registered in the Service Global Catalogue.
Registering a resource in the Global Catalogue allows publishing it on the Marketplace where it may be discovered by all the visitors (anonymous and authenticated). It is thus crucial to include in the STAC Item descriptors accurate and sufficient information about the resources.
The following sections describes the different pieces of information that must, of should, be included in the STAC Item descriptors, and explain how this must be done to be properly managed in AIOPEN.
Trained Models Information#
The STAC Items must be valid and must include all the information marked as REQUIRED in the core STAC specification and in the STAC extensions in use. Information indicated as Recommended is not used by the AIOPEN service but is displayed in the Details pages of the Marketplace to help users determine if a given resource meets their needs. It also informs on how data (e.g. satellite imagery) must be pre-processed before being used as inference input to obtain predictions.
Specific content is also required to ensure the resources can be shared or published in the AIOPEN platform. Required information is different if the resource is a trained model or a training dataset.
Required Information#
The following table describes the information that is either required or recommended to be included in the STAC Items representing trained models:
Element / Field |
Required |
Comment |
---|---|---|
STAC extension |
Required |
Trained models must use version 1.3.0 of the |
Asset with role |
Required |
The |
|
Required |
These properties are defined as required in the |
|
Required |
This field provides the characteristics of the model input (e.g. bands, shape, datatype) and describes the transformation (pre-processing) between the EO data and the input value. |
|
Required |
This field describes model outputs and how to interpret them (e.g. classes). |
|
Optional however … |
This property is required to publish or unpublish a model. If not specified the STAC Item is ignored (see Publish & Unpublish status). |
|
Optional |
If the |
Recommended Information#
Using the mlm
extension, it is mandatory to include the list of model inputs and outputs together with their shape and datatype. This information is used by the service to make sure the provided input data complies with the model signature.
The mlm
extension allows including more detailed information about the inputs, and in particular indications for pre-processing the input data before submitting it to the model.
Note
Even though it is not mandatory to provide the information described below, it is greatly recommended to do so as it may be crucial to allow the future users of your models to appropriately prepare their input data. A section is dedicated to the input data preparation in the Exploitation Manual. This preparation relies on the information provided in the STAC descriptors. See: Inference Pipeline: Providing valid input data.
Accelerator
The mlm:accelerator
property may be provided at model-level to indicate that a certain type of hardware is required to run inferences.
Not providing a value means that the model does not require any specific accelerator. Using amd64
means that the model may be executed on AMD or Intel CPUs.
Using cuda
means the model is compatible with NVIDIA GPUs. Other values are allowed, as indicated in the extension specification .
The property mlm:accelerator_count
may be used to indicate the minimum amount of accelerator instances required to run the model (e.g. the amount of GPUs).
If the indicated accelerator is mandatory for running the model, the property mlm:accelerator_constrained
must be set to true
. Otherwise it is considered optional.
Input Image Bands
When a model input is a multi-band image, it is recommended to indicate in the STAC Item descriptor the list of bands accepted by the model. Several STAC extensions allow expressing bands information such as eo
, raster
and STAC Commons .
Only the bands used as input to the model should be included in the bands
field.
Virtual bands may be included as well. These are bands resulting from the execution of an expression on other band values. The format
and expression
fields in the model band objects may be used for that purpose.
Common band names have also been defined in the eo
extension allowing to use well known names in the descriptors.
Example model input definition with a name, four bands (one of which is the result of applying an expression), a shape, a list of dimension names, and a data type (source ):
"mlm:input": [
{
"name": "RBG+NDVI Bands Sentinel-2 Batch",
"bands": [
{
"name": "B04"
},
{
"name": "B03"
},
{
"name": "B02"
},
{
"name": "NDVI",
"format": "rio-calc",
"expression": "(B08 - B04) / (B08 + B04)"
}
],
"input": {
"shape": [
-1,
13,
64,
64
],
"dim_order": [
"batch",
"channel",
"height",
"width"
],
"data_type": "float32"
}
}
]
Depending on the STAC extension used to specify the bands information, the corresponding schema must be added in the STAC Item, for example:
"stac_extensions": [
"https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
"https://stac-extensions.github.io/eo/v1.1.0/schema.json",
"https://stac-extensions.github.io/raster/v1.1.0/schema.json"
]
More information about the model inputs definition may be found in the “mlm” extension specification .
Input Image Normalisation Method
Should the input image data need to be normalised before being submitted to the model, the input field norm_type
may be specified to indicate the normalisation method to be applied.
The mlm
STAC extension proposes a pre-defined list of normalisation methods .
Depending on the value given to the norm_type
field, it may be required to provide additional information by means of a statistics
object as specified in STAC Commons . For example:
If the normalisation method is
min-max
, the statistical valuesminimum
andmaximum
must be provided.If the normalisation method is
z-score
, the statistical valuesmean
andstddev
must be provided.
Example model input definition (fragment):
"norm_by_channel": false,
"norm_type": "min-max",
"norm_clip": null,
"statistics": {
"minimum": 0,
"maximum": 1
}
To normalise each channel (band) with channel-wise statistics, the norm_by_channel
field must be set to true
and one set of statistical values must be provided per channel.
For example (fragment):
"norm_by_channel": true,
"norm_type": "z-score",
"resize_type": null,
"statistics": [
{
"mean": 1354.40546513,
"stddev": 245.71762908
},
{
"mean": 1118.24399958,
"stddev": 333.00778264
}
]
Input Image Resize Method
Should the input image data need to be resized before being submitted to the model, the input field resize_type
may be specified to indicate the method to be applied.
The mlm
STAC extension proposes a pre-defined list of resize methods .
Input Image Scaling
The value_scaling
input property may be used to indicate how the values of each channel (band) of an input image must be scaled to fit into the range expected by a model.
The property may contain a single entry, in which case the same operation is applied to all the input channels, or an array containing exactly one entry per channel.
In the latter case, each entry (operation) is applied to the corresponding channel.
The mlm
specification defines the following scaling types with their associated parameters:
min-max(minimum, maximum)
Operation:(data - minimum) / (maximum - minimum)
z-score(mean, stddev)
Operation:(data - mean) / stddev
clip(minimum, maximum)
Operation:min(max(data, minimum), maximum)
clip-min(minimum)
Operation:max(data, minimum)
clip-max(maximum)
Operation:min(data, maximum)
offset(value)
Operation:data - value
scale(value)
Operation:data / value
processing(Processing Expression)
Operation: according to the processing:expression
For example, the following fragment indicates that the values in the first channel must be substracted with value 5, in the second channel all the values lower than 0 or higher than 10 must be set to these limits, and the values in the third channel must be divided by 255. As there must be one entry per channel, this example is only applicable when the input data contains exactly 3 channels.
{
"value_scaling": [
{
"type": "offset",
"value": 5
}, {
"type": "clip",
"minimum": 0,
"maximum": 10
}, {
"type": "scale",
"value": 255
}
]
}
Read more about the value_scaling
property in the STAC Extention .
Input Data Pre-Processing Function
The input field pre_processing_function
in the mlm
STAC extension allows referring to functions that may be used to pre-process the input image data. The specification proposes three types of functions:
python
for referring to a Python module and function,docker
for referring to a Docker image (and tag),uri
for referring to a Python script available through HTTP/HTTPS.
Example pre-processing function specification (source ):
"pre_processing_function": {
"format": "python",
"expression": "torchgeo.datamodules.eurosat.EuroSATDataModule.collate_fn"
}
Read more about the processing expression field in the mlm
extension specification.
Output Data Classification
In addition to the output shape and datatype, the mlm
STAC extension allows specifying how the output values must be interpreted semantically. For example a STAC Item may describe the class associated to each value produced by a classification model.
For doing so, the classification:classes
field must be used and given a structure that complies with the “classification” STAC extension .
Example class definitions in the output of a urbanisation detection model that distinguishes between city and non-city pixels:
"classification:classes": [
{
"value": 0,
"name": "BACKGROUND",
"description": "Background non-city.",
"color_hint": "000000"
},
{
"value": 1,
"name": "CITY",
"description": "A city is detected.",
"color_hint": "0000FF"
}
]
When used, the classification
schema must be added in the STAC Item descriptor along the other extensions in use:
"stac_extensions": [
"https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
"https://stac-extensions.github.io/classification/v2.0.0/schema.json"
]
More information about the model outputs definition may be found in the “mlm” extension specification .
Output Data Post-Processing Function
The output field post_processing_function
in the mlm
STAC extension allows referring to functions that may be used to post-process the output data. The format is the same as for the pre_processing_function
input field.
Training Data Information#
Required Information#
The following table describes the information that is either required or recommended to be included in the STAC Items representing training data:
Element / Field |
Required |
Comment |
---|---|---|
STAC extension |
Required |
Training data must specify this STAC extension (see Training Data assets: label or feature). |
Asset with field |
Required |
The |
|
Optional however … |
This property is required to publish or unpublish training data. If not specified the STAC Item is ignored (see Publish & Unpublish status). |
|
Optional |
If the |
Publish & Unpublish status#
As explained in the Sharing and Publishing section of the Developer Manual, resources may be published but also unpublished from the catalogues. In order to publish or unpublish a resource, the resource status in the corresponding STAC Item descriptor must updated and the file must be pushed again in GitHub.
The target status must be specified in properties/status
as follows:
"status": "publish"
(or"published"
) to register the resource in the catalogue (and thus publish to the Marketplace)."status": "unpublish"
(or"unpublished"
) to unregister the resource from the catalogue (and thus remove from the Marketplace).
Example to publish a new resource or modify a resource already published (with the same id
):
{
"type": "Feature",
"stac_version": "1.0.0",
"id": "model-deforestation",
"properties": {
"title": "Deforestation tracking using U-Net",
"description": "Deforestation-tracking model using Sentinel-2 data",
"status": "published"
}
}
Example to unpublish a resource:
{
"type": "Feature",
"stac_version": "1.0.0",
"id": "model-deforestation",
"properties": {
"title": "Deforestation tracking using U-Net",
"description": "Deforestation-tracking model using Sentinel-2 data",
"status": "unpublished"
}
}
Target catalogue collection#
Resource developers and providers may choose in which catalogue collection they want to register their resources. It is typically the name or organisation of the user publishing the resources but this is not mandatory.
The collection identifier must be provided in the collection
field.
For example:
{
"type": "Feature",
"stac_version": "1.0.0",
"id": "model-deforestation",
"collection": "kplabs",
"properties": {
"title": "Deforestation tracking using U-Net",
"description": "Deforestation-tracking model using Sentinel-2 data",
"status": "published"
}
}
Note
The identifier of the catalogue collections in which resources are published is in reality <collection-id>:published
. This allows the Marketplace to filter and only display the resources located in *:published
collections.
Reference to the resource assets#
STAC Item descriptors represent either a trained model or a training dataset and each descriptor must contain the reference to the actual resource files (assets) stored in on of the user workspace buckets.
Trained Model assets#
Initially, AIOPEN was using the ml-model
STAC extension to include the reference to the model assets. This extension has been deprecated in 2024 and the version 1.3.0 of the “mlm” STAC extension must be used instead.
It is thus mandatory to declare the extension URL in the STAC Item descriptor. Optionally, the “file” STAC extension may be used to indicate the size of the model assets.
"stac_extensions": [
"https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
"https://stac-extensions.github.io/file/v2.1.0/schema.json"
]
In both cases, the asset “roles” is checked. They must contain either “ml-model:inference-runtime” or “mlm:model”.
The link (href) must refer to the “MLmodel” files generated by MLflow. The service will automatically take into account all the files (objects) having the same prefix (thus in the same “folder” and in the “sub-folders”).
Example using the mlm
STAC extension and:
experiment ID =
2
run ID =
69f168eaebc04b99af345720d34e6264
model name =
model
(default value in MLflow)
"assets": {
"inferencing-compose": {
"href": "s3://developer-modelrepo/2/69f168eaebc04b99af345720d34e6264/artifacts/model/MLmodel",
"type": "application/yaml; application=mlflow",
"title": "Model inference runtime definition",
"file:size": 12345,
"roles": [
"mlm:model"
]
}
}
Training Data assets: label or feature#
STAC Item descriptors representing training data must use the ml-aoi
STAC extension (ml-aoi extension ).
It is thus mandatory to declare the extension URL in the STAC Item descriptor. Optionally, the “file” STAC extension may be used to indicate the size of the training data files.
"stac_extensions": [
"https://stac-extensions.github.io/ml-aoi/v0.2.0/schema.json",
"https://stac-extensions.github.io/file/v2.1.0/schema.json"
]
The data files must be referred to using asset entries ).
Instead of defining asset roles (to be included in the roles
array), the ml-aoi
STAC extension defines fields to be included directly in the asset definition. The “roles” field is then optional.
The field to be used to indicate that an asset contains labels or features is ml-aoi:role
, with the value label
or feature
, respectively.
Multiple assets of type label
or feature
may coexist in the same STAC Item.
For example:
"assets": {
"data-files": {
"ml-aoi:role": "feature",
"href": "s3://developer-data/path/to/my/dataset",
"type": "image/tiff; application=geotiff",
"title": "Training data files",
"file:size": 1324543""
}
}
Resource versioning#
Altough not mandatory, it is a recommended to version shared and published resources. When specified, the resource version is displayed in both the Marketplace main page (displaying resource cards) and in the resource details pages.
Note that the Marketplace does not allow searching or filtering on the resource version. Also, when multiple versions of the same resource exist, it is up to the user to identify the one to use (most frequently the most recent one).
The version information displayed by the Marketplace must be located in the version
field in the properties
section of the STAC Items. This field is defined in the the “version” STAC extension.
This extension also defines two boolean fields experimental
and deprecated
and a number of relation types, which are not used by the Marketplace, but may be used by the users who are discovering the resources using the catalogue API.
STAC Items that include version information should thus indicate that they comply with the related schema:
"stac_extensions": [
"...",
"https://stac-extensions.github.io/version/v1.2.0/schema.json"
]
Version information is included in the STAC Item properties:
"properties": {
"version": "1.2.0",
"...": "..."
}
Terms and Conditions (license)#
A user who want to use (order or execute) a resource that is given a license property, must accept the license before being allowed to proceed.
The resource license may be specified using a STAC Item property or a link:
Example using the license
property field:
{
"type": "Feature",
"stac_version": "1.0.0",
"id": "EuroSAT-subset-train-sample-59-class-SeaLake",
"properties": {
"license": "SPDX-License-Identifier: MIT",
"<other-properties>": "...",
}
}
Example using a license
link:
"links": [
{
"rel": "license",
"href": "https://www.gnu.org/licenses/gpl-3.0.html",
"type": "text/html",
"title": "GPL-3.0"
}
]
Custom thumbnail or logo#
The thumbnail or logo is displayed in the Marketplace. The AIOPEN Platform logo is displayed by default.
Using a custom image is thus a means to attract the attention to the users and visually express the origin of a resource.
The thumbnail or the logo must be an image that may be natively displayed by recent web browsers, such as PNGs, JPGs, SVGs, etc.
This image may be provided using either a link
or an asset
in the STAC Item:
The
rel
property of the link must be eitherlogo
orthumbnail
.The asset must include either
logo
orthumnail
in itsroles
.
Example link:
"links": [
{
"rel": "thumbnail",
"href": "https://raw.githubusercontent.com/ai-extensions/stac-data-loader/0.5.0/data/EuroSAT/data/subset/ds/images/remote_sensing/otherDatasets/sentinel_2/png/SeaLake/SeaLake_984.png",
"type": "image/png",
"title": "Preview of SeaLake_984."
}
]
Example assets (the value of the key is not relevant):
"assets": {
"thumbnail": {
"href": "https://raw.githubusercontent.com/ai-extensions/stac-data-loader/0.5.0/data/EuroSAT/data/subset/ds/images/remote_sensing/otherDatasets/sentinel_2/png/SeaLake/SeaLake_984.png",
"type": "image/png",
"title": "Preview of SeaLake_984.",
"roles": [
"thumbnail",
"overview"
]
}
}
"assets": {
"kp-labs-logo-square": {
"href": "https://pbs.twimg.com/profile_images/1097809914813124609/GG3XKCHl_200x200.png",
"type": "image/png",
"title": "KP Labs square logo",
"roles": [
"logo"
]
}
}
"assets": {
"logo": {
"href": "https://aiopen-platform.com/wp-content/uploads/2023/05/IT4I-EN.png",
"type": "image/png",
"title": "Provider logo",
"roles": [
"logo"
]
}
}
Contact persons#
The “contact” STAC extension is used to specify contact information such as the name and coordinates of the resource developers and providers.
The extension must be declared in the STAC Item descriptor, next to the mlm
or the ml-aoi
extension:
"stac_extensions": [
"https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
"https://stac-extensions.github.io/contacts/v0.1.1/schema.json"
]
The contact information is included in the STAC Item descriptor under the contacts
property. The value is a list (array) and thus allows specifying multiple contacts.
See the full specification for the For example:
"properties": {
"contacts": [
{
"name": "KP Labs",
"organization": "KP Labs",
"phones": [
{
"value": "+12345678933",
"roles": [
"work"
]
}
],
"emails": [
{
"value": "aiopen@example.com",
"roles": [
"work"
]
}
]
}
]
}
Themes#
Assigning themes to resources helps the end users in choosing the model or datset that best suit their needs.
The AIOPEN Marketplace does not yet allow searching or filtering on theme values however this information is provided in the resource details pages.
The “themes” STAC extension is used to specify contact information such as the name and coordinates of the resource developers and providers.
The extension must be declared in the STAC Item descriptor, next to the mlm
or the ml-aoi
extension:
"stac_extensions": [
"https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
"https://stac-extensions.github.io/themes/v1.0.0/schema.json"
]
Example themes
property in a STAC Item:
"properties": {
"themes": [
{
"concepts": [
{
"id": "Deforestation",
"name": "Deforestation"
}
],
"scheme": "https://en.wikipedia.org/wiki"
},
{
"concepts": [
{
"id": "Category:Deforestation",
"name": "Deforestation"
}
],
"scheme": "https://dbpedia.org/page"
}
]
}
Publication DOIs and Citations#
Including external references to related publications provides users with additional insights to the published models and training data and helps them determine if a given resource is of interest to them or not.
The “scientific” STAC extension allows providing this information and also allows indicating how the resource must be cited in publications.
The properties fields specified in this extension use the sci:
prefix.
Altough the Marketplace does not allow searching or filtering on DOIs or citations, this information is displayed in the item details pages.
When used, the scientific
extension must be declared in the STAC Item descriptor next to the mlm
or the ml-aoi
extension:
"stac_extensions": [
"https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
"https://stac-extensions.github.io/scientific/v1.0.0/schema.json"
]
Related publications are listed in the STAC Item property sci:publications
. Each publication entry must contain the publication Digital Object Identifier (in doi
) and a citation
string (free text).
If the current resource has itself a DOI, this may be specified either in the property sci:doi
, or as a hyperlink in an item link with role cite-as
.
Example use of scientific
fields and links in a STAC Item:
"properties": {
"id": "unique-item-id",
"sci:doi": "10.5061/dryad.s2v81.2/27.2",
"sci:publications": [
{
"doi": "10.5061/dryad.s2v81.2",
"citation": "Vega GC, Pertierra LR, Olalla-Tárraga MÁ (2017) Data from: MERRAclim, a high-resolution global dataset of remotely sensed bioclimatic variables for ecological modelling. Dryad Digital Repository."
},
{
"doi": "10.1038/sdata.2017.78",
"citation": "Vega GC, Pertierra LR, Olalla-Tárraga MÁ (2017) MERRAclim, a high-resolution global dataset of remotely sensed bioclimatic variables for ecological modelling. Scientific Data 4: 170078."
}
]
},
"links": [
{
"rel": "cite-as",
"href": "https://doi.org/10.5061/dryad.s2v81.2"
}
]