Using the STAC Specification#

Purpose of the STAC Item descriptors#

The STAC Item specification is used to encode the metadata of the resources that may be created, shared and published in the AIOPEN Service. Concretely, this currently applies to trained models and to training data.

Complete examples of STAC Item descriptors for both resource types are provided in the Sharing and Publishing section of the Developer Manual.

A web-based STAC Validator tool has been integrated in the Development Services to facilitate the creation and validation of the STAC Items. See: Using the STAC Validator.

When STAC Item descriptors containing resources metadata are pushed in a GitHub repository monitored by the Service, these are automatically registered either in the user workspace Local Catalogue, or in the Service Global Catalogue. The destination Catalogue depends on the git branch in which the file is pushed:

STAC Item descriptors pushed (or merged) in the develop branch are registered in the user workspace Local Catalogue.
STAC Item descriptors pushed (or merged) in the main branch are registered in the Service Global Catalogue.

Registering a resource in the Global Catalogue allows publishing it on the Marketplace where it may be discovered by all the visitors (anonymous and authenticated). It is thus crucial to include in the STAC Item descriptors accurate and sufficient information about the resources.

The following sections describes the different pieces of information that must, of should, be included in the STAC Item descriptors, and explain how this must be done to be properly managed in AIOPEN.

Trained Models Information#

The STAC Items must be valid and must include all the information marked as REQUIRED in the core STAC specification and in the STAC extensions in use. Information indicated as Recommended is not used by the AIOPEN service but is displayed in the Details pages of the Marketplace to help users determine if a given resource meets their needs. It also informs on how data (e.g. satellite imagery) must be pre-processed before being used as inference input to obtain predictions.

Specific content is also required to ensure the resources can be shared or published in the AIOPEN platform. Required information is different if the resource is a trained model or a training dataset.

Required Information#

The following table describes the information that is either required or recommended to be included in the STAC Items representing trained models:

Element / Field	Required	Comment
STAC extension `mlm`	Required	Trained models must use version 1.3.0 of the `mlm` STAC extension (see Trained Model assets).
Asset with role `mlm:model`	Required	The `href` of this asset must refer to the `MLmodel` file generated by MLflow and stored in a worskpace S3 bucket (see Trained Model assets).
`properties/mlm:name` `properties/mlm:architecture` `properties/mlm:tasks`	Required	These properties are defined as required in the `mlm` extension specifiation.
`properties/mlm:input`	Required	This field provides the characteristics of the model input (e.g. bands, shape, datatype) and describes the transformation (pre-processing) between the EO data and the input value.
`properties/mlm:output`	Required	This field describes model outputs and how to interpret them (e.g. classes).
`properties/status`	Optional however …	This property is required to publish or unpublish a model. If not specified the STAC Item is ignored (see Publish & Unpublish status).
`properties/type`	Optional	If the `type` property is also provided, this must comply with the STAC extension. The `type` must thus have the value `model`.

Recommended Information#

Using the mlm extension, it is mandatory to include the list of model inputs and outputs together with their shape and datatype. This information is used by the service to make sure the provided input data complies with the model signature.

The mlm extension allows including more detailed information about the inputs, and in particular indications for pre-processing the input data before submitting it to the model.

Note

Even though it is not mandatory to provide the information described below, it is greatly recommended to do so as it may be crucial to allow the future users of your models to appropriately prepare their input data. A section is dedicated to the input data preparation in the Exploitation Manual. This preparation relies on the information provided in the STAC descriptors. See: Inference Pipeline: Providing valid input data.

Accelerator

The mlm:accelerator property may be provided at model-level to indicate that a certain type of hardware is required to run inferences. Not providing a value means that the model does not require any specific accelerator. Using amd64 means that the model may be executed on AMD or Intel CPUs. Using cuda means the model is compatible with NVIDIA GPUs. Other values are allowed, as indicated in the extension specification .

The property mlm:accelerator_count may be used to indicate the minimum amount of accelerator instances required to run the model (e.g. the amount of GPUs). If the indicated accelerator is mandatory for running the model, the property mlm:accelerator_constrained must be set to true. Otherwise it is considered optional.

Input Image Bands

When a model input is a multi-band image, it is recommended to indicate in the STAC Item descriptor the list of bands accepted by the model. Several STAC extensions allow expressing bands information such as eo, raster and STAC Commons .

Only the bands used as input to the model should be included in the bands field.

Virtual bands may be included as well. These are bands resulting from the execution of an expression on other band values. The format and expression fields in the model band objects may be used for that purpose.

Common band names have also been defined in the eo extension allowing to use well known names in the descriptors.

Example model input definition with a name, four bands (one of which is the result of applying an expression), a shape, a list of dimension names, and a data type (source ):

"mlm:input": [
  {
    "name": "RBG+NDVI Bands Sentinel-2 Batch",
    "bands": [
      {
        "name": "B04"
      },
      {
        "name": "B03"
      },
      {
        "name": "B02"
      },
      {
        "name": "NDVI",
        "format": "rio-calc",
        "expression": "(B08 - B04) / (B08 + B04)"
      }
    ],
    "input": {
      "shape": [
        -1,
        13,
        64,
        64
      ],
      "dim_order": [
        "batch",
        "channel",
        "height",
        "width"
      ],
      "data_type": "float32"
    }
  }
]

Depending on the STAC extension used to specify the bands information, the corresponding schema must be added in the STAC Item, for example:

"stac_extensions": [
   "https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
   "https://stac-extensions.github.io/eo/v1.1.0/schema.json",
   "https://stac-extensions.github.io/raster/v1.1.0/schema.json"
]

More information about the model inputs definition may be found in the “mlm” extension specification .

Input Image Normalisation Method

Should the input image data need to be normalised before being submitted to the model, the input field norm_type may be specified to indicate the normalisation method to be applied.

The mlm STAC extension proposes a pre-defined list of normalisation methods .

Depending on the value given to the norm_type field, it may be required to provide additional information by means of a statistics object as specified in STAC Commons . For example:

If the normalisation method is min-max, the statistical values minimum and maximum must be provided.
If the normalisation method is z-score, the statistical values mean and stddev must be provided.

Example model input definition (fragment):

"norm_by_channel": false,
"norm_type": "min-max",
"norm_clip": null,
"statistics": {
  "minimum": 0,
  "maximum": 1
}

To normalise each channel (band) with channel-wise statistics, the norm_by_channel field must be set to true and one set of statistical values must be provided per channel.

For example (fragment):

"norm_by_channel": true,
"norm_type": "z-score",
"resize_type": null,
"statistics": [
  {
    "mean": 1354.40546513,
    "stddev": 245.71762908
  },
  {
    "mean": 1118.24399958,
    "stddev": 333.00778264
  }
]

Input Image Resize Method

Should the input image data need to be resized before being submitted to the model, the input field resize_type may be specified to indicate the method to be applied.

The mlm STAC extension proposes a pre-defined list of resize methods .

Input Image Scaling

The value_scaling input property may be used to indicate how the values of each channel (band) of an input image must be scaled to fit into the range expected by a model. The property may contain a single entry, in which case the same operation is applied to all the input channels, or an array containing exactly one entry per channel. In the latter case, each entry (operation) is applied to the corresponding channel.

The mlm specification defines the following scaling types with their associated parameters:

min-max(minimum, maximum) Operation: (data - minimum) / (maximum - minimum)
z-score(mean, stddev) Operation: (data - mean) / stddev
clip(minimum, maximum) Operation: min(max(data, minimum), maximum)
clip-min(minimum) Operation: max(data, minimum)
clip-max(maximum) Operation: min(data, maximum)
offset(value) Operation: data - value
scale(value) Operation: data / value
processing(Processing Expression) Operation: according to the processing:expression

For example, the following fragment indicates that the values in the first channel must be substracted with value 5, in the second channel all the values lower than 0 or higher than 10 must be set to these limits, and the values in the third channel must be divided by 255. As there must be one entry per channel, this example is only applicable when the input data contains exactly 3 channels.

{
  "value_scaling": [
    {
      "type": "offset",
      "value": 5
    }, {
      "type": "clip",
      "minimum": 0,
      "maximum": 10
    }, {
      "type": "scale",
      "value": 255
    }
  ]
}

Read more about the value_scaling property in the STAC Extention .

Input Data Pre-Processing Function

The input field pre_processing_function in the mlm STAC extension allows referring to functions that may be used to pre-process the input image data. The specification proposes three types of functions:

python for referring to a Python module and function,
docker for referring to a Docker image (and tag),
uri for referring to a Python script available through HTTP/HTTPS.

Example pre-processing function specification (source ):

"pre_processing_function": {
  "format": "python",
  "expression": "torchgeo.datamodules.eurosat.EuroSATDataModule.collate_fn"
}

Read more about the processing expression field in the mlm extension specification.

Output Data Classification

In addition to the output shape and datatype, the mlm STAC extension allows specifying how the output values must be interpreted semantically. For example a STAC Item may describe the class associated to each value produced by a classification model.

For doing so, the classification:classes field must be used and given a structure that complies with the “classification” STAC extension .

Example class definitions in the output of a urbanisation detection model that distinguishes between city and non-city pixels:

"classification:classes": [
  {
    "value": 0,
    "name": "BACKGROUND",
    "description": "Background non-city.",
    "color_hint": "000000"
  },
  {
    "value": 1,
    "name": "CITY",
    "description": "A city is detected.",
    "color_hint": "0000FF"
  }
]

When used, the classification schema must be added in the STAC Item descriptor along the other extensions in use:

"stac_extensions": [
  "https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
  "https://stac-extensions.github.io/classification/v2.0.0/schema.json"
]

More information about the model outputs definition may be found in the “mlm” extension specification .

Output Data Post-Processing Function

The output field post_processing_function in the mlm STAC extension allows referring to functions that may be used to post-process the output data. The format is the same as for the pre_processing_function input field.

Training Data Information#

Required Information#

The following table describes the information that is either required or recommended to be included in the STAC Items representing training data:

Element / Field	Required	Comment
STAC extension `ml-aoi`	Required	Training data must specify this STAC extension (see Training Data assets: label or feature).
Asset with field `ml-aoi:role` and value `feature` or `label`	Required	The `href` of this asset must refer to the actual data file or folder in a workspace S3 bucket (see Training Data assets: label or feature).
`properties/status`	Optional however …	This property is required to publish or unpublish training data. If not specified the STAC Item is ignored (see Publish & Unpublish status).
`properties/type`	Optional	If the `type` property is also provided, this must comply with the STAC extension. The `type` must thus have the value `TrainingData`.

Publish & Unpublish status#

As explained in the Sharing and Publishing section of the Developer Manual, resources may be published but also unpublished from the catalogues. In order to publish or unpublish a resource, the resource status in the corresponding STAC Item descriptor must updated and the file must be pushed again in GitHub.

The target status must be specified in properties/status as follows:

"status": "publish" (or "published") to register the resource in the catalogue (and thus publish to the Marketplace).
"status": "unpublish" (or "unpublished") to unregister the resource from the catalogue (and thus remove from the Marketplace).

Example to publish a new resource or modify a resource already published (with the same id):

{
  "type": "Feature",
  "stac_version": "1.0.0",
  "id": "model-deforestation",
  "properties": {
    "title": "Deforestation tracking using U-Net",
    "description": "Deforestation-tracking model using Sentinel-2 data",
    "status": "published"
  }
}

Example to unpublish a resource:

{
  "type": "Feature",
  "stac_version": "1.0.0",
  "id": "model-deforestation",
  "properties": {
    "title": "Deforestation tracking using U-Net",
    "description": "Deforestation-tracking model using Sentinel-2 data",
    "status": "unpublished"
  }
}

Target catalogue collection#

Resource developers and providers may choose in which catalogue collection they want to register their resources. It is typically the name or organisation of the user publishing the resources but this is not mandatory.

The collection identifier must be provided in the collection field.

For example:

{
  "type": "Feature",
  "stac_version": "1.0.0",
  "id": "model-deforestation",
  "collection": "kplabs",
  "properties": {
    "title": "Deforestation tracking using U-Net",
    "description": "Deforestation-tracking model using Sentinel-2 data",
    "status": "published"
  }
}

Note

The identifier of the catalogue collections in which resources are published is in reality <collection-id>:published. This allows the Marketplace to filter and only display the resources located in *:published collections.

Reference to the resource assets#

STAC Item descriptors represent either a trained model or a training dataset and each descriptor must contain the reference to the actual resource files (assets) stored in on of the user workspace buckets.

Trained Model assets#

Initially, AIOPEN was using the ml-model STAC extension to include the reference to the model assets. This extension has been deprecated in 2024 and the version 1.3.0 of the “mlm” STAC extension must be used instead.

It is thus mandatory to declare the extension URL in the STAC Item descriptor. Optionally, the “file” STAC extension may be used to indicate the size of the model assets.

"stac_extensions": [
  "https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
  "https://stac-extensions.github.io/file/v2.1.0/schema.json"
]

In both cases, the asset “roles” is checked. They must contain either “ml-model:inference-runtime” or “mlm:model”.

The link (href) must refer to the “MLmodel” files generated by MLflow. The service will automatically take into account all the files (objects) having the same prefix (thus in the same “folder” and in the “sub-folders”).

Example using the mlm STAC extension and:

experiment ID = 2
run ID = 69f168eaebc04b99af345720d34e6264
model name = model (default value in MLflow)

"assets": {
  "inferencing-compose": {
    "href": "s3://developer-modelrepo/2/69f168eaebc04b99af345720d34e6264/artifacts/model/MLmodel",
    "type": "application/yaml; application=mlflow",
    "title": "Model inference runtime definition",
    "file:size": 12345,
    "roles": [
      "mlm:model"
    ]
  }
}

Training Data assets: label or feature#

STAC Item descriptors representing training data must use the ml-aoi STAC extension (ml-aoi extension ).

It is thus mandatory to declare the extension URL in the STAC Item descriptor. Optionally, the “file” STAC extension may be used to indicate the size of the training data files.

"stac_extensions": [
  "https://stac-extensions.github.io/ml-aoi/v0.2.0/schema.json",
  "https://stac-extensions.github.io/file/v2.1.0/schema.json"
]

The data files must be referred to using asset entries ).

Instead of defining asset roles (to be included in the roles array), the ml-aoi STAC extension defines fields to be included directly in the asset definition. The “roles” field is then optional.

The field to be used to indicate that an asset contains labels or features is ml-aoi:role, with the value label or feature, respectively.

Multiple assets of type label or feature may coexist in the same STAC Item.

For example:

"assets": {
  "data-files": {
    "ml-aoi:role": "feature",
    "href": "s3://developer-data/path/to/my/dataset",
    "type": "image/tiff; application=geotiff",
    "title": "Training data files",
    "file:size": 1324543""
  }
}

Resource versioning#

Altough not mandatory, it is a recommended to version shared and published resources. When specified, the resource version is displayed in both the Marketplace main page (displaying resource cards) and in the resource details pages.

Note that the Marketplace does not allow searching or filtering on the resource version. Also, when multiple versions of the same resource exist, it is up to the user to identify the one to use (most frequently the most recent one).

The version information displayed by the Marketplace must be located in the version field in the properties section of the STAC Items. This field is defined in the the “version” STAC extension.

This extension also defines two boolean fields experimental and deprecated and a number of relation types, which are not used by the Marketplace, but may be used by the users who are discovering the resources using the catalogue API.

STAC Items that include version information should thus indicate that they comply with the related schema:

"stac_extensions": [
  "...",
  "https://stac-extensions.github.io/version/v1.2.0/schema.json"
]

Version information is included in the STAC Item properties:

"properties": {
  "version": "1.2.0",
  "...": "..."
}

Terms and Conditions (license)#

A user who want to use (order or execute) a resource that is given a license property, must accept the license before being allowed to proceed.

The resource license may be specified using a STAC Item property or a link:

Example using the license property field:

{
  "type": "Feature",
  "stac_version": "1.0.0",
  "id": "EuroSAT-subset-train-sample-59-class-SeaLake",
  "properties": {
    "license": "SPDX-License-Identifier: MIT",
    "<other-properties>": "...",
  }
}

Example using a license link:

"links": [
  {
    "rel": "license",
    "href": "https://www.gnu.org/licenses/gpl-3.0.html",
    "type": "text/html",
    "title": "GPL-3.0"
  }
]

Custom thumbnail or logo#

The thumbnail or logo is displayed in the Marketplace. The AIOPEN Platform logo is displayed by default.

Using a custom image is thus a means to attract the attention to the users and visually express the origin of a resource.

The thumbnail or the logo must be an image that may be natively displayed by recent web browsers, such as PNGs, JPGs, SVGs, etc.

This image may be provided using either a link or an asset in the STAC Item:

The rel property of the link must be either logo or thumbnail.
The asset must include either logo or thumnail in its roles.

Example link:

"links": [
  {
    "rel": "thumbnail",
    "href": "https://raw.githubusercontent.com/ai-extensions/stac-data-loader/0.5.0/data/EuroSAT/data/subset/ds/images/remote_sensing/otherDatasets/sentinel_2/png/SeaLake/SeaLake_984.png",
    "type": "image/png",
    "title": "Preview of SeaLake_984."
  }
]

Example assets (the value of the key is not relevant):

"assets": {
  "thumbnail": {
    "href": "https://raw.githubusercontent.com/ai-extensions/stac-data-loader/0.5.0/data/EuroSAT/data/subset/ds/images/remote_sensing/otherDatasets/sentinel_2/png/SeaLake/SeaLake_984.png",
    "type": "image/png",
    "title": "Preview of SeaLake_984.",
    "roles": [
      "thumbnail",
      "overview"
    ]
  }
}

"assets": {
  "kp-labs-logo-square": {
    "href": "https://pbs.twimg.com/profile_images/1097809914813124609/GG3XKCHl_200x200.png",
    "type": "image/png",
    "title": "KP Labs square logo",
    "roles": [
      "logo"
    ]
  }
}

"assets": {
  "logo": {
    "href": "https://aiopen-platform.com/wp-content/uploads/2023/05/IT4I-EN.png",
    "type": "image/png",
    "title": "Provider logo",
    "roles": [
      "logo"
    ]
  }
}

Contact persons#

The “contact” STAC extension is used to specify contact information such as the name and coordinates of the resource developers and providers.

The extension must be declared in the STAC Item descriptor, next to the mlm or the ml-aoi extension:

"stac_extensions": [
  "https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
  "https://stac-extensions.github.io/contacts/v0.1.1/schema.json"
]

The contact information is included in the STAC Item descriptor under the contacts property. The value is a list (array) and thus allows specifying multiple contacts.

See the full specification for the For example:

"properties": {
  "contacts": [
    {
      "name": "KP Labs",
      "organization": "KP Labs",
      "phones": [
        {
          "value": "+12345678933",
          "roles": [
            "work"
          ]
        }
      ],
      "emails": [
        {
          "value": "aiopen@example.com",
          "roles": [
            "work"
          ]
        }
      ]
    }
  ]
}

Themes#

Assigning themes to resources helps the end users in choosing the model or datset that best suit their needs.

The AIOPEN Marketplace does not yet allow searching or filtering on theme values however this information is provided in the resource details pages.

The “themes” STAC extension is used to specify contact information such as the name and coordinates of the resource developers and providers.

The extension must be declared in the STAC Item descriptor, next to the mlm or the ml-aoi extension:

"stac_extensions": [
  "https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
  "https://stac-extensions.github.io/themes/v1.0.0/schema.json"
]

Example themes property in a STAC Item:

"properties": {
  "themes": [
    {
      "concepts": [
        {
          "id": "Deforestation",
          "name": "Deforestation"
        }
      ],
      "scheme": "https://en.wikipedia.org/wiki"
    },
    {
      "concepts": [
        {
          "id": "Category:Deforestation",
          "name": "Deforestation"
        }
      ],
      "scheme": "https://dbpedia.org/page"
    }
  ]
}

Publication DOIs and Citations#

Including external references to related publications provides users with additional insights to the published models and training data and helps them determine if a given resource is of interest to them or not.

The “scientific” STAC extension allows providing this information and also allows indicating how the resource must be cited in publications.

The properties fields specified in this extension use the sci: prefix.

Altough the Marketplace does not allow searching or filtering on DOIs or citations, this information is displayed in the item details pages.

When used, the scientific extension must be declared in the STAC Item descriptor next to the mlm or the ml-aoi extension:

"stac_extensions": [
  "https://stac-extensions.github.io/mlm/v1.3.0/schema.json",
  "https://stac-extensions.github.io/scientific/v1.0.0/schema.json"
]

Related publications are listed in the STAC Item property sci:publications. Each publication entry must contain the publication Digital Object Identifier (in doi) and a citation string (free text).

If the current resource has itself a DOI, this may be specified either in the property sci:doi, or as a hyperlink in an item link with role cite-as.

Example use of scientific fields and links in a STAC Item:

"properties": {
  "id": "unique-item-id",
  "sci:doi": "10.5061/dryad.s2v81.2/27.2",
  "sci:publications": [
    {
      "doi": "10.5061/dryad.s2v81.2",
      "citation": "Vega GC, Pertierra LR, Olalla-Tárraga MÁ (2017) Data from: MERRAclim, a high-resolution global dataset of remotely sensed bioclimatic variables for ecological modelling. Dryad Digital Repository."
    },
    {
      "doi": "10.1038/sdata.2017.78",
      "citation": "Vega GC, Pertierra LR, Olalla-Tárraga MÁ (2017) MERRAclim, a high-resolution global dataset of remotely sensed bioclimatic variables for ecological modelling. Scientific Data 4: 170078."
    }
  ]
},
"links": [
  {
    "rel": "cite-as",
    "href": "https://doi.org/10.5061/dryad.s2v81.2"
  }
]