Harvesting

This section outlines the options available for harvesting spatial and non-spatial metadata from various endpoints.

To set up a new harvester in GeoNetwork 4.2.x login as an Administrator and go to Admin console > Harvesting > Catalog harvesters

To add a new harvester click on the Harvest from dropdown. The available sources are:

  • ArcSDE

  • Directory- this will harvest from a directory located on the same server as GeoNetwork

  • GeoNetwork (2.0)

  • GeoNetwork (from 2.1 to 3.x)

  • GeoPortal REST

  • OAI/PMH

  • OGC CSW 2.0.2

  • OGC Web Services

  • OGC WFS GetFeature

  • Simple URL

  • Thredds catalog

  • WebDAV/WAF

Dropdown menu showing the available harvesting options in GeoNetwork 4.x

Directory harvesting

This option allows users to harvest records from the same server that runs GeoNetwork.

Important

If you are running GeoNetwork in a dockerised setup you will need to map the local directory to a volume and the path will become the mapped volume path. For example if the docker mapping is ./harvester-test:/var/lib/jetty/webapps/geonetwork/harvester-test, then the path that the harvester needs to be pointed at has to be /var/lib/jetty/webapps/geonetwork/harvester-test

GeoNetwork showing an example configuration for Directory harvesting

An example setup for harvesting in a dockerised setup

The main configuration options for a directory harvest are:

  • Node name and logo- this is the name of the harvester and a logo

    • Note: in order to be able to associate a logo to the harvester, it needs to be pre-loaded into the catalog. This can be done in Admin console > Settings > Logo

  • Group- the group which owns the harvested records

  • User- a user can be picked from the list and will be the owner of the records

  • Schedule- this feature can be enabled or disabled. If enabled, the user can set a recurring harvest

  • Directory- this is the path to the Directory that holds the records

  • Also search in subfolders- if ticked this will point the harvester to any existing subfolders too

  • Action on UUID collision- this dictates what action will be taken if a UUID already exists in the catalog. This can be set to:

    • Skip record (default)

    • Overwrite record

    • Create new UUID

  • Update catalog record only if file was updated (tickbox)

  • Keep catalog record even if deleted at source (tickbox)

  • Validate records before import- the default option is to accept all metadata without validation

  • XSL transformation to apply

  • Batch edits

  • Category- the category to be allocated to the harvested records

  • Group privileges for the harvested records

Warning

This method has been tested in GeoNetwork 4.2.x and it successfuly harvested .ZIP and .XML records, however only .XML records are shown and accounted for in the Metadata records tab on the harvester page.

GeoNetwork showing a discrepancy between the actual number of records harvested and the Metadata records tab

Simple URL harvesting

This option allows users to harvest records from various enpoints like DCAT/rdf or JSON (ESRI).

Important

You’ll need to adapt the config to match the exact feed that you’re trying to harvest- so manually look at it to identify the overarching dataset and identifier elements before continuing.

The main configuration options to set are:

  • Node name and logo- this is the name of the harvester and a logo

    • Note: in order to be able to associate a logo to the harvester, it needs to be pre-loaded into the catalog. This can be done in Admin console > Settings > Logo

  • Group- the group which owns the harvested records

  • User- a user can be picked from the list and will be the owner of the records

  • Schedule- this feature can be enabled or disabled. If enabled, the user can set a recurring harvest

  • URL- path to endpoint for whole catalog (e.g. https://apps.titellus.net/geonetwork/api/collections/velo/items?f=dcat)

  • Element to loop on- the XPath for the element that represents a dataset (e.g. dcat:CatalogRecord)

  • Element for the UUID of each record- the element inside the dataset loop that should be used as the unique identifier (e.g. ./dct:identifier)

  • XSL transformation to apply- these are now done on a per schema basis, so find the correct file and add it as follows: schema:{schemaname}:convert:{optional folder inside the schema's convert folder}/{filename without the xsl suffix} (e.g. schema:iso19115-3.2018:convert/DCAT/sparql-to-iso19115-3)

  • Batch edits

  • Category- the category to be allocated to the harvested records

  • Group privileges for the harvested records

GeoNetwork showing the top section of the configuration for an example Simple URL harvester

The top section of the configuration for an example Simple URL harvester

GeoNetwork showing the middle section of the configuration for an example Simple URL harvester

The middle section of the configuration for an example Simple URL harvester

GeoNetwork showing the bottom section of the configuration for an example Simple URL harvester

The bottom section of the configuration for an example Simple URL harvester