How to mirror SILO datasets

You can maintain your own local copy of SILO datasets using the methods described below.

If you wish to mirror SILO datasets, please read the usage information about data mirroring on our Frequently asked questions page.

Station datasets

  1. Download the entire dataset for each station(s) of interest

    Notes:

    • you can choose the data format and/or variables which suit your application
    • the list of available stations can be downloaded using our API. The list can be obtained via URL:
      https://siloapi.longpaddock.qld.gov.au/stations

    Individual datasets can be downloaded using the methods described in our API tutorial.

    Note: this step only needs to be done once.

  2. Each time you wish to update your local copy, you can either:

    • Download updates for selected stations:

      1. Download the list of stations which have changed.

        After each nightly update SILO publishes a list describing which stations have been updated. The list can be obtained via URL:

        https://s3-ap-southeast-2.amazonaws.com/silo-open-data/mirror_information/mirror_stations.YYYYMMDD-YYYYMMDD
        where YYYYMMDD-YYYYMMDD is the period over which changes were made to the data (i.e. the date(s) when SILO modified the dataset, not the observation date(s) of records which were modified).

        The update lists available for download can be obtained via URL:

        
                                                  https://s3-ap-southeast-2.amazonaws.com/silo-open-data/mirror_information/index.html

        Notes:

        • the update list contains a row for each station, detailing the period over which data have changed. For example, if the list contains the row:
                  1005,20181003,20181109
          at least one variable was updated between 3 October 2018 and 9 November 2018 for station 1005

      2. Download a new copy of the dataset for each station appearing in the update list (or the subset of stations that you are interested in).

        For example, to download the update for station 1005 shown in the example above, you could use curl:

        curl 'https://siloapi.longpaddock.qld.gov.au/pointdata?station=1005&apikey=<my_key>&start=20181003&finish=20181109&format=p51'
        where you substitute your API key for <my_key>.

        Notes:

        • the update lists (mirror_stations.YYYYMMDD-YYYYMMDD files) are periodically removed
        • an update list shows the stations with datasets that changed in a given nightly update. If you don't update your local copy every day you will need to account for changes in all update lists since your previous update.

    • or

    • Download updates for all stations:

      After each nightly update SILO publishes a file containing data updates for all stations. The dataset can be obtained via URL:

      https://s3-ap-southeast-2.amazonaws.com/silo-open-data/mirror_information/mirror_stations.YYYYMMDD-YYYYMMDD.data.zip
      where YYYYMMDD-YYYYMMDD is the period over which changes were made to the data.

      Notes:

      • the file contains data for all stations
      • each row contains the station number followed by data in alldata format
      • data rows for a given station are ordered by date, but the stations are not necessarily in numerical order.

The following example code shows how selected point datasets can be mirrored using UNIX bash scripts.
mirror_example.bash (main script)

#!/bin/bash

# Script demonstrates how to mirror SILO's station datasets

# Copyright, Queensland Government Department of Environment and Science, 2018

###########################################################
# Settings                                                #
###########################################################
my_key="substitute your API key here"
api_url="https://siloapi.longpaddock.qld.gov.au"
open_data_s3_path="s3://silo-open-data"
open_data_url="https://s3-ap-southeast-2.amazonaws.com/silo-open-data"
my_reference="my tag"

# The dataset format
format="standard"

# Set this flag to true to initialise the system. It will
# 1. select the stations to mirror
# 2. download the entire dataset for all selected stations
initialise=true

# Include the function used for merging the updates with the original files
source mirror_example_functions.bash

###########################################################
# Initialise the system                                   #
###########################################################
if [ $initialise = true ]
then
   ########################################################
   # Select stations to mirror                            #
   ########################################################
   # In this example we will mirror all stations within 20 kilometres of 
   # Gatton station (40444) which have rainfall records that are 80%
   # complete throughout the 1980s and 1990s
   target_station=40444
   radius=20

   # Step 1. Get the list of stations
   wget -O station_list.json "$api_url/stations?near=$target_station&radius=$radius"

   # Step 2. Get statistics for all stations
   # - extract lines containing station numbers, and then remove the unwanted text
   grep number station_list.json | sed 's/.*://;s/,.*//' | \
   while read station_number
   do
      wget -O $station_number.statistics.csv "$api_url/stations/$station_number?statistics=decadal&format=csv"
   done

   # Stations are being selected based on data completeness in the 1980s and 1990s.
   # - this information is in columns 20 and 21 (see the header row in the CSV file)
   column_for_1980s=20
   column_for_1990s=21

   # Desired compleness is 80%
   desired_completeness="0.8"

   # Variable being tested for completeness
   variable=daily_rain

   # Set the start date (this can be determined by the user)
   start=18890101

   # To maintain a mirror the finish date should always be the day before the current day
   finish=`date -d "1 day ago" '+%Y%m%d'`

   # Approximate number of days in 1980s and 1990s is given by:
   # 20 years x 365.25 days/year
   # Note: division by 1 is to convert the floating point value to an integer
   total_no_days=`echo "(20 * 365.25) / 1" | bc`

   # Create an empty file containing the selected stations
   > stations_to_mirror

   # Step 3. Select stations meeting the completeness criteria
   grep number station_list.json | sed 's/.*://;s/,.*//' | \
   while read station_number
   do
      # Check there are statistics data for the variable of interest
      grep $variable $station_number.statistics.csv > /dev/null
      if [ $? -ne 0 ]
      then
         # The variable does not appear in the statistics file, so skip this station
         continue
      fi

      # Extract the row containing statistics information for variable of interest (daily rainfall),
      # then extract the desired columns (decades of interest are the 1980s and 1990s)
      # and then remove the unwanted double quotes
      no_observations_in_1980s=`grep $variable $station_number.statistics.csv | cut -d, -f$column_for_1980s | sed 's/"//g'`     
      no_observations_in_1990s=`grep $variable $station_number.statistics.csv | cut -d, -f$column_for_1990s | sed 's/"//g'`     

      # Compute the completeness fraction
      completeness=`echo "scale=2;($no_observations_in_1980s + $no_observations_in_1990s) / $total_no_days" | bc`

      # If the station has the desired proportion of observed data, add it to 
      # the list of stations to mirror
      if [ `echo "$completeness >= $desired_completeness" | bc` -eq 1 ]
      then
         echo $station_number >> stations_to_mirror
      fi 
   done

   # Delete the statistics files
   grep number station_list.json | sed 's/.*://;s/,.*//' | \
   while read station_number
   do
      rm $station_number.statistics.csv
   done
   rm station_list.json

   ########################################################
   # Get the entire dataset for all selected stations     #
   ########################################################
   cat stations_to_mirror | while read station_number
   do
      wget -O $station_number.$format \
      "$api_url/pointdata?station=$station_number&start=$start&finish=$finish&apikey=$my_key&format=$format&user_ref=$my_reference"
   done
fi

###########################################################
# Update the datasets for all selected stations           #
###########################################################

# Note: this step is run every time you wish to update your local copy of the datasets

# Step 1. Get a list of the data changes

   # Get the update list that was compiled yesterday (i.e. the most recent)
   yesterday=`date -d "1 day ago" '+%Y%m%d'`
   filename="mirror_information/mirror_stations.$yesterday-$yesterday"

   wget -O update_list "$open_data_url/$filename"

   # Note: the AWS cli could also be used to retrieve the update list:
   # aws s3 cp $open_data_s3_path/$filename update_list

# Step 2. Get the data updates
   cat stations_to_mirror | while read station_number
   do
      # Check there is update information for the current station
      grep "^${station_number}," update_list > /dev/null
      if [ $? -ne 0 ]
      then
         # Update information is not available, so skip this station
         continue
      fi

      # Determine the update period for the current station
      start=`grep "^${station_number}," update_list | cut -d, -f2`
      finish=`grep "^${station_number}," update_list | cut -d, -f3`

      # Get the update for the current station
      wget -O $station_number.$format.update \
      "$api_url/pointdata?station=$station_number&start=$start&finish=$finish&apikey=$my_key&format=$format&user_ref=$my_reference"
   done

# Step 3. Update the local copies with the new and/or modified data
   # Note: this example uses a simple method for merging the new and/or modified data with
   #       the existing data. This restricts the data format to any format which:
   #         1. stores the data for a single day on a single row.
   #         2. contains the date (YYYYMMDD format) in the first column

   cat stations_to_mirror | while read station_number
   do
      # Check if an update is available
      if [ ! -f $station_number.$format.update ]
      then
         # Update information is not available, so skip this station
         continue
      fi

      # To update the local copy using the new and/or modified data just downloaded, we need to:
      # 1. append any new rows
      # 2. overwrite existing rows
      update_file $station_number.$format $station_number.$format.update

      # Delete the update file
      rm $station_number.$format.update
   done
                
mirror_example_functions.bash (tools)

###########################################################
# Function merges the new and/or modified data in the     #
# "update" file with the data in the "original" file      #
# - both files must have the same format, which is:       #
#                                                         #
# [optional header file]                                  #
# yyyymmdd ..... columns of data ....                     #
# yyyymmdd ..... columns of data ....                     #
#    ...                                                  #
#                                                         #
###########################################################

function update_file {
   # The approach is:
   # 1. Remove the header information from both the update and original files
   # 2. Merge the changes
   # 3. Sort the new file, using the YYYYMMDD date in the first column
   # 4. Insert the header information
   #
   # Step 2 is the only step required if: (i) the files do not contain a header; (ii) the
   # rows in both files are sorted by date; and (iii) the original file does not contain any
   # missing rows. In this situation the updated rows can simply be overwritten and new
   # rows can be appended.
   
   original_file="$1"
   update_file="$2"
   
   # Step 1. Remove any header information
   original_tmp=$(mktemp /tmp/original.XXXXXX)
   update_tmp=$(mktemp /tmp/update.XXXXXX)
   merged_tmp=$(mktemp /tmp/merged.XXXXXX)
   original_header_tmp=$(mktemp /tmp/original.XXXXXX)
   
   # Extract the data rows from the orginal and update files
   # Note: data rows commence with the YYYYMMDD date, and a space can appear before the date
   grep '^[ ]\{0,1\}[18|19|20][0-9][0-9]' $original_file > $original_tmp
   grep '^[ ]\{0,1\}[18|19|20][0-9][0-9]' $update_file > $update_tmp
   
   # Extract the header from the original (all rows except the data rows)
   grep -v '^[ ]\{0,1\}[18|19|20][0-9][0-9]' $original_file > $original_header_tmp
   
   # Step 2. Merge the original and updated files (with the header information removed)
   awk 'FNR==NR{array[$1]=$0; next}          # Store all the updated rows i.e. those rows in the update file.
   {
      print ($1 in array) ? array[$1] : $0;  # Output the updated row if it existed in the original file, otherwise output the original row
      if ($1 in array) delete array[$1]      # Delete the updated row
   }END{
      for (row in array)                     # Output any remaining updated rows - these are "new" rows i.e. rows that are in 
         print array[row]                    # the update file but which were not in the original file
   }' $update_tmp $original_tmp \
   \
   | # Step 3. Sort by date
   sort -n -k1 > $merged_tmp

   # Step 4. Re-insert the header information
   cat $original_header_tmp $merged_tmp > $original_file
   
   # Clean up
   rm $original_tmp $update_tmp $original_header_tmp $merged_tmp
}
                

Note: the example demonstrates Method 1 described above (downloading updates for selected stations).

Gridded datasets

To mirror SILO's gridded datasets you can either:

  1. Use the Amazon Web Services Command Line Interface (CLI):

    1. Install the AWS CLI
    2. Use the CLI sync command to mirror the data.
      For example, to mirror the monthly rainfall rasters into your local targetfolder:
      aws s3 sync s3://silo-open-data/annual/monthly_rain target --exact-timestamps

    Notes:

    • the first time you run the sync command it will download the entire dataset
    • you need to re-run the sync command every time you wish to update your local copy (sync will only download files that have changed)
    • the --exact-timestamps option is required otherwise sync will not download files which have been updated but still have the same file size.

  2. or

  3. Manually download new and/or updated rasters:

    1. Download the entire set of rasters for the variable(s) that you wish to mirror.

      A list of files available for download can be obtained via URL:

      
                                      https://s3-ap-southeast-2.amazonaws.com/silo-open-data/annual/index.html
      Individual files can be downloaded using the methods described on our gridded data page. For example, the monthly rainfall rasters for 1989 can be downloaded using curl as follows:
      curl 'https://s3-ap-southeast-2.amazonaws.com/silo-open-data/annual/monthly_rain/1989.monthly_rain.nc'
      Note: this step only needs to be done once.

    2. Each time you wish to update your local copy:

      1. Download the list of rasters which have changed.

        After each nightly update SILO publishes a list describing which rasters have been updated. The list can be obtained via URL:

        https://s3-ap-southeast-2.amazonaws.com/silo-open-data/mirror_information/mirror_rasters.YYYYMMDD-YYYYMMDD
        where YYYYMMDD-YYYYMMDD is the period over which changes were made to the data (i.e. the date(s) when SILO modified the dataset, not the observation date(s) of records which were modified).

        The update lists available for download can be obtained via URL:

        
                                                      https://s3-ap-southeast-2.amazonaws.com/silo-open-data/mirror_information/index.html

      2. Download a new copy of each file appearing in the update list (or the subset of files that you are interested in).

        Notes:

        • the update lists (mirror_rasters.YYYYMMDD-YYYYMMDD files) are periodically removed
        • an update list shows the files which changed in a given nightly update. If you don't update your local copy every day you will need to account for changes in all update lists since your previous update.

Please note SILO data are constantly evolving so you will need to determine how often you wish to update your local copy of the data. SILO data typically change due to:

  • Nightly updates: each night SILO ingests new data which have been collected recently. This typically only impacts the most recent datasets (rainfall datasets for the preceding 12 months and other variables for the preceding 3-6 months)
  • Bulk updates: SILO periodically regenerates the entire dataset to incorporate new features or to take advantage of data improvements. This typically impacts the entire time period spanned by the affected variable(s).

You may also wish to consider your network bandwidth and transfer costs when determining how often you update your local copy. The rasters are packed into annual files, each being around 410 MB in size for daily variables and around 14 MB for monthly rainfall.