Data preparation and software selection

Gloria Otieno, Jeske van de Gevel, Maarten van Zonneveld

R.Vernooy /Bioversity

In module 1, you learned the key components of a situational analysis oroverview of a community’s vulnerabilities. This is needed tobe able to plan and prioritize key interventions with the various institutions that work within the community,e.g., introducing new varieties from abroad to increase local diversity, working with the national gene bank to restore lost varieties, settingup a community seed bank to conserve threatened varieties, or a combination of these interventions.

In this module, you will learn about the identification, collection, and preparation ofexisting open-source climate and crop datasets, how to clean and prepare your own data, and how to import georeferenced data into mapping and modeling software. You will become familiar with key features and the use of software packages and online resources, such as DIVA-GIS, MaxEnt, Climate Analogues, and Google Earth, to determine how patterns of plant genetic resource interdependence may change in the future.

At the end of this module you will be able to start designing one or more strategies to use and safeguard plant genetic resources that are better adapted to climate change.

Learning objectives

At the end ofthis module, you will be able to:

  • Download climate datasets and data on crop presence
  • Prepare and import your dataset for use in DIVA-GIS, MaxEnt, and Google Earth
  • Create online maps with the Climate Analogues tool 

What do you already know?

  • Do you work with biodiversity data? Have you ever downloaded occurrence data from portals such as Genesys PGR? For what purpose? What was your experience inusing these data portals?
  • What experience do you have in using climate data in your research? Are you familiar with WorldClim datasets? Which datasets have you used and for what purpose?
  • Do you have experience with mapping and modeling software, such as MaxEnt, DIVA-GIS, and Google Earth? Have you used this softwareto combine biodiversity data with climate data to create multilayered maps or models?
  • Have you heard or read about the concept of climate analogues? Have you considered applying this concept in your own research? In what way could it be useful?

The first step in identifying appropriate germplasm for adaptation measures isto gain anoverview of which geographic areas are suitable for a specific crop or variety. By combining biodiversity data (e.g., crop presence data for accessions held in national or international institutes) with climate data (e.g., the 19 bioclimatic variables in the WorldClim package), we will be able tomap this data for climate change analysis and germplasm identification. 

Sources of climate data

Climate data can be obtained from various sources, including local field observations, national meteorological stations, and global agencies and databases.

Regional meteorological data can be acquired by contacting individual regional and national weather stations. These data are usually observations of parameters, such as temperature and precipitation for a specific day and time. For this information tobe useful in detecting trends, it should span 30 years or more. However, in developing countries, weather stations are often few and do not cover large areas. Data may lack precision, and meteorological organizations might charge a fee for use of their data. The quality of the datasets and the format in which they are distributed depends on the organization.

Environmental sensors can be used to collect your own local weather information. For example, iButtons are small, affordable devices that measure temperature and humidity atpredetermined intervals. They can be placed in less accessible areas where local weather data are not available at a high resolution. Bioversity scientists have written a manual that describes in detail how to use iButtons for weather observation (Mittra et al. 2013). iButtons are bought in combination with devices to connect them to your computer and specialized software to process and store the readings in comma-separated variable (CSV) format.

Global agronomic weather data can be obtained via the aWhere platform, which combines observational weather data, forecast models, and historical norms from meteorological stations around the world and interpolates these into 9-km grid cells. Data include precipitation, minimum and maximum temperature, minimum and maximum relative humidity, solar radiation, maximum morning wind speed, and calculations such ascumulative growing degree-days (where you set base and cap temperatures).

aWhere’s data can be accessed freely by users in East and Southern Africa and parts of West Africa and India; however, you must register tobe able to view or download data via the aWhere platform (apps.awhere.com). aWhere also offers the possibility of uploading your own trial and weather data.

WorldClim is a set of global climate layers that can be downloaded for free via http://www.worldclim.org. These layers cover all global land areas except Antarctica andconsist of19 bioclimatic variables derived from monthly temperature and precipitation values. Bioclimatic variables represent annual trends, seasonality, and extremes. They are inthe latitude/longitude coordinate reference system (not projected), and the datum isWGS84. Table 1 summarizes various climate datasets.

Table 1: WorldClim and National Oceanic and Atmospheric Administration climate data

Datasets Period covered Source data Available spatial resolutions
Present conditions
Download via http://www.worldclim.org/current
1950–2000 Interpolations ofobservations 30 seconds
2.5 minutes
5 minutes
10 minutes
Projected future conditions
Download via http://www.worldclim.org/CMIP5
To about 2050 Climate projections from the Fifth Assessment Report ofthe Intergovernmental Panel on Climate Change 30 seconds
2.5 minutes
5 minutes
10 minutes
Past conditions
Download via http://www.worldclim.org/paleo-climate
Last glacial period (22 000 years ago) to mid-Holocene (6000 years ago) Downscaled paleoclimate data 30 seconds
2.5 minutes
5 minutes
10 minutes
Surface marine data
Download via http://icoads.noaa.gov/
1800 to present
1960 to present (more detailed)
Gridded monthly summaries using observations from many systems 2° × 2° longitude boxes (from 1800) and 1° × 1° boxes (from 1960)

The datasets are available at four spatial resolutions: from 30 seconds (0.93 km × 0.93 km = 0.86 km2 at the equator) to 2.5, 5, and 10 minutes (18.6 km × 18.6 km = 346.0km2at the equator). The highest resolution (30 seconds) is available per tile (region) in georeferenced TIFF format (GeoTIFF file) and can be used in any GIS application. WorldClim data at lower resolutions can be downloadedin ZIP-files in a generic grid (raster) format and in Environmental Systems Research Institute (ESRI) grids (for use with ESRI products). Table 2 summarizes the approximate grid resolutions.

Table 2: Approximate grid resolutions

Resolution Approximate unit area
30 arc seconds 1 km²
2.5 arc minutes 5 km²
5 arc minutes 9 km²
10 arc minutes 18 km²

To import the generic grid (raster) format into DIVA-GIS, first download and unzip the ZIP-files. Use: Data\Import to gridfile\Multiple Files (BIL/BIP/BSQ). Some of these files have the extension .GRD. In this case, if you rename the .BIL files to .GRI, they can be opened inDIVA-GIS without the import procedure.

High resolution physical climate data can also be obtained from the Intergovernmental Panel on Climate Change (IPCC) through its data distribution centre. Climate model data from IPCC AR4 (2007) and IPCC TAR (2001) are viewed through the DDC file navigator and can be downloaded in CSV format. Data are available in10-year or30-year intervals ranging from 1900 to 2000 for the following variables: temperature, wet days, precipitation, daily maximum temperature, daily minimum temperature, ground frost frequency, water vapour, diurnal temperature range, and cloud cover.

Sources of biodiversity data

Biodiversity and environmental data can also be obtained from various sources, including (national) gene banks, herbariums, field observations, and global agencies and databases. A few of these sources are mentioned here.

Global or national gene banks serve as repositories of plant genetic materials. Passport and evaluation data for accessions can be obtained by contacting individual banks. Passport data include the source and origin ofan accession as well as taxonomic identification. Gene bank collections can also contain characterization and evaluation data on phenotypic traitsor provide more specific trait data including those of underutilized crops which are not covered in global databases. Examples of gene banks with clear trait data or the ability torequest germplasm online are listedin Table 3.

Table 3: Examples of gene banks that offer easily accessible and clear trait data

Name Source
National Plant Germplasm System (also GRIN Global)

United States Department of Agriculture

http://www.ars-grin.gov/npgs/orders.html

Centre for Genetic Resources (Dutch national gene bank) Wageningen University and Research Centre, the Netherlands http://www.wageningenur.nl/en/Expertise-Services/Statutory-research-tasks/Centre-for-Genetic-Resources-the-Netherlands-1.htm
European catalogue ofex situ collections of plant genetic resources (EURISCO) Secretariat of the European Cooperative Programme for Plant Genetic Resources http://eurisco.ipk-gatersleben.de/apex/f?p=103:1

Not all gene banks make their collection data available online. Quality and access to the data depend on the regulations and resources of individual institutes.

Open access biodiversity data on all types of life on earth — plants, animals, and microbes — are available through the Global Biodiversity Information Facility (GBIF) web portal. This is the largest biodiversity database on the Internet asit accesses data from hundreds ofinstitutions. The data from GBIF are georeferenced, allowing you to download maps with specific occurrences of species or varieties within a species. Data can be further refined bygeographic area, elevation, and climate. To download the data you must register. After you select the relevant data, a direct download link is send to your email address which will allow you to download a ZIP-file. The ZIP-file consists of text and XML files. The downloaded files can be imported into GIS software to create maps, which can be overlaid with climate maps to visualize the effect of climate change on diversity. More on this can be found inModule 3. Website: http://www.gbif.org/

GENESYS is an online database that provides access to millions of accessions held in gene banks around the world. Information includes crop name, genus, species, accession number, scientific name, country of origin, biological status of accession, holding institute, and the latitude, longitude, and elevation where it was collected. The dataset provides information in addition to standard passport data, such as characterization and evaluation data (i.e., plant height, growing periods at given locations, seed colour, response to specific pests or diseases, response to various climatic conditions, and possible end uses), and environmental data based on the longitude and latitude at collection sites. This allows users to identify specific accessions with desirable traits for climate change adaptation, such asdrought tolerance and water logging. Users can get additional information on whether the accession has been made available through the multilateral system of the International Treaty on Plant Genetic Resources for Food and Agriculture, whether it has been safely duplicated in the global seed vault located in Svalbard, Norway, and whether itis available for distribution. Data can be filtered online through the accession browser, viewed as maps or downloaded as CSV files, and imported into GIS applications. Website: https://www.genesys-pgr.org/welcome

Recommended readings

Mittra, S., van Etten, J., Franco, T. , 2013 Collecting weather data in the field with high spatial and temporal resolution using iButton Bioversity International, Rome (Italy)

This manual describes in detail how to use iButtons for weather observations.

Dias, S., MacKay, M. , 2011 Promoting conservation and use of Plant Genetic Resources for Food and Agriculture – Information services for users worldwide Wageningen, The Netherlands

This PowerPoint presentation gives an overview of the key features of EURISCO and GENESYS and how to use these data portals.

Saarenmaa, H. n.d.,Sharing and accessing biodiversity data globally through GBIF. Global Biodiversity Information Facility, Copenhagen, Denmark

This short paper explains how one can become a GBIF data provider, and how users can access the data using web services and the GIS functions on the GBIF data portal

More on the subject

Samy Gaiji, S., Chavan, V., Arino, A., Otegui, J., Hobern, D., Sood, R., Robles, E. , 2013 Content assessment of the primary biodiversity data published through GBIF network: status challenges and potential. Biodiversity Informatics, , 8: 94–172

This paper is the first comprehensive assessment of the content mobilized so far through GBIF, as well as a reflection on possible strategies to improve its “fitness for use.”

What software you use for data analysis will depend on the type and format of the data. Georeferenced climate and accessions data can be used to create multilayered maps in a wide range of GIS software programs. However, itis important to choose software that can do both mapping and analysis from points to grids to landscape. A number of free software programs exist for this type of analysis.

Species distribution models and crop models are used to predict climate change impact oncrop suitability and yield. Application of processed-based models may require specialized knowledge and a detailed set of parameters for specific areas. Therefore, niche models that predict climate change impact on crop suitability maybe a good option to provide recommendations in the context of the uncertainties involved in climate change projects and the often limited amount and quality of data.

Ecocrop

Ecocrop is a simple niche-based empirical model that uses environmental ranges to define the suitable area of a crop. It draws on the Food and Agriculture Organization’s Ecocrop plant database, which includes optimal environmental ranges of more than 2000 species. The model allows adjustment of minimum and maximum temperature based on local research findings. The Ecocrop database is available at www.ecocrop.fao.org and is also included in DIVA-GIS. Selecting a crop and setting parameters based on your own research findings allows you to create maps that show the suitability of a certain crop now and in the future.

Other resources for species distribution modeling:

  • ModEco — integrated software for species distribution analysis and modeling
  • DesktopGarp — software package for biodiversity and ecological research
  • openModeller — a generic approach to species' potential distribution modelling

Recommended readings

Hijmans, R.J., Guarino, L., Mathur, P. , 2012 DIVA-GIS version 7.5 manual

Use of the Ecocrop model is explained on pages 54–56.

Mapping Software

In this module, we focus on four programs: DIVA-GIS to map the distribution of biological diversity and query climate data; MaxEnt, species distribution modeling software, to model the range in which a species can occur; Google Earth to create maps with georeferenced occurrence data or accessions data against high resolution satellite imagery; and Climate Analogues to project future climate conditions for a particular location and match this to sites that currently have similar rainfall and climatic conditions.

DIVA-GIS

DIVA-GIS is an open-source software program used to create maps and carry out geographic data analysis. It can create a wide range of maps, from a map of the world, to a map of a very small area, such as a district or even a village, showing, for example, state boundaries, rivers, a satellite image, and sites where an animal species was observed. This program can be downloaded at http://www.diva-gis.org

DIVA-GIS comes with the option to download free spatial data, such as administrative boundaries, roads, etc., and species occurrence data from GBIF, Genesys, LandSat, etc. As noted above, the Ecocrop model is built in.

Using DIVA-GIS, you can also download free spatial data  at http://www.diva-gis.org/Data for the whole world, which can then be used in DIVA-GIS or other GIS programs, such as Arc-GIS. DIVA-GIS is particularly useful for mapping and analyzing biodiversity data, such as the distribution of a species or other “point distributions.”

Climate data from WorldClim can be downloaded directly into DIVA-GIS at http://www.diva-gis.org/climate. This makes it possible to overlay climate information with species occurrence or other georeferenced data to provide an overview of the way the distribution of a species changes due to climate over certain periods. 

Recommended readings

Hijmans, R.J., Guarino, L., Mathur, P. 2012. DIVA-GIS version 7.5 manual.

This manual includes a step-by-step guide to downloading and installing DIVA-GIS, a summary of its uses, and a list of its features. It also gives an overview of data analysis and how to generate maps and shape files

MaxEnt

Maximum entropy modeling uses layers of environmental variables (elevation, precipitation, etc.) as well as a set of georeferenced occurrence locations to produce a model of the range of a given species. One of the main applications of MaxEnt is prediction of species occurrence. From current species locations and environmental predictors (e.g., precipitation, temperature) across a user-defined landscape divided into grid cells, MaxEnt extracts a sample of background locations that it contrasts with present locations to predict species occurrence.

Available from: http://www.cs.princeton.edu/~schapire/MaxEnt/

Recommended readings

Merow, C., Smith, M.J., Silander Jr, J.A. 2013. A practical guide to MaxEnt for modeling species’ distributions: what it does, and why inputs and settings matter. Ecography 36: 1058–1069. 

This article gives a detailed explanation on how MaxEnt works, and different types of analysis that can be performed by it. It also provides insights on data requirements, formats, and conversions that might be necessary when performing analysis in MaxEnt.

 

 

More on the subject

Phillips, S.J., Dudik, M., Schapire, R.E. 2004. A maximum entropy approach to species distribution modeling. Proceedings of the Twenty-First International Conference on Machine Learning, pp. 655-662.

Phillips, S.J., Anderson, R.P., Schapire, R.E. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190: 231–259. 
These two articles provide details on maximum entropy modeling.

Google Earth

Google Earth is a geobrowser that provides satellite and aerial imagery, ocean bathymetry, and other geographic data over the Internet to display the Earth as a three-dimensional globe. It has many features, including an increasing set of layers of mappable data, the ability to display third-party data, tools for creating new data, and the ability to import GPS data.

Georeferenced occurrence data or accessions data can be imported and mapped in Google Earth. The resolution is as high as 1 m and, therefore, specific names and locations can easily be obtained. The free version of Google Earth is downloadable at http://www.google.com/earth/download/ge/agree.html

Google Earth is searchable and allows the user to pan, zoom, rotate, and tilt the view of the Earth. Its layers of data, such as volcanoes and terrain, reside on Google’s servers, and can be displayed. Its elevation data, primarily from NASA’s Shuttle Radar Topography Mission, provide a terrain layer that adds depth to the landscape. Google Earth can also be used to acquire the coordinates of collections or occurrences with location names but no GPS information. 

Recommended readings

Scheldeman, X., van Zonneveld, M. 2010.Training manual on spatial analysis of plant diversity and distribution. Bioversity International, Rome, Italy.

A description of the step-by-step procedure for importing data into Google Earth is available on pages 74–78.

Climate Analogues

Climate Analogues is an open-access tool developed by the programme on Climate Change, Agriculture and Food Security in conjunction with the International Center for Tropical Agriculture and the Walker Institute. Used to support adaptation to climate change in the agricultural section, its main applications are in agricultural policy and planning. The tool can be used to identify future climate conditions at a particular location, sites that currently resemble these conditions, and locations that have or will have similar climate conditions.

The tool can facilitate knowledge-sharing among communities, providing the opportunity to transfer practices and technologies to improve adaptive capacities. It can also provide insights into whether successful adaptation options in one location can be transferred to a future climatic analogue site. The temporal analogues are time specific and make use of past climates to create a representative time series for future climates. This allows the identification of historic events that might provide insight into the possible future consequences of climate change.

Based on careful analyses using the tool and supported by data from actual conditions in farmers’ fields, scientists can formulate possible intervention strategies, including identification of appropriate plant genetic resources, or develop new varieties for specific locations of interest.

The Climate Analogues online platform can be accessed at http://www.ccafs-analogues.org/tool/. The procedure for using this tool and interpreting analogue maps is described in a tutorial: http://www.ccafs-analogues.org/tutorial/.

 

Recommended readings

Ramírez-Villegas, J., Lau, C., Köhler, A.K., Signer, J., Jarvis, A., Arnell, N., Osborne, T., Hooker, J. 2011.Climate Analogues: finding tomorrow’s agriculture today. Working paper 12. CGIAR Research Program on Climate Change, Agriculture and Food Security, Cali, Colombia. .

This paper provides a general explanation of Climate Analogues, including the conceptual framework used and the models applied in building the analogues. It also explains how to interpret the results and notes the limitations of the method.

A common feature of the data referred to in this module is that they include georeferenced information and/or basic passport data. The data must be organized in a format that can be recognized by software such as DIVA-GIS and MaxEnt. These software applications have the capability of processing vector and raster data. Vector data come in the form of shapefiles with extensions, such as .SHP, .SHX, and .DBF, which store spatial features. Environmental data from specific geographic areas may be organized in rasters, which consist of a matrix of cells (or pixels) organized into rows and columns (or a grid) where each cell contains a value representing information, such as temperature. Rasters can be digital aerial photographs, imagery from satellites, digital pictures, or even scanned maps.
The level of detail in a raster is referred to as resolution. Raster sizes range from 1° (111 km) to 30 seconds (approximately 1 km at the equator).

Preparing data for use in DIVA-GIS and MaxEnt

  1. Converting data into appropriate formats 

Presence points, which consist of passport data, can be entered or downloaded intoanExcel file and then converted into appropriate formats for spatial analysis using GIS applications, such as DIVA-GIS or MaxEnt. Your data should include an identification code, a scientific or taxonomic name, and coordinates (latitude and longitude). Other relevant information can also be added. Records downloaded from GBIFor Genesys will contain this information, but always check to ensure that the data are complete (e.g., no blank fields) and that the coordinates are in decimal degrees.

Table 4: Convert geographic coordinates from degrees, minutes, and seconds or degrees and decimal minutes to decimal degrees.

Decimal Degrees = [(Degrees (°) + Minutes (') / 60 + Seconds('') / 3600)] * H

H = 1 when the coordinate is  in the Eastern (E) or Nothern (N) Hemisphere
H = -1 when the coordinate is in the Western (W) or Southern (S) Hemisphere

Longitude Degrees, Minutes
& Seconds
Decimal Degrees Latitude Degrees, Minutes
& Seconds
Decimal Degrees
Eastern Hemisphere 60°20'15''E + 60.3375 Northern Hemisphere 24°00'45''N + 24.0125
Western Hemisphere 60°20'15''W - 60.3375 Southern Hemisphere 24°00'45''W - 24.0125

Source: Scheldeman and van Zonneveld (2010: 24–26); see section on Google Earth above for full reference.

If the dataset you are working with is missing coordinates, you can add them manually using a gazetteer database. A gazetteer is an alphabetical database of administrative units combined with geographic coordinates. Gazetteer databases can be downloaded from the DIVA-GIS website (http://www.diva-gis.org/gdata).

Find a description of the location of your data point in your dataset; this could be the name of a village or town or a larger administrative unit, such as a municipality or a region. Use this information to search the gazetteer database for matches. Caution must be exercised to avoid assigning coordinates of a different village with the same name in another administrative unit. Alternatively, use Google Earth to locate the missing coordinates by typing in the description of the location and pinpointing the data point.
The placemark will contain the coordinates, which you can copy to your dataset.

In summary:

  • Your dataset contains the required ID, label (taxonomic name), and latitude, longitude columns.
  • Coordinates are in decimal degrees.
  • Presence points with missing coordinates are removed completely from the dataset or missing coordinates are assigned using a gazetteer database.

2. Importing data into DIVA-GIS

Before continuing, make sure you have DIVA-GIS version 7.5 installed. Go to the DIVA-GIS website and click on the download page (http://www.diva-gis.org/download). For full functionality, you should also download the WorldClim climate data from the DIVA-GIS website (http://www.diva-gis.org/climate or from the downscaled GCM data portal (http://gisweb. ciat.cgiar.org/GCMPage).

Once you have DIVA-GIS up and running, take some time to familiarize yourself with the program. A good starting point is the DIVA-GIS 7.5 manual by Hijmans et al. (2012) which can be found at http://www.diva-gis.org/docs/DIVA-GIS_manual_7.pdf

Importing occurrence data: From the Data menu select “Import Points to Shapefile” and choose which type of file you want to import. DIVA-GIS allows direct imports from text (.TXT), Access database (.MDB), Excel (.XLS), or dBase (.DBF) files. You are asked to specify your dataset (input file) and the columns that contain latitude and longitude. Click “Save to Shapefile” to generate a vector file (.SHP). This shapefile contains all the presence points in your dataset.

Importing climate data (WorldClim): Previously, you downloaded climate layers from the DIVA-GIS website. The files in the ZIP-file are CLM files, which you should extract and store in a folder on your hard drive. To load the climate data into DIVA-GIS, go to Tools/Options/Climate and select the folder in which you stored the WorldClim files. Check to ensure that DIVA-GIS selected the right columns for each parameter and press “Apply.” You have now loaded the climate data into DIVA-GIS. This is not visible on your screen, but you have also set the WorldClim database as your default. You will use the climate data in Module 3.

3. Importing data into MaxEnt

After installing and starting MaxEnt, you will be asked to load a file with occurrence data (͞samples͟) and another file containing environmental variables (͞environmental layers͟).The occurrence data should be a .CSV file and the environmental layers in ASCII raster grids (each environmental variable represents one layer).
For example, the19 bioclimatic values representing the WorldClim datasets would appear as separate layers. After specifying anoutput file, you are ready to proceed to the next step of running the MaxEnt model. More on this topic in Module 3.

A full description of the procedure is available in Scheldeman and van Zonneveld (2010: 28–33).

Recommended readings

Hijmans, R.J., Guarino, L., Mathur, P. , 2012 DIVA-GIS version 7.5 manual

Use of the Ecocrop model is explained on pages 54–56.

Scheldeman, X., van Zonneveld, M., 2010 Training manual on spatial analysis of plant diversity and distribution (Pages 1–40) Bioversity International, Rome, Italy

Here is a quiz that will help you test your newly acquired knowledge. Once you have covered the content sections and completed the assigned readings, please answer the Data preparation and software selection quiz.

Continue to quiz

Applying your new knowledge

Now that you are able to select, prepare, and clean data and use itto create maps, itis time to apply your new knowledge. Please document this step of the research process by:

  1. Listing two data sources for both biodiversity and climate data.
  2. Explain the use of DIVA-GIS, MaxEnt, Google Earth, and the Climate Analogues tool in climate change analysis.
  3. Describe the steps for loading data into DIVA-GIS and MaxEnt.

The next module in our research process is Climate change analysis. Let us begin!

2

Data preparation and software selection

3

Climate change analysis and identification of germplasm