In last practical session we look how to find and access data for your projects.
The availability of open data is growing rapidly. Data is shared by governments, organisations and individual persons.
At the same time the idea of reproducibility is spreading in science. One assumption for that is machine-readable data. Database API are being developed for this purpose.
Such APIs allow direct access to databases from R workspace. Many different packages have been developed for this purpose:
In current tutorial your task is to get familiar with R package ‘wbstats’. This package provides you with tools for searching and downloading data and statistics from the World Bank Data API.
Read the introductive tutorial and test the examples.
Your task in this session is to find one interesting geospatial dataset from wbstats and create the map based on that data!
First, look at the tutorial: wbstats**.
As we read from the tutorial:
For performance and ease of use, a cached version of useful information is provided with the wbstats R-package. This data is called wb_cachelist and provides a snapshot of available countries, indicators, and other relevant information.
wb_cachelistis by default the the source from whichwbsearch()andwb()uses to find matching information. The structure ofwb_cachelistis as follows.
Start libraries (install first, if needed!):
library(tidyverse) # always useful!
library(wbstats)
Get the list of indicators:
wbstats_indicators <- wb_cachelist$indicators
head(wbstats_indicators)
## indicatorID
## 1 ZINC
## 2 XGDP.56.FSGOV.FDINSTADM.FFD
## 3 XGDP.23.FSGOV.FDINSTADM.FFD
## 4 WP15187.1
## 5 WP15186.1
## 6 WP15185.1
## indicator
## 1 Zinc, cents/kg, current$
## 2 Government expenditure in tertiary institutions as % of GDP (%)
## 3 Government expenditure in secondary institutions education as % of GDP (%)
## 4 Received payments for agricultural products: through a mobile phone (% recipients, age 15+) [w2]
## 5 Received payments for agricultural products: into an account at a financial institution (% recipients, age 15+) [w2]
## 6 Received payments for agricultural products: in cash (% recipients, age 15+) [w2]
## unit
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## indicatorDesc
## 1 Zinc (LME), high grade, minimum 99.95% purity, settlement price beginning April 1990; previously special high grade, minimum 99.995%, cash prices
## 2 Total general (local, regional and central) government expenditure in educational institutions (current and capital) at a given level of education, expressed as a percentage of GDP. It excludes transfers to private entities such as subsidies to households and students, but includes expenditure funded by transfers from international sources to government. Divide total expenditure in public institutions of a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/
## 3 Total general (local, regional and central) government expenditure in educational institutions (current and capital) at a given level of education, expressed as a percentage of GDP. It excludes transfers to private entities such as subsidies to households and students, but includes expenditure funded by transfers from international sources to government. Divide total expenditure in public institutions of a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/
## 4 Denotes, among respondents reporting personally receiving money from any source for the sale of agricultural products, crops, produce, or livestock (self- or family-owned) in the past 12 months, the percentage who received this money through a mobile phone (% recipients, age 15+). [w2: data are available for wave 2].
## 5 Denotes, among respondents reporting personally receiving money from any source for the sale of agricultural products, crops, produce, or livestock (self- or family-owned) in the past 12 months, the percentage who received this money directly into an account at a bank or another type of financial institution (% recipients, age 15+). [w2: data are available for wave 2].
## 6 Denotes, among respondents reporting personally receiving money from any source for the sale of agricultural products, crops, produce, or livestock (self- or family-owned) in the past 12 months, the percentage who received this money directly in cash (% recipients, age 15+). [w2: data are available for wave 2].
## sourceOrg
## 1 Platts Metals Week, Engineering and Mining Journal; Thomson Reuters Datastream; World Bank.
## 2 UNESCO Institute for Statistics
## 3 UNESCO Institute for Statistics
## 4 Demirguc-Kunt et al., 2015
## 5 Demirguc-Kunt et al., 2015
## 6 Demirguc-Kunt et al., 2015
## sourceID source
## 1 21 Global Economic Monitor Commodities
## 2 12 Education Statistics
## 3 12 Education Statistics
## 4 28 Global Financial Inclusion
## 5 28 Global Financial Inclusion
## 6 28 Global Financial Inclusion
RStudio is not the best environment to study the list of available indicators. Therefore I saved it as csv and opened it in spreadsheet (e.g. MS Excel). The list of available indicators is very long 16978 rows. Try to find geospatial dataset with global coverage. I choosed “Refugee population by country or territory of origin” (SM.POP.REFG) and “Population, total” (SP.POP.TOTL).
My plan is to create the global map where the ratio of refugee population of origin / total population is presented.
For that I run download query with multiple indicators:
pop_gdp_wide <- wb(indicator = c("SP.POP.TOTL", "SM.POP.REFG"),
mrv = 1, return_wide = TRUE)
Try to understand the function of every parameter in previous script!
Check the result:
glimpse(pop_gdp_wide)
## Observations: 263
## Variables: 6
## $ iso3c <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "ARG",...
## $ date <chr> "2018", "2018", "2018", "2018", "2018", "2018", "2018",...
## $ iso2c <chr> "AW", "AF", "AO", "AL", "AD", "1A", "AE", "AR", "AM", "...
## $ country <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra",...
## $ SM.POP.REFG <dbl> NA, 72231, 39865, 131, NA, 9326512, 1164, 3468, 17970, ...
## $ SP.POP.TOTL <dbl> 105845, 37172386, 30809762, 2866376, 77006, 419790588, ...
Everything looks ok. I have the data! Now I need spatial layer of countries. For this we can use package rnaturalearth:
library(rnaturalearth)
Download country borders:
countries50 <- ne_download(scale = 50, type = 'countries', category = 'cultural', returnclass = "sf")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\Administrator\AppData\Local\Temp\Rtmpkfurs9", layer: "ne_50m_admin_0_countries"
## with 241 features
## It has 94 fields
## Integer64 fields read as strings: POP_EST NE_ID
glimpse(countries50)
## Observations: 241
## Variables: 95
## $ featurecla <chr> "Admin-0 country", "Admin-0 country", "Admin-0 country",...
## $ scalerank <int> 1, 1, 1, 3, 5, 6, 1, 1, 1, 3, 5, 3, 3, 3, 3, 1, 5, 3, 3,...
## $ LABELRANK <int> 3, 3, 3, 2, 3, 6, 4, 3, 4, 6, 6, 6, 6, 6, 4, 5, 2, 4, 5,...
## $ SOVEREIGNT <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ SOV_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ ADM0_DIF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ LEVEL <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ TYPE <chr> "Sovereign country", "Sovereign country", "Sovereign cou...
## $ ADMIN <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ ADM0_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ GEOU_DIF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ GEOUNIT <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ GU_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ SU_DIF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ SUBUNIT <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ SU_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ BRK_DIFF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ NAME <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ NAME_LONG <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ BRK_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ BRK_NAME <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ BRK_GROUP <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ABBREV <chr> "Zimb.", "Zambia", "Yem.", "Viet.", "Ven.", "Vat.", "Van...
## $ POSTAL <chr> "ZW", "ZM", "YE", "VN", "VE", "V", "VU", "UZ", "UY", "FS...
## $ FORMAL_EN <chr> "Republic of Zimbabwe", "Republic of Zambia", "Republic ...
## $ FORMAL_FR <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_CIAWF <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ NOTE_ADM0 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Commonwealt...
## $ NOTE_BRK <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_SORT <chr> "Zimbabwe", "Zambia", "Yemen, Rep.", "Vietnam", "Venezue...
## $ NAME_ALT <chr> NA, NA, NA, NA, NA, "Holy See", NA, NA, NA, NA, NA, NA, ...
## $ MAPCOLOR7 <int> 1, 5, 5, 5, 1, 1, 6, 2, 1, 5, 2, 4, 4, 4, 4, 4, 4, 6, 6,...
## $ MAPCOLOR8 <int> 5, 8, 3, 6, 3, 3, 3, 3, 2, 2, 5, 5, 5, 5, 5, 5, 5, 6, 6,...
## $ MAPCOLOR9 <int> 3, 5, 3, 5, 1, 4, 7, 5, 2, 4, 5, 1, 1, 1, 1, 1, 1, 6, 6,...
## $ MAPCOLOR13 <int> 9, 13, 11, 4, 4, 2, 3, 4, 10, 13, 3, 1, 1, 1, 1, 1, 1, 3...
## $ POP_EST <chr> "13805084", "15972000", "28036829", "96160163", "3130401...
## $ POP_RANK <int> 14, 14, 15, 16, 15, 3, 10, 15, 12, 9, 8, 8, 9, 9, 8, 12,...
## $ GDP_MD_EST <dbl> 2.833e+04, 6.517e+04, 7.345e+04, 5.949e+05, 4.686e+05, 0...
## $ POP_YEAR <int> 2017, 2017, 2017, 2017, 2017, 2015, 2017, 2017, 2017, 20...
## $ LASTCENSUS <int> 2002, 2010, 2004, 2009, 2001, NA, 2009, 1989, 2004, 2000...
## $ GDP_YEAR <int> 2016, 2016, 2016, 2016, 2016, 0, 2016, 2016, 2016, 2016,...
## $ ECONOMY <chr> "5. Emerging region: G20", "7. Least developed region", ...
## $ INCOME_GRP <chr> "5. Low income", "4. Lower middle income", "4. Lower mid...
## $ WIKIPEDIA <int> NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ FIPS_10_ <chr> "ZI", "ZA", "YM", "VM", "VE", "VT", "NH", "UZ", "UY", "F...
## $ ISO_A2 <chr> "ZW", "ZM", "YE", "VN", "VE", "VA", "VU", "UZ", "UY", "F...
## $ ISO_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ ISO_A3_EH <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ ISO_N3 <chr> "716", "894", "887", "704", "862", "336", "548", "860", ...
## $ UN_A3 <chr> "716", "894", "887", "704", "862", "336", "548", "860", ...
## $ WB_A2 <chr> "ZW", "ZM", "RY", "VN", "VE", NA, "VU", "UZ", "UY", "FM"...
## $ WB_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", NA, "VUT", "UZB", "UR...
## $ WOE_ID <int> 23425004, 23425003, 23425002, 23424984, 23424982, 234249...
## $ WOE_ID_EH <int> 23425004, 23425003, 23425002, 23424984, 23424982, 234249...
## $ WOE_NOTE <chr> "Exact WOE match as country", "Exact WOE match as countr...
## $ ADM0_A3_IS <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ ADM0_A3_US <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ ADM0_A3_UN <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ADM0_A3_WB <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CONTINENT <chr> "Africa", "Africa", "Asia", "Asia", "South America", "Eu...
## $ REGION_UN <chr> "Africa", "Africa", "Asia", "Asia", "Americas", "Europe"...
## $ SUBREGION <chr> "Eastern Africa", "Eastern Africa", "Western Asia", "Sou...
## $ REGION_WB <chr> "Sub-Saharan Africa", "Sub-Saharan Africa", "Middle East...
## $ NAME_LEN <int> 8, 6, 5, 7, 9, 7, 7, 10, 7, 10, 12, 14, 15, 4, 14, 11, 2...
## $ LONG_LEN <int> 8, 6, 5, 7, 9, 7, 7, 10, 7, 30, 16, 24, 28, 4, 14, 11, 1...
## $ ABBREV_LEN <int> 5, 6, 4, 5, 4, 4, 4, 4, 4, 6, 6, 6, 11, 4, 9, 4, 6, 10, ...
## $ TINY <int> NA, NA, NA, 2, NA, 4, 2, 5, NA, NA, 2, 3, 3, 2, 3, NA, N...
## $ HOMEPART <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, 1, ...
## $ MIN_ZOOM <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MIN_LABEL <dbl> 3.0, 3.0, 3.0, 2.0, 3.0, 5.0, 4.0, 3.0, 3.0, 5.0, 5.0, 5...
## $ MAX_LABEL <dbl> 8.0, 8.0, 8.0, 7.0, 7.5, 10.0, 9.0, 8.0, 8.0, 10.0, 10.0...
## $ NE_ID <chr> "1159321441", "1159321439", "1159321425", "1159321417", ...
## $ WIKIDATAID <chr> "Q954", "Q953", "Q805", "Q881", "Q717", "Q237", "Q686", ...
## $ NAME_AR <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_BN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_DE <chr> "Simbabwe", "Sambia", "Jemen", "Vietnam", "Venezuela", "...
## $ NAME_EN <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ NAME_ES <chr> "Zimbabue", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ NAME_FR <chr> "Zimbabwe", "Zambie", "Yémen", NA, "Venezuela", "Vatican...
## $ NAME_EL <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_HI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_HU <chr> "Zimbabwe", "Zambia", "Jemen", NA, "Venezuela", NA, "Van...
## $ NAME_ID <chr> "Zimbabwe", "Zambia", "Yaman", "Vietnam", "Venezuela", "...
## $ NAME_IT <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", N...
## $ NAME_JA <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_KO <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_NL <chr> "Zimbabwe", "Zambia", "Jemen", "Vietnam", "Venezuela", "...
## $ NAME_PL <chr> "Zimbabwe", "Zambia", "Jemen", "Wietnam", "Wenezuela", "...
## $ NAME_PT <chr> NA, NA, "Iémen", "Vietname", "Venezuela", "Vaticano", "V...
## $ NAME_RU <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NAME_SV <chr> "Zimbabwe", "Zambia", "Jemen", "Vietnam", "Venezuela", "...
## $ NAME_TR <chr> "Zimbabve", "Zambiya", "Yemen", "Vietnam", "Venezuela", ...
## $ NAME_VI <chr> "Zimbabwe", "Zambia", "Yemen", NA, "Venezuela", NA, "Van...
## $ NAME_ZH <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ geometry <MULTIPOLYGON [arc_degree]> MULTIPOLYGON (((31.28789 -2..., MU...
List of columns in downloaded dataset is very long. To keep workscpace cleaner I select only the relevant columns:
countries50_ii <- countries50 %>%
select(ADM0_A3, NAME, GDP_MD_EST, POP_EST)
glimpse(countries50_ii)
## Observations: 241
## Variables: 5
## $ ADM0_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", ...
## $ NAME <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "...
## $ GDP_MD_EST <dbl> 2.833e+04, 6.517e+04, 7.345e+04, 5.949e+05, 4.686e+05, 0...
## $ POP_EST <chr> "13805084", "15972000", "28036829", "96160163", "3130401...
## $ geometry <MULTIPOLYGON [arc_degree]> MULTIPOLYGON (((31.28789 -2..., MU...
In next step I have to join my data table (attributes) with geospatial layer (geometry):
data_joined <- left_join(countries50_ii, pop_gdp_wide, by = c("ADM0_A3" = "iso3c"))
glimpse(data_joined)
## Observations: 241
## Variables: 10
## $ ADM0_A3 <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB",...
## $ NAME <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", ...
## $ GDP_MD_EST <dbl> 2.833e+04, 6.517e+04, 7.345e+04, 5.949e+05, 4.686e+05, ...
## $ POP_EST <chr> "13805084", "15972000", "28036829", "96160163", "313040...
## $ date <chr> "2018", "2018", "2018", "2018", "2018", NA, "2018", "20...
## $ iso2c <chr> "ZW", "ZM", "YE", "VN", "VE", NA, "VU", "UZ", "UY", "FM...
## $ country <chr> "Zimbabwe", "Zambia", "Yemen, Rep.", "Vietnam", "Venezu...
## $ SM.POP.REFG <dbl> 7797, 49879, 264369, NA, 67289, NA, NA, 14, 391, NA, NA...
## $ SP.POP.TOTL <dbl> 14439018, 17351822, 28498687, 95540395, 28870195, NA, 2...
## $ geometry <MULTIPOLYGON [arc_degree]> MULTIPOLYGON (((31.28789 -2..., M...
Now I calculate the ratio (percentage):
data_joined <- data_joined %>%
mutate(refug_pop_ratio = SM.POP.REFG / SP.POP.TOTL * 100)
In previous tutorials we have mainly used for visualizations the ggplot2 packages. But here I will use tmap to create the map:
library(tmap)
tm_shape(data_joined)+
tm_polygons(col = "refug_pop_ratio",
style = "quantile",
palette = "-cividis",
border.col = "grey40",
lwd = 0.2,
title = "Percentage",
colorNA = "dodgerblue")+
tm_style("classic")+
tm_layout(main.title = "Share of refugees in population",
legend.bg.color = "white",
legend.bg.alpha = 0.7)+
tm_credits("data: World Bank; map: A. Aasa",
bg.color = "white",
bg.alpha = 0.7)
The result look ok for me!
And now is your turn!
Please create your map. Bind all the home works into one doc/pdf/html. Add script-files separately (in case of R-markdowm script is already in html file). Please add your name to each file name! E-mail results to: anto.aasa@ut.ee. Final deadline for all submissions is May 15th, 2020.
Author: Anto Aasa
in ISET, Tbilisi
Last update: 2020-05-12 10:50:54