First step is always starting the packages we need:
library(readr)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(treemapify)
library(knitr)
library(tidyr)
library(worldmet)
library(ggmap)
library(lubridate)
library(sf)
library(rsdmx)
library(gganimate)
library(plotly)
library(mapview)
According to Wikipedia: Visualization is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity. Examples from history include cave paintings, Egyptian hieroglyphs, Greek geometry, and Leonardo da Vinci’s revolutionary methods of technical drawing for engineering and scientific purposes. Visualization today has ever-expanding applications in science, education, engineering (e.g., product visualization), interactive multimedia, medicine, etc. Typical of a visualization application is the field of computer graphics. The invention of computer graphics may be the most important development in visualization since the invention of central perspective in the Renaissance period. The development of animation also helped advance visualization.
Several remarkably good books about scientific visualization have been published:
The last book on the list has been influencial also for building R library ggplot2: Wickham H. (2010) “A Layered Grammar of Graphics”“
The author of current short tutorial is not trying to compete with any of the previously mentioned books. The aim of the tutorial is to go through some of the important aspects of scientific visualization and demonstrate the process of creating visualizations.
More specifically we are focusing on R library ggplot2.
Before whatever complicated analysis we should have as good as possible overview of data we are using. For that there are two main (parallel) ways:
Today we focus on visualizations. Modern hard- and software allow us to create very sophisticated and colorful graphs. However, we should remember, that very often less is more. So let’s forget 3-D effects and try to be modest.
Despite the notorious saying about three kind of lies - “lies, damned lies, and statistics” - we should avoid this assumption in our research. It is important to acknowledge that the way you present your results on graphs can have a crucial impact on how viewers perceive your message.
Let us look more closely one example:
Same data, almost the same plot, but completely different impression!
ggplot2 works on principle of layers: all the plot elements are placed on different layers and it’s based on grammar of graphics. Hadley Wickham, the original author of the ggplot2 package, writes in his book ggplot2 “…the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system”.
Next picture tries to visualize a graph’s components:
Geometry is one of the most important elements of a graph: it’s the content. But without scales, coordinate system and annotations the graph is not complete, it isn’t informative and can be even misleading.
Geometry of ggplot2 is defined by geom. For every new data layer the geom type must be defined. In addition to the geom type aesthetics (aes()) must be described also. In this section the colour, shape and location of the geometry is described. Aesthetic can be specified uniformly for the whole graph or separately for every data layer.
There are many different geom’s available. The most common of them are listed below:
It is little use of just talking about graphs. Let’s start plotting! First of all, every lecturer wants to know who the participants are. This helps to decide what should be the level of teaching.
Download the list of participants of the visualization workshop:
participants <- read_delim("http://aasa.ut.ee/digitalMethods/data/aasa_visual_participants.csv", ";", escape_double = FALSE, trim_ws = TRUE)
## Warning: Missing column names filled in: 'X6' [6], 'X7' [7]
## Parsed with column specification:
## cols(
## name = col_character(),
## institution = col_character(),
## gender = col_character(),
## background = col_character(),
## level = col_character(),
## X6 = col_character(),
## X7 = col_character()
## )
# actually, there are many possibilities to read data into R (fread, read_csv, read_delim etc)
glimpse(participants) # check table structure
## Observations: 39
## Variables: 7
## $ name <chr> "Giovanni", "Krista", "Agnese", "Maris", "Marge", ...
## $ institution <chr> "Estonian Academy of Music and Theatre", "Universi...
## $ gender <chr> "m", "f", "f", "f", "f", "f", "f", "m", "f", "f", ...
## $ background <chr> "Composition (Music)", "Information management", "...
## $ level <chr> "Doctoral studies", "MA or PhD supervisor", "MA or...
## $ X6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ X7 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#View(participants) #view full table
#fix(participants) #possiblity to add or edit values
Clean the table! Last row of the dataset is empty. Also last two columns are empty.
In the current tutorial we are using the package called dplyr for data filtering, arranging, selecting, mutating, grouping and summarising. And for better (human readable) code we are using pipes.
participants <- participants %>% filter(!is.na(gender)) # is.na() detects NA values (NA - Not Available)
participants <- participants[,1:5] # select columns 1 - 5
Count participants by category (currently the institution):
ggplot()+
geom_bar(data = participants, aes(x=institution))
The labels of X-axis are piled on top of each other. We can rotate them:
ggplot()+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust = 1.0))+
geom_bar(data = participants, aes(x=institution))
Still not the best. Axis labels are taking to musch space and it’s hard to read the labels on X-axis. But we can rotate the whole graph:
ggplot()+
geom_bar(data = participants, aes(x=institution))+
coord_flip()
OK!
Now same procedure for visualizing the gender balance and participants position:
ggplot()+
geom_bar(data = participants, aes(x=gender))
ggplot()+
geom_bar(data = participants, aes(x=level))
In the era of social network the popularity is very important! We will study the popularity of participants with the help of Google search. There is a pre-prepared dataset where after every first name is the number of results from Google search (of course the list shows nothing and is misleading, but it’s useful for practicing ggplot2).
Import data:
participants_popularity <- read_delim("http://aasa.ut.ee/digitalMethods/data/participant_popularity.csv", ";", escape_double = FALSE, trim_ws = TRUE)
## Parsed with column specification:
## cols(
## name = col_character(),
## n = col_double()
## )
glimpse(participants_popularity)
## Observations: 38
## Variables: 2
## $ name <chr> "Anu", "Mai", "Agnese", "Aija", "Airi-Alina", "Anastasia"...
## $ n <dbl> 8.01e+07, 1.68e+09, 1.03e+07, 3.48e+06, 8.07e+03, 1.74e+0...
Draw the plot:
ggplot()+
geom_point(data = participants_popularity, aes(x=name, y=n))
There are several problems: names are piled on top of each other, Y-axis values are in scientific notation and names should be sorted accordng to their popularity. Maybe arrange() will help?
participants_popularity <- participants_popularity %>% arrange(desc(n)) # desc() sort values starting from biggest
head(participants_popularity)
## # A tibble: 6 x 2
## name n
## <chr> <dbl>
## 1 Esta 2520000000
## 2 Mai 1680000000
## 3 Anna 1160000000
## 4 Anne 950000000
## 5 Diana 502000000
## 6 Giovanni 250000000
Looks ok! The most popular name is first! Let’s plot the table again:
ggplot()+
geom_point(data = participants_popularity, aes(x=name, y=n))
Still the same. It means we have to do the sorting in ggplot:
ggplot()+
geom_point(data = participants_popularity, aes(x=reorder(name, n), y=n))
# reorder() makes the thing! It reorers names by n
But to increase the readability we should switch the axis and maybe some lines help as well? Additionally we define the colour of geom, add a title and disable scientific notation of X-axis:
ggplot()+
theme_minimal()+
geom_segment(data = participants_popularity, aes(x=0, xend = n, y = reorder(name, n), yend=reorder(name, n)), colour="red")+
geom_point(data = participants_popularity, aes(x=n, y=reorder(name, n)), colour = "red")+
scale_x_continuous(labels = scales::comma)+
labs(title = "Popularity of names according to the Google search")
Congratulations! We manged to create the so called Cleveland dot plot. Probably you noticed, that we defined also the color of points. The colors list in R is very long: link.
There is no use of beautiful visualizations if we can’t export them to our research papers and presentations. Saving the plot as raster or vector is easy:
# to png (jpeg had some problems with Mac OS)
png("popularity.png", units = "in", res = 300, height = 6, width = 6)
ggplot()+
theme_minimal()+
geom_segment(data = participants_popularity, aes(x=0, xend = n, y = reorder(name, n), yend=reorder(name, n)), colour="red")+
geom_point(data = participants_popularity, aes(x=n, y=reorder(name, n)), colour = "red")+
scale_x_continuous(labels = scales::comma)+
labs(title = "Popularity of names according to the Google search")
dev.off()
## png
## 2
# to pdf
# try help: ?pdf
One nice possibility to visualize volumes is treemap. Let’s try it! Visualize the “popularity”:
ggplot(data = participants_popularity, aes(area = n, fill = n, label = name)) +
geom_treemap()+
geom_treemap_text(fontface = "italic", colour = "white", place = "centre", grow = TRUE)+ # defines how text is displayed
labs(title = "popularity of names according to the Google search", caption = "Author: A. Aasa")+ # add titles
guides(fill = F) #ban the display of legend
In the same way we can visualize the background of participants. Firstly, we should count the participants by their background. This can be nicely done with dplyr & piping:
glimpse(participants) # check the column names
## Observations: 38
## Variables: 5
## $ name <chr> "Giovanni", "Krista", "Agnese", "Maris", "Marge", ...
## $ institution <chr> "Estonian Academy of Music and Theatre", "Universi...
## $ gender <chr> "m", "f", "f", "f", "f", "f", "f", "m", "f", "f", ...
## $ background <chr> "Composition (Music)", "Information management", "...
## $ level <chr> "Doctoral studies", "MA or PhD supervisor", "MA or...
participants_level <- participants %>% group_by(level) %>% summarise(n = n()) %>% ungroup()
# translation for previous line: create object 'participants_level', select table 'participants', group by 'level', count rows, ungroup the result
And now the plot:
ggplot(data = participants_level, aes(area = n, fill = n, label = paste0(level, "\nn = ", n))) +
geom_treemap()+
geom_treemap_text(fontface = "bold", colour = "grey", place = "topleft", grow = TRUE)+
labs(title = "Background of participants", caption = "Author: A. Aasa")+
guides(fill = F)+
scale_fill_gradientn(colours=c("grey40", "grey10"))
Also we used some updates. For example we added the group size (n =) to the plot. For that we modified label=: we pasted level and n next to each other, added between them some text (“n =”) and defined a line break (‘’).
Probably we know the correct answer already… But as researchers we should verify our gut feeling!
Our workflow step by step is the following:
We can obtain the data via worldmet package. For that we ask information about meteorological stations in Tartu:
getMeta(site = "tartu", returnMap = T)
# In RStudio Viewer section you should see the map and also the table with meta information should be displayed
# The station code for Tartu is "262420-99999"
You can download meteodata year by year:
data_meteo_tartu_2017 <- importNOAA(code = "262420-99999", parallel = F, year = 2017)
## Warning: Grouping rowwise data frame strips rowwise nature
data_meteo_tartu_2018 <- importNOAA(code = "262420-99999", parallel = F, year = 2018)
## Warning: Grouping rowwise data frame strips rowwise nature
# bind 2 tables together: rbind() binds tables by rows
data_meteo_tartu <- rbind(data_meteo_tartu_2017, data_meteo_tartu_2018)
# Day before workshop meteorological data for 2018 was lost...
# download the same dataset from here:
data_meteo_tartu <- read.csv2("http://aasa.ut.ee/digitalMethods/data/data_meteo_tartu.csv")
glimpse(data_meteo_tartu)
## Observations: 14,323
## Variables: 4
## $ date <fct> 2017-01-01 00:00:00, 2017-01-01 01:00:00, 2017-01-01 ...
## $ air_temp <dbl> 6.00, 5.95, 5.85, 5.80, 5.75, 5.70, 5.15, 5.10, 5.10,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,...
When was the coldest and warmest day in Tartu during the last 2 years?
# summarise() gives just min and max values, but not the date:
data_meteo_tartu %>% filter(!is.na(air_temp)) %>% ungroup() %>% summarise(min = min(air_temp), max = max(air_temp))
## min max
## 1 -22.75 30.9
# filtering helps to get more information:
data_meteo_tartu %>% filter(!is.na(air_temp)) %>% ungroup() %>% filter(air_temp == min(air_temp)) %>% select(date, air_temp)
## date air_temp
## 1 2018-02-28 05:00:00 -22.75
data_meteo_tartu %>% filter(!is.na(air_temp)) %>% ungroup() %>% filter(air_temp == max(air_temp)) %>% select(date, air_temp)
## date air_temp
## 1 2018-08-09 13:00:00 30.9
## 2 2018-08-09 14:00:00 30.9
## 3 2018-08-10 12:00:00 30.9
To decide which year’s summer was warmer we calculate monthly mean air temperatures. For that from the initial table we select only the date and air temperature:
glimpse(data_meteo_tartu)
## Observations: 14,323
## Variables: 4
## $ date <fct> 2017-01-01 00:00:00, 2017-01-01 01:00:00, 2017-01-01 ...
## $ air_temp <dbl> 6.00, 5.95, 5.85, 5.80, 5.75, 5.70, 5.15, 5.10, 5.10,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,...
data_meteo_tartu <- data_meteo_tartu %>% select(date, air_temp)
For monthly mean we have to calculate month and year from date-time (we use them as grouping variables):
data_meteo_tartu <- data_meteo_tartu %>% mutate(month = month(date), year=year(date))
Calculate monthly average for both years:
data_meteo_tartu_monthMean <- data_meteo_tartu %>%
filter(!is.na(air_temp)) %>% # remove NA-values
group_by(year, month) %>%
summarise(mean_temp = mean(air_temp)) %>%
ungroup()
Plot the result! Barplot should be okay for that:
ggplot()+
geom_col(data = data_meteo_tartu_monthMean, aes(x=month, y=mean_temp, fill=year))
Not very nice… We have to “dodge” the bars, year must be converted to factor and labels on the X-axis should be improved:
ggplot()+
geom_col(data = data_meteo_tartu_monthMean, aes(x=month, y=mean_temp, fill=as.factor(year)), position = "dodge2")+
scale_fill_manual(values = c("dodgerblue", "orange"))+
scale_colour_manual(values = c("blue", "orange4"))+
scale_x_continuous(breaks = seq(1, 12, 1))+
labs(title = "monthly mean air temperature in Tartu", y = "mean, C", fill="year", colour = "year")
Almost ok! But we can also add the values on plot:
ggplot()+
geom_col(data = data_meteo_tartu_monthMean, aes(x=month, y=mean_temp, fill=as.factor(year)), position = "dodge2")+
geom_text(data = data_meteo_tartu_monthMean, aes(x=month, y=mean_temp + 1, group=year, label = round(mean_temp, 1), colour = as.factor(year)), position ="dodge")+
scale_fill_manual(values = c("dodgerblue", "orange"))+
scale_colour_manual(values = c("blue", "orange4"))+
scale_x_continuous(breaks = seq(1, 12, 1))+
labs(title = "monthly mean air temperature in Tartu", y = "mean, C", fill="year", colour = "year")
## Warning: Width not defined. Set with `position_dodge(width = ?)`
Yes! We had a correct feeling that the summer of 2018 was much warmer!
When we try to describe some spatial phenomena just tables or ordinary plots are not enough. We have to use maps!
Let’s look more closely how air temperature in Estonia was distributed on Aug 20, 2108 00:45. Data is obtained from Estonian Weather Service. Dataset contain meteorological information (air temperature) and information about observation sites (name, WMO-code and geographical coordinates).
First we import the data (because Estonian language is rich from letters like õäöü we define also the text encoding (UTF-8):
airtemp <- read.csv2("http://aasa.ut.ee/digitalMethods/data/et_airtemp_0045_20082018.csv", encoding = "utf-8")
glimpse(airtemp)
## Observations: 91
## Variables: 5
## $ name <fct> Kuressaare linn, Tallinn-Harku, Pakri, Kunda, J...
## $ wmocode <int> NA, 26038, 26029, 26045, 26046, 26058, 26141, 2...
## $ longitude <dbl> 22.43961, 24.60289, 24.04008, 26.54140, 27.3982...
## $ latitude <dbl> 58.28085, 59.39812, 59.38950, 59.52141, 59.3290...
## $ airtemperature <dbl> 15.3, 13.1, 18.2, 13.5, 13.1, 14.0, 11.7, 11.0,...
This dataset is presented as a plain text document (not a spatial object) with spatial information (locations of weather stations in Estonia). As usual, the easiest way to check the data is to visualize it:
We can plot the air temperature directly from data table:
ggplot()+
geom_point(data = airtemp, aes(x=longitude, y=latitude, colour = airtemperature))+
scale_colour_gradientn(colours = topo.colors(20))
Without spatial context it’s hard to interpret the plot (currently it’s not even a map). For spatial context we download a spatial layer with borders of counties from the Estonian Land Board. Originally the data is stored in SHP format, but more and more a [sf (simple feature) format is used [(https://cran.r-project.org/web/packages/sf/vignettes/sf1.html). Here we import data with sf package, which converts the shp to sf automatically:
# download & unzip the file:
download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/maakond_shp.zip", destfile="maakond_shp.zip")
#maakond means county!
unzip("maakond_shp.zip")
# Yesterday Mac OS users had problems with installing the sf-library.
# For Mac OS users it's resommended to skip next lines and continue again from '# MAC USERS!!!'
# read shp to R:
county <- st_read("maakond_20180801.shp")
## Reading layer `maakond_20180801' from data source `C:\Users\Anto\Documents\digitalmethods_bu\maakond_20180801.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## epsg (SRID): 3301
## proj4string: +proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
glimpse(county)
## Observations: 15
## Variables: 3
## $ MNIMI <fct> Viljandi maakond, Hiiu maakond, Harju maakond, Lääne ...
## $ MKOOD <fct> 0084, 0039, 0037, 0056, 0071, 0045, 0081, 0079, 0052,...
## $ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((621049.2 64..., MULTIPOL...
# plot:
ggplot()+
geom_sf(data = county, fill="grey20", colour="white")
# MAC USERS!!!
# users have to install additional libraries:
#install.packages("maptools")
#install.packages("rgeos")
#install.packages("rgdal")
library(maptools)
library(sp)
library(rgdal)
county_sp <- readShapePoly("maakond_20180801.shp")
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
library(rgeos)
## rgeos version: 0.3-28, (SVN revision 572)
## GEOS runtime version: 3.6.1-CAPI-1.10.1 r0
## Linking to sp version: 1.3-1
## Polygon checking: TRUE
county_sp <- gSimplify(county_sp, 100)
county_df <- fortify(county_sp)
ggplot()+
geom_polygon(data = county_df, aes(x=long, y=lat, group=group), fill="grey20", colour = "white")
What is the coordinate reference sytem (CRS) of the spatial layer of counties?
# Windows
st_crs(county)
## Coordinate Reference System:
## EPSG: 3301
## proj4string: "+proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
#MAC
proj4string(county_sp) # value is NA, which means Coordinate Reference System is not set.
## [1] NA
#MAC set coordinate reference system:
proj4string(county_sp) <- CRS("+proj=lcc +lat_1=59.33333333333334 +lat_2=58 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +units=m +no_defs")
Pay attention to EPSG and proj4string more information. First of them is the CRS code and second is the definition. CRS of the downloaded spatial layer is the official coordinate system used in Estonia. Look more closer the spatial layer:
glimpse(county)
## Observations: 15
## Variables: 3
## $ MNIMI <fct> Viljandi maakond, Hiiu maakond, Harju maakond, Lääne ...
## $ MKOOD <fct> 0084, 0039, 0037, 0056, 0071, 0045, 0081, 0079, 0052,...
## $ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((621049.2 64..., MULTIPOL...
All the spatial information about how to draw the polygons is stored in column geometry. If we look closer, we see that the values there are quite big (much bigger than geographical coordinates). This means we cannot plot geographical coordinates directly on the current base map. For that we have to transform layers to have the same CRS.
So we know, that counties layer is in the official CRS of Estibua and locations of weather observation sites are in geographical coordinates. Let’s observe what happens if we try to put those layers together on a map:
#Windows:
ggplot()+
geom_sf(data = county)+
geom_point(data = airtemp, aes(x=longitude, y=latitude, colour = airtemperature))+
scale_colour_gradientn(colours = topo.colors(20))
#MAC
ggplot()+
geom_polygon(data = county_df, aes(x=long, y=lat, group=group))+
geom_point(data = airtemp, aes(x=longitude, y=latitude, colour = airtemperature))+
scale_colour_gradientn(colours = topo.colors(20))
we see that the layers do not fit on top of each other. Solution is to convert both layers to the same CRS. For example we could transform county borders (EPSG = 3301) to the geographical coordinates (EPSG = 4326):
# Windows:
county_4326 <- st_transform(county, 4326)
ggplot()+
theme_dark()+
geom_sf(data = county_4326, colour="grey10", fill="black")+
geom_point(data = airtemp, aes(x=longitude, y=latitude, colour = airtemperature), size=2)+
scale_colour_gradientn(colours = topo.colors(20))
#MAC:
county_sp_4326 <- spTransform(county_sp, CRS("+proj=longlat +ellps=WGS84"))
county_df_4326 <- fortify(county_sp_4326)
ggplot()+
theme_dark()+
geom_polygon(data = county_df_4326, aes(x=long, y=lat, group=group), colour ="grey10", fill = "black")+
geom_point(data = airtemp, aes(x=longitude, y=latitude, colour = airtemperature), size=2)+
scale_colour_gradientn(colours = topo.colors(20))
Or we can use, for example, Google Maps as a base map:
# get the map for Estonia
map_estonia <- get_map("estonia", maptype = "hybrid", color ="bw", zoom =7)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=estonia&zoom=7&size=640x640&scale=2&maptype=hybrid&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=estonia&sensor=false
ggmap(map_estonia)
Instead of plotting plain data table on top of a map, we convert the temperature dataset into a spatial object:
# Windows: create sf object
airtemp_sf <- st_as_sf(airtemp, coords = c("longitude", "latitude"), crs = 4326)
# MAC: next section
Plot the result on top of Google Map:
# Windows:
ggmap(map_estonia)+
geom_sf(data = airtemp_sf, aes(colour = airtemperature), size=3, inherit.aes = F)+
scale_colour_gradientn(colours = topo.colors(20))
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
#MAC:
ggmap(map_estonia)+
geom_point(data = airtemp, aes(x = longitude, y = latitude, colour = airtemperature), size=3)+
scale_colour_gradientn(colours = topo.colors(20))
With a help of this kind of maps it is much easier to explain possible spatial patterns and interprete the reasons what may cause this distribution.
Analysing temporal processes can be quite a challenge. Espescially if there are more than one factor. Let’s look at a simple case. Namely, we try to analyse how population pyramid of Estonia has changed in time. For that we have to get the data first.
Statistics Estonia offers a possibility to download data directly from R (or other software) using the SDMX API.
To get the data we have to go to the database and copy export parameters to R (I did it already, so you don’t have to) and then we can download and tidy the table:
# URL of data table
# URL of data table
#URL <- "http://andmebaas.stat.ee/restsdmx/sdmx.ashx/GetData/RV021/1+2+3.1+2+3+4+5+6+7+8+9+10+11+12+13+14+15+16+17+18+19+31+32+33+34+20+21/all?startTime=1919&endTime=2018"
#import data
#data_popPyramid <- readSDMX(URL, dsd = T)
# check the class
#class(data_popPyramid)
# convert data to plain table
#data_popPyramid %>% as.data.frame(labels = T) %>% head()
# get the structure for the data table:
#data_popPyramid_structure <- readSDMX("http://andmebaas.stat.ee/restsdmx/sdmx.ashx/GetDataStructure/RV021")
#class(data_popPyramid_structure)
# final format of table:
#data_popPyramid <- setDSD(data_popPyramid, data_popPyramid_structure)
#data_popPyramid_df <- data_popPyramid %>% as.data.frame(labels = T)
#glimpse(data_popPyramid_df)
# In case database of Statistics Estonia is down,
#you can download data from here:
data_popPyramid_df <- read.csv2("http://aasa.ut.ee/digitalMethods/data/data_popPyramid_df.csv")
glimpse(data_popPyramid_df)
## Observations: 3,292
## Variables: 11
## $ DIM1 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ DIM1_label.en <fct> Males, Males, Males, Males, Males, Males,...
## $ DIM1_label.et <fct> Mehed, Mehed, Mehed, Mehed, Mehed, Mehed,...
## $ DIM3 <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ DIM3_label.en <fct> 35-39, 35-39, 35-39, 35-39, 35-39, 35-39,...
## $ DIM3_label.et <fct> 35-39, 35-39, 35-39, 35-39, 35-39, 35-39,...
## $ TIME_FORMAT <fct> P1Y, P1Y, P1Y, P1Y, P1Y, P1Y, P1Y, P1Y, P...
## $ TIME_FORMAT_label.en <fct> Annual, Annual, Annual, Annual, Annual, A...
## $ obsTime <int> 1923, 1924, 1925, 1926, 1927, 1928, 1929,...
## $ obsValue <int> 34323, 35005, 34893, 34576, 34533, 34649,...
## $ obsValue2 <int> 34323, 35005, 34893, 34576, 34533, 34649,...
Looks like we have successfully managed to import the data!
Now we have to check the tabel more precisely: are the categories ok, are they in the right order etc:
# time is stored as character, convert it to integer:
data_popPyramid_df <- data_popPyramid_df %>% mutate(obsTime = as.integer(obsTime))
data_popPyramid_df %>% head() %>% kable()
| DIM1 | DIM1_label.en | DIM1_label.et | DIM3 | DIM3_label.en | DIM3_label.et | TIME_FORMAT | TIME_FORMAT_label.en | obsTime | obsValue | obsValue2 |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Males | Mehed | 10 | 35-39 | 35-39 | P1Y | Annual | 1923 | 34323 | 34323 |
| 2 | Males | Mehed | 10 | 35-39 | 35-39 | P1Y | Annual | 1924 | 35005 | 35005 |
| 2 | Males | Mehed | 10 | 35-39 | 35-39 | P1Y | Annual | 1925 | 34893 | 34893 |
| 2 | Males | Mehed | 10 | 35-39 | 35-39 | P1Y | Annual | 1926 | 34576 | 34576 |
| 2 | Males | Mehed | 10 | 35-39 | 35-39 | P1Y | Annual | 1927 | 34533 | 34533 |
| 2 | Males | Mehed | 10 | 35-39 | 35-39 | P1Y | Annual | 1928 | 34649 | 34649 |
# check categories list:
data_popPyramid_df %>% distinct(DIM1_label.en) %>% kable()
| DIM1_label.en |
|---|
| Males |
| Females |
data_popPyramid_df %>% distinct(DIM3_label.en) %>% kable()
| DIM3_label.en |
|---|
| 35-39 |
| 10-14 |
| 75-79 |
| 40-44 |
| 5-9 |
| 70-74 |
| 20-24 |
| 85 and older |
| 90-94 |
| 25-29 |
| 55-59 |
| 15-19 |
| 95-99 |
| 45-49 |
| 50-54 |
| 80-84 |
| 30-34 |
| 0 |
| 60-64 |
| 1-4 |
| 65-69 |
| 100 and older |
| 85-89 |
Exclude unneeded categories:
data_popPyramid_df <- data_popPyramid_df %>% filter(DIM1_label.en != "Males and females")
data_popPyramid_df <- data_popPyramid_df %>% filter(DIM3_label.en != "Age groups total")
data_popPyramid_df <- data_popPyramid_df %>% filter(DIM3_label.en != "Age unknown")
Is the animation really needed? Why not use traditional line graph? We can try:
ggplot()+
geom_line(data = data_popPyramid_df, aes(x=obsTime, y=obsValue, group=DIM3_label.en, colour=DIM3_label.en))+
facet_wrap(~DIM1_label.en)
The pattern is strangely beautiful, but it doesn’t give us any information. Let’s try to animate the change!
Now we can start constructing the plot. It is reasonable to start with one time frame (the last, for example). It will be used as a template for the final animation:
# filter first time frame to construct plot
# Check the time extent:
data_popPyramid_df %>% summarise(start = min(obsTime), end = max(obsTime))
## start end
## 1 1923 2018
# table with last observation year:
data_popPyramid_df_2018 <- data_popPyramid_df %>% filter(obsTime == max(obsTime))
# Look are the gender classes ok?
data_popPyramid_df_2018 %>% distinct(DIM1_label.en)
## DIM1_label.en
## 1 Males
## 2 Females
Plot:
ggplot()+
theme_minimal()+
geom_point(data = data_popPyramid_df_2018, aes(x=obsValue, y = DIM3_label.en, group = DIM1_label.en, colour =DIM1_label.en))
Age groups are in wrong order! We have to reorder the factor levels:
data_popPyramid_df_2018 %>% distinct(DIM3_label.en)
## DIM3_label.en
## 1 35-39
## 2 10-14
## 3 75-79
## 4 40-44
## 5 5-9
## 6 70-74
## 7 20-24
## 8 85 and older
## 9 90-94
## 10 25-29
## 11 55-59
## 12 15-19
## 13 95-99
## 14 45-49
## 15 50-54
## 16 80-84
## 17 30-34
## 18 0
## 19 60-64
## 20 1-4
## 21 65-69
## 22 100 and older
## 23 85-89
Some problems. Fix them:
# type of age group variable?
glimpse(data_popPyramid_df_2018$DIM3_label.en)
## Factor w/ 23 levels "0","1-4","10-14",..: 9 3 18 10 12 17 6 21 22 7 ...
# convert it to factor!
data_popPyramid_df_2018 <- data_popPyramid_df_2018 %>% mutate(DIM3_label.en = as.factor(DIM3_label.en))
glimpse(data_popPyramid_df_2018$DIM3_label.en)
## Factor w/ 23 levels "0","1-4","10-14",..: 9 3 18 10 12 17 6 21 22 7 ...
# reorder factor levels:
data_popPyramid_df_2018$DIM3_label.en <- factor(data_popPyramid_df_2018$DIM3_label.en, levels = c(
"0",
"1-4",
"5-9",
"10-14",
"15-19",
"20-24",
"25-29",
"30-34",
"35-39",
"40-44",
"45-49",
"50-54",
"55-59",
"60-64",
"65-69",
"70-74",
"75-79",
"80-84",
"85-89",
"85 and older",
"90-94",
"95-99",
"100 and older"))
Check the result:
ggplot()+
theme_minimal()+
geom_point(data = data_popPyramid_df_2018, aes(x=obsValue, y = DIM3_label.en, group = DIM1_label.en, colour =DIM1_label.en))
Much better!!
But classical population pyramid is different:
Gender category “female” is usually on the left, and “male”" on the right. For that we should multiply values of female gender by -1:
To “push”" female data to the left side of the plot, the easiest way is to multiply it by -1:
data_popPyramid_df_2018 <- data_popPyramid_df_2018 %>% mutate(obsValue2 = ifelse(DIM1_label.en == "Females", obsValue * -1, obsValue))
data_popPyramid_df_2018 %>% head() %>% glimpse()
## Observations: 6
## Variables: 11
## $ DIM1 <int> 2, 3, 2, 3, 2, 3
## $ DIM1_label.en <fct> Males, Females, Males, Females, Males, Fe...
## $ DIM1_label.et <fct> Mehed, Naised, Mehed, Naised, Mehed, Naised
## $ DIM3 <int> 10, 5, 18, 11, 4, 17
## $ DIM3_label.en <fct> 35-39, 10-14, 75-79, 40-44, 5-9, 70-74
## $ DIM3_label.et <fct> 35-39, 10-14, 75-79, 40-44, 5-9, 70-74
## $ TIME_FORMAT <fct> P1Y, P1Y, P1Y, P1Y, P1Y, P1Y
## $ TIME_FORMAT_label.en <fct> Annual, Annual, Annual, Annual, Annual, A...
## $ obsTime <int> 2018, 2018, 2018, 2018, 2018, 2018
## $ obsValue <int> 46742, 33691, 18058, 44248, 38921, 33746
## $ obsValue2 <dbl> 46742, -33691, 18058, -44248, 38921, -33746
Let’s try again:
ggplot()+
theme_minimal()+
geom_segment(data = data_popPyramid_df_2018, aes(x = obsValue2, xend = 0, y = DIM3_label.en, yend = DIM3_label.en, colour =DIM1_label.en), size=4)
Already looks like a population pyramid! A bit more polishing:
ggplot()+
theme_minimal()+
geom_segment(data = data_popPyramid_df_2018, aes(x = obsValue2, xend = 0, y = DIM3_label.en, yend = DIM3_label.en, colour =DIM1_label.en), size=4)+
scale_colour_manual(values = c("orange", "dodgerblue"))+
labs(x="age group", y = "n", title = "population pyramid of Estonia", caption = "data: Statistics Estonia \nvisualization: A. Aasa", fill = "gender", colour = "gender")+
scale_x_continuous(breaks = c(-50000, -25000, 0, 25000, 50000), labels = c("50000", "25000", "0", "25000", "50000"))
Congratulations! We managed to create the first frame for the animation! Now the animating process is easy! But before that we should examine how big is the biggest age group, and at the same time we reorder factors for full table and multiply values:
# axis extent (helps to define x-axis limits):
data_popPyramid_df %>% filter(obsValue == max(obsValue)) %>% select(obsValue)
## obsValue
## 1 61172
# reorder factors:
data_popPyramid_df_2018 <- data_popPyramid_df_2018 %>% mutate(DIM3_label.en = as.factor(DIM3_label.en))
glimpse(data_popPyramid_df_2018$DIM3_label.en)
## Factor w/ 23 levels "0","1-4","5-9",..: 9 4 17 10 3 16 6 20 21 7 ...
# reorder factor levels:
data_popPyramid_df$DIM3_label.en <- factor(data_popPyramid_df$DIM3_label.en, levels = c(
"0",
"1-4",
"5-9",
"10-14",
"15-19",
"20-24",
"25-29",
"30-34",
"35-39",
"40-44",
"45-49",
"50-54",
"55-59",
"60-64",
"65-69",
"70-74",
"75-79",
"80-84",
"85-89",
"85 and older",
"90-94",
"95-99",
"100 and older"))
# multiply:
data_popPyramid_df <- data_popPyramid_df %>% mutate(obsValue2 = ifelse(DIM1_label.en == "Females", obsValue * -1, obsValue))
Animate all:
# old version of gganimate
data_popPyramid_ani <- ggplot(data = data_popPyramid_df, aes(x = obsValue2, xend = 0, y = DIM3_label.en, yend = DIM3_label.en, colour =DIM1_label.en, frame = obsTime, cumulative = FALSE))+
theme_minimal()+
geom_segment(size=4)+
scale_colour_manual(values = c("orange", "dodgerblue"))+
labs(x="age group", y = "n", title = "population pyramid of Estonia", caption = "data: Statistics Estonia \nvisualization: A. Aasa", fill = "gender", colour = "gender")+
scale_x_continuous(breaks = c(-50000, -25000, 0, 25000, 50000), labels = c("50000", "25000", "0", "25000", "50000"), limits = c(-62000, 62000))
gganimate(data_popPyramid_ani)
#new version of gganimate:
ggplot(data = data_popPyramid_df, aes(x = obsValue2, xend = 0, y = DIM3_label.en, yend = DIM3_label.en, colour =DIM1_label.en))+
theme_minimal()+
geom_segment(size=4)+
scale_colour_manual(values = c("orange", "dodgerblue"))+
#labs(x="age group", y = "n", title = "population pyramid of Estonia", caption = "data: Statistics Estonia \nvisualization: A. Aasa", fill = "gender", colour = "gender")+
scale_x_continuous(breaks = c(-50000, -25000, 0, 25000, 50000), labels = c("50000", "25000", "0", "25000", "50000"), limits = c(-62000, 62000))+
labs(title = 'year: {frame_time}', x = 'age group', y = 'n', colour="gender") +
transition_time(obsTime) +
ease_aes('linear')
#animate(data_popPyramid_ani, 'gif.gif')
#anim_save("pop_pyramid.gif", animation = last_animation())
Wikipedia: Across the many fields concerned with interactivity, including information science, computer science, human-computer interaction, communication, and industrial design, there is little agreement over the meaning of the term interactivity, although all are related to interaction with computers and other machines with a user interface.
As a final topic let us test some examples of ineractive plots. Lets look a bit back - the Estonian population pyramid for 2018:
gg_popPyr2018 <- ggplot()+
theme_minimal()+
geom_segment(data = data_popPyramid_df_2018, aes(x = obsValue2, xend = 0, y = DIM3_label.en, yend = DIM3_label.en, colour =DIM1_label.en), size=4)+
scale_colour_manual(values = c("orange", "dodgerblue"))+
labs(x="age group", y = "n", title = "population pyramid of Estonia", caption = "data: Statistics Estonia \nvisualization: A. Aasa", fill = "gender", colour = "gender")+
scale_x_continuous(breaks = c(-50000, -25000, 0, 25000, 50000), labels = c("50000", "25000", "0", "25000", "50000"))
ggplotly(gg_popPyr2018)
Example of an interactive map:
# Windows:
mapview(airtemp_sf, zcol = "airtemperature")
# MAC:
glimpse(airtemp)
## Observations: 91
## Variables: 5
## $ name <fct> Kuressaare linn, Tallinn-Harku, Pakri, Kunda, J...
## $ wmocode <int> NA, 26038, 26029, 26045, 26046, 26058, 26141, 2...
## $ longitude <dbl> 22.43961, 24.60289, 24.04008, 26.54140, 27.3982...
## $ latitude <dbl> 58.28085, 59.39812, 59.38950, 59.52141, 59.3290...
## $ airtemperature <dbl> 15.3, 13.1, 18.2, 13.5, 13.1, 14.0, 11.7, 11.0,...
airtemp_crd <- airtemp %>% dplyr::select(longitude, latitude)
airtemp_sp <- SpatialPointsDataFrame(airtemp_crd, airtemp)
mapview(airtemp_sp, zcol="airtemperature")
Author: Anto Aasa
Digital Methods in Humanities and Social Sciences
Last update: 2018-08-24 15:07:32