1 Why R?

R is freely available tool for the data analysis. R is the language and also the environment for statistical calculations and graphical representation. R is also very powerful tool for data processing, analysis and visualizations. For today R has developed to the general-purpose programming language capable as well for correlation analysis as well for producing web applications.

Additionally to the base program are available thousands of extensions, which add possibilities to the main core. Extensions are are called packages (libraries). Some packages are added already to the main software, but most of them should be installed separately.

Biggest problem for the R is quite steep learning curve. But it will pay back very fast!

  • R is not just a statistics software, it’s a language.
  • R is free.
  • R is very flexible and powerful.

2 Introduction to R

To go through the current introduction you have to install R, RStudio and RTools to your computer.

  • R
  • RStudio Desktop
  • RTools
    • The tutorial says that if everything is done by the book, then everything should work.

In some cases still the update of Environmental variables is needed:

From here you can find a good description, how to set up the R environment in your computer: link

Start RStudio

2.1 Additional reading & examples

2.2 R & RStudio environment

Define working directory:

# set working directory:
setwd("YOUR FOLDER PATH")

# avoid displaying numbers in scientific notation:
options(scipen = 999)

2.3 Installation of packages

Basic R offers very variety of options and functions for data processing, analysis and visualizations. But additionally we can use extension called package.
In R, a package is a collection of R functions, data and compiled code. The location where the packages are stored is called the library. If there is a particular functionality that you require, you can download the package from the appropriate site and it will be stored in your library. To actually use the package use the command “library(package)” which makes that package available to you. Then just call the appropriate package functions etc. link

install.packages("tidyverse")
install.packages("sf")
install.packages("ggmap")
install.packages("tmap")

2.4 Start libraries

library(tidyverse) #  collection of R packages designed for data science
library(sf) # spatial data as simple features
library(maptools) # spatial data
library(spatstat) # spatial statistics
library(tmap) # thematic maps
library(ggthemes) # additional predefined themes for ggplot2 plots
library(ggthemes)
library(RColorBrewer)

2.5 Create objects

In R it’s all about assignment! We assign some properties to the object, after what the created object may be a value, a table, a plot, a map etc
For example:

a <- 1 # assign value 1 to a
a # print the content of object "a""
## [1] 1
x <- c(1, 1, 2, 2) # create vector
x
## [1] 1 1 2 2
y <- c(1, 2, 2, 1)
y
## [1] 1 2 2 1
data_table <-  data.frame(var1=x, var2=y) # bind vectors to data frame (table)

View the data:

# see the content in console:
data_table

# View in data viewer:
View(data_table)

# edit data:
fix(data_table)

# data structure (column names, types *etc*)
glimpse(data_table)

# remove (delete) the object from working space:
rm(data_table) # delete object

# recreate the table:
data_table <-  tibble(var1 = x, var2 = y) # restore previously deleted object

Before any deeper analysis it is reasonable to describe the data. Very often datasets are too big and scrolling the raw table is not enough. To get better understanding about the data we can use descriptive statistics and/or graphs.
> According to Wikipedia: A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features of a collection of information.

For example: next commands demonstrate also the main logic of applying functions on variables (function(tableName$columnName)):

max(data_table$var1)
## [1] 2
min(data_table$var1)
## [1] 1
mean(data_table$var1)
## [1] 1.5
sd(data_table$var1) # standard deviation
## [1] 0.5773503
length(data_table$var1)
## [1] 4
nrow(data_table) # nr of rows in data frame
## [1] 4
nchar("Hello, world!") # number of characters in string
## [1] 13
hw <- "Hello, world!"
nchar(hw)
## [1] 13

3 Get data

3.1 Data import

We are not living in ideal world. It means, that data we get is often in awful condition. One of the biggest problems is the good will of incompetent data collector and/or processor. They put lot of effort to format the data into “beautiful” shape and form. For example, in MS Excel merged cells, subtotals, colors, bold/italic, comments, mixed data types, etc are the favorites. Usually it takes a lot of time to get rid all of this useless formatting and decorations.

Here is an example of data, where dates are stored as text (in one column it’s not just text, but roman numerals!):

But currently we can assume, that datasets are prepared by a specialist and we can import data without major problems.
In our first example we plan to analyse how the population size has changed in counties between 1990 and 2016.
For that we have to import the data:

3.1.1 You can download the data directly from Internet

# import:
data_countyPop_csv <- read.csv("http://aasa.ut.ee/Rspatial/data/county_population_RV022.csv", encoding = "UTF-8")

# structure of imported dataset:
glimpse(data_countyPop_csv)
## Rows: 30
## Columns: 11
## $ DIM2       <int> 37, 37, 39, 39, 44, 44, 49, 49, 51, 51, 57, 57, 59, 59, 65,…
## $ County     <chr> "Harju county", "Harju county", "Hiiu county", "Hiiu county…
## $ DIM3       <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "K",…
## $ Sex        <chr> "Males and females", "Males and females", "Males and female…
## $ DIM4       <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999,…
## $ Age.group  <chr> "Age groups total", "Age groups total", "Age groups total",…
## $ TIME       <int> 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016,…
## $ Year       <int> 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016,…
## $ Value      <int> 607158, 576265, 11332, 9348, 221807, 146506, 42607, 31298, …
## $ Flag.Codes <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Flags      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

How many different data types are in this table?

3.1.2 Import data directly from Statistics Estonia database

In case of repetitive operations means scripting that many (or even all) steps can be completed automatically. This includes also importing of data into your project. Let us look one case where we try to download population size by counties in Estonia. In database of Statistics Estonia the SDMX format is used, therefore we need to start the package to get this functionality into R.

Then we have to

  • define the address of the dataset
  • import the structure of dataset
  • download and import the data
  • format it to data frame.
library(rsdmx)
# address of data
data_URL <- "http://andmebaas.stat.ee/restsdmx/sdmx.ashx/GetData/RV022/37+39+44+49+51+57+59+65+67+70+74+78+82+84+86.K.999/all?startTime=1990&endTime=2016"

# address of data struscture
data_structure_URL <- "http://andmebaas.stat.ee/restsdmx/sdmx.ashx/GetDataStructure/RV022"

# import data
data_countyPop <- readSDMX(data_URL, dsd = T)

# get the structure for the data table:
data_countyPop_STR <- readSDMX(data_structure_URL)

# final format of table:
data_countyPop <- setDSD(data_countyPop, data_countyPop_STR)

data_countyPop <- data_countyPop %>% 
  as.data.frame(labels = T)

# It is always useful to check the result!!
glimpse(data_countyPop)

Relying on 3rd party services is always risky. Sometimes the database is down. In that case you can download the data from here:

http://aasa.ut.ee/Rspatial/data/data_countyPop.csv

Next import the data to R (Environment tab = > Import Dataset = > From Text ( readr))

Success?
Looks like everything we need is there. But are we satisfied with the format?
You may face several problems:

  • Are all the variable types in relevant format (text, integer, etc)?
  • We can’t map it directly. It’s just a plain table.
  • We also have to figure out, how to present/visualize the change? - Linear trend? - Difference between first and last year? - …

Lets take the easy path: We want to show difference between first and last year.
For that we have to select relevant variables (columns), convert year from character to integer, filter relevant years

# select relevant columns:
data_countyPop <- data_countyPop %>% 
  dplyr::select(DIM2, DIM2_label.en, obsTime, obsValue)

# convert character to integer if needed:
#data_countyPop <- data_countyPop %>% 
#  mutate(obsTime = as.integer(obsTime))

# query the first and last year in dataset:
data_countyPop <- data_countyPop %>% 
  filter(obsTime == min(obsTime) | 
           obsTime == max(obsTime))

library(knitr) # controlling the output

data_countyPop %>% 
  head() %>% 
  kable()
DIM2 DIM2_label.en obsTime obsValue
37 Harju county 1990 607158
37 Harju county 2016 576265
39 Hiiu county 1990 11332
39 Hiiu county 2016 9348
44 Ida-Viru county 1990 221807
44 Ida-Viru county 2016 146506

3.1.3 Data wrangling (working with package dplyr)

Introduction to dplyr.

3.1.4 What is piping and how to use it with dplyr?

According to Sean C. Anderson: the dplyr R package is awesome. Pipes from the magrittr R package are awesome. Put the two together and you have one of the most exciting things to happen to R in a long time.
The dplyr is Hadley Wickham’s re-imagined plyr package (with underlying C++ secret sauce co-written by Romain Francois). plyr 2.0 if you will. It does less than plyr, but what it does it does more elgantly and much more quickly.
dplyr is built around 5 verbs. These verbs make up the majority of the data manipulation you tend to do. You might need to: - Select certain columns of data. - Filter your data to select specific rows. - Arrange the rows of your data into an order. - Mutate your data frame to contain new columns. - Summarise chunks of you data in some way.

Please read & test the examples: Introduction to dplyr and pipes

4 Visualizations

Undoubtedly are maps just another type of visualizations. Similarly to maps we have to define the coordinates (XY-axes) and attributes (plot type, colors, labels, shapes etc) for every visualization.

According to Wikipedia: Visualization is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity. Examples from history include cave paintings, Egyptian hieroglyphs, Greek geometry, and Leonardo da Vinci’s revolutionary methods of technical drawing for engineering and scientific purposes. Visualization today has ever-expanding applications in science, education, engineering (e.g., product visualization), interactive multimedia, medicine, etc. Typical of a visualization application is the field of computer graphics. The invention of computer graphics may be the most important development in visualization since the invention of central perspective in the Renaissance period. The development of animation also helped advance visualization.

Several remarkably good books about scientific visualization have been published:

The last book on the list has been influential also for building R library ggplot2: Wickham H. (2010) “A Layered Grammar of Graphics””

The author of current short tutorial is not trying to compete with any of the previously mentioned books. The aim is to go through some of the important aspects of scientific visualization and demonstrate the process of creating them.
More specifically we are focusing on R library tidyverse, sf and tmap.

For many R users the capabilities of base R plotting are enough link.

plot(data_table$var1, data_table$var2, main = "Title", ylab="value y", type = "l", col="red")
points(data_table$var1, data_table$var2, col="blue")

In many cases basic plotting engine is not enough or is too clumsy (far from intuitive). There are available some plotting libraries as well: lattice, ggplot2. During the current course we are going to use mainly the ggplot2 library. But today we also live in conditions of so called sf revolution and we have access to (new) library for thematic mapping: tmap.

ggplot2 website says: “ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

ggplot object must have:

  • data: You can’t create a plot without data (must be stored as an R data frame);
  • coordinate system: describes 2-D space that data is projected onto;
  • geoms: describe type of geometric objects that represent data (points, lines, polygons);
  • aesthetics: describe visual characteristics that represent data (position, size, color, shape, transparency, fill);
  • scales: for each aesthetic, describe how visual characteristic is converted to display values (log scales, color scales, size scales, shape scales)
  • stats : describe statistical transformations that typically summarize data (counts, means, medians, regression lines);
  • facets: describe how data is split into subsets and displayed as multiple small graphs.
ggplot()+
  theme_excel()+
  geom_path(data = data_table, aes(x = var1, y = var2), colour = "blue", size = 0.25)+
  geom_point(data = data_table, aes(x=var1, y=var2), fill="pink", colour="red", shape=21, size=3)+
  labs(title="simple example", x="header, x", y="header, y")

ggplot colours are defined, like in HTML/CSS, using the hexadecimal values (00 to FF). Many colours are translated to human language link.

4.1 Know your data!

Lets assume, that you received somehow from your partner a dataset. According to him, it some base map data. Unfortunately there wasn’t any additional metadata and documentation available. Can you figure out, what dataset it is and is it usable?

Download the data and check the structure:

unknownData <- read.csv2("http://aasa.ut.ee/Rspatial/data/unknownData.csv")
glimpse(unknownData)
## Rows: 12,748
## Columns: 7
## $ variable1 <dbl> 453124.8, 462438.8, 464891.5, 464248.4, 460592.8, 459549.8, …
## $ variable2 <dbl> 6505228, 6500818, 6492784, 6485559, 6490631, 6488152, 648984…
## $ variable3 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ variable4 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ variable5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ variable6 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ variable7 <chr> "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1…
unknownData %>% 
  head() %>% 
  kable() #kable() will format the table output into "nicer" form (useful for R Markdown output)
variable1 variable2 variable3 variable4 variable5 variable6 variable7
453124.8 6505228 1 FALSE 1 1 1.1
462438.8 6500818 2 FALSE 1 1 1.1
464891.5 6492784 3 FALSE 1 1 1.1
464248.4 6485559 4 FALSE 1 1 1.1
460592.8 6490631 5 FALSE 1 1 1.1
459549.8 6488152 6 FALSE 1 1 1.1

Calculate descriptive statistics for variable 1:

unknownData_descr <- unknownData %>% 
  summarise(n = n(), 
            min = min(variable1), 
            mean = mean(variable1), 
            max = max(variable1), 
            sd = sd(variable1))

kable(unknownData_descr)
n min mean max sd
12748 369046.6 476038.1 739152.4 82765.54

Or you can calculate descriptive statistics for multiple columns:

unknownData %>% 
  summarise_all(mean)
## Warning in mean.default(variable7): argument is not numeric or logical:
## returning NA
##   variable1 variable2 variable3    variable4 variable5 variable6 variable7
## 1  476038.1   6498252  1486.837 0.0007844368  276.5258  5.334327        NA

You can also apply several functions at the same time:

unknownData %>% 
  summarise_all(funs(mean, min, max))

Error message warns that you can’t apply it for factors. We have to select relevant columns:

unknownData %>% 
  dplyr::select(variable1, variable2) %>%  
  summarise_all(funs(mean, min, max))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
##   variable1_mean variable2_mean variable1_min variable2_min variable1_max
## 1       476038.1        6498252      369046.6       6377146      739152.4
##   variable2_max
## 1       6633918
# or:
unknownData %>% 
  dplyr::select(variable1 : variable6) %>%  
  summarise_all(funs(mean, min, max)) %>% 
  kable()
variable1_mean variable2_mean variable3_mean variable4_mean variable5_mean variable6_mean variable1_min variable2_min variable3_min variable4_min variable5_min variable6_min variable1_max variable2_max variable3_max variable4_max variable5_max variable6_max
476038.1 6498252 1486.837 0.0007844 276.5258 5.334327 369046.6 6377146 1 0 1 1 739152.4 6633918 5111 1 988 15

Visualization is one of the easiest way to get familiar with the data. Let’s look different possibilities. First we draw the histogram:

ggplot()+
  geom_histogram(data = unknownData, aes(x= variable1), bins = 50)

Histogram helps to get overview about the variability of the variable. Currently we dont see too much…

Maybe it’s some time series? We can try line plot:

ggplot()+
  geom_line(data = unknownData, aes(x = variable1,  y = variable2), colour = "blue")

Ok! Looks like Estonian contour. But definitely we can’t use it in this form. We should search other ways. Maybe the scatter plot helps?

ggplot()+
  geom_point(data = unknownData, aes(x = variable1,  y = variable2), colour = "blue", size = 0.5)

Even better! Educated eye is probably recognizing the borders of Estonian counties. To plot borders we should use lines. For that ggplot2 offers possibility to plot path:

ggplot()+
  geom_path(data = unknownData, aes(x = variable1,  y = variable2), colour = "blue", size = 0.5)

Still problems… We have to define “when the pencil should be switched off” if we start with new county:

ggplot()+
  geom_path(data = unknownData, aes(x = variable1,  y = variable2, group  = variable7), colour = "blue", size=0.5)

Yes! But instead of empty contour we can also fill every polygon (path group) with different colour:

ggplot()+
  geom_polygon(data = unknownData, aes(x = variable1,  y = variable2, group  = variable7, fill = as.factor(variable6)), colour = "white", size=0.5)+
  scale_fill_viridis_d()+
  coord_equal()

This looks like almost a map, but need improvement! Let’s but this currently on hold and look next step.

4.2 Visualize population in Estonian counties

ggplot()+
  theme_minimal()+
  geom_col(data = data_countyPop, aes(x = DIM2_label.en, y = obsValue))

To big numbers?! Columns are on top of each other.

ggplot()+
  theme_minimal()+
  geom_point(data = data_countyPop, aes(x = DIM2_label.en, y=obsValue, colour = obsTime))

We can see some differences but the initial result is definitely ugly. We could rotate the axes, define better colours etc

ggplot()+
  theme_minimal()+
  geom_point(data = data_countyPop, aes(y = DIM2_label.en, x=obsValue, colour = as.factor(obsTime)))

Repeating the “county” is redundant, delete “county”. Next operation doesn’t delete “maakond” but replaces it with nothing.

data_countyPop <- data_countyPop %>% 
  mutate(DIM2_label.en = gsub(" county", "", DIM2_label.en))

data_countyPop
## # A tibble: 30 × 4
##     DIM2 DIM2_label.en obsTime obsValue
##    <dbl> <chr>           <dbl>    <dbl>
##  1    37 Harju            1990   607158
##  2    37 Harju            2016   576265
##  3    39 Hiiu             1990    11332
##  4    39 Hiiu             2016     9348
##  5    44 Ida-Viru         1990   221807
##  6    44 Ida-Viru         2016   146506
##  7    49 Jõgeva           1990    42607
##  8    49 Jõgeva           2016    31298
##  9    51 Järva            1990    43715
## 10    51 Järva            2016    30709
## # … with 20 more rows

We also should sort the counties by population size but not by the name. In that case the sorting inside of the table doesnt help, the sorting procedure has to happen in plot:

ggplot()+
  theme_minimal()+
  geom_point(data = data_countyPop, aes(y = reorder(DIM2_label.en, obsValue), x=obsValue, colour = as.factor(obsTime)))+
  scale_colour_manual(values = c("orange", "dodgerblue"))+
  labs(colour = "year", 
       y="county", 
       x="population size", 
       title = "Population change in Estonian counties", 
       subtitle = "difference between 1990 and 2016", 
       caption = "data: Statistics Estonia \nvisual: A. Aasa")

Much better!! A bit more polishing :

ggplot()+
  theme_minimal()+
  geom_point(data = data_countyPop, aes(y = reorder(DIM2_label.en, obsValue), 
                                        x=obsValue, 
                                        colour = as.factor(obsTime)))+
  scale_colour_manual(values = c("orange", "dodgerblue"))+
  labs(colour = "year", 
       y="county", 
       x="population size", 
       title = "Population change in Estonian counties", 
       subtitle = "difference between 1990 and 2016", 
       caption = "data: Statistics Estonia \nvisual: A. Aasa")+
  scale_x_log10()

The result looks “almost”, but it’s still hard to understand the volumes and changes. We should calculate the difference between years. For that we have to but different years to next of each other:

data_countyPop_ii <- data_countyPop %>% 
  spread(obsTime, obsValue)

data_countyPop_ii <- data_countyPop_ii %>% 
  mutate(diff_abs = `2016` - `1990`, 
         diff_pr  = ((`2016` / `1990`)-1)*100)

One possible plot to visualize the population change:

ggplot()+
  theme_minimal()+
  geom_segment(data = data_countyPop_ii, aes(x=`1990`, xend = `2016`, y=reorder(DIM2_label.en, `1990`), yend = reorder(DIM2_label.en,`1990`)), col = "red", arrow = arrow(length= unit(0.02, "npc")))+
  geom_text(data = data_countyPop_ii, aes(x=`1990` *1.2, y=reorder(DIM2_label.en, `1990`), label = paste0(round(diff_pr, 1), "%")), col ="red")+
  scale_x_log10()+
  labs(x="population size", y="counties", title = "title")

5 Spatial data

You can make spatial objects by your self, but you can import them (e.g. ’.shp) as well.The most important assumption is the availability of spatial information (coordinates, address). There are many possibilities to import spatial object into R. We focus on simple feature’s class.

Simple feature (sf) is basically a data.frame where geometry of each object is stored in special column (geometry).

5.1 Create first real map in R

Our aim is to download spatial layer of Estonian counties and draw general overview map. Work steps:

  1. download the file,
  2. unzip it,
  3. look for the files containing “shp” in their name
  4. get the name of file you need
  5. import shp-file to R
  6. plot spatial polygons
  7. polish the map (add titles, map scale, north arrow etc
  8. check the structure of imported spatial object
  9. check the coordinate system (CRS) of imported layer (Estonian official CRS = 3301; WGS84 = 4326)

First step. Start libraries to expand the basic functionality of R with extensions. Download zipped geolayer directly from Land Board webpage.

download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/maakond_shp.zip", destfile="maakond_shp.zip")
#maakond means county!
unzip("maakond_shp.zip")

Second step. Import geolayer to R. Plot it.

# The geolayer is regularly updated and the name contains data. We have to check the correct name. Next command lists all the files in working directory which names contain ".shp"
list.files(pattern = ".shp")
##  [1] "asustusyksus_20211001.shp" "asustusyksus_20211101.shp"
##  [3] "eestimaa_wgs84.shp"        "eestimaa_wgs84.shp.xml"   
##  [5] "gps_us.shp"                "gps_us_monterey.shp"      
##  [7] "maakond_20210901.shp"      "maakond_20211101.shp"     
##  [9] "maakond_20221001.shp"      "maakond_shp.zip"          
## [11] "omavalitsus_20211001.shp"  "omavalitsus_20221101.shp" 
## [13] "omavalitsus_shp.zip"       "population_2017.shp"      
## [15] "trt_cont.shp"
# import shp as simple feature:
county <- st_read("maakond_20210901.shp") # Date in file name is changing, because files are updated monthly! 
## Reading layer `maakond_20210901' from data source 
##   `C:\ANTO\loengud\geopythonR\rspatial_Git\rspatial\maakond_20210901.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## Projected CRS: Estonian Coordinate System of 1997
# plot the layer with ggplot2:
ggplot()+
  geom_sf(data = county)

# plot with tmap:
tm_shape(county)+
  tm_polygons()

Voila!
R is very flexible. This means, every map element can be changed. By default all the ggplot2 visualizations are designed with grey background and black axes. But there are many other layouts available (switch on library ggthemes with additional layouts):

time <- Sys.time()
# ggplot:
ggplot()+
  theme_dark()+
  geom_sf(data = county, aes(fill=MNIMI))+
  geom_sf_text(data = county, aes(label = MNIMI))

# time to draw the map:
Sys.time() - time
## Time difference of 2.656362 secs
# 'maakond' (county) is redundant. Delete it:
county$MNIMI <- str_replace(county$MNIMI, " maakond", "") # we replace 'maakond' with nothing

time <- Sys.time()
# tmap:
tm_shape(county)+
  tm_polygons("orange")+
  tm_text("MNIMI", size = 0.5)

Sys.time() - time
## Time difference of 2.141902 secs

Compared with traditional GIS software it feels slow… That’s the problem of ggplot2 (tmap should be faster). Current county borders are quite detail which means that the geometry part of the geolayer is big as well. We can simplify the layer. In cartograpy this procedure is called generalization.

county_smpl <- st_simplify(county, preserveTopology = T, dTolerance = 200)

object.size(county)
## 17272488 bytes
object.size(county_smpl)
## 1076240 bytes

5.1.1 Coordinate reference system (CRS)

Because of different reasons geographers are not always using geographical coordinates. Actually there are available very many different coordinate reference systems (CRS). Geographical coordinates are often stored in World Geodetic System (WGS84; epsg:4326). The official CRS of Estonia is L-EST97, epsg:3301. In L-EST97 the coordinates of Vanemuise 46, Tartu are X = 6473548.92, Y = 658887.97; but geographical coordinates are latitude = 58.373474, longitude = 26.716185. This means we have to be very careful dealing with spatial objects and pay attention, to the CRS of spatial data. All the analyzed spatial datasets should be in the same CRS!

5.1.1.1 Check current CRS or define it

st_crs(county)
## Coordinate Reference System:
##   User input: Estonian Coordinate System of 1997 
##   wkt:
## PROJCRS["Estonian Coordinate System of 1997",
##     BASEGEOGCRS["EST97",
##         DATUM["Estonia 1997",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["Degree",0.0174532925199433]]],
##     CONVERSION["unnamed",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",57.5175539305556,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",24,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",58,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",59.3333333333333,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",6375000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     ID["EPSG",3301]]

Changing the CRS is called transformation. For example CRS with “traditional” coordinates is WGS84, which EPSG code is 4326. And if we know the EPSG code, the changing is already very easy:

county_4326 <- st_transform(county_smpl, crs = 4326)

time <- Sys.time()

ggplot()+
  theme_map()+
  geom_sf(data = county_4326, fill="springgreen", colour = "grey30", size=0.5)

Sys.time() - time
## Time difference of 0.247957 secs

Combine counties to one polygon (contour of Estonia):

time <- Sys.time()
county_cmb <- st_union(county)
Sys.time()- time
## Time difference of 4.585593 secs
plot(county_cmb)

county_cmb_smpl <- st_simplify(county_cmb, dTolerance = 200)

Calculate buffer around Estonian contour:

county_cmb_buf_1000 <- st_buffer(county_cmb_smpl, dist = 1000) # distance is in meters (depends what is the general unit of oordinate reference system, currently EPSG:3301 (Estonian official CRS))
class(county_cmb_buf_1000) # sometimes the result is not in the same class.
## [1] "sfc_MULTIPOLYGON" "sfc"
county_cmb_buf_1000 <- st_sf(county_cmb_buf_1000) # convert it to Simple Feature object

county_cmb_buf_1000 <- county_cmb_buf_1000 %>% 
  mutate(dist = 1000) # add the buffer size also to the attributes table

county_cmb_buf_1000 <- county_cmb_buf_1000 %>% 
  rename(geom = county_cmb_buf_1000) # rename the column to *geom*. "geom" is usulally the column where object geometry is stored

class(county_cmb_buf_1000) # check the object class again
## [1] "sf"         "data.frame"
# plot it:
ggplot()+
  geom_sf(data = county_cmb_buf_1000)+
  geom_sf(data = county_cmb_smpl)

Without any deeper introduction we create now our first for-next loop. This means we are going to calculate 10 different buffers around Estonian contour. But instead of manual work we write a function to do it automatically (read more: link):

county_cmb_buf_1000_ii <- county_cmb_buf_1000 # back up the original data 

for (i in 1:10) { #beginning of for-next loop where we fix how many times the loop will run (currently from 1 to 10)
  
  vahe_cnt <- st_buffer(county_cmb_smpl, dist = i * 2000) # calculate buffer
  
  vahe_cnt <- st_sf(vahe_cnt) # convert result to sf
  
  vahe_cnt <- vahe_cnt %>% 
    mutate(dist = i * 2000) # store the buffer size in attributes table
  
  vahe_cnt <- vahe_cnt %>% 
    rename(geom = vahe_cnt) # rename the geom column
  
  county_cmb_buf_1000_ii <- rbind(county_cmb_buf_1000_ii, vahe_cnt) # bind the result to master table
  
}

# check the result:  
glimpse(county_cmb_buf_1000_ii)
## Rows: 11
## Columns: 2
## $ dist <dbl> 1000, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000, 18000, ~
## $ geom <MULTIPOLYGON [m]> MULTIPOLYGON (((450545.2 65..., MULTIPOLYGON (((3670~

Plot the results of buffering:

county_cmb_buf_1000_ii <- county_cmb_buf_1000_ii %>% filter(dist > 1000) # remove the smallest buffer

ggplot()+
  theme(panel.background = element_rect(fill = "lightskyblue1"))+
  geom_sf(data = county_cmb_smpl, fill = "grey95", colour = "dodgerblue2", size = 0.25)+
  geom_sf(data = county_cmb_buf_1000_ii, aes(colour = dist, size = dist), fill =NA)+
  scale_colour_gradient(high ="lightskyblue2", low = "dodgerblue1")+
  scale_size(range = c(0.5, 0.01))+
  guides(size = F, colour = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Create 20 km buffer zone around Estonian coast. For that we calculate buffer and then calculate the difference between original layer (county_cmb_smpl) and buffer (buff_20):

buff_20 <- st_buffer(county_cmb_smpl, dist = 20000)

buff_20 <- st_difference(buff_20, county_cmb_smpl)

ggplot()+
  geom_sf(data = buff_20, fill="lightblue", colour ="dodgerblue1", size= 0.2)

Paint different counties with different colour:

# what colour palettes are available?
display.brewer.all()

# choose Spectral palette:
ggplot(data = county_smpl)+
  geom_sf(aes(fill = MNIMI))+
  geom_sf_text(aes(label = MNIMI), colour="grey50")+
  scale_fill_brewer(palette = "Spectral")+
  guides(fill = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors

‘Spectral’ palette contains only 11 colors. We have to extend it to needed number (there are 15 counties in Estonia):

# interpolate a set of given colors to create new color palette:
coul <-  colorRampPalette(RColorBrewer::brewer.pal(6, "Spectral"))(15)

ggplot(data = county_smpl)+
  geom_sf(aes(fill = MNIMI))+
  geom_sf_text(aes(label = MNIMI), colour="grey50")+
  scale_fill_manual(values = coul)+
  guides(fill = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Try to make the old-style-map where coastal line is “beautifully” buffered:

ggplot(data = county_smpl)+
  theme(panel.background = element_rect(fill = "dodgerblue"),
        plot.background = element_rect(fill = "dodgerblue"))+
  geom_sf(data = county_cmb_buf_1000_ii, aes(colour = dist, size = dist), fill =NA)+
  geom_sf(aes(fill = MNIMI))+
  geom_sf_text(aes(label = MNIMI), colour="grey50")+
  scale_colour_gradient(high ="skyblue1", low = "navyblue")+
  scale_fill_manual(values = coul)+
  scale_size(range = c(0.5, 0.01))+
  guides(fill = F, size=F, colour = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Add map scale and north arrow:

library(ggspatial)

ggplot(data = county_smpl)+
  theme_minimal()+
  theme(axis.title = element_blank())+
  geom_sf(data = county_cmb_buf_1000_ii, aes(colour = dist, size = dist), fill =NA)+
  geom_sf(aes(fill = MNIMI), alpha = 0.5)+
  geom_sf_text(aes(label = MNIMI), colour="grey50")+
  scale_colour_gradient(high ="skyblue1", low = "grey50")+
  scale_fill_manual(values = coul)+
  scale_size(range = c(0.5, 0.01))+
  guides(fill = F, size=F, colour = F)+
  annotation_scale(location = "bl", style = "ticks") +
  annotation_north_arrow(width = unit(0.5, "cm"), location = "tl")+
  labs(title = "Estonian counties", caption = "Maps are fun!")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Colouring of counties just like that is not very informative. This kind of thamatic map sould carry some message. For example can we make map to visualize population size and/or density in counties. For that we have prepared previously the dataset with populations sizes.

glimpse(data_countyPop_ii)
## Rows: 15
## Columns: 6
## $ DIM2          <dbl> 37, 39, 44, 49, 51, 57, 59, 65, 67, 70, 74, 78, 82, 84, …
## $ DIM2_label.en <chr> "Harju", "Hiiu", "Ida-Viru", "Jõgeva", "Järva", "Lääne",…
## $ `1990`        <dbl> 607158, 11332, 221807, 42607, 43715, 33694, 79767, 36186…
## $ `2016`        <dbl> 576265, 9348, 146506, 31298, 30709, 24580, 59467, 28218,…
## $ diff_abs      <dbl> -30893, -1984, -75301, -11309, -13006, -9114, -20300, -7…
## $ diff_pr       <dbl> -5.088132, -17.507942, -33.948883, -26.542587, -29.75180…
glimpse(county_smpl)
## Rows: 15
## Columns: 3
## $ MNIMI    <chr> "Saare", "Viljandi", "Hiiu", "Harju", "Lääne", "Rapla", "Lään…
## $ MKOOD    <chr> "0074", "0084", "0039", "0037", "0056", "0071", "0060", "0045…
## $ geometry <GEOMETRY [m]> MULTIPOLYGON (((455191.3 64..., MULTIPOLYGON (((6210…

We have to join those two datasets. In one dataset we have the geographical information (shape of counties) and in another dataset we have population attributes. Both tables should contain column where the ID’s of counties are stored. In current case it’s respectively MKOOD & DIM2. Both columns must be in same format!!! Currently first is factor and another is character. We add 2 leading zeros to DIM2 and convert it to factor:

data_countyPop_ii <- data_countyPop_ii %>% mutate(MKOOD = paste0("00", DIM2))

Join the tables (there are different join types: link; currently we choose left join)

county_smpl_pop <- left_join(county_smpl, data_countyPop_ii, by = "MKOOD")

map:

ggplot()+
  geom_sf(data = county_smpl_pop, aes(fill = diff_abs))

We failed to join the tables… Reason for that is that, if county borders are changed then also the county code changes. We may try to arrange (sort) both tables by the county names and then bind them:

data_countyPop_ii <- data_countyPop_ii %>% arrange(DIM2_label.en)

county_smpl <- county_smpl %>% arrange(MNIMI) 

county_smpl_pop <- bind_cols(county_smpl, data_countyPop_ii)
## New names:
## • `MKOOD` -> `MKOOD...2`
## • `MKOOD` -> `MKOOD...10`
county_smpl_pop <- st_sf(county_smpl_pop)
glimpse(county_smpl_pop)
## Rows: 15
## Columns: 10
## $ MNIMI         <chr> "Harju", "Hiiu", "Ida-Viru", "Jõgeva", "Järva", "Lääne",…
## $ MKOOD...2     <chr> "0037", "0039", "0045", "0050", "0052", "0056", "0060", …
## $ DIM2          <dbl> 37, 39, 44, 49, 51, 57, 59, 65, 67, 70, 74, 78, 82, 84, …
## $ DIM2_label.en <chr> "Harju", "Hiiu", "Ida-Viru", "Jõgeva", "Järva", "Lääne",…
## $ `1990`        <dbl> 607158, 11332, 221807, 42607, 43715, 33694, 79767, 36186…
## $ `2016`        <dbl> 576265, 9348, 146506, 31298, 30709, 24580, 59467, 28218,…
## $ diff_abs      <dbl> -30893, -1984, -75301, -11309, -13006, -9114, -20300, -7…
## $ diff_pr       <dbl> -5.088132, -17.507942, -33.948883, -26.542587, -29.75180…
## $ MKOOD...10    <chr> "0037", "0039", "0044", "0049", "0051", "0057", "0059", …
## $ geometry      <GEOMETRY [m]> MULTIPOLYGON (((505059.9 65..., MULTIPOLYGON ((…

Try to map it again:

ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
  geom_sf()+
  geom_sf_text(aes(label = diff_pr))

Round the labels:

ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
  geom_sf()+
  geom_sf_text(aes(label = round(diff_pr, 1)))

Change the colour palette

ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
  theme_minimal()+
  geom_sf()+
  geom_sf_text(aes(label = round(diff_pr, 1)))+
  scale_fill_gradientn(colours = c("red4", "red1", "orange", "lightyellow"))

Final polishing:

ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
  theme_minimal()+
  geom_sf(col = "grey70", size= 0.5)+
  geom_sf_text(aes(label = paste0(round(diff_pr, 1), "%")))+
  scale_fill_gradientn(colours = c("red4", "red1", "orange", "lightyellow"))+
  labs(title = "population change in counties, 1990...2016", subtitle = "data: Statistics Estonia", caption = "author: A. Aasa", fill = "change, %")+
  ggspatial::annotation_scale(location = "bl", style = "ticks") +
  ggspatial::annotation_north_arrow(location = "tl",width = unit(0.5, "cm"))

For ggplot2 it has always been a challenge to re-class legend colours and values. One option is to transform the variable (log10(), sqrt(), rank() etc). Another option is to use classIntervals() from the classInt library. But today we have option to use tmap library.
The syntax of tmap is pretty similar to ggplot2, but more focused on thematic maps. For tmapwe have to specify first the data layer (tm_shape()). In next row we define how the layer is visualized. All the tmap-related functions start with prefix tm_.

Read more about tmap (link)!

# Quantiles:
tm_shape(county_smpl_pop)+
  tm_polygons("diff_pr", title = "Quantile", palette = "Reds", style = "quantile")

# fisher:
tm_shape(county_smpl_pop)+
  tm_polygons("diff_pr", title = "Fisher", palette = c("darkgreen", "grey70", "orange"), style = "fisher")

# jenks:
tm_shape(county_smpl_pop)+
  tm_polygons("diff_pr", title = "Jenks", palette = "-cividis", style = "jenks")

#You can test other options: "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "sd", "fisher", and "jenks".

What is the best option?

6 Interactive mode

Compared with ggplot2 tmap has one more advantage: we can use interactive mode. For that we have to switch it on:

tmap_mode("view") # back to static: tmap_mode"plot"
## tmap mode set to interactive viewing

Then plot the map:

tm_shape(county_smpl_pop)+
  tm_polygons("diff_pr", title = "Jenks", palette = "-cividis", style = "jenks", alpha = 0.7)

Congratulations! You manged to finish the first session!
Now proceed to the page of independent tasks and follow the instructions!

Written by: Anto Aasa
Supervisors: Anto Aasa & Lika Zhvania
LTOM.02.041
Last update: 2022-11-06 10:06:00


.