R is freely available tool for the data analysis. R is the language and also the environment for statistical calculations and graphical representation. R is also very powerful tool for data processing, analysis and visualizations. For today R has developed to the general-purpose programming language capable as well for correlation analysis as well for producing web applications.
Additionally to the base program are available thousands of extensions, which add possibilities to the main core. Extensions are are called packages (libraries). Some packages are added already to the main software, but most of them should be installed separately.
Biggest problem for the R is quite steep learning curve. But it will pay back very fast!
To go through the current introduction you have to install R, RStudio and RTools to your computer.
In some cases still the update of Environmental variables is needed:
From here you can find a good description, how to set up the R environment in your computer: link
Start RStudio
Define working directory:
# set working directory:
setwd("YOUR FOLDER PATH")
# avoid displaying numbers in scientific notation:
options(scipen = 999)
Basic R offers very variety of options and functions for data
processing, analysis and visualizations. But additionally we can use
extension called package.
In R, a package is a collection of R functions, data and compiled code.
The location where the packages are stored is called the library. If
there is a particular functionality that you require, you can download
the package from the appropriate site and it will be stored in your
library. To actually use the package use the command “library(package)”
which makes that package available to you. Then just call the
appropriate package functions etc. link
install.packages("tidyverse")
install.packages("sf")
install.packages("ggmap")
install.packages("tmap")
library(tidyverse) # collection of R packages designed for data science
library(sf) # spatial data as simple features
library(maptools) # spatial data
library(spatstat) # spatial statistics
library(tmap) # thematic maps
library(ggthemes) # additional predefined themes for ggplot2 plots
library(ggthemes)
library(RColorBrewer)
In R it’s all about assignment! We assign some
properties to the object, after what the created object may be a value,
a table, a plot, a map etc
For example:
a <- 1 # assign value 1 to a
a # print the content of object "a""
## [1] 1
x <- c(1, 1, 2, 2) # create vector
x
## [1] 1 1 2 2
y <- c(1, 2, 2, 1)
y
## [1] 1 2 2 1
data_table <- data.frame(var1=x, var2=y) # bind vectors to data frame (table)
View the data:
# see the content in console:
data_table
# View in data viewer:
View(data_table)
# edit data:
fix(data_table)
# data structure (column names, types *etc*)
glimpse(data_table)
# remove (delete) the object from working space:
rm(data_table) # delete object
# recreate the table:
data_table <- tibble(var1 = x, var2 = y) # restore previously deleted object
Before any deeper analysis it is reasonable to describe the data.
Very often datasets are too big and scrolling the raw table is not
enough. To get better understanding about the data we can use
descriptive statistics and/or
graphs.
> According to Wikipedia: A descriptive statistic
(in the count noun sense) is a summary statistic that quantitatively
describes or summarizes features of a collection of information.
For example: next commands demonstrate also the main logic of
applying functions on variables
(function(tableName$columnName)):
max(data_table$var1)
## [1] 2
min(data_table$var1)
## [1] 1
mean(data_table$var1)
## [1] 1.5
sd(data_table$var1) # standard deviation
## [1] 0.5773503
length(data_table$var1)
## [1] 4
nrow(data_table) # nr of rows in data frame
## [1] 4
nchar("Hello, world!") # number of characters in string
## [1] 13
hw <- "Hello, world!"
nchar(hw)
## [1] 13
We are not living in ideal world. It means, that data we get is often in awful condition. One of the biggest problems is the good will of incompetent data collector and/or processor. They put lot of effort to format the data into “beautiful” shape and form. For example, in MS Excel merged cells, subtotals, colors, bold/italic, comments, mixed data types, etc are the favorites. Usually it takes a lot of time to get rid all of this useless formatting and decorations.
Here is an example of data, where dates are stored as text (in one column it’s not just text, but roman numerals!):
But currently we can assume, that datasets are prepared by a
specialist and we can import data without major problems.
In our first example we plan to analyse how the population size has
changed in counties between 1990 and 2016.
For that we have to import the data:
# import:
data_countyPop_csv <- read.csv("http://aasa.ut.ee/Rspatial/data/county_population_RV022.csv", encoding = "UTF-8")
# structure of imported dataset:
glimpse(data_countyPop_csv)
## Rows: 30
## Columns: 11
## $ DIM2 <int> 37, 37, 39, 39, 44, 44, 49, 49, 51, 51, 57, 57, 59, 59, 65,…
## $ County <chr> "Harju county", "Harju county", "Hiiu county", "Hiiu county…
## $ DIM3 <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "K",…
## $ Sex <chr> "Males and females", "Males and females", "Males and female…
## $ DIM4 <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999,…
## $ Age.group <chr> "Age groups total", "Age groups total", "Age groups total",…
## $ TIME <int> 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016,…
## $ Year <int> 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016, 1990, 2016,…
## $ Value <int> 607158, 576265, 11332, 9348, 221807, 146506, 42607, 31298, …
## $ Flag.Codes <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Flags <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
How many different data types are in this table?
In case of repetitive operations means scripting that many (or even all) steps can be completed automatically. This includes also importing of data into your project. Let us look one case where we try to download population size by counties in Estonia. In database of Statistics Estonia the SDMX format is used, therefore we need to start the package to get this functionality into R.
Then we have to
library(rsdmx)
# address of data
data_URL <- "http://andmebaas.stat.ee/restsdmx/sdmx.ashx/GetData/RV022/37+39+44+49+51+57+59+65+67+70+74+78+82+84+86.K.999/all?startTime=1990&endTime=2016"
# address of data struscture
data_structure_URL <- "http://andmebaas.stat.ee/restsdmx/sdmx.ashx/GetDataStructure/RV022"
# import data
data_countyPop <- readSDMX(data_URL, dsd = T)
# get the structure for the data table:
data_countyPop_STR <- readSDMX(data_structure_URL)
# final format of table:
data_countyPop <- setDSD(data_countyPop, data_countyPop_STR)
data_countyPop <- data_countyPop %>%
as.data.frame(labels = T)
# It is always useful to check the result!!
glimpse(data_countyPop)
Relying on 3rd party services is always risky. Sometimes the database
is down. In that case you can download the data from here:
http://aasa.ut.ee/Rspatial/data/data_countyPop.csv
Next import the data to R (Environment tab = > Import Dataset = > From Text ( readr))
Success?
Looks like everything we need is there. But are we satisfied with the
format?
You may face several problems:
Lets take the easy path: We want to show difference between first and
last year.
For that we have to select relevant variables (columns), convert year
from character to integer, filter relevant years
# select relevant columns:
data_countyPop <- data_countyPop %>%
dplyr::select(DIM2, DIM2_label.en, obsTime, obsValue)
# convert character to integer if needed:
#data_countyPop <- data_countyPop %>%
# mutate(obsTime = as.integer(obsTime))
# query the first and last year in dataset:
data_countyPop <- data_countyPop %>%
filter(obsTime == min(obsTime) |
obsTime == max(obsTime))
library(knitr) # controlling the output
data_countyPop %>%
head() %>%
kable()
| DIM2 | DIM2_label.en | obsTime | obsValue |
|---|---|---|---|
| 37 | Harju county | 1990 | 607158 |
| 37 | Harju county | 2016 | 576265 |
| 39 | Hiiu county | 1990 | 11332 |
| 39 | Hiiu county | 2016 | 9348 |
| 44 | Ida-Viru county | 1990 | 221807 |
| 44 | Ida-Viru county | 2016 | 146506 |
dplyr)Introduction to dplyr.
According to Sean C. Anderson: the dplyr R package is awesome. Pipes from the magrittr R package are awesome. Put the two together and you have one of the most exciting things to happen to R in a long time.
The dplyr is Hadley Wickham’s re-imagined plyr package (with underlying C++ secret sauce co-written by Romain Francois). plyr 2.0 if you will. It does less than plyr, but what it does it does more elgantly and much more quickly.
dplyr is built around 5 verbs. These verbs make up the majority of the data manipulation you tend to do. You might need to: -Selectcertain columns of data. -Filteryour data to select specific rows. -Arrangethe rows of your data into an order. -Mutateyour data frame to contain new columns. -Summarisechunks of you data in some way.
Please read & test the examples: Introduction to dplyr and pipes
Undoubtedly are maps just another type of visualizations. Similarly to maps we have to define the coordinates (XY-axes) and attributes (plot type, colors, labels, shapes etc) for every visualization.
According to Wikipedia: Visualization is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity. Examples from history include cave paintings, Egyptian hieroglyphs, Greek geometry, and Leonardo da Vinci’s revolutionary methods of technical drawing for engineering and scientific purposes. Visualization today has ever-expanding applications in science, education, engineering (e.g., product visualization), interactive multimedia, medicine, etc. Typical of a visualization application is the field of computer graphics. The invention of computer graphics may be the most important development in visualization since the invention of central perspective in the Renaissance period. The development of animation also helped advance visualization.
Several remarkably good books about scientific visualization have been published:
The last book on the list has been influential also for building R library ggplot2: Wickham H. (2010) “A Layered Grammar of Graphics””
The author of current short tutorial is not trying to compete with
any of the previously mentioned books. The aim is to go through some of
the important aspects of scientific visualization and demonstrate the
process of creating them.
More specifically we are focusing on R library tidyverse, sf and tmap.
For many R users the capabilities of base R plotting are enough link.
plot(data_table$var1, data_table$var2, main = "Title", ylab="value y", type = "l", col="red")
points(data_table$var1, data_table$var2, col="blue")
In many cases basic plotting engine is not enough or is too clumsy
(far from intuitive). There are available some plotting libraries as
well: lattice, ggplot2. During the current
course we are going to use mainly the ggplot2 library. But
today we also live in conditions of so called sf
revolution and we have access to (new) library for thematic
mapping: tmap.
ggplot2 website says: “ggplot2 is a system for
declaratively creating graphics, based on The Grammar of Graphics. You provide
the data, tell ggplot2 how to map variables to aesthetics, what
graphical primitives to use, and it takes care of the
details.”
ggplot object must have:
ggplot()+
theme_excel()+
geom_path(data = data_table, aes(x = var1, y = var2), colour = "blue", size = 0.25)+
geom_point(data = data_table, aes(x=var1, y=var2), fill="pink", colour="red", shape=21, size=3)+
labs(title="simple example", x="header, x", y="header, y")
ggplot colours are defined, like in HTML/CSS, using the hexadecimal values (00 to FF). Many colours are translated to human language link.
Lets assume, that you received somehow from your partner a dataset. According to him, it some base map data. Unfortunately there wasn’t any additional metadata and documentation available. Can you figure out, what dataset it is and is it usable?
Download the data and check the structure:
unknownData <- read.csv2("http://aasa.ut.ee/Rspatial/data/unknownData.csv")
glimpse(unknownData)
## Rows: 12,748
## Columns: 7
## $ variable1 <dbl> 453124.8, 462438.8, 464891.5, 464248.4, 460592.8, 459549.8, …
## $ variable2 <dbl> 6505228, 6500818, 6492784, 6485559, 6490631, 6488152, 648984…
## $ variable3 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ variable4 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ variable5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ variable6 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ variable7 <chr> "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1…
unknownData %>%
head() %>%
kable() #kable() will format the table output into "nicer" form (useful for R Markdown output)
| variable1 | variable2 | variable3 | variable4 | variable5 | variable6 | variable7 |
|---|---|---|---|---|---|---|
| 453124.8 | 6505228 | 1 | FALSE | 1 | 1 | 1.1 |
| 462438.8 | 6500818 | 2 | FALSE | 1 | 1 | 1.1 |
| 464891.5 | 6492784 | 3 | FALSE | 1 | 1 | 1.1 |
| 464248.4 | 6485559 | 4 | FALSE | 1 | 1 | 1.1 |
| 460592.8 | 6490631 | 5 | FALSE | 1 | 1 | 1.1 |
| 459549.8 | 6488152 | 6 | FALSE | 1 | 1 | 1.1 |
Calculate descriptive statistics for variable 1:
unknownData_descr <- unknownData %>%
summarise(n = n(),
min = min(variable1),
mean = mean(variable1),
max = max(variable1),
sd = sd(variable1))
kable(unknownData_descr)
| n | min | mean | max | sd |
|---|---|---|---|---|
| 12748 | 369046.6 | 476038.1 | 739152.4 | 82765.54 |
Or you can calculate descriptive statistics for multiple columns:
unknownData %>%
summarise_all(mean)
## Warning in mean.default(variable7): argument is not numeric or logical:
## returning NA
## variable1 variable2 variable3 variable4 variable5 variable6 variable7
## 1 476038.1 6498252 1486.837 0.0007844368 276.5258 5.334327 NA
You can also apply several functions at the same time:
unknownData %>%
summarise_all(funs(mean, min, max))
Error message warns that you can’t apply it for factors. We have to select relevant columns:
unknownData %>%
dplyr::select(variable1, variable2) %>%
summarise_all(funs(mean, min, max))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## variable1_mean variable2_mean variable1_min variable2_min variable1_max
## 1 476038.1 6498252 369046.6 6377146 739152.4
## variable2_max
## 1 6633918
# or:
unknownData %>%
dplyr::select(variable1 : variable6) %>%
summarise_all(funs(mean, min, max)) %>%
kable()
| variable1_mean | variable2_mean | variable3_mean | variable4_mean | variable5_mean | variable6_mean | variable1_min | variable2_min | variable3_min | variable4_min | variable5_min | variable6_min | variable1_max | variable2_max | variable3_max | variable4_max | variable5_max | variable6_max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 476038.1 | 6498252 | 1486.837 | 0.0007844 | 276.5258 | 5.334327 | 369046.6 | 6377146 | 1 | 0 | 1 | 1 | 739152.4 | 6633918 | 5111 | 1 | 988 | 15 |
Visualization is one of the easiest way to get familiar with the data. Let’s look different possibilities. First we draw the histogram:
ggplot()+
geom_histogram(data = unknownData, aes(x= variable1), bins = 50)
Histogram helps to get overview about the variability of the variable. Currently we dont see too much…
Maybe it’s some time series? We can try line plot:
ggplot()+
geom_line(data = unknownData, aes(x = variable1, y = variable2), colour = "blue")
Ok! Looks like Estonian contour. But definitely we can’t use it in this form. We should search other ways. Maybe the scatter plot helps?
ggplot()+
geom_point(data = unknownData, aes(x = variable1, y = variable2), colour = "blue", size = 0.5)
Even better! Educated eye is probably recognizing the borders of Estonian counties. To plot borders we should use lines. For that ggplot2 offers possibility to plot path:
ggplot()+
geom_path(data = unknownData, aes(x = variable1, y = variable2), colour = "blue", size = 0.5)
Still problems… We have to define “when the pencil should be switched off” if we start with new county:
ggplot()+
geom_path(data = unknownData, aes(x = variable1, y = variable2, group = variable7), colour = "blue", size=0.5)
Yes! But instead of empty contour we can also fill every polygon (path group) with different colour:
ggplot()+
geom_polygon(data = unknownData, aes(x = variable1, y = variable2, group = variable7, fill = as.factor(variable6)), colour = "white", size=0.5)+
scale_fill_viridis_d()+
coord_equal()
This looks like almost a map, but need improvement! Let’s but this currently on hold and look next step.
ggplot()+
theme_minimal()+
geom_col(data = data_countyPop, aes(x = DIM2_label.en, y = obsValue))
To big numbers?! Columns are on top of each other.
ggplot()+
theme_minimal()+
geom_point(data = data_countyPop, aes(x = DIM2_label.en, y=obsValue, colour = obsTime))
We can see some differences but the initial result is definitely ugly. We could rotate the axes, define better colours etc
ggplot()+
theme_minimal()+
geom_point(data = data_countyPop, aes(y = DIM2_label.en, x=obsValue, colour = as.factor(obsTime)))
Repeating the “county” is redundant, delete “county”. Next operation doesn’t delete “maakond” but replaces it with nothing.
data_countyPop <- data_countyPop %>%
mutate(DIM2_label.en = gsub(" county", "", DIM2_label.en))
data_countyPop
## # A tibble: 30 × 4
## DIM2 DIM2_label.en obsTime obsValue
## <dbl> <chr> <dbl> <dbl>
## 1 37 Harju 1990 607158
## 2 37 Harju 2016 576265
## 3 39 Hiiu 1990 11332
## 4 39 Hiiu 2016 9348
## 5 44 Ida-Viru 1990 221807
## 6 44 Ida-Viru 2016 146506
## 7 49 Jõgeva 1990 42607
## 8 49 Jõgeva 2016 31298
## 9 51 Järva 1990 43715
## 10 51 Järva 2016 30709
## # … with 20 more rows
We also should sort the counties by population size but not by the name. In that case the sorting inside of the table doesnt help, the sorting procedure has to happen in plot:
ggplot()+
theme_minimal()+
geom_point(data = data_countyPop, aes(y = reorder(DIM2_label.en, obsValue), x=obsValue, colour = as.factor(obsTime)))+
scale_colour_manual(values = c("orange", "dodgerblue"))+
labs(colour = "year",
y="county",
x="population size",
title = "Population change in Estonian counties",
subtitle = "difference between 1990 and 2016",
caption = "data: Statistics Estonia \nvisual: A. Aasa")
Much better!! A bit more polishing :
ggplot()+
theme_minimal()+
geom_point(data = data_countyPop, aes(y = reorder(DIM2_label.en, obsValue),
x=obsValue,
colour = as.factor(obsTime)))+
scale_colour_manual(values = c("orange", "dodgerblue"))+
labs(colour = "year",
y="county",
x="population size",
title = "Population change in Estonian counties",
subtitle = "difference between 1990 and 2016",
caption = "data: Statistics Estonia \nvisual: A. Aasa")+
scale_x_log10()
The result looks “almost”, but it’s still hard to understand the volumes and changes. We should calculate the difference between years. For that we have to but different years to next of each other:
data_countyPop_ii <- data_countyPop %>%
spread(obsTime, obsValue)
data_countyPop_ii <- data_countyPop_ii %>%
mutate(diff_abs = `2016` - `1990`,
diff_pr = ((`2016` / `1990`)-1)*100)
One possible plot to visualize the population change:
ggplot()+
theme_minimal()+
geom_segment(data = data_countyPop_ii, aes(x=`1990`, xend = `2016`, y=reorder(DIM2_label.en, `1990`), yend = reorder(DIM2_label.en,`1990`)), col = "red", arrow = arrow(length= unit(0.02, "npc")))+
geom_text(data = data_countyPop_ii, aes(x=`1990` *1.2, y=reorder(DIM2_label.en, `1990`), label = paste0(round(diff_pr, 1), "%")), col ="red")+
scale_x_log10()+
labs(x="population size", y="counties", title = "title")
You can make spatial objects by your self, but you can import them (e.g. ’.shp) as well.The most important assumption is the availability of spatial information (coordinates, address). There are many possibilities to import spatial object into R. We focus on simple feature’s class.
Simple feature (sf) is basically a data.frame
where geometry of each object is stored in special column
(geometry).
Our aim is to download spatial layer of Estonian counties and draw general overview map. Work steps:
First step. Start libraries to expand the basic functionality of R with extensions. Download zipped geolayer directly from Land Board webpage.
download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/maakond_shp.zip", destfile="maakond_shp.zip")
#maakond means county!
unzip("maakond_shp.zip")
Second step. Import geolayer to R. Plot it.
# The geolayer is regularly updated and the name contains data. We have to check the correct name. Next command lists all the files in working directory which names contain ".shp"
list.files(pattern = ".shp")
## [1] "asustusyksus_20211001.shp" "asustusyksus_20211101.shp"
## [3] "eestimaa_wgs84.shp" "eestimaa_wgs84.shp.xml"
## [5] "gps_us.shp" "gps_us_monterey.shp"
## [7] "maakond_20210901.shp" "maakond_20211101.shp"
## [9] "maakond_20221001.shp" "maakond_shp.zip"
## [11] "omavalitsus_20211001.shp" "omavalitsus_20221101.shp"
## [13] "omavalitsus_shp.zip" "population_2017.shp"
## [15] "trt_cont.shp"
# import shp as simple feature:
county <- st_read("maakond_20210901.shp") # Date in file name is changing, because files are updated monthly!
## Reading layer `maakond_20210901' from data source
## `C:\ANTO\loengud\geopythonR\rspatial_Git\rspatial\maakond_20210901.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## Projected CRS: Estonian Coordinate System of 1997
# plot the layer with ggplot2:
ggplot()+
geom_sf(data = county)
# plot with tmap:
tm_shape(county)+
tm_polygons()
Voila!
R is very flexible. This means, every map element can be changed. By
default all the ggplot2 visualizations are designed with grey
background and black axes. But there are many other layouts available
(switch on library ggthemes with additional
layouts):
time <- Sys.time()
# ggplot:
ggplot()+
theme_dark()+
geom_sf(data = county, aes(fill=MNIMI))+
geom_sf_text(data = county, aes(label = MNIMI))
# time to draw the map:
Sys.time() - time
## Time difference of 2.656362 secs
# 'maakond' (county) is redundant. Delete it:
county$MNIMI <- str_replace(county$MNIMI, " maakond", "") # we replace 'maakond' with nothing
time <- Sys.time()
# tmap:
tm_shape(county)+
tm_polygons("orange")+
tm_text("MNIMI", size = 0.5)
Sys.time() - time
## Time difference of 2.141902 secs
Compared with traditional GIS software it feels slow… That’s the problem of ggplot2 (tmap should be faster). Current county borders are quite detail which means that the geometry part of the geolayer is big as well. We can simplify the layer. In cartograpy this procedure is called generalization.
county_smpl <- st_simplify(county, preserveTopology = T, dTolerance = 200)
object.size(county)
## 17272488 bytes
object.size(county_smpl)
## 1076240 bytes
Because of different reasons geographers are not always using geographical coordinates. Actually there are available very many different coordinate reference systems (CRS). Geographical coordinates are often stored in World Geodetic System (WGS84; epsg:4326). The official CRS of Estonia is L-EST97, epsg:3301. In L-EST97 the coordinates of Vanemuise 46, Tartu are X = 6473548.92, Y = 658887.97; but geographical coordinates are latitude = 58.373474, longitude = 26.716185. This means we have to be very careful dealing with spatial objects and pay attention, to the CRS of spatial data. All the analyzed spatial datasets should be in the same CRS!
st_crs(county)
## Coordinate Reference System:
## User input: Estonian Coordinate System of 1997
## wkt:
## PROJCRS["Estonian Coordinate System of 1997",
## BASEGEOGCRS["EST97",
## DATUM["Estonia 1997",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["Degree",0.0174532925199433]]],
## CONVERSION["unnamed",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",57.5175539305556,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",24,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",58,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",59.3333333333333,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",6375000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## ID["EPSG",3301]]
Changing the CRS is called transformation. For example CRS with “traditional” coordinates is WGS84, which EPSG code is 4326. And if we know the EPSG code, the changing is already very easy:
county_4326 <- st_transform(county_smpl, crs = 4326)
time <- Sys.time()
ggplot()+
theme_map()+
geom_sf(data = county_4326, fill="springgreen", colour = "grey30", size=0.5)
Sys.time() - time
## Time difference of 0.247957 secs
Combine counties to one polygon (contour of Estonia):
time <- Sys.time()
county_cmb <- st_union(county)
Sys.time()- time
## Time difference of 4.585593 secs
plot(county_cmb)
county_cmb_smpl <- st_simplify(county_cmb, dTolerance = 200)
Calculate buffer around Estonian contour:
county_cmb_buf_1000 <- st_buffer(county_cmb_smpl, dist = 1000) # distance is in meters (depends what is the general unit of oordinate reference system, currently EPSG:3301 (Estonian official CRS))
class(county_cmb_buf_1000) # sometimes the result is not in the same class.
## [1] "sfc_MULTIPOLYGON" "sfc"
county_cmb_buf_1000 <- st_sf(county_cmb_buf_1000) # convert it to Simple Feature object
county_cmb_buf_1000 <- county_cmb_buf_1000 %>%
mutate(dist = 1000) # add the buffer size also to the attributes table
county_cmb_buf_1000 <- county_cmb_buf_1000 %>%
rename(geom = county_cmb_buf_1000) # rename the column to *geom*. "geom" is usulally the column where object geometry is stored
class(county_cmb_buf_1000) # check the object class again
## [1] "sf" "data.frame"
# plot it:
ggplot()+
geom_sf(data = county_cmb_buf_1000)+
geom_sf(data = county_cmb_smpl)
Without any deeper introduction we create now our first for-next loop. This means we are going to calculate 10 different buffers around Estonian contour. But instead of manual work we write a function to do it automatically (read more: link):
county_cmb_buf_1000_ii <- county_cmb_buf_1000 # back up the original data
for (i in 1:10) { #beginning of for-next loop where we fix how many times the loop will run (currently from 1 to 10)
vahe_cnt <- st_buffer(county_cmb_smpl, dist = i * 2000) # calculate buffer
vahe_cnt <- st_sf(vahe_cnt) # convert result to sf
vahe_cnt <- vahe_cnt %>%
mutate(dist = i * 2000) # store the buffer size in attributes table
vahe_cnt <- vahe_cnt %>%
rename(geom = vahe_cnt) # rename the geom column
county_cmb_buf_1000_ii <- rbind(county_cmb_buf_1000_ii, vahe_cnt) # bind the result to master table
}
# check the result:
glimpse(county_cmb_buf_1000_ii)
## Rows: 11
## Columns: 2
## $ dist <dbl> 1000, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000, 18000, ~
## $ geom <MULTIPOLYGON [m]> MULTIPOLYGON (((450545.2 65..., MULTIPOLYGON (((3670~
Plot the results of buffering:
county_cmb_buf_1000_ii <- county_cmb_buf_1000_ii %>% filter(dist > 1000) # remove the smallest buffer
ggplot()+
theme(panel.background = element_rect(fill = "lightskyblue1"))+
geom_sf(data = county_cmb_smpl, fill = "grey95", colour = "dodgerblue2", size = 0.25)+
geom_sf(data = county_cmb_buf_1000_ii, aes(colour = dist, size = dist), fill =NA)+
scale_colour_gradient(high ="lightskyblue2", low = "dodgerblue1")+
scale_size(range = c(0.5, 0.01))+
guides(size = F, colour = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Create 20 km buffer zone around Estonian coast. For that we calculate buffer and then calculate the difference between original layer (county_cmb_smpl) and buffer (buff_20):
buff_20 <- st_buffer(county_cmb_smpl, dist = 20000)
buff_20 <- st_difference(buff_20, county_cmb_smpl)
ggplot()+
geom_sf(data = buff_20, fill="lightblue", colour ="dodgerblue1", size= 0.2)
Paint different counties with different colour:
# what colour palettes are available?
display.brewer.all()
# choose Spectral palette:
ggplot(data = county_smpl)+
geom_sf(aes(fill = MNIMI))+
geom_sf_text(aes(label = MNIMI), colour="grey50")+
scale_fill_brewer(palette = "Spectral")+
guides(fill = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors
‘Spectral’ palette contains only 11 colors. We have to extend it to needed number (there are 15 counties in Estonia):
# interpolate a set of given colors to create new color palette:
coul <- colorRampPalette(RColorBrewer::brewer.pal(6, "Spectral"))(15)
ggplot(data = county_smpl)+
geom_sf(aes(fill = MNIMI))+
geom_sf_text(aes(label = MNIMI), colour="grey50")+
scale_fill_manual(values = coul)+
guides(fill = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Try to make the old-style-map where coastal line is “beautifully” buffered:
ggplot(data = county_smpl)+
theme(panel.background = element_rect(fill = "dodgerblue"),
plot.background = element_rect(fill = "dodgerblue"))+
geom_sf(data = county_cmb_buf_1000_ii, aes(colour = dist, size = dist), fill =NA)+
geom_sf(aes(fill = MNIMI))+
geom_sf_text(aes(label = MNIMI), colour="grey50")+
scale_colour_gradient(high ="skyblue1", low = "navyblue")+
scale_fill_manual(values = coul)+
scale_size(range = c(0.5, 0.01))+
guides(fill = F, size=F, colour = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Add map scale and north arrow:
library(ggspatial)
ggplot(data = county_smpl)+
theme_minimal()+
theme(axis.title = element_blank())+
geom_sf(data = county_cmb_buf_1000_ii, aes(colour = dist, size = dist), fill =NA)+
geom_sf(aes(fill = MNIMI), alpha = 0.5)+
geom_sf_text(aes(label = MNIMI), colour="grey50")+
scale_colour_gradient(high ="skyblue1", low = "grey50")+
scale_fill_manual(values = coul)+
scale_size(range = c(0.5, 0.01))+
guides(fill = F, size=F, colour = F)+
annotation_scale(location = "bl", style = "ticks") +
annotation_north_arrow(width = unit(0.5, "cm"), location = "tl")+
labs(title = "Estonian counties", caption = "Maps are fun!")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Colouring of counties just like that is not very informative. This kind of thamatic map sould carry some message. For example can we make map to visualize population size and/or density in counties. For that we have prepared previously the dataset with populations sizes.
glimpse(data_countyPop_ii)
## Rows: 15
## Columns: 6
## $ DIM2 <dbl> 37, 39, 44, 49, 51, 57, 59, 65, 67, 70, 74, 78, 82, 84, …
## $ DIM2_label.en <chr> "Harju", "Hiiu", "Ida-Viru", "Jõgeva", "Järva", "Lääne",…
## $ `1990` <dbl> 607158, 11332, 221807, 42607, 43715, 33694, 79767, 36186…
## $ `2016` <dbl> 576265, 9348, 146506, 31298, 30709, 24580, 59467, 28218,…
## $ diff_abs <dbl> -30893, -1984, -75301, -11309, -13006, -9114, -20300, -7…
## $ diff_pr <dbl> -5.088132, -17.507942, -33.948883, -26.542587, -29.75180…
glimpse(county_smpl)
## Rows: 15
## Columns: 3
## $ MNIMI <chr> "Saare", "Viljandi", "Hiiu", "Harju", "Lääne", "Rapla", "Lään…
## $ MKOOD <chr> "0074", "0084", "0039", "0037", "0056", "0071", "0060", "0045…
## $ geometry <GEOMETRY [m]> MULTIPOLYGON (((455191.3 64..., MULTIPOLYGON (((6210…
We have to join those two datasets. In one dataset we have the geographical information (shape of counties) and in another dataset we have population attributes. Both tables should contain column where the ID’s of counties are stored. In current case it’s respectively MKOOD & DIM2. Both columns must be in same format!!! Currently first is factor and another is character. We add 2 leading zeros to DIM2 and convert it to factor:
data_countyPop_ii <- data_countyPop_ii %>% mutate(MKOOD = paste0("00", DIM2))
Join the tables (there are different join types: link;
currently we choose left join)
county_smpl_pop <- left_join(county_smpl, data_countyPop_ii, by = "MKOOD")
map:
ggplot()+
geom_sf(data = county_smpl_pop, aes(fill = diff_abs))
We failed to join the tables… Reason for that is that, if county borders are changed then also the county code changes. We may try to arrange (sort) both tables by the county names and then bind them:
data_countyPop_ii <- data_countyPop_ii %>% arrange(DIM2_label.en)
county_smpl <- county_smpl %>% arrange(MNIMI)
county_smpl_pop <- bind_cols(county_smpl, data_countyPop_ii)
## New names:
## • `MKOOD` -> `MKOOD...2`
## • `MKOOD` -> `MKOOD...10`
county_smpl_pop <- st_sf(county_smpl_pop)
glimpse(county_smpl_pop)
## Rows: 15
## Columns: 10
## $ MNIMI <chr> "Harju", "Hiiu", "Ida-Viru", "Jõgeva", "Järva", "Lääne",…
## $ MKOOD...2 <chr> "0037", "0039", "0045", "0050", "0052", "0056", "0060", …
## $ DIM2 <dbl> 37, 39, 44, 49, 51, 57, 59, 65, 67, 70, 74, 78, 82, 84, …
## $ DIM2_label.en <chr> "Harju", "Hiiu", "Ida-Viru", "Jõgeva", "Järva", "Lääne",…
## $ `1990` <dbl> 607158, 11332, 221807, 42607, 43715, 33694, 79767, 36186…
## $ `2016` <dbl> 576265, 9348, 146506, 31298, 30709, 24580, 59467, 28218,…
## $ diff_abs <dbl> -30893, -1984, -75301, -11309, -13006, -9114, -20300, -7…
## $ diff_pr <dbl> -5.088132, -17.507942, -33.948883, -26.542587, -29.75180…
## $ MKOOD...10 <chr> "0037", "0039", "0044", "0049", "0051", "0057", "0059", …
## $ geometry <GEOMETRY [m]> MULTIPOLYGON (((505059.9 65..., MULTIPOLYGON ((…
Try to map it again:
ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
geom_sf()+
geom_sf_text(aes(label = diff_pr))
Round the labels:
ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
geom_sf()+
geom_sf_text(aes(label = round(diff_pr, 1)))
Change the colour palette
ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
theme_minimal()+
geom_sf()+
geom_sf_text(aes(label = round(diff_pr, 1)))+
scale_fill_gradientn(colours = c("red4", "red1", "orange", "lightyellow"))
Final polishing:
ggplot(data = county_smpl_pop, aes(fill = diff_pr))+
theme_minimal()+
geom_sf(col = "grey70", size= 0.5)+
geom_sf_text(aes(label = paste0(round(diff_pr, 1), "%")))+
scale_fill_gradientn(colours = c("red4", "red1", "orange", "lightyellow"))+
labs(title = "population change in counties, 1990...2016", subtitle = "data: Statistics Estonia", caption = "author: A. Aasa", fill = "change, %")+
ggspatial::annotation_scale(location = "bl", style = "ticks") +
ggspatial::annotation_north_arrow(location = "tl",width = unit(0.5, "cm"))
For ggplot2 it has always been a challenge to re-class legend colours
and values. One option is to transform the variable
(log10(), sqrt(), rank()
etc). Another option is to use classIntervals()
from the classInt library. But today we have option to use
tmap library.
The syntax of tmap is pretty similar to
ggplot2, but more focused on thematic maps. For
tmapwe have to specify first the data layer
(tm_shape()). In next row we define how the layer is
visualized. All the tmap-related functions start with
prefix tm_.
Read more about tmap (link)!
# Quantiles:
tm_shape(county_smpl_pop)+
tm_polygons("diff_pr", title = "Quantile", palette = "Reds", style = "quantile")
# fisher:
tm_shape(county_smpl_pop)+
tm_polygons("diff_pr", title = "Fisher", palette = c("darkgreen", "grey70", "orange"), style = "fisher")
# jenks:
tm_shape(county_smpl_pop)+
tm_polygons("diff_pr", title = "Jenks", palette = "-cividis", style = "jenks")
#You can test other options: "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "sd", "fisher", and "jenks".
What is the best option?
Compared with ggplot2 tmap has one more
advantage: we can use interactive mode. For that we have to switch it
on:
tmap_mode("view") # back to static: tmap_mode"plot"
## tmap mode set to interactive viewing
Then plot the map:
tm_shape(county_smpl_pop)+
tm_polygons("diff_pr", title = "Jenks", palette = "-cividis", style = "jenks", alpha = 0.7)
Congratulations! You manged to finish the first session!
Now proceed to the page of independent tasks and follow the
instructions!
Written by: Anto Aasa
Supervisors: Anto Aasa & Lika Zhvania
LTOM.02.041
Last update: 2022-11-06 10:06:00
.