2 Why R?

R is freely available tool for the data analysis. R is the language and also the environment for statistical calculations and graphical representation. R is also very powerful tool for data processing, analysis and visualizations. For today R has developed to the general-purpose programming language capable as well for correlation analysis as well for producing web applications.

Additionally to the base program are available thousands of extensions, which add possibilities to the main core. Extensions are are called packages (libraries). Some packages are added already to the main software, but most of them should be installed separately.

Biggest problem for the R is quite steep learning curve. But it will pay back very fast!

  • R is not just a statistics software, it’s a language.
  • R is free.
  • R is very flexible and powerful.

3 Introduction to R

To go through the current introduction you have to install R and RStudio to your computer.

Start RStudio

3.1 Additional reading & examples

3.2 R & RStudio environment

Define working directory:

# Of course here should be the folder path for your computer!!!   
setwd("C:/ANTO/loengud/tbilisi/2020/web/")
options(scipen = 999) # avoids displaying numbers in scientific notation

3.3 Installation of packages

Basic R offers very variety of options and functions for data processing, analysis and visualizations. But additionally we can use extension called package.
In R, a package is a collection of R functions, data and compiled code. The location where the packages are stored is called the library. If there is a particular functionality that you require, you can download the package from the appropriate site and it will be stored in your library. To actually use the package use the command “library(package)” which makes that package available to you. Then just call the appropriate package functions etc. link

install.packages("name") # instead of name is real name of package!    

3.4 Start libraries

library(tidyverse) # collection of R packages designed for data science
library(sf) # spatial data as simple features
library(maptools) # spatial data
library(ggthemes) # additional predefined themes for ggplot2 plots
library(spatstat) # spatial statistics
library(ggrepel) # geoms for ggplot2 to repel overlapping text labels
library(RColorBrewer) # color palettes
library(ggspatial)
library(classInt)
library(knitr)

3.5 Create objects

In R it’s all about assignment! We assign some properties to the object, after what the created object may be a value, a table, a plot, a map etc
For example:

a <- 1 # assign value 1 to a
a # print the content of object "a""
## [1] 1
x <- c(1, 1, 2, 2) # create vector
x
## [1] 1 1 2 2
y <- c(1, 2, 2, 1)
y
## [1] 1 2 2 1
data_table <-  data.frame(var1=x, var2=y) # bind vectors to data frame (table)

View the data:

# see the content in console:
data_table

# View in data viewer:
View(data_table)

# edit data:
fix(data_table)
# You can not process the table before the (fix(table)) is closed!


# data structure (column names, types *etc*)
glimpse(data_table)

# remove (delete) the object from working space:
rm(data_table) # delete object

# recreate the table:
data_table <-  data.frame(var1 = x, var2 = y) # restore previously deleted object

Before any deeper analysis it is reasonable to describe the data. Very often datasets are too big and scrolling the raw table is not enough. To get better understanding about the data we can use descriptive statistics and/or graphs.
> According to Wikipedia: A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features of a collection of information.

For example: next commands demonstrate also the main logic of applying functions on variables (function(tableName$columnName)):

max(data_table$var1)
## [1] 2
min(data_table$var1)
## [1] 1
mean(data_table$var1)
## [1] 1.5
sd(data_table$var1) # standard deviation
## [1] 0.5773503
length(data_table$var1)
## [1] 4
nrow(data_table) # nr of rows in data frame
## [1] 4
nchar("Hello, world!") # number of characters in string
## [1] 13

4 Get data

4.1 Data import

We are not living in ideal world. This means, that data we get is too often in awful condition: one of the biggest problems is the good will of incompetent data collector and/or processor. They but lot of effort to format data into “beautiful” format in Excel (merged cells, subtotals, colors, bold/italic, comments, mixed data types etc etc). To get rid of all these useles formating takes usually lot of time.

Here is an example of data, where dates are stored as text (and not just text, but as roman numerals!):

But here and today we can assume, that datasets are prepared by a specialist and we can import data without major problems.
In our first example we analyse how the number of tourists from EU countries has changed when comparing February 2020 to February 2019.
For that we have to import the data. The data originates from the webpage of Georgian National Tourism Administration. Initially the data is in MS Excel format link. In principle, you can import XLSX directly to R (File -> Import Dataset -> from Excel), but in current case the file is designed to “beautiful” (not machine readable) format and therefore its is converted to CSV.

4.1.1 You can download the data directly from Internet:

GE_travel <- read.csv2("http://aasa.ut.ee/tbilisi/data/GE_travel_data.csv", encoding = "UTF-8", dec = ".")

# check the result:
glimpse(GE_travel)
## Observations: 28
## Variables: 6
## $ No                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ EU_member_countries <fct> Poland, Germany, United Kingdom, Lithuania, Fra...
## $ Feb_2019            <int> 2269, 1925, 1723, 1198, 920, 743, 992, 995, 392...
## $ Feb_2020            <int> 4194, 2656, 2262, 1595, 1402, 1346, 1089, 1065,...
## $ change              <int> 1925, 731, 539, 397, 482, 603, 97, 70, 633, 194...
## $ change_pr           <dbl> 0.84839136, 0.37974026, 0.31282646, 0.33138564,...

You can download the data also manually. Find from the Environment tab Import Dataset. There you can import the data from your local disc or from Internet. You can define there also all the parameters of dataset (variable type etc).

Imported dataset is not very “beautiful”. Instead of spaces between words "_" is used. Words are abbreviated etc:

No EU_member_countries Feb_2019 Feb_2020 change change_pr
1 Poland 2269 4194 1925 0.8483914
2 Germany 1925 2656 731 0.3797403
3 United Kingdom 1723 2262 539 0.3128265
4 Lithuania 1198 1595 397 0.3313856
5 France 920 1402 482 0.5239130
6 Italy 743 1346 603 0.8115747

For the correct data analysis it is always crucial to have full description of the data we are going to analyse. It means you need metadata i.e. data about data (description of the data). In current case it wasn’t added to the original file, but hopefully we can interpret it correctly.

column name column type original column name
No integer Row number
EU_member_countries factor EU member countries
Feb_2019 integer 2019: February
Feb_2020 integer 2020: February
change integer change
change_pr double change %

Table and variable types are looking fine! But sometimes the format and/or shape of imported dataset does not meet the needs for your planned data analysis. Then you have to reshape data data. This process is called data wrangling. In R you can use for that library called dplyr (it belongs to collection of packages named tidyverse).

4.1.2 Data wrangling (working with package dplyr)

Introductin to dplyr.

4.1.3 What is piping and how to use it with dplyr?

According to Sean C. Anderson: the dplyr R package is awesome. Pipes from the magrittr R package are awesome. Put the two together and you have one of the most exciting things to happen to R in a long time.
The dplyr is Hadley Wickham’s re-imagined plyr package (with underlying C++ secret sauce co-written by Romain Francois). plyr 2.0 if you will. It does less than plyr, but what it does it does more elgantly and much more quickly.
dplyr is built around 5 verbs. These verbs make up the majority of the data manipulation you tend to do. You might need to: - Select certain columns of data. - Filter your data to select specific rows. - Arrange the rows of your data into an order. - Mutate your data frame to contain new columns. - Summarise chunks of you data in some way.

Pipes are an extremely useful tool from the magrittr package1 that allow you to express a sequence of multiple operations. They can greatly simplify your code and make your operations more intuitive link.

Here you can read more about piping and test examples: Introduction to dplyr and pipes. Please do!

5 Visualizations

Undoubtedly are maps just another type of visualizations. Similarly to maps we have to define the coordinates (XY-axes) and attributes (plot type, colors, labels, shapes etc) for every visualization.

According to Wikipedia: Visualization is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity. Examples from history include cave paintings, Egyptian hieroglyphs, Greek geometry, and Leonardo da Vinci’s revolutionary methods of technical drawing for engineering and scientific purposes. Visualization today has ever-expanding applications in science, education, engineering (e.g., product visualization), interactive multimedia, medicine, etc. Typical of a visualization application is the field of computer graphics. The invention of computer graphics may be the most important development in visualization since the invention of central perspective in the Renaissance period. The development of animation also helped advance visualization.

Several remarkably good books about scientific visualization have been published:

The last book on the list has been influencial also for building R library ggplot2: Wickham H. (2010) “A Layered Grammar of Graphics”"

The author of current short tutorial is not trying to compete with any of the previously mentioned books. The aim of is to go through some of the important aspects of scientific visualization and demonstrate the process of creating visualizations.
More specifically we are focusing on R library ggplot2.

For many R users the capabilities of base R plotting are enough ( link ).

plot(data_table$var1, data_table$var2, main = "my plot", ylab="value y", type = "l", col="red")
points(data_table$var1, data_table$var2, col="blue")

In many cases basic plotting engine is not enough or is too clumsy and not intuitive. There are available some plotting libraries as well: lattice, ggplot2. During the current course we are going to use ggplot2.

ggplot2 website says: “ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

ggplot object must have:

  • data: You can’t create a plot without data (must be stored as an R data frame);
  • coordinate system: describes 2-D space that data is projected onto;
  • geoms: describe type of geometric objects that represent data (points, lines, polygons);
  • aesthetics: describe visual characteristics that represent data (position, size, color, shape, transparency, fill);
  • scales: for each aesthetic, describe how visual characteristic is converted to display values (log scales, color scales, size scales, shape scales)
  • stats : describe statistical transformations that typically summarize data (counts, means, medians, regression lines);
  • facets: describe how data is split into subsets and displayed as multiple small graphs.
ggplot()+
  geom_point(data = data_table, aes(x=var1, y=var2), fill="pink", colour="red", shape=21, size=3)+
  labs(title="simple example", x="header, x", y="header, y")

ggplot colours are defined, like in HTML/CSS, using the hexadecimal values (00 to FF). Many colours are translated to human language link.

5.1 Know your data!

Lets assume, that you received somehow from your partner a dataset. According to him, it some base map data. Unfortunately there wasn’t any additional metadata and documentation available. Can you figure out, what dataset it is and is it usable?

Download the data and check the structure:

unknownData <- read.csv2("http://aasa.ut.ee/tbilisi/data/unknownData.csv")
glimpse(unknownData)
## Observations: 7,219
## Variables: 7
## $ variable1 <dbl> 40.24884, 40.26221, 40.26694, 40.26645, 40.27279, 40.2838...
## $ variable2 <dbl> 43.58498, 43.58565, 43.58235, 43.58100, 43.57732, 43.5782...
## $ order     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ hole      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ piece     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ group     <fct> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
unknownData %>% head() %>% kable()
variable1 variable2 order hole piece id group
40.24884 43.58498 1 FALSE 1 1 1.1
40.26221 43.58565 2 FALSE 1 1 1.1
40.26694 43.58235 3 FALSE 1 1 1.1
40.26645 43.58100 4 FALSE 1 1 1.1
40.27279 43.57732 5 FALSE 1 1 1.1
40.28386 43.57823 6 FALSE 1 1 1.1

Calculate descriptive statistics for variable 1:

unknownData_descr <- unknownData %>% 
  summarise(n = n(), 
            min = min(variable1), 
            mean = mean(variable1), 
            max = max(variable1), 
            sd = sd(variable1))

kable(unknownData_descr)
n min mean max sd
7219 40.00901 43.6765 46.72657 1.456807

Or you can calculate descriptive statistics for multiple columns:

unknownData %>% summarise_all(mean)
## Warning in mean.default(group): argument is not numeric or logical: returning NA
##   variable1 variable2    order hole piece       id group
## 1   43.6765  42.11603 316.3905    0     1 6.803158    NA

You can also apply several functions at the same time:

unknownData %>% summarise_all(funs(mean, min, max))

What happend?
Error message warns that you can’t apply it for factors. We have to select relevant columns:

unknownData %>% 
  select(variable1, variable2) %>%  
  summarise_all(funs(mean, min, max))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
##   variable1_mean variable2_mean variable1_min variable2_min variable1_max
## 1        43.6765       42.11603      40.00901      41.05455      46.72657
##   variable2_max
## 1       43.5865
# or:
unknownData %>% 
  select(variable1 : variable2) %>%  
  summarise_all(funs(mean, min, max)) %>%
  kable()
variable1_mean variable2_mean variable1_min variable2_min variable1_max variable2_max
43.6765 42.11603 40.00901 41.05455 46.72657 43.5865

Visualization is one of the easiest way to get familiar with the data. Let’s look different possibilities. First we draw the histogram:

ggplot()+
  geom_histogram(data = unknownData, aes(x= variable1), bins = 50)

Histogram helps to get overview about the variability of the variable. Currently we dont see too much…

Maybe it’s some time series? We can try line plot:

ggplot()+
  geom_line(data = unknownData, aes(x = variable1,  y = variable2), colour = "blue")

Ok! Looks like contour of Georgia! But definitely we can’t use it in this form. We should search other ways. Maybe the scatter plot helps?

ggplot()+
  geom_point(data = unknownData, aes(x = variable1,  y = variable2), colour = "blue", size=0.5)

Even better! Educated eye is probably recognizing the borders of Georgian regions. To plot borders we should use lines. For that ggplot2 offers possibility to plot path:

ggplot()+
  geom_path(data = unknownData, aes(x = variable1,  y = variable2), colour = "blue", size=0.5)

Still problems… We have to define “when the pencil should be switched off” if we start with new region. For that we can use group. This defines groups which are drawn separately:

ggplot()+
  geom_path(data = unknownData, aes(x = variable1,  y = variable2, group  = group), colour = "blue", size=0.5)

Yes! But instead of empty contour we can also draw it as polygons and paint every polygon with different colour:

ggplot()+
  geom_polygon(data = unknownData, aes(x = variable1,  y = variable2, group  = group, fill = as.factor(group)), colour = "white", size=0.5)+
  scale_fill_viridis_d()+
  coord_equal()

This looks like almost a map, but need improvement! Region names are missing etc. Let’s but this currently on hold and look next step.

5.2 Visualize the change of number of tourists in Georgia

Let’s turn back to dataset GE_travel.

No EU_member_countries Feb_2019 Feb_2020 change change_pr
1 Poland 2269 4194 1925 0.8483914
2 Germany 1925 2656 731 0.3797403
3 United Kingdom 1723 2262 539 0.3128265
4 Lithuania 1198 1595 397 0.3313856
5 France 920 1402 482 0.5239130
6 Italy 743 1346 603 0.8115747

Our aim is to compare how the number of tourists has dropped because of the corona virus. We have tourist numbers from February 2019 and February 2020. Let’s focus on EU countries.

Plot number of visitors in February 2019:

ggplot()+
  theme_minimal()+
  geom_col(data = GE_travel, aes(x = EU_member_countries, y = Feb_2019), fill  ="orange")

Now both years:

ggplot()+
  theme_minimal()+
  geom_col(data = GE_travel, aes(x = EU_member_countries, y = Feb_2019), fill = "orange")+
  geom_col(data = GE_travel, aes(x = EU_member_countries, y = Feb_2020), fill = "dodgerblue")

Columns are on top of each other. Country names are unreadable…
Use points:

ggplot()+
  theme_minimal()+
  geom_point(data = GE_travel, aes(x = EU_member_countries, y = Feb_2019), colour = "orange")+
  geom_point(data = GE_travel, aes(x = EU_member_countries, y = Feb_2020), colour = "dodgerblue")

We can see some differences but the initial result is definitely ugly. We could rotate the axes:

ggplot()+
  theme_minimal()+
  geom_point(data = GE_travel, aes(y = EU_member_countries, x = Feb_2019), colour = "orange")+
  geom_point(data = GE_travel, aes(y = EU_member_countries, x = Feb_2020), colour = "blue")

We also should sort the countries by number of tourists but not by the name. In that case the sorting inside of the table doesnt help, the sorting procedure has to happen in plot:

ggplot()+
  theme_minimal()+
  geom_point(data = GE_travel, aes(y = reorder(EU_member_countries, Feb_2020), x = Feb_2019), colour = "orange")+
  geom_point(data = GE_travel, aes(y = reorder(EU_member_countries, Feb_2020), x = Feb_2020), colour = "blue")+
  scale_colour_manual(values = c("orange", "dodgerblue"))+
  labs(colour = "year", 
       y="country", 
       x="Number of tourists", 
       title = "Change in number of tourists in Georgia (EU countries)", 
       subtitle = "difference between March 2019 and 2020", 
       caption = "data: Georgian National Tourism Administration \nvisual: A. Aasa")

The result looks “almost”, but it’s still hard to understand the volumes and changes. Add percentage of change to plot and visulise the change with arrow:

ggplot()+
  theme_minimal()+
  geom_segment(data = GE_travel, aes(y = reorder(EU_member_countries, Feb_2020),
                                     yend = reorder(EU_member_countries, Feb_2020), 
                                     x = Feb_2019, xend = Feb_2020,
                                     colour = (change_pr * 100)),
               
               arrow = arrow(length= unit(0.02, "npc")))+
  scale_colour_gradientn(colours = c("darkgreen", "grey70", "orange", "red", "red4"))+
  labs(colour = "change, %", 
       y = "country", 
       x = "Number of tourists", 
       title = "Change in number of tourists in Georgia (EU countries)", 
       subtitle = "difference between March 2019 and 2020", 
       caption = "data: Georgian National Tourism Administration \nvisual: A. Aasa")

6 Spatial data

You can make spatial objects by your self, but you can import them (e.g. ’.shp) as well.The most important assumption is the availability of spatial information (coordinates, address, shape / geometry etc). There are many possibilities to import spatial object into R. We focus on simple feature’s.

6.1 Create first real map in R

Our aim is to download spatial layer of Georgian regions and draw general overview map. Work steps:

  1. download the file,
  2. unzip it,
  3. look for the files containing “shp” in their name
  4. get the name of file you need
  5. import shp to R
  6. plot spatial polygons
  7. polish the map (add titles, map scale, north arrow etc
  8. check the structure of imported spatial object
  9. check the coordinate system (CRS) of imported layer (WGS84 = 4326)

For the first steps start libraries to expand the basic functionality. Download zipped geolayer (in shp-format) from Internet. I found relevant population statistics data from webpage calledThe Humanitarian Data Exchange.

Download the data:

# first define the URL:
ge_admin_url <- "https://data.humdata.org/dataset/3ee95199-2dfe-40fc-b9bd-b44cc8c91024/resource/ac2b2747-18a2-43c9-abbb-4ab2c5c539a3/download/geo_adm_geostat_20191018_shp.zip"

# download the zipped dataset:
download.file(ge_admin_url, destfile="ge_admin.zip")
# unzip it:
unzip("ge_admin.zip")

##############
# In case the original link is not working:
download.file("http://aasa.ut.ee/tbilisi/data/ge_admin.zip", destfile="ge_admin.zip")

The shapefile format is a geospatial vector data format for geographic information system (GIS) software. The shapefile format is a digital vector storage format for storing geometric location and associated attribute information. The format consists of a collection of files with a common filename prefix, stored in the same directory. (Wikipedia)

Mandatory files in shp collection:

  • .shp — shape format; the feature geometry itself
  • .shx — shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly
  • .dbf — attribute format; columnar attributes for each shape, in dBase IV format
  • .prj — projection description, using a well-known text representation of coordinate reference systems

It means for every SHP-layer we must have one shp-file. Call the list of shp-files in working directory

list.files(pattern = ".shp")

Now you should see these 3 files among others:

  • geo_admbnda_adm0_geostat_20191018.shp
  • geo_admbnda_adm1_geostat_20191018.shp
  • geo_admbnda_adm2_geostat_20191018.shp

Luckily the zip-conatiner contains also metadata: GEO COD-AB 2019_10_28.pdf. Read it (it is in working directory (to ask working directory run getwd()))!
Now you know the meaning of different layers.
If files are there, you can read them to R:

ge_adm_0 <- st_read("geo_admbnda_adm0_geostat_20191018.shp")
## Reading layer `geo_admbnda_adm0_geostat_20191018' from data source `C:\ANTO\loengud\tbilisi\2020\web\geo_admbnda_adm0_geostat_20191018.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 13 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 40.00834 ymin: 41.05455 xmax: 46.72672 ymax: 43.5865
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
ge_adm_1 <- st_read("geo_admbnda_adm1_geostat_20191018.shp")
## Reading layer `geo_admbnda_adm1_geostat_20191018' from data source `C:\ANTO\loengud\tbilisi\2020\web\geo_admbnda_adm1_geostat_20191018.shp' using driver `ESRI Shapefile'
## Simple feature collection with 13 features and 16 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 40.00834 ymin: 41.05455 xmax: 46.72672 ymax: 43.5865
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
ge_adm_2 <- st_read("geo_admbnda_adm2_geostat_20191018.shp")
## Reading layer `geo_admbnda_adm2_geostat_20191018' from data source `C:\ANTO\loengud\tbilisi\2020\web\geo_admbnda_adm2_geostat_20191018.shp' using driver `ESRI Shapefile'
## Simple feature collection with 70 features and 19 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 41.54717 ymin: 41.05455 xmax: 46.72672 ymax: 43.25688
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

From metadata we see that we have problems. In case of administrative level 2 (district or self-governing city) the data has gaps: several regions are excluded:

  • Tbilisi,
  • the Autonomous Republic of Abkhazia,
  • area known as the Provisional Administration.

Plot the imported layers.
Country:

ggplot()+
  geom_sf(data = ge_adm_0, col = "wheat2", fill = "wheat3")+
  labs(title = "Country of Georgia")

Regions

ggplot()+
  geom_sf(data = ge_adm_1, fill = "firebrick", col = "grey10")+
  labs(title = "Regions of Georgia")

Districts of Georgia:

ggplot()+
  geom_sf(data = ge_adm_2, fill = "orange", col = "white")+
  labs(title = "Districts of Georgia")

As I previously noted the level 2 (districts) is most problematic. Let’s set it aside for now and come back to Regions. As you remember our aim was to create nice map of Georgian regions.

ggplot()+
  geom_sf(data = ge_adm_1, fill = "firebrick", col = "grey10")+
  labs(title = "Regions of Georgia")

Map of regions looks almost ok? Actually several important details are missing:

  • map scale
  • north arrow
  • author & data source
  • names of regions

And maybe we also should tune the colours… Let’s start from the map background:

In ggplot2 you can easily modify every element of the plot. For the beginner it is maybe easier to use some predefined themes, for example:

ggplot()+
  theme_grey()+
  geom_sf(data = ge_adm_1, fill = "firebrick", col = "grey10")+
  labs(title = "Regions of Georgia")

ggplot()+
  theme_light()+
  geom_sf(data = ge_adm_1, fill = "firebrick", col = "grey10")+
  labs(title = "Regions of Georgia")

Theme light looks nice.
In next step we use different colours for regions and also add region names to the map:

ggplot()+
  theme_light()+
  geom_sf(data = ge_adm_1, aes(fill = ADM1_EN))+
  labs(title = "Regions of Georgia",
       fill  = "Regions")

As you can see it is hard to connect region names in legend with specific region on the map. It is probably better to put names on top of the regions. For that we can use **geom_sf

ggplot()+
  theme_light()+
  geom_sf(data = ge_adm_1, aes(fill = ADM1_EN))+
  geom_sf_label(data = ge_adm_1, aes(label = ADM1_EN),
                alpha = 0.7,
                size= 3)+
  labs(title = "Regions of Georgia",
       fill  = "Regions")+
  guides(fill = F)

Unfortunately the labels (Region names) are in some cases overlapping. To avoid this we can use package called ggrepel. But unfortunately we cannot apply on sf- objects. Two overcome the problem we can:

  1. calculate centroids for every regions
  2. convert sf layer to data frame.

First we calculate centroids and select the column with region names:

ge_adm_1_centr <- ge_adm_1 %>% 
  st_centroid() %>% 
  select(ADM1_EN)
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
## geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data
# check the result:
glimpse(ge_adm_1_centr)
## Observations: 13
## Variables: 2
## $ ADM1_EN  <fct> Autonomous Republic of Abkhazia, Autonomous Republic of Ad...
## $ geometry <POINT [arc_degree]> POINT (41.16486 43.11086), POINT (42.08668 ...

AS you see the geometry type is know POINT. In next step we convert geometry-column to XY-coordinates:

ge_adm_1_xy <- ge_adm_1_centr %>% 
  st_coordinates() %>% 
  as_tibble() # last operation converts result to table (initially the output is in form of matrix)

# see the result: 
head(ge_adm_1_xy)
## # A tibble: 6 x 2
##       X     Y
##   <dbl> <dbl>
## 1  41.2  43.1
## 2  42.1  41.6
## 3  42.2  42.0
## 4  42.9  42.2
## 5  45.8  41.8
## 6  44.5  41.5

Now add names to previous table:

# drop geometry from centroids layer:
ge_adm_1_names <- ge_adm_1_centr %>% 
  st_drop_geometry()

# bind tables by columns:
ge_adm_1_xy <- cbind(ge_adm_1_names, ge_adm_1_xy)

And know we can try to plot map without overlapping labels:

ggplot()+
  theme_light()+
  geom_sf(data = ge_adm_1, aes(fill = ADM1_EN))+
  geom_label_repel(data = ge_adm_1_xy, aes(x = X, y = Y, label = ADM1_EN),
                alpha = 0.7,
                size= 3)+
  labs(title = "Regions of Georgia",
       fill  = "Regions")+
  guides(fill = F)

Are we happy? Actually I have to admit that I don’t like the default colors… No problem! We can choose some other palette. Plot all the colour palettes which are available:

display.brewer.all()

Let’s say we like “Spectral”. We have to define that we want to use the palette ‘Spectral’ with 13 colours:

palette_13 <-  colorRampPalette(brewer.pal(6, "Spectral"))(13)
ggplot()+
  theme_light()+
  geom_sf(data = ge_adm_1, aes(fill = ADM1_EN), colour = "grey", size= 0.2)+
  geom_label_repel(data = ge_adm_1_xy, aes(x = X, y = Y, label = ADM1_EN),
                alpha = 0.7,
                size= 3)+
  labs(title = "Regions of Georgia",
       fill  = "Regions")+
  guides(fill = F)+
  scale_fill_manual(values = palette_13)

Almost done! Add map scale, north arrow, remove axis titles (‘X’ & ‘Y’)

ggplot()+
  theme_light()+
  theme(axis.title = element_blank(),
        axis.ticks = element_blank())+
  geom_sf(data = ge_adm_1, aes(fill = ADM1_EN), colour = "grey", size= 0.2)+
  geom_label_repel(data = ge_adm_1_xy, aes(x = X, y = Y, label = ADM1_EN),
                alpha = 0.7,
                size= 3)+
  labs(title = "Regions of Georgia",
       fill  = "Regions",
       caption = "data: The Humanitarian Data Exchange\nvisual: A. Aasa")+
  guides(fill = F)+
  scale_fill_manual(values = palette_13)+
   annotation_scale(location = "bl", style = "ticks") +
  annotation_north_arrow(location = "tr", width = unit(0.5, "cm"))

6.2 Population map of Georgia

Colouring of regions just by the name is not very informative. This kind of thematic map should carry some message. For example can we make map to visualize population size and/or density in regions. For that we have to download population sizes for the regions.

We can download the population data from the same address where the administrative data was link:

It is useful always to check how the data import worked. If number of columns is small, then head() and glimpse() are enough, but currently the number of columns is 72 and table doesn’t fit very well to the Console window. Then you can use View():

View(ge_pop)

Hopefully everything looks ok and we can continue. To have better overview it’s recommended to select from the initial data tabel only the relevant columns. In current case our purpose is to create population map and therefore we should select POP_Total and codes (ADM2_PCODE) of administrative units.

ge_pop_2 <- ge_pop %>% 
  select(ADM2_PCODE, POP_Total)

# check the result:
glimpse(ge_pop_2)
## Observations: 70
## Variables: 2
## $ ADM2_PCODE <fct> GE1511, GE1523, GE1525, GE1529, GE1532, GE1535, GE2311, ...
## $ POP_Total  <int> 152839, 16760, 74794, 15044, 51189, 23327, 14785, 31486,...

From spatial layer we select district name and ID:

ge_adm_2_ii <- ge_adm_2 %>% 
  select(ADM2_PCODE, ADM2_EN)

glimpse(ge_adm_2_ii)
## Observations: 70
## Variables: 3
## $ ADM2_PCODE <fct> GE3823, GE4123, GE4127, GE4129, GE4111, GE2924, GE3523, ...
## $ ADM2_EN    <fct> Abasha, Adigeni, Akhalkalaki, Akhaltsikhe, Akhaltsikhe C...
## $ geometry   <POLYGON [arc_degree]> POLYGON ((42.34764 42.30973..., POLYGON...

Now we have 2 relevant datasets. In one (ge_adm_2) is stored the spatial information (polygons of districts) and in second table (ge_pop_2) we have population size in districts. We have to bind those two tables into one. Just putting columns net to each other is not very good idea: there is always possibility that tables are different: rows are in different order or don’t match. Therefore we have to join the tables.
You can red more about joining from here: link.

Both tables should contain column where the ID’s of administrative units are stored. In current case it’s ADM2_PCODE. Both columns must be in same format!!! If one is for eaxample integer and another is character join doesn’t work.

We use left join (returns all records from the left table (table1), and the matched records from the right table (table2). The result is NA from the right side, if there is no match):

ge_adm_2_pop <- left_join(ge_adm_2_ii, ge_pop_2, by = "ADM2_PCODE")

# check the result
glimpse(ge_adm_2_pop)
## Observations: 70
## Variables: 4
## $ ADM2_PCODE <fct> GE3823, GE4123, GE4127, GE4129, GE4111, GE2924, GE3523, ...
## $ ADM2_EN    <fct> Abasha, Adigeni, Akhalkalaki, Akhaltsikhe, Akhaltsikhe C...
## $ POP_Total  <int> 22341, 16462, 45070, 20992, 17903, 31461, 9139, 2047, 10...
## $ geometry   <POLYGON [arc_degree]> POLYGON ((42.34764 42.30973..., POLYGON...

Everything should look ok and we can plot the first version of our population map:

ggplot()+
  geom_sf(data = ge_adm_2_pop, aes(fill = POP_Total), col = "grey", size= 0.2)+
  scale_fill_gradientn(colours = c("darkgreen", "yellow", "red"))

As you remember from previous we have gaps in data: Tbilisi, the Autonomous Republic of Abkhazia and area known as the Provisional Administration are missing. In case of Tbilisi it will be easy to select and add the polygon from table ge_adm_1 and it’s similarly simple to find the population size of Tbilisi (according to Wikipedia it’s 1171100). In case of two other districts it must be agreed that we have no data.

6.2.0.1 Add Tbilisi to the district layer

For that we have to cut and paste region of Tbilisi. We select ID and name of region, then we rename the columns (if we bind two tables [Tbilisi & districts] the names must be the same) and finally we use filter to keep only the polygon of Tbilisi:

ge_adm_1_TBI <- ge_adm_1 %>% 
  select(ADM1_PCODE, ADM1_EN) %>% 
  rename(ADM_PCODE = ADM1_PCODE,
         ADM_EN = ADM1_EN) %>% 
  filter(ADM_EN == "Tbilisi")

# check:
ggplot()+
  geom_sf(data = ge_adm_1_TBI, fill = "firebrick")

Nice!

We have to enter the population number. As usual for R we have several options for that. One way is to use fix(). This command opens table in pop-up window and you can enter the population number (1171100) to right place.

fix()

Another option is to do it with scripting: if the name of administrative unit equals Tbilisi, then the population size is set to 1171100, otherwise the column remains unchanged:

ge_adm_1_TBI <- ge_adm_1_TBI %>% 
  mutate(POP_Total  = 1171100)

Now we rename columns also for ge_adm_2_pop and bind 2 tables together:

ge_adm_2_pop <- ge_adm_2_pop %>% 
  rename(ADM_PCODE = ADM2_PCODE,
         ADM_EN = ADM2_EN)

ge_adm_2_pop <- rbind(ge_adm_2_pop, ge_adm_1_TBI)

You may get the error! It’s because the columns are in different order. In that case we can fix it with select():

ge_adm_1_TBI <- ge_adm_1_TBI %>% 
  select(ADM_PCODE, ADM_EN, POP_Total, geometry)

ge_adm_2_pop <- rbind(ge_adm_2_pop, ge_adm_1_TBI)

How it went?

ggplot()+
  geom_sf(data = ge_adm_2_pop, fill = "wheat3")

Tbilisi is now in right place! Now we can plot the population map. In interests of territorial integrity, we add to the map also the Autonomous Republic of Abkhazia and area known as the Provisional Administration (as background layer).

ggplot()+
  theme_minimal()+
  theme(axis.title = element_blank(),
        axis.ticks = element_blank())+
  geom_sf(data = ge_adm_0, fill = "grey30", colour = "grey", size = 0.2)+
  geom_sf(data = ge_adm_2_pop, aes(fill = POP_Total), colour = "grey40", size= 0.2)+
  scale_fill_gradientn(colours = c("darkgreen", "grey80", "orange"))+
  labs(title = "Population in Georgia",
       fill  = "N",
       caption = "data: The Humanitarian Data Exchange\nvisual: A. Aasa")+
  annotation_scale(location = "bl", style = "ticks") +
  annotation_north_arrow(location = "tr", width = unit(0.5, "cm"))

In principle it is a correct map. But it isn’t good map. Our aim was to visualize population distribution. On the map we see 3 main colours: areas with no data, one big city (Tbilisi) and rest of the Georgia. We need better class intervals! This is one of the weaknesses of ggplot2, we cannot directly apply other then equal intervals. One solution is to use some transformation. Currently the problem is that we have one big city and population in other districts is much smaller. In that case the log10 transformation may help:

ggplot()+
  theme_minimal()+
  theme(axis.title = element_blank(),
        axis.ticks = element_blank())+
  geom_sf(data = ge_adm_0, fill = "grey30", colour = "grey", size = 0.2)+
  geom_sf(data = ge_adm_2_pop, aes(fill = log10(POP_Total)), colour = "grey60", size= 0.2)+
  scale_fill_viridis_c(breaks = c(4, 5, 6),
                       labels = c(10000, 100000, 1000000))+
  labs(title = "Population in Georgia",
       fill  = "N",
       caption = "data: The Humanitarian Data Exchange\nvisual: A. Aasa")+
  annotation_scale(location = "bl", style = "ticks") +
  annotation_north_arrow(location = "tr", width = unit(0.5, "cm"))

Result is much better, but still needs some improvements.
Equal intervals is usually very bad classification method. It’s not sensitive to outliers and will often create empty classes. For example natural breaks (Fisher-Jenks algorithm) is a more sophisticated method because it creates distinct groups of similar values by minimising the sum of variance in created classes. We’ll use the classIntervals() function from the classInt package. We’ll choose 10 class intervals and the natural breaks or ‘jenks’ method for grouping values.

classes <- classIntervals(ge_adm_2_pop$POP_Total, n = 10, style = "jenks", dig.lab=20)

Check the result:

classes$brks
##  [1]    2047   10387   23570   33463   45070   62511   81876  104300  125103
## [10]  152839 1171100

Next we’ll create a new column in our sf object using the base R cut() function to cut up our percent variable into distinct groups:

ge_adm_2_pop <- ge_adm_2_pop %>%
  mutate(percent_class = cut(POP_Total, classes$brks, include.lowest = T, dig.lab = 10))

# check the result:
glimpse(ge_adm_2_pop)
## Observations: 72
## Variables: 5
## $ ADM_PCODE     <fct> GE3823, GE4123, GE4127, GE4129, GE4111, GE2924, GE352...
## $ ADM_EN        <fct> Abasha, Adigeni, Akhalkalaki, Akhaltsikhe, Akhaltsikh...
## $ POP_Total     <dbl> 22341, 16462, 45070, 20992, 17903, 31461, 9139, 2047,...
## $ geometry      <POLYGON [arc_degree]> POLYGON ((42.34764 42.30973..., POLY...
## $ percent_class <fct> "(10387,23570]", "(10387,23570]", "(33463,45070]", "(...

Now we are ready to plot again:

ggplot()+
  theme_minimal()+
  theme(axis.title = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_rect(fill = "slategray4"),
        panel.grid = element_line(colour = "grey40", size= 0.2),
        plot.background = element_rect(fill = "slategray4"))+
  geom_sf(data = ge_adm_0, fill = "grey40", colour = "grey60", size = 0.2)+
  geom_sf(data = ge_adm_2_pop, aes(fill = percent_class), colour = "grey60", size= 0.2)+
  scale_fill_viridis_d()+
  labs(title = "Population in Georgia",
       fill  = "N",
       caption = "data: The Humanitarian Data Exchange\nvisual: A. Aasa")+
  annotation_scale(location = "bl", style = "ticks") +
  annotation_north_arrow(location = "tr", width = unit(0.5, "cm"))

6.3 Save your map

It is not enough to have beautiful map on screen of your computer. You should save it. From ggplot2 it’s easy:

# png:
ggsave("name_of_your_map.png", dpi = 300, height = 4, width = 7, units = "in")

# pdf:
ggsave("name_of_your_map.pdf", dpi = 300, height = 4, width = 7, units = "in")

# read more from help!
?ggsave

7 Homework number 1

Your task is to create a map where you visualise gender balance in Georgia. What is the share of female / male population in Georgia? For that you can use the same data used in this tutorial.

  • processed layer of population in Georgian districts ge_adm_pop
    • total population: POP_Total
  • table of populatiotion in districts: ge_pop
    • from this table you should select 3 columns: ID-codes of administrative units (ADM2_PCODE), female total population (F_Total) and male total popultaion (M_Total)
  • join tables (left_join). Attention! The names of columne where are ID codes of administrative units are different. You should use: ‘by = c(“ADM_PCODE” = “ADM2_PCODE”)
  • enter population value for female and male population in Tbilisi. As you remember for Tbilisi we don’t have data. According to Internet size of the female population in Tbilisi was 634743 and male population 536357.
  • calculate gender share (mutate = F_Total / POP_Total)
  • create the map and export it (Using Export from Plots tab is bd idea, because output will be saved with screen resolution)!
  • write short description of the map (up to 100 words)

Bind all the home works into one doc/pdf/html. Add script-files separately (in case of R-markdowm script is already in html file). Please add your name to each file name! E-mail results to: . Final deadline for all submissions is May 15th, 2020.

Possible result:


Author: Anto Aasa
in ISET, Tbilisi
Last update: 2020-04-19 15:37:45

.