Today we look more precisely how to transform and mutate the data. Current session is based on Estonian data.
> According to Wikipedia: Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
Let’s suppose that we are agricultural scientists. We have received a task to analyse the distribution of livestock in Estonia. The data is coming from Estonian Agricultural Registers and Information Board (ARIB) and contains counts of different farmed animal species and breeds in Estonian farms.
Firstly we list all the libraries we need in current session (obviously in real life we don’t know in the beginning what tools we need and the list is formed during the process):
library(tidyverse)
library(knitr)
library(readxl)
library(curl) # library to download data from web
library(sf)
You can import data directly from Excel. It’i not so comfortable as from csv, but we can manage! First you have to download the xlsx-file into your computer and then import it to R.
# define xlsx file address
url <- "http://aasa.ut.ee/Rspatial/data/FarmedAnimalsByLocation_31102018.xlsx"
# define the name for downloaded file
destfile <- "FarmedAnimalsByLocation_31102018.xlsx"
# download the data according to previous parameters
curl_download(url, destfile)
# import data to R
agrAnimal <- read_excel(destfile)
Get familiar with the data:
glimpse(agrAnimal)
## Observations: 5,041
## Variables: 13
## $ `action place` <chr> "EE1000", "EE10009", "EE10011", "EE10015...
## $ `estonian holstein cattle` <dbl> 1, 56, 2, 26, 170, 33, 1, 0, 0, 4, 14, 3...
## $ `estonian red cattle` <dbl> 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0...
## $ `estonian native cattle` <dbl> 0, 0, 0, 0, 17, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ `beef cattle` <dbl> 0, 0, 32, 0, 2, 0, 27, 1, 3, 6, 21, 0, 0...
## $ sheeps <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0...
## $ goats <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ pigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `X koordinaat` <chr> "6405512", "6565214", "6525085", "656351...
## $ `Y koordinaat` <chr> "661450", "603603", "582066", "593841", ...
## $ county <chr> "VÕRU MAAKOND", "LÄÄNE-VIRU MAAKOND", "J...
## $ municipality <chr> "ANTSLA VALD", "TAPA VALD", "TÜRI VALD",...
## $ `admin unit` <chr> "LUHAMETSA KÜLA", "LINNAPE KÜLA", "PALA ...
As you see, the file contains information about livestock in Estonia. Data is aggregated farm level. We have the counts of animals and breed. Locational information contains coordinates (looks like in Estonian official system, EPSG:3301). Additionally we have information about adminstrative units, municipalities and counties. Unfortunately we don’t have metadata and therefore we dont know which version of adminstrative hierarchy is used. This data would be crucial to create maps: we may join the tables (geolayer & attributes). But we have the exact coordinates of farms and we can delete the columns with descriptive locational information. There are available many ways to exclude columns from dataset:
#back up the data:
agrAnimal_bu <- agrAnimal
# delete the column
agrAnimal_bu$county <- NULL
# use reverse select
agrAnimal_bu <- agrAnimal_bu %>% select(-municipality)
glimpse(agrAnimal_bu)
## Observations: 5,041
## Variables: 11
## $ `action place` <chr> "EE1000", "EE10009", "EE10011", "EE10015...
## $ `estonian holstein cattle` <dbl> 1, 56, 2, 26, 170, 33, 1, 0, 0, 4, 14, 3...
## $ `estonian red cattle` <dbl> 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0...
## $ `estonian native cattle` <dbl> 0, 0, 0, 0, 17, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ `beef cattle` <dbl> 0, 0, 32, 0, 2, 0, 27, 1, 3, 6, 21, 0, 0...
## $ sheeps <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0...
## $ goats <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ pigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `X koordinaat` <chr> "6405512", "6565214", "6525085", "656351...
## $ `Y koordinaat` <chr> "661450", "603603", "582066", "593841", ...
## $ `admin unit` <chr> "LUHAMETSA KÜLA", "LINNAPE KÜLA", "PALA ...
# select relevant variable
agrAnimal <- agrAnimal %>% select(`action place`, `estonian holstein cattle`, `estonian red cattle`, `estonian native cattle`, `beef cattle`, sheeps, goats, pigs, `X koordinaat`, `Y koordinaat`, municipality)
glimpse(agrAnimal)
## Observations: 5,041
## Variables: 11
## $ `action place` <chr> "EE1000", "EE10009", "EE10011", "EE10015...
## $ `estonian holstein cattle` <dbl> 1, 56, 2, 26, 170, 33, 1, 0, 0, 4, 14, 3...
## $ `estonian red cattle` <dbl> 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0...
## $ `estonian native cattle` <dbl> 0, 0, 0, 0, 17, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ `beef cattle` <dbl> 0, 0, 32, 0, 2, 0, 27, 1, 3, 6, 21, 0, 0...
## $ sheeps <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0...
## $ goats <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ pigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `X koordinaat` <chr> "6405512", "6565214", "6525085", "656351...
## $ `Y koordinaat` <chr> "661450", "603603", "582066", "593841", ...
## $ municipality <chr> "ANTSLA VALD", "TAPA VALD", "TÜRI VALD",...
If column name contains space, we have to define them with help of apostrophe (name name). It is annoying… Better rename the columns:
agrAnimal <- agrAnimal %>% rename(action_place = `action place`,
estonian_holstein_cattle = `estonian holstein cattle`,
estonian_red_cattle = `estonian red cattle`,
estonian_native_cattle = `estonian native cattle`,
beef_cattle = `beef cattle`,
X_coord = `X koordinaat`,
Y_coord = `Y koordinaat`)
Check the result!!! Is it ok?
Check again! What is the data type for coordinates?
You can’t plot data if coordinates are stored as characters…
Convert them to numeric format! Converting from character to integer or numeric is easy. If you want to convert factor to numeric you have to convert factor at first to character and only then to integer, otherwise if you try to convert directly from factor to integer or numeric your result is a vector of the internal level representations of your factor and not the original values:
agrAnimal <- agrAnimal %>% mutate(X_coord = as.numeric(X_coord), Y_coord = as.numeric(Y_coord))
Check the result! Ok?
If you are satisfied, you can try to map the data. Currently its not a spatial object but flat data table which contains coordinates. In ggpolt it’s not a problem, you can plot spatail objects and tabular data on same plot. Only important thimng is that coordinates must be in the same reference system.
ggplot()+
geom_point(data = agrAnimal, aes(x= X_coord, y = Y_coord))
If you are familar with Estonian geography you probably recognize, that there is something wrong. What? Can you solve it?
The problem is, that traditionally X is used to label longitude and Y is used for latitude. In current case it’s vice versa. To avoid mistakes it’s better to rename those columns:
agrAnimal <- agrAnimal %>% rename(x = Y_coord, y = X_coord)
plot:
ggplot()+
geom_point(data = agrAnimal, aes(x= x, y = y))
Nice! Estonia is so densely covered with farms that we can already see the contour of Estonia! But to be safe, we download also the geolayer of Estonian municipalities and plot those two layers together.
Download and unzip the municipalities layer:
download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/omavalitsus_shp.zip", destfile="omavalitsus_shp.zip")
#omavalitsus means municipality!
unzip("omavalitsus_shp.zip")
Put results on plot!
First we have to import the downloaded shp-layer. SHP-layer is actually a bunch of several files (attribues, geometry, projection etc). With help of sf library the working with SHP-layer is today very simple!. First import the data:
municip <- st_read("omavalitsus_20190301.shp") # check the actual file name! They are updated regularly!
## Reading layer `omavalitsus_20190301' from data source `C:\ANTO\loengud\tbilisi\2020\web\omavalitsus_20190301.shp' using driver `ESRI Shapefile'
## Simple feature collection with 79 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## epsg (SRID): 3301
## proj4string: +proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
If the import was successful you should see some information about imported layer. It contains 79 features (municipalities) and 5 fields (columns). Important is to notice the epsg (3301), which currently refers to the estonian official CRS. If the CRS is not defined you can define it by your self:
st_crs(municip) <- 3301
Plot municipalities and animal data together. A mentioned earlier ggplot2 is very flexible and you can plot plain data table and geolayer simultaneously at the same plot. It is important to put layers in correct order!
ggplot()+
geom_sf(data = municip)+
geom_point(data = agrAnimal, aes(x=x, y=y), # in aes() are defined all the parameters which are coming from the data (coordinates, colour, shape, transparency[alpha])
colour = "red", # but you can define previously listed parameters globally as well
alpha = 0.5, # you can "broke"" the lines to make code more readable!
size= 0.5,
shape = 2)
We can see, that all the data fits nicely on the map. To see the importance of correct order of layers, plot the map again put change the order of layers!
Previously plotted map shows only the general distribution of Estonian farms. This is certainly not enough for good analysis. WE should try to visualize distribution patterns of different species and breeds. Next we draw the map where we see the distribution of sheeps, goats and pigs:
ggplot()+
geom_sf(data = municip, colour = "grey40", fill = "grey80", size=0.5)+
geom_point(data = agrAnimal, aes(x=x, y=y, size=goats), colour = "blue")+
geom_point(data = agrAnimal, aes(x=x, y=y, size=pigs), colour = "grey30")+
geom_point(data = agrAnimal, aes(x=x, y=y, size=sheeps), colour = "red")
Not very nice? All the layers are overlapped and uppermost layer hides the other layers, proper legend is missing etc. We should create separate map for every species/breed? Currently our data is stored in so called wide format which means all the data groups are stored in different columns. To make analysis process more convenient and faster we should convert the data to long format link
Convert data from wide format to long format:
agrAnimal_2 <- gather(agrAnimal, "key", "value", 2:8)
agrAnimal_2 <- agrAnimal_2 %>% filter(value > 0)
Compare long and wide format:
Wide format:
agrAnimal %>% head() %>% kable()
| action_place | estonian_holstein_cattle | estonian_red_cattle | estonian_native_cattle | beef_cattle | sheeps | goats | pigs | y | x | municipality |
|---|---|---|---|---|---|---|---|---|---|---|
| EE1000 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 6405512 | 661450 | ANTSLA VALD |
| EE10009 | 56 | 3 | 0 | 0 | 0 | 0 | 0 | 6565214 | 603603 | TAPA VALD |
| EE10011 | 2 | 0 | 0 | 32 | 0 | 0 | 0 | 6525085 | 582066 | TÜRI VALD |
| EE10015 | 26 | 0 | 0 | 0 | 0 | 0 | 0 | 6563514 | 593841 | JÄRVA VALD |
| EE10016 | 170 | 3 | 17 | 2 | 0 | 0 | 0 | 6528313 | 604696 | JÄRVA VALD |
| EE10018 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | 6554026 | 598905 | JÄRVA VALD |
Long format:
agrAnimal_2 %>% head() %>% kable()
| action_place | y | x | municipality | key | value |
|---|---|---|---|---|---|
| EE1000 | 6405512 | 661450 | ANTSLA VALD | estonian_holstein_cattle | 1 |
| EE10009 | 6565214 | 603603 | TAPA VALD | estonian_holstein_cattle | 56 |
| EE10011 | 6525085 | 582066 | TÜRI VALD | estonian_holstein_cattle | 2 |
| EE10015 | 6563514 | 593841 | JÄRVA VALD | estonian_holstein_cattle | 26 |
| EE10016 | 6528313 | 604696 | JÄRVA VALD | estonian_holstein_cattle | 170 |
| EE10018 | 6554026 | 598905 | JÄRVA VALD | estonian_holstein_cattle | 33 |
Now we can plot different groups more easily and faster:
ggplot()+
geom_sf(data = municip, colour = "grey40", fill = "grey80", size=0.1)+
geom_point(data = agrAnimal_2, aes(x=x, y=y, size=value, colour = key, alpha = value))
Now we have proper legend but general expression is still the mess… Plot every species/breed on separete submap? For that we can wrap every submap in sequenced panel In ggplot2 this is called faceting
ggplot()+
geom_sf(data = municip, colour = "grey40", fill = "grey80", size=0.5)+
geom_point(data = agrAnimal_2, aes(x=x, y=y, size=value, colour = key, alpha = value))+
facet_wrap(~key)
Still not satisfied! Maybe the main problem is, that we have to much data and we should aggragate it from farm level to municipality level? Let’s try!
First we have to check the compatibility of two layers (are the municipality divisions from the same time?). To do that we have to join those two tables. Although we have already studied this topic, let’s look at it again.
Remind the structure of tables:
glimpse(agrAnimal_2)
## Observations: 7,152
## Variables: 6
## $ action_place <chr> "EE1000", "EE10009", "EE10011", "EE10015", "EE10016", ...
## $ y <dbl> 6405512, 6565214, 6525085, 6563514, 6528313, 6554026, ...
## $ x <dbl> 661450, 603603, 582066, 593841, 604696, 598905, 596034...
## $ municipality <chr> "ANTSLA VALD", "TAPA VALD", "TÜRI VALD", "JÄRVA VALD",...
## $ key <chr> "estonian_holstein_cattle", "estonian_holstein_cattle"...
## $ value <dbl> 1, 56, 2, 26, 170, 33, 1, 4, 14, 393, 13, 665, 272, 14...
glimpse(municip)
## Observations: 79
## Variables: 6
## $ ONIMI <fct> Ruhnu vald, Muhu vald, Keila linn, Maardu linn, Viimsi val...
## $ OKOOD <fct> 0689, 0478, 0296, 0446, 0890, 0304, 0424, 0653, 0651, 0718...
## $ MNIMI <fct> Saare maakond, Saare maakond, Harju maakond, Harju maakond...
## $ MKOOD <fct> 0074, 0074, 0037, 0037, 0037, 0037, 0037, 0037, 0037, 0037...
## $ TYYP <fct> 1, 1, 4, 4, 1, 1, 4, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 4, 1...
## $ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((456706.5 64..., MULTIPOLYGON ...
In one table the names of the municipalities are stored in column municipality in another ONIMI. Problem is that join function is case sensitive. This means that municipality names must be stored in both tables in exactly the same way. Currently in attribute file the names are in UPPERCASE. Convert both of them to lower case (more information about working with strings: link)
agrAnimal_2 <- agrAnimal_2 %>%
mutate(municip_key = str_to_lower(municipality))
municip <- municip %>%
mutate(municip_key = str_to_lower(ONIMI)) # create new layer to back up the data!
Check the results with glimpse!
Keep in new municipality layer only the names (municip_key) and codes (OKOOD) of the municipalities. Use the SELECT function (find from previous parts of tutorial!!!)
municip2 <- municip %>%
select(municip_key, OKOOD)
glimpse(municip2)
## Observations: 79
## Variables: 3
## $ municip_key <chr> "ruhnu vald", "muhu vald", "keila linn", "maardu linn",...
## $ OKOOD <fct> 0689, 0478, 0296, 0446, 0890, 0304, 0424, 0653, 0651, 0...
## $ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((456706.5 64..., MULTIPOLYG...
remove geometry column from new municipalities table:
st_geometry(municip2) <- NULL
Instead of joining the raw animal data we should aggregate it to municipality level! Let’s calculate the general count of all animals by municipalities. For that we have to define the grouping value and use summarise function:
agrAnimal_2_aggr <- agrAnimal_2 %>%
group_by(municip_key) %>%
summarise(n = sum(value)) %>%
ungroup()
And now we can try to join the tables. Most probably there will be some errors. To see how many municipality names are stored in both tables exactly the same way we use inner_join:
agrAnimal_JOINED <- inner_join(agrAnimal_2_aggr, municip2, by = "municip_key")
In municipalities layer we have 79 units but after join only 73.
Join the result with geolayer (now we use left_join to keep all the municipalities). After that we filter out all the municipalities with no data (NA-value):
agrAnimal_2_aggr_sf <- left_join(municip, agrAnimal_JOINED, by ="municip_key")
agrAnimal_2_aggr_sf <- agrAnimal_2_aggr_sf %>%
filter(!is.na(n))
plot the map
ggplot()+
geom_sf(data = agrAnimal_2_aggr_sf, aes(fill = n))+
scale_fill_gradientn(colours = topo.colors(20), na.value = "black")
We see some small white spots on the map (e.g. Viljandi, Võru).
Well. we have to search for another way to aggregate animal layer. On possibility is the spatial join! Read more,
If “ordinary” join is based on key-column of tables, the spatial join is based on geometry. This means both of datasets must be geolayers.
For that we have to convert animals data to SF-object more about sf
This kind of converting assumes, that we have in table locational information (usually coordinates).
Conversion to sf itself is easy:
# reminad the structure:
glimpse(agrAnimal_2)
## Observations: 7,152
## Variables: 7
## $ action_place <chr> "EE1000", "EE10009", "EE10011", "EE10015", "EE10016", ...
## $ y <dbl> 6405512, 6565214, 6525085, 6563514, 6528313, 6554026, ...
## $ x <dbl> 661450, 603603, 582066, 593841, 604696, 598905, 596034...
## $ municipality <chr> "ANTSLA VALD", "TAPA VALD", "TÜRI VALD", "JÄRVA VALD",...
## $ key <chr> "estonian_holstein_cattle", "estonian_holstein_cattle"...
## $ value <dbl> 1, 56, 2, 26, 170, 33, 1, 4, 14, 393, 13, 665, 272, 14...
## $ municip_key <chr> "antsla vald", "tapa vald", "türi vald", "järva vald",...
agrAnimal_2_sf <- st_as_sf(agrAnimal_2, coords = c("x", "y"), crs = 3301)
#agrAnimal_2_sf <- st_sf(agrAnimal_2_sf)
glimpse(agrAnimal_2_sf)
## Observations: 7,152
## Variables: 6
## $ action_place <chr> "EE1000", "EE10009", "EE10011", "EE10015", "EE10016", ...
## $ municipality <chr> "ANTSLA VALD", "TAPA VALD", "TÜRI VALD", "JÄRVA VALD",...
## $ key <chr> "estonian_holstein_cattle", "estonian_holstein_cattle"...
## $ value <dbl> 1, 56, 2, 26, 170, 33, 1, 4, 14, 393, 13, 665, 272, 14...
## $ municip_key <chr> "antsla vald", "tapa vald", "türi vald", "järva vald",...
## $ geometry <POINT [m]> POINT (661450 6405512), POINT (603603 6565214), ...
#plot the result (may take a bit more time...)
ggplot()+
geom_sf(data = agrAnimal_2_sf, aes(col=key, alpha =value))
It is useful to undestand what the function does:
?st_as_sf
ggplot()+
geom_sf(data = municip)+
geom_sf(data = agrAnimal_2_sf, aes(col=key, alpha =value))
Now we can perform the spatial join. The purpose is to add to animals sf-layer the municipality information from the municipalities layer:
agrAnimal_2_sf_municip <- st_join(agrAnimal_2_sf, municip, join = st_intersects)
In my computer previous command gave error message.
Despite the fact that both layers have the same CRS:
st_crs(agrAnimal_2_sf) == st_crs(municip)
## [1] FALSE
I managed to trick the software by saying that I want to transform both layers to new coordinate system:
agrAnimal_2_sf_municip <- st_join(st_transform(agrAnimal_2_sf, 3301), st_transform(municip, 3301), join = st_intersects)
st_geometry(agrAnimal_2_sf_municip) <- NULL
agrAnimal_2_sf_municip_aggr <- agrAnimal_2_sf_municip %>%
group_by(OKOOD, key) %>%
summarise(sum = sum(value)) %>%
ungroup()
agrAnimal_2_sf_municip_aggr <- agrAnimal_2_sf_municip_aggr %>%
spread(key, sum)
glimpse(agrAnimal_2_sf_municip_aggr)
## Observations: 73
## Variables: 8
## $ OKOOD <fct> 0130, 0141, 0142, 0171, 0184, 0191, 0198, ...
## $ beef_cattle <dbl> 595, 665, 2357, 519, 1546, 1508, 304, 4046...
## $ estonian_holstein_cattle <dbl> 53, 198, 1195, 3792, 43, 2203, 1178, 799, ...
## $ estonian_native_cattle <dbl> 1, 22, 33, 27, NA, 1, 2, 112, 85, 15, 15, ...
## $ estonian_red_cattle <dbl> 91, 26, 550, 1051, NA, 4, 4, 267, 3, 3, 26...
## $ goats <dbl> 34, 37, 13, 25, 130, 74, 15, 181, 26, 3, 8...
## $ pigs <dbl> 2, NA, 8560, 8086, NA, 3569, NA, 8, NA, NA...
## $ sheeps <dbl> 964, 277, 1422, 607, 1026, 568, 32, 3688, ...
agrAnimal_2_sf_municip_aggr <- left_join(municip, agrAnimal_2_sf_municip_aggr, by="OKOOD")
Plot the map:
ggplot()+
geom_sf(data = agrAnimal_2_sf_municip_aggr, aes(fill=beef_cattle), size=0.25, colour = "grey70")+
scale_fill_gradientn(colours = c("darkgreen", "grey80", "orange", "red", "brown"), na.value = "magenta")+
labs(fill = "N",
title = "Beef cattle in Estonian municipalities",
subtitle = "Agricultural Registers and Information Board",
caption = "author: A. Aasa")
This kind maps can be misleading.The area of municipalities is very different. Maybe the density is better measure? Let’s try.
# calculate area:
agrAnimal_2_sf_municip_aggr$area <- st_area(agrAnimal_2_sf_municip_aggr)
agrAnimal_2_sf_municip_aggr <- agrAnimal_2_sf_municip_aggr %>%
mutate(area = as.numeric(area) / 1000000) # m2 => km2
# plot:
ggplot()+
geom_sf(data = agrAnimal_2_sf_municip_aggr, aes(fill = beef_cattle / area), size=0.25, colour = "grey70")+
scale_fill_gradientn(colours = c("darkgreen", "grey80", "orange", "red", "brown"), na.value = "magenta")+
labs(fill = "N per km2",
title = "Beef cattle density in Estonian municipalities",
subtitle = "Agricultural Registers and Information Board",
caption = "author: A. Aasa")
Quite big difference compared with previous?
Your task is to create 2 thematic maps about pigs (or some other species/breed) distribution in Estonia:
For that you can use the same data used in this tutorial.
Bind all the homeworks into one doc/pdf/html. Add script-files separately (in case of R-markdowm script is already in html file). Please add your name to each file name! E-mail results to: anto.aasa@ut.ee. Final deadline for all submissions is May 15th, 2020.
Author: Anto Aasa
in ISET, Tbilisi
Last update: 2020-04-19 18:04:06
.