Today we look more precisely how to transform and mutate the
data.
> According to Wikipedia: Data wrangling, sometimes
referred to as data munging, is the process of transforming and mapping
data from one “raw” data form into another format with the intent of
making it more appropriate and valuable for a variety of downstream
purposes such as analytics.
Let’s suppose that we are animal geographers. We have received a task to analyse the distribution of farm animals in Estonia. The data is coming from Estonian Agricultural Registers and Information Board (ARIB) and contains counts of different farmed animal species and breeds in Estonian farms.
Firstly we list all the libraries we need in current session (obviously in real life we don’t know in the beginning what tools we need and the list is formed during the process):
library(tidyverse)
library(knitr)
library(stringr)
library(readxl)
library(curl) # library to download data from web
library(sf)
You can import data directly from Excel. It’s not so comfortable as from csv, but we can manage! First you have to download the xlsx-file and then import it to R.
# define xlsx file address
url <- "http://aasa.ut.ee/Rspatial/data/FarmedAnimalsByLocation_31102018.xlsx"
# define the name for downloaded file
destfile <- "FarmedAnimalsByLocation_31102018.xlsx"
# download the data according to previous parameters
curl_download(url, destfile)
# import data to R
agrAnimal <- read_excel(destfile)
Get familiar with the data:
glimpse(agrAnimal)
## Rows: 5,041
## Columns: 13
## $ `action place` <chr> "EE1000", "EE10009", "EE10011", "EE10015", …
## $ `estonian holstein cattle` <dbl> 1, 56, 2, 26, 170, 33, 1, 0, 0, 4, 14, 393,…
## $ `estonian red cattle` <dbl> 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0…
## $ `estonian native cattle` <dbl> 0, 0, 0, 0, 17, 0, 0, 1, 0, 0, 0, 0, 1, 3, …
## $ `beef cattle` <dbl> 0, 0, 32, 0, 2, 0, 27, 1, 3, 6, 21, 0, 0, 0…
## $ sheeps <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0…
## $ goats <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `X koordinaat` <chr> "6405512", "6565214", "6525085", "6563514",…
## $ `Y koordinaat` <chr> "661450", "603603", "582066", "593841", "60…
## $ county <chr> "VÕRU MAAKOND", "LÄÄNE-VIRU MAAKOND", "JÄRV…
## $ municipality <chr> "ANTSLA VALD", "TAPA VALD", "TÜRI VALD", "J…
## $ `admin unit` <chr> "LUHAMETSA KÜLA", "LINNAPE KÜLA", "PALA KÜL…
or:
head(agrAnimal)
## # A tibble: 6 × 13
## action pl…¹ eston…² eston…³ eston…⁴ beef …⁵ sheeps goats pigs X koo…⁶ Y koo…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 EE1000 1 1 0 0 0 0 0 6405512 661450
## 2 EE10009 56 3 0 0 0 0 0 6565214 603603
## 3 EE10011 2 0 0 32 0 0 0 6525085 582066
## 4 EE10015 26 0 0 0 0 0 0 6563514 593841
## 5 EE10016 170 3 17 2 0 0 0 6528313 604696
## 6 EE10018 33 0 0 0 0 0 0 6554026 598905
## # … with 3 more variables: county <chr>, municipality <chr>,
## # `admin unit` <chr>, and abbreviated variable names ¹`action place`,
## # ²`estonian holstein cattle`, ³`estonian red cattle`,
## # ⁴`estonian native cattle`, ⁵`beef cattle`, ⁶`X koordinaat`, ⁷`Y koordinaat`
#tail(agrAnimal)
As you see, the file contains information about farmed(?) animals in Estonia. Data is aggregated on farm level. We have counts of animal species and breeds. Locational information contains geographical coordinates (looks like in Estonian official system, EPSG:3301). Additionally we have information about adminstrative units, municipalities and counties. Unfortunately we don’t have metadata (municipality ID and time). Therefore we don’t know which version of adminstrative hierarchy is used. This data would be crucial to create maps: we may join the tables (geolayer & attributes). But we have the exact coordinates of farms and we can delete the columns with descriptive locational information. There are many ways to exclude columns from dataset:
#back up the data:
agrAnimal_bu <- agrAnimal
# delete the column
agrAnimal_bu$county <- NULL
# delete with 'select':
agrAnimal_bu <- agrAnimal_bu %>% select(-municipality)
glimpse(agrAnimal_bu)
## Rows: 5,041
## Columns: 11
## $ `action place` <chr> "EE1000", "EE10009", "EE10011", "EE10015", …
## $ `estonian holstein cattle` <dbl> 1, 56, 2, 26, 170, 33, 1, 0, 0, 4, 14, 393,…
## $ `estonian red cattle` <dbl> 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0…
## $ `estonian native cattle` <dbl> 0, 0, 0, 0, 17, 0, 0, 1, 0, 0, 0, 0, 1, 3, …
## $ `beef cattle` <dbl> 0, 0, 32, 0, 2, 0, 27, 1, 3, 6, 21, 0, 0, 0…
## $ sheeps <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0…
## $ goats <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `X koordinaat` <chr> "6405512", "6565214", "6525085", "6563514",…
## $ `Y koordinaat` <chr> "661450", "603603", "582066", "593841", "60…
## $ `admin unit` <chr> "LUHAMETSA KÜLA", "LINNAPE KÜLA", "PALA KÜL…
# select relevant variable
agrAnimal <- agrAnimal %>%
select(`action place`,
`estonian holstein cattle`,
`estonian red cattle`,
`estonian native cattle`,
`beef cattle`,
sheeps,
goats,
pigs,
`X koordinaat`,
`Y koordinaat`,
municipality)
glimpse(agrAnimal)
## Rows: 5,041
## Columns: 11
## $ `action place` <chr> "EE1000", "EE10009", "EE10011", "EE10015", …
## $ `estonian holstein cattle` <dbl> 1, 56, 2, 26, 170, 33, 1, 0, 0, 4, 14, 393,…
## $ `estonian red cattle` <dbl> 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0…
## $ `estonian native cattle` <dbl> 0, 0, 0, 0, 17, 0, 0, 1, 0, 0, 0, 0, 1, 3, …
## $ `beef cattle` <dbl> 0, 0, 32, 0, 2, 0, 27, 1, 3, 6, 21, 0, 0, 0…
## $ sheeps <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0…
## $ goats <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pigs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `X koordinaat` <chr> "6405512", "6565214", "6525085", "6563514",…
## $ `Y koordinaat` <chr> "661450", "603603", "582066", "593841", "60…
## $ municipality <chr> "ANTSLA VALD", "TAPA VALD", "TÜRI VALD", "J…
If column name contains space or some specific symbols, we have to define them with help of apostrophe (‘name name’). It is annoying… Better rename the columns:
agrAnimal <- agrAnimal %>% rename(action_place = `action place`,
estonian_holstein_cattle = `estonian holstein cattle`,
estonian_red_cattle = `estonian red cattle`,
estonian_native_cattle = `estonian native cattle`,
beef_cattle = `beef cattle`,
X_coord = `X koordinaat`,
Y_coord = `Y koordinaat`)
Check the result! Is it ok?
Check again! What is the data type for coordinates?
You can’t plot data if coordinates are stored as characters…
Convert them to numeric format! Converting from character to
integer or numeric is easy. If you want to convert
factor to numeric you have to convert factor
firstly to character and only then to integer,
otherwise if you try to convert directly from factor to integer
or numeric your result is a vector of the internal level
representations of your factor and not the original values:
agrAnimal <- agrAnimal %>%
mutate(X_coord = as.numeric(X_coord),
Y_coord = as.numeric(Y_coord))
Check the result! Ok?
If you are satisfied, you can try to map the data. Currently its not a
spatial object but flat data table which contains coordinates. In
ggplot
it’s not a problem, you can plot spatial objects and
tabular data on same plot. Only important thing is that coordinates must
be in the same reference system.
ggplot()+
geom_point(data = agrAnimal, aes(x= X_coord, y = Y_coord))
If you are familiar with Estonian geography you probably recognize,
that there is something wrong. What? Can you solve it?
The problem is, that traditionally X is used to label longitude and Y is
used for latitude. In current case it’s vice versa. To avoid mistakes
it’s better to rename those columns:
agrAnimal <- agrAnimal %>%
rename(x = Y_coord, y = X_coord)
plot:
ggplot()+
geom_point(data = agrAnimal, aes(x = x, y = y))
Nice! Estonia is so densely covered with farms that we can already
see the contour of Estonia! But to be safe, we download also the
geolayer of Estonian municipalities and plot those two layers
together.
Download and unzip the municipalities layer:
download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/omavalitsus_shp.zip", destfile="omavalitsus_shp.zip")
#omavalitsus means municipality!
unzip("omavalitsus_shp.zip")
Put results on plot!
First we have to import the downloaded shp-layer. SHP-layer is actually
a bunch of several files (attribues, geometry, projection etc).
With help of sf
library the working with SHP-layer is today
very simple! First import the data:
# Attention!!! Date ion shp-file name is changed after every data update!
# check the correct file name if needed:
list.files(pattern = "shp")
## [1] "asustusyksus_20211001.shp" "asustusyksus_20211101.shp"
## [3] "eestimaa_wgs84.shp" "eestimaa_wgs84.shp.xml"
## [5] "gps_us.shp" "gps_us_monterey.shp"
## [7] "maakond_20210901.shp" "maakond_20211101.shp"
## [9] "maakond_20221001.shp" "maakond_shp.zip"
## [11] "omavalitsus_20211001.shp" "omavalitsus_20221101.shp"
## [13] "omavalitsus_shp.zip" "population_2017.shp"
## [15] "trt_cont.shp"
# import shp:
municip <- st_read("omavalitsus_20211001.shp", quiet = T)
If the import was successful you should see some information about imported layer. It contains 79 features (municipalities) and 5 fields (columns). Important is to notice the epsg (3301), which currently refers to the estonian official CRS. If the CRS is not defined you can define it by your self:
st_crs(municip) <- 3301
Plot municipalities and animal data together. A mentioned earlier
ggplot2
is very flexible and you can plot plain data table
and geolayer simultaneously at the same plot. It is important to put
layers in correct order!
ggplot()+
geom_sf(data = municip)+
geom_point(data = agrAnimal, aes(x=x, y=y), # in aes() are defined all the parameters which are coming from the data (coordinates, colour, shape, transparency[alpha])
colour = "red", # but you can define previously listed parameters globally as well
alpha = 0.5, # you can "broke"" the lines to make code more readable!
size= 0.5,
shape = 2)