First things first! Please start the libraries:
library(dplyr)
library(readr)
library(knitr)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(scico)
library(ggmap)
library(data.table)
library(tidyr)
library(sf)
It is characteristic to the modern network and information society that we meet data everywhere and very often the data is big. Big data could be a co-product of some other ICT (Information & Communication Technology) service: every usage of ICT produces some amount of data. This data describes the usage of certain services, but can also descrive the user (frequency, location etc). In current workshop, we will go through some examples how different data sources can reveal spatio-temporal behaviour of society and/or some smaller social group. Today, the mobile data collection is a very useful toolbox for mobility analysis. Hereby we try to “make friends” with several different datasets (mobile positioning, GPS-positioning, crimes etc). The main aim is to introduce the participants the nature of mobile data, its applications and prospects as well as the disadvantages and shortcomings.
Many research questions are related to spatial aspects. In addition to the question Where? researchers often have to answer how one or another object/phenomenon is related to the spatial context; are there any spatial patterns?
In case we assume, that spatial context of the data is essential, we have to record also the spatial information during the data collection process. Easiest and the most common way is to use coordinates. We all now what geographical coordinates are (geographical coordinates of Jakobi 2, Tartu: 26.71967E 58.38049N).
Because of different reasons geographers are not always using geographical coordinates. Actually there are very many different coordinate reference systems (CRS). For example the official CRS of Estonia is L-EST Coordinate System L-EST97, epsg:3301. In L-EST97 the coordinates of Jakobi 2, Tartu are X = 659059.9 & Y = 6474338. This means we have to be very careful dealing with spatial objects and pay attention, that we have to know the CRS of spatial data and for specific analysis all the spatial datasets must be in the same CRS!
In general there are two types of spatial data: vector and raster. In case of vector data we are dealing with discrete objects (point, line, polygon), in case of raster data we have a grid of cells (pixels) organized into rows and columns where each cell has a value representing information. Today we look some simple vector data.
As usual in R, there ara many ways to work with spatial data, but today we are focusing on Simple Feature (sf). Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them. link
In Mobility Lab the mobility data, among other methods, is also collected with the help of special mobile phone application “MobilityLog”. MobilityLog is an Android application which is designed for long-term mobility tracking and social-network-related experiments. The app was developed in joint cooperation of the Computer Laboratory of the University of Cambridge and the Mobility Lab of the University of Tartu. MobilityLog tracks location with the help of GPS, WiFi access points, accelerometer, and mobile network cells. It also records other phone uses such as call events, battery level, screen activation, or alarm clock. It’s a powerful tool to monitor activities carried on by the phone user.
Thus, in addition to location information (GPS), “MobilityLog” can record other behavioral information. App sends the collected data to the servers of Mobility Lab in University of Tartu on a daily basis, where the quality of the received data is controlled. After the quality check, a set of algorithms start to calculate predefined parameters of respondent’s behavior.
Thus, the “MobilityLog” application can be used to monitor/sense the activity of phone user. Every respondent has an android phone where the data collection app is installed. In order to protect the respondents’ privacy, the data will be pseudonymized in the initial stage of data processing.
Current dataset containS three weeks of GPS data recorded during a tourism trip in California. The data is recorded using MobilityLog. Dataset starts in 21-04-2018 and ends in 10-05-2018. All together it contains 197 375 GPS-points.
In real world the preparation of data is very often one of the most annoying steps during the analysis process. Today we jump over most of the obstacles and import the preprepared data directly from the Internet into R.
url <- "http://aasa.ut.ee/digitalMethods/digitalMethodsGPS.zip" # define the dataset address
download.file(url, "digitalMethodsGPS.zip") # download file
unzip("digitalMethodsGPS.zip", exdir = "data") # unzip files
file.remove("digitalMethodsGPS.zip") # tidy up by removing the zip file
## [1] TRUE
gpsData <- read_csv2("data/gps_us.csv") # import data to R
glimpse(gpsData) # structure of imported file
## Observations: 185,808
## Variables: 17
## $ user_id <int> 241, 241, 241, 241, 241, 241, 241, 241, 241, 24...
## $ id <int> 646380266, 646380267, 646380268, 646380269, 646...
## $ restart <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ counter <int> 66718, 66720, 66723, 66730, 66732, 66734, 66736...
## $ time_system <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524...
## $ time_gps <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524...
## $ time_system_ts <dttm> 2018-04-22 00:10:39, 2018-04-22 00:10:40, 2018...
## $ time_gps_ts <dttm> 2018-04-22 00:10:36, 2018-04-22 00:10:37, 2018...
## $ accuracy <int> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6,...
## $ altitude <dbl> -23.991000, -24.298600, -12.608500, -9.616940, ...
## $ bearing <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ speed <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.0...
## $ header_id <int> 226334354, 226334354, 226334354, 226334360, 226...
## $ series_id <int> 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1042,...
## $ year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,...
## $ X <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -12...
## $ Y <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.6497...
In the current tutorial we are using the package called dplyr for data filtering, arranging, selecting, mutating, grouping and summarising. And for better (human readable) code we are using pipes.
With glimpse() we see that dataset contains 17 variables and has 185 808 rows. There are integer, double and date-time variables. To reduce the dataset size we keep only columns which are relevant for current analysis (time, accuracy, altitude, bearing, longitude[X], latitude[Y]):
gpsData <- gpsData %>% select(time_system_ts, accuracy, altitude, bearing, speed, X, Y)
gpsData %>% head() %>% kable() # print the formatted head of the table
| time_system_ts | accuracy | altitude | bearing | speed | X | Y |
|---|---|---|---|---|---|---|
| 2018-04-22 00:10:39 | 12 | -23.99100 | 165.957 | 0.104187 | -121.3842 | 38.64978 |
| 2018-04-22 00:10:40 | 16 | -24.29860 | NA | NA | -121.3842 | 38.64976 |
| 2018-04-22 00:10:41 | 16 | -12.60850 | NA | 0.000000 | -121.3842 | 38.64976 |
| 2018-04-22 00:10:42 | 6 | -9.61694 | NA | 0.000000 | -121.3842 | 38.64976 |
| 2018-04-22 00:10:43 | 4 | -8.17016 | NA | 0.000000 | -121.3843 | 38.64971 |
| 2018-04-22 00:10:44 | 6 | -5.67983 | NA | 0.000000 | -121.3843 | 38.64971 |
Before entering into the trouble we have to convert the date-time field to correct shape. Originally the data recorder (the smartphone with MobilityLog app) stored all locations in Estonian time zone (UTC +3). The time zone of California (UTC -7) differs 10 hours. This means we have to subtract 10 hours from the values in date-time column:
gpsData <- gpsData %>% mutate(time_system_ts = time_system_ts - hours(10))
Now the data should be ready for the analysis. Let’s try to summarise the data (when it starts, when it ends and how many GPS-points it contains)?
gpsData %>% summarise(start = min(time_system_ts), end = max(time_system_ts), 'GPS points' = n()) %>% kable()
| start | end | GPS points |
|---|---|---|
| 2018-04-21 14:10:39 | 2018-05-10 13:56:51 | 185808 |
The data collection application MobilityLog is developed to save the battery of the smart phone as much as possible. For that reason the GPS-sensor is switched off every time the app realises that the phone is not moving. This means that dataset contains lot of gaps. Usually gaps in data refer to data collection problems, but in current case they contain significant information: where and when the respondent stopped during the trip?
We could just plot the speed (meters per second) against time. For visualizations we are using package called ggplot2 in the current workshop.
ggplot2 webpage: It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).
glimpse(gpsData) #check to remeber the column names
## Observations: 185,808
## Variables: 7
## $ time_system_ts <dttm> 2018-04-21 14:10:39, 2018-04-21 14:10:40, 2018...
## $ accuracy <int> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6,...
## $ altitude <dbl> -23.991000, -24.298600, -12.608500, -9.616940, ...
## $ bearing <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ speed <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.0...
## $ X <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -12...
## $ Y <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.6497...
ggplot()+
geom_point(data = gpsData, aes(x=time_system_ts, y=altitude), size=0.25, col="red")
Looks like there are a few gaps in data!? Suspicious… If we assume, that GPS records location after every second, then the dataset sould contain 1.64077210^{6} rows, but it is ca 10 times smaller. This means we have to be more clever to bring out the gaps. Currently it seems, that the horizontal axis is pressed together and gaps are hidden under the data points. One possibility is to create tile-plot, where the horizontal axis represents days and the vertical axis represents time. Additionally we can colour the data points according to the GPS-speed.
To do that, we have to calculate Date and time column. To avoid potential problems with time we calculate it in decimal scale where 0 is midnight, 12 is midday and 24 is midnight again:
gpsData <- gpsData %>% dplyr::mutate(time_dec=hour(time_system_ts)+(minute(time_system_ts)/60), date=as.Date(time_system_ts))
And now the plot:
# first version:
# pay attention on subset() which filters out some gps-points (speed = 0)
ggplot()+
geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))
# first version wasn't to beautiful. Some polishing:
ggplot()+
theme(panel.background = element_rect(fill="black"),
panel.grid.major = element_line(colour = "grey10"),
panel.grid.minor = element_line(colour = "grey10"),
plot.background = element_rect(fill="black"),
legend.background = element_rect(fill="black"),
legend.text = element_text(colour = "grey"),
legend.title = element_text(colour = "grey"),
axis.text = element_text(colour="grey"),
axis.title = element_text(colour = "grey"),
plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
labs(title="Time in movement", x="date", y="time")+ #titles
scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
guides(fill=F) # hide the legend
There is no use of beautiful visualization if we can’t export it to our research papers and presentations. Saving the plot as raster or vector is easy:
# to png (jpeg had some problems with Mac OS)
png("name.png", units = "in", res = 300, height = 4, width = 6)
ggplot()+
theme(panel.background = element_rect(fill="black"),
panel.grid.major = element_line(colour = "grey10"),
panel.grid.minor = element_line(colour = "grey10"),
plot.background = element_rect(fill="black"),
legend.background = element_rect(fill="black"),
legend.text = element_text(colour = "grey"),
legend.title = element_text(colour = "grey"),
axis.text = element_text(colour="grey"),
axis.title = element_text(colour = "grey"),
plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
labs(title="Time in movement", x="date", y="time")+ #titles
scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
guides(fill=F) # hide the legend
dev.off()
## png
## 2
# to pdf
# try help: ?pdf
It’s spatial data. We should study spatial patterns. For this, we have to make a map!
ggplot()+
geom_point(data = gpsData, aes(y = Y, x=X))
Plotted spatial data is not a map! Map needs spatial context. We need to use base map. The easiest solution is to use package called ggmap which is using Google Maps.
First we have to download the map:
Ca_map <- get_map(location = "California", maptype = "terrain", color = "bw", zoom = 7)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=California&zoom=7&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=California&sensor=false
(You can try “satellite” or “hybrid” instead of “terrain”.)
To publish the map the grammar is very similar to ggplot2, just the first line defines that ggmap is used:
ggmap(Ca_map)+
theme_map()+
geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)
Ok? Of course you can add a title and tune the colours…
For that we can make really simple query:
trip_extremes <- gpsData %>% filter(altitude == min(altitude) | altitude == max(altitude))
trip_extremes %>% kable()
| time_system_ts | accuracy | altitude | bearing | speed | X | Y | time_dec | date |
|---|---|---|---|---|---|---|---|---|
| 2018-04-27 10:58:30 | 4 | 3073.550 | NA | 0.000000 | -118.1789 | 37.38543 | 10.966667 | 2018-04-27 |
| 2018-05-08 07:07:35 | 96 | -192.916 | 348.241 | 0.216682 | -122.3295 | 37.20981 | 7.116667 | 2018-05-08 |
# altitude -192 refers, that it is under the sea level; it's probably GPS-error (look the accuracy!)
And now the map:
ggmap(Ca_map)+
theme_map()+
#geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)
geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
scale_colour_gradient(low="blue", high = "red")+
guides(colour = F)
We should add more information on the map. For example: - What was the altitude? - When was the place visited?
ggmap(Ca_map)+
theme_map()+
geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
geom_label(data = trip_extremes, aes(y = Y + 0.2, x=X, colour = altitude, label= round(altitude, 1)), size=3)+
geom_label(data = trip_extremes, aes(y = Y - 0.2, x=X, colour = altitude, label= as.Date(time_system_ts)), size=2.5, alpha=0.5)+
scale_colour_gradient(low="blue", high = "red")+
guides(colour = F)
So much (little) about GPS-data. It was just a featherlight touch of possibilities…
Passive mobile positiong means, that we are using Call Detail Records (CDR) data. CDR is a dataset, which contains log-files collected by mobile network operators to monitor and keep track of billable calling services used by their clients. Such dataset that contains information about the phone call initiator (the unique and anonymous ID), time of calling activities as well as the location of the mobile antenna where the call activity was made, is an excellent source for analysing spatio-temporal behaviour of phone users.
CDR-data in the following example is stored in two separete files. One of them is the log-file off calling activities (cdr_example) and another contains spatial information about the mobile antennas (cdr_antennas).
First we import the log-files:
cdr_example <- read.csv2("http://aasa.ut.ee/digitalMethods/data/cdr_example.csv", sep=";", dec = ".")
glimpse(cdr_example)
## Observations: 42,112
## Variables: 3
## $ time <fct> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-0...
## $ site_id <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
At first glimpse everything looks ok! Only problem is that time is stored not as time but as character. Convert time to proper format and calculate date and time (hours) columns:
cdr_example <- cdr_example %>% mutate(time = ymd_hms(as.character(time)), date = as.Date(time), hour = hour(time), minute = minute(time))
glimpse(cdr_example)
## Observations: 42,112
## Variables: 6
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-...
## $ site_id <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8...
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 4...
Now we import the table of antennas:
cdr_antennas <- read.csv2("http://aasa.ut.ee/digitalMethods/data/sites.csv", sep=";", dec = ",")
glimpse(cdr_antennas)
## Observations: 1,117
## Variables: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096...
## $ lat <dbl> 57.74861, 59.08381, 59.40694, 59.38417, 58.38350, 58.0...
## $ lon <dbl> 26.35027, 24.63264, 24.81722, 24.82553, 24.48503, 26.4...
Easiest way to perform a quality check is to plot the data:
ggplot()+
geom_point(data = cdr_antennas, aes(x=lon, y=lat))
Now we have to put two tables together. This process is called joining. In the current case we use left_join which returns us all records from the left table (cdr_example), and the matched records from the right table (cdr_antennas). For that we have to define the key-column by which tables are joined. The result is NULL from the right side, if there is no match.
cdr_example <- left_join(cdr_example, cdr_antennas, by = "site_id")
glimpse(cdr_example)
## Observations: 42,112
## Variables: 8
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-...
## $ site_id <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8...
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 4...
## $ lat <dbl> NA, NA, NA, 59.41333, 59.41333, 58.09139, NA, NA, N...
## $ lon <dbl> NA, NA, NA, 24.75564, 24.75564, 26.10084, NA, NA, N...
We see, that some coordinate fields are empty (NA). This means that location information is missing for those antennas. Filter them out!
cdr_example <- cdr_example %>% filter(!is.na(lat)) # is.na() finds all cells where value is NA (not available), !is.na() works vice versa
glimpse(cdr_example)
## Observations: 4,564
## Variables: 8
## $ time <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
cdr_antenna_count <- cdr_example %>% group_by(lat, lon) %>% summarise(n=n()) %>% ungroup()
ggplot()+
geom_point(data = cdr_antenna_count, aes(x=lon, y=lat, size=n, alpha=n), colour="red")
For spatial context we download the base map of Estonia using ggmap:
ET_map <- get_map(location ="Estonia", maptype = "hybrid", color ="bw", zoom =7)
ggmap(ET_map)
Visualize frequently visited places:
ggmap(ET_map)+
theme_map()+
theme(legend.position = "right")+
geom_point(data = cdr_antenna_count, aes(x=lon, y=lat, size=n, alpha=n), colour="red")
In mobility studies the location of home or working place of the respondent is crucial. Can we detect it from mobile positioning data as well?
We may assume, that home is the most important place for every person. This allows us to assume, that at home we use our mobile phones the most, and the working place is the second most important place (of course those assumptions are over simplified, but we don’t have too much time to dig deeper).
In the following query we count (n()) all calling activities grouping them by antennas. Afterwards we sort the result starting with the largest value (descending), finally the tables head is saved to new data frame:
cdr_example_aggrActivities <- cdr_example %>% group_by(site_id) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
cdr_example_aggrActivities %>% kable()
| site_id | n |
|---|---|
| 938 | 1136 |
| 752 | 385 |
| 502 | 212 |
| 753 | 196 |
| 55 | 162 |
| 504 | 136 |
According to the assumptions we defined earlier we can see from the results that antenna (site) is the home location and antenna with id is the working place. Where are they located in space? To answer that question we have to use the coordinates as well:
cdr_example_aggrActivities <- cdr_example %>% group_by(site_id, lon, lat) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
cdr_example_aggrActivities %>% kable()
| site_id | lon | lat | n |
|---|---|---|---|
| 938 | 27.15611 | 59.43528 | 1136 |
| 752 | 24.83249 | 59.42916 | 385 |
| 502 | 24.75611 | 59.43777 | 212 |
| 753 | 24.75564 | 59.41333 | 196 |
| 55 | 24.81474 | 59.42226 | 162 |
| 504 | 24.79641 | 59.17908 | 136 |
And then we put possible home and work places on the map:
ggmap(ET_map)+
theme_map()+
theme(legend.position = "right")+
geom_point(data = cdr_example_aggrActivities[1,], aes(x=lon, y=lat, colour="home"), size=2)+ # [1,] selects first row in table
geom_point(data = cdr_example_aggrActivities[2,], aes(x=lon, y=lat, colour = "work"), size=2)+
scale_colour_manual(values = c("orange", "dodgerblue"))
CDR-data is commonly used in mobility studies. The accuracy of the data doesn’t allow exactly the turn-by-turn analysis of movement, but it’s a priceless source to analyse commuting, tourism, migration etc patterns.
Again let’s check the column names. And then plot paths on the map.
glimpse(cdr_example)
## Observations: 4,564
## Variables: 8
## $ time <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
ggplot()+
geom_path(data = cdr_example, aes(x=lon, y=lat), size = 0.2, col="red")
Results look like plate of spaghetties, it’s very hard to understand which places were visited more often and which places less often.
May changing the the transparency (alpha) value helps?
ggplot()+
geom_path(data = cdr_example, aes(x=lon, y=lat), size = 0.2, col="red", alpha=0.5)
Nothing better… It’s because the path is created as a single line. But there is a solution to improve it!
We should draw separate line between every two location points! For that geom_segment can be used. To define the end of segment we use lead(), where we say that segment ends with a value on a next row:
ggplot()+
geom_segment(data = cdr_example, aes(x = lon, xend = lead(lon, 1), y=lat, yend = lead(lat, 1)), alpha=0.05, size=0.25)
## Warning: Removed 1 rows containing missing values (geom_segment).
You can test different alpha values (0…1) to see, how the output changes!
Put the path on the base map:
ggmap(ET_map, darken=c(0.25, "dodgerblue"))+ # we can change the tonality of base map
geom_segment(data = cdr_example, aes(x = lon, xend = lead(lon, 1), y=lat, yend = lead(lat, 1)), alpha=0.05, size=0.25, colour="orange")
## Warning: Removed 1 rows containing missing values (geom_segment).
CDR-data is not used only in mobility studies. It can give much more information about the phone carrier. If we assume that starting a phone call or sending a SMS is act of free will, then we can start to analyse the temporal behaviour of the phone user. For that let us also calculate the weekday names:
# R gives the weekdays in language defined in your computer settings
# If you want you can switch to english:
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
cdr_example <- cdr_example %>% mutate(wday = lubridate::wday(time, abbr = T, label =T))
# change back to your language:
Sys.setlocale("LC_TIME","Estonian") # in current case it's estonian
## [1] "Estonian_Estonia.1257"
glimpse(cdr_example)
## Observations: 4,564
## Variables: 9
## $ time <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
# list weekdays:
cdr_example %>% distinct(wday)
## wday
## 1 Tue
## 2 Wed
## 3 Thu
## 4 Fri
## 5 Sat
## 6 Mon
## 7 Sun
# reorder them:
cdr_example$wday <- factor(cdr_example$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
ggplot()+
geom_point(data = cdr_example, aes(x=wday, y=hour + (minute / 60)), shape=3, alpha = 0.075, size=6)+
scale_y_continuous(breaks = seq(0, 24, 3))+
labs(y="time, hh:mm")
Visually it looks not too bad. But we should add some values or numbers. Otherwise we can only say, that a person is more active on workingdays (calls more and wakes earlier) and usually not using the phone during the night.
Let’s calculate some calling time percentiles and add them to plot (we may assume that an active day starts if 10% of calls are made; midday arrives if 50% calls are made and evening starts if 90% calls are made):
# calculate time in decimal format:
cdr_example <- cdr_example %>% mutate(time_dec = hour + (minute / 60))
glimpse(cdr_example)
## Observations: 4,564
## Variables: 10
## $ time <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
## $ time_dec <dbl> 9.183333, 9.250000, 9.366667, 17.700000, 10.400000,...
cdr_quart <- cdr_example %>% group_by(wday) %>% summarise(q1 = quantile(time_dec, 0.10), q2 = quantile(time_dec, 0.5), q3 = quantile(time_dec, 0.90)) %>% ungroup()
ggplot()+
geom_point(data = cdr_example, aes(x=wday, y=hour + (minute / 60)), shape=3, alpha = 0.075, size=6)+
geom_segment(data = cdr_quart, aes(x=wday, xend =wday, y=q1, yend = q3), colour="red", size=1)+
geom_point(data = cdr_quart, aes(x= wday, y =q1), colour="red", size=4, shape = "-")+
geom_point(data = cdr_quart, aes(x= wday, y =q2), colour="red", size=3)+
geom_point(data = cdr_quart, aes(x= wday, y =q3), colour="red", size=4, shape="-")+
scale_y_continuous(breaks = seq(0, 24, 3))+
labs(y="time, hh:mm")
A more traditional way is to count calling activities by hour:
glimpse(cdr_example)
## Observations: 4,564
## Variables: 10
## $ time <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
## $ time_dec <dbl> 9.183333, 9.250000, 9.366667, 17.700000, 10.400000,...
cdr_example_hourAggr <- cdr_example %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()
ggplot()+
geom_line(data = cdr_example_hourAggr, aes(x=hour, y=n, group=wday, colour=as.factor(wday)))+
facet_wrap(~wday)+
scale_colour_manual(values = c("dodgerblue", "steelblue", "blue", "cyan", "navyblue", "orange", "red"))+
scale_y_continuous(breaks = seq(0, 24, 3))+
labs(y="time, hh:mm", colour = "day")
#Home antenna:
cdr_example_aggrActivities[1,]
## # A tibble: 1 x 4
## # Groups: site_id, lon [1]
## site_id lon lat n
## <int> <dbl> <dbl> <int>
## 1 938 27.2 59.4 1136
#workplace antenna
cdr_example_aggrActivities[2,]
## # A tibble: 1 x 4
## # Groups: site_id, lon [1]
## site_id lon lat n
## <int> <dbl> <dbl> <int>
## 1 752 24.8 59.4 385
cdr_example_H <- cdr_example %>% filter(site_id == cdr_example_aggrActivities$site_id[1])
cdr_example_W <- cdr_example %>% filter(site_id == cdr_example_aggrActivities$site_id[2])
cdr_example_H_timeAggr <- cdr_example_H %>% group_by(wday, hour) %>% summarise(n = n()) %>% ungroup()
cdr_example_W_timeAggr <- cdr_example_W %>% group_by(wday, hour) %>% summarise(n = n()) %>% ungroup()
ggplot()+
geom_line(data = cdr_example_H_timeAggr, aes(x=hour, y=n, colour="home"))+
geom_line(data = cdr_example_W_timeAggr, aes(x=hour, y=n, colour="work"))+
scale_x_continuous(breaks = seq(0, 24, 3))+
facet_wrap(~wday)
Was the home and workplace determination successful?
Today we read and hear more and more about big data or open data. Big data we already covered (in reality both datasets are very big, here we just looked small fragments).
What about open data? in Estonia the state-produced linked open data is accessible through Open Data Portal of Estonia. One of the important feature of this data is that it’s often machine readable.
Lets look one of the datasets from Estonian Police and Border Guard Board. It contains information about crimes in Estonia. The dataset is updated weekly and freely available.
Download it:
crime_data <- read_delim("https://opendata.smit.ee/ppa/csv/avalik_1.csv",
"\t", escape_double = FALSE, col_types = cols(ToimKell = col_character(),
ToimKpv = col_character()), trim_ws = TRUE)
glimpse(crime_data)
## Observations: 13,423
## Variables: 18
## $ JuhtumId <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4"...
## $ ToimKpv <chr> "2018-08-08", "2018-08-07", "2018-08-0...
## $ ToimKell <chr> "00:37", "19:07", "18:03", "18:00", "1...
## $ ToimNadalapaev <chr> "Kolmapäev", "Teisipäev", "Teisipäev",...
## $ SyndmusLiik <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS...
## $ SyndmusTaiendavStatLiik <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_V...
## $ Seadus <chr> "Karistusseadustik", "Karistusseadusti...
## $ Paragrahv <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262."...
## $ ParagrahvTais <chr> "§ 262. Avaliku korra rikkumine", "§ 2...
## $ Loige <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1."...
## $ Kahjusumma <chr> NA, NA, "0-499", "0-499", "0-499", "0-...
## $ KohtLiik <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "...
## $ MaakondNimetus <chr> "Ida-Viru maakond", "Harju maakond", "...
## $ ValdLinnNimetus <chr> "Narva linn", "Tallinn", "Tallinn", "N...
## $ KohtNimetus <chr> "Narva linn", "Haabersti linnaosa", "L...
## $ Lest_X <chr> "6587500-6587999", "6587500-6587999", ...
## $ Lest_Y <chr> "738000-738499", "536500-536999", "547...
## $ SyyteoLiik <chr> "VT", "VT", "VT", "VT", "VT", "VT", "V...
# sometimes there exists easier solution (in current case default download works without any additional parameters):
crime_data_2 <- fread("https://opendata.smit.ee/ppa/csv/avalik_1.csv", encoding = "UTF-8")
glimpse(crime_data_2)
## Observations: 13,423
## Variables: 18
## $ JuhtumId <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4"...
## $ ToimKpv <chr> "2018-08-08", "2018-08-07", "2018-08-0...
## $ ToimKell <chr> "00:37", "19:07", "18:03", "18:00", "1...
## $ ToimNadalapaev <chr> "Kolmapäev", "Teisipäev", "Teisipäev",...
## $ SyndmusLiik <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS...
## $ SyndmusTaiendavStatLiik <chr> "", "", "", "", "", "", "", "", "MUU_V...
## $ Seadus <chr> "Karistusseadustik", "Karistusseadusti...
## $ Paragrahv <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262."...
## $ ParagrahvTais <chr> "§ 262. Avaliku korra rikkumine", "§ 2...
## $ Loige <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1."...
## $ Kahjusumma <chr> "", "", "0-499", "0-499", "0-499", "0-...
## $ KohtLiik <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "...
## $ MaakondNimetus <chr> "Ida-Viru maakond", "Harju maakond", "...
## $ ValdLinnNimetus <chr> "Narva linn", "Tallinn", "Tallinn", "N...
## $ KohtNimetus <chr> "Narva linn", "Haabersti linnaosa", "L...
## $ Lest_X <chr> "6587500-6587999", "6587500-6587999", ...
## $ Lest_Y <chr> "738000-738499", "536500-536999", "547...
## $ SyyteoLiik <chr> "VT", "VT", "VT", "VT", "VT", "VT", "V...
crime_data %>% head() %>% kable()
| JuhtumId | ToimKpv | ToimKell | ToimNadalapaev | SyndmusLiik | SyndmusTaiendavStatLiik | Seadus | Paragrahv | ParagrahvTais | Loige | Kahjusumma | KohtLiik | MaakondNimetus | ValdLinnNimetus | KohtNimetus | Lest_X | Lest_Y | SyyteoLiik |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| df5d4c9c-fde4-18d5-a4bc-97886cf239c4 | 2018-08-08 | 00:37 | Kolmapäev | AVALIKU_KORRA_RIKKUMINE | NA | Karistusseadustik | § 262. | § 262. Avaliku korra rikkumine | lg. 1. | NA | RAVIASUTUS | Ida-Viru maakond | Narva linn | Narva linn | 6587500-6587999 | 738000-738499 | VT |
| df5d4c92-fde4-18d5-a4bc-97886cf239c4 | 2018-08-07 | 19:07 | Teisipäev | PISIVARGUS | NA | Karistusseadustik | § 218. | § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu | lg. 1. | NA | AVALIK_KOHT,KAUPLUS | Harju maakond | Tallinn | Haabersti linnaosa | 6587500-6587999 | 536500-536999 | VT |
| df5d4c7e-fde4-18d5-a4bc-97886cf239c4 | 2018-08-07 | 18:03 | Teisipäev | PISIVARGUS | NA | Karistusseadustik | § 218. | § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu | lg. 1. | 0-499 | AVALIK_KOHT,KAUPLUS | Harju maakond | Tallinn | Lasnamäe linnaosa | 6588000-6588499 | 547500-547999 | VT |
| df5d4c74-fde4-18d5-a4bc-97886cf239c4 | 2018-08-07 | 18:00 | Teisipäev | AVALIKU_KORRA_RIKKUMINE | NA | Karistusseadustik | § 262. | § 262. Avaliku korra rikkumine | lg. 1. | 0-499 | OUEALA_LAHTINE_HOOV | Ida-Viru maakond | Narva linn | Narva linn | 6590500-6590999 | 737500-737999 | VT |
| df5d4c60-fde4-18d5-a4bc-97886cf239c4 | 2018-08-07 | 16:06 | Teisipäev | KELMUS | NA | Karistusseadustik | § 218. | § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu | lg. 1. | 0-499 | AVALIK_KOHT,PARKLA,TANAV_VALJAK | Harju maakond | Tallinn | Põhja-Tallinna linnaosa | 6588000-6588499 | 539000-539499 | VT |
| df5d4b20-fde4-18d5-a4bc-97886cf239c4 | 2018-08-07 | 10:30 | Teisipäev | PISIVARGUS | NA | Karistusseadustik | § 218. | § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu | lg. 1. | 0-499 | AVALIK_KOHT,TANAV_VALJAK | Harju maakond | Tallinn | Kesklinna linnaosa | 6588500-6588999 | 541500-541999 | VT |
Unfortunately the table is in Estonian, but hopefully we manage to analyse it! First we translate the column names to English:
crime_names <- colnames(crime_data)
crime_names_en <-c("CaseId", "Date", "Time", "Weekday", "CaseType", "CaseTypeAdditional", "Law", "Paragraph", "ParagraphFull", "Section", "DamagesEuro", "PlaceType", "County", "Municipality", "Place", "Lest_X", "Lest_Y", "Type")
data_frame("Columns, et" = crime_names, "Columns, en" = crime_names_en) %>% kable()
| Columns, et | Columns, en |
|---|---|
| JuhtumId | CaseId |
| ToimKpv | Date |
| ToimKell | Time |
| ToimNadalapaev | Weekday |
| SyndmusLiik | CaseType |
| SyndmusTaiendavStatLiik | CaseTypeAdditional |
| Seadus | Law |
| Paragrahv | Paragraph |
| ParagrahvTais | ParagraphFull |
| Loige | Section |
| Kahjusumma | DamagesEuro |
| KohtLiik | PlaceType |
| MaakondNimetus | County |
| ValdLinnNimetus | Municipality |
| KohtNimetus | Place |
| Lest_X | Lest_X |
| Lest_Y | Lest_Y |
| SyyteoLiik | Type |
Replace column names and select relevant (more interesting variables):
colnames(crime_data) <- crime_names_en
glimpse(crime_data)
## Observations: 13,423
## Variables: 18
## $ CaseId <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date <chr> "2018-08-08", "2018-08-07", "2018-08-07", "...
## $ Time <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
To check what variable categories are available we can use command distinct:
LIST_caseType <- crime_data %>% distinct(CaseType)
LIST_caseType %>% head() %>% kable()
| CaseType |
|---|
| AVALIKU_KORRA_RIKKUMINE |
| PISIVARGUS |
| KELMUS |
| VARGUS |
| VANDALISM |
| PISIVARGUS,VARGUS |
Next we should check the frequency of different categories:
TOP_caseType <- crime_data %>% group_by(CaseType) %>% summarise(n=n()) %>% arrange(desc(n))
TOP_caseType %>% head() %>% kable()
| CaseType | n |
|---|---|
| VARGUS | 4453 |
| PISIVARGUS | 3912 |
| AVALIKU_KORRA_RIKKUMINE | 1579 |
| VANDALISM | 1089 |
| AVALIKU_KORRA_RIKKUMINE,KEHALINE_VAARKOHTLEMINE | 704 |
| MUU | 358 |
As we see once again, the working language of the Estonian police is Estonian. But we can translate the top 5 crime types easily to English:
| crime type, et | link to google translate |
|---|---|
| VARGUS | https://translate.google.com/#et/en/vargus |
| PISIVARGUS | https://translate.google.com/#et/en/pisivargus |
| AVALIKU_KORRA_RIKKUMINE | https://translate.google.com/#et/en/avaliku_korra_rikkumine |
| VANDALISM | https://translate.google.com/#et/en/vandalism |
| AVALIKU_KORRA_RIKKUMINE,KEHALINE_VAARKOHTLEMINE | https://translate.google.com/#et/en/avaliku_korra_rikkumine,kehaline_vaarkohtlemine |
Traditional glimpse at the data:
glimpse(crime_data)
## Observations: 13,423
## Variables: 18
## $ CaseId <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date <chr> "2018-08-08", "2018-08-07", "2018-08-07", "...
## $ Time <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
As it appears all the variables are imported as character. We have to convert columns to proper format. First we convert date, calculate weekdays and also time in hours:
# R gives the weekdays in language defined in your computer settings
# Let's switch to english:
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
crime_data <- crime_data %>% mutate(Date = ymd(Date), wday = lubridate::wday(Date, abbr = T, label =T), hour = as.integer(substr(Time, 1, 2)))
# change back to your language:
Sys.setlocale("LC_TIME","Estonian") # in current case it's estonian
## [1] "Estonian_Estonia.1257"
glimpse(crime_data)
## Observations: 13,423
## Variables: 20
## $ CaseId <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...
Crimes dataset also contains spatial information (L-EST97; epsg = 3301). Currently the coordinates are stored in character field and given as range (coordinates are gridded). Our task is to convert the pixel/cell extent to centroid:
crime_data <- crime_data %>% mutate(Lest_X_bu = Lest_X, Lest_Y_bu = Lest_Y) # back up coordinate field, because 'separate' will delete the original field
crime_data <- crime_data %>% separate(Lest_X, c("x_min", "x_max"), sep = "-") # split column
crime_data <- crime_data %>% separate(Lest_Y, c("y_min", "y_max"), sep = "-")
crime_data <- crime_data %>% mutate(x_min = as.integer(x_min),
x_max = as.integer(x_max),
y_min = as.integer(y_min),
y_max = as.integer(y_max))
# calculate centroid:
crime_data <- crime_data %>% mutate(x=(x_min + x_max) / 2, y=(y_min + y_max) / 2)
glimpse(crime_data)
## Observations: 13,423
## Variables: 26
## $ CaseId <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ x_min <int> 6587500, 6587500, 6588000, 6590500, 6588000...
## $ x_max <int> 6587999, 6587999, 6588499, 6590999, 6588499...
## $ y_min <int> 738000, 536500, 547500, 737500, 539000, 541...
## $ y_max <int> 738499, 536999, 547999, 737999, 539499, 541...
## $ Type <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...
## $ Lest_X_bu <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y_bu <chr> "738000-738499", "536500-536999", "547500-5...
## $ x <dbl> 6587750, 6587750, 6588250, 6590750, 6588250...
## $ y <dbl> 738249.5, 536749.5, 547749.5, 737749.5, 539...
Plot the distribution of crimes in Estonia:
ggplot()+
geom_point(data = crime_data, aes(x=x, y=y), size=2, alpha=0.2)
## Warning: Removed 156 rows containing missing values (geom_point).
Well… It should remind us the contour of Estonia. But it does not!
Let’s download Estonian contour from Estonian Land Board:
download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/maakond_shp.zip", destfile="maakond_shp.zip")
#maakond means county!
unzip("maakond_shp.zip")
# read shp to R:
county <- st_read("maakond_20180801.shp")
## Reading layer `maakond_20180801' from data source `C:\ANTO\loengud\digimeetoditeDoktorikool\mobileDateCollectionMethods\maakond_20180801.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## epsg (SRID): 3301
## proj4string: +proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
ggplot()+
geom_sf(data = county)
Plot crimes data on top of county borders:
ggplot()+
geom_sf(data = county)+
geom_point(data = crime_data, aes(x=x, y=y), size=2, alpha=0.2, colour = "red")
## Warning: Removed 156 rows containing missing values (geom_point).
Well, looks like coordinate field names are switched in case of crimes dataset. Try:
ggplot()+
geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
geom_point(data = crime_data, aes(x=y, y=x), size=0.5, alpha=0.25, shape=15, colour="red")
## Warning: Removed 156 rows containing missing values (geom_point).
It worked! Spatial pattern of crimes correlates nicely with population density. We can reduce the dataset by aggregating it to grid:
glimpse(crime_data)
## Observations: 13,423
## Variables: 26
## $ CaseId <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ x_min <int> 6587500, 6587500, 6588000, 6590500, 6588000...
## $ x_max <int> 6587999, 6587999, 6588499, 6590999, 6588499...
## $ y_min <int> 738000, 536500, 547500, 737500, 539000, 541...
## $ y_max <int> 738499, 536999, 547999, 737999, 539499, 541...
## $ Type <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...
## $ Lest_X_bu <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y_bu <chr> "738000-738499", "536500-536999", "547500-5...
## $ x <dbl> 6587750, 6587750, 6588250, 6590750, 6588250...
## $ y <dbl> 738249.5, 536749.5, 547749.5, 737749.5, 539...
crime_data_grd_aggr <- crime_data %>% group_by(x, y) %>% summarise(n = n()) %>% ungroup()
ggplot()+
geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n, alpha=n))+
scale_colour_gradientn(colours = c("black", "red", "orange", "yellow"))
## Warning: Removed 1 rows containing missing values (geom_point).
The general picture remained the same but the dataset is 10x smaller (13423 rows vs 1154 rows). The only problem with the previous plot is that the cells with large values (crime hotspots) are buried under other points.
Can we change the plotting order (bigger values on top)? Yes! We can sort rows by value:
crime_data_grd_aggr <- crime_data_grd_aggr %>% arrange(n)
ggplot()+
geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n))+
scale_colour_gradientn(colours = c("black", "red", "orange", "yellow"))
## Warning: Removed 1 rows containing missing values (geom_point).
Group data by weekdays and hours, count number of crimes by groups. Create a plot:
crime_data_aggr_hourWday <- crime_data %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()
ggplot()+
theme_minimal()+
geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n))+
facet_wrap(~wday)+
scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 1 rows containing missing values (geom_path).
Result looks almost okay! Somehow R thinks, that week starts with weekend day (sunday). Therefore the first day of the week is Sunday. But we can change it! Let’s define Monday as the first day of the week:
crime_data %>% distinct(wday)
## # A tibble: 7 x 1
## wday
## <ord>
## 1 Wed
## 2 Tue
## 3 Mon
## 4 Sun
## 5 Sat
## 6 Fri
## 7 Thu
crime_data$wday <- factor(crime_data$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
Let’s focus on public order violation (avaliku korra rikkumine):
crime_data_publicOrder <- crime_data %>% filter(CaseType == "AVALIKU_KORRA_RIKKUMINE")
crime_data_aggr_hourWday <- crime_data %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()
ggplot()+
theme_minimal()+
geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n))+
facet_wrap(~wday)+
scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 1 rows containing missing values (geom_path).
Line graph is not the best type, because the weekdays are not connected with each other. Make a bar plot:
ggplot()+
theme_minimal()+
geom_bar(data = crime_data_aggr_hourWday, aes(x = hour, y=n), stat ="identity", fill="red")+
facet_wrap(~wday)+
scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 7 rows containing missing values (position_stack).
Many other possibilities:
ggplot()+
theme_minimal()+
geom_segment(data = crime_data_aggr_hourWday, aes(x = hour, xend=hour, y=0, yend = n), colour="red")+
geom_point(data = crime_data_aggr_hourWday, aes(x = hour, y=n), colour="red")+
facet_wrap(~wday)+
scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 7 rows containing missing values (geom_segment).
## Warning: Removed 7 rows containing missing values (geom_point).
Hard to compare? Put all days on the same plot:
ggplot()+
theme_minimal()+
geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n, group=wday, colour=wday))+
scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 7 rows containing missing values (geom_path).
scale_colour_manual(values = c("dodgerblue", "green", "blue", "grey", "orange", "red", "magenta"))
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: colour
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: NA
## name: waiver
## palette: function
## palette.cache: NULL
## position: left
## range: <ggproto object: Class RangeDiscrete, Range, gg>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range, gg>
## reset: function
## scale_name: manual
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
Still not very beautiful. We can calculate the mean for workdays and for weekend:
crime_data_aggr_hourWeekend <- crime_data_aggr_hourWday %>% mutate(weekend = ifelse(wday == "Sat" | wday == "Sun", "weekend", "workday"))
crime_data_aggr_hourWeekend <- crime_data_aggr_hourWeekend %>% group_by(weekend, hour) %>% summarise(n=mean(n)) %>% ungroup()
ggplot()+
theme_minimal()+
geom_line(data = crime_data_aggr_hourWeekend, aes(x=hour, y=n, colour=weekend))+
geom_point(data = crime_data_aggr_hourWeekend, aes(x=hour, y=n, colour=weekend), shape=21)+
labs(title="Average dynamics of crimes during the day", caption ="author: A. Aasa")+
scale_colour_manual(values = c("orange", "dodgerblue"))+
scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
We dont have to stay on a national level, we can also “zoom in”. For example we can download base map for Tartu and put crimes on that map.
tartu_map <- get_map(location ="Tartu, Estonia", maptype = "terrain", color ="bw", zoom =13)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Tartu,+Estonia&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tartu,%20Estonia&sensor=false
ggmap(tartu_map)+
theme_map()+
theme(legend.position = "right")+
geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n, alpha=n))
## Warning: Removed 1154 rows containing missing values (geom_point).
Maps does not fit? It is because of the different coordinate system: Google base map is in geographical coordinates (WGS84, epsg = 4326) and crimes dataset is in Estonian official coordinates system (L-EST 97; epsg = 3301). To solve it we have to convert crimes data frame to spatial object.
crime_data_grd_aggr <- crime_data %>% group_by(x, y) %>% summarise(n = n()) %>% ungroup()
glimpse(crime_data_grd_aggr)
## Observations: 1,154
## Variables: 3
## $ x <dbl> 6389500, 6399500, 6402500, 6403500, 6403500, 6404750, 640525...
## $ y <dbl> 669499.5, 673499.5, 673499.5, 641499.5, 695499.5, 622249.5, ...
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 4, 1, 3, 1, 11, 8, 13, 8, 3, 1, 2, 2, 1...
crime_data_grd_aggr_sf <- crime_data_grd_aggr %>% filter(!is.na(x))
crime_data_grd_aggr_sf <- st_as_sf(crime_data_grd_aggr_sf, coords = c("y", "x"), crs = 3301)
Now we created a spatial object where the coordinate system (L-EST 97; epsg=3301) is defined. Next we have to transform to geographical coordinate system (WGS84, epsg=4236):
crime_data_grd_aggr_sf <- st_transform(crime_data_grd_aggr_sf, 4326)
And next we can check the results:
ggmap(tartu_map)+
#theme_map()+
theme(legend.position = "right")+
geom_sf(data = crime_data_grd_aggr_sf, aes(alpha=n), size=3, colour="red", inherit.aes = F)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
No difference?!?! Suspicious…
The problem is that we don’t see the crimes outside of Tartu, but they are still there! We have to delete all the crime data outside of Tartu!
crime_data_grd_aggr_sf <- crime_data_grd_aggr %>% filter(!is.na(x))
crime_data_grd_aggr_sf <- crime_data_grd_aggr_sf %>% filter(y > 655918 & y < 663097 & x > 6469207 & x < 6476878)
crime_data_grd_aggr_sf <- st_as_sf(crime_data_grd_aggr_sf, coords = c("y", "x"), crs = 3301)
crime_data_grd_aggr_sf <- st_transform(crime_data_grd_aggr_sf, 4326)
ggmap(tartu_map)+
#theme_map()+
theme(legend.position = "right")+
geom_sf(data = crime_data_grd_aggr_sf, aes(alpha=n), size=6, shape=15, colour="red", inherit.aes = F)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Have a safe trip home!
Author: Anto Aasa
Digital Methods in Humanities and Social Sciences
Last update: 2018-08-23 14:30:10