First things first! Please start the libraries:

library(dplyr)
library(readr)
library(knitr)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(scico)
library(ggmap)
library(data.table)
library(tidyr)
library(sf)

1 Introduction

It is characteristic to the modern network and information society that we meet data everywhere and very often the data is big. Big data could be a co-product of some other ICT (Information & Communication Technology) service: every usage of ICT produces some amount of data. This data describes the usage of certain services, but can also descrive the user (frequency, location etc). In current workshop, we will go through some examples how different data sources can reveal spatio-temporal behaviour of society and/or some smaller social group. Today, the mobile data collection is a very useful toolbox for mobility analysis. Hereby we try to “make friends” with several different datasets (mobile positioning, GPS-positioning, crimes etc). The main aim is to introduce the participants the nature of mobile data, its applications and prospects as well as the disadvantages and shortcomings.

Slides of lecture

2 Spatial data

Many research questions are related to spatial aspects. In addition to the question Where? researchers often have to answer how one or another object/phenomenon is related to the spatial context; are there any spatial patterns?
In case we assume, that spatial context of the data is essential, we have to record also the spatial information during the data collection process. Easiest and the most common way is to use coordinates. We all now what geographical coordinates are (geographical coordinates of Jakobi 2, Tartu: 26.71967E 58.38049N).

2.1 Coordinate reference system (CRS)

Because of different reasons geographers are not always using geographical coordinates. Actually there are very many different coordinate reference systems (CRS). For example the official CRS of Estonia is L-EST Coordinate System L-EST97, epsg:3301. In L-EST97 the coordinates of Jakobi 2, Tartu are X = 659059.9 & Y = 6474338. This means we have to be very careful dealing with spatial objects and pay attention, that we have to know the CRS of spatial data and for specific analysis all the spatial datasets must be in the same CRS!

2.2 Geometry of spatial data

In general there are two types of spatial data: vector and raster. In case of vector data we are dealing with discrete objects (point, line, polygon), in case of raster data we have a grid of cells (pixels) organized into rows and columns where each cell has a value representing information. Today we look some simple vector data.

2.3 Simple Features (sf)

As usual in R, there ara many ways to work with spatial data, but today we are focusing on Simple Feature (sf). Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them. link

3 GPS data collection with MobilityLog App

In Mobility Lab the mobility data, among other methods, is also collected with the help of special mobile phone application “MobilityLog”. MobilityLog is an Android application which is designed for long-term mobility tracking and social-network-related experiments. The app was developed in joint cooperation of the Computer Laboratory of the University of Cambridge and the Mobility Lab of the University of Tartu. MobilityLog tracks location with the help of GPS, WiFi access points, accelerometer, and mobile network cells. It also records other phone uses such as call events, battery level, screen activation, or alarm clock. It’s a powerful tool to monitor activities carried on by the phone user.

Thus, in addition to location information (GPS), “MobilityLog” can record other behavioral information. App sends the collected data to the servers of Mobility Lab in University of Tartu on a daily basis, where the quality of the received data is controlled. After the quality check, a set of algorithms start to calculate predefined parameters of respondent’s behavior.

Thus, the “MobilityLog” application can be used to monitor/sense the activity of phone user. Every respondent has an android phone where the data collection app is installed. In order to protect the respondents’ privacy, the data will be pseudonymized in the initial stage of data processing.

3.1 GPS-data for current workshop

Current dataset containS three weeks of GPS data recorded during a tourism trip in California. The data is recorded using MobilityLog. Dataset starts in 21-04-2018 and ends in 10-05-2018. All together it contains 197 375 GPS-points.

4 Analysis of the MobilityLog GPS data

4.1 Data import and preparation for the analysis

In real world the preparation of data is very often one of the most annoying steps during the analysis process. Today we jump over most of the obstacles and import the preprepared data directly from the Internet into R.

url <-  "http://aasa.ut.ee/digitalMethods/digitalMethodsGPS.zip" # define the dataset address
download.file(url, "digitalMethodsGPS.zip") # download file
unzip("digitalMethodsGPS.zip", exdir = "data") # unzip files
file.remove("digitalMethodsGPS.zip") # tidy up by removing the zip file
## [1] TRUE
gpsData <- read_csv2("data/gps_us.csv") # import data to R
glimpse(gpsData) # structure of imported file 
## Observations: 185,808
## Variables: 17
## $ user_id        <int> 241, 241, 241, 241, 241, 241, 241, 241, 241, 24...
## $ id             <int> 646380266, 646380267, 646380268, 646380269, 646...
## $ restart        <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ counter        <int> 66718, 66720, 66723, 66730, 66732, 66734, 66736...
## $ time_system    <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524...
## $ time_gps       <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524...
## $ time_system_ts <dttm> 2018-04-22 00:10:39, 2018-04-22 00:10:40, 2018...
## $ time_gps_ts    <dttm> 2018-04-22 00:10:36, 2018-04-22 00:10:37, 2018...
## $ accuracy       <int> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6,...
## $ altitude       <dbl> -23.991000, -24.298600, -12.608500, -9.616940, ...
## $ bearing        <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ speed          <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.0...
## $ header_id      <int> 226334354, 226334354, 226334354, 226334360, 226...
## $ series_id      <int> 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1042,...
## $ year           <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,...
## $ X              <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -12...
## $ Y              <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.6497...

In the current tutorial we are using the package called dplyr for data filtering, arranging, selecting, mutating, grouping and summarising. And for better (human readable) code we are using pipes.
With glimpse() we see that dataset contains 17 variables and has 185 808 rows. There are integer, double and date-time variables. To reduce the dataset size we keep only columns which are relevant for current analysis (time, accuracy, altitude, bearing, longitude[X], latitude[Y]):

gpsData <- gpsData %>% select(time_system_ts, accuracy, altitude, bearing, speed, X, Y)
gpsData %>% head() %>% kable() # print the formatted head of the table
time_system_ts accuracy altitude bearing speed X Y
2018-04-22 00:10:39 12 -23.99100 165.957 0.104187 -121.3842 38.64978
2018-04-22 00:10:40 16 -24.29860 NA NA -121.3842 38.64976
2018-04-22 00:10:41 16 -12.60850 NA 0.000000 -121.3842 38.64976
2018-04-22 00:10:42 6 -9.61694 NA 0.000000 -121.3842 38.64976
2018-04-22 00:10:43 4 -8.17016 NA 0.000000 -121.3843 38.64971
2018-04-22 00:10:44 6 -5.67983 NA 0.000000 -121.3843 38.64971

Before entering into the trouble we have to convert the date-time field to correct shape. Originally the data recorder (the smartphone with MobilityLog app) stored all locations in Estonian time zone (UTC +3). The time zone of California (UTC -7) differs 10 hours. This means we have to subtract 10 hours from the values in date-time column:

gpsData <- gpsData %>% mutate(time_system_ts = time_system_ts - hours(10))

Now the data should be ready for the analysis. Let’s try to summarise the data (when it starts, when it ends and how many GPS-points it contains)?

gpsData %>% summarise(start = min(time_system_ts), end = max(time_system_ts), 'GPS points' = n()) %>% kable()
start end GPS points
2018-04-21 14:10:39 2018-05-10 13:56:51 185808

The data collection application MobilityLog is developed to save the battery of the smart phone as much as possible. For that reason the GPS-sensor is switched off every time the app realises that the phone is not moving. This means that dataset contains lot of gaps. Usually gaps in data refer to data collection problems, but in current case they contain significant information: where and when the respondent stopped during the trip?

4.2 How to visualize gaps in data?

We could just plot the speed (meters per second) against time. For visualizations we are using package called ggplot2 in the current workshop.

ggplot2 webpage: It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

glimpse(gpsData) #check to remeber the column names
## Observations: 185,808
## Variables: 7
## $ time_system_ts <dttm> 2018-04-21 14:10:39, 2018-04-21 14:10:40, 2018...
## $ accuracy       <int> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6,...
## $ altitude       <dbl> -23.991000, -24.298600, -12.608500, -9.616940, ...
## $ bearing        <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ speed          <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.0...
## $ X              <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -12...
## $ Y              <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.6497...
ggplot()+
  geom_point(data = gpsData, aes(x=time_system_ts, y=altitude), size=0.25, col="red")

Looks like there are a few gaps in data!? Suspicious… If we assume, that GPS records location after every second, then the dataset sould contain 1.64077210^{6} rows, but it is ca 10 times smaller. This means we have to be more clever to bring out the gaps. Currently it seems, that the horizontal axis is pressed together and gaps are hidden under the data points. One possibility is to create tile-plot, where the horizontal axis represents days and the vertical axis represents time. Additionally we can colour the data points according to the GPS-speed.
To do that, we have to calculate Date and time column. To avoid potential problems with time we calculate it in decimal scale where 0 is midnight, 12 is midday and 24 is midnight again:

gpsData <- gpsData %>% dplyr::mutate(time_dec=hour(time_system_ts)+(minute(time_system_ts)/60), date=as.Date(time_system_ts))

And now the plot:

# first version:
# pay attention on subset() which filters out some gps-points (speed = 0)
ggplot()+
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))

# first version wasn't to beautiful. Some polishing:
ggplot()+
  theme(panel.background = element_rect(fill="black"),
        panel.grid.major = element_line(colour = "grey10"),
        panel.grid.minor = element_line(colour = "grey10"),
        plot.background = element_rect(fill="black"),
        legend.background = element_rect(fill="black"),
        legend.text = element_text(colour = "grey"), 
        legend.title = element_text(colour = "grey"),
        axis.text = element_text(colour="grey"),
        axis.title = element_text(colour = "grey"),
        plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
  scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
  labs(title="Time in movement", x="date", y="time")+ #titles
  scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
  guides(fill=F) # hide the legend

4.3 Save the plot!

There is no use of beautiful visualization if we can’t export it to our research papers and presentations. Saving the plot as raster or vector is easy:

  • define the file where the output will be written
  • write plot to the file
  • close the file
# to png (jpeg had some problems with Mac OS)
png("name.png", units = "in", res = 300, height = 4, width = 6)

ggplot()+
  theme(panel.background = element_rect(fill="black"),
        panel.grid.major = element_line(colour = "grey10"),
        panel.grid.minor = element_line(colour = "grey10"),
        plot.background = element_rect(fill="black"),
        legend.background = element_rect(fill="black"),
        legend.text = element_text(colour = "grey"), 
        legend.title = element_text(colour = "grey"),
        axis.text = element_text(colour="grey"),
        axis.title = element_text(colour = "grey"),
        plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
  scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
  labs(title="Time in movement", x="date", y="time")+ #titles
  scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
  guides(fill=F) # hide the legend

dev.off()
## png 
##   2
# to pdf
# try help: ?pdf

4.4 Map the data!

It’s spatial data. We should study spatial patterns. For this, we have to make a map!

ggplot()+
  geom_point(data = gpsData, aes(y = Y, x=X))

Plotted spatial data is not a map! Map needs spatial context. We need to use base map. The easiest solution is to use package called ggmap which is using Google Maps.
First we have to download the map:

Ca_map <- get_map(location = "California", maptype = "terrain", color = "bw", zoom = 7)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=California&zoom=7&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=California&sensor=false

(You can try “satellite” or “hybrid” instead of “terrain”.)

To publish the map the grammar is very similar to ggplot2, just the first line defines that ggmap is used:

ggmap(Ca_map)+
  theme_map()+
  geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)

Ok? Of course you can add a title and tune the colours…

4.5 Where were the highest and lowest points visited during the trip?

For that we can make really simple query:

trip_extremes <- gpsData %>% filter(altitude == min(altitude) | altitude == max(altitude)) 
trip_extremes %>% kable()
time_system_ts accuracy altitude bearing speed X Y time_dec date
2018-04-27 10:58:30 4 3073.550 NA 0.000000 -118.1789 37.38543 10.966667 2018-04-27
2018-05-08 07:07:35 96 -192.916 348.241 0.216682 -122.3295 37.20981 7.116667 2018-05-08
# altitude -192 refers, that it is under the sea level; it's probably GPS-error (look the accuracy!)

And now the map:

ggmap(Ca_map)+
  theme_map()+
  #geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)
  geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
  scale_colour_gradient(low="blue", high = "red")+
  guides(colour = F)

We should add more information on the map. For example: - What was the altitude? - When was the place visited?

ggmap(Ca_map)+
  theme_map()+
  geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
  geom_label(data = trip_extremes, aes(y = Y + 0.2, x=X, colour = altitude, label= round(altitude, 1)), size=3)+
  geom_label(data = trip_extremes, aes(y = Y - 0.2, x=X, colour = altitude, label= as.Date(time_system_ts)), size=2.5, alpha=0.5)+
  scale_colour_gradient(low="blue", high = "red")+
  guides(colour = F)

So much (little) about GPS-data. It was just a featherlight touch of possibilities…

5 Passive Mobile Positioning

Passive mobile positiong means, that we are using Call Detail Records (CDR) data. CDR is a dataset, which contains log-files collected by mobile network operators to monitor and keep track of billable calling services used by their clients. Such dataset that contains information about the phone call initiator (the unique and anonymous ID), time of calling activities as well as the location of the mobile antenna where the call activity was made, is an excellent source for analysing spatio-temporal behaviour of phone users.
CDR-data in the following example is stored in two separete files. One of them is the log-file off calling activities (cdr_example) and another contains spatial information about the mobile antennas (cdr_antennas).
First we import the log-files:

cdr_example <- read.csv2("http://aasa.ut.ee/digitalMethods/data/cdr_example.csv", sep=";", dec = ".")
glimpse(cdr_example)
## Observations: 42,112
## Variables: 3
## $ time       <fct> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-0...
## $ site_id    <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...

At first glimpse everything looks ok! Only problem is that time is stored not as time but as character. Convert time to proper format and calculate date and time (hours) columns:

cdr_example <- cdr_example %>% mutate(time = ymd_hms(as.character(time)), date = as.Date(time), hour = hour(time), minute = minute(time))
glimpse(cdr_example)
## Observations: 42,112
## Variables: 6
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-...
## $ site_id    <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 4...

Now we import the table of antennas:

cdr_antennas <- read.csv2("http://aasa.ut.ee/digitalMethods/data/sites.csv", sep=";", dec = ",")
glimpse(cdr_antennas)
## Observations: 1,117
## Variables: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096...
## $ lat     <dbl> 57.74861, 59.08381, 59.40694, 59.38417, 58.38350, 58.0...
## $ lon     <dbl> 26.35027, 24.63264, 24.81722, 24.82553, 24.48503, 26.4...

Easiest way to perform a quality check is to plot the data:

ggplot()+
  geom_point(data = cdr_antennas, aes(x=lon, y=lat))

Now we have to put two tables together. This process is called joining. In the current case we use left_join which returns us all records from the left table (cdr_example), and the matched records from the right table (cdr_antennas). For that we have to define the key-column by which tables are joined. The result is NULL from the right side, if there is no match.

cdr_example <- left_join(cdr_example, cdr_antennas, by = "site_id")
glimpse(cdr_example)
## Observations: 42,112
## Variables: 8
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-...
## $ site_id    <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 4...
## $ lat        <dbl> NA, NA, NA, 59.41333, 59.41333, 58.09139, NA, NA, N...
## $ lon        <dbl> NA, NA, NA, 24.75564, 24.75564, 26.10084, NA, NA, N...

We see, that some coordinate fields are empty (NA). This means that location information is missing for those antennas. Filter them out!

cdr_example <- cdr_example %>% filter(!is.na(lat)) # is.na() finds all cells where value is NA (not available), !is.na() works vice versa

5.1 Frequently visited places

glimpse(cdr_example)
## Observations: 4,564
## Variables: 8
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
cdr_antenna_count <- cdr_example %>% group_by(lat, lon) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  geom_point(data = cdr_antenna_count, aes(x=lon, y=lat, size=n, alpha=n), colour="red")

For spatial context we download the base map of Estonia using ggmap:

ET_map <- get_map(location ="Estonia", maptype = "hybrid", color ="bw", zoom =7)
ggmap(ET_map)

Visualize frequently visited places:

ggmap(ET_map)+
  theme_map()+
  theme(legend.position = "right")+
  geom_point(data = cdr_antenna_count, aes(x=lon, y=lat, size=n, alpha=n), colour="red")

5.2 Find home and work location!

In mobility studies the location of home or working place of the respondent is crucial. Can we detect it from mobile positioning data as well?
We may assume, that home is the most important place for every person. This allows us to assume, that at home we use our mobile phones the most, and the working place is the second most important place (of course those assumptions are over simplified, but we don’t have too much time to dig deeper).
In the following query we count (n()) all calling activities grouping them by antennas. Afterwards we sort the result starting with the largest value (descending), finally the tables head is saved to new data frame:

cdr_example_aggrActivities <- cdr_example %>% group_by(site_id) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
cdr_example_aggrActivities %>% kable()
site_id n
938 1136
752 385
502 212
753 196
55 162
504 136

According to the assumptions we defined earlier we can see from the results that antenna (site) is the home location and antenna with id is the working place. Where are they located in space? To answer that question we have to use the coordinates as well:

cdr_example_aggrActivities <- cdr_example %>% group_by(site_id, lon, lat) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
cdr_example_aggrActivities %>% kable()
site_id lon lat n
938 27.15611 59.43528 1136
752 24.83249 59.42916 385
502 24.75611 59.43777 212
753 24.75564 59.41333 196
55 24.81474 59.42226 162
504 24.79641 59.17908 136

And then we put possible home and work places on the map:

ggmap(ET_map)+
  theme_map()+
  theme(legend.position = "right")+
  geom_point(data = cdr_example_aggrActivities[1,], aes(x=lon, y=lat, colour="home"), size=2)+ # [1,] selects first row in table
  geom_point(data = cdr_example_aggrActivities[2,], aes(x=lon, y=lat, colour = "work"), size=2)+
  scale_colour_manual(values = c("orange", "dodgerblue"))

5.3 Visualize movement path

CDR-data is commonly used in mobility studies. The accuracy of the data doesn’t allow exactly the turn-by-turn analysis of movement, but it’s a priceless source to analyse commuting, tourism, migration etc patterns.
Again let’s check the column names. And then plot paths on the map.

glimpse(cdr_example)
## Observations: 4,564
## Variables: 8
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
ggplot()+
  geom_path(data = cdr_example, aes(x=lon, y=lat), size = 0.2, col="red")

Results look like plate of spaghetties, it’s very hard to understand which places were visited more often and which places less often.
May changing the the transparency (alpha) value helps?

ggplot()+
  geom_path(data = cdr_example, aes(x=lon, y=lat), size = 0.2, col="red", alpha=0.5)

Nothing better… It’s because the path is created as a single line. But there is a solution to improve it!
We should draw separate line between every two location points! For that geom_segment can be used. To define the end of segment we use lead(), where we say that segment ends with a value on a next row:

ggplot()+
  geom_segment(data = cdr_example, aes(x = lon, xend = lead(lon, 1), y=lat, yend = lead(lat, 1)), alpha=0.05, size=0.25)
## Warning: Removed 1 rows containing missing values (geom_segment).

You can test different alpha values (0…1) to see, how the output changes!
Put the path on the base map:

ggmap(ET_map, darken=c(0.25, "dodgerblue"))+ # we can change the tonality of base map
  geom_segment(data = cdr_example, aes(x = lon, xend = lead(lon, 1), y=lat, yend = lead(lat, 1)), alpha=0.05, size=0.25, colour="orange")
## Warning: Removed 1 rows containing missing values (geom_segment).

5.4 Temporal activity

CDR-data is not used only in mobility studies. It can give much more information about the phone carrier. If we assume that starting a phone call or sending a SMS is act of free will, then we can start to analyse the temporal behaviour of the phone user. For that let us also calculate the weekday names:

# R gives the weekdays in language defined in your computer settings
# If you want you can switch to english:
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
cdr_example <- cdr_example %>% mutate(wday = lubridate::wday(time, abbr = T, label =T))

# change back to your language:
Sys.setlocale("LC_TIME","Estonian") # in current case it's estonian
## [1] "Estonian_Estonia.1257"
glimpse(cdr_example)
## Observations: 4,564
## Variables: 9
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday       <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
# list weekdays:
cdr_example %>% distinct(wday)
##   wday
## 1  Tue
## 2  Wed
## 3  Thu
## 4  Fri
## 5  Sat
## 6  Mon
## 7  Sun
# reorder them:
cdr_example$wday <- factor(cdr_example$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

5.4.1 Chronocode:

ggplot()+
  geom_point(data = cdr_example, aes(x=wday, y=hour + (minute / 60)), shape=3, alpha = 0.075, size=6)+
  scale_y_continuous(breaks = seq(0, 24, 3))+
  labs(y="time, hh:mm")

Visually it looks not too bad. But we should add some values or numbers. Otherwise we can only say, that a person is more active on workingdays (calls more and wakes earlier) and usually not using the phone during the night.
Let’s calculate some calling time percentiles and add them to plot (we may assume that an active day starts if 10% of calls are made; midday arrives if 50% calls are made and evening starts if 90% calls are made):

# calculate time in decimal format:
cdr_example <- cdr_example %>% mutate(time_dec = hour + (minute / 60))
glimpse(cdr_example)
## Observations: 4,564
## Variables: 10
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday       <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
## $ time_dec   <dbl> 9.183333, 9.250000, 9.366667, 17.700000, 10.400000,...
cdr_quart <- cdr_example %>% group_by(wday) %>% summarise(q1 = quantile(time_dec, 0.10), q2 = quantile(time_dec, 0.5), q3 = quantile(time_dec, 0.90)) %>% ungroup()

ggplot()+
  geom_point(data = cdr_example, aes(x=wday, y=hour + (minute / 60)), shape=3, alpha = 0.075, size=6)+
  geom_segment(data = cdr_quart, aes(x=wday, xend =wday, y=q1, yend = q3), colour="red", size=1)+
  geom_point(data = cdr_quart, aes(x= wday, y =q1), colour="red", size=4, shape = "-")+
  geom_point(data = cdr_quart, aes(x= wday, y =q2), colour="red", size=3)+
  geom_point(data = cdr_quart, aes(x= wday, y =q3), colour="red", size=4, shape="-")+
  scale_y_continuous(breaks = seq(0, 24, 3))+
  labs(y="time, hh:mm")

5.4.2 Aggregate and count by hour

A more traditional way is to count calling activities by hour:

glimpse(cdr_example)
## Observations: 4,564
## Variables: 10
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday       <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
## $ time_dec   <dbl> 9.183333, 9.250000, 9.366667, 17.700000, 10.400000,...
cdr_example_hourAggr <-  cdr_example %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  geom_line(data = cdr_example_hourAggr, aes(x=hour, y=n, group=wday, colour=as.factor(wday)))+
  facet_wrap(~wday)+
  scale_colour_manual(values = c("dodgerblue", "steelblue", "blue", "cyan", "navyblue", "orange", "red"))+
  scale_y_continuous(breaks = seq(0, 24, 3))+
  labs(y="time, hh:mm", colour = "day")

5.4.3 Temporal activity at home/work:

#Home antenna:
cdr_example_aggrActivities[1,]
## # A tibble: 1 x 4
## # Groups:   site_id, lon [1]
##   site_id   lon   lat     n
##     <int> <dbl> <dbl> <int>
## 1     938  27.2  59.4  1136
#workplace antenna
cdr_example_aggrActivities[2,]
## # A tibble: 1 x 4
## # Groups:   site_id, lon [1]
##   site_id   lon   lat     n
##     <int> <dbl> <dbl> <int>
## 1     752  24.8  59.4   385
cdr_example_H <- cdr_example %>% filter(site_id == cdr_example_aggrActivities$site_id[1])
cdr_example_W <- cdr_example %>% filter(site_id == cdr_example_aggrActivities$site_id[2])
cdr_example_H_timeAggr <- cdr_example_H %>% group_by(wday, hour) %>% summarise(n = n()) %>% ungroup()
cdr_example_W_timeAggr <- cdr_example_W %>% group_by(wday, hour) %>% summarise(n = n()) %>% ungroup()

ggplot()+
  geom_line(data = cdr_example_H_timeAggr, aes(x=hour, y=n, colour="home"))+
  geom_line(data = cdr_example_W_timeAggr, aes(x=hour, y=n, colour="work"))+
  scale_x_continuous(breaks = seq(0, 24, 3))+
  facet_wrap(~wday)

Was the home and workplace determination successful?

6 Crime in Estonia

Today we read and hear more and more about big data or open data. Big data we already covered (in reality both datasets are very big, here we just looked small fragments).
What about open data? in Estonia the state-produced linked open data is accessible through Open Data Portal of Estonia. One of the important feature of this data is that it’s often machine readable.

Lets look one of the datasets from Estonian Police and Border Guard Board. It contains information about crimes in Estonia. The dataset is updated weekly and freely available.

Download it:

crime_data <- read_delim("https://opendata.smit.ee/ppa/csv/avalik_1.csv", 
                    "\t", escape_double = FALSE, col_types = cols(ToimKell = col_character(),
                                                                  ToimKpv = col_character()), trim_ws = TRUE)
glimpse(crime_data)
## Observations: 13,423
## Variables: 18
## $ JuhtumId                <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4"...
## $ ToimKpv                 <chr> "2018-08-08", "2018-08-07", "2018-08-0...
## $ ToimKell                <chr> "00:37", "19:07", "18:03", "18:00", "1...
## $ ToimNadalapaev          <chr> "Kolmapäev", "Teisipäev", "Teisipäev",...
## $ SyndmusLiik             <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS...
## $ SyndmusTaiendavStatLiik <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_V...
## $ Seadus                  <chr> "Karistusseadustik", "Karistusseadusti...
## $ Paragrahv               <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262."...
## $ ParagrahvTais           <chr> "§ 262. Avaliku korra rikkumine", "§ 2...
## $ Loige                   <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1."...
## $ Kahjusumma              <chr> NA, NA, "0-499", "0-499", "0-499", "0-...
## $ KohtLiik                <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "...
## $ MaakondNimetus          <chr> "Ida-Viru maakond", "Harju maakond", "...
## $ ValdLinnNimetus         <chr> "Narva linn", "Tallinn", "Tallinn", "N...
## $ KohtNimetus             <chr> "Narva linn", "Haabersti linnaosa", "L...
## $ Lest_X                  <chr> "6587500-6587999", "6587500-6587999", ...
## $ Lest_Y                  <chr> "738000-738499", "536500-536999", "547...
## $ SyyteoLiik              <chr> "VT", "VT", "VT", "VT", "VT", "VT", "V...
# sometimes there exists easier solution (in current case default download works without any additional parameters):
crime_data_2 <- fread("https://opendata.smit.ee/ppa/csv/avalik_1.csv", encoding = "UTF-8")
glimpse(crime_data_2)
## Observations: 13,423
## Variables: 18
## $ JuhtumId                <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4"...
## $ ToimKpv                 <chr> "2018-08-08", "2018-08-07", "2018-08-0...
## $ ToimKell                <chr> "00:37", "19:07", "18:03", "18:00", "1...
## $ ToimNadalapaev          <chr> "Kolmapäev", "Teisipäev", "Teisipäev",...
## $ SyndmusLiik             <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS...
## $ SyndmusTaiendavStatLiik <chr> "", "", "", "", "", "", "", "", "MUU_V...
## $ Seadus                  <chr> "Karistusseadustik", "Karistusseadusti...
## $ Paragrahv               <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262."...
## $ ParagrahvTais           <chr> "§ 262. Avaliku korra rikkumine", "§ 2...
## $ Loige                   <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1."...
## $ Kahjusumma              <chr> "", "", "0-499", "0-499", "0-499", "0-...
## $ KohtLiik                <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "...
## $ MaakondNimetus          <chr> "Ida-Viru maakond", "Harju maakond", "...
## $ ValdLinnNimetus         <chr> "Narva linn", "Tallinn", "Tallinn", "N...
## $ KohtNimetus             <chr> "Narva linn", "Haabersti linnaosa", "L...
## $ Lest_X                  <chr> "6587500-6587999", "6587500-6587999", ...
## $ Lest_Y                  <chr> "738000-738499", "536500-536999", "547...
## $ SyyteoLiik              <chr> "VT", "VT", "VT", "VT", "VT", "VT", "V...
crime_data %>% head() %>% kable()
JuhtumId ToimKpv ToimKell ToimNadalapaev SyndmusLiik SyndmusTaiendavStatLiik Seadus Paragrahv ParagrahvTais Loige Kahjusumma KohtLiik MaakondNimetus ValdLinnNimetus KohtNimetus Lest_X Lest_Y SyyteoLiik
df5d4c9c-fde4-18d5-a4bc-97886cf239c4 2018-08-08 00:37 Kolmapäev AVALIKU_KORRA_RIKKUMINE NA Karistusseadustik § 262. § 262. Avaliku korra rikkumine lg. 1. NA RAVIASUTUS Ida-Viru maakond Narva linn Narva linn 6587500-6587999 738000-738499 VT
df5d4c92-fde4-18d5-a4bc-97886cf239c4 2018-08-07 19:07 Teisipäev PISIVARGUS NA Karistusseadustik § 218. § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu lg. 1. NA AVALIK_KOHT,KAUPLUS Harju maakond Tallinn Haabersti linnaosa 6587500-6587999 536500-536999 VT
df5d4c7e-fde4-18d5-a4bc-97886cf239c4 2018-08-07 18:03 Teisipäev PISIVARGUS NA Karistusseadustik § 218. § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu lg. 1. 0-499 AVALIK_KOHT,KAUPLUS Harju maakond Tallinn Lasnamäe linnaosa 6588000-6588499 547500-547999 VT
df5d4c74-fde4-18d5-a4bc-97886cf239c4 2018-08-07 18:00 Teisipäev AVALIKU_KORRA_RIKKUMINE NA Karistusseadustik § 262. § 262. Avaliku korra rikkumine lg. 1. 0-499 OUEALA_LAHTINE_HOOV Ida-Viru maakond Narva linn Narva linn 6590500-6590999 737500-737999 VT
df5d4c60-fde4-18d5-a4bc-97886cf239c4 2018-08-07 16:06 Teisipäev KELMUS NA Karistusseadustik § 218. § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu lg. 1. 0-499 AVALIK_KOHT,PARKLA,TANAV_VALJAK Harju maakond Tallinn Põhja-Tallinna linnaosa 6588000-6588499 539000-539499 VT
df5d4b20-fde4-18d5-a4bc-97886cf239c4 2018-08-07 10:30 Teisipäev PISIVARGUS NA Karistusseadustik § 218. § 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu lg. 1. 0-499 AVALIK_KOHT,TANAV_VALJAK Harju maakond Tallinn Kesklinna linnaosa 6588500-6588999 541500-541999 VT

Unfortunately the table is in Estonian, but hopefully we manage to analyse it! First we translate the column names to English:

crime_names <- colnames(crime_data)
crime_names_en <-c("CaseId", "Date", "Time", "Weekday", "CaseType", "CaseTypeAdditional", "Law", "Paragraph", "ParagraphFull", "Section", "DamagesEuro", "PlaceType", "County", "Municipality", "Place", "Lest_X", "Lest_Y", "Type")

data_frame("Columns, et" = crime_names, "Columns, en" = crime_names_en) %>% kable()
Columns, et Columns, en
JuhtumId CaseId
ToimKpv Date
ToimKell Time
ToimNadalapaev Weekday
SyndmusLiik CaseType
SyndmusTaiendavStatLiik CaseTypeAdditional
Seadus Law
Paragrahv Paragraph
ParagrahvTais ParagraphFull
Loige Section
Kahjusumma DamagesEuro
KohtLiik PlaceType
MaakondNimetus County
ValdLinnNimetus Municipality
KohtNimetus Place
Lest_X Lest_X
Lest_Y Lest_Y
SyyteoLiik Type

Replace column names and select relevant (more interesting variables):

colnames(crime_data) <- crime_names_en
glimpse(crime_data)
## Observations: 13,423
## Variables: 18
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <chr> "2018-08-08", "2018-08-07", "2018-08-07", "...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X             <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y             <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...

To check what variable categories are available we can use command distinct:

LIST_caseType <- crime_data %>% distinct(CaseType)
LIST_caseType %>% head() %>% kable()
CaseType
AVALIKU_KORRA_RIKKUMINE
PISIVARGUS
KELMUS
VARGUS
VANDALISM
PISIVARGUS,VARGUS

Next we should check the frequency of different categories:

TOP_caseType <- crime_data %>% group_by(CaseType) %>% summarise(n=n()) %>% arrange(desc(n))
TOP_caseType %>% head() %>% kable()
CaseType n
VARGUS 4453
PISIVARGUS 3912
AVALIKU_KORRA_RIKKUMINE 1579
VANDALISM 1089
AVALIKU_KORRA_RIKKUMINE,KEHALINE_VAARKOHTLEMINE 704
MUU 358

As we see once again, the working language of the Estonian police is Estonian. But we can translate the top 5 crime types easily to English:

crime type, et link to google translate
VARGUS https://translate.google.com/#et/en/vargus
PISIVARGUS https://translate.google.com/#et/en/pisivargus
AVALIKU_KORRA_RIKKUMINE https://translate.google.com/#et/en/avaliku_korra_rikkumine
VANDALISM https://translate.google.com/#et/en/vandalism
AVALIKU_KORRA_RIKKUMINE,KEHALINE_VAARKOHTLEMINE https://translate.google.com/#et/en/avaliku_korra_rikkumine,kehaline_vaarkohtlemine

6.1 General overview of crimes

Traditional glimpse at the data:

glimpse(crime_data)
## Observations: 13,423
## Variables: 18
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <chr> "2018-08-08", "2018-08-07", "2018-08-07", "...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X             <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y             <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...

As it appears all the variables are imported as character. We have to convert columns to proper format. First we convert date, calculate weekdays and also time in hours:

# R gives the weekdays in language defined in your computer settings
# Let's switch to english:
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
crime_data <- crime_data %>% mutate(Date = ymd(Date), wday = lubridate::wday(Date, abbr = T, label =T), hour = as.integer(substr(Time, 1, 2)))
# change back to your language:
Sys.setlocale("LC_TIME","Estonian") # in current case it's estonian
## [1] "Estonian_Estonia.1257"
glimpse(crime_data)
## Observations: 13,423
## Variables: 20
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X             <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y             <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday               <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour               <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...

Crimes dataset also contains spatial information (L-EST97; epsg = 3301). Currently the coordinates are stored in character field and given as range (coordinates are gridded). Our task is to convert the pixel/cell extent to centroid:

crime_data <- crime_data %>% mutate(Lest_X_bu = Lest_X, Lest_Y_bu = Lest_Y) # back up coordinate field, because 'separate' will delete the original field
crime_data <- crime_data %>% separate(Lest_X, c("x_min", "x_max"), sep = "-") # split column
crime_data <- crime_data %>% separate(Lest_Y, c("y_min", "y_max"), sep = "-")
crime_data <- crime_data %>% mutate(x_min = as.integer(x_min),
                                    x_max = as.integer(x_max),
                                    y_min = as.integer(y_min),
                                    y_max = as.integer(y_max))

# calculate centroid:
crime_data <- crime_data %>% mutate(x=(x_min + x_max) / 2, y=(y_min + y_max) / 2)
glimpse(crime_data)
## Observations: 13,423
## Variables: 26
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ x_min              <int> 6587500, 6587500, 6588000, 6590500, 6588000...
## $ x_max              <int> 6587999, 6587999, 6588499, 6590999, 6588499...
## $ y_min              <int> 738000, 536500, 547500, 737500, 539000, 541...
## $ y_max              <int> 738499, 536999, 547999, 737999, 539499, 541...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday               <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour               <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...
## $ Lest_X_bu          <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y_bu          <chr> "738000-738499", "536500-536999", "547500-5...
## $ x                  <dbl> 6587750, 6587750, 6588250, 6590750, 6588250...
## $ y                  <dbl> 738249.5, 536749.5, 547749.5, 737749.5, 539...

Plot the distribution of crimes in Estonia:

ggplot()+
  geom_point(data = crime_data, aes(x=x, y=y), size=2, alpha=0.2)
## Warning: Removed 156 rows containing missing values (geom_point).

Well… It should remind us the contour of Estonia. But it does not!

Let’s download Estonian contour from Estonian Land Board:

download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/maakond_shp.zip", destfile="maakond_shp.zip")
#maakond means county!
unzip("maakond_shp.zip")

# read shp to R:
county <- st_read("maakond_20180801.shp")
## Reading layer `maakond_20180801' from data source `C:\ANTO\loengud\digimeetoditeDoktorikool\mobileDateCollectionMethods\maakond_20180801.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## epsg (SRID):    3301
## proj4string:    +proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
ggplot()+
  geom_sf(data = county)

Plot crimes data on top of county borders:

ggplot()+
  geom_sf(data = county)+
  geom_point(data = crime_data, aes(x=x, y=y), size=2, alpha=0.2, colour = "red")
## Warning: Removed 156 rows containing missing values (geom_point).

Well, looks like coordinate field names are switched in case of crimes dataset. Try:

ggplot()+
  geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
  geom_point(data = crime_data, aes(x=y, y=x), size=0.5, alpha=0.25, shape=15, colour="red")
## Warning: Removed 156 rows containing missing values (geom_point).

It worked! Spatial pattern of crimes correlates nicely with population density. We can reduce the dataset by aggregating it to grid:

glimpse(crime_data)
## Observations: 13,423
## Variables: 26
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ x_min              <int> 6587500, 6587500, 6588000, 6590500, 6588000...
## $ x_max              <int> 6587999, 6587999, 6588499, 6590999, 6588499...
## $ y_min              <int> 738000, 536500, 547500, 737500, 539000, 541...
## $ y_max              <int> 738499, 536999, 547999, 737999, 539499, 541...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday               <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour               <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...
## $ Lest_X_bu          <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y_bu          <chr> "738000-738499", "536500-536999", "547500-5...
## $ x                  <dbl> 6587750, 6587750, 6588250, 6590750, 6588250...
## $ y                  <dbl> 738249.5, 536749.5, 547749.5, 737749.5, 539...
crime_data_grd_aggr <-  crime_data %>% group_by(x, y) %>% summarise(n = n()) %>% ungroup()

ggplot()+
  geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
  geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n, alpha=n))+
  scale_colour_gradientn(colours = c("black", "red", "orange", "yellow"))
## Warning: Removed 1 rows containing missing values (geom_point).

The general picture remained the same but the dataset is 10x smaller (13423 rows vs 1154 rows). The only problem with the previous plot is that the cells with large values (crime hotspots) are buried under other points.
Can we change the plotting order (bigger values on top)? Yes! We can sort rows by value:

crime_data_grd_aggr <- crime_data_grd_aggr %>% arrange(n)

ggplot()+
  geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
  geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n))+
  scale_colour_gradientn(colours = c("black", "red", "orange", "yellow"))
## Warning: Removed 1 rows containing missing values (geom_point).

6.2 Temporal dynamics of crimes

Group data by weekdays and hours, count number of crimes by groups. Create a plot:

crime_data_aggr_hourWday <- crime_data %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n))+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 1 rows containing missing values (geom_path).

Result looks almost okay! Somehow R thinks, that week starts with weekend day (sunday). Therefore the first day of the week is Sunday. But we can change it! Let’s define Monday as the first day of the week:

crime_data %>% distinct(wday)
## # A tibble: 7 x 1
##   wday 
##   <ord>
## 1 Wed  
## 2 Tue  
## 3 Mon  
## 4 Sun  
## 5 Sat  
## 6 Fri  
## 7 Thu
crime_data$wday <- factor(crime_data$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

Let’s focus on public order violation (avaliku korra rikkumine):

crime_data_publicOrder <- crime_data %>% filter(CaseType == "AVALIKU_KORRA_RIKKUMINE")

crime_data_aggr_hourWday <- crime_data %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n))+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 1 rows containing missing values (geom_path).

Line graph is not the best type, because the weekdays are not connected with each other. Make a bar plot:

ggplot()+
  theme_minimal()+
  geom_bar(data = crime_data_aggr_hourWday, aes(x = hour, y=n), stat ="identity", fill="red")+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 7 rows containing missing values (position_stack).

Many other possibilities:

ggplot()+
  theme_minimal()+
  geom_segment(data = crime_data_aggr_hourWday, aes(x = hour, xend=hour, y=0, yend = n), colour="red")+
  geom_point(data = crime_data_aggr_hourWday, aes(x = hour, y=n), colour="red")+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 7 rows containing missing values (geom_segment).
## Warning: Removed 7 rows containing missing values (geom_point).

Hard to compare? Put all days on the same plot:

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n, group=wday, colour=wday))+
  scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 7 rows containing missing values (geom_path).

  scale_colour_manual(values = c("dodgerblue", "green", "blue", "grey", "orange", "red", "magenta"))
## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     reset: function
##     scale_name: manual
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Still not very beautiful. We can calculate the mean for workdays and for weekend:

crime_data_aggr_hourWeekend <- crime_data_aggr_hourWday %>% mutate(weekend = ifelse(wday == "Sat" | wday == "Sun", "weekend", "workday"))
crime_data_aggr_hourWeekend <- crime_data_aggr_hourWeekend %>% group_by(weekend, hour) %>% summarise(n=mean(n)) %>% ungroup()

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWeekend, aes(x=hour, y=n, colour=weekend))+
  geom_point(data = crime_data_aggr_hourWeekend, aes(x=hour, y=n, colour=weekend), shape=21)+
  labs(title="Average dynamics of crimes during the day", caption ="author: A. Aasa")+
  scale_colour_manual(values = c("orange", "dodgerblue"))+
  scale_x_continuous(breaks = seq(0, 24, 3))
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).

6.3 Crimes in Tartu?

We dont have to stay on a national level, we can also “zoom in”. For example we can download base map for Tartu and put crimes on that map.

tartu_map <- get_map(location ="Tartu, Estonia", maptype = "terrain", color ="bw", zoom =13)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Tartu,+Estonia&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tartu,%20Estonia&sensor=false
ggmap(tartu_map)+
  theme_map()+
  theme(legend.position = "right")+
  geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n, alpha=n))
## Warning: Removed 1154 rows containing missing values (geom_point).

Maps does not fit? It is because of the different coordinate system: Google base map is in geographical coordinates (WGS84, epsg = 4326) and crimes dataset is in Estonian official coordinates system (L-EST 97; epsg = 3301). To solve it we have to convert crimes data frame to spatial object.

crime_data_grd_aggr <-  crime_data %>% group_by(x, y) %>% summarise(n = n()) %>% ungroup()
glimpse(crime_data_grd_aggr)
## Observations: 1,154
## Variables: 3
## $ x <dbl> 6389500, 6399500, 6402500, 6403500, 6403500, 6404750, 640525...
## $ y <dbl> 669499.5, 673499.5, 673499.5, 641499.5, 695499.5, 622249.5, ...
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 4, 1, 3, 1, 11, 8, 13, 8, 3, 1, 2, 2, 1...
crime_data_grd_aggr_sf <- crime_data_grd_aggr %>% filter(!is.na(x))
crime_data_grd_aggr_sf <- st_as_sf(crime_data_grd_aggr_sf, coords = c("y", "x"), crs = 3301)

Now we created a spatial object where the coordinate system (L-EST 97; epsg=3301) is defined. Next we have to transform to geographical coordinate system (WGS84, epsg=4236):

crime_data_grd_aggr_sf <- st_transform(crime_data_grd_aggr_sf, 4326)

And next we can check the results:

ggmap(tartu_map)+
  #theme_map()+
  theme(legend.position = "right")+
  geom_sf(data = crime_data_grd_aggr_sf, aes(alpha=n), size=3, colour="red", inherit.aes = F)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

No difference?!?! Suspicious…
The problem is that we don’t see the crimes outside of Tartu, but they are still there! We have to delete all the crime data outside of Tartu!

crime_data_grd_aggr_sf <- crime_data_grd_aggr %>% filter(!is.na(x))
crime_data_grd_aggr_sf <- crime_data_grd_aggr_sf %>% filter(y > 655918 & y < 663097 & x > 6469207 & x < 6476878)
crime_data_grd_aggr_sf <- st_as_sf(crime_data_grd_aggr_sf, coords = c("y", "x"), crs = 3301)

crime_data_grd_aggr_sf <- st_transform(crime_data_grd_aggr_sf, 4326)

ggmap(tartu_map)+
  #theme_map()+
  theme(legend.position = "right")+
  geom_sf(data = crime_data_grd_aggr_sf, aes(alpha=n), size=6, shape=15, colour="red", inherit.aes = F)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Have a safe trip home!


Author: Anto Aasa
Digital Methods in Humanities and Social Sciences

Mobility Lab


Last update: 2018-08-23 14:30:10