First things first! Please start the libraries:

library(dplyr)
library(readr)
library(knitr)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(scico)
library(ggmap)
library(data.table)
library(tidyr)
library(sf)

1 Introduction

It is characteristic to the modern network and information society that we meet data everywhere and very often the data is big. Big data could be a co-product of some other ICT (Information & Communication Technology) service: every usage of ICT produces some amount of data. This data describes the usage of certain services, but can also descrive the user (frequency, location etc). In current workshop, we will go through some examples how different data sources can reveal spatio-temporal behaviour of society and/or some smaller social group. Today, the mobile data collection is a very useful toolbox for mobility analysis. Hereby we try to “make friends” with several different datasets (mobile positioning, GPS-positioning, crimes etc). The main aim is to introduce the participants the nature of mobile data, its applications and prospects as well as the disadvantages and shortcomings.

Slides of lecture

2 Spatial data

Many research questions are related to spatial aspects. In addition to the question Where? researchers often have to answer how one or another object/phenomenon is related to the spatial context; are there any spatial patterns?
In case we assume, that spatial context of the data is essential, we have to record also the spatial information during the data collection process. Easiest and the most common way is to use coordinates. We all now what geographical coordinates are (geographical coordinates of Jakobi 2, Tartu: 26.71967E 58.38049N).

2.1 Coordinate reference system (CRS)

Because of different reasons geographers are not always using geographical coordinates. Actually there are very many different coordinate reference systems (CRS). For example the official CRS of Estonia is L-EST Coordinate System L-EST97, epsg:3301. In L-EST97 the coordinates of Jakobi 2, Tartu are X = 659059.9 & Y = 6474338. This means we have to be very careful dealing with spatial objects and pay attention, that we have to know the CRS of spatial data and for specific analysis all the spatial datasets must be in the same CRS!

2.2 Geometry of spatial data

In general there are two types of spatial data: vector and raster. In case of vector data we are dealing with discrete objects (point, line, polygon), in case of raster data we have a grid of cells (pixels) organized into rows and columns where each cell has a value representing information. Today we look some simple vector data.

2.3 Simple Features (sf)

As usual in R, there ara many ways to work with spatial data, but today we are focusing on Simple Feature (sf). Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them. link

3 GPS data collection with MobilityLog App

In Mobility Lab the mobility data, among other methods, is also collected with the help of special mobile phone application “MobilityLog”. MobilityLog is an Android application which is designed for long-term mobility tracking and social-network-related experiments. The app was developed in joint cooperation of the Computer Laboratory of the University of Cambridge and the Mobility Lab of the University of Tartu. MobilityLog tracks location with the help of GPS, WiFi access points, accelerometer, and mobile network cells. It also records other phone uses such as call events, battery level, screen activation, or alarm clock. It’s a powerful tool to monitor activities carried on by the phone user.

Thus, in addition to location information (GPS), “MobilityLog” can record other behavioral information. App sends the collected data to the servers of Mobility Lab in University of Tartu on a daily basis, where the quality of the received data is controlled. After the quality check, a set of algorithms start to calculate predefined parameters of respondent’s behavior.

Thus, the “MobilityLog” application can be used to monitor/sense the activity of phone user. Every respondent has an android phone where the data collection app is installed. In order to protect the respondents’ privacy, the data will be pseudonymized in the initial stage of data processing.

3.1 GPS-data for current workshop

Current dataset containS three weeks of GPS data recorded during a tourism trip in California. The data is recorded using MobilityLog. Dataset starts in 21-04-2018 and ends in 10-05-2018. All together it contains 197 375 GPS-points.

4 Analysis of the MobilityLog GPS data

4.1 Data import and preparation for the analysis

In real world the preparation of data is very often one of the most annoying steps during the analysis process. Today we jump over most of the obstacles and import the preprepared data directly from the Internet into R.

url <-  "http://aasa.ut.ee/digitalMethods/digitalMethodsGPS.zip" # define the dataset address
download.file(url, "digitalMethodsGPS.zip") # download file
unzip("digitalMethodsGPS.zip", exdir = "data") # unzip files
file.remove("digitalMethodsGPS.zip") # tidy up by removing the zip file

## [1] TRUE

gpsData <- read_csv2("data/gps_us.csv") # import data to R
glimpse(gpsData) # structure of imported file

## Observations: 185,808
## Variables: 17
## $ user_id        <int> 241, 241, 241, 241, 241, 241, 241, 241, 241, 24...
## $ id             <int> 646380266, 646380267, 646380268, 646380269, 646...
## $ restart        <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ counter        <int> 66718, 66720, 66723, 66730, 66732, 66734, 66736...
## $ time_system    <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524...
## $ time_gps       <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524...
## $ time_system_ts <dttm> 2018-04-22 00:10:39, 2018-04-22 00:10:40, 2018...
## $ time_gps_ts    <dttm> 2018-04-22 00:10:36, 2018-04-22 00:10:37, 2018...
## $ accuracy       <int> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6,...
## $ altitude       <dbl> -23.991000, -24.298600, -12.608500, -9.616940, ...
## $ bearing        <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ speed          <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.0...
## $ header_id      <int> 226334354, 226334354, 226334354, 226334360, 226...
## $ series_id      <int> 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1042,...
## $ year           <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,...
## $ X              <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -12...
## $ Y              <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.6497...

In the current tutorial we are using the package called dplyr for data filtering, arranging, selecting, mutating, grouping and summarising. And for better (human readable) code we are using pipes.
With glimpse() we see that dataset contains 17 variables and has 185 808 rows. There are integer, double and date-time variables. To reduce the dataset size we keep only columns which are relevant for current analysis (time, accuracy, altitude, bearing, longitude[X], latitude[Y]):

gpsData <- gpsData %>% select(time_system_ts, accuracy, altitude, bearing, speed, X, Y)
gpsData %>% head() %>% kable() # print the formatted head of the table

time_system_ts	accuracy	altitude	bearing	speed	X	Y
2018-04-22 00:10:39	12	-23.99100	165.957	0.104187	-121.3842	38.64978
2018-04-22 00:10:40	16	-24.29860	NA	NA	-121.3842	38.64976
2018-04-22 00:10:41	16	-12.60850	NA	0.000000	-121.3842	38.64976
2018-04-22 00:10:42	6	-9.61694	NA	0.000000	-121.3842	38.64976
2018-04-22 00:10:43	4	-8.17016	NA	0.000000	-121.3843	38.64971
2018-04-22 00:10:44	6	-5.67983	NA	0.000000	-121.3843	38.64971

Before entering into the trouble we have to convert the date-time field to correct shape. Originally the data recorder (the smartphone with MobilityLog app) stored all locations in Estonian time zone (UTC +3). The time zone of California (UTC -7) differs 10 hours. This means we have to subtract 10 hours from the values in date-time column:

gpsData <- gpsData %>% mutate(time_system_ts = time_system_ts - hours(10))

Now the data should be ready for the analysis. Let’s try to summarise the data (when it starts, when it ends and how many GPS-points it contains)?

gpsData %>% summarise(start = min(time_system_ts), end = max(time_system_ts), 'GPS points' = n()) %>% kable()

start	end	GPS points
2018-04-21 14:10:39	2018-05-10 13:56:51	185808

The data collection application MobilityLog is developed to save the battery of the smart phone as much as possible. For that reason the GPS-sensor is switched off every time the app realises that the phone is not moving. This means that dataset contains lot of gaps. Usually gaps in data refer to data collection problems, but in current case they contain significant information: where and when the respondent stopped during the trip?

4.2 How to visualize gaps in data?

We could just plot the speed (meters per second) against time. For visualizations we are using package called ggplot2 in the current workshop.

ggplot2 webpage: It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

glimpse(gpsData) #check to remeber the column names

## Observations: 185,808
## Variables: 7
## $ time_system_ts <dttm> 2018-04-21 14:10:39, 2018-04-21 14:10:40, 2018...
## $ accuracy       <int> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6,...
## $ altitude       <dbl> -23.991000, -24.298600, -12.608500, -9.616940, ...
## $ bearing        <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ speed          <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.0...
## $ X              <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -12...
## $ Y              <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.6497...

ggplot()+
  geom_point(data = gpsData, aes(x=time_system_ts, y=altitude), size=0.25, col="red")

Looks like there are a few gaps in data!? Suspicious… If we assume, that GPS records location after every second, then the dataset sould contain 1.64077210^{6} rows, but it is ca 10 times smaller. This means we have to be more clever to bring out the gaps. Currently it seems, that the horizontal axis is pressed together and gaps are hidden under the data points. One possibility is to create tile-plot, where the horizontal axis represents days and the vertical axis represents time. Additionally we can colour the data points according to the GPS-speed.
To do that, we have to calculate Date and time column. To avoid potential problems with time we calculate it in decimal scale where 0 is midnight, 12 is midday and 24 is midnight again:

gpsData <- gpsData %>% dplyr::mutate(time_dec=hour(time_system_ts)+(minute(time_system_ts)/60), date=as.Date(time_system_ts))

And now the plot:

# first version:
# pay attention on subset() which filters out some gps-points (speed = 0)
ggplot()+
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))

# first version wasn't to beautiful. Some polishing:
ggplot()+
  theme(panel.background = element_rect(fill="black"),
        panel.grid.major = element_line(colour = "grey10"),
        panel.grid.minor = element_line(colour = "grey10"),
        plot.background = element_rect(fill="black"),
        legend.background = element_rect(fill="black"),
        legend.text = element_text(colour = "grey"), 
        legend.title = element_text(colour = "grey"),
        axis.text = element_text(colour="grey"),
        axis.title = element_text(colour = "grey"),
        plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
  scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
  labs(title="Time in movement", x="date", y="time")+ #titles
  scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
  guides(fill=F) # hide the legend

4.3 Save the plot!

There is no use of beautiful visualization if we can’t export it to our research papers and presentations. Saving the plot as raster or vector is easy:

define the file where the output will be written
write plot to the file
close the file

# to png (jpeg had some problems with Mac OS)
png("name.png", units = "in", res = 300, height = 4, width = 6)

ggplot()+
  theme(panel.background = element_rect(fill="black"),
        panel.grid.major = element_line(colour = "grey10"),
        panel.grid.minor = element_line(colour = "grey10"),
        plot.background = element_rect(fill="black"),
        legend.background = element_rect(fill="black"),
        legend.text = element_text(colour = "grey"), 
        legend.title = element_text(colour = "grey"),
        axis.text = element_text(colour="grey"),
        axis.title = element_text(colour = "grey"),
        plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
  scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
  labs(title="Time in movement", x="date", y="time")+ #titles
  scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
  guides(fill=F) # hide the legend

dev.off()

## png 
##   2

# to pdf
# try help: ?pdf

4.4 Map the data!

It’s spatial data. We should study spatial patterns. For this, we have to make a map!

ggplot()+
  geom_point(data = gpsData, aes(y = Y, x=X))

Plotted spatial data is not a map! Map needs spatial context. We need to use base map. The easiest solution is to use package called ggmap which is using Google Maps.
First we have to download the map:

Ca_map <- get_map(location = "California", maptype = "terrain", color = "bw", zoom = 7)

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=California&zoom=7&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=California&sensor=false

(You can try “satellite” or “hybrid” instead of “terrain”.)

To publish the map the grammar is very similar to ggplot2, just the first line defines that ggmap is used:

ggmap(Ca_map)+
  theme_map()+
  geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)

Ok? Of course you can add a title and tune the colours…

4.5 Where were the highest and lowest points visited during the trip?

For that we can make really simple query:

trip_extremes <- gpsData %>% filter(altitude == min(altitude) | altitude == max(altitude)) 
trip_extremes %>% kable()

time_system_ts	accuracy	altitude	bearing	speed	X	Y	time_dec	date
2018-04-27 10:58:30	4	3073.550	NA	0.000000	-118.1789	37.38543	10.966667	2018-04-27
2018-05-08 07:07:35	96	-192.916	348.241	0.216682	-122.3295	37.20981	7.116667	2018-05-08

# altitude -192 refers, that it is under the sea level; it's probably GPS-error (look the accuracy!)

And now the map:

ggmap(Ca_map)+
  theme_map()+
  #geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)
  geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
  scale_colour_gradient(low="blue", high = "red")+
  guides(colour = F)

We should add more information on the map. For example: - What was the altitude? - When was the place visited?

ggmap(Ca_map)+
  theme_map()+
  geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
  geom_label(data = trip_extremes, aes(y = Y + 0.2, x=X, colour = altitude, label= round(altitude, 1)), size=3)+
  geom_label(data = trip_extremes, aes(y = Y - 0.2, x=X, colour = altitude, label= as.Date(time_system_ts)), size=2.5, alpha=0.5)+
  scale_colour_gradient(low="blue", high = "red")+
  guides(colour = F)

So much (little) about GPS-data. It was just a featherlight touch of possibilities…

5 Passive Mobile Positioning

Passive mobile positiong means, that we are using Call Detail Records (CDR) data. CDR is a dataset, which contains log-files collected by mobile network operators to monitor and keep track of billable calling services used by their clients. Such dataset that contains information about the phone call initiator (the unique and anonymous ID), time of calling activities as well as the location of the mobile antenna where the call activity was made, is an excellent source for analysing spatio-temporal behaviour of phone users.
CDR-data in the following example is stored in two separete files. One of them is the log-file off calling activities (cdr_example) and another contains spatial information about the mobile antennas (cdr_antennas).
First we import the log-files:

cdr_example <- read.csv2("http://aasa.ut.ee/digitalMethods/data/cdr_example.csv", sep=";", dec = ".")
glimpse(cdr_example)

## Observations: 42,112
## Variables: 3
## $ time       <fct> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-0...
## $ site_id    <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...

At first glimpse everything looks ok! Only problem is that time is stored not as time but as character. Convert time to proper format and calculate date and time (hours) columns:

cdr_example <- cdr_example %>% mutate(time = ymd_hms(as.character(time)), date = as.Date(time), hour = hour(time), minute = minute(time))
glimpse(cdr_example)

## Observations: 42,112
## Variables: 6
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-...
## $ site_id    <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 4...

Now we import the table of antennas:

cdr_antennas <- read.csv2("http://aasa.ut.ee/digitalMethods/data/sites.csv", sep=";", dec = ",")
glimpse(cdr_antennas)

## Observations: 1,117
## Variables: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096...
## $ lat     <dbl> 57.74861, 59.08381, 59.40694, 59.38417, 58.38350, 58.0...
## $ lon     <dbl> 26.35027, 24.63264, 24.81722, 24.82553, 24.48503, 26.4...

Easiest way to perform a quality check is to plot the data:

ggplot()+
  geom_point(data = cdr_antennas, aes(x=lon, y=lat))

Now we have to put two tables together. This process is called joining. In the current case we use left_join which returns us all records from the left table (cdr_example), and the matched records from the right table (cdr_antennas). For that we have to define the key-column by which tables are joined. The result is NULL from the right side, if there is no match.

cdr_example <- left_join(cdr_example, cdr_antennas, by = "site_id")
glimpse(cdr_example)

## Observations: 42,112
## Variables: 8
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-...
## $ site_id    <int> 937, 937, 1242, 753, 753, 507, 1742, 1746, 1746, 17...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 4...
## $ lat        <dbl> NA, NA, NA, 59.41333, 59.41333, 58.09139, NA, NA, N...
## $ lon        <dbl> NA, NA, NA, 24.75564, 24.75564, 26.10084, NA, NA, N...

We see, that some coordinate fields are empty (NA). This means that location information is missing for those antennas. Filter them out!

cdr_example <- cdr_example %>% filter(!is.na(lat)) # is.na() finds all cells where value is NA (not available), !is.na() works vice versa

5.1 Frequently visited places

glimpse(cdr_example)

## Observations: 4,564
## Variables: 8
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...

cdr_antenna_count <- cdr_example %>% group_by(lat, lon) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  geom_point(data = cdr_antenna_count, aes(x=lon, y=lat, size=n, alpha=n), colour="red")

For spatial context we download the base map of Estonia using ggmap:

ET_map <- get_map(location ="Estonia", maptype = "hybrid", color ="bw", zoom =7)
ggmap(ET_map)

Visualize frequently visited places:

ggmap(ET_map)+
  theme_map()+
  theme(legend.position = "right")+
  geom_point(data = cdr_antenna_count, aes(x=lon, y=lat, size=n, alpha=n), colour="red")

5.2 Find home and work location!

In mobility studies the location of home or working place of the respondent is crucial. Can we detect it from mobile positioning data as well?
We may assume, that home is the most important place for every person. This allows us to assume, that at home we use our mobile phones the most, and the working place is the second most important place (of course those assumptions are over simplified, but we don’t have too much time to dig deeper).
In the following query we count (n()) all calling activities grouping them by antennas. Afterwards we sort the result starting with the largest value (descending), finally the tables head is saved to new data frame:

cdr_example_aggrActivities <- cdr_example %>% group_by(site_id) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
cdr_example_aggrActivities %>% kable()

site_id	n
938	1136
752	385
502	212
753	196
55	162
504	136

According to the assumptions we defined earlier we can see from the results that antenna (site) is the home location and antenna with id is the working place. Where are they located in space? To answer that question we have to use the coordinates as well:

cdr_example_aggrActivities <- cdr_example %>% group_by(site_id, lon, lat) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
cdr_example_aggrActivities %>% kable()

site_id	lon	lat	n
938	27.15611	59.43528	1136
752	24.83249	59.42916	385
502	24.75611	59.43777	212
753	24.75564	59.41333	196
55	24.81474	59.42226	162
504	24.79641	59.17908	136

And then we put possible home and work places on the map:

ggmap(ET_map)+
  theme_map()+
  theme(legend.position = "right")+
  geom_point(data = cdr_example_aggrActivities[1,], aes(x=lon, y=lat, colour="home"), size=2)+ # [1,] selects first row in table
  geom_point(data = cdr_example_aggrActivities[2,], aes(x=lon, y=lat, colour = "work"), size=2)+
  scale_colour_manual(values = c("orange", "dodgerblue"))

5.3 Visualize movement path

CDR-data is commonly used in mobility studies. The accuracy of the data doesn’t allow exactly the turn-by-turn analysis of movement, but it’s a priceless source to analyse commuting, tourism, migration etc patterns.
Again let’s check the column names. And then plot paths on the map.

glimpse(cdr_example)

## Observations: 4,564
## Variables: 8
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...

ggplot()+
  geom_path(data = cdr_example, aes(x=lon, y=lat), size = 0.2, col="red")

Results look like plate of spaghetties, it’s very hard to understand which places were visited more often and which places less often.
May changing the the transparency (alpha) value helps?

ggplot()+
  geom_path(data = cdr_example, aes(x=lon, y=lat), size = 0.2, col="red", alpha=0.5)

Nothing better… It’s because the path is created as a single line. But there is a solution to improve it!
We should draw separate line between every two location points! For that geom_segment can be used. To define the end of segment we use lead(), where we say that segment ends with a value on a next row:

ggplot()+
  geom_segment(data = cdr_example, aes(x = lon, xend = lead(lon, 1), y=lat, yend = lead(lat, 1)), alpha=0.05, size=0.25)

## Warning: Removed 1 rows containing missing values (geom_segment).

You can test different alpha values (0…1) to see, how the output changes!
Put the path on the base map:

ggmap(ET_map, darken=c(0.25, "dodgerblue"))+ # we can change the tonality of base map
  geom_segment(data = cdr_example, aes(x = lon, xend = lead(lon, 1), y=lat, yend = lead(lat, 1)), alpha=0.05, size=0.25, colour="orange")

## Warning: Removed 1 rows containing missing values (geom_segment).

5.4 Temporal activity

CDR-data is not used only in mobility studies. It can give much more information about the phone carrier. If we assume that starting a phone call or sending a SMS is act of free will, then we can start to analyse the temporal behaviour of the phone user. For that let us also calculate the weekday names:

# R gives the weekdays in language defined in your computer settings
# If you want you can switch to english:
Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

cdr_example <- cdr_example %>% mutate(wday = lubridate::wday(time, abbr = T, label =T))

# change back to your language:
Sys.setlocale("LC_TIME","Estonian") # in current case it's estonian

## [1] "Estonian_Estonia.1257"

glimpse(cdr_example)

## Observations: 4,564
## Variables: 9
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday       <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...

# list weekdays:
cdr_example %>% distinct(wday)

##   wday
## 1  Tue
## 2  Wed
## 3  Thu
## 4  Fri
## 5  Sat
## 6  Mon
## 7  Sun

# reorder them:
cdr_example$wday <- factor(cdr_example$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

5.4.1 Chronocode:

ggplot()+
  geom_point(data = cdr_example, aes(x=wday, y=hour + (minute / 60)), shape=3, alpha = 0.075, size=6)+
  scale_y_continuous(breaks = seq(0, 24, 3))+
  labs(y="time, hh:mm")

Visually it looks not too bad. But we should add some values or numbers. Otherwise we can only say, that a person is more active on workingdays (calls more and wakes earlier) and usually not using the phone during the night.
Let’s calculate some calling time percentiles and add them to plot (we may assume that an active day starts if 10% of calls are made; midday arrives if 50% calls are made and evening starts if 90% calls are made):

# calculate time in decimal format:
cdr_example <- cdr_example %>% mutate(time_dec = hour + (minute / 60))
glimpse(cdr_example)

## Observations: 4,564
## Variables: 10
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday       <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
## $ time_dec   <dbl> 9.183333, 9.250000, 9.366667, 17.700000, 10.400000,...

cdr_quart <- cdr_example %>% group_by(wday) %>% summarise(q1 = quantile(time_dec, 0.10), q2 = quantile(time_dec, 0.5), q3 = quantile(time_dec, 0.90)) %>% ungroup()

ggplot()+
  geom_point(data = cdr_example, aes(x=wday, y=hour + (minute / 60)), shape=3, alpha = 0.075, size=6)+
  geom_segment(data = cdr_quart, aes(x=wday, xend =wday, y=q1, yend = q3), colour="red", size=1)+
  geom_point(data = cdr_quart, aes(x= wday, y =q1), colour="red", size=4, shape = "-")+
  geom_point(data = cdr_quart, aes(x= wday, y =q2), colour="red", size=3)+
  geom_point(data = cdr_quart, aes(x= wday, y =q3), colour="red", size=4, shape="-")+
  scale_y_continuous(breaks = seq(0, 24, 3))+
  labs(y="time, hh:mm")

5.4.2 Aggregate and count by hour

A more traditional way is to count calling activities by hour:

glimpse(cdr_example)

## Observations: 4,564
## Variables: 10
## $ time       <dttm> 2008-04-01 09:11:19, 2008-04-01 09:15:50, 2008-04-...
## $ site_id    <int> 753, 753, 507, 504, 938, 938, 938, 938, 938, 752, 7...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, Joh...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 20...
## $ hour       <int> 9, 9, 9, 17, 10, 10, 10, 11, 15, 16, 8, 9, 9, 11, 1...
## $ minute     <int> 11, 15, 22, 42, 24, 37, 41, 3, 58, 17, 52, 1, 31, 2...
## $ lat        <dbl> 59.41333, 59.41333, 58.09139, 59.17908, 59.43528, 5...
## $ lon        <dbl> 24.75564, 24.75564, 26.10084, 24.79641, 27.15611, 2...
## $ wday       <ord> Tue, Tue, Tue, Tue, Wed, Wed, Wed, Wed, Wed, Wed, T...
## $ time_dec   <dbl> 9.183333, 9.250000, 9.366667, 17.700000, 10.400000,...

cdr_example_hourAggr <-  cdr_example %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  geom_line(data = cdr_example_hourAggr, aes(x=hour, y=n, group=wday, colour=as.factor(wday)))+
  facet_wrap(~wday)+
  scale_colour_manual(values = c("dodgerblue", "steelblue", "blue", "cyan", "navyblue", "orange", "red"))+
  scale_y_continuous(breaks = seq(0, 24, 3))+
  labs(y="time, hh:mm", colour = "day")

5.4.3 Temporal activity at home/work:

#Home antenna:
cdr_example_aggrActivities[1,]

## # A tibble: 1 x 4
## # Groups:   site_id, lon [1]
##   site_id   lon   lat     n
##     <int> <dbl> <dbl> <int>
## 1     938  27.2  59.4  1136

#workplace antenna
cdr_example_aggrActivities[2,]

## # A tibble: 1 x 4
## # Groups:   site_id, lon [1]
##   site_id   lon   lat     n
##     <int> <dbl> <dbl> <int>
## 1     752  24.8  59.4   385

cdr_example_H <- cdr_example %>% filter(site_id == cdr_example_aggrActivities$site_id[1])
cdr_example_W <- cdr_example %>% filter(site_id == cdr_example_aggrActivities$site_id[2])
cdr_example_H_timeAggr <- cdr_example_H %>% group_by(wday, hour) %>% summarise(n = n()) %>% ungroup()
cdr_example_W_timeAggr <- cdr_example_W %>% group_by(wday, hour) %>% summarise(n = n()) %>% ungroup()

ggplot()+
  geom_line(data = cdr_example_H_timeAggr, aes(x=hour, y=n, colour="home"))+
  geom_line(data = cdr_example_W_timeAggr, aes(x=hour, y=n, colour="work"))+
  scale_x_continuous(breaks = seq(0, 24, 3))+
  facet_wrap(~wday)

Was the home and workplace determination successful?

6 Crime in Estonia

Today we read and hear more and more about big data or open data. Big data we already covered (in reality both datasets are very big, here we just looked small fragments).
What about open data? in Estonia the state-produced linked open data is accessible through Open Data Portal of Estonia. One of the important feature of this data is that it’s often machine readable.

Lets look one of the datasets from Estonian Police and Border Guard Board. It contains information about crimes in Estonia. The dataset is updated weekly and freely available.

Download it:

crime_data <- read_delim("https://opendata.smit.ee/ppa/csv/avalik_1.csv", 
                    "\t", escape_double = FALSE, col_types = cols(ToimKell = col_character(),
                                                                  ToimKpv = col_character()), trim_ws = TRUE)
glimpse(crime_data)

## Observations: 13,423
## Variables: 18
## $ JuhtumId                <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4"...
## $ ToimKpv                 <chr> "2018-08-08", "2018-08-07", "2018-08-0...
## $ ToimKell                <chr> "00:37", "19:07", "18:03", "18:00", "1...
## $ ToimNadalapaev          <chr> "Kolmapäev", "Teisipäev", "Teisipäev",...
## $ SyndmusLiik             <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS...
## $ SyndmusTaiendavStatLiik <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_V...
## $ Seadus                  <chr> "Karistusseadustik", "Karistusseadusti...
## $ Paragrahv               <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262."...
## $ ParagrahvTais           <chr> "§ 262. Avaliku korra rikkumine", "§ 2...
## $ Loige                   <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1."...
## $ Kahjusumma              <chr> NA, NA, "0-499", "0-499", "0-499", "0-...
## $ KohtLiik                <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "...
## $ MaakondNimetus          <chr> "Ida-Viru maakond", "Harju maakond", "...
## $ ValdLinnNimetus         <chr> "Narva linn", "Tallinn", "Tallinn", "N...
## $ KohtNimetus             <chr> "Narva linn", "Haabersti linnaosa", "L...
## $ Lest_X                  <chr> "6587500-6587999", "6587500-6587999", ...
## $ Lest_Y                  <chr> "738000-738499", "536500-536999", "547...
## $ SyyteoLiik              <chr> "VT", "VT", "VT", "VT", "VT", "VT", "V...

# sometimes there exists easier solution (in current case default download works without any additional parameters):
crime_data_2 <- fread("https://opendata.smit.ee/ppa/csv/avalik_1.csv", encoding = "UTF-8")
glimpse(crime_data_2)

## Observations: 13,423
## Variables: 18
## $ JuhtumId                <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4"...
## $ ToimKpv                 <chr> "2018-08-08", "2018-08-07", "2018-08-0...
## $ ToimKell                <chr> "00:37", "19:07", "18:03", "18:00", "1...
## $ ToimNadalapaev          <chr> "Kolmapäev", "Teisipäev", "Teisipäev",...
## $ SyndmusLiik             <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS...
## $ SyndmusTaiendavStatLiik <chr> "", "", "", "", "", "", "", "", "MUU_V...
## $ Seadus                  <chr> "Karistusseadustik", "Karistusseadusti...
## $ Paragrahv               <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262."...
## $ ParagrahvTais           <chr> "§ 262. Avaliku korra rikkumine", "§ 2...
## $ Loige                   <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1."...
## $ Kahjusumma              <chr> "", "", "0-499", "0-499", "0-499", "0-...
## $ KohtLiik                <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "...
## $ MaakondNimetus          <chr> "Ida-Viru maakond", "Harju maakond", "...
## $ ValdLinnNimetus         <chr> "Narva linn", "Tallinn", "Tallinn", "N...
## $ KohtNimetus             <chr> "Narva linn", "Haabersti linnaosa", "L...
## $ Lest_X                  <chr> "6587500-6587999", "6587500-6587999", ...
## $ Lest_Y                  <chr> "738000-738499", "536500-536999", "547...
## $ SyyteoLiik              <chr> "VT", "VT", "VT", "VT", "VT", "VT", "V...

crime_data %>% head() %>% kable()

JuhtumId	ToimKpv	ToimKell	ToimNadalapaev	SyndmusLiik	SyndmusTaiendavStatLiik	Seadus	Paragrahv	ParagrahvTais	Loige	Kahjusumma	KohtLiik	MaakondNimetus	ValdLinnNimetus	KohtNimetus	Lest_X	Lest_Y	SyyteoLiik
df5d4c9c-fde4-18d5-a4bc-97886cf239c4	2018-08-08	00:37	Kolmapäev	AVALIKU_KORRA_RIKKUMINE	NA	Karistusseadustik	§ 262.	§ 262. Avaliku korra rikkumine	lg. 1.	NA	RAVIASUTUS	Ida-Viru maakond	Narva linn	Narva linn	6587500-6587999	738000-738499	VT
df5d4c92-fde4-18d5-a4bc-97886cf239c4	2018-08-07	19:07	Teisipäev	PISIVARGUS	NA	Karistusseadustik	§ 218.	§ 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu	lg. 1.	NA	AVALIK_KOHT,KAUPLUS	Harju maakond	Tallinn	Haabersti linnaosa	6587500-6587999	536500-536999	VT
df5d4c7e-fde4-18d5-a4bc-97886cf239c4	2018-08-07	18:03	Teisipäev	PISIVARGUS	NA	Karistusseadustik	§ 218.	§ 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu	lg. 1.	0-499	AVALIK_KOHT,KAUPLUS	Harju maakond	Tallinn	Lasnamäe linnaosa	6588000-6588499	547500-547999	VT
df5d4c74-fde4-18d5-a4bc-97886cf239c4	2018-08-07	18:00	Teisipäev	AVALIKU_KORRA_RIKKUMINE	NA	Karistusseadustik	§ 262.	§ 262. Avaliku korra rikkumine	lg. 1.	0-499	OUEALA_LAHTINE_HOOV	Ida-Viru maakond	Narva linn	Narva linn	6590500-6590999	737500-737999	VT
df5d4c60-fde4-18d5-a4bc-97886cf239c4	2018-08-07	16:06	Teisipäev	KELMUS	NA	Karistusseadustik	§ 218.	§ 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu	lg. 1.	0-499	AVALIK_KOHT,PARKLA,TANAV_VALJAK	Harju maakond	Tallinn	Põhja-Tallinna linnaosa	6588000-6588499	539000-539499	VT
df5d4b20-fde4-18d5-a4bc-97886cf239c4	2018-08-07	10:30	Teisipäev	PISIVARGUS	NA	Karistusseadustik	§ 218.	§ 218. Varavastane süütegu väheväärtusliku asja ja varalise õiguse vastu	lg. 1.	0-499	AVALIK_KOHT,TANAV_VALJAK	Harju maakond	Tallinn	Kesklinna linnaosa	6588500-6588999	541500-541999	VT

Unfortunately the table is in Estonian, but hopefully we manage to analyse it! First we translate the column names to English:

crime_names <- colnames(crime_data)
crime_names_en <-c("CaseId", "Date", "Time", "Weekday", "CaseType", "CaseTypeAdditional", "Law", "Paragraph", "ParagraphFull", "Section", "DamagesEuro", "PlaceType", "County", "Municipality", "Place", "Lest_X", "Lest_Y", "Type")

data_frame("Columns, et" = crime_names, "Columns, en" = crime_names_en) %>% kable()

Columns, et	Columns, en
JuhtumId	CaseId
ToimKpv	Date
ToimKell	Time
ToimNadalapaev	Weekday
SyndmusLiik	CaseType
SyndmusTaiendavStatLiik	CaseTypeAdditional
Seadus	Law
Paragrahv	Paragraph
ParagrahvTais	ParagraphFull
Loige	Section
Kahjusumma	DamagesEuro
KohtLiik	PlaceType
MaakondNimetus	County
ValdLinnNimetus	Municipality
KohtNimetus	Place
Lest_X	Lest_X
Lest_Y	Lest_Y
SyyteoLiik	Type

Replace column names and select relevant (more interesting variables):

colnames(crime_data) <- crime_names_en
glimpse(crime_data)

## Observations: 13,423
## Variables: 18
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <chr> "2018-08-08", "2018-08-07", "2018-08-07", "...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X             <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y             <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...

To check what variable categories are available we can use command distinct:

LIST_caseType <- crime_data %>% distinct(CaseType)
LIST_caseType %>% head() %>% kable()

CaseType
AVALIKU_KORRA_RIKKUMINE
PISIVARGUS
KELMUS
VARGUS
VANDALISM
PISIVARGUS,VARGUS

Next we should check the frequency of different categories:

TOP_caseType <- crime_data %>% group_by(CaseType) %>% summarise(n=n()) %>% arrange(desc(n))
TOP_caseType %>% head() %>% kable()

CaseType	n
VARGUS	4453
PISIVARGUS	3912
AVALIKU_KORRA_RIKKUMINE	1579
VANDALISM	1089
AVALIKU_KORRA_RIKKUMINE,KEHALINE_VAARKOHTLEMINE	704
MUU	358

As we see once again, the working language of the Estonian police is Estonian. But we can translate the top 5 crime types easily to English:

crime type, et	link to google translate
VARGUS	https://translate.google.com/#et/en/vargus
PISIVARGUS	https://translate.google.com/#et/en/pisivargus
AVALIKU_KORRA_RIKKUMINE	https://translate.google.com/#et/en/avaliku_korra_rikkumine
VANDALISM	https://translate.google.com/#et/en/vandalism
AVALIKU_KORRA_RIKKUMINE,KEHALINE_VAARKOHTLEMINE	https://translate.google.com/#et/en/avaliku_korra_rikkumine,kehaline_vaarkohtlemine

6.1 General overview of crimes

Traditional glimpse at the data:

glimpse(crime_data)

## Observations: 13,423
## Variables: 18
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <chr> "2018-08-08", "2018-08-07", "2018-08-07", "...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X             <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y             <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...

As it appears all the variables are imported as character. We have to convert columns to proper format. First we convert date, calculate weekdays and also time in hours:

# R gives the weekdays in language defined in your computer settings
# Let's switch to english:
Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

crime_data <- crime_data %>% mutate(Date = ymd(Date), wday = lubridate::wday(Date, abbr = T, label =T), hour = as.integer(substr(Time, 1, 2)))
# change back to your language:
Sys.setlocale("LC_TIME","Estonian") # in current case it's estonian

## [1] "Estonian_Estonia.1257"

glimpse(crime_data)

## Observations: 13,423
## Variables: 20
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ Lest_X             <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y             <chr> "738000-738499", "536500-536999", "547500-5...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday               <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour               <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...

Crimes dataset also contains spatial information (L-EST97; epsg = 3301). Currently the coordinates are stored in character field and given as range (coordinates are gridded). Our task is to convert the pixel/cell extent to centroid:

crime_data <- crime_data %>% mutate(Lest_X_bu = Lest_X, Lest_Y_bu = Lest_Y) # back up coordinate field, because 'separate' will delete the original field
crime_data <- crime_data %>% separate(Lest_X, c("x_min", "x_max"), sep = "-") # split column
crime_data <- crime_data %>% separate(Lest_Y, c("y_min", "y_max"), sep = "-")
crime_data <- crime_data %>% mutate(x_min = as.integer(x_min),
                                    x_max = as.integer(x_max),
                                    y_min = as.integer(y_min),
                                    y_max = as.integer(y_max))

# calculate centroid:
crime_data <- crime_data %>% mutate(x=(x_min + x_max) / 2, y=(y_min + y_max) / 2)
glimpse(crime_data)

## Observations: 13,423
## Variables: 26
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ x_min              <int> 6587500, 6587500, 6588000, 6590500, 6588000...
## $ x_max              <int> 6587999, 6587999, 6588499, 6590999, 6588499...
## $ y_min              <int> 738000, 536500, 547500, 737500, 539000, 541...
## $ y_max              <int> 738499, 536999, 547999, 737999, 539499, 541...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday               <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour               <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...
## $ Lest_X_bu          <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y_bu          <chr> "738000-738499", "536500-536999", "547500-5...
## $ x                  <dbl> 6587750, 6587750, 6588250, 6590750, 6588250...
## $ y                  <dbl> 738249.5, 536749.5, 547749.5, 737749.5, 539...

Plot the distribution of crimes in Estonia:

ggplot()+
  geom_point(data = crime_data, aes(x=x, y=y), size=2, alpha=0.2)

## Warning: Removed 156 rows containing missing values (geom_point).

Well… It should remind us the contour of Estonia. But it does not!

Let’s download Estonian contour from Estonian Land Board:

download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/maakond_shp.zip", destfile="maakond_shp.zip")
#maakond means county!
unzip("maakond_shp.zip")

# read shp to R:
county <- st_read("maakond_20180801.shp")

## Reading layer `maakond_20180801' from data source `C:\ANTO\loengud\digimeetoditeDoktorikool\mobileDateCollectionMethods\maakond_20180801.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## epsg (SRID):    3301
## proj4string:    +proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs

ggplot()+
  geom_sf(data = county)

Plot crimes data on top of county borders:

ggplot()+
  geom_sf(data = county)+
  geom_point(data = crime_data, aes(x=x, y=y), size=2, alpha=0.2, colour = "red")

## Warning: Removed 156 rows containing missing values (geom_point).

Well, looks like coordinate field names are switched in case of crimes dataset. Try:

ggplot()+
  geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
  geom_point(data = crime_data, aes(x=y, y=x), size=0.5, alpha=0.25, shape=15, colour="red")

## Warning: Removed 156 rows containing missing values (geom_point).

It worked! Spatial pattern of crimes correlates nicely with population density. We can reduce the dataset by aggregating it to grid:

glimpse(crime_data)

## Observations: 13,423
## Variables: 26
## $ CaseId             <chr> "df5d4c9c-fde4-18d5-a4bc-97886cf239c4", "df...
## $ Date               <date> 2018-08-08, 2018-08-07, 2018-08-07, 2018-0...
## $ Time               <chr> "00:37", "19:07", "18:03", "18:00", "16:06"...
## $ Weekday            <chr> "Kolmapäev", "Teisipäev", "Teisipäev", "Tei...
## $ CaseType           <chr> "AVALIKU_KORRA_RIKKUMINE", "PISIVARGUS", "P...
## $ CaseTypeAdditional <chr> NA, NA, NA, NA, NA, NA, NA, NA, "MUU_VARGUS...
## $ Law                <chr> "Karistusseadustik", "Karistusseadustik", "...
## $ Paragraph          <chr> "§ 262.", "§ 218.", "§ 218.", "§ 262.", "§ ...
## $ ParagraphFull      <chr> "§ 262. Avaliku korra rikkumine", "§ 218. V...
## $ Section            <chr> "lg. 1.", "lg. 1.", "lg. 1.", "lg. 1.", "lg...
## $ DamagesEuro        <chr> NA, NA, "0-499", "0-499", "0-499", "0-499",...
## $ PlaceType          <chr> "RAVIASUTUS", "AVALIK_KOHT,KAUPLUS", "AVALI...
## $ County             <chr> "Ida-Viru maakond", "Harju maakond", "Harju...
## $ Municipality       <chr> "Narva linn", "Tallinn", "Tallinn", "Narva ...
## $ Place              <chr> "Narva linn", "Haabersti linnaosa", "Lasnam...
## $ x_min              <int> 6587500, 6587500, 6588000, 6590500, 6588000...
## $ x_max              <int> 6587999, 6587999, 6588499, 6590999, 6588499...
## $ y_min              <int> 738000, 536500, 547500, 737500, 539000, 541...
## $ y_max              <int> 738499, 536999, 547999, 737999, 539499, 541...
## $ Type               <chr> "VT", "VT", "VT", "VT", "VT", "VT", "VT", "...
## $ wday               <ord> Wed, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue...
## $ hour               <int> 0, 19, 18, 18, 16, 10, 9, 6, 4, 4, 2, 20, 2...
## $ Lest_X_bu          <chr> "6587500-6587999", "6587500-6587999", "6588...
## $ Lest_Y_bu          <chr> "738000-738499", "536500-536999", "547500-5...
## $ x                  <dbl> 6587750, 6587750, 6588250, 6590750, 6588250...
## $ y                  <dbl> 738249.5, 536749.5, 547749.5, 737749.5, 539...

crime_data_grd_aggr <-  crime_data %>% group_by(x, y) %>% summarise(n = n()) %>% ungroup()

ggplot()+
  geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
  geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n, alpha=n))+
  scale_colour_gradientn(colours = c("black", "red", "orange", "yellow"))

## Warning: Removed 1 rows containing missing values (geom_point).

The general picture remained the same but the dataset is 10x smaller (13423 rows vs 1154 rows). The only problem with the previous plot is that the cells with large values (crime hotspots) are buried under other points.
Can we change the plotting order (bigger values on top)? Yes! We can sort rows by value:

crime_data_grd_aggr <- crime_data_grd_aggr %>% arrange(n)

ggplot()+
  geom_sf(data = county, colour = "grey", fill="grey90", size=0.25)+
  geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n))+
  scale_colour_gradientn(colours = c("black", "red", "orange", "yellow"))

## Warning: Removed 1 rows containing missing values (geom_point).

6.2 Temporal dynamics of crimes

Group data by weekdays and hours, count number of crimes by groups. Create a plot:

crime_data_aggr_hourWday <- crime_data %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n))+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))

## Warning: Removed 1 rows containing missing values (geom_path).

Result looks almost okay! Somehow R thinks, that week starts with weekend day (sunday). Therefore the first day of the week is Sunday. But we can change it! Let’s define Monday as the first day of the week:

crime_data %>% distinct(wday)

## # A tibble: 7 x 1
##   wday 
##   <ord>
## 1 Wed  
## 2 Tue  
## 3 Mon  
## 4 Sun  
## 5 Sat  
## 6 Fri  
## 7 Thu

crime_data$wday <- factor(crime_data$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

Let’s focus on public order violation (avaliku korra rikkumine):

crime_data_publicOrder <- crime_data %>% filter(CaseType == "AVALIKU_KORRA_RIKKUMINE")

crime_data_aggr_hourWday <- crime_data %>% group_by(wday, hour) %>% summarise(n=n()) %>% ungroup()

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n))+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))

## Warning: Removed 1 rows containing missing values (geom_path).

Line graph is not the best type, because the weekdays are not connected with each other. Make a bar plot:

ggplot()+
  theme_minimal()+
  geom_bar(data = crime_data_aggr_hourWday, aes(x = hour, y=n), stat ="identity", fill="red")+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))

## Warning: Removed 7 rows containing missing values (position_stack).

Many other possibilities:

ggplot()+
  theme_minimal()+
  geom_segment(data = crime_data_aggr_hourWday, aes(x = hour, xend=hour, y=0, yend = n), colour="red")+
  geom_point(data = crime_data_aggr_hourWday, aes(x = hour, y=n), colour="red")+
  facet_wrap(~wday)+
  scale_x_continuous(breaks = seq(0, 24, 3))

## Warning: Removed 7 rows containing missing values (geom_segment).

## Warning: Removed 7 rows containing missing values (geom_point).

Hard to compare? Put all days on the same plot:

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWday, aes(x = hour, y=n, group=wday, colour=wday))+
  scale_x_continuous(breaks = seq(0, 24, 3))

## Warning: Removed 7 rows containing missing values (geom_path).

  scale_colour_manual(values = c("dodgerblue", "green", "blue", "grey", "orange", "red", "magenta"))

## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     reset: function
##     scale_name: manual
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Still not very beautiful. We can calculate the mean for workdays and for weekend:

crime_data_aggr_hourWeekend <- crime_data_aggr_hourWday %>% mutate(weekend = ifelse(wday == "Sat" | wday == "Sun", "weekend", "workday"))
crime_data_aggr_hourWeekend <- crime_data_aggr_hourWeekend %>% group_by(weekend, hour) %>% summarise(n=mean(n)) %>% ungroup()

ggplot()+
  theme_minimal()+
  geom_line(data = crime_data_aggr_hourWeekend, aes(x=hour, y=n, colour=weekend))+
  geom_point(data = crime_data_aggr_hourWeekend, aes(x=hour, y=n, colour=weekend), shape=21)+
  labs(title="Average dynamics of crimes during the day", caption ="author: A. Aasa")+
  scale_colour_manual(values = c("orange", "dodgerblue"))+
  scale_x_continuous(breaks = seq(0, 24, 3))

## Warning: Removed 2 rows containing missing values (geom_path).

## Warning: Removed 2 rows containing missing values (geom_point).

6.3 Crimes in Tartu?

We dont have to stay on a national level, we can also “zoom in”. For example we can download base map for Tartu and put crimes on that map.

tartu_map <- get_map(location ="Tartu, Estonia", maptype = "terrain", color ="bw", zoom =13)

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Tartu,+Estonia&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tartu,%20Estonia&sensor=false

ggmap(tartu_map)+
  theme_map()+
  theme(legend.position = "right")+
  geom_point(data = crime_data_grd_aggr, aes(x=y, y=x, colour = n, alpha=n))

## Warning: Removed 1154 rows containing missing values (geom_point).

Maps does not fit? It is because of the different coordinate system: Google base map is in geographical coordinates (WGS84, epsg = 4326) and crimes dataset is in Estonian official coordinates system (L-EST 97; epsg = 3301). To solve it we have to convert crimes data frame to spatial object.

crime_data_grd_aggr <-  crime_data %>% group_by(x, y) %>% summarise(n = n()) %>% ungroup()
glimpse(crime_data_grd_aggr)

## Observations: 1,154
## Variables: 3
## $ x <dbl> 6389500, 6399500, 6402500, 6403500, 6403500, 6404750, 640525...
## $ y <dbl> 669499.5, 673499.5, 673499.5, 641499.5, 695499.5, 622249.5, ...
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 4, 1, 3, 1, 11, 8, 13, 8, 3, 1, 2, 2, 1...

crime_data_grd_aggr_sf <- crime_data_grd_aggr %>% filter(!is.na(x))
crime_data_grd_aggr_sf <- st_as_sf(crime_data_grd_aggr_sf, coords = c("y", "x"), crs = 3301)

Now we created a spatial object where the coordinate system (L-EST 97; epsg=3301) is defined. Next we have to transform to geographical coordinate system (WGS84, epsg=4236):

crime_data_grd_aggr_sf <- st_transform(crime_data_grd_aggr_sf, 4326)

And next we can check the results:

ggmap(tartu_map)+
  #theme_map()+
  theme(legend.position = "right")+
  geom_sf(data = crime_data_grd_aggr_sf, aes(alpha=n), size=3, colour="red", inherit.aes = F)

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

No difference?!?! Suspicious…
The problem is that we don’t see the crimes outside of Tartu, but they are still there! We have to delete all the crime data outside of Tartu!

crime_data_grd_aggr_sf <- crime_data_grd_aggr %>% filter(!is.na(x))
crime_data_grd_aggr_sf <- crime_data_grd_aggr_sf %>% filter(y > 655918 & y < 663097 & x > 6469207 & x < 6476878)
crime_data_grd_aggr_sf <- st_as_sf(crime_data_grd_aggr_sf, coords = c("y", "x"), crs = 3301)

crime_data_grd_aggr_sf <- st_transform(crime_data_grd_aggr_sf, 4326)

ggmap(tartu_map)+
  #theme_map()+
  theme(legend.position = "right")+
  geom_sf(data = crime_data_grd_aggr_sf, aes(alpha=n), size=6, shape=15, colour="red", inherit.aes = F)

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Have a safe trip home!

Author: Anto Aasa
Digital Methods in Humanities and Social Sciences

Mobility Lab

Last update: 2018-08-23 14:30:10

Mobile Data Collection Methods

Anto Aasa

23 08 2018