1 Introduction

First things first! Please start the libraries:

library(tidyverse)
library(lubridate)
library(ggthemes)
library(scico)
library(ggmap)
library(data.table)
library(sf)
library(knitr)

2 Introduction

It is characteristic to the modern network and information society that we meet data everywhere and very often the data is big. Big data could be a co-product of some other ICT (Information & Communication Technology) service: every usage of ICT produces some amount of data. This data describes the usage of certain services, but can also descrive the user (frequency, location etc). In current workshop, we will go through some examples how different data sources can reveal spatio-temporal behaviour of society and/or some smaller social group. Today, the mobile data collection is a very useful toolbox for mobility analysis. Hereby we try to “make friends” with several different datasets (mobile positioning, GPS-positioning, crimes etc). The main aim is to introduce the participants the nature of mobile data, its applications and prospects as well as the disadvantages and shortcomings.

3 GPS data collection with MobilityLog App

In Mobility Lab the mobility data, among other methods, is also collected with the help of special mobile phone application “MobilityLog”. MobilityLog is an Android application which is designed for long-term mobility tracking and social-network-related experiments. The app was developed in joint cooperation of the Computer Laboratory of the University of Cambridge and the Mobility Lab of the University of Tartu. MobilityLog tracks location with the help of GPS, WiFi access points, accelerometer, and mobile network cells. It also records other phone uses such as call events, battery level, screen activation, or alarm clock. It’s a powerful tool to monitor activities carried on by the phone user.

Thus, in addition to location information (GPS), “MobilityLog” can record other behavioral information. App sends the collected data to the servers of Mobility Lab in University of Tartu on a daily basis, where the quality of the received data is controlled. After the quality check, a set of algorithms start to calculate predefined parameters of respondent’s behavior.

Thus, the “MobilityLog” application can be used to monitor/sense the activity of phone user. Every respondent has an android phone where the data collection app is installed. In order to protect the respondents’ privacy, the data will be pseudonymized in the initial stage of data processing.

3.1 GPS-data

Current dataset containS three weeks of GPS data recorded during a tourism trip in California. The data is recorded using MobilityLog. Dataset starts in 21-04-2018 and ends in 10-05-2018. All together it contains 197 375 GPS-points.
Our task is to put together initial overview about movement of one visitor in California.

4 Analysis of the MobilityLog GPS data

4.1 Data import and preparation for the analysis

In real world the preparation of data is very often one of the most annoying steps during the analysis process. Today we jump over most of the obstacles and use the cleaned and preprepared data.

Download the data:

url <-  "http://aasa.ut.ee/Rspatial/data/usa_GPS.zip" # define the dataset address
download.file(url, "usa_GPS.zip") # download file
unzip("usa_GPS.zip") # unzip files
file.remove("usa_GPS.zip") # tidy up by removing the zip file

Import it to R:

gpsData <- read.csv2("gps_us.csv") # import data to R
glimpse(gpsData) # structure of imported file

## Observations: 185,808
## Variables: 17
## $ user_id        <int> 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 24...
## $ id             <int> 646380266, 646380267, 646380268, 646380269, 64638027...
## $ restart        <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ counter        <dbl> 66718, 66720, 66723, 66730, 66732, 66734, 66736, 667...
## $ time_system    <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524345e+...
## $ time_gps       <dbl> 1.524345e+12, 1.524345e+12, 1.524345e+12, 1.524345e+...
## $ time_system_ts <fct> 2018-04-22 00:10:39, 2018-04-22 00:10:40, 2018-04-22...
## $ time_gps_ts    <fct> 2018-04-22 00:10:36, 2018-04-22 00:10:37, 2018-04-22...
## $ accuracy       <dbl> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6, 6, 6...
## $ altitude       <dbl> -23.991000, -24.298600, -12.608500, -9.616940, -8.17...
## $ bearing        <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ speed          <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.000000...
## $ header_id      <int> 226334354, 226334354, 226334354, 226334360, 22633436...
## $ series_id      <int> 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1042...
## $ year           <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ X              <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -121.384...
## $ Y              <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.64971, 38...

With glimpse() we see that dataset contains 17 variables and has 185808 rows. There are integer, double and date-time variables. To reduce the dataset size we keep only columns which are relevant for current analysis (time, accuracy, altitude, bearing, longitude[X], latitude[Y]):

gpsData <- gpsData %>% 
  select(time_system_ts, accuracy, altitude, bearing, speed, X, Y)

gpsData %>% 
  head()  # print the head of the table

##        time_system_ts accuracy  altitude bearing    speed         X        Y
## 1 2018-04-22 00:10:39       12 -23.99100 165.957 0.104187 -121.3842 38.64978
## 2 2018-04-22 00:10:40       16 -24.29860      NA       NA -121.3842 38.64976
## 3 2018-04-22 00:10:41       16 -12.60850      NA 0.000000 -121.3842 38.64976
## 4 2018-04-22 00:10:42        6  -9.61694      NA 0.000000 -121.3842 38.64976
## 5 2018-04-22 00:10:43        4  -8.17016      NA 0.000000 -121.3843 38.64971
## 6 2018-04-22 00:10:44        6  -5.67983      NA 0.000000 -121.3843 38.64971

Everything, except the time_system_ts-column look ok. Namely the time is stored as factor. To deal with dates, timestamps etc in R we can use library called lubridate. With help of this package we can convert timestamp column to date-time format:

gpsData <- gpsData %>% 
  mutate(time_system_ts = ymd_hms(time_system_ts))

gpsData %>% 
  head()

##        time_system_ts accuracy  altitude bearing    speed         X        Y
## 1 2018-04-22 00:10:39       12 -23.99100 165.957 0.104187 -121.3842 38.64978
## 2 2018-04-22 00:10:40       16 -24.29860      NA       NA -121.3842 38.64976
## 3 2018-04-22 00:10:41       16 -12.60850      NA 0.000000 -121.3842 38.64976
## 4 2018-04-22 00:10:42        6  -9.61694      NA 0.000000 -121.3842 38.64976
## 5 2018-04-22 00:10:43        4  -8.17016      NA 0.000000 -121.3843 38.64971
## 6 2018-04-22 00:10:44        6  -5.67983      NA 0.000000 -121.3843 38.64971

Now the date-time column is converted to ** ** format. Red more about dates and times in R link

Before entering into the trouble we have to make one more step with date-time column (time_system_ts). Originally the data recorder (the smartphone with MobilityLog app) stored all locations in Estonian time zone (UTC +3). The time zone of California (UTC -7) differs from Estonian time zone 10 hours. This means we have to subtract 10 hours from the values in date-time column:

gpsData <- gpsData %>% 
  mutate(time_system_ts = time_system_ts - hours(10))

Now the data should be ready for the analysis. Let’s try to summarise the data (when it starts, when it ends and how many GPS-points it contains)?

gpsData %>% 
  summarise(start = min(time_system_ts), end = max(time_system_ts), 'GPS points' = n()) %>% 
  kable()

start	end	GPS points
2018-04-21 14:10:39	2018-05-10 13:56:51	185808

The data collection application MobilityLog is developed to save the battery of the smart phone as much as possible. For that reason the GPS-sensor is switched off every time the app realises that the phone is not moving. This means that dataset contains lot of gaps. Usually gaps in data refer to data collection problems, but in current case they contain significant information: where and when the respondent stopped during the trip?

Put the gps locations on plot:

glimpse(gpsData) #check to remeber the column names

## Observations: 185,808
## Variables: 7
## $ time_system_ts <dttm> 2018-04-21 14:10:39, 2018-04-21 14:10:40, 2018-04-2...
## $ accuracy       <dbl> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6, 6, 6...
## $ altitude       <dbl> -23.991000, -24.298600, -12.608500, -9.616940, -8.17...
## $ bearing        <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ speed          <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.000000...
## $ X              <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -121.384...
## $ Y              <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.64971, 38...

ggplot()+
  geom_point(data = gpsData, aes(x=time_system_ts, y=altitude), size=0.25, col="red")

Looks like there are a few gaps in data!? Suspicious… If we assume, that GPS records location after every second, then the dataset sould contain 1.64077210^{6} rows, but it is ca 10 times smaller. This means we have to be more clever to bring out the gaps. Currently it seems, that the horizontal axis is pressed together and gaps are hidden under the data points. One possibility is to create tile-plot, where the horizontal axis represents days and the vertical axis represents time. Additionally we can colour the data points according to the GPS-speed.
To do that, we have to calculate Date and time column. To avoid potential problems with time we calculate it in decimal scale where 0 is midnight, 12 is midday and 24 is midnight again:

gpsData <- gpsData %>% 
  dplyr::mutate(time_dec = hour(time_system_ts) + (minute(time_system_ts)/60), 
                date = as.Date(time_system_ts))

And now the plot:

# first version:
# pay attention on subset() which filters out some gps-points (speed = 0)
ggplot()+
  geom_tile(data= subset(gpsData, speed > 0), aes(x = date, y = time_dec , fill = speed))

# first version wasn't to beautiful. Some polishing:
ggplot()+
  theme(panel.background = element_rect(fill="black"),
        panel.grid.major = element_line(colour = "grey10"),
        panel.grid.minor = element_line(colour = "grey10"),
        plot.background = element_rect(fill="black"),
        legend.background = element_rect(fill="black"),
        legend.text = element_text(colour = "grey"), 
        legend.title = element_text(colour = "grey"),
        axis.text = element_text(colour="grey"),
        axis.title = element_text(colour = "grey"),
        plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
  scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
  labs(title="Time in movement", x="date", y="time")+ #titles
  scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
  guides(fill=F) # hide the legend

4.2 Save the plot!

There is no use of beautiful visualization if we can’t export it to our research papers and presentations. Saving the plot as raster or vector is easy:

define the file where the output will be written
write plot to the file
close the file

png("name.png", units = "in", res = 300, height = 4, width = 6)

ggplot()+
  theme(panel.background = element_rect(fill="black"),
        panel.grid.major = element_line(colour = "grey10"),
        panel.grid.minor = element_line(colour = "grey10"),
        plot.background = element_rect(fill="black"),
        legend.background = element_rect(fill="black"),
        legend.text = element_text(colour = "grey"), 
        legend.title = element_text(colour = "grey"),
        axis.text = element_text(colour="grey"),
        axis.title = element_text(colour = "grey"),
        plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
  geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
  scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
  labs(title="Time in movement", x="date", y="time")+ #titles
  scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
  guides(fill=F) # hide the legend

dev.off()

# to pdf
# try help: ?pdf

4.3 Map the data!

It’s spatial data. We should study spatial patterns. For this, we have to make a map!

ggplot()+
  geom_point(data = gpsData, aes(y = Y, x = X))

Plotted spatial data is not a map! Map needs spatial context. We need to use base map. The easiest solution is to use package called ggmap which is using Google Maps.
First we have to download the map:

# to download the base map from Stamen maps we have to define the bounding box for the map:
bbox <- c(left = min(gpsData$X)-1, bottom = min(gpsData$Y)-1, right = max(gpsData$X)+1, top = max(gpsData$Y)+1)

# print it:
bbox

##       left     bottom      right        top 
## -123.51787   34.05451 -116.86662   39.90682

# download the map:
Ca_map <- get_stamenmap(bbox, maptype = "toner-lite", zoom = 7)

(You can try “satellite” or “hybrid” instead of “terrain”.)

To publish the map the grammar is very similar to ggplot2, just the first line defines that ggmap is used:

ggmap(Ca_map)+
  theme_map()+
  geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)

Ok? Of course you can add a title and tune the colours…

4.4 Where were the highest and lowest points visited during the trip?

For that we can make really simple query:

trip_extremes <- gpsData %>% filter(altitude == min(altitude) | altitude == max(altitude)) 

trip_extremes %>% kable()

time_system_ts	accuracy	altitude	bearing	speed	X	Y	time_dec	date
2018-04-27 10:58:30	4	3073.550	NA	0.000000	-118.1789	37.38543	10.966667	2018-04-27
2018-05-08 07:07:35	96	-192.916	348.241	0.216682	-122.3295	37.20981	7.116667	2018-05-08

# altitude -192 refers, that it is under the sea level; it's probably GPS-error (look the accuracy!)

And now the map:

ggmap(Ca_map)+
  theme_map()+
  #geom_point(data = gpsData, aes(y = Y, x=X), colour ="red", size=0.5, alpha=0.5)
  geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
  scale_colour_gradient(low="blue", high = "red")+
  guides(colour = F)

We should add more information on the map. For example: - What was the altitude? - When was the place visited?

ggmap(Ca_map)+
  theme_map()+
  geom_point(data = trip_extremes, aes(y = Y, x=X, colour = altitude), size=2)+
  geom_label(data = trip_extremes, aes(y = Y + 0.2, x=X, colour = altitude, label= round(altitude, 1)), size=3)+
  geom_label(data = trip_extremes, aes(y = Y - 0.2, x=X, colour = altitude, label= as.Date(time_system_ts)), size=2.5, alpha=0.5)+
  scale_colour_gradient(low="blue", high = "red")+
  guides(colour = F)

So much (little) about GPS-data. It was just a featherlight touch of possibilities…

5 Passive Mobile Positioning

Passive mobile positiong means, that we are using Call Detail Records (CDR) data. CDR is a dataset, which contains log-files collected by mobile network operators to monitor and keep track of billable calling services used by their clients. Such dataset that contains information about the phone call initiator (the unique and anonymous ID), time of calling activities as well as the location of the mobile antenna where the call activity was made, is an excellent source for analysing spatio-temporal behaviour of phone users.
CDR-data in the following example is stored in two separete files. One of them is the log-file off calling activities (cdr_example) and another contains spatial information about the mobile antennas (cdr_antennas).
First we import the log-files:

cdr_example <- read.csv2("http://aasa.ut.ee/Rspatial/data/cdr_example.csv", sep=";", dec = ".")

cdr_example %>% head()

##                  time site_id.y pos_usr_id       date
## 1 2008-04-01 08:40:45       264       John 2008-04-01
## 2 2008-04-01 08:56:34       264       John 2008-04-01
## 3 2008-04-01 09:08:27       348       John 2008-04-01
## 4 2008-04-01 09:11:19       215       John 2008-04-01
## 5 2008-04-01 09:15:50       215       John 2008-04-01
## 6 2008-04-01 09:22:31       148       John 2008-04-01

At first glimpse everything looks ok! Only problem is that time is stored not as time but as character. Convert time to proper format and calculate date and time (hours) columns:

cdr_example <- cdr_example %>% 
  mutate(time = ymd_hms(as.character(time)),
         date = as.Date(time), 
         hour = hour(time), 
         minute = minute(time))

glimpse(cdr_example)

## Observations: 42,112
## Variables: 6
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09...
## $ site_id.y  <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 4...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, John, Jo...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, ...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56...

Now we import the table of antennas:

cdr_antennas <- read.csv2("http://aasa.ut.ee/Rspatial/data/sites.csv", sep=";", dec = ",")

glimpse(cdr_antennas)

## Observations: 1,117
## Variables: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096, 133...
## $ lat     <dbl> 57.74861, 59.08381, 59.40694, 59.38417, 58.38350, 58.05833,...
## $ lon     <dbl> 26.35027, 24.63264, 24.81722, 24.82553, 24.48503, 26.49500,...

Easiest way to perform a quality check is to plot the data:

ggplot()+
  geom_point(data = cdr_antennas, aes(x=lon, y=lat))

Now we have to put two tables together. This process is called joining. In the current case we use left_join which returns us all records from the left table (cdr_example), and the matched records from the right table (cdr_antennas). For that we have to define the key-column by which tables are joined. The result is NULL from the right side, if there is no match.

cdr_example2 <- left_join(cdr_example, cdr_antennas, by = c("site_id.y" = "site_id"))

cdr_example2 %>% 
  head()

##                  time site_id.y pos_usr_id       date hour minute      lat
## 1 2008-04-01 08:40:45       264       John 2008-04-01    8     40 58.78721
## 2 2008-04-01 08:56:34       264       John 2008-04-01    8     56 58.78721
## 3 2008-04-01 09:08:27       348       John 2008-04-01    9      8 58.66305
## 4 2008-04-01 09:11:19       215       John 2008-04-01    9     11 58.84897
## 5 2008-04-01 09:15:50       215       John 2008-04-01    9     15 58.84897
## 6 2008-04-01 09:22:31       148       John 2008-04-01    9     22 58.74500
##        lon
## 1 26.01778
## 2 26.01778
## 3 26.09499
## 4 26.29264
## 5 26.29264
## 6 26.39861

Check are there any rows without coordinates(NA)! This means that location information is missing for those antennas. Filter them out!

nrow(cdr_example2)

## [1] 42112

cdr_example2 <- cdr_example2 %>% 
  filter(!is.na(lat)) # is.na() finds all cells where value is NA (not available), !is.na() works vice versa

nrow(cdr_example2)

## [1] 42112

How many rows where excluded?

5.1 Frequently visited places

glimpse(cdr_example2)

## Observations: 42,112
## Variables: 8
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09...
## $ site_id.y  <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 4...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, John, Jo...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, ...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56...
## $ lat        <dbl> 58.78721, 58.78721, 58.66305, 58.84897, 58.84897, 58.745...
## $ lon        <dbl> 26.01778, 26.01778, 26.09499, 26.29264, 26.29264, 26.398...

cdr_antenna_count <- cdr_example2 %>% 
  group_by(lat, lon) %>% 
  summarise(n=n()) %>% 
  ungroup()

ggplot()+
  geom_point(data = cdr_antenna_count, aes(x=lon, y=lat, size=n, alpha=n), colour="red")

For spatial context we use the contours of Estonian counties. (check from previous sessions).
Import polygons as county (county <- st_read(“layer.shp”))

download.file("https://geoportaal.maaamet.ee/docs/haldus_asustus/maakond_shp.zip", destfile="maakond_shp.zip")
#maakond means county!
unzip("maakond_shp.zip")

For the next st_read() please correct the file name! Date is different!

## Reading layer `maakond_20190301' from data source `C:\ANTO\loengud\tbilisi\2020\web\maakond_20190301.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 369032.1 ymin: 6377141 xmax: 739152.8 ymax: 6634019
## epsg (SRID):    3301
## proj4string:    +proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs

Was the geolayer import successful?

# check coordinate reference system (CRS):
st_crs(county)

## Coordinate Reference System:
##   EPSG: 3301 
##   proj4string: "+proj=lcc +lat_1=58 +lat_2=59.33333333333334 +lat_0=57.51755393055556 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

ggplot()+
  geom_sf(data = county, colour = "white", fill= "grey80", size = 0.25)

Good!

Visualize frequently visited places (put together layer of Estonian counties and mobile positioning data):

ggplot()+
  theme_minimal()+
  geom_sf(data = county, colour = "white", fill= "grey80", size = 0.25)+
  geom_point(data = cdr_antenna_count, aes(x = lon, y = lat, size = n, alpha = n), colour = "red")

Problem! Can you figure out why?

Problem is that two datasets are in different coordinate reference system (CRS): borders of counties are in LEST-97 (epsg:3301) and mobile positioning data is in WGS84 (epsg:4326). Let’s use Estonian system. This means we have to convert cdr data to Estonian system. Let’s start from the beginning: change the coordinates already in antennas table:

glimpse(cdr_antennas)

## Observations: 1,117
## Variables: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096, 133...
## $ lat     <dbl> 57.74861, 59.08381, 59.40694, 59.38417, 58.38350, 58.05833,...
## $ lon     <dbl> 26.35027, 24.63264, 24.81722, 24.82553, 24.48503, 26.49500,...

# convert plain table to spatial object:
cdr_antennas_wgs84_sf <- st_as_sf(cdr_antennas, coords = c("lon", "lat"), crs = 4326) 

# transform the antennas table to estonian CRS:
cdr_antennas_lest97_sf <- st_transform(cdr_antennas_wgs84_sf, 3301)

# convert the result back to data frame:
cdr_antennas_lest97_df <- as_data_frame(cdr_antennas_lest97_sf) %>% 
  select(site_id)

## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

glimpse(cdr_antennas_lest97_df)

## Observations: 1,117
## Variables: 1
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096, 133...

# add LEST coordinates: 
cdr_antennas_lest97_df <- cbind(cdr_antennas_lest97_df, st_coordinates(cdr_antennas_lest97_sf))

glimpse(cdr_antennas_lest97_df)

## Observations: 1,117
## Variables: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096, 133...
## $ X       <dbl> 639930.8, 536270.8, 546413.9, 546917.6, 528372.0, 647260.9,...
## $ Y       <dbl> 6403187, 6549622, 6585730, 6583199, 6471551, 6437971, 65355...

# Join tables again:
cdr_example2 <- left_join(cdr_example, cdr_antennas_lest97_df, by = c("site_id.y" = "site_id"))

cdr_example2 %>% 
  head()

##                  time site_id.y pos_usr_id       date hour minute        X
## 1 2008-04-01 08:40:45       264       John 2008-04-01    8     40 616661.9
## 2 2008-04-01 08:56:34       264       John 2008-04-01    8     56 616661.9
## 3 2008-04-01 09:08:27       348       John 2008-04-01    9      8 621556.7
## 4 2008-04-01 09:11:19       215       John 2008-04-01    9     11 632312.8
## 5 2008-04-01 09:15:50       215       John 2008-04-01    9     15 632312.8
## 6 2008-04-01 09:22:31       148       John 2008-04-01    9     22 638839.8
##         Y
## 1 6518169
## 2 6518169
## 3 6504483
## 4 6525554
## 5 6525554
## 6 6514195

# count calls by antennas:
cdr_antenna_count <- cdr_example2 %>% 
  group_by(site_id.y, X, Y) %>% 
  summarise(n = n()) %>% 
  ungroup()

glimpse(cdr_antenna_count)

## Observations: 651
## Variables: 4
## $ site_id.y <int> 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 19, 20, 21,...
## $ X         <dbl> 527933.1, 581634.9, 591412.8, 550018.7, 507100.9, 677858....
## $ Y         <dbl> 6567107, 6443624, 6573544, 6592009, 6484621, 6457624, 658...
## $ n         <int> 11, 3, 34, 1, 2, 6, 3, 3, 3, 1, 1, 24, 153, 8, 5, 3, 1, 1...

# plot the map:
ggplot()+
  theme_minimal()+
  geom_sf(data = county, colour = "white", fill= "grey80", size = 0.25)+
  geom_point(data = cdr_antenna_count, aes(x = X, y = Y, size=n, alpha=n), colour="red")

In general it’s ok, but plotting is quite slow. This is because counties layer is very accurate and large. We can make it faster. For that we can generalize (simplify the map):

county_smpl <-  st_simplify(county, preserveTopology = T, dTolerance = 200)

To get more information, how previous command works: ?st_simplify

Plot it again:

ggplot()+
  theme_minimal()+
  geom_sf(data = county_smpl, colour = "grey", fill= "snow", size = 0.25)+
  geom_point(data = cdr_antenna_count, aes(x = X, y = Y, size=n, alpha=n), colour="red")

cdr_example2 %>% 
  summarise(min_time = min(time), max_time = max(time))

##              min_time            max_time
## 1 2008-01-01 01:20:41 2014-12-31 20:12:53

5.2 Find home and work location!

In mobility studies the location of home or working place of the respondent is crucial. Can we detect it from mobile positioning data as well?
We may assume, that home is the most important place for every person. This allows us to assume, that at home we use our mobile phones the most, and the working place is the second most important place (of course those assumptions are over simplified, but we don’t have too much time to dig deeper).
In the following query we count (n()) all calling activities grouping them by antennas. Afterwards we sort the result starting with the largest value (descending), finally the tables head is saved to new data frame:

cdr_example_aggrActivities <- cdr_example2 %>% 
  group_by(site_id.y) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) %>% 
  head()

cdr_example_aggrActivities %>% 
  kable()

site_id.y	n
264	22444
498	2809
215	918
147	783
817	691
1161	537

According to the assumptions we defined earlier we can see from the results that antenna (site) 264 is the home location and antenna with id 498 is the working place. Where are they located in space? To answer that question we have to use the coordinates as well:

cdr_example_aggrActivities <- cdr_example2 %>% 
  group_by(site_id.y, X, Y) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) %>% 
  head()

cdr_example_aggrActivities %>% kable()

site_id.y	X	Y	n
264	616661.9	6518169	22444
498	615088.8	6502337	2809
215	632312.8	6525554	918
147	635306.9	6513513	783
817	559703.7	6581428	691
1161	627174.9	6512670	537

And then we put possible home and work places on the map:

ggplot()+
  theme_minimal()+
  geom_sf(data = county_smpl, colour = "grey", fill= "snow", size = 0.25)+
  geom_point(data = cdr_example_aggrActivities[1,], aes(x=X, y=Y, colour="home"), size=2)+ # [1,] selects first row in table
  geom_point(data = cdr_example_aggrActivities[2,], aes(x=X, y=Y, colour = "work"), size=2)+
  scale_colour_manual(values = c("orange", "dodgerblue"))

5.3 Visualize movement path

CDR-data is commonly used in mobility studies. The accuracy of the data doesn’t allow exactly the turn-by-turn analysis of movement, but it’s a priceless source to analyse commuting, tourism, migration etc patterns.
Again let’s check the column names. And then plot paths on the map.

glimpse(cdr_example2)

## Observations: 42,112
## Variables: 8
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09...
## $ site_id.y  <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 4...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, John, Jo...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, ...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56...
## $ X          <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839...
## $ Y          <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65...

ggplot()+
  geom_sf(data = county_smpl, colour = "grey", fill= "snow", size = 0.25)+
  geom_path(data = cdr_example2, aes(x=X, y=Y), size = 0.2, col="red")

Results look like plate of spaghetties, it’s very hard to understand which places were visited more often and which places less often.
May changing the the transparency (alpha) value helps?

ggplot()+
  geom_path(data = cdr_example2, aes(x=X, y=Y), size = 0.2, col="red", alpha=0.5)

Nothing better… It’s because the path is created as a single line. But there is a solution to improve it!
We should draw separate line between every two location points! For that geom_segment can be used. To define the end of segment we use lead(), where we say that segment ends with a value on a next row:

ggplot()+
  geom_sf(data = county_smpl, colour = "grey", fill= "snow", size = 0.25)+
  geom_segment(data = cdr_example2, aes(x = X, xend = lead(X, 1), y=Y, yend = lead(Y, 1)), alpha=0.05, size=0.25)

## Warning: Removed 1 rows containing missing values (geom_segment).

Waiting was a bit longer, but result is bettter as well!

5.4 Temporal activity

CDR-data is not used only in mobility studies. It can give much more information about the phone carrier. If we assume that starting a phone call or sending a SMS is act of free will, then we can start to analyse the temporal behaviour of the phone user. For that let us also calculate the weekday names:

# R gives the weekdays in language defined in your computer settings
# If you want, you can switch to english:
Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

cdr_example2 <- cdr_example2 %>% mutate(wday = lubridate::wday(time, abbr = T, label =T))

# change back to your language:
Sys.setlocale("LC_TIME","Estonian") # in current case it's estonian

## [1] "Estonian_Estonia.1257"

glimpse(cdr_example2)

## Observations: 42,112
## Variables: 9
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09...
## $ site_id.y  <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 4...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, John, Jo...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, ...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56...
## $ X          <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839...
## $ Y          <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65...
## $ wday       <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, T...

# list weekdays:
cdr_example2 %>% 
  distinct(wday)

##   wday
## 1  Tue
## 2  Wed
## 3  Thu
## 4  Fri
## 5  Sat
## 6  Sun
## 7  Mon

# reorder them:
cdr_example2$wday <- factor(cdr_example2$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

What does factor? Run: ?factor and find out!

5.4.1 Diurnal activity

We want to analyse the temporal activity of phone user. Make a graph to show the diurnal activity during the week. Let’s calculate some calling time percentiles and add them to plot (we may assume that an active day starts if 10% of calls are made; midday arrives if 50% calls are made and evening starts if 90% calls are made):

# calculate time in decimal format:
cdr_example2 <- cdr_example2 %>% 
  mutate(time_dec = hour + (minute / 60))

glimpse(cdr_example2)

## Observations: 42,112
## Variables: 10
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09...
## $ site_id.y  <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 4...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, John, Jo...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, ...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56...
## $ X          <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839...
## $ Y          <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65...
## $ wday       <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, T...
## $ time_dec   <dbl> 8.666667, 8.933333, 9.133333, 9.183333, 9.250000, 9.3666...

cdr_quart <- cdr_example2 %>% 
  group_by(wday) %>% 
  summarise(q1 = quantile(time_dec, 0.10), 
            q2 = quantile(time_dec, 0.5), 
            q3 = quantile(time_dec, 0.90)) %>%
  ungroup()

5.4.2 Aggregate and count by hour

A more traditional way is to count calling activities by hour:

glimpse(cdr_example2)

## Observations: 42,112
## Variables: 10
## $ time       <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09...
## $ site_id.y  <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 4...
## $ pos_usr_id <fct> John, John, John, John, John, John, John, John, John, Jo...
## $ date       <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04...
## $ hour       <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, ...
## $ minute     <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56...
## $ X          <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839...
## $ Y          <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65...
## $ wday       <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, T...
## $ time_dec   <dbl> 8.666667, 8.933333, 9.133333, 9.183333, 9.250000, 9.3666...

cdr_example_hourAggr <-  cdr_example2 %>% 
  group_by(wday, hour) %>% 
  summarise(n=n()) %>% 
  ungroup()

ggplot()+
  geom_line(data = cdr_example_hourAggr, aes(x=hour, y=n, group=wday, colour=as.factor(wday)))+
  facet_wrap(~wday)+
  scale_colour_manual(values = c("dodgerblue", "steelblue", "blue", "dodgerblue1", "navyblue", "orange", "red"))+
  scale_x_continuous(breaks = seq(0, 24, 3))+
  labs(y="time, hh:mm", colour = "day")

5.4.3 Temporal activity at home/work:

#Home antenna:
cdr_example_aggrActivities[1,]

## # A tibble: 1 x 4
## # Groups:   site_id.y, X [1]
##   site_id.y       X        Y     n
##       <int>   <dbl>    <dbl> <int>
## 1       264 616662. 6518169. 22444

#workplace antenna
cdr_example_aggrActivities[2,]

## # A tibble: 1 x 4
## # Groups:   site_id.y, X [1]
##   site_id.y       X        Y     n
##       <int>   <dbl>    <dbl> <int>
## 1       498 615089. 6502337.  2809

# calling activities in home location:
cdr_example_H <- cdr_example2 %>% 
  filter(site_id.y == cdr_example_aggrActivities$site_id.y[1])

# calling activities in work location:
cdr_example_W <- cdr_example2 %>% 
  filter(site_id.y == cdr_example_aggrActivities$site_id.y[2])

# Diurnal rhythm:
cdr_example_H_timeAggr <- cdr_example_H %>% 
  group_by(wday, hour) %>% 
  summarise(n = n()) %>% 
  ungroup()

# Diurnal rhythm:
cdr_example_W_timeAggr <- cdr_example_W %>% 
  group_by(wday, hour) %>% 
  summarise(n = n()) %>% 
  ungroup()

# plot:
ggplot()+
  geom_line(data = cdr_example_H_timeAggr, aes(x=hour, y=n, colour="home"))+
  geom_line(data = cdr_example_W_timeAggr, aes(x=hour, y=n, colour="work"))+
  scale_x_continuous(breaks = seq(0, 24, 3))+
  facet_wrap(~wday)

What you think, was the definition of home and work locations successful?

6 Homework number 5

Try to make thematic map of previously used gps points on date 2018-04-27!

Bind all the home works into one doc/pdf/html. Add script-files separately (in case of R-markdowm script is already in html file). Please add your name to each file name! E-mail results to: anto.aasa@ut.ee. Final deadline for all submissions is May 15th, 2020.

Possible results:

Hints:
I assume, that many of you can make it without additional help. There are some new aspects which I haven’t touched during the practical sessions. Try to search the solution from Web search!

In case you didn’t manage:

You have to make subset from gpsData. Part of the filtering syntax: filter(as.Date(time_system_ts) == as.Date("2018-04-27")). Pay attention on as.Date()!
Convert the subset to sf: st_as_sf()
Calculate bounding box (st_bbox())
Download the bsae map (get_stamenmap())
Create the map (ggmap() & geom_sf() don’t match directly.
- For geom_sf(): geom_sf(data = gpsData_slct_sf, col = "red", size = 0.5, inherit.aes = F)+)

Author: Anto Aasa
in ISET, Tbilisi
Last update: 2020-04-22 17:08:04

logo

Mobile Data Collection Methods

Anto Aasa

Spring, 2020