## [1] ""
Start the libraries!
(It’s recommendded is to keep/start all the used libraries in the
beginning of session. This helps later to avoid conflicts between
libraries)
library(tmap)
library(sf)
library(tidyverse)
#library(ggthemes)
library(lubridate)
library(knitr)
It is characteristic to the modern network and information society that we meet data everywhere and very often the data is big. Big data is often a co-product of some other ICT (Information & Communication Technology) service: every usage of ICT produces some amount of data. This data describes the usage of certain services, but can also describe the behavior of the user (activity, location etc). Next we will go through some examples how different data sources can reveal spatio-temporal behavior of society and/or some smaller social group. Today, the mobile data collection is a very useful toolbox for mobility analysis. Hereby we try to “make friends” with several different datasets (mobile positioning, GPS-positioning, crimes etc). The main aim is to introduce the nature of mobile data, how it can be used in mobility studies.
In Mobility Lab the mobility data, among other methods, is also collected with the help of special mobile phone application “MobilityLog”. MobilityLog is an Android application which is designed for long-term mobility tracking and social-network-related experiments. The app was developed in joint cooperation of the Computer Laboratory of the University of Cambridge and the Mobility Lab of the University of Tartu. MobilityLog tracks location with the help of GPS, WiFi access points, accelerometer, and mobile network cells. It also records other phone uses such as call events, battery level, screen activation, or alarm clock. It’s a powerful tool to monitor activities carried on by the phone user.
Thus, in addition to location information (GPS), “MobilityLog” can record other behavioral information. App sends the collected data to the servers of Mobility Lab in University of Tartu on a daily basis, where the quality of the received data is controlled. After the quality check, a set of algorithms start to calculate predefined parameters of respondent’s behavior.
Thus, the “MobilityLog” application can be used to monitor/sense the activity of phone user. Every respondent has an android phone where the data collection app is installed. In order to protect the respondents’ privacy, the data will be pseudonymised in the initial stage of data processing.
Current dataset containS three weeks of GPS data recorded during a
trip to California. The data is recorded using MobilityLog. Dataset
starts in 21-04-2018 and ends in 10-05-2018. All together it contains
197 375 GPS-points.
Our task is to put together initial overview about movement of one
visitor in California.
In real world the preparation of data is very often one of the most annoying steps during the analysis process. Today we jump over most of the obstacles and import the pre prepared data directly from the Internet into R.
url <- "http://aasa.ut.ee/Rspatial/data/usa_GPS.zip" # define the dataset address
download.file(url, "usa_GPS.zip") # download file
unzip("usa_GPS.zip") # unzip files
#file.remove("usa_GPS.zip") # tidy up by removing the zip file
Import data:
gpsData <- read.csv2("gps_us.csv") # import data to R
glimpse(gpsData) # structure of imported file
## Rows: 185,808
## Columns: 17
## $ user_id <int> 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, …
## $ id <int> 646380266, 646380267, 646380268, 646380269, 646380270, …
## $ restart <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ counter <dbl> 66718, 66720, 66723, 66730, 66732, 66734, 66736, 66738,…
## $ time_system <dbl> 1524345039381, 1524345040339, 1524345041348, 1524345042…
## $ time_gps <dbl> 1524345036986, 1524345037986, 1524345038986, 1524345039…
## $ time_system_ts <chr> "2018-04-22 00:10:39", "2018-04-22 00:10:40", "2018-04-…
## $ time_gps_ts <chr> "2018-04-22 00:10:36", "2018-04-22 00:10:37", "2018-04-…
## $ accuracy <dbl> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6, 6, 6, 1…
## $ altitude <dbl> -23.991000, -24.298600, -12.608500, -9.616940, -8.17016…
## $ bearing <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ speed <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.000000, 0…
## $ header_id <int> 226334354, 226334354, 226334354, 226334360, 226334360, …
## $ series_id <int> 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1042, 1…
## $ year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
## $ X <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -121.3843, …
## $ Y <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.64971, 38.64…
With glimpse() we see that dataset contains 17 variables
and has 185 808 rows. There are integer, double and date-time variables.
To reduce the size of the dataset we keep only columns which are
relevant for current analysis
(time, accuracy, altitude, bearing, longitude[X], latitude[Y]):
gpsData <- gpsData %>%
select(time_system_ts,
accuracy,
altitude,
bearing,
speed,
X,
Y)
gpsData %>%
head() # print the head of the table
## time_system_ts accuracy altitude bearing speed X Y
## 1 2018-04-22 00:10:39 12 -23.99100 165.957 0.104187 -121.3842 38.64978
## 2 2018-04-22 00:10:40 16 -24.29860 NA NA -121.3842 38.64976
## 3 2018-04-22 00:10:41 16 -12.60850 NA 0.000000 -121.3842 38.64976
## 4 2018-04-22 00:10:42 6 -9.61694 NA 0.000000 -121.3842 38.64976
## 5 2018-04-22 00:10:43 4 -8.17016 NA 0.000000 -121.3843 38.64971
## 6 2018-04-22 00:10:44 6 -5.67983 NA 0.000000 -121.3843 38.64971
Everything, except the column time_system_ts looks ok.
Namely the time is stored as factor. To deal with dates, timestamps
etc in R we can use library called lubridate. With
help of this package we can convert timestamp column to date-time
format:
gpsData <- gpsData %>%
mutate(time_system_ts = ymd_hms(time_system_ts))
gpsData %>%
head()
## time_system_ts accuracy altitude bearing speed X Y
## 1 2018-04-22 00:10:39 12 -23.99100 165.957 0.104187 -121.3842 38.64978
## 2 2018-04-22 00:10:40 16 -24.29860 NA NA -121.3842 38.64976
## 3 2018-04-22 00:10:41 16 -12.60850 NA 0.000000 -121.3842 38.64976
## 4 2018-04-22 00:10:42 6 -9.61694 NA 0.000000 -121.3842 38.64976
## 5 2018-04-22 00:10:43 4 -8.17016 NA 0.000000 -121.3843 38.64971
## 6 2018-04-22 00:10:44 6 -5.67983 NA 0.000000 -121.3843 38.64971
Now the date-time column is converted to POSIXct format.
Red more about dates and times in R link
Before entering into the trouble we have to make one more step with
date-time column (time_system_ts). Originally the data
recorder (the smartphone with MobilityLog app) stored all locations in
Estonian time zone (UTC +3). The time zone of California (UTC -7)
differs from Estonian time zone 10 hours. This means we have to subtract
10 hours from the values in date-time column:
gpsData <- gpsData %>%
mutate(time_system_ts = time_system_ts - hours(10))
Now the data should be ready for the analysis. Let’s try to summarise the data (when it starts, when it ends and how many GPS-points it contains)?
gpsData %>%
summarise(start = min(time_system_ts),
end = max(time_system_ts),
'GPS points' = n()) %>%
kable()
| start | end | GPS points |
|---|---|---|
| 2018-04-21 14:10:39 | 2018-05-10 13:56:51 | 185808 |
The data collection application MobilityLog is developed to save the battery of the smart phone as much as possible. For that reason the GPS-sensor is switched off every time the app realises that the phone is not moving. This means that dataset contains lot of gaps. Usually gaps in data refer to data collection problems, but in current case they contain significant information: where and when the respondent stopped during the trip?
To map the data gaps, we look at what altitude the passengers have been on their journey:
glimpse(gpsData) # check column names
## Rows: 185,808
## Columns: 7
## $ time_system_ts <dttm> 2018-04-21 14:10:39, 2018-04-21 14:10:40, 2018-04-21 1…
## $ accuracy <dbl> 12, 16, 16, 6, 4, 6, 6, 6, 4, 4, 6, 6, 6, 6, 6, 6, 6, 1…
## $ altitude <dbl> -23.991000, -24.298600, -12.608500, -9.616940, -8.17016…
## $ bearing <dbl> 165.957, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ speed <dbl> 0.104187, NA, 0.000000, 0.000000, 0.000000, 0.000000, 0…
## $ X <dbl> -121.3842, -121.3842, -121.3842, -121.3842, -121.3843, …
## $ Y <dbl> 38.64978, 38.64976, 38.64976, 38.64976, 38.64971, 38.64…
ggplot()+
geom_point(data = gpsData, aes(x = time_system_ts, y = altitude), size = 0.25, col = "red")
From the plot we can see, that dataset covers time period from middle of the April to the middle of May. At first glance, it seems that the data has been collected properly for the entire period and there are no large data gaps.
Looks like there are a few gaps in data!? Suspicious… If we assume,
that GPS records location after every second, then the dataset should
contain 1640772 rows, but it is ca 10 times smaller. This means
we have to be more clever to bring out the gaps. Currently it seems,
that the horizontal axis is pressed together and gaps are hidden under
the data points. One possibility is to create tile-plot,
where the horizontal axis represents days and the vertical axis
represents time. Additionally we can colour the data points according to
the GPS-speed.
To do that, we have to calculate Date and time column.
To avoid potential problems with time we calculate it in decimal scale
where 0 is midnight, 12 is midday and 24 is midnight again:
gpsData <- gpsData %>%
mutate(time_dec = hour(time_system_ts) + (minute(time_system_ts)/60),
date = as.Date(time_system_ts))
And now the plot:
# first version:
# pay attention on subset() which filters out some gps-points (speed = 0)
ggplot()+
geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))
The first version wasn’t to beautiful. After some polishing (with
theme() you can design the plot as you like!):
ggplot()+
theme(panel.background = element_rect(fill="black"),
panel.grid.major = element_line(colour = "grey10"),
panel.grid.minor = element_line(colour = "grey10"),
plot.background = element_rect(fill="black"),
legend.background = element_rect(fill="black"),
legend.text = element_text(colour = "grey"),
legend.title = element_text(colour = "grey"),
axis.text = element_text(colour="grey"),
axis.title = element_text(colour = "grey"),
plot.title = element_text(colour="grey"))+ #theme() allows to change all the design elements of plot
geom_tile(data= subset(gpsData, speed > 0), aes(x=date, y=time_dec , fill=speed))+ # the plot itself
scico::scale_fill_scico(palette = "lajolla", direction = -1)+ #colour palette
labs(title="Time in movement", x="date", y="time")+ #titles
scale_y_continuous(breaks = seq(0, 24, 3))+ # change the time axis to better format
guides(fill=F) # hide the legend
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
There is no use of beautiful visualization if we can’t export it to our research papers and presentations. Saving the plot as raster or vector is easy. We use .png because .jpg had problems with Mac OS:
ggsave("file_name.png", units = "in", dpi = 300, height = 4, width = 6)
How to export it as pdf?
GPS data obviously contains spatial information. We should study spatial patterns. Plot the coordinates:
ggplot()+
geom_point(data = gpsData, aes(y = Y, x = X))
Plotted spatial data is not a map! Map needs spatial context. We need
to use base map. One of the easiest solutions is to use package tmap.
First we have to convert the plain data table to spatial object. We use
library sf for this:
gpsData_sf <- gpsData %>%
st_as_sf(coords = c("X", "Y"), crs = 4326)
For the spatial context use tmap_mode("view"): (aAck to
static: tmap_mode("plot"))
#tmap_mode("view")
# To reduce the size of html file "view" is not used.
tm_shape(gpsData_sf)+
tm_dots("magenta", size = .1)
For that we can run simple query:
trip_extremes <- gpsData_sf %>%
filter(altitude == min(altitude) | altitude == max(altitude)) # "|" stands for "OR"
trip_extremes %>%
kable()
| time_system_ts | accuracy | altitude | bearing | speed | time_dec | date | geometry |
|---|---|---|---|---|---|---|---|
| 2018-04-27 10:58:30 | 4 | 3073.550 | NA | 0.000000 | 10.966667 | 2018-04-27 | POINT (-118.1789 37.38543) |
| 2018-05-08 07:07:35 | 96 | -192.916 | 348.241 | 0.216682 | 7.116667 | 2018-05-08 | POINT (-122.3295 37.20981) |
Altitude -192 refers, that it is under the sea level. It’s probably GPS-error? Or?
And now the map:
tm_shape(trip_extremes)+
tm_dots(col = "altitude", palette = c("blue", "red"), size = 1)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_dots()`: migrate the argument(s) related to the scale of the
## visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
## tm_scale(<HERE>).
## [v3->v4] `tm_dots()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## Multiple palettes called "blue" found: "kovesi.blue", "tableau.blue". The first one, "kovesi.blue", is returned.
##
## [plot mode] fit legend/component: Some legend items or map compoments do not
## fit well, and are therefore rescaled.
## ℹ Set the tmap option `component.autoscale = FALSE` to disable rescaling.
We should add more information on the map. For example: - What was the altitude? - When was the place visited? - Why this place is worth to visit? :)
(tm_basemap() is available if “view” is
active)
# round the elevation:
trip_extremes <- trip_extremes %>%
mutate(altitude = round(altitude,0))
#tm_basemap("Esri.WorldTopoMap")+
tm_shape(trip_extremes)+
tm_dots(col = "altitude", palette = c("blue", "red"), legend.show = F, size = 1, alpha = .7)+
tm_text("altitude",
bg.color = "white",
bg.alpha = 1,
just = "bottom", fontface = "bold",
xmod = 8)+
tm_layout(title = "The highest and the lowest locations visited during the trip, m")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_dots()`: migrate the argument(s) related to the scale of the
## visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
## tm_scale(<HERE>).
## [v3->v4] `tm_dots()`: use `fill_alpha` instead of `alpha`.
## [v3->v4] `tm_dots()`: use `fill.legend = tm_legend_hide()` instead of
## `legend.show = FALSE`.
## [tm_dots()] Argument `legend.show` unknown.
## [v3->v4] `tm_text()`: migrate the layer options 'just' to 'options =
## opt_tm_text(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
## Multiple palettes called "blue" found: "kovesi.blue", "tableau.blue". The first one, "kovesi.blue", is returned.
Turn off the interactive view:
tmap_mode("plot")
## ℹ tmap mode set to "plot".
So much (little) about GPS-data. It was just a feather-light touch of possibilities…
Passive mobile positioning means, that we are using Call Detail
Records (CDR) data. CDR is a dataset, which contains log-files
collected by mobile network operators to monitor and keep track of
billable calling services used by their clients. Such dataset that
contains information about the phone call initiator (the unique and
anonymous ID), time of calling activities as well as the location of the
mobile antenna where the call activity was made, is an excellent source
for analysing spatio-temporal behaviour of phone users.
CDR-data in the following example is stored in two separete files. One
of them is the log-file off calling activities
(cdr_example) and another contains spatial information
about the mobile antennas (cdr_antennas).
First we import the log-files:
cdr_example <- read.csv2("https://aasa.ut.ee/Rspatial/data/cdr_example.csv", sep=";", dec = ".")
cdr_example %>%
head()
## time site_id.y pos_usr_id date
## 1 2008-04-01 08:40:45 264 John 2008-04-01
## 2 2008-04-01 08:56:34 264 John 2008-04-01
## 3 2008-04-01 09:08:27 348 John 2008-04-01
## 4 2008-04-01 09:11:19 215 John 2008-04-01
## 5 2008-04-01 09:15:50 215 John 2008-04-01
## 6 2008-04-01 09:22:31 148 John 2008-04-01
At first glimpse everything looks ok! Only problem is that time is stored not as time but as character. Convert time to proper format and calculate date and time (hours) columns:
cdr_example <- cdr_example %>%
mutate(time = ymd_hms(as.character(time)),
date = as.Date(time),
hour = hour(time),
minute = minute(time))
glimpse(cdr_example)
## Rows: 42,112
## Columns: 6
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09:08…
## $ site_id.y <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 498,…
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01…
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, 9, …
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56, 5…
Now we import the table of antennas:
#cdr_antennas <- read.csv2("C:/ANTO/loengud/geopythonR/www/anto_cells_19102015.csv", dec = ".")
cdr_antennas <- read.csv2("http://aasa.ut.ee/Rspatial/data/sites.csv", sep = ";", dec = ",")
glimpse(cdr_antennas)
## Rows: 1,117
## Columns: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096, 133, 1…
## $ lat <dbl> 57.74861, 59.08381, 59.40694, 59.38417, 58.38350, 58.05833, 58…
## $ lon <dbl> 26.35027, 24.63264, 24.81722, 24.82553, 24.48503, 26.49500, 24…
Easiest way to perform a quality check is to plot the data:
ggplot()+
geom_point(data = cdr_antennas, aes(x = lon, y = lat))
Now we have to put two tables together. This process is called joining. In the current case we use left_join which returns us all records from the left table (cdr_example), and the matched records from the right table (cdr_antennas). For that we have to define the key-column by which tables are joined. The result is NULL from the right side, if there is no match.
cdr_example2 <- left_join(cdr_example, cdr_antennas, by = c("site_id.y" = "site_id"))
cdr_example2 %>%
head()
## time site_id.y pos_usr_id date hour minute lat
## 1 2008-04-01 08:40:45 264 John 2008-04-01 8 40 58.78721
## 2 2008-04-01 08:56:34 264 John 2008-04-01 8 56 58.78721
## 3 2008-04-01 09:08:27 348 John 2008-04-01 9 8 58.66305
## 4 2008-04-01 09:11:19 215 John 2008-04-01 9 11 58.84897
## 5 2008-04-01 09:15:50 215 John 2008-04-01 9 15 58.84897
## 6 2008-04-01 09:22:31 148 John 2008-04-01 9 22 58.74500
## lon
## 1 26.01778
## 2 26.01778
## 3 26.09499
## 4 26.29264
## 5 26.29264
## 6 26.39861
Check are there any rows without coordinates(NA)! This means that location information is missing for those antennas. Filter them out!
nrow(cdr_example2)
## [1] 42112
cdr_example2 <- cdr_example2 %>%
filter(!is.na(lat)) # is.na() finds all cells where value is NA (not available), !is.na() works vice versa
nrow(cdr_example2)
## [1] 42112
How many rows where excluded?
glimpse(cdr_example2)
## Rows: 42,112
## Columns: 8
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09:08…
## $ site_id.y <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 498,…
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01…
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, 9, …
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56, 5…
## $ lat <dbl> 58.78721, 58.78721, 58.66305, 58.84897, 58.84897, 58.74500,…
## $ lon <dbl> 26.01778, 26.01778, 26.09499, 26.29264, 26.29264, 26.39861,…
cdr_antenna_count <- cdr_example2 %>%
group_by(lat, lon) %>%
#summarise(n = n()) %>%
tally() %>% # difference between tally() and summarise(n = n())?
ungroup()
ggplot()+
geom_point(data = cdr_antenna_count, aes(x = lon, y = lat, size = n, alpha = n), colour = "red")
For spatial context we use the contours of Estonian counties. (check
how to download the layer via WFS from spatial interpolation
session).
Import polygons as county
(county <- st_read("layer.shp"))
Was the import successful? Use glimpse() and
st_crs()to check the content of imported layer.
Plot:
ggplot()+
geom_sf(data = county, colour = "white", fill = "grey80", size = 0.25)
Nice!
Visualize frequently visited places (put together layer of Estonian counties and mobile positioning data):
ggplot()+
theme_minimal()+
geom_sf(data = county, colour = "white", fill= "grey80", size = 0.25)+
geom_point(data = cdr_antenna_count, aes(x = lon, y = lat, size = n, alpha = n), colour = "red")
Problem! Different CRS again…
Problem is that two datasets are in different coordinate reference system (CRS): borders of counties are in LEST-97 (epsg:3301) and mobile positioning data is in WGS84 (epsg:4326). Let’s use Estonian system. This means we have to convert CDR-data to the Estonian reference system:
glimpse(cdr_antennas)
## Rows: 1,117
## Columns: 3
## $ site_id <int> 351, 243, 601, 974, 667, 424, 269, 252, 187, 749, 1096, 133, 1…
## $ lat <dbl> 57.74861, 59.08381, 59.40694, 59.38417, 58.38350, 58.05833, 58…
## $ lon <dbl> 26.35027, 24.63264, 24.81722, 24.82553, 24.48503, 26.49500, 24…
# convert plain table to spatial object:
cdr_antennas_wgs84_sf <- st_as_sf(cdr_antennas, coords = c("lon", "lat"), crs = 4326, remove = F)
# transform the antennas layer to Estonian CRS:
cdr_antennas_lest97_sf <- st_transform(cdr_antennas_wgs84_sf, 3301)
#check the table:
head(cdr_antennas_lest97_sf)
## Simple feature collection with 6 features and 3 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 528372 ymin: 6403187 xmax: 647260.9 ymax: 6585730
## Projected CRS: Estonian Coordinate System of 1997
## site_id lat lon geometry
## 1 351 57.74861 26.35027 POINT (639930.8 6403187)
## 2 243 59.08381 24.63264 POINT (536270.8 6549622)
## 3 601 59.40694 24.81722 POINT (546413.9 6585730)
## 4 974 59.38417 24.82553 POINT (546917.6 6583199)
## 5 667 58.38350 24.48503 POINT (528372 6471551)
## 6 424 58.05833 26.49500 POINT (647260.9 6437971)
# lat/long columns are in WGS-84, geometry is in L-EST97
Add coordinates to the attributes table:
# add columns with LEST-97 coordinates as well!
cdr_antennas_lest97_sf <- cbind(cdr_antennas_lest97_sf, st_coordinates(cdr_antennas_lest97_sf))
# convert the result back to data frame:
cdr_antennas_lest97_df <- (cdr_antennas_lest97_sf) %>%
st_drop_geometry() %>%
as_tibble()
# Join tables again:
cdr_example2 <- left_join(cdr_example, cdr_antennas_lest97_df, by = c("site_id.y" = "site_id"))
cdr_example2 %>%
head()
## time site_id.y pos_usr_id date hour minute lat
## 1 2008-04-01 08:40:45 264 John 2008-04-01 8 40 58.78721
## 2 2008-04-01 08:56:34 264 John 2008-04-01 8 56 58.78721
## 3 2008-04-01 09:08:27 348 John 2008-04-01 9 8 58.66305
## 4 2008-04-01 09:11:19 215 John 2008-04-01 9 11 58.84897
## 5 2008-04-01 09:15:50 215 John 2008-04-01 9 15 58.84897
## 6 2008-04-01 09:22:31 148 John 2008-04-01 9 22 58.74500
## lon X Y
## 1 26.01778 616661.9 6518169
## 2 26.01778 616661.9 6518169
## 3 26.09499 621556.7 6504483
## 4 26.29264 632312.8 6525554
## 5 26.29264 632312.8 6525554
## 6 26.39861 638839.8 6514195
Count the calls / SMSs and plot it:
# count calls by antennas:
cdr_antenna_count <- cdr_example2 %>%
group_by(site_id.y, X, Y) %>%
tally() %>%
ungroup()
glimpse(cdr_antenna_count)
## Rows: 651
## Columns: 4
## $ site_id.y <int> 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 19, 20, 21, 22…
## $ X <dbl> 527933.1, 581634.9, 591412.8, 550018.7, 507100.9, 677858.0, …
## $ Y <dbl> 6567107, 6443624, 6573544, 6592009, 6484621, 6457624, 658023…
## $ n <int> 11, 3, 34, 1, 2, 6, 3, 3, 3, 1, 1, 24, 153, 8, 5, 3, 1, 140,…
# plot the map:
ggplot()+
#theme_minimal()+
theme_void()+
geom_sf(data = county, colour = "white", fill= "grey80", size = 0.25)+
geom_point(data = cdr_antenna_count, aes(x = X, y = Y, size=n, alpha=n), colour="red")
In general it’s ok, but plotting is quite slow. This is because counties layer is very detail and large. We can make it faster. For that we can generalize (simplify the map):
county_smpl <- st_simplify(county, preserveTopology = T, dTolerance = 200)
Plot it again:
ggplot()+
theme_minimal()+
geom_sf(data = county_smpl, colour = "grey", fill= "snow", size = 0.25)+
geom_point(data = cdr_antenna_count, aes(x = X, y = Y, size = n, alpha = n), colour = "red")
cdr_example2 %>%
summarise(min_time = min(time), max_time = max(time)) %>%
kable()
| min_time | max_time |
|---|---|
| 2008-01-01 01:20:41 | 2014-12-31 20:12:53 |
In mobility studies the location of home or working place of the
respondent is crucial. Can we detect it from mobile positioning data as
well?
We may assume, that home is the most important place for every person.
This allows us to assume, that at home we use our mobile phones the
most, and the working place is the second most important place (of
course those assumptions are over simplified, but we don’t have too much
time to dig deeper).
In the following query we count (n()) all calling activities grouping
them by antennas. Afterwards we sort the result starting with the
largest value (descending), finally the tables head is saved to new data
frame:
cdr_example_aggrActivities <- cdr_example2 %>%
group_by(site_id.y) %>%
summarise(n = n()) %>%
arrange(-n) %>%
head()
cdr_example_aggrActivities %>%
kable()
| site_id.y | n |
|---|---|
| 264 | 22444 |
| 498 | 2809 |
| 215 | 918 |
| 147 | 783 |
| 817 | 691 |
| 1161 | 537 |
According to the assumptions we defined earlier we can see from the results that antenna (site) 264 is the home location and antenna with id 498 is the working place. Where are they located in space? To answer that question we have to use the coordinates as well:
cdr_example_aggrActivities <- cdr_example2 %>%
group_by(site_id.y, X, Y) %>%
tally() %>%
arrange(desc(n)) %>%
head()
cdr_example_aggrActivities %>%
kable()
| site_id.y | X | Y | n |
|---|---|---|---|
| 264 | 616661.9 | 6518169 | 22444 |
| 498 | 615088.8 | 6502337 | 2809 |
| 215 | 632312.8 | 6525554 | 918 |
| 147 | 635306.9 | 6513513 | 783 |
| 817 | 559703.7 | 6581428 | 691 |
| 1161 | 627174.9 | 6512670 | 537 |
And then we put possible home and work places on the map:
ggplot()+
theme_minimal()+
geom_sf(data = county_smpl, colour = "grey", fill = "snow", size = 0.25)+
geom_point(data = cdr_example_aggrActivities[1,], aes(x = X, y = Y, colour = "home"), size = 2)+ # [1,] selects first row in table
geom_point(data = cdr_example_aggrActivities[2,], aes(x = X, y = Y, colour = "work"), size = 2)+
scale_colour_manual(values = c("orange", "dodgerblue"))
CDR-data is commonly used in mobility studies. The accuracy of the
data doesn’t allow exactly the turn-by-turn analysis of movement, but
it’s a priceless source to analyse commuting, tourism, migration
etc patterns.
Again let’s check the column names. And then plot paths on the map.
glimpse(cdr_example2)
## Rows: 42,112
## Columns: 10
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09:08…
## $ site_id.y <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 498,…
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01…
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, 9, …
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56, 5…
## $ lat <dbl> 58.78721, 58.78721, 58.66305, 58.84897, 58.84897, 58.74500,…
## $ lon <dbl> 26.01778, 26.01778, 26.09499, 26.29264, 26.29264, 26.39861,…
## $ X <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839.8,…
## $ Y <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65023…
ggplot()+
geom_sf(data = county_smpl, colour = "grey", fill = "snow", size = 0.25)+
geom_path(data = cdr_example2, aes(x = X, y = Y), size = 0.2, col = "red")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Results look like a plate full of spaghetti, it’s very hard to
understand which places were visited more often and which places less
often.
May changing the the transparency (alpha) value helps?
ggplot()+
geom_path(data = cdr_example2, aes(x = X, y = Y), linewidth = 0.2, col = "red", alpha = 0.5)
Nothing better… It’s because the path is created as a single line.
But there is a solution to improve it!
We should draw separate line between every two location points! For that
geom_segment
can be used. To define the end of segment we use lead(),
where we say that segment ends with a value on a next row:
ggplot()+
geom_sf(data = county_smpl, colour = "grey", fill = "snow", size = 0.25)+
geom_segment(data = cdr_example2, aes(x = X, xend = lead(X, 1), y = Y, yend = lead(Y, 1)), alpha = 0.05, size = 0.25)
Or maybe curves?
ggplot()+
geom_sf(data = county_smpl, colour = "grey", fill = "snow", size = 0.25)+
geom_curve(data = cdr_example2, aes(x = X, xend = lead(X, 1), y = Y, yend = lead(Y, 1)), alpha = 0.05, size = 0.25, curvature = 0.15)
Error! Curve cannot start and end in same location!
# filter:
cdr_example3 <- cdr_example2 %>%
filter(X != lead(X,1))
ggplot()+
geom_sf(data = county_smpl, colour = "grey", fill = "snow", size = 0.25)+
geom_curve(data = cdr_example3, aes(x = X, xend = lead(X, 1), y = Y, yend = lead(Y, 1)), alpha = 0.025, size = 0.25, curvature = 0.2, colour = "red4")
Still the error?
Let’s tidy the table:
# filter:
cdr_example3_A <- cdr_example2 %>%
mutate(Xend = lead(X, 1),
Yend = lead(Y, 1)) %>%
filter(!is.na(X)) %>%
filter(!is.na(Y)) %>%
filter(X != Xend)
Plot the results:
ggplot()+
geom_sf(data = county_smpl, colour = "grey", fill = "snow", size = 0.25)+
geom_curve(data = cdr_example3_A, aes(x = X, xend = Xend, y = Y, yend = Yend), alpha = 0.025, size = 0.25, curvature = 0.2, colour = "red4")
Pretty good!
Sometimes we want to use dark background for the thematic layers. We
can define it manually in theme():
ggplot()+
theme(
plot.background = element_rect(fill = "black"),
panel.background = element_rect(fill = "black"),
panel.grid = element_line(color = "grey30"),
plot.title = element_text(color = "cyan"))+
geom_sf(data = county_smpl, colour = "grey20", fill = "grey10", size = 0.25)+
geom_curve(data = cdr_example3_A, aes(x = X, xend = Xend, y = Y, yend = Yend), alpha = 0.025, size = 0.25, curvature = 0.2, colour = "cyan")+
labs(title = "Movement of John in Estonia")
The previous map is quite attractive and informative. But lines and
arcs are not geometrical objects but just the visualization in ggplot2.
For proper spatial analysis, it has to be a real spatial object. It
means we have to convert the sequence of points to lines. Let’s set
ourselves the task of creating a separate path for every day.
We must remember that drawing a path or line means having at least two
locations (points). It means we have to filter out the days where we
have only one location record:
# create sf-object:
cdr_example3_sf <- cdr_example2 %>%
filter(!is.na(X)) %>%
st_as_sf(coords = c("X", "Y"), crs = 3301)
# select days with 1 location:
cdr_example3_sf_LINE_PTS <- cdr_example3_sf %>%
st_drop_geometry() %>%
group_by(date) %>%
tally() %>%
ungroup() %>%
filter(n == 1)
# remove days with 1 location:
cdr_example3_sf_LINE <- anti_join(cdr_example3_sf, cdr_example3_sf_LINE_PTS)
glimpse(cdr_example3_sf_LINE)
## Rows: 42,039
## Columns: 9
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09:08…
## $ site_id.y <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 498,…
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01…
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, 9, …
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56, 5…
## $ lat <dbl> 58.78721, 58.78721, 58.66305, 58.84897, 58.84897, 58.74500,…
## $ lon <dbl> 26.01778, 26.01778, 26.09499, 26.29264, 26.29264, 26.39861,…
## $ geometry <POINT [m]> POINT (616661.9 6518169), POINT (616661.9 6518169), P…
And now, we can create the spatial paths:
cdr_example3_sf_LINE <- cdr_example3_sf_LINE %>%
st_transform(4326) %>%
group_by(pos_usr_id, date) %>%
summarise(do_union = FALSE) %>%
st_cast("LINESTRING")
cdr_example3_sf_LINE <- cdr_example3_sf_LINE %>%
ungroup()
glimpse(cdr_example3_sf_LINE)
## Rows: 2,366
## Columns: 3
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-01-01, 2008-01-02, 2008-01-03, 2008-01-04, 2008-01-05…
## $ geometry <LINESTRING [°]> LINESTRING (25.42222 59.455..., LINESTRING (26.0…
Plot it!
#tmap:
tm_shape(county_smpl)+
tm_borders("grey")+
tm_shape(cdr_example3_sf_LINE)+
tm_lines("red")
With ggplot2:
ggplot()+
geom_sf(data = county)+
geom_sf(data = cdr_example3_sf_LINE)
Use alpha parameter to set the transparency and find the optimal value:
ggplot()+
geom_sf(data = county)+
geom_sf(data = cdr_example3_sf_LINE,
alpha = ???
)
If the spatial context is missing, use
tmap_mode("view)!
On which day the person moved the most?
cdr_example3_sf_LINE$LN_day_mileage <- as.numeric(st_length(cdr_example3_sf_LINE)) / 1000
# Why we used as.numeric()?
# Why divied by 1000?
# Find the max value:
cdr_example3_sf_LINE %>%
st_drop_geometry() %>%
filter(LN_day_mileage == max(LN_day_mileage))
## # A tibble: 1 × 3
## pos_usr_id date LN_day_mileage
## <chr> <date> <dbl>
## 1 John 2011-01-20 1051.
Write the query and design a thematic map:
How long was the average daily distance traveled?
Visualize the distribution with histogram:
What can we see from this plot?
And now try to draw the density plot (geom_density() and
add mean/median to the plot (calculate mean/median and use
geom_vline()):
In addition to daily distance, activity spaces are also applied to describe daily movement patterns.
According to Wikipedia, the activity space designates the “set of places individuals encounter as a result of their routine activities in everyday life.”
There are several different approaches and geometries available to characterize the activity space Smith L, 2019:
In the current example, we look at how to calculate Minimum Convex Polygon (MCP):
cdr_example3_sf_MCP <- cdr_example3_sf %>%
group_by(pos_usr_id, date) %>%
summarise(geometry = st_combine( geometry ) ) %>%
st_convex_hull() %>%
ungroup()
Plot the output:
#tm_shape(cdr_example3_sf_MCP)+
# tm_polygons("red", alpha = .1)
ggplot()+
geom_sf(data = county)+
geom_sf(data = cdr_example3_sf_MCP,
fill = "dodgerblue",
color = "blue",
linewidth = .3,
alpha = .2)
Calculate the area and perimeter of polygons:
cdr_example3_sf_MCP$MCP_area <- st_area(cdr_example3_sf_MCP)
cdr_example3_sf_MCP$MCP_perimeter <- st_perimeter(cdr_example3_sf_MCP)
## Warning in st_perimeter(cdr_example3_sf_MCP): 'st_perimeter' is deprecated.
## Use 'sf::st_perimeter or lwgeom::st_perimeter_lwgeom' instead.
## See help("Deprecated")
To calculate perimeter, you have to install library
lwgeom!!! (and don’t forget to start the library! :) )
What is the average activity space area in square kilometers? (Units!)
CDR-data is not used only in mobility studies. It can give much more information about the phone carrier. If we assume that starting a phone call or sending a SMS is act of free will, then we can start to analyse the temporal behaviour of the phone user. For that let us also calculate the weekday names:
# R gives the weekdays in language defined in your computer settings
cdr_example2 <- cdr_example2 %>%
mutate(wday = lubridate::wday(time, abbr = T, label = T))
glimpse(cdr_example2)
## Rows: 42,112
## Columns: 11
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09:08…
## $ site_id.y <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 498,…
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01…
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, 9, …
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56, 5…
## $ lat <dbl> 58.78721, 58.78721, 58.66305, 58.84897, 58.84897, 58.74500,…
## $ lon <dbl> 26.01778, 26.01778, 26.09499, 26.29264, 26.29264, 26.39861,…
## $ X <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839.8,…
## $ Y <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65023…
## $ wday <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue,…
# list weekdays:
cdr_example2 %>%
distinct(wday)
## wday
## 1 Tue
## 2 Wed
## 3 Thu
## 4 Fri
## 5 Sat
## 6 Sun
## 7 Mon
# reorder them:
cdr_example2$wday <- factor(cdr_example2$wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
We want to analyse the temporal activity of phone user. Make a graph to show the diurnal activity during the week. Let’s calculate some calling time percentiles and add them to plot (we may assume that an active day starts if 10% of calls are made; midday arrives if 50% calls are made and evening starts if 90% calls are made):
# calculate time in decimal format:
cdr_example2 <- cdr_example2 %>%
mutate(time_dec = hour + (minute / 60))
glimpse(cdr_example2)
## Rows: 42,112
## Columns: 12
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09:08…
## $ site_id.y <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 498,…
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01…
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, 9, …
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56, 5…
## $ lat <dbl> 58.78721, 58.78721, 58.66305, 58.84897, 58.84897, 58.74500,…
## $ lon <dbl> 26.01778, 26.01778, 26.09499, 26.29264, 26.29264, 26.39861,…
## $ X <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839.8,…
## $ Y <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65023…
## $ wday <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue,…
## $ time_dec <dbl> 8.666667, 8.933333, 9.133333, 9.183333, 9.250000, 9.366667,…
cdr_quart <- cdr_example2 %>%
group_by(wday) %>%
summarise(q1 = quantile(time_dec, 0.10),
q2 = quantile(time_dec, 0.5),
q3 = quantile(time_dec, 0.90)) %>%
ungroup()
A more traditional way is to count calling activities by hour:
glimpse(cdr_example2)
## Rows: 42,112
## Columns: 12
## $ time <dttm> 2008-04-01 08:40:45, 2008-04-01 08:56:34, 2008-04-01 09:08…
## $ site_id.y <int> 264, 264, 348, 215, 215, 148, 498, 498, 498, 498, 498, 498,…
## $ pos_usr_id <chr> "John", "John", "John", "John", "John", "John", "John", "Jo…
## $ date <date> 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01, 2008-04-01…
## $ hour <int> 8, 8, 9, 9, 9, 9, 10, 10, 16, 16, 16, 16, 17, 17, 8, 8, 9, …
## $ minute <int> 40, 56, 8, 11, 15, 22, 31, 49, 31, 39, 41, 54, 6, 42, 56, 5…
## $ lat <dbl> 58.78721, 58.78721, 58.66305, 58.84897, 58.84897, 58.74500,…
## $ lon <dbl> 26.01778, 26.01778, 26.09499, 26.29264, 26.29264, 26.39861,…
## $ X <dbl> 616661.9, 616661.9, 621556.7, 632312.8, 632312.8, 638839.8,…
## $ Y <dbl> 6518169, 6518169, 6504483, 6525554, 6525554, 6514195, 65023…
## $ wday <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue,…
## $ time_dec <dbl> 8.666667, 8.933333, 9.133333, 9.183333, 9.250000, 9.366667,…
cdr_example_hourAggr <- cdr_example2 %>%
group_by(wday, hour) %>%
summarise(n = n()) %>%
ungroup()
## `summarise()` has grouped output by 'wday'. You can override using the
## `.groups` argument.
ggplot()+
geom_line(data = cdr_example_hourAggr, aes(x=hour, y=n, group=wday, colour=as.factor(wday)))+
facet_wrap(~wday)+
scale_colour_manual(values = c("dodgerblue", "steelblue", "blue", "navyblue", "purple", "orange", "red"))+
scale_x_continuous(breaks = seq(0, 24, 3))+
labs(y="time, hh:mm", colour = "day")
#Home antenna:
cdr_example_aggrActivities[1,]
## # A tibble: 1 × 4
## # Groups: site_id.y, X [1]
## site_id.y X Y n
## <int> <dbl> <dbl> <int>
## 1 264 616662. 6518169. 22444
#workplace antenna
cdr_example_aggrActivities[2,]
## # A tibble: 1 × 4
## # Groups: site_id.y, X [1]
## site_id.y X Y n
## <int> <dbl> <dbl> <int>
## 1 498 615089. 6502337. 2809
# Why we used '[ ]'?
cdr_example_H <- cdr_example2 %>%
filter(site_id.y == cdr_example_aggrActivities$site_id.y[1])
cdr_example_W <- cdr_example2 %>%
filter(site_id.y == cdr_example_aggrActivities$site_id.y[2])
cdr_example_H_timeAggr <- cdr_example_H %>%
group_by(wday, hour) %>%
summarise(n = n()) %>%
ungroup()
cdr_example_W_timeAggr <- cdr_example_W %>%
group_by(wday, hour) %>%
summarise(n = n()) %>%
ungroup()
ggplot()+
geom_line(data = cdr_example_H_timeAggr, aes(x=hour, y=n, colour="home"))+
geom_line(data = cdr_example_W_timeAggr, aes(x=hour, y=n, colour="work"))+
scale_x_continuous(breaks = seq(0, 24, 3))+
facet_wrap(~wday)
What you think, was the definition of home and work locations successful?
Author: Anto Aasa
LTOM.02.041
Last update: 2025-05-28 16:34:01.918698