Nube de datos: marzo 2020

2020-03-20

Plotting coronavirus cases in R

Introduction

We want to show the evolution of the coronavirus cases using R creating static and interactive plots.

Plots

Interactive (linear scale)

Interactive (log scale)

Solution

We use the data repository created by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). There are three time-series: confirmed, deaths and recovered cases. First we will prepare the data and then plot the time-series using ggplot2 for the static version and plotly to add interactivity. The data source includes cases across the world, but in our example we will subset the time-series for Germany, France, Italy, Spain, and the United Kingdom.

# Libraries
library(magrittr)
library(lubridate) 
library(tidyverse)
library(plotly)
library(scales)

# Importing data
confirmed <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv")
recovered <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv")

# Data preparation
AppendMe <- function(dfNames) {
  do.call(rbind, lapply(dfNames, function(x) {
    cbind(get(x), source = x)
  }))
}
df <- AppendMe(c("confirmed", "deaths", "recovered"))
data <- df %>%
  rename(province = `Province/State`, country = `Country/Region`) %>% 
  pivot_longer(
    -c(province, country, Lat, Long, source),
    names_to = "date",
    values_to = "count"
  ) %>% 
mutate(date = mdy(date)) 

# Plot linear scale
p <- data %>%
  filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
  group_by(country, date, source) %>%
  summarise(n = sum(count)) %>%
  ggplot(aes(date, n, colour = country)) +
  geom_line(linetype = 2) +
  geom_point(size = 1) +
  facet_wrap( ~  source  , scales = "free", nrow = 3) +
  theme_bw()+
  labs(title = "Cumulative Covid-19 cases (linear scale)")+
  ylab("")+
  scale_x_date(date_labels = "%b %d")+
  scale_y_continuous(labels = comma)
p # Static
ggplotly(p) # Interactive

# Plot log scale
p <- data %>%
  filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
  group_by(country, date, source) %>%
  summarise(n = sum(count)) %>%
  ggplot(aes(date, n, colour = country)) +
  geom_line(linetype = 2) +
  geom_point(size = 1) +
  facet_wrap( ~  source, scales = "free",  nrow = 3) +
  theme_bw()+
  labs(title = "Cumulative Covid-19 cases (log scale)")+
  ylab("")+
  scale_x_date(date_labels = "%b %d")+
  scale_y_log10(breaks = c(1, 10, 100, 10000))
  p 
ggplotly(p)

To highlight a series while hovering over it, we use the function highlightfrom the plotly package.

p <- data %>%
  filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
  group_by(country, date, source) %>%
  summarise(cases = sum(count)) %>%
  highlight_key(~ country ) %>% 
  ggplot(aes(date, cases, colour = country)) +
  geom_line(linetype = 2)+
  geom_point(size = 1) +
  facet_wrap(~  source  , scales = "free", nrow = 3)+
  theme_bw()+
  labs(title = "Cumulative Covid-19 cases (linear scale)")+
  ylab("")+
  scale_x_date(date_labels = "%b %d")+
  scale_y_continuous(labels = comma)
ggplotly(p, tooltip = c("country", "date", "cases")) %>% 
highlight(on = "plotly_hover")

Plot here. Screenshot below.

References

2020-03-16

Variaciones diarias de acciones o índices en R

Problema

Queremos crear un mapa de calor en forma de calendario con las variaciones diarias de acciones o índices en R.

Solución

Ejecutamos el código de Paul Bleicher para crear la función calendarHeat. Editamos el código fuente para modificar la paleta de colores si fuera necesario.
Extraemos los datos usando la función getSymbols.
Calculamos las variaciones diarias.
Creamos el gráfico de las series temporales.

Un par de ejemplos:

Dow Jones, emulando la paleta usada por Mike Bostock here. Los días del índice son verde cuando suben y rosas cuando bajan. Los varaciones están calculadas en porcentaje.

library(tidyverse)
library(tidyquant)
# Dow Jones
symb <- getSymbols(Symbols = "^DJI", QQQ = 'yahoo', auto.assign = FALSE)
n <- gsub("^.*\\.", "", names(symb))
symb <- as.data.frame(symb)
colnames(symb) <- n
symb$date <- rownames(symb)
rownames(symb) <- NULL
df <- symb %>%
  mutate(date = as.Date(date),
         pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
  filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31') # o filter(!is.na(pct_vol), date >= '2011-01-01', date <= '2015-12-31')
calendarHeat(
  df$date,
  df$pct_vol,
  varname = "Dow Jones Industrial Average",
  ncolors = 50,
  color = "g2p"
)

S&P500, usando otra paleta de azul a rojo. Los varaciones están calculadas en porcentaje.

symb <- getSymbols(Symbols = "^GSPC", QQQ = 'yahoo', auto.assign = FALSE)
n <- gsub("^.*\\.", "", names(symb))
symb <- as.data.frame(symb)
colnames(symb) <- n
symb$date <- rownames(symb)
rownames(symb) <- NULL
df <- symb %>%
  mutate(date = as.Date(date),
         pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
  filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31')
calendarHeat(
  df$date,
  df$pct_vol,
  varname = "S&P 500",
  ncolors = 50,
  color = "b2r"
)

Resultados

En los siguentes gráficos, las últimas variaciones diarias desencadenadas por la pandemia del coronavirus (COVID-19) hacen que el resto de días aparezcan muy pálidos en comparación.

Entradas relacionadas

Daily changes of stocks in R

Problem

We want to create a calendar heatmap with the daily changes of stocks or indexes in R.

Solution

We run the calendarHeat function created by Paul Bleicher to display calendar heatmaps. Editing the palettes in source code if needed.
We extract the stock or index data using getSymbols.
We calculate the daily changes.
We plot the time series.

Let's see a couple of examples:

Dow Jones, emulating the palette used by Mike Bostock here. Days the index went up are green, and down are pink. The changes are in percentages.

library(tidyverse)
library(tidyquant)
# Dow Jones
symb <- getSymbols(Symbols = "^DJI", QQQ = 'yahoo', auto.assign = FALSE)
n <- gsub("^.*\\.", "", names(symb))
symb <- as.data.frame(symb)
colnames(symb) <- n
symb$date <- rownames(symb)
rownames(symb) <- NULL
df <- symb %>%
  mutate(date = as.Date(date),
         pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
  filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31') # or filter(!is.na(pct_vol), date >= '2011-01-01', date <= '2015-12-31')
calendarHeat(
  df$date,
  df$pct_vol,
  varname = "Dow Jones Industrial Average",
  ncolors = 50,
  color = "g2p"
)

S&P500, using another palette from blue to red. The changes are in percentages.

symb <- getSymbols(Symbols = "^GSPC", QQQ = 'yahoo', auto.assign = FALSE)
n <- gsub("^.*\\.", "", names(symb))
symb <- as.data.frame(symb)
colnames(symb) <- n
symb$date <- rownames(symb)
rownames(symb) <- NULL
df <- symb %>%
  mutate(date = as.Date(date),
         pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
  filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31')
calendarHeat(
  df$date,
  df$pct_vol,
  varname = "S&P 500",
  ncolors = 50,
  color = "b2r"
)

Results

In the next two plots, because of the large daily variations in the last days due to the coronavirus pandemic (COVID-19), the rest of the days are very pale in comparison.

2020-03-13

How to create a calendar heatmap in R

Problem

We want to create a calendar heatmap for a time series in R.

Solution

First we run the calendarHeat function created by Paul Bleicher to display calendar heatmaps. Secondly we create some random time series data. Finally we plot the time series in two palettes.

library(tidyverse)
# Random data
set.seed(2015)
df <-
  data.frame(dates = sample(seq(
    as.Date('2018-01-01'),
    as.Date('2020-12-31'),
    by = "day"
  ), 1000))
df <- df %>% mutate(sessions = floor(runif(nrow(.), 1, 101)) )
# Two calendar with different palettes
calendarHeat(df$dates, df$sessions, varname = "Sessions")
calendarHeat(df$dates, df$sessions, varname = "Sessions", ncolors = 99, color = "r2b" )

Results

Calendar heatmap by hour and weekday in ggplot2

2020-03-08

Calendar heatmap by hour and weekday in ggplot2

Problem

We want to create a calendar heatmap using ggplot2.

Solution

In our example we generate random dates, and then group the results by hour of the day and day of the week.

library(tidyverse)
library(lubridate)
# Data
set.seed(2020)
df <-
  data.frame(dates = sample(seq(
    as.POSIXct('2019/01/01', tz = "CET"),
    as.POSIXct('2019/12/31', tz = "CET"),
    by = "sec"
  ), 1000))

# Data manipulation
df %>%
  mutate(days = factor(weekdays(dates, abbreviate = TRUE),
                       levels = rev(
                         c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
                       ))) %>%
  group_by(hours= hour(dates), days) %>%
  summarise(sessions = n()) %>%

# Plot
ggplot(aes(hours, days)) +
  geom_tile(aes(fill = sessions), colour = "white") +
  scale_fill_distiller(palette = "YlGnBu", direction = 1) +
  scale_x_continuous(breaks = 0:23) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.key.width = unit(2, "cm"),
    panel.grid = element_blank()
  ) +
  coord_equal()

Results

How to create a calendar heatmap in R

References

stackoverflow

2020-03-07

Show the first and last n lines of a data frame in R

Problem

We want to show the first and last n lines of a data frame. The equivalent of using the head and tail functions at the same time.

Solution

We can use the function headTail from the psych package.

Defatult options

By default headTail returns the top 4 and bottom 4 lines separated with dots (ellipsis).

library(psych)
headTail(ToothGrowth)

     len supp dose
1    4.2   VC  0.5
2   11.5   VC  0.5
3    7.3   VC  0.5
4    5.8   VC  0.5
...  ...   ...
57  26.4   OJ    2
58  27.3   OJ    2
59  29.4   OJ    2
60    23   OJ    2

Arguments

We can control tne number of lines to show at the top and the bottom, the ellipsis (how top and bottom are separared), number of columns to show and round the number of digits. Some examples:

# The first and last 2 lines without separation
headTail(ToothGrowth, top = 2, bottom = 2, ellipsis = FALSE)

    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
59 29.4   OJ  2.0
60 23.0   OJ  2.0

# The first and last 3, from column 4 to 5 with no decimals.
headTail(iris, top = 3, bottom = 3, digits = 0 , from = 4, to = 5)

   Petal.Width   Species
1             0    setosa
2             0    setosa
3             0    setosa
...         ...      
148           2 virginica
149           2 virginica
150           2 virginica

References

2020-03-05

Descriptive statistics by group in R

Title

Problem

We'd like to report descriptive statistics in R by a grouping variable and subsetting the output statistics.

Solution

We will use the data frame iris, columns Sepal.Length and Sepal.Width and grouping by Species. In our example, we want to return the mean, the standard deviation, the skewness and kurtosis.

Subset of descriptive statistics by group

library(psych)
# Variables by index
d <- describeBy(iris[1:2], group = iris$Species)
# Two options to subset the statistics:
lapply(d, "[", , c(3, 4, 11, 12))
lapply(d, subset, , c(3, 4, 11, 12))

# Variables by name
i <- match(c("Sepal.Length", "Petal.Length"), names(iris))
d <- describeBy(iris[i], group = iris$Species)
lapply(d, subset, , c("mean", "sd", "skew", "kurtosis"))

$setosa
             mean   sd skew kurtosis
Sepal.Length 5.01 0.35 0.11    -0.45
Sepal.Width  3.43 0.38 0.04     0.60

$versicolor
             mean   sd  skew kurtosis
Sepal.Length 5.94 0.52  0.10    -0.69
Sepal.Width  2.77 0.31 -0.34    -0.55

$virginica
             mean   sd skew kurtosis
Sepal.Length 6.59 0.64 0.11    -0.20
Sepal.Width  2.97 0.32 0.34     0.38

Subset of descriptive statistics without grouping

# Seleccionamos las columnas deseadas de la tabla
d <- describe(iris[1:2])
# Subsetting output statistics
d[, c(3, 4, 11, 12)]

             mean   sd skew kurtosis
Sepal.Length 5.84 0.83 0.31    -0.61
Sepal.Width  3.06 0.44 0.31     0.14

References

Descriptive statistics in R

2020-03-20

Introduction

Plots

Solution

References

2020-03-16

Problema

Solución

Resultados

Entradas relacionadas

Problem

Solution

Results

Related posts

2020-03-13

Problem

Solution

Results

Related posts

2020-03-08

Problem

Solution

Results

Related posts

References

2020-03-07

Problem

Solution

Related posts

References

2020-03-05

Problem

Solution

References