2020-03-20

Plotting coronavirus cases in R

Introduction

We want to show the evolution of the coronavirus cases using R creating static and interactive plots.

Plots

  • Interactive (linear scale)
  • Interactive (log scale)
  • Solution

    We use the data repository created by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). There are three time-series: confirmed, deaths and recovered cases. First we will prepare the data and then plot the time-series using ggplot2 for the static version and plotly to add interactivity. The data source includes cases across the world, but in our example we will subset the time-series for Germany, France, Italy, Spain, and the United Kingdom.

    # Libraries
    library(magrittr)
    library(lubridate) 
    library(tidyverse)
    library(plotly)
    library(scales)
    
    # Importing data
    confirmed <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
    deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv")
    recovered <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv")
    
    # Data preparation
    AppendMe <- function(dfNames) {
      do.call(rbind, lapply(dfNames, function(x) {
        cbind(get(x), source = x)
      }))
    }
    df <- AppendMe(c("confirmed", "deaths", "recovered"))
    data <- df %>%
      rename(province = `Province/State`, country = `Country/Region`) %>% 
      pivot_longer(
        -c(province, country, Lat, Long, source),
        names_to = "date",
        values_to = "count"
      ) %>% 
    mutate(date = mdy(date)) 
    
    # Plot linear scale
    p <- data %>%
      filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
      group_by(country, date, source) %>%
      summarise(n = sum(count)) %>%
      ggplot(aes(date, n, colour = country)) +
      geom_line(linetype = 2) +
      geom_point(size = 1) +
      facet_wrap( ~  source  , scales = "free", nrow = 3) +
      theme_bw()+
      labs(title = "Cumulative Covid-19 cases (linear scale)")+
      ylab("")+
      scale_x_date(date_labels = "%b %d")+
      scale_y_continuous(labels = comma)
    p # Static
    ggplotly(p) # Interactive
    
    # Plot log scale
    p <- data %>%
      filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
      group_by(country, date, source) %>%
      summarise(n = sum(count)) %>%
      ggplot(aes(date, n, colour = country)) +
      geom_line(linetype = 2) +
      geom_point(size = 1) +
      facet_wrap( ~  source, scales = "free",  nrow = 3) +
      theme_bw()+
      labs(title = "Cumulative Covid-19 cases (log scale)")+
      ylab("")+
      scale_x_date(date_labels = "%b %d")+
      scale_y_log10(breaks = c(1, 10, 100, 10000))
      p 
    ggplotly(p) 
    
    To highlight a series while hovering over it, we use the function highlightfrom the plotly package.

    p <- data %>%
      filter(country %in% c("Germany", "France", "Italy",  "Spain", "United Kingdom")) %>% 
      group_by(country, date, source) %>%
      summarise(cases = sum(count)) %>%
      highlight_key(~ country ) %>% 
      ggplot(aes(date, cases, colour = country)) +
      geom_line(linetype = 2)+
      geom_point(size = 1) +
      facet_wrap(~  source  , scales = "free", nrow = 3)+
      theme_bw()+
      labs(title = "Cumulative Covid-19 cases (linear scale)")+
      ylab("")+
      scale_x_date(date_labels = "%b %d")+
      scale_y_continuous(labels = comma)
    ggplotly(p, tooltip = c("country", "date", "cases")) %>% 
    highlight(on = "plotly_hover")
    
    Plot here. Screenshot below.

    References

    2020-03-16

    Variaciones diarias de acciones o índices en R


    Problema

    Queremos crear un mapa de calor en forma de calendario con las variaciones diarias de acciones o índices en R.

    Solución

    1. Ejecutamos el código de Paul Bleicher para crear la función calendarHeat. Editamos el código fuente para modificar la paleta de colores si fuera necesario.
    2. Extraemos los datos usando la función getSymbols.
    3. Calculamos las variaciones diarias.
    4. Creamos el gráfico de las series temporales.
    Un par de ejemplos:

    1. Dow Jones, emulando la paleta usada por Mike Bostock here. Los días del índice son verde cuando suben y rosas cuando bajan. Los varaciones están calculadas en porcentaje.
    2. library(tidyverse)
      library(tidyquant)
      # Dow Jones
      symb <- getSymbols(Symbols = "^DJI", QQQ = 'yahoo', auto.assign = FALSE)
      n <- gsub("^.*\\.", "", names(symb))
      symb <- as.data.frame(symb)
      colnames(symb) <- n
      symb$date <- rownames(symb)
      rownames(symb) <- NULL
      df <- symb %>%
        mutate(date = as.Date(date),
               pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
        filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31') # o filter(!is.na(pct_vol), date >= '2011-01-01', date <= '2015-12-31')
      calendarHeat(
        df$date,
        df$pct_vol,
        varname = "Dow Jones Industrial Average",
        ncolors = 50,
        color = "g2p"
      )
      
    3. S&P500, usando otra paleta de azul a rojo. Los varaciones están calculadas en porcentaje.
    4. symb <- getSymbols(Symbols = "^GSPC", QQQ = 'yahoo', auto.assign = FALSE)
      n <- gsub("^.*\\.", "", names(symb))
      symb <- as.data.frame(symb)
      colnames(symb) <- n
      symb$date <- rownames(symb)
      rownames(symb) <- NULL
      df <- symb %>%
        mutate(date = as.Date(date),
               pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
        filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31')
      calendarHeat(
        df$date,
        df$pct_vol,
        varname = "S&P 500",
        ncolors = 50,
        color = "b2r"
      )
      

    Resultados

    En los siguentes gráficos, las últimas variaciones diarias desencadenadas por la pandemia del coronavirus (COVID-19) hacen que el resto de días aparezcan muy pálidos en comparación.

    Entradas relacionadas

    Daily changes of stocks in R


    Problem

    We want to create a calendar heatmap with the daily changes of stocks or indexes in R.

    Solution

    1. We run the calendarHeat function created by Paul Bleicher to display calendar heatmaps. Editing the palettes in source code if needed.
    2. We extract the stock or index data using getSymbols.
    3. We calculate the daily changes.
    4. We plot the time series.
    Let's see a couple of examples:

    1. Dow Jones, emulating the palette used by Mike Bostock here. Days the index went up are green, and down are pink. The changes are in percentages.
    2. library(tidyverse)
      library(tidyquant)
      # Dow Jones
      symb <- getSymbols(Symbols = "^DJI", QQQ = 'yahoo', auto.assign = FALSE)
      n <- gsub("^.*\\.", "", names(symb))
      symb <- as.data.frame(symb)
      colnames(symb) <- n
      symb$date <- rownames(symb)
      rownames(symb) <- NULL
      df <- symb %>%
        mutate(date = as.Date(date),
               pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
        filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31') # or filter(!is.na(pct_vol), date >= '2011-01-01', date <= '2015-12-31')
      calendarHeat(
        df$date,
        df$pct_vol,
        varname = "Dow Jones Industrial Average",
        ncolors = 50,
        color = "g2p"
      )
      
    3. S&P500, using another palette from blue to red. The changes are in percentages.
    4. symb <- getSymbols(Symbols = "^GSPC", QQQ = 'yahoo', auto.assign = FALSE)
      n <- gsub("^.*\\.", "", names(symb))
      symb <- as.data.frame(symb)
      colnames(symb) <- n
      symb$date <- rownames(symb)
      rownames(symb) <- NULL
      df <- symb %>%
        mutate(date = as.Date(date),
               pct_vol = round(100 * (Adjusted / lag(Adjusted) - 1), 2)) %>%
        filter(!is.na(pct_vol), date >= '2016-01-01', date <= '2020-12-31')
      calendarHeat(
        df$date,
        df$pct_vol,
        varname = "S&P 500",
        ncolors = 50,
        color = "b2r"
      )
      

    Results

    In the next two plots, because of the large daily variations in the last days due to the coronavirus pandemic (COVID-19), the rest of the days are very pale in comparison.

    Related posts

    2020-03-13

    How to create a calendar heatmap in R


    Problem

    We want to create a calendar heatmap for a time series in R.

    Solution

    First we run the calendarHeat function created by Paul Bleicher to display calendar heatmaps. Secondly we create some random time series data. Finally we plot the time series in two palettes.

    library(tidyverse)
    # Random data
    set.seed(2015)
    df <-
      data.frame(dates = sample(seq(
        as.Date('2018-01-01'),
        as.Date('2020-12-31'),
        by = "day"
      ), 1000))
    df <- df %>% mutate(sessions = floor(runif(nrow(.), 1, 101)) )
    # Two calendar with different palettes
    calendarHeat(df$dates, df$sessions, varname = "Sessions")
    calendarHeat(df$dates, df$sessions, varname = "Sessions", ncolors = 99, color = "r2b" )
    

    Results

    Related posts

    2020-03-08

    Calendar heatmap by hour and weekday in ggplot2

    Problem

    We want to create a calendar heatmap using ggplot2.

    Solution

    In our example we generate random dates, and then group the results by hour of the day and day of the week.

    library(tidyverse)
    library(lubridate)
    # Data
    set.seed(2020)
    df <-
      data.frame(dates = sample(seq(
        as.POSIXct('2019/01/01', tz = "CET"),
        as.POSIXct('2019/12/31', tz = "CET"),
        by = "sec"
      ), 1000))
    
    # Data manipulation
    df %>%
      mutate(days = factor(weekdays(dates, abbreviate = TRUE),
                           levels = rev(
                             c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
                           ))) %>%
      group_by(hours= hour(dates), days) %>%
      summarise(sessions = n()) %>%
    
    # Plot
    ggplot(aes(hours, days)) +
      geom_tile(aes(fill = sessions), colour = "white") +
      scale_fill_distiller(palette = "YlGnBu", direction = 1) +
      scale_x_continuous(breaks = 0:23) +
      theme_minimal() +
      theme(
        legend.position = "bottom",
        legend.key.width = unit(2, "cm"),
        panel.grid = element_blank()
      ) +
      coord_equal()
    

    Results

    Related posts

    References

    2020-03-07

    Show the first and last n lines of a data frame in R

    Problem

    We want to show the first and last n lines of a data frame. The equivalent of using the head and tail functions at the same time.

    Solution

    We can use the function headTail from the psych package.

    1. Defatult options
    2. By default headTail returns the top 4 and bottom 4 lines separated with dots (ellipsis).

      library(psych)
      headTail(ToothGrowth)
      
           len supp dose
      1    4.2   VC  0.5
      2   11.5   VC  0.5
      3    7.3   VC  0.5
      4    5.8   VC  0.5
      ...  ...   ...
      57  26.4   OJ    2
      58  27.3   OJ    2
      59  29.4   OJ    2
      60    23   OJ    2
      
    3. Arguments
    4. We can control tne number of lines to show at the top and the bottom, the ellipsis (how top and bottom are separared), number of columns to show and round the number of digits. Some examples:

      # The first and last 2 lines without separation
      headTail(ToothGrowth, top = 2, bottom = 2, ellipsis = FALSE)
      
          len supp dose
      1   4.2   VC  0.5
      2  11.5   VC  0.5
      59 29.4   OJ  2.0
      60 23.0   OJ  2.0
      
      # The first and last 3, from column 4 to 5 with no decimals.
      headTail(iris, top = 3, bottom = 3, digits = 0 , from = 4, to = 5)
      
         Petal.Width   Species
      1             0    setosa
      2             0    setosa
      3             0    setosa
      ...         ...      
      148           2 virginica
      149           2 virginica
      150           2 virginica
      

    Related posts

    References

    2020-03-05

    Descriptive statistics by group in R

    Title

    Problem

    We'd like to report descriptive statistics in R by a grouping variable and subsetting the output statistics.

    Solution

    We will use the data frame iris, columns Sepal.Length and Sepal.Width and grouping by Species. In our example, we want to return the mean, the standard deviation, the skewness and kurtosis.

  • Subset of descriptive statistics by group
  • library(psych)
    # Variables by index
    d <- describeBy(iris[1:2], group = iris$Species)
    # Two options to subset the statistics:
    lapply(d, "[", , c(3, 4, 11, 12))
    lapply(d, subset, , c(3, 4, 11, 12)) 
    
    # Variables by name
    i <- match(c("Sepal.Length", "Petal.Length"), names(iris))
    d <- describeBy(iris[i], group = iris$Species)
    lapply(d, subset, , c("mean", "sd", "skew", "kurtosis")) 
    
    $setosa
                 mean   sd skew kurtosis
    Sepal.Length 5.01 0.35 0.11    -0.45
    Sepal.Width  3.43 0.38 0.04     0.60
    
    $versicolor
                 mean   sd  skew kurtosis
    Sepal.Length 5.94 0.52  0.10    -0.69
    Sepal.Width  2.77 0.31 -0.34    -0.55
    
    $virginica
                 mean   sd skew kurtosis
    Sepal.Length 6.59 0.64 0.11    -0.20
    Sepal.Width  2.97 0.32 0.34     0.38
    
  • Subset of descriptive statistics without grouping
  • # Seleccionamos las columnas deseadas de la tabla
    d <- describe(iris[1:2])
    # Subsetting output statistics
    d[, c(3, 4, 11, 12)]
    
                 mean   sd skew kurtosis
    Sepal.Length 5.84 0.83 0.31    -0.61
    Sepal.Width  3.06 0.44 0.31     0.14
    

    References

    Nube de datos