2019-05-23

How to apply a function across rows in R

Problem

We'd like to apply a function across rows in R. In our example, we will add two columns calculating the minimum and the median for each row.

df <- structure(list(V1 = c(5L, 4L, 7L), V2 = c(8L, 9L, 3L), V3 = c(12L, 
5L, 9L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, 
-3L))
 V1 V2 V3
1  5  8 12
2  4  9  5
3  7  3  9

Solution

  • dplyr
  • library(dplyr)
    # Using the piper operator %>%
    df %>% 
      rowwise() %>% 
      mutate(min= min(V1, V2, V3), median = median(c(V1, V2, V3)))
    # Without the pipe operator %>%
    mutate(rowwise(df), min = min(V1, V2, V3), median = median(c(V1, V2, V3)))
    

    Source: local data frame [3 x 5]
    Groups: 
    
         V1    V2    V3   min median
      (int) (int) (int) (int)  (int)
    1     5     8    12     5      8
    2     4     9     5     4      5
    3     7     3     9     3      7
    
  • Base R
  • df$min <- apply(df, 1, min) df$median <- apply(df[, 1:3], 1, median)

      V1 V2 V3 min median
    1  5  8 12   5      8
    2  4  9  5   4      5
    3  7  3  9   3      7
    

    Related posts

    References

    2019-05-11

    How to create boxplots grouping into intervals x-axis values using R?

    Problem

    We want to create boxplots grouping into intervals the values of the x-axis, like in the example below.

    Solution

    Data

    We generate random data and a sequence to create the intervals.

    set.seed(12)
    y <- rnorm(1000)
    x <- rnorm(1000)
    rng <- seq(-3, 3, 0.5)
    
  • Base
  • boxplot(y ~ cut(x, breaks = rng),las=2)
    
    If we'd like to include not available data (NAs), we use the function addNA:

    boxplot(y ~ addNA(cut(x, breaks = rng)), las = 2)
    
  • ggplot2
  • First we create data frame with the intervals.

    library(ggplot2)
    df <- data.frame(x = cut(x, breaks = rng), y = y)
    ggplot(data = df, aes(x = x, y = y)) + geom_boxplot(aes(fill = x))
    

    Related posts

    References

    2019-05-09

    How to make barplot bars with the same bar width in R

    Title

    Problem

    We have three barplots with a different number of bars in each of them, and we want them to have the same bar width.

    par(mfrow=c(1,3));
    par(mar=c(9,6,4,2)+0.1);
    barcenter1<- barplot(c(1,2,3,4,5));
    mtext("Average Emergent", side=2, line=4);
    par(mar=c(9,2,4,2)+0.1);
    barcenter2<- barplot(c(1,2,3));
    par(mar=c(9,2,4,2)+0.1);
    barcenter3<- barplot(c(1,2,3,4,5,6,7));
    

    Solution

    • First try
    • Using the arguments xlim = c(0, 1), width = 0.1 the problem is partially corrected. However, you can notice that the bar width is not exactly the same.

      width - optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless xlim is specified.

      par(mfrow = c(1, 3))
      par(mar = c(9, 6, 4, 2) + 0.1)
      barcenter1 <- barplot(c(1, 2, 3, 4, 5), xlim = c(0, 1), width = 0.1)
      mtext("Average Emergent", side = 2, line = 4)
      par(mar = c(9, 2, 4, 2) + 0.1)
      barcenter2 <- barplot(c(1, 2, 3), xlim = c(0, 1), width = 0.1)
      par(mar = c(9, 2, 4, 2) + 0.1)
      barcenter1 <- barplot(c(1, 2, 3, 4, 5, 6, 7), xlim = c(0, 1), width = 0.1)
      
    • Second try/li>

      We add zeros as placeholders.

      par(mfrow = c(1, 1)) # Reiniciamos los parámetros gráficos
      df <- data.frame(barcenter1 = c(1, 2, 3, 4, 5, 0, 0), 
                       barcenter2 = c(1, 2, 3, 0, 0, 0, 0), 
                       barcenter3 = c(1, 2, 3, 4, 5, 6, 7))
      barplot(as.matrix(df), beside = TRUE)
      
      With ggplot2:

      df <- data.frame(x = c(1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7), 
                       y = c(rep("bar1", 5), rep("bar2", 3), rep("bar3", 7)))                                                                                         
      library(ggplot2)
      ggplot(data = df, aes(x = x, y = x)) + 
        geom_bar(stat = "identity") + 
        facet_grid(~y) 
      
  • Alternative
  • If instead of three separate barplots we'd like to plot them with an empty space between them, we plot a single barplot with two NAs as delimiters. Si en lugar de tres gráficos propiamente dichos queremos mostrarlos sin que haya separación por las barras en blanco entre ellas, creamos un gráfico único con espacio con dos separaciones por NAs.

    x <- c(1, 2, 3, 4, 5, NA, 1, 2, 3, NA, 1, 2, 3, 4, 5, 6, 7)
    barplot(x)
    

    References

    2019-05-07

    Plot a continuous series with ggplot2

    Problem

    When we try to plot a continuous series, in our example WS (Winter Solstice), ggplot2 connects the last winter data point in March to the first winter data point in December.

    • Data
    library(ggplot2)
    
    getSeason <- function(DATES) {
    #found here https://stackoverflow.com/questions/9500114/find-which-season-a-particular-date-belongs-to
    WS <- as.Date("2012-12-15", format = "%Y-%m-%d") # Winter Solstice
    SE <- as.Date("2012-3-15",  format = "%Y-%m-%d") # Spring Equinox
    SS <- as.Date("2012-6-15",  format = "%Y-%m-%d") # Summer Solstice
    FE <- as.Date("2012-9-15",  format = "%Y-%m-%d") # Fall Equinox
    
    # Convert dates from any year to 2012 dates
    d <- as.Date(strftime(DATES, format="2012-%m-%d"))
    
    ifelse (d >= WS | d < SE, "Winter",
      ifelse (d >= SE & d < SS, "Spring",
        ifelse (d >= SS & d < FE, "Summer", "Fall")))
    }
    
    zz <- sample(1:10000,365)/1000
    dag <- seq(as.Date("2014-01-01"), as.Date("2014-12-31"), by = "day")
    seas <-  getSeason(dag)
    test <- data.frame(zz,dag,seas)
    
    ggplot(data=test, aes(x=dag,ymax=zz,ymin=0,fill=seas))+
    geom_ribbon()
    

    Solution

    We can solve it by subsetting our data in two, above and below WS, and plotting two layers with geom_ribbon. Thus you convert the continuous WS series into two discrete sections.

    library(dplyr)
    ggplot() +
      geom_ribbon(data = filter(test, dag >= "2014-12-15") ,
                  aes(x = dag, ymax = zz, ymin = 0, fill = seas)) +
      geom_ribbon(data = filter(test, dag < "2014-12-15") ,
                  aes(x = dag, ymax = zz, ymin = 0, fill = seas))
    

    Results

    References

    2019-05-06

    Filter between two dates in R with dplyr

    Problem

    We'd like to filter between two dates in R using the package dplyr.

        Patch       Date Prod_DL
    1    BVG1   9/4/2015    3.43
    2   BVG11  9/11/2015    3.49
    3   BVG12  9/18/2015    3.45
    4   BVG13  12/6/2015    3.57
    5   BVG14 12/13/2015    3.43
    6   BVG15 12/20/2015    3.47
    
    • Data
    df <- read.table(
      text = "Patch,Date,Prod_DL
      BVG1,9/4/2015,3.43
      BVG11,9/11/2015,3.49
      BVG12,9/18/2015,3.45
      BVG13,12/6/2015,3.57
      BVG14,12/13/2015,3.43
      BVG15,12/20/2015,3.47",
      sep = ",",
      stringsAsFactors = FALSE,
      header = TRUE,
      row.names = NULL
    )
    

    Solution

  • Alternative 1
  • We properly format the column containing the dates, originally a character column, and filter between the two dates.

    library("dplyr")
    df$Date <-as.Date(df$Date,"%m/%d/%Y")
    df %>%
      select(Patch, Date, Prod_DL) %>%
      filter(Date > "2015-09-04" & Date < "2015-09-18")
    
        Patch       Date Prod_DL
    1   BVG11 2015-09-11    3.49
    
  • Alternative 2
  • We properly format the column containing the dates and use the function between: 'This is a shortcut for x >= left & x <= right, implemented efficiently in C++ for local values, and translated to the appropriate SQL for remote tables.' We need to change the days to account for the = sign on both sides, and to use as.Date, explanation here.

    df$Date <- as.Date(df$Date, "%m/%d/%Y")
    df %>% 
      select(Patch, Date, Prod_DL) %>%
      filter(between(Date, as.Date("2015-09-05"), as.Date("2015-09-17")))
    
        Patch       Date Prod_DL
    1   BVG11 2015-09-11    3.49
    

    Related posts

    References

    2019-05-04

    Read data from URL in R

    Problem

    How can we read data from an URL in R? We have a URL pointing to a text file, and we'd like to read it directly in R without previously downloading it to our computer.

    http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv
    

    Solution

    We just need to pass the URL string inside the appropriate function. Apart from read.csv from the base package, I include a couple of examples from two popular packages: data.table y readr.

    • Base
    • ad <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
      head(ad)
      
        X    TV radio newspaper sales
      1 1 230.1  37.8      69.2  22.1
      2 2  44.5  39.3      45.1  10.4
      3 3  17.2  45.9      69.3   9.3
      4 4 151.5  41.3      58.5  18.5
      5 5 180.8  10.8      58.4  12.9
      6 6   8.7  48.9      75.0   7.2
      
    • data.table
    • library(data.table)
      ad <- fread("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
      head(ad)
      
      V1    TV radio newspaper sales
      1:  1 230.1  37.8      69.2  22.1
      2:  2  44.5  39.3      45.1  10.4
      3:  3  17.2  45.9      69.3   9.3
      4:  4 151.5  41.3      58.5  18.5
      5:  5 180.8  10.8      58.4  12.9
      6:  6   8.7  48.9      75.0   7.2
      
    • readr
    • library(readr)
      ad <- read_csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
      head(ad)
      
      # A tibble: 6 x 5
           X1    TV radio newspaper sales
                
      1     1 230.1  37.8      69.2  22.1
      2     2  44.5  39.3      45.1  10.4
      3     3  17.2  45.9      69.3   9.3
      4     4 151.5  41.3      58.5  18.5
      5     5 180.8  10.8      58.4  12.9
      6     6   8.7  48.9      75.0   7.2
      

    Related posts

    References

    2019-05-03

    Drop unused levels from a factor in R

    Problem

    If we filter a data frame containing a factor and then perform any operation, such as creating a contingency table, R will still show the unused levels. Subsetting does not in general drop unused levels.

    df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)
    library(dplyr)
    aa <-  df %>%
      group_by(name) %>%
      filter(n() < 4) %>% 
      droplevels()
    table(aa$name)
    
    In our example, the level c is still included in the results. We'd like to remove it and display only the used levels a and b.

    # Resultado
    a b c 
    3 2 0
    # Resultado deseado
    a b 
    3 2
    

    Solution

    There are two alternatives, the function droplevels or factor.

    table(droplevels(aa$name))
    table(factor(aa$name))
    
    If we are using dplyr and the pipe operator:

    aa <-  df %>%
      group_by(name) %>%
      filter(n() < 4) %>% 
      droplevels()
    table(aa$name)
    
    # Better still
    df %>%
      group_by(name) %>%
      filter(n() < 4) %>% 
      droplevels() %>% 
      {table(.$name)}
    

    Related posts

    References

    Nube de datos