Mostrando entradas con la etiqueta Data frame. Mostrar todas las entradas
Mostrando entradas con la etiqueta Data frame. Mostrar todas las entradas

2020-12-21

How to create a symmetric matrix in R

Title

Problem

We want to create a symmetric matrix based on the following example.

Example

  V1 V2 V3 V4 V5
1  0  2  3  4  5
2  0  0  6  8 10
3  0  0  0 12 15
4  0  0  0  0 20
5  0  0  0  0  0
df <- structure(list(V1 = c(0L, 0L, 0L, 0L, 0L), V2 = c(2L, 0L, 0L, 
0L, 0L), V3 = c(3L, 6L, 0L, 0L, 0L), V4 = c(4L, 8L, 12L, 0L, 
0L), V5 = c(5L, 10L, 15L, 20L, 0L)), .Names = c("V1", "V2", "V3", 
"V4", "V5"), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5"))

Solution

  • Adding the transpose of the matrix.
  • df + t(df)
  • Function lower.tri
  • We assign to the lower triangle of the matrix the transpose of the lower triangle.

    df[lower.tri(df)] <- t(df)[lower.tri(df)]
      V1 V2 V3 V4 V5
    1  0  2  3  4  5
    2  2  0  6  8 10
    3  3  6  0 12 15
    4  4  8 12  0 20
    5  5 10 15 20  0

    Using upper.tri would not create a symmetrical matrix.

    df[lower.tri(df)] <- df[upper.tri(df)]
      V1 V2 V3 V4 V5
    1  0  2  3  4  5
    2  2  0  6  8 10
    3  3  8  0 12 15
    4  6 12 10  0 20
    5  4  5 15 20  0
    

References

2020-11-24

How to select all columns in dplyr

Title

Problem

We want to select all variables in a data frame using dplyr.

Solution

select(iris, everything())
# Using the pipe operator %>%
iris %>% select(everything())

Notes

The function select( ) subsets columns based on their names and other features. We can use the function everything( ) a selection helper to select all variables.

Results

Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

References

How to aggregate multiple variables with different functions in R

Title

Problem

We want to group the following data frame by Branch, using different functions to summarize the results.

  Branch Loan_Amount TAT
1      A         100 2.0
2      A         120 4.0
3      A         300 9.0
4      B         150 1.5
5      B         200 2.0
This is the expected outcome. We want to group by Branch, count the Number_of_loans, and sum Loan_Amount and TAT.

  Branch Number_of_loans Loan_Amount  TAT
1      A               3         520 15.0
2      B               2         350  3.5
Data

df <- structure(list(Branch = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), Loan_Amount = c(100L, 120L, 300L, 150L, 
200L), TAT = c(2, 4, 9, 1.5, 2)), .Names = c("Branch", "Loan_Amount", 
"TAT"), class = "data.frame", row.names = c(NA, -5L))

Options

  • Base package
  • df1 <- aggregate(.~ Branch, df, FUN = "sum")
    df2 <- setNames(aggregate(Loan_Amount~Branch, df, length)[2], c("Number_of_loans"))
    cbind(df1, df2)
    
      Branch Loan_Amount  TAT Number_of_loans
    1      A         520 15.0               3
    2      B         350  3.5               2
    
  • dplyr
  • library(dplyr)
    df %>% 
      group_by(Branch) %>% 
      summarise(Number_of_loans = n(),
                Loan_Amount = sum(Loan_Amount),
                TAT = sum(TAT))
    
    Source: local data frame [2 x 4]
    
      Branch Number_of_loans Loan_Amount   TAT
      (fctr)           (int)       (int) (dbl)
    1      A               3         520  15.0
    2      B               2         350   3.5
    
  • sqldf
  • library(sqldf)
    sqldf("SELECT Branch, 
                  COUNT(Loan_Amount) Number_of_loans, 
                  SUM(Loan_Amount) Loan_Amount, 
                  SUM(TAT) TAT 
          FROM df 
          GROUP BY Branch")
    
      Branch Number_of_loans Loan_Amount  TAT
    1      A               3         520 15.0
    2      B               2         350  3.5
    

    References

    2020-04-04

    How to select a random sample in R

    Title

    Problem

    We want to extract a random sample from a data frame in R.

    Solution

    • Base package
    set.seed(1)
    starwars[sample(nrow(starwars), 10), ] # 10 filas
    # Showing the first 5 columns
    set.seed(1)
    starwars[sample(nrow(starwars), 10), 1:5]
    
    # A tibble: 10 x 5
       name            height  mass hair_color skin_color
                                
     1 Dexter Jettster    198 102   none       brown     
     2 Sebulba            112  40   none       grey, red 
     3 Luke Skywalker     172  77   blond      fair      
     4 Jar Jar Binks      196  66   none       orange    
     5 Bib Fortuna        180  NA   none       pale      
     6 Han Solo           180  80   brown      fair      
     7 Cliegg Lars        183  NA   brown      fair      
     8 Eeth Koth          171  NA   black      brown     
     9 Boba Fett          183  78.2 black      fair      
    10 Yarael Poof        264  NA   none       white
    
    • dplyr
    library(tidyverse)
    set.seed(1)
    starwars %>%
      sample_n(10) %>%
      select(1:5)
    
    • data.table
    library(data.table)
    set.seed(1)
    data.table(starwars)[sample(.N, 10), 1:5]
    

    2020-03-05

    Descriptive statistics by group in R

    Title

    Problem

    We'd like to report descriptive statistics in R by a grouping variable and subsetting the output statistics.

    Solution

    We will use the data frame iris, columns Sepal.Length and Sepal.Width and grouping by Species. In our example, we want to return the mean, the standard deviation, the skewness and kurtosis.

  • Subset of descriptive statistics by group
  • library(psych)
    # Variables by index
    d <- describeBy(iris[1:2], group = iris$Species)
    # Two options to subset the statistics:
    lapply(d, "[", , c(3, 4, 11, 12))
    lapply(d, subset, , c(3, 4, 11, 12)) 
    
    # Variables by name
    i <- match(c("Sepal.Length", "Petal.Length"), names(iris))
    d <- describeBy(iris[i], group = iris$Species)
    lapply(d, subset, , c("mean", "sd", "skew", "kurtosis")) 
    
    $setosa
                 mean   sd skew kurtosis
    Sepal.Length 5.01 0.35 0.11    -0.45
    Sepal.Width  3.43 0.38 0.04     0.60
    
    $versicolor
                 mean   sd  skew kurtosis
    Sepal.Length 5.94 0.52  0.10    -0.69
    Sepal.Width  2.77 0.31 -0.34    -0.55
    
    $virginica
                 mean   sd skew kurtosis
    Sepal.Length 6.59 0.64 0.11    -0.20
    Sepal.Width  2.97 0.32 0.34     0.38
    
  • Subset of descriptive statistics without grouping
  • # Seleccionamos las columnas deseadas de la tabla
    d <- describe(iris[1:2])
    # Subsetting output statistics
    d[, c(3, 4, 11, 12)]
    
                 mean   sd skew kurtosis
    Sepal.Length 5.84 0.83 0.31    -0.61
    Sepal.Width  3.06 0.44 0.31     0.14
    

    References

    2019-07-03

    How to calculate the percent of column total in R

    Problem

    We want to calculate the percent of column total in R. In our example, the percent of column freq: 7/397, 23/397, etc.

        x freq
    1 Jan    7
    2 Feb   23
    3 Mar   86
    4 Apr  281
    
    Data:

    df <- read.table(text = "x    freq
                            Jan   7
                            Feb   23
                            Mar   86
                            Apr   281", 
                            header = TRUE)
    

    Solution

    We create the percent of column total using the function prop.table.

    df$prob <- prop.table(df$freq)
    # Percentages with two decimal places
    df$prob <- round(prop.table(df$freq), 4)*100
    
        x freq  prob
    1 Jan    7  1.76
    2 Feb   23  5.79
    3 Mar   86 21.66
    4 Apr  281 70.78
    
    If we'd like to calculate the percent of a specific row, February in our example:
    prop.table(df$freq)[df$x == "Feb"] 
    
     [1] 0.05793451
    

    Alternatives

    • Base package
    • df$prob <- df$freq/sum(df$freq)
      
    • dplyr
    • library(dplyr)
      df %>% mutate(prob = prop.table(freq))
      # Or
      df %>% mutate(prob = freq / sum(freq))
      
      A specific row:
      df %>% filter(x == "Feb")
      
          x freq       prob
      1 Feb   23 0.05793451
      

    References

    2018-04-15

    Recrear un objeto con la función dput en R

    Problema

    Deseamos recrear un objeto en R. Es decir, queremos el código que nos permite crear dicho objeto, en lugar de importarlo. En nuestro ejemplo un data frame cargado en nuestro directorio de trabajo, del que mostramos la parte inicial y final.

    library(psych)
    headTail(df)
    
        time group value
    1      1     A     3
    2      2     A     3
    3      3     A     8
    4      4     A     5
    ...  ...     ...
    47     7     E     7
    48     8     E     8
    49     9     E    10
    50    10     E     3
    

    Solución

    La función dput nos permite recrear dicho objeto.

    dput(df)
    
     structure(list(time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 
    6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), group = c("A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", 
    "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", 
    "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "E", 
    "E", "E", "E", "E", "E", "E", "E", "E", "E"), value = c(3L, 3L, 
    8L, 5L, 4L, 4L, 9L, 6L, 7L, 4L, 5L, 2L, 6L, 8L, 6L, 7L, 6L, 2L, 
    4L, 10L, 3L, 7L, 3L, 2L, 8L, 6L, 5L, 3L, 1L, 7L, 6L, 9L, 7L, 
    4L, 6L, 4L, 7L, 6L, 7L, 7L, 3L, 4L, 10L, 2L, 7L, 10L, 7L, 8L, 
    10L, 3L)), .Names = c("time", "group", "value"), row.names = c(NA, 
    -50L), class = "data.frame")
    
    Si quisiéramos crear de nuevo el objeto.

    df <- structure(list(time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 
    6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), group = c("A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", 
    "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", 
    "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "E", 
    "E", "E", "E", "E", "E", "E", "E", "E", "E"), value = c(3L, 3L, 
    8L, 5L, 4L, 4L, 9L, 6L, 7L, 4L, 5L, 2L, 6L, 8L, 6L, 7L, 6L, 2L, 
    4L, 10L, 3L, 7L, 3L, 2L, 8L, 6L, 5L, 3L, 1L, 7L, 6L, 9L, 7L, 
    4L, 6L, 4L, 7L, 6L, 7L, 7L, 3L, 4L, 10L, 2L, 7L, 10L, 7L, 8L, 
    10L, 3L)), .Names = c("time", "group", "value"), row.names = c(NA, 
    -50L), class = "data.frame")
    
    Si en lugar de recrear el código, quremos guardar el objeto en el directorio para acceder a él posteriormente:

    # Guardarmos objeto en el directorio de trabajo
    dput(df. "df")
    # Volvemos a traerlo al directorio de trabajo
    df_nuevo <- dget("df")
    

    Referencias

    Nube de datos