Nube de datos: Data frame

Mostrando entradas con la etiqueta Data frame. Mostrar todas las entradas

2020-12-21

How to create a symmetric matrix in R

Title

Problem

We want to create a symmetric matrix based on the following example.

Example

  V1 V2 V3 V4 V5
1  0  2  3  4  5
2  0  0  6  8 10
3  0  0  0 12 15
4  0  0  0  0 20
5  0  0  0  0  0

df <- structure(list(V1 = c(0L, 0L, 0L, 0L, 0L), V2 = c(2L, 0L, 0L, 
0L, 0L), V3 = c(3L, 6L, 0L, 0L, 0L), V4 = c(4L, 8L, 12L, 0L, 
0L), V5 = c(5L, 10L, 15L, 20L, 0L)), .Names = c("V1", "V2", "V3", 
"V4", "V5"), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5"))

Solution

Adding the transpose of the matrix.

df + t(df)

Function lower.tri

We assign to the lower triangle of the matrix the transpose of the lower triangle.

df[lower.tri(df)] <- t(df)[lower.tri(df)]

  V1 V2 V3 V4 V5
1  0  2  3  4  5
2  2  0  6  8 10
3  3  6  0 12 15
4  4  8 12  0 20
5  5 10 15 20  0

Using upper.tri would not create a symmetrical matrix.

df[lower.tri(df)] <- df[upper.tri(df)]

  V1 V2 V3 V4 V5
1  0  2  3  4  5
2  2  0  6  8 10
3  3  8  0 12 15
4  6 12 10  0 20
5  4  5 15 20  0

References

stackoverflow

2020-11-24

How to select all columns in dplyr

Title

Problem

We want to select all variables in a data frame using dplyr.

Solution

select(iris, everything())
# Using the pipe operator %>%
iris %>% select(everything())

Notes

The function select( ) subsets columns based on their names and other features. We can use the function everything( ) a selection helper to select all variables.

Results

Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

References

How to aggregate multiple variables with different functions in R

Title

Problem

We want to group the following data frame by Branch, using different functions to summarize the results.

  Branch Loan_Amount TAT
1      A         100 2.0
2      A         120 4.0
3      A         300 9.0
4      B         150 1.5
5      B         200 2.0

This is the expected outcome. We want to group by Branch, count the Number_of_loans, and sum Loan_Amount and TAT.

  Branch Number_of_loans Loan_Amount  TAT
1      A               3         520 15.0
2      B               2         350  3.5

Data

df <- structure(list(Branch = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), Loan_Amount = c(100L, 120L, 300L, 150L, 
200L), TAT = c(2, 4, 9, 1.5, 2)), .Names = c("Branch", "Loan_Amount", 
"TAT"), class = "data.frame", row.names = c(NA, -5L))

Options

Base package

df1 <- aggregate(.~ Branch, df, FUN = "sum")
df2 <- setNames(aggregate(Loan_Amount~Branch, df, length)[2], c("Number_of_loans"))
cbind(df1, df2)

  Branch Loan_Amount  TAT Number_of_loans
1      A         520 15.0               3
2      B         350  3.5               2

dplyr

library(dplyr)
df %>% 
  group_by(Branch) %>% 
  summarise(Number_of_loans = n(),
            Loan_Amount = sum(Loan_Amount),
            TAT = sum(TAT))

Source: local data frame [2 x 4]

  Branch Number_of_loans Loan_Amount   TAT
  (fctr)           (int)       (int) (dbl)
1      A               3         520  15.0
2      B               2         350   3.5

sqldf

library(sqldf)
sqldf("SELECT Branch, 
              COUNT(Loan_Amount) Number_of_loans, 
              SUM(Loan_Amount) Loan_Amount, 
              SUM(TAT) TAT 
      FROM df 
      GROUP BY Branch")

  Branch Number_of_loans Loan_Amount  TAT
1      A               3         520 15.0
2      B               2         350  3.5

References

stackoverflow

2020-04-04

How to select a random sample in R

Title

Problem

We want to extract a random sample from a data frame in R.

Solution

Base package

set.seed(1)
starwars[sample(nrow(starwars), 10), ] # 10 filas
# Showing the first 5 columns
set.seed(1)
starwars[sample(nrow(starwars), 10), 1:5]

# A tibble: 10 x 5
   name            height  mass hair_color skin_color
                            
 1 Dexter Jettster    198 102   none       brown     
 2 Sebulba            112  40   none       grey, red 
 3 Luke Skywalker     172  77   blond      fair      
 4 Jar Jar Binks      196  66   none       orange    
 5 Bib Fortuna        180  NA   none       pale      
 6 Han Solo           180  80   brown      fair      
 7 Cliegg Lars        183  NA   brown      fair      
 8 Eeth Koth          171  NA   black      brown     
 9 Boba Fett          183  78.2 black      fair      
10 Yarael Poof        264  NA   none       white

dplyr

library(tidyverse)
set.seed(1)
starwars %>%
  sample_n(10) %>%
  select(1:5)

data.table

library(data.table)
set.seed(1)
data.table(starwars)[sample(.N, 10), 1:5]

2020-03-05

Descriptive statistics by group in R

Title

Problem

We'd like to report descriptive statistics in R by a grouping variable and subsetting the output statistics.

Solution

We will use the data frame iris, columns Sepal.Length and Sepal.Width and grouping by Species. In our example, we want to return the mean, the standard deviation, the skewness and kurtosis.

Subset of descriptive statistics by group

library(psych)
# Variables by index
d <- describeBy(iris[1:2], group = iris$Species)
# Two options to subset the statistics:
lapply(d, "[", , c(3, 4, 11, 12))
lapply(d, subset, , c(3, 4, 11, 12))

# Variables by name
i <- match(c("Sepal.Length", "Petal.Length"), names(iris))
d <- describeBy(iris[i], group = iris$Species)
lapply(d, subset, , c("mean", "sd", "skew", "kurtosis"))

$setosa
             mean   sd skew kurtosis
Sepal.Length 5.01 0.35 0.11    -0.45
Sepal.Width  3.43 0.38 0.04     0.60

$versicolor
             mean   sd  skew kurtosis
Sepal.Length 5.94 0.52  0.10    -0.69
Sepal.Width  2.77 0.31 -0.34    -0.55

$virginica
             mean   sd skew kurtosis
Sepal.Length 6.59 0.64 0.11    -0.20
Sepal.Width  2.97 0.32 0.34     0.38

Subset of descriptive statistics without grouping

# Seleccionamos las columnas deseadas de la tabla
d <- describe(iris[1:2])
# Subsetting output statistics
d[, c(3, 4, 11, 12)]

             mean   sd skew kurtosis
Sepal.Length 5.84 0.83 0.31    -0.61
Sepal.Width  3.06 0.44 0.31     0.14

References

Descriptive statistics in R

2019-07-03

How to calculate the percent of column total in R

Problem

We want to calculate the percent of column total in R. In our example, the percent of column freq: 7/397, 23/397, etc.

    x freq
1 Jan    7
2 Feb   23
3 Mar   86
4 Apr  281

Data:

df <- read.table(text = "x    freq
                        Jan   7
                        Feb   23
                        Mar   86
                        Apr   281", 
                        header = TRUE)

Solution

We create the percent of column total using the function prop.table.

df$prob <- prop.table(df$freq)
# Percentages with two decimal places
df$prob <- round(prop.table(df$freq), 4)*100

    x freq  prob
1 Jan    7  1.76
2 Feb   23  5.79
3 Mar   86 21.66
4 Apr  281 70.78

If we'd like to calculate the percent of a specific row, February in our example:

prop.table(df$freq)[df$x == "Feb"]

 [1] 0.05793451

Alternatives

Base package

df$prob <- df$freq/sum(df$freq)

dplyr

library(dplyr)
df %>% mutate(prob = prop.table(freq))
# Or
df %>% mutate(prob = freq / sum(freq))

df %>% filter(x == "Feb")

    x freq       prob
1 Feb   23 0.05793451

References

2018-04-15

Recrear un objeto con la función dput en R

Problema

Deseamos recrear un objeto en R. Es decir, queremos el código que nos permite crear dicho objeto, en lugar de importarlo. En nuestro ejemplo un data frame cargado en nuestro directorio de trabajo, del que mostramos la parte inicial y final.

library(psych)
headTail(df)

    time group value
1      1     A     3
2      2     A     3
3      3     A     8
4      4     A     5
...  ...     ...
47     7     E     7
48     8     E     8
49     9     E    10
50    10     E     3

Solución

La función dput nos permite recrear dicho objeto.

dput(df)

 structure(list(time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), group = c("A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", 
"C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "E", 
"E", "E", "E", "E", "E", "E", "E", "E", "E"), value = c(3L, 3L, 
8L, 5L, 4L, 4L, 9L, 6L, 7L, 4L, 5L, 2L, 6L, 8L, 6L, 7L, 6L, 2L, 
4L, 10L, 3L, 7L, 3L, 2L, 8L, 6L, 5L, 3L, 1L, 7L, 6L, 9L, 7L, 
4L, 6L, 4L, 7L, 6L, 7L, 7L, 3L, 4L, 10L, 2L, 7L, 10L, 7L, 8L, 
10L, 3L)), .Names = c("time", "group", "value"), row.names = c(NA, 
-50L), class = "data.frame")

Si quisiéramos crear de nuevo el objeto.

df <- structure(list(time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), group = c("A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", 
"C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "E", 
"E", "E", "E", "E", "E", "E", "E", "E", "E"), value = c(3L, 3L, 
8L, 5L, 4L, 4L, 9L, 6L, 7L, 4L, 5L, 2L, 6L, 8L, 6L, 7L, 6L, 2L, 
4L, 10L, 3L, 7L, 3L, 2L, 8L, 6L, 5L, 3L, 1L, 7L, 6L, 9L, 7L, 
4L, 6L, 4L, 7L, 6L, 7L, 7L, 3L, 4L, 10L, 2L, 7L, 10L, 7L, 8L, 
10L, 3L)), .Names = c("time", "group", "value"), row.names = c(NA, 
-50L), class = "data.frame")

Si en lugar de recrear el código, quremos guardar el objeto en el directorio para acceder a él posteriormente:

# Guardarmos objeto en el directorio de trabajo
dput(df. "df")
# Volvemos a traerlo al directorio de trabajo
df_nuevo <- dget("df")

Referencias

Write an Object to a File or Recreate it