2019-12-31

Calculate the difference between alternate rows in R

Problem

We want to calculate the difference between alternate rows in R

df <- 
structure(list(MemId = c(1, 2, 3, 4, 5,6), ET = structure(c(1506829256,
 1506829319, 1506843096,1506843226, 1506850144, 1506853708), class = 
c("POSIXct", "POSIXt"))), .Names = c("MemId", "ET"), row.names = c("1",
 "2", "14", "15", "37", "38"), class = "data.frame")
  MemId                  ET
1      1 2017-10-01 05:40:56
2      2 2017-10-01 05:41:59
14     3 2017-10-01 09:31:36
15     4 2017-10-01 09:33:46
37     5 2017-10-01 11:29:04
38     6 2017-10-01 12:28:28

Solution

First, we create a dummy column with 0 and 1. Then when id == 1 we calculate with difftime the difference between the value of the vector ET and the "previous" value lag(ET).

library(dplyr)
df %>%
  mutate(id = rep_len(0:1, nrow(df))) %>%
  mutate(dif = ifelse(id == 1, difftime(ET, lag(ET), units = "secs"), NA))
  MemId                  ET id  dif
1     1 2017-10-01 05:40:56  0   NA
2     2 2017-10-01 05:41:59  1   63
3     3 2017-10-01 09:31:36  0   NA
4     4 2017-10-01 09:33:46  1  130
5     5 2017-10-01 11:29:04  0   NA
6     6 2017-10-01 12:28:28  1 3564

2019-12-30

Generate random email addresses in R

Problem

We want to create a vector with random email addresses.

Solution

There are multiple options, I show two options using the packages generator y stringi.

  • generator
  • library(generator)
    # Example 1: 10 email addresses
    set.seed(1977)
    r_email_addresses(10)
    # Example 2: Added to a data frame
    library(dplyr)
    set.seed(1977)
    mtcars %>% mutate(email = r_email_addresses(n())) %>% head(10)
    
  • stringi
  • A more verbose option concatenating three random text strings separated by @. We gain in flexibility because we can choose the lenght of the strings.

    library(stringi)
    set.seed(1977)
    paste0(stringi::stri_rand_strings(10, 5),
           "@",
           stringi::stri_rand_strings(10, 5)
           ".",
           stringi::stri_rand_strings(n(), 3)
    )
    set.seed(1977)
    mtcars %>% mutate(email = paste0(
      stringi::stri_rand_strings(n(), 5),
      "@",
      stringi::stri_rand_strings(n(), 5),
      ".",
      stringi::stri_rand_strings(n(), 3)
    )) %>% head(10)
    

    Results

    [1] "rdkvi@kfjaul.myo"      "qpm@bxvqkrtl.fhk"      "aedcq@a.ruc"          
     [4] "xi@l.wkp"              "bhk@obcjyzdi.tda"      "da@qksrfahvup.yel"    
     [7] "ahjq@lg.gmo"           "agcmyp@i.itc"          "rutfv@ponbi.jtm"      
    [10] "yeaxlbsko@ntlvkfe.fsg"
    
        mpg cyl  disp  hp drat    wt  qsec vs am gear carb                     email
    1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4          rdkvi@kfjaul.myo
    2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4          qpm@bxvqkrtl.fhk
    3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1               aedcq@a.ruc
    4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1                  xi@l.wkp
    5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2          bhk@obcjyzdi.tda
    6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1         da@qksrfahvup.yel
    7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4               ahjq@lg.gmo
    8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2              agcmyp@i.itc
    9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2           rutfv@ponbi.jtm
    10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     yeaxlbsko@ntlvkfe.fsg
    

    2019-12-29

    Read compressed files in R using readr

    Problem

    We need to read compressed files in R.

    Solution

    We will use the package readr:

    Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed.

    In our example we use the file title.ratings.tsv.gz.

    library(readr)
    df_ratings <- read_tsv('title.ratings.tsv.gz', na = "\\N", quote = '')
    df_ratings %>% head()
    
    We can provide the URL and it will be automatically downloaded and decompressed

    df_ratings <- read_tsv('https://datasets.imdbws.com/title.ratings.tsv.gz', na = "\\N", quote = '')
    df_ratings %>% head()
    

    Results

    # A tibble: 6 x 3
      tconst    averageRating numVotes
                       
    1 tt0000001           5.8     1423
    2 tt0000002           6.4      168
    3 tt0000003           6.6     1016
    4 tt0000004           6.4      100
    5 tt0000005           6.2     1713
    6 tt0000006           5.5       88
    

    Hide zero values in Excel

    Problem

    We want to hide zero values in Excel.

    Solution

    Option 1 - A selected range

    1. Select the cells that contain the zero values you want to hide and press Ctrl + 1
    2. Select Number and then Custom, then type 0;; and press OK.
    Opction 2 - A sheet
    1. Click File > Options > Advanced.
    2. Under Display options for this worksheet, select a worksheet, uncheck the Show a zero in cells that have zero value check box.
    Option 3 - VBA: sheets or workbooks.
    1. Open the Visual Basic Editor: Alt+F11
    2. Copy the following code: una para ocultar y otra para mostrar los ceros
    3. Sub Hide_Zeros()
          ActiveWindow.DisplayZeros = False
      End Sub
      
      Sub Display_Zeros()
          ActiveWindow.DisplayZeros = True
      End Sub
      
      If we'd like to hide zero values from all sheets:

      Sub Hide_Zeros_All_Sheets()
          Worksheets.Select
          ActiveWindow.DisplayZeros = False
      End Sub
      
      Sub Display_Zeros_All_Sheets()
          Worksheets.Select
          ActiveWindow.DisplayZeros = True
      End Sub
      
      To show all zero values again using the previous code you need to remove the custom format applied in option 1.

    Results

    Related posts

    2019-12-28

    How to convert a continuous variable to discrete in R?

    Problem

    We want to convert continuous variable to discrete in R:

    'Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%'.

    library(ISLR)
    library(tidyverse)
    glimpse(College)
    
    Observations: 777
    Variables: 18
    $ Private      Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye...
    $ Apps         1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 17...
    $ Accept       1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 142...
    $ Enroll       721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484...
    $ Top10perc    23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 2...
    $ Top25perc    52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 4...
    $ F.Undergrad  2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 183...
    $ P.Undergrad  537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 63...
    $ Outstate     7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 155...
    $ Room.Board   3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 338...
    $ Books        450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 40...
    $ Personal     2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 6...
    $ PhD          70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 3...
    $ Terminal     78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, ...
    $ S.F.Ratio    18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5,...
    $ perc.alumni  12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26...
    $ Expend       7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644...
    $ Grad.Rate    60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 5...
    

    Solution

    1. Option 1: form ISLR's book.
    2. Elite = rep("No", nrow(College))
      Elite[College$Top10perc > 50] = "Yes"
      Elite <- as.factor(Elite)
      college <- data.frame(College,  Elite)
      summary(college[, c("Top10perc", "Elite")])
      
      There are 78 elite universities.

        Top10perc     Elite    
       Min.   : 1.00   No :699  
       1st Qu.:15.00   Yes: 78  
       Median :23.00            
       Mean   :27.56            
       3rd Qu.:35.00            
       Max.   :96.00    
      
    3. Option 2: ifelse from base package and dplyr
    4. # base 
      College$Elite <- factor(ifelse(College$Top10perc > 50, "Yes", "No"))
      # dplyr
      library(dplyr)
      College <-
        college %>%
        mutate(Elite = factor(ifelse(College$Top10perc > 50, "Yes", "No")))
      
    5. Option 3: creating a logical vector.
    6. There are multiple options. I show two examples.

      college$Elite <- transform(College, Elite = Top10perc > 50)
      College$Elite <- College$Top10perc > 50
      

    References

    From 'An Introduction to Statistical Learning' (ISLR), page 54.

    Related posts

    2019-12-27

    How to list the available data sets in R

    Problem

    We want to list the available data sets in R.

    Solution

    1. List all available data sets
    2. data()
      
      Data sets in package ‘datasets’:
      
      AirPassengers           Monthly Airline Passenger Numbers 1949-1960
      BJsales                 Sales Data with Leading Indicator
      BJsales.lead (BJsales)
                              Sales Data with Leading Indicator
      BOD                     Biochemical Oxygen Demand
      CO2                     Carbon Dioxide Uptake in Grass Plants
      ChickWeight             Weight versus age of chicks on different diets
      DNase                   Elisa assay of DNase
      EuStockMarkets          Daily Closing Prices of Major European Stock
                              Indices, 1991-1998
      Formaldehyde            Determination of Formaldehyde
      HairEyeColor            Hair and Eye Color of Statistics Students
      ...                    ...
      
    3. List all available data sets, including those packages not currently loaded.
    4. data(package = .packages(all.available = TRUE))
      
      Data sets in package ‘aqp’:
      
      amarillo                Amarillo Soils
      ca630                   Soil Data from the Central Sierra Nevada Region
                              of California
      munsell                 Munsell to sRGB Lookup Table for Common Soil
                              Colors
      rruff.sample            Sample XRD Patterns
      soil_minerals           Munsell Colors of Common Soil Minerals
      sp1                     Soil Profile Data Example 1
      sp2                     Honcut Creek Soil Profile Data
      sp3                     Soil Profile Data Example 3
      sp4                     Soil Chemical Data from Serpentinitic Soils of
                              California
      sp5                     Sample Soil Database #5
      sp6                     Soil Physical and Chemical Data from
                              Manganiferous Soils
      
      Data sets in package ‘beeswarm’:
      
      breast                  Lymph-node-negative primary breast tumors
      
    5. List all available data sets of a specific package
    6. data(package = "ISLR")
      
      Data sets in package ‘ISLR’:
      
      Auto                    Auto Data Set
      Caravan                 The Insurance Company (TIC) Benchmark
      Carseats                Sales of Child Car Seats
      College                 U.S. News and World Report's College Data
      Credit                  Credit Card Balance Data
      Default                 Credit Card Default Data
      Hitters                 Baseball Data
      Khan                    Khan Gene Data
      NCI60                   NCI 60 Data
      OJ                      Orange Juice Data
      Portfolio               Portfolio Data
      Smarket                 S&P Stock Market Data
      Wage                    Mid-Atlantic Wage Data
      Weekly                  Weekly S&P Stock Market Data
      

    References

    2019-12-22

    Increase the number of recent workbooks shown in Excel

    Problem

    By default Excel shows a list of the last 25 workbooks opened. How can we increase that number?

    Solution

    Click File, Options, Advanced, scroll down to the Display section, and increase the number of recent workbooks (the maximum is 50).

    Nube de datos