class: center, middle, inverse, title-slide # Programming with Data ## Session 2: R Programming (I) ### Dr. Wang Jiwei ### Master of Professional Accounting --- class: inverse, center, middle <!-- Load in primary data set --> # Introduction to R --- ## What is R? - R is free and open source - R is a “statistical programming language” - Focused on data handling, calculation, data analysis, and visualization - R is *not* a general programming language [(wikipedia)](https://en.wikipedia.org/wiki/General-purpose_programming_language) - We will use R for all work in this course .center[<a target="_blank" href="https://www.r-project.org/about.html"><img src="../../../Figures/R.svg" alt="R Logo" width="400px"></a>] --- ## The History of R - 1993: developed by Ross Ihaka and Robert Gentleman at University of Auckland - Why R? "R & R" - R is written in C and is developed from Bell Laboratory's S language - 2000.2.29: R 1.0.0 official release .center[<a target="_blank" href="https://www.r-bloggers.com/celebration2020-a-great-get-together-to-celebrate-20-years-of-r-1-0-0/"><img src="../../../Figures/r-first-cd.jpg" width = "400px"></a>] --- ## The Happiest R - based on programmers' pictures on <a target="_blank" href="https://github.com/">GitHub</a> .center[<a target="_blank" href="https://medium.com/swlh/what-programming-language-has-the-happiest-developers-f0636b08e898"><img src="../../../Figures/happyR.jpeg"></a>] --- ## R vs Python > Each has its own merits <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> R </th> <th style="text-align:left;"> Python </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Statistical analysis with smaller dataset </td> <td style="text-align:left;"> Machine/Deep learning with large dataset </td> </tr> <tr> <td style="text-align:left;"> Data visualization </td> <td style="text-align:left;"> General purpose which is great for automation </td> </tr> </tbody> </table> .center[<a target="_blank" href="https://towardsdatascience.com/r-vs-python-vs-julia-90456a2bcbab"><img src="../../../Figures/c-python-julia-R.png" height = 300px></a>] --- ## Setup - For this class, I will assume you are using RStudio for R programming. You will need to first install R and then RStudio. - <a target="_blank" href="https://cloud.r-project.org/">R Installation</a> - <a target="_blank" href="https://www.rstudio.com/products/rstudio/download/#download">RStudio downloads</a> - You will need a laptop or desktop for this - For the most part, everything will work the same across all computer types - Everything in these slides was tested using R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 😀 > R and RStudio installation path should be in English. Any non-English path may result in installation failure. .center[] --- ## How to use RStudio .pull-left[ 1. R markdown file - integrate code into reports - more interactive reports with analytics - this slides written with R Markdown using the <a target="_blank" href="https://github.com/yihui/xaringan">**xaringan**</a> package 2. Console - Useful for testing code and exploring your data - Enter your code one line at a time 3. R Markdown console - Shows if there are any errors when preparing your report ] .pull-right[ <img src="../../../Figures/RStudio_1-anotated.png" height="400px"> ] --- ## How to use RStudio .pull-left[ <ol start = 4> <li> Environment - Shows all the values you have stored <li> Help - Can search documentation for instructions on how to use a function <li> Viewer - Shows any output you have at the moment. <li> Files - Shows files on your computer </ol> ] .pull-right[ <img src="../../../Figures/RStudio_2-anotated.png" height="400px"> ] --- class: inverse, center, middle # Basic R commands --- ## Arithmetic .pull-left[ - Anything in boxes like those on the right are R code - The slides themselves are made in R, so you could copy and paste any code in the slides right into R to use it yourself - Grey boxes: R Code - Lines starting with hash `#` are comments - They only explain what the code does - Boxes with ##: Output ] .pull-right[ .rcode[ ```r # Addition uses '+' 1 + 1 ``` ``` ## [1] 2 ``` ```r # Subtraction uses '-' 2 - 1 ``` ``` ## [1] 1 ``` ```r # Multiplication uses '*' 3 * 3 ``` ``` ## [1] 9 ``` ```r # Division uses '/' 4 / 2 ``` ``` ## [1] 2 ``` ]] --- ## Arithmetic .pull-left[ - Exponentiation `^` - Write `\(x^y\)` as `x ^ y` - Modulus `%%` - The remainder after division - Ex.: `\(46\text{ mod }6 = 4\)` 1. `\(6 \times 7 = 42\)` 2. `\(46 - 42 = 4\)` 3. `\(4 < 6\)`, so 4 is the remainder - Integer division `%/%` (not used often) - Like division, but it drops any decimal ] .pull-right[ .pythoncode[ ```r # Exponentiation uses '^' 5 ^ 5 ``` ``` ## [1] 3125 ``` ```r 25 ^ (1/2) ``` ``` ## [1] 5 ``` ```r # Modulus (remainder) uses '%%' 46 %% 6 ``` ``` ## [1] 4 ``` ```r # Integer division uses '%/%' 46 %/% 6 ``` ``` ## [1] 7 ``` ]] --- ## Variable assignment .pull-left[ - Variable assignment lets you give something a name - This lets you easily reuse it - In R, we can name almost anything that we create - Values - Data - Functions, etc... - We will name things using the `<-` or `=` command, with the first being preferred ] .pull-right[ .pythoncode[ ```r # Store 2 in 'x' and 'x1' x <- 2 x1 <- 2 # Check the value of x and x1 x; x1 ``` ``` ## [1] 2 ``` ``` ## [1] 2 ``` ```r # Store arithmetic in y y <- x * 2 # Check the value of y y ``` ``` ## [1] 4 ``` ]] --- ## Variable assignment .pull-left[ - Note that values are calculated at the time of assignment - We previously set `y <- 2 * x` - If we change the values of `x` and `y` remain unchanged! - Variables: combinations of alphanumeric characters along with periods (`.`) and underscores (`_`), cannot start with a number or an underscore though - Best practice: use actual names for variables instead of single letters. ] .pull-right[ ```r # Previous value of x and y x ``` ``` ## [1] 2 ``` ```r y ``` ``` ## [1] 4 ``` ```r # Change x, how about y? x <- 200 x ``` ``` ## [1] 200 ``` ```r y ``` ``` ## [1] 4 ``` ] --- ## Variable assignment .pull-left[ - To remove a variable, use function `rm()` - free up memory - Variable names are case sensitive ] .pull-right[ ```r # Assign value to x x <- 1 # remove variable x rm(x) # Check the value of x x ``` ```r # Store 2 in 'x' x <- 2 # Check the value of X X ``` ] --- ## Application: Singtel > Set a variable `growth` to the amount of Singtel's earnings growth percent in 2018 ```r # Data from Singtel's earnings reports, in Millions of SGD singtel_2017 <- 3831.0 singtel_2018 <- 5430.3 # Compute growth growth <- singtel_2018 / singtel_2017 - 1 # Check the value of growth growth ``` ``` ## [1] 0.4174628 ``` --- ## Recap - So far, we are using R as a glorified calculator - The key to using R is that we can scale this up with little effort - Calculating *all* public companies' earnings growth isn't much harder than calculating Singtel's! > Scaling this up will give use a lot more value - We can also leverage **functions** to automate more complex operations - There are many functions built in, and many more freely available - We'll also need ways to read **data files** and work with collections of numbers .center[<a target="_blank" href="https://blog.revolutionanalytics.com/2017/01/cran-10000.html"><img src="../../../Figures/r-packages.png" height="200px"></a>] --- class: inverse, center, middle # Working with data in R --- ## Data types in **R** - The four main types of data in R: - **Numeric:** Any number - Positive or negative - With or without decimals - **Boolean/Logical:** `TRUE` or `FALSE` - Capitalization matters! - Shorthand is `T` and `F` - **Character:** "text in quotes" - More difficult to work with - Either single or double quotes although double is recommended - **Factor:** Converts text into numeric data - Categorical data for statistical analysis - eg, convert Male/Female into numbers to be included in statistical analysis --- ## Data types in **R** ```r tech_firm <- TRUE # boolean data earnings <- 12662 # numeric data class(tech_firm) ``` ``` ## [1] "logical" ``` ```r is.logical(tech_firm) ``` ``` ## [1] TRUE ``` ```r is.numeric(earnings) ``` ``` ## [1] TRUE ``` --- ## Data types in **R** ```r company_name <- "Google" # character data company_name <- 'Google' # also character data company_name ``` ``` ## [1] "Google" ``` ```r class(company_name) ``` ``` ## [1] "character" ``` ```r is.character(company_name) ``` ``` ## [1] TRUE ``` ```r nchar(company_name) ``` ``` ## [1] 6 ``` --- ## Practice: Data types - This practice is to make sure you understand main data types - Do Exercise 1 on the following R practice file: - <a target="_blank" href="Session_2s_Exercise.html#Exercise_1:_Data_types">R Practice</a> --- ## Scaling up...... - We already have some data entered, but it's only a small amount - We need to scale this up... - **Vectors** using `c()`! - **Matrices** using `matrix()`! - **Lists** using `list()`! - **Data frames** using `data.frame()`! > Each of these is covered in the coming slides --- class: inverse, center, middle # Vectors --- ## Vectors: What are they? - Remember back to linear algebra... - Examples: $$ `\begin{matrix} \left(\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix}\right) & \text{or} & \left(\begin{matrix} 1 & 2 & 3 & 4 \end{matrix}\right) \end{matrix}` $$ > Vector is a row (or column) of data --- ## Vector creation - Vectors are entered using the `c()` command - Any data type is fine, but all elements must be the *same type* ```r company <- c("Google", "Microsoft", "Goldman") company ``` ``` ## [1] "Google" "Microsoft" "Goldman" ``` ```r tech_firm <- c(TRUE, TRUE, FALSE) tech_firm ``` ``` ## [1] TRUE TRUE FALSE ``` ```r earnings <- c(12662, 21204, 4286) earnings ``` ``` ## [1] 12662 21204 4286 ``` --- ## Vector has no dimension > A vector in R can be seen as a "concatenation" (in fact *c* stands for concatenate) of elements of 1 or more of the *same* data type, indexed by their positions and so no dimensions (in a spatial sense), but just a continuous index that goes from 1 to the length of the object itself. - A vector is neither a row vector nor a column vector. - So R will interpret a vector in whichever way makes the *matrix* product sensible. --- ## Vector has no dimension ```r dim(earnings) = c(1, 3) # add dimmensions earnings ``` ``` ## [,1] [,2] [,3] ## [1,] 12662 21204 4286 ``` ```r dim(earnings) = c(3, 1) earnings ``` ``` ## [,1] ## [1,] 12662 ## [2,] 21204 ## [3,] 4286 ``` ```r class(earnings) ``` ``` ## [1] "matrix" "array" ``` ```r dim(earnings) = NULL # remove dimensions class(earnings) ``` ``` ## [1] "numeric" ``` --- ## Special cases for vectors .pull-left[ - Counting between integers using colon and seq() - `:`, e.g. `1:5` or `22:500` - `seq()`, e.g. `seq(from=0, to=100,` ` by=5)` ```r 1:5 ``` ``` ## [1] 1 2 3 4 5 ``` ```r seq(from=0, to=100, by=5) ``` ``` ## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 ## [20] 95 100 ``` `\(\uparrow\)` note that [20] means the 20th output ] .pull-right[ - Repeating something - `rep()`, e.g. `rep(1,times=10)` or `rep("hi",times=5)` ```r rep(1, times=10) ``` ``` ## [1] 1 1 1 1 1 1 1 1 1 1 ``` ```r rep("hi", times=5) ``` ``` ## [1] "hi" "hi" "hi" "hi" "hi" ``` ] --- ## Vector math > Works the same as scalars (real numbers), but applies *element-wise* - First element with first element, - Second element with second element, - ...... ```r earnings # previously defined ``` ``` ## [1] 12662 21204 4286 ``` ```r earnings + earnings # Add element-wise ``` ``` ## [1] 25324 42408 8572 ``` ```r earnings * earnings # multiply element-wise ``` ``` ## [1] 160326244 449609616 18369796 ``` --- ## Vector math > Can also use 1 vector and 1 scalar - Scalar is applied to all vector elements ```r earnings + 10000 # Adding a scalar to a vector ``` ``` ## [1] 22662 31204 14286 ``` ```r 10000 + earnings # Order doesn't matter ``` ``` ## [1] 22662 31204 14286 ``` ```r earnings / 1000 # Dividing a vector by a scalar ``` ``` ## [1] 12.662 21.204 4.286 ``` --- ## Vector math - From linear algebra, you might remember multiplication being a bit different, as a dot product. That can be done with `%*%` ```r # Dot product: sum of product of elements earnings %*% earnings # returns a matrix though... ``` ``` ## [,1] ## [1,] 628305656 ``` ```r drop(earnings %*% earnings) # drops excess dimensions ``` ``` ## [1] 628305656 ``` --- ## Vector math - Other useful functions, `length()` and `sum()`: ```r length(earnings) # returns the number of elements ``` ``` ## [1] 3 ``` ```r sum(earnings) # returns the sum of all elements ``` ``` ## [1] 38152 ``` --- ## Naming vectors .pull-left[ - Vectors allow us to include a lot of information in one object - It isn't easy to read though - We can make things more readable by assigning `names()` - Names provide a way to easily work with and understand the data ] .pull-right[ *Hard to read:* ```r earnings ``` ``` ## [1] 12662 21204 4286 ``` *Easy to read:* ```r names(earnings) <- c("Google", "Microsoft", "Goldman") earnings ``` ``` ## Google Microsoft Goldman ## 12662 21204 4286 ``` ] --- ## Selecting vectors .pull-left[ - Selecting can be done a few ways. - By index, such as `[1]` - By name, such as `["Google"]` ```r earnings[1] ``` ``` ## Google ## 12662 ``` ```r earnings["Google"] ``` ``` ## Google ## 12662 ``` ] .pull-right[ - Multiple selection: - `earnings[c(1,2)]` - `earnings[1:2]` - `earnings[c("Google",` `"Microsoft")]` ```r # Each of the above 3 is equivalent earnings[1:2] ``` ``` ## Google Microsoft ## 12662 21204 ``` ] --- ## Combining vectors - Combining is done using `c()` ```r c1 <- c(1, 2, 3) c2 <- c(4, 5, 6) c3 <- c(c1, c2) c3 ``` ``` ## [1] 1 2 3 4 5 6 ``` --- ## Factor vectors - *Factors* in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. - convert character values into numerical values - categorical variables in statistical modeling - *Levels* of a factor are the unique values of that factor variable - R is giving each unique value of a factor a unique integer, tying it back to the character representation - Levels can be ordered --- ## Factor vectors ```r x <- factor(c("High School", "College", "Masters", "PhD")) x ``` ``` ## [1] High School College Masters PhD ## Levels: College High School Masters PhD ``` ```r x <- factor(c("College", "High School", "PhD", "PhD", "Masters"), levels = c("High School", "College", "Masters", "PhD"), ordered = TRUE) x ``` ``` ## [1] College High School PhD PhD Masters ## Levels: High School < College < Masters < PhD ``` ```r as.numeric(x) ``` ``` ## [1] 2 1 4 4 3 ``` --- ## Missing data .pull-left[ - Missing data is represented by *NA* in R. - an element of a vector - `is.na` tests each element of a vector for missingness - *NULL* is the absence of anyting, ie, nothingness - atomical and cannot exist within a vector ```r z <- c(1, NA, 8, 3, 5) z ``` ``` ## [1] 1 NA 8 3 5 ``` ```r is.na(z) ``` ``` ## [1] FALSE TRUE FALSE FALSE FALSE ``` ] .pull-right[ ```r mean(z) ``` ``` ## [1] NA ``` ```r mean(z, na.rm = TRUE) ``` ``` ## [1] 4.25 ``` ```r y <- c(1, NULL, 2) y ``` ``` ## [1] 1 2 ``` ```r is.null(y) ``` ``` ## [1] FALSE ``` ] --- ## Vector example ```r # Calculating profit margin for all public US tech firms # 715 tech firms with >1M sales in 2017 summary(earnings_2017) # Cleaned data from Compustat, in $M USD ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -4307.49 -15.98 1.84 296.84 91.36 48351.00 ``` ```r summary(revenue_2017) # Cleaned data from Compustat, in $M USD ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.06 102.62 397.57 3023.78 1531.59 229234.00 ``` ```r profit_margin <- earnings_2017 / revenue_2017 summary(profit_margin) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -13.97960 -0.10253 0.01353 -0.10967 0.09295 1.02655 ``` --- ## Vector example ```r # order() to sort and return the index for each element # head() to output the first few elements head(order(profit_margin)) ``` ``` ## [1] 424 477 612 305 317 625 ``` ```r # These are the worst and best profit margin firms in 2017. profit_margin[order(profit_margin)][c(1, length(profit_margin))] ``` ``` ## HELIOS AND MATHESON ANALYTIC CCUR HOLDINGS INC ## -13.979602 1.026549 ``` --- ## Practice: Vectors - This practice explores the ROA of Goldman Sachs, JPMorgan, and Citigroup in 2017 - Do Exercise 2 on the following R practice file: - <a target="_blank" href="Session_2s_Exercise.html#Exercise_2:_Vectors">R Practice</a> --- class: inverse, center, middle # Matrices --- ## Matrices: what are they? - Remember back to linear algebra... - Example: `$$\left(\begin{matrix}1 & 2 & 3 & 4\\5 & 6 & 7 & 8\\9 & 10 & 11 & 12\end{matrix}\right)$$` > Matrix is a rows *and* columns of data --- ## Matrix creation - Matrices are entered using the `matrix()` command - Any data type is fine, but all elements must be the *same type* ```r columns <- c("Google", "Microsoft", "Goldman") rows <- c("Earnings","Revenue") # same: matrix(data=c(12662, 21204, 4286, 110855, 89950, 42254),ncol=3) firm_data <- matrix(data=c(12662, 21204, 4286, 110855, 89950, 42254), nrow=2) firm_data ``` ``` ## [,1] [,2] [,3] ## [1,] 12662 4286 89950 ## [2,] 21204 110855 42254 ``` --- ## Math with matrices > Everything with matrices works just like vectors ```r firm_data + firm_data ``` ``` ## [,1] [,2] [,3] ## [1,] 25324 8572 179900 ## [2,] 42408 221710 84508 ``` ```r firm_data / 1000 ``` ``` ## [,1] [,2] [,3] ## [1,] 12.662 4.286 89.950 ## [2,] 21.204 110.855 42.254 ``` --- ## Math with matrices - Matrix transposing, `\(A^T\)`, uses `t()` ```r firm_data_T <- t(firm_data) firm_data_T ``` ``` ## [,1] [,2] ## [1,] 12662 21204 ## [2,] 4286 110855 ## [3,] 89950 42254 ``` - Matrix multiplication, `\(A~B\)`, uses `%*%` ```r firm_data %*% firm_data_T ``` ``` ## [,1] [,2] ## [1,] 8269698540 4544356878 ## [2,] 4544356878 14523841157 ``` > Matrix is the cornerstone of machine learning, although we don't use it much for this course --- ## Matrix naming - We can name matrix rows and columns, much like we named vector elements - Use `rownames()` for rows - Use `colnames()` for columns ```r rownames(firm_data) <- rows colnames(firm_data) <- columns firm_data ``` ``` ## Google Microsoft Goldman ## Earnings 12662 4286 89950 ## Revenue 21204 110855 42254 ``` --- ## Selecting from matrices - Select using 2 indexes instead of 1: - `matrix_name[rows, columns]` - To select all rows or columns, leave that index blanks ```r firm_data[2, 3] ``` ``` ## [1] 42254 ``` ```r firm_data[, c("Google","Microsoft")] ``` ``` ## Google Microsoft ## Earnings 12662 4286 ## Revenue 21204 110855 ``` ```r firm_data[1, ] ``` ``` ## Google Microsoft Goldman ## 12662 4286 89950 ``` --- ## Combining matrices - Matrices are combined top to bottom as rows with `rbind()` ```r # Preloaded: industry codes as indcode (vector) # - GICS codes: 40 = Financials, 45 = Information Technology # - https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard mat <- rbind(firm_data, indcode) # Add a row rownames(mat)[3] <- "Industry" # Name the new row mat ``` ``` ## Google Microsoft Goldman ## Earnings 12662 4286 89950 ## Revenue 21204 110855 42254 ## Industry 45 45 40 ``` --- ## Combining matrices - Matrices are combined side-by-side as columns with `cbind()` ```r # Preloaded: JPMorgan data as jpdata (vector) mat <- cbind(firm_data, jpdata) # Add a column colnames(mat)[4] <- "JPMorgan" # Name the new column mat ``` ``` ## Google Microsoft Goldman JPMorgan ## Earnings 12662 4286 89950 17370 ## Revenue 21204 110855 42254 115475 ``` --- class: inverse, center, middle # Lists --- ## Lists: what are they? - Like vectors, but with mixed types - Generally not something we will create, often returned by analysis functions in R - Such as the linear regression models `lm()` ```r model <- summary(lm(earnings ~ revenue, data=tech_df)) model ``` ``` ## ## Call: ## lm(formula = earnings ~ revenue, data = tech_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -16045.0 20.0 141.6 177.1 12104.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.837e+02 4.491e+01 -4.091 4.79e-05 *** ## revenue 1.589e-01 3.564e-03 44.585 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1166 on 713 degrees of freedom ## Multiple R-squared: 0.736, Adjusted R-squared: 0.7356 ## F-statistic: 1988 on 1 and 713 DF, p-value: < 2.2e-16 ``` --- ## Structure of a list - `str()` will tell us what's in this list ```r str(model) ``` ``` ## List of 11 ## $ call : language lm(formula = earnings ~ revenue, data = tech_df) ## $ terms :Classes 'terms', 'formula' language earnings ~ revenue ## .. ..- attr(*, "variables")= language list(earnings, revenue) ## .. ..- attr(*, "factors")= int [1:2, 1] 0 1 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "earnings" "revenue" ## .. .. .. ..$ : chr "revenue" ## .. ..- attr(*, "term.labels")= chr "revenue" ## .. ..- attr(*, "order")= int 1 ## .. ..- attr(*, "intercept")= int 1 ## .. ..- attr(*, "response")= int 1 ## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## .. ..- attr(*, "predvars")= language list(earnings, revenue) ## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" ## .. .. ..- attr(*, "names")= chr [1:2] "earnings" "revenue" ## $ residuals : Named num [1:715] -59.7 173.8 -620.2 586.7 613.6 ... ## ..- attr(*, "names")= chr [1:715] "40" "103" "127" "135" ... ## $ coefficients : num [1:2, 1:4] -1.84e+02 1.59e-01 4.49e+01 3.56e-03 -4.09 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "(Intercept)" "revenue" ## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)" ## $ aliased : Named logi [1:2] FALSE FALSE ## ..- attr(*, "names")= chr [1:2] "(Intercept)" "revenue" ## $ sigma : num 1166 ## $ df : int [1:3] 2 713 2 ## $ r.squared : num 0.736 ## $ adj.r.squared: num 0.736 ## $ fstatistic : Named num [1:3] 1988 1 713 ## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf" ## $ cov.unscaled : num [1:2, 1:2] 1.48e-03 -2.83e-08 -2.83e-08 9.35e-12 ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "(Intercept)" "revenue" ## .. ..$ : chr [1:2] "(Intercept)" "revenue" ## - attr(*, "class")= chr "summary.lm" ``` --- ## Looking into lists - Lists generally use double square brackets, `[[index]]` - Used for pulling individual elements out of a list - `[[c()]]` will drill through lists, as opposed to pulling multiple values - Single square brackets pull out elements as it is - Double square brackets extract just the element - For 1 level, we can also use `$` .pull-left[ ```r model["r.squared"] ``` ``` ## $r.squared ## [1] 0.7360059 ``` ```r model[["r.squared"]] ``` ``` ## [1] 0.7360059 ``` ```r model$r.squared ``` ``` ## [1] 0.7360059 ``` ] .pull-right[ ```r earnings["Google"] ``` ``` ## Google ## 12662 ``` ```r earnings[["Google"]] ``` ``` ## [1] 12662 ``` ```r #Can't use $ with vectors ``` ] --- ## Practice: Lists - In this practice, we will explore lists and how to parse them - Do Exercise 3 on the following R practice file: - <a target="_blank" href="Session_2s_Exercise.html#Exercise_3:_Lists">R Practice</a> --- class: inverse, center, middle # Data frames --- ## Data frames: what? - Data frames are like a hybrid between lists and matrices .pull-left[ Like a matrix: - 2 dimensional like matrices - Can access data with `[]` - All elements in a column must be the same data type ] .pull-right[ Like a list: - Can have different data types for different columns - Can access data with `$` ] > Think of columns as variables, rows as observations, and data frames as the Excel spreadsheet --- ## Example of a data frame ```r library(DT) # The library is for including larger collections of data in output datatable(tech_df[1:20, c("conm","tic","margin")], options = list(pageLength = 5), rownames=FALSE) ```
--- ## How to create a df? 1. On import of data, usually you will get a data frame 2. Using the `data.frame()` function ```r df <- data.frame(companyName = company, earnings = earnings, tech_firm = tech_firm) df ``` ``` ## companyName earnings tech_firm ## Google Google 12662 TRUE ## Microsoft Microsoft 21204 TRUE ## Goldman Goldman 4286 FALSE ``` --- ## Selecting from df - Access like a matrix ```r df[, 1] ``` ``` ## [1] "Google" "Microsoft" "Goldman" ``` - Access like a list ```r df$companyName ``` ``` ## [1] "Google" "Microsoft" "Goldman" ``` ```r df[[1]] ``` ``` ## [1] "Google" "Microsoft" "Goldman" ``` > All are relatively equivalent. Using `$` is generally most natural. Using `[,]` is good for complex references. --- ## Making new columns > Suggested method: use `$` ```r df$all_zero <- 0 df$revenue <- c(110855, 89950, 42254) df$margin <- df$earnings / df$revenue # html_df() is a custom function for small tables html_df(df) ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> companyName </th> <th style="text-align:center;"> earnings </th> <th style="text-align:center;"> tech_firm </th> <th style="text-align:center;"> all_zero </th> <th style="text-align:center;"> revenue </th> <th style="text-align:center;"> margin </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Google </td> <td style="text-align:left;"> Google </td> <td style="text-align:center;"> 12662 </td> <td style="text-align:center;"> TRUE </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 110855 </td> <td style="text-align:center;"> 0.1142213 </td> </tr> <tr> <td style="text-align:left;"> Microsoft </td> <td style="text-align:left;"> Microsoft </td> <td style="text-align:center;"> 21204 </td> <td style="text-align:center;"> TRUE </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 89950 </td> <td style="text-align:center;"> 0.2357310 </td> </tr> <tr> <td style="text-align:left;"> Goldman </td> <td style="text-align:left;"> Goldman </td> <td style="text-align:center;"> 4286 </td> <td style="text-align:center;"> FALSE </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 42254 </td> <td style="text-align:center;"> 0.1014342 </td> </tr> </tbody> </table> > Alternative method: use `cbind()` just like with matrices --- ## Sorting data frames - To sort a *vector*, we could use the `sort()` ```r sort(df$earnings) ``` ``` ## [1] 4286 12662 21204 ``` > THIS CAN'T SORT DATA FRAMES - A column of a data frame is fine, but it can't sort the whole thing! --- ## Sorting data frames - To sort a data frame, we use the `order()` function - It returns the order of each element in increasing value - 1 is the lowest value - Then we pass the new order like we are selecting elements ```r ordering <- order(df$earnings) ordering ``` ``` ## [1] 3 1 2 ``` ```r df <- df[ordering, ] df ``` ``` ## companyName earnings tech_firm all_zero revenue margin ## Goldman Goldman 4286 FALSE 0 42254 0.1014342 ## Google Google 12662 TRUE 0 110855 0.1142213 ## Microsoft Microsoft 21204 TRUE 0 89950 0.2357310 ``` --- ## Sorting data frames - Order can sort by multiple levels - `order(level1, level2, ...)`, where `level_` are vectors or df columns ```r example <- data.frame(firm=c("Google","Microsoft","Google","Microsoft"), year=c(2017, 2017, 2016, 2016)) example ``` ``` ## firm year ## 1 Google 2017 ## 2 Microsoft 2017 ## 3 Google 2016 ## 4 Microsoft 2016 ``` ```r ordering <- order(example$firm, example$year) example <- example[ordering, ] example ``` ``` ## firm year ## 3 Google 2016 ## 1 Google 2017 ## 4 Microsoft 2016 ## 2 Microsoft 2017 ``` --- ## Subsetting data frames 1. We can use the selecting methods from before 2. We can pass a vector of logical values telling R what to keep - This is pretty useful! 3. We can also use `subset()` function ```r df[df$tech_firm, ] # Remember the comma! ``` ``` ## companyName earnings tech_firm all_zero revenue margin ## Google Google 12662 TRUE 0 110855 0.1142213 ## Microsoft Microsoft 21204 TRUE 0 89950 0.2357310 ``` ```r subset(df, earnings < 20000) ``` ``` ## companyName earnings tech_firm all_zero revenue margin ## Goldman Goldman 4286 FALSE 0 42254 0.1014342 ## Google Google 12662 TRUE 0 110855 0.1142213 ``` --- ## Practice: Data frames - This exercise explores the nature of banks' deposits - We will see which of Goldman, JPMorgan, and Citigroup have (since 2010): - The least of their assets in deposits - The most of their assets in deposits - Do Exercise 4 on the following R practice file: - <a target="_blank" href="Session_2s_Exercise.html#Exercise_4:_Data_frames">R Practice</a> --- class: inverse, center, middle # Summary of Session 2 --- ## For next week - continue with your Datacamp and textbook - review today's code and pre-read next week's seminar notes - start the **Assignment 1** which is due in two weeks. > Tentatively, there will be the following progress assessment (30%): 1. Individual Assignment 1, on R Programming Basics 2. Individual Assignment 2, on Regressions 3. Two pop up quizzes - Individual assignments will be in [R Markdown (.rmd)](https://rmarkdown.rstudio.com/) file format > All sumbissions and feedback are on eLearn. Please pay attention to academic integrity. --- ## R Markdown: A quick guide - Headers and subheaders start with `#`, `##`, ..., `######` - Code blocks starts with <img src="../../../Figures/rchunkbeg.png"> and end with <img src="../../../Figures/rchunkend.png"> (backticks or grave accent) - By default, all code and figures will show up in the output - `echo=FALSE`: don't display code in output document - `results="hide"`: don't display results in output - Inline code goes in a block starting with <img src="../../../Figures/rchunkinlbeg.png"> and ending with <img src="../../../Figures/rchunkinlend.png"> - Italic font can be used by putting `*` or `_` around *text* - Bold font can be used by putting `**` around text - E.g.: `**bold text**` becomes **bold text** - To render the document, click <img src="../../../Figures/knit.png" style="margin:0px"> - Math can be placed between `$` to use [LaTeX](https://www.latex-project.org/) notation - E.g. `$\frac{revt}{at}$` becomes `\(\frac{revt}{at}\)` - Full equations (on their own line) can be placed between `$$` - A block quote is prefixed with `>` - For a complete guide, see R Studio's <a target="_blank" href="https://www.rstudio.com/resources/cheatsheets/">R Markdown::Cheat Sheet</a> - My slides are prepared using the [xaringan](https://github.com/yihui/xaringan) template - The assignment is prepared using the [tufte style](https://github.com/rstudio/tufte) --- ## R Coding Style Guide Style is subjective and arbitrary but it is important to follow a generally accepted style if you want to share code with others. I suggest the [The tidyverse style guide](https://style.tidyverse.org/) which is also adopted by [Google](https://google.github.io/styleguide/Rguide.html) with some modification - Highlights of **the tidyverse style guide**: - *File names*: end with .R - *Identifiers*: variable_name, function_name, try not to use "." as it is reserved by Base R's S3 objects - *Line length*: 80 characters - *Indentation*: two spaces, no tabs (RStudio by default converts tabs to spaces and you may change under global options) - *Spacing*: x = 0, not x=0, no space before a comma, but always place one after a comma - *Curly braces {}*: first on same line, last on own line - *Assignment*: use `<-`, not `=` nor `->` - *Semicolon(;)*: don't use, I used once for the interest of space - *return()*: Use explicit returns in functions: default function return is the last evaluated expression - *File paths*: use [relative file path](https://www.w3schools.com/html/html_filepaths.asp) "../../filename.csv" rather than absolute path "C:/mydata/filename.csv". Backslash needs `\\` --- ## R packages used in this slide This slide was prepared on 2021-09-24 from Session_2s.Rmd with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 🙋. The attached packages used in this slide are: ``` ## DT kableExtra knitr ## "0.18" "1.3.4" "1.33" ```