/
R Programming I: Basic data types, structures & R Programming I: Basic data types, structures &

R Programming I: Basic data types, structures & - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
342 views
Uploaded On 2019-11-08

R Programming I: Basic data types, structures & - PPT Presentation

R Programming I Basic data types structures amp subsetting EPID 799C Fall 2018 Suggestion for Class Arrival Download lecture Open lecturespecific scratchpad R script to take notes Open ID: 764617

top births vectors data births top data vectors operators mage vector

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "R Programming I: Basic data types, struc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

R Programming I:Basic data types, structures & subsetting EPID 799C Fall 2018

Suggestion for Class Arrival Download lecture Open lecture-specific scratchpad R script to “take notes” Open Homework script / assignment and half-listen for nuggets. 

A clear honor code note! Technically, the HW answers (among other things) are online (old site) We’re asking you not to look in advance. That’s it. That’s how the honor code works. 

Prototypical Epi Analysis Note a little overlap of HW2. We’ll occasionally learn some useful tools “out of order” – slightly more advanced concepts that you’d often want to pull out right away. But generally we’re working in order.

Overview Functional Focus: Exploration & Recoding Building toolset: Rules of R syntax Basic object types & data structures Basic operators

Elements of R Syntax Objects: 3 pi Mydata somevariable Operators: + - * / & | %in% Functions: mean() sd () plot() glm ()

Rules of R Grammar R evaluates expressions . Expressions are objects linked using operators and functions: Operators link objects side-by-side. 1+2 weight/height^2 data$variable Functions link objects in (optionally) named groups. sum(1,2,3,4) rnorm (n=10, mean=0,sd=1)

Everything else is vocabulary! Recommend: Try using all (/most) of these! http://adv-r.had.co.nz/Vocabulary.html

Organizing Syntax Elements 1. Objects Classes Data structures 2. Operators Assignment Infix operators 3. Functions (More next class)

1A. Classes type_of() or class() Basic Types Logical Numeric Integer Real Complex Character Useful Compound Types Dates Factors … so many others, we’ll start Wednesday Coersion , e.g. : as.numeric () str(c(“a”, 1)) sum(T, T, F)

1B. Data Structures str() Homogenous 1D Atomic Vector 2D Matrix ND Array Create with, conveniently: c() list() matrix() data.frame () * array() Heterogenous List Data Frame …plus compound types Most relevant Functions: str() class() typeof () length() nrow () ncol () dim() attributes() attr () names() rownames () colnames ()

Sidenote: names() Names will play a big role later, but in short: you can name the “cells” of a vector or list, or the rows or columns of a matrix/ data.frame . They get stored in attributes. QE_scores = c(“Student A” = 80, “Student B” = 90, “Student C” = 75) typeof ( QE_scores ) names ( QE_scores ) str ( QE_scores )

Assignment: = or <- To define an object, use <- or = students <- 20 [no output] students 20 births_top = head( births_sm ) births_top An expression without assignment prints the result but does not modify any objects. An expression with assignment defines an object but does not display the re s ult. 13

Atoms: The Basic Building Block One “unit” of data my_year = 2012 my_study = “Births: PNC & Preterm Birth” my_year my_name 2012 my_year “ Births: PNC & PT Birth ” my_study value symbol object

Arithmetic Operators All of the basic operators (and order of operations) work like you [should] expect with atoms: 1+1 18-19 100/3 births_top$wksgest + 1 %% # remainder %/% # divisor

Logical Operators If you ask R to evaluate an equation, inequality, or Boolean expression of atoms, it will return TRUE or FALSE: 1 == 2+3 FALSE 3 < 4 TRUE 12 >= 13-1 TRUE TRUE & FALSE FALSE TRUE | FALSE TRUE (3<4) & ! (FALSE) TRUE 16

Aside: Fancier Binary Operators %in% # try 1 %in% 1:4 # …and 1:4 %in% 1 %>% # pipe ( magrittr ) %over% # spatial “over” ( sp ) … and define your own! “Infix” operators (e.g. a FUNCTION b) are really just calling FUNCTION(a, b). More next class on functions See: infix operators 17

Vectors: Atoms in Sequence Multiple “units” of data locker_combo = c(12,24,7) foods = c(“Pie”, ”Pizza”, ”Tofu”) top_gestation_obs = births_top$wksgest locker.combo foods 12 24 7 “Pie” “Pizza” “Tofu”

Arithmetic with Vectors Arithmetic operators can be used on vectors with other vectors or atoms: top_gestation_obs + 1:6 top_gestation_obs + 1 top_gestation_obs + 1:2 #recycling! top_gestation_obs + births_top$weeknum

Vectorized Arithmetic The heights and weights of five patients in a cohort study at baseline were 64, 72, 70, 67, 73 inches and 80, 85, 79, 72, and 90 kilograms. Create a separate height vector and a weight vector containing the data. Convert the height vector to centimeters (1 inch = 2.54 cm). Use vector arithmetic to calculate a patient bmi * vector ( bmi = weight[kg]/height[cm]^2) Now do the same thing inside a data.frame * Problematic as BMI is… 

Logic with Vectors Logical operators can also be used on vectors with other vectors or atoms: a = 1:5 a # [5] 1 2 3 4 5 a>2 # apply to all [5] F F T T T b = c(3,2,1,3,5) b # [5] 5 4 3 2 1 a==b # element-wise [5] F T F F T a>=b # [5] F T T T T

Pause to Reflect We have basic types of data Numbers, logicals , characters, etc. We’ll see more later, but they’ll follow similar rules We’ve seen basic data structures Most notably for now: vectors and data.frames We’ll see more later (especially lists), but again, similar We’re about to hit the first powerful R concepts: vectorization (operating on a whole vector at once), and vectorized subsetting , including with data.frames

Subsetting Indexing vectors, lists, matrices and data.frames

Slicing Vectors with Atoms Slice a vector using the square brackets: [] top_gestation_obs [3] births_top$weeknum [1] 44 births_top$smoker_f [2] Think of this as “indexing”, or “referencing” part of a vector

Slicing Vectors with Vectors Slice a vector using square brackets: [] births_top$weeknum [1:3] births_top$weeknum [c(T,T,T,F,F,F)] births_top$mage [c(T,F,T,T,F,T)] #>20 # Remember, nothing “happens” to our original # vector unless we are using an assignment!

Subsetting with expressions Combine a slice with a logical test to query a vector (return all elements that match a condition): # Step by step… births_top$mage >20 births_gte_20 = births_top$mage >20 births_top$mage [births_gte_20] #>20 # But usually just… births_top$mage [ births_top$mage >20] #>20 # Or just as valid births_top$mage [ births_top$raceeth_f == “Other”]

Combining subsetting & queries We now have some powerful, 1 line tools. Using births_sm (all records) What’s the mean of the weeks gestation for everyone? What’s the mean for Smokers*? Non-smokers? Those missing the smoking variable (hint: use is.na() function) What’s the mean for wksgest for moms with mage* <20? >=20? >= 30? *remember to deal with missing values. Forget how? Try tab autocomplete or ?mean or F1 on mean() to get that syntax. This is powerful, but not powerful enough. Later we’ll have much more efficient ways to do this…

Lists: Mixed Vectors A list is a vector that can have multiple modes (flavors). They work like vectors but can also be referenced slightly differently (double brackets: [[ ]] ) to return not just the subset of the list, but what’s in that subset. [[]] == $ ! list(thing="A", 1, TRUE) list(thing="A", 1, TRUE)[1] list(thing="A", 1, TRUE)[[1]] list(thing="A", 1, TRUE)$thing List are a useful object for complex operations and objects. Will cover later, but useful glimpse for data.frames ...

Sidenote: Matrices are organized Vectors Vectors can be connected into a matrix : rbind () cbind () a = c(1,2,3) rbind ( a,b ) b = c(4,5,6) cbind ( a,b )

Slicing Matrices Like vectors, matrices can be sliced using []. Give slice instructions for both rows and columns (leave one blank to specify “all”), separated by a comma: m = rbind ( 4:6, 7:9 ) # stack rows m[1, ] # row 1, all columns 4 5 6 m[ ,2] # all rows, column 2 5 8 m[1:2,2:3] # row 1 to 2, col 2 to 3 5 6 # 8 9 7 8 9 4 5 6 m

Slicing with Matrices Matrices (rectangular data) can also sliced by a logical matrix (or by extension, a logical test that returns a logical matrix): m = cbind ( c(4,7), c(5,8), c(6,9) ) # same m m[ rbind ( c(T,F,T),c(F,T,F) ) ] # 4 8 6 m[m%%2==0] # even numbers # 4 8 6 # Note that this approach returns a vector! 7 8 9 4 5 6 m

Double Slicing Remember, output can always be input - you can also slice the result of slice as an alternative specification: m = rbind ( 4:6, 7:9 ) # better v = m[1, ] # v is 4 5 6 v[2] # 5 # one step m[1, ][2] # 5 7 8 9 4 5 6 m

Data Frames: Mixing and Naming Data frames allow you to mix-and-match different modes (flavors) of vectors into a matrix you can reference by name. This is a data set . The benefit is treating related data together (vs. all free-floating vectors). We’ve been using this since day 1. id = c(“A”,”B”,”C”) bp = c(115, 120, 130) dx = c(0, 0, 1) data.frame ( id,bp,dx )

Slicing & Assignment Remember names ? In addition to using the matrix methods , you can also make references by name using the [] or $ operators*: names( births_top ) births_top [" wksgest "] # return df births_top$wksgest births_top [, " wksgest "] births_top [1:3, 1:3] births_top [ births_top$mage == 20, " wksgest "] subset( births_top , subset = births_top$mage == 20, select = "wksgest") # Rarely do thisbirths_top[, c("wksgest", "weeknum", "mage")]

Slicing & Assignment # Power and danger: We’re allowed to # (and often going to) do this! births_top$mage [ births_top$mage < 20] = NA Careful ! Don’t overwrite your original births_sm data! (meh, if you do, control-shift-F10 and start over).

Slicing Data Frames: Advanced Note (for later!) births_top [" wksgest "] births_top [[" wksgest "]] # *technically...$ == births_top [, sapply ( births_top , is.numeric )] # ^ looking weeks ahead, but can you guess?

Functions: Taking Action Functions enable you to perform tasks. A function takes one or more arguments, separated by commas. We’ve been using them! Parameters can go in order, or directly by name: mean( dat$bp ) # one argument table( dat$id , dat$bp ) # two arguments rnorm (n=10,mean=1,sd=2) # named arguments …More layers Wednesday

You’re getting dangerous!  Data types & data structures to hold them Vectorization & vectorized subsetting for efficicency . Basic operators and functions when you need them. Enough, already, to do a lot of exploration

Activity: Tour ( full/ narrow/recoded ) Dataset Using births_sm from the rdata file, answer the questions below . Also note the “resources” folder! Use these functions (and others) to answer the questions below : dim() summary() table() hist() plot() How many observations are there, and how many variables are in the (small) births dataset? (Hint: see HW1!) What is the average maternal age (mage)? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? How many mothers smoked ( smoker_f )? Make a scatterplot of maternal age versus gestational age.

Activity: Tour (full/ narrow/ unrecoded ) Dataset Now use read.csv() to read births2012_sm.csv . How many observations are there, and how many variables are in the (full/wide) births dataset? How do the types of variables compare to those in births_sm ? (Hint: we’ve got some recoding to do!) What is the average maternal age (mage) now? How many mothers have the value 99? How many mothers smoked ( smoker_CIGDUR )? Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age? Try the same questions with births2012.csv, the full/wide/ unrecoded dataset. Note: much bigger !

Packages: Ready for next class Packages are extensions to base R. They contain additional functions, documentation for using them, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). https://cran.r-project.org/web/packages/available_packages_by_name.html The “ tidyverse ” is a set of packages for data manipulation, exploration,and visualization. They share a common design and work in harmony. We’ll be using it extensively. #Install and load the package ' tidyverse ’ install.packages (' tidyverse ') #only need to run once library( tidyverse ) #run once per R session to use load it