Fall 2017 Overview Review of data types and operators Functions galore Reading amp writing files Exploring with real datasets A little group work today Data Structures Review amp deeper 5 basic data structures 4 ID: 755370
Download Presentation The PPT/PDF document "R Programming II EPID 799C" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
R Programming II
EPID 799C
Fall 2017Slide2
Overview
Review of data types and operators
Functions (galore)
Reading & writing filesExploring with real datasets
A little group work today…Slide3
Data Structures
Review & deeper: 5 basic data structures, 4
(+)
atomic typesSlide4
Data structures
Homogenous
Heterogenous
1D
Atomic Vector
List*
2D
Matrix
Data frameNDArray
Atoms are: logical, integer, double (aka numeric), and character.
Two rare atomic types (complex and raw2).Also other aggregate types. We’ll introduce factors and dates soon.We make these things using functions c(), matrix(), array(), list() and data.frame().Note: Lists are recursive!
2
Do genetic work? An R package that works with DNA sequences uses raw format (byte code/math) to efficiently store and operate on those ATGCs. It’ll be largely invisible to you, but you can thank bit math for speedy comparison of sequences.Slide5
We Try
Let’s make some!Slide6
You try: data types
Create three atomic vectors (length 5) of each of these types: character, integer, logical. Name them whatever you want.
Use “:” shorthand to create a vector of numbers, again, of the same length.
Create a
data.frame
using those four atomic vectors, and take a look at it by printing it to console
Create a 3x3 matrix of (any) numbers, then of
logicals
. (hint: rep() function may be useful for logicals)Create a 3x3x3 array (27 elements) of the numbers 1 to 27. May need help on array()…Create a list that includes another list in it.Slide7
Answers
# Atomic vectors
a = c(1, 2,3, 4)
b = 1:4
c = c("one", "two", "three", "four")
d = c(T, F, T, F)
my_df
=
data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factorstr(my_df)#Matrices and arraysmatrix(1:9, nrow = 3)array(1:27, dim = c(3, 3, 3))my_list = list(1, "a", list("mike", "shoes"))Slide8
Notes on Data Structures
Everything is a
vector
in R. Atoms are vectors of length = 1.
^As reviewed Monday: This is crazy important and useful.
Operators therefore expect vectors and know how to operate on entire vectors at once.
Lists are recursive and heterogenous. Can make up building blocks of more complex objects.
Data.frames are really…lists of atomic (homogenous) vectors*. We’ll verify this later.* Fancy note: technically this means you could have a vector of lists (each element is a list) and it’s still a data.frame. Some recent spatial packages take advantage of this (e.g. simple features geometries).Slide9
Type Coersion
Explicit
as.thing
(this) will convert this into the other thing.
e.g.
as.numeric
(c(“1”, “2”, “3”))
as.character(1:3) # as.... Other stuff. Or as(thing, “class”)ImplicitR tries to help when it can:e.g. sum(T, F, T, F, T)If you’re going to lose information (get new Nas), R will let you know. Generally artithmatic operators coerce to numbers, and logicals to logicals, etc.Slide10
Functions
data, _
str
, class, summary, head, tail,
setwd
,
getwd
, View, plot, dim,
nrow/col, sd, hist, boxplot, table, type_of, sum, read.csv and write.csv…Slide11
Our first functions (vocabulary!)
str
(),
type_of(), length(), attributes(), names(), class()Slide12
We Try
Let’s explore data types with our functions!Slide13
Sidenote
1:
Factors & Dates!
We’ve got the functions to make sense of theseSlide14
Aggregate Data Types: Factors
Now we’re ready: What is a factor?
Let’s find out! Create one:
roles = factor(c(“student”, “faculty”, “staff”))
^ NOTE there was a typo during class. I forgot the c!
Find out: use
str
(), class(), levels(), attributes(),
as.numeric(), typeof() on rolesHow are factors different? Why are they here? Also see: ordered()Stuck with factors? Check out the forcats:: package at http://forcats.tidyverse.org/ and this on factors: http://r4ds.had.co.nz/factors.html Slide15
Aggregate Data Types: Dates
What is a date? Two things really…
today_date
=
as.Date
("2017/08/30")
typeof
(
today_date)today_date_lt = as.POSIXlt("2017/08/30")typeof(today_date_lt)(Try our other functions too!)But hint: Futzing with dates can be a hassle. We’ll use the Lubridate:: package to make that easier, later. <2m skim this: http://r4ds.had.co.nz/dates-and-times.html Slide16
Sidenote
2:
Operators?
C’mon, I thought we were doing functions!Slide17
Reminder: Operators… are functions!
`+`(1, 2)
`%in%`(1:4, c(2,3))
…so are assignment, indexing and (technically) function calls themselves.
Meaning, hey, you can easily define your own binary operators. More often we’ll define our own functions, but important to know: EVERYTHING is a function / object, and can be passed around.
`%
add_wrong
%` <- function(a, b) {a + b+1}
a %add_wrong% bSlide18
Sidenote
3:
Write your own
Often *super* usefulSlide19
We Try: Best way to learn functions are to write our own
my_first_function
= function(
param
, param2=4){
# function body, using the parameters…
return(
my_return_val
) # ^ note: will return last value if left out!}my_first_function(1, 2) #calling it like this, or my_first_function(param2 = 2, param = 1) # this, ormy_first_function(1) #thisR functions are scoped (e.g. variables created inside don’t exist outside) and pass by reference as default (smart, don’t create new copies of what’s passed inside unless the copy is changed)Let’s write get_older and hello_worldSlide20
We Try: Best way to learn functions are to write our own
write.csv and read.csv
What is iris?
(see data() )
Save
data.frame
iris to
iris_lower
, and change all the variable names to lower case. Slide21
You try: Function Vocab Injection
Using the iris dataset and
Advanced R: Function Vocabulary (
http://adv-r.had.co.nz/Vocabulary.html
)
Try out as many functions as you can in your group!
(Feel free to split them up and work in groups)
Suggestions: Some of these are actually pretty advanced - consider not diving into EVERY function. Some you might want to skip that are a bit of a rabbit hole…
<<- get assign rle Slide22
Answers
# Atomic vectors
a = c(1, 2,3, 4)
b = 1:4
c = c("one", "two", "three", "four")
d = c(T, F, T, F)
my_df
=
data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factorstr(my_df)#Matrices and arraysmatrix(1:9, nrow = 3)array(1:27, dim = c(3, 3, 3))my_list = list(1, "a", list("mike", "shoes"))Slide23
Sidenote
4:
classes and functions
You’ll never need to know this until you do.Slide24
Classes and functions
How do
fuctions
like dim() or plot() know how to handle all these things?
Technically, they’re
generics,
calling (effectively masking) functions things like
dim.data.frame
or
plot.factor that it calls based on the class() of the object.Look up help for plot.factor and dim.data.frameSlide25
Super Duper Sub-setting
Review from last time: [], [[]], $ and their many flexibilitiesSlide26
Super Duper Sub-setting: Vectors
[] is the atomic subset operator (by location)
(Given R is “vectorized” – like almost all of our data! - think matrix notation)
[[]] (“double brackets”) is the subset into operator (think subset, then look inside that thing). Most commonly used in in a named list, like… a
data.frame
! Slide27
We Try: Super Duper Sub-setting: Vectors
[] can subset a vector by:
Numeric vectors (negative to drop, repeats, etc.)
Logical vectors
Character vectors (IF you’ve named those elements)
[] can subset a 2+ dimensional object (matrix, array
data.frame
) in similar ways…
…but then accepts a few other higher order versions of the above.
Technically, [] used on a vector always returns a vector, right?Slide28
Super Duper
Subsetting
: Lists
[] returns smaller lists element from a list.
But often we want to look inside that list element (e.g. in data frames). So for lists we use the [[]], e.g. iris[[“Petal Length”]].
But that’s a hassle, so
x$y
is a convenience wrapper for the same operator (equivalent to x[[“y”, exact=F]]) which we’ll use ALL THE TIME.
*
Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! Slide29
Ready for Reality!
Introducing the class datasetSlide30
Putting it all together
We have basic
data types
to hold our
vectorized, atomic data
.
We have a wealth of
functions
to operate on them, usually on a whole vector (think “column”) at once. We can write our own if we need to.
We have powerful subsetting (see L2 for the full rundown) to select, rewrite, extract and perform other actions on slices of our data.* Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! Slide31
NC Birth Data
The “small” dataset contains (N) columns of (M) rows of data.
Check the documentation for what these values really mean.
Mdif
, visits,
wksgets
,
mrace
, cores,
bfed.The overall question: does prenatal care reduce preterm birth?* Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! Slide32
You try: Functions
Read in the NC births (small) file, and rename the variables to all lower case.
Explore the dataset as a small group using as many relevant functions as you can from the Advanced R package, and report out to the group
Try
str
(), length(), dim(),
typeof
(), attributes(),
Try head(), tail(), subset()Slide33
You try: Tour the Dataset
Download and unzip the Births Dataset, then use read.csv() to (and maybe
setwd
() ) to import the small version of the dataset:
births2012_small.csv
Use these functions to answer the questions below
dim() summary() table()
hist() plot()Use an expression with assignment to make a working copy of the dataset with a simpler nameHow many observations, and how many variables are in the (small) births dataset?What is the average maternal age (mage)? How many mothers have the value 99?Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age?How many mothers smoked (CIGDUR)?Make a scatterplot of maternal age versus gestational age. Slide34
You try
You can flexibly program with [] and [[]], but not as flexibly with $, even though almost always we’ll use $. Can you see why?
Using the births data…