/
R Programming II EPID 799C R Programming II EPID 799C

R Programming II EPID 799C - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
399 views
Uploaded On 2019-03-12

R Programming II EPID 799C - PPT Presentation

Fall 2017 Overview Review of data types and operators Functions galore Reading amp writing files Exploring with real datasets A little group work today Data Structures Review amp deeper 5 basic data structures 4 ID: 755370

functions data function list data functions list function vectors frame types atomic dataset

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "R Programming II EPID 799C" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

R Programming II

EPID 799C

Fall 2017Slide2

Overview

Review of data types and operators

Functions (galore)

Reading & writing filesExploring with real datasets

A little group work today…Slide3

Data Structures

Review & deeper: 5 basic data structures, 4

(+)

atomic typesSlide4

Data structures

Homogenous

Heterogenous

1D

Atomic Vector

List*

2D

Matrix

Data frameNDArray

Atoms are: logical, integer, double (aka numeric), and character.

Two rare atomic types (complex and raw2).Also other aggregate types. We’ll introduce factors and dates soon.We make these things using functions c(), matrix(), array(), list() and data.frame().Note: Lists are recursive!

2

Do genetic work? An R package that works with DNA sequences uses raw format (byte code/math) to efficiently store and operate on those ATGCs. It’ll be largely invisible to you, but you can thank bit math for speedy comparison of sequences.Slide5

We Try

Let’s make some!Slide6

You try: data types

Create three atomic vectors (length 5) of each of these types: character, integer, logical. Name them whatever you want.

Use “:” shorthand to create a vector of numbers, again, of the same length.

Create a

data.frame

using those four atomic vectors, and take a look at it by printing it to console

Create a 3x3 matrix of (any) numbers, then of

logicals

. (hint: rep() function may be useful for logicals)Create a 3x3x3 array (27 elements) of the numbers 1 to 27. May need help on array()…Create a list that includes another list in it.Slide7

Answers

# Atomic vectors

a = c(1, 2,3, 4)

b = 1:4

c = c("one", "two", "three", "four")

d = c(T, F, T, F)

my_df

=

data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factorstr(my_df)#Matrices and arraysmatrix(1:9, nrow = 3)array(1:27, dim = c(3, 3, 3))my_list = list(1, "a", list("mike", "shoes"))Slide8

Notes on Data Structures

Everything is a

vector

in R. Atoms are vectors of length = 1.

^As reviewed Monday: This is crazy important and useful.

Operators therefore expect vectors and know how to operate on entire vectors at once.

Lists are recursive and heterogenous. Can make up building blocks of more complex objects.

Data.frames are really…lists of atomic (homogenous) vectors*. We’ll verify this later.* Fancy note: technically this means you could have a vector of lists (each element is a list) and it’s still a data.frame. Some recent spatial packages take advantage of this (e.g. simple features geometries).Slide9

Type Coersion

Explicit

as.thing

(this) will convert this into the other thing.

e.g.

as.numeric

(c(“1”, “2”, “3”))

as.character(1:3) # as.... Other stuff. Or as(thing, “class”)ImplicitR tries to help when it can:e.g. sum(T, F, T, F, T)If you’re going to lose information (get new Nas), R will let you know. Generally artithmatic operators coerce to numbers, and logicals to logicals, etc.Slide10

Functions

data, _

str

, class, summary, head, tail,

setwd

,

getwd

, View, plot, dim,

nrow/col, sd, hist, boxplot, table, type_of, sum, read.csv and write.csv…Slide11

Our first functions (vocabulary!)

str

(),

type_of(), length(), attributes(), names(), class()Slide12

We Try

Let’s explore data types with our functions!Slide13

Sidenote

1:

Factors & Dates!

We’ve got the functions to make sense of theseSlide14

Aggregate Data Types: Factors

Now we’re ready: What is a factor?

Let’s find out! Create one:

roles = factor(c(“student”, “faculty”, “staff”))

^ NOTE there was a typo during class. I forgot the c!

Find out: use

str

(), class(), levels(), attributes(),

as.numeric(), typeof() on rolesHow are factors different? Why are they here? Also see: ordered()Stuck with factors? Check out the forcats:: package at http://forcats.tidyverse.org/ and this on factors: http://r4ds.had.co.nz/factors.html Slide15

Aggregate Data Types: Dates

What is a date? Two things really…

today_date

=

as.Date

("2017/08/30")

typeof

(

today_date)today_date_lt = as.POSIXlt("2017/08/30")typeof(today_date_lt)(Try our other functions too!)But hint: Futzing with dates can be a hassle. We’ll use the Lubridate:: package to make that easier, later. <2m skim this: http://r4ds.had.co.nz/dates-and-times.html Slide16

Sidenote

2:

Operators?

C’mon, I thought we were doing functions!Slide17

Reminder: Operators… are functions!

`+`(1, 2)

`%in%`(1:4, c(2,3))

…so are assignment, indexing and (technically) function calls themselves.

Meaning, hey, you can easily define your own binary operators. More often we’ll define our own functions, but important to know: EVERYTHING is a function / object, and can be passed around.

`%

add_wrong

%` <- function(a, b) {a + b+1}

a %add_wrong% bSlide18

Sidenote

3:

Write your own

Often *super* usefulSlide19

We Try: Best way to learn functions are to write our own

my_first_function

= function(

param

, param2=4){

# function body, using the parameters…

return(

my_return_val

) # ^ note: will return last value if left out!}my_first_function(1, 2) #calling it like this, or my_first_function(param2 = 2, param = 1) # this, ormy_first_function(1) #thisR functions are scoped (e.g. variables created inside don’t exist outside) and pass by reference as default (smart, don’t create new copies of what’s passed inside unless the copy is changed)Let’s write get_older and hello_worldSlide20

We Try: Best way to learn functions are to write our own

write.csv and read.csv

What is iris?

(see data() )

Save

data.frame

iris to

iris_lower

, and change all the variable names to lower case. Slide21

You try: Function Vocab Injection

Using the iris dataset and

Advanced R: Function Vocabulary (

http://adv-r.had.co.nz/Vocabulary.html

)

Try out as many functions as you can in your group!

(Feel free to split them up and work in groups)

Suggestions: Some of these are actually pretty advanced - consider not diving into EVERY function. Some you might want to skip that are a bit of a rabbit hole…

<<- get assign rle Slide22

Answers

# Atomic vectors

a = c(1, 2,3, 4)

b = 1:4

c = c("one", "two", "three", "four")

d = c(T, F, T, F)

my_df

=

data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factorstr(my_df)#Matrices and arraysmatrix(1:9, nrow = 3)array(1:27, dim = c(3, 3, 3))my_list = list(1, "a", list("mike", "shoes"))Slide23

Sidenote

4:

classes and functions

You’ll never need to know this until you do.Slide24

Classes and functions

How do

fuctions

like dim() or plot() know how to handle all these things?

Technically, they’re

generics,

calling (effectively masking) functions things like

dim.data.frame

or

plot.factor that it calls based on the class() of the object.Look up help for plot.factor and dim.data.frameSlide25

Super Duper Sub-setting

Review from last time: [], [[]], $ and their many flexibilitiesSlide26

Super Duper Sub-setting: Vectors

[] is the atomic subset operator (by location)

(Given R is “vectorized” – like almost all of our data! - think matrix notation)

[[]] (“double brackets”) is the subset into operator (think subset, then look inside that thing). Most commonly used in in a named list, like… a

data.frame

! Slide27

We Try: Super Duper Sub-setting: Vectors

[] can subset a vector by:

Numeric vectors (negative to drop, repeats, etc.)

Logical vectors

Character vectors (IF you’ve named those elements)

[] can subset a 2+ dimensional object (matrix, array

data.frame

) in similar ways…

…but then accepts a few other higher order versions of the above.

Technically, [] used on a vector always returns a vector, right?Slide28

Super Duper

Subsetting

: Lists

[] returns smaller lists element from a list.

But often we want to look inside that list element (e.g. in data frames). So for lists we use the [[]], e.g. iris[[“Petal Length”]].

But that’s a hassle, so

x$y

is a convenience wrapper for the same operator (equivalent to x[[“y”, exact=F]]) which we’ll use ALL THE TIME.

*

Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! Slide29

Ready for Reality!

Introducing the class datasetSlide30

Putting it all together

We have basic

data types

to hold our

vectorized, atomic data

.

We have a wealth of

functions

to operate on them, usually on a whole vector (think “column”) at once. We can write our own if we need to.

We have powerful subsetting (see L2 for the full rundown) to select, rewrite, extract and perform other actions on slices of our data.* Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! Slide31

NC Birth Data

The “small” dataset contains (N) columns of (M) rows of data.

Check the documentation for what these values really mean.

Mdif

, visits,

wksgets

,

mrace

, cores,

bfed.The overall question: does prenatal care reduce preterm birth?* Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this! Slide32

You try: Functions

Read in the NC births (small) file, and rename the variables to all lower case.

Explore the dataset as a small group using as many relevant functions as you can from the Advanced R package, and report out to the group

Try

str

(), length(), dim(),

typeof

(), attributes(),

Try head(), tail(), subset()Slide33

You try: Tour the Dataset

Download and unzip the Births Dataset, then use read.csv() to (and maybe

setwd

() ) to import the small version of the dataset:

births2012_small.csv

Use these functions to answer the questions below

dim() summary() table()

hist() plot()Use an expression with assignment to make a working copy of the dataset with a simpler nameHow many observations, and how many variables are in the (small) births dataset?What is the average maternal age (mage)? How many mothers have the value 99?Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age?How many mothers smoked (CIGDUR)?Make a scatterplot of maternal age versus gestational age. Slide34

You try

You can flexibly program with [] and [[]], but not as flexibly with $, even though almost always we’ll use $. Can you see why?

Using the births data…