2019-03-12 0K 0 0 0

##### Description

Fall 2017. Overview. Review of data types and operators. Functions (galore). Reading & writing files. Exploring with real datasets. A little group work today…. Data Structures. Review & deeper: 5 basic data structures, 4. ID: 755370

**Embed code:**

## Download this presentation

DownloadNote - The PPT/PDF document "R Programming II EPID 799C" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

## Presentations text content in R Programming II EPID 799C

R Programming II

EPID 799C

Fall 2017

Slide2Overview

Review of data types and operators

Functions (galore)

Reading & writing filesExploring with real datasets

A little group work today…

Slide3Data Structures

Review & deeper: 5 basic data structures, 4

(+)

atomic types

Slide4Data structures

Homogenous

Heterogenous

1D

Atomic Vector

List*

2D

Matrix

Data frameNDArray

Atoms are: logical, integer, double (aka numeric), and character.

Two rare atomic types (complex and raw2).Also other aggregate types. We’ll introduce factors and dates soon.We make these things using functions c(), matrix(), array(), list() and data.frame().Note: Lists are recursive!

2

Do genetic work? An R package that works with DNA sequences uses raw format (byte code/math) to efficiently store and operate on those ATGCs. It’ll be largely invisible to you, but you can thank bit math for speedy comparison of sequences.

Slide5We Try

Let’s make some!

Slide6You try: data types

Create three atomic vectors (length 5) of each of these types: character, integer, logical. Name them whatever you want.

Use “:” shorthand to create a vector of numbers, again, of the same length.

Create a

data.frame

using those four atomic vectors, and take a look at it by printing it to console

Create a 3x3 matrix of (any) numbers, then of

logicals

. (hint: rep() function may be useful for logicals)Create a 3x3x3 array (27 elements) of the numbers 1 to 27. May need help on array()…Create a list that includes another list in it.

Slide7Answers

# Atomic vectors

a = c(1, 2,3, 4)

b = 1:4

c = c("one", "two", "three", "four")

d = c(T, F, T, F)

my_df

=

data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factorstr(my_df)#Matrices and arraysmatrix(1:9, nrow = 3)array(1:27, dim = c(3, 3, 3))my_list = list(1, "a", list("mike", "shoes"))

Slide8Notes on Data Structures

Everything is a

vector

in R. Atoms are vectors of length = 1.

^As reviewed Monday: This is crazy important and useful.

Operators therefore expect vectors and know how to operate on entire vectors at once.

Lists are recursive and heterogenous. Can make up building blocks of more complex objects.

Data.frames are really…lists of atomic (homogenous) vectors*. We’ll verify this later.* Fancy note: technically this means you could have a vector of lists (each element is a list) and it’s still a data.frame. Some recent spatial packages take advantage of this (e.g. simple features geometries).

Slide9Type Coersion

Explicit

as.thing

(this) will convert this into the other thing.

e.g.

as.numeric

(c(“1”, “2”, “3”))

as.character(1:3) # as.... Other stuff. Or as(thing, “class”)ImplicitR tries to help when it can:e.g. sum(T, F, T, F, T)If you’re going to lose information (get new Nas), R will let you know. Generally artithmatic operators coerce to numbers, and logicals to logicals, etc.

Slide10Functions

data, _

str

, class, summary, head, tail,

setwd

,

getwd

, View, plot, dim,

nrow/col, sd, hist, boxplot, table, type_of, sum, read.csv and write.csv…

Slide11Our first functions (vocabulary!)

str

(),

type_of(), length(), attributes(), names(), class()

Slide12We Try

Let’s explore data types with our functions!

Slide13Sidenote

1:

Factors & Dates!

We’ve got the functions to make sense of these

Slide14Aggregate Data Types: Factors

Now we’re ready: What is a factor?

Let’s find out! Create one:

roles = factor(c(“student”, “faculty”, “staff”))

^ NOTE there was a typo during class. I forgot the c!

Find out: use

str

(), class(), levels(), attributes(),

as.numeric(), typeof() on rolesHow are factors different? Why are they here? Also see: ordered()Stuck with factors? Check out the forcats:: package at http://forcats.tidyverse.org/ and this on factors: http://r4ds.had.co.nz/factors.html

Slide15Aggregate Data Types: Dates

What is a date? Two things really…

today_date

=

as.Date

("2017/08/30")

typeof

(

today_date)today_date_lt = as.POSIXlt("2017/08/30")typeof(today_date_lt)(Try our other functions too!)But hint: Futzing with dates can be a hassle. We’ll use the Lubridate:: package to make that easier, later. <2m skim this: http://r4ds.had.co.nz/dates-and-times.html

Slide16Sidenote

2:

Operators?

C’mon, I thought we were doing functions!

Slide17Reminder: Operators… are functions!

`+`(1, 2)

`%in%`(1:4, c(2,3))

…so are assignment, indexing and (technically) function calls themselves.

Meaning, hey, you can easily define your own binary operators. More often we’ll define our own functions, but important to know: EVERYTHING is a function / object, and can be passed around.

`%

add_wrong

%` <- function(a, b) {a + b+1}

a %add_wrong% b

Slide18Sidenote

3:

Write your own

Often *super* useful

Slide19We Try: Best way to learn functions are to write our own

my_first_function

= function(

param

, param2=4){

# function body, using the parameters…

return(

my_return_val

) # ^ note: will return last value if left out!}my_first_function(1, 2) #calling it like this, or my_first_function(param2 = 2, param = 1) # this, ormy_first_function(1) #thisR functions are scoped (e.g. variables created inside don’t exist outside) and pass by reference as default (smart, don’t create new copies of what’s passed inside unless the copy is changed)Let’s write get_older and hello_world

Slide20We Try: Best way to learn functions are to write our own

write.csv and read.csv

What is iris?

(see data() )

Save

data.frame

iris to

iris_lower

, and change all the variable names to lower case.

Slide21You try: Function Vocab Injection

Using the iris dataset and

Advanced R: Function Vocabulary (

http://adv-r.had.co.nz/Vocabulary.html

)

Try out as many functions as you can in your group!

(Feel free to split them up and work in groups)

Suggestions: Some of these are actually pretty advanced - consider not diving into EVERY function. Some you might want to skip that are a bit of a rabbit hole…

<<- get assign rle

Slide22Answers

# Atomic vectors

a = c(1, 2,3, 4)

b = 1:4

c = c("one", "two", "three", "four")

d = c(T, F, T, F)

my_df

=

data.frame(a, b, c, d) #stringsAsFactors = F would keep c from turning into a factorstr(my_df)#Matrices and arraysmatrix(1:9, nrow = 3)array(1:27, dim = c(3, 3, 3))my_list = list(1, "a", list("mike", "shoes"))

Slide23Sidenote

4:

classes and functions

You’ll never need to know this until you do.

Slide24Classes and functions

How do

fuctions

like dim() or plot() know how to handle all these things?

Technically, they’re

generics,

calling (effectively masking) functions things like

dim.data.frame

or

plot.factor that it calls based on the class() of the object.Look up help for plot.factor and dim.data.frame

Slide25Super Duper Sub-setting

Review from last time: [], [[]], $ and their many flexibilities

Slide26Super Duper Sub-setting: Vectors

[] is the atomic subset operator (by location)

(Given R is “vectorized” – like almost all of our data! - think matrix notation)

[[]] (“double brackets”) is the subset into operator (think subset, then look inside that thing). Most commonly used in in a named list, like… a

data.frame

!

Slide27We Try: Super Duper Sub-setting: Vectors

[] can subset a vector by:

Numeric vectors (negative to drop, repeats, etc.)

Logical vectors

Character vectors (IF you’ve named those elements)

[] can subset a 2+ dimensional object (matrix, array

data.frame

) in similar ways…

…but then accepts a few other higher order versions of the above.

Technically, [] used on a vector always returns a vector, right?

Slide28Super Duper

Subsetting

: Lists

[] returns smaller lists element from a list.

But often we want to look inside that list element (e.g. in data frames). So for lists we use the [[]], e.g. iris[[“Petal Length”]].

But that’s a hassle, so

x$y

is a convenience wrapper for the same operator (equivalent to x[[“y”, exact=F]]) which we’ll use ALL THE TIME.

*

Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this!

Slide29Ready for Reality!

Introducing the class dataset

Slide30Putting it all together

We have basic

data types

to hold our

vectorized, atomic data

.

We have a wealth of

functions

to operate on them, usually on a whole vector (think “column”) at once. We can write our own if we need to.

We have powerful subsetting (see L2 for the full rundown) to select, rewrite, extract and perform other actions on slices of our data.* Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this!

Slide31NC Birth Data

The “small” dataset contains (N) columns of (M) rows of data.

Check the documentation for what these values really mean.

Mdif

, visits,

wksgets

,

mrace

, cores,

bfed.The overall question: does prenatal care reduce preterm birth?* Fancy note: see that exact=F? That means you can do some crazy stuff, like iris$Petal.Len . Really. Don’t do this!

Slide32You try: Functions

Read in the NC births (small) file, and rename the variables to all lower case.

Explore the dataset as a small group using as many relevant functions as you can from the Advanced R package, and report out to the group

Try

str

(), length(), dim(),

typeof

(), attributes(),

Try head(), tail(), subset()

Slide33You try: Tour the Dataset

Download and unzip the Births Dataset, then use read.csv() to (and maybe

setwd

() ) to import the small version of the dataset:

births2012_small.csv

Use these functions to answer the questions below

dim() summary() table()

hist() plot()Use an expression with assignment to make a working copy of the dataset with a simpler nameHow many observations, and how many variables are in the (small) births dataset?What is the average maternal age (mage)? How many mothers have the value 99?Make a histogram of gestational age (WKSGEST). What is the minimum and maximum (non-99) gestational age?How many mothers smoked (CIGDUR)?Make a scatterplot of maternal age versus gestational age.

Slide34You try

You can flexibly program with [] and [[]], but not as flexibly with $, even though almost always we’ll use $. Can you see why?

Using the births data…