/
HRP 223 – HRP 223 –

HRP 223 – - PowerPoint Presentation

test
test . @test
Follow
359 views
Uploaded On 2017-06-30

HRP 223 – - PPT Presentation

2008 Day 1 Then vs Now Copyright 19992008 Leland Stanford Junior University All rights reserved Warning This presentation is protected by copyright law and international treaties Unauthorized reproduction of this presentation or any portion of it may result in severe civ ID: 564723

excel data column click data excel click column numeric sas validation table work information character tables run button values

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "HRP 223 –" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

HRP 223 – 2008

Day 1 –Then vs. Now

Copyright ©

1999-2008

Leland Stanford Junior University. All rights reserved.

Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.Slide2

Making Things Easier

Lousy Usability

Not a Bad DesignSlide3

GUI

Play with the pushpins.Slide the edges of the windows around.

Use the View menu to turn Windows on and off.Right click on the button bar and choose Customize…Slide4

Add Run Selection

Click the Commands tab.Click on the Code group.Drag and drop the Run Selection button next to the play button.Slide5

Tweak Some OptionsSlide6

My Typical GUI

Push the project designer button if all you see is gray workspace.Slide7

SAS

It used to be that you needed to memorize lots of arcane syntax to accomplish even the easiest of tasks, like making a toy data set.Slide8

Wizards take

more time than typing but they are easy to use.

They can’t do some complicated tasks.Slide9

Doing Things Differently

Most people who learn SAS think data steps manipulate and subset datasets while procs do statistics.There is one very important procedure for data manipulation, proc

sql.SQL is the programming language for working with databases.Oracle, MS Access, MS SQLserver

, etc. Slide10

The EG

data manipulation tools use SQL.

For simple tasks you can use either language (or point and click) but for complicated tasks, each has advantages which I will eventually show you.

Data Step vs. Proc SQLSlide11

Point and Click

To subset using EG, click on the dataset and then chose Filter and Query… from the Data menu.Then give the query a name, drag the dataset icon to the select data tab, then click the filter tab.Slide12
Slide13
Slide14
Slide15

Why SQL Matters

Since the 1980’s, the dominant model for databases has been “relational.” Relational databases use multiple tables to store data. Each of these tables, which look just like spreadsheet pages, stores related information. The primary goal in laying out a relational database is to remove redundancy and extra variables. This process is called

normalization.Slide16

Normalization

In a normalized database there is always a variable (or set of variables) that can be used to link the tables, but other than that, the same information is not repeated in a table or across tables. This buys you lots of space and speed!Slide17

A Normalized Database

In the Has_CA library:

ID is a key joining the master and cancer tables.

ID and dxDate make a joint key.

ID is a key joining the master and dude tables.

hName is a key relating the treatment and hospital tables.Slide18

What is data?

All pieces of information that you collect and calculate as part of a study are data. Every person’s response to a questionnaire is called a data point.

There are two fundamentally different types of data: character and numeric. Numeric data is always … numeric. Information that you could want to do math on is numeric data.

Character data is alphanumeric. It includes the obvious things like names and addresses, but it also includes

numbers that you should not do math on

.

Some systems make finer distinctions and let you set data so they are forced to be factors.Slide19

What is data coding?

A question such as, “What is your current age in years?” is going to generate numeric data.A question such as, “At what age did you first contract a sexually transmitted disease?” is going to generate numeric data ….

But you are going to need to allow for the possibility that somebody has never contracted a sexually transmitted disease.

… and you always need to allow for people who never knew or do not remember information or who may be dishonest in their answers.Slide20

What is data coding? (2)

When you have a question that generates numeric data and your subject’s response is not a “real number” you can code a bogus value.

“Not applicable” can be coded as age –1000000.“Do not know” can be coded as –2000000.The better way to deal with this problem is to use the value “NULL.”SAS allows you to code 27 different types of NULL.Null values make your job easier when you try to do math on the values.Slide21

Missing Data

You already saw that SAS represents missing character data as a pair of quotes with nothing between them and missing numbers are stored as a decimal place.You can also use .A, .B, etc. to code for missing numbers but you can’t enter them directly into the table viewer.Slide22

What is data coding? (3)

Questions that generate alphanumeric data are always complex compared to numeric data.“Where were you born?” can be coded as a string of letters from a fill-in-the-blank question or coded as letters or numbers from multiple choice format.

Do not use null in fill-in-the-blanks.Slide23

What is a data dictionary?

A data dictionary should provide a list of the following things:The variables’ names

A conceptual description of each variableIf the data is stored in character, numeric or some other formPermitted values

Inner and outer ranges

Codes for missing or impossible values

The source of the information: question and questionnaire number

The dictionary may also include:

The position of the variables on the page

Logic checks that should be appliedSlide24

Paper to Computer

Always budget time and money for blind double entry of your data.The questionnaires themselves are called hard copy. The first electronic copies that your data entry people create are called the source files. Always keep the hard copy for as long as you can.

You will want to go back and check the source. Always write-protect and back up your source files.Slide25

Protecting your Source FilesSlide26

Structure Your Storage

Make a directory on your hard drive called Projects and put subdirectories off that directory that corresponds to all your projects.Having a uniform place to store all your projects makes backups easier.Ask your IT person how to back up your work.

IRT’s preferred solution is Iron Mountain - call 5-8000 for desktop support help.Slide27

A Useful Naming Convention

Store your incremental work using my naming convention. Give a meaningful name to your work and follow it with the date of the modification in the format

YearMonthDay. That way your files will automatically be saved in alphabetical and chronological order. loadData20071020.xlsloadData20071019.xls

loadData20071012.xlsSlide28

Getting Data into a Computer

Excel Tolerable for toy data sets if and only if you force it to use validation

Security and auditing tools are primitive at best. FileMaker Pro Advance or Microsoft AccessDifficult to implement correctly STRIDE and ONCORE (if cancer)Supported by IRTSlide29

If you can afford a database…

You can force hard error and range checking. Do you really want to do this?Print out results, ideally in a different format, immediately after the contact and review them with subjects.

Collecting with paper and then doing electronic entry is time consuming but offers benefits: validation, data availabilityBack up, back up, back up.Incremental, full and offsite!Slide30

Using Excel

If you don’t have the ability to get access to a database and/or database programmer, you can fake it (sort of) with Excel.Think of Excel as a tool of last resort, not a good idea…

Excel 2007 is (in my humble opinion) the first worthwhile update to Excel in 10 or 15 years.Slide31

New TablesSlide32

Structuring Data in Excel

Use column headings with no spaces or punctuation (especially leading or trailing spaces).Make only one record (row) for each observation on a subject.

Do not put extra text in a cell.Put only one type of information in each column.Put only one type of information in each cell.Use a standard unit in each column.Slide33

Excel Bad and Good

One record for each observation, remove extra textSlide34

Use Validation

The data menu has a validation option that lets you do things like date validation and enabling pick lists for Excel cells.

Split this into 2 columns, use only months and force only legal values.Slide35

First Make a Table

Enter the column headings, select them, then on the insert tab of the ribbon, click table. Click the Table button. Click on the checkbox to tell it to use the column headings.Slide36

Note the context-specific Table Tools tab is available when you are inside the table.

Give the sheet and table a meaningful name. Use the Quick Styles button to make the table banded to help data entry.You can use Table Style Options to turn on banded columns.Slide37

Turn on Validation

Click at the top of a column that needs validation, then push Data Validation on the Data tab of the ribbon (or use the Data menu, Validation… option in Excel 2003).Slide38
Slide39

A Validated Column

Add validation for every cell you possibly can. Months of age validation:Slide40

Handy Table Features

Click on the down arrows in the column headings to get access to useful features like subsetting:Slide41

Dates in Excel

Poorly validated dates are a nightmare to fix and cause endless problems in analysis programs.

If you inherit dates typed by someone who was not taught about validation:Check your dates by changing the format.Check your dates by concatenating leading and trailing characters (to find blanks).The formula I use is

= “x” &

click a cell

& “x”Slide42

Notice dates are actually the number of days since 1900 (in Windows).Slide43

Excel Horror Stories

Most statisticians that I know dislike Excel and many HATE it.Be aware that Excel will incorrectly plot values.The engine behind Excel systematically deletes values without telling you.Slide44

3D Bar

Charts Excel 2003

What value is this?Slide45

PowerPoint 2003

PowerPoint… the preferred analysis and visualization tool in the Microsoft Office suite.Slide46

Excel 2007 … awesomeSlide47

Excel and Analysis Software

Use extreme caution with every package that can read in Excel files, even other Microsoft programs.

There is a common, but not commonly known, bug in how Windows processes Excel files. The fundamental issue is how it figures out if a column in Excel has character or numeric data. If it thinks that the column is character data, no problem, but what happens when it thinks a column has numbers, and part way down the column, it has some letters in a cell? Sometimes the cells with the characters are unceremoniously blanked out!

You can end up

with missing data

.Slide48

SAS

RSlide49

The Registry

Deep inside of Windows is a repository of information on all the software on your computer. It is called the registry.

In the registry there is a key that tells applications which are talking to Excel how many rows to check, going down a column, to figure out if a column should be called character or numeric. It is set by default to only look in the first 8 rows!!!!! So if you have character data for the first time in a cell after the first 8 rows, it guesses incorrectly that you have only numeric data in the column and your character cells will be

erased without warning

on import. Slide50

You can fix this.

Make sure to follow these instructions carefully. If you tweak the wrong thing in the registry you can render your machine unable to reboot!

Click the Windows Start menu and choose RunIn the dialog type regedit and click okOpen up the tree to this path

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel

Double click

TypeGuessRows

Type 0, that is zero not the letter o, in the DWORD editor and click ok

Microsoft ACCESS will silently change this setting!

So watch this setting if you use ACCESS.Slide51

Importing into EG

It is easy…Slide52

2nd Try

Tell it the column is character….Slide53

A dozen tries later…

If you are thinking “there is no way I am going to memorize that syntax”, I got ya’ covered.Slide54

Download my editor macros:

www.stanford.edu/class/hrp223/2007/programming/macros2007.kmf

You only need to do this once on each machine you work with.

Push Enter when you see the

autocomplete

tooltip.Slide55

Where Datasets are Saved

By default, Enterprise Guide saves all files into a directory it calls SASUSER. This is an exceptionally bad idea. You are in violation of HIPAA because anyone can see your data.Your data will be an organized disaster.

SAS is too old to know about “folders”; instead it uses “libraries”.Slide56

SASUSER

On every machine you work on tell SAS to stop defaulting to using the SASUSER library.Go to the Tools Menu and choose Options…Slide57

Click on the Output Library “windowpane” then click on WORK then push the up button until WORK is at the top of the list.Slide58
Slide59

Work

Work is a hidden folder that is destroyed when you quit SAS.It is the default place to store data in SAS (but not EG).In code you will see references like:

These are totally identical to SAS and EG.Slide60

Telling EG About Other Folders

If you import into the work library, your SAS dataset will be destroyed when you quit EG. To save them in a permanent folder use the GUI to make a library.Slide61
Slide62

You can code it instead.

If you don’t want to point and click that much, you can just write a single line of code:libname

DAY1 'C:\Projects\classes\HRP223-2008\day1' ;Slide63

A Complete Program

Below is the complete import and store program.Notice the comment starting with * ending with ; .Slide64

Organize it.

Give the nodes rational names, then move the library icon to the top of the Process Flow (flowchart).

You can run the library node by right clicking it.Slide65

Run the entire mess.

You can run the flowchart by right clicking on the flowchart background and choosing Run Process Flow. The stuff is run from upper left to lower right. Keep the libraries at the top.

Related Contents


Next Show more