2008 Day 1 Then vs Now Copyright 19992008 Leland Stanford Junior University All rights reserved Warning This presentation is protected by copyright law and international treaties Unauthorized reproduction of this presentation or any portion of it may result in severe civ ID: 564723
Download Presentation The PPT/PDF document "HRP 223 –" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
HRP 223 – 2008
Day 1 –Then vs. Now
Copyright ©
1999-2008
Leland Stanford Junior University. All rights reserved.
Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.Slide2
Making Things Easier
Lousy Usability
Not a Bad DesignSlide3
GUI
Play with the pushpins.Slide the edges of the windows around.
Use the View menu to turn Windows on and off.Right click on the button bar and choose Customize…Slide4
Add Run Selection
Click the Commands tab.Click on the Code group.Drag and drop the Run Selection button next to the play button.Slide5
Tweak Some OptionsSlide6
My Typical GUI
Push the project designer button if all you see is gray workspace.Slide7
SAS
It used to be that you needed to memorize lots of arcane syntax to accomplish even the easiest of tasks, like making a toy data set.Slide8
Wizards take
more time than typing but they are easy to use.
They can’t do some complicated tasks.Slide9
Doing Things Differently
Most people who learn SAS think data steps manipulate and subset datasets while procs do statistics.There is one very important procedure for data manipulation, proc
sql.SQL is the programming language for working with databases.Oracle, MS Access, MS SQLserver
, etc. Slide10
The EG
data manipulation tools use SQL.
For simple tasks you can use either language (or point and click) but for complicated tasks, each has advantages which I will eventually show you.
Data Step vs. Proc SQLSlide11
Point and Click
To subset using EG, click on the dataset and then chose Filter and Query… from the Data menu.Then give the query a name, drag the dataset icon to the select data tab, then click the filter tab.Slide12Slide13Slide14Slide15
Why SQL Matters
Since the 1980’s, the dominant model for databases has been “relational.” Relational databases use multiple tables to store data. Each of these tables, which look just like spreadsheet pages, stores related information. The primary goal in laying out a relational database is to remove redundancy and extra variables. This process is called
normalization.Slide16
Normalization
In a normalized database there is always a variable (or set of variables) that can be used to link the tables, but other than that, the same information is not repeated in a table or across tables. This buys you lots of space and speed!Slide17
A Normalized Database
In the Has_CA library:
ID is a key joining the master and cancer tables.
ID and dxDate make a joint key.
ID is a key joining the master and dude tables.
hName is a key relating the treatment and hospital tables.Slide18
What is data?
All pieces of information that you collect and calculate as part of a study are data. Every person’s response to a questionnaire is called a data point.
There are two fundamentally different types of data: character and numeric. Numeric data is always … numeric. Information that you could want to do math on is numeric data.
Character data is alphanumeric. It includes the obvious things like names and addresses, but it also includes
numbers that you should not do math on
.
Some systems make finer distinctions and let you set data so they are forced to be factors.Slide19
What is data coding?
A question such as, “What is your current age in years?” is going to generate numeric data.A question such as, “At what age did you first contract a sexually transmitted disease?” is going to generate numeric data ….
But you are going to need to allow for the possibility that somebody has never contracted a sexually transmitted disease.
… and you always need to allow for people who never knew or do not remember information or who may be dishonest in their answers.Slide20
What is data coding? (2)
When you have a question that generates numeric data and your subject’s response is not a “real number” you can code a bogus value.
“Not applicable” can be coded as age –1000000.“Do not know” can be coded as –2000000.The better way to deal with this problem is to use the value “NULL.”SAS allows you to code 27 different types of NULL.Null values make your job easier when you try to do math on the values.Slide21
Missing Data
You already saw that SAS represents missing character data as a pair of quotes with nothing between them and missing numbers are stored as a decimal place.You can also use .A, .B, etc. to code for missing numbers but you can’t enter them directly into the table viewer.Slide22
What is data coding? (3)
Questions that generate alphanumeric data are always complex compared to numeric data.“Where were you born?” can be coded as a string of letters from a fill-in-the-blank question or coded as letters or numbers from multiple choice format.
Do not use null in fill-in-the-blanks.Slide23
What is a data dictionary?
A data dictionary should provide a list of the following things:The variables’ names
A conceptual description of each variableIf the data is stored in character, numeric or some other formPermitted values
Inner and outer ranges
Codes for missing or impossible values
The source of the information: question and questionnaire number
The dictionary may also include:
The position of the variables on the page
Logic checks that should be appliedSlide24
Paper to Computer
Always budget time and money for blind double entry of your data.The questionnaires themselves are called hard copy. The first electronic copies that your data entry people create are called the source files. Always keep the hard copy for as long as you can.
You will want to go back and check the source. Always write-protect and back up your source files.Slide25
Protecting your Source FilesSlide26
Structure Your Storage
Make a directory on your hard drive called Projects and put subdirectories off that directory that corresponds to all your projects.Having a uniform place to store all your projects makes backups easier.Ask your IT person how to back up your work.
IRT’s preferred solution is Iron Mountain - call 5-8000 for desktop support help.Slide27
A Useful Naming Convention
Store your incremental work using my naming convention. Give a meaningful name to your work and follow it with the date of the modification in the format
YearMonthDay. That way your files will automatically be saved in alphabetical and chronological order. loadData20071020.xlsloadData20071019.xls
loadData20071012.xlsSlide28
Getting Data into a Computer
Excel Tolerable for toy data sets if and only if you force it to use validation
Security and auditing tools are primitive at best. FileMaker Pro Advance or Microsoft AccessDifficult to implement correctly STRIDE and ONCORE (if cancer)Supported by IRTSlide29
If you can afford a database…
You can force hard error and range checking. Do you really want to do this?Print out results, ideally in a different format, immediately after the contact and review them with subjects.
Collecting with paper and then doing electronic entry is time consuming but offers benefits: validation, data availabilityBack up, back up, back up.Incremental, full and offsite!Slide30
Using Excel
If you don’t have the ability to get access to a database and/or database programmer, you can fake it (sort of) with Excel.Think of Excel as a tool of last resort, not a good idea…
Excel 2007 is (in my humble opinion) the first worthwhile update to Excel in 10 or 15 years.Slide31
New TablesSlide32
Structuring Data in Excel
Use column headings with no spaces or punctuation (especially leading or trailing spaces).Make only one record (row) for each observation on a subject.
Do not put extra text in a cell.Put only one type of information in each column.Put only one type of information in each cell.Use a standard unit in each column.Slide33
Excel Bad and Good
One record for each observation, remove extra textSlide34
Use Validation
The data menu has a validation option that lets you do things like date validation and enabling pick lists for Excel cells.
Split this into 2 columns, use only months and force only legal values.Slide35
First Make a Table
Enter the column headings, select them, then on the insert tab of the ribbon, click table. Click the Table button. Click on the checkbox to tell it to use the column headings.Slide36
Note the context-specific Table Tools tab is available when you are inside the table.
Give the sheet and table a meaningful name. Use the Quick Styles button to make the table banded to help data entry.You can use Table Style Options to turn on banded columns.Slide37
Turn on Validation
Click at the top of a column that needs validation, then push Data Validation on the Data tab of the ribbon (or use the Data menu, Validation… option in Excel 2003).Slide38Slide39
A Validated Column
Add validation for every cell you possibly can. Months of age validation:Slide40
Handy Table Features
Click on the down arrows in the column headings to get access to useful features like subsetting:Slide41
Dates in Excel
Poorly validated dates are a nightmare to fix and cause endless problems in analysis programs.
If you inherit dates typed by someone who was not taught about validation:Check your dates by changing the format.Check your dates by concatenating leading and trailing characters (to find blanks).The formula I use is
= “x” &
click a cell
& “x”Slide42
Notice dates are actually the number of days since 1900 (in Windows).Slide43
Excel Horror Stories
Most statisticians that I know dislike Excel and many HATE it.Be aware that Excel will incorrectly plot values.The engine behind Excel systematically deletes values without telling you.Slide44
3D Bar
Charts Excel 2003
What value is this?Slide45
PowerPoint 2003
PowerPoint… the preferred analysis and visualization tool in the Microsoft Office suite.Slide46
Excel 2007 … awesomeSlide47
Excel and Analysis Software
Use extreme caution with every package that can read in Excel files, even other Microsoft programs.
There is a common, but not commonly known, bug in how Windows processes Excel files. The fundamental issue is how it figures out if a column in Excel has character or numeric data. If it thinks that the column is character data, no problem, but what happens when it thinks a column has numbers, and part way down the column, it has some letters in a cell? Sometimes the cells with the characters are unceremoniously blanked out!
You can end up
with missing data
.Slide48
SAS
RSlide49
The Registry
Deep inside of Windows is a repository of information on all the software on your computer. It is called the registry.
In the registry there is a key that tells applications which are talking to Excel how many rows to check, going down a column, to figure out if a column should be called character or numeric. It is set by default to only look in the first 8 rows!!!!! So if you have character data for the first time in a cell after the first 8 rows, it guesses incorrectly that you have only numeric data in the column and your character cells will be
erased without warning
on import. Slide50
You can fix this.
Make sure to follow these instructions carefully. If you tweak the wrong thing in the registry you can render your machine unable to reboot!
Click the Windows Start menu and choose RunIn the dialog type regedit and click okOpen up the tree to this path
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel
Double click
TypeGuessRows
Type 0, that is zero not the letter o, in the DWORD editor and click ok
Microsoft ACCESS will silently change this setting!
So watch this setting if you use ACCESS.Slide51
Importing into EG
It is easy…Slide52
2nd Try
Tell it the column is character….Slide53
A dozen tries later…
If you are thinking “there is no way I am going to memorize that syntax”, I got ya’ covered.Slide54
Download my editor macros:
www.stanford.edu/class/hrp223/2007/programming/macros2007.kmf
You only need to do this once on each machine you work with.
Push Enter when you see the
autocomplete
tooltip.Slide55
Where Datasets are Saved
By default, Enterprise Guide saves all files into a directory it calls SASUSER. This is an exceptionally bad idea. You are in violation of HIPAA because anyone can see your data.Your data will be an organized disaster.
SAS is too old to know about “folders”; instead it uses “libraries”.Slide56
SASUSER
On every machine you work on tell SAS to stop defaulting to using the SASUSER library.Go to the Tools Menu and choose Options…Slide57
Click on the Output Library “windowpane” then click on WORK then push the up button until WORK is at the top of the list.Slide58Slide59
Work
Work is a hidden folder that is destroyed when you quit SAS.It is the default place to store data in SAS (but not EG).In code you will see references like:
These are totally identical to SAS and EG.Slide60
Telling EG About Other Folders
If you import into the work library, your SAS dataset will be destroyed when you quit EG. To save them in a permanent folder use the GUI to make a library.Slide61Slide62
You can code it instead.
If you don’t want to point and click that much, you can just write a single line of code:libname
DAY1 'C:\Projects\classes\HRP223-2008\day1' ;Slide63
A Complete Program
Below is the complete import and store program.Notice the comment starting with * ending with ; .Slide64
Organize it.
Give the nodes rational names, then move the library icon to the top of the Process Flow (flowchart).
You can run the library node by right clicking it.Slide65
Run the entire mess.
You can run the flowchart by right clicking on the flowchart background and choosing Run Process Flow. The stuff is run from upper left to lower right. Keep the libraries at the top.