Dealing With Dirty Data Presented by John R Woodard MS GISP EA Kelly Wright MS Justin Walters WHAT TO EXPECT Todays workshop is about dealing with bad datainformation Three presenters ID: 683755
Download Presentation The PPT/PDF document "The Nuts & Bolts of GIS:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Nuts & Bolts of GIS:Dealing With “Dirty” Data!
Presented by:
John R. Woodard MS, GISP, EA
Kelly Wright MS
Justin WaltersSlide2
WHAT TO EXPECT!
Today’s workshop is about dealing with bad data/information.
Three presenters
Each presenter will cover one section of the workshop (30 mins each)
Workshop Introduction & bringing the dirt! – John Woodard
Working with MS Excel to clean out the dirt! – Kelly Wright
Using the Python to squeeze out the dirt! – Justin Walters
Q&A & Wrap up!
Each presenter will give a short introduction about themselves
There will be a 10 min break after topic 2
Goals:
Everyone will learn about data errors and how to avoid them.
Everyone will learn about one or two tools that can help you find, fix, or avoid errors.
Provide references that can help you.Slide3
Who is John R. Woodard MS, GISP, EA?
PROFESSIONAL:
Currently, GIS/IT Project Developer for Davey Resource Group, Kent Ohio
16 years experience of developing GIS systems and data problem solving.
10 years U.S. Navy Reserves Seabees (Construction Battalion)
GISP (Geographic Information System Professional)Certification
EA (Enterprise Architecture) Certification
EDUCATION:
Masters in Digital Science at Kent State University.
MS in Geography/GIS from the University of Akron.
PERSONAL:
Scuba Diver &
Beekeeper (Yes, I deal with over 400,000 bees and sell honey!)
ArtistSlide4
WHY ARE WE HERE?
DIRT? – On who or what?
TO TALK ABOUT DIRT!
DIRT in my data! HELP!Slide5
What is “dirt” in data/information?
Dirty Data
is simply a catch all term that can be used to describe errors in data or information
.
Can anyone give me an example of “dirt” in a dataset?Slide6
Are these dirt?
Typos
Misspellings
Incomplete data
Inaccurate data
Duplicate data
Incorrect data
Not following principles
Not following formats
Data gathered on Tues
Inconsistent Data
Data type
Data ageSlide7
The Best Defense, is a Good Offense! - The Art of War, Sun Tzu
Let’s be honest, ensuring the cleanliness of your dataset is a constant struggle – you are literally waging an unending war……
Being Proactive and taking the initiative with strategies to prevent data errors is your best offense.
The field of Enterprise Architecture and its’ sub field of Data Architecture can provide help.
Enterprise Architecture works to establish and organize all of a company’s resources into a strategic plan that supports the organization’s mission.
Data Architecture is about organizing data within the enterprise architecture.Slide8
What is Data Architecture?
A strategic data plan that can fit within the Enterprise Architecture or stand alone (small system) that is:
1. business driven
2. based on real world objects and events
3. implemented into the proper environments.
The data architecture is created with models, policies, standards that govern how data is collected, created, stored and used within an information system.Slide9
The Open Group Architecture Framework (TOGAF) Architecture Principle Template
Name
Should both represent the essence of the rule as well as be easy to remember. Specific technology platforms should not be mentioned in the name or statement of a principle. Avoid ambiguous words in the Name and in the Statement such as: "support", "open", "consider", and for lack of good measure the word "avoid", itself, be careful with "manage(ment)", and look for unnecessary adjectives and adverbs (fluff).
Statement
Should succinctly and unambiguously communicate the fundamental rule. For the most part, the principles statements for managing information are similar from one organization to the next. It is vital that the principles statement be unambiguous.
Rationale
Should highlight the business benefits of adhering to the principle, using business terminology. Point to the similarity of information and technology principles to the principles governing business operations. Also describe the relationship to other principles, and the intentions regarding a balanced interpretation. Describe situations where one principle would be given precedence or carry more weight than another for making a decision.
Implications
Should highlight the requirements, both for the business and IT, for carrying out the principle - in terms of resources, costs, and activities/tasks. It will often be apparent that current systems, standards, or practices would be incongruent with the principle upon adoption. The impact to the business and consequences of adopting a principle should be clearly stated. The reader should readily discern the answer to: "How does this affect me?" It is important not to oversimplify, trivialize, or judge the merit of the impact. Some of the implications will be identified as potential impacts only, and may be speculative rather than fully analyzed.-The Open Group Architecture Framework (TOGAF)http://pubs.opengroup.org/architecture/togaf8-doc/arch/chap29.html#tagtcjh_2Slide10
Example principle of data architecture
Name
Data is an Asset
Statement
Data is an asset that has value to the enterprise and is managed accordingly.
Rationale
Data is a valuable corporate resource; it has real, measurable value. In simple terms, the purpose of data is to aid decision-making. Accurate, timely data is critical to accurate, timely decisions. Most corporate assets are carefully managed, and data is no exception. Data is the foundation of our decision-making, so we must also carefully manage data to ensure that we know where it is, can rely upon its accuracy, and can obtain it when and where we need it.
Implications
This is one of three closely-related principles regarding data: data is an asset; data is shared; and data is easily accessible. The implication is that there is an education task to ensure that all organizations within the enterprise understand the relationship between value of data, sharing of data, and accessibility to data.Stewards must have the authority and means to manage the data for which they are accountable. We must make the cultural transition from "data ownership" thinking to "data stewardship" thinking. The role of data steward is critical because obsolete, incorrect, or inconsistent data could be passed to enterprise personnel and adversely affect decisions across the enterprise.Part of the role of data steward, who manages the data, is to ensure data quality. Procedures must be developed and used to prevent and correct errors in the information and to improve those processes that produce flawed information. Data quality will need to be measured and steps taken to improve data quality - it is probable that policy and procedures will need to be developed for this as well.A forum with comprehensive enterprise-wide representation should decide on process changes suggested by the steward.Since data is an asset of value to the entire enterprise, data stewards accountable for properly managing the data must be assigned at the enterprise level.
The Open Group Architecture Framework (TOGAF)
http://pubs.opengroup.org/architecture/togaf8-doc/arch/chap29.html#tagtcjh_2Slide11
Data Standards
These are documented agreements (MOA’s) between departments, organizations, people, etc. that cover how data will be represented, formatted, defined, structured, transmitted, used, or managed.
Example:
Standard For Data Format (Streets)
All department’s agree to use the following format for street types
Fo
rmat
All Caps
Based on USPS Suffix AbbreviationsTwo, three, or four LettersList ST = Street RD = Road FLD = Field FRD = Ford CRK = Creek CT = Court Slide12
Standards can be maintained (enforced) in Geo-database domains.Slide13
Geo-database domains
How many of you use the domains in a file geo-database?
Have you planned their relationships to data standards or formats?
Domains are very useful for enforcing data formatting standards!
They prevent “personal” preferences from creating formatting errors.
They keep data entries consistent.
They prevent typos…..
Can you list a few more items?Slide14Slide15
Early detection is also another method for preventing “dirty data” from corrupting you datasets. We often use MS Excel “flat files” for importing and exporting data/information between systems.
Ms. Kelly Wright will cover how to use MS Excel to detect and eliminate errors in your data.Slide16
Cleaning Dirty Data Part 2: Cleaning it up with ExcelSlide17
IntroductionWhy clean the data?Before you Clean, PlanWhy Excel?Pitfalls of Geodata
(Windows) Keyboard Shortcuts
Tools and FormulasSlide18
Who is Kelly Wright, MS?
PROFESSIONAL:
Currently, GIS Applications Specialist at LJB Inc. consulting of Miamisburg, Ohio
Previously, GIS Graduate Intern at ODOT, Data Analyst for Time Warner Cable
4 years experience in GIS, 7 years experience in Data Analysis
EDUCATION:
M.S. in Geographic Information Science and Technology from the University of Southern California (Go Trojans!)
B.A. in Geography from the University of Texas at Austin (Hook ‘
em Horns!)PERSONAL:No Matter What: Born a Buckeye, Always a BuckeyeSlide19
Why Clean the Data?Data Integrity Looks Professional
Costs U.S. businesses $600 billion per year
Legal Liability
Fraud
False Conclusions and Misdirected InvestmentsSlide20
Why Excel?UbiquitousReliableSupport and Resources
InteroperabilitySlide21
Watch Out For:Hyphenated wordsState/Province abbreviations
ZIP and ZIP+5
Case-sensitivity
Variations
“
Road” v. “Rd” v. “
rd
” v. “RD”
Extra SpacesNumbers as TextCharacters as Text Slide22
Before You Clean, PlanUnderstand the data before you change it
MAKE A BACKUP
Record all data transformations
in the order that you do them
Make sure all columns and rows are visible
Remove Filters
Remove Formatting
Manipulate Columns First
Always do the work in a new columnUse Keyboard ShortcutsSlide23
Keyboard Shortcuts 1
Ctrl+A
Select All
Ctrl+C
Copy
Ctrl+V
Paste
Ctrl+Alt+V
Paste Special
Ctrl+X
Cut
Ctrl+N
New Workbook
Ctrl+S
Save
Ctrl+R
Fill Right
Ctrl+D
Fill Down
Ctrl+Z
Undo
Ctrl+Y
Redo
Ctrl+F
Find
Ctrl+H
Replace
Ctrl+E
Flash FillSlide24
Keyboard Shortcuts 2
F2
Edit the active cell and move the cursor to the end of the text
F7
Check spelling in the active worksheet or selected range
Home
Moves to the beginning of a row in a worksheet
Ctrl+Home
Moves to the beginning of a worksheet (cell A1)
Ctrl+End
Moves to the end of a worksheet/Moves the cursor to the end of the text when in the formula bar
Page Down
Moves one screen down in a worksheet
Alt+PgDn
Moves one screen to the right in a worksheet
Alt+PgUp
Moves one screen to the left in a worksheet
Ctrl+1
Open the Format Cells window
Ctrl+F6
Switch between open workbooks or windowsSlide25
Keyboard Shortcuts 3
Ctrl+PgUp
Move between work sheets in the same document
Ctrl+PgDn
Move between work sheets in the same document
Ctrl+Tab
Move between Two or more open Excel files
Ctrl+Shift+=
Insert a new column or row
Ctrl+Space
Select an entire column in a worksheet
Shift+Space
Select an entire row in a worksheet
Ctrl+Shift+Space
Select all cells containing data (also Ctrl+A)
Ctrl+Shift+End
Select all text in the formula bar from the cursor position to the end
Ctrl+Shift+Arrow
Extend the selection of cells to the last nonblank cell in the same column or row as the active cell, or if the next cell is blank, to the next nonblank cell.Slide26
Keyboard Shortcuts 4
Shift+F3
Insert a function (puts the cursor in the formula bar)
Ctrl+Shift+(
Unhides any hidden rows within the selection
Ctrl+Shift+)
Unhides any hidden columns within the selection.
Ctrl+Shift+~
Applies the General number format.
Ctrl+`
Alternates between displaying cell values and displaying formulas.Slide27
Sort and FilterSortDoesn’t change the values at all, just puts them in order
Alphabetical/Numeric
Ascending/Descending
Can reveal problems caused by extra spaces, misspelling
Quick way to see how Excel “buckets” the data
Filter
See an immediate list of unique values
Determine what to Find and Replace
Identify situations like extra spaces or misspellingBEWARE: don’t copy and paste when filters are appliedData permutations may not apply to cells hidden when filtered
Home Tab > Sort & FilterSlide28
Find and Replace ToolMost important tool for cleaning dirty data!Remove spaces and nonprinting characters
Standardize spelling and case
Go To Special – F5
Blanks – Selects all blank cells in a highlighted dataset
Conditional Formatting – Selects all cells with a certain formatting, like color
Formulas – Selects cells filled by formula
Enter desired text in one cell and use
Ctrl+Enter
to fill all of the blank cells
Home Tab > Find & Select
Ctrl+F
Find
Ctrl+H
ReplaceSlide29
Find/Remove Duplicates 1Highlight Duplicate Data
Data Tab > Conditional Formatting > Highlight Cells Rules > Duplicate Values
Remove Duplicates
Data Tab > Data Tools > Remove Duplicates
Permanently removes duplicates – copy original range of cells to a new sheet or workbook
Determined by value (if date types do not match, they are not considered duplicate)
Excellent for generating a domainSlide30
Find/Remove Duplicates 2Filter for Unique Values
Data Tab > Sort & Filter > Advanced
Temporarily Hides Duplicates
Shortcut: Open Filter Tool to see all unique values
Conditionally Format Unique or Duplicate Values
Home Tab > Conditional Formatting
Very Powerful
Use resulting formatting to filterSlide31
“Fill Down” and “Flash Fill”Replicates values or formula to continue series downward in a column
Click and drag to show context menu
Double-click to apply to next adjacent blank
Paste values or continues a series
Flash Fill (2016 only)
Automatically fills data when a pattern is sensed
Home Tab > Fill
Data Tab > Flash Fill (2016)
Ctrl+D
Fill Down
Ctrl+E
Flash FillSlide32
FormulasType an equal sign “=“ to begin
Excel suggests formulas as you type
Formula Tab >
Show Formulas
Evaluate Formulas
Formula Library
Error CheckingSlide33
Case: UPPER, LOWER, and PROPERThese functions make data entries more uniform and ensures that Excel “buckets” these records together.
Columbus,
columbus
,
COlumbus
, and COLUMBUS
Excel (and ArcMap by extension) treats these as unique values
Causes problems with SQL queries
=LOWER – convert all characters to lower case=UPPER – convert all characters to upper case=PROPER – convert to camel case/sentence caseSlide34
Spaces and nonprinting charactersFind and Replace Tool=CODE
– returns the Unicode character set values for the first character in a text string
0 – 31, 32, 127, 129, 141, 143, 144, 157 and 160 (160 is non-breaking space
nbsp
) are nonprinting characters that may affect the behavior of filter, sort, and search
=CLEAN
– removes Unicode values 0 – 31 from the string
=TRIM
– removes Unicode value 32 (space) from the stringRemoves leading, trailing, and extra (duplicate) spaces from a text stringSlide35
Fixing Numbers Stored as TextHighlight range of cells and use Error Checking context menu
Apply “General” Formatting to entire dataset
Unhides issues with incorrect data types
Paste Special
and Multiply
Good for multiple columns at once
Put the number 1 into a blank cell that is not storing a number as text (i.e. a new column). “Copy” the cell (CTRL+C)
Select the group of cells that are incorrectly storing numbers as text
Home Tab > Paste > Paste SpecialClick Multiply > OKThis multiplies each cell by 1, returning a number as the resultSlide36
Fixing Dates and TimesSort column to identify common error types2-digit v. 4-digit years
DD/MM/YY v. MM/DD/YY
Date and time v. date only
24-hr clock v. 12-hr clock
A v. a.m., P v. p.m.
Highlight range of cells and use Error Checking context menu
Find & replace “.” or “-“ or “/” to standardize if the delimiting character is the only thing wrongSlide37
Parsing: Text to Columns ToolSplit Text Strings into more than one column
Use Delimiters or Fixed Width
Set the Data Format
General
Text
Date
Not dynamic
Split names, addresses, extract ZIP codes or cities, select first # of characters from a string
Data Tab > Text To Columns
Alt + A + ESlide38
Parsing: Functions to Split ColumnsThese functions use the position of each character within the string
=LEFT
– returns the first character/s
=LEFT([cell #], [# of chars])
=MID
– returns a specific number of characters from a string, starting at the position you specify, based on the number of characters you specify
=MID([cell #], [start position], [# of chars])
=RIGHT
– returns the last character/s=SEARCH – returns the position of the sub-string you are searching for=LEN – returns the number of charactersSlide39
Combining: CONCATENATE and “&”Combine the contents of two or more cells
=CONCATENATE
=CONCATENATE(
[Cell1]
,
[Cell2]
)
Add text and characters before, between, and/or after
=CONCATENATE([Cell1],”[chars or text]”,[Cell2])=CONCAT replaces this function in Excel 2016 and laterUsing the Ampersand & to combine text values=[Cell1]&[Cell2]=[Cell1]&””&[Cell2] to include a space (doesn’t need commas like CONCAT)Slide40
10 MIN BREAKSlide41
Justin Walters – Geospatial Specialist Davey Resource GroupProfessional:
Current: GIS project manager for Utility Vegetation Management since 2014
5 years experience Garmin International Inc., Cartography project lead, recreation team.
Education:
BA: Geographic Information Systems, Kent State University
BA: Environmental Geography, Kent State University
Minor: Mathematics, Physics, Climatology, Kent State UniversitySlide42
Cleaning Data Through ArcMap Automations
Method 1: Model Builder
Benefits: Easy, Intuitive
Limitations: Complete customization with complexities.
Method 2: Arcpy
Benefits: Highly customizable, higher functionality
Limitations: Understanding the languageSlide43
Model Builder AutomationsThe Problem
: The client wants a complete record of features in geodatabase format. Additionally a direct link to the data storage website is requested.
The Solution
: Create a tool that populates this field for a complete geodatabase. Slide44
ArcPy Automations - BeginnerThe Problem
: When utilizing tools to convert shapefiles into a geodatabase, inaccurate geometries often occur.
The Solution
: Repair the geometries prior to data analysis. Slide45
ArcPy Automations - IntermediateThe Problem
: Received parcels have the entire address in one field and they need separated into two
The Solution
: Splitting at the first space.Slide46
ArcPy Automations - Advanced
The Problem
: Customers that refuse work need to be highlighted and reported separate from fully compliant customers.
The Solution
: Automate the cross referencing
of work layers to parcels, applying
permissions from work areas to select
parcels.Slide47
Book: Data Strategy, by Sid Adelman, Larissa Moss, Majid Abai,Book: Enterprise Architecture As Strategy: Creating a Foundation for Business Execution, by Jeanne W. Ross and Peter Will
Web Article: Data Cleaning -
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1198040/
Web Article: Big Data’s dirty problem by Verne Kopytoff -
http://fortune.com/2014/06/30/big-data-dirty-problem/
Web Article: Data format -
https://en.wikipedia.org/wiki/Data_format
Website: 10 Data Principles (CDC) -
https://www.cdc.gov/nchhstp/programintegration/tenguidingprinciples.htmWebsite: Data Standards (USGS)- https://www2.usgs.gov/datamanagement/plan/datastandards.phpWebsite: -The Open Group Architecture Framework (TOGAF) http://pubs.opengroup.org/architecture/togaf8-doc/arch/chap29.html#tagtcjh_2REFERENCES:Slide48
REFERENCES:
Book: Data Strategy, by Sid Adelman, Larissa Moss, Majid Abai,
Book: Enterprise Architecture As Strategy: Creating a Foundation for Business Execution, by Jeanne W. Ross and Peter Will
Web Article: Data Cleaning -
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1198040/
Web Article: Big Data’s dirty problem by Verne Kopytoff -
http://fortune.com/2014/06/30/big-data-dirty-problem/
Web Article: Data format -
https://en.wikipedia.org/wiki/Data_formatWebsite: 10 Data Principles (CDC) - https://www.cdc.gov/nchhstp/programintegration/tenguidingprinciples.htmWebsite: Data Standards (USGS)- https://www2.usgs.gov/datamanagement/plan/datastandards.phpWebsite: -The Open Group Architecture Framework (TOGAF) http://pubs.opengroup.org/architecture/togaf8-doc/arch/chap29.html#tagtcjh_2Slide49
Web Article: Microsoft Support: Top Ten Ways to Clean Your Data - https://support.office.com/en-us/article/Top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19
Web Article: 10 Super Neat Ways to Clean Data in Excel Spreadsheets -
10 Super Neat Ways to Clean Data in Excel Spreadsheets
Related YouTube Video:
https://www.youtube.com/watch?v=e0TfIbZXPeA
Web Article: 5 Tips for Cleaning Your Dirty Data -
http://www.msn.com/?ocid=wispr&pc=u477https://esj.com/articles/2012/05/22/5-tips-cleaning-data.aspx
Web Article: How to Clean Data in Excel: 5 Simple Tricks - https://www.livestories.com/blog/excel-tutorial-5-easy-ways-you-can-clean-data-in-excel Web Article: Get Only the Numeric or the Text Part from a String Using Excel Formulas - https://trumpexcel.com/get-numeric-text-part-of-string/ Web Article: 7 Amazing Things Excel Text-t0-Columns Can Do For You - https://trumpexcel.com/excel-text-to-columns-examples/ REFERENCES: Slide50
Web Article and Video: Microsoft Support - Auto Fill and Flash Fill - https://support.office.com/en-us/article/Use-AutoFill-and-Flash-Fill-2e79a709-c814-4b27-8bc2-c4dc84d49464
Web Article: Dirty Data Got You Down? Clean It Up -
https://www.govtechworks.com/dirty-data-got-you-down-clean-it-up/#gs.23JDD7E
Wikipedia: Data Cleansing -
https://en.wikipedia.org/wiki/Data_cleansing
Web Article: Excel keyboard shortcuts and function keys for Windows -
https://support.office.com/en-us/article/Excel-keyboard-shortcuts-and-function-keys-for-Windows-1798d9d5-842a-42b8-9c99-9b7213f0040f Web Article: Excel keyboard shortcuts - http://www.asap-utilities.com/excel-tips-shortcuts.php Web Article: 222 Excel shortcuts for Windows and Mac - https://exceljet.net/keyboard-shortcuts REFERENCES:Slide51
Web Article: 253 Shortcuts for Microsoft Excel 2016 (Windows) - https://shortcutworld.com/Excel/win/Microsoft-Excel_2016_ShortcutsWeb Article and Video: Microsoft Support – Overview of Formulas in Excel -
https://support.office.com/en-us/article/Overview-of-formulas-in-Excel-ecfdc708-9162-49e8-b993-c311f47ca173
Web Article: Microsoft Support – Excel Functions (alphabetical) -
https://support.office.com/en-us/article/Excel-functions-alphabetical-b3944572-255d-4efb-bb96-c6d90033e188
Web Article: Microsoft Support – Excel Functions (by category) -
https://support.office.com/en-us/article/Excel-functions-by-category-5f91f4e9-7b42-46d2-9bd1-63f26a86c0eb
REFERENCES: