A Database Reverse Engineering Case Study Michael R Blaha DSc blahacomputerorg wwwsuperdataguycom DAMA Chicago Aug 17 2016 What is Database Reverse Engineering Reverse engineering is the inverse to normal development ID: 762462
Download Presentation The PPT/PDF document "A Database Reverse Engineering Case Stud..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
A Database Reverse Engineering Case Study Michael R. Blaha, DSc.blaha@computer.orgwww.superdataguy.com DAMA Chicago Aug 17, 2016
What is Database Reverse Engineering? Reverse engineering is the inverse to normal developmentStart with an application and work backwards to understand the software and infer its intentReverse engineering can apply to a variety of artifacts Hardware, programming code, databases, … Our focus here is on databases
Why Would Anyone Want to Do DBRE? To elicit requirementsDBRE is not intended to perpetuate past flawsDBRE is merely a source of tentative requirementsTo convert legacy dataTo integrate application stovepipes To assess software To assist maintenance To construct documentation
Case Study 1: Reverse Engineer WordPress
Rationale WordPress is an interesting DBRE case study because...WordPress is a well-known applicationThe case study has a populated database The data is real (not synthetic) The data is not proprietary Illustrate DBRE techniques for a small database
Processing Details Export MySQL db from www.superdataguy.com websiteThe exported localhost.sql is unreadable Import SQL code into a local MySQL db Export schema only from local MySQL db The exported file is readable Manually edit the SQL by deleting… `, unsigned, COLLATE, KEY, UNIQUE KEY, ENGINE Reverse engineer schema with ERwin
Initial ERwin Model
Record Counts From querying the MySQL database wp_fqir_commentmeta 68 wp_fqir_comments 23 wp_fqir_links 0 wp_fqir_options 359 wp_fqir_postmeta 656 wp_fqir_posts 406 wp_fqir_term_relationships 59 wp_fqir_term_taxonomy 15 wp_fqir_termmeta 0 wp_fqir_terms 15 wp_fqir_usermeta 47 wp_fqir_users 1
Manually Add FKs Look for name similarityVerify with data analysisSELECT * FROM wp_fqir_commentmetaWHERE comment_ID NOT IN ( SELECT comment_ID FROM wp_fqir_comments); -- 0 records SELECT * FROM wp_fqir_commentmeta WHERE comment_ID IS NULL ; -- 0 records
DBRE ERwin Model
Data Dictionary By inspecting the data of each table… wp_fqir_commentmeta Key-value pairs from Akismet plugin wp_fqir_comments Comments posted to the Website (mostly spam) wp_fqir_links wp_fqir_options Key-value pairs for WordPress and plugins wp_fqir_postmeta Key-value pairs for WordPress, embedded media, and plugins wp_fqir_posts Each record is a WordPress page or post Appears to keep a history wp_fqir_term_relationships wp_fqir_term_taxonomy wp_fqir_termmeta wp_fqir_terms Mostly user defined categories in WordPress wp_fqir_usermeta Key-value pairs for users wp_fqir_users Authorized users for the Website
Commentary WordPress has a very small databaseOnly 12 tablesI had expected more tablesNote the lack of connectivity among tablesTypically tables are more highly interconnected WordPress has no dangling references I don’t think WordPress uses RI ERwin only partially reverse engineers MySQL It chokes on some keywords
Case Study 2: Reverse Engineer Adventure Works 2012
Rationale Adventure Works 2012 is an interesting DBRE case study because...Adventure Works is a free database provided with MS SQL ServerThe case study has a populated database The data is not proprietary The database is of medium size (71 tables) The database defines referential integrity Only one FK is missing
Mechanical Approach Strip down the schema to get to a core modelThis is like skimming a bookWe are working towards an abridgement of a modelWe can quickly get a sense of a schema We will use ER/Studio
Record Counts Query the SQL Server databasedbo.AWBuildVersion 1 dbo.DatabaseLog 1597 dbo.ErrorLog 0 HumanResources.Department 16 HumanResources.Employee 290 HumanResources. EmployeeDepartmentHistory 296 HumanResources.EmployeePayHistory 316 HumanResources.JobCandidate 13 HumanResources.Shift 3 Person.Address 19614 Person.AddressType 6 Person.BusinessEntity 20777 Person.BusinessEntityAddress 19614 Person.BusinessEntityContact 909 Person.ContactType 20 Person.CountryRegion 238 Person.EmailAddress 19972 Person.Password 19972 Person.Person 19972 Person.PersonPhone 19972 Person.PhoneNumberType 3 Person.StateProvince 181
Record Counts Record counts partially indicate table purposeProduction.BillOfMaterials 2679 Production.Culture 8 Production.Document 13 Production.Illustration 5 Production.Location 14 Production.Product 504 Production.ProductCategory 4 Production.ProductCostHistory 395 Production.ProductDescription 762 Production.ProductDocument 32 Production.ProductInventory 1069 Production.ProductListPriceHistory 395 Production.ProductModel 128 Production.ProductModelIllustration 7 Production.ProductModel ProductDescriptionCulture 762 Production.ProductPhoto 101 Production.ProductProductPhoto 504 Production.ProductReview 4 Production.ProductSubcategory 37 Production.ScrapReason 16 Production.TransactionHistory 113443 Production.TransactionHistoryArchive 89253 Production.UnitMeasure 38 Production.WorkOrder 72591
Record Counts Note different counts for ‘types’ and ‘instances’ Purchasing.WorkOrderRouting 67131 Purchasing.ProductVendor 460 Purchasing.PurchaseOrderDetail 8845 Purchasing.PurchaseOrderHeader 4012 Purchasing.ShipMethod 5 Purchasing.Vendor 104 Sales.CountryRegionCurrency 109 Sales.CreditCard 19118 Sales.Currency 105 Sales.CurrencyRate 13532 Sales.Customer 19820 Sales.PersonCreditCard 19118 Sales.SalesOrderDetail 121317 Sales.SalesOrderHeader 31465 Sales.SalesOrderHeaderSalesReason 27647 Sales.SalesPerson 17 Sales.SalesPersonQuotaHistory 163 Sales.SalesReason 10 Sales.SalesTaxRate 29 Sales.SalesTerritory 10 Sales.SalesTerritoryHistory 17 Sales.ShoppingCartItem 3 Sales.SpecialOffer 16 Sales.SpecialOfferProduct 538 Sales.Store 701
Processing Details Import schema into ER/StudioFile / New / Reverse engineer AdventureWorks2012All owners, user tables, no inferencesAdd FK Production.WorkOrderRouting.ProductID -> Production.Product.ProductID Successively delete all entity types with 0,1 connections Delete entity types with few (<= 3) connections
Step 1: ER/Studio Model
Step 2: ER/Studio Model
Step 3: ER/Studio Model
Step 3: Final Tables Remaining tables in blackdbo.AWBuildVersion 1 dbo.DatabaseLog 1597 dbo.ErrorLog 0 HumanResources.Department 16 HumanResources.Employee 290 HumanResources. EmployeeDepartmentHistory 296 HumanResources.EmployeePayHistory 316 HumanResources.JobCandidate 13 HumanResources.Shift 3 Person.Address 19614 Person.AddressType 6 Person.BusinessEntity 20777 Person.BusinessEntityAddress 19614 Person.BusinessEntityContact 909 Person.ContactType 20 Person.CountryRegion 238 Person.EmailAddress 19972 Person.Password 19972 Person.Person 19972 Person.PersonPhone 19972 Person.PhoneNumberType 3 Person.StateProvince 181
Step 3: Final Tables Deleted tables in red Production.BillOfMaterials 2679 Production.Culture 8 Production.Document 13 Production.Illustration 5 Production.Location 14 Production.Product 504 Production.ProductCategory 4 Production.ProductCostHistory 395 Production.ProductDescription 762 Production.ProductDocument 32 Production.ProductInventory 1069 Production.ProductListPriceHistory 395 Production.ProductModel 128 Production.ProductModelIllustration 7 Production.ProductModel ProductDescriptionCulture 762 Production.ProductPhoto 101 Production.ProductProductPhoto 504 Production.ProductReview 4 Production.ProductSubcategory 37 Production.ScrapReason 16 Production.TransactionHistory 113443 Production.TransactionHistoryArchive 89253 Production.UnitMeasure 38 Production.WorkOrder 72591
Step 3: Final Tables Purchasing.WorkOrderRouting 67131 Purchasing.ProductVendor 460 Purchasing.PurchaseOrderDetail 8845 Purchasing.PurchaseOrderHeader 4012 Purchasing.ShipMethod 5 Purchasing.Vendor 104 Sales.CountryRegionCurrency 109 Sales.CreditCard 19118 Sales.Currency 105 Sales.CurrencyRate 13532 Sales.Customer 19820 Sales.PersonCreditCard 19118 Sales.SalesOrderDetail 121317 Sales.SalesOrderHeader 31465 Sales.SalesOrderHeaderSalesReason 27647 Sales.SalesPerson 17 Sales.SalesPersonQuotaHistory 163 Sales.SalesReason 10 Sales.SalesTaxRate 29 Sales.SalesTerritory 10 Sales.SalesTerritoryHistory 17 Sales.ShoppingCartItem 3 Sales.SpecialOffer 16 Sales.SpecialOfferProduct 538 Sales.Store 701
Commentary DBRE depends on having FKsThe 0,1 connection deletions lose little infoThe “few” connections deletions are speculative Supertype/subtypes are troublesome From a separate manual DBRE BusinessEntity -> Employee, Vendor, Person, Store Employee -> SalesPerson
Supertype / Subtype
Case Study 3: Core DBRE
Rationale A project building a very large data warehouse100 facts200 dimensionsThe primary operational feeder application has 8500 tables I was new to the project and there was a lot to learn I wanted to reverse engineer the feeder application so that I could understand it
Available Inputs We had the following inputs (paper printouts) for the feeder applicationA thorough data dictionaryPrimary key definitionsForeign key definitions
The DBRE Problem Reverse engineer a database with 8500 tablesWith smaller schema, we could type the database structure into a modeling tool and then analyze itHowever, 8500 tables would take too longWe decided to determine the tightly connected tables and hope that would yield a much smaller model We presume that the tightly connected tables are the most important ones
DBRE Approach Do a graph analysisCreate a meta-table with FK to PK referencesThe FK in the source table points to the PK in the target tableUsing SQL, successively delete tables with 0,1 FK connections The final result is the multiply connected tables
Example
Example Repeatedly subtract tables with one reference until there is no change
Finding Core Tables DELETE FROM TableReferences AS T3WHERE EXISTS ( SELECT T1.sourceTable FROM TableReferences AS T1 WHERE NOT EXISTS ( SELECT * FROM TableReferences AS T2 WHERE T1.sourceTable = T2.targetTable ) AND T3.sourceTable = T1.sourceTable GROUP BY T1.sourceTable HAVING COUNT(*)= 1 ); The middle query finds tables with one source reference The innermost query limits the one-source tables to those that are not the target of any other sources The outer query does the deletion
Results Initial: 8500 tablesSeveral thousand FK definitions854 tables have FK columns254 tables are referenced by FKs Final result : 553 core tables
Case Study 4:Enterprise Data Model
Rationale Construct an enterprise data modelMy client – a financial software vendor – was a fusion of five formerly separate companiesThe applications were greatly dissimilar because they were built by separate organizations The purpose of the EDM was to provide a basis for integrating the applications and help the new company strengthen their brand
DBRE Approach We wanted to seed the EDM with application contentWe tried full DBRE but it was not helpful because the resulting models were so differentWe tried core DBRE but the models were still confusing because they were so different Finally we decided to count the FK references to each table We included the tables with the highest counts This worked
Example AddressType 2Applicant 29ApplicantAddressHistory 7ApplicantType 2 LenderApplicantDetails 2 Country 2 OverseasCorrespondence 2 ExistingInsuranceCover 6 Provider 3 PaymentFrequency 2 Fee 11 FeeDueType 2 FeeType 2 ProductFee 2
Results
DBRE and Zachman FW The “What” columnPopulated database – Technician perspectiveDatabase schema – Engineer perspective Physical data model – Architect perspective Logical data model – Business mgmt. perspective Conceptual data model – Executive perspective
Thank you for attending… Any questions???