/
An Approach for Testing An Approach for Testing

An Approach for Testing - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
343 views
Uploaded On 2019-12-11

An Approach for Testing - PPT Presentation

An Approach for Testing the ExtractTransformLoad Process in Data Warehouse Systems Masters Defense Hajar Homayouni Dec 1 2017 1 Outline 2 Survey of data warehouse testing research Problem description and goal ID: 770060

data table target source table data source target count attribute match warehouse analysis testing etl test record distinct year

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "An Approach for Testing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems Master’s Defense Hajar HomayouniDec. 1, 2017 1

Outline2Survey of data warehouse testing research Problem description and goalMotivating exampleProposed approach for functional testing of ETL Demonstration and evaluationConclusionsFuture work

Data Warehouse Systems3 Help researchers and analyzers:Make accurate analysis and decisionsFind precise patterns and trends in data Front-end Applications Analysis Tools Decision Support Tools Analysis on d rugs and treatments Source AreaRelational/non-relational databaseXML fileCSV fileHospitals Health Data Warehouse Google BigQuery E T L

Extract-Transform-Load Process 4

Data Warehouse Testing Survey GoalsCreate a classification framework for data warehouse testing approachesSurvey existing techniques and their limitations Identify open problems 5

Classification Framework for Data Warehouse Testing Approaches 6 Data Warehouse Testing Testing Source Area Testing Target Data Warehouse Testing Extract-Transform-Load Testing Front-end Applications TestingUnderlying DataTestingData ModelTestingData Management Product FunctionalTestingSecurityTestingFunctionalTestingStructuralEvaluationUsabilityEvaluationMaintainabilityEvaluationPerformanceTestingStressTestingRecoveryTestingFunctionalTestingPerformanceTestingStressTesting Scalability TestingReliabilityTestingRegressionTesting Usability Testing FunctionalTestingUsabilityTestingPerformanceTestingStressTesting

Functional Testing of ETL ProcessData Quality Tests: Validate data in the target data warehouse in isolation to detect syntactic and semantic violations Example: Patient’s height and weight values stored in the target data warehouse must be positiveBalancing Tests: Analyze the data in the source databases and target data warehouse and report differencesExample: Date of birth must be consistent between target data warehouse and source hospital for the same patient 7

Limitations of Balancing Test ApproachesStare and compare: Manual approach that requires exhaustive comparison of data Query Surge: Automatic approach but only verifies data that is not modified during the ETL transformation8

Goal9Propose an automatic approach for validating ETL processes using balancing tests that: Ensure that the data obtained from the source databases is not lost or incorrectly modified by the ETL process Include validating data that has been reformatted and modified through the ETL process

Sources of Faults in Data10 Complexity of transformations make ETL implementations fault prone Faulty ETL implementations lead to incorrect data in data warehouseSystem failures or connection loss may result in data loss or duplication Erroneous setting of ETL parameters can result in incorrect data Malicious programs may remove or modify data in data warehouses

Motivating ExampleTable mappingsOne-to-one table mapping Many-to-one table mapping Does the Person table lose any current patients?Does the Location table lose any addresses of patients admitted after Year 2000? 11 Source Table Source Attribute Target TableTarget AttributeSelection ConditionAddressAddress_keyLocationLocation_idTransform all the new addresses (Year>2000) Source TableSource Attribute Target TableTarget AttributeSelection ConditionPatientAddressAddress_keyPersonLocation_idTransform all the current patients with their new addresses (Year>2000) LEFT JOIN

Motivating Example (cont.)Attribute mappingsOne-to-one attribute mapping Many-to-one attribute mapping Are any of the attribute values incorrect?Is Date_of_birth the correct concatenation of three attributes in the source? 12 Source Table Source Attribute Target Table Target Attribute Selection Condition AddressAddress_keyLocationLocation_idTransform all the new addresses (Year>2000) Source TableSource AttributeTarget TableTarget AttributeSelection ConditionPatient Day_of_birthMonth_of_birthYear_of_birthPersonDate_of_birthTransform all the current patients

Proposed ApproachDefine properties to be checked in balancing testsIdentify source-to-target mappings from transformation rules Generate balancing tests 13

Balancing Properties14CompletenessConsistency Syntactic validity

Completeness15Ensures that all the source data that must be transformed and loaded are present in the warehouseVerified by checking:Record count match Distinct record count match

Record Count Match16One-to-one table mapping COUNT(target_table ) = COUNT(source_table) Where conditionExample: COUNT(Location ) = COUNT( Address ) Where (Year>2000) Many-to-one table mapping N and M are the number of records in the left and right source tables Upper bound COUNT(target_table) in UNION=N+MUpper bound COUNT(target_table) in JOIN=N∗MLower bound COUNT(target_table) in JOIN= 0Lower bound COUNT(target_table) in LEFT JOIN=NLower bound COUNT(target_table) in RIGHT JOIN=MExample:COUNT(Person) >= COUNT(Patient) Where (IsCurrent=TRUE)

Distinct Record Count Match17One-to-one table mapping DISTINCT COUNT(target_table ) = COUNT(source_table) Where conditionExample: DISTINCT COUNT(Location ) = COUNT(Address ) Where (Year>2000) Many-to-one table mapping Upper bound DISTINCT COUNT( target_table) in UNION=N+MUpper bound DISTINCT COUNT(target_table) in JOIN=N∗MLower bound DISTINCT COUNT(target_table) in JOIN= 0Lower bound DISTINCT COUNT(target_table) in LEFT JOIN=NLower bound DISTINCT COUNT(target_table) in RIGHT JOIN=MWhere N and M are the number of records in the left and right source tablesExample:DISTINCT COUNT(Person) >= COUNT(Patient) Where (IsCurrent=TRUE)

Consistency18Ensures that attribute contents in target table conform to the corresponding ones in source table(s) based on transformation rulesVerified by checking:Attribute value matchAttribute constraint matchAttribute outlier matchAttribute average match

Consistency Example19 Attribute value match Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_idTransform all the new addresses (Year>2000) Source TableSource AttributeTarget TableTarget AttributeSelection ConditionPatientDay_of_birthMonth_of_birthYear_of_birthPersonDate_of_birthTransform all the current patients IN ?IN ?CONCAT

Consistency Example (Cont.)20 Attribute constraint match Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_idTransform all the new addresses (Year>2000) Source TableSource AttributeTarget TableTarget AttributeSelection ConditionPatientDay_of_birthMonth_of_birthYear_of_birthPersonDate_of_birthTransform all the current patients NOT NULLYear_of_birth < Year.Now()NOT NULL?Date_of_birth < Date.Now()?

Consistency Example (Cont.)21 Attribute outlier/average match Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key Location Location_idTransform all the new addresses (Year>2000) Source TableSource AttributeTarget TableTarget AttributeSelection ConditionPatientWeightHeightPersonBMITransform all the current patients =?BMI=?Min/Max/AvgMin/Max/AvgMin/Max/AvgMin/Max/AvgBMI=Avg(Weight)/ Avg(Height)^2

Syntactic Validity22Ensures that syntax of attributes in target table conforms to syntax of the corresponding ones in source table(s) based on transformation rulesVerified by checking:Attribute data type matchAttribute length match

Syntactic Validity Example23 Attribute data type match Attribute length match Source Table Source Attribute Target Table Target Attribute Selection Condition Address Address_key LocationLocation_idTransform all the new addresses (Year>2000) BigIntINT64? Source TableSource AttributeTarget TableTarget AttributeSelection ConditionPatientDay_of_birthMonth_of_birthYear_of_birthPersonDate_of_birthTransform all the current patientsCONCAT=?Average LengthAverage Length

Identify Source-to-Target Mappings24 Table mappingTarget table: Insert into clause Source table(s): From clauseTable operation(s): Join/union clause(s) Selection condition: Where clauseAttribute mappingSource/target attribute(s ): As clause Attribute operation(s ): Aggregation function(s) ETL transformation rules Identify source-to-target mappings Mapping table

Generate Balancing Tests25 Mapping Table Test Assertions Generate Analysis Queries Properties Generate Test Assertions Source Analysis Values Target Analysis Values Translator for SQL Dialects

Generate Analysis Queries26For each of the mapping rows, generate analysis queries to check the balancing properties Record count matchSource analysis query: SELECT COUNT(Patient) Where IsCurrent=TRUETarget analysis query: SELECT  COUNT( Person) Attribute outlier match Source analysis query: SELECT MIN/MAX(Avg(Weight) / (Avg(Height)^2) Where IsCurrent=TRUETarget analysis query: SELECT MIN/MAX(BMI)Source TableTable OperationSource AttributeAttribute OperationTarget TableTarget AttributeSelection Condition[0] Patient[1] AddressLeft joinAddress_KeyPersonLocation_idIsCurrent=TRUE and Year>2000Patient[0] Weight[1] Height Avg(Weight)/ Avg(Height)^2Person BMIIsCurrent=TRUE

Generate Test Assertions27Compare the analysis values in the source with the ones in the target Store the mismatches in a test result table Assertion ID Description 100 Record count mismatch : Person VS Patient Source_analysis_value Target_analysis_value COUNT(Patient) Where (IsCurrent=TRUE)COUNT(Person) Test assertion for record count match:Test result table:If Source_analysis_value > Target_analysis_value then Assertion_id=100 and Description=“Record count mismatch: target_table VS source_table”

Empirical Evaluation Objectives28Validation of ETL scriptsEvaluation of fault finding ability of assertions

Validation of ETL Scripts29Goal: Demonstrate that the ETL in the health data warehouse is correct with respect to the assertionsQuestion: Do test assertions find faults that were previously undetected by the original developers? Metric: Number of faults detected by the assertions

Validation of ETL Scripts (Cont.)30 Experiment subjects: Sources: Hospital A (72,652,442 records) with Caboodle data model on MSSQL ServerHospital B (32,726,923 records) with Caboodle data model on MSSQL ServerTarget: Data warehouse (127,714,325 records) with OMOP data model on Google BigQuery Results : Generated 44 test assertions Revealed 11 previously undetected faults Six assertion violations for Hospital A: Three average mismatchesTwo record count mismatchesOne outlier mismatchFive assertion violations for Hospital B: Four average mismatchesOne outlier mismatch

Fault Finding Ability of Assertions31 Goal: Evaluate the ability of generated assertions to detect faults injected into mock data Question: Does at least one assertion fail as a result of each fault?Metric: Percentage of faults in mock data detected by test assertionsExperiment subjects: Source: Mock database (900 records) with Caboodle data model on MSSQL Server Target: Data warehouse (112 records) with OMOP data model on Google BigQueryResults: Detected 92% of 118 injected faultsAll undetected faults resulted from mutating attributes in the target table that did not map to any attribute in the source table(s)All the assertions that should have failed actually failed

Discussion32Properties CompletenessNecessary but not sufficientLimitations of our evaluation Single ETL program used No comparison with other approachesSingle test data generator

Conclusions33Proposed a classification framework based on what is tested and how it is testedIdentified gaps in the literature and proposed directions for future research Presented an approach to automatically create balancing testsAssertions could find previously undetected faults in the health data warehouseAssertions could detect faults injected into mock data in the target data warehouse

Future Work34Define a structural format to specify source-to-target mappings in a consistent manner  Implement an algorithm to generate effective test assertions from these mappings.Develop an approach that aids developers in effectively localizing faults in the ETL process  Develop an approach that selects test cases to make regression testing more efficient than using a test-all process

Publications35Hajar Homayouni, Sudipto Ghosh, and Indrakshi Ray. Data Warehouse Testing. Accepted to Publication in Advances in Computers, 2017. Hajar Homayouni, Sudipto Ghosh, and Indrakshi Ray. On Generating Balancing Tests for Validating the Extract-Transform-Load Process for Data Warehouse Systems. Submitted to 11th IEEE Conference on Software Testing, Validation and Verification, 2018.

36Thank You

Mutation Analysis37 Property violations for each operator AR: record count and distinct record count match violationDR : record count and distinct record count match violation MNF : attribute value match and average match violation MSF : attribute value match and attribute length match violation MNFmin : outlier match and average match violation MNFmax : outlier match and average match violation MSFlength: attribute length match violation MFnull: attribute constraint match violationOperatorDescriptionARAdd random recordDRDelete random recordMNFModify numeric fieldMSFModify string fieldMNFminModify min of numeric fieldMNFmaxModify max of numeric fieldMSFlengthModify string field lengthMfnullModify filed to null