Due in two weeks Instructions will be sent out by the weekend Will be graded unlike the proposal Profiler Summary of Paper A d ata cleaning browsing and visualization tool Builds on wranglerpotters wheel ideas ID: 631886
Download Presentation The PPT/PDF document "Miscellaneous Midterm project review" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
MiscellaneousSlide2
Midterm project review
Due in two weeks
Instructions will be sent out by the weekend
Will be graded unlike the proposalSlide3
ProfilerSlide4
Summary of Paper
A “
d
ata cleaning” browsing and visualization tool
Builds on wrangler/potter’s wheel ideas
Adds a few notions of its own:
Recommending visualizations that highlight anomalies
Linked visualizations to see how value dependencies manifest across visualizationsSlide5
Assumptions
What are the assumptions made by the paper?Slide6
Assumptions
Single table:
Foreign key dependencies missed …
No data integration
Univariate
outliers (typically)
Fits in main memory
No entire row
deduplicationSlide7
Lots more to do…
What are the future directions from this paper?Slide8
Lots more to do…
What are the future directions from this paper?
Lots of user options: how does the user make sense of them?
What does the user do after browsing?
When does the user stop?Slide9
Other Open Questions
Recommendation of what to clean first?
Notions of completeness?
Real world statistics on what sorts of anomalies are more present than others?
Fixing errors?Slide10
Mutual Information-based Anomaly
The metric used tries to identify relationship between COUNT(*) GROUP BY X for the anomalous data vs. the other data
What else can you think of?