Mansoureh Rousta USING ALL AVAILABLE DATA BUT Cost Inexactitude We never wanted consider them unavoidable and learn to live with them Going to big data from small In a world of small data ID: 316825
Download Presentation The PPT/PDF document "Chapter 3 MESSY" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Chapter 3 MESSY
Mansoureh
RoustaSlide2
USING
ALL AVAILABLE DATA BUT
…
Cost
Inexactitude
We never wanted consider them unavoidable and learn to live with them
Going to big data from smallSlide3
In a world of small data
Reducing errors
Ensuring high quality was a natural and essential impulse
Collected a little information so it must be as accurate ass possible
Scientific try to make their measurement more and more precise. (celestial bodies or microscope)
Analyzing only a limited number of data points means errors may get amplified
Slide4
If one could measure a phenomenon, the
implicit
belief was, one could understand it
.
Later
, measurement was tied
to
the scientific method of observation and explanation: the ability to quantify, record, and present reproducible results. Slide5
By the nineteenth century
France…
developed a system of precisely defined units of measurement to capture space, time, and more, and had begun to get other nations to adopt the same standards
.
It was the apex of the age of
measurement
Just half a century later, in the 1920s, the discoveries of quantum mechanics shattered forever the dream of comprehensive and perfect measurement. Slide6
However
in
many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcomingSlide7
you can also increase messiness by combining different types of information from different sources, which don’t always align perfectly.
Messiness can also refer to the inconsistency of formatting, for which the data needs to be “cleaned” before being processed. Slide8
Suppose we need to measure the temperature in a
vineyard
One temperature sensor
a sensor for every one of the hundreds of
vines
Now suppose we increase the frequency of the sensor
readings
The information will be a bit less accurate, but its great volume makes it worthwhile to forgo strict exactitude.Slide9
in many cases it is more fruitful to tolerate error than it would be to work at preventing
it
For instance, we can accept some messiness in return for
scale
As Forrester, a technology consultancy, puts it, “Sometimes two plus two can equal 3.9, and that is good enough.”
Of
course the data can’t be completely incorrect, but we’re willing to sacrifice a bit of accuracy in return for knowing the general trend.Slide10
Everyone knows how much processing power has increased over the years as predicted by Moore’s Law, which states that the number of transistors on a chip doubles
roughly every
two years.
This continual improvement has made computers faster and memory more
plentiful
BUT…Slide11
Fewer of us know that the performance of the algorithms that drive many of our systems has also increased—in many areas more than the improvement of processors under Moore’s Law.
Many of the gains
to
society from big data, however, happen not so much because of faster chips or better algorithms but because there is more data.Slide12
For example, chess algorithms have changed only slightly in the past few
decades
the way computers learn how to
parse words
as we use them in everyday
speech
Around 2000, Microsoft researchers Michele
Banko
and Eric Brill were looking for a method to improve the grammar checker that is part of the company’s Word
programSlide13
They weren’t sure whether it would be more useful to put their effort into
improving
existing algorithms,
finding
new techniques,
or
adding more
sophisticated features
The results were
astounding.
As more data went in, the performance of all four types of algorithms improved dramatically.
Simple algorithm 75% to 95%
Large data algorithm 86% to 94%Slide14
“These results suggest that we may want to reconsider the
tradeoff between
spending time and money on algorithm development
versus spending
it on
corpus development
,”
Banko
and Brill wrote in one of their research papers on the topic
.Slide15
few years after
Banko
and Brill
…
researchers
at
rival
Google were thinking along similar lines—but at an even larger
scale
Instead of testing algorithms with a billion words, they used a trillion.
Google
did this not to develop a grammar checker but to crack an even more complex nut:
language translationSlide16
So-called machine translation has been a
vision of
computer pioneers since the dawn of computing in the
1940s
The
idea took on a special urgency during the Cold
War
At first, computer scientists
opted for
a combination of grammatical rules and a bilingual
dictionary
An IBM computer translated sixty Russian phrases into English in 1954, using 250 word pairs in the computer’s vocabulary and six rules of grammar. The results were very promisingSlide17
The director of the research program, Leon
Dostert
of Georgetown University, predicted
that…
By 1966 a committee of machine-translation
grandees had
to
admit failure
The problem was harder than they had realized it would
be
In the late 1980s, researchers at IBM had a novel
idea…Slide18
In the 1990s IBM’s
Candide
project used ten years’ worth of Canadian
parliamentary transcripts
published in French and English—about three million sentence
pairs
Suddenly, computer translation got a lot
better
Eventually IBM
pulled the
plug
.Slide19
But less than a decade later, in 2006, Google got into translation, as part of its
mission to
“organize the world’s information and make it universally accessible and useful.”
Instead
of nicely translated pages of text in two languages, Google
availed
itself of a larger but also much messier dataset: the entire global Internet and
more
Its trillion-word
corpus
amounted to 95 billion English sentences, albeit of dubious qualitySlide20
Despite the messiness of the input, Google’s service works the
best
By mid-2012 its dataset covered more than 60
languages
The
reason Google’s translation system works well is not that it has a smarter algorithm. It works well because its creators, like
Banko
and Brill at Microsoft, fed in more data—and not just of high
quality
Google was able to use a dataset
tens of
thousands
of times larger than IBM’s
Candide
because it accepted messinessSlide21
More trumps better
Messiness is difficult to accept for the
conventional sampling
analysts, who for all their lives have focused on preventing and
eradicating messiness
They work hard to reduce error rates when collecting samples, and to test the samples for potential biases before announcing their resultsSlide22
They use multiple error-reducing
strategies
Such strategies are costly to implement
and they are hardly feasible for big
data
Moving into a world of big data will require us to change our thinking about the merits of
exactitude we
no longer need to worry so much about individual data points
biasing
the overall
analysisSlide23
At BP’s Cherry Point Refinery in Blaine, Washington, wireless sensors are installed throughout the plant, forming an invisible mesh that produces vast amounts of data in real time.
The
environment of
intense
heat and electrical machinery might distort
the
readings, resulting in messy data
.
But the huge quantity of information generated from both wired and wireless sensors makes up for those hiccupsSlide24
When the quantity of data is vastly larger and is of a new type, exactitude
in
some cases is no longer the goal
It
bears noting
that messiness is not
inherent
to big data.
Instead
it is a function of the imperfection
of
the tools we use to measure, record, and analyze information.
If
the technology were to somehow become perfect, the problem of
inexactitude would
disappearSlide25
Messiness in action
In many areas of technology and society, we are leaning in favor of more and messy over fewer and
exact
These hierarchical
systems
have always been
imperfect,
as everyone familiar with a library card
catalogueSlide26
For example, in 2011 the photo-sharing site Flickr held more than six billion photos from more than 75 million users. Trying to label each photo according to preset categories would have been useless. Would there really have been one
entitled
“Cats that look like Hitler
”?
Instead, clean taxonomies are being replaced by mechanisms that are messier but also
eminently
more flexible and adaptable to a world that evolves and changesSlide27
When we upload photos to Flickr, we “tag”
them
Many of the Web’s most popular sites
flaunt
their
admiration
for imprecision
over
the pretense of
exactitude For example likes in face book or twitterSlide28
Traditional database engines required data to be highly structured and precise.
Data
wasn’t simply stored; it was broken up into “records” that contained fields. Each field held information of a particular type and length.
For example…Slide29
The most common language for accessing databases has long been SQL, or “structured query language.”
The
very name evokes its rigidity
.
But the big shift in recent years has been toward something called
noSQL
, which doesn’t require a preset record structure to work. It accepts data of varying type and size and allows it to be searched successfully. Slide30
Pat
Helland
, one of the world’s foremost authorities on database
design
“
lossy
”
Processing big data
entails
an inevitable
loss of
informationSlide31
Traditional database design promises to deliver consistent results across time. If you ask for your bank account balance, for
example,
you expect to receive the exact
amount….
While traditional systems would have a delay until all updates are
made
Instead
, accepting messiness is a kind of
solution
Hadoop
, an open-source rival to
Google’s
MapReduce
system
Slide32
typical
data analysis requires an operation called “extract, transfer, and load,” or
ETL
Hadoop
: data
can’t be moved and must be analyzed where it
is
Hadoop’s
output isn’t as precise as that of relational databases: it can’t be trusted to launch a spaceship or to certify bank-account details. But for many less critical tasks, where an ultra-precise answer isn’t needed, it does the trick far faster than the alternativesSlide33
In return for living with messiness, we get tremendously
valuable
services that would be impossible at their scope and scale with traditional methods and tools.
For example
ZestFinance
for loan
According to some estimates only 5 percent of all digital data is “structured”—that is, in a form that fits neatly into a traditional database. Without accepting messiness, the remaining 95 percent of unstructured
data so
web pages and videos, remain darkSlide34
Big
data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracySlide35
big
data may require
us
to change, to become more comfortable with disorder and uncertaintySlide36
as
the next chapter will explain, finding associations in data and acting on them may often be good enough