The History of Histograms abridged annis Ioannidis Departmen of Informatics and elecomm unications Univ ersit of thens anepistimioup olis Informatics Buildings  thens Hellas Greece annisdi

The History of Histograms abridged annis Ioannidis Departmen of Informatics and elecomm unications Univ ersit of thens anepistimioup olis Informatics Buildings thens Hellas Greece annisdi - Description

uoagr Abstract The history of histograms is long and ric h full of detailed information in ev ery step It in cludes the course of histograms in di57355eren scien ti57356c 57356elds the successes and failures of histograms in appro ximating and compre ID: 24779 Download Pdf

289K - views

The History of Histograms abridged annis Ioannidis Departmen of Informatics and elecomm unications Univ ersit of thens anepistimioup olis Informatics Buildings thens Hellas Greece annisdi

uoagr Abstract The history of histograms is long and ric h full of detailed information in ev ery step It in cludes the course of histograms in di57355eren scien ti57356c 57356elds the successes and failures of histograms in appro ximating and compre

Similar presentations

Download Pdf

The History of Histograms abridged annis Ioannidis Departmen of Informatics and elecomm unications Univ ersit of thens anepistimioup olis Informatics Buildings thens Hellas Greece annisdi

Download Pdf - The PPT/PDF document "The History of Histograms abridged annis..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "The History of Histograms abridged annis Ioannidis Departmen of Informatics and elecomm unications Univ ersit of thens anepistimioup olis Informatics Buildings thens Hellas Greece annisdi"— Presentation transcript:

Page 1
The History of Histograms (abridged) annis Ioannidis Departmen of Informatics and elecomm unications, Univ ersit of thens anepistimioup olis, Informatics Buildings 157-84, thens, Hellas (Greece) Abstract The history of histograms is long and ric h, full of detailed information in ev ery step. It in- cludes the course of histograms in dieren scien tic elds, the successes and failures of histograms in appro ximating and compressing information, their adoption industry and solutions that ha een giv en on great a- riet of histogram-related

problems. In this pap er and in the same spirit of the histogram tec hniques themselv es, compress their en- tire history (including their \future history" as curren tly an ticipated) in the giv en/xed space budget, mostly recording details for the erio ds, ev en ts, and results with the highest (p ersonally-biased) in terest. In limited set of exp erimen ts, the seman tic distance et een the compressed and the full form of the history as found relativ ely small! Prehistory The ord histo gr am is of Greek origin, as it is com- osite of the ords ‘isto-s os (= ‘mast’, also means ‘w eb

but this is not relev an to this discus- sion) and ‘gram-ma (= ‘something writ- ten’). Hence, it should in terpreted as form of writing consisting of ‘masts’, i.e., long shap es ertically standing, or something similar. It is not, ho ev er, Permission to opy without fe al or art of this material is gr ante pr ovide that the opies ar not made or distribute for dir ct ommer cial advantage, the VLDB opyright notic and the title of the public ation and its date app ar, and notic is given that opying is by ermission of the ery ar ge Data Base Endowment. opy otherwise, or to epublish, quir es fe

and/or sp cial ermission fr om the Endowment. Pro ceedings of the 29th VLDB Conference, Berlin, German 2003 ord that as originally used in the Greek language The term ‘histogram as coined the famous statis- tician Karl earson to refer to \common form of graphical represen tation". In the Oxford English Dic- tionary quotes from \Philosophical ransactions of the Ro al So ciet of London" Series A, ol. CLXXXVI, (1895) p. 399, it is men tioned that \[The ord ‘his- togram as] in tro duced the writer in his lectures on statistics as term for common form of graphical represen tation, i.e., columns

marking as areas the frequency corresp onding to the range of their base.". Stigler iden ties the lectures as the 1892 lectures on the geometry of statistics [69 ]. The ab quote suggests that histograms ere used long efore they receiv ed their name, but their birth date is unclear. Bar harts (i.e., histograms with an individual ‘base elemen asso ciated with eac column) most lik ely predate histograms and this helps us put lo er ound on the timing of their rst app earance. The oldest kno wn bar hart app eared in ok the Scottish olitical economist William Pla yfair titled \The

Commercial and olitical tlas (London 1786)" and sho ws the imp orts and exp orts of Scotland to and from sev en teen coun tries in 1781 [74 ]. Although Pla y- fair as sk eptical of the usefulness of his in en tion, it as adopted man in the follo wing ears, includ- ing for example, Florence Nigh tingale, who used them in 1859 to compare mortalit in the eacetime arm to that of civilians and through those con vinced the go ernmen to impro arm ygiene. rom all the ab e, it is clear that histograms ere rst conceiv ed as visual aid to statistical appro xima- tions. Ev en to da this oin is

still emphasized in the common conception of histograms: ebster’s denes the con trary the ord history is indeed part of the Greek language (‘istoria o and in use since the ancien times. Despite its similarit to ‘histogram’, ho ev er, it app ears to ha dieren et ymology one that is related to the original meaning of the ord, whic as ‘kno wledge’. His claim to fame includes, among others, the hi-square test for statistical signicance and the term ‘standard deviation’. In addition to the bar hart, Pla yfair is probably the fa- ther of the pie hart and other extremely in

tuitiv and useful visualizations that use to da
Page 2
histogram as \a bar graph of frequency distribution in whic the widths of the bars are prop ortional to the classes in to whic the ariable has een divided and the heigh ts of the bars are prop ortional to the class frequencies". Histograms, ho ev er, are extremely use- ful ev en when disasso ciated from their canonical visual represen tation and treated as purely mathematical ob- jects capturing data distribution appro ximations. This is precisely ho approac them in this pap er. In the past few decades, histograms ha een used in

sev eral elds of informatics. Besides databases, his- tograms ha pla ed ery imp ortan role primarily in image pro cessing and computer vision. Giv en an image (or video) and visual pixel parameter, his- togram captures for eac ossible alue of the param- eter (W ebster’s \classes") the um er of pixels that ha this alue (W ebster’s \frequencies"). Suc his- togram is summary that is haracteristic of the image and can ery useful in sev eral tasks: iden tifying sim- ilar images, compressing the image, and others. Color histograms are the most common in the literature, e.g., in the QBIC

system [21 ], but sev eral other parameters ha een prop osed as ell, e.g., edge densit tex- turedness, in tensit gradien t, etc. [61]. In general, his- tograms used in image pro cessing and computer vision are accurate. or example, color histogram con tains separate and precise coun of pixels for eac ossi- ble distinct color in the image. The only elemen of appro ximation migh in the um er of bits used to represen dieren colors: few er bits imply that sev eral actual colors are represen ted one, whic will as- so ciated with the um er of pixels that ha an of the colors that are group ed

together. Ev en this kind of appro ximation is not common, ho ev er. In databases, histograms are used as mec hanism for full-edged compression and appro ximation of data distributions. They rst app eared in the literature and in systems in the 1980’s and ha een studied exten- siv ely since then at con tin uously increasing rate. In this pap er, concen trate on the database notion of histograms, discuss the most imp ortan dev elopmen ts on the topic so far, and outline sev eral problems that eliev are in teresting and whose solution ma fur- ther expand their applicabilit and

usefulness. Histogram Denitions 2.1 Data Distributions Consider relation with umeric attributes ::n ). The value set of attribute is the set of alues of that are presen in Let ): where when The spr ad of is dened as 1) ), for (W tak 1.) The fr quency of is the um er of tuples in with ). The ar of is dened as ). The data distribution of is the set of pairs (1) (1)) (2) (2)) )) The joint fr quency ::; of the alue com bi- nation ::; is the um er of tuples in that con tain in attribute for all The joint data distribution ;::;n of ::; is the en tire set of (value ombination,

joint fr quency) pairs. In the sequel, for 1-dimensional cases, use the ab sym ols without the subscript 2.2 Motiv ation for Histograms Data distributions are ery useful in database systems but are usually to large to stored accurately so histograms come in to pla as an appro ximation mec h- anism. The most imp ortan applications of his- togram tec hniques in databases ha een selectivit estimation and appro ximate query answ ering within query optimization (for the former) or pre-execution user-lev el query feedbac (for oth). Our discussion elo fo cuses exactly on these o, esp ecially range-

query selectivit estimation as this is the most op- ular issue in the literature. It should not forgot- ten, ho ev er, that histograms ha pro ed to useful in the con text of sev eral other database problems as ell, e.g., load-balancing in parallel join query execu- tion [65], partition-based temp oral join execution [68 and others. 2.3 Histograms histo gr am on an attribute is constructed parti- tioning the data distribution of in to 1) utu- ally disjoin subsets called buckets and appro ximating the frequencies and alues in eac buc et in some com- mon fashion. This denition lea es sev

eral degrees of freedom in designing sp ecic histogram classes as there are sev eral ossible hoices for eac of the follo wing (mostly orthogonal) asp ects of histograms [67 ]: artition Rule: This is further analyzed in to the follo wing haracteristics: artition Class: This indicates if there are an restrictions on the buc ets. Of great imp ortance is the serial class, whic requires that buc ets are non-o erlapping with resp ect to some parameter (the next haracteristic), and its sub class end- biase whic requires at most one non-singleton buc et. Sort arameter: This is parameter whose

alue for eac elemen in the data distribution is deriv ed from the corresp onding attribute alue and frequencies. All serial histograms require that the sort parameter alues in eac buc et form con tiguous range. ttribute alue ), frequency ), and area are examples of sort parameters that ha een discussed in the literature. Source arameter: This captures the prop ert of the data distribution that is the most critical
Page 3
in an estimation problem and is used in conjunc- tion with the next haracteristic in iden tifying unique partitioning. Spread ), frequency ), and area are the most

commonly used source parameters. artition Constrain t: This is mathematical constrain on the source parameter that uniquely iden ties single histogram within its partition class. Sev eral partition constrain ts ha een pro- osed so far, e.g., qui-sum, v-optimal, maxdi, and ompr esse whic are dened further elo as they are in tro duced. Man of the more suc- cessful ones try to oid grouping astly dieren source parameter alues in to buc et. ollo wing [67 ], use p(s,u) to denote serial his- togram class with partition constrain sort parame- ter and source parameter

Construction Algorithm: Giv en particular partition rule, this is the algorithm that constructs histograms that satisfy the rule. It is often the case that, for the same histogram class, there are sev eral construction algorithms with dieren eciency alue Appro ximation: This captures ho at- tribute alues are appro ximated within buc et, whic is indep enden of the partition rule of his- togram. The most common alternativ es are the on- tinuous value assumption and the uniform spr ad as- sumption oth assume alues uniformly placed in the range co ered the buc et, with the former

ignoring the um er of these alues and the later recording that um er inside the buc et. requency Appro ximation: This captures ho frequencies are appro ximated within buc et. The dominan approac is making the uniform distribution assumption where the frequencies of all elemen ts in the buc et are assumed to the same and equal to the erage of the actual frequencies. Error Guaran tees: These are upp er ounds on the errors of the estimates histogram generates, whic are pro vided based on information that the his- togram main tains. ulti-dimensional histogram on set of at- tributes is constructed

partitioning the join data distribution of the attributes. They ha the exact same haracteristics as 1-dimensional histograms, ex- cept that the partition rule needs to more in tricate and cannot alw ys clearly analyzed in to the four other haracteristics as efore, e.g., there is no real sort parameter in this case, as there can no order- ing in ultiple dimensions [66 ]. The ast of Histograms First App earance the est of our kno wledge, the rst prop osal to use histograms to appro ximate data distributions within database system as in Ko oi’s PhD thesis [47 ]. His prop osal as an

immediate loan from statistics of the simplest form of histogram, with the alue set eing divided in to ranges of equal length, i.e., the so called qui-width histograms. Hence, in terms of the tax- onom of Section 2.3, the en try oin for histograms in to the orld of databases as the serial class of qui- sum(V,S) where the equi-sum partition constrain re- quires that the sums of the source-parameter alues (spreads in this case) in eac buc et are equal. Within eac buc et, alues and frequencies ere appro ximated based on the ontinuous value assumption and the uni- form distribution assumption resp

ectiv ely Equi-width histograms represen ted dramatic im- pro emen er the uniform distribution assumption for the en tire alue set (i.e., essen tially single-buc et histogram), whic as the state of the practice at the time. Hence, they ere quic kly adopted the Ingres DBMS in its commercial ersion, and later on other DBMSs as ell. First Alternativ few ears after Ko oi’s thesis, the rst alternativ histogram as prop osed, hanging only the source parameter [62 ]. Instead of ha ving buc ets of equal- size ranges, the new prop osal called for buc ets with (roughly) the same um er of tuples

in eac one, i.e., the so called qui-depth or qui-height histograms. In terms of the taxonom these are the qui-sum(V,F) histograms. There as ample evidence that equi-depth histograms ere considerably more eectiv than equi- width histograms, hence, man commercial endors switc hed to those in the ears follo wing their in tro- duction. Equi-depth histograms ere later presen ted in their ulti-dimensional form as ell [58 ]. Optimal Sort arameter After sev eral ears of inactivit on the topic of his- tograms, in terest in it as renew ed in the con text of studying ho initial errors in

statistics main tained the database propagate in estimates of the size of com- plex query results [36 ]. In particular, it as sho wn that, under some rather general conditions, in the orst case, errors propagate exp onen tially in the query size (i.e., in the um er of joins), remo ving an hop for high-qualit estimates for large ulti-join queries. The rst results that led to ards new yp es of his- tograms ere deriv ed in an eort to obtain statistics that ould optimal in minimizing/con taining the propagation of errors in the size of join results [37 ]. The basic mathematical to

ols used ere orro ed from ma jorization theory [55]. The fo cus as on rather restricted class of equalit join queries, i.e., single-join queries or ulti-join queries with only one attribute participating in joins er relation (more generally with 1-1 functional dep endency et een eac pair of join attributes of eac relation). or this query class, and
Page 4
under the assumption that the alue set is kno wn ac- cur ately it as formally pro ed that the optimal his- togram as serial and had frequency as the sort pa- rameter en ears ago The ab result migh ha not had the impact it did if it

had remained true only for the restricted query class it as rst pro ed for. So on afterw ards, ho ev er, in VLDB’93, it as generalized for arbitrary equalit join queries, giving strong indication that the most eectiv histograms ma ery dieren from those that ere used un til that oin [34 ]. the est of our kno wledge, histograms with fre- quency as the sort parameter represen ted the rst de- parture from alue-based grouping of buc ets, not only within the area of databases, but erall within math- ematics and statistics as ell. urthermore, their in- tro duction

essen tially generalized some common prac- tices that ere already in use in commercial systems (e.g., in DB2), where the highest frequency alues ere main tained individually and accurately due to their signican con tribution to selectivit estimates. Suc practice is an instance of sp ecial case of histogram in the end-biase partition class, with frequency as the sort parameter: the highest sort-parameter alues are main tained in singleton buc ets. Although less accu- rate than general serial histograms, in sev eral cases, end-biased histograms pro ed quite eectiv e. New artition

Constrain ts The results on the optimalit of frequency as the sort parameter left op en imp ortan questions. First, whic partition constrain ts are the most eectiv e, i.e., whic ones among all ossible frequency-based buc e- tizations? Second, whic histograms are optimal when the alue set is not accurately main tained but is ap- pro ximated in some fashion? The answ er to the rst question came in the form of the v-optimal histograms, whic partition the data distribution so that (roughly) the ariance of source- parameter alues within eac buc et is minimized [38 ]. Unfortunately

the second question had no analyt- ical answ er, but extensiv exp erimen tation led to the formation of the space of histogram haracteristics that use as the basic framew ork for our discussion in this pap er (Section 2.3) [67 ]. In addition to the qui- sum and v-optimal partition constrain ts, it in tro duced sev eral ossible new ones as ell, whic similarly to v- optimal had as goal to oid grouping together in the same buc et astly dieren source-parameter alues. Among them, distinguish maxdi whic places buc et oundaries et een adjacen source-parameter These ere called simply serial

histo gr ams at the time, but the term as later generalized to imply non-o erlapping ranges of an sort parameter, not just frequency whic is ho use the term in this pap er as ell. alues (in sort-parameter order) whose dierence is among the largest, and ompr esse whic puts the highest source alues in singleton buc ets and parti- tions the rest in equi-sum fashion. Ov erall, the new partition constrain ts (i.e., v-optimal, maxdi, com- pressed) ere sho wn to the most eectiv in curbing query-result-size estimation errors. The same eort oin ted to ards sev eral

ossibilities for the sort and source parameters, i.e., alue, spread, frequency area, cum ulativ frequency etc., with fre- quency and area eing the est source parameters. In- terestingly the est sort parameter pro ed to the alue and not the frequency as the original optimal- it results ould suggest, indicating that, if alues are not kno wn accurately ha ving buc ets with erlapping alue ranges do es not pa o for range queries. The most eectiv of these histograms ha actu- ally een adopted industrial pro ducts (see Section 4). urthermore, in addition to selectivit estimation for arious

relational and non-relational queries, these histograms ha pro ed to ery eectiv in appro x- imate query answ ering as ell [39 ]. Since the sp ecication of the ab space of his- tograms, there ha een sev eral eorts that ha studied one or more of its haracteristics and ha prop osed alternativ e, impro ed approac hes. or eac haracteristic, outline some of the most notable pieces of ork on it in separate subsection elo w. Unless explicitly men tioning the opp osite, the discus- sion is ab out 1-dimensional histograms. Alternativ artition Constrain ts In addition to the

partition constrain ts that ere in- tro duced as part of the original histogram framew ork [67 ], few more ha een prop osed that attempt to approac the eectiv eness of v-optimal, usually ha v- ing more ecien construction cost. Among them, note one that uses simplied form of the opti- mal knot plac ement pr oblem [18 to iden tify the buc et oundaries, whic are where the ‘knots are placed [46 ]. The simplication consists of using only linear splines that are also allo ed to discon tin uous across buc et oundaries. This is com bined with in teresting alternativ es

on the alue and frequency appro ximation within eac buc et. Multi-Dimensional artition Rules The rst in tro duction of ulti-dimensional histograms as Muralikrishna and DeWitt [58 ], who essen tially describ ed 2-dimensional equi-depth histograms. Space as divided in the same it is done in Grid- le, i.e., recursiv ely cutting the en tire space in to half- spaces using alue of one of the dimensions as oundary eac time, the dimension and the alue e- ing hosen in presp ecied at the eginning of the pro cess [58]. Buc ets ere non-o erlapping (the ulti- dimensional ersion of

the serial partition class) on
Page 5
the space of the ulti-dimensional alues (the ulti- dimensional ersion of alue as the sort parameter), the oundaries hosen with equi-sum as the partition constrain and frequency as the source parameter. It as not un til sev eral ears later that an new par- tition rules ere prop osed [66 ], this time taking adv an- tage of the generalit of the histogram taxonom [67 ]. The most eectiv family of suc rules as MHIST- whic starts from the en tire join data distribution placed in single buc et and, at eac step, splits the space captured one of

the buc ets it has formed in to subspaces, un til it has exhausted its budget of buc ets. The split is made in the buc et and along the dimension that is haracterized as most \critical", i.e., whose marginal distribution is the most in ne of artitioning based on the (1-dimensional) partition constrain and source parameter used. In com bination with the most eectiv partition constrain ts and source parameters (i.e., v-optimal or maxdi with frequency or area), MHIST-2 represen ted dramatic impro e- men er the original ulti-dimensional equi-depth histograms. Since MHIST, there ha een sev

eral other in ter- esting partition rules that ha een prop osed. One of them is GENHIST [31 ], whic as originally prop osed in the con text of ulti-dimensional real-v alued data, but its applicabilit is broader. The main haracteris- tic of GENHIST is that it allo ws buc ets to erlap in the space of ulti-dimensional alues: the algorithm starts from uniform grid partitioning of the space and then iterativ ely enlarges the buc ets that con tain high um ers of data elemen ts. This has eects: rst, the densit of data in eac buc et decreases, th us mak- ing the erall densit smo other;

second, the buc ets end up erlapping, th us creating man more distinct areas than there are buc ets er se. The data distribu- tion appro ximation within eac area is com bination of what all the erlapping buc et that form the area indicate. This results in small um er of buc ets pro ducing appro ximations with lo errors. Another alternativ is the STHoles Histo gr am [11 ], whic tak es, in some sense, dual approac to GEN- HIST: instead of the region co ered buc et in- creasing in size and erlapping with other buc ets, in STHoles, this region ma decrease in size due to the remo al of piece of it

(i.e., op ening hole) that forms separate, hild buc et. This creates buc ets that are not solid rectangles, and is therefore capable of capturing quite irregular data distributions. Iden tifying eectiv ulti-dimensional partition rules is no means closed problem, with dieren approac hes eing prop osed con tin uously [23 ]. alue Appro ximation Within Eac Buc et Giv en sp ecic amoun of space for histogram, one of the main tradeos is the um er of buc ets er- sus the amoun of information ept in eac buc et. small amoun of information within eac buc et im- plies gross

lo cal appro ximations but also more buc ets. Finding the righ balance in this tradeo to optimize the erall appro ximation of the data distribution is ey question. With resp ect to appro ximating the set of alues that fall in 1-dimensional buc et, there ha een essen- tially approac hes. Under the traditional ontinu- ous value assumption one main tains the least amoun of information (just the min and max alue), but noth- ing that ould giv some indication of ho man al- ues there are or where they migh e. Under the more recen uniform spr ad assumption [67], one also main- tains the um er of alues

within eac buc et and appro ximates the actual alue set the set that is formed (virtually) placing the same um er of al- ues at equal distances et een the min and max alue. dieren ersion of that has also een prop osed that do es not record the actual erage spread within buc et but one that reduces the erall appro ximation error in range queries taking in to accoun the op- ularit of particular ranges within eac buc et [46 ]. There ha een sev eral studies that sho eac gen- eral tec hnique sup erior to the other, an indication that there ma no univ ersal winner. The main approac hes men

tioned ab ha een extended for ulti-dimensional buc ets as ell, main taining the min and max alue of eac dimension in the buc et. Under the con tin uous alue assump- tion nothing more is required, but under the uniform spread assumption, the problem arises of whic dis- tinct (m ulti-dimensional) alues are assumed to exist in the buc et. If is the um er of distinct alues in attribute that are presen in buc et and is the ’th appro ximate alue in dimension (obtained applying the uniform spread assumption along that dimension), then reasonable approac is to assume that all ossible com binations ::;

exist in the buc et [66 ]. There has also een an in teresting eort that in- tro duces the use of kernel estimation in to the 1- dimensional histogram orld [10 to deal sp ecically with real-v alued data. Roughly it suggests ho osing the oin ts of considerable hange in the probabilit densit function as the buc et oundaries (in spirit similar to the maxdi partition constrain t) and then applying the traditional ernel estimation metho for appro ximating the alues within eac buc et. This has also een generalized for the ulti-dimensional case [31 ]. requency Appro ximation Within Eac

Buc et With resp ect to appro ximating the set of frequencies that fall in buc et, almost all eorts deal with the traditional uniform distribution assumption Among the few exceptions is one that is com bined with the linear spline partition constrain men tioned ab and
Page 6
uses line ar spline-b ase appro ximation for frequen- cies as ell [46 ]. It records one additional data item er buc et to capture linearly gro wing or shrinking frequencies at the exp ense of few er buc ets for xed space budget. Lik ewise, another exception uses equally small additional space

within eac buc et to store cum ulativ frequencies in 4-lev el tree index [13 ]. Con trary to the previous eort, ho ev er, it is com bined with some of the established partition con- strain ts, i.e., v-optimal and maxdi. Ecien and Dynamic Constructions Although estimation eectiv eness is probably the most imp ortan prop ert of histograms (or an other com- pression/estimation metho for that matter), con- struction cost is also concern. With resp ect to this asp ect, histograms ma divided in to categories: static histograms and dynamic/adaptive histograms. Static

histograms are those that are traditionally used in database systems: after they are constructed (from the stored data or sample of it), they remain unc hanged ev en if the original data gets up dated. De- ending on the details of the up dates, static his- togram ev en tually drifts from what it is sup- osed to appro ximate, and the estimations it pro duces ma suer from increasingly larger errors. When this happ ens, the administrators ask for recalculation, at whic oin the old histogram is discarded and new one is calculated afresh. An imp ortan consideration for static histograms is

the cost of eac calculation itself, whic is mostly aected the partition con- strain t. Most suc constrain ts (e.g., equi-sum, maxd- i, compressed) ha straigh tforw ard calculations that are ecien t. This is not the case, ho ev er, for what has een sho wn to the most eectiv constrain t, i.e., v-optimal, whose straigh tforw ard calculation is in general exp onen tial in the um er of source-parameter alues. ey con tribution in this direction has een the prop osal of dynamic-programming based algo- rithm that iden ties the v-optimal histogram (for an sort

and source parameter) in time that is quadratic in the um er of source-parameter alues and linear in the um er of buc ets, th us making these histograms practical as ell [42 ]. Subsequen tly sev eral (mostly theoretical) eorts ha in tro duced algorithms that ha reduced the required running time for calculat- ing these optimal histograms, ev en tually bringing it do wn to linear erall and ac hieving similar impro e- men ts for the required space as ell [30 ]. Dynamic- programming algorithms ha also een prop osed for constructing the optimal histograms for (hierarc hical) range queries

in OLAP data [44 ]. or the ulti- dimensional case, optimal histogram iden tication is NP-hard, so sev eral appro ximate tec hniques ha een prop osed [59 ]. Another in teresting dev elopmen has een the pro- osal of algorithms to iden tify optimal sets of his- tograms (as opp osed to individual histograms), based on an exp ected orkload [40 ]. This eort fo cuses on v- optimal histograms, but is equally applicable to other partition constrain ts as ell. Ev en with the existence of ecien calculation al- gorithms, ho ev er, static histograms suer from in- creasing

errors et een calculations. Moreo er, in data stream en vironmen t, static histograms are not an option at all, as there is no opp ortunit to store the incoming data or examine it more than once. Hence, sev eral orks ha prop osed arious ap- proac hes to dynamic/adaptiv e/self-tuning histograms, whic hange as the data gets up dated, while remain- ing comp etitiv to their static coun terparts. Among these, note one for equi-depth and compressed his- tograms [26 ], one for v-optimal histograms [27 ], and one for (linear) spline-based histograms [46 ]. There is also an eort fo cusing on

data streams, where sketch on the (join t) data distribution of the stream is main tained, from whic an eectiv ultidimensional histogram ma constructed [72]; the STHoles his- togram is used for exp erimen tation with the metho d, but in principle, it could applied to other histogram classes as ell. Another approac to dynamic construction that has een examined in the past consists of query feed- bac mec hanisms that tak in to accoun actual sizes of query results to dynamically mo dify histograms so that their estimates are closer to realit In essence, this is histogram adaptation at

query time instead of at up date time. The main represen tativ es in this cat- egory are the ST-histo gr ams [2] and their descendan STHoles histo gr ams [11 ], whic emplo sophisticated partition rule as ell. These tec hniques are indep en- den of the particular haracteristics of the initial his- tograms, whic ma constructed in an e.g., they could equi-depth histograms. In addition to their dynamic nature, ey adv an tage of these ap- proac hes is their lo cost. The LEO system [70 generalizes these eorts as it uses result sizes of uc more complicated queries to mo dify its statistics,

including join and aggregate queries, queries with user-dened functions, and oth- ers. In terestingly LEO do es not up date the statistics in place, but puts all feedbac information in to sepa- rate catalogs, whic are used in com bination with the original histograms at estimation time. Error Guaran tees Most ork on histograms deals with iden tifying those that exhibit lo errors in some estimation problem, but not with pro viding, together with the estimates, some information on what those errors migh e. The rst ork to address the issue [42 suggests storing in eac buc et the

maxim um dierence et een the actual and the appro ximate (t ypically the erage) frequency of alue in the buc et and using that to
Page 7
pro vide upp er ounds on the error of an selectivit estimates pro duced the histogram for equalit and range selection queries. An in teresting alternativ fo- cuses on optimizing top-N range queries and stores ad- ditional information on er-histogram rather than on er-buc et basis [20 ]. Other Data yp es As men tioned earlier, most ork on histograms has fo- cused on appro ximating umeric alues, in one or ul- tiple dimensions (attributes). Nev

ertheless, the need to appro ximation is uc broader, and sev eral eorts ha examined the use of histograms for other data yp es as ell. With resp ect to spatial data, the canonical ap- proac hes to 2-dimensional histograms do not quite ork out as these are for oin data and do not extend to ob jects that are 2-dimensional themselv es. urther- more, frequency is usually not an issue in spatial data, as spatial ob jects are not rep eated in database. Sev- eral in teresting tec hniques ha een presen ted to ad- dress the additional hallenges, whic essen tially are related to the partition

rule, i.e., ho the spatial ob- jects are group ed in to buc ets. Some form buc ets generalizing con en tional histogram partition con- strain ts while others do it follo wing approac hes used in spatial indices (e.g., R-trees). The MinSkew His- to gr am [5] is among the more sophisticated ones and divides the space using binary partitionings (recur- siv ely dividing the space along one of the dimensions eac time) so that the erall sp atial skew of all buc ets is minimized. The latter captures the ariance in the densit of ob jects within eac buc et, so it follo ws, in some sense, the spirit of

the v-optimal histograms. The SQ-Histo gr am [3] is an in teresting alternativ e, divid- ing the space according to the Quad-tree rule (whic is more restrictiv than arbitrary binary partitionings) and, in addition to spatial pro ximit taking in to ac- coun pro ximit in the size as ell as the complexit (n um er of ertices) of the olygons that are placed in the same buc et. Both approac hes are quite eectiv e, with SQ eing probably the erall winner. Spatial histograms, i.e., MinSk ew histograms, ha also een extended to capture the elo cit of ob ject mo emen t, th us ecoming able to appro

ximate spatio-temp oral data as ell [17 ]. The recen in terest in XML could not, of course, lea un touc hed XML le appro ximation, XML query result size estimation, and other related problems. The semi-structured nature of XML les do es not lend itself to histogram-based appro ximation, as there is no immediate ulti-dimensional space that can buc etized but one needs to formed from some umeric XML-le haracteristics. In the StatiX ap- proac [22 ], information in an XML Sc hema is used to iden tify oten tial sources of structural sk ew and then 1-dimensional histograms

are built for the most problematic places in the sc hema, appro ximating the distributions of paren ids for dieren elemen ts. In the XP athLearner approac [49 ], (rst-order) Mark Histograms are used [1], where the frequencies of the results of tra ersing all paths of length are stored in 2-dimensional histograms. The dimen- sions alw ys represen the ‘from and ‘to no des of the paths in the XML graph; in the rst histogram, oth no des/dimensions are for XML tags, whereas in the second histogram, the ‘from no de/dimension is an XML tag and the ‘to no de/dimension is alue.

Assuming enough memory frequencies are main tained accurately for all tag-to-tag pairs (accu- rate histogram), as there are ery few. the con- trary tag,v alue pairs are placed in histogram that is based on 2-dimensional ersion of the com- pressed partition constrain t, with frequency as the source parameter. Another approac for estimating XML-query result sizes builds osition histo gr ams on 2-dimensional space as ell, only here the di- mensions are directly or indirectly related to the um- ering of eac no de in preorder tra ersal of the XML graph [79 ]. Finally histograms ha also een used in

com bination with or as parts of other data structures for XML appro ximations. The XSketch is quite an eectiv graph-based synopsis that tries to captured oth the structural and the alue haracteristics of an XML le [63 64 ]. Histograms en ter the picture as they are used at arious parts of an XSk etc to cap- ture statistical correlations of elemen ts and alues in particular neigh orho ds of the XSk etc graph. In addition to XML graphs, histograms ha also een prop osed to capture the degrees of the no des in general graphs as to compare graphs et een them and grade their

similarit [60 ]. Uncon en tional Histograms Throughout the ears, there ha een few in terest- ing pieces of ork that do not quite follo the gen- eral histogram taxonom or histogram problem de- nitions. One of them suggests the use of the Discrete Cosine ransform (DCT) to compress an en tire ulti- dimensional histogram and store its compressed form [48 ]. It emplo ys ery simple ulti-dimensional parti- tion rule (a uniform grid er the en tire space), divides the space in to large um er of small buc ets, and then compresses the buc et information using DCT. This app ears to sa on space but

also estimation time, as it is ossible to reco er the necessary information through the in tegral of the in erse DCT function. There is also promising line of ork that com bines histograms with other tec hniques to pro duce higher- qualit estimations than either tec hnique could do alone. In addition to sev eral suc com binations with sampling, particularly in teresting tec hnique tries to ercome the ‘curse of dimensionalit y iden tify- ing the critical areas of dep endence and indep endence
Page 8
among dimensions in ulti-dimensional data, captur- ing them with statistical in

teraction mo del (e.g., log-linear mo del) whic can then form the basis for lo er-dimensional MHIST histograms to appro ximate the erall join data distribution [19 ]. Finally there is ery in teresting departure from the con en tion that histograms are built on base rela- tions and estimations of the data distributions of in- termediate query results are obtained appropriate manipulations of these base-relation histograms [12 ]. It discusses the ossibilit of main taining histograms on complex query results, whic pro es to quite ef- fectiv in some cases. This ork uses the main SQL Serv er

histograms (essen tially maxdi see Section 4) to demonstrate the prop osed approac h, but the erall eort is orthogonal to the particular histogram class. As the um er of oten tial complex query histograms is uc larger than that of base-relation histograms, the corresp onding database design problem of ho os- ing whic histograms to construct is accordingly more dicult as ell. ortunately orkload-based algo- rithm pro es adequate for the task. Industrial Presence of Histograms Histograms ha not only een the sub ject of uc researc activit but also the fa orite appro ximation metho

of all commercial DBMSs as ell. Essen tially all systems had equi-width histograms in the egin- ning and then ev en tually mo ed to equi-depth his- tograms. In this section, briey describ the cur- ren tly adopted histogram class for three of the most opular DBMSs. DB2 emplo ys compressed histograms with alue as the sort parameter and frequency as the source pa- rameter [50 ]. Users ma sp ecify the um ers of single- ton and non-singleton buc ets desired for the most- frequen alues and the equi-depth part of com- pressed histogram, resp ectiv ely with the default e- ing 10 and 20.

departure from our general descrip- tions ab is that DB2 stores cum ulativ frequencies within non-singleton buc ets. Histogram construction is based on reserv oir sample of the data. DB2 ex- ploits ulti-dimensional cardinalit information from indices on comp osite attributes (whenev er they are ailable) to obtain some appro ximate quan tication of an dep endence that ma exist et een the at- tributes, and uses this during selectivit estimation. Otherwise, it assumes attributes are indep enden t. The learning capabilities of LEO [70 pla ma jor role in ho all ailable information is est

exploited for high- qualit estimation. Oracle still emplo ys equi-depth histograms [78 ]. Its basic approac to ulti-dimensional selectivities is similar to that of DB2, based on exploiting an ail- able information from comp osite indices. In addition to that, ho ev er, it oers dynamic sampling capabil- ities to obtain on-the-y dep endence information for rather complex predicates whenev er needed (selections and single-table functions are already ailable, while joins will in the next release). It also tak es in to accoun the dep endencies that exist et een the at- tributes of

cub e’s dimensions hierarc hies during roll- up and pro vides estimates at the appropriate hierar- lev el. Finally the next release will emplo learning tec hniques to remem er selectivities of past predicates and use them in the future. SQL Serv er emplo ys maxdi histograms with alue as the sort parameter and essen tially erage frequency (within eac buc et) as the source parameter [9]. It ermits up to 199 buc ets, storing within eac buc et the frequency of the max alue and (essen tially) the cum ulativ frequency of all alues less than that. His- togram construction is ypically based on sample

of the data. Comp osite indices are used in similar fashion as in the other systems for obtaining ulti- dimensional selectivit information. Note that all commercial DBMSs ha implemen ted strictly 1-dimensional histograms. Except for some in- ciden tal indirect information, they essen tially still em- plo the attribute value indep endenc assumption and ha not en tured o to ulti-dimensional histograms. Comp etitors of Histograms The main tec hnique that has comp eted against his- tograms in the past decade is wavelets whic is ery imp ortan for image compression and has een in tro- duced in to

the database orld in the late 90’s [7, 56 ]. elets ha een used extensiv ely for appro ximate answ ering of dieren query yp es and/or in dier- en en vironmen ts: ultidimensional aggregate queries (range-sum queries) in OLAP en vironmen ts [75 76 ], aggregate and non-aggregate relational queries with computations directly on the stored elet co e- cien ts [14], and selection and aggregate queries er streams [28 ]. As with histograms, there ha also een eorts to devise elet-based tec hniques whose ap- pro ximate query answ ers are pro vided with error guar- an tees

[24 ], as ell as to construct and main tain the most imp ortan elet co ecien ts dynamically [57 ]. Sampling is not direct comp etitor to histograms, as it is mostly run time tec hnique, and furthermore, the literature on sampling is extremely large, so it is imp ossible to analyze the corresp onding highligh ts in the limited space of this pap er. Ho ev er, should emphasize that sampling is often complemen tary tec hnique to histograms, as static (and ev en sev eral forms of dynamic) histograms are usually constructed based on sample of the original data [15, 26 62 ]. There are also

sev eral sp ecialized tec hniques that ha een prop osed and comp ete with histograms on sp ecic estimation problems. These include tec hniques for selectivit estimation of select-join queries [71 or spatial queries [8 ], using query feedbac to mo dify stored curv e-tting/parametric information for et-
Page 9
ter selectivit estimation [16 ], selectivit estimation for alphan umeric/string data in 1-dimensional [43 45 and ulti-dimensional en vironmen ts [41 77 ], iden ti- cation of quan tiles [6 53 54 and their dynamic main- tenance with priori guaran tees [29 ],

appro ximate query answ ering for aggregate join queries [4], select- join queries [25 ], and within the general framew ork of on-line aggr gation [33 32 51 ], computing frequencies of high-frequency items in stream [52 ], and others. Despite their sub optimalit compared to some of these tec hniques on the corresp onding problems, histograms remain the metho of hoice, due to their erall ef- fectiv eness and wide applicabilit The uture of Histograms Despite the success of histograms, there are sev eral problems whose curren solutions lea enough space for signican impro emen and sev eral

others that re- main wide op en, whose solution ould mak the ap- plicabilit of histograms uc wider and/or their ef- fectiv eness higher. ha recen tly discussed arious problems of oth yp es [35], some addressing sp ecic histogram haracteristics from the existing taxonom while others eing cast in sligh tly more general con- text. In this section, fo cus on three of the op en problems, those that eliev are the most promis- ing and sense as eing the furthest from an past or curren ork that are are of. Histogram ec hniques and Clustering Abstracting the details of the problem of

histogram-based appro ximation, one ould see some striking similarities with the traditional problem of clustering [73 ]: the join data distribution is parti- tioned in to buc ets, where eac buc et con tains similar elemen ts. Similarit is dened based on some distance function that tak es in to accoun the alues of the data attributes and the alue of the frequency if there is an ariation on it (e.g., if it is not equal to for all data elemen ts). The buc ets are essen tially clusters in the traditional sense, and for eac one, ery short appro ximation of the elemen ts that fall in it is

stored. Despite the similarities, the tec hniques that ha een dev elop ed for the problems are in general ery dieren t, with no ell-do cumen ted reasoning for man of these dierences. Wh can’t the histogram tec hniques that ha een dev elop ed for selectivit es- timation used for clustering or vice ersa? rom another ersp ectiv e, wh can’t the frequency in selec- tivit estimation considered as another dimension of the join data distribution and ha the problem considered as traditional clustering? What ould the impact of using stored appro ximations dev elop ed for one problem to

solv another? In general, giv en the great ariet of tec hniques that exist for the prob- lems, it is crucial to obtain an understanding of the adv an tages and disadv an tages of eac one, its range of applicabilit and in general, their relativ haracteris- tics when utually compared. comprehensiv study needs to conducted that will include sev eral more tec hniques than those men tioned here. The \New Jer- sey Data Reduction Rep ort" [7 has examined man tec hniques and has pro duced preliminary compar- ison of their applicabilit to dieren yp es of data. It can serv as go starting oin for

erication, extrap olation, and further exploration, not only with resp ect to applicabilit but also precise eectiv eness trade-os, eciency of the algorithms, and other har- acteristics. Buc et Recognition and Represen tation The goal of an form of (partition-based) appro xima- tion, e.g., histogram-based and traditional clustering, is to iden tify groups of elemen ts so that all those within group are similar with resp ect to small um er of parameters that haracterize them. By storing ap- pro ximations of just these parameters, one is able to reconstruct an

appro ximation of the en tire group of elemen ts with little error. Note that, in the terms of the histogram taxonom these parameters should hosen as the source parameter(s), to satisfy the pro ximit y-expressing partition constrain t. Ho do kno whic parameters are similar for elemen ts so that can group them together and rep- resen them in terms of them? This is ypical ques- tion for traditional pattern recognition [73 ], where e- fore applying an clustering tec hniques, there is an earlier stage where the appropriate dimensions of the elemen ts are hosen among great um er of ossi- bilities.

There are sev eral tec hniques that mak suc hoice with arying success dep ending on the case. It is imp ortan t, ho ev er, to emphasize that, in prin- ciple, these parameters ma not necessarily among the original dimensions of the data elemen ts presen ted in the problem but ma deriv ativ es of them. or example, in sev eral histogram-based appro ximations as ha describ ed them ab e, pro ximit is sough directly for frequencies but not for attribute alues, as atten tion there is on their spreads. (Recall also the success of area as source parameter, whic is the pro duct of frequency with

spread.) The frequencies in buc et are assumed constan and require smaller amoun of information to stored for their appro xi- mation than the attribute alues, whic are assumed to follo linear rule (equal spread). Hence, con- en tional histogram-based appro ximation, under the uniform distribution and uniform spread assumptions, implies clustering in the deriv ed space of frequency and spread. In principle, ho ev er, not all data distri- butions are serv ed est with suc an approac h. increase the accuracy of histogram appro xima- tions, there should no xed, predened appro xi-

mation approac to the alue dimensions and the fre- quencies. It should not necessarily ev en the same
Page 10
for dieren buc ets. Histograms should exible enough to use the optimal appro ximation for eac di- mension in eac buc et, one that ould pro duce the est estimations for the least amoun of information. Iden tifying what that optimal appro ximation is, is hard problem and requires further in estigation. Histograms and ree Indices The fact that there is close relationship et een ap- pro ximate statistics ept in databases, esp ecially his- tograms, and indices has

een recognized in the past in sev eral orks [7 ]. If one considers the ro ot of B+ tree, the alues that app ear in it essen tially partition the attribute on whic it is built in to buc ets with the corresp onding orders. Eac buc et is then further sub divided in to smaller buc ets the no des of the subsequen lev el of the tree. One can imagine storing the appropriate information next to eac buc et sp ec- ied in no de, hence transforming the no de in to histogram, and the en tire index in to so called hi- er ar chic al histo gr am This ma adv ersely aect in- dex searc

erformance, of course, as it ould reduce the out-degree of the no de, ossibly making the tree deep er. Nev ertheless, although this idea orks against the main functionalit of an index, its enets are non- negligible as ell, so it has ev en een incorp orated in to some systems. eliev that hierarc hical histograms and, in gen- eral, the in teraction et een appro ximation structures and indices should in estigated further, as there are sev eral in teresting issues that remain unexplored as analyzed elo w. Consider again B+ tree whose no des are completely full. In that case, the ro ot of

the tree sp ecies buc etization of the attribute domain that corresp onds to an equi-depth histogram, i.e., eac buc et con tains roughly an equal um er of elemen ts under it. Similarly an no de in the tree sp ecies an equi-depth buc etization of the range of alues it leads to. The main issue with B+ trees eing turned in to hierarc hical equi-depth histograms is that the latter are far from optimal erall on selectivit estimation [67 ]. Histograms lik v-optimal and maxdi are uc more eectiv e. What kind of indices ould one get if eac no de represen ted buc etizations follo

wing one of these rules? Clearly the trees ould un balanced. This ould mak traditional searc less ecien on the erage. On the other hand, other forms of searc hes ould serv ed more eectiv ely In particular, in system that pro vides appro ximate answ ers to queries, the ro ot of suc tree ould pro vide higher-qualit answ er than the ro ot of the corresp onding B+ tree. urthermore, the system ma mo in progressiv fashion, tra ersing the tree as usual and pro viding series of answ ers that are con tin uously impro ving in qualit ev en tually reac hing the lea es and the nal,

accurate result. Returning to precise query answ ering, note that ypically indices are built assuming all alues or ranges of alues eing equally imp ortan t. Hence, ha ving balanced tree ecomes crucial. There are often cases, ho ev er, where dieren alues ha dieren imp or- tance and dieren frequency in the exp ected ork- loads [46 ]. If this query frequency or some other suc parameter is used in conjunction with adv anced his- togram buc etization rules, some ery in teresting trees ould generated whose erage searc erformance migh uc etter than that of the B+ tree. rom the

ab e, it is clear that the in teraction e- een histograms and indices presen ts opp ortunities but also sev eral tec hnical hallenges that need to in estigated. The trade-o et een hierarc hical his- tograms that are balanced trees with equi-depth buc k- etization and those that are un balanced with more ad- anced buc etizations requires sp ecial atten tion. The ossibilit of some completely new structures that ould strik ev en etter trade-os, com bining the est of oth orlds, cannot ruled out either. Conclusions Histograms ha een ery successful within the database orld. The reason is

that, among sev eral ex- isting comp eting tec hniques, they probably represen the optimal oin balancing the tradeo et een sim- plicit eciency eectiv eness, and applicabilit for data appro ximation/compression. Researc h-wise most of the basic problems around histograms seem to ha een solv ed, but eliev there are still etter so- lutions to found for some of them. Moreo er, as outlined in the previous section, there are some un- touc hed foundational problems whose solution ma re- quire signican hanges in our erall ersp ectiv on histograms. As uc as the past ten ears ha

een enjo able and pro ductiv in deep ening our collectiv understanding of histograms and applying them in the real orld, eliev the next ten will ev en more exciting and really lo ok forw ard to them! ersonal History Our ersonal history with histograms has een strongly inuenced Sta vros Christo doulakis. It all started during the \Query Optimization orkshop", whic as organized in conjunction with SIGMOD’89 in ortland, when Sta vros argued that optimizing ery large join queries did not mak an sense, as the er- rors in the selectivit estimates ould ery large after few joins. an ting to

pro him wrong due to ersonal in terest in large query optimization, started collab orating with him on the error propaga- tion problem, ork that led to results that justied Sta vros fears completely [36 ]. During this eort, ere initiated Sta vros in to the onderful orld of ma jorization theory Sc ur functions, and all the other
Page 11
mathematical to ols that uc of our subsequen his- togram ork ould based on. urther collab oration with Sta vros resulted in the iden tication of \serial his- tograms" and the rst realization of their signicance

[37 ], whic as the springb oard for the VLDB’93 pa- er. or all these, an to express our sincere grat- itude to Sta vros for rev ealing an exciting researc path that as hiding man treasures along the The second erson who has mark ed signican tly our in olv emen with histograms is Vish osala. As PhD studen at Wisconsin, Vish to ok the origi- nal \serial histogram" results, div ed deep in to them, and pushed them in man dieren directions, an ef- fort that ev en tually led us to sev eral in teresting results that ha pla ed an imp ortan role in the success of histograms. or the long

and fruitful collab oration ha had, oth efore and after his PhD degree, man thanks are due to Vish as ell. Ac kno wledgemen ts: or this presen pap er, ould lik to thank Minos Garofalakis, Neoklis oly- zotis, and again Vish osala for sev eral useful sug- gestions and for erifying that the error in the appro xi- mation of the history of histograms presen ted is small. References [1] Ab oulnaga A., Alameldeen A., Naugh ton J.: Estimat- ing the Selectivit of XML ath Expressions for In ternet Scale Applications. VLDB Conf. (2001) 591-600 [2] Ab oulnaga A., Chaudh uri S.: Self-tuning Histograms:

Building Histograms Without Lo oking at Data. SIG- MOD Conf. (1999) 181-192 [3] Ab oulnaga A., Naugh ton J.: Accurate Estimation of the Cost of Spatial Selections ICDE 2000 123-134 [4] Ac hary S., Gibb ons ., osala V., Ramasw am S.: Join Synopses for Appro ximate Query Answ ering. SIG- MOD Conf. (1998) 275-286 [5] Ac hary S., osala V., Ramasw am S.: Selectivit Estimation in Spatial Databases. SIGMOD Conf. (1999) 13-24 [6] Alsabti K., Rank S., Singh V.: One-P ass Algorithm for Accurately Estimating Quan tiles for Disk-Residen Data. VLDB Conf. (1997) 346-355 [7] Barbar D., et al.: The New Jersey

Data Reduction Rep ort. Data Engineering Bulletin 20:4 (1997) 3-45 [8] Belussi A., aloutsos C.: Estimating the Selectivit of Spatial Queries Using the ‘Correlation ractal Dimen- sion. VLDB Conf. (1995) 299-310 [9] Blak eley J., Kline N.: ersonal comm unication. (2003) [10] Blohsfeld B., Korus D., Seeger, B.: Comparison of Selectivit Estimators for Range Queries on Metric ttributes. SIGMOD Conf. (1999) 239-250 [11] Bruno N., Chaudh uri S., Gra ano L.: STHoles: Multidimensional orkload-Aw are Histogram. SIG- MOD Conf. (2001) 294-305 [12] Bruno N., Chaudh uri S., Gra ano L.: Exploiting Statistics

on Query Expressions for Optimization SIG- MOD Conf. (2002) 263-274 [13] Buccafurri F., Rosaci D., Sacc D.: Impro ving Range Query Estimation on Histograms. ICDE (2002) 628-238 [14] Chakrabarti K., Garofalakis M., Rastogi R., Shim K.: Appro ximate Query Pro cessing Using elets. VLDB Journal 10:2-3 (2001) 199-223 [15] Chaudh uri S., Mot ani R., Narasa yy V.: Random Sampling for Histogram Construction: Ho Muc is Enough? SIGMOD Conf. (1998) 436-447 [16] Chen C., Roussop oulos N.: Adaptiv Selectivit Esti- mation Using Query eedbac k. SIGMOD Conf. (1994) 161-172 [17] Choi Y.-J., Ch ung C.-W.:

Selectivit Estimation for Spatio-T emp oral Queries to Mo ving Ob jects. SIGMOD Conf. (2002) 440-451 [18] De Bo or C.: Practical Guide to Splines. Springer (1994) [19] Deshpande A., Garofalakis M., Rastogi R.: Indep en- dence is Go d: Dep endency-Based Histogram Synopses for High-Dimensional Data. SIGMOD Conf. (2001) 199- 210 [20] Donjerk vic D., Ramakrishnan R.: Probabilistic Opti- mization of op Queries. VLDB Conf. (1999) 411-422 [21] Flic kner M., et al.: Query Image and Video Con- ten t: The QBIC System. IEEE Computer 28:9 (1995) 23-32 [22] reire J., Haritsa J., Ramanath M., Ro ., Simeon

J.: StatiX: Making XML Coun t. SIGMOD Conf. (2002) 181-191 [23] urtado ., Madeira H.: Summary Grids: Building Accurate Multidimensional Histograms. ASF AA Conf. (1999) 187-194 [24] Garofalakis M., Gibb ons .: elet Synopses with Error Guaran tees. SIGMOD Conf. (2002) 476-487 [25] Geto or L., ask ar B., Koller D.: Selectivit Estima- tion Using Probabilistic Mo dels. SIGMOD Conf. (2001) 461-472 [26] Gibb ons ., Matias Y., osala V.: ast Incremen tal Main tenance of Appro ximate Histograms. VLDB Conf. (1997) 466-475 [27] Gilb ert A., Guha S., Indyk ., Kotidis Y., Muth ukr- ishnan S., Strauss M.:

ast, Small-Space Algorithms for Appro ximate Histogram Main tenance. CM STOC (2002) 389-398 [28] Gilb ert A., Kotidis Y., Muth ukrishnan S., Strauss M.: Surng elets on Streams: One-P ass Summaries for Appro ximate Aggregate Queries. VLDB Conf. (2001) 79-88 [29] Gilb ert A., Kotidis Y., Muth ukrishnan S., Strauss M.: Ho to Summarize the Univ erse: Dynamic Main tenance of Quan tiles. VLDB Conf. (2002) 454-465 [30] Guha S., Indyk ., Muth ukrishnan S., Strauss M.: Histogramming Data Streams with ast er-Item Pro- cessing. ICALP Conf. (2002) 681-692 [31] Gunopulos D., Kollios G., Tsotras V.,

Domeniconi C.: Appro ximating Multi-Dimensional Aggregate Range Queries Ov er Real ttributes. SIGMOD Conf. (2000) 463-474 [32] Haas ., Hellerstein J.: Ripple Joins for Online Ag- gregation. SIGMOD Conf. (1999) 287-298 [33] Hellerstein J., Haas ., ang H.: Online Aggregation. SIGMOD Conf. (1997) 171-182 [34] Ioannidis Y.: Univ ersalit of Serial Histograms. VLDB Conf. (1993) 256-267
Page 12
[35] Ioannidis Y.: Appro ximations in Database Systems. ICDT (2003) 16-30 [36] Ioannidis Y., Christo doulakis S.: On the Propagation of Errors in the Size of Join Results. SIGMOD Conf. (1991) 268-277

[37] Ioannidis Y., Christo doulakis S.: Optimal Histograms for Limiting orst-Case Error Propagation in the Size of Join Results. CM TODS 18:4 (1993) 709-748 [38] Ioannidis Y., osala V.: Balancing Histogram Opti- malit and Practicalit for Query Result Size Estima- tion. SIGMOD Conf. (1995) 233-244 [39] Ioannidis Y., osala V.: Histogram-Based Appro x- imation of Set-V alued Query-Answ ers. VLDB Conf. (1999) 174-185 [40] Jagadish H. V., Jin H., Ooi B. C., an K.-L.: Global Optimization of Histograms. SIGMOD Conf. (2001) 223-234 [41] Jagadish H. V., Kapitsk aia O., Ng R., Sriv asta D.:

Multi-Dimensional Substring Selectivit Estima- tion. VLDB Conf. (1999) 387-398 [42] Jagadish H. V., Koudas N., Muth ukrishnan S., os- ala V., Sev cik K., Suel T.: Optimal Histograms with Qualit Guaran tees. VLDB Conf. (1998) 275-286 [43] Jagadish H. V., Ng R., Sriv asta D.: Substring Se- lectivit Estimation. PODS Symp osium (1999) 249-260 [44] Koudas N., Muth ukrishnan S., Sriv asta D.: Opti- mal Histograms for Hierarc hical Range Queries. PODS Symp osium (2000) 196-204 [45] Krishnan ., Vitter J., Iy er B.: Estimating Alphan u- meric Selectivit in the Presence of Wildcards. SIGMOD Conf. (1996)

282-293 [46] K onig A., eikum G.: Com bining Histograms and arametric Curv Fitting for eedbac k-Driv en Query Result-Size Estimation. VLDB Conf. (1999) 423-434 [47] Ko oi R.: The Optimization of Queries in Relational Databases. PhD Thesis, Case estern Reserv Univ er- sit (1980) [48] Lee J.-H., Kim D.-H., Ch ung C.-W.: Multi- Dimensional Selectivit Estimation Using Compressed Histogram Information. SIGMOD Conf. (1999) 205-214 [49] Lim L., ang M., admanabhan S., Vitter J., arr R.: XP athLearner: An On-Ling Self-T uning Mark Histogram for XML ath Selectivit Estimation. VLDB Conf. (2002) 442-453

[50] Lohman G.: ersonal comm unication. (2003) [51] Luo G., Ellmann C., Haas ., Naugh ton J.: Scalable Hash Ripple Join Algorithm. SIGMOD Conf. (2002) 252-262 [52] Manku G. S., Mot ani R.: Appro ximate requency Coun ts er Data Streams. VLDB Conf. (2002) 346-357 [53] Manku G. S., Ra jagopalan S., Lindsa B.: Appro x- imate Medians and Other Quan tiles in One ass and with Limited Memory SIGMOD Conf. (1998) 426-435 [54] Manku G. S., Ra jagopalan S., Lindsa B.: Random Sampling ec hniques for Space Ecien Online Compu- tation of Order Statistics of Large Datasets. SIGMOD Conf. (1999) 251-262

[55] Marshall A., Olkin L.: Inequalities: Theory of Ma- jorization and its Applications. Academic Press (1986) [56] Matias Y., Vitter J., ang M.: elet-Based His- tograms for Selectivit Estimation. SIGMOD Conf. (1998) 448-459 [57] Matias Y., Vitter J., ang M.: Dynamic Main tenance of elet-Based Histograms. VLDB Conf. (2000) 101- 110 [58] Muralikrishna M., DeWitt D.: Equi-Depth His- tograms for Estimating Selectivit actors for Multi- Dimensional Queries. SIGMOD Conf. (1988) 28-36 [59] Muth ukrishnan S., osala V., Suel T.: On Rectangu- lar artitionings in Tw Dimensions: Algorithms, Com- plexit

and Applications. ICDT (1999) 236-256 [60] apadop oulos A., Manolop oulos Y.: Structure-Based Similarit Searc with Graph Histograms. DEXA ork- shop (1999) 174-179 [61] ass G., Zabih R.: Comparing Images Using Join Histograms. Multimedia Systems (1999) 234-240 [62] Piatetsky-Shapiro G., Connell C.: Accurate Estima- tion of the Num er of uples Satisfying Condition. SIGMOD Conf. (1984) 256-276 [63] olyzotis N., Garofalakis M.: Statistical Synopses for Graph-Structured XML Databases. SIGMOD Conf. (2002) 358-369 [64] olyzotis N., Garofalakis M.: Structure and alue Synopses for XML Data Graphs. VLDB

Conf. (2002) 466-477 [65] osala V., Ioannidis Y.: Estimation of Query-Result Distribution and its Application in arallel-Join Load Balancing. VLDB Conf. (1996) 448-459 [66] osala V., Ioannidis Y.: Selectivit Estimation Without the ttribute alue Indep endence Assumption. VLDB Conf. (1997) 486-495 [67] osala V., Ioannidis Y., Haas ., Shekita E.: Im- pro ed Histograms for Selectivit Estimation of Range Predicates. SIGMOD Conf. (1996) 294-305 [68] Sitzmann I., Stuc ey .: Impro ving emp oral Joins Using Histograms. DEXA Conf. (2000) 488-498 [69] Stigler S.: The History of Statistics: The Measure-

men of Uncertain efore 1900. Harv ard Univ ersit Press (1986) [70] Stillger M., Lohman G., Markl V., Kandil M.: LEO DB2’s LEarning Optimizer. VLDB Conf. (2001) 19-28 [71] Sun W., Ling Y., Rishe N., Deng Y.: An Instan and Accurate Size Estimation Metho for Joins and Se- lection in Retriev al-In tensiv En vironmen t. SIGMOD Conf. (1993) 79-88 [72] Thap er N., Guha S., Indyk ., Koudas N.: Dynamic Multidimensional Histograms. SIGMOD Conf. (2002) 428-439 [73] Theo doridis, S., Koutroum bas K.: attern Recogni- tion. Academic Press, 2nd edition (2003) [74] ufte E.: The Visual Displa of Quan titativ

Infor- mation. Graphics Press (1983) [75] Vitter J., ang M.: Appro ximate Computation of Multidimensional Aggregates of Sparse Data Using elets. SIGMOD Conf. (1999) 193-204 [76] Vitter J., ang M., Iy er B.: Data Cub Appro xima- tion and Histograms via elets. CIKM Conf. (1998) 96-104 [77] ang M., Vitter J., Iy er B.: Selectivit Estimation in the Presence of Alphan umeric Correlations. ICDE (1997) 169-180 [78] Witk wski A.: ersonal comm unication. (2003) [79] Y., atel J., Jagadish H. V.: Using Histograms to Estimate Answ er Sizes for XML Queries. Information Systems 28:1-2 (2003) 33-59