/
The Many Facets of Apache The Many Facets of Apache

The Many Facets of Apache - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
465 views
Uploaded On 2016-04-11

The Many Facets of Apache - PPT Presentation

Solr Yonik Seeley Lucid Imagination yoniklucidimaginationcom Oct 20 2011 What I Will Cover What is Faceted Search Solrs Faceted Search Tips amp Tricks Performance amp Algorithms ID: 279039

field facet int amp facet field amp int query nonfic constraints solr range faceting method true category counts lst

Share:

Link:

Embed:


Presentation Transcript

Slide1

The Many Facets of Apache Solr

Yonik Seeley, Lucid Imaginationyonik@lucidimagination.com, Oct 20 2011Slide2

What I Will CoverWhat is Faceted SearchSolr’s Faceted Search

Tips & TricksPerformance & Algorithms3Slide3

My BackgroundCreator of SolrCo-founder of Lucid ImaginationExpertise: Distributed Search systems and performance

Lucene/Solr committer, a member of the Lucene PMC, member of the Apache Software FoundationWork: CNET Networks, BEA, Telcordia, among othersM.S. in Computer Science, StanfordSlide4

What is Faceted Search5Slide5

Manufacturer is a

facet, a way of categorizing the resultsCanon, Sony, and Nikon are constraints, or facet values

The breadcrumb trail shows what constraints have already been applied and allows for their removal

The facet count

or constraint count shows how many results match each value

Faceted Search ExampleSlide6

Key Elements of Faceted SearchNo hierarchy of options is enforcedUsers can apply facet constraints in any orderUsers can remove facet constraints in any orderNo surprises

The user is only given facets and constraints that make sense in the context of the items they are looking atThe user always knows what to expect before they apply a constraintAlso known as guided navigation, faceted navigation, faceted browsing, parametric search7Slide7

Solr’s Faceted Search8Slide8

Field FacetingSpecifies a Field to be used as a FacetUses each term indexed in that Field as a

ConstraintField must be indexedCan be used multiple times for multiple fields 9

q = iPhone fq = inStock:true facet = true

facet.field = color facet.field = categorySlide9

Field Faceting Response10

http://localhost:8983/solr/select?q=iPhone&fq=inStock:true

&facet=true&facet.field=color&facet.field

=category<lst

name="facet_counts”> <lst name="

facet_fields"> <lst

name="color">

<

int

name="red">17</

int

>

<

int

name="green">6</

int

>

<

int

name="blue">2</

int

>

<

int

name="yellow">2</

int

>

<

lst

name

=”category"

>

<

int

name

=

“accessories”

>16<

/

int

>

<

int

name=

“electronics”>11<

/

int

>Slide10

Or if you prefer JSON…11

http://localhost:8983/solr/select?q=iPhone&fq=inStock:true&facet=true&facet.field

=color&facet.field=category&wt=json

"facet_counts":{

"facet_fields":{

"color":[

"red

"

,17,

"

green",6,

"

blue",2,

"

yellow"

,

2]

"category

"

:[

"accessories"

,16,

"electronics

",11]Slide11

Applying ConstraintsAssume the user clicked on “red”…

12http://localhost:8983/solr/select?q=iPhone&fq=inStock:true&

fq=color:red&facet=true&facet.field

=color&facet.field=category

Simply add another filter query to apply that constraint

Remove redundant

facet.field

(assuming single valued field)Slide12

facet.field Optionsfacet.prefix - Restricts the possible constraints to only indexed values with a specified prefix.

facet.mincount=0 - Restricts the constraints returned to those containing a minimum number of documents in the result set.facet.sort=count - The ordering of constraints: count or indexfacet.offset=0 - Indicates how many constraints in the specified sort ordering should be skipped in the response.facet.limit=100 - The number of constraints to returnfacet.missing=false – Return the number of docs with no value in the field13Slide13

facet.querySpecifies a query string to be used as a Facet ConstraintTypically used multiple times to get multiple (discrete) setsAny type of query supported

14

facet.query

= rank:[* TO 20]

facet.query

= rank:[21 TO *]Slide14

facet.query Results

<result numFound="27" ... />...<lst name="facet_counts"> <lst name="facet_queries

"> <int name="rank:[* TO 20]">2</int>

<int name="rank:[21 TO *]">15</int>

</lst> ...Slide15

Spatial faceting16

q=*:*&facet=true pt=45.15,-93.85 sfield=store

facet.query={!geofilt d=5} facet.query

={!geofilt d=10}

"facet_counts":{ "facet_queries":{ "{!

geofilt d=5}":3, "{!geofilt

d=10}":6},

The

lat,lon

center point to search from

Name of the field containing

lat+lon

data

g

eospatial query type Slide16

Range FacetingSimpler than a sequence of

facet.query paramshttp://...&facet=true&facet.range

=price&facet.range.start=0

&facet.range.end=500&

facet.range.gap=50

"

facet_counts

":{

"

facet_ranges

":{

"price":{

"

counts”:[

"

0.0”,5

,

"

50.0”,2

,

"

100.0”,0

,

"

150.0”,2

,

"

200.0”,0

,

"

250.0”,1

,

"

300.0”,2

,

"

350.0”,2

,

"

400.0”,0

,

"

450.0”,1],

"gap":50.0,

"start":0.0,

"end":500.0}}}}Slide17

Date Facetingfacet.date is deprecated, use facet.range on a date field nowCreates Constraints based on evenly sized date ranges using the Gregorian Calendar

Ranges are specified using "Date Math" so they DWIM in spite of variable length months and leap years

facet.range

=

pubdate

facet.range.start

= NOW/YEAR-1YEAR

facet.range.end

= NOW/MONTH+1MONTH

facet.range.gap

= +1MONTHSlide18

Date Faceting Results "facet_counts

":{ "facet_ranges":{ ”pubdate":{ "counts":[ "2010-01-01T00:00:00Z",4,

"2010-02-01T00:00:00Z",6, "2010-03-01T00:00:00Z",0, "2010-04-01T00:00:00Z",13,

[…] "2011-09-01T00:00:00Z",5, "2011-10-01T00:00:00Z"

,2], "gap":"+1MONTH", "start":"2010-01-01T00:00:00Z",

"end":"2011-11-01T00:00:00Z”}}}Slide19

Range Faceting Optionsfacet.range.hardend=false - Determines what effective end value is used when the specified "start" and "end" don't divide into even "gap" sized buckets; false means the last Constraint range may be shorter then the others

facet.range.other=none - Allows you to specify what other Constraints you are interested in besides the generated ranges: before, after, between, none, allfacet.range.include=lower – Specifies what bounds are inclusive vs exclusive: lower, upper, edge, outer, allSlide20

Pivot Faceting (trunk)

Computes a Matrix of Constraint Counts across multiple Facet FieldsSyntax: facet.pivot=field1,field2,field3,…

#docs#docs w/ inStock:true

#docs w/ instock:falsecat:electronics14

104cat:memory

33

0

cat:connector

2

0

2

cat:graphics

card

2

0

2

cat:hard

drive

2

2

0

facet.pivot

=

cat,inStockSlide21

Pivot Faceting "facet_counts":{

"facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":6, "count":5},

{ "field":"popularity", "value":7, "count":4},

http://...&facet=

true&facet.pivot=cat,popularity

(continued)

{

"

field":"popularity

",

"

value”:1,

"count":2}]},

{

"

field":"cat

",

"

value":"memory

",

"count":3,

"pivot":[]},

[…]

14 docs w/

cat==electronics

5 docs w/

cat==electronics

&& popularity==6Slide22

Tips & Tricks23Slide23

term QParserDefault Query Parser does special things with whitespace and punctuationProblematic when "filtering" on Facet Field Constraints that contain whitespace, punctuation, or other reserved characters.

Use the term parser to filter on an exact Term

fq

= {

!term f

=category}Books & Magazines

fq

= {

!term f

=

category v=$t}

t = Books

& Magazines

ORSlide24

Taxonomy Facets

What If Your Documents Are Organized in a Taxonomy?Slide25

Taxonomy Facets: Data

Doc

#1:

NonFic

> Law

Doc#2: NonFic

>

Sci

Doc

#3:

NonFic

>

Hist

NonFic

>

Sci

>

Phys

Doc

#1:

1/

NonFic

,

2/

NonFic

/Law

Doc

#2:

1/

NonFic

,

2/

NonFic

/

Sci

Doc

#3:

1/

NonFic

,

2/

NonFic

/

Hist

,

2/

NonFic

/

Sci

,

3/

NonFic

/

Sci

/

Phys

Flattened Data

Indexed Terms (prepend number of nodes in path segment)Slide26

Taxonomy Facets: Initial Query

facet.field

= category

facet.prefix

=

2/NonFic

facet.mincount

= 1

<result

numFound

="164" ...

<

lst

name="

facet_fields

">

<

lst

name="category">

<

int

name

=

"

2/

NonFic

/

Sci

">2</

int

>

<

int

name

=

"

2/

NonFic

/

Hist

">1</

int

>

<

int

name

=

"

2/

NonFic

/Law">1</

int

>Slide27

Taxonomy Facets: Drill Down

fq

=

{

!term f=category}2/

NonFic

/

Sci

facet.field

=

category

facet.prefix

= 3/

NonFic

/

Sci

facet.mincount

= 1

<result

numFound

="2" ...

<

lst

name="

facet_fields

">

<

lst

name="category">

<

int

name

=”3/

NonFic

/

Sci

/

Phys

">1</

int

>Slide28

Multi-Select Faceting29

Very generic supportReuses localParams syntax {!name=

val}Ability to tag

filtersAbility to exclude certain filters when faceting, by tag

http://search.lucidimagination.com

q

=index replication

facet

=true

fq

={!tag=

pr

}project:(lucene OR

solr

)

facet.field

={!ex=

pr

}project

facet.field

={!ex=

src

}sourceSlide29

Same Facet, Different Exclusions30

q

= Hot Rod

fq

= {!

df

=colors tag=cx}purple green

facet.field

= {!key=

all_colors

ex=cx}colors

facet.field

= {!key=

overlap_colors

}colors

"

facet_counts

":{

"

facet_fields

":{

all_colors

"

:[

"red",

19,

"green",6,

"blue",

2],

A

key

can be specified for a facet to change the name used to identify it in the response

.

overlap_colors

"

:[

"red"

,7,

"green",6,

"

blue”,1]

}

}Slide30

“Pretty” facet.field TermsField Faceting uses Indexed TermsLeverage

copyField and TokenFilters that will give you good looking Constraints31

<

tokenizer

class="

solr.PatternTokenizerFactory

"

pattern="(,|;)\s*" />

<filter class="

solr.PatternReplaceFilterFactory

"

pattern="\s+" replacement=" " />

<filter class="

solr.PatternReplaceFilterFactory

"

pattern=" and " replacement=" &amp; " />

<filter class="

solr.TrimFilterFactory

" />

<filter class="

solr.CapitalizationFilterFactory

"

onlyFirstWord

="false" />Slide31

“Pretty” facet.field Results32

{“id” : “

MyExampleDoc

”,

"category” : ”

books

and magazines;

computers,

}

"

facet_counts

":{

"

facet_fields

":{

"

category_pretty

"

:[

"

Books & Magazines

"

,

1,

"

Computers

"

,1]

<

copyField

“source”=“category” “

dest

”=“

category_pretty

”/>

copyField

in schemaSlide32

facet.field Labelsfacet.query params are echoed verbatim when returning the constraint counts

Optionally, one can declare a facet.query in solrconfig.xml and include a "label" that the presentation layer can parse out for display.33

facet.query

= {!

label

=‘

Hot!

}

+pop:[1 TO *]

+

pub_date

:[NOW/DAY-1DAY TO *]

"

facet_queries

"

: {

"

{

!label

=‘

Hot!

’ } pop:[1 TO *] .

.

.

"

: 15

}

“label” has no meaning to

solrSlide33

Performance34Slide34

facet.method

namefacet.methoddescriptionmemoryCPUenumenum

Iterates over terms, calculating set intersectionsfilter-per-term in the filterCache~O(nTerms)field cache

fc (single-valued field)Iterates over documents, counting termsLucene FieldCache Entry… int

[maxDoc]+termsO(nDocs)UnInvertedFieldfc (multi-valued field)

Iterates over documents, counting termsO(maxDoc *

num

terms per doc)

~O(

nDocs

)

Per-segment field cache

fcs

Like field-cache, just better for NRT since

FieldCacheEntry

is at segment level.

Lucene

FieldCache

Entries…

int

[

maxDoc

]+terms

O(

nDocs

)

+O(

nTerms

)

35Slide35

5

3

51

4

52

1

(null)

batman

flash

spiderman

superman

wolverine

order

: for each doc, an index into the lookup array

lookup

: the string values

Lucene FieldCache Entry (StringIndex) for the

hero

field

0

2

7

0

1

0

0

0

2

Documents matching the base query

Juggernaut

accumulator

increment

lookup

q=Juggernaut

&facet=true

&

facet.field

=hero

Priority queue

Batman, 3

flash, 5

f

acet.method

=fc

(single-

valued

field)

Mem

=

int

[

maxDoc

]

+

unique_values

CPU=O(

nDocs

in base set)Slide36

Segment1FieldCacheEntry

Segment2FieldCacheEntrySegment3FieldCache

EntrySegment4FieldCache

Entry0

2

7

0

3

5

0

1

2

0

2

1

0

1

3

0

4

0

1

0

Priority queue

Batman, 3

flash, 5

Base

DocSet

lookup

inc

accumulator1

accumulator2

accumulator3

accumulator4

FieldCache + accumulator merger

(Priority queue)

thread1

thread2

thread3

thread4

facet.method

=

fcs (trunk)

(

per

-segment single-

valued)Slide37

facet.method=fcs

Controllable multi-threadingfacet.method=fcsfacet.field={!threads=4}myfield

DisadvantagesLarger memory use (FieldCaches + accumulators)Slower (extra FieldCache merge step needed) – O(nTerms)AdvantagesRebuilds FieldCache entries only for new segments (NRT friendly)Multi-threadedSlide38

Per-segment faceting performance comparison

Time for request*facet.method=fcfacet.method=fcsstatic index3 ms

244 msquickly changing index1388 ms267 ms

Base DocSet=100 docs, facet.field on a field with 100,000 unique terms

Test index: 10M documents, 18 segments, single valued field

Time for request*

facet.method

=

fc

facet.method

=

fcs

static index

26 ms

34 ms

quickly

changing index

741 ms

94 ms

Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

*complete request time, measured externally

A

BSlide39

facet.method=fc (multi-valued field)

UnInvertedField - like single-valued FieldCache algorithm, but with multi-valued FieldCacheGood for many unique terms, relatively few values per docBest case: 50x faster, 5x smaller than “enum” (100K unique values, 1-5 per doc)O(n_docs

), but optimization to count the inverse when n>maxDoc/2Memory efficientTerms in a document are delta coded variable width ords (vints

)Ord list for document packed in an int or in a shared byte[]Hybrid approach: “big terms” that match >5% of index use filterCache instead

Only 1/128th of string values in memorySlide40

facet.method=fcfieldValueCache

Implicit cache with UnInvertedField entriesNot autowarmed – use static warming requesthttp://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc)Slide41

Faceting: fieldValueCacheImplicit cache with UnInvertedField entriesNot

autowarmed – use static warming requesthttp://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc)

item_cat

:{field=cat,memSize

=5376,tindexSize=52,time=2,phase1=2,nTerms=16,bigTerms=10,termInstances=6,uses=44}Slide42

Multi-valued faceting:

facet.method=enumfacet.method=enumFor each term in field:Retrieve filterCalculate intersection sizehero:batman

1

35

8

batman

flash

superman

wolverine

spiderman

1

3

5

8

0

1

5

2

4

7

0

6

9

1

2

7

8

Lucene

Inverted Index (on disk)

hero:flash

0

1

5

Solr

filterCache

(in memory)

0

1

5

Docs matching base query

intersection count

9

Priority queue

batman=2Slide43

facet.method=enum

O(n_terms_in_field)Short circuits based on term.dffilterCache entries int[ndocs] or BitSet(maxDoc)Size filterCache appropriatelyEither autowarm filterCache, or use static warming queries (via

newSearcher event) in solrconfig.xmlfacet.enum.cache.minDf - prevent filterCache use for small terms Also useful for huge index w/ no time constraintsSlide44

Q&A45