Solr Yonik Seeley Lucid Imagination yoniklucidimaginationcom Oct 20 2011 What I Will Cover What is Faceted Search Solrs Faceted Search Tips amp Tricks Performance amp Algorithms ID: 279039
Download Presentation The PPT/PDF document "The Many Facets of Apache" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Many Facets of Apache Solr
Yonik Seeley, Lucid Imaginationyonik@lucidimagination.com, Oct 20 2011Slide2
What I Will CoverWhat is Faceted SearchSolr’s Faceted Search
Tips & TricksPerformance & Algorithms3Slide3
My BackgroundCreator of SolrCo-founder of Lucid ImaginationExpertise: Distributed Search systems and performance
Lucene/Solr committer, a member of the Lucene PMC, member of the Apache Software FoundationWork: CNET Networks, BEA, Telcordia, among othersM.S. in Computer Science, StanfordSlide4
What is Faceted Search5Slide5
Manufacturer is a
facet, a way of categorizing the resultsCanon, Sony, and Nikon are constraints, or facet values
The breadcrumb trail shows what constraints have already been applied and allows for their removal
The facet count
or constraint count shows how many results match each value
Faceted Search ExampleSlide6
Key Elements of Faceted SearchNo hierarchy of options is enforcedUsers can apply facet constraints in any orderUsers can remove facet constraints in any orderNo surprises
The user is only given facets and constraints that make sense in the context of the items they are looking atThe user always knows what to expect before they apply a constraintAlso known as guided navigation, faceted navigation, faceted browsing, parametric search7Slide7
Solr’s Faceted Search8Slide8
Field FacetingSpecifies a Field to be used as a FacetUses each term indexed in that Field as a
ConstraintField must be indexedCan be used multiple times for multiple fields 9
q = iPhone fq = inStock:true facet = true
facet.field = color facet.field = categorySlide9
Field Faceting Response10
http://localhost:8983/solr/select?q=iPhone&fq=inStock:true
&facet=true&facet.field=color&facet.field
=category<lst
name="facet_counts”> <lst name="
facet_fields"> <lst
name="color">
<
int
name="red">17</
int
>
<
int
name="green">6</
int
>
<
int
name="blue">2</
int
>
<
int
name="yellow">2</
int
>
<
lst
name
=”category"
>
<
int
name
=
“accessories”
>16<
/
int
>
<
int
name=
“electronics”>11<
/
int
>Slide10
Or if you prefer JSON…11
http://localhost:8983/solr/select?q=iPhone&fq=inStock:true&facet=true&facet.field
=color&facet.field=category&wt=json
"facet_counts":{
"facet_fields":{
"color":[
"red
"
,17,
"
green",6,
"
blue",2,
"
yellow"
,
2]
"category
"
:[
"accessories"
,16,
"electronics
",11]Slide11
Applying ConstraintsAssume the user clicked on “red”…
12http://localhost:8983/solr/select?q=iPhone&fq=inStock:true&
fq=color:red&facet=true&facet.field
=color&facet.field=category
Simply add another filter query to apply that constraint
Remove redundant
facet.field
(assuming single valued field)Slide12
facet.field Optionsfacet.prefix - Restricts the possible constraints to only indexed values with a specified prefix.
facet.mincount=0 - Restricts the constraints returned to those containing a minimum number of documents in the result set.facet.sort=count - The ordering of constraints: count or indexfacet.offset=0 - Indicates how many constraints in the specified sort ordering should be skipped in the response.facet.limit=100 - The number of constraints to returnfacet.missing=false – Return the number of docs with no value in the field13Slide13
facet.querySpecifies a query string to be used as a Facet ConstraintTypically used multiple times to get multiple (discrete) setsAny type of query supported
14
facet.query
= rank:[* TO 20]
facet.query
= rank:[21 TO *]Slide14
facet.query Results
<result numFound="27" ... />...<lst name="facet_counts"> <lst name="facet_queries
"> <int name="rank:[* TO 20]">2</int>
<int name="rank:[21 TO *]">15</int>
</lst> ...Slide15
Spatial faceting16
q=*:*&facet=true pt=45.15,-93.85 sfield=store
facet.query={!geofilt d=5} facet.query
={!geofilt d=10}
"facet_counts":{ "facet_queries":{ "{!
geofilt d=5}":3, "{!geofilt
d=10}":6},
The
lat,lon
center point to search from
Name of the field containing
lat+lon
data
g
eospatial query type Slide16
Range FacetingSimpler than a sequence of
facet.query paramshttp://...&facet=true&facet.range
=price&facet.range.start=0
&facet.range.end=500&
facet.range.gap=50
"
facet_counts
":{
"
facet_ranges
":{
"price":{
"
counts”:[
"
0.0”,5
,
"
50.0”,2
,
"
100.0”,0
,
"
150.0”,2
,
"
200.0”,0
,
"
250.0”,1
,
"
300.0”,2
,
"
350.0”,2
,
"
400.0”,0
,
"
450.0”,1],
"gap":50.0,
"start":0.0,
"end":500.0}}}}Slide17
Date Facetingfacet.date is deprecated, use facet.range on a date field nowCreates Constraints based on evenly sized date ranges using the Gregorian Calendar
Ranges are specified using "Date Math" so they DWIM in spite of variable length months and leap years
facet.range
=
pubdate
facet.range.start
= NOW/YEAR-1YEAR
facet.range.end
= NOW/MONTH+1MONTH
facet.range.gap
= +1MONTHSlide18
Date Faceting Results "facet_counts
":{ "facet_ranges":{ ”pubdate":{ "counts":[ "2010-01-01T00:00:00Z",4,
"2010-02-01T00:00:00Z",6, "2010-03-01T00:00:00Z",0, "2010-04-01T00:00:00Z",13,
[…] "2011-09-01T00:00:00Z",5, "2011-10-01T00:00:00Z"
,2], "gap":"+1MONTH", "start":"2010-01-01T00:00:00Z",
"end":"2011-11-01T00:00:00Z”}}}Slide19
Range Faceting Optionsfacet.range.hardend=false - Determines what effective end value is used when the specified "start" and "end" don't divide into even "gap" sized buckets; false means the last Constraint range may be shorter then the others
facet.range.other=none - Allows you to specify what other Constraints you are interested in besides the generated ranges: before, after, between, none, allfacet.range.include=lower – Specifies what bounds are inclusive vs exclusive: lower, upper, edge, outer, allSlide20
Pivot Faceting (trunk)
Computes a Matrix of Constraint Counts across multiple Facet FieldsSyntax: facet.pivot=field1,field2,field3,…
#docs#docs w/ inStock:true
#docs w/ instock:falsecat:electronics14
104cat:memory
33
0
cat:connector
2
0
2
cat:graphics
card
2
0
2
cat:hard
drive
2
2
0
facet.pivot
=
cat,inStockSlide21
Pivot Faceting "facet_counts":{
"facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":6, "count":5},
{ "field":"popularity", "value":7, "count":4},
http://...&facet=
true&facet.pivot=cat,popularity
(continued)
{
"
field":"popularity
",
"
value”:1,
"count":2}]},
{
"
field":"cat
",
"
value":"memory
",
"count":3,
"pivot":[]},
[…]
14 docs w/
cat==electronics
5 docs w/
cat==electronics
&& popularity==6Slide22
Tips & Tricks23Slide23
term QParserDefault Query Parser does special things with whitespace and punctuationProblematic when "filtering" on Facet Field Constraints that contain whitespace, punctuation, or other reserved characters.
Use the term parser to filter on an exact Term
fq
= {
!term f
=category}Books & Magazines
fq
= {
!term f
=
category v=$t}
t = Books
& Magazines
ORSlide24
Taxonomy Facets
What If Your Documents Are Organized in a Taxonomy?Slide25
Taxonomy Facets: Data
Doc
#1:
NonFic
> Law
Doc#2: NonFic
>
Sci
Doc
#3:
NonFic
>
Hist
NonFic
>
Sci
>
Phys
Doc
#1:
1/
NonFic
,
2/
NonFic
/Law
Doc
#2:
1/
NonFic
,
2/
NonFic
/
Sci
Doc
#3:
1/
NonFic
,
2/
NonFic
/
Hist
,
2/
NonFic
/
Sci
,
3/
NonFic
/
Sci
/
Phys
Flattened Data
Indexed Terms (prepend number of nodes in path segment)Slide26
Taxonomy Facets: Initial Query
facet.field
= category
facet.prefix
=
2/NonFic
facet.mincount
= 1
<result
numFound
="164" ...
<
lst
name="
facet_fields
">
<
lst
name="category">
<
int
name
=
"
2/
NonFic
/
Sci
">2</
int
>
<
int
name
=
"
2/
NonFic
/
Hist
">1</
int
>
<
int
name
=
"
2/
NonFic
/Law">1</
int
>Slide27
Taxonomy Facets: Drill Down
fq
=
{
!term f=category}2/
NonFic
/
Sci
facet.field
=
category
facet.prefix
= 3/
NonFic
/
Sci
facet.mincount
= 1
<result
numFound
="2" ...
<
lst
name="
facet_fields
">
<
lst
name="category">
<
int
name
=”3/
NonFic
/
Sci
/
Phys
">1</
int
>Slide28
Multi-Select Faceting29
Very generic supportReuses localParams syntax {!name=
val}Ability to tag
filtersAbility to exclude certain filters when faceting, by tag
http://search.lucidimagination.com
q
=index replication
facet
=true
fq
={!tag=
pr
}project:(lucene OR
solr
)
facet.field
={!ex=
pr
}project
facet.field
={!ex=
src
}sourceSlide29
Same Facet, Different Exclusions30
q
= Hot Rod
fq
= {!
df
=colors tag=cx}purple green
facet.field
= {!key=
all_colors
ex=cx}colors
facet.field
= {!key=
overlap_colors
}colors
"
facet_counts
":{
"
facet_fields
":{
”
all_colors
"
:[
"red",
19,
"green",6,
"blue",
2],
A
key
can be specified for a facet to change the name used to identify it in the response
.
”
overlap_colors
"
:[
"red"
,7,
"green",6,
"
blue”,1]
}
}Slide30
“Pretty” facet.field TermsField Faceting uses Indexed TermsLeverage
copyField and TokenFilters that will give you good looking Constraints31
<
tokenizer
class="
solr.PatternTokenizerFactory
"
pattern="(,|;)\s*" />
<filter class="
solr.PatternReplaceFilterFactory
"
pattern="\s+" replacement=" " />
<filter class="
solr.PatternReplaceFilterFactory
"
pattern=" and " replacement=" & " />
<filter class="
solr.TrimFilterFactory
" />
<filter class="
solr.CapitalizationFilterFactory
"
onlyFirstWord
="false" />Slide31
“Pretty” facet.field Results32
{“id” : “
MyExampleDoc
”,
"category” : ”
books
and magazines;
computers,
“
}
"
facet_counts
":{
"
facet_fields
":{
"
category_pretty
"
:[
"
Books & Magazines
"
,
1,
"
Computers
"
,1]
<
copyField
“source”=“category” “
dest
”=“
category_pretty
”/>
copyField
in schemaSlide32
facet.field Labelsfacet.query params are echoed verbatim when returning the constraint counts
Optionally, one can declare a facet.query in solrconfig.xml and include a "label" that the presentation layer can parse out for display.33
facet.query
= {!
label
=‘
Hot!
’
}
+pop:[1 TO *]
+
pub_date
:[NOW/DAY-1DAY TO *]
"
facet_queries
"
: {
"
{
!label
=‘
Hot!
’ } pop:[1 TO *] .
.
.
"
: 15
}
“label” has no meaning to
solrSlide33
Performance34Slide34
facet.method
namefacet.methoddescriptionmemoryCPUenumenum
Iterates over terms, calculating set intersectionsfilter-per-term in the filterCache~O(nTerms)field cache
fc (single-valued field)Iterates over documents, counting termsLucene FieldCache Entry… int
[maxDoc]+termsO(nDocs)UnInvertedFieldfc (multi-valued field)
Iterates over documents, counting termsO(maxDoc *
num
terms per doc)
~O(
nDocs
)
Per-segment field cache
fcs
Like field-cache, just better for NRT since
FieldCacheEntry
is at segment level.
Lucene
FieldCache
Entries…
int
[
maxDoc
]+terms
O(
nDocs
)
+O(
nTerms
)
35Slide35
5
3
51
4
52
1
(null)
batman
flash
spiderman
superman
wolverine
order
: for each doc, an index into the lookup array
lookup
: the string values
Lucene FieldCache Entry (StringIndex) for the
“
hero
”
field
0
2
7
0
1
0
0
0
2
Documents matching the base query
“
Juggernaut
”
accumulator
increment
lookup
q=Juggernaut
&facet=true
&
facet.field
=hero
Priority queue
Batman, 3
flash, 5
f
acet.method
=fc
(single-
valued
field)
Mem
=
int
[
maxDoc
]
+
unique_values
CPU=O(
nDocs
in base set)Slide36
Segment1FieldCacheEntry
Segment2FieldCacheEntrySegment3FieldCache
EntrySegment4FieldCache
Entry0
2
7
0
3
5
0
1
2
0
2
1
0
1
3
0
4
0
1
0
Priority queue
Batman, 3
flash, 5
Base
DocSet
lookup
inc
accumulator1
accumulator2
accumulator3
accumulator4
FieldCache + accumulator merger
(Priority queue)
thread1
thread2
thread3
thread4
facet.method
=
fcs (trunk)
(
per
-segment single-
valued)Slide37
facet.method=fcs
Controllable multi-threadingfacet.method=fcsfacet.field={!threads=4}myfield
DisadvantagesLarger memory use (FieldCaches + accumulators)Slower (extra FieldCache merge step needed) – O(nTerms)AdvantagesRebuilds FieldCache entries only for new segments (NRT friendly)Multi-threadedSlide38
Per-segment faceting performance comparison
Time for request*facet.method=fcfacet.method=fcsstatic index3 ms
244 msquickly changing index1388 ms267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request*
facet.method
=
fc
facet.method
=
fcs
static index
26 ms
34 ms
quickly
changing index
741 ms
94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
BSlide39
facet.method=fc (multi-valued field)
UnInvertedField - like single-valued FieldCache algorithm, but with multi-valued FieldCacheGood for many unique terms, relatively few values per docBest case: 50x faster, 5x smaller than “enum” (100K unique values, 1-5 per doc)O(n_docs
), but optimization to count the inverse when n>maxDoc/2Memory efficientTerms in a document are delta coded variable width ords (vints
)Ord list for document packed in an int or in a shared byte[]Hybrid approach: “big terms” that match >5% of index use filterCache instead
Only 1/128th of string values in memorySlide40
facet.method=fcfieldValueCache
Implicit cache with UnInvertedField entriesNot autowarmed – use static warming requesthttp://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc)Slide41
Faceting: fieldValueCacheImplicit cache with UnInvertedField entriesNot
autowarmed – use static warming requesthttp://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc)
item_cat
:{field=cat,memSize
=5376,tindexSize=52,time=2,phase1=2,nTerms=16,bigTerms=10,termInstances=6,uses=44}Slide42
Multi-valued faceting:
facet.method=enumfacet.method=enumFor each term in field:Retrieve filterCalculate intersection sizehero:batman
1
35
8
batman
flash
superman
wolverine
spiderman
1
3
5
8
0
1
5
2
4
7
0
6
9
1
2
7
8
Lucene
Inverted Index (on disk)
hero:flash
0
1
5
Solr
filterCache
(in memory)
0
1
5
Docs matching base query
intersection count
9
Priority queue
batman=2Slide43
facet.method=enum
O(n_terms_in_field)Short circuits based on term.dffilterCache entries int[ndocs] or BitSet(maxDoc)Size filterCache appropriatelyEither autowarm filterCache, or use static warming queries (via
newSearcher event) in solrconfig.xmlfacet.enum.cache.minDf - prevent filterCache use for small terms Also useful for huge index w/ no time constraintsSlide44
Q&A45