Joseph Busch and Vivian Bliss Agenda Predefined Boolean Queries Case Study Lessons Learned Boolean queries Basic operators AND conjunctive OR disjunctive NOT negation Venn diagrams ID: 738602
Download Presentation The PPT/PDF document "Why Build Custom Categorizers Using Bool..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Why Build Custom Categorizers Using Boolean Queries Instead of Machine Learning? Robert Wood Johnson Foundation Case Study
Joseph Busch and Vivian BlissSlide2
Agenda
Pre-defined Boolean Queries
Case Study
Lessons LearnedSlide3
Boolean queries
Basic operators
AND (conjunctive)
OR (disjunctive)
NOT (negation)
Venn diagrams
A OR B
A AND B
A NOT B
B NOT A
A
B
A
B
A
B
A
BSlide4
Proximity operators
Proximity search (specified distance).
Hint
: Proximity operators and syntax are not standardized.
NEAR
NOT NEAR
FOLLOWED BYNOT FOLLOWED BY
SENTENCE FAR
Document
Section
Paragraph
SentenceSlide5
Query syntax
Bounded phrase
Usually quotation marks, e.g.
“health insurance”
Truncation (right, left, internal)
Usually an asterisk, e.g.
child*
“pre-existing condition*”Nested statementsParentheses (that must match up)(“health insurance” AND (children* OR “pre-existing condition*”))Slide6
How to create a Boolean query (1)
Brainstorm a list of 10 relevant words and phrases.
Use that list to identify 10 relevant items (articles, videos, websites, etc.)
E.g., do a Google search, search Google Scholar, search the NYT (or any other newspaper that you subscribe to), search Library of Congress Chronicling America (1789-1963), etc.
Review 10 relevant items and write down the words and phrases that provide a context for the theme/topic/concept.
Titles, headings, summaries, introductions (at the beginning) and conclusions (at the end) are good areas to focus on without having to read the whole item.
Note any named entities (people, organizations, events, laws, etc.) that are closely associated with the theme/topic/concept.
E.g., for gun violence “Gabrielle
Giffords
”, “Michael Bloomberg”, “Doctors Against Gun Violence”, “March for our Lives”, etc.Slide7
How to create a Boolean query (2)
Consolidate the terms.
Identify duplicates, synonyms, as well as any concepts that you want to combine even if they are not synonyms.
Re-label the term as needed to reflect the concept/category. Also consider and note any other relationships between terms. Prioritize the terms. Rank from 1-N, most relevant to least relevant.
Hint
: Rank each term by higher, medium, lower relevance, then sort and rank from 1-7.
Write a query for each term.
Note that regular plurals (-s, -
es, -ies) are usually (but not always) included automatically, but you always need to specify irregular plurals, e.g., “mice”.
Qualify the scope for each term. Does the term require any qualification of the scope, e.g., by population, setting, geography, etc.? Validate that the term is disjunctive, distinct, and requires no further qualification.
Combine the terms into a single nested query with an OR operator. Slide8
Agenda
Pre-defined Boolean Queries
Case Study
Lessons LearnedSlide9
https://www.rwjf.org/
Slide10
Applying the RWJF taxonomy to grants
Grant awards
$3,000 to $23 million
Mostly $100,000-$300,000
Grants period of performance
1 month to 5 years
Grants description
Metadata including Program Area, Type of Support, Grantmaking Intervention, Demographics, Topics, and Tags
But metadata is difficult to use to understand answer grantmaking trendsSlide11
New RWJF taxonomy objectives
Better
Taxonomy
Current Taxonomy
Automated methods will be critical for updating descriptive metadata from the
Current
Taxonomy
to the new metadata scheme and values (the
Better Taxonomy
) Slide12
2017 pilot project
Childhood Obesity
Disease Prevention and Health Promotion
Health Care Quality
Health CoverageSlide13
2018 project
Operational text classifier using less complex Boolean queries
Requirements for building test collections to refine recall and precision for auto-classification
Methodology for refining recall and precision in stages
Requirements for integrating text analytics and information retrieval software into RWJF staff workflowSlide14
Breaking down broad topics into simple queries
Broad topic Boolean query from 2017 pilot project
[
{
"name" : "Childhood Obesity"
"query" : "(((child* OR adolescent* OR youth OR girl* OR boy*) NEAR/5 obesity) OR ((obesity NEAR/5 (prevent* OR trend OR challenge OR solving OR solution OR prevalence)) NEAR/10 (child* OR youth* OR adolescent* OR girl* OR boy*)) OR (("healthy weight" OR overweight OR obese) NEAR/5 (child* OR adolescent* OR youth)) OR (("body mass index" OR BMI) NEAR/5 (child* OR adolescent* OR youth)) OR ((child* OR adolescent* OR youth) NEAR/5 ("healthy habits" OR "healthy behavior*" OR (health* NEAR/5 eat*))) OR ("dietary guidelines" NEAR/5 (child* OR youth* OR adolescent* OR girl* OR boy*)) ("nutritional standards" NEAR/5 (school NEAR/5 (meal* OR lunch* OR snack* OR breakfast*))) OR (("sweet* beverage*" OR (sugar* NEAR/5 drink*)) NEAR/5 school* NEAR/10 (kids OR child* OR adolescent* OR youth)) OR (obesity NEAR/5 prevent*) OR ((lower OR reduce) NEAR/5 obesity) OR ("healthy weight commitment" NEAR/5 (child* OR adolescent* OR youth)) OR ("active living research" NEAR/5 (child* OR adolescent* OR youth)) OR (("physical activity" OR "physical education" OR "physically active" OR "physical fitness") NEAR/10 (child* OR adolescent* OR youth* OR girl* OR boy* OR school*)) OR ((activity OR "activity pattern*") NEAR/5 (child* OR adolescent* OR youth* OR girl* OR boy*)))"
}
]
Broad topic Boolean query broken up into simple queriesSlide15
Content collections for query building and testing
Option 1: Build-up a test collection using the “snowball” method.
Use relevant words & phrases to identify a core set of relevant articles from authoritative sources
Perform rhetorical analysis to build up lists of words, phrases and named entities
Iterate with editorial judgement
Option 2: If available, use an existing categorized collection.
Carefully assess existing categorized content to determine if it is relevant and consistently categorized, and representative of categorization application target content.
Potential problems: Categorization is incorrect or over-tagged (more than 3 Topics)
E.g., RWJF.org topics are associated with grants based on a mapping from more detailed PIMS Topics, not directly from subject matter expert indexing.Content is formulaic rather than distinctE.g., RWJF leadership development grants often have a lot of boiler plate paragraphs in their executive summary.
Content is not representative of target collectionE.g., Only most recent content, not example from across the whole collection.Option 3: Re-index an existing representative collection.Slide16
Content collections for query building and testing
Option 1: Building and Option 3: Re-indexing could take less time than compensating for problems with Option 2: Using an existing categorized collection.
Option 1
:
Build-up a test collection using the “snowball” method.
Option 2
:
If available, use an existing categorized collection.
Option 3: Re-index an existing representative collection.Slide17
Overall 2017 pilot project results
Categorized to Pre-defined Category
Correct Pre-defined Category
Refinement is a tradeoff between recall and precision in categorizationSlide18
2017 pilot project results for each Broad Topic
Childhood Obesity
Disease Prevention and Health Promotion
Health Coverage
Health Care Quality
n=64
n=65
n=72
n=69Slide19
Boolean query categorizer refinement process
In first iteration, optimize recall as much as possible
In second iteration, optimize for precision (and recall as necessary)Slide20
Optimize recall and then refine precision (and recall as necessary)
Topic 1 Test Collection
Topic 2 Test Collection
Topic 3 Test Collection
…
Round 1
Test & Revise
80% +/- 5% Recall
Combined Topics Test Collection
Test & Revise
Round 2
80% +/- 5%
Preci-sion
Build Queries
for each Topic
Topic 1 Query
Topic 2 Query
Topic 3 Query
…
for each Topic
for each Topic
Combine Queries
Round 3
Same as Round 2
Optimize recall
Optimize precision
Combined Topics Test Collection
for each TopicSlide21
Integrating text analytics into staff workflows
RWJF Taxonomy Manager
Monitor, Assist and Report
PIMS
RWJF Program Associate
New precis
RWJF Topic(s)
Review & Add
Consultant support (Taxonomy Strategies)
Senior Advisory GroupSlide22
Agenda
Pre-defined Boolean Queries
Case Study
Lessons LearnedSlide23
Lessons learned about building Boolean categorizers
Breaking down broad topics into simple constituent queries facilitates the process of refining recall and precision by making the queries more easily understood and editable.
Representative test collections are essential for building Boolean categorizers, but even when pre-categorized collections exist they may not be the best or most cost-effective option.
It is effective to refine Boolean categorizers by optimizing recall before refining precision (and recall as necessary).
Automated methods should not replace staff but be a means to engage subject matter experts (operational as well as senior level staff) with content and categorization.Slide24
References
RWJF. (2018). Frequently Asked Questions. Retrieved August 13, 2018.
https://www.rwjf.org/en/how-we-work/grants-explorer/faqs.html
.
Lexalytics. (2018).
Semantria
. Retrieved August 13, 2018.
https://www.lexalytics.com/semantria. Busch, Joseph. (2018). GoToWebinar - The Current State of Automated Content Tagging: Dangers and Opportunities. July 19, 2018. Retrieved August 13, 2018. Slides -
http://www.taxonomystrategies.com/wp-content/uploads/2018/01/Current%20State%20of%20Automated%20Content%20Tagging-Webinar-20180719.pdf. Script - http://www.taxonomystrategies.com/wp-content/uploads/2018/01/Current%20State%20of%20Automated%20Content%20Tagging-20180719.pdf. Slide25
Resources on recall and precision
Buckland, Michael and Fredric
Gey
. “The relationship between Recall and Precision.” 45
Journal of the Association for Information Science and Technology
1:12-19 (1994)
Cleverdon, Cyril W. “On the Inverse Relationship of Recall and Precision.” 28
Journal of Documentation 3:195-201 (1972)Croft, W. Bruce. “Boolean queries and term dependencies in probabilistic retrieval models.” 37 Journal of the Association for Information Science and Technology
2:71-77 (1986)Slide26
Questions
Joseph Busch,
jbusch@taxonomystrategies.com
joseph@semanticstaffing.com
(m) 415-377-7912
Vivian Bliss,
vbliss@taxonomystrategies.com
(m) 425-417-7628