/
Why Build Custom Categorizers Using Boolean Queries Instead of Machine Learning? Robert Why Build Custom Categorizers Using Boolean Queries Instead of Machine Learning? Robert

Why Build Custom Categorizers Using Boolean Queries Instead of Machine Learning? Robert - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
348 views
Uploaded On 2018-12-08

Why Build Custom Categorizers Using Boolean Queries Instead of Machine Learning? Robert - PPT Presentation

Joseph Busch and Vivian Bliss Agenda Predefined Boolean Queries Case Study Lessons Learned Boolean queries Basic operators AND conjunctive OR disjunctive NOT negation Venn diagrams ID: 738602

recall topic query boolean topic recall boolean query queries collection child precision adolescent youth content rwjf test 2018 option

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Why Build Custom Categorizers Using Bool..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Why Build Custom Categorizers Using Boolean Queries Instead of Machine Learning? Robert Wood Johnson Foundation Case Study

Joseph Busch and Vivian BlissSlide2

Agenda

Pre-defined Boolean Queries

Case Study

Lessons LearnedSlide3

Boolean queries

Basic operators

AND (conjunctive)

OR (disjunctive)

NOT (negation)

Venn diagrams

A OR B

A AND B

A NOT B

B NOT A

A

B

A

B

A

B

A

BSlide4

Proximity operators

Proximity search (specified distance).

Hint

: Proximity operators and syntax are not standardized.

NEAR

NOT NEAR

FOLLOWED BYNOT FOLLOWED BY

SENTENCE FAR

Document

Section

Paragraph

SentenceSlide5

Query syntax

Bounded phrase

Usually quotation marks, e.g.

“health insurance”

Truncation (right, left, internal)

Usually an asterisk, e.g.

child*

“pre-existing condition*”Nested statementsParentheses (that must match up)(“health insurance” AND (children* OR “pre-existing condition*”))Slide6

How to create a Boolean query (1)

Brainstorm a list of 10 relevant words and phrases.

Use that list to identify 10 relevant items (articles, videos, websites, etc.)

E.g., do a Google search, search Google Scholar, search the NYT (or any other newspaper that you subscribe to), search Library of Congress Chronicling America (1789-1963), etc.

Review 10 relevant items and write down the words and phrases that provide a context for the theme/topic/concept.

Titles, headings, summaries, introductions (at the beginning) and conclusions (at the end) are good areas to focus on without having to read the whole item.

Note any named entities (people, organizations, events, laws, etc.) that are closely associated with the theme/topic/concept.

E.g., for gun violence “Gabrielle

Giffords

”, “Michael Bloomberg”, “Doctors Against Gun Violence”, “March for our Lives”, etc.Slide7

How to create a Boolean query (2)

Consolidate the terms.

Identify duplicates, synonyms, as well as any concepts that you want to combine even if they are not synonyms.

Re-label the term as needed to reflect the concept/category. Also consider and note any other relationships between terms. Prioritize the terms. Rank from 1-N, most relevant to least relevant.

Hint

: Rank each term by higher, medium, lower relevance, then sort and rank from 1-7.

Write a query for each term.

Note that regular plurals (-s, -

es, -ies) are usually (but not always) included automatically, but you always need to specify irregular plurals, e.g., “mice”.

Qualify the scope for each term. Does the term require any qualification of the scope, e.g., by population, setting, geography, etc.? Validate that the term is disjunctive, distinct, and requires no further qualification.

Combine the terms into a single nested query with an OR operator. Slide8

Agenda

Pre-defined Boolean Queries

Case Study

Lessons LearnedSlide9

https://www.rwjf.org/

Slide10

Applying the RWJF taxonomy to grants

Grant awards

$3,000 to $23 million

Mostly $100,000-$300,000

Grants period of performance

1 month to 5 years

Grants description

Metadata including Program Area, Type of Support, Grantmaking Intervention, Demographics, Topics, and Tags

But metadata is difficult to use to understand answer grantmaking trendsSlide11

New RWJF taxonomy objectives

Better

Taxonomy

Current Taxonomy

Automated methods will be critical for updating descriptive metadata from the

Current

Taxonomy

to the new metadata scheme and values (the

Better Taxonomy

) Slide12

2017 pilot project

Childhood Obesity

Disease Prevention and Health Promotion

Health Care Quality

Health CoverageSlide13

2018 project

Operational text classifier using less complex Boolean queries

Requirements for building test collections to refine recall and precision for auto-classification

Methodology for refining recall and precision in stages

Requirements for integrating text analytics and information retrieval software into RWJF staff workflowSlide14

Breaking down broad topics into simple queries

Broad topic Boolean query from 2017 pilot project

[

{

"name" : "Childhood Obesity"

"query" : "(((child* OR adolescent* OR youth OR girl* OR boy*) NEAR/5 obesity) OR ((obesity NEAR/5 (prevent* OR trend OR challenge OR solving OR solution OR prevalence)) NEAR/10 (child* OR youth* OR adolescent* OR girl* OR boy*)) OR (("healthy weight" OR overweight OR obese) NEAR/5 (child* OR adolescent* OR youth)) OR (("body mass index" OR BMI) NEAR/5 (child* OR adolescent* OR youth)) OR ((child* OR adolescent* OR youth) NEAR/5 ("healthy habits" OR "healthy behavior*" OR (health* NEAR/5 eat*))) OR ("dietary guidelines" NEAR/5 (child* OR youth* OR adolescent* OR girl* OR boy*)) ("nutritional standards" NEAR/5 (school NEAR/5 (meal* OR lunch* OR snack* OR breakfast*))) OR (("sweet* beverage*" OR (sugar* NEAR/5 drink*)) NEAR/5 school* NEAR/10 (kids OR child* OR adolescent* OR youth)) OR (obesity NEAR/5 prevent*) OR ((lower OR reduce) NEAR/5 obesity) OR ("healthy weight commitment" NEAR/5 (child* OR adolescent* OR youth)) OR ("active living research" NEAR/5 (child* OR adolescent* OR youth)) OR (("physical activity" OR "physical education" OR "physically active" OR "physical fitness") NEAR/10 (child* OR adolescent* OR youth* OR girl* OR boy* OR school*)) OR ((activity OR "activity pattern*") NEAR/5 (child* OR adolescent* OR youth* OR girl* OR boy*)))"

}

]

Broad topic Boolean query broken up into simple queriesSlide15

Content collections for query building and testing

Option 1: Build-up a test collection using the “snowball” method.

Use relevant words & phrases to identify a core set of relevant articles from authoritative sources

Perform rhetorical analysis to build up lists of words, phrases and named entities

Iterate with editorial judgement

Option 2: If available, use an existing categorized collection.

Carefully assess existing categorized content to determine if it is relevant and consistently categorized, and representative of categorization application target content.

Potential problems: Categorization is incorrect or over-tagged (more than 3 Topics)

E.g., RWJF.org topics are associated with grants based on a mapping from more detailed PIMS Topics, not directly from subject matter expert indexing.Content is formulaic rather than distinctE.g., RWJF leadership development grants often have a lot of boiler plate paragraphs in their executive summary.

Content is not representative of target collectionE.g., Only most recent content, not example from across the whole collection.Option 3: Re-index an existing representative collection.Slide16

Content collections for query building and testing

Option 1: Building and Option 3: Re-indexing could take less time than compensating for problems with Option 2: Using an existing categorized collection.

Option 1

:

Build-up a test collection using the “snowball” method.

Option 2

:

If available, use an existing categorized collection.

Option 3: Re-index an existing representative collection.Slide17

Overall 2017 pilot project results

Categorized to Pre-defined Category

Correct Pre-defined Category

Refinement is a tradeoff between recall and precision in categorizationSlide18

2017 pilot project results for each Broad Topic

Childhood Obesity

Disease Prevention and Health Promotion

Health Coverage

Health Care Quality

n=64

n=65

n=72

n=69Slide19

Boolean query categorizer refinement process

In first iteration, optimize recall as much as possible

In second iteration, optimize for precision (and recall as necessary)Slide20

Optimize recall and then refine precision (and recall as necessary)

Topic 1 Test Collection

Topic 2 Test Collection

Topic 3 Test Collection

Round 1

Test & Revise

80% +/- 5% Recall

Combined Topics Test Collection

Test & Revise

Round 2

80% +/- 5%

Preci-sion

Build Queries

for each Topic

Topic 1 Query

Topic 2 Query

Topic 3 Query

for each Topic

for each Topic

Combine Queries

Round 3

Same as Round 2

Optimize recall

Optimize precision

Combined Topics Test Collection

for each TopicSlide21

Integrating text analytics into staff workflows

RWJF Taxonomy Manager

Monitor, Assist and Report

PIMS

RWJF Program Associate

New precis

RWJF Topic(s)

Review & Add

Consultant support (Taxonomy Strategies)

Senior Advisory GroupSlide22

Agenda

Pre-defined Boolean Queries

Case Study

Lessons LearnedSlide23

Lessons learned about building Boolean categorizers

Breaking down broad topics into simple constituent queries facilitates the process of refining recall and precision by making the queries more easily understood and editable.

Representative test collections are essential for building Boolean categorizers, but even when pre-categorized collections exist they may not be the best or most cost-effective option.

It is effective to refine Boolean categorizers by optimizing recall before refining precision (and recall as necessary).

Automated methods should not replace staff but be a means to engage subject matter experts (operational as well as senior level staff) with content and categorization.Slide24

References

RWJF. (2018). Frequently Asked Questions. Retrieved August 13, 2018.

https://www.rwjf.org/en/how-we-work/grants-explorer/faqs.html

.

Lexalytics. (2018).

Semantria

. Retrieved August 13, 2018.

https://www.lexalytics.com/semantria. Busch, Joseph. (2018). GoToWebinar - The Current State of Automated Content Tagging: Dangers and Opportunities. July 19, 2018. Retrieved August 13, 2018. Slides -

http://www.taxonomystrategies.com/wp-content/uploads/2018/01/Current%20State%20of%20Automated%20Content%20Tagging-Webinar-20180719.pdf. Script - http://www.taxonomystrategies.com/wp-content/uploads/2018/01/Current%20State%20of%20Automated%20Content%20Tagging-20180719.pdf. Slide25

Resources on recall and precision

Buckland, Michael and Fredric

Gey

. “The relationship between Recall and Precision.” 45

Journal of the Association for Information Science and Technology

1:12-19 (1994)

Cleverdon, Cyril W. “On the Inverse Relationship of Recall and Precision.” 28

Journal of Documentation 3:195-201 (1972)Croft, W. Bruce. “Boolean queries and term dependencies in probabilistic retrieval models.” 37 Journal of the Association for Information Science and Technology

2:71-77 (1986)Slide26

Questions

Joseph Busch,

jbusch@taxonomystrategies.com

joseph@semanticstaffing.com

(m) 415-377-7912

Vivian Bliss,

vbliss@taxonomystrategies.com

(m) 425-417-7628