Using Machine Learning Recommendations to Improve Manual Survey Text Coding - PowerPoint Presentation

valerie . @valerie

67 views
Uploaded On 2023-06-26

Using Machine Learning Recommendations to Improve Manual Survey Text Coding - PPT Presentation

41223 FEDCASIC 2023 Presenters Caroline Kery ckeryrtiorg and Durk Steed dsteedrtiorg Roadmap Manual Survey Response Coding Survey Coding The issue Free Response Text Entries ID: 1003751

coding text labels embeddings text coding embeddings labels survey rti smart response questions trained manual labeling org learned label

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/1003751" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Using Machine Learning Recommendations t..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

1. Using Machine Learning Recommendations to Improve Manual Survey Text Coding4/12/23FEDCASIC – 2023Presenters: Caroline Kery (ckery@rti.org) and Durk Steed (dsteed@rti.org)

2. Roadmap

3. Manual Survey Response Coding

4. Survey Coding: The issueFree Response Text EntriesSpanish LiteratureAgEnlgishBandI don’t know4Label or Code List00 Uncodeable01 Music, General02 Agriculture Studies03 English04 Spanish LiteratureWhy is Manual Text Coding Challenging?Text often includes:AbbreviationsMisspellingsSpecific terminology or subject matter knowledge that coders must be trained on

5. Survey Coding: The issueFree Response Text EntriesSpanish LiteratureAgEnlgishBandI don’t know5Label or Code List00 Uncodeable01 Music, General02 Agriculture Studies03 English04 Spanish LiteratureThe result:On large surveys with many responses, manual coding is often:Tedious and Expensive (the task is boring and time consuming)Requires significant onboardingGenerally, just the worst

6. If it’s terrible, why do we do it so often?There are lots of reasons a survey may end up with text that needs labeling:Questions with “other” options where the respondent may provide a custom response.Open-ended free response questions where the survey givers defined categories after the fact.Questions where codes are meant to be selected, but need to be double-checked for accuracySurvey Coding: The issue

7. Common solutions:Avoid free text as much as possible!Sadly not always an option Use an automated coding systemCan work great but probably still need to review resultsCan be tripped up by abbreviations and misspellingsPass around an excel fileCan work great for small projects with a few codersGets stressful the larger the task getsUse a labeling softwareWhat we are talking about today!Survey Coding: The issue

8. Open-Source Labeling SoftwareDeveloped by RTI Originally to build machine learning datasetsIn the works: releasing SMART version 3.0.0 which adds features for large survey coding.Has seen growing use within RTI, and continues to be updated and improved from user feedbackSmart Docs: https://smart-app.readthedocs.io/en/latest/ Introducing: SMART!

9. SMART: continued

10. Labeling software exists, so why use our own?

11. Some Select Problems and SolutionsSurveys were wasting time coding the same text over and overSMART’s deduplication system was expanded so unique entries only had to be coded onceExisting workflows made mistakes in coding complicated to fixSMART refined its label history system so coders and administrators could change past labels easilySurveys were constantly getting new data that needed codingSMART added scheduled database imports of unlabeled data and exports of labeled dataSurveys can have thousands of labels to search throughSMART … did a lot of things, lets talk about it!

12. The premiseSMART was designed for just a few labels, but we quickly learned that often surveys can have thousands of labelsFinding the correct label could be time consuming because of thisA label suggestion feature to pick out the most likely labels could help coders select the labels fasterStarted out with simple text similarityWe knew we could do better than that

13. What Are Text Embeddings?Embeddings are vector representations of pieces of textCan be mathematically compared for similarityThe more advanced deep learning methods are capable of capturing the underlying meaning of pieces of textTwo text strings with different words, yet similar meaning, can be determined to have high similaritysentence-transformersMaps sentences and paragraphs into 384 dimensional dense vector spaces13

14. Usage With SMARTFor projects with more than 5 labels, SMART automatically generates embeddings of the labels and their descriptions. When a user then goes to code items, SMART will present the top five labels with the closest embeddings to the text embeddings.14

15. Previous string-matching calculations were ~43% accurate (correct label within top 5 suggested labels)Pre-trained SBERT.net model ~56% accurateCustom, trained model ~70%Base SBERT.net model + common Text-abbreviations datasetThe Result: Improved Accuracy15

16. Lessons learned (Embeddings)The “black box” issue:Embeddings can be amazing, but need vettingLeveraging giant models trained on a huge corpus is both good and bad“Black box”The model by default cannot tell which associations are great and which are problematicEmbeddings may not pick up domain-specific acronyms or termsThe same acronym can mean many things across domainsShortened or slang terms may depend on the topic (ex: “diff-eq” = differential equations)We learned to update the embeddings with term pairs These include terms that should be associated together (ex: STEAM = science, technology, engineering, art, and math) and terms which should not be associated together (ex: woman home economics)Embeddings models are large and can be slowNeed to be careful about when/how you are using them 16

17. Lessons learned (Development with multiple user groups)17

18. Future workWorking to improve the embeddings and make them more targeted for the SME of the projectUse past labeled data to help inform the embeddingsRTI currently is funding an internal research and development project to look into these features

19. Thank you!For future questions, email us at: ckery@rti.org, dsteed@rti.org