EDirect for PubMed Part 3 Formatting Results and Unix Tools Kate Majewski National Library of Medicine National Institutes of Health US Department of Health and Human Services Remember our theme ID: 1020554
Download Presentation The PPT/PDF document "The Insider’s Guide to Accessing NLM D..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. The Insider’s Guide to Accessing NLM DataEDirect for PubMedPart 3: Formatting Results and Unix ToolsKate MajewskiNational Library of MedicineNational Institutes of HealthU.S. Department of Health and Human Services
2. Remember our theme…Get exactly the data you need…and only the data you need…in the format you need.2
3. EDirect for PubMed AgendaPart 1: Getting PubMed DataPart 2: Extracting Data from XMLPart 3: Formatting Results and Unix ToolsPart 4: xtract Conditional ArgumentsPart 5: Developing and Building Scripts3
4. Today’s AgendaQuick Recap of Part TwoGrouping elements with –blockCustomizing separators with –tab and –sepSaving to a fileReading from a file4
5. Recap of Part Twoxtract: pulls data from XML and arranges it in a table-pattern: defines rows for xtract-element: defines columns for xtract5
6. Recap of Part Two (cont'd)Identify XML elements by nameArticleTitleIdentify specific child elements with Parent/Child constructionMedlineCitation/PMIDIdentify attributes with "@"MedlineCitation@Status6
7. Questions from last class? Homework?7
8. -tab and -sep-tab changes the separator after each column-sep changes the separator between multiple values in the same columns8
9. -tab "\t" -sep "\t"924102982 1742-4658 Wu Doyle Barry Beauvais21171099 1097-4598 Wu Gussoni17150207 0012-1606 Yoon Molloy Wu Cowan Gussonixtract –pattern PubmedArticle –tab "\t" –sep "\t" \–element MedlineCitation/PMID ISSN LastNamextract CommandOutput
10. -tab "\t" -sep " "1024102982 1742-4658 Wu Doyle Barry Beauvais21171099 1097-4598 Wu Gussoni17150207 0012-1606 Yoon Molloy Wu Cowan Gussonixtract –pattern PubmedArticle –tab "\t" –sep " " \–element MedlineCitation/PMID ISSN LastNamextract CommandOutput
11. 24102982|1742-4658|Wu Doyle Barry Beauvais21171099|1097-4598|Wu Gussoni17150207|0012-1606|Yoon Molloy Wu Cowan Gussoni-tab "|" -sep " "11xtract –pattern PubmedArticle –tab "|" –sep " " \–element MedlineCitation/PMID ISSN LastNamextract CommandOutput
12. -tab "|" -sep ", "1224102982|1742-4658|Wu, Doyle, Barry, Beauvais21171099|1097-4598|Wu, Gussoni17150207|0012-1606|Yoon, Molloy, Wu, Cowan, Gussonixtract –pattern PubmedArticle –tab "|" –sep ", " \–element MedlineCitation/PMID ISSN LastNamextract CommandOutput
13. With -tab/-sep, order matters!13xtract –pattern PubmedArticle \–element MedlineCitation/PMID -tab "|" -element ISSN \-tab ":" –element Volume Issue24102982 1742-4658|280:2321171099 1097-4598|43:117150207 0012-1606|301:1xtract CommandOutput-tab/-sep only affect subsequent -elements
14. With -tab/-sep, order matters!14xtract –pattern PubmedArticle \–element MedlineCitation/PMID -tab "|" -element ISSN \-tab ":" –element Volume Issue24102982 1742-4658|280:2321171099 1097-4598|43:117150207 0012-1606|301:1xtract CommandOutputLater -tab/-sep overwrite earlier ones
15. Exercise 1Write an xtract command that:Has a new row for each PubMed recordHas columns for PMID, Journal Title Abbreviation, and Author-supplied KeywordsEach column should be separated by "|"Multiple keywords in the last column should be separated with commasYour output should look like this:s1526359634|Elife|Argonaute,RNA silencing,biochemistry[…]
16. Exercise 1 Solution16xtract -pattern PubmedArticle -tab "|" -sep "," \-element MedlineCitation/PMID ISOAbbreviation Keyword
17. Getting Author InformationWe want a list of all of the authors for each citation.One row per PubMed recordPMIDall of the authors’ last names and initials17
18. Authors: First DraftWe want a list of all of the authors for each citationTry:Doesn't work the way we expectShows all the last names, then all the initialsWe want to retain the relationship between last name and corresponding initials18xtract –pattern PubmedArticle \–element MedlineCitation/PMID LastName Initials
19. xtract-ing authorsXML input<PubmedArticle> <MedlineCitation> <PMID>98765432</PMID> <Author> <LastName>Wu</LastName> <Initials>MP</Initials> </Author> <Author> <LastName>Billings</LastName> <Initials>JS</Initials> </Author> <Author> <LastName>Melendez</LastName> <Initials>BJ</Initials> </Author> <Author> <LastName>Collins</LastName> <Initials>FS</Initials> </Author>[…]98765432 Wu Billings Melendez Collins MP JS BJ FSxtract outputxtract –pattern PubmedArticle \–element MedlineCitation/PMID LastName Initials19
20. -blockGroups multiple child elements of the same parent element20xtract –pattern PubmedArticle –element MedlineCitation/PMID \-block Author –element LastName Initials
21. How -block worksXML input<PubmedArticle> <MedlineCitation> <PMID>98765432</PMID> <Author> <LastName>Wu</LastName> <Initials>MP</Initials> </Author> <Author> <LastName>Billings</LastName> <Initials>JS</Initials> </Author> <Author> <LastName>Melendez</LastName> <Initials>BJ</Initials> </Author> <Author> <LastName>Collins</LastName> <Initials>FS</Initials> </Author>[…]xtract output98765432 Wu MP Billings JS Melendez BJ Collins FSxtract –pattern PubmedArticle –element MedlineCitation/PMID \-block Author –element LastName Initials21
22. This is good, but we can do betterEverything is separated by tabs22xtract –pattern PubmedArticle –element MedlineCitation/PMID \-block Author –element LastName Initials24102982 Wu MP Doyle JR Barry B Beauvais A21171099 Wu MP Gussoni E17150207 Yoon S Molloy MJ Wu MP Cowan DBxtract CommandOutput
23. What we know so far…2324102982|1742-4658|Wu, Doyle, Barry, Beauvais21171099|1097-4598|Wu, Gussoni17150207|0012-1606|Yoon, Molloy, Wu, Cowan, Gussonixtract –pattern PubmedArticle –tab "|" –sep ", " \–element MedlineCitation/PMID ISSN LastNamextract CommandOutput
24. Two elements in the same columnUse a comma to group multiple elements24xtract –pattern PubmedArticle –element MedlineCitation/PMID \-block Author –sep " " –element LastName,Initials24102982 Wu MP Doyle JR Barry B Beauvais A21171099 Wu MP Gussoni E17150207 Yoon S Molloy MJ Wu MP Cowan DB Gussoni Extract CommandOutput
25. How –block creates columns25xtract –pattern PubmedArticle –element MedlineCitation/PMID \-block Author –sep " " –element LastName,Initials24102982 Wu MP Doyle JR Barry B Beauvais A21171099 Wu MP Gussoni E17150207 Yoon S Molloy MJ Wu MP Cowan DB Gussoni Extract CommandOutput
26. "-block" resets -tab/-sep to default26xtract –pattern PubmedArticle –tab "|" \–element MedlineCitation/PMID \-block Author –sep " " –element LastName,Initials24102982|Wu MP Doyle JR Barry B Beauvais A21171099|Wu MP Gussoni E17150207|Yoon S Molloy MJ Wu MP Cowan DB Gussoni Extract CommandOutput
27. "-block" resets -tab/-sep to default27xtract –pattern PubmedArticle –tab "|" \–element MedlineCitation/PMID \-block Author –tab "|" –sep " " –element LastName,Initials24102982|Wu MP|Doyle JR|Barry B|Beauvais A21171099|Wu MP|Gussoni E17150207|Yoon S|Molloy MJ|Wu MP|Cowan DB|Gussoni Extract CommandOutput
28. Exercise 2Write an xtract command that:Has a new row for each PubMed recordHas a column for PMIDLists all of the MeSH headings, separated by "|"If a heading has subheadings attached, separate the heading and subheadings with "/"2824102982|Cell Fusion|Myoblasts/cytology/metabolism|Muscle Development/physiology
29. Exercise 2 Solution29xtract –pattern PubmedArticle -tab "|" \–element MedlineCitation/PMID -block MeshHeading \–tab "|" –sep "/" –element DescriptorName,QualifierName
30. Saving Results to a File">"Save in the format of your choiceExample:Check using 30efetch –db pubmed –id 24102982,21171099,17150207 \-format xml > testfile.txtls
31. But where is my file!?Try Cygwin users: try this:$ cygpath -w ~Mac users: look in your Users folder:Users/<your user name>/31pwd
32. Another way to find your filesFind the "edirect" folder on your computerSave a file with a distinctive name, then search for it.Example:32efetch –db pubmed –id 24102982,21171099,25359968,17150207 \–format uid > specialname.csv
33. Exercise 3: Retrieving XMLHow can I get the full XML of all articles about the relationship of Zika Virus to microcephaly in Brazil? Save your results to a file.33
34. Exercise 3 Solution34esearch –db pubmed \–query “zika virus microcephaly brazil” | \efetch -format xml > zika.xml
35. catShort for concatenateUsed to open files and display them on screenCan also combine/append files.35
36. Reading a search string from a file36esearch –db pubmed –query “$(cat searchstring.txt)”
37. Reading a list of PMIDs from a fileCould use a similar techniqueRequires input to be specially formattedIs there another way?37
38. Piping esearch to efetchPipes the PMIDs retrieved with esearch, and uses them as the -id argument for efetch.Also pipes the -db38esearch –db pubmed –query “asthenopia[mh] AND \ nursing[sh]” | efetch –format uid
39. EDirect and the History serveresearchDB and PMIDsefetch39
40. EDirect and the History server 40
41. EDirect and the History serveresearchWebEnv and Query KeyefetchDB and PMIDsHistoryserverDB and PMIDs41
42. EDirect and the History serverepostWebEnv and Query KeyefetchDB and PMIDsHistoryserverDB and PMIDs42
43. epostUploads a list of PMIDs to the history serverExample:43epost –db pubmed –id 24102982,21171099
44. An epost-efetch pipeline44cat specialname.csv | epost –db pubmed | efetch –format xml
45. Using the -input argument45epost –db pubmed –input specialname.csv | \efetch –format abstract
46. Coming next time…Limiting output using Conditional arguments46
47. In the meantime…Insider’s Guide onlinehttps://dataguide.nlm.nih.govSign up for "utilities-announce" mailing list!Questions?https://dataguide.nlm.nih.gov/contact47
48. Questions?48