/
RESEARCHOpenAccessACRF-basedsystemforrecognizingchemicalentitymentions RESEARCHOpenAccessACRF-basedsystemforrecognizingchemicalentitymentions

RESEARCHOpenAccessACRF-basedsystemforrecognizingchemicalentitymentions - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
390 views
Uploaded On 2015-09-04

RESEARCHOpenAccessACRF-basedsystemforrecognizingchemicalentitymentions - PPT Presentation

1 ID: 121253

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "RESEARCHOpenAccessACRF-basedsystemforrec..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

RESEARCH OpenAccess ACRF-basedsystemforrecognizingchemical entitymentions(CEMs)inbiomedicalliterature ShuoXu 1 ,XinAn 2 ,LijunZhu 1 ,YunliangZhang 1* ,HaodongZhang 3 Abstract Background: Inordertoimproveinformationaccessonchemicalcompoundsanddrugs(chemicalentities) describedintextrepositories,itisverycrucialtobeabletoidentifychemicalentitymentions(CEMs)automatically correspondingsystemsthatareabletodetectmentionsofchemicalcompoundsanddrugs,whichhastwo subtasks:CDI(ChemicalDocumentIndexing)andCEM. Results: Oursystemprocessingpipelineconsistsofthreemajorcomponents:pre-processing(sentencedetection, tokenization),recognition(CRF-basedapproach),andpost-processing(rule-basedapproachandformatconversion). Inourpost-challengesystem,thecostparameterinCRFmodelwasoptimizedby10-foldcrossvalidationwithgrid search,andwordrepresentationsfeatureinducedbyBrownclusteringmethodwasintroduced.FortheCEM subtask,ourofficialrunswererankedintoppositionbyobtainingmaximum88.79%precision,69.08%recalland 77.70%balancedF-measure,whichwereimprovedfurtherto88.43%precision,76.48%recalland82.02%balanced F-measureinourpost-challengesystem. Conclusions: Inoursystem,insteadofextractingaCEMasawhole,weregardeditasasequencelabeling problem.Thoughourcurrentsystemhasmuchroomforimprovement,oursystemisvaluableinshowingthatthe performanceintermofbalancedF-measurecanbeimprovedlargelybyutilizinglargeamountsofrelatively inexpensiveun-annotatedPubMedabstractsandoptimizingthecostparameterinCRFmodel.Fromourpractice andlessons,ifonedirectlyutilizessomeopen-sourcenaturallanguageprocessing(NLP)toolkits,suchasOpenNLP, StandfordCoreNLP,falsepositive(FP)ratemaybeveryhigh.Itisbettertodevelopsomeadditionalrulesto minimizetheFPrateifonedoesnotwanttore-traintherelatedmodels.OurCEMrecognitionsystemisavailable at:http://www.SciTeMiner.org/XuShuo/Demo/CEM. Background Thereisanincreasinginteresttoimproveinformation accessonchemicalcompoundsanddrugs(chemicalenti- ties)describedintextrepositories,includingscientific articles,patents,healthagencyreports,ortheWeb[1].In ordertoachievethisgoal,itisverycrucialtobeableto identifychemicalentitymentions(CEMs)automatically fchemicalentitiesisalso crucialforothersubsequenttextprocessingtasks,such asdetectionofdrug-protei ninteractions[2],adverse effectsofchemicalcompoundsandtheirassociationsto toxicologicalendpoints,ortheextractionofpathwayand metabolicreactionrelationsandsoon.Thoughmany methodsandstrategiestorecognizechemicalsintext havebeenproposed[3],onlyaverylimitednumberof publiclyaccessibleCEMreco gnitionsystemshavebeen released[4]. TheBioCreative(CriticalAssessmentofInformation ExtractionSystemsinBiology)challengeisacommunity- wideefforttobuildanevaluationframeworkforasses- singtextminingsystemsinbiologicaldomains[5].The chemicalcompoundanddrugnamedentityrecognition (CHEMDNER)challengeinBioCreativeIVwasspecially designedtopromotetheimplementationofsystemsthat areabletodetectmentionsofchemicalcompoundsand drugs,whichhastwosubtasks,CDI(ChemicalDocument Indexing)subtaskandCEM(ChemicalEntityMention) subtask.CDIsubtaskisthetasktoreturnarankedlistof 1 InformationTechnologySupportingCenter,InstituteofScientificand TechnicalInformationofChina,No.15FuxingRd.,HaidianDistrict,100038 Beijing,PRChina Fulllistofauthorinformationisavailableattheendofthearticle Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 ©2015Xuetal.;licenseeSpringer.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttribution License(http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium, providedtheoriginalworkisproperlycited.TheCreativeCommonsPublicDomainDedicationwaiver(http://creativecommons.org/ publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. chemicalentitiesdescribedwithinagivendocuments. CEMsubtaskisthetasktoprovideforagivendocument thestartandendindicescorrespondingtoallthechemi- calentitiesmentionedinthedocument. Here,wepresentthemethod,theresultsandrecogni- tionsystemfromourparticipationintheCEMsubtask ofCHEMDNERchallenge[1,6]withsomepostchallenge systemsimprovement.Inourrecognitionsystem, insteadofextractingaCEMsuchas “ (+)-antiBP-7,8- diol-9,10-epoxide ” asawhole,weregarditasa sequencelabelingproblem.Ourmainfocusonthis improvedsystemwastoexploretheeffectivenessofcost parameteroptimization[7,8]andwordrepresentation-s [9-11]featureforourapproachtoCEMsubtask.The proposedmethodcombinesnaturallanguageprocessing (NLP)strategieswithmachinelearning(ML)techniques toutilizewordrepresentationsfeaturefromlarge amountsofrelativelyinexpensiveun-annotatedPubMed abstractsalongwithsmallamountsofannotatedones. AsshowninFigure1,oursystemfirstdetectssen- tenceboundariesonthePubMedabstracts,andthen tokenizeseachdetectedsentenceaspre-processing. Next,oursystemextractsCEMsfromtextwithacondi- tionalrandomfield(CRF)approach[12],followedby somepost-processingstep sincludingarule-based approachandaformatconversionstep.Wedescribe eachstepindetailinthefollo wingsections.Although currentapproachhasmuch roomforimprovement,it producedthetop-rankedperformanceamongallsub- mittedrunsintheCEMsubtaskofBioCreativeIV CHEMDNERchallenge. Theorganizationoftherestofthearticleisasfollows. Inthenextsection,wedescribetheresultsofoursub- missionandpost-challengerunsontheCEMsubtaskof BioCreativeIVCHEMDNERchallenge.Thisisfollowed bydiscussionandconclusionsdrawnfromourexperi- ence.Lastly,ourmethodsemployedareexplainedin detail. Resultsanddiscussion Weanalyzedthetraining,developmentandtestingdata setsandfoundthattherearemanynestedCEMsinthe developmentset,suchas “ polysorbate80 ” (offset:1138 to1152)and “ polysorbate ” (offset:1138to1149)inthe abstractofPMID:23064325.SeeTable1formore examplesofnestedCEMpairs.SincelinearCRFmodel, utilizedinthisarticle,cannotidentifythenestedCEMs, wejustomitthelessspannedCEMs.Inaddition,there maybesomeannotationerrorsinthedevelopmentset, suchasexamplesinTable2.Wealsomanuallycor- rectedtheseerrorsbeforetrainingourCRFmodel. Table3showsabriefoverviewofthecorrected CHEMDNERcorpus.Pleasesee[13]formoredetailsof CEMsannotating,classifyingandsplittingintotraining, developmentandtestdatasets. Toevaluatetheperformanceofsubmittedresults,the BioCreativeIVcompetitionreliedonthreeperformance measuresatentitylevel:recall,precisionandF-measure. Therecallistheproportionofcorrectpredictionof positiveCEMs.Theprecisionistheproportionofpre- dictedCEMsthatareactuallytrueCEMs.TheF-mea- sureprovidesamorebalancedevaluationbyaveraging precisionandrecall.Therecall,precisionandF-measure aredefinedformallyasfollows. r = TP TP + FN (1) p = TP TP + FP (2) F  =(1+  2 ) p × r  2 p + r (3) where TP (truepositive)isthenumberofthecorrect positivepredictions, FN (falsenegative)isthenumberof incorrectnegativepredictions(typeIIerrors),and FP is thenumberofincorrectpositivepredictions(typeI errors).ThebalancedF-measure( b =1),themaineva- luationmetricusedfortheCEMsubtaskoftheBioCrea- tiveIVCHEMDNERcompetition,canbesimplifiedto: F 1 =2 p × r p + r (4) Inordertomakethebestofannotatedcorpus,wepooled thetraininganddevelopmentdatasets.Theparticipating teamsareallowedtohave5daystogenerateuptofivedif- ferentannotations("runs ” )forthetestsetandtosubmitthe annotationstotheorganizers .Thus,participatingteamscan utilizedifferentsettings,modelsormethodswhengoldtest Figure1 Thesystemprocessingpipeline .Thesystemprocessingpipelinethatincludesthreemajorcomponents:pre-processing(sentence detection,tokenization),recognition(CRF-basedapproach)andpost-processing(rule-basedapproachandformatconversion). Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page2of9 annotationsetisunknown.WesubmittedfiverunsfortheCEMsubtask,eachusingthesamepipeline,butwithdiffer-entvaluesforthecostparameterintheCRFmodel[12,14].Duetotimeconstraints,wejustsetthecostparametertoeachelementin{2,2,2}.Table4presentstheofficialperformancescoresofoursubmittedruns.Run5performedthebestintermsofrecallandbalancedF-mea-sure.Run1performedthebestintermofprecision. Table1NestedCEMpairsinthedevelopmentsetoftheCHEMDNERcorpusA567A567Foreachrow,theCEMwithoffsetincolumn6-7isnestedintheCEMwithoffsetincolumn4-5.TheCEMswithrespectiveoffsetsincolumn6-7areomitteddirectlywhentrainingourCRFmodels.etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page3of9 Infact,thecostparametertradesthebalancebetweenover-fittingandunder-fitting[12,14].Withlargercostparametervalue,CRFtendstoover-fittothegiventrainingcorpus.FromTable4,onecaneasilyseethatthepredictedresultsweresignificantlyinfluencedbythisparameter.Inourpost-challengeimprovedsystems,10-foldcrossvalidationatdocumentlevelisutilizedtooptimizethecostparameterwithgridsearch[7,8].Spe-cifically,thepooledtraininganddevelopmentdatasetsarerandomlydividedinto10sub-corpusofnearlyequalsize.Foreachcost,2,2},aCRFmodelisinduced10times,eachtimeleavingoutoneofthesub-corpusesthatisthenusedtocalculatethebalancedF-measure.Anoptimalvalueofcostsisselectedfromthisgridsearch.Inourpost-challengeimprovedsystem,wereobtainedfiverunsfortheCEMsubtask,eachusingthesamepipelineasofficialsubmissions,butwithdifferentfea-turessets(Table5).FromTable3,CHEMDNERcorpusincludeslargeamountsofrelativelyinexpensiveun-annotatedPubMedabstracts.Inordertoreducedatasparsityandimprovefurthertheperformanceofoursystem,wordrepresentationsfeatureisusedinourpost-challengesystem,sinceitisasimpleandgeneralmethodforsemi-supervisedlearning[11].Previousstu-dies[11,15,16]showthatwordrepresentationsfeatureisaveryimportantfeaturetoimprovethebalancedF-measureofpre-definedcategoriesofpropernamesandbio-entityrecognition.Here,thetraining,development,testandbackgrounddatasetsarepooledtoinducewordrepresentationsofeachtokenbyBrownclusteringmethod[10,17]with500,1000,1500and2000clusters,respectively.Figure2showsthebalancedF-measureforpostchallengerunswith10-foldcrossvalidationbygridsearch[7,8].Table6reportstheperformanceresultswiththeoptimalvalueforthecostparameter.FromFigure2andbycomparingTable4andTable6,itisnotdifficulttoseethatthewordrepresentationsfeatureimprovedlargelytheperfor-manceofoursystemintermsofbalancedF-measureandrecall,butwithalittleperformancedegradationintermofprecision.Run1,Run4andRun3performedthebestintermofprecision,recall,balancedF-measure,ThoughtheannotatedCEMsareclassifiedintoeight={SYSTEMATIC,IDENTIFIER,FORMULA,TRIVIAL,ABBREVIATION,FAMILY,MULTIPLE,NOCLASS},theannotationsoftheindividualCEMclassesaredisregardedinourpost-challengesystem.InordertohighlighttheexistinggapsintheCEMrecognitionsystem,performanceresultsforeachcategoryinCarealsogiveninTable4andTable6intermofprecision.AsforofficialperformancescoresinTable4,oursystemworkedbestonrecognizingtheFORMULACEMsforRun1,Run2andRun3,andSYSTEMATICCEMsforRun4andRun5.FromTable6,onecanseethatour Table2NestedCEMpairsinthedevelopmentsetoftheCHEMDNERcorpusIDPMIDT/AStartEndStartEnd123412114A977984977985223572392T42554256323414800T69896889423411224A278288277288523401298A438502438501Theoffsetsincolumn4-5arecorrectedtotheonesincolumn6-7. Table3TheoverviewofthecorrectedCHEMDNERcorpusintermsofthenumberofPubMedabstracts(#Articles),thenumberofCEMs(#CEMs),andthenumberofCEMsforeachoftheCEMclassesinC={SYSTEMATIC,IDENTIFIER,FORMULA,TRIVIAL,ABBREVIATION,FAMILY,MULTIPLE,NOCLASS}×meanstheresultingfigureisTrainingDevelopmentTestBackground#Articles3,5003,5003,00017,000#CEMs29,47829,48525,351×ABBREVIATION4,5384,5174,059×FAMILY4,0904,2123,622×FORMULA4,4484,1173,443×IDENTIFIER672639513×MULTIPLE202187199×SYSTEMATIC6,6566,8145,666×TRIVIAL8,8328,9677,808×NOCLASS403241× Table4OfficialscoresfortheCEMsubtaskintheBioCreativeIVCHEMDNERcompetitionRun1Run2Run3Run4Run5cost222-TP15,82116,53116,99117,32817,5121,8342,0072,0092,1292,2119,5308,8208,3608,0237,839Precision(%)89.6189.1789.4389.0688.79Recall(%)62.4165.2167.0268.3569.08score(%)73.5875.3376.6277.3477.70ABBREVIATION53.9055.7556.7457.6358.24FAMILY59.5062.2064.8066.2367.28FORMULA72.7674.3074.8875.4675.84IDENTIFIER68.6270.1869.9869.7969.59MULTIPLE27.6432.1628.6431.6631.66SYSTEMATIC69.5772.6174.3275.6176.35TRIVIAL58.9562.7365.4867.4168.26NOCLASS51.2251.2256.1058.5458.54etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page4of9 postchallengeimprovedsystemidentifiedSYSTEMATIC CEMsatthebest.What ’ smore,itseemsbeverydiffi- culttorecognizeMULTIPLECEMsinbothsystems. Mainreasonmaybethatthenumberofannotated CEMsisnotsufficefortheMULTIPLEcategory(202, 187,199fortraining,developmentandtestdatasets, respectivelyinTable3). Conclusions Inthearticle,wepresentourpost-challengesystemand itsperformancefortheCEMsubtaskofBioCreativeIV CHEMDNERchallenge.Oursystemprocessingpipeline consistsofthreemajorcomponents:preprocessing(sen- tencedetection,tokenization),recognition(CRF-based approach),andpost-processing(rulebasedapproachand formatconversion).Ourmainfocusonthisimproved systemwastoexploretheeffectivenessofthecostpara- meteroptimizationandwordrepresentationsfeaturefor theCEMsubtask. Inourpost-challengeimprovedsystem,insteadof extractingaCEMasawhole,weregardeditasa sequencelabelingproblem.ThefamousCRFmodelis utilizedtosolvethesequencelabelingproblem,whose costparameterisoptimizedby10-foldcrossvalidation withgridsearch.Differentfeaturetypes,includinggen- erallinguistic,character,casepattern,contextual,and wordrepresentationsfeatures,wereexploitedforour runs.Inordertoreducedatasparsityintheannotated traininganddevelopmentdatasets,wordrepresentations wereinducedfrompooledtraining,development,test andbackgrounddatasetsbyBrownclusteringmethod. Finkel&Manning[18]proposedamodelspecifically forrecognizingnestednamedentitiesbyusingadiscri- minativeconstituencyparser.Themodelexplicitly representsthenestedstruc ture,allowingentitiestobe influencednotjustbythelabelsofthetokenssurround- ingthem,asinaCRF,butalsobytheentitiescontained inthem,andinwhichtheyarecontained.Inongoing work,themodelwillbeintroducedforrecognizing nestedCEMs. Thoughourcurrentsystemhasmuchroomfor improvement,oursystemisvaluableinshowingthat Table5Featurecombinationsusedforpost-challengerunsontheCEMsubtask. WordRepresentation GeneralLinguisticCharacterCasePatternContextual500100015002000 Run1  Run2  Run3  Run4  Run5  Figure2 ThebalancedF-measureforpost-challengerunswith10-foldcrossvalidationbygridsearch . Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page5of9 theperformanceintermofbalancedF-measurecanbeimprovedlargelybyutilizinglargeamountsofrelativelyinexpensiveun-annotatedPubMedabstracts.Fromourpracticeandlesson,ifwedirectlyusesomeopen-sourceNLPtoolkits,suchasOpenNLP,StanfordCoreNLP,falsepositiveratemaybeveryhigh.Itisbettertodevelopsomeadditionalrulestominimizethefalsepositiverateifonedontwanttore-traintherelatedMethodsPre-processing:sentencedetection&tokenizationAsentencedetectorcanidentifyifapunctuationcharac-termarkstheendofasentenceornot.Here,thesen-tencedetectorinOpenNLP[19]isutilized.However,sentenceboundaryidentificationischallengingbecausepunctuationmarksareoftenambiguous[20].Inordertoimprovefurthertheperformanceofthesentencedetec-tion,wecollectedmanyabbreviations,suchassyn.,etc.fromthetraininganddevelopmentsets.Thenwegeneratedseveralrules,suchasifcurrentsen-tenceendswiththeseabbreviationsorcomma,ornextsentencestartswithlower-caseletter.Inthiscase,thecurrentandnextsentencesaremergedintoanewone.Atokenizerdivideseachobtainedsentenceaboveintotokens,whichusuallycorrespondtowords,punctuation,numbers,etc.However,tocaptureindividualcompo-nentswithinaCEM,similartoWeietal.[21],weper-formedtokenizationonafinerlevel.Specifically,specialcharactersinTable7,numbers,andGreeksymbolsaredividedasseparatetokens.AnexampleisshowninTable8.Pluralupper-caseabbreviationsarealsosepa-ratedintotwotokens,suchasintoAsamatteroffact,beforeanypre-processing,wealsomergedsomespecialcharacterswiththesamemeaning,suchas,etc.Recognition:CRF-basedapproachAsmentionedinBackground,weseetheCEMrecogni-tionproblemasasequencelabelingone(seeTable8).Asatypeofdiscriminativeundirectedprobabilisticmodel,CRFs[12,14]areoftenusedforlabelingorpar-singofsequentialdata,suchasnaturallanguagetextorbiologicalsequences.CRFs[22-24]hasbeenappliedsuc-cessfullytoidentifyvariousbio-entities,suchasgene,proteinandsoon,andshownagoodperformance.Giventokensequence  x=(x1,x2,···,xN) ,CRFdefinestheconditionalprobabilitydistribution Pr(  y|  x) oflabelsequence  y=(y1,y2,···,yN) asfollows. Pr(exp( Here, w=(w1,w2,···,wM)T isaglobalfeatureweightvector, )=( isalocalfeaturevectorfunction,andMisthenumberoffeaturefunctions.Theweightvectorwcanbeobtainedfromthetraininganddevelopmentsetsbyalimited-memoryBroyden-F(L-BFGS)[25]method.ThetraditionalBIEOlabelsetisusedinourpost-chal-lengeimprovedsystem.Thatistosay,eachtokenislabeledasbeingthebeginningof(B),theinsideof(I),theendof(E)orentirelyoutside(O)ofaspanofinterest.Here,CRF++[26]isadoptedfortheactualimplementa-tion.InCRF++,thereare4majorparameters(and)tocontrolthetrainingcondition.Inoursub-mittedpredictionsandpost-challengeones,thepara-metersandwereconsistentlysettoCRF-L2,2and4,respectively.Theoptionisoptimizedwith10-foldcrossvalidation,asintroducedabove.FeaturesforourCRFmodelOursystemexploitsfourdifferenttypesoffeatures:GenerallinguisticfeaturesOursystemincludestheoriginaluni-tokensandbi-tokens,aswellasstemmeduni-tokens,bi-tokensandtri-tokens,asfeaturesusingthePortersstemmer[27]fromStanfordCoreNLP[28].CharacterfeaturesSincemanyCEMscontainnumbers,Greekletters,Romannumbers,aminoacids,chemicalelements,andspecialcharacters,oursystemcalculatesseveralstatisticsasfeaturesforeachtoken,includingitsnumberofdigi-tals,numberofupper-andlower-caseletters,numberofallcharactersandpresenceorabsenceofspecificcharac-tersorGreekletters,Romannumbers,aminoacids,orchemicalelements. Table6Performanceresultsinourpost-challengeimprovedsystemfortheCEMsubtaskintheBioCreativeIVCHEMDNERcompetitionRun1Run2Run3Run4Run5cost222218,02519,25919,38919,49519,3552,3122,6712,5372,6942,5057,3266,0925,9625,8565,996Precision(%)88.6387.8288.4387.8688.5471.1075.9776.4876.9076.35score(%)78.9181.4782.0282.0281.99ABBREVIATION59.7763.3764.2865.8565.9872.3673.5572.9273.9472.97FORMULA73.0273.4574.1274.0174.30IDENTIFIER64.7263.3566.6764.3366.86MULTIPLE23.6232.1635.1833.1730.65SYSTEMATIC79.1982.8683.2583.2282.5671.4181.7682.3882.7081.56NOCLASS53.6663.4163.4168.2963.41etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page6of9 Casepatternfeatures Similarto[21],anyuppercasealphabeticcharacteris replacedby ‘ A ’ ,anylowercaseoneisreplacedby ‘ a ’ , andanynumber(0-9)isreplacedby ‘ 0 ’ .Moreover,our systemalsomergeconsecutivelettersandnumbersand generatedadditionalsingleletter ‘ a ’ andnumber ‘ 0 ’ features. Contextualfeatures Foreachtoken,oursystemincludesacombinationof thecurrentoutputtokena ndpreviousoutputtoken (bigram). Wordrepresentationfeatures Onecommonapproachtoinducingunsupervisedword representationistouseclustering,perhapshierarchical, suchasBrownclusteringmethod[17],Collobertand Westonembeddings[29],hierarchicallog-bilinear model(HLBL)embeddings[30]andsoon.Here,the Brownclusteringmethodisused.Theimplementation ofBrownclusteringmethodbyLiang[31]isadoptedin ourpost-challengesystem. TheresultofrunningtheBrownclusteringmethodis abinarytree,whereeachtokenoccupiesasingleleaf node,andwhereeachleafnodecontainsasingletoken. Therootnodedefinesaclustercontainingtheentire tokenset.Interiornodesrepresentintermediatesize clusterscontainingallofthetokensthattheydominate. Thus,nodeslowerinthebinarytreecorrespondto smallertokenclusters,whilehighernodescorrespondto largertokenclusters.AccordingtoHuffmancoding [32],aparticulartokencanbeassignedabinarystring byfollowingthetraversalpathfromtheroottoitsleaf, assigninga0foreachleftbranch,anda1foreachright branch. Intuitively,theBrownclusteringmethodwillmerge thetokenswithsimilarcontextsintothesamecluster. Thus,themoresimilartheprefixofthetoken ’ sHuff- mancoding,themoresimilarthetokens.Table9shows sometokenexamplesandtheirbinarystringrepresenta- tionswith500clusters.Let ’ stakeTable9asanexam- ple.AccordingtomainideaoftheBrownclustering method,thetoken “ interpeak ” (01100110110)ismore similarthanthetoken “ aquaporine ” (01101110011)with thetoken “ florbetapir ” (0110011010). Post-processing:rule-basedapproach&format conversion Oncloserexamination,wefindthattheresultsofCRF approachincludesomefalsepositiveCEMs,suchas “ 25 (3),186-193 ” , “ 1-D,2-D ” andsoon.So,wedeveloped severaladditionalregularexpressestoremovethem.In addition,ourpost-processingstepalsohelpsadjusttext spansofCEMs,suchasaddingamissingclosingpar- enthesis,suchas “ [4Fe-4S](2+ ” into “ [4Fe-4S](2+) ” .All oftheadjustmentrulesarelistedinTable10.Here,#(·, str)meansthenumberofoccurrencesofthestringstr intheinterestedCEM,right(·, n )andleft(·, n )denote thesubstringwiththelengthof n rightorlefttothe interestedCEM,andoffset(·,start)andoffset(·,left)indi- catethestartorendoffsetoftheinterestedCEM.Let ’ s takethefirstrowinTable10asanexample.Itmeans thatifthenumberoftheoccurrencesof “ ( ” ishigher thanthatof “ ) ” intheinterestedCEM,andifthesub- stringwiththelengthof1righttotheinterestedCEM is “ ) ” ,thenstartoffsetoftheinterestedCEMismoved onecharacterfurthertotheright. Finally,weconvertedtherecognizedCEMsintothe officialformatwiththeresultingconfidencescores.In oursystem,theconfidencescoreissimplysettoaver- agedconditionalprobablyofeachtokenscomposedof theinterestedCEM,formallydefinedasfollows. score(CEM)= 1 | CEM |  t  CEM CondProb(t) (6) Table7Specialcharactersincludedinourtokenizer ()[]{�}  ,./\ ‘ ™ @·© ® “ :=    +-?_ |  ¬®    ±~*% ‰  #&;!£ € ¥$ ×÷ ‡† ...    \b  \t\n \f  \b\t Table8AnexampleofCEMcomponentlabelsinan excerpt “ ...[C(8)mim][PF(6)]... “ inPMID:23265515 token ...[C( label OBI I conditionalprob....0.9944560.9972410.999912 token 8)mim] label I I I I conditionalprob.0.9999140.9998530.9972440.996372 token [PF(6 label I I I I conditionalprob.0.9961100.9959400.9967330.996693 token ) ]... label I EO conditionalprob.0.8257820.731261... Table9Sampletokensandtheirresultingbinarystring representationswith500clusters. ID Token BinaryString 1 gracile 010011 2 quintile 010010 3 florbetapir 0110011010 4 interpeak 01100110110 5 aquaporine 01101110011 Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page7of9 where|CEM|meansthenumberoftokencomponents ofaCEM.Take “ [C(8)mim][PF(6)] ” inTable8asan example.Itsconfidencescoreiscalculatedasfollows. score([C(8)mim][PF(6)]) = 1 13  t  [C(8)mim][PF(6)] CondProb(t) =0.963655 (7) Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authors ’ contributions SXandYZdevelopedtheCEMrecognitionsystem.SXandHZconducted extensiveexperimentsanddraftedthemanuscript.LZandXAconceivedof thesupportingprojects,andparticipatedintheresultingdesignand coordinationandhelpeddraftthemanuscript.Allauthorsreadand approvedthefinalmanuscript. Declarations ThisworkwassupportedpartiallybyFundamentalResearchFundsforthe CentralUniversities:ResearchonForestPropertyCirculationMechanismin CollectiveForestArea(JGTD2014-04),BeijingForestryUniversityYoung ScientistFund:ResearchonEconometricMethodsofAuctionwiththeir ApplicationsintheCirculationofCollectiveForestRight(BLX2011028),the NationalScienceFoundationofChina:ResearchonTechnologyOpportunity DetectionbasedonPaperandPatentInformationResources(71403255),Key TechnologiesR&DProgramofChinese12thFive-YearPlan(2011-2015): STKOSCollaborativeConstructionSystemandAuxiliaryToolDevelopment (2011BAH10B02),andKeyWorkProjectofInstituteofScientificand TechnicalInformationofChina(ISTIC):IntelligentAnalysisServicePlatform andApplicationDemonstrationforMulti-SourceScienceandTechnology LiteratureintheEraofBigData(ZD2014-7-1). Thisarticlehasbeenpublishedaspartof JournalofCheminformatics Volume 7Supplement1,2015:TextminingforchemistryandtheCHEMDNERtrack. Thefullcontentsofthesupplementareavailableonlineathttp://www. jcheminf.com/supplements/7/S1. Authors ’ details 1 InformationTechnologySupportingCenter,InstituteofScientificand TechnicalInformationofChina,No.15FuxingRd.,HaidianDistrict,100038 Beijing,PRChina. 2 SchoolofEconomicsandManagement,BeijingForestry University,No.35QinghuaEastRd.,HaidianDistrict,100083Beijing,PR China. 3 NetworkCenter,ScienceandTechnologyDaily,No.15FuxingRd., HaidianDistrict,100038Beijing,PRChina. Published:19January2015 References 1.KrallingerM,LeitnerF,RabalO,VazquezM,MiguelJ,ValenciaA: CHEMDNER:Thedrugsandchemicalnamesextractionchallenge. JCheminform 2015, 7(Suppl1) :S1. 2.LiJ,ZhuX,ChenJY: Buildingdisease-specificdrug-proteinconnectivity mapsfrommolecularinteractionnetworksandpubmedabstracts. PLoS ComputationalBiology 2009, 5(7) :1000450,doi:10.1371/journal.pcbi.1000450. 3.EltyebS,SalimN: Chemicalnamedentitiesrecognition:Areviewon approachesandapplications. JournalofCheminformatics 2014, 6(17) :1-12, doi:10.1186/1758-2946-6-17. 4.VazquezM,KrallingerM,LeitnerF,ValenciaA: Textminingfordrugsand chemicalcompound:Methods,toolsandapplications. Molecular Informatics 2011, 30(6-7) :506-519,doi:10.1002/minf.201100005. 5.KrallingerM,MorganA,SmithL,LeitnerF,TanabeL,WilburJ,HirschmanL, ValenciaA: Evaluationoftext-miningsystemsforbiology:Overviewof thesecondBioCreativecommunitychallenge. GenomeBiology 2008, 9(Suppl2) :1,doi:10.1186/gb-2008-9-S2-S1. 6.XuS,AnX,ZhuL,ZhangY,ZhangH: ACRF-basedsystemforrecognizing chemicalentitiesinbiomedicalliterature. In Proceedingsofthe4th BioCreativeChallengeEvaluationWorkshop KrallingerM,LeitnerF,RabalO, VazquezM,OyarzabalJ,ValenciaA2013, 2 :152-157. 7.XuS,MaF,TaoL: Learnfromtheinformationcontainedinthefalse splicesitesaswellasinthetruesplicesitesusingSVM. where|CEM| meansthenumberoftokencomponentsofaCEM.Take “ [C(8)mim][PF(6)] ” in Table8asanProceedingsoftheInternationalConferenceonIntelligent SystemsandKnowledgeEngineering AtlantisPress,Amsterdam,Netherlands; 2007,1360-1366,doi:10.2991/iske.2007.13. 8.XuS: Selenoproteingenespredictioninsilicobasedonmachine learningapproaches. PhDthesis ChinaAgriculturalUniversity;2008. 9.MikolovT,ChenK,CorradoG,DeanJ: Efficientestimationofword representationsinvectorspace. ProceedingsoftheInternationalConference onLearningRepresentations 2013. Table10TheAdjustmentRulesoftheTextSpansintheBioCreativeIVCHEMDNERcompetition. ID IFCondition Action 1 #(·, “ ( ” )==#(·, “ ) ” )+1  right(·,1)== “ ) ” offset(·,end)=offset(·,end)+1 2 #(·, “ ( ” )==#(·, “ ) ” )-1  left(·,1)== “ ( ” offset(·,start)=offset(·,start)-1 3 #(·, “ [ ” )==#(·, “ ] ” )+1  right(·,1)== “ ] ” offset(·,end)=offset(·,end)+1 4 #(·, “ [ ” )==#(·, “ ] ” )-1  left(·,1)== “ [ ” offset(·,start)=offset(·,start)-1 5 #(·, “ { ” )==#(·, “ } ” )+1  right(·,1)== “ } ” offset(·,end)=offset(·,end)+1 6 #(·, “ { ” )==#(·, “ } ” )-1  left(·,1)== “ { ” offset(·,start)=offset(·,start)-1 7 #(·, “ &#xsc00; “ )==#(·, “ &#x/sc0; “ )+1  right(·,5)== “ &#x/sc0; “ offset(·, end)=offset(·,end)+5 8 #(·, “ &#xsc00; “ )==#(·, “ &#x/sc0; “ )-1  left(·,4)== “ &#xsc00; “ offset(·,start)=offset(·,start)-4 9 #(·, “ &#xi000; “ )==#(·, “ &#x/i00; “ )+1  right(·,4)== “ &#x/i00; “ offset(·,end)=offset(·,end)+4 10 #(·, “ &#xi000; “ )==#(·, “ &#x/i00; “ )-1  left(·,3)== “ &#xi000; “ offset(·,start)=offset(·,start)-3 11 #(·, “ &#xsup0; “ )==#(·, “ &#x/sup; “ )+1  right(·,6)== “ &#x/sup; “ offset(·,end)=offset(·,end)+6 12 #(·, “ &#xsup0; “ )==#(·, “ &#x/sup; “ )-1  left(·,5)== “ &#xsup0; “ offset(·,start)=offset(·,start)-5 13 #(·, “ &#xsub0; “ )==#(·, “ &#x/sub; “ )+1  right(·,6)== “ &#x/sub; “ offset(·,end)=offset(·,end)+6 14 #(·, “ &#xsub0; “ )==#(·, “ &#x/sub; “ )- 1  left(·,5)== “ &#xsub0; “ offset(·,start)=offset(·,start)-5 #(·, str )meansthenumberofoccurrencesofthestring str intheinterestedCEM,right(·, n )andleft(·, n )denotethesubstringwiththelengthof n rightorleftto theinterestedCEM,andoffset(·,start)andoffset(·,left)indicatethestartorendoffsetoftheinterestedCEM. Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page8of9 10.LiangP:Semi-supervisedlearningfornaturallanguage.sthesisMassachusettsInstituteofTechnology;2005.11.TurianJ,RatinovL,BengioY:Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.Proceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics.AssociationforComputationalLinguistics,Stroudsburg,PA,USA2010,384-394.12.LaffertyJ,McCallumA,PereiraF:Conditionalrandomfields:Probabilisticmodelsforsegmentingandlabelingsequencedata.Proceedingsofthe18thInternationalConferenceonMachineLearningMorganKaufmannPublishersInc.,SanFrancisco,CA,USA;2001,282-289.13.KrallingerM,RabalO,LeitnerF,VazquezM,SalgadoD,LuZ,LeamanR,LuY,JiD,LoweDM,SayleRA,Batista-NavarroRT,RakR,HuberT,RocktaschelT,MatosS,CamposD,TangB,XuH,MunkhdalaiT,RyuKH,RamananSV,NathanS,ZitnikS,BajecM,WeberL,IrmerM,AkhondiSA,KorsJA,XuS,AnX,SikdarUK,EkbalA,YoshiokaM,DiebTM,ChoiM,VerspoorK,KhabsaM,GilesCL,LiuH,RavikumarKE,LamuriasA,CoutoFM,DaiH,TsaiRT,AtaC,CanT,UsieA,AlvesR,Segura-BedmarI,MartinezP,OryzabalJ,ValenciaA:TheCHEMDNERcorpusofchemicalsanddrugsanditsannotationprinciples.JCheminform7(Suppl1)14.ShaF,PereiraF:Shallowparsingwithconidtionalrandomfields.ProceedingsoftheHumanLanguageTechnologyConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics.AssociationforComputationalLingustics,Stroudsburg,PA,USA2003,213-220,15.MillerS,GuinnessJ,ZamanianA:Nametaggingwithwordclustersanddiscriminativetraining.ProceedingsofConferenceonHumanLanguageTechnology/NorthAmericanChapteroftheAssociationforComputationalLinguiusticsAnnualMeetingAssociationforComputationalLinguistics,Boston,Massachusetts;2004,337-342.16.GanchevK,CrammerK,PereiraF,MannG,BellareK,McCallumA,CarrollS,JinY,WhiteP:Penn/Umass/CHOPBioCreativeIIsystems.Proceedingsofthe2ndBioCreativeChallengeEvaluationWorkshop17.BrownPF,deSouzaPV,MercerRL,PietraVJD,LaiJC:Class-basedn-grammodelsofnaturallanguage.ComputationalLinguistics18.FinkelJR,ManningCD:Nestednamedentityrecognition.Proceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLanguageProcessing.AssociationforComputationalLingustics,Stroudsburg,PA,USA2009,141-150.TheApacheOpenNLPLibrary.Library.20.ReadJ,DridanR,OepenS,SolbergLJ:Sentenceboundarydetection:Alongsolvedproblem?Proceedingsofthe24ndInternationalConferenceonComputationalLinguistics.IndianInstituteofTechnologyBombay,Mumbai,Maharashtra,India;KayM,BoitetC2012:985-994.21.WeiC-H,HarrisBR,KaoH-Y,LuZ:tmVar:Atextminingapproachforextractingsequencevariantsinbiomedicalliterature.22.McDonaldR,PereiraF:Identifyinggeneandproteinmmentionintextusingconditionalrandomfields.BMCBioinformatics6(Suppl1)23.HuangH-S,LinY-S,LinK-T,KuoC-J,ChangY-M,YangB-H,ChungI-F,HsuC-N:High-recallgenementionrecognitionbyunificationofmultiplebackgroundparsingmodels.Proceedingsofthe2ndBioCreativeChallengeEvaluationWorkshop24.KlingerR,FriedrichCM,FluckJ,Hofmann-ApitiusM:Namedentityrecognitionwithcombinationsofconditionalrandomfields.Proceedingsofthe2ndBioCreativeChallengeEvaluationWorkshopHirschmannL,KrallingerM,ValenciaA2007,89-92.25.LiuDC,NocedalJ:OnthelimitedmemoryBFGSmethodforlargescaleMathematicalProgramming:503-528,doi:10.1007/26.KudoT:CRF++:YetAnotherCRFToolkit.Toolkit.trunk/doc/index.html].27.PorterMF:Analgorithmforsuffixstripping.28.ManningC,BauerJ:StanfordCoreNLP-ASuiteofNLPTools.Tools.stanford.edu/software/corenlp.shtml].29.CollobertR,WestonJ:Aunifiedarchitecturefornaturallanguageprocessing:Deepneuralnetworkswithmultitasklearning.Proceedingsofthe25thInternationalConferenceonMachineLearning30.MnihA,AndriyG:Ascalablehierarchicaldistributedlanguagemodel.AdvancesinNeuralInformationProcessingSystems21.MITPress,Cambridge,MA;KollerD,SchuurmansD,BengioY,BottouL2009:1081-1088.31.LiangP:C++ImplementationoftheBrownWordClusteringAlgorithm.Algorithm.32.HuffmanDA:Amethodfortheconstructionofminimum-redundancyProceedingsoftheI.R.E:1098-1101,doi:10.1109/doi:10.1186/1758-2946-7-S1-S11Citethisarticleas:etalACRF-basedsystemforrecognizingchemicalentitymentions(CEMs)inbiomedicalliterature.JournalofCheminformatics(Suppl1):S11. W. Jeffery Hurst,The Hershey Company. http://www.chemistrycentral.com/manuscript/ etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page9of9

Related Contents


Next Show more