1 ID: 121253
Download Pdf The PPT/PDF document "RESEARCHOpenAccessACRF-basedsystemforrec..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
RESEARCH OpenAccess ACRF-basedsystemforrecognizingchemical entitymentions(CEMs)inbiomedicalliterature ShuoXu 1 ,XinAn 2 ,LijunZhu 1 ,YunliangZhang 1* ,HaodongZhang 3 Abstract Background: Inordertoimproveinformationaccessonchemicalcompoundsanddrugs(chemicalentities) describedintextrepositories,itisverycrucialtobeabletoidentifychemicalentitymentions(CEMs)automatically correspondingsystemsthatareabletodetectmentionsofchemicalcompoundsanddrugs,whichhastwo subtasks:CDI(ChemicalDocumentIndexing)andCEM. Results: Oursystemprocessingpipelineconsistsofthreemajorcomponents:pre-processing(sentencedetection, tokenization),recognition(CRF-basedapproach),andpost-processing(rule-basedapproachandformatconversion). Inourpost-challengesystem,thecostparameterinCRFmodelwasoptimizedby10-foldcrossvalidationwithgrid search,andwordrepresentationsfeatureinducedbyBrownclusteringmethodwasintroduced.FortheCEM subtask,ourofficialrunswererankedintoppositionbyobtainingmaximum88.79%precision,69.08%recalland 77.70%balancedF-measure,whichwereimprovedfurtherto88.43%precision,76.48%recalland82.02%balanced F-measureinourpost-challengesystem. Conclusions: Inoursystem,insteadofextractingaCEMasawhole,weregardeditasasequencelabeling problem.Thoughourcurrentsystemhasmuchroomforimprovement,oursystemisvaluableinshowingthatthe performanceintermofbalancedF-measurecanbeimprovedlargelybyutilizinglargeamountsofrelatively inexpensiveun-annotatedPubMedabstractsandoptimizingthecostparameterinCRFmodel.Fromourpractice andlessons,ifonedirectlyutilizessomeopen-sourcenaturallanguageprocessing(NLP)toolkits,suchasOpenNLP, StandfordCoreNLP,falsepositive(FP)ratemaybeveryhigh.Itisbettertodevelopsomeadditionalrulesto minimizetheFPrateifonedoesnotwanttore-traintherelatedmodels.OurCEMrecognitionsystemisavailable at:http://www.SciTeMiner.org/XuShuo/Demo/CEM. Background Thereisanincreasinginteresttoimproveinformation accessonchemicalcompoundsanddrugs(chemicalenti- ties)describedintextrepositories,includingscientific articles,patents,healthagencyreports,ortheWeb[1].In ordertoachievethisgoal,itisverycrucialtobeableto identifychemicalentitymentions(CEMs)automatically fchemicalentitiesisalso crucialforothersubsequenttextprocessingtasks,such asdetectionofdrug-protei ninteractions[2],adverse effectsofchemicalcompoundsandtheirassociationsto toxicologicalendpoints,ortheextractionofpathwayand metabolicreactionrelationsandsoon.Thoughmany methodsandstrategiestorecognizechemicalsintext havebeenproposed[3],onlyaverylimitednumberof publiclyaccessibleCEMreco gnitionsystemshavebeen released[4]. TheBioCreative(CriticalAssessmentofInformation ExtractionSystemsinBiology)challengeisacommunity- wideefforttobuildanevaluationframeworkforasses- singtextminingsystemsinbiologicaldomains[5].The chemicalcompoundanddrugnamedentityrecognition (CHEMDNER)challengeinBioCreativeIVwasspecially designedtopromotetheimplementationofsystemsthat areabletodetectmentionsofchemicalcompoundsand drugs,whichhastwosubtasks,CDI(ChemicalDocument Indexing)subtaskandCEM(ChemicalEntityMention) subtask.CDIsubtaskisthetasktoreturnarankedlistof 1 InformationTechnologySupportingCenter,InstituteofScientificand TechnicalInformationofChina,No.15FuxingRd.,HaidianDistrict,100038 Beijing,PRChina Fulllistofauthorinformationisavailableattheendofthearticle Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 ©2015Xuetal.;licenseeSpringer.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttribution License(http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium, providedtheoriginalworkisproperlycited.TheCreativeCommonsPublicDomainDedicationwaiver(http://creativecommons.org/ publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. chemicalentitiesdescribedwithinagivendocuments. CEMsubtaskisthetasktoprovideforagivendocument thestartandendindicescorrespondingtoallthechemi- calentitiesmentionedinthedocument. Here,wepresentthemethod,theresultsandrecogni- tionsystemfromourparticipationintheCEMsubtask ofCHEMDNERchallenge[1,6]withsomepostchallenge systemsimprovement.Inourrecognitionsystem, insteadofextractingaCEMsuchas (+)-antiBP-7,8- diol-9,10-epoxide asawhole,weregarditasa sequencelabelingproblem.Ourmainfocusonthis improvedsystemwastoexploretheeffectivenessofcost parameteroptimization[7,8]andwordrepresentation-s [9-11]featureforourapproachtoCEMsubtask.The proposedmethodcombinesnaturallanguageprocessing (NLP)strategieswithmachinelearning(ML)techniques toutilizewordrepresentationsfeaturefromlarge amountsofrelativelyinexpensiveun-annotatedPubMed abstractsalongwithsmallamountsofannotatedones. AsshowninFigure1,oursystemfirstdetectssen- tenceboundariesonthePubMedabstracts,andthen tokenizeseachdetectedsentenceaspre-processing. Next,oursystemextractsCEMsfromtextwithacondi- tionalrandomfield(CRF)approach[12],followedby somepost-processingstep sincludingarule-based approachandaformatconversionstep.Wedescribe eachstepindetailinthefollo wingsections.Although currentapproachhasmuch roomforimprovement,it producedthetop-rankedperformanceamongallsub- mittedrunsintheCEMsubtaskofBioCreativeIV CHEMDNERchallenge. Theorganizationoftherestofthearticleisasfollows. Inthenextsection,wedescribetheresultsofoursub- missionandpost-challengerunsontheCEMsubtaskof BioCreativeIVCHEMDNERchallenge.Thisisfollowed bydiscussionandconclusionsdrawnfromourexperi- ence.Lastly,ourmethodsemployedareexplainedin detail. Resultsanddiscussion Weanalyzedthetraining,developmentandtestingdata setsandfoundthattherearemanynestedCEMsinthe developmentset,suchas polysorbate80 (offset:1138 to1152)and polysorbate (offset:1138to1149)inthe abstractofPMID:23064325.SeeTable1formore examplesofnestedCEMpairs.SincelinearCRFmodel, utilizedinthisarticle,cannotidentifythenestedCEMs, wejustomitthelessspannedCEMs.Inaddition,there maybesomeannotationerrorsinthedevelopmentset, suchasexamplesinTable2.Wealsomanuallycor- rectedtheseerrorsbeforetrainingourCRFmodel. Table3showsabriefoverviewofthecorrected CHEMDNERcorpus.Pleasesee[13]formoredetailsof CEMsannotating,classifyingandsplittingintotraining, developmentandtestdatasets. Toevaluatetheperformanceofsubmittedresults,the BioCreativeIVcompetitionreliedonthreeperformance measuresatentitylevel:recall,precisionandF-measure. Therecallistheproportionofcorrectpredictionof positiveCEMs.Theprecisionistheproportionofpre- dictedCEMsthatareactuallytrueCEMs.TheF-mea- sureprovidesamorebalancedevaluationbyaveraging precisionandrecall.Therecall,precisionandF-measure aredefinedformallyasfollows. r = TP TP + FN (1) p = TP TP + FP (2) F =(1+ 2 ) p × r 2 p + r (3) where TP (truepositive)isthenumberofthecorrect positivepredictions, FN (falsenegative)isthenumberof incorrectnegativepredictions(typeIIerrors),and FP is thenumberofincorrectpositivepredictions(typeI errors).ThebalancedF-measure( b =1),themaineva- luationmetricusedfortheCEMsubtaskoftheBioCrea- tiveIVCHEMDNERcompetition,canbesimplifiedto: F 1 =2 p × r p + r (4) Inordertomakethebestofannotatedcorpus,wepooled thetraininganddevelopmentdatasets.Theparticipating teamsareallowedtohave5daystogenerateuptofivedif- ferentannotations("runs )forthetestsetandtosubmitthe annotationstotheorganizers .Thus,participatingteamscan utilizedifferentsettings,modelsormethodswhengoldtest Figure1 Thesystemprocessingpipeline .Thesystemprocessingpipelinethatincludesthreemajorcomponents:pre-processing(sentence detection,tokenization),recognition(CRF-basedapproach)andpost-processing(rule-basedapproachandformatconversion). Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page2of9 annotationsetisunknown.WesubmittedfiverunsfortheCEMsubtask,eachusingthesamepipeline,butwithdiffer-entvaluesforthecostparameterintheCRFmodel[12,14].Duetotimeconstraints,wejustsetthecostparametertoeachelementin{2,2,2}.Table4presentstheofficialperformancescoresofoursubmittedruns.Run5performedthebestintermsofrecallandbalancedF-mea-sure.Run1performedthebestintermofprecision. Table1NestedCEMpairsinthedevelopmentsetoftheCHEMDNERcorpusA567A567Foreachrow,theCEMwithoffsetincolumn6-7isnestedintheCEMwithoffsetincolumn4-5.TheCEMswithrespectiveoffsetsincolumn6-7areomitteddirectlywhentrainingourCRFmodels.etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page3of9 Infact,thecostparametertradesthebalancebetweenover-fittingandunder-fitting[12,14].Withlargercostparametervalue,CRFtendstoover-fittothegiventrainingcorpus.FromTable4,onecaneasilyseethatthepredictedresultsweresignificantlyinfluencedbythisparameter.Inourpost-challengeimprovedsystems,10-foldcrossvalidationatdocumentlevelisutilizedtooptimizethecostparameterwithgridsearch[7,8].Spe-cifically,thepooledtraininganddevelopmentdatasetsarerandomlydividedinto10sub-corpusofnearlyequalsize.Foreachcost,2,2},aCRFmodelisinduced10times,eachtimeleavingoutoneofthesub-corpusesthatisthenusedtocalculatethebalancedF-measure.Anoptimalvalueofcostsisselectedfromthisgridsearch.Inourpost-challengeimprovedsystem,wereobtainedfiverunsfortheCEMsubtask,eachusingthesamepipelineasofficialsubmissions,butwithdifferentfea-turessets(Table5).FromTable3,CHEMDNERcorpusincludeslargeamountsofrelativelyinexpensiveun-annotatedPubMedabstracts.Inordertoreducedatasparsityandimprovefurthertheperformanceofoursystem,wordrepresentationsfeatureisusedinourpost-challengesystem,sinceitisasimpleandgeneralmethodforsemi-supervisedlearning[11].Previousstu-dies[11,15,16]showthatwordrepresentationsfeatureisaveryimportantfeaturetoimprovethebalancedF-measureofpre-definedcategoriesofpropernamesandbio-entityrecognition.Here,thetraining,development,testandbackgrounddatasetsarepooledtoinducewordrepresentationsofeachtokenbyBrownclusteringmethod[10,17]with500,1000,1500and2000clusters,respectively.Figure2showsthebalancedF-measureforpostchallengerunswith10-foldcrossvalidationbygridsearch[7,8].Table6reportstheperformanceresultswiththeoptimalvalueforthecostparameter.FromFigure2andbycomparingTable4andTable6,itisnotdifficulttoseethatthewordrepresentationsfeatureimprovedlargelytheperfor-manceofoursystemintermsofbalancedF-measureandrecall,butwithalittleperformancedegradationintermofprecision.Run1,Run4andRun3performedthebestintermofprecision,recall,balancedF-measure,ThoughtheannotatedCEMsareclassifiedintoeight={SYSTEMATIC,IDENTIFIER,FORMULA,TRIVIAL,ABBREVIATION,FAMILY,MULTIPLE,NOCLASS},theannotationsoftheindividualCEMclassesaredisregardedinourpost-challengesystem.InordertohighlighttheexistinggapsintheCEMrecognitionsystem,performanceresultsforeachcategoryinCarealsogiveninTable4andTable6intermofprecision.AsforofficialperformancescoresinTable4,oursystemworkedbestonrecognizingtheFORMULACEMsforRun1,Run2andRun3,andSYSTEMATICCEMsforRun4andRun5.FromTable6,onecanseethatour Table2NestedCEMpairsinthedevelopmentsetoftheCHEMDNERcorpusIDPMIDT/AStartEndStartEnd123412114A977984977985223572392T42554256323414800T69896889423411224A278288277288523401298A438502438501Theoffsetsincolumn4-5arecorrectedtotheonesincolumn6-7. Table3TheoverviewofthecorrectedCHEMDNERcorpusintermsofthenumberofPubMedabstracts(#Articles),thenumberofCEMs(#CEMs),andthenumberofCEMsforeachoftheCEMclassesinC={SYSTEMATIC,IDENTIFIER,FORMULA,TRIVIAL,ABBREVIATION,FAMILY,MULTIPLE,NOCLASS}×meanstheresultingfigureisTrainingDevelopmentTestBackground#Articles3,5003,5003,00017,000#CEMs29,47829,48525,351×ABBREVIATION4,5384,5174,059×FAMILY4,0904,2123,622×FORMULA4,4484,1173,443×IDENTIFIER672639513×MULTIPLE202187199×SYSTEMATIC6,6566,8145,666×TRIVIAL8,8328,9677,808×NOCLASS403241× Table4OfficialscoresfortheCEMsubtaskintheBioCreativeIVCHEMDNERcompetitionRun1Run2Run3Run4Run5cost222-TP15,82116,53116,99117,32817,5121,8342,0072,0092,1292,2119,5308,8208,3608,0237,839Precision(%)89.6189.1789.4389.0688.79Recall(%)62.4165.2167.0268.3569.08score(%)73.5875.3376.6277.3477.70ABBREVIATION53.9055.7556.7457.6358.24FAMILY59.5062.2064.8066.2367.28FORMULA72.7674.3074.8875.4675.84IDENTIFIER68.6270.1869.9869.7969.59MULTIPLE27.6432.1628.6431.6631.66SYSTEMATIC69.5772.6174.3275.6176.35TRIVIAL58.9562.7365.4867.4168.26NOCLASS51.2251.2256.1058.5458.54etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page4of9 postchallengeimprovedsystemidentifiedSYSTEMATIC CEMsatthebest.What smore,itseemsbeverydiffi- culttorecognizeMULTIPLECEMsinbothsystems. Mainreasonmaybethatthenumberofannotated CEMsisnotsufficefortheMULTIPLEcategory(202, 187,199fortraining,developmentandtestdatasets, respectivelyinTable3). Conclusions Inthearticle,wepresentourpost-challengesystemand itsperformancefortheCEMsubtaskofBioCreativeIV CHEMDNERchallenge.Oursystemprocessingpipeline consistsofthreemajorcomponents:preprocessing(sen- tencedetection,tokenization),recognition(CRF-based approach),andpost-processing(rulebasedapproachand formatconversion).Ourmainfocusonthisimproved systemwastoexploretheeffectivenessofthecostpara- meteroptimizationandwordrepresentationsfeaturefor theCEMsubtask. Inourpost-challengeimprovedsystem,insteadof extractingaCEMasawhole,weregardeditasa sequencelabelingproblem.ThefamousCRFmodelis utilizedtosolvethesequencelabelingproblem,whose costparameterisoptimizedby10-foldcrossvalidation withgridsearch.Differentfeaturetypes,includinggen- erallinguistic,character,casepattern,contextual,and wordrepresentationsfeatures,wereexploitedforour runs.Inordertoreducedatasparsityintheannotated traininganddevelopmentdatasets,wordrepresentations wereinducedfrompooledtraining,development,test andbackgrounddatasetsbyBrownclusteringmethod. Finkel&Manning[18]proposedamodelspecifically forrecognizingnestednamedentitiesbyusingadiscri- minativeconstituencyparser.Themodelexplicitly representsthenestedstruc ture,allowingentitiestobe influencednotjustbythelabelsofthetokenssurround- ingthem,asinaCRF,butalsobytheentitiescontained inthem,andinwhichtheyarecontained.Inongoing work,themodelwillbeintroducedforrecognizing nestedCEMs. Thoughourcurrentsystemhasmuchroomfor improvement,oursystemisvaluableinshowingthat Table5Featurecombinationsusedforpost-challengerunsontheCEMsubtask. WordRepresentation GeneralLinguisticCharacterCasePatternContextual500100015002000 Run1 Run2 Run3 Run4 Run5 Figure2 ThebalancedF-measureforpost-challengerunswith10-foldcrossvalidationbygridsearch . Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page5of9 theperformanceintermofbalancedF-measurecanbeimprovedlargelybyutilizinglargeamountsofrelativelyinexpensiveun-annotatedPubMedabstracts.Fromourpracticeandlesson,ifwedirectlyusesomeopen-sourceNLPtoolkits,suchasOpenNLP,StanfordCoreNLP,falsepositiveratemaybeveryhigh.Itisbettertodevelopsomeadditionalrulestominimizethefalsepositiverateifonedontwanttore-traintherelatedMethodsPre-processing:sentencedetection&tokenizationAsentencedetectorcanidentifyifapunctuationcharac-termarkstheendofasentenceornot.Here,thesen-tencedetectorinOpenNLP[19]isutilized.However,sentenceboundaryidentificationischallengingbecausepunctuationmarksareoftenambiguous[20].Inordertoimprovefurthertheperformanceofthesentencedetec-tion,wecollectedmanyabbreviations,suchassyn.,etc.fromthetraininganddevelopmentsets.Thenwegeneratedseveralrules,suchasifcurrentsen-tenceendswiththeseabbreviationsorcomma,ornextsentencestartswithlower-caseletter.Inthiscase,thecurrentandnextsentencesaremergedintoanewone.Atokenizerdivideseachobtainedsentenceaboveintotokens,whichusuallycorrespondtowords,punctuation,numbers,etc.However,tocaptureindividualcompo-nentswithinaCEM,similartoWeietal.[21],weper-formedtokenizationonafinerlevel.Specifically,specialcharactersinTable7,numbers,andGreeksymbolsaredividedasseparatetokens.AnexampleisshowninTable8.Pluralupper-caseabbreviationsarealsosepa-ratedintotwotokens,suchasintoAsamatteroffact,beforeanypre-processing,wealsomergedsomespecialcharacterswiththesamemeaning,suchas,etc.Recognition:CRF-basedapproachAsmentionedinBackground,weseetheCEMrecogni-tionproblemasasequencelabelingone(seeTable8).Asatypeofdiscriminativeundirectedprobabilisticmodel,CRFs[12,14]areoftenusedforlabelingorpar-singofsequentialdata,suchasnaturallanguagetextorbiologicalsequences.CRFs[22-24]hasbeenappliedsuc-cessfullytoidentifyvariousbio-entities,suchasgene,proteinandsoon,andshownagoodperformance.Giventokensequence x=(x1,x2,···,xN) ,CRFdefinestheconditionalprobabilitydistribution Pr( y| x) oflabelsequence y=(y1,y2,···,yN) asfollows. Pr(exp( Here, w=(w1,w2,···,wM)T isaglobalfeatureweightvector, )=( isalocalfeaturevectorfunction,andMisthenumberoffeaturefunctions.Theweightvectorwcanbeobtainedfromthetraininganddevelopmentsetsbyalimited-memoryBroyden-F(L-BFGS)[25]method.ThetraditionalBIEOlabelsetisusedinourpost-chal-lengeimprovedsystem.Thatistosay,eachtokenislabeledasbeingthebeginningof(B),theinsideof(I),theendof(E)orentirelyoutside(O)ofaspanofinterest.Here,CRF++[26]isadoptedfortheactualimplementa-tion.InCRF++,thereare4majorparameters(and)tocontrolthetrainingcondition.Inoursub-mittedpredictionsandpost-challengeones,thepara-metersandwereconsistentlysettoCRF-L2,2and4,respectively.Theoptionisoptimizedwith10-foldcrossvalidation,asintroducedabove.FeaturesforourCRFmodelOursystemexploitsfourdifferenttypesoffeatures:GenerallinguisticfeaturesOursystemincludestheoriginaluni-tokensandbi-tokens,aswellasstemmeduni-tokens,bi-tokensandtri-tokens,asfeaturesusingthePortersstemmer[27]fromStanfordCoreNLP[28].CharacterfeaturesSincemanyCEMscontainnumbers,Greekletters,Romannumbers,aminoacids,chemicalelements,andspecialcharacters,oursystemcalculatesseveralstatisticsasfeaturesforeachtoken,includingitsnumberofdigi-tals,numberofupper-andlower-caseletters,numberofallcharactersandpresenceorabsenceofspecificcharac-tersorGreekletters,Romannumbers,aminoacids,orchemicalelements. Table6Performanceresultsinourpost-challengeimprovedsystemfortheCEMsubtaskintheBioCreativeIVCHEMDNERcompetitionRun1Run2Run3Run4Run5cost222218,02519,25919,38919,49519,3552,3122,6712,5372,6942,5057,3266,0925,9625,8565,996Precision(%)88.6387.8288.4387.8688.5471.1075.9776.4876.9076.35score(%)78.9181.4782.0282.0281.99ABBREVIATION59.7763.3764.2865.8565.9872.3673.5572.9273.9472.97FORMULA73.0273.4574.1274.0174.30IDENTIFIER64.7263.3566.6764.3366.86MULTIPLE23.6232.1635.1833.1730.65SYSTEMATIC79.1982.8683.2583.2282.5671.4181.7682.3882.7081.56NOCLASS53.6663.4163.4168.2963.41etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page6of9 Casepatternfeatures Similarto[21],anyuppercasealphabeticcharacteris replacedby A ,anylowercaseoneisreplacedby a , andanynumber(0-9)isreplacedby 0 .Moreover,our systemalsomergeconsecutivelettersandnumbersand generatedadditionalsingleletter a andnumber 0 features. Contextualfeatures Foreachtoken,oursystemincludesacombinationof thecurrentoutputtokena ndpreviousoutputtoken (bigram). Wordrepresentationfeatures Onecommonapproachtoinducingunsupervisedword representationistouseclustering,perhapshierarchical, suchasBrownclusteringmethod[17],Collobertand Westonembeddings[29],hierarchicallog-bilinear model(HLBL)embeddings[30]andsoon.Here,the Brownclusteringmethodisused.Theimplementation ofBrownclusteringmethodbyLiang[31]isadoptedin ourpost-challengesystem. TheresultofrunningtheBrownclusteringmethodis abinarytree,whereeachtokenoccupiesasingleleaf node,andwhereeachleafnodecontainsasingletoken. Therootnodedefinesaclustercontainingtheentire tokenset.Interiornodesrepresentintermediatesize clusterscontainingallofthetokensthattheydominate. Thus,nodeslowerinthebinarytreecorrespondto smallertokenclusters,whilehighernodescorrespondto largertokenclusters.AccordingtoHuffmancoding [32],aparticulartokencanbeassignedabinarystring byfollowingthetraversalpathfromtheroottoitsleaf, assigninga0foreachleftbranch,anda1foreachright branch. Intuitively,theBrownclusteringmethodwillmerge thetokenswithsimilarcontextsintothesamecluster. Thus,themoresimilartheprefixofthetoken sHuff- mancoding,themoresimilarthetokens.Table9shows sometokenexamplesandtheirbinarystringrepresenta- tionswith500clusters.Let stakeTable9asanexam- ple.AccordingtomainideaoftheBrownclustering method,thetoken interpeak (01100110110)ismore similarthanthetoken aquaporine (01101110011)with thetoken florbetapir (0110011010). Post-processing:rule-basedapproach&format conversion Oncloserexamination,wefindthattheresultsofCRF approachincludesomefalsepositiveCEMs,suchas 25 (3),186-193 , 1-D,2-D andsoon.So,wedeveloped severaladditionalregularexpressestoremovethem.In addition,ourpost-processingstepalsohelpsadjusttext spansofCEMs,suchasaddingamissingclosingpar- enthesis,suchas [4Fe-4S](2+ into [4Fe-4S](2+) .All oftheadjustmentrulesarelistedinTable10.Here,#(·, str)meansthenumberofoccurrencesofthestringstr intheinterestedCEM,right(·, n )andleft(·, n )denote thesubstringwiththelengthof n rightorlefttothe interestedCEM,andoffset(·,start)andoffset(·,left)indi- catethestartorendoffsetoftheinterestedCEM.Let s takethefirstrowinTable10asanexample.Itmeans thatifthenumberoftheoccurrencesof ( ishigher thanthatof ) intheinterestedCEM,andifthesub- stringwiththelengthof1righttotheinterestedCEM is ) ,thenstartoffsetoftheinterestedCEMismoved onecharacterfurthertotheright. Finally,weconvertedtherecognizedCEMsintothe officialformatwiththeresultingconfidencescores.In oursystem,theconfidencescoreissimplysettoaver- agedconditionalprobablyofeachtokenscomposedof theinterestedCEM,formallydefinedasfollows. score(CEM)= 1 | CEM | t CEM CondProb(t) (6) Table7Specialcharactersincludedinourtokenizer ()[]{} ,./\ @·© ® := +-?_ | ¬® ±~*% #&;!£ ¥$ ×÷ ... \b \t\n\f \b\t Table8AnexampleofCEMcomponentlabelsinan excerpt ...[C(8)mim][PF(6)]... inPMID:23265515 token ...[C( label OBI I conditionalprob....0.9944560.9972410.999912 token 8)mim] label I I I I conditionalprob.0.9999140.9998530.9972440.996372 token [PF(6 label I I I I conditionalprob.0.9961100.9959400.9967330.996693 token ) ]... label I EO conditionalprob.0.8257820.731261... Table9Sampletokensandtheirresultingbinarystring representationswith500clusters. ID Token BinaryString 1 gracile 010011 2 quintile 010010 3 florbetapir 0110011010 4 interpeak 01100110110 5 aquaporine 01101110011 Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page7of9 where|CEM|meansthenumberoftokencomponents ofaCEM.Take [C(8)mim][PF(6)] inTable8asan example.Itsconfidencescoreiscalculatedasfollows. score([C(8)mim][PF(6)]) = 1 13 t [C(8)mim][PF(6)] CondProb(t) =0.963655 (7) Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Authors contributions SXandYZdevelopedtheCEMrecognitionsystem.SXandHZconducted extensiveexperimentsanddraftedthemanuscript.LZandXAconceivedof thesupportingprojects,andparticipatedintheresultingdesignand coordinationandhelpeddraftthemanuscript.Allauthorsreadand approvedthefinalmanuscript. Declarations ThisworkwassupportedpartiallybyFundamentalResearchFundsforthe CentralUniversities:ResearchonForestPropertyCirculationMechanismin CollectiveForestArea(JGTD2014-04),BeijingForestryUniversityYoung ScientistFund:ResearchonEconometricMethodsofAuctionwiththeir ApplicationsintheCirculationofCollectiveForestRight(BLX2011028),the NationalScienceFoundationofChina:ResearchonTechnologyOpportunity DetectionbasedonPaperandPatentInformationResources(71403255),Key TechnologiesR&DProgramofChinese12thFive-YearPlan(2011-2015): STKOSCollaborativeConstructionSystemandAuxiliaryToolDevelopment (2011BAH10B02),andKeyWorkProjectofInstituteofScientificand TechnicalInformationofChina(ISTIC):IntelligentAnalysisServicePlatform andApplicationDemonstrationforMulti-SourceScienceandTechnology LiteratureintheEraofBigData(ZD2014-7-1). Thisarticlehasbeenpublishedaspartof JournalofCheminformatics Volume 7Supplement1,2015:TextminingforchemistryandtheCHEMDNERtrack. Thefullcontentsofthesupplementareavailableonlineathttp://www. jcheminf.com/supplements/7/S1. Authors details 1 InformationTechnologySupportingCenter,InstituteofScientificand TechnicalInformationofChina,No.15FuxingRd.,HaidianDistrict,100038 Beijing,PRChina. 2 SchoolofEconomicsandManagement,BeijingForestry University,No.35QinghuaEastRd.,HaidianDistrict,100083Beijing,PR China. 3 NetworkCenter,ScienceandTechnologyDaily,No.15FuxingRd., HaidianDistrict,100038Beijing,PRChina. Published:19January2015 References 1.KrallingerM,LeitnerF,RabalO,VazquezM,MiguelJ,ValenciaA: CHEMDNER:Thedrugsandchemicalnamesextractionchallenge. JCheminform 2015, 7(Suppl1) :S1. 2.LiJ,ZhuX,ChenJY: Buildingdisease-specificdrug-proteinconnectivity mapsfrommolecularinteractionnetworksandpubmedabstracts. PLoS ComputationalBiology 2009, 5(7) :1000450,doi:10.1371/journal.pcbi.1000450. 3.EltyebS,SalimN: Chemicalnamedentitiesrecognition:Areviewon approachesandapplications. JournalofCheminformatics 2014, 6(17) :1-12, doi:10.1186/1758-2946-6-17. 4.VazquezM,KrallingerM,LeitnerF,ValenciaA: Textminingfordrugsand chemicalcompound:Methods,toolsandapplications. Molecular Informatics 2011, 30(6-7) :506-519,doi:10.1002/minf.201100005. 5.KrallingerM,MorganA,SmithL,LeitnerF,TanabeL,WilburJ,HirschmanL, ValenciaA: Evaluationoftext-miningsystemsforbiology:Overviewof thesecondBioCreativecommunitychallenge. GenomeBiology 2008, 9(Suppl2) :1,doi:10.1186/gb-2008-9-S2-S1. 6.XuS,AnX,ZhuL,ZhangY,ZhangH: ACRF-basedsystemforrecognizing chemicalentitiesinbiomedicalliterature. In Proceedingsofthe4th BioCreativeChallengeEvaluationWorkshop KrallingerM,LeitnerF,RabalO, VazquezM,OyarzabalJ,ValenciaA2013, 2 :152-157. 7.XuS,MaF,TaoL: Learnfromtheinformationcontainedinthefalse splicesitesaswellasinthetruesplicesitesusingSVM. where|CEM| meansthenumberoftokencomponentsofaCEM.Take [C(8)mim][PF(6)] in Table8asanProceedingsoftheInternationalConferenceonIntelligent SystemsandKnowledgeEngineering AtlantisPress,Amsterdam,Netherlands; 2007,1360-1366,doi:10.2991/iske.2007.13. 8.XuS: Selenoproteingenespredictioninsilicobasedonmachine learningapproaches. PhDthesis ChinaAgriculturalUniversity;2008. 9.MikolovT,ChenK,CorradoG,DeanJ: Efficientestimationofword representationsinvectorspace. ProceedingsoftheInternationalConference onLearningRepresentations 2013. Table10TheAdjustmentRulesoftheTextSpansintheBioCreativeIVCHEMDNERcompetition. ID IFCondition Action 1 #(·, ( )==#(·, ) )+1 right(·,1)== ) offset(·,end)=offset(·,end)+1 2 #(·, ( )==#(·, ) )-1 left(·,1)== ( offset(·,start)=offset(·,start)-1 3 #(·, [ )==#(·, ] )+1 right(·,1)== ] offset(·,end)=offset(·,end)+1 4 #(·, [ )==#(·, ] )-1 left(·,1)== [ offset(·,start)=offset(·,start)-1 5 #(·, { )==#(·, } )+1 right(·,1)== } offset(·,end)=offset(·,end)+1 6 #(·, { )==#(·, } )-1 left(·,1)== { offset(·,start)=offset(·,start)-1 7 #(·, sc00; )==#(·, /sc0; )+1 right(·,5)== /sc0; offset(·, end)=offset(·,end)+5 8 #(·, sc00; )==#(·, /sc0; )-1 left(·,4)== sc00; offset(·,start)=offset(·,start)-4 9 #(·, i000; )==#(·, /i00; )+1 right(·,4)== /i00; offset(·,end)=offset(·,end)+4 10 #(·, i000; )==#(·, /i00; )-1 left(·,3)== i000; offset(·,start)=offset(·,start)-3 11 #(·, sup0; )==#(·, /sup; )+1 right(·,6)== /sup; offset(·,end)=offset(·,end)+6 12 #(·, sup0; )==#(·, /sup; )-1 left(·,5)== sup0; offset(·,start)=offset(·,start)-5 13 #(·, sub0; )==#(·, /sub; )+1 right(·,6)== /sub; offset(·,end)=offset(·,end)+6 14 #(·, sub0; )==#(·, /sub; )- 1 left(·,5)== sub0; offset(·,start)=offset(·,start)-5 #(·, str )meansthenumberofoccurrencesofthestring str intheinterestedCEM,right(·, n )andleft(·, n )denotethesubstringwiththelengthof n rightorleftto theinterestedCEM,andoffset(·,start)andoffset(·,left)indicatethestartorendoffsetoftheinterestedCEM. Xu etal . JournalofCheminformatics 2015, 7 (Suppl1):S11 http://www.jcheminf.com/content/7/S1/S11 Page8of9 10.LiangP:Semi-supervisedlearningfornaturallanguage.sthesisMassachusettsInstituteofTechnology;2005.11.TurianJ,RatinovL,BengioY:Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.Proceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics.AssociationforComputationalLinguistics,Stroudsburg,PA,USA2010,384-394.12.LaffertyJ,McCallumA,PereiraF:Conditionalrandomfields:Probabilisticmodelsforsegmentingandlabelingsequencedata.Proceedingsofthe18thInternationalConferenceonMachineLearningMorganKaufmannPublishersInc.,SanFrancisco,CA,USA;2001,282-289.13.KrallingerM,RabalO,LeitnerF,VazquezM,SalgadoD,LuZ,LeamanR,LuY,JiD,LoweDM,SayleRA,Batista-NavarroRT,RakR,HuberT,RocktaschelT,MatosS,CamposD,TangB,XuH,MunkhdalaiT,RyuKH,RamananSV,NathanS,ZitnikS,BajecM,WeberL,IrmerM,AkhondiSA,KorsJA,XuS,AnX,SikdarUK,EkbalA,YoshiokaM,DiebTM,ChoiM,VerspoorK,KhabsaM,GilesCL,LiuH,RavikumarKE,LamuriasA,CoutoFM,DaiH,TsaiRT,AtaC,CanT,UsieA,AlvesR,Segura-BedmarI,MartinezP,OryzabalJ,ValenciaA:TheCHEMDNERcorpusofchemicalsanddrugsanditsannotationprinciples.JCheminform7(Suppl1)14.ShaF,PereiraF:Shallowparsingwithconidtionalrandomfields.ProceedingsoftheHumanLanguageTechnologyConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics.AssociationforComputationalLingustics,Stroudsburg,PA,USA2003,213-220,15.MillerS,GuinnessJ,ZamanianA:Nametaggingwithwordclustersanddiscriminativetraining.ProceedingsofConferenceonHumanLanguageTechnology/NorthAmericanChapteroftheAssociationforComputationalLinguiusticsAnnualMeetingAssociationforComputationalLinguistics,Boston,Massachusetts;2004,337-342.16.GanchevK,CrammerK,PereiraF,MannG,BellareK,McCallumA,CarrollS,JinY,WhiteP:Penn/Umass/CHOPBioCreativeIIsystems.Proceedingsofthe2ndBioCreativeChallengeEvaluationWorkshop17.BrownPF,deSouzaPV,MercerRL,PietraVJD,LaiJC:Class-basedn-grammodelsofnaturallanguage.ComputationalLinguistics18.FinkelJR,ManningCD:Nestednamedentityrecognition.Proceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLanguageProcessing.AssociationforComputationalLingustics,Stroudsburg,PA,USA2009,141-150.TheApacheOpenNLPLibrary.Library.20.ReadJ,DridanR,OepenS,SolbergLJ:Sentenceboundarydetection:Alongsolvedproblem?Proceedingsofthe24ndInternationalConferenceonComputationalLinguistics.IndianInstituteofTechnologyBombay,Mumbai,Maharashtra,India;KayM,BoitetC2012:985-994.21.WeiC-H,HarrisBR,KaoH-Y,LuZ:tmVar:Atextminingapproachforextractingsequencevariantsinbiomedicalliterature.22.McDonaldR,PereiraF:Identifyinggeneandproteinmmentionintextusingconditionalrandomfields.BMCBioinformatics6(Suppl1)23.HuangH-S,LinY-S,LinK-T,KuoC-J,ChangY-M,YangB-H,ChungI-F,HsuC-N:High-recallgenementionrecognitionbyunificationofmultiplebackgroundparsingmodels.Proceedingsofthe2ndBioCreativeChallengeEvaluationWorkshop24.KlingerR,FriedrichCM,FluckJ,Hofmann-ApitiusM:Namedentityrecognitionwithcombinationsofconditionalrandomfields.Proceedingsofthe2ndBioCreativeChallengeEvaluationWorkshopHirschmannL,KrallingerM,ValenciaA2007,89-92.25.LiuDC,NocedalJ:OnthelimitedmemoryBFGSmethodforlargescaleMathematicalProgramming:503-528,doi:10.1007/26.KudoT:CRF++:YetAnotherCRFToolkit.Toolkit.trunk/doc/index.html].27.PorterMF:Analgorithmforsuffixstripping.28.ManningC,BauerJ:StanfordCoreNLP-ASuiteofNLPTools.Tools.stanford.edu/software/corenlp.shtml].29.CollobertR,WestonJ:Aunifiedarchitecturefornaturallanguageprocessing:Deepneuralnetworkswithmultitasklearning.Proceedingsofthe25thInternationalConferenceonMachineLearning30.MnihA,AndriyG:Ascalablehierarchicaldistributedlanguagemodel.AdvancesinNeuralInformationProcessingSystems21.MITPress,Cambridge,MA;KollerD,SchuurmansD,BengioY,BottouL2009:1081-1088.31.LiangP:C++ImplementationoftheBrownWordClusteringAlgorithm.Algorithm.32.HuffmanDA:Amethodfortheconstructionofminimum-redundancyProceedingsoftheI.R.E:1098-1101,doi:10.1109/doi:10.1186/1758-2946-7-S1-S11Citethisarticleas:etalACRF-basedsystemforrecognizingchemicalentitymentions(CEMs)inbiomedicalliterature.JournalofCheminformatics(Suppl1):S11. W. Jeffery Hurst,The Hershey Company. http://www.chemistrycentral.com/manuscript/ etalJournalofCheminformatics(Suppl1):S11http://www.jcheminf.com/content/7/S1/S11Page9of9