267K - views

SemiMarkov Conditional Random Fields for Information Extraction Sunita Sarawagi Indian Institute of Technology Bombay India sunitaiitb

acin William W Cohen Center for Automated Learning Discovery Carnegie Mellon University wcohencscmuedu Abstract We describe semiMarkov conditional random 64257elds semiCR Fs a con ditionally trained version of semiMarkov chains Intuiti vely a semi C

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "SemiMarkov Conditional Random Fields for..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

SemiMarkov Conditional Random Fields for Information Extraction Sunita Sarawagi Indian Institute of Technology Bombay India sunitaiitb






Presentation on theme: "SemiMarkov Conditional Random Fields for Information Extraction Sunita Sarawagi Indian Institute of Technology Bombay India sunitaiitb"— Presentation transcript:

Semi-MarkovConditionalRandomFieldsforInformationExtraction SunitaSarawagiIndianInstituteofTechnologyBombay,Indiasunita@iitb.ac.inWilliamW.CohenCenterforAutomatedLearning&DiscoveryCarnegieMellonUniversitywcohen@cs.cmu.eduAbstractWedescribesemi-Markovconditionalrandomelds(semi-CRFs),acon-ditionallytrainedversionofsemi-Markovchains.Intuitively,asemi-CRFonaninputsequencexoutputsa“segmentation”ofx,inwhichlabelsareassignedtosegments(i.e.,subsequences)ofxratherthantoindividualelementsxiofx.Importantly,featuresforsemi-CRFscanmeasurepropertiesofsegments,andtransitionswithinasegmentcanbenon-Markovian.Inspiteofthisadditionalpower,exactlearningandinferencealgorithmsforsemi-CRFsarepolynomial-time—oftenonlyasmallconstantfactorslowerthanconventionalCRFs.Inexperimentsonvenamedentityrecognitionproblems,semi-CRFsgenerallyoutper-formconventionalCRFs.1IntroductionConditionalrandomelds(CRFs)arearecently-introducedformalism[12]forrepresent-ingaconditionalmodelPr(yjx),wherebothxandyhavenon-trivialstructure(oftensequential).HereweintroduceageneralizationofsequentialCRFscalledsemi-Markovconditionalrandomelds(orsemi-CRFs).Recallthatsemi-MarkovchainmodelsextendhiddenMarkovmodels(HMMs)byallowingeachstatesitopersistforanon-unitlengthoftimedi.Afterthistimehaselapsed,thesystemwilltransitiontoanewstates0,whichdependsonlyonsi;however,duringthe“segment”oftimebetweenitoi+di,thebe-haviorofthesystemmaybenon-Markovian.Semi-Markovmodelsarefairlycommonincertainapplicationsofstatistics[8,9],andarealsousedinreinforcementlearningtomodelhierarchicalMarkovdecisionprocesses[19].Semi-CRFsareaconditionallytrainedversionofsemi-Markovchains.Inthispaper,wepresentinferenceandlearningmethodsforsemi-CRFs.Wealsoarguethatsegmentsoftenhaveaclearintuitivemeaning,andhencesemi-CRFsaremorenaturalthanconventionalCRFs.Wefocushereonnamedentityrecognition(NER),inwhichasegmentcorrespondstoanextractedentity;however,similarargumentsmightbemadeforseveralothertasks,suchasgene-nding[11]orNP-chunking[16].InNER,asemi-Markovformulationallowsonetoeasilyconstructentity-levelfeatures(suchas“entitylength”and“similaritytootherknownentities”)whichcannotbeeasilyencodedinCRFs.ExperimentsonvedifferentNERproblemsshowthatsemi-CRFsoftenoutperformconventionalCRFs. 2CRFsandSemi-CRFs2.1DenitionsACRFmodelsPr(yjx)usingaMarkovrandomeld,withnodescorrespondingtoele-mentsofthestructuredobjecty,andpotentialfunctionsthatareconditionalon(featuresof)x.Learningisperformedbysettingparameterstomaximizethelikelihoodofasetof(x;y)pairsgivenastrainingdata.OnecommonuseofCRFsisforsequentiallearningproblemslikeNPchunking[16],POStagging[12],andNER[15].FortheseproblemstheMarkoveldisachain,andyisalinearsequenceoflabelsfromaxedsetY.Forin-stance,intheNERapplication,xmightbeasequenceofwords,andymightbeasequenceinfI;Ogjxj,whereyi=Iindicates“wordxiisinsideaname”andyi=Oindicatestheopposite.Assumeavectorfoflocalfeaturefunctionsf=hf1;:::;fKi,eachofwhichmapsapair(x;y)andanindexitoameasurementfk(i;x;y)2R.Letf(i;x;y)bethevectorofthesemeasurements,andletF(x;y)=Pjxjif(i;x;y).ForthecaseofNER,thecomponentsoffmightincludethemeasurementf13(i;x;y)=[[xiiscapitalized]][[yi=I]],wheretheindicatorfunction[[c]]=1ifciftrueandzerootherwise;thisimpliesthatF13(x;y)wouldbethenumberofcapitalizedwordsxipairedwiththelabelI.Followingpreviouswork[12,16]wewilldeneaconditionalrandomeld(CRF)tobeanestimatoroftheformPr(yjx;W)=1 Z(x)eWF(x;y)(1)whereWisaweightvectoroverthecomponentsofF,andZ(x)=Py0eWF(x;y0).Toextendthistothesemi-Markovcase,lets=hs1;:::;spidenoteasegmenta-tionofx,wheresegmentsj=htj;uj;yjiconsistsofastartpositiontj,anendpositionuj,andalabelyj2Y.Conceptually,asegmentmeansthatthetagyjisgiventoallxi'sbetweeni=tjandi=uj,inclusive.Weassumesegmentshavepositivelength,andadjacentsegmentstouch,thatistjandujalwayssatisfy1tjujjsjandtj+1=uj+1.ForNER,acorrectsegmentationofthesentence“IwentskiingwithFernandoPereirainBritishColumbia”mightbes=h(1;1;O);(2;2;O);(3;3;O);(4;4;O);(5;6;I);(7;7;O);(8;9;I)i,correspondingtothelabelsequencey=hO;O;O;O;I;I;O;I;Ii.Weassumeavectorgofsegmentfeaturefunctionsg=hg1;:::;gKi,eachofwhichmapsatriple(j;x;s)toameasurementgk(j;x;s)2R,anddeneG(x;s)=Pjsjjg(j;x;s).Wealsomakearestrictiononthefeatures,analogoustotheusualMarkovianassumptionmadeinCRFs,andassumethateverycomponentgkofgisafunctiononlyofx,sj,andthelabelyj1associatedwiththeprecedingsegmentsj1.Inotherwords,weassumethateverygk(j;x;s)canberewrittenasgk(j;x;s)=g0k(yj;yj1;x;tj;uj)(2)foranappropriatelydenedg0k.Intherestofthepaper,wewilldroptheg0notationandusegforbothversionsofthesegment-levelfeaturefunctions.Asemi-CRFisthenanestimatoroftheformPr(sjx;W)=1 Z(x)eWG(x;s)(3)whereagainWisaweightvectorforGandZ(x)=Ps0eWG(x;s0).2.2AnefcientinferencealgorithmTheinferenceproblemforasemi-CRFisdenedasfollows:givenWandx,ndthebestsegmentation,argmaxsPr(sjx;W),wherePr(sjx;W)isdenedbyEquation3.An efcientinferencealgorithmissuggestedEquation2,whichimpliesthatargmaxsPr(sjx;W)=argmaxsWG(x;s)=argmaxsWXjg(yj;yj1;x;tj;uj)LetLbeanupperboundonsegmentlength.Letsi:ldenotesetofallpartialsegmentationstartingfrom1(therstindexofthesequence)toi,suchthatthelastsegmenthasthelabelyandendingpositioni.LetVx;g;W(i;y)denotethelargestvalueofWG(x;s0)foranys02si:l.Omittingthesubscripts,thefollowingrecursivecalculationimplementsasemi-MarkovanalogoftheusualViterbialgorithm:V(i;y)=(maxy0;d=1:::LV(id;y0)+Wg(y;y0;x;id;i)ifi�00ifi=01ifi0(4)ThebestsegmentationthencorrespondstothepathtracedbymaxyV(jxj;y).2.3Semi-MarkovCRFsvsorder-LCRFsSinceconventionalCRFsneednotmaximizeoverpossiblesegmentlengthsd,inferenceforsemi-CRFsismoreexpensive.However,Equation4showsthattheadditionalcostisonlylinearinL.ForNER,areasonablevalueofLmightbefourorve.1SinceintheworstcaseLjxj,thealgorithmisalwayspolynomial,evenwhenLisunbounded.ForxedL,itcanbeshownthatsemi-CRFsarenomoreexpressivethanorder-LCRFs.Fororder-LCRFs,howevertheadditionalcomputationalcostisexponentialinL.Thedifferenceisthatsemi-CRFsonlyconsidersequencesinwhichthesamelabelisassignedtoallLpositions,ratherthanalljYjbLlength-Lsequences.Thisisausefulrestriction,asitleadstofasterinference.Semi-CRFsarealsoanaturalrestriction,asitisoftenconvenienttoexpressfeaturesintermsofsegments.Asanexample,letdjdenotethelengthofasegment,andletbetheaveragelengthofallsegmentswithlabelI.Nowconsiderthesegmentfeaturegk1(j;x;s)=(dj)2[[yj=I]].Aftertraining,thecontributionofthisfeaturetowardPr(sjx)associatedwithalength-dentitywillbeproportionaltoewk(d)2—i.e.,itallowsthelearnertomodelaGaussiandistributionofentitylengths.Incontrast,thefeaturegk2(j;x;y)=dj[[yj=I]]wouldmodelanexponentialdis-tributionoflengths.Itturnsoutthatgk2isequivalenttothelocalfeaturefunctionf(i;x;y)=[[yi=I]],inthefollowingsense:foreverytriplex;y;s,whereyarethetagsfors,Pjgk2(j;x;s)=Pif(i;s;y).Thusanysemi-CRFmodelbasedonthesinglefeaturegk2couldalsoberepresentedbyaconventionalCRF.Ingeneral,asemi-CRFmodelcanbefactorizedintermsofanequivalentorder-1CRFmodeliffthesumofthesegmentfeaturescanberewrittenasasumoflocalfeatures.Thusthedegreetowhichsemi-CRFsarenon-Markoviandependsonthefeatureset.2.4LearningalgorithmDuringtrainingthegoalistomaximizelog-likelihoodoveragiventrainingsetT=f(x`;s`)gN`=1.FollowingthenotationofShaandPereira[16],weexpressthelog-likelihoodoverthetrainingsequencesasL(W)=X`logPr(s`jx`;W)=X`(WG(x`;s`)logZW(x`))(5) 1Assumingthatnon-entitywordsareplacedinunit-lengthsegments,aswedobelow. WewishtondaWthatmaximizesL(W).Equation5isconvex,andcanthusbemaxi-mizedbygradientascent,oroneofmanyrelatedmethods.(Inourimplementationweusealimited-memoryquasi-Newtonmethod[13,14].)ThegradientofL(W)isthefollowing:rL(W)=X`G(x`;s`)Ps0G(s0;x`)eWG(x`;s0) ZW(x`)(6)=X`G(x`;s`)EPr(s0jW)G(x`;s0)(7)Therstsetoftermsareeasytocompute.However,wemustusetheMarkovpropertyofGandadynamicprogrammingsteptocomputethenormalizer,ZW(x`),andtheexpectedvalueofthefeaturesunderthecurrentweightvector,EPr(s0jW)G(x`;s0).Wethusdene (i;y)asthevalueofPs02si:yeWG(s0;x)whereagainsi:ydenotesallsegmentationsfrom1toiendingatiandlabeledy.Fori�0,thiscanbeexpressedrecursivelyas (i;y)=LXd=1Xy02Y (id;y0)eWg(y;y0;x;id;i)withthebasecasesdenedas (0;y)=1and (i;y)=0fori0.ThevalueofZW(x)canthenbewrittenasZW(x)=Py (jxj;y).AsimilarapproachcanbeusedtocomputetheexpectationPs0G(x`;s0)eWG(x`;s0).Forthek-thcomponentofG,letk(i;y)bethevalueofthesumPs02si:yGk(s0;x`)eWG(x`;s0),restrictedtothepartofthesegmentationendingatpositioni.Thefollowingrecursion2canthenbeusedtocomputek(i;y):k(i;y)=LXd=1Xy02Y(k(id;y0)+ (id;y0)gk(y;y0;x;id;i))eWg(y;y0;x;id;i)FinallyweletEPr(s0jW)Gk(s0;x)=1 ZW(x)Pyk(jxj;y).3ExperimentswithNERdata3.1BaselinealgorithmsanddatasetsInourexperiments,wetrainedsemi-CRFstomarkentitysegmentswiththelabelI,andputnon-entitywordsintounit-lengthsegmentswithlabelO.WecomparedthiswithtwoversionsofCRFs.Therstversion,whichwecallCRF/1,labelswordsinsideandoutsideentitieswithIandO,respectively.Thesecondversion,calledCRF/4,replacestheItagwithfourtagsB,E,C,andU,whichdependonwherethewordappearsinanentity[2].WecomparedthealgorithmsonveNERproblems,associatedwiththreedifferentcorpora.TheAddresscorpuscontains4,226words,andconsistsof395homeaddressesofstudentsinamajoruniversityinIndia[1].Weconsideredextractionofcitynamesandstatenamesfromthiscorpus.TheJobscorpuscontains73,330words,andconsistsof300computer-relatedjobpostings[4].Weconsideredextractionofcompanynamesandjobtitles.The18,121-wordEmailcorpuscontains216emailmessagestakenfromtheCSPACEemailcorpus[10],whichismailassociatedwitha14-week,277-personmanagementgame.Hereweconsideredextractionofpersonnames. 2Asintheforward-backwardalgorithmforchainCRFs[16],spacerequirementsherecanbereducedfromMLjYjtoMjYj,whereMisthelengthofthesequence,bypre-computinganappro-priatesetof values. 3.2FeaturesAsfeaturesforCRF,weusedindicatorsforspecicwordsatlocationi,orlocationswithinthreewordsofi.FollowingpreviousNERwork[7]),wealsousedindicatorsforcapi-talization/letterpatterns(suchas“Aa+”foracapitalizedword,or“D”forasingle-digitnumber).Asfeaturesforsemi-CRFs,weusedthesamesetofword-levelfeatures,aswelltheirlogicalextensionstosegments.Specically,weusedindicatorsforthephraseinsideasegmentandthecapitalizationpatterninsideasegment,aswellasindicatorsforwordsandcapitalizationpatternsin3-wordwindowsbeforeandafterthesegment.Wealsousedindicatorsforeachsegmentlength(d=1;:::;L),andcombinedallword-levelfeatureswithindicatorsforthebeginningandendofasegment.Toexploitmoreofthepowerofsemi-CRFs,wealsoimplementedanumberofdictionary-derivedfeatures,eachofwhichwasbasedondifferentdictionaryDandsimilarityfunctionsim.Lettingxsjdenotethesubsequencehxtj:::xuji,adictionaryfeatureisdenedasgD;sim(j;x;s)=argmaxu2Dsim(xsj;u)—i.e.,thedistancefromthewordsequencexsjtotheclosestelementinD.Foreachoftheextractionproblems,weassembledoneexternaldictionaryofstrings,whichweresimilar(butnotidentical)totheentitynamesinthedocuments.Forinstance,forcitynamesintheAddressdata,weusedawebpagelistingcitiesinIndia.Duetovari-ationsinthewayentitynamesarewritten,rotematchingthesedictionariestothedatagivesrelativelylowF1values,rangingfrom22%(forthejob-titleextractiontask)to57%(fortheperson-nametask).Weusedthreedifferentsimilaritymetrics(Jaccard,TFIDF,andJaroWinkler)whichareknowntoworkwellforname-matchingindataintegrationtasks[5].Allofthedistancemetricsarenon-Markovian—i.e.,thedistance-basedsegmentfeaturescannotbedecomposedintosumsoflocalfeatures.Moredetailonthedistancemetrics,featuresets,anddatasetsabovecanbefoundelsewhere[6].Wealsoextendedthesemi-CRFalgorithmtoconstruct,onthey,aninternalsegmentdictionaryofsegmentslabeledasentitiesinthetrainingdata.Tomakemeasurementsontrainingdatasimilartothoseontestdata,whenndingtheclosestneighborofxsjintheinternaldictionary,weexcludedallstringsformedfromx,thusexcludingmatchesofxsjtoitself(orsubsequencesofitself).Thisfeaturecouldbeviewedasasortofnearest-neighborclassier;inthisinterpretationthesemi-CRFisperformingasortofbi-levelstacking[20].Forcompletenessintheexperiments,wealsoevaluatedlocalversionsofthedictionaryfeatures.Specically,weconstructeddictionaryfeaturesoftheformfD;sim(i;x;y)=argmaxu2Dsim(xi;u),whereDiseithertheexternaldictionaryusedabove,oraninternalworddictionaryformedfromallwordscontainedinentities.Asbefore,wordsinxwereexcludedinndingnearneighborstoxi.3.3ResultsandDiscussionWeevaluatedF1-measureperformance3ofCRF/1,CRF/4,andsemi-CRFs,withandwith-outinternalandexternaldictionaries.AdetailedtabulationoftheresultsareshowninTa-ble1,andFigure1showsF1valuesplottedagainsttrainingsetsizeforasubsetofthreeofthetasks,andfourofthelearningmethods.Ineachexperimentperformancewasaveragedoversevenruns,andevaluationwasperformedonahold-outsetof30%ofthedocuments.Inthetablethelearnersaretrainedwith10%oftheavailabledata—asthecurvesshow,performancedifferencesareoftensmallerwithmoretrainingdata.Gaussianpriorswereusedforallalgorithms,andforsemi-CRFs,axedvalueofLwaschosenforeachdatasetbasedonobservedentitylengths.Thisrangedbetween4and6forthedifferentdatasets.Inthebaselinecongurationinwhichnodictionaryfeaturesareused,semi-CRFsperform 3F1isdenedas2*precision*recall/(precision+recall.) 0 20 40 60 80 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 F1 span accuracy Fraction of available training dataAddress_State CRF/4 SemiCRF CRF/1 0 20 40 60 80 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 F1 span accuracy Fraction of available training dataAddress_City CRF/4 SemiCRF CRF/1 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 F1 span accuracy Fraction of available training dataEmail_Person CRF/4 SemiCRF CRF/1 Figure1:F1asafunctionoftrainingsetsize.Algorithmsmarkedwith“+dict”includeexternaldictionaryfeatures,andalgorithmsmarkedwith“+int”includeinternaldictionaryfeatures.WedonotuseinternaldictionaryfeaturesforCRF/4sincetheyleadtoreducedaccuracy. baseline +internaldict +externaldict +bothdictionaries F1 F1base F1base F1baseextern CRF/1 state 20.8 44.5113.9 69.2232.7 55.2165.4-67.3 title 28.5 3.8-86.7 38.635.4 19.9-30.2-65.6 person 67.6 48.0-29.0 81.420.4 64.7-4.3-24.7 city 70.3 60.0-14.7 80.414.4 69.8-0.7-15.1 company 51.4 16.5-67.9 55.37.6 15.6-69.6-77.2 CRF/4 state 15.0 25.469.3 46.8212.0 43.1187.3-24.7 title 23.7 7.9-66.7 36.453.6 14.6-38.4-92.0 person 70.9 64.5-9.0 82.516.4 74.85.5-10.9 city 73.2 70.6-3.6 80.810.4 76.34.2-6.1 company 54.8 20.6-62.4 61.211.7 25.1-54.2-65.9 semi-CRF state 25.6 35.538.7 62.7144.9 65.2154.79.8 title 33.8 37.510.9 41.121.5 40.218.9-2.5 person 72.2 74.83.6 82.814.7 83.715.91.2 city 75.9 75.3-0.8 84.010.7 83.610.1-0.5 company 60.2 59.7-0.8 60.91.2 60.91.20.0 Table1:ComparingvariousmethodsonveIEtasks,withandwithoutdictionaryfeatures.ThecolumnbaseispercentagechangeinF1valuesrelativetothebaseline.Thecolumnexternisischangerelativetousingonlyexternal-dictionaryfeatures.bestonallveofthetasks.Wheninternaldictionaryfeaturesareused,theperformanceofsemi-CRFsisoftenimproved,andneverdegradedbymorethan2.5%.However,theless-naturallocalversionofthesefeaturesoftenleadstosubstantialperformancelossesforCRF/1andCRF/4.Semi-CRFsperformbestonnineofthetentaskvariantsforwhichinternaldictionarieswereused.Theexternal-dictionaryfeaturesarehelpfultoallthealgo-rithms.Semi-CRFsperformsbestonthreeofvetasksinwhichonlyexternaldictionarieswereused.Overall,semi-CRFperformsquitewell.Ifweconsiderthetaskswithandwithoutexternaldictionaryfeaturesasseparate“conditions”,thensemi-CRFsusingallavailableinforma-tion4outperformbothCRFvariantsoneightoften“conditions”.Wealsocomparedsemi-CRFtoorder-LCRFs,withvariousvaluesofL.5InTable2we 4I.e.,theboth-dictionaryversionwhenexternaldictionariesareavailable,andtheinternal-dictionaryonlyversionotherwise.5Order-LCRFswereimplementedbyreplacingthelabelsetYwithYL.WelimitedexperimentstoL3forcomputationalreasons. CRF/1 CRF/4 semi-CRF L=1 L=2 L=3 L=1 L=2 L=3 Address State 20.8 20.1 19.2 15.0 16.4 16.4 25.6 Address City 70.3 71.0 71.2 73.2 73.9 73.7 75.9 Email persons 67.6 63.7 66.7 70.9 70.7 70.4 72.2 Table2:F1valuesfordifferentorderCRFsshowtheresultforL=1,L=2,andL=3,comparedtosemi-CRF.Forthesetasks,theperformanceofCRF/4andCRF/1doesnotseemtoimprovemuchbysimplyincreasingorder.4RelatedworkSemi-CRFsaresimilartonestedHMMs[1],whichcanalsobetraineddiscrimini-tively[17].Theprimarydifferenceisthatthe“innermodel”forsemi-CRFsisofshort,uniformly-labeledsegmentswithnon-Markovianproperties,whilenestedHMMsallowlonger,diversely-labeled,Markovian“segments”.DyanamicCRFs[18]can,withanappropriatenetworkarchitecture,beusedtoimplementsemi-CRFs.Anothernon-MarkovianmodelrecentlyusedforNERisrelationalMarkovnetworks(RMNs)[3].However,inbothdynamicCRFsandRMNs,inferenceisnottractable,soanumberofapproximationsmustbemadeintrainingandclassication.AninterestingquestionforfutureresearchiswhetherthetractibleextensiontoCRFinfer-enceconsideredherecancanbeusedtoimproveinferencemethodsforRMNsordynamicCRFs.Inrecentpriorwork[6],weinvestigatedsemi-MarkovlearningmethodsforNERbasedonavotedperceptrontrainingalgorithm[7].Thevotedperceptronhassomeadvantagesineaseofimplementation,andefciency.6However,semi-CRFsperformsomewhatbet-ter,onaverage,thanourperceptron-basedlearningalgorithm.Probabilistically-groundedapproacheslikeCRFsalsoarepreferabletomargin-basedapproacheslikethevotedpercep-tronincertainsettings,e.g.,whenitnecessarytoestimatecondencesinaclassication.5ConcludingRemarksSemi-CRFsareatractibleextensionofCRFsthatoffermuchofthepowerofhigher-ordermodelswithouttheassociatedcomputationalcost.Amajoradvantageofsemi-CRFsisthattheyallowfeatureswhichmeasurepropertiesofsegments,ratherthanindividualelements.ForapplicationslikeNERandgene-nding[11],thesefeaturescanbequitenatural.AppendixAnimplementationofsemi-CRFsisavailableathttp://crf.sourceforge.net,andaNERpackagethatusesitisavailableonhttp://minorthird.sourceforge.net.References[1]V.R.Borkar,K.Deshmukh,andS.Sarawagi.Automatictextsegmentationforextractingstructuredrecords.InProc.ACMSIGMODInternationalConf.onManagementofData,SantaBarabara,USA,2001. 6Inparticular,thevotedperceptronalgorithmismorestablenumerically,asitdoesnotneedtocomputeapartitionfunction.OurCRFimplementationdoesallcomputationsinlogspace,whichisrelativelyexpensive. [2]A.Borthwick,J.Sterling,E.Agichtein,andR.Grishman.Exploitingdiverseknowledgesourcesviamaximumentropyinnamedentityrecognition.InSixthWorkshoponVeryLargeCorporaNewBrunswick,NewJersey.AssociationforComputationalLinguistics.,1998.[3]R.BunescuandR.J.Mooney.Relationalmarkovnetworksforcollectiveinformationextrac-tion.InProceedingsoftheICML-2004WorkshoponStatisticalRelationalLearning(SRL-2004),Banff,Canada,July2004.[4]M.E.CaliffandR.J.Mooney.Bottom-uprelationallearningofpatternmatchingrulesforinformationextraction.JournalofMachineLearningResearch,4:177–210,2003.[5]W.W.Cohen,P.Ravikumar,andS.E.Fienberg.Acomparisonofstringdistancemetricsforname-matchingtasks.InProceedingsoftheIJCAI-2003WorkshoponInformationIntegrationontheWeb(IIWeb-03),2003.Toappear.[6]W.W.CohenandS.Sarawagi.Exploitingdictionariesinnamedentityextraction:Combiningsemi-markovextractionprocessesanddataintegrationmethods.InProceedingsoftheTenthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,2004.Toappear.[7]M.Collins.Discriminativetrainingmethodsforhiddenmarkovmodels:Theoryandexper-imentswithperceptronalgorithms.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),2002.[8]X.Ge.SegmentalSemi-MarkovModelsandApplicationstoSequenceAnalysis.PhDthesis,UniversityofCalifornia,Irvine,December2002.[9]J.JanssenandN.Limnios.Semi-MarkovModelsandApplications.KluwerAcademic,1999.[10]R.E.Kraut,S.R.Fussell,F.J.Lerch,andJ.A.Espinosa.Coordinationinteams:evi-dencefromasimulatedmanagementgame.ToappearintheJournalofOrganizationalBehavior,2004.[11]A.Krogh.Genending:puttingthepartstogether.InM.J.Bishop,editor,GuidetoHumanGenomeComputing,pages261–274.AcademicPress,2ndedition,1998.[12]J.Lafferty,A.McCallum,andF.Pereira.Conditionalrandomelds:Probabilisticmodelsforsegmentingandlabelingsequencedata.InProceedingsoftheInternationalConferenceonMachineLearning(ICML-2001),Williams,MA,2001.[13]D.C.LiuandJ.Nocedal.Onthelimitedmemorybfgsmethodforlarge-scaleoptimization.MathematicProgramming,45:503–528,1989.[14]R.Malouf.Acomparisonofalgorithmsformaximumentropyparameterestimation.InPro-ceedingsofTheSixthConferenceonNaturalLanguageLearning(CoNLL-2002),pages49–55,2002.[15]A.McCallumandW.Li.Earlyresultsfornamedentityrecognitionwithconditionalrandomelds,featureinductionandweb-enhancedlexicons.InProceedingsofTheSeventhConferenceonNaturalLanguageLearning(CoNLL-2003),Edmonton,Canada,2003.[16]F.ShaandF.Pereira.Shallowparsingwithconditionalrandomelds.InInProceedingsofHLT-NAACL,2003.[17]M.Skounakis,M.Craven,andS.Ray.Hierarchicalhiddenmarkovmodelsforinformationextraction.InProceedingsofthe18thInternationalJointConferenceonArticialIntelligence,Acapulco,Mexico.MorganKaufmann.,2003.[18]C.Sutton,K.Rohanimanesh,andA.McCallum.Dynamicconditionalrandomelds:Factor-izedprobabilisticmodelsforlabelingandsegmentingsequencedata.InICML,2004.[19]R.Sutton,D.Precup,andS.Singh.BetweenMDPsandsemi-MDPs:Aframeworkfortemporalabstractioninreinforcementlearning.ArticialIntelligence,112:181–211,1999.[20]D.H.Wolpert.Stackedgeneralization.NeuralNetworks,5:241–259,1992.