/
Sorry I Forgot the Attachment Email Attachment Prediction Mark Dredze Computer and Information Sorry I Forgot the Attachment Email Attachment Prediction Mark Dredze Computer and Information

Sorry I Forgot the Attachment Email Attachment Prediction Mark Dredze Computer and Information - PDF document

jane-oiler
jane-oiler . @jane-oiler
Follow
571 views
Uploaded On 2015-01-22

Sorry I Forgot the Attachment Email Attachment Prediction Mark Dredze Computer and Information - PPT Presentation

upennedu John Blitzer Computer and Information Sciences Department University of Pennsylvania Philadelphia PA 19104 blitzercisupennedu Fernando Pereira Computer and Information Sciences Department University of Pennsylvania Philadelphia PA 19104 pere ID: 33724

upennedu John Blitzer Computer and

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Sorry I Forgot the Attachment Email Atta..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Precision Recall F Balanced EntireCorpus 0.85 0.56 0.67 Transfer 0.78 0.42 0.54 HighRecall EntireCorpus 0.55 0.84 0.66 Transfer 0.40 0.72 0.51 HighPrecision EntireCorpus 0.90 0.49 0.63 Transfer 0.87 0.27 0.41 Table1:Resultsforhighrecall,highprecision,andbalancedclassi ersevaluatedontheentirecorpusandsimulatedusertransfer.Resultsareaveragedover10runs.3.CORPUSWeevaluatedbothourhighprecisionandourhighre-callclassi ersontheEnronemailcorpus.Whiletheorigi-nalEnronemailscontainedattachmentinformation,thisin-formationhadbeenexcludedfromthepreparedcorpus[5].Weannotatedthecorpuswithinformationfromtheoriginaldata le(adatabasedump)toproduceanannotatedsubsetofthecorpus.Eachemailreceivedan\X-Header"indicat-ingwhetherornotitcontainedanattachment.Emailsthathadoneormoreattachmentsreceivedanadditionalheaderindicatingthenameandtypeoftheattachment.Emailsreceivedthisheaderonlyiftheyactuallyweresentwithanattachmentbytheoriginaluser.Apositivelabelcorre-spondedtoapositivevalueinthe\X-Header",meaningthattheemailhadanattachment;negativelabelswereappliedtoemailsthatdidnotcontainanattachment.Ournegatively-labeledinstancesmayincludeemailsthatshouldincludeanattachment,whichwasforgotten.Itwouldbediculttocorrecttheselabelsmanually,andwebelievethattheyarearelativelysmallfractionoftheoverallcorpus.Fortrain-ingandtestingpurposes,weselectedemailfrom24users,whichhadmailboxeslargerthan30messages,yieldingato-talof7656messagesofwhich1017hadattachments,about13%.Eachmailboxvariedinsizeandwasacombinationofmultipleemailfolders,includingfolderssuchas\inbox"and\discussion".Wehadtomakeseveralmodi cationstothecorpustoprepareitforourtask.First,manyattachmentemailsac-tuallycontainedaforwardedmessageasanattachment,thedefaultbehaviorofsomeclients.Wealsoexcludedsomeusermailboxesthatappearedtobecomposedmostlyofmachinegeneratedemail.Furthermore,therewereseveralhundredattachmentemailsthatwereformattedreportsrelatingto -nancialdata.Whilethisemailiseasiertoclassify,thelargesimilarityandvolumeofthesemessagesundulypositivelyin uencedperformance.Weremovedtheseemailsfromthecorpus.Finally,ifaemailwassenttotwousers,itappearedtwiceinthecorpus.Weremovedtheseduplicateemails.Wediscoveredasigni cantproblemwiththecorpusdataforourtask.Sincetheemailsinthecorpushadactuallycontainedattachments,residualartifactswereintroducedbyvariousemailclients.Forexample,someattachmentmes-sagesincludedartifactssuchas\hhFile:E&YMemo.docii"or\{E&YMemo.doc".Sincearealemaildraftwouldneverincludetheseattributes,weneededtoremovetheseartifacts.Wehandchecked200messagesthathadattachments,ran-domlysampledfromeachuser'sinboxtoobtainawidevari-etyofemailtypes.Aseachemailwaschecked,wedevelopedalistofartifactstoremove.Afterautomaticallyprocessingthecorpustoremovetheseartifacts,werecheckedthe200emailstoverifythattheywerecleanofanyofthesefeatures.Whileeachofthesemodi cationspredictablyloweredoursystem'sperformance,wefeelthattheresultingcorpusmoreaccuratelyre ectsrealworldemailandthatourperfor-mancenumbersmorecloselymodelarealworldsystem.4.EVALUATIONUsingourcleaneddatasetof7656messagesweconductedtwoevaluations.First,werandomlysplitthecorpusalongan80/20train-testsplit,trainingandtestingclassi ersonthesplitcorpus.TheseresultsarepresentedasEntireCor-pus.Next,wesplitthecorpusbyuser,sortingtheusersran-domlyusingan80/20train-testsplit.Ifauserwasplacedinthetraingroup,allemailfromthatuserwasusedfortrain-ing;thesamewasdoneforthetestgroup.Thisensuredthatifanemailbelongedtothetrainortestset,allotheremailsinthatuser'smailboxwereplacedintothesamesplit.Therefore,amailboxusedinthetestingoftheclassi erdidnota ecttraining.Thisrepresentedapseudo-transfertaskbetweenusersandispresentedasTransfer.Whilewewouldideallyliketotestourclassi eronauser'ssentmail,thisinformationwasnotavailableinthecorpus.Weplantoeval-uatefurthertransferscenariosinourfuturework,suchastransferringaclassi ertrainedonEnrondatatonon-users.Weevaluatedourhighprecisionandhighrecallclassi ers,aswellasabalancedclassi er,onbothofthesedatasets10times.Ourresultsshowtheaverageofthe10runs.5.DISCUSSIONANDFUTUREWORKTable1presentsresultsforthethreeclassi ers.Fortest-ingontheentirecorpus,thebalancedclassi erproducespromisingresultswhichfavorprecision.Ifanemaildis-cussesadocumentthenitwillhaveanattachment(pre-cision),whilemanyemailsthatcontainattachmentslackdiscussionaboutthedocument(recall).Wenoticedseveralinterestingexampleswhereanemaildidnotdirectlyrefertoanattachmentandwewereunabletoconcludeiftherewasanattachmentfromthelanguageoftheemailexceptbycheckingifonewasincluded.Thisindicatesthattheprob-lemmaybecomplexevenforahumanreader.Aninter-estingresultisthatwhenattemptingtotransferaclassi erbetweenusers,performancesu erssubstantially,13pointsinthebalancedclassi ercase.Di erentusersarelikelytodealwithdi erenttypesofattachmentssoeachuser'slexi-convariessubstantially.Asmallprecisiondropcomparedtoasubstantialdropinrecalllendsevidencetothishypothesis.Thisobservationleadsustoexploreusertransferinfu-turework.Anyrealworldsystemwouldneedtooperateonanewuser,sotransferbetweenusersisvital.TheEn-roncorpusdisplaystermsunknowninotherenvironments,suchas\rentrolls".Whilethepresenceof\rentroll"maybeagoodfeatureintheEnrondomain,itisapoorindicatorinmostemail.Otherusersmaydiscussdocumentssuchas\budgets"or\proposals"givingthesefeaturesimportanceinattachmentprediction.E ectivetransferwouldallowforamappingbetweenthesewordsintheEnrondomaintoanotheruser'slexicon.StructuralCorrespondenceLearn-inguses\pivotfeatures"commoninthesourceandtarget