upennedu John Blitzer Computer and Information Sciences Department University of Pennsylvania Philadelphia PA 19104 blitzercisupennedu Fernando Pereira Computer and Information Sciences Department University of Pennsylvania Philadelphia PA 19104 pere ID: 33724
Download Pdf The PPT/PDF document "Sorry I Forgot the Attachment Email Atta..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Precision Recall F Balanced EntireCorpus 0.85 0.56 0.67 Transfer 0.78 0.42 0.54 HighRecall EntireCorpus 0.55 0.84 0.66 Transfer 0.40 0.72 0.51 HighPrecision EntireCorpus 0.90 0.49 0.63 Transfer 0.87 0.27 0.41 Table1:Resultsforhighrecall,highprecision,andbalancedclassiersevaluatedontheentirecorpusandsimulatedusertransfer.Resultsareaveragedover10runs.3.CORPUSWeevaluatedbothourhighprecisionandourhighre-callclassiersontheEnronemailcorpus.Whiletheorigi-nalEnronemailscontainedattachmentinformation,thisin-formationhadbeenexcludedfromthepreparedcorpus[5].Weannotatedthecorpuswithinformationfromtheoriginaldatale(adatabasedump)toproduceanannotatedsubsetofthecorpus.Eachemailreceivedan\X-Header"indicat-ingwhetherornotitcontainedanattachment.Emailsthathadoneormoreattachmentsreceivedanadditionalheaderindicatingthenameandtypeoftheattachment.Emailsreceivedthisheaderonlyiftheyactuallyweresentwithanattachmentbytheoriginaluser.Apositivelabelcorre-spondedtoapositivevalueinthe\X-Header",meaningthattheemailhadanattachment;negativelabelswereappliedtoemailsthatdidnotcontainanattachment.Ournegatively-labeledinstancesmayincludeemailsthatshouldincludeanattachment,whichwasforgotten.Itwouldbediculttocorrecttheselabelsmanually,andwebelievethattheyarearelativelysmallfractionoftheoverallcorpus.Fortrain-ingandtestingpurposes,weselectedemailfrom24users,whichhadmailboxeslargerthan30messages,yieldingato-talof7656messagesofwhich1017hadattachments,about13%.Eachmailboxvariedinsizeandwasacombinationofmultipleemailfolders,includingfolderssuchas\inbox"and\discussion".Wehadtomakeseveralmodicationstothecorpustoprepareitforourtask.First,manyattachmentemailsac-tuallycontainedaforwardedmessageasanattachment,thedefaultbehaviorofsomeclients.Wealsoexcludedsomeusermailboxesthatappearedtobecomposedmostlyofmachinegeneratedemail.Furthermore,therewereseveralhundredattachmentemailsthatwereformattedreportsrelatingto-nancialdata.Whilethisemailiseasiertoclassify,thelargesimilarityandvolumeofthesemessagesundulypositivelyin uencedperformance.Weremovedtheseemailsfromthecorpus.Finally,ifaemailwassenttotwousers,itappearedtwiceinthecorpus.Weremovedtheseduplicateemails.Wediscoveredasignicantproblemwiththecorpusdataforourtask.Sincetheemailsinthecorpushadactuallycontainedattachments,residualartifactswereintroducedbyvariousemailclients.Forexample,someattachmentmes-sagesincludedartifactssuchas\hhFile:E&YMemo.docii"or\{E&YMemo.doc".Sincearealemaildraftwouldneverincludetheseattributes,weneededtoremovetheseartifacts.Wehandchecked200messagesthathadattachments,ran-domlysampledfromeachuser'sinboxtoobtainawidevari-etyofemailtypes.Aseachemailwaschecked,wedevelopedalistofartifactstoremove.Afterautomaticallyprocessingthecorpustoremovetheseartifacts,werecheckedthe200emailstoverifythattheywerecleanofanyofthesefeatures.Whileeachofthesemodicationspredictablyloweredoursystem'sperformance,wefeelthattheresultingcorpusmoreaccuratelyre ectsrealworldemailandthatourperfor-mancenumbersmorecloselymodelarealworldsystem.4.EVALUATIONUsingourcleaneddatasetof7656messagesweconductedtwoevaluations.First,werandomlysplitthecorpusalongan80/20train-testsplit,trainingandtestingclassiersonthesplitcorpus.TheseresultsarepresentedasEntireCor-pus.Next,wesplitthecorpusbyuser,sortingtheusersran-domlyusingan80/20train-testsplit.Ifauserwasplacedinthetraingroup,allemailfromthatuserwasusedfortrain-ing;thesamewasdoneforthetestgroup.Thisensuredthatifanemailbelongedtothetrainortestset,allotheremailsinthatuser'smailboxwereplacedintothesamesplit.Therefore,amailboxusedinthetestingoftheclassierdidnotaecttraining.Thisrepresentedapseudo-transfertaskbetweenusersandispresentedasTransfer.Whilewewouldideallyliketotestourclassieronauser'ssentmail,thisinformationwasnotavailableinthecorpus.Weplantoeval-uatefurthertransferscenariosinourfuturework,suchastransferringaclassiertrainedonEnrondatatonon-users.Weevaluatedourhighprecisionandhighrecallclassiers,aswellasabalancedclassier,onbothofthesedatasets10times.Ourresultsshowtheaverageofthe10runs.5.DISCUSSIONANDFUTUREWORKTable1presentsresultsforthethreeclassiers.Fortest-ingontheentirecorpus,thebalancedclassierproducespromisingresultswhichfavorprecision.Ifanemaildis-cussesadocumentthenitwillhaveanattachment(pre-cision),whilemanyemailsthatcontainattachmentslackdiscussionaboutthedocument(recall).Wenoticedseveralinterestingexampleswhereanemaildidnotdirectlyrefertoanattachmentandwewereunabletoconcludeiftherewasanattachmentfromthelanguageoftheemailexceptbycheckingifonewasincluded.Thisindicatesthattheprob-lemmaybecomplexevenforahumanreader.Aninter-estingresultisthatwhenattemptingtotransferaclassierbetweenusers,performancesuerssubstantially,13pointsinthebalancedclassiercase.Dierentusersarelikelytodealwithdierenttypesofattachmentssoeachuser'slexi-convariessubstantially.Asmallprecisiondropcomparedtoasubstantialdropinrecalllendsevidencetothishypothesis.Thisobservationleadsustoexploreusertransferinfu-turework.Anyrealworldsystemwouldneedtooperateonanewuser,sotransferbetweenusersisvital.TheEn-roncorpusdisplaystermsunknowninotherenvironments,suchas\rentrolls".Whilethepresenceof\rentroll"maybeagoodfeatureintheEnrondomain,itisapoorindicatorinmostemail.Otherusersmaydiscussdocumentssuchas\budgets"or\proposals"givingthesefeaturesimportanceinattachmentprediction.EectivetransferwouldallowforamappingbetweenthesewordsintheEnrondomaintoanotheruser'slexicon.StructuralCorrespondenceLearn-inguses\pivotfeatures"commoninthesourceandtarget