/
OntheFeasibilityofParserbasedLogCompressioninLargeScaleCloudSystemsJ OntheFeasibilityofParserbasedLogCompressioninLargeScaleCloudSystemsJ

OntheFeasibilityofParserbasedLogCompressioninLargeScaleCloudSystemsJ - PDF document

deena
deena . @deena
Follow
342 views
Uploaded On 2022-08-24

OntheFeasibilityofParserbasedLogCompressioninLargeScaleCloudSystemsJ - PPT Presentation

3Correspondingauthorgyzhtsinghuaeducnpastlogs192124certainanalysismayrequirestatisticsoveralongperiodoftimetogenerateaconclusion101445forthepurposeoftheauditionlocallawsrequireaclou ID: 941072

147 150 148 usenix 150 147 usenix 148 conference storage file 19th 2019 ieee technologies association 000 info https

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "OntheFeasibilityofParserbasedLogCompress..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

OntheFeasibilityofParser-basedLogCompressioninLarge-ScaleCloudSystemsJunyuWei†,GuangyanZhang†,YangWang‡,ZhiweiLiu¶,ZhanyangZhu†,JunchaoChen†,TingtaoSun§,QiZhou§†TsinghuaUniversity,‡TheOhioStateUniversity,¶ChinaUniversityofGeosciences,§AlibabaCloudAbstractGiventhetremendousscaleoftoday'ssystemlogs,compres-sioniswidelyusedtosavespace.Whileparser-basedlogcompressorreportedpromisingresults,weobservelessin-triguingperformancewhenapplyingittoourproductionlogs.Ourdetailedanalysisshowsthat,rst,someproblemsarecausedbyacombinationofsub-optimalimplementationandassumptionsthatdonotholdonourlarge-scalelogs.Weaddresstheseissueswithamoreefcientimplementation.Furthermore,ouranalysisrevealsnewopportunitiesforfur-therimprovement.Inparticular,numericalvaluesaccountforasignicantpercentageofspaceandclassiccompres-sionalgorithms,whichtrytoidentifyduplicatebytes,donotworkwellonnumericalvalues.Weproposethreetechniques,namelydeltatimestamps,correlationidentication,andelas-ticencoding,tofurthercompressnumericalvalues.Basedonthesetechniques,wehavebuiltLogReducer.Ourevaluationon18typesofproductionlogsand16typesofpubliclogsshowsthatLogReducerachievesthehighestcom-pressionratioinalmostallcasesandonlargelogs,itsspeediscomparabletothegeneral-purposecompressionalgorithmthattargetsahighcompressionratio.1IntroductionMostsystemsloginternaleventsforvariousreasons,suchasdiagnosingsystemerrors[9,36,48],prolinguserbehav-iors[10,11,28],modelingsystemperformance[2,15],anddetectingpotentialsecurityproblems[12,37].Intoday'sdatacenters,thesizeofsuchlogscangrowlarge.In2016,Feng.etalreportedtheirsystemcangenerate100GBoflogsperday[13],andin2019,thisnumberincreasedto2TBperday[30].AliCloud,amajorcloudproviderandourcollaborator,cangenerateseveralPBsoflogsperday.Theselogsusuallyneedtobestoredforalongtimeformultiplereasons:sometimesananomalyisdetectedmuchlaterthanitwaslogged,sothedeveloperneedstoanalyzethe Correspondingauthor:gyzh@tsinghua.edu.cnpastlogs[1,9,21,24];certainanalysismayrequirestatisticsoveralongperiodoftimetogenerateaconclusion[10,14,45];forthepurposeoftheaudition,locallawsrequireacloudprovidertostoretheselogsforacertainamountoftime.Asaresult,AliCloudhasdecidedtostoreitslogsfor180days.Consideringit'sgeneratingseveralPBsoflogsperday,storingtheselogsisaconsiderableoverheadevenforabigcompany.Toreducelogsize,aclassicsolutionistocompresstheselogs.General-purposelosslesscompressionmethods,suchasLZMA[40],gzip[6],PPMd[4],andbzip[41],cancompressalebyidentifyingandreplacingduplicatebytes.Anumberofrecentworksobservethatmostsystemlogsaregeneratedbyaddingvariablestoastringtemplate(e.g.printf("value=%d",v)),andthusbyseparatingthem,theyonlyneedtostorethevariables[23,30,31,33,46].Wecalltheseapproachesparser-basedlogcompressioninthispaper.Whiletheseworksreportpromisingresults,weobservelessintriguingperformancewhenapplyingthismethodtoproductionlogsfromAliCloud:whenapplyingLogzip[30],thelatestoneinthislineofwork,toourlogs,wendit'sseventimesslowerthanLZMA,ageneral-purposecompres-sionmethodtargetingahighcompressionratio,andLogzip'scompressionratioisworsethanLZMAon13outofthe18typesofthelogs.Tounderstandwhethersuchproblemsarefundamentalorduetoengineeringissues,weperformadetailedanalysisofLogzip.Ouranalysisshowsthat,rst,someproblemsarein-deedcausedbyacombinationofsub-optimalimplementationandundesirablelimits:LogzipisimplementedinPythonandusesseveralnotoriouslyslowlibrariesanddatastructureslikePandasDataFrame[5];Logziplimitsalogentrytohavenomorethan5variables,whichistoosmallforourlogs;in

creas-ingthelimitwillfurtherslowdownLogzip,whichisalreadyseventimesslowerthanLZMA.Weaddresstheseissuesbyre-implementingthewholealgorithminC/C++,whichsignif-icantlyimprovesthecompressionspeed.Itfurtherallowsustoremovethelimitonthenumberofvariablestoimprovethecompressionratioaswell.Second,ouranalysisrevealsnewopportunitiesforfurther USENIX Association 19th USENIX Conference on File and Storage Technologies 249 improvement.Atahighlevel,Logzipusesgeneral-purposecompressionmethodstofurthercompressvariables:whilethisworkswellforstringvariables,itdoesnotworkwellfornumericalvariables,sincegeneral-purposecompressionmethodstargetndingduplicatebytes,andtherearenotmuchduplicationinnumericaldatainourexperiments.Weincor-poratethreetechniquestofurthercompressnumericaldata:Weobservetimestampsaccountforover20%ofthespacein8outof18types(even70%inonetype)inthecom-pressedles,mainlybecauseAliCloudneedsmicro-secondleveltiminginformationtoaccuratelyidentifytheorderofevents,forthepurposeslikeperformancedebuggingandresolvingconicts.Tocompresstimestamps,weusetheclassicdifferentialmethodtocomputeandstorethedeltavalueoftwoconsecutivetimestamps.Weobservethatnumericaldataareoftencorrelated.AtypicalexampleisinanI/Olog:whentheuserperformssequentialI/Os,theoffsetofthenextI/OisequaltothesumoftheoffsetandlengthofthepreviousI/O.Suchcorre-lationprovidesanobviousopportunitytofurthercompressnumericaldata.Followingthisidea,wehavedevelopedanovelalgorithmtoidentifysimplenumericalcorrelationinlogsamplesandapplythefoundrulesduringcompression.Weobservemostnumericalvaluesaresmallandusingxed-lengthcoding(e.g.32bitsforaninteger)willgener-atemany0bitsatthebeginning.Weproposeelasticcod-ing,whichrepresentsanumberwithanelasticnumberofbytes,totrimleadingzeroes.Comparedtogeneral-purposecompressionalgorithms,elasticcodingismoreefcientattrimmingleadingzeroes;comparedtoxed-lengthcoding,elasticcodingcanreducethelengthwhenthevalueissmallbutmayincreasethelengthwhenthevalueislarge,whichisabenecialtrade-offgivenourobservation.Bycombiningalltheeffortsmentionedabove,wehavebuiltLogReducer.WehaveappliedLogReducerto18typesofAliCloudlogs(1.76TBintotal)and16typesofpubliclogs(77GBintotal).ComparedwithLZMA,LogReducercanachieve1.19to5.30compressionratioonallcasesand0.56to3.16compressionspeedonlogsover100MB(Lo-gReduceriscomparablysloweronsmallerlogsbecauseofitsinitializationoverhead).ComparedwithLogzip,LogReducercanachieve1.03to4.01compressionratioand2.05to182.31compressionspeed.Suchresultshaveconrmedthat,withproperimplementationandoptimization,parser-basedlogcompressionispromisingtocompresslarge-scaleproductionlogs.Thecontributionofthispaperisthree-fold.First,westudywhystate-of-the-artparser-basedcompressionmethodsdonotperformwellonourproductionlogs.Second,basedonthestudy,webuildLogReducerbyimprovingtheimplemen-tationofexistingmethods,applyingpropertechniquesbasedonthecharacteristicsofthelogs,andintroducingnewtech-niques.Finally,wedemonstratetheefcacyofLogReduceronavarietyoflogs.LogReducerisopensource[39].2Background2.1StructureofCloudLogsWecollectalargesetoflogsgeneratedinAliCloud.Theyarefromdifferentapplicationsdevelopedbydifferentteams,whichserveforvariouspurposes,e.g.,warninganderrorre-porting,infrastructuremonitoring,userbehaviortracing,andperiodicalsummary.Table1showsexamplesofthreetypesoflogs.Samplesofall18typescanbefoundin[38].Thebasicstructureoftheselogscontainsthreeparts:header,templateandvariable.Aheaderincludesthetimes-tampandthecorrespondingloglevel.TheheaderisaddedbyAliCloud's

loggingsystemautomaticallyanditsformatisrelativelystatic,whichallowsustousearegularexpressiontoseparatetheheaderfromtheremainingpart.Therestofthissectionmainlydiscusseshowtoparsetheremainingpartintotemplatesandvariables.Templatesaretheformalizedoutputstatementsoflogs.InLogF,“Writechunk%sOffset%dLength%d”and“Readchunk%sOffset%d”aretwotemplates.Variablesrefertothepartwhichvariesineachinstanceofthesametemplate.InLogF,theyinclude“3242513_B”,“339911”,“11”,etc. Userbehaviortracing(LogF) [2019-08-2715:21:24.456234][INFO]Writechunk:3242513_BOffset:339911Length:11[2019-08-2715:21:24.463321][INFO]Readchunk:3242514_COffset:272633[2019-08-2715:21:24.464322][INFO]Writechunk:3242512_FOffset:318374Length:7[2019-08-2715:21:24.474433][INFO]Writechunk:3242513_BOffset:339922Length:55 Infrastructuremonitoring(LogD) [2018-01-1208:53:12.188370][10593]project:393logstore:XDoFiqnlmZdshard:78inow:3376dataInow:18869[2018-01-1208:53:12.188390][10593]project:656logstore:lOdMafL31Pgshard:37inow:7506dataInow:42712 Warninganderrorreporting(LogQ) Aug2803:09:02h10c10322.et15su[57118]:(tonobody)rootonnoneAug2803:09:02h10c10322.et15su[57118]:sessionopenedforusernobodyby(uid=0) Table1:ExamplesoflogsinAliCloud.2.2Parser-basedLogCompressorParser-basedlogcompressionrstusesalogparsertoiden-tifythetemplateofeachlogentryandextractthecorrespond-ingvariables;itthenreplacesthetemplatestringwithatem-plateIDtosavespace;itnallyappliesgeneral-purposecom-pressionmethodstovariablestofurtherreducespace. 250 19th USENIX Conference on File and Storage Technologies USENIX Association Figure1:Parsertreearchitecture.Logparsercanbeimplementedusinglongestcommonstring[8],clustering[35],andparsertree[23],amongthemparsertreeshowsbettereffectiveness[49].Herewerstpresenttheconceptofparsertreeandthenshowhowtobuildtheparsertreeanduseittoseparatetemplatesandvariables.Parsertree.Givenalistoftemplatesandlogentries,anaiveapproachtomatchtheentrytoatemplateistocomparetheentrytoeachtemplateandndthetemplatewhichismostsimilartotheentry.However,whentherearemanytemplates,suchone-by-onecomparisonisinefcient.Toimprovetheefciencyoftemplatematching,severalworks[23,30]useaparsertreetofacilitatethematching:asshowninFigure1,eachleafoftheparsertreeisagroupoftemplatessharingthesamelength(i.e.,thenumberoftokensinalogmessage);therstlayerofinternalnodesusethelengthofthelogentrytocategorizetheentry;thefollowinglayersofinternalnodesformmultiplepaths,eachofthemleadstoaleafnodeintheparsertree.Boththeinternalnodeandthetemplateuse&#x*000;“”torepresentavariable.AssumingwehavebuiltaparsertreeasshowninFigure1,andwehavealogentry“Readchunk3242514_COffset272633”:sinceitslengthis5,wewillrstgototheinternalnode“Length=5”;sinceitsrsttokenis“Read”,wewillthengototheinternalnode“Read”;andthen“chunk”;nallywewillcomparethelogentrywitheachtemplateintheleafnodeandnd“Readchunk&#x*000;Offset&#x*000;”isclosesttothelogentry,sowewillchoosethistemplateandidentify“3242514_C”and“272633”asvariables.Buildingtheparsertree.Parser-basedlogcompressorrstbuildstheparsertreebyparsingasampleofthelogentries.Foreachlogentry,thelogparserperformsfoursteps,andweuselogentry“Readchunk3242514_COffset272633”asanexampletoexplainthesesteps.Inthebeginning,theparsertreejusthasonerootnode.Intherststep,thelogparserusespredenedsplitcharac-ters,suchasemptyspaceorcomma,tosplitalogentryintoalistofstringscalledtokens.Inourexample

,therawlogmessagewillbedividedinto“Read”,“chunk”,“3242514_C”,“Offset”,“272633”accordingly.Inthesecondstep,thelogparserwillcheckwhethertheinternalnodeofthelengthexists(Length=5inourcase).Ifnot,thelogparserwillcreateanewinternalnode.Finally,itmovestothecorrespondinginternalnode.Inthethirdstep,thelogparsertraversesthetreeaccordingtothetokensinthelogentryandmovestocorrespondinginternalnodes(“Read”and“chunk”inourexample).Afterreachingthelimitationoftreedepth,itreachesaleafnode,whichcontainsagroupoftemplates.Ifthecorrespondinginternalnodedoesnotexist,thelogparserwillbuildtheinternalnodeandaddthenodetotheprextree.Denition1.SimilaritybetweenlogLandtemplateT(liistheithtokeninL;tiistheithtokeninT;f(a;b)=1ifa=b,otherwisef(a;b)=0,jjisthenumberoftokensinalog)Similarity(L;T)=åf(li;ti) jLj(1)Inthefourthstep,thelogparsersearchesforthemostsimilartemplateinthistemplategroupusingasimilarityfunctiondenedinequation1.Ifthelargestsimilarityissmallerthanathresholde,thelogparserwillcreateanewtemplate,whichisthesameasthelogentry.Notethatatthismomentthelogparsercannottellwhichtokensofthelogentryarevariables.Ifthelargestsimilarityislargerthane,thelogparserwillregardthislogentryasaninstanceofthematchingtemplateandupdatethetemplateaccordinglytomarkdifferentpartsasvariables.Forexample,supposethelogparserrstparses“Readchunk3242514_COffset272633”:sincethereisnotemplateyet,thelogparserwillcreateanewtemplate“Readchunk3242514_COffset272633”.Thensupposethelogparserprocesses“Readchunk3242514_BOffset268832”:itssimi-larityto“Readchunk3242514_COffset272633”is0.6,soifeissmallerthan0.6,thelogparserwillconsider“Readchunk3242514_BOffset268832”asaninstanceof“Readchunk3242514_COffset272633”andupdatethetemplateinto“Readchunk&#x*000;Offset&#x*000;”ifeislargerthan0.6,thelogparserwilltreat“Readchunk3242514_BOffset268832”asanewtemplate.Compressinglogs.Thenthecompressorusestheparsertreetocompresslogs[30].Theprocedureissimilartobuild-ingaparsertree,exceptthatinthisphase,thecompressorwillnotupdatetheparsertree.Itwillrstutilizetheparsertreetotrytomatcheachlogentrytoatemplate.Ifamatchisfound,thelogentrywillbeconvertedtothetemplateIDandthevariables;ifnotemplateismatched,thelogentrywillberegardedasamismatchandwillnotbeconverted.AfterwardthecompressorwillgrouplogentriesaccordingtotheirtemplateIDsandstoretheirvariablesinacolumnmanner,i.e.,itrststorestherstvariableofeachlogentry USENIX Association 19th USENIX Conference on File and Storage Technologies 251 inthegroup,thenstoresthesecondvariable,andsoon.Thecolumn-basedstorageisbasedontheobservationthatvari-ablesatthesamepositionofthesametemplatearepronetohavemoreredundancy,sothatslidingwindowbasedal-gorithmssuchasLZ77[40]willhavemorechancestotrimredundancy.Finallythecompressorconcatenateseverythingandcompressesitwithageneral-purposecompressor.3RestorethePromiseofParser-basedLogCompressionWetestedLogzip,themostrecentparser-basedlogcompres-sionimplementation,on18typesofAliCloud'sproductionlogs.First,wendthevalueofthesimilaritythresholdehasacriticalimpactontheperformanceofLogzip.WhenusingLogzip'sdefaultvalue0.5,wendLogziptakesnearly20daystobuildtheparsertreeandcangeneratetensofthou-sandsoftemplates,whichimpairsboththespeedandthecompressionratio.Wetunedthisvalueonourlogsandfoundavalueof0.1workswellforalmostallofourlogs.Thisisduetothefollowingreason:LogzipwasmainlytestedonPClogs,whichusuallyareshortandonlycontainasmallnumberofvar

iables;AliCloud'slogsusuallyhavemorevariables(seeTable2),andthuslogswithinthesametemplatesarequitedifferentfromeachother,i.e.,theyshareonlyasmallnumberofcommontokens.Therefore,toextractthecorrecttemplates,weneedasmallere.Manualtuningisalwaysundesirableinaproductionenvironment:whilewendthevalue0.1workswell,forenvironmentswithmoreversatilelogs,anautomatictuningproceduremightbebenecial.WecontinuedtestingLogzipwithe=0:1andfoundtheresultisstillnotidealintermsofbothcompressionratioandcompressionspeed:comparedwithLZMA,thegeneral-purposecompressionalgorithmthatcanachievethehighestcompressionratioonourlogs,Logzipisseventimesslower,andon13outofthe18typesoflogs,Logzip'scompressionra-tioislower(§6.1).Ourdetailedanalysisrevealedacorrelatedproblembetweencompressionratioandspeed:Logzipimplementationassumesthatalogentryusuallyhasnomorethanvevariables:forlogentrieswithmorethanvevariables,Logzipwillregardcontentaftertherstvariableasalargevariable,andfeedittothegeneral-purposecompressor.However,asshowninTable2,thisassumptiondoesnotholdonmostofourlogs(manyhaveover10variablesandonehas176variables).Asaresult,Logziplosesitseffectivenessonourlogs,whichcanexplainitspoorcompressionratio.Wetriedtoincreasethislimitandfounditfurtherexacer-batesthespeedproblemofLogzip:whileLogzipisalreadyseventimesslowerthanLZMAwiththelimitofvevariables,increasingthelimitto256willmakeLogzipunbearablyslow,whichmightbethereasonLogzipsetsasmalllimit.Wepro-ledLogziptounderstanditsbottleneckandfoundithasusedseveralnotoriouslyslowlibrariesordatastructuresincludingPandasDataFrame,Pythonarrayappend,etc.Toaddressthisproblem,were-implementthewholealgorithminC/C++anddramaticallyimprovethespeed.Theincreasedspeedallowsustoremovethelimitofvevariablesaswell.Wefurtherimprovethespeedwiththefollowingtechniques:CuttingtheParserTree.Wehaveobservedthatthetotalnumberoftemplatesinourproductionlogsisusuallysmall:asshowninTable2,15typesoflogshavelessthan50templates.Ifwegroupthembasedonlength,thenumberoftemplatesinonegroupisevensmaller.ThereasonbehindthisobservationisthatthesecloudlogsaregeneratedbydevelopersintheoperationengineeringgroupofAliCloud,andthustheirpatternsarerelativelystaticcomparedtologsgeneratedbycloudusers.Basedonthisobservation,wecuttheparsertreeintoonlyonelayerinthecompressionphase:weonlytakelengthintoconsiderationandwestoretemplateswiththesamelengthtogetherandsearchthemonebyone.Thisoptimizationhasimprovedcompressionspeedandavoidedthetuningofthedepthoftheparsertree.Batchprocessing.Ifweneedtocompressalargenumberofsmalllogles,andwestartonecompressorprocesstocompresseachlog,weobservetheoverheadtostartandstopprocessescouldslowdownthewholecompressionsigni-cantly.Therefore,weallowourcompressortotakeabatchofloglesasinputsandcompressthemtogether.Withalltheeffortsmentionedabove,wehaverestoredthepromiseofparser-basedcompression:asshownin§6.2,comparedwithLZMA,ourimplementation(i.e.,LogReducer-B)canachieve1.16-3.73compressionratioand0.51-2.01compressionspeed.4FurtherCompressingNumericalVariablesWehavedoneadetailedanalysisonthecompressedlesgen-eratedbythepreviousstepandfoundthatin10ofthe18typesoflogs,numericalvariablesaccountforover50%spaceaftercompression;forothertypesoflogstherateisatleast20%;inthreecases,therateisover80%(Table3).Thisisbecause1)ourlogshavealargenumberofnumericalvariablesand2)general-purposecompressionmethods,whichtrytoidentifyredundantbytes,donotworkwellwithnumericalvariables.4.1CompressingTimestampsOuranalysisshowsthattimestampsaretherstdominantnumericaldatainourlogs.AsshowninTable

3,ineighttypesoflogs,timestampsaccountformorethan20%ofspace.Inonecase,thisratecanreachcloseto70%.ThisisbecauseAliCloudneedsprecisetiminginformationtoorderevents,forpurposeslikedebuggingandauditing.Initsenvironment,systemlogscanbegeneratedatahigh 252 19th USENIX Conference on File and Storage Technologies USENIX Association Logtype LogALogBLogCLogDLogELogFLogGLogHLogI #oftemplates 4229367472023948 Avg.#ofvariables 14510132212757176 Logtype LogJLogKLogLLogMLogNLogOLogPLogQLogR #oftemplates 29104164912043130 Avg.#ofvariables 346794132225 Table2:Templateinformationon18typesoflogsinAliCloud. Logtype LogALogBLogCLogDLogELogFLogGLogHLogI NumberRate(%) 46.6368.5152.1982.8651.6988.4233.9245.5131.65 TimeRate(%) 36.7538.9715.2815.4910.4210.0722.8831.234.22 Logtype LogJLogKLogLLogMLogNLogOLogPLogQLogR NumberRate(%) 39.2769.8524.8953.5453.4078.4727.3029.3684.96 TimeRate(%) 26.967.189.3221.5325.6314.7915.7714.9068.27 Table3:Spaceconsumptionofnumericalvariablesandtimestampsincompressedle.speed,uptoonemillionentriespersecond,whichmotivatesAliCloudtorecordtimestampsatmicro-secondlevel.Asaresult,rst,ittakesmorebitstostorethemicrosecondleveltimestampsthanmillisecondorsecondleveltimestamps,andsecond,thereisnotmuchredundancyinthetimestamps.Tocompresstimestamps,weusetheclassicaldifferentialmethod,whichrecordsthedeltavaluebetweentwoconsec-utivetimestamps.Thismethodcansignicantlyreducethesizeoftimestampswhenthetargetsystemgenerateslogsfre-quently,namelythedeltavaluewillbesmall.Byusingthismethod,wecanreducethespaceoverheadandpassamuchsmallernumbertothegeneral-purposecompressor,whichcanimprovebothcompressionratioandcompressionspeed.4.2CorrelationIdenticationandUtilizationWeobservenumericalvariablessometimesarecorrelated.Forexample,inanI/Otrace,iftheuserperformssequentialI/Os,theoffsetofthenextI/OwillbeequaltothesumoftheoffsetandlengthofthepreviousI/O.Suchcorrelationprovidesanobviousopportunitytocom-pressnumericaldata.Ifmostvaluesofcertainvariablesfollowcertainkindofcorrelation,weonlyneedtostorehowvaluesdeviatefromthecorrelationinaresiduevector;sincemostvaluesoftheresiduevectorwillbezeroes,theywillbeeffec-tivelycompressedbyageneral-purposecompressor.Forexample,forlogsoftypeLogFwithtemplates“WriteChunk&#x*000;Length&#x*000;Offset&#x*000;Version&#x*000;”wecanextractfourvariablesfromthetemplateandthreeofthemarenumericalvariables,namely~L(Length),~O(Offset),~V(Version).Inourlogs,wendthreetypesofcorrelationsinthesevariables.Notethatthevaluesofeachvariableformavectorsincetherearemultiplelogentries.Inter-variablecorrelation:VersionisoftenequaltothesumofOffsetandLength,namely~V=~L+~O,anditsresiduevectoris~V�~L�~O. Figure2:NumericalcorrelationsobservedonLogF.Intra-variablecorrelation:LengthsofthesameChunkIDareoftenclose.Wecancomputeitsresiduevectoras~L[i]�~L[i�1]=~DL.Mixedcorrelation:iftheuserisperformingsequentialI/Ostoachunk,thenitslengthsandoffsetshavethefollowingcorrelation:~O[i]=~O[i�1]+~L[i�1].Itsresiduevectoris~O[i]�~O[i�1]�~L[i�1]=~DO�~L.Correlationidentication.Weproposeanovelmethodtoidentifysuchcorrelation.Thegoalofcorrelationidentica-tionprocessistondtherelationshipacrossandwithindif-ferentvariablessothatwecanrepresentsomevariableswithresiduevectors,whichcanbecompressedmoreeffectively.Toachievethisgoal,werstenumeratedifferentcombinationsofvariables,theIDstogroupdifferententries(e.g.ChunkIDinFigure2),andtheaforementionedcorrelationrules,andcomputethecorrespondingresiduevectors.Thenweselectvectorsfromtheoriginalvectorsandtheresiduevect

orswiththegoalsof1)maximizingcompressionratioand2)beingabletorecoveralloriginalvectors.Thewholeidenticationprocessisillustratedinalgorithm1,whichmaintainsthreesets:thetargetsety;therecover USENIX Association 19th USENIX Conference on File and Storage Technologies 253 setRincludingalloriginalvectorsthatcanberecoveredfromthecurrenty;thetotalcandidatesetTincludingallcandidatevectors.Oneofitskeydatastructureisamapfromthecandidatevectorstooriginalvectors:map(~C)willreturnalloriginalvectorsthat~Cisbuiltfrom(e.g.,map(~A�~B)=~Aand~B). Algorithm1Correlationidenticationalgorithm 1:RecoverablesetR=/02:FinalvectorsetY=/03:InitializecandidatesetT4:repeat5:C=f~C2T:jmap(~C)�Rj=1g6:~Cmin=vectorwiththesmallestentropyinC.7:Y Y[~Cmin8:R R[map(~Cmin)9:untilRcontainsalloriginalvectors10:OutputY Thealgorithmworksiniterations:ineachiteration,itrsttriestondallcandidatevectors~CthatcanrecoveronemorevariablecomparedtothecurrentrecoverablesetR(line5);thenamongthem,itchoosestheonewiththehighestcom-pressionratio(line6).HerewepredictthecompressionratioofacandidateusingitsShannonEntropy[26],denedinDenition2;nallyitupdatesYandRaccordingly(lines7and8);itrepeatsthisuntilYcanrecoveralloriginalvectors(line9).Thecostofenumerationisacceptable,sinceitisperformedonsamplesoflogs.Denition2.Entropyforavariablevector.SAdenotesthesetofallvaluesappearingin~Aand#(s)denotesthenumberoftimesthevaluesappearsin~A.E(~A)=�ås2SA#(s) jAjlog#(s) jAj(2)InFigure2,Ywillnallycontainthreeresiduevectors,namely:f~DL;~DO�~L;~V�~L�~Og.Thesethreeresiduevectorsareenoughtorecoveralloriginalvariablevectors~L;~O;~V,andwillhaveahighercompressionratiothantheoriginalvariablevectors.Correlationutilization.TheoutputofthetrainingphasefornumericalcorrelationsisthetargetvectorsetY.Inthecompressionphase,wecalculateeachresiduevectorinsetYanddiscardoriginalvectorsthatdonotappearinY.IfweapplythreecorrelationsinYtoourexampleinFigure2,theresultisshowninFigure3.Asonecansee,forvariablesthatperfectlymatchcertainrules,theirresiduevectorscontainmanyzeroes;evenforthosethatdonotperfectlymatchtherules,theirvaluesaresmaller,whichfacilitatestheelasticencoderdiscussedinthefollowingsection. Figure3:ProcessingresultoflogsinFigure2.4.3ElasticEncoderThesimplestwaytorepresentnumericalvariablesistousexednumberofbytes(e.g.4bytestorepresentaninteger,8bytestorepresentalongvalue,etc).However,ifmostnumbersaresmall,thesebyteswillcontainmanyleadingzeroes(forpositivenumbers)orones(fornegativenumbers).General-purposecompressionmaybeabletondsuchcon-secutivezeroesorones,butsinceitneedstosearchforsuchzeroes/onesandstoreadditionalmetadatatorecordthelengthofzeroes/ones,wedesignadedicatedencodingalgorithmtotrimleadingzeroes.Toefcientlyexploitsuchopportunity,weapplyanelasticencodingmethodtostorenumbersaccordingtotheirsize.Wecutthe32-bitintegerinto7-bitsegmentsandaddonebittoeachsegment,indicatingwhetherthesegmentisthelastsegment(1meansitisthelast).Thenwediscardtheprexofsegmentscontainingonlyzeroes.Herewechoosethenumber7because,afteraddingonebit,eachsegmenttakesabyte,whichiseasytohandle.Foranegativenumberrepresentedbytwo-complementencoding,itisnottrivialtojustchangeallonestozeroes,sincetheleadingonesincludetherstbitwhichindicatesthisnumberisanegativenumber.Toovercomethisproblem,weadoptashiftingoperator[43]tomovetherstbittothelastpositionandreverseallotherbitsiftheoriginalnumberisnegative.Byadoptingthismethod,wewillprocessnegativenumbersinthesamewayasprocessingpositiveones.Byusingelasticencoding,wewilltrimtheleadingze-ros/onesatthecostofaddi

ngone-bitmetadataforeveryre-maining7bits.Therefore,thesmallerthenumberis,themoreredundancywecantrim.Moreprecisely,foranintegerbe-tween[�27n;�27(n�1))[(27(n�1)�1;27n�1](0n6),elasticencodingcansave(32�8n)bitscomparedwithusingxed32bits.Inourlogs,wendthismethodcansave24bits(i.e.,n=1)formorethan60%ofthenumbers.Notethatbothdeltatimestampsandapplyingcorrelationcontributetotheeffectivenessofelasticencodingsincethesetechniquestendtomakenumberssmaller.5ArchitectureandImplementationBasedontheobservationsandideasmentionedpreviously,wehavebuiltLogReducer,aparser-basedlogcompressor, 254 19th USENIX Conference on File and Storage Technologies USENIX Association Figure4:LogReducerarchitecture.withabout3,000linesofC/C++code.Figure4showsitsarchitecture.LogReducercontainstwophases,trainingphaseandcompressionphase.Training.Trainingphaseisdoneoversampleddata.Itusesaparsertoextracttemplates(§2)andacorrelationminertondpossiblecorrelations(§4.2).Attheendofthisphase,LogReducermayndalistoftemplatesandcorrelationrules.Justlikeanyothermethodsrelyingonsampling,weexpectsuchsamplestocapturethepropertiesofreallogsasmuchaspossible.Traditionalparser-basedcompressionmethodslikeLogziponlyneedtoextracttemplatesduringthetrainingphase,andsincethetemplateofalogentryonlydependsontheentryitself,thesemethodscanuserandomsampling.LogReducer,however,triestoidentifydatacorrelationacrossadjacentlogentries,andrandomsamplingwilllosesuchrelationship.Toaddressthischallenge,werstrandomlypickseveralstartingpointsandthenchooseacontiguoussequenceoflogsfromeachstartingpoint.Thismethodshowsgoodperformanceonbothextractingtemplatesandidentifyingcorrelationsacrossadjacentlogentries.Compression.Inthecompressionphase,foreachlogen-try,LogReducerwillrstextractitsheader.LogReducerwillfurtherextractthetimestampsfromtheheadersandcomputethedeltavaluesofconsecutivetimestamps(§4.1).ThenLo-gReducerwilltrytomatchthelogentrytotemplatesusingtheparser(§3)andapplyfoundedcorrelationstonumericalvariables(§4.2).ThenLogReducerwillencodeallnumericaldata,includingtimestamps,numericalvariables,andtemplateIDs,usingelasticencoder(§4.3).Finally,LogReducerwillpackalldatausingLZMAsincewenditcanalmostalwaysachievethehighestcompressionratioonourlogs.Inordertoillustratethewholeprocessofcompression,weexhibitacompletecompressioncase.SupposewehavefourinputlogentriesofLogFshowninTable1(§2)andthetemplatesandcorrelationsfoundedinthetrainingphase.LogReducerrstextractstheirheadersandmatchestheirbodiestotemplates.TheresultsareshowninTable4,inwhicheachlogentryisdividedintothreeparts,namelylogheader,templateID,andcorrespondingvariables.Herethesecondlogentrybelongstotemplate:"Readchunk&#x*000;Of&#x*000;fset:whosetemplateIDis2andtheotherthreelogentriesbe-longtotemplate:"Writechunk&#x*000;Of&#x*000;fset:&#x*000;Length:whosetemplateIDis1.Asaresult,template2hastwovari-ablesandtemplate1hasthreevariables. Headers TemplateID V1V2V3 [2019-08-2715:21:24.456234][INFO] 1 3242513_B33991111 [2019-08-2715:21:24.463321][INFO] 2 3242514_C272633- [2019-08-2715:21:24.464322][INFO] 1 3242512_F3183747 [2019-08-2715:21:24.474433][INFO] 1 3242513_B33992255 Table4:Extractingheadersandmatchingtemplates.ThenLogReducerwillcomputethedifferenceofadjacenttimestampsandutilizecorrelationsovernumericalvariables.Table5showstheresultofthesesteps:alltimestampsbecomemuchsmallerexcepttherstone;sinceLogReduceridentiesthesequentialaccesspatternfor3242513_B,itdoesnotneedtostoretheoffsetofthesecondaccessandwecalculatethedeltaresultforwritelengthofthesamechunk(i.e.,logentry4).

FinallyLogReducerencodesallnumericalresultsusingelasticencoder,organizesallvariablesinacolumnmanner,andpacksthemwithLZMA. TimeOtherHeader TemplateID V1V2V3 20190827152124456234[INFO] 1 3242513_B33991111 0000007087[INFO] 2 3242514_C272633- 0000001001[INFO] 1 3242512_F3183747 00000010111[INFO] 1 3242513_B044 Table5:Computingthedeltavaluesoftimestampsandapply-ingcorrelation.Others.Incloudenvironments,logsusuallycontainalargeamountofinformation.Tomakesuchinformationeasytobeunderstoodbyhumans,logsoftenneedtobetruncatedintoseverallines.Suchmulti-linelogentriesdonotexistinthePClogswhereLogzipwastestedupon.Logzipsimplytreatsthemasamismatch,whichobviouslyreducesitseffectiveness.Wecalculatetherateofmulti-linelogsinourproductionlogsandndinLogHandLogR,therateofmulti-linelogshavereachedupto5%ofthewholesize.Tosupportmulti-linelogentries,wedonotsplitlogentriesbasedonanewlinesymbol;instead,wesplitlogentriesbasedonlogheaders,aswediscussedin§2.1.Doingsoachievesbothhighercompressionratioandhighercompressionspeed.Besides,toimprovethegeneralityofLogReducerbeyondAliCloudlogs,weimplementahead-formatadaptor:basedontheassumptionthatthenumberoftokensintheheadisstatic USENIX Association 19th USENIX Conference on File and Storage Technologies 255 forthesametypeoflogs,thisadaptortriestotreattherstntokensasthehead(ittriesn=1to10inourexperiments)toseewhichnvaluecanachievethebestcompressionratio.6EvaluationOurevaluationtriestoanswerthreequestions:WhatistheoverallperformanceofLogReducerintermsofcompressionratioandcompressionspeedonAliCloudlogs?(§6.1)WhatistheeffectofeachindividualtechniqueofLogRe-ducer?(§6.2)HowdoesLogReducerperformonlogsbeyondAliCloudlogs?(§6.3)Toanswerthesequestions,wemeasuretheperformanceofLogReduceron18typesofproductionlogsfromAliCloudwithatotalsizeofabout1.76TB(Table6).Wemeasureboththecompressionratio(i.e.,Originalsize Compressedsize)andthecompres-sionspeed(MB/s).Forcomparison,wealsomeasuretheperformanceoftwogeneral-purposecompressionalgorithms(gzipandLZMA)andtwolog-speciccompressionalgorithms(LogArchive[3]andLogzip[30]).gzipisaclassicalcompressiontool.Ittar-getshighcompressionspeedinsteadofahighcompressionratio.Weuse"tar"[7]commandtocompresslogdatasetwithgzip.LZMAisawell-studiedgeneral-purposecompressionmethodbasedonLZ77[40]algorithm.Ithasahighcom-pressionratiobutarelativelylowcompressionspeed.Weuse7z[50]tocompressthelogdatawithLZMA.LogArchiveisabucket-basedlogcompressionmethod.Weuseitsopen-sourcecodetocompressourlogsdata[17].Logzipisthelatestimplementationofparser-basedcompressor.Weuseitsopen-sourcecode[19].Notethatasdiscussedin§3,wechangetheevalueofLogzipfrom0.5to0.1.Testbed.Weperformallexperimentsusing4Linuxservers,eachwith2IntelXeonE5-26822.50GHzCPUs(with16cores),188GBRAM,andRedHat4.8.5withLinuxkernel3.10.0.Foreachmethod,weuse4threadstocompressthelogdatainparallelandsumtheirtotaltime.6.1OverallPerformanceCompressionratio.AsshowninFigure5(a),LogReducerhasthehighestcompressionratioonalllogs.Itcanachieve1.54to6.78compressionratiocomparedtogzip,1.19to4.80comparedtoLZMA,1.11to3.60comparedtoLogArchive,and1.45to4.01comparedtoLogzip.Inourexperiments,LogzipfailedonLogI;LogArchivefailedonLogIandLogJ.Bothofthesetwologshavemuchlongerlogentriesthanothers,whichcausesbufferoverowinLogzipandLogArchive.WeuseLZMAtocompressfailedlogs,sinceitisthedefaultsettingofLogzipandLogArchive.LogReducercancompressall1.76TBlogdatasetinto34.25GB,whichtakesonly1.90%spaceaftercompression.gzip,LZMA,LogArchive,andLogzipcancompressall1.76TBinto152.03GB,107.22GB,91.54GB,and

89.86GBrespectively.Asaresult,theirspaceconsumptionis4.44,3.13,2.67and2.62asmuchasLogReducerrespec-tively.WefurthercomputetheimprovementofLogReduceroverthebestoftheotherfouralgorithmsonall18logs.LogRe-ducerhasthehighestimprovementonLogFandlowestim-provementonLogLduetothefollowingreasons:LogFhasseveraltypicalcorrelationswediscussedin§4.2andLo-gReducercanidentifythemandtrimredundancyeffectively,whileotherworkscannotutilizesuchcorrelation.LogLhasalowpercentageofnumericalvalues(only24.89%)andtimestamps(only9.32%),whichmeansthenewtechniquesintroducedbyLogReducerarenotveryeffective.Compressionspeed.AsshowninFigure5(b),LogReduceris4.01-182.31asfastasLogzipand4.49-11.65asfastasLogArchive.LogReduceriscomparabletoLZMAincompressionspeed(0.56-3.16):itisslowerthanLZMAon8outof18logs;insomespecialcases(LogK,LogF,LogO)LogReduceris2-3asfastasLZMA.LogReducerisslowerthangzip,asgzipisoptimizedforspeed.WedonotshowthespeedofgzipinFigure5(b)sinceitshighvaluewillmakeotherbarshardtodistinguish.Tocompressall1.76TBlogs,LogReducertakes58.19hours;Logziptakesnearly27days;LogArchivetakesnearly23days;LZMAtakes91.54hours;gziptakes25.35hours.Inotherwords,LogReduceris11.22,9.43,and1.57asfastasLogzip,LogArchive,andLZMArespectively;itisabout60%slowerthangzip.6.2EffectsofIndividualTechniquesThissectionmeasurestheeffectsofindividualtechniquespresentedin§3and§4.Weuseanefcientre-implementationofparser-basedcom-pressorasourbaseline(LogReducer-NB),whichincludestheC/C++implementation,removingthelimitonthenumberofvariables,andcuttingtheparsertree(§3).WeaddbatchprocessingonLogReducer-NBtogetLogReducer-B(§3),adddeltatimestampsonLogReducer-BtogetLogReducer-D(§4.1),addelasticencodingapproachonLogReducer-DtogetLogReducer-ED(§4.3)andnallyaddnumericalcorrelationutilization(§4.2)onLogReducer-EDtogetthefullversionofLogReducer.TheresultisshowninTable7.Asonecansee,theefcientre-implementation(NBver-sion)signicantlyimprovesthecompressionratioandcom-pressionspeedoverLogzipinalmosteverytypeoflogs.Thishasconrmedoneofourkeyobservations:anefcientimple-mentationiscriticaltorealizethefullpotentialofparser-basedlogcompression. 256 19th USENIX Conference on File and Storage Technologies USENIX Association Logtype LogALogBLogCLogDLogELogFLogGLogHLogI TotalSize(GB) 18.6716.0545.8265.7434.98443.30148.880.1914.20 TotalLine(106) 74.7472.60231.43406.9877.561425.37579.941.088.65 TimeSpan(H) 476758720681563328977 Logtype LogJLogKLogLLogMLogNLogOLogPLogQLogR TotalSize(GB) 18.6716.0545.8265.7434.98443.30148.880.1914.20 TotalLine(106) 74.7472.60231.43406.9877.561425.37579.941.088.65 TimeSpan(H) 85333523830174151262165722 Table6:Logdatasetdescription. (a)Compressionratio (b)CompressionspeedFigure5:PerformanceonAliCloudlogsTheBversion(batchprocessing)isover1.5asfastastheNBversionon10logs.Inparticular,itis1.99and1.82asfastastheNBversiononLogDandLogC.ThesetwologshavemanylesandthusLogReducercansavemuchtimebybatchprocessing.Batchprocessinghasnoimpactoncompressionratioasitdoesnotchangethelogicofthecompressionalgorithm.ThecompressionratiooftheDversion(deltatimestamps)isover1.1ashighastheBversionon3logsandover1.05ashighon7logs.Inparticular,itscompressionratiois1.24ashighastheBversiononLogR.Deltatimestampscanbringsignicantimprovementtologswhichhavealargepercentageoftimestampvalues:LogR,LogAandLogHhavethehighest,thirdhighestandfourthhighesttimestamppercentageamongall18logs(seeTable3)andthuscanbenetfromdeltatimestamps.LogB,whichhasthesecondhighesttimestamppercent

age,isrelativelysparseranddoesnotbenetmuchfromdeltatimestamps.Deltatimestampsimprovescompressionspeedaswellbyfeedingasmallerintermediateresulttothegeneral-purposecompressor:itisover1.05asfastastheBversionon6logs.ThecompressionratiooftheEDversion(elasticencoding)isover1.05ashighastheDversionon12logs.Itmainlyimprovescompressionratioonlogswithalargepercentageofsmallnumbers,suchasLogD,LogR,andLogM.TheEDversionisover1.5asfastastheDversionon5logsand1.2asfaston11logs,sinceelasticencodingprovidesadedicatedandthusmoreefcientwaytotrimleadingzeroesoronescomparedwithgeneral-purposemethods.ThecompressionratiooftheLRversion(correlationiden-ticationandutilization)isover1.05ashighastheEDversionon4logs.Inparticular,itis2.07ashighastheEDversiononLogFand1.13ashighonLogO,becausecorre-lationsarecommoninthesetwologs.Itwillincuroverhead,itsspeedis0.7to1.05comparedwithEDversion.6.3PerformanceonPublicLogsToexaminethegeneralityofLogReducerbeyondAliCloudlogs,weevaluateLogReduceron16typesofpubliclogs[18] USENIX Association 19th USENIX Conference on File and Storage Technologies 257 CompressionRatio CompressionSpeed(MB/s) LZMALogzip B&NB DEDLR LZMALogzip NBB DEDLR LogA 19.3037.34 53.96 61.9463.7963.86 7.030.22 3.996.42 6.317.587.23LogB 17.9117.64 32.66 33.5535.6335.67 7.250.79 3.806.75 7.219.528.63LogC 15.4812.61 30.36 32.3034.8035.81 5.060.68 3.476.31 6.179.508.22LogD 12.1611.57 23.08 24.5026.5627.26 4.080.66 2.845.64 5.299.807.83LogE 14.197.73 22.99 23.3524.7325.22 4.890.64 4.335.34 5.426.936.22LogF 11.5810.69 16.32 16.4717.6236.42 3.600.81 3.324.33 4.338.008.44LogG 16.5813.42 30.23 31.7633.0032.99 7.350.34 4.075.89 6.527.277.17LogH 17.7327.73 34.85 38.5840.0540.08 7.150.99 3.713.64 3.833.963.98LogI 11.95/ 13.88 13.8814.0314.26 4.05/ 5.263.81 3.573.853.81LogJ 17.469.04 31.16 33.2534.9436.22 7.760.03 2.724.37 4.604.824.78LogK 12.1411.20 23.88 24.5125.7426.97 3.390.67 4.536.82 6.2410.7610.72LogL 12.3811.62 17.75 17.9618.4318.48 6.011.17 2.554.74 4.805.474.71LogM 18.4214.20 37.56 39.1443.5643.99 7.100.67 4.905.22 6.557.215.75LogN 14.1113.64 22.43 22.6323.7125.01 5.280.77 3.565.68 5.667.617.38LogO 8.255.23 11.35 11.2812.0513.67 2.480.64 2.523.42 3.447.155.98LogP 22.7310.61 34.90 35.9836.9237.58 8.220.63 5.755.52 7.149.326.64LogQ 20.5531.27 76.72 79.0583.0984.25 6.780.68 2.413.67 3.723.763.77LogR 22.8255.63 80.73 100.44109.21109.51 7.671.07 4.948.23 7.8510.879.95 Table7:Effectsofindividualtechniquesoncompressionratioandcompressionspeed.XisshortforLogReducer-X(X2{B,NB,D,ED}).LRstandsforthefullversionofLogReducer.“/”:LogzipfailedonLogI.fromdiversesources[25,49].AsshowninFigure6,thecom-pressionratioofLogReduceris1.03–3.15comparedwithLogzip,1.19–5.14comparedwithLogArchive,1.23–5.30comparedwithLZMA,and1.79–20.27comparedwithgzip.WefurtherinvestigatethelogsonwhichLogRe-ducerhaslessimprovement:someofthemhavetoomanytemplates(e.g.Android,Thunderbird),whichcausesallparse-basedmethods,includingLogReducer,tohavemanymis-matches;someofthemhaveonlyafewvariablesandevenfewernumericalvariables(e.g.,Thunderbird,Proxifer),whichcausesLogReducer'soptimizationstobelesseffective;inaddition,LogZiphasaspecicoptimizationforHDFSlog,whichimprovesthecompressionratioofLogZip.Intermsofcompressionspeed,LogReduceris2.05–101.12asfastasLogzipand1.79–9.95asfastasLog-Archive.LogReducerisslowerthanLZMAbyupto5.88andthangzipbyupto36.16duetotworeasons.First,sinceoverhalfofthelogsaresmallerthan100MB,theinitializa-tionoverheadofLo

gReducer(e.g.spaceallocation)becomessignicant,takingover40%ofthetime.Second,somecaseshavetoomanytemplates(e.g.Android,Thunderbird),whichcausesalowmatchingrateandawasteoftime.SuchresultshaveconrmedtheassumptionsofLogRe-ducer:LogReducerismainlydesignedforlarge-scalelogswithasmallnumberoftemplatesandmanyvariables.Whensuchassumptionshold,LogReducercanperformsignicantlybetterthanexistingmethods;whensuchassumptionsdonothold,LogReducerislesseffectivebutcanstillachievethehighestcompressionratio.7RelatedWorkLogparser.Logparserfocusesontheextractionprocessoflogtemplates,whichcanbedividedintothreetypes:cluster-basedmethods(LKE[14],LogSig[44],SHISO[34],LenMa[42],LogMine[20]),frequent-pattern-basedmethods(SLCT[46],LFA[35]),andheuristic-structure-basedmethods(IPLoM[32],AEL[27],Drain[23]).Cluster-basedmethodsdividethelogsintoclustersandex-tracttemplatesforeachcluster.Pattern-basedmethodstrytoextractfrequentpatternsfromlogentriesandregardthemasconstanttemplates.Heuristicmethodswillextractlogstruc-turebasedonobservationsoflogentries.Zhuetal.[49]comparethesemethodsandndthatDrainperformsbetterthanothers.Asaresult,bothLogzipandourimplementationarebasedonDrain.Numberencodingmethods.LevelDB[16]hasusedvari-antencodingtorepresentnumbersbasedontheirsize.Thrift[43]hasusedZigzagencodingtogetmoreleadingzerotoenableefcientdataserializationwhencommunicat-ingbetweenprocesses.Comparedwiththem,LogReducerfurtheruseselasticencodingtoreducethespaceoverheadofstoringnumericalvariables.General-purposecompressionapproaches.Thesemeth-odscanbecategorizedintothreekinds:statistic-based,predict-based,anddictionary-based.Statistic-basedcompres-sionmethods(e.g.,Huffmancoding[40],Arithmeticcod-ing[47])rstcollectstatisticinformationaboutinputlogsandthendesignvariantlengthcodingforeachtokens.Predict- 258 19th USENIX Conference on File and Storage Technologies USENIX Association (a)Compressionratio.Numbersabovebarsdenotecompressionratiosexceeding70. (b)CompressionspeedFigure6:Performanceonpublicdataset.basedcompressionmethods(e.g.,PPMd[4])predictthenexttokenbasedoncurrentcontextduringreadingtheinputstream,andassignsashorterencodingifpredictionissuc-cessful.Dictionary-basedcompressionmethods(e.g.,LZMA,gzip)searchforsimilartokensinaslidingwindowandstoretheminadictionarywhenprocessingtheinputstream.Statistic-basedmethodsneedtoreadtheinputlogletwice.Asaresult,whentheinputlogleislarge,theyarenotef-cient.Withprediction-basedmethods,theappearanceofvari-ableswilldecreasethepredictionaccuracy.Dictionary-basedmethodsmaylosethechancetotrimredundancywithinalongdistance,anddonottakethedeltaoftimestampsandcorrela-tionofvariablesintoconsideration,sincetheyarenotrelatedtoredundancyliterally.Ourmethodsutilizegeneral-purposecompressionapproachesandimprovetheireffectivenessonlogdata.Log-speciccompressionapproaches.Thesemethodscanbedividedintotwocategories:parser-basedandnonparser-based.CLC[22],LogArchive[3],Cowic[29]andMLC[13]processvariablesandtemplatestogether.CLCtriestondthefrequentpatternsshowninloglesandprocessesthesepatternsdirectly.LogArchiveusessimilarityfunctionandslidingwindowstodividelogentriesintodifferentbuck-etsandcompressesbucketstogethertoimprovecompressionratio.Cowicdoesnotfocusonthecompressionratio.Instead,ittriestodecreasethedecompressionlatencybyonlydecom-pressingneededlogsratherthanthewholeles.MLCusesblock-levelduplicationmethodstondredundancyconceal-ingbetweenlogentriesanddividethemintogroupsaccordingtotheirsimilaritiesandcompressthemusingdeltaencoding.Logzip[30]extractstemplatesandprocessest

emplatesandvariablesseparately.Itusesaparsertogetseveraltemplatesonasmallsampleandextractsalltemplatesinoriginalloglesbyiterativematching.Finally,itcompressestemplateIDsandvariablesusinggeneral-purposecompressionmethodsseparately.However,Logzipdoesnotperformwellonourlogsduetosub-optimalimplementation.8ConclusionThisworkexaminesthelatestparser-basedlogcompressionapproachonproductionlogs.Itobservesthat,rst,anef-cientimplementationiscriticaltorealizethefullpotentialofthisapproach;andsecond,therearemoreopportunitiestofurthercompresslogs.Basedontheseideas,wehavebuiltLogReducer,whichshowspromisingcompressionratioandcompressionspeed.AcknowledgmentWethankallreviewersfortheirinsightfulcomments,andespeciallyourshepherd,DalitNaor,forherguidancedur-ingourcamera-readypreparation.ThisworkwassupportedbytheNationalkeyR&DProgramofChinaunderGrant2018YFB0203902,andtheNationalNaturalScienceFounda-tionofChinaunderGrants61672315and62025203. USENIX Association 19th USENIX Conference on File and Storage Technologies 259 References[1]BoyuanChenandZhenMingJiang.Characterizinganddetectinganti-patternsintheloggingcode.InProceed-ingsofthe39thInternationalConferenceonSoftwareEngineering,pages71–81.IEEE,2017.[2]MichaelChow,DavidMeisner,JasonFlinn,DanielPeek,andThomasFWenisch.Themysterymachine:End-to-endperformanceanalysisoflarge-scaleInternetser-vices.InProceedingsofthe11thUSENIXSympo-siumonOperatingSystemsDesignandImplementation,pages217–231.USENIXAssociation,2014.[3]RobertChristensenandFeifeiLi.Adaptivelogcom-pressionformassivelogdata.InProceedingsofthe2013ACMSIGMODInternationalConferenceonMan-agementofData,pages1283–1284.ACM,2013.[4]JohnClearyandIanWitten.Datacompressionusingadaptivecodingandpartialstringmatching.IEEEtrans-actionsonCommunications,32(4):396–402,1984.[5]Pythondateengineergroup.PythondataanalysislibraryPandas.https://pandas.pydata.org/,2015.[6]PeterDeutsch.DEFLATEcompresseddataformatspecicationversion1.3.https://tools.ietf.org/html/rfc1951,1996.[7]GNUdevelopergroup.Homepageanddocumenta-tionofTar.https://www.gnu.org/software/tar/,2019.[8]MinDuandFeifeiLi.Spell:Streamingparsingofsystemeventlogs.InProceedingsofthe16thInter-nationalConferenceonDataMining,pages859–864.IEEE,2016.[9]MinDu,FeifeiLi,GuinengZheng,andVivekSrikumar.DeepLog:Anomalydetectionanddiagnosisfromsys-temlogsthroughdeeplearning.InProceedingsof2017ACMSIGSACConferenceonComputerandCommuni-cationsSecurity,pages1285–1298.ACM,2017.[10]SusanDumais,RobinJeffries,DanielMRussell,DianeTang,andJaimeTeevan.Understandinguserbehaviorthroughlogdataandanalysis.InWaysofKnowinginHCI,pages349–372.Springer,2014.[11]YaochungFan,YuchiChen,KuanchiehTung,KuochenWu,andArbeeLPChen.AframeworkforenablinguserpreferenceprolingthroughWi-Filogs.IEEETransac-tionsonKnowledgeandDataEngineering,28(3):592–603,2016.[12]BettinaFazzinga,SergioFlesca,FilippoFurfaro,andLuigiPontieri.Onlineandofineclassicationoftracesofeventlogsonthebasisofsecurityrisks.JournalofIntelligentInformationSystems,50(1):195–230,2018.[13]BoFeng,ChentaoWu,andJieLi.MLC:anef-cientmulti-levellogcompressionmethodforcloudbackupsystems.InProceedingsof2016IEEETrust-com/BigDataSE/ISPA,pages1358–1365.IEEE,2016.[14]QiangFu,Jian-GuangLou,YiWang,andJiangLi.Exe-cutionanomalydetectionindistributedsystemsthroughunstructuredloganalysis.InProceedingsofthe9thIEEEinternationalconferenceondatamining,pages149–158.IEEE,2009.[15]MonaGhassemian,PhilippHofmann,ChristianPre-hofer,VasilisFriderikos,andHamidAghvami.Per-formanceanalysisofInternetgatewaydiscoveryp

roto-colsinadhocnetworks.InProceedingsof2004IEEEWirelessCommunicationsandNetworkingConference,volume1,pages120–125.IEEE,2004.[16]SanjayGhemawatandJeffDean.LevelDB.https://github.com/google/leveldb,2011.[17]LogArchivegroup.OpensourcecodeofLogA-rchive.https://github.com/robertchristensen/log_archive_v0,2019.[18]Loghubgroup.Downloadlinkofpubliclogdataset.https://zenodo.org/record/1596245#.XMMZ1dv7S-Y,2019.[19]Logzipgroup.OpensourcecodeofLogzip.https://github.com/logpai/logzip,2019.[20]HosseinHamooni,BiplobDebnath,JianwuXu,HuiZhang,GuofeiJiang,andAbdullahMueen.LogMine:Fastpatternrecognitionforloganalytics.InProceed-ingsofthe25thACMInternationalonConferenceonInformationandKnowledgeManagement,pages1573–1582.ACM,2016.[21]MehranHassani,WeiyiShang,EmadShihab,andNiko-laosTsantalis.Studyinganddetectinglog-relatedis-sues.EmpiricalSoftwareEngineering,23(6):3248–3280,2018.[22]KimmoHätönen,JeanFrançoisBoulicaut,MikaKlemet-tinen,MarkusMiettinen,andCyrilleMasson.Com-prehensivelogcompressionwithfrequentpatterns.InProceedingsof2003InternationalConferenceonDataWarehousingandKnowledgeDiscovery,pages360–370.Springer,2003.[23]PinjiaHe,JiemingZhu,ZibinZheng,andMichaelRLyu.Drain:Anonlinelogparsingapproachwithxeddepthtree.InProceedingsof2017IEEEInternationalConferenceonWebServices,pages33–40.IEEE,2017. 260 19th USENIX Conference on File and Storage Technologies USENIX Association [24]ShilinHe,JiemingZhu,PinjiaHe,andMichaelRLyu.Experiencereport:Systemloganalysisforanomalydetection.InProceedingsofthe27thInternationalSymposiumonSoftwareReliabilityEngineering,pages207–218.IEEE,2016.[25]ShilinHe,JiemingZhu,PinjiaHe,andMichaelR.Lyu.Loghub:Alargecollectionofsystemlogdatasetstowardsautomatedloganalytics.arXivpreprintarXiv:2008.06448,2020.[26]EdwinTJaynes.Probabilitytheory:Thelogicofsci-ence.Cambridgeuniversitypress,2003.[27]ZhenMingJiang,AhmedEHassan,ParminderFlora,andGilbertHamann.Abstractingexecutionlogstoexe-cutioneventsforenterpriseapplications(shortpaper).InProceedingsofthe8thInternationalConferenceonQualitySoftware,pages181–186.IEEE,2008.[28]GeorgeLee,JimmyLin,ChuangLiu,AndrewLorek,andDmitriyRyaboy.TheuniedlogginginfrastructurefordataanalyticsatTwitter.ProceedingsoftheVLDBEndowment,5(12):1771–1780,2012.[29]HaoLin,JingyuZhou,BinYao,MinyiGuo,andJieLi.Cowic:Acolumn-wiseindependentcompressionforlogstreamanalysis.InProceedingsofthe15thIEEE/ACMInternationalSymposiumonCluster,CloudandGridComputing,pages21–30.IEEE,2015.[30]JinyangLiu,JiemingZhu,ShilinHe,PinjiaHe,ZibinZheng,andMichaelRLyu.Logzip:extractinghiddenstructuresviaiterativeclusteringforlogcompression.InProceedingsofthe34thIEEE/ACMInternationalConferenceonAutomatedSoftwareEngineering,pages863–873.IEEE,2019.[31]AdetokunboMakanju,ANurZincir-Heywood,andEvangelosEMilios.Alightweightalgorithmformes-sagetypeextractioninsystemapplicationlogs.IEEETransactionsonKnowledgeandDataEngineering,24(11):1921–1936,2011.[32]AdetokunboAOMakanju,ANurZincir-Heywood,andEvangelosEMilios.Clusteringeventlogsusingiterativepartitioning.InProceedingsofthe15thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages1255–1264,2009.[33]SalmaMessaoudi,AnnibalePanichella,DomenicoBian-culli,LionelBriand,andRaimondasSasnauskas.Asearch-basedapproachforaccurateidenticationoflogmessageformats.InProceedingsofthe26thConferenceonProgramComprehension,pages167–177.ACM,2018.[34]MasayoshiMizutani.Incrementalminingofsystemlogformat.InProceedingsof2013IEEEInternationalCon-ferenceonServicesComputing,pages595–602.IEEE,2013.[35]Me

iyappanNagappanandMladenAVouk.Abstractingloglinestologeventtypesforminingsoftwaresystemlogs.InProceedingsofthe7thWorkingConferenceonMiningSoftwareRepositories,pages114–117.IEEE,2010.[36]KarthikNagaraj,CharlesKillian,andJenniferNeville.Structuredcomparativeanalysisofsystemslogstodiag-noseperformanceproblems.InProceedingsofthe9thUSENIXconferenceonNetworkedSystemsDesignandImplementation,pages26–26.USENIXAssociation,2012.[37]AlinaOprea,ZhouLi,Ting-FangYen,SangHChin,andSumayahAlrwais.Detectionofearly-stageenterpriseinfectionbymininglarge-scalelogdata.InProceedingsofthe45thAnnualIEEE/IFIPInternationalConferenceonDependableSystemsandNetworks,pages45–56.IEEE,2015.[38]LogReducerresearchgroup.Opensampleoflarge-scalecloudlogs.https://github.com/THUBear-wjy/openSample,2020.[39]LogReducerresearchgroup.OpensourcecodeofLogReducer.https://github.com/THUBear-wjy/LogReducer,2020.[40]KhalidSayood.Introductiontodatacompression.Mor-ganKaufmann,2017.[41]JulianSeward.Thebzip2homepage.http://www.bzip.org,1997.[42]KeiichiShima.Lengthmatters:Clusteringsystemlogmessagesusinglengthofwords.arXivpreprintarXiv:1611.03213,2016.[43]MarkSlee,AdityaAgarwal,andMarcKwiatkowski.Thrift:Scalablecross-languageservicesimplementation.FacebookWhitePaper,5(8),2007.[44]LiangTang,TaoLi,andChang-ShingPerng.LogSig:Generatingsystemeventsfromrawtextuallogs.InProceedingsofthe20thACMinternationalconferenceonInformationandknowledgemanagement,pages785–794.ACM,2011.[45]SarahKTylerandJaimeTeevan.Largescalequeryloganalysisofre-nding.InProceedingsofthe3rdACMinternationalconferenceonWebsearchanddatamining,pages191–200,2010. USENIX Association 19th USENIX Conference on File and Storage Technologies 261 [46]RistoVaarandi.Adataclusteringalgorithmforminingpatternsfromeventlogs.InProceedingsofthe3rdIEEEWorkshoponIPOperations&Management,pages119–126.IEEE,2003.[47]IanHWitten,RadfordMNeal,andJohnGCleary.Arith-meticcodingfordatacompression.CommunicationsoftheACM,30(6):520–540,1987.[48]DingYuan,HaohuiMai,WeiweiXiong,LinTan,YuanyuanZhou,andShankarPasupathy.SherLog:errordiagnosisbyconnectingcluesfromrun-timelogs.InProceedingsofthe15thInternationalConferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,pages143–154.ACM,2010.[49]JiemingZhu,ShilinHe,JinyangLiu,PinjiaHe,QiXie,ZibinZheng,andMichaelRLyu.Toolsandbenchmarksforautomatedlogparsing.InProceedingsofthe41stInternationalConferenceonSoftwareEngineering:Soft-wareEngineeringinPractice,pages121–130.IEEE,2019.[50]7zipdevelopergroup.7-zipleachieverhomepage.https://www.7-zip.org/,2019. 262 19th USENIX Conference on File and Storage Technologies USENIX Association gzip LZMA LogArchive LogZip LogReducerLog ABCDEFGHIJKLMNOPQRCompression Speed(MB/s)Log gzip LZMA LogArchive LogZip LogReducerLog AndroidApacheBglHadoopHdfsHealthappHpcLinuxMacOpenstackProxifierSparkSshThunderbirdWindowsZookeeper0246810Compression Speed(MB/s)Log On the Feasibility of Parser-based Log Compression in Large-Scale Cloud SystemsJunyu Wei and Guangyan Zhang, Tsinghua University; Yang Wang,The Ohio State University; Zhiwei Liu, China University of Geosciences;Zhanyang Zhu and Junchao Chen, Tsinghua University; Tingtao Sunand Qi Zhou, Alibaba Cloudhttps://www.usenix.org/conference/fast21/presentation/wei This paper is included in the Proceedings of the 19th USENIX Conference on File and Storage Technologies.February 23–25, 2021978-1-939133-20-5Open access to the Proceedings of the 19th USENIX Conference on File and Storage Technologiesis sponsored by USEN

Related Contents


Next Show more