111CountingkmersinsequencingreadsInsequencingreadsitisunknownwhichstrandsoftheDNAissequencedAsaconsequenceakmeroritsreversecomplementareessentiallyequivalentThecanonicalrepresentativeofakmerm ID: 366216
Download Pdf The PPT/PDF document "Contents1Gettingstarted21.1Countingallk-..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Contents1Gettingstarted21.1Countingallk-mers....................................21.1.1Countingk-mersinsequencingreads.......................31.1.2Countingk-mersinagenome...........................31.2Countinghigh-frequencyk-mers..............................31.2.1Onepassmethod..................................31.2.2Twopassmethod..................................42FAQ52.1Howtoreadcompressedles(orotherformat)?newmacroname............52.2Howtoreadmultiplelesatonce?............................52.3Howtoreducetheoutputsize?..............................63Subcommands73.1histo.............................................73.2dump.............................................73.3query.............................................73.4info..............................................83.5merge............................................83.6cite..............................................81 1.1.1Countingk-mersinsequencingreadsInsequencingreads,itisunknownwhichstrandsoftheDNAissequenced.Asaconsequence,ak-meroritsreversecomplementareessentiallyequivalent.Thecanonicalrepresentativeofak-mermisbydenitionmorthereversecomplementofm,whichevercomesrstlexicographically.The-Cswitchinstructstosaveinthehashonlycanonicalk-mers,whilethecountisthenumberofoccurrencesofbothak-meranditreversecomplement.Thesizeparameter(givenwith-s)isanindicationofthenumberk-mersthatwillbestoredinthehash.Forsequencingreads,onethissizeshouldbethesizeofthegenomeplusthek-mersgeneratedbysequencingerrors.Forexample,iftheerrorrateise(e.g.Illuminareads,usuallye1%),withanestimatedgenomesizeofGandacoverageofc,thenumberofexpectedk-mersisG+Gcek.ThisassumeNOTE:unlikeinJellysh1,this-sparameterisonlyanestimation.Ifthesizegivenistoosmalltotallthek-mers,thehashsizewillbeincreasedautomaticallyorpartialresultswillbewrittentodiskandnallymergedautomatically.Running'jellyshmerge'shouldneverbenecessary,asnowjellyshnowtakescareofthistaskonitsown.Ifthelowfrequencyk-mers(k-mersoccurringonlyonce),whicharemostlyduetosequencingerrors,arenotofinterest,onemightconsidercountingonlyhigh-frequencyk-mers(seesection1.2),whichuseslessmemoryandispotentiallyfaster.1.1.2Countingk-mersinagenomeInanactualgenomeornishedsequence,ak-meranditsreversecomplementarenotequivalent,henceusingthe-Cswitchdoesnotmakesense.Inaddition,thesizeforthehashcanbesetdirectlytothesizeofthegenome.1.2Countinghigh-frequencyk-mersJellyshoerstwowaytocountonlyhigh-frequencyk-mers(meaningonlyk-merswithcount1),whichreducessignicantlythememoryusage.BothmethodsarebasedonusingBloomlters.Therstmethodisaonepassmethod,whichprovidesapproximatecountforsomepercentageofthek-mers.Thesecondmethodisatwopassmethodwhichprovidesexactcount.Inbothmethods,mostofthelow-frequencyk-mersarenotreported.1.2.1OnepassmethodAddingthe--bf-sizeswitchmakejellyshrstinsertallk-mersrstintoaBloomlterandonlyinsertintothehashthek-merswhichhavealreadybeenseenatleastonce.Theargumentto--bf-sizeshouldthetotalnumberofk-merexpectedinthedatasetwhilethe--sizeargumentshouldbethenumberofk-mersoccurringmorethanonce.Forexample:jellyfishcount-m25-s3G--bf-size100G-t16homo_sapiens.fawouldbeappropriateforcounting25-mersinhumanreadsat30coverage.Theapproximatememoryusageis9bitsperk-merintheBloomlter.3 Chapter2FAQ2.1Howtoreadcompressedles(orotherformat)?newmacronameJellyshonlyreadsFASTAorFASTQformattedinputles.Byreadingfrompipes,jellyshcanreadcompressedles,likethis:zcat*.fastq.gz|jellyfishcount/dev/fd/0...orbyusingthe'()'redirectionprovidedbytheshell(e.g.bash,zsh):jellyfishcount(zcatfile1.fastq.gz)(file2.fasta.gz)...2.2Howtoreadmultiplelesatonce?Often,jellyshcanparseaninputsequencelefasterthangziporfastq-dump(toparseSRAles)canoutputthesequence.Thisleadstomanythreadsinjellyshgoingpartiallyunused.Jellyshcanbeinstructedtoopenmultipleleatonce.Forexample,toreadtwoshortreadarchivelessimultaneously:jellyfishcount-F2(stq-dump-Zfile1.sra)(dump-Zfile2.sra)...Anotherwayistousegenerators.First,createalecontaining,oneperline,commandstogeneratesequence.Thenpassthisletojellyshandthenumberofgeneratorstorunsimultaneously.Jellyshwillspawnsubprocessesrunningthecommandspassedandreadtheirstandardoutputforsequence.Bydefault,thecommandsarerunusingtheshellintheSHELLenvironmentvariable,andthiscanbechangedbythe-Sswitch.Multiplegeneratorswillberunsimultaneouslyasspeciedbythe-Gswitch.Forexample:ls*.fasta.gz|xargs-n1echogunzip-cúst;q--1;generatorsjellyfishcount-ggenerators-G4...Therstcommandcreatedthecommandlistintothe'generators'le,eachcommandunzippingoneFASTAleinthecurrentdirectory.Thesecondcommandrunsjellyshwith4concurrentgenerators.5 Chapter3Subcommands3.1histoThehistosubcommandoutputsthehistogramofk-mersfrequencies.Thelastbin,withvalueoneabovethehighsettingsetbythe-hswitch(10000bydefault),isacatchall:allk-merswithacountgreaterthanthehighsettingaretalliedinthatonebin.Ifthelowsettingisset(-lswitch),thentherstbin,withvalueonebelowthelowsetting,isalsosimilarlyacatchall.Bydefault,thebinswithazerocountareskipped.Thiscanbechangedwiththe-fswitch.3.2dumpThedumpsubcommandoutputsalistofallthek-mersintheleassociatedwiththeircount.Bydefault,theoutputisinFASTAformat,wheretheheaderlinecontainsthecountofthek-merandthesequencepartisthesequenceofthek-mer.Thisformathastheadvantagethattheoutputcontainsthesequenceofk-mersandcanbedirectlyfedintoanotherprogramexpectingtheverycommonFASTAformat.Amoreconvenientcolumnformat(forhumanbeings)isselectedwiththe-cswitch.Lowfrequencyandhighfrequencyk-merscanbeskippedwiththe-Land-Uswitchesrespec-tively.Intheoutputofthedumpsubcommand,thek-mersaresortedaccordingtothehashfunctionusedbyJellysh.Theoutputcanbeconsideredtobefairlypseudo-random.ByfairlywemeanthatNOguaranteeismadeabouttheactualrandomnessofthisorder,itisjustgoodenoughforthehashtabletoworkproperly.Andbypseudo-randomwemeanthattheorderisactuallydeterministic:giventhesamehashfunction,theoutputwillbealwaysthesameandtwodierentlesgeneratedwiththesamehashfunctioncanbemergedeasily.3.3queryThequerysubcommandoutputsthek-mersandtheircountsforsomesubsetofk-mers.Itwilloutputsthecountsofallthek-merspassedonthecommandlineorofallthek-mersinthe7