/
Linear-Time Encoding/Decoding of Irreducible Words for Linear-Time Encoding/Decoding of Irreducible Words for

Linear-Time Encoding/Decoding of Irreducible Words for - PowerPoint Presentation

hadley
hadley . @hadley
Follow
0 views
Uploaded On 2024-02-16

Linear-Time Encoding/Decoding of Irreducible Words for - PPT Presentation

Codes Correcting Tandem Duplications tUAN THANH NGUYEN Nanyang Technological University NTU Singapore Joint work with Yeow Meng Chee Han Mao Kiah Johan Chrisnata Our motivation Applications ID: 1046323

encoder codes code irreducible codes encoder irreducible code time balanced construct linear length 2017 previous method words encoding optimal

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Linear-Time Encoding/Decoding of Irreduc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Linear-Time Encoding/Decoding of Irreducible Words forCodes Correcting Tandem Duplications tUAN THANH NGUYENNanyang Technological University (NTU), SingaporeJoint work with:Yeow Meng CheeHan Mao KiahJohan Chrisnata

2. Our motivation Applications that store data in living organisms Shipman et al. (2017) : CRISPR-Cas, encoding of a digital movie into the genomes of a population of living bacteria.

3. Our motivation Errors due to the biological mutations DeletionACGATGCGATDuplicationInversionTranslocationInsertionSubstitutionDuplicationplays an important role in determining an individual’s inherited traitsis believed to be the cause of several disordersis one of the two common repeats found in the human genome (More than 50%) ACGATGATGCGATGAT

4. Problem ClassificationTandem Duplication1. Fixed-length duplications2. Variable-length duplicationsA C G A G C A TA C G A C G A G C A G C A TA A C G C G A G C A G C A T TThe duplication lengthThe number of errors1.1 Bounded1.2 Unbounded2.1 Bounded2.2 UnboundedWe focus on the worst-case scenario !Given  

5. 012Notation012012Given alphabet an integer  0012 011220121201200112012212012-descendants of  -descendant cone of  -irreducible  

6. Problem FormulationGoal: Given , construct a code such that “For all ” 0122011201122Optimal codes are found when (Jain et al. 2017)A method to construct codes when is provided (Jain et al. 2017).Main idea: using “irreducible words”There is no known result when  Previous Works

7. Previous WorkFor different irreducible words generate different descendants! 0120012112011210abcdThe code is optimal 

8. Previous WorkFor we can choose more than one codewords in each cone! 0120012112011210abcdACDIrreducible words form an “almost optimal” code!

9. Our Main ResultsPublication: IEEE International Symposium on Information Theory 2018. Detailed analysis on constructed codes based on irreducible words when such codes are denoted by Provide an explicit formula to compute the size and asymptotic rateProvide an upper bound for optimal code and hence conclude that is almost optimalLinear-time encoder for The extension of this encoder provides the first known encoder for previous constructed codes. 

10. Encoder of for (Sketched idea) InputencoderDuplicationchannelDecoderError-decoderIrr-decoderoutputx yy'xirreducible wordx …           …   y For we define the neighbours of     IrrIrrIrrFor o achieve encoding rates at least optimal rate , we only require  

11. Example20101 01021221201001021021012012012020101212010210120120

12. Special attention on  Recent Work:The GC-content of a DNA string refers to the number of nucleotides that corresponds to G or C, and DNA strings with GC-content that are too high or too low are more prone to both synthesis and sequencing errors. Many recent works use DNA strings whose GC-content are close to 50% or exactly 50%. This is referred as “GC-balanced constraint”.Our updated encoder:Irreducible GC-balanced AAAA ATACTA ATGCTACGIrreducible

13. Modified Knuth Method Irreducible + GC-balanced 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0InputFlip  1 0 1 1 0 1 0 0 Output codewordRedundancy: (to encode the index t+ a look-up table) Knuth Balancing MethodInput A T C A T G A TFlip G C T G T G A T   G C T G T G A TOutput codewordRedundancy: (linear-time encoding + no need a look-up table) 

14. Recent Work: Design codes when  The size of our code is at least  Upper boundLower bound50.82800.4936100.95420.8701150.97450.9311200.98280.9547Upper boundLower bound50.82800.4936100.95420.8701150.97450.9311200.98280.9547In term of rate:  

15. SummaryGoal: “Given , construct the largest code where each codeword is of length over -ay alphabet that can correct unbounded tandem duplications of length at most .”  Our workPrevious WorksFurther workOptimal codes when (Jain et al. 2017)A method to construct codes when  Provide upper bound and lower bound for codes when Linear-time encoder for known TD codes ( (IEEE ISIT 2018)Linear-time encoder for TD GC-balanced code A method to construct codes when  Find optimal codes when Reduce the redundancy of the encoder for TD GC-balanced codeDesign better codes when  

16.