/
Digital information preservation Digital information preservation

Digital information preservation - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
401 views
Uploaded On 2016-12-02

Digital information preservation - PPT Presentation

in DNA Robust Chemical Preservation of Digital Information on DNA in Silica with ErrorCorrecting Codes Angewandte 2015 Towards practical highcapacity lowmaintenance information storage in synthesized DNA Nature 2013 ID: 496267

information dna data storage dna information storage data error sequencing sequences errors synthesis synthesized long total sequence stored results

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Digital information preservation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Digital information preservation in DNA

Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes (Angewandte 2015)Towards practical, high-capacity, low-maintenance information storage in synthesized DNA (Nature 2013)Next-Generation Digital Information Storage in DNA (Science 2012)

Mikk Puustusmaa 2015Slide2

Introduction

Information, such as text printed on paper or images projected onto microfilm, can survive for over 500 years. However, the storage of digital information for time frames exceeding 50 years is challenging.As digital information continues to accumulate, higher density and longer-term storage solutions are necessaryDNA has many potential advantages as a medium for immutable, high latency information storage needs

At theoretical maximum, DNA can encode two bits per nucleotide (

nt

) or

455

exabytes

(455

 

x 10

12

mb

)

per gram of single-stranded DNASlide3

Advantages

Unlike most digital storage media, DNA storage is not restricted to a planar layer and is often readable despite degradation in non ideal conditions over millennia

.

Most recently, 300 000 year old mitochondrial DNA from bears and humans has been sequenced.

DNA’s essential biological role provides access

to natural reading and writing enzymes

and ensures that DNA will remain a readable standard for the foreseeable future.Slide4

Advantages

The number of bases of synthesized DNA needed to encode information grows linearly with the amount of information to be stored, but we must also consider the indexing information required to reconstruct full-length files from short fragments.As indexing information grows only as the logarithm of the number of fragments to be indexed, the total amount of synthesized DNA required grows

sub-linearly

.Slide5

Problems

Since synthesis and sequencing of very long DNA strands is technically impeded, data must be stored on several short DNA segmentsApproaches using living vectors are not as reliable, scalable or

cost-efficient

owing to disadvantages such as constraints on the genomic elements and locations that can be manipulated without affecting viability, the fact that mutation will cause the fidelity of stored and decoded information to reduce over time, and possibly the requirement for storage conditions to be carefully regulatedSlide6

Science 2012

They converted an html coded draft of a book that included 53,426 words, 11 JPG images, and one JavaScript

program into a 5.27-megabit

bitstream

(674,56kB)

C

onverted

individual bits to A or C for 0 and T or G for 1. Bases were chosen randomly while disallowing

homopolymer

runs greater than three. Addresses of the

bitstream

were 19 bits long and numbered consecutively, starting from 0000000000000000001.Slide7

Science 2012

They encode one bit per base (A or C for zero, G or T for one), instead of two. This allows them to encode messages many ways in order to avoid sequences that are difficult to read or write such as extreme GC content, repeats, or

secondary structure

.

By splitting the bit stream into addressed data blocks, we eliminate the need for long DNA constructs that are difficult to assemble at this scaleSlide8

Synthesis

We synthesized 54,898 oligonucleotides on Agilent’s Oligo Library Synthesis microarray platform.In order to avoid cloning and sequence verifying constructs, they

synthesized, stored, and sequenced many copies of each individual

oligo

54,898 159-nt oligonucleotides

each encoding a 96-bit data block (96

nt

),

a 19-bit address specifying the location of the data block in the bit stream (19

nt

)

flanking 22-nt common sequences for amplification and sequencing.Slide9

Sequencing

We sequenced the amplified library by loading on a single lane of a HiSeq 2000 using paired end 100 reads. From the lane we got 346,151,426 million paired reads with 87.14% >= Q30 and mean Q score of 34.16. Since we were sequencing a 115bp construct with paired 100bp reads, we used SeqPrep (9) to combine overlapping reads into a single contig.

They

joined overlapping paired

-

end 100-nt reads to reduce the effect of sequencing error

E

rrors

in synthesis and sequencing are rarely coincident, each molecular copy corrects errors in the other copiesSlide10

Results

Then with only reads that gave the expected 115-nt length and perfect barcode sequences, we generated consensus at each base of each data block at an average of ~3000-fold coverageAll data blocks were recovered with a total of 10 bit errors out of 5.27 million,

which were predominantly located within

homopolymer

runs at the end of the

oligo

, where we only had single sequence coverage

Future work could use compression, redundant encodings, parity checks, and error correction to improve density, error rate, and safety.Slide11
Slide12

Nature 2013

They encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2x106 bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracySlide13

Nature 2013

The five files comprised all 154 of Shakespeare’s sonnets (ASCII text), a classic scientific paper (PDF format), a medium-resolution colour photograph of the European Bioinformatics Institute (JPEG 2000 format), a

26-s excerpt from Martin Luther King’s 1963 ‘I have a dream’ speech

(MP3 format) and a Huffman code used in this study to convert bytes to

base-3

digits

(ASCII text), giving a total of 757,051 bytes or a Shannon information of 5.2

x

10

6

bits

The bytes comprising each file were represented as single DNA sequences with

no

homopolymers

, which are associated with higher error rates in existing high-throughput sequencing technologies and led to errors in a recent DNA-storage experimentSlide14

Nature 2013Slide15

Trit to

nucleotideSlide16

Nature 2013

Each DNA sequence was split into overlapping segments, generating fourfold redundancy, and alternate segments were converted to their reverse complement. These measures reduce the probability of systematic failure for any particular string, which could lead to uncorrectable

errors and data loss.

Each segment was then augmented with indexing information that permitted determination of the file from which it originated and its location within that file, and simple parity-check error-detection.

In all, the five files were represented by a total of 153,335 strings of DNA, each comprising 117 nucleotides (

nt

).Slide17

Synthesis

We synthesized oligonucleotides (oligos) corresponding to our designed DNA strings using an updated version of Agilent Technologies’ OLSErrors occur only rarely (1 error per 500 bases) and independently in the different copies of each stringDNA in lyophilized

form that is expected to have excellent long-term preservation characteristicsSlide18

Sequencing

paired-end mode on the Illumina HiSeq 2000Strings with uncertainties due to synthesis or sequencing errors were discarded and the remainder decoded using the reverse of the encoding procedure, with the error-detection bases and properties of the coding scheme allowing us to discard further strings containing errors.

Although many discarded strings will have contained information that could have been recovered with more sophisticated decoding,

the high level of redundancy and sequencing coverage rendered this unnecessary in

their

experiment

.Slide19

Results

Four of the five resulting DNA sequences could be fully decoded without intervention. The fifth however contained two gaps, each a run of 25 bases, for which no segment was detected corresponding to the original DNA. Each of these gaps was caused by the failure to sequence any

oligo

representing any of four consecutive overlapping segments.

Inspection of the neighbouring regions of the reconstructed sequence permitted us to hypothesize what the missing nucleotides should have been and we manually inserted those 50 bases accordingly.

This sequence could also then be decoded. Inspection confirmed that our original computer files had been reconstructed

with 100% accuracy

.Slide20

Results

This also suggests that our mean sequencing coverage of 1,308 times was considerably in excess of that needed for reliable decoding. But data indicates that reducing the coverage by a

factor of 10

(or even more) would have led to unaltered decoding characteristics, which further illustrates the robustness of our DNA-storage method.Slide21

Angewandte (2015)

They translated 83 kB of information to 4991 DNA segments, each 158 nucleotides long, which were encapsulated in silicaThey employed error-correcting codes to correct storage-related errors.

Accelerated aging experiments were performed to measure DNA decay kinetics, which show that data can be archived on DNA for millennia under a wide range of conditions.Slide22

Angewandte (2015)Slide23
Slide24

Error correcting

In classical data-storage devices, error correcting codes are implemented, which add redundancy and allow the correction of essentially all errors that occur during usage. To account for the specific requirements of storage on DNA the existing data coding schemes had to be adapted: Individual sequences are indexed and two independent error correcting codes (specifically Reed–Solomon codes) are used in a concatenated fashionSlide25

Angewandte (2015)

To physically test the code we stored the text from two old documents the Swiss Federal Charter from 1291 and the English translation of the Method of ArchimedesThe (uncompressed) total text is 83 kilobytes large, and was encoded. This resulted in 4991 sequences, each 117 nucleotides long to which constant primers were added (giving a

total length of 158

nt

)

The sequences were synthesized on an electrochemical microarray technology (

CustomArray

), prepared for sequencing by a custom

PCR

(polymerase chain reaction) method, and read using the

Illumina

MiSeq

platformSlide26

Results

From reading the sequences, the inner code had to correct an average of 0.7 nt errors per sequence and the outer code had to account for a loss of 0.3% of total sequences and correct about 0.4% of the sequences, thereby resulting in a complete and error-free recovery of the original information.Slide27

Results

To test if DNA stored in the solid state is more stable,they took the 4991 element oligo pool and tested the stability of three previously established dry storage procedures for DNA by accelerated aging tests.Slide28
Slide29

Result

From the data shown in Figure 2 it is evident that DNA preservation is best in the inorganic storage format (DNA encapsulated in silica), which has the lowest local water concentration. By separating the DNA molecules from the environment by an inorganic layer, the degree of preservation is not affected by the humidity of the storage environment. This independence of humidity is very important for guaranteeing long-term stability,

as a

nonhumid

environment is

hard to maintain

In contrast,

stabilityincreasing

factors such as

low temperature

(e.g. permafrost) and

absence of light

can be maintained for extended periods of time without energy input.Slide30

Results

The original information could be recovered error free, even after treating the DNA in silica at 70°C for one week. This is thermally equivalent to storing information on DNA in central Europe for 2000 years.Slide31

Price

With negligible computational costs and optimized use of the technologies estimated current costs to be $12,400 / MB for information storage in DNA and $220 /MB for information decoding.Current

technology and our encoding scheme

(

N

ature

2013)

, DNA-based storage may be cost-effective for archives of several megabytes with a

600–5,000-yr

horizon

.

One order of magnitude reduction in synthesis costs reduces this to ,

50–500

yr

; with two orders

of magnitude reduction, as can be expected in less than a decade if current trends continueSlide32

Price

DNA-based storage might already be economically viable for long horizon archives with a low expectation of extensive access, such as government and historical recordsAn examplein a scientific context is CERN’s CASTOR system, which stores a total of 80 PB of Large Hadron Collider data and grows at 15 PB yr .Only 10% is maintained on disk.

Archives of older data are needed for potential future verification of events, but access rates decrease considerably 2–3 years after collection.

Further examples are found in astronomy, medicine and interplanetary explorationSlide33
Slide34

Conclusion

Density, stability, and energy efficiency are all potential advantages of DNA storage, although costs and times for writing and reading are currently impractical for all but century-scale archives

DNA-based storage remains feasible on scales many orders of magnitude greater than current global data volumes

However, the costs of DNA synthesis and sequencing have been dropping at exponential rates of

5- and 12-fold

per year, respectively—much faster than electronic media at

1.6-fold per year

DNA

synthesis

costs

drop

at a

pace

that

should

make

data

storing

on DNA

cost-effective

for

sub-50-year

archiving

within

a

decade

.Slide35

Tänan Kuulamast!