in DNA Robust Chemical Preservation of Digital Information on DNA in Silica with ErrorCorrecting Codes Angewandte 2015 Towards practical highcapacity lowmaintenance information storage in synthesized DNA Nature 2013 ID: 496267
Download Presentation The PPT/PDF document "Digital information preservation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Digital information preservation in DNA
Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes (Angewandte 2015)Towards practical, high-capacity, low-maintenance information storage in synthesized DNA (Nature 2013)Next-Generation Digital Information Storage in DNA (Science 2012)
Mikk Puustusmaa 2015Slide2
Introduction
Information, such as text printed on paper or images projected onto microfilm, can survive for over 500 years. However, the storage of digital information for time frames exceeding 50 years is challenging.As digital information continues to accumulate, higher density and longer-term storage solutions are necessaryDNA has many potential advantages as a medium for immutable, high latency information storage needs
At theoretical maximum, DNA can encode two bits per nucleotide (
nt
) or
455
exabytes
(455
x 10
12
mb
)
per gram of single-stranded DNASlide3
Advantages
Unlike most digital storage media, DNA storage is not restricted to a planar layer and is often readable despite degradation in non ideal conditions over millennia
.
Most recently, 300 000 year old mitochondrial DNA from bears and humans has been sequenced.
DNA’s essential biological role provides access
to natural reading and writing enzymes
and ensures that DNA will remain a readable standard for the foreseeable future.Slide4
Advantages
The number of bases of synthesized DNA needed to encode information grows linearly with the amount of information to be stored, but we must also consider the indexing information required to reconstruct full-length files from short fragments.As indexing information grows only as the logarithm of the number of fragments to be indexed, the total amount of synthesized DNA required grows
sub-linearly
.Slide5
Problems
Since synthesis and sequencing of very long DNA strands is technically impeded, data must be stored on several short DNA segmentsApproaches using living vectors are not as reliable, scalable or
cost-efficient
owing to disadvantages such as constraints on the genomic elements and locations that can be manipulated without affecting viability, the fact that mutation will cause the fidelity of stored and decoded information to reduce over time, and possibly the requirement for storage conditions to be carefully regulatedSlide6
Science 2012
They converted an html coded draft of a book that included 53,426 words, 11 JPG images, and one JavaScript
program into a 5.27-megabit
bitstream
(674,56kB)
C
onverted
individual bits to A or C for 0 and T or G for 1. Bases were chosen randomly while disallowing
homopolymer
runs greater than three. Addresses of the
bitstream
were 19 bits long and numbered consecutively, starting from 0000000000000000001.Slide7
Science 2012
They encode one bit per base (A or C for zero, G or T for one), instead of two. This allows them to encode messages many ways in order to avoid sequences that are difficult to read or write such as extreme GC content, repeats, or
secondary structure
.
By splitting the bit stream into addressed data blocks, we eliminate the need for long DNA constructs that are difficult to assemble at this scaleSlide8
Synthesis
We synthesized 54,898 oligonucleotides on Agilent’s Oligo Library Synthesis microarray platform.In order to avoid cloning and sequence verifying constructs, they
synthesized, stored, and sequenced many copies of each individual
oligo
54,898 159-nt oligonucleotides
each encoding a 96-bit data block (96
nt
),
a 19-bit address specifying the location of the data block in the bit stream (19
nt
)
flanking 22-nt common sequences for amplification and sequencing.Slide9
Sequencing
We sequenced the amplified library by loading on a single lane of a HiSeq 2000 using paired end 100 reads. From the lane we got 346,151,426 million paired reads with 87.14% >= Q30 and mean Q score of 34.16. Since we were sequencing a 115bp construct with paired 100bp reads, we used SeqPrep (9) to combine overlapping reads into a single contig.
They
joined overlapping paired
-
end 100-nt reads to reduce the effect of sequencing error
E
rrors
in synthesis and sequencing are rarely coincident, each molecular copy corrects errors in the other copiesSlide10
Results
Then with only reads that gave the expected 115-nt length and perfect barcode sequences, we generated consensus at each base of each data block at an average of ~3000-fold coverageAll data blocks were recovered with a total of 10 bit errors out of 5.27 million,
which were predominantly located within
homopolymer
runs at the end of the
oligo
, where we only had single sequence coverage
Future work could use compression, redundant encodings, parity checks, and error correction to improve density, error rate, and safety.Slide11Slide12
Nature 2013
They encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2x106 bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracySlide13
Nature 2013
The five files comprised all 154 of Shakespeare’s sonnets (ASCII text), a classic scientific paper (PDF format), a medium-resolution colour photograph of the European Bioinformatics Institute (JPEG 2000 format), a
26-s excerpt from Martin Luther King’s 1963 ‘I have a dream’ speech
(MP3 format) and a Huffman code used in this study to convert bytes to
base-3
digits
(ASCII text), giving a total of 757,051 bytes or a Shannon information of 5.2
x
10
6
bits
The bytes comprising each file were represented as single DNA sequences with
no
homopolymers
, which are associated with higher error rates in existing high-throughput sequencing technologies and led to errors in a recent DNA-storage experimentSlide14
Nature 2013Slide15
Trit to
nucleotideSlide16
Nature 2013
Each DNA sequence was split into overlapping segments, generating fourfold redundancy, and alternate segments were converted to their reverse complement. These measures reduce the probability of systematic failure for any particular string, which could lead to uncorrectable
errors and data loss.
Each segment was then augmented with indexing information that permitted determination of the file from which it originated and its location within that file, and simple parity-check error-detection.
In all, the five files were represented by a total of 153,335 strings of DNA, each comprising 117 nucleotides (
nt
).Slide17
Synthesis
We synthesized oligonucleotides (oligos) corresponding to our designed DNA strings using an updated version of Agilent Technologies’ OLSErrors occur only rarely (1 error per 500 bases) and independently in the different copies of each stringDNA in lyophilized
form that is expected to have excellent long-term preservation characteristicsSlide18
Sequencing
paired-end mode on the Illumina HiSeq 2000Strings with uncertainties due to synthesis or sequencing errors were discarded and the remainder decoded using the reverse of the encoding procedure, with the error-detection bases and properties of the coding scheme allowing us to discard further strings containing errors.
Although many discarded strings will have contained information that could have been recovered with more sophisticated decoding,
the high level of redundancy and sequencing coverage rendered this unnecessary in
their
experiment
.Slide19
Results
Four of the five resulting DNA sequences could be fully decoded without intervention. The fifth however contained two gaps, each a run of 25 bases, for which no segment was detected corresponding to the original DNA. Each of these gaps was caused by the failure to sequence any
oligo
representing any of four consecutive overlapping segments.
Inspection of the neighbouring regions of the reconstructed sequence permitted us to hypothesize what the missing nucleotides should have been and we manually inserted those 50 bases accordingly.
This sequence could also then be decoded. Inspection confirmed that our original computer files had been reconstructed
with 100% accuracy
.Slide20
Results
This also suggests that our mean sequencing coverage of 1,308 times was considerably in excess of that needed for reliable decoding. But data indicates that reducing the coverage by a
factor of 10
(or even more) would have led to unaltered decoding characteristics, which further illustrates the robustness of our DNA-storage method.Slide21
Angewandte (2015)
They translated 83 kB of information to 4991 DNA segments, each 158 nucleotides long, which were encapsulated in silicaThey employed error-correcting codes to correct storage-related errors.
Accelerated aging experiments were performed to measure DNA decay kinetics, which show that data can be archived on DNA for millennia under a wide range of conditions.Slide22
Angewandte (2015)Slide23Slide24
Error correcting
In classical data-storage devices, error correcting codes are implemented, which add redundancy and allow the correction of essentially all errors that occur during usage. To account for the specific requirements of storage on DNA the existing data coding schemes had to be adapted: Individual sequences are indexed and two independent error correcting codes (specifically Reed–Solomon codes) are used in a concatenated fashionSlide25
Angewandte (2015)
To physically test the code we stored the text from two old documents the Swiss Federal Charter from 1291 and the English translation of the Method of ArchimedesThe (uncompressed) total text is 83 kilobytes large, and was encoded. This resulted in 4991 sequences, each 117 nucleotides long to which constant primers were added (giving a
total length of 158
nt
)
The sequences were synthesized on an electrochemical microarray technology (
CustomArray
), prepared for sequencing by a custom
PCR
(polymerase chain reaction) method, and read using the
Illumina
MiSeq
platformSlide26
Results
From reading the sequences, the inner code had to correct an average of 0.7 nt errors per sequence and the outer code had to account for a loss of 0.3% of total sequences and correct about 0.4% of the sequences, thereby resulting in a complete and error-free recovery of the original information.Slide27
Results
To test if DNA stored in the solid state is more stable,they took the 4991 element oligo pool and tested the stability of three previously established dry storage procedures for DNA by accelerated aging tests.Slide28Slide29
Result
From the data shown in Figure 2 it is evident that DNA preservation is best in the inorganic storage format (DNA encapsulated in silica), which has the lowest local water concentration. By separating the DNA molecules from the environment by an inorganic layer, the degree of preservation is not affected by the humidity of the storage environment. This independence of humidity is very important for guaranteeing long-term stability,
as a
nonhumid
environment is
hard to maintain
In contrast,
stabilityincreasing
factors such as
low temperature
(e.g. permafrost) and
absence of light
can be maintained for extended periods of time without energy input.Slide30
Results
The original information could be recovered error free, even after treating the DNA in silica at 70°C for one week. This is thermally equivalent to storing information on DNA in central Europe for 2000 years.Slide31
Price
With negligible computational costs and optimized use of the technologies estimated current costs to be $12,400 / MB for information storage in DNA and $220 /MB for information decoding.Current
technology and our encoding scheme
(
N
ature
2013)
, DNA-based storage may be cost-effective for archives of several megabytes with a
600–5,000-yr
horizon
.
One order of magnitude reduction in synthesis costs reduces this to ,
50–500
yr
; with two orders
of magnitude reduction, as can be expected in less than a decade if current trends continueSlide32
Price
DNA-based storage might already be economically viable for long horizon archives with a low expectation of extensive access, such as government and historical recordsAn examplein a scientific context is CERN’s CASTOR system, which stores a total of 80 PB of Large Hadron Collider data and grows at 15 PB yr .Only 10% is maintained on disk.
Archives of older data are needed for potential future verification of events, but access rates decrease considerably 2–3 years after collection.
Further examples are found in astronomy, medicine and interplanetary explorationSlide33Slide34
Conclusion
Density, stability, and energy efficiency are all potential advantages of DNA storage, although costs and times for writing and reading are currently impractical for all but century-scale archives
DNA-based storage remains feasible on scales many orders of magnitude greater than current global data volumes
However, the costs of DNA synthesis and sequencing have been dropping at exponential rates of
5- and 12-fold
per year, respectively—much faster than electronic media at
1.6-fold per year
DNA
synthesis
costs
drop
at a
pace
that
should
make
data
storing
on DNA
cost-effective
for
sub-50-year
archiving
within
a
decade
.Slide35
Tänan Kuulamast!