Cassettes and DNA to cope with the explosion of our digital data

Cassettes and DNA to cope with the explosion of our digital data


Research, industries and individuals are accumulating more and more digital data. So much so that hard drives and other recorders will soon be overwhelmed. To compensate for future shortages, an ancient object is constantly evolving: the magnetic cassette, while waiting for cutting-edge technology based on DNA.

An Instagram photo, videos on a conductof the couriers… each individual accumulates a considerable amount of digital data, constantly increasing with the new technologies at our disposal – videos in 4K, streaming on Netflix – all stored not on a hard disk, but in the “cloud”, the ” cloud “, sometimes hundreds of kilometers from oneself. But these data, although very familiar, are not the ones that weigh the heaviest in “Big Data”, massive data.

Research is a much more important contributor. Human scientific experiments are heavy, very heavy: the European Organization for Nuclear Research, CERNnear Geneva, has accumulated, since its creation, more than 100 petabytes (PB) of images, raw data, information, to be kept for future generations who will want to study them. 100 Po is the equivalent of approximately 102,400 hard drives of 1 terabyte (TB), on sale for individuals…

The first image of the M87* black hole required an immense amount of data.
The first image of the M87* black hole required an immense amount of data. Event Horizon Telescope (EHT)/National Science Foundation/Handou

The first photo of a black hole required nearly 5 Po, which is equivalent to 5,000 hard drives of 1 TB. Industries, such as Twitter, EDF, or any company with a minimum of digitization, are other contributors to Big Data.

physical limits

Between 2010 and 2020, the amount of information contained in massive data has multiplied by more than 30, going from 2 zettabytes (2 million Po) to 60 zettabytes. And the pace quickens. By 2025, humanity is expected to produce 175 zettabytes of data.

François Képès, cell biologist, responsible between 2018 and 2021 for a prospective working group on the storage of digital data, explains: ” In 2018, one millionth of the earth’s landmass was occupied by data centers. At this exponential rate there, in 2060, all the land masses will be covered with data centers. »

Construction of a Facebook data center on October 5, 2021 in Eagle Mountain, Utah.
Construction of a Facebook data center on October 5, 2021 in Eagle Mountain, Utah. Getty Images via AFP – GEORGES FREY

However, in 70 years, researchers have continued to reduce the size of storage systems, moving from floppy disks to hard disks, for increased capacities. But in its conclusions, the report of the working group published in 2020 recalls that the Moore’s law on semiconductors also applies to electronic and magnetic storage systems. ” It is not possible to miniaturize and optimize indefinitely. There was a doubling of capacity and a halving of the price, every two years, for several decades, but this optimization is slowing down. We are reaching some hard physical limits and the optimization we can still expect is relatively low », specifies François Képès.

The cassette, an emergency solution

If electronic storage systems reach their limits, the cassette continues to break records. Yes, we are talking here about the cassette, the one you put in your old camcorder or cassette player, whose tapes could go off in all directions in the event of defective rewinding. But the cassettes developed today have nothing to do with those of yesterday. The latest record from Fujifilm and IBM stands at 580 TB that is the equivalent of 76 million audio cassettes from the 1990s (60 Mb/cassette). Here is a video during the 2017 record, which was then 330 TB.

With tapes twenty times thinner than a hair and over a kilometer long, the cassette fits in the palm of a hand, and still has a few years ahead of it. Mark Lantz, magnetic tape researcher at IBM, says: This really demonstrates the possibility of continuing to scale tape technology, essentially at historic rates of doubling cartridge capacity every two years, for at least the next ten years. »

The next ten years… and after? By highlighting this temporality, Mark Lantz, like many engineers working in storage, shows that he is well aware of the limits of electronic and magnetic storage. Both consume enormous resources, in energy and space.

Mark Lantz, a scientist at IBM, holds a tape of several hundred TB in his hand.
Mark Lantz, a scientist at IBM, holds a tape of several hundred TB in his hand. © Photo courtesy of IBM Research

However, the magnetic cassette has the advantage of requiring less electronics: a single reader can read several cassettes, where each hard disk has its own reading system. In addition, a tape lasts for decades unlike a hard drive, and is more energy efficient.

Nevertheless, a tape, however powerful it may be, still takes up too much physical space and will not be able to contain the size of the massive data to come. We must therefore move up a gear. And that is what François Képès’ working group sought to do. ” We logically considered alternatives such as etching on glass, crystal or storage on polymers such as DNA. It seemed likely to us that the only technology that could be developed in time and that had sufficient improvement factors, was storage on polymer sums up the researcher.

Waiting for DNA

DNA? Do not panic: there is no question of storing information in living beings, or modifying it directly in someone. Admittedly, it was imagined to do so in bacteria or spores, but this is no longer the main track.

DNA is a large chain of molecules that carry the instructions for the reproduction and development of living things. Here, it is the term “instruction” that is interesting. DNA is a chain of four monomers, the “bars” that connect the two helices: A, C, G and T. The sequence of these monomers (AAGTTCCGATAT, for example) gives the information, exactly like… the binary system, based on 1 and 0, at the origin of any computer system.

DNA sequencing is made up of four different monomers: A, C, T, G.
DNA sequencing is made up of four different monomers: A, C, T, G. Getty Images – alanphillips

First, it is necessary to determine which succession of monomer one wishes to align, to encode the digital file. Let’s imagine that A is 0 0, C is 0 1, G is 1 1, and T is 1 0. Let’s take a completely fake example. If we want to store a photo, encoded as 01 11, this would mean that the computer must translate » the 01 11 in CG. This is the encoding, we encode the file. Then, you have to “chemically” write CG in the DNA, then store it to bring it out when you need it.

At the time of reading, the software will translate the sequence of letters into binary code, thus reconstituting the photo on the screen. To summarize, there are therefore five stages: encoding, writing, storing, reading, decoding.

But why store our information on DNA? For the amount of information that can be encoded in it (the informational density), its energy sobriety and its durability. No need to cool the DNA, unlike in data centers: it can be stored at room temperature… for up to 52,000 years, if the encapsulation technique of the French company Imagene is used.

Each of its capsules can contain up to 0.8 g of DNA, or 1.4 exabytes of data. As a reminder, one exabyte represents one million 1 TB hard disks. 0.8 g of DNA would thus contain as much information as 150 tons of hard disks! To store the 175 Zettabytes of Big Data of 2025, it would take only 175 kilos of DNA. The American DARPA agency considers that DNA could make it possible to divide by a thousand the energy consumption of our data.

Development potential?

The main advantage of DNA is that we know it very well, recalls François Képès: “ Biomedical has led to the development of DNA technology which is already very advanced. It means that all the necessary methods for the work of storing and archiving digital data has already been done, now, it does not mean that it is commercially level, not at all. »

However, technology is advancing very quickly. ” The cost to sequence a human genome [la lecture, NDLR] a extraordinarily low. We were at 3 billion dollars in 2003, we are at 500 today », enthuses the researcher. But there are still limits: 500 dollars for a DNA reading at the speed of 2022 is still 1,000 times too expensive and 1,000 times too slow, compared to a hard disk. For writing, it’s even 100 million times too slow and too expensive.

There are people who told us to come back and talk about it at the end of the century. No way ! DNA-related technologies advance by a factor of two approximately every six months : four times faster than electronics between 1976 and 2011. At this rate, the 1000 factor of reading will be swallowed up within five years, around 2025. And the 100 million for writing, around 2035! »

Already, some applications are possible for DNA, until 2035. Not all data needs to be read or written regularly. Thus, the INA, a French organization responsible for archiving audiovisual productions, accumulates an additional 20 PB of data each year. All this data does not need to be brought out quickly, hence the interest of encoding it in DNA. In the same way, the banking sector, which must keep the banking data of its customers, sometimes for decades, could use this new storage technology.

Proof that the stakes are enormous, the American DARPA has invested hundreds of millions of euros in DNA technologies. France, for its part, is starting to get started, in particular thanks to François Képès’ working group, with an investment of 20 million euros government funding for DNA storage research.

Read also: Faced with the immensity of Big Data, the strategies of investigative journalists

Leave a Comment

Your email address will not be published. Required fields are marked *