Let's Learn Together! - A Simple Mathematical Approach To File "Compression"

Back • Return Home

A Simple Mathematical Approach To File "Compression" (and The Physics Behind It)

• A computer file of any type is merely a binary number. The size of the file determines the length of the binary number used to represent it.

When represented in a higher base, that binary number takes on a smaller form. For example, the same binary number represented within decimal is going to be much shorter (i.e.: it will have fewer digits).

Main Point #1: The number itself has not changed, just the way in which it is represented. Therefore, size is not necessarily determined by the amount of data involved, but by how we choose to symbolize it. No data is being added or removed.

The real question is: Is there a fast and reliable way to convert a number from one base into another (such as from binary to decimal)?

Indeed, there is! [see "Residue Number Systems"; a couple of interesting papers are "Residue Number System: An Important Application in Bioinformatics" by Hassan Kehinde Bello and Kazeem Alagbe Gbolagade, and "A New Moduli Set for Residue Number System in Ternary Valued Logic" by M. Hosseinzadeh and K. Navi]

• A thing does not always have to be physically present if we are given the instructions and materials necessary to make it from scratch.

Main Point #2: If we can provide an algorithm that generates a given number, then we do not need to "store" the information itself. We can recreate it "on the fly" just by following the steps of the algorithm!

Further, if the algorithm is simple and recursive, then we only need to specify a small number of steps and how many times to repeat them.

The next best question is: What kind of algorithm do we need to do this?

We can start with simple processes that we might already know!

For example, we can take a decimal number of any size and add together its digits until we get a single digit. This is equivalent to "compressing" that number. In other words, we can represent a very large number (with many digits) as a single number (with only one digit).

Then, to "decompress" it, we merely specify how many times we need to multiply that single number by 9 in order to give us our starting number again.

The adding and multiplying procedures described above can both be done very quickly, and in turn, can be used to scale the size of a number up and down very easily. Again, a handful of small numbers can represent an incredibly large number (i.e.: one with many digits).

• We often distinguish between "analog" and "digital" signals (or to put it another way, between things which are "continuous" and "discrete"). An "analog" signal is one that can take on an "infinite" range of measured values (such as temperature, pressure, and so on). A "digital" signal is one that can be represented by a "finite" (or limited) number of states. This set of states is referred to as its "quantinization".

Computers utilize two distinct states, "off" and "on", represented by the 0's and 1's of binary. Therefore, computers are considered "digital", even though the electrical waves that run through their circuitry are technically considered "analog".

We can convert one signal type into the other through a process called "sampling". Many devices do this all the time (with "Analog-to-Digital Converters", or "ADCs", and "Digital-to-Analog Converters", or "DACs"). This is done with varying amounts of fidelity or "resolution". However, there is thought to be a minimum amount of information that must be present in order for one to effectively move between the two types of signals [see "Nyquist-Shannon sampling theorem"].

Therefore, the final "million dollar" question is: Is it possible to create a sampling process where there is a 1-to-1 correspondence between an analog and digital signal? Where no information is "lost" and what is necessary can be retrieved?

As strange as it may sound, logic would dictate that this may actually be possible (and without introducing concepts outside of "standard" science). For example, we interpret all energy as both "quantized" and "conserved". In other words, it comes in distinct packets (like "photons" of light) and it never disappears completely, only changes form!

To put it more plainly, there is no such thing as an "analog" signal per se, only systems that seem so complex that we try to approximate them with numbers, rather than represent them directly as numbers. At a high enough resolution, all signals are "digital".

Therefore, are efficient compression algorithms already used within biological systems (such as DNA), and are their use by humans simply the result of rediscovering something inherent to Nature itself?