Data compression

Data compression is the process of modifying, encoding or converting the bits structure of data in such a way that it consumes less space on disk.

Data compression

In particular, if the entropy is less than the average length of an encoding,… Information is digitally encoded as a pattern of 0s and 1s, or bits binary digits. A four-letter alphabet a, e, r, t would require two bits per character if all characters were equally probable.

Because a is most frequent in this text, with t the second most common, assigning a variable-length binary code—a: This encoding has the important property that no code is a prefix of any other. That is, no Data compression bits are required to separate letter codes: Data compression may be lossless exact or lossy inexact.

Lossless compression can be reversed to yield the original data, while lossy compression loses detail or introduces small errors upon reversal. Lossless compression Data compression necessary for text, where every character is important, while lossy compression may be acceptable for images or voice the limitation of the frequency spectrum in telephony being an example of lossy compression.

The three most common compression programs for general data are Zip on computers using Windows operating systemStuffIt on Apple computersand gzip on computers running UNIX ; all use lossless compression.

A common format for compressing static images, especially for display over the Internetis GIF graphics interchange formatwhich is Data compression lossless except that its images are limited to colours. A greater range of colours can be used with the JPEG joint photographic experts group formatting standard, which uses both lossless and lossy techniques, as do various standards of MPEG moving picture expert group for videos.

For compression programs to work, they must have a model of the data that describes the distribution of characters, words, or other elements, such as the frequency with which individual characters occur in English.

Fixed models such as the simple example of the four-character alphabet, above, may not characterize a single text very well, particularly if the text contains tabular data or uses a specialized vocabulary.

In these cases, adaptive models, derived from the text itself, may be superior. Adaptive models estimate the distribution of characters or words based on what they have processed so far. An important property of adaptive modeling is that if the compression and decompression programs use precisely the same rules for forming the model and the same table of codes that they assign to its elements, then the model itself need not be sent to the decompression program.

For example, if the compressing program gives the next available code to the when it is seen for the third time, decompression will follow the same rule and expect that code for the after its second occurrence. Coding may work with individual symbols or with words.

Huffman codes use a static model and construct codes like that illustrated earlier in the four-letter alphabet. Arithmetic coding encodes strings of symbols as ranges of real numbers and achieves more nearly optimal codes.

It is slower than Huffman coding but is suitable for adaptive models. Run-length encoding RLE is good for repetitive data, replacing it by a count and one copy of a repeated item. Adaptive dictionary methods build a table of strings and then replace occurrences of them by shorter codes.

The Lempel-Ziv algorithminvented by Israeli computer scientists Abraham Lempel and Jacob Ziv, uses the text itself as the dictionary, replacing later occurrences of a string by numbers indicating where it occurred before and its length.

Zip and gzip use variations of the Lempel-Ziv algorithm. Lossy compression extends these techniques by removing detail. In particular, digital images are composed of pixels that represent gray-scale or colour information.

While smoothing out a large section of an image would be glaringly evident, the change is far less noticeable when spread over small scattered sections.

The most common method uses the discrete cosine transform, a mathematical formula related to the Fourier transformwhich breaks the image into separate parts of differing levels of importance for image quality. This technique, as well as fractal techniques, can achieve excellent compression ratios.

While the performance of lossless compression is measured by its degree of compression, lossy compression is also evaluated on the basis of the error it introduces.

There are mathematical methods for calculating error, but the measure of error also depends on how the data are to be used: Video images may be compressed by storing only the slight differences between successive frames. DVD and some television networking devices. Video compression can achieve compression ratios approaching to-1 with minimal distortion.

There is a trade-off between the time and memory that compression algorithms require and the compression that they achieve.


English text can generally be compressed to one-half or one-third of its original size. Images can often be compressed by factors of 10 to 20 or more. Despite the growth of computer storage capacity and network speeds, data compression remains an essential tool for storing and transmitting ever-larger collections of data.

See also information theory: Data compression ; telecommunication:Data compression: Data compression, the process of reducing the amount of data needed for the storage or transmission of a given piece of information, typically by the use of encoding techniques.

Keep Exploring Britannica

Compression predates digital technology, having been used in Morse Code, which assigned the . Data compression algorithms reduce the size of the bit strings in a data stream that is far smaller in scope and generally remembers no more than the last megabyte or less of data. Taneja Group analyst Mike Matchett discussed the benefits of compression and deduplication and how the two differ.

Data compression can be viewed as a special case of data differencing: Data differencing consists of producing a difference given a source and a target, with patching producing a target given a source and a difference, while data compression consists of producing a compressed file given a target, and decompression consists of producing a target given only a compressed file.

Introduction to Data Compression the compression would be lossy since the text has changed, but the “meaning” and clarity of the message might be fully maintained, or even improved. In fact Strunk and White might argue that good writing is the art of lossy text compression. Lossy data compression is the converse of lossless data compression.

In the late s, digital images became more common, and standards for compressing them emerged. In the early s, lossy compression methods began to be widely used. In these schemes, some loss of information is .

Data Compression. 08/31/; 12 minutes to read Contributors. all; In this article APPLIES TO: SQL Server Azure SQL Database Azure SQL Data Warehouse Parallel Data Warehouse SQL Server and Azure SQL Database support row and page compression for rowstore tables and indexes, and supports columnstore and columnstore archival compression for columnstore tables and indexes.

Data compression
Data Compression Explained