Straight-line grammar (with start symbol ß) for the second sentence of the United States Declaration of Independence. Each blue character denotes a nonterminal symbol; they were obtained from a gzip-compression of the sentence.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms.^[1] To compress a data sequence ${\displaystyle x=x_{1}\cdots x_{n))$ , a grammar-based code transforms $x$ into a context-free grammar $G$ . The problem of finding a smallest grammar for an input sequence (smallest grammar problem) is known to be NP-hard,^[2] so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar $G$ is further compressed by statistical encoders like arithmetic coding.

Examples and characteristics

The class of grammar-based codes is very broad. It includes block codes, the multilevel pattern matching (MPM) algorithm,^[3] variations of the incremental parsing Lempel-Ziv code,^[4] and many other new universal lossless compression algorithms. Grammar-based codes are universal in the sense that they can achieve asymptotically the entropy rate of any stationary, ergodic source with a finite alphabet.

Practical algorithms

The compression programs of the following are available from external links.

Sequitur^[5] is a classical grammar compression algorithm that sequentially translates an input text into a CFG, and then the produced CFG is encoded by an arithmetic coder.
Re-Pair^[6] is a greedy algorithm using the strategy of most-frequent-first substitution. The compressive performance is powerful, although the main memory space requirement is very large.
GLZA,^[7] which constructs a grammar that may be reducible, i.e., contain repeats, where the entropy-coding cost of "spelling out" the repeats is less than the cost creating and entropy-coding a rule to capture them. (In general, the compression-optimal SLG is not irreducible, and the Smallest Grammar Problem is different from the actual SLG compression problem.)

References

External links

Data compression methods

Lossless

Entropy type	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary type	Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other types	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM bzip2 (RLE + BWT + MTF + Huffman)

Lossy

Transform type	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive type	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT

Theory

Community

Hutter Prize

People

Mark Adler

Examples and characteristics

Practical algorithms

See also

References

External links