CRAM
Filename extension	.cram
Developed by	Markus Hsi-Yang Fritz et al; Vadim Zalunin; James Bonfield
Type of format	Bioinformatics
Open format?	yes
Website	www.ga4gh.org/cram/, www.ebi.ac.uk/ena/software/cram-toolkit

Compressed Reference-oriented Alignment Map (CRAM) is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al.^[1]

CRAM was designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them.

Implementations of CRAM exist in htsjdk,^[2] htslib,^[3] JBrowse,^[4] and Scramble.^[5]

The file format specification is maintained by the Global Alliance for Genomics and Health (GA4GH)^[6] with the specification document available from the EBI cram toolkit page.^[7]

File format

The basic structure of a CRAM file is a series of containers, the first of which holds a compressed copy of the SAM header. Subsequent containers consist of a container Compression Header followed by a series of slices which in turn hold the alignment records themselves, formatted as a series of blocks.

CRAM file:

Magic number	Container (SAM header)	Container (Data)	...	Container (Data)	Container (EOF)

Container:

Container Header	Compression Header	Slice	...	Slice

Slice:

Slice Header	Block	Block	...	Block

CRAM constructs records from a set of data series, describing the components of an alignment. The container Compression Header specifies which data series is encoded in which block, what codec will be used, and any codec specific meta-data (for example a table of Huffman symbol code lengths). While data series can be mixed together within the same block, keeping them separate usually improves compression and provides the opportunity for efficient selective decoding where only some data types are required.

Selective access to a CRAM file is granted via the index (with file-name suffix ".crai"). On chromosome and position sorted data this indicates which region is covered by each slice. On unsorted data the index may be used to simply fetch the N^th container. Selective decoding may also be achieved using the Compression Header to skip specified data series if partial records are required.

History

Year	Version(s)	Notes
2010-11	pre-CRAM	Initial paper describing the reference based format. This did not use the name CRAM, but called it mzip. This software was implemented in Python as a prototype and demonstration of the basic concepts.^[1]
2011-12	0.3–0.86	Vadim Zalunin of the European Bioinformatics Institute (EBI) produced the first implementation named CRAM as a package called CRAMtools,^[8] written in the Java programming language.
2012	1.0^[9]	Implemented in Java CRAMtools.^[10]
2013		C implementation added to the Scramble^[11]^[5] tool, by James Bonfield of the Wellcome Sanger Institute.
2013	2.0	Changes included support for more than one reference per slice (useful with highly fragmented assemblies), better encoding of SAM auxiliary tags, splitting soft-clip and inserted bases into their own data-series, meta-data to track the number of records and bases per slice, and corrections to the BF (BAM flag) data-series.
2013		Added to htslib (0.2.0).
2014	2.1^[12]	Added EOF blocks, to help identify truncated files.
2014		Added to htsjdk (1.127).
2014	3.0^[13]	Inclusion of lzma and rANS codecs for block compression, along with multiple checksums for ensuring data integrity
2018		JavaScript implementation as part of JBrowse^[4] (1.15.0), by Rob Buels.
2021		Rust implementation in Noodles^[14]
2023	3.1^[15]	Officially adopted. (Draft from 2019)

CRAM version 4.0 exists as a prototype in Scramble,^[5] initially demonstrated in 2015, but has yet to be adopted as a standard.

References

^ ^a ^b Hsi-Yang Fritz, Markus; Leinonen, Rasko; Cochrane, Guy; Birney, Ewan (May 2011). "Efficient storage of high throughput DNA sequencing data using reference-based compression". Genome Research. 21 (5): 734–740. doi:10.1101/gr.114819.110. ISSN 1549-5469. PMC 3083090. PMID 21245279.
^ "Htsjdk by Broad Institute". samtools.github.io. Retrieved 2018-10-14.
^ "Samtools". www.htslib.org. Retrieved 2018-10-14.
^ ^a ^b "JBrowse · A fast, embeddable genome browser built with HTML5 and JavaScript". jbrowse.org. Retrieved 2018-10-14.
^ ^a ^b ^c Bonfield, James K. (2014-06-14). "The Scramble conversion tool". Bioinformatics. 30 (19): 2818–2819. doi:10.1093/bioinformatics/btu390. ISSN 1460-2059. PMC 4173023. PMID 24930138.
^ "GA4GH". www.ga4gh.org. Retrieved 2018-10-14.
^ EMBL-EBI. "CRAM toolkit < Software < European Nucleotide Archive < EMBL-EBI". www.ebi.ac.uk. Retrieved 2018-10-14.
^ "vadimzalunin/crammer". GitHub. 2017-08-08. Retrieved 2018-10-14.
^ "CRAM 1.0 Specification" (PDF).
^ "enasequence/cramtools". GitHub. 2018-10-02. Retrieved 2018-10-14.
^ "jkbonfield/io_lib". GitHub. 2018-10-16. Retrieved 2018-10-14.
^ "CRAM 2.1 Specification" (PDF).
^ "CRAM 3.0 Specification" (PDF).
^ https://github.com/zaeleus/noodles/
^ "CRAM 3.1 Specification" (PDF).

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl, InterPro, KEGG, and Gene Ontology Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format GTF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons

File format

History

See also

References