In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.
Searching in a sequence database involves looking for similarities between a sequence query and the sequences located in a sequence database, finding the sequence in the database that "best" matches the target sequence (based on criteria which vary depending on the search method). The number of matches/hits is used to formulate a score that determines the similarity between the sequence query and the sequences in the sequence database.
The method for scoring the similarity will determine the rules by which a set of sequences can be considered similar or not. These are the two main methods to find the similarity between sequences:
Algorithms perform the searches. The algorithms focus on increasing the effectiveness by increasing the efficiency and the sensitivity of its results. The efficiency depends on the run time of the algorithm. Meanwhile, the sensitivity depends on the algorithm being able to find all true positive matches when comparing sequences. There are different types of algorithms that are used depending on the focus of the search. These are the following types:
These algorithms focus on finding all the possible solutions. Thus, they concentrate on sensitivity by making the results very accurate. The downside is the run time. The Smith-Waterman and the Burrows-Wheeler Transform are examples of these algorithms.
These algorithms focus on faster run times as opposed to the quality of the results. These algorithms are used when the user needs to find the quickest solution with an acceptable result. However, the solution might not be the most accurate. FASTA and BLAST are examples of these algorithms.
Records in sequence databases are deposited from a wide range of sources, from individual researchers to large genome sequencing centers. As a result, the sequences themselves, and especially the biological annotations attached to these sequences, may vary in quality. There is much redundancy, as multiple labs may submit numerous sequences that are identical, or nearly identical, to others in the databases.
Many annotations of the sequences are based not on laboratory experiments, but on the results of sequence similarity searches for previously-annotated sequences. Once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This can lead to a transitive annotation problem because there may be several such annotation transfers by sequence similarity between a particular database record and actual wet lab experimental information. Therefore, care must be taken when interpreting the annotation data from sequence databases.