Contents
- News
- Citation Information
- Introduction
- Tutorial
- Command Reference
- Variable Reference
- Theory
- File Formats
- References
- Get the program
News
September 3, 2011: Version 1.0.1 binary created for i386 Mac OS X 10.6 Snow Leopard.
December 19, 2005: Version 0.9.9a binary created for Windows systems. See download section.
December 2005: Join the new sequoia mailing list! Keep informed about the latest sequoia news, and experiences from other users.
Citation Information
Studies which make use of SEQUOIA must cite the following reference:C.M. Bruns, I. Hubatsch, M. Ridderström, B. Mannervik, and J.A. Tainer (1999) Human Glutathione Transferase A4-4 Crystal Structures and Mutagenesis Reveal the Basis of High Catalytic Efficiency with Toxic Lipid Peroxidation Products. Journal of Molecular Biology 288(3): 427-439 [PubMed link]
Introduction
SEQUOIA is a computer program for the alignment of homologous protein sequences, and for the superpostion of homologous protein atomic coordinates. To get started aligning sequences, follow the examples in the tutorial section of this document.Philosophy
This program is intended to assist in the alignment of homologous protein sequences. It uses a conventional dynamic programming algorithm after the work of Needleman and Wunsch to find the globally optimal alignment given a particular residue comparison matrix and gapping model. I created this after being frustrated that I was unable to find a program which would align together two files of sequences, each of which contains pre-aligned sequences. Often in protein crystallography, we have structural information about how to align two protein sequences, which we wish to keep intact. Many of the available multiple alignment programs require a series of individual sequence files as input, and return a multiple sequence alignment as output, with very little opportunity for manual intervention along the way. In my experience, the results of these programs are almost always at odds with the structural alignments (i.e. wrong). I wrote this in an attempt to provide an alignment tool which the thoughtful modern protein biochemist might be able to use in conjunction with a variety of ancillary information. The sequence and matrix files can be modified easily with your favorite text editor.
Accolades, kudos, and other diplomatically worded bug reports should be sent to cmbruns@rotatingpenguin.com.
Registers
Understanding the organization of sequence information is crucial to the effective use of SEQUOIA. Most of the operations in this program involve aligning two collections of things together. The "things" in this context may be sequences, groups of aligned sequences, or three-dimensional protein structures. There are a total of six registers, or containers, for these items. The two primary ones, SEQ1 and SEQ2, contain sequences or alignments. A third register, ALIGN, usually contains the combined result of aligning SEQ1 with SEQ2. The remaining three, STRUCT1, STRUCT2, and OVERLAY, are analogous to the first three, except that they contain tertiary structures rather than simple sequences. To summarize:
- SEQ1
- The principal storage bin for sequences or alignments. This register can be filled by READing sequences from a file, or SETing them from another register. READing a structure into STRUCT1 also overwrites the SEQ1 register. The contents of SEQ1 can be output in sequence file format by WRITEing to a file or to the screen. The contents can be output in a more attractive presentation format with the PRINT command.
- SEQ2
- The other principal storage bin for sequences or alignments. READing a structure into STRUCT2 also overwrites the SEQ2 register. READ, SET, WRITE, and PRINT commands have the same uses with SEQ2 as they do with SEQ1.
- ALIGN
- This bin is used to store newly aligned sequences. Be aware that ALIGN is both a register and a command. This register is overwritten during the execution of ALIGN, SALIGN, and OVERLAY commands. Its capitalization pattern is modified by the EQUIVALENCE command. READ, SET, WRITE, and PRINT commands can be used with ALIGN, just as with SEQ1 and SEQ2.
- STRUCT1
- The principal storage bin for protein tertiary structures. Can be filled with the READ and SET commands. READing a file into STRUCT1 also overwrites SEQ1 with the sequence. WRITE and PRINT commands are synonymous, generating a PDB format atomic coordinate output.
- STRUCT2
- Like STRUCT1, the other principal storage bin for protein tertiary structures. READing a file into STRUCT2 also overwrites SEQ2 with the sequence. READ, SET, WRITE, and PRINT commands can be used the same way as for STRUCT1.
- OVERLAY
- The result of superposing STRUCT2 onto STRUCT1. This register is modified by the SUPERPOSE and OVERLAY commands. READ, SET, WRITE, and PRINT commands can be used the same way as for STRUCT1 and STRUCT2.
Acknowledgments
Christopher Putnam of the Scripps Research Institute has devoted some of his boundless energy and enthusiasm to improving this program. Without his suggestions this work would be far shabbier than it now is. John Tainer, Dave Hosfield, Terence Lo, Maria Thayer, and Cliff Mol also provided constructive advice.Tutorial
In the following examples, text following the "SEQUOIA>" prompt represent commands typed by the user. Other lines represent the output of the program. Commands may be typed in either upper or lower case letters.
Scenario 1: Aligning three sequences
In this example, the files test1.seq, test2.seq, and test3.seq contain sequence information in the SEQUOIA file format. A file called test.align is created in the same format, and the file test.pretty is created with a more aesthetically pleasing representation of the aligned sequences. What follows is the output of an actual alignment session.SEQUOIA multiple sequence alignment tool version 0.9.7 copyright (c) 1995, 1996, 1997, 1998, 1999, 2000 by Chris Bruns, Ph.D. The Scripps Research Institute, La Jolla, CA All rights reserved SEQUOIA> read SEQ1 test1.seq Opening file ``test1.seq'' Read in 1 sequence(s) SEQUOIA> read SEQ2 test2.seq Opening file ``test2.seq'' Read in 1 sequence(s) SEQUOIA> align Running TABULATE command... Mean residue alignment score = 1.82649e-07 with a standard deviation of 2.08357 Aligning sequences... 0 gaps added to make alignment Alignment score is 44.6169 SEQUOIA> print ALIGN 1) (w=1) test1 2) (w=1) test2 . . . . .50 1) MQFKHFKLATLAAALAFSANSFADIT- 26 2) -MKTSIRYALLAAALTAATPALADITV 26 * ***** **** SEQUOIA> set SEQ1 ALIGN copied ALIGN to SEQ1 SEQUOIA> read SEQ2 test3.seq Opening file ``test3.seq'' Read in 1 sequence(s) SEQUOIA> align Running TABULATE command... Mean residue alignment score = 9e-07 with a standard deviation of 3.5 Aligning sequences... 0 gaps added to make alignment Alignment score is 30 SEQUOIA> print ALIGN 1) (w=1) test1 2) (w=1) test2 3) (w=1) test3 . . . . .50 1) ---MQFKHFKLATLAAALAFSANSFADIT- 26 2) ----MKTSIRYALLAAALTAATPALADITV 26 3) MKLRISSLGPVALLASSMMLAFGAQAA--- 27 * ** * SEQUOIA> write ALIGN test.align Opening file ``test.align'' Wrote 3 sequence(s) SEQUOIA> print ALIGN test.pretty Opening file ``test.pretty'' SEQUOIA> print id ALIGN Percent sequence identities: 1 2 3 1) 100 40 16 2) 40 100 30 3) 16 30 100 SEQUOIA> quit Program terminated by user
The PRINT ID command displays the level of sequence identity among the aligned sequences. The PRINT commands are not required for the alignment process, but they show that nothing ridiculous is going on. Only the READ, ALIGN, SET, and WRITE commands in the above example are required to generate the alignment. Additional sequences can be added to the alignment by repeatedly SETing the previous SEQ1 to ALIGNment, READing the new sequence into SEQ2, and ALIGNing. The best strategy is to ALIGN the most similar sequences first. Don't forget to WRITE the final alignment when you are done.
The whole process could be done in batch mode by making a script file called test.inp, containing the lines:
read seq1 test1.seq read seq2 test2.seq align set SEQ1 ALIGN read seq2 test3.seq align write test.align quitThen you could invoke this file from within SEQUOIA by typing
@test.inpor, from a shell in UNIX or MSDOS:
sequoia < test.inp
Scenario 2: Aligning two sequences, assessing significance
Scenario 3: Overlaying two tertiary structures
step 1 - read in two coordinate files
SEQUOIA> read STRUCT1 test1.pdb SEQUOIA> read STRUCT2 test2.pdb
step 2 - create a preliminary alignment
SEQUOIA> align
step 3 - attempt an overlay using default parameters
SEQUOIA> overlay
step 4 - THINK - is this working?
If it is, go on to step 6. The OVERLAY step often fails for difficult problems, so this is one point where you may have to experiment to get things working. OVERLAY is a semi-automatic command which uses the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE.
step 5 - Experiment until it starts working.
PROBLEM: not enough equivalences (usually falls to zero)
Try increasing DCUTOFF to, say, 10. If this works, try slowly lowering DCUTOFF toward a more reasonable value (4.0 is good).
SEQUOIA> align SEQUOIA> set dcutoff 10 SEQUOIA> overlay SEQUOIA> set dcutoff 8 SEQUOIA> overlay SEQUOIA> set dcutoff 6 SEQUOIA> overlay SEQUOIA> set dcutoff 4 SEQUOIA> overlay
If this sort of thing fails, try using a molecular graphics program to place the structures into approximate alignment. Then try something like:
SEQUOIA> read STRUCT1 test1.pdb SEQUOIA> read STRUCT2 test2.pdb SEQUOIA> set dcutoff 10 SEQUOIA> align SEQUOIA> salign SEQUOIA> equivalence SEQUOIA> overlayetc...
Some parameters you should consider adjusting to change the performance of the overlay procedure are GPEN, EPEN, USEANGLE, ACUTOFF, DCUTOFF, and RUNLENGTH. You may also wish to consider manually using the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE; instead of using the more automatic OVERLAY command.
step 6 - repeat OVERLAY until the process converges.
SEQUOIA> overlay SEQUOIA> overlay SEQUOIA> overlayetc...
Eventually the structure based alignment will probably stabilize. The RMS deviations will remain about the same, and the number of aligned residues will remain the same. Examine the alignment to ensure that everything is the way you want it. If things have gone well you will have a reasonable least-squares superposition of the structures, and you will have a good starting point for a structural alignment.
DO NOT BLINDLY TRUST THIS ALIGNMENT. In my experience most features will be correct, but it is necessary for an intelligent and knowledgeable human to compare the structures and the alignment to make final adjustments.
Some parameters you should consider adjusting to change the performance of the overlay procedure are GPEN, EPEN, USEANGLE, ACUTOFF, DCUTOFF, and RUNLENGTH. You may also wish to consider manually using the commands SALIGN, STABULATE, SUPERPOSE, and EQUIVALENCE; instead of using the more automatic OVERLAY command.
step 7 - write out the transformed coordinates
SEQUOIA> write OVERLAY test2rot.pdb
step 8 - write out the structure-based sequence alignment
SEQUOIA> write ALIGN test1_2struct.align
Command Reference
Type "HELP" at the SEQUOIA prompt for on-line help
Commands:
- ALIGN
- ALIGN aligns the sequences in register SEQ1 with those in SEQ2 using the
current parameters. The result ends up in ALIGN. TABULATE will be automatically
run, if necessary.
- Usage: ALIGN
- COMMENT
- A line starting with COMMENT is ignored.
- SYNONYMS: COMMENT, #, [, {, /, REM
- CONSENSUS <sigma>
- CONSENSUS generates a weighted consensus sequence based on the contents of the ALIGN register, and places the result in the SEQ2 register. The consensus contains an X at positions where the best residue is less than sigma standard deviations above the mean for all residue types. The resulting consensus may be useful for identifying distant homologs in a BLAST search.
- EQUIVALENCE
- Capitalizes those residue codes in the ALIGN
register whose atomic coordinates are structurally equivalent.
Others are converted
to lower case. EQUIVALENCE is one step in the OVERLAY process.
The variables ACUTOFF, DCUTOFF, RUNLENGTH, and USEANGLE affect the behavior
of this command.
- Usage: EQUIVALENCE
- HELP
- HELP prints documentation about SEQUOIA commands.
- Usage: HELP
- EXAMPLE: HELP ALIGN
- SYNONYMS: HELP, ?, MAN
- OPTIMIZE
- Iteratively optimize the current alignment. One by one, each sequence is removed, degapped, and realigned. Should remove the most egregious flaws like "----K---" within a sequence. The sequences are realigned in random order. Multiple OPTIMIZE commands may be used to further improve the alignment. Try using the SPLIT command manually for more precise realignment.
- OVERLAY
- OVERLAY iteratively superposes and realigns two coordinate files to convergence. OVERLAY calls the EQUIVALENCE, SUPERPOSE, and SALIGN commands. It uses the ALIGN register to SUPERPOSE STRUCT2 onto STRUCT1. EQUIVALENCE residues are reassigned and the process is repeated to convergence. Finally, ALIGN is replaced with the new SALIGNment of STRUCT1 with STRUCT2.
- PRINT
- PRINT is used to display readable formatted data from SEQUOIA.
- PRINT <register> displays a formatted verion of a sequence alignment or structure. The formatted sequence alignment CANNOT be READ into SEQUOIA. Use the WRITE command to make a native sequence file.
- PRINT MATRIX displays the current comparison matrix
- PRINT ID <register> displays a table of sequence identities.
- SUBCOMMANDS: ID, MATRIX
- QUIT
- QUIT causes SEQUOIA to terminate.
- SYNONYMS: QUIT, BYE, END, EXIT, HALT, KILL, Q, STOP
- READ
- READ is used to load data into SEQUOIA from files.
- READ
loads structure data into a register. - READ MATRIX
loads a comparison matrix. (The matrix "BLOSUM62" is automatically loaded by default.) - RANDOM <iterations>
- Calculates alignment scores with SEQ2 randomly shuffled. This is used to provide one estimate of alignment significance Subsequent runs use the same randomizations, so that statistics may be accurately compared between runs. [For example, the "delta sigma" score reported can give an accurate comparison of the significance of two sequential runs between which the gap penalty has been changed.] The "iterations" argument specifies the number of randomizations to perform. This should definitely be run in batch mode for large values. The RANDOM_SEED variable can be modified to yield a different set of randomizations.
- RUN
- RUN <filename> executes SEQUIOA commands contained in a file.
- SYNONYMS: RUN, @
- This feature can be used recursively.
- SALIGN
- SALIGN aligns the sequences in registers SEQ1 with those in SEQ2 based upon the superposed atomic coordinates , and places the result in the ALIGN register.
- Usage: SALIGN
- SET
- SET
assigns a value to a SEQUOIA variable. - EXAMPLE: SET GPEN 5.0
- SET
copies the contents of into (this supersedes the old COPY command) - EXAMPLE: SET SEQ1 ALIGN
- SET with no arguments lists the values of all variables.
- SYNONYMS: SET, LET
- SPLIT <integer value>
- Extracts a single sequence (defaults to 1) from ALIGN, remove gaps, and place in SEQ1. The other sequences in ALIGN are placed in SEQ2. This process is used by the OPTIMIZE command.
- Usage: SPLIT
- STABULATE
- Fills the scoring table with values used for structural overlay with the SALIGN command. (This is ordinarily done automatically.)
- SUPERPOSE
- Replaces OVERLAY with STRUCT1, plus STRUCT2 rotated and translated so as to minimize the squared deviation of residues declared equivalent by capitalization in the ALIGN register.
- SYSTEM
- SYSTEM <command> executes a command in the local operating system.
- EXAMPLE: SYSTEM ls *.pdb
- SYNONYMS: SYSTEM, !
- TABULATE
- Creates the scoring matrix using SEQ1, SEQ2, and the comparison matrix for subsequent use by the ALIGN command. (This is ordinarily done automatically.)
- WEIGHT Applies a weighting scheme to all of the sequence registers to emphasize the more diverse sequences. Within each alignment, each sequence is weighted as follows:
weight(i) = 1 / (SUM(over j) Proximity(i,j))
For now the Proximity between sequences i and j is defined as the sequence identity. The weights are then normalized so that the smallest weight in each sequence is unity. This weighting will ordinarily improve the significance of alignments between large groups of distantly related sequences. If you ever find a case where it hurts, please let me know.- WRITE <register> [<filename>]
- Creates sequence or coordinate files from SEQUOIA.
- Usage: HELP
Variable Reference
Variables used in the current version of SEQUOIA are shown below. SEQUOIA variables are NOT case sensitive. Most changes to variables can be done with the SET command.Variables:
- ACUTOFF
- In order to be considered structurally equivalent, the orientations
of two overlaid residues must differ by less than this value.
- See also: ACUTOFF, DCUTOFF, RUNLENGTH
- Units: degrees
- Default value = 45
- See also: ACUTOFF, DCUTOFF, RUNLENGTH
- DCUTOFF
- In order to be considered structurally equivalent, the distance between
of two overlaid residues must be less than this value.
- See also: ACUTOFF, DCUTOFF, RUNLENGTH
- Units: Angstroms
- Default value = 4.5
- See also: ACUTOFF, DCUTOFF, RUNLENGTH
- ECHO
- When set, user commands are echoed to the screen. This is useful
when running scripts, if your commands do not appear in the log file.
- Units: boolean (0 or 1)
- Default value = 0
- Units: boolean (0 or 1)
- EPEN
- Gap extension penalty. In order to reduce spurious placement of large
gaps during alignment, this value is subtracted from the alignment score for
each gap character that is inserted during alignment.
Thus longer gaps incur higher penalties, unlike the
case with GPEN. Higher values result in smaller and fewer gaps.
A larger value may work in
conjunction with a smaller value of GPEN.
- Units: standard deviations of residue scores
- Default value = 0.01
- Units: standard deviations of residue scores
- GPEN
- Gap penalty. In order to reduce spurious placement of
gaps during alignment, this value is subtracted from the alignment score for
each gap (of any size) that is inserted during alignment.
Higher
values result in fewer gaps.
- Units: standard deviations of residue scores
- Default value = 10
- Units: standard deviations of residue scores
- PRETTY_LENGTH
- Controls the width of the output for the PRINT <register> command to
display formatted sequence alignments.
- Units: characters
- Default value = 50
- Units: characters
- RANDOM_SEED
- Changes the randomization seed for the RANDOM command.
The value of RANDOM_SEED determines the sequence of random
numbers used in the RANDOM and OPTIMIZE
commands. Leave this the same value in order to compare identical randomizations
with different gpen values in RANDOM, for example,
so that the same set of shufflings will be performed every time.
If you change the seed between randomization runs, the "delta sigma"
statistic will be much less meaningful.
- Units: positive integer
- Default value = 1
- Units: positive integer
- RUNLENGTH
- RUNLENGTH or more equivalent residues must occur in a row, in order to be considered
structurally equivalent.
- See also: ACUTOFF, DCUTOFF, RUNLENGTH
- Units: residues
- Default value = 4
- See also: ACUTOFF, DCUTOFF, RUNLENGTH
- SUBOPTIMAL
- During alignment, SEQUOIA will randomly choose among alignment paths differing
by less than this value. A simulated annealing strategy for alignment optimization
can be performed by gradually decreasing the value of this parameter.
- Units: standard deviations of residue scores
- Default value = 0.1
- Units: standard deviations of residue scores
- USEANGLE
- When set, residue orientations are used to determine structural equivalences.
This should be set to true, once an initial overlay has been generated.
- Units: boolean (0 or 1)
- Default value = 0
- Units: boolean (0 or 1)
Theory
Global sequence alignment
Most modern computational sequence alignment methods can be placed into one of two categories: local and global alignment methods. SEQUOIA uses a global alignment algorithm.
Both classes of techniques use some sort of comparison matrix to determine which amino acid residues are "similar". Local methods find continuous segments of similar residues between two protein sequences. The equivalent segments will pair up residues that have high comparison scores, so that the each segment has a high total sum.
Global alignment methods align two entire sequences together so as to maximize the total sum of the comparison scores of all aligned residues, not just those in the most similar continuous segments. To accomplish this, gaps are ordinarily inserted into the sequences to account for insertions and deletions in the sequences during the course of evolution.
Computational complexity
How do you calculate which is the very highest scoring sequence alignment between two sequences? One possibility would be to explicitly generate every possible alignment, calculate each score, and save the highest scoring one. Unfortunately, the total number of alignments is quite large. If n is the number of residues in the sequences, the total number of alignments is proportional to 2 raised to the n power, or n factorial, or something like that. Whichever it is, it turns out to be greater than the number of particles in the universe for reasonable sized cases. I don't care how fast your computer is, it cannot do this.
Fortunately the global alignment problem can be broken down into a series of subproblems, so that the algorithmic technique of dynamic programming can be used. This means that in this case the optimal solution can be found in time proportional to n squared, which is accessible even on modest computers (Perhaps n cubed for more complicated gapping schemes -- more complicated than anything currently in SEQUOIA). It is important to realize that both memory and time requirements are proportional to the square of the sequence length. That is, if you change to sequences that are a bit more than three times as long, the time and memory requirements will increase by a factor of ten. But it is still far superior to that "number of particles in the universe" monkey business.
Gapping and gap penalties
Proteins tend to accumulate insertions and deletions of stretches of amino acids during the course of evolution. This means that two homologous proteins which have diverged significantly will have regions of insertion and deletion relative to one another. This means that the "true" alignment between the two protein sequences will have to account for these insertion and deletion events. Each such event is usually represented by a gap in one of the aligned sequences.
File Formats
Sequence file format
I have attempted to keep the file format simple. What follows is a canonical sequence file in a format readable by SEQUOIA.
> A glycine rich sequence GGDFHINMVGRGEGLPWGTGQASGDGPF RGHTGLPGDV*The title of the sequence goes on its own line with a ">" character in the first position. The sequence information then follows on subsequent lines in free format until the "*" character, or the next ">" character, indicating a new sequence. In the output files, the entire actual sequence is shoved onto one line, for ease of editing the alignment. If the first character of the sequence is a '#', then it should be followed by digits, to indicate the number of the first residue. Thus a sequence which begins with residue number 17 would look like:
> partial sequence #17gvcsnflqwertyiop
Comparison matrix file format
References
- Dayhoff, M. O. (1978) Atlas of protein sequence and structure. (Natl. Biomed. Res. Found. Washington), Vol. 5, Suppl. 3.
- Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919
- Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453
Executable files
Preliminary test versions of SEQUOIA are available for the following platforms:- Windows[Download executable]
version 0.9.9a - Mac OS X
(you may need to type "chmod +x sequoia101osxi386" after downloading)
- [Download executable] Mac OS X 10.6 Snow Leopard (Intel i386)
sequoia version 1.0.1 - [Download executable] Mac OS X 10.4 Tiger (PowerPC).
sequoia version 0.9.9a
- [Download executable] Mac OS X 10.6 Snow Leopard (Intel i386)
- PC Linux ELF 2.4 [Download executable]
version 0.9.9a - SGI IRIX5.3 [Download executable]
version 0.9.7 - SGI IRIX6.5 [Download executable]
version 0.9.9 - SunOS 5.6[Download executable]
version 0.9.7 - alpha OSF1 V4.0[Download executable]
version 0.9.9 -- Jan 4, 2003
version 1.0.1 September 3, 2011 - created new binary for Mac OS X 10.6 (i386 Snow Leopard)
version 0.9.9a December 6, 2005 - removed time limit license
version 0.9.9a December 6, 2005 - created new binaries for Mac OS X Tiger (10.4/PPC) and Linux
version 0.9.9 December 13, 2002 - compiled new version for IRIX 6.5
version 0.9.9 December 12, 2002 - compiled new version for MSDOS/windows
version 0.9.9 June 28, 2002 - set alignment score to zero when one structure contains no protein
version 0.9.7 May 08, 2000 - replaced broken RANDOM command. Significance is now reported with respect to extreme-value distribution.
version 0.9.6 - removed output buffering. gpen is updated before every alignment
version 0.9.5 May 04, 2000 - restored EQUIVALENCE command to former functionality. SUN version works again.
version 0.9.4 December 1999 - updated expiration date. Partially implemeted equivalence command is broken.
version 0.9.3b. New features include command line history and editing, plus a bug fix for cases where a D- amino acid could cause structural alignment to fail ungracefully.