LAMA help

What does LAMA do?

LAMA (Local Alignment of Multiple Alignments) is a program for comparing protein multiple sequence alignments with each other. The program can search databases of such multiple alignments. The search is for sequence similarities between conserved regions of protein families. The method is sensitive, detecting weak sequence relationships between protein families. Sequence similarities beyond the range of conventional sequence database searches can be detected by the method.

What can LAMA do for me?

LAMA can identify protein families similar to your protein(s) of interest and protein motifs similar to conserved regions in your protein(s). The information known about these similar families and motifs can help you identify the function and structure of your protein and locate critical conserved regions in your protein(s). This can direct you in designing experiments to test your hypotheses.

LAMA compares multiple sequence alignments of proteins. If you have only a single protein sequence you first need to find other members of its family. The protein sequences also need to be multiply aligned. The Content of input section explains how to find related sequences and align them.

How does LAMA align blocks?

The multiple alignments are first transformed into position specific scoring matrices (PSSMs). Each column in the PSSM corresponds to a position in the alignment and has the amino acid distribution of that position. The transformation into the PSSM is done with position-based sequence weights (Henikoff & Henikoff, 1994a) and odd ratios between the amino acid frequencies observed in the multiple alignments and the frequencies expected from protein databases (Henikoff & Henikoff, 1995). The transformation corrects possible overrepresentation of some sequences by sequence weighting and considers the background frequencies of the amino acids. The method was tested and calibrated with ungapped local multiple alignments (blocks) from the Blocks Database .

The matrices are treated as sequences of columns, enabling their alignment with one another. To use algorithms developed for aligning single sequences we need a measure for comparing pairs of matrix columns. This corresponds to the substitution matrices (PAM, BLOSUM etc.) used in single-sequence alignments. The measure used in our method to score the similarity between pairs of matrix columns is the Pearson correlation coefficient (r): where A(i) and B(i) are the values of amino acid i in columns A and B, respectively, and /A and /B are the means of the values in columns A and B. The correlation score ranges from 1 for columns with identical amino acid distributions to -1 for columns with opposite distributions (in each column only 10 amino acids occur and those 10 amino acids are different in the two compared columns).

The score of a block-to-block alignment is the sum of the scores from comparing the corresponding columns in the two block matrices:

Local alignment of blocks.
Positions 2 to 7 from block A aligned with positions 4 to 9 from 
block B. A column comparison score, s(Xn*Ym), is calculated for 
each pair of positions (A2*B4 to A7*B9). The score of the alignment 
of the two segments, S, is the sum of the column comparison scores.
The alignment is done using the Smith-Waterman algorithm for optimal local alignments. No gaps are allowed since the aligned objects are short conserved sequence regions. All alignments above the cutoff score are reported for each pair of compared blocks. There may be cases where parts of one long block are similar to several blocks:
	AAAAAAAAAAAAAAAAAAA
	 BBB       CCCCCC

Input for LAMA

Content of input

LAMA can compare any multiple alignment if it is in the correct format. However, the column comparison measure and the significance estimation of the scores are appropriate for protein sequence blocks - ungapped conserved multiple alignments. The use of other types of multiple alignments, such as global multiple alignments that include many gaps, may give misleading results. For example, the resulting alignments may not be optimal or their significance different from what the output suggests.

If you only have a single protein sequence or want to find more protein sequences related to yours you can search the sequence databases. One way to do this on the WWW is using the BLAST program to search the NCBI sequence databases. Links to other search methods can be found at the Baylor College of Medicine Human Genome Center Search Launcher site.

The BlockMaker WWW site can be used for finding blocks in your group of related protein sequences. There are various other methods for making protein multiple sequence alignments. Among these are the MEME system, Gibbs sampling programs, the MACAW interactive program, and the CLUSTAL-W progressive multiple alignment program. Several of these methods are available through the multiple sequence alignment page at the Baylor College of Medicine Human Genome Center.

Multiple alignments submitted to the program should be of conserved, relatively ungapped, protein sequence regions. A few gaps in the alignment are acceptable. The more sequences are in the alignment the better. In general, avoid alignments with less than 4 sequences.

Format of input

LAMA only accepts input in the Block format. Other multiple alignments can be reformatted to the Block format. If you are not sure of your multiple alignment or just have a group of related sequences you can use the BlockMaker program for finding blocks in the sequences. Note that to avoid biassed sequence representation blocks include sequence weights.

Output options

Some of the examples included in this document illustrate the use of the options.

Output from LAMA

Content and format of output:


LAMA version 1.00 October 96. Minimal length of reported alignments 4 Score cutoff is 5.6 Z score units (in the top 7.7e-05 percentile of chance scores) alignment Z-score expected number for block 1 from:to block 2 from:to length searching 5000 blocks BL01063B 20 : 46 and BL00042B 3 : 29 (27) score 39 ( 7.2 1.3e-02) [alignment Logos?] BL01063B 5 : 39 and BL00324C 3 : 37 (35) score 27 ( 6.1 1.5e-01) [alignment Logos?] BL01063B 12 : 47 and BL00622 8 : 43 (36) score 33 ( 8.2 0.0e+00) [alignment Logos?] BL01063B 10 : 46 and BL00894A 1 : 37 (37) score 26 ( 5.7 3.2e-01) [alignment Logos?] BL01063B 4 : 42 and BL01043A 2 : 40 (39) score 29 ( 8.1 0.0e+00) [alignment Logos?]
The program version and execution parameters head the search output. Only alignments longer than the minimal length will be reported. The significance of very short alignments (fewer than 4 positions) cannot be reliably estimated. Alignments with scores equal or above the score cutoff will be reported. The score cutoff is specified as a Z score. Z score is the number of standard deviations between the score and the mean score. The mean score and the standard deviations were calculated for the random scores from the alignment of a large number of shuffled unbiassed blocks (7 million block pairs; see first supplement). The Z score is related to the percentile of the score in the shuffled blocks scores. This dependence is not linear but sigmoidal (see second supplement).
For each reported alignment the program shows the names of the two aligned blocks, their position relative to one another, the alignment length, the score, and the expected number of such scores when searching a given number of blocks. The expected number is for chance (random) alignments of unbiassed blocks. It is calculated from the score percentiles between the shuffled unbiassed blocks. In this example the expected number is for searching 5000 blocks. Blocks from the Blocks Database and from the Prints database will be linked to the database entries. The "alignment" link (alignment) shows the alignment of the two blocks. This can also be seen by following the "logos" (Logos) link that shows the sequence logos of aligned pairs of blocks. Sequence logos are graphical representations of the blocks. For example, here (PostScript viewer required) the logo of block BL00622 is shifted 4 positions relative to the logo of block BL01063B so that their similar segments (8-43 and 12-47) are aligned. Indeed, these segments both contain helix-turn-helix DNA binding motifs.

When both query and target blocks are provided by the user the output can also contain the column scores of each reported alignment and the PSSMs of every compared block.

Pay attention to any error or warning messages. Most will probably have to do with the format of the input.

Evaluating LAMA alignment scores

The alignment score is the average of the column scores in the alignment multiplied by 100. Since the column scores have a range of -1 to 1 the alignment score will range from -100 to 100. An alignment score of 46 means that on average the aligned positions had a correlation coefficient of 0.46. The significance of the alignment score depends on the length of the compared blocks. Alignments between longer blocks will tend to be longer and have higher scores. The Z score and expected number let us estimate the significance of the scores and to compare alignments of different lengths. The higher the Z score the less likely the alignment is due to chance. How unlikely depends on the number of blocks searched. The more blocks searched the greater the probability to find chance high scores. For example, the output of the calibration with the shuffled blocks contained 7 million scores but no alignments with Z scores greater than 8.3 . Hence an alignment with a score equal or higher than that Z score is unlikely by chance in a comparable or smaller number of alignments. The expected number shows this directly. The expected number is shown for searching 5000 blocks since version 9.1 of the Blocks Database contains 3300 blocks. For example, searching this release of the Blocks Database and finding an alignment expected to appear 1.8e-01 times (0.18) suggests that this alignment is not due to chance. Alignments with expected occurrences of 7.5e-03 or even 0 are almost certainly genuine (or due to biassed blocks, see below).
A relation between two families by a single pair of blocks with a high Z score is termed a single hit. However, protein families often have a number of blocks. A multiple hit is when two or more block pairs from the same two families are similar:
                                               multiple hit
     Family 1, blocks 1A, 1B, 1C, 1D.         1A=2B + 1D=2C
     Family 2, blocks 2A, 2B, 2C.
We expect the order of the blocks in the hit to be the same in both families (in this example 1A -> 1D and 2B -> 2C).
Individual block pairs with Z scores likely by chance by themselves can still indicate a genuine relation if they are in a multiple hit. While the shuffled blocks scores contained no single hit with Z score above 8.3, there were no multiple hit with Z scores less than 5.6 . Hence genuine relationships can also be indicated by several alignments whose Z scores are individually expected to occur by chance.

When comparing blocks against a database the Z score cutoff is set as 5.6, corresponding to expected occurrence rate of 0.385 per searching 5000 blocks. When both query and target blocks are provided other cutoffs can be chosen.

False positive (high score but no relation) and false negative (low score but genuine relation) hits are still possible and biological knowledge and common sense should be used. Compositionally biassed blocks (consisting of sequence segments rich in a few amino acids or short repeats) are a common cause for false positive hits. You can check if a block is biassed here. False negative hits can be caused by misalignment in the blocks .

Each entry in the Blocks Database version 8.6 (3174 blocks from 858 protein families) was searched against the other entries in the database. All block pairs with Z scores larger than 5.6 were saved. Protein families related by more then one saved score were considered as multiple hits and alignments with Z scores above 8.3 as single hits. This resulted in 141 pairs of families. Eighty percent of these were identified as genuine relationships (true positives) according to the family descriptions, by sharing common sequences, or by detailed examination. Compositional bias was responsible for another eight percent of the high scores. The remaining twelve percent of the high scores could not be classified as either genuine or false based on available evidence.

Distribution of top scoring family pairs
Relation typeGenuine(1)Biassed
Composition
UnknownTotal
Multiple block hits- independent(2)
  24 
  -
  1 
  25 
                   - repeats(3)
  11 
  6 
  9 
  26 
                   - inner repeats(4)
  15 
  4 
  2 
  21 
Single block hits
  63
  1
  5
  69
Total
 113
 11
 17
 141
Fraction
  80%
  8%
 12%

(1) Genuine relations were identified by the families prosite descriptions,
    detailed analysis of the literature or by sharing common sequences 
    (22 of the single and independent-multiple hits).
(2) An independent multiple hit is two different protein families 
    related by two or more different block pairs.
(3) A repeat multiple hit is two different protein families where a 
    block from one family is similar with two or more blocks from the 
    other family.
(4) An inner-repeat multiple hit is a case where the similarities are 
    between blocks from the same family.

Examples

Supplements

To calibrate the LAMA scores the Blocks Database was purged from biassed blocks, the PSSMs of the remaining blocks were each shuffled and then compared against the blocks from the unshuffled database. The best score from each of the resulting 7 million comparisons was saved. These scores are due to chance and were used to estimate the significance of alignment scores between blocks. The mean and variance of chance alignments depend on the length of the compared blocks. Longer blocks will give longer alignments and higher scores by chance alone. Grouping the chance scores by the length of the shorter block in each comparison gave very similar score distributions. The mean and standard deviation of each group was used to transform each score into a Z score. The percentiles of all these Z scores was then calculated. These percentiles are used to estimate the expected number each score should appear not due to genuine relationship.

Following are links to tables with this data. Note that the scores in the tables are the raw scores of the alignments. The scores shown in the LAMA output are normalized by dividing the raw score by the alignment length.

Credits and citation

The multiple alignment comparison method and LAMA program were developed by Shmuel Pietrokovski in the lab of Steve Henikoff at the Fred Hutchinson Cancer Research Center, Seattle.

An article describing the method and its uses
"Searching Databases of Conserved Sequence Regions by Aligning Protein Multiple-Alignments"
appeared in Nucleic Acids Research 24(19) 3836-3845 (October 96'). This article should be cited in research using this method.


[Blocks home] [Block Searcher] [Block Maker] [Get Blocks] [format a block] [check for biassed blocks] [LAMA Searcher]
Page last modified January 1997 (thanks for Liz G.Wiz for useful comments)
Shmuel Pietrokovski