
                                 fprotdist 



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Protein distance algorithm

Description

   Computes a distance measure for protein sequences, using maximum
   likelihood estimates based on the Dayhoff PAM matrix, the JTT matrix
   model, the PBM model, Kimura's 1983 approximation to these, or a model
   based on the genetic code plus a constraint on changing to a different
   category of amino acid. The distances can also be corrected for
   gamma-distributed and gamma-plus-invariant-sites-distributed rates of
   change in different sites. Rates of evolution can vary among sites in
   a prespecified way, and also according to a Hidden Markov model. The
   program can also make a table of percentage similarity among
   sequences. The distances can be used in the distance matrix programs.

Algorithm

   This program uses protein sequences to compute a distance matrix,
   under four different models of amino acid replacement. It can also
   compute a table of similarity between the amino acid sequences. The
   distance for each pair of species estimates the total branch length
   between the two species, and can be used in the distance matrix
   programs FITCH, KITSCH or NEIGHBOR. This is an alternative to use of
   the sequence data itself in the parsimony program PROTPARS.

   The program reads in protein sequences and writes an output file
   containing the distance matrix or similarity table. The five models of
   amino acid substitution are one which is based on the Jones, Taylor
   and Thornton (1992) model of amino acid change, the PMB model
   (Veerassamy, Smith and Tillier, 2004) which is derived from the Blocks
   database of conserved protein motifs, one based on the PAM matrixes of
   Margaret Dayhoff, one due to Kimura (1983) which approximates it based
   simply on the fraction of similar amino acids, and one based on a
   model in which the amino acids are divided up into groups, with change
   occurring based on the genetic code but with greater difficulty of
   changing between groups. The program correctly takes into account a
   variety of sequence ambiguities.

   The five methods are:

   (1) The Dayhoff PAM matrix. This uses Dayhoff's PAM 001 matrix from
   Dayhoff (1979), page 348. The PAM model is an empirical one that
   scales probabilities of change from one amino acid to another in terms
   of a unit which is an expected 1% change between two amino acid
   sequences. The PAM 001 matrix is used to make a transition probability
   matrix which allows prediction of the probability of changing from any
   one amino acid to any other, and also predicts equilibrium amino acid
   composition. The program assumes that these probabilities are correct
   and bases its computations of distance on them. The distance that is
   computed is scaled in units of expected fraction of amino acids
   changed. This is a unit such that 1.0 is 100 PAM's.

   (2) The Jones-Taylor-Thornton model. This is similar to the Dayhoff
   PAM model, except that it is based on a recounting of the number of
   observed changes in amino acids by Jones, Taylor, and Thornton (1992).
   They used a much larger sample of protein sequences than did Dayhoff.
   The distance is scaled in units of the expected fraction of amino
   acids changed (100 PAM's). Because its sample is so much larger this
   model is to be preferred over the original Dayhoff PAM model. It is
   the default model in this program.

   (3) The PMB (Probability Matrix from Blocks) model. This is derived
   using the Blocks database of conserved protein motifs. It will be
   described in a paper by Veerassamy, Smith and Tillier (2004).
   Elisabeth Tillier kindly made the matrices available for this model.

   (4) Kimura's distance. This is a rough-and-ready distance formula for
   approximating PAM distance by simply measuring the fraction of amino
   acids, p, that differs between two sequences and computing the
   distance as (Kimura, 1983) D = - loge ( 1 - p - 0.2 p2 ). This is very
   quick to do but has some obvious limitations. It does not take into
   account which amino acids differ or to what amino acids they change,
   so some information is lost. The units of the distance measure are
   fraction of amino acids differing, as also in the case of the PAM
   distance. If the fraction of amino acids differing gets larger than
   0.8541 the distance becomes infinite.

   (5) The Categories distance. This is my own concoction. I imagined a
   nucleotide sequence changing according to Kimura's 2-parameter model,
   with the exception that some changes of amino acids are less likely
   than others. The amino acids are grouped into a series of categories.
   Any base change that does not change which category the amino acid is
   in is allowed, but if an amino acid changes category this is allowed
   only a certain fraction of the time. The fraction is called the "ease"
   and there is a parameter for it, which is 1.0 when all changes are
   allowed and near 0.0 when changes between categories are nearly
   impossible.

   In this option I have allowed the user to select the
   Transition/Transversion ratio, which of several genetic codes to use,
   and which categorization of amino acids to use. There are three of
   them, a somewhat random sample:
    1. The George-Hunt-Barker (1988) classification of amino acids,
    2. A classification provided by my colleague Ben Hall when I asked
       him for one,
    3. One I found in an old "baby biochemistry" book (Conn and Stumpf,
       1963), which contains most of the biochemistry I was ever taught,
       and all that I ever learned.

   Interestingly enough, all of them are consisten with the same linear
   ordering of amino acids, which they divide up in different ways. For
   the Categories model I have set as default the George/Hunt/Barker
   classification with the "ease" parameter set to 0.457 which is
   approximately the value implied by the empirical rates in the Dayhoff
   PAM matrix.

   The method uses, as I have noted, Kimura's (1980) 2-parameter model of
   DNA change. The Kimura "2-parameter" model allows for a difference
   between transition and transversion rates. Its transition probability
   matrix for a short interval of time is:


              To:     A        G        C        T
                   ---------------------------------
               A  | 1-a-2b     a         b       b
       From:   G  |   a      1-a-2b      b       b
               C  |   b        b       1-a-2b    a
               T  |   b        b         a     1-a-2b

   where a is u dt, the product of the rate of transitions per unit time
   and dt is the length dt of the time interval, and b is v dt, the
   product of half the rate of transversions (i.e., the rate of a
   specific transversion) and the length dt of the time interval.

   Each distance that is calculated is an estimate, from that particular
   pair of species, of the divergence time between those two species. The
   Kimura distance is straightforward to compute. The other two are
   considerably slower, and they look at all positions, and find that
   distance which makes the likelihood highest. This likelihood is in
   effect the length of the internal branch in a two-species tree that
   connects these two species. Its likelihood is just the product, under
   the model, of the probabilities of each position having the (one or)
   two amino acids that are actually found. This is fairly slow to
   compute.

   The computation proceeds from an eigenanalysis (spectral
   decomposition) of the transition probability matrix. In the case of
   the PAM 001 matrix the eigenvalues and eigenvectors are precomputed
   and are hard-coded into the program in over 400 statements. In the
   case of the Categories model the program computes the eigenvalues and
   eigenvectors itself, which will add a delay. But the delay is
   independent of the number of species as the calculation is done only
   once, at the outset.

   The actual algorithm for estimating the distance is in both cases a
   bisection algorithm which tries to find the point at which the
   derivative os the likelihood is zero. Some of the kinds of ambiguous
   amino acids like "glx" are correctly taken into account. However, gaps
   are treated as if they are unkown nucleotides, which means those
   positions get dropped from that particular comparison. However, they
   are not dropped from the whole analysis. You need not eliminate
   regions containing gaps, as long as you are reasonably sure of the
   alignment there.

   Note that there is an assumption that we are looking at all positions,
   including those that have not changed at all. It is important not to
   restrict attention to some positions based on whether or not they have
   changed; doing that would bias the distances by making them too large,
   and that in turn would cause the distances to misinterpret the meaning
   of those positions that had changed.

   The program can now correct distances for unequal rates of change at
   different amino acid positions. This correction, which was introduced
   for DNA sequences by Jin and Nei (1990), assumes that the distribution
   of rates of change among amino acid positions follows a Gamma
   distribution. The user is asked for the value of a parameter that
   determines the amount of variation of rates among amino acid
   positions. Instead of the more widely-known coefficient alpha,
   PROTDIST uses the coefficient of variation (ratio of the standard
   deviation to the mean) of rates among amino acid positions. . So if
   there is 20% variation in rates, the CV is is 0.20. The square of the
   C.V. is also the reciprocal of the better-known "shape parameter",
   alpha, of the Gamma distribution, so in this case the shape parameter
   alpha = 1/(0.20*0.20) = 25. If you want to achieve a particular value
   of alpha, such as 10, you will want to use a CV of 1/sqrt(100) = 1/10
   = 0.1.

   In addition to the five distance calculations, the program can also
   compute a table of similarities between amino acid sequences. These
   values are the fractions of amino acid positions identical between the
   sequences. The diagonal values are 1.0000. No attempt is made to count
   similarity of nonidentical amino acids, so that no credit is given for
   having (for example) different hydrophobic amino acids at the
   corresponding positions in the two sequences. This option has been
   requested by many users, who need it for descriptive purposes. It is
   not intended that the table be used for inferring the tree.

Usage

   Here is a sample session with fprotdist


% fprotdist 
Protein distance algorithm
Input (aligned) protein sequence set(s): protdist.dat
Phylip distance matrix output file [protdist.fprotdist]: 


Computing distances:
  Alpha
  Beta         .
  Gamma        ..
  Delta        ...
  Epsilon      ....

Output written to file "protdist.fprotdist"

Done.


   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqsetall  File containing one or more sequence
                                  alignments
  [-outfile]           outfile    [*.fprotdist] Phylip distance matrix output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -ncategories        integer    [1] Number of substitution rate categories
                                  (Integer from 1 to 9)
*  -rate               array      Rate for each category
*  -categories         properties File of substitution rate categories
   -weights            properties Weights file
   -method             menu       [j] Choose the method to use (Values: j
                                  (Jones-Taylor-Thornton matrix); h
                                  (Henikoff/Tiller PMB matrix); d (Dayhoff PAM
                                  matrix); k (Kimura formula); s (Similarity
                                  table); c (Categories model))
*  -gamma              menu       [c] Rate variation among sites (Values: g
                                  (Gamma distributed rates); i
                                  (Gamma+invariant sites); c (Constant rate))
*  -gammacoefficient   float      [1] Coefficient of variation of substitution
                                  rate among sites (Number 0.001 or more)
*  -invarcoefficient   float      [1] Coefficient of variation of substitution
                                  rate among sites (Number 0.001 or more)
*  -aacateg            menu       [G] Choose the category to use (Values: G
                                  (George/Hunt/Barker (Cys), (Met Val Leu
                                  Ileu), (Gly Ala Ser Thr Pro)); C (Chemical
                                  (Cys Met), (Val Leu Ileu Gly Ala Ser Thr),
                                  (Pro)); H (Hall (Cys), (Met Val Leu Ileu),
                                  (Gly Ala Ser Thr),(Pro)))
*  -whichcode          menu       [u] Which genetic code (Values: u
                                  (Universal); c (Ciliate); m (Universal
                                  mitochondrial); v (Vertebrate
                                  mitochondrial); f (Fly mitochondrial); y
                                  (Yeast mitochondrial))
*  -ease               float      [0.457] Prob change category (1.0=easy)
                                  (Number from 0.000 to 1.000)
*  -ttratio            float      [2.0] Transition/transversion ratio (Number
                                  0.000 or more)
*  -basefreq           array      [0.25 0.25 0.25 0.25] Base frequencies for A
                                  C G T/U (use blanks to separate)
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Input file format

   fprotdist reads any normal sequence USAs.

  Input files for usage example

  File: protdist.dat

   5   13
Alpha     AACGTGGCCACAT
Beta      AAGGTCGCCACAC
Gamma     CAGTTCGCCACAA
Delta     GAGATTTCCGCCT
Epsilon   GAGATCTCCGCCC

Output file format

   fprotdist output contains on its first line the number of species. The
   distance matrix is then printed in standard form, with each species
   starting on a new line with the species name, followed by the
   distances to the species in order. These continue onto a new line
   after every nine distances. The distance matrix is square with zero
   distances on the diagonal. In general the format of the distance
   matrix is such that it can serve as input to any of the distance
   matrix programs.

   If the similarity table is selected, the table that is produced is not
   in a format that can be used as input to the distance matrix programs.
   it has a heading, and the species names are also put at the tops of
   the columns of the table (or rather, the first 8 characters of each
   species name is there, the other two characters omitted to save
   space). There is not an option to put the table into a format that can
   be read by the distance matrix programs, nor is there one to make it
   into a table of fractions of difference by subtracting the similarity
   values from 1. This is done deliberately to make it more difficult for
   the use to use these values to construct trees. The similarity values
   are not corrected for multiple changes, and their use to construct
   trees (even after converting them to fractions of difference) would be
   wrong, as it would lead to severe conflict between the distant pairs
   of sequences and the close pairs of sequences.

   If the option to print out the data is selected, the output file will
   precede the data by more complete information on the input and the
   menu selections. The output file begins by giving the number of
   species and the number of characters, and the identity of the distance
   measure that is being used.

   In the Categories model of substitution, the distances printed out are
   scaled in terms of expected numbers of substitutions, counting both
   transitions and transversions but not replacements of a base by
   itself, and scaled so that the average rate of change is set to 1.0.
   For the Dayhoff PAM and Kimura models the distance are scaled in terms
   of the expected numbers of amino acid substitutions per site. Of
   course, when a branch is twice as long this does not mean that there
   will be twice as much net change expected along it, since some of the
   changes may occur in the same site and overlie or even reverse each
   other. The branch lengths estimates here are in terms of the expected
   underlying numbers of changes. That means that a branch of length 0.26
   is 26 times as long as one which would show a 1% difference between
   the protein (or nucleotide) sequences at the beginning and end of the
   branch. But we would not expect the sequences at the beginning and end
   of the branch to be 26% different, as there would be some overlaying
   of changes.

   One problem that can arise is that two or more of the species can be
   so dissimilar that the distance between them would have to be
   infinite, as the likelihood rises indefinitely as the estimated
   divergence time increases. For example, with the Kimura model, if the
   two sequences differ in 85.41% or more of their positions then the
   estimate of divergence time would be infinite. Since there is no way
   to represent an infinite distance in the output file, the program
   regards this as an error, issues a warning message indicating which
   pair of species are causing the problem, and computes a distance of
   -1.0.

  Output files for usage example

  File: protdist.fprotdist

    5
Alpha       0.000000  0.331834  0.628142  1.036660  1.365098
Beta        0.331834  0.000000  0.377406  1.102689  0.682218
Gamma       0.628142  0.377406  0.000000  0.979550  0.866781
Delta       1.036660  1.102689  0.979550  0.000000  0.227515
Epsilon     1.365098  0.682218  0.866781  0.227515  0.000000

Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                         Description
                    distmat      Create a distance matrix from a multiple sequence alignment
                    ednacomp     DNA compatibility algorithm
                    ednadist     Nucleic acid sequence Distance Matrix program
                    ednainvar    Nucleic acid sequence Invariants method
                    ednaml       Phylogenies from nucleic acid Maximum Likelihood
                    ednamlk      Phylogenies from nucleic acid Maximum Likelihood with clock
                    ednapars     DNA parsimony algorithm
                    ednapenny    Penny algorithm for DNA
                    eprotdist    Protein distance algorithm
                    eprotpars    Protein parsimony algorithm
                    erestml      Restriction site Maximum Likelihood method
                    eseqboot     Bootstrapped sequences algorithm
                    fdiscboot    Bootstrapped discrete sites algorithm
                    fdnacomp     DNA compatibility algorithm
                    fdnadist     Nucleic acid sequence Distance Matrix program
                    fdnainvar    Nucleic acid sequence Invariants method
                    fdnaml       Estimates nucleotide phylogeny by maximum likelihood
                    fdnamlk      Estimates nucleotide phylogeny by maximum likelihood
                    fdnamove     Interactive DNA parsimony
                    fdnapars     DNA parsimony algorithm
                    fdnapenny    Penny algorithm for DNA
                    fdolmove     Interactive Dollo or Polymorphism Parsimony
                    ffreqboot    Bootstrapped genetic frequencies algorithm
                    fproml       Protein phylogeny by maximum likelihood
                    fpromlk      Protein phylogeny by maximum likelihood
                    fprotpars    Protein parsimony algorithm
                    frestboot    Bootstrapped restriction sites algorithm
                    frestdist    Distance matrix from restriction sites or fragments
                    frestml      Restriction site maximum Likelihood method
                    fseqboot     Bootstrapped sequences algorithm
                    fseqbootall  Bootstrapped sequences algorithm

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Although we take every care to ensure that the results of the EMBOSS
                    version are identical to those from the original package, we recommend
                    that you check your inputs give the same results in both versions
                    before publication.

                    Please report all bugs in the EMBOSS version to the EMBOSS bug team,
                    not to the original author.

History

                    Written (2004) - Joe Felsenstein, University of Washington.

                    Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.
