
                                  fproml 



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Protein phylogeny by maximum likelihood

Description

   Estimates phylogenies from protein amino acid sequences by maximum
   likelihood. The PAM, JTT, or PMB models can be employed, and also use
   of a Hidden Markov model of rates, with the program inferring which
   sites have which rates. This also allows gamma-distribution and
   gamma-plus-invariant sites distributions of rates across sites. It
   also allows different rates of change at known sites.

Algorithm

   This program implements the maximum likelihood method for protein
   amino acid sequences. It uses the either the Jones-Taylor-Thornton or
   the Dayhoff probability model of change between amino acids. The
   assumptions of these present models are:
    1. Each position in the sequence evolves independently.
    2. Different lineages evolve independently.
    3. Each position undergoes substitution at an expected rate which is
       chosen from a series of rates (each with a probability of
       occurrence) which we specify.
    4. All relevant positions are included in the sequence, not just
       those that have changed or those that are "phylogenetically
       informative".
    5. The probabilities of change between amino acids are given by the
       model of Jones, Taylor, and Thornton (1992), the PMB model of
       Veerassamy, Smith and Tillier (2004), or the PAM model of Dayhoff
       (Dayhoff and Eck, 1968; Dayhoff et. al., 1979).

   Note the assumption that we are looking at all positions, including
   those that have not changed at all. It is important not to restrict
   attention to some positions based on whether or not they have changed;
   doing that would bias branch lengths by making them too long, and that
   in turn would cause the method to misinterpret the meaning of those
   positions that had changed.

   This program uses a Hidden Markov Model (HMM) method of inferring
   different rates of evolution at different amino acid positions. This
   was described in a paper by me and Gary Churchill (1996). It allows us
   to specify to the program that there will be a number of different
   possible evolutionary rates, what the prior probabilities of
   occurrence of each is, and what the average length of a patch of
   positions all having the same rate. The rates can also be chosen by
   the program to approximate a Gamma distribution of rates, or a Gamma
   distribution plus a class of invariant positions. The program computes
   the the likelihood by summing it over all possible assignments of
   rates to positions, weighting each by its prior probability of
   occurrence.

   For example, if we have used the C and A options (described below) to
   specify that there are three possible rates of evolution, 1.0, 2.4,
   and 0.0, that the prior probabilities of a position having these rates
   are 0.4, 0.3, and 0.3, and that the average patch length (number of
   consecutive positions with the same rate) is 2.0, the program will sum
   the likelihood over all possibilities, but giving less weight to those
   that (say) assign all positions to rate 2.4, or that fail to have
   consecutive positions that have the same rate.

   The Hidden Markov Model framework for rate variation among positions
   was independently developed by Yang (1993, 1994, 1995). We have
   implemented a general scheme for a Hidden Markov Model of rates; we
   allow the rates and their prior probabilities to be specified
   arbitrarily by the user, or by a discrete approximation to a Gamma
   distribution of rates (Yang, 1995), or by a mixture of a Gamma
   distribution and a class of invariant positions.

   This feature effectively removes the artificial assumption that all
   positions have the same rate, and also means that we need not know in
   advance the identities of the positions that have a particular rate of
   evolution.

   Another layer of rate variation also is available. The user can assign
   categories of rates to each positions (for example, we might want
   amino acid positions in the active site of a protein to change more
   slowly than other positions. This is done with the categories input
   file and the C option. We then specify (using the menu) the relative
   rates of evolution of amino acid positions in the different
   categories. For example, we might specify that positions in the active
   site evolve at relative rates of 0.2 compared to 1.0 at other
   positions. If we are assuming that a particular position maintains a
   cysteine bridge to another, we may want to put it in a category of
   positions (including perhaps the initial position of the protein
   sequence which maintains methionine) which changes at a rate of 0.0.

   If both user-assigned rate categories and Hidden Markov Model rates
   are allowed, the program assumes that the actual rate at a position is
   the product of the user-assigned category rate and the Hidden Markov
   Model regional rate. (This may not always make perfect biological
   sense: it would be more natural to assume some upper bound to the
   rate, as we have discussed in the Felsenstein and Churchill paper).
   Nevertheless you may want to use both types of rate variation.

Usage

   Here is a sample session with fproml


% fproml 
Protein phylogeny by maximum likelihood
Input (aligned) protein sequence set(s): proml.dat
Phylip tree file (optional): 
Phylip proml program output file [proml.fproml]: 


Adding species:
   1. Alpha
   2. Beta
   3. Gamma
   4. Delta
   5. Epsilon


Output written to file "proml.fproml"

Tree also written onto file "proml.treefile"

Done.


   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqsetall  File containing one or more sequence
                                  alignments
  [-intreefile]        tree       Phylip tree file (optional)
  [-outfile]           outfile    [*.fproml] Phylip proml program output file

   Additional (Optional) qualifiers (* if not always prompted):
   -ncategories        integer    [1] Number of substitution rate categories
                                  (Integer from 1 to 9)
*  -rate               array      Rate for each category
*  -categories         properties File of substitution rate categories
   -weights            properties Weights file
*  -lengths            boolean    [N] Use branch lengths from user trees
   -model              menu       [Jones-Taylor-Thornton] Probability model
                                  for amino acid change (Values: j
                                  (Jones-Taylor-Thornton); h (Henikoff/Tillier
                                  PMBs); d (Dayhoff PAM))
   -gamma              menu       [Constant rate] Rate variation among sites
                                  (Values: g (Gamma distributed rates); i
                                  (Gamma+invariant sites); h (User defined HMM
                                  of rates); n (Constant rate))
*  -gammacoefficient   float      [1] Coefficient of variation of substitution
                                  rate among sites (Number 0.001 or more)
*  -ngammacat          integer    [1] Number of categories (1-9) (Integer from
                                  1 to 9)
*  -invarcoefficient   float      [1] Coefficient of variation of substitution
                                  rate among sites (Number 0.001 or more)
*  -ninvarcat          integer    [1] Number of categories (1-9) including one
                                  for invariant sites (Integer from 1 to 9)
*  -invarfrac          float      [0.0] Fraction of invariant sites (Number
                                  from 0.000 to 1.000)
*  -nhmmcategories     integer    [1] Number of HMM rate categories (Integer
                                  from 1 to 9)
*  -hmmrates           array      [1.0] HMM category rates
*  -hmmprobabilities   array      [1.0] Probability for each HMM category
*  -adjsite            boolean    [N] Rates at adjacent sites correlated
*  -lambda             float      [1.0] Mean block length of sites having the
                                  same rate (Number 1.000 or more)
*  -njumble            integer    [0] Number of times to randomise, choose 0
                                  if you don't want to randomise (Integer 0 or
                                  more)
*  -seed               integer    [1] Random number seed between 1 and 32767
                                  (must be odd) (Integer from 1 to 32767)
*  -global             boolean    [N] Global rearrangements
   -outgrno            integer    [0] Species number to use as outgroup
                                  (Integer 0 or more)
   -[no]rough          boolean    [Y] Speedier but rougher analysis
   -[no]trout          toggle     [Y] Write out trees to tree file
*  -outtreefile        outfile    [*.fproml] Phylip tree output file
                                  (optional)
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run
   -[no]treeprint      boolean    [Y] Print out tree
   -hypstate           boolean    [N] Reconstruct hypothetical sequence

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   "-outtreefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Input file format

   fproml reads any normal sequence USAs.

  Input files for usage example

  File: proml.dat

   5   13
Alpha     AACGTGGCCAAAT
Beta      AAGGTCGCCAAAC
Gamma     CATTTCGTCACAA
Delta     GGTATTTCGGCCT
Epsilon   GGGATCTCGGCCC

Output file format

   fproml output starts by giving the number of species and the number of
   amino acid positions.

   If the R (HMM rates) option is used a table of the relative rates of
   expected substitution at each category of positions is printed, as
   well as the probabilities of each of those rates.

   There then follow the data sequences, if the user has selected the
   menu option to print them, with the sequences printed in groups of ten
   amino acids. The trees found are printed as an unrooted tree topology
   (possibly rooted by outgroup if so requested). The internal nodes are
   numbered arbitrarily for the sake of identification. The number of
   trees evaluated so far and the log likelihood of the tree are also
   given. Note that the trees printed out have a trifurcation at the
   base. The branch lengths in the diagram are roughly proportional to
   the estimated branch lengths, except that very short branches are
   printed out at least three characters in length so that the
   connections can be seen. The unit of branch length is the expected
   fraction of amino acids changed (so that 1.0 is 100 PAMs).

   A table is printed showing the length of each tree segment (in units
   of expected amino acid substitutions per position), as well as (very)
   rough confidence limits on their lengths. If a confidence limit is
   negative, this indicates that rearrangement of the tree in that region
   is not excluded, while if both limits are positive, rearrangement is
   still not necessarily excluded because the variance calculation on
   which the confidence limits are based results in an underestimate,
   which makes the confidence limits too narrow.

   In addition to the confidence limits, the program performs a crude
   Likelihood Ratio Test (LRT) for each branch of the tree. The program
   computes the ratio of likelihoods with and without this branch length
   forced to zero length. This done by comparing the likelihoods changing
   only that branch length. A truly correct LRT would force that branch
   length to zero and also allow the other branch lengths to adjust to
   that. The result would be a likelihood ratio closer to 1. Therefore
   the present LRT will err on the side of being too significant. YOU ARE
   WARNED AGAINST TAKING IT TOO SERIOUSLY. If you want to get a better
   likelihood curve for a branch length you can do multiple runs with
   different prespecified lengths for that branch, as discussed above in
   the discussion of the L option.

   One should also realize that if you are looking not at a
   previously-chosen branch but at all branches, that you are seeing the
   results of multiple tests. With 20 tests, one is expected to reach
   significance at the P = .05 level purely by chance. You should
   therefore use a much more conservative significance level, such as .05
   divided by the number of tests. The significance of these tests is
   shown by printing asterisks next to the confidence interval on each
   branch length. It is important to keep in mind that both the
   confidence limits and the tests are very rough and approximate, and
   probably indicate more significance than they should. Nevertheless,
   maximum likelihood is one of the few methods that can give you any
   indication of its own error; most other methods simply fail to warn
   the user that there is any error! (In fact, whole philosophical
   schools of taxonomists exist whose main point seems to be that there
   isn't any error, that the "most parsimonious" tree is the best tree by
   definition and that's that).

   The log likelihood printed out with the final tree can be used to
   perform various likelihood ratio tests. One can, for example, compare
   runs with different values of the relative rate of change in the
   active site and in the rest of the protein to determine which value is
   the maximum likelihood estimate, and what is the allowable range of
   values (using a likelihood ratio test, which you will find described
   in mathematical statistics books). One could also estimate the base
   frequencies in the same way. Both of these, particularly the latter,
   require multiple runs of the program to evaluate different possible
   values, and this might get expensive.

   If the U (User Tree) option is used and more than one tree is
   supplied, and the program is not told to assume autocorrelation
   between the rates at different amino acid positions, the program also
   performs a statistical test of each of these trees against the one
   with highest likelihood. If there are two user trees, the test done is
   one which is due to Kishino and Hasegawa (1989), a version of a test
   originally introduced by Templeton (1983). In this implementation it
   uses the mean and variance of log-likelihood differences between
   trees, taken across amino acid positions. If the two trees' means are
   more than 1.96 standard deviations different then the trees are
   declared significantly different. This use of the empirical variance
   of log-likelihood differences is more robust and nonparametric than
   the classical likelihood ratio test, and may to some extent compensate
   for the any lack of realism in the model underlying this program.

   If there are more than two trees, the test done is an extension of the
   KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that
   a correction for the number of trees was necessary, and they
   introduced a resampling method to make this correction. In the version
   used here the variances and covariances of the sum of log likelihoods
   across amino acid positions are computed for all pairs of trees. To
   test whether the difference between each tree and the best one is
   larger than could have been expected if they all had the same expected
   log-likelihood, log-likelihoods for all trees are sampled with these
   covariances and equal means (Shimodaira and Hasegawa's "least
   favorable hypothesis"), and a P value is computed from the fraction of
   times the difference between the tree's value and the highest
   log-likelihood exceeds that actually observed. Note that this sampling
   needs random numbers, and so the program will prompt the user for a
   random number seed if one has not already been supplied. With the
   two-tree KHT test no random numbers are used.

   In either the KHT or the SH test the program prints out a table of the
   log-likelihoods of each tree, the differences of each from the highest
   one, the variance of that quantity as determined by the log-likelihood
   differences at individual sites, and a conclusion as to whether that
   tree is or is not significantly worse than the best one. However the
   test is not available if we assume that there is autocorrelation of
   rates at neighboring positions (option A) and is not done in those
   cases.

   The branch lengths printed out are scaled in terms of 100 times the
   expected numbers of amino acid substitutions, scaled so that the
   average rate of change, averaged over all the positions analyzed, is
   set to 100.0, if there are multiple categories of positions. This
   means that whether or not there are multiple categories of positions,
   the expected percentage of change for very small branches is equal to
   the branch length. Of course, when a branch is twice as long this does
   not mean that there will be twice as much net change expected along
   it, since some of the changes occur in the same position and overlie
   or even reverse each other. underlying numbers of changes. That means
   that a branch of length 26 is 26 times as long as one which would show
   a 1% difference between the amino acid sequences at the beginning and
   end of the branch, but we would not expect the sequences at the
   beginning and end of the branch to be 26% different, as there would be
   some overlaying of changes.

   Confidence limits on the branch lengths are also given. Of course a
   negative value of the branch length is meaningless, and a confidence
   limit overlapping zero simply means that the branch length is not
   necessarily significantly different from zero. Because of limitations
   of the numerical algorithm, branch length estimates of zero will often
   print out as small numbers such as 0.00001. If you see a branch length
   that small, it is really estimated to be of zero length.

   Another possible source of confusion is the existence of negative
   values for the log likelihood. This is not really a problem; the log
   likelihood is not a probability but the logarithm of a probability.
   When it is negative it simply means that the corresponding probability
   is less than one (since we are seeing its logarithm). The log
   likelihood is maximized by being made more positive: -30.23 is worse
   than -29.14.

   At the end of the output, if the R option is in effect with multiple
   HMM rates, the program will print a list of what amino acid position
   categories contributed the most to the final likelihood. This
   combination of HMM rate categories need not have contributed a
   majority of the likelihood, just a plurality. Still, it will be
   helpful as a view of where the program infers that the higher and
   lower rates are. Note that the use in this calculations of the prior
   probabilities of different rates, and the average patch length, gives
   this inference a "smoothed" appearance: some other combination of
   rates might make a greater contribution to the likelihood, but be
   discounted because it conflicts with this prior information. See the
   example output below to see what this printout of rate categories
   looks like. A second list will also be printed out, showing for each
   position which rate accounted for the highest fraction of the
   likelihood. If the fraction of the likelihood accounted for is less
   than 95%, a dot is printed instead.

   Option 3 in the menu controls whether the tree is printed out into the
   output file. This is on by default, and usually you will want to leave
   it this way. However for runs with multiple data sets such as
   bootstrapping runs, you will primarily be interested in the trees
   which are written onto the output tree file, rather than the trees
   printed on the output file. To keep the output file from becoming too
   large, it may be wisest to use option 3 to prevent trees being printed
   onto the output file.

   Option 4 in the menu controls whether the tree estimated by the
   program is written onto a tree file. The default name of this output
   tree file is "outtree". If the U option is in effect, all the
   user-defined trees are written to the output tree file.

   Option 5 in the menu controls whether ancestral states are estimated
   at each node in the tree. If it is in effect, a table of ancestral
   sequences is printed out (including the sequences in the tip species
   which are the input sequences). The symbol printed out is for the
   amino acid which accounts for the largest fraction of the likelihood
   at that position. In that table, if a position has an amino acid which
   accounts for more than 95% of the likelihood, its symbol printed in
   capital letters (W rather than w). One limitation of the current
   version of the program is that when there are multiple HMM rates
   (option R) the reconstructed amino acids are based on only the single
   assignment of rates to positions which accounts for the largest amount
   of the likelihood. Thus the assessment of 95% of the likelihood, in
   tabulating the ancestral states, refers to 95% of the likelihood that
   is accounted for by that particular combination of rates.

  Output files for usage example

  File: proml.fproml


Amino acid sequence Maximum Likelihood method, version 3.68

Jones-Taylor-Thornton model of amino acid change


  +Beta
  |
  |                                     +Epsilon
  |      +------------------------------3
  1------2                              +------------Delta
  |      |
  |      +--------------------Gamma
  |
  +---------Alpha


remember: this is an unrooted tree!

Ln Likelihood =  -131.55052

 Between        And            Length      Approx. Confidence Limits
 -------        ---            ------      ------- ---------- ------

     1          Alpha             0.31006     (     zero,     0.66806) **
     1          Beta              0.00010     (     zero,    infinity)
     1             2              0.22206     (     zero,     0.62979) *
     2             3              1.00907     (  0.13965,     1.87849) **
     3          Epsilon           0.00010     (     zero,    infinity)
     3          Delta             0.41176     (     zero,     0.86685) **
     2          Gamma             0.68569     (  0.01628,     1.35510) **

     *  = significantly positive, P < 0.05
     ** = significantly positive, P < 0.01


  File: proml.treefile

(Beta:0.00010,((Epsilon:0.00010,Delta:0.41176):1.00907,
Gamma:0.68569):0.22206,Alpha:0.31006);

Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                         Description
                    distmat      Create a distance matrix from a multiple sequence alignment
                    ednacomp     DNA compatibility algorithm
                    ednadist     Nucleic acid sequence Distance Matrix program
                    ednainvar    Nucleic acid sequence Invariants method
                    ednaml       Phylogenies from nucleic acid Maximum Likelihood
                    ednamlk      Phylogenies from nucleic acid Maximum Likelihood with clock
                    ednapars     DNA parsimony algorithm
                    ednapenny    Penny algorithm for DNA
                    eprotdist    Protein distance algorithm
                    eprotpars    Protein parsimony algorithm
                    erestml      Restriction site Maximum Likelihood method
                    eseqboot     Bootstrapped sequences algorithm
                    fdiscboot    Bootstrapped discrete sites algorithm
                    fdnacomp     DNA compatibility algorithm
                    fdnadist     Nucleic acid sequence Distance Matrix program
                    fdnainvar    Nucleic acid sequence Invariants method
                    fdnaml       Estimates nucleotide phylogeny by maximum likelihood
                    fdnamlk      Estimates nucleotide phylogeny by maximum likelihood
                    fdnamove     Interactive DNA parsimony
                    fdnapars     DNA parsimony algorithm
                    fdnapenny    Penny algorithm for DNA
                    fdolmove     Interactive Dollo or Polymorphism Parsimony
                    ffreqboot    Bootstrapped genetic frequencies algorithm
                    fpromlk      Protein phylogeny by maximum likelihood
                    fprotdist    Protein distance algorithm
                    fprotpars    Protein parsimony algorithm
                    frestboot    Bootstrapped restriction sites algorithm
                    frestdist    Distance matrix from restriction sites or fragments
                    frestml      Restriction site maximum Likelihood method
                    fseqboot     Bootstrapped sequences algorithm
                    fseqbootall  Bootstrapped sequences algorithm

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Although we take every care to ensure that the results of the EMBOSS
                    version are identical to those from the original package, we recommend
                    that you check your inputs give the same results in both versions
                    before publication.

                    Please report all bugs in the EMBOSS version to the EMBOSS bug team,
                    not to the original author.

History

                    Written (2004) - Joe Felsenstein, University of Washington.

                    Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.
