How to Use the MicrobesOnline Tree-Browser

Navigating to the tree-browser
What the tree-browser shows

The gene tree
Gene context
The species tree

The tree-browser's controls

Which tree to use
Which genes to show
Controlling the gene context view
Controlling the species view

How the trees are built

Gene trees
The species tree

Downloading data & making figures

Navigating to the tree-browser

To use the tree browser, the first step is to choose an "anchor" gene whose evolutionary history you want to explore. To find a gene, visit microbesonline.org, and use "Search genes" or "Sequence search". Then click on the T link in the search results or visit the gene page and click on "Browse genomes by trees".

If you want to link to the tree browser from your own web site, use the URL http://microbesonline.org/cgi-bin/treeBrowse.cgi?locus=NNNN, where NNNN is the numeric VIMSS ID of the gene. We try to maintain the VIMSS ids even when the genome sequence or gene annotations are updated.

What the tree-browser shows

Given a gene of interest, the tree browser selects a domain or gene family and displays relevant parts of a phylogenetic tree. The tree shows you which relatives are closest, and hence which are most likely to have the same function. The gene trees are computed beforehand and are rooted.

In the initial display, the tree-browser shows a gene tree together with the genomic context of those genes (see example). Conserved context implies conserved function, and often implies a similarity in function to the surrounding genes.

The tree browser can also show the gene tree together with the species tree so that you can compare them (see example). This highlights the presence or absence of a gene in related genomes, or, if close relatives in the gene are from distant genomes, suggests that horizontal gene transfer occurred.

The gene tree

At the top, the tree browser reports which gene it is showing information for and which gene family it used to build the tree. The gene family computations are not perfect, and sometimes it happens that a gene is assigned to a family but close homologs of the gene are not. In this case those homologs will be missing from the tree (see coverage).

The gene tree is at the bottom left. The gene of interest, or "anchor", is at the top of the tree, and the gene's closest homologs are beneath it. To allow more distant homologs to be shown, groups of closely related homologs are collapsed to a single cluster or clade. A single gene is shown for each cluster. Each clustered gene is shown with "bush" at the left to show how deep the internal branch is and with a "+" on the name. When you hover on a clustered gene in the tree it will show "gene-name and N similar sequences". To highlight the phylogenetic position of the "anchor" gene, it is not collapsed into clusters unless the other sequences in the cluster are 99% identical. If you wish, you can change the level of clustering or turn it off entirely.

If a gene is from one of the genomes you have selected, its name will be in magenta. Genes from selected genomes are always shown (they are never "hidden" inside clusters). If you have not selected any genomes, then the anchor gene and its paralogs (other genes from that genome) will be in magenta and will always be shown.

If a gene has been analyzed in a published paper, the gene name will be underlined in green.

You can click on a gene's name in the gene tree (or in the genome context view) to bring up a menu. The menu lets you view the gene (or cluster), recenter the tree-browser to focus on that gene, or add the gene to a cart for future analysis, such as building your own alignment or phylogenetic tree. If the gene represents a cluster, the menu also includes the option to partially "expand" the cluster.

Confident clades in the tree (those with support 0.95 or higher) are marked with a black circle, and less confident clades (support 0.8 or higher) are marked with a grey circle. You can see the support value at any node in the tree by hovering. However, if you find a phylogenetic grouping of interest by using the tree-browser, we strongly urge that you build your own tree. You can do this easily within MicrobesOnline by adding genes to a cart, building a multiple sequence alignment, and then building a tree. Building your own tree allows you to check alignment quality and to use a higher-quality (slower) tree-building method.

Gene context

In the default view, the tree-browser shows a tree at the left, and to the right of each gene it shows the region of the genome surrounding that gene. Within the genome context view, every gene is shown by a pointed box; the direction of the box shows what strand the gene is on. The gene box that corresponds to the gene tree entry will have its name in green and will (by default) be at the center of the display.

To help you determine which of the other genes shown are homologous to each other, related genes are shown with the same color. If a gene is in grey, it does not have an obvious ortholog in the current view, but you still might want to recenter on it to make sure. Non-protein-coding genes are in black. For an explanation of how protein gene's are colored, see the color option.

As in the gene tree, you can see more information about each gene by hovering your cursor on it, and clicking on the gene brings up a menu with more options.

The species tree

If you click on "show species tree", then the tree-browser will show you the species tree. The genome of the selected gene will be at the top of the species tree. Genomes that contain one or more genes from the shown portion of the gene tree will be shown in green. Related genomes that do not contain any of those shown genes will be shown in red. Please note that a genome will be shown in red if that genome contains genes that are in the tree but are too distantly related to the anchor to be shown.

By default, closely related groups of genomes are collapsed to a single node so that the genome tree is more compact and comprehensible. These groups will be labelled with something like "Vibrionaceae (10 genes 8 genomes)" to show how many species are grouped together, and also how many of the genes in the shown portion of the gene tree are in those genomes. You can hover on this to see the names of some of those genomes, and you can click to see the full list of genomes or for more detailed information about the genome. The group will be in yellow if some genomes in the group contain genes from the shown section of the gene tree but other genomes in the group do not.

To indicate how the gene tree corresponds to the species tree, the genes or clusters in the gene tree are numbered in blue: 1, 2, 3, etc. This numbering is only shown when the species tree is shown. Each genome or group of genomes is labelled with the numbers of the genes that it contains, e.g. 1,11. Because closely related genes are shown as a single node in the gene tree, the same gene number can show up in several genomes. Conversely, even a single genome can contain several members of a gene family (that is, paralogs).

The tree-browser's controls

The tree-browser has many controls that allow you to customize the display. After adjusting these settings, hit the "Update" button to see a new display.

Which tree to use

Domain used: This option lets you choose which gene family or which domain of the protein to show the tree for. MicrobesOnline includes pre-computed trees for every COG, Pfam, TIGRFam, SMART, PIRSF, SuperFamily, and Gene3D family. (Roughly speaking, COGs and TIGRFams are full-length gene families and Pfams, SMART, PIRSF, and Gene3D are domain families.) To ensure that virtually every gene has a tree, MicrobesOnline also includes trees for gene families that were identified by FastBLAST and for additional "ad-hoc" families. By default, the tree-browser chooses a tree that has the most aligned positions and the best coverage of tree-orthologs.

Coverage: Occasionally the tree-browser selects a family that does not include all of the close homologs of the gene of interest. To test for this problem, the tree-browser shows how many of the tree-orthologs of the anchor gene are present in the tree. You can check more thoroughly by clicking on "Check coverage of homologs." If the coverage is poor, try selecting a tree for a different family.

Which genes to show

Cluster: By default, the tree browser clusters together closely related clades. That is, given a clade in the gene tree whose members are all closely related to each other, it selects just one of them to show. You can turn this feature off by setting "Cluster" to none, or you can adjust the amount of clustering. Lower values allow more homologs to be grouped together so that you can see more distant homologs; the value corresponds roughly to the minimum %identity of the members of a cluster. The anchor gene is treated specially, and is only put in a cluster with genes that are >96% identical to it (unless you turn clustering off).

Genomes selected: The tree browser's clustering also depends on the "Genomes selected" at the top of the page. The tree browser always shows genes in selected genomes (they are never hidden inside clusters with other genes), and colors them magenta in the gene tree. You can change the list of selected genomes and then hit Update to make the tree browser show genes from specific genomes of interest.

Expand: You can "expand" a specific gene of interest by clicking on a leaf in the gene tree. Only collapsed nodes (those marked with a "+") can be expanded. Once a node is expanded, you can collapse it by clicking on the red minus sign.

Limit: You can also control how many genes or clusters to show in the gene tree. Beyond the "limit," more distant homologs are ignored. Showing fewer clusters creates a more compact display and greatly speeds up the browser when showing genome context. However, if there is an error in the tree, then the supposedly more distant homologs that were ignored may actually be important for understanding the function or history of the gene.

Gene context options

Overlapping genes on separate lines: If you are showing gene context, then by default the gene's context will be shown with a single line for each gene. This can make it hard to see overlapping genes. If you wish, you can place overlapping genes on separate lines instead.

Color: If you are showing gene context, then the genes are colored according to their homology group. By default, the tree-browser colors genes by COG, or, for genes that are in the anchor track and are not in COG, by their tree-orthologs. COGs are relatively broad homology groups. Alternatively, you can color by tree-ortholog (for genes in the anchor track) and by MOG (for other groups of genes). If a gene is in grey, that means it is not in a homology group to any shown gene. However, because of paralogs, a grey gene in the anchor track could still have close homologs in the view. To learn more about a gene, click on it and select "recenter."

Species tree options

Cluster species: If you are showing the species tree, then you can control the extent to which similar genomes are grouped together. This is analogous to the "Cluster" control for the gene tree. However, the %identities for species are on a different scale because the species tree is built using the most highly conserved proteins. For example, between Escherichia coli and Salmonella typhimurium, the typical protein is about 20% different, but the species distance is only 1% or 0.01.

Simplify: It often happens that a gene is present in one bacterium but not in any of its relatives. By default, the tree-browser will group some of those relatives together, even though they do not form a clade, so that the species tree is more compact. If you want to see the details of gene presence/absence or if you want to check for horizontal gene transfer events, you should turn "Simplify" off. In particular, this will highlight cases where a genome contains the gene but multiple related groups of bacteria lack this gene. This suggests that the gene was acquired by horizontal gene transfer (although multiple independent losses of the gene could also have occurred). Before concluding that HGT occurred, you should check the tree's coverage.

Changing the tree's look

Rectangular style: By default, the tree-browser draws trees in a "straight" style, in which the vertical dimension is meaningless and the horizontal length of a branch indicates the amount of evolution on that branch. The tree-browser can use the traditional rectangular style instead.

Use branch lengths: If you wish, the tree-browser can ignore the branch lengths and show only the branching order.

How the trees are built

All of the trees shown in the tree-browser are pre-computed. Every time we update the MicrobesOnline database, we compute a new tree for every gene family and a new species tree.

Computing the gene trees

MicrobesOnline includes pre-computed trees for every COG, TIGRFam, PFam, SMART, and PIRSF family, and for every Superfamily and Gene3D model. Because many genes have homologs but do not belong to any of these families, we also build trees for all of the additional families identified by FastBLAST. We do not build trees for PANTHER families because the alignments are highly gapped (many of the hits only align to a small fraction of the model). Instead, we build "ad hoc" trees for genes that were not included in any of the other trees. These ad-hoc trees include the seed gene and its homologs as identified by FastBLAST. Genes are assigned to COGs by reverse position-specific BLAST against the conserved domain database (CDD). Genes are included in COG trees up the two best hits for that part of the gene. This means that a gene can be in a COG tree even if it is not assigned to a COG, for example because it is a fragment or is a better hit to another COG. Genes are assigned to other families using HMMer 3 or FastHMM.

Once we have a list of homologous regions of proteins, we need to align them. We align HMM-based families with hmmalign from the HMMer package. We align COG families based on the individual profile alignments from psi-blast. We align FastBLAST families based on their pairwise alignments to the seed from BLASTp. The alignments are trimmed slightly: positions that are gaps in ≥ 90% of the sequences are removed. This trimming is minimal, and makes the trees more sensitive to any errors in the alignments, but more aggressive trimming on large gene families results in very small numbers of aligned positions and poor phylogenetic signal, which would also lead to errors. Finally, we build phylogenetic trees with FastTree version 2, a fast and accurate maximum-likelihood method written by Morgan Price. The local support values are from the SH test. As many of the trees contain thousands of sequences, and some trees contain over 100,000 sequences, higher-quality tree-building methods, such as maximum-likelihood or Bayesian, would be prohibitively slow. We perform midpoint rooting on the trees.

As mentioned previously, if you find a phylogenetic grouping of interest using the tree-browser, we strongly urge you to confirm it by building your own custom tree on MicrobesOnline. We have done this ourselves for dozens of genes. High-bootstrap nodes in the pre-built trees are usually correct, but on rare occasions, high-bootstrap nodes are strongly rejected by the custom tree. We suspect that this is because of alignment problems. A more common situation is that the custom tree is better resolved because more positions are aligned or because more positions are maintained after trimming. This is expected if the custom tree includes only close homologs. Also, sometimes the tree is for a gene family that does not include all of the close homologs of the gene of interest. The tree-browser can check for coverage.

If you want to download these alignments or these trees, please contact us.

Computing the species tree

MicrobesOnline includes a pre-computed species tree. This tree is updated regularly as new genomes are added. The tree includes almost all of the bacteria, archaea, and fungi in MicrobesOnline. (Genomes are not included if they are low-quality draft assemblies or if they are mixtures of two related species. The mixtures have also been renamed to end with "spp.") The prokaryotic portion of the species tree is based on approximately 78 COGs that are present as a single copy in most bacteria and archaea. Each COG was aligned with MUSCLE, using the -diags optimization. Positions were trimmed if they were gaps in more than 5% of genomes. We concatenated the alignments and used FastTree, a minimum evolution method, to infer a topology. The fungal portion of the species tree is based on a larger set of single-copy COGs. This tree is then spliced into the root of the prokaryotic tree with an arbitrary branch length.

Downloading data & making figures

You can download data from the tree-browser for further analysis or for making figures:

You can download the "gene context" view as a SVG diagram that can be edited with Inkscape (open-source) or Illustrator (from Adobe).
You can download the gene tree in the standard Newick format. Each node label is of the form locusId_version_domainBegin_domainEnd or locusId_domainBegin_domainEnd. We use MEGA3 (free to academics) for making figures.
You can download the species tree in the standard Newick format. Each node label is an NCBI taxonomy id (or an internal MicrobesOnline taxonomy id, if NCBI has not assigned one yet).
You can download information about the genes as a tab-delimited file.
You can download a fasta file with protein sequences. The description line for each entry contains the same fields as the gene information table, but separated by ":" instead of by tabs.

For more information about the tree-browser, please contact us at gtlweb@vimss.lbl.gov.