Welcome to PHYLOViZ’s documentation!¶
PHYLOViZ is a platform independent JAVA software that allows the analysis of sequence-based typing methods that generate allelic profiles and their associated epidemiological data.
Download and install¶
PHYLOViZ core and several plugins are free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
Certain source files distributed by the PHYLOViZ Team are under the terms of the GNU General Public License Version 3 with the following clarification and special exception, but only where PHYLOViZ Team has expressly included it in the particular source file’s header.
Linking this library statically or dynamically with other modules is
making a combined work based on this library. Thus, the terms and
conditions of the GNU General Public License cover the whole combination.
As a special exception, the copyright holders of this library give you
permission to link this library with independent modules to produce an
executable, regardless of the license terms of these independent modules,
and to copy and distribute the resulting executable under terms of your
choice, provided that you also meet, for each linked independent module,
the terms and conditions of the license of that module. An independent
module is a module which is not derived from or based on this library.
If you modify this library, you may extend this exception to your version
of the library, but you are not obligated to do so. If you do not wish
to do so, delete this exception statement from your version.
Code licensed under this license may be reused in commercial products provided that changes made directly in the sources - bug fixes or enhancements - must be contributed back to PHYLOViZ, but new source files (as in new plugins) which you write that link to PHYLOViZ code do not need to be.
Choose the appropriate version for your operating system or the .jar file. The OS specific versions already contain some memory specific parameters to enhance the software performance when using large datasets.
See details about available plugins and the licenses under which they are covered.
Binaries¶
A cross-platform zip distribution package is available.
Just unzip the package, enter the created directory and in the sub-directory bin/
run phyloviz.exe
or phyloviz64.exe
(Windows) or ‘phyloviz’ (Linux/MacOS) accordingly to your operating system.
NOTE: You may need to adjust some parameters in etc/phyloviz.conf
with respect to memory usage. These settings have a strong impact on visualization features. For instance, in Windows, you may achieve better results with:
default_options="--branding phyloviz -J-Xss8M -J-Xms32m -J-Xmx1024M --laf javax.swing.plaf.metal.MetalLookAndFeel"
IMPORTANTE NOTICE: After installing always go to the “Help” menu and “Check for updates” to install any novel plugins or latest updates to PHYLOViZ software. The SNP analysis plugin is installed in this way to demonstrate the plugin capability.
Source¶
All the Source code is available in the new code repository for in bitbucket.org. Check it out at https://bitbucket.org/phyloviz/phyloviz-main.
PHYLOViZ is built on top of the NetBeans Platform, thus we recommend NetBeans for the development of new plugins.
Loading data¶
File formats¶
To be able to analize and visualize your data, PHYLOViZ needs two separate files: One file contains the allelic profile data of the method you are using (Typing Data), while the other will contain accessory data (Isolate Data). In the example image below they are sampleAPfile.txt and sampleADfile.txt_respectively.
The Typing data should be a tab separated file containing the allelic profiles, formatted as follows: the first line should contain the column headers (usually locus identifiers be it either SNP, MLST or cg/wgMLST locus). The first column should be the allelic profile identifier (for MLST this would be the Sequence Type number, for any other method could be an unique strain ID. however if two strains have the same profile they should be given the same ID). The following columns are the loci used in the analysis.
If the Isolate data file is not used, the Typing data file should also represent the number of repeated profiles in a collection, that is to say that if a given profile appears in a collection n times it should be repeated in the Typing data file n times.
In case of an Isolate data file is used the frequency of each type will be represented by the number of entries with a given Sequence type, in the Isolate file only and the frequency represented by repeated profiles in the Typing data file will not be used.
You can find an example of MLST data correctly formatted here. Note that in this file several STs are represented by more than one isolate (e.g. ST3 was found in 6 isolates).
The Isolate data file can contain epidemiological and/or demographic data or any other data you want to visualize overlaid onto the results of the analysis algorithms. The link between the data in the two files is made by the Sequence Type identifier. You can find an example file correctly formatted here.
Loading a Dataset¶
Go to File menu and choose Load Dataset.
If any errors in the data loading process are found they will be displayed in the session Tab. In the following screenshot you can see an example where allelic profiles were repeated with different identifiers. In the example data,we created ST81 as copy of ST1 profile and PHYLOViZ detects it and eliminates it from the analysis.
The dialog will now guide the user in the loading of the data. The first step is choosing a name for your Dataset since now PHYLOViZ supports multiple datasets open simultaneously. You must also choose the Dataset Type from the dropbox menu.
The Dataset type can be MLST or MLVA datasets with any number of loci, without any missing data. Lines with missing data will be excluded on load. If you have installed the Single Nucleotide Polymorphism (SNP) plugin, you can also access it on the Dataset type. See the Sample Datasets page to access some test data for the sequence-based typing methods available.
The next step is loading the allelic profile data for the method you selected.
After loading the allelic profile data, you can choose a file with information on your isolates for which the allelic profile was loaded. The linking field, as explained before, should be the Sequence Identifier and should be selected in the Key dropdown menu.
Then the dataset is loaded and double clicking on the dataset name opens the available data.
Double clicking on Isolate Data and Typing Data in the tree menu under the dataset name opens the respective tabs.
The default view is the table view. Also available is the tree view, where it is easier to visualize what information is available in the different fields and to select combinations of fields with specific values.
Loading a remote Dataset¶
We can also load datasets from remote databases and services. PHYLOViZ contains already a list of available databases. We can choose Load Dataset from MLST DBs.
There are several datasets available from several providers. In the following example we select the Streptococcus pneumoniae dataset from PubMLST.org.
The next step is to download the dataset.
In the next window we can load ancillary data on isolates. In this example we choose to not load any data.
We can also load sequence data for each allele. They are downloaded individually and loaded as typing data.
At the end we have seven typing data items to explore and analyze.
Data analysis¶
In the current version of PHYLOViZ, you can analyze your data using the several algorithms described below. Press the Right Mouse Button on the Typing Data (now named with the method) and choose compute to access the available analysis algorithms.
goeBURST algorithm¶
Selecting the goeBURST algorithms opens the dialog for the goeBURST algorithm. This algorithm was typically used for MLST data analysis and was originally described in the article Global optimal eBURST analysis of multilocus typing data using a graphic matroid approach. The first step is choosing the Distance to be used. Currently eBURST Distance is the only one available, but others could be implemented. The eBURST distances follows the tiebreak rules discussed in the article.
The second step is the choice of the level to which clonal complexes will be formed. The usual default for MLST analysis is SLV Level. Choosing DLV or TLV level will take longer calculation times, but could provide some insight to the relationships between clonal complexes formed at SLV and DLV level respectively.
A goeBURST Output tab will appear and display the goeBURST algorithm results. It will contain information about the Clonal Complexes (CCs), namely the Sequence Types that compose them and what edges (the links between STs) were drawn in each CC.
In order to display the goEburst tree view, it is necessary to expand the typing data on the DataSets’ tab, if it is not already expanded.
Double clicking on the goeBURST item that is now on the Dataset tree menu will show the display. The clonal complexes will be arbitrarily numbered starting from 0 (for the CC with most STs) and contains all the data relevant to the goeBURST analysis (STs in each group and the drawn SLVs edges). The following screenshot summarizes the output for a single clonal complex with the test dataset used.
Multiple groups can be displayed simultaneously by selecting them, using the CTRL /CMD and/or SHIFT keys.
goeBURST Full MST algorithm¶
Using an extension of the goeBURST rules up to \(n\)LV level (where \(n\) equals to the number of loci your dataset uses), a Minimum Spanning Tree-like structure can be drawn. This is typicially used for SNP or cg/wgMLST datasets with dozens to thousand of loci.
Select goeBURST Full MST in the Compute options to draw it. Contrary to the standard goeBURST, the link statistics are not presented. After computation, double click on the goeBURST Full MST that appears under the dataset heading to visualize the result.
New options appear on the display: The Level selector and two new buttons Get Groups and Save Groups. The Level represents the Locus Variant level and allows the removal of all the links greater than the number represented. The user can use the up and down arrows or directly edit the number by clicking on it. The Get Groups button allows separate the display of groups that are not connected at the level chosen in order to simplify the analysis of larger datasets. This will generate a display very similar to that of goeBURST, but at a higher link level. The Save Groups creates an extra column in the isolate data with the title label goeBURST MST[\(x\)] with \(x\) being equal to the level used to create the groups.
Decreasing the Level selector, allows the user to see how clonal complexes would relate to each other at a certain level. Level 1, 2 and 3 are equivalent to calculating goeBURST at those levels (SLV,DLV and TLV respectively). The following images shows what happens to the dataset when you decrease the level. Level 4 is not displayed since no new groups are formed at that level.
At level 5 only two groups are formed in the sample dataset.
At level 3 (TLV level) some singletons appear. Level 4 is not shown since no changes were observed in the graph. This means that there are no two STs in the dataset that differ in 4 of the loci of their profiles.
At level 2 , 6 groups appear with 4 or more STs each.
And finally at level 1, the equivalent of the most commonly used Clonal Complex definition by goeBURST, 17 groups with 2 or more STs are formed and there are 25 singletons on the dataset.
Hierarchical Clustering¶
Selecting the Hierarchical Clustering opens the dialog where you can select what method you want to apply. The first step is choosing the Distance to be used. Currently the hamming distance is the only one available, but others could be implemented.
The second step is to select the Method. You can choose between complete-linkage, single-linkage, UPGMA (Unweighted Pair Group Method with Arithmetic mean) and WPGMA (Weighted Pair Group Method with Arithmetic mean). Selecting the method corresponds to selecting the criterion of minimal dissimilarity.
A Hierarchical Clustering Output Tab will appear and display the results of the application of the chosen method. A Leaf represents a Sequence Type and a Union represents a group that results of joining Leafs or Unions with Leafs. This process of joining is displayed step by step by the algorithm in the Output’s Tab. Finally we have the number of ties occured. The tie break applied is to always choose the first one found.
In order to display the dendogram view, it is necessary to expand the typing data on the Datasets’ tab, if it is not already expanded.
It shoud appear an icon corresponding to the hierarchical clustering computation
Double clicking on the Hierarchical Clustering item will show the display. This type of clustering is represented in the format of a dendogram. The following screenshot summarizes the output for the previous dataset. Sometimes it is necessary to fit the image to see all the display at once. To do this, please right click on the mouse over the display.
Some features were added to the visualization to improve and facilitate the analysis. These features are the following:
- Height scale
- Width scale
- Options Panel
- Search ST
- Filter by distance (cut off threshold)
- Export image
See section display and visualization for more information on these features.
Neighbor Joinning¶
Selecting the Neighbor Joinning algorithm opens the dialog where you can select what method you want to apply. The first step is choosing the Distance to be used.
The second step is to select the Criteria of the tree branch-length minimization. You can choose between Saitou-Nei and Studier-Keppler criterion.
A Neighbor Joinning Output Tab will appear and display the results of the application of the chosen method. The information displayed represents the same as the Hierarchical Clustering Output Tab.
In order to display the view, it is necessary to expand the typing data on the Dataset’s tab, if it is not already expanded.
Double clicking on the Neighbor Joinning item will show the display. By default it is represented in the format of a radial tree. The following screenshot summarizes the output for the previous dataset.
Some features were added to the visualization to improve and facilitate the analysis. These features are the following:
- Options Panel that includes changing the tree layout
- Search ST
- Filter by distance
- Export image
Display and visualization¶
Interface features¶
After running the selected algorithm, you will notice that the program then tries to optimize the display of the group with the largest number of elements in the data set. You can change the speed at which this occurs by moving the animation speed slider.
The Display tab offers the user the ability to search for an isolate, Highlight the SLVs and DLVs, control the animation speed, select diferent diferent or multiple groups. You can fit any displayed graphs to the window by right-clicking any open space (i.e. with no link or ST node) on the window.
GoeBURST and GoeBURST Full MST features¶
- Basic Interface
- SLV/DLV highlighting
- High Level Edges
- Saving Results
Hierarchical Clustering and Neighbor Joinning features¶
- Height scale
- Width scale
- Filter by distance (cut off threshold)
- Re-scale edges
Color conventions¶
Link colors for goeBURST results:
- Black - Link drawn without recourse to tiebreak rules,
- Blue - Link drawn using tiebreak rule 1 (number of SLVs),
- Green - Link drawn using tiebreak rule 2 (number of DLVs),
- Red - Link drawn using tiebreak rule 3 (number of TLVs),
- Yellow - Link drawn using tiebreak rule 4 or 5 (Frequency found on the data set and ST number , respectively),
- Gray - Links drawn at DLV (darker gray) or TLV (lighter gray) if the groups are constructed at DLV/TLV level.
Link colors for goeBURST Full MST results: The goeBURST Full MST algorithm links uses a grayscale with darker links having less differences between the profiles than the lighter gray links. To know the number of differences that the link represents click on the link in the Display window.
ST nodes colors:
- Light green - Group founder
- Dark green - Sub-group founder
- Light blue - Common node
- Red - Selected node
Querying and visualizing the data¶
The main goal of PHYLOViZ is to provide a data visualization tool for overlaying accessory data on the data analysis algorithms result. This allows to test the method’s adequacy to the data, or the proposal of novel hypothesis. This section will explain the basics on how this can be achieved in our software. The user can query the data using regular expressions, or simply manually selecting the desired fields from the table or, even just use the checkboxes in the tree view. Using your dataset and this instructions you should be able to create visualizations similar to the ones found in the PHYLOViZ website.
The isolate data tab¶
The Isolate Data tab is displayed by double clicking on the Isolate Data on the Dataset tree. The following screenshot resumes the basic functionality of the display on the table view.
The typing data tab¶
The Typing Data tab contains the allelic profiles loaded in the dataset. The name of displayed on the tab, and on the Dataset tree, is the name of the selected method during the Load Dataset procedure. The user can also query, select and visualize the data of the allelic profiles, similarly to operations describe in the Isolate Data tab.
Regular expression primer¶
Some basic regular expressions that can be used in PHYLOViZ. For more complex expressions there are extensive tutorials on regular expressions online. Just search Regular Expression or regex.
.
(period mark) - represents any character.[ ]
(square brackets) - Match anything inside the square brackets for one character position once and only once. Examples:[40]
will match any field with4
or0
;[7-9]
will match any field will7
,8
or9
(-
is the range separator).^
(caret) - Starts with. Ex:^P
will give you all the fields that start with aP
. Inside the square brackets means negation. Example[^a-c]
means anything nota
,b
orc
.$
(dollar sign) - Ends with. Ex.7$
will give you all fields that end in a7
.?
(question mark) - Matches the preceding character 0 or 1 times only. Example:colou?r
will findcolor
andcolour
.*
(asterisk) - Matches the preceding character 0 or more times. Example:tre*
would findtree
,tread
andtrough
.+
(plus) - Matches the preceding character 1 or more times. Example:tre+
would findtree
,tread
but nottrough
.{n}
(any integer between brackets) - Matches the preceding character exactly n times. Example:AT[GC]{2}
would matchATGC
,ATCG
,ATGG
orATCC
but notATGA
.
All these operators can be combined to create complex search expressions. For example : ^st[G|C].*6$
would find any field that starts with st
followed by a C
or a G
then as 0 or more characters and ends with a 6
. The following screenshot shows the result on the test dataset:
Queries using the table view¶
In the Table view of the Data tab you can manually select any field you want to represent by left clicking on it. That will automatically display all the entries with the selected value and not only the selected ones. To select multiple fields you can press the CTRL key (or CMD on Mac) while clicking on the desired fields. If you keep the SHIFT key pressed you can select ranges of cells.
You can also automatically select multiple columns by clicking with the right mouse button on the table headers and pressing the Select button.
Finally to plot the data on the Display tab, press the View button, after all the desired selections are performed.
Queries using the tree view¶
The Tree view offers a faster way to create simple queries. The user can also use the regex filter to search the dataset but all the possibilities for each dataset column are automatically indexed in a tree like manner. By pressing the Select button and switching to Table view the user can see the resulting selection. The users can alternate both views (Table and Tree) at will for creating the selection.
Query examples¶
- Tree view with selections
- Queries on the results produced by the goeBURST and goeBURST Full MST algorithm
- Queries on the results produced by the Hierarchical Clustering algorithm
- Queries on the results produced by the Neighbor Joinning algorithm
Exporting the results to an image file¶
To export the resulting graphs to an image file. Click on the Options button and choose Export. Select the adequate file format for the intended purpose. We recommend the use of png images for presentation quality and eps for publication quality.
Project management¶
A PHYLOViZ Project allows users to save their ongoing studies and update them as needed. It is a time-saving feature when working with large data sets and essential for efficiently sharing results, since the saved projects can then be shared. Each project includes the data under analysis, results of inference algorithms, visualization serializations and related customizations.
Saving¶
Right click on the dataset you would like to save and choose the option Save as Project. As you can see we’ll save a DataSet that has the isolate data integrated in the visualization.
Finally you can choose where to save your project. A dialog appears if you are overriding an existing project or creating a new one with a name that was already taken in the chosen directory.
Loading¶
Go to File menu and choose Open Project.
The next step is to find the project that you would like to load. After finding it click on Open Project.
This action will open the Projects Tab where you can see your project listed and many others that were previously opened.
Now for restoring the study just right click on the project and select Load DataSet. This will open the Dataset’s Tab with your saved study.
The project is now loaded with all the study that was done before as we can see in the following screenshot. You can check that the isolated data integrated on saving was restored completely.