Selection of organisms for the co-evolution-based study of protein interactions
The prediction and study of protein interactions and functional relationships based on similarity of phylogenetic trees, exemplified by the mirrortree and related methodologies, is being widely used. The results of these methods are expected to be highly dependent on the characteristics of the phylogenetic trees used, specially the set of organisms used to construct them (number, taxonomic distribution, ...) Nevertheless such dependence, albeit suspected, had not been studied so far. In most previous works, people used as many organisms as possible (or those available in a given database/resource) to construct the trees and latter evaluate their similarities.
The goal of this work was to study the dependence of the results of mirrortree and two of its more recent variants on the set of organisms used for constructing the trees. For that, we took different subsets of organisms sampled according with different taxonomic criteria, generate phylogenetic trees based on them, and evaluate the performance of these methodologies based on these trees. For this evaluation we used as gold standards sets of interactions of different nature (physical, functional, ...). (See Figure).
We found that the performance of these methodologies depends on the set of organisms used for building the trees, and it is not always directly related to the number of organisms in a simple way. Certain subsets of organisms seem to be more suitable for the predictions of certain types of interactions. Moreover, the optimal set of organisms also depend on the method used. Overall, our results allow us to propose a number of general "recipes" for users on which set of organisms and method to use depending on the type of interactions they want to predict, the genomic information available and the computational resources.optimalthe
The general conclusion is that, instead of using all genomes available, or the same set in every situation and for any type of interactions, it is recommended to use different sets of organisms depending on the available computational resources and data, as well as the type of interactions of interest. Moreover, with the increasing number of fully sequenced genomes, there will be a point in the future where it will be impossible to use all available genomes.
More information and links
© 2010, Computational Systems Biology Group. CNB-CSIC