X D E T v 4.3 F. Pazos & A. Valencia -o- Information =========== XDET implements two methods for detecting positions in multiple sequence alignments with a "family-dependent" or "function-dependent" conservation pattern. These positions have been shown to be related to functionality, and they complement the fully conserved positions as predictors of functionality from sequence information. They are usually related to functional specificity. The first method is the "mutational behaviour (MB) method". This method, previously implemented in the "mtreedet" program, is based on the comparison of the mutational behaviour of a position with the mutational behaviour of the whole alignment with the idea that positions showing a family-dependent conservation pattern would have a similar mutational behaviour as the whole family. The mutational behaviour of a position is represented by a matrix containing all the similarities between the aminoacids of the proteins at that position. The mutational behaviour of the whole alignment is represented by an equivalent matrix containing the similarities between the corresponding proteins. Both matrices are compared with a non-parametric rank correlation criteria. Hence, for each position the method produces an score which is that correlation coefficient. Positions with high scores are taken as the predicted functional sites. The second method is a variation of the previous one which incorporates the possibility of using an external arbitrary functional classification instead of relying on the one implicit in the alignment. Such possibility is intended for cases where some degree of phylogeny/function disagreement is suspected. The external functional classification is incorporated in the form of a matrix of "functional similarities" between proteins. The original "MB-method" is described in: * Antonio del Sol, Florencio Pazos & Alfonso Valencia. (2003). Automatic Methods for Predicting Functionally Important Residues. Journal of Molecular Biology. 326(4):1289-1302. While the new method able to incorporate an external functional classification is described in: * Florencio Pazos, Antonio Rausell & Alfonso Valencia. (2006). Phylogeny-independent detection of functional residues. Bioinformatics. 22(12):1440-1448. NOTE: Contact Antonio Rausell (arausell@cnio.es) for obtaining the other program described in the second paper ("MCdet"). A possible measure of protein functional similarity is that representing similarities of "interaction contexts". Fed with that information, Xdet would detect positions responsible for interaction specificity: * Borja Pitarch, Juan AG Ranea, Florencio Pazos. (2021). Protein Residues Determining Interaction Specificity In Paralogous Families. Bioinformatics. 37(8):1076-1082. If matrices with similarities of physico-chemical properties are used for representing the mutational behaviour of the alignment positions (instead of a substitution matrix) it is possible to detect the physico-chemical properties more related to the changes in specificity, what provides clue on the molecular mechanism behing that specificity: * Florencio Pazos. (2021). Prediction of Protein Sites and Physicochemical Properties Related to Functional Specificity. Bioengineering 8(12):201. Please, cite these references when reporting any result obtained using this program. See http://csbg.cnb.csic.es/pazos/Xdet for updated versions of the program, additional data and links to other resources. Using the program ================= The program is distributed as an standalone executable file (e.g. "xdet" for UNIX-based OSs or "XDET.EXE" for MS-Windows(R)). Running the program without command-line arguments prints a short information on how to use it and a short description of the output: ----------------------------------------------------------------------------- Xdet-Propdet-Mtreedet v. 4.3 Florencio Pazos pazos@cnb.csic.es ** Usage: xdet aln_file(HSSP|PIR/FASTA) matrix(Maxhom|raw) [options] Options (Take a look at the user manual for a detailed description): -E Skip entropy calculation. -S=n Generate 'n' random alns in order to associate Z-scores and P-values. to the correlation scores. The default is to skip suffling and report a single correlation score. It can take a long time for the program to run for large 'n' and many sequences. -M=file Read and use protein 'similarity' matrix from an external file instead of calculating it from the alignment. This option is used to 'impose' and external functional classification. instead of the one implicit in the alignment. Line format:prot_nr1prot_nr2functional_similarity -C=prop_file1 -C=prop_file2 ... Read files with aminoacid properties to calculate individual properties correlated with functions. Many files can be included using '-C' more than once. File format: prop_name -no spaces or blanks; truncated to 10 chars.- aa1value aa2value ....the rest of 20 aa. ... Aminoacids ('aa1', 'aa2', ...) in 1-letter code. Any prop. for gaps is 0.0 unless explicit in file ('.' or '-') -D=[path][wildcard] Read ALL property files matching that expression. Examples: -D=\*.prop -D=props/hdf_\* (You might need to escape '\' wildcards) '-D' and '-C' can be combined. The expression is treated as RECURSIVE in the Windows version. -c Add as an additional aminoacid property the combination of all the '-C/-D' properties. -N Use Pearson's normal correlation instead of the default Spearman's non-parametric rank correlation. Output format: |%4d %4c %4d %c %3d %6.3f %c %3d xx %9.4f [%9.4f %e %9.4f %e]| nr aa pdbn chain acc entrop secstr var xx MTD_VAL [Zscr_pos Pval_pos Zscr_glob Pval_glob] For properties (-C, -D options): |%4d %4c %4d %c %3d %-11s %c %3d xP %9.4f [%9.4f %e %9.4f %e]| nr aa pdbn chain prop_nr prop_name secstr var xP MTD_VAL [Zscr_pos Pval_pos Zscr_glob Pval_glob] ----------------------------------------------------------------------------- The basic input for the program is a multiple sequence alignment. This can be provided either in HSSP, PIR or FASTA formats. Programs for generating multiple sequence alignments, such as ClustalW, can usually generate PIR and/or FASTA files. The results of the program TOTALLY depend on the multiple sequence alignment used. For the MB-method (old Mtredet program) the alignment should contain a good representation of the sequence-space around the query sequence, removing sequences too far from the query and redundant sequences. A set of parameters for generating multiple sequence alignments which is producing good results for this method is: - Retrieve homologs up to BLAST E-value 10e-2. - Re-align these homologs (clustalw, t_cofee, ...). Do not use the original BLAST alignment. - From the alignment, remove fragments (sequences aligning in less than 60% of the master). - Remove outliers (%SEQID<25% with master). - Remove redundancy (%SEQID>95%) - Do not use alignments with less than 14 sequences. But the user should try many alignments with different combinations of parameters. For the method which uses an external set of functional similarities it does not matter how divergent the sequences in the alignment are (structural alignments, etc) as long as the alignment is correct. An aminoacid homology matrix is also required as input. The program can read the Maxhom format (see the web page above to obtain matrices in this format) and a raw format which opens the possibility for the user to incorporate any other matrix. The raw format is just: aa1<1SPC>aa2<1SPC>similarity where 'similarity' is a real number. For example: ----------- A A 28.0 A C -28.0 A D -13.0 A E -8.0 A F -37.0 A G -5.0 A H -31.0 A I -19.0 .....etc.... ----------- An external functional classification can be imported with the "-M" option. The parameter for this option is a file with "functional similarities" for each pair of proteins in the alignment. The format of this file is prot_nr1prot_nr2functional_similarity where 'functional_similarity' is a real number. Proteins are numbered according with the order they have in the multiple sequence alignment. For example: ---------------------- 1 2 100.0 1 3 80.0 1 4 0.0 ....etc.... ---------------------- In this example, "functional similarities" are given as percentages. The 1st protein in the alignment is functionally identical to the 2nd one, while it does not have any functional similarity with the 4th one. 'functional_similarity' can be in any scale. Take a look at the reference above for different examples of "functional similarities" and different ways of quantifying them. If this option is used, a 'functional_similarity' value should be given for ALL pairs of proteins, the program complains otherwise. An arbitrary number of amino-acid properties can be read in order to detect properties related to the functional specificities (see above). A property's file should include a name for the property as first line, and then the property value for all the 20 AA with the format: aavalue For example: ---------------- hdf_Eisenb A 0.620000 R -2.530000 N -0.780000 ...etc... ---------------- These files are imported with the -C and -D options (see above). Note that -C/-D can be combined with -M in order to run the method in suppervised mode and scoring amino-acid properties. To statistically asses the significance of the scores, the program can shuffle the alignment a number of times and calculate z-scores and p-values of the original scores with respect to the distribution of scores of the shuffled alignments. shuffling is done by changing the order of the sequences in the multiple sequence alignment. To activate this shuffling, simply add "-S=n" as an option, where 'n' is the number of shuffled alignments. Two p-values and two z-scores are reported associated to the score of each position. They are based on two background distributions of scores obtained from the shuffled alignments, one with the scores for that position only, and another one with the scores of all positions. We are still working on defining a good null-model and background distributions for this problem. So this option is still experimental. If this option is not used, p-values and z-scores are not calculated and only the raw score is reported. Please note that large 'n' and big alignments can result in very long running times. Output description ================== The output of the program consists of a line for each position in the multiple sequence alignment. The main score of the program (correlation between the aminoacid similarities within a position and the overall similarities -sequence or "functional" similarities- between proteins) is in the 9th column. High values of this parameter are associated with positions related with functional specificity. This value is not high for fully-conserved positions (see below). Nevertheless, fully conserved positions are the main indicators of functionality. Conserved positions can be detected by the entropy value (5th column) or by the HSSP VAR parameter (7th column) (see below). This list of positions only contains the ones with a percentage of gaps lower than a hardcoded threshold (10%). ---------------------------------------------------- 7 N 7 26 -1.000 H 46 xx 0.2651 8 C 8 14 0.000 H 0 xx -2.0000 9 I 9 0 2.286 C 30 xx 0.3138 10 K 10 65 2.842 C 40 xx 0.1919 11 C 11 0 0.979 C 16 xx -0.0020 12 K 12 18 2.541 C 39 xx 0.2425 16 C 16 13 0.000 H 0 xx -2.0000 17 V 17 5 2.450 H 35 xx -0.0509 18 E 18 118 2.712 H 35 xx 0.1402 ......... ...... .... . ---------------------------------------------------- 1 2 3 c 4 5 6 7 8 9 Column 1: Position number. Positions are numbered as in the multiple sequence alignment. GAPS are included in this numbering. " 2: Aminoacid in master sequence (1st sequence of the alignment). " 3: PDB numbering. Database HSSP alignments include the PDB numbering of the master sequence. If this information is not available the position number (1st column) is reported also here. " c: Database HSSP alignments may include a PDB chain identifier. That would be reported in this column. " 4: Solvent accessibility taken from the HSSP file. "-1" if not available. " 5: Sequence entropy of the position. A measure of conservation (0: fully conserved). "-1.000" indicates that entropy has not been calculated for that positions because it contains a fraction of gaps higher than a hardcoded limit, or that entropy calculation has been disabled with the "-E" option). " 6: Secondary structure code taken from the HSSP file. "-" when not available. " 7: Variability (VAR) taken from the HSSP file. Another measure of conservation (0:fully conserved; 100: fully variable). "-1" when the input file is not HSSP. " 8: Reserved. " 9: Correlation value. Main score of the method. This value goes from -1.0 (position "anti-correlated" with the functional classification) to 1.0 (position perfectly correlated with the functional classification). Values lower than -1.0 are flags to indicate that the calculation were not done (i.e. "gappy position"). Fully conserved positions (entropy=0.0 -5th column-) also have a value lower than -1.0 (i.e. positions 8 and 16 in the example above). If amino-acid properties are imported (see above) apart from the lines with that format for the alignment positions, additional lines are included with the scores for each property/position. These are labelled with "xP" (instead of "xx") in column 8 and differ in some fields. Nevertheless, absolute positions of the common fields are the same and the SCORE is always at column 9, so that a "sort" by that column will sort both types of lines. For example: ----------------------------------------------------- 43 K 43 1 MW - -1 xP 0.3273 55 Y 55 14 hdf_Eisenb - -1 xP 0.3061 12 S 12 17 rel_mut - -1 xP 0.2889 55 Y 55 -1 0.911 - -1 xx 0.2795 ..... ... . ----------------------------------------------------- In this example all lines are scores for properties (molecular weight in position 43, hydrophobicity in position 55 and mutability in position 12), except the last one that is the "normal" score for position 55. If the "-S" option is used, 4 additional columns contain the z-score and p-values calculated with respect to the shuffled scores for that position and the shuffled scores for all positions respectively (see above). For the positions where calculations could not be done (raw_score < -1.0) the z-scores and p-values are labelled as "[]" and 2.0E+00 respectively. ------------------------------------------------------------------------------------------------------ 9 V 9 -1 1.694 - -1 xx 0.1142 3.7149 0.000000e+00 1.6702 5.958904e-02 10 G 10 -1 0.000 - -1 xx -2.0000 [] 2.000000e+00 [] 2.000000e+00 11 A 11 -1 2.174 - -1 xx 0.1929 3.2576 0.000000e+00 2.8426 7.534247e-03 12 G 12 -1 2.102 - -1 xx 0.3471 9.7135 0.000000e+00 5.1424 0.000000e+00 ............. ------------------------------------------------------------------------------------------------------ Examples ======== 1: xdet 5fd1.hssp Maxhom_McLachlan.metric > 5fd1.xdet 2: xdet myaln.fasta Maxhom_McLachlan.metric | sort -nr -k 9 > myaln.xdet (Output already sorted by score. Note: for HSSP files which include the PDB chain the sort should be "sort -nr -k 10", since an additional column is present.) 3: ~/bin/xdet myaln.fasta mymatrix.txt -E | sort -nr -k 9 | head -10 >myaln.treedet (Report only the 10 positions with highest scores. Do not calculate entropy.) 4: XDET.EXE TEST.PIR MATRIX.TXT -M=LIGAND_SIMILARITY.TXT (Use an external matrix of functional similarities between proteins. Instead of using the one implicit in the sequence relationships of the alignment. -Suppervised mode-) 5: ~/bin/xdet myaln.fasta Maxhom_McLachlan.metric -C=./hdf.prop -C=./MW.prop | sort -n -k 9 (Besides scoring the positions, report also the scores for two amino-acid properties, and sort both types of scores together) 6: xdet myaln.fasta Maxhom_McLachlan.metric -D=../../props/\*.prop | grep "xP" (Calculate the scores for the properties in all *.prop files of the specified path. Filter out lines with position scores and report only those with property scores) -- See http://csbg.cnb.csic.es/pazos/Xdet for updated versions of the program, additional data and links to other resources. Please, cite the references above when reporting any data obtained using this program. Send any query/comment to the following address. Use this address also for reporting bugs. We will be very happy to know on any result (good or bad ;-) you may obtain using this program. Florencio Pazos Cabaleiro. Protein Design Group. Centro Nacional de Biotecnologia (CNB-CSIC) Campus UAM Cantoblanco. 28049 Madrid. e-mail: pazos@cnb.csic.es Tlf. +34.915854669. Fax. +34.915854506