|
|||||||||||||||||||||||||
|
DNA2PRO: This program uses as input DNA sequences from isolated phage particles affinity selected against a protein or small molecule ligand. DNA2PRO will translate the sequences of inserts from the New England Biolabs Ph.D.-12TM or the NEB Ph.D.-7TM, or any other display library construct if provided with the appropriate start and end sequence of the vector. The FASTA format list of peptide sequences can then be used as input for further analysis using RELIC or other web-based software. Input 1: Please select the appropriate NEB library, or enter the appropriate start sequence and end sequence as found in the DNA sequence input files. Please also enter the number of bases between the start and end sequences that will be translated into the peptide insert.Input 2: Please upload the DNA sequence text files with inserts to be translated Please click here for an example input files with correct format.
An example of a DNA sequence input text file with the start and end sequences is shown in red. The 36 bases in between are translated as the peptide insert. CTAAAGTTTTGTCGTCTTTCCAGNNGTTAGTAAATGAATTTTCTGTATGGGATTTTGCTAAACAACTTTCAACAGOutput: Five output files will result: 1. Peptide sequences A list of Peptide Sequences in FASTA format is provided for easy viewing and download. The 12mer peptide sequence translated from the above raw DNA sequence is: seantqapfsrp 2. Text file File # 1 s37671.anl_10_cp1.seq
3. DNA sequences of insert 4. Sequence files 5. Invalid DNA sequences
AAFREQ: Amino Acid Frequency This program uses as input a peptide file and will calculate the frequency of the amino acids within the population as a function of position within the recombinant insert, thereby indicating which amino acids are over or under-represented. This program can determine position-specific and residue-specific values such as how many threonines occur in position 1 of the inserts. Examination of the data in the table below indicates that there are no prolines at the amino terminal positions of the inserts (insert position #1), but a significant number at all the other positions. This frequency distribution pattern indicates significant bias against proline residues immediately adjacent to the signal peptide cleavage site. The value in the right hand column gives the position-independent frequency of that particular amino acid in the input peptide population. To assess changes in these patterns during biopanning, AAFREQ can be used with 50-100 phage clone sequences from the unselected original library as input and then compared with the frequency values calculated from the affinity selected pool. A table will be generated as output with the amino acids totaled per each position within the peptide as well as the overall frequency of each amino acid within the population. An example is given below for a 12mer library of 100 peptides that was selected for affinity to galactose.
Alanine for example, occurs 7 times in position 1 with a combined total of 78. The frequency of alanine in this library is 78/1092, or 0.0747. POPDIV: Population Diversity This program uses as input a peptide file and will calculate the diversity within the population. The quality of a library frequently depends on it's completeness, or diversity - the proportion of possible sequences actually present in the library. Popdiv is an analytical expression for diversity and is a method for estimating the diversity of a peptide library from the sequences of a limited number of the members of the library (Makowski and Soares; 2003). This program is a useful tool for rapids assessment of the relative complexity of two or more peptide populations and is equally useful to researchers either constructing or utilizing combinatorial peptide libraries. Using novel statistical methods, this program uses as input a peptide file and will quantitate and annotate the sequence diversity within a combinatorial peptide library. An example is given below for a 12mer library of 100 peptides that was selected for affinity to glucose. diversity
:
0.018245
For instance, if the calculated diversity is 0.01 for a 12mer library (which has a total of 2012 (4.096x1015) possible sequences), then the population is statistically indistinguishable from a population containing approximately 4x1013 ( 4x1015x0.01) peptides. AADIV: Amino Acid Frequency + Diversity This program uses as input a peptide file and simultaneously calculates the amino acid frequency and diversity. An example is given below for a 12mer library of 100 peptides that was selected for affinity to vegF. The results will include the amino acid frequency calculations in addition to the calculated diversity of the population.
INFO: Information Associated with the observation of a peptide This program uses as input a peptide file and calculates the information associated with a peptide in the population. This value characterizes the bias as a function of position in these libraries by assigning a statistical measure to the significance of the observed amino acid frequencies. The Information value is calculated for the amino acid frequency distributions at each position in the combinatorial library. The probability of the amino acid frequency at each position is calculated on the basis of the number of codons coding for that amino acid and assuming Poisson statistics. Higher information levels correspond to less probable distributions and most likely are due to active censorship/bias of the sequences of inserts at those positions. Analysis of the details of the censorship patterns provides clues as to the biological basis of the censorship. In order to assess the probability that a peptide selected for affinity to a target is observed because of that affinity (and not due to some random event), the probability of random observation of that peptide is calculated (from the observed frequencies of amino acids in the unselected library). Since the calculated probability of any specific peptide is very small, we calculate, instead, the associated information, where, Information = -ln(probability). The smaller the probability of random occurrence (or, the larger the associated information), the greater the chance that this peptide was observed due to specific binding to the target. 2 outputs will result from this program. The example below is the Information values of 500 peptides that were selected for affinity to ATP. Output 1: list of peptides analyzed with corresponding values for Information
The peptide wmdmfrgewrkp, for instance, has the least probability of occurring by chance on the basis of amino acid frequencies in the library – it therefore has the highest associated information Output 2: The histogram and plot of information values provides a standard to use for identifying peptides with higher or lower than average associated information
DIVAA: Analysis of Amino Acid Diversity in Multiple Aligned Protein Sequences
Multiple sequence alignment of proteins is an effective way of identifying
highly conserved amino acids; patterns of conserved sequence; and generating
clues to functional and evolutionary relationships among proteins (e.g.
Baxevanis and Ouellette, 1998; Durbin et al., 1998). The power of this approach
has been the major impetus behind efforts to produce programs that generate
alignments and has led to the wide-spread use of tools such as BLAST (Altschul
et al., 1990). As more sequence data becomes available, extraction of
increasingly more subtle information from multiple-aligned sequences will
become possible. One of the most important deductions that can be obtained from
a set of multiple aligned sequences is the identification of specific residues
that are conserved among a set of related proteins, usually due to their
functional and structural importance to the protein. This type of information
can be mathematically extracted through the use of quantitative measures of
both conservation and variability. Quantitation of the abundances of amino
acids found at each position in a sequence motif can provide a basis for
understanding the structural and functional constraints at each point. One
measure that has been widely used is the information distribution across a
conserved site - a quantitative characterization of the degree of conservation
at each position in a sequence alignment (Schneider et al., 1986). Since the
abundances of different amino acids differ significantly, the calculation of
information as a function of position in multiple aligned proteins is dependent
on an a priori assignment of probabilities based, for instance, on codon usage
or amino acid abundance. Information about the degree of conservation in
adjoining sites can provide further information about the structural and
functional constraints on the sequence motif. DIVAA provides an intuitive
measure of how far the amino acid abundances of a particular position differ
from that of a uniform, random distribution. It is distinct from information
and other statistical measures that have been used previously by being
independent of a priori probabilities or expectations of abundances.
MOTIF1 This program uses as input a peptide file and searches for motifs within the population. MOTIF1 identifies families of short (user-specified length) conserved amino acid motifs. Alignment of short stretches such as these may aid in the identification of weaker consensus sequences on either side of this “anchor” sequence within the peptide family. An example is given for motifs found within a population of peptides that have been affinity selected for ATP
MOTIF2 This program uses as input a peptide file and searches for motifs within the population MOTIF2 searches for patterns of 3 amino acids and does not allow conservative amino acid substitutions, but does allow identical gap lengths, and also outputs a list of source peptides for each motif printed in cluster format. A conserved motif of this type is possible for peptides which are long enough to generate partial secondary structure, thus enabling non-continuous sequence conservation but contiguous spatial conservation. An example is given for motifs found within a population of peptides affinity selected for GTP
MOTIF3 This program uses as input a peptide file and searches for motifs within the population that contain four amino acids, and outputs peptides that have identities in at least three of the four amino acids in the motif, with the minimum number of occurrences specified by the user (i.e. the minimum number of times the user requires that a motif must occur in the population of peptides to be listed in the output), and output motifs that have at least 3 of the 4 amino acids.
CLOSEcon This program uses as input 1or more pdb files (all with the same ligand) and provides a list of the amino acid residues that are in contact (defined by a maximum interatomic distance chosen by the user) with a ligand on the basis of crystallographic coordinates. Proteins may be downloaded from the PDB at http://www.rcsb.org/pdb/, and then uploaded onto the data entry page for either single protein analysis or analysis of multiple proteins containing the same ligand. Click here for proper format. Next, the ligand and the maximum distance from the ligand for which the residues should be determined is entered. The output of this program is a list of contiguous amino acid residue sequences as well as single residue or punctate contacts, which can then be used as input with other RELIC programs. For instance, the amino acid frequencies can be calculated with aafreqs, or both contiguous and non-contiguous motifs within this population can be identified using the motif programs. This type of analysis allows for comparison of crystallographically determined contact sequences with phage-display-derived sequences. The example below uses 1A9C (GTP Cyclohydrolase I) and extracts the residues within 10A of GTP and results in 2 output files. Single amino acids indicate punctate contact while the peptide strings indicate extended contact
HETEROalign Using a PDB file, this program will predict where in the protein structure a small molecule ligand is most likely to bind using a population of peptides selected for binding to that small molecule. HETEROalign provides 3 visualizations of the similarity between a protein sequence and a population of peptides where similarity is calculated from a modified BLOSUM62 similarity matrix using a 5 amino acid window as discussed in MATCH below (Makowski, in preparation). The first visualization is a three-dimensional representation of the similarity by color. The program calculates the similarity via the MATCH algorithm (below) and generates a new .PDB file in which the ‘temperature factor’ has been replaced by the similarity score. The new pdb file can be downloaded by the user and any standard three-dimensional visualization package such as rasmol (http://www.umass.edu/microbio/rasmol/getras.htm) can be used to show similarity when the colors of the image are coded to ‘temperature factor’. Below is an example of this output, in which the region of highest similarity between the ATP affinity selected peptides and the ATP-binding protein 1AYL is shown in red. (For help using rasmol to visualize HETEROalign PDB output, please click here.)
This .PDB file can next be used as input into the program
DistSim to plot the relationship between the similarity (i.e.
replaced temperature factor) and the distance from the ligand for each amino
acid residue in the protein.
The third output file indicates the similarity scores of each residue in the protein sequence.
Bioaffinity screening of combinatorial peptide libraries using a purified protein as a target will produce a population of peptides with affinity to that protein – HETEROalign can thus be similarly utilized to map segments of the binding partner with high similarity to the affinity selected peptide sequences. DistSim This program uses the .PDB-formatted output from HETEROalign to calculate the relationship between the similarity (i.e. replaced temperature factor) and the distance from the ligand for each amino acid residue in the protein. A single distance and a single temperature factor are associated with each amino acid. This makes possible the type of plot shown below, which demonstrates that the amino acid residues in the protein most similar to those of the ATP affinity selected peptides are also physically closest to the ATP hetero group of the protein (lower right quadrant)
MATCH This program uses as input both a single protein fasta file in addition to a peptide file. The peptide file may be obtained from RELIC or it may be the user’s own data. Input 1: Protein fasta sequence Please create a notepad text file of an individual protein sequence in fasta format and either upload the file or copy and paste the sequence into the data entry page for this program. Click here for an example of proper format Input 2: Affinity selected peptides Click here for an example of proper format. OR Choose RELIC affinity selected peptides This program outputs the alignment of a set of specified peptides against a specified single protein sequence by calculating the similarity between each peptide sequence as compared to the length of the protein sequence. Similarity between a segment of protein sequence and the selected peptides is calculated using a modification of the BLOSUM62 amino acid substitution matrix. The BLOSUM62 matrix was decomposed into physiochemical and genetic (mutability) factors, and the genetic factors were removed, resulting in a matrix that reflects differences in observed amino acid substitution rates, but does not include the influence of mutability, a property that should not be relevant to peptide binding properties. Similarity scores are calculated 5 amino acids at a time, with every pentapeptide in each affinity-selected peptide compared to each 5 amino acid segment of a potential protein target. For each pentapeptide segment of a protein, the similarity score was calculated as the sum of all similarity scores between it and pentapeptides within the selected peptides. Low similarity scores added to the noise level of the resulting similarity plots. To minimize this noise a threshold is specified and those comparisons resulting in a similarity score below this threshold were ignored in the calculation. The similarity threshold was selected both empirically and experimentally and corresponds to approximately three identities and one similarity for every 5 amino acids. Biases in the population of unselected peptides due to inherent differences in growth characteristics were removed by subtracting a similarity plot calculated between the protein and a set of sequences from the unselected population. An example is given for the ATP-binding protein 1AYL (phosphoenolpyruvate kinase) with the alignment of peptides affinity selected for ATP against the length of the protein. This segment corresponds to the maximum scoring similarity between the peptides and the protein. This region also roughly corresponds to the P-loop, a well-known consensus sequence for the binding of nucleotide triphosphates. Output 1: The alignment
Output 2: The similarity scores
FASTAcon This program uses as input a fasta file of proteins and searches for short consensus sequences, either continuous or discontinuous. Please create a notepad text file of multiple protein sequences in fasta format (a genome for example) and upload the file into the data entry page for this program. Click here for an example of proper format. Next, enter the consensus of interest for which the fasta sequences should be searched, and then enter the number of amino acids in that motif that must match in order to be listed in the results. For example, using the E. Coli K12 genome obtained from NCBI (http://www.ncbi.nlm.nih.gov/COG/) the proteins that contain the ATP binding P-loop consensus sequence GxxxxGKT can be identified, as well as the protein sequence surrounding the consensus sequence.
FASTAskan This program uses as input a peptide file and compares it to a fasta file such as a genome or multiple proteins. Input 1: affinity selected peptide sequences Upload the notepad file as input into the program on the data entry page. Click here for an example of proper format. OR Choose RELIC affinity selected peptides on the FASTAskan data entry page Input 2: protein fasta sequence Please create a notepad text file of multiple protein sequences in fasta format, such as a genome, and upload the file into the data entry page for this program. Click here for an example of proper format. OR Choose from a list of genomes available in RELIC OR Please refer to these sites to download your genome of interest: http://www.ncbi.nlm.nih.gov/COG/ http://www.ebi.ac.uk/proteome/
The example below is the top 10 scoring E. coli K12 proteins ranked by similarity to our affinity selected ATP peptides To save a report 1. Go to the "File" Menu on your web browser and choose "Save As…" 2. Select the folder/directory where you would like to store your report and click the "Save" button. 3. Repeat for each report you wish to save
|
||||||||||||||||||||||||
|
|||||||||||||||||||||||||