Recently, an increasing amount of linguistic methods with a probabilistic component has been investigated at the structural level, i. At the semantic level a representation of meaning is assigned to the structure [ 11 ] and at the pragmatic level some context of the sequence e. In the field of protein sequence analysis, the size of the alphabet and the complexity of relationships between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars.
The first rules were designed to define short functional patterns consisting of adjacent and well conserved amino acids. They are expressed by non-probabilistic regular grammars, e. Although their expressive power is fairly limited, they have proved extremely useful in protein annotation and detection of important protein regions e. Approaches based on Hidden Markov Models HMMs are regarded as the state of the art methods in the field of protein sequence annotation.
However, an important drawback of HMM profiles is that they are not human-readable and, therefore, these descriptors cannot provide any biological insight by themselves. In addition, since the expressive power of an HMM is similar to a stochastic regular grammar [ 16 — 18 ], they have limitations regarding the types of patterns they are able to encode.
For example, they cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins, e. Similarly, bonds in binding sites often exceed the capability of regular grammars and HMMs [ 20 ]. These weakly context-sensitive grammars could not only predict which amino acids were involved in sheets, but also the locations of the hydrogen bonds.
However, the structure of the grammar had to be provided; their algorithm learned the probability parameters. Context Free Grammars CFGs have the potential to overcome some of the limitations of HMM based schemes since they have the next level of expressiveness in Chomsky's classification and produce human-readable descriptors.
Although they do not have the power of context-sensitive grammars and, therefore, cannot deal with crossing relationships, their reduced complexity makes them more practical and allows the possibility of learning grammar structure from examples. Consequently, they could potentially be used to describe a variety of patterns including nested relationships. Moreover, we believe that many ligand binding sites, where main dependencies are essentially branched and nested like, could be detected using CFGs.
These relationships are often not direct interactions between amino acids, but indirect through the intermediate of a ligand. For example, the NAP Nicotinamide-Adenine-Dinucleotide Phosphate binding region of aldo-keto reductases, see Figure 1b , could be modelled as involving indirect nested dependencies between NAP binding residues, Figure 1c.
Moreover, CFGs can be utilised to model dependencies between different parts of a binding site, such as beta strands, helices and loops, by using branching rules. Thus, the development of grammars which have the abilities to model branched and nested relationships should permit to improve modelling of such type of binding sites.
CFG have already been applied successfully in the fields of bioinformatics, particularly for RNA structure prediction [ 5 , 10 , 22 — 24 ] and compression [ 25 ]. A CFG is particularly adapted to this task because it can express, unlike regular grammar, the nested dependencies due to the Watson-Crick pairing which is key to RNA structure. Due to a larger set of terminals 20 amino acids and less straightforward relations between residues there is no equivalence to the Watson-Crick pairing , utilisation of Context-Free Grammars to analyse proteins has not been, so far, comparably successful.
Since the design of an unbiased negative sample is particularly difficult in protein sequence analysis, the fact that CFGs cannot be inferred from positive data only is a serious drawback [ 26 ]. An alternative is to develop an approach based on stochastic grammars which, in principle, do not require a negative set for their inference [ 17 , 27 ].
In this paper, a Stochastic Context Free Grammar based framework for the analysis of protein sequences is presented and applied to the interpretation and detection of amino acids involved in binding sites. We start by demonstrating the value of our framework by showing the biological insight which is provided by the produced grammars. Then, we assess its performance in sequence annotation and binding site detection and evaluate them against profile HMMs. In the Methods section, we present the general principles which are behind our framework and the key strategies it relies on.
Formal definition of Stochastic Context Free Grammars, implementation aspects and detailed description of datasets are provided in the Appendix. The rational for utilising Stochastic Context Free Grammars to produce ligand binding site descriptors is that not only they have the power to express branched and nested like dependencies, but also their rules can be analysed to acquire biological knowledge about binding sites of interest. In this section, we illustrate how analysis of sequence based SCFGs allows gaining an insight into the spatial configuration of binding sites.
We propose two ways of analysing probabilistic grammars to extract biologically meaningful features focusing on either parse trees or grammar rules. We start by providing an in-depth study of grammars produced to describe the extended PS pattern which include calcium and manganese binding sites. Through this analysis, we will use the 3D structure of a legume lectin protein, i. We will focus our attention on grammars based on residue accessibility, calcium propensity and manganese propensity, since these grammars have been shown as being the most informative to describe the PS pattern.
Figure 2a shows the 3D structure of the extended PS pattern.
Chomsky Normal Form | Normal Forms in Automata | Gate Vidyalay
The grammar generated for this pattern based on accessibility is composed of the following rules associated with their normalized probability. Since all amino acids which show high accessibility propensity, i. Therefore, this grammar imposes a constraint between the length of the loop and the first beta strand, see Figure 3a. Whereas the accessibility based grammar describes in particular the beta sheet which is present in the pattern, the magnesium propensity based grammar deals with magnesium binding. The derivations of V on the strand side and U on the loop side impose the presence of W rules.
Finally, the calcium propensity based grammar defines not only the calcium binding part of the pattern, but more generally the pattern's structure. The second way of reading grammars, namely analysis of highly probable production chains and especially cycles, is demonstrated on the NAP binding pattern PS which is found in some aldo- and ketoreductases.
The protein structure of 1MRQ is used for illustration see Figure 6 for a 3D stick model of the binding site. Analysis of the PDB model confirms this pattern defines a helix. Finally, we analyse the SO4 binding site associated to PS profile of m-phase inducer MPI phosphatase using both parse trees and grammar rules. Residues shown in red, grey and green colours have respectively high, average and low accessibility.
The parse tree of the accessibility based grammar for the region is shown in Figure 8a. This tree shows a strong asymmetry with a mainly hydrophilic left side and a hydrophobic right side. This suggests very different structural properties between these parts of binding site. Figure 7a reveals a beta-sheet on the left and an alpha-helix plus one disturbed but clear turn on the right.
In addition to these features which could have been obtained using standard secondary structure prediction methods, the parse tree also highlights the creation of a hydrophilic environment between the right side hydrophilic amino acids close to the tree root i.
I and C and the left side amino acids. Red, black and green colours express respectively high, average and low level of the property of interest. Using the parse tree of the SO4 propensity based grammar Figure 8b , some insight can be provided regarding SO4 binding.
Modeliranje sustava programske podrške
Contrarily to the hydrophilic side of the site which is composed of residues showing low SO4 propensity, the hydrophobic side appears as a good candidate for SO4 binding. From cycle A, the following pattern is revealed: 'ySy p zy p z', where y represents either z or n. Since S can be substituted by the pattern itself, after a substitution the patterns become 'yySy p zy p zy p zy p z'. In addition, the derivation of the less likely cycle B produces the following pattern: 'nySxy p z' where x represents any SO4 propensity.
This result combined with the low SO4 propensity of the hydrophobic side of the binding site suggests that SO4 binding would involve the arginine-rich ridge of a helix. This analysis of grammars describing ligand binding sites has shown that probabilistic context-free grammars allow the production of binding site descriptors which are human-readable and, hence, provide some insight into biologically meaningful features.
Moreover, each of these grammars relies on high probability rules which could not be expressed with regular grammars. Therefore, this confirms that the description of many ligand binding sites benefits from the expressive power of context free grammars. In order to demonstrate that, not only SCFG based descriptors are meaningful, but are powerful at both annotating sequences and detecting binding sites, we first evaluate them on sites which can be expressed quite successfully by a PROSITE pattern.
In this section, all results are produced using grammars containing a full set of rules. Since each SCFG deals with one amino acid property at a time, scores obtained by several grammars need to be combined to obtain optimal results see Methods for details. This short pattern - only 12 residues - is the anion exchanger pattern PS The table reveals that charge and van der Waals volume are important features for the expression of this binding site.
Since negative-to-positive ratios in our datasets are quite high between 6 and 13 , ROC curves may present an optimistic assessment of the performances of our framework. Although accessibility and, Ca and Mn propensity are key properties of the residues involved in this binding site, they need to be combined to produce good results. The results for combined grammars are also very good for PS Recall of 0. This slightly worse result can be explained by the fact that unlike the two other patterns, the pattern covers only a part of the binding site to relatively huge NAP molecule.
Therefore many key dependencies were not available to the grammars. Since correct annotation does not imply correct detection, both tests - annotation and detection - are necessary to prove the functionality of the approach. In order to evaluate capabilities of detection, a number of tests were carried out. In Table 5 results for PS for the combined grammar most successful in annotation task are shown. As an outcome of this evaluation, performance of detection appears to be good. Similar outcomes were obtained for the other patterns where detection results were in line with annotation results.
To conclude, our system managed to achieve good accuracy in both annotation and detection.
- butcher boy patrick mccabe essay.
- Chomsky Normal Form - Tutorialspoint.
- Chomsky Normal Form - Automata Theory!
- Generating all permutations by context-free grammars in Chomsky normal form;
The results confirmed suitability of our approach in integrating amino acid properties in our grammars and combining obtained grammars. It shows that these strategies together with appropriate choice of the properties relevant to the pattern provide satisfactory solutions to the requirement of alphabet size reduction.
Research Paper On Chomsky Normal Form
The remaining part of this paper will only show annotation results since it makes comparisons with profile HMM performances easier. In this section we evaluate the approach consisting on constraining the initial grammar structure as described in the Methods section. It imposes a bias in the grammars so that they use context-free features and it allows increasing the number of non-terminal while keeping a manageable total number of rules see Appendix D.
These results suggest that increasing the number of non-terminals allows improving performance by increasing expressive capabilities. Analysis of parse trees shows that more than 6 independent NTs would be required to cover all important structural features. Moreover, examination of grammar structures produced with different parameters confirmed that constrained grammars were more consistent in their structure than standard SCFGs. Since we have already demonstrated that unlike HMM profiles, rules of SCFG are human readable and can be used to gain some biological insight about binding sites, in this section comparison between the two techniques is limited to annotation results.
Moreover, since MPI phosphatases are a subset of Rhodanese-like proteins which can be expressed by a domain profile PS , it is expected that profile HMMs would perform well in annotating this family. Table 7 shows comparison of results between the methods for our patterns of interest. These results validate our assumption that SCFGs gain efficiency from higher expressiveness and, despite operating on a reduced protein alphabet, can be at least as efficient as lower-level grammars, i.
These curves show that although our stochastic grammars are based here on a single feature - zinc propensity - they perform slightly better than Profile HMMs. Although none of the other tested properties allowed improving SCFGs results, we believe there is still some space for improvement if suitable properties could be combined to zinc propensity.