Intrinsic protein disorder in histone lysine methylation

The complex pattern of posttranslational modifications (PTMs) of histone proteins result in an epigenetic regulatory code that controls entire gene expression programs within a cell [1]. One of the best characterized histone modifications is methylation, that can occur on lysine or arginine residues [2]. Lysine methylation is mediated by histone lysine methyltransferases (HKMTs), a protein family defined by the presence of the SET domain, named after the Drosophila proteins Suppressor of variegation 3–9 [Su(var)3–9], Enhancer of zeste [E(z)], and Trithorax (Trx) [2]. DOT1L is the only protein that is capable of lysine methylation, despite the absence of a SET domain [3]. Many HKMTs are involved in different types of cancer [4, 5], making them an intensively studied protein group. While most studies are directed to the catalytic domain(s), we aimed at the structural analysis of the regions of HKMTs outside their catalytic domains. After finding that a significant proportion of the studied sequences are predicted to be intrinsically disordered, we tried to identify possible functional sites within these regions.

Intrinsically disordered proteins and protein regions (IDPs/IDRs) lack stable 3D structure in their functional state that confers a multitude of functional advantages [6], utilized in the diverse roles of IDPs in important biological processes [7, 8].

Although proteins participating in chromatin remodelling are known to have high levels of disorder in general [9], HKMTs distinguish themselves from other histone modifying enzymes not only by a high level of disorder (Additional file 1: Figure S1A), but also by the length of their disordered regions. Whereas 60 % of HKMTs contain IDRs longer than 80 amino acids (Additional file 1: Figure S1B), less than 20 % of eukaryotic proteins contain IDRs of similar length [10].

An evolutionary comparative analysis shows that the average length and number of proteins responsible for histone modification increases with evolution, with the sharpest transitions occurring between prokaryotes and eukaryotes and lower eukaryotes and vertebrates. Although the statistical analysis is hindered by large standard deviation values and the limited sample size in certain cases (for example, the histone demethylase group of protostomes contains only one representative and there are only two known bacterial histone arginine methyltransferases), HKMTs are almost always significantly longer than other histone modifying enzymes (Additional file 2: Figure S2). This is not true for prokaryotes, where HKMTs constitute the shortest proteins of the studied groups. This observation also includes that the length of HKMTs rose more sharply in eukaryotes than in other histone modifying enzyme families. Since it was shown earlier that protein length and intrinsic protein disorder does not correlate closely with organism complexity [11], this elevated length is probably related to the specific function and regulation of the HKMT proteins rather than being a general evolutionary trait. Contrary to protein length, the number of HKMT proteins is generally not significantly higher than other histone modifying enzymes in the studied evolutionary groups, although in some specific cases we could detect significant differences. This finding shows that in case of HKMTs, more complex regulation with growing organism complexity was achieved through extending individual proteins, rather than producing more, specialized representatives of the family.

In order to determine the evolutionary variability of these IDRs, we performed disorder conservation analysis of the mostly disordered HKMTs, using the DisCons [12] webtool, which can differentiate between constrained and flexible disorder. Disorder is considered constrained when disorder tendency and sequence of a region are both conserved, while flexible disorder means that only the disorder tendency is retained through evolution. The analysis showed that the long IDRs are highly conserved in vertebrates, with constrained disorder conservation levels above 80 % for all examples (Additional file 3: Table S1). Since disordered proteins generally tolerate sequence changes better than globular proteins [13, 14], the fact that not only disorder, but also sequences are conserved, shows that these regions harbor important functional sites. CREB-binding protein (CBP), a histone acetyltransferase with experimentally confirmed functions in its disordered regions [15], has a similar disorder conservation level as the studied HKMTs.

Although many HKMTs contain single amino acid repeat regions (see Additional file 4: Table S2.) and other low complexity regions (LCRs), SEG analysis [16] shows that contrary to protein disorder, LCRs are not overrepresented in any of the histone modifying protein families compared to the average of human proteins, and the overlap between LCRs and IDRs is limited (Additional file 1: Figure S1C.). This suggests that although LCRs are thought to be involved in mediating flexible protein-protein interactions [17], it seems that in this particular case, low complexity is not a dominant feature.

Polyglutamine sequences are among the most studied LCRs due to their involvement in many diseases [18]. Of HKMTs, only MLL4 contains long stretches of polyglutamine (polyQ) repeats (14 regions with lengths between 5 and 13 amino acids). A long run of glutamines between aa 3898 and 3974 is also found in MLL4, where Q repeats are interrupted with a leucine residue at every five to ten residues. This region is predicted to form a coiled-coil structure [19] and may be involved in stabilizing protein-protein interactions as suggested for such regions by Schaefer et al. [20]. It is to be noted that LCRs also often have highly repetitive Q/N-rich regions, which may undergo regulated structural transitions from a disordered to a highly ordered amyloid-like state, conferring prion-like functions on the protein [21].

The main functional regions of IDPs/IDRs, however, are short recognition elements, most often termed eukaryotic linear motifs (ELMs). A search in the ELM database [22] for known sequence motifs in the disordered regions of HKMTs resulted in a limited number of annotated motifs, but we could identify more ELM hits with the database’s acceptable expectation value. One of the most frequent motifs found was LIG_WD40_WDR5_WIN_1 which is responsible for binding WDR5 and WD40 domains [23].

Other motifs with reliable e-values are involved in transcriptional activation/repression, cellular proliferation, ubiquitination, DNA repair, RNA binding and splicing. These are in good correlation with the functions generally assigned to HKMTs [2], but the physiological role of these predicted motifs remains to be experimentally validated. We found 18 different motifs altogether, and these occurred at 50 different sites in the mostly disordered HKMTs (Additional file 4: Table S2). This represents more than 2 predicted ELM motifs per 1000 residues, which is significantly higher than the number obtained for randomized sequences with the same amino acid composition (0.645 ELM motif per 1000 residues, p??0.0001). The average level of conservation of motifs predicted in MLL1 and MLL4 is significantly higher (p?=?0.001 and p?=?0.003, respectively) than that of the whole proteins, but ELM motifs in the other, highly disordered HKMTs (NSD1, SUV420H1, PRDM2 and DOT1L) do not show significantly higher conservation. Given that the average conservation of these proteins is already rather high, this does not necessarily question their functional importance. Our suggestion is that some, or many of the ELMs found in this study may participate in the interactions of HKMTs with other macromolecules, making them excellent candidates for further investigations. A statistical analysis shows that ELMs participating in protein-protein interactions occur at a significantly (p??0.0001) higher level in the studied HKMTs (1.7814 motifs/1000 aa) than in randomized sequences (0.995 motifs/1000 aa), underscores this proposition. It is also informative that research directed at the non-enzymatic regions of HKMTs has already unveiled a new motif that mediates the interaction of different proteins with LEDGF/p75 [24].

Using the ANCHOR [25] server that can predict disordered binding regions from sequence, we mapped the potential binding regions of the IDRs of HKMTs. The number of ANCHOR sites (30.106 regions/1000 aa) is significantly higher (p??0.0001) than that found in the randomized sequences (25.935 regions/1000 aa). In order to reduce the number of false positive hits, we only considered the ANCHOR sites that were conserved in vertebrates (Fig. 1). A comparison with different databases containing cancer-related mutations resulted in several hits localized to these putative binding regions. Since HKMTs work mainly as parts of large complexes [2], it is not unfounded to suggest that these may be the regions responsible for mediating functionally important interactions. A recently characterized DOT1L-AF9 interaction [26] overlaps with a predicted binding site (Fig. 1), pointing to the validity of our suggestions. AF9 is a fusion partner of MLL1 and is involved in the leukemias involving MLL fusions [27], which highlights the importance of DOT1L-AF9 interaction.

Fig. 1

IUPred profile of four representative HKMTs and CBP. Regions with corresponding PDB structures (red¹), SCOP domains (green), conserved predicted binding regions (yellow), known binding regions (orange horizontal lines) and cancer associated SNPs (black diamonds) are indicated. ¹List of PDB structures: MLL1: 4gq6_b, 3u88_m, 2mtn_a, 2msr_a, 2j2s_a, 2kyu_a, 3lqh_a, 2agh_c, 2w5y_a; MLL4: 3uvk_b, 3erq_d, 4z4p_a; NSD1: 3ooi_a; DOT1L: 3uvp_a, 2mv7_b; CBP: 1rdt_e, 1lik_a, 2lxt_a, 4n4f_a, 2kje_a

The disordered region of DOT1L is probably also involved in the H2BUb-H3K79 crosstalk, since a C-terminal truncated construct can methylate nucleosomes in the absence of the facilitating ubiquitin (Ub) mark [28]. Ubiquitin interaction appears to be mediated through lysine-rich regions in DOT1 proteins, as shown for yeast DOT1P [29]. The lysine-rich region of yeast DOT1P localizes in the disordered region of the protein and human DOT1Ll also contains a disordered lysine-rich region that might be involved in the H2BUb-H3K79 crosstalk. The lysine-rich region homologous to that of yeast DOT1P is localized between amino acids 387–416 in DOT1L, and overlaps with a conserved ANCHOR site (aa 408–416) according to our prediction (Fig. 1). The notion that it is a valid and important interaction site is corroborated by the three SNPs in this region that are found in cancer databases.

NSD1 also uses a disordered region for interacting with Nizp1 in mediating gene repression [30]. The interacting region of mouse NSD1 is a cysteine-rich region (aa 2117–2207) that corresponds to a conserved ANCHOR sequence in the human protein (Fig. 1), raising the possibility of a similar mechanism in human cells.

MLL proteins also contain several conserved ANCHOR regions, some of them in longer sequences that are known to participate in structurally not characterized partner binding (Fig. 1). The reliability of our predictions is supported by the finding that the ternary complex formed between the activation domain of MLL1, the KIX domain of CBP and the TAD of c-Myb [31] is mediated by a short sequence in MLL1 between residues 2844 and 2857 [32]. This interaction is essential for transcriptional activation by MLL [31] and overlaps with one of the conserved ANCHOR sites (aa 2841–2853).

The functional importance of regions of MLL proteins other than their SET domain is underlined by the fact that unlike mll
^?/? mice, animals with SET domain-deleted MLL are viable and fertile, although they show defects in DNA methylation [33]. The SET domain is also lost in MLL rearranged leukemias, where the N-terminal region of MLL proteins is fused to various protein partners, resulting in aberrant expression of MLL target genes [34]. The disordered nature of the MLL protein is important for the fusion proteins to be viable in the cells, as a link between protein disorder and fusion protein survival was shown in a previous work [35].

The extreme length of the IDRs found in HKMTs suggests that these regions have further roles than simply presenting interaction sites of a couple of amino acids in length. Involvement of long disordered regions in establishing long-range contacts between spatially distant binding partners was suggested for proteins participating in nonsense-mediated decay [36]. HKMTs might rely on similar strategies when recognizing other histone modifications, exemplified by the H2BUb-H3K79 crosstalk in the case of DOT1L. These long IDRs may also serve as tools for complex intramolecular regulation through the interplay of a variety of elements, domains, motifs and linkers in a phenomenon termed ‘multistery’ [37].

Although disordered regions do not fold into a well-defined structure on their own, they often gain structure upon binding to different partners through induced folding [38]. The ternary complex formed between MLL1, menin and LEDGF/p75, critical for the development of MLL leukemia [39], is a good example of a well-characterized interaction involving disordered regions. We demonstrate how a disordered segment can change the stability of a complex through the analysis of the published structures supplemented with molecular modeling.

The originally published crystal structure (PDB: 3U88) contained a region spanning amino acids 4 to 153 of MLL1 from which the disordered segments (aa 16–22 and 36–102) were removed [40]. We performed molecular dynamics simulations using the sequence of MLL1 between amino acids 1 and 200. Our simulations show that this region is highly dynamic in the unbound state, sampling a multitude of different conformations (Fig. 2a), with short regions of limited preference for secondary structural elements. The region between amino acids 120–140 has the highest tendency to fold into a continuous alpha helical state which is capable of facilitating binding (Fig. 2b). The ensuing conformational selection is a basic mechanism of disordered proteins binding to their binding partners [38].

Fig. 2

Molecular dynamics simulation of MLL1-menin-LEDGF/p75 complex. a Overlay of 20 structures between 900 k-1100 k steps of DMD of free MLL1 N-terminus. The region between amino acids 120–135 are highlighted in red or orange. b DSSP helix content of the free MLL1 N-terminus per frame versus the amino acid chain. Orange to red lines represent the number of replicas (one, two or three) in any given frame that contain an amino acid in helical conformation (1–2000 k steps). c Structure of the ternary complex as represented in PDB database (3U88). Salmon: MLL1, cyan: LEDGF/p75, green: menin. Side chains of F148 and F151 in MLL1 are red. Intramolecular contacts are shown as yellow-red dots. d Structure of the ternary complex as modeled with the disordered regions of MLL1 based on PDB structures 3U88 and 2MSR. Salmon: MLL1, cyan: LEDGF/p75, green: menin. Side chains of F148 and F151 in MLL1 are red. Intramolecular contacts are shown as yellow-red dots (at 200 k steps)

A short segment of MLL1 (aa 140–160) binds LEDGF/p75 independently of the formerly described helix [41, 42] through a region that does not fold upon binding and has no particular structural propensity in the unbound state (PDB: 2MSR, Fig. 2b). This is an example of the unique ability of IDPs to bind without folding [43], which is nevertheless very important for the stability of the ternary complex. Our molecular dynamics simulations demonstrate that the LEDGF/p75-menin complex is not stable, (Additional file 5: Movie 1) and while the MLL1 helix between menin and LEDGF/p75 stabilizes the ternary complex with hydrophobic and electrostatic interactions (Additional file 6: Movie 2), the extensive movements of LEDGF/p75 relative to menin might not be compatible with the biological function. Even though the two interacting amino acids of the disordered loop (F148 and F151) do not form stable bonds with the partners, the simulation containing the loop region showed a much more stable complex (Additional file 7: Movie 3). The MLL1 construct in the published crystal structure also contains the binding phenylalanines, but no coordinates could be assigned to them [40], revealing that they remain disordered even in the confines of a crystal lattice. This model illustrates nicely that two different IDR binding strategies (folding upon binding and binding without folding) can work together to modulate the stability of a binding interface. The fact that in the case of the MLL_6–153 there could be no interaction detected with LEDGF/p75 [40] but the region 1–160 interacts with LEDGF/p75 alone [41, 42], hints at the importance of amino acids distant to the actual binding site. This observation underlines that even though many IDP interactions are mediated by residual structural elements, lack of a tendency to fold does not necessarily mean a lack of interaction capability and function.

Apart from the known binding regions of MLL1, our simulations also included a large disordered loop of MLL1 between amino acids 36 and 102. The larger number of intramolecular contacts in the model compared with the crystallized complex (68 versus 45, respectively) suggests that disordered regions distant to the binding site may also contribute to the interaction (Fig. 2c and d). The loop region does not seem to make extensive contacts with either partners and might serve as a platform for other interaction partners.

In all, we have shown that intrinsic disorder is a prominent feature of HKMTs and the intricate regulation and complex activity of these important enzymes cannot be fully understood without dissecting the behavior of these regions. The rare instances where disordered regions of HKMTs were studied show that many important functions lie in these sequences. Given the extreme length of IDRs in some of the HKMTs, it is entirely possible that many other functions await discovery. For this reason it is important to direct structural and biochemical studies at the disordered segments of these proteins. Most promising candidates would be the conserved ANCHOR regions, especially those that contain cancer-related SNPs. Regions participating in detected, but uncharacterized partner binding also bear the possibility of notable discoveries. Recognizing the importance of protein disorder in the epigenetic regulation is important for a deeper understanding that may bring further development of this field.