DEG-77

Profiling RNA-Seq at multiple resolutions markedly increases the number of causal eQTLs in autoimmune disease

Abstract
Genome-wide association studies have identified hundreds of risk loci for autoimmune dis- ease, yet only a minority (~25%) share genetic effects with changes to gene expression (eQTLs) in immune cells. RNA-Seq based quantification at whole-gene resolution, where abundance is estimated by culminating expression of all transcripts or exons of the same gene, is likely to account for this observed lack of colocalisation as subtle isoform switches and expression variation in independent exons can be concealed. We performed integrative cis-eQTL analysis using association statistics from twenty autoimmune diseases (560 inde- pendent loci) and RNA-Seq data from 373 individuals of the Geuvadis cohort profiled at gene-, isoform-, exon-, junction-, and intron-level resolution in lymphoblastoid cell lines.After stringently testing for a shared causal variant using both the Joint Likelihood Mapping and Regulatory Trait Concordance frameworks, we found that gene-level quantification sig- nificantly underestimated the number of causal cis-eQTLs. Only 5.0–5.3% of loci were found to share a causal cis-eQTL at gene-level compared to 12.9–18.4% at exon-level and 9.6–10.5% at junction-level. More than a fifth of autoimmune loci shared an underlying causal variant in a single cell type by combining all five quantification types; a marked increase over current estimates of steady-state causal cis-eQTLs. Causal cis-eQTLs detected at different quantification types localised to discrete epigenetic annotations. We applied a linear mixed-effects model to distinguish cis-eQTLs modulating all expression ele- ments of a gene from those where the signal is only evident in a subset of elements. Exon- level analysis detected disease-associated cis-eQTLs that subtly altered transcription glob- ally across the target gene. We dissected in detail the genetic associations of systemic lupus erythematosus and functionally annotated the candidate genes. Many of the known and novel genes were concealed at gene-level (e.g. IKZF2, TYK2, LYST). Our findings are provided as a web resource.

Introduction
The autoimmune diseases are a family of heritable, often debilitating, complex disorders in which immune dysfunction leads to loss of tolerance to self-antigens and chronic inflamma- tion [1]. Genome-wide association studies (GWAS) have now detected hundreds of suscepti- bility loci contributing to risk of autoimmunity [2] yet their biological interpretation still remains challenging [3]. Mapping single nucleotide polymorphisms (SNPs) that influence gene expression (eQTLs) can provide meaningful insight into the potential candidate genes and etiological pathways connected to discrete disease phenotypes [4]. For example, such anal- yses have implicated dysregulation of autophagy in Crohn’s disease [5], the pathogenic role of CD4+ effector memory T-cells in rheumatoid arthritis [6], and an overrepresentation of tran- scription factors in systemic lupus erythematosus [7].Expression profiling in appropriate cell types and physiological conditions is necessary to capture the pathologically relevant regulatory changes driving disease risk [8]. Lack of such expression data is thought to explain the observed disparity of shared genetic architecture between disease association and gene expression at certain autoimmune loci [9]. A much over- looked cause of this disconnect however, is not only the use of microarrays to profile gene expression, but also the resolution to which expression is quantified using RNA-Sequencing (RNA-Seq) [10]. Expression estimates of whole-genes, individual isoforms and exons, splice- junctions, and introns are obtainable with RNA-Seq [11–18]. The SNPs that affect these dis- crete units of expression vary strikingly in their proximity to the target gene, localisation to specific epigenetic marks, and effect on translated isoforms [18]. For example, in over 57% of genes with both an eQTL influencing overall gene expression and a transcript ratio QTL (trQTL) affecting the ratio of each transcript to the gene total, the causal variants for each effect are independent and reside in distinct regulatory elements of the genome [18].

RNA-Seq based eQTL investigations that solely rely on whole-gene expression estimates are likely to mask the allelic effects on independent exons and alternatively-spliced isoforms [16–19]. This is in part due to subtle isoform switches and expression variation in exons that cannot be captured at gene-level [20]. A large proportion of trait associated variants are thought to act via direct effects on pre-mRNA splicing that do not change total mRNA levels [21]. Recent evidence also suggests that exon-level based strategies are more sensitive than conventional gene-level approaches, and allow for detection of moderate but systematic changes in gene expression that are not necessarily derived from alternative-splicing events [15,22]. Furthermore, gene-level summary counts can be biased in the direction of extreme exon outliers [22]. Use of isoform-, exon-, and junction-level quantification in eQTL analysis also support the potential to not only point to the candidate genes involved, but also the spe- cific transcripts or functional domains affected [10,18]. This of course facilitates the design of targeted functional studies and better illuminates the causative relationship between regulatory genetic variation and disease. Lastly, though intron-level quantification is not often used in conventional eQTL analysis, it can still provide valuable insight into the role of unannotated exons in reference gene annotations, retained introns, and even intronic enhancers [23,24].Low-resolution expression profiling with RNA-Seq will impede the subsequent identification of causal eQTLs when applying genetic and epigenetic fine-mapping approaches [25]. In this investigation, we aim to increase our knowledge of the regulatory mechanisms and candi- date genes of human autoimmune disease through integration of GWAS and RNA-Seq expres- sion data profiled at gene-, isoform-, exon-, junction-, and intron-level in lymphoblastoid cell lines (LCLs). This is firstly performed in detail using association data from a GWAS in sys- temic lupus erythematosus, and is then scaled up to a total of twenty autoimmune diseases.Our findings are provided as a web resource to interrogate the functional effects of autoim- mune associated SNPs (www.insidegen.com), and will serve as the basis for targeted follow-up investigations.

Results
Using densely imputed genetic association data from a large European GWAS in systemic lupus erythematosus (SLE) [7], we performed integrative cis-eQTL analysis with RNA-Seq expression data profiled at five resolutions: gene-, transcript-, exon-, junction-, and intron- level. Expression data were derived from 373 healthy European donors of the Geuvadis project profiled in lymphoblastoid cell lines (LCLs) [18]. See S1 Fig for a summary of how expression at the five resolutions was quantified. A total of 38 genome-wide significant SLE loci (S1 Table) were put forward for analysis. To test for evidence of a single shared causal variant between disease and gene expression at each locus, we employed the Joint Likelihood Mapping (JLIM) framework [9] using summary-level statistics for SLE association and full genotype- level data for gene expression. Using JLIM, cis-eQTLs were defined if a nominal association(P<0.01) with at least one SNP existed within 100kb of the SNP most associated with disease and the transcription start site of the gene was located within +/-500kb of that SNP (as definedby authors of JLIM). JLIM P-values were corrected for multiple testing by a false discovery rate (FDR) of 5% per RNA-Seq quantification type (i.e. at exon-level, JLIM P-values were adjusted for total number of exons tested in cis to the 38 SNPs). Causal associations of the integrative cis-eQTL SLE GWAS analysis across the five RNA-Seq quantification types are available in S2Table and the full output (including non-causal associations) are available in S3 Table. The dis- tribution of JLIM P-values across the five RNA-Seq quantification types are depicted in S2 Fig.We found the number of causal cis-eQTLs was markedly underrepresented when con- sidering conventional gene-level quantification (Table 1). Only two of the 38 SLE susceptibility loci (5.3%) were deemed to be causal cis-eQTLs at gene-level for three candidate genes.This is a similar proportion observed by Chun et al [9] who found that 16 of the 272 (5.9%)The lead SNPs from the Bentham and Morris et al 2015 GWAS in persons of European descent were functionally annotated by cis-eQTL analysis in the Geuvadis RNA-Seq cohort in lymphoblastoid cell lines using RNA-Seq quantification profiled at five resolutions (gene, transcript, exon, junction, and intron).Only SNPs reaching genome-wide significance, not conditional peaks, outside of the major histocompatibility complex loci, and with minor allele frequency > 5% were included leaving 38 SLE lead SNPs in total. All SLE loci were densely imputed to the 1000 Genomes Phase 3 Imputation Panel as described in methods.All 38 loci (+/-100kb of each lead SNP) comprised a nominally significant cis-eQTL (P<0.01) for at least one gene within +/-500kb of the lead SNP at each resolution of RNA-Seq. Evidence of a single shared causal variant at each locus was assessed using the Joint Likelihood Mapping (JLIM) algorithm as described in methods.aNumber of loci where the disease association is consistent with a single shared effect for at least one cis-eQTL (P<0.01 and JLIM FDR adjusted P<0.05). The total number of unique causal cis-eQTLs across all RNA-Seq quantification types.Expression targets corresponds to the quantification type in hand (i.e. number of exons at exon-level).https://doi.org/10.1371/journal.pgen.1007071.t001autoimmune susceptibility loci tested were cis-eQTLs driven by a shared causal variant in the Geuvadis RNA-Seq dataset using gene-level quantification (based upon the seven autoimmune diseases interrogated—not including SLE).Of note, transcript-level quantification did not increase the number of causal cis-eQTLs (Table 1). Transcript-level analysis did, however, yield a greater number of candidate genes (seven individual transcripts derived from a total of four genes). Both junction- and intron- level quantification increased the number of causal cis-eQTLs to four (10.5% of the 38 total SLE loci). Using exon-level quantification, we were able to detect seven significant cis-eQTLs driven by a single shared causal variant (18.4%). Exon-level analysis also produced the greatest number of candidate gene targets: nine unique genes derived from 24 individual SNP-exon pairs (Table 1). Therefore, even with the severe multiple testing burden, we firstly conclude that exon-, junction-, and intron-level analysis detects more causal cis-eQTLs than gene-level.By combining all five types of RNA-Seq quantification (gene, transcript, exon, junction, and intron) we classified nine of the 38 SLE susceptibility loci (24%) as being driven by the same causal variant as the cis-eQTL in LCLs (Table 1). This value, derived from interrogating only a single cell type, is almost equal to the total number of causal autoimmune cis-eQTLs detected by Chun et al [9] (~25%) across three different cell types (CD4+ T-cells–measured by microar- ray, CD14+ monocytes–microarray, and LCLs–RNA-Seq gene-level). We found that when considering the specificity of cis-eQTLs and target genes across the five RNA-Seq quantification types, both gene- and transcript-level quantification were redun- dant with respect to exon-level data; i.e. there were no causal cis-eQTLs or target genes detected at gene- or transcript-level that were not captured by exon-level analysis (S3 Fig).Both junction- and intron-level quantification captured a single causal cis-eQTL each that was not captured by exon-level. We conclude that profiling at all resolutions of RNA-Seq is required to capture the full set of potentially causal cis-eQTLs.Associated SNPs are most likely to colocalize with exon- and junction- level cis-eQTLsWe compared the detection of cis-eQTLs using a pairwise comparison between the five RNA- Seq quantification types for matched SNP-gene cis-eQTL pairs (Fig 1). We only considered matched SNP-gene cis-eQTL association pairs that had a nominal cis-eQTL association P- value < 0.01 in both quantification types, and to be conservative, when multiple transcripts, exons, junctions, and introns were annotated with the same gene symbol, we selected the asso- ciations that minimized the difference in JLIM P-value between matched SNP-gene cis-eQTLs across RNA-Seq quantification types. There were over 250 matched SNP-gene cis-eQTL pairs per comparison. We firstly observed that the correlation of both cis-eQTL association P-values from regression and JLIM P-values across RNA-Seq quantification types reflected the methods in which expression quantification was obtained (Fig 1A). Both cis-eQTL and JLIM P-values between matched SNP-gene pairs at gene- and transcript-level were highly correlated as gene-level estimates are obtained from the sum of all transcript-level estimates for the same gene. Exon-level and junction-level associations were also highly correlated due to split-reads being incorporated into the exon-level estimate. As expected, intron-level cis-eQTL and JLIM P-val- ues for matched SNP-gene pairs were only weakly correlated against other quantification types—as reads mapping to introns are not included in the other quantification models. Interest- ingly, although cis-eQTL association P-values for matched SNP-gene pairs between transcript- level and junction-level were found to be relatively high (r2 = 0.70), we found the JLIM P-val- ues for the matched pairs to be comparatively low (r2 = 0.29); suggesting that whilst the statisti- cal significance of matched cis-eQTLs maybe similar between these quantification types, the underlying causal variants driving the disease and cis-eQTL association are likely to be independent.By plotting the JLIM P-values for matched SNP-gene pairs between different quantification types, we found many instances of P-values distributed along the axes rather than on the diag- onal (Fig 1B). Our findings therefore suggest that often, one quantification type is more likely to explain the observed disease association than the other. When we compared conventional gene-level cis-eQTL analysis against exon-level results (Fig 1C), we found that of the 296matched SNP-gene cis-eQTL associations (P<0.01), eleven (4%) shared the same causal vari- ant at both gene- and exon-level using a nominal JLIM P-value threshold <0.01. Only three ofthe 296 matched SNP-gene cis-eQTL associations (1%) were captured by gene-level only—in contrast to the 26 (9% of total associations) captured uniquely at exon-level. As expected, the overwhelming majority of cis-eQTL associations (86%) did not possess a single shared causal variant at either gene- or exon-level. We performed this analysis for all possible combinations of quantification types (Table 2). In each instance, gene-level analysis detected only the minor- ity of nominally causal associations for matched SNP-gene association pairs (JLIM P<0.01).Exon-level and junction-level analysis consistently detected more causal cis-eQTL associationsthan gene-, transcript-, and intron-level. In fact, when combined, exon- and junction-level analysis explained the most nominally causal associations for all significant SNP-gene cis- eQTL association pairs (24%).We functionally dissected the 12 candidate genes taken from the nine SLE associated loci that showed strong evidence of a shared causal variant with a cis-eQTL in LCLs (Table 3). We sys- tematically annotated these genes using a combination of cell/tissue expression patterns, mouse models, known molecular phenotypes, molecular interactions, and associations with other autoimmune diseases (S4 Table). We found the majority of novel SLE candidate genes detected by RNA-Seq were predominately expressed in immune-related tissues such as whole blood as well as the spleen and thymus. Based on our annotation and what is already docu- mented at certain loci, we were sceptical on the pathogenic involvement of three candidate genes (PHTF1, ARHGAP30, and RABEP1). Although the cis-eQTL effect for these genes is evi- dently driven by the shared causal variant as the disease association, it is possible that these effects of expression modulation are merely passengers that are carried on the same functional haplotype as the true causal gene(s) and do not contribute themselves to the breakdown ofself-tolerance (detailed in S4 Table). We show the regional association plots and the candidate genes detected from cis-eQTL analysis in S4 Fig. The causal cis-eQTL rs2736340 for genes BLK and FAM167A was detected at all RNA-Seq profiling types. It is well established that the risk allele of this SNP reduces proximal promoter activity of BLK; a member of the Src family kinases that functions in intracellular signalling and the regulation of B-cell proliferation, differentiation, and tolerance [26]. The allelicNine of the 38 SLE loci (24%) were found to be driven by the same causal variant as the disease association across all five RNA-Seq quantification types in LCLs (cis-eQTL P<0.01 and joint likelihood of shared association FDR<0.05). Bold type indicates associations that show evidence of a shared causal variant for cis-eQTL and disease.aMinimum cis-eQTL P-value for any SNP within 100 kb of the lead SNP. Dashes (–) indicate genes that were either not detected or had minimum cis-eQTL P>0.01 in the RNA-Seq quantification type in hand. JLIM P-values <10−4 indicates the JLIM statistic is more extreme than permutation. JLIM: joint likelihood mapping. If multiple SNP-unit associations are deemed to be causal (i.e. one SNP shows a causal association to two exons of the same gene, the association with the smallest JLIM P-value is reported).consequence of FAM167A expression modulation is unknown. We found multiple instances of known SLE susceptibility genes that were concealed when using gene-level quantification. For example, we defined rs7444 as a causal cis-eQTL for UBE2L3 at transcript- and exon-level—but not at gene-level (Table 3). The risk allele of rs7444 has been associated with increased expression of UBE3L3 (Ubiquitin conjugating enzyme E2 L3) in ex vivo B-cells and monocytes and correlates with NF-κB activation along with increased circulating plasmablast and plasma cell numbers [27]. Similarly, the rs10028805 SNP is a known splicing cis-eQTL for BANK1 (B- cell scaffold protein with ankyrin repeats 1). We replicated at exon-, and junction-level this splicing effect which has been proposed to alter the B-cell activation threshold [28]. Again, this mechanism was not detected using gene-level quantification.IKZF2 (detected at the exon-level only) is a transcription factor thought to play a key role in T-reg stabilisation in the presence of inflammatory responses [29]. IKZF2 deficient mice acquire an auto-inflammatory phenotype in later life similar to rheumatoid arthritis, with increased numbers of activated CD4+ and CD8+ T-cells, T-follicular helper cells, and germinal centre B-cells, which culminates in autoantibody production [30]. Of note, other members of this gene family, IKZF1 and IKZF3, are also associated with SLE and can hetero-dimerize (S4 Table) [7]. We also believe LYST, ATG4D, and TYK2 to also be intriguing candidate genes.LYST encodes a lysosomal trafficking regulator [31] whilst ATG4D is a cysteine peptidase involved in autophagy and this locus is associated with multiple sclerosis, psoriasis, and rheu- matoid arthritis [32]. TYK2 is discussed in greater detail in the following section.RNA-Seq can resolve the potential causal regulatory mechanism(s)Interestingly, for the three causal SNP-gene pairs detected at gene-level (rs2736340 –BLK, rs2736340 –FAM167A, and rs7444 –CCDC116), we found that at exon-level, all expressed exons possessed causal cis-eQTLs. For example, rs2736340 is a causal cis-eQTL for all thirteen exons of BLK and for all three exons of FAM167A (S5 Table). These data suggest that gene- level analysis is capturing associations where all—or the majority of exons—are modulated by the cis-eQTL. We found that within the SLE associated loci that showed evidence of a shared causal vari- ant with a cis-eQTL (Table 3), there were many instances in which the proposed causal cis- eQTL modulated expression of only a single expression element. This enabled us to resolve the potential regulatory effect of the causal cis-eQTL to a particular transcript, exon, junction, or intron (S5 Table). We were able to resolve to a single expression element in nine of the twelve candidate SNP-gene pairs. For example, rs9782955 is a causal cis-eQTL for LYST at junction- level for only a single junction (chr1:235915471–235916344; cis-eQTL P = 1.3x10-03; JLIMP = 2.0x10-04). We provide depicted examples of this isolation analysis for candidate genesIKZF2 (S5 Fig), UBE2L3 (S6 Fig), and LYST (S7 Fig).We provide a worked example of resolving the causal mechanism(s) using RNA-Seq for the novel association rs2304256 with TYK2 (Fig 2). The top panel of Fig 2A shows the genetic asso- ciation to SLE at the 19p13.2 susceptibility locus tagged by lead SNP rs2304256 (P = 1.54x10- 12). Multiple tightly correlated SNPs span the gene body and the 30 region of TYK2 –which encodes Tyrosine Kinase 2—thought to be involved in the initiation of type I IFN signalling [33]. In the panel below, we plot the gene-level association of all SNPs in cis to TYK2 andshow no significant association of rs3204256 with TYK2 expression (P = 0.18). At exon-, and intron-level, we were able to classify rs2304256 as a causal cis-eQTL for a single exon (chr19: 10475527–10475724; cis-eQTL P = 2.58x10-09; JLIM P<10−04) and a single intron (chr19: 10473333–10475290; P = 2.20x10-08; JLIM P = 2x10-04) of TYK2 respectively as shown in the bottom two panels of Fig 2A. We show the exon and intron labelling of TYK2 in further detailin S8 Fig. We found strong correlation of association P-values of the SLE GWAS and the P-val- ues of TYK2 cis-eQTLs against at exon-level and intron-level, but not at gene-level (Fig 2B). The risk allele rs2304256 [C] was found to be associated with decreased expression of the TYK2 exon and increased expression of the TYK2 intron (Fig 2C). By plotting the cis-eQTL P- values alongside the JLIM P-values for all exons and introns of TYK2 against rs2304256 (Fig 2D), we clearly show that only a single exon and a single intron of TYK2 colocalize with the SLE association signal–marked by an asterisk (note that rs2304256 is a strong cis-eQTL for many introns of TYK2 but only shares a causal variant with one intron). We show the genomic location of the affected exon and intron of TYK2 in Fig 2E (exon 8 and the intron between exons 9 and 10). Intron 9–10 of TYK2 is clearly expressed in LCLs according to transcription levels assayed by RNA-Seq on LCLs (GM12878) from ENCODE (Fig 2E).Interestingly, rs2304256 (marked by an asterisk in Fig 2E) is a missense variant (V362F) within exon 8 of TYK2. The PolyPhen prediction of this substitution is predicted to be benign and, to the best of our knowledge, no investigation has isolated the functional effect of this par- ticular amino acid change. We do not believe the cis-eQTL at exon 8 to be a result of variation at rs3204256 and mapping biases, as the alignability of 75mers by GEM from ENCODE is pre- dicted to be robust around exon 8 (Fig 2E). In fact, rs3204256 [C] is the reference allele yet is associated with decreased expression of exon 8.In conclusion, we have found an interesting and novel mechanism that would have been concealed by gene-level analysis that involves the risk allele of a missense SNP associated with decreased expression of a single exon of TYK2 but increased expression of the neighbouring intron. Whether the cis-eQTL effect and missense variation act in a combinatorial manner and whether the intron is truly retained or if it is derived from an unannotated transcript of TYK2 is an interesting line of investigation.We re-performed our integrative cis-eQTL analysis with the Geuvadis RNA-Seq dataset in LCLs using association data from twenty autoimmune diseases. This was to firstly reiterate the importance of leveraging RNA-Seq in GWAS interpretation and to secondly demonstrate that our findings in SLE persisted across other immunological traits. As the raw genetic association data were not available for all twenty diseases, we were unable to implement the JLIM pipeline which requires densely typed or imputed GWAS summary-level statistics. We therefore opted to use the Regulatory Trait Concordance (RTC) method, which requires full genotype-level data for the expression trait, but only the marker identifier for the lead SNP of the disease asso- ciation trait (see methods for a description of the RTC method). We stringently controlled our integrative cis-eQTL analysis for multiple testing to limit potential false positive findings of overlapping association signals. To do this, we applied a Bonferroni correction to nominal cis- eQTL P-values separately per disease and per RNA-Seq quantification type. We rigorouslydefined causal cis-eQTLs, as associations with PBF < 0.05 and RTC 0.95. An overview of the analysis pipeline is depicted in S9 Fig and S10 Fig. Using an r2 cut-off of 0.8 and a 100kb limit,we pruned the 752 associated SNPs from the twenty human autoimmune diseases from the Immunobase resource (S6 Table) to obtain 560 independent susceptibility loci.Our findings confirmed our previous results from the SLE investigation, and again support the gene-level study using the JLIM package. As before, we found that only 5% (28 of the 560 loci) of autoimmune susceptibility loci were deemed to share causal variants with cis-eQTLs using either gene- or transcript-level analysis (Fig 3A). Exon-level analysis more than doubled the yield to 13% (72 of the 560 loci) with junction-, and intron-level analysis also outperform- ing gene-level (10% and 8% respectively). When combining all RNA-Seq quantification types, we could define 20% of autoimmune associated loci (110 of the 560 loci) as being candidate causal cis-eQTLs—which corroborates our previous estimate in SLE using JLIM (24%).By separating causal cis-eQTL associations out by quantification type, we found over half (65%) were detected at exon-level, and considerable overlap of cis-eQTL associations existed between both types (Fig 3B). Unlike in our SLE analysis, gene- and isoform-level analysis did capture a small fraction of causal cis-eQTLs that were not captured at exon-level. Our data therefore suggest that although exon- and junction-level, and to a lesser extent intron-level analysis, capture most candidate-causal cis-eQTLs. It is necessary to prolife gene-expression at all quantification types to avoid misinterpretation of the functional impact of disease associ- ated SNPs.We mapped the causal cis-eQTLs detected by all RNA-Seq quantification types back to the diseases to which they are associated (Fig 3C). Interestingly, we observed the diseases that fell below the 20% average comprised autoimmune disorders related to the gut: celiac disease (7%), inflammatory bowel disease (14%), Crohn’s disease (16%), and ulcerative colitis (18%). We attribute this observation as a result of the cellular expression specificity of associated genes in colonic tissue and in T-cells [34]. Correspondingly, we observed an above-average fre- quency of causal cis-eQTLs detected in SLE (22%) and primary biliary cirrhosis (37%); diseases in which the pathogenic role of B-lymphocytes and autoantibody production is well docu- mented [34]. Note that there are 60 SLE GWAS associations in this analysis as these originatefrom three independent GWA studies (S6 Table). We further broke down our results per dis- ease by RNA-Seq quantification type (Fig 3D) and in all cases, the greatest frequency of causal cis-eQTLs and candidate genes were captured by exon- and junction-level analyses.Web resource for functional interpretation of association studies of autoimmune diseaseWe provide the results from our analysis as a web resource (found at www.insidegen.com) for researchers to lookup causal cis-eQTLs and candidate genes from the twenty autoimmune dis- eases detected across the five RNA-Seq quantification types. The data are sub-settable and exportable by SNP ID, gene, RNA-Seq resolution, genomic position, and association to specific autoimmune diseases. See methods for a walkthrough of how to access results.Exon-level quantification detects systematic and heterogeneous effects on gene expressionBy implementing a mixed model test of heterogeneity that accounts for the dependency struc- ture arising from within-individual and within-gene expression correlations, we attempted to distinguish causal cis-eQTLs at transcript-, exon-, junction-, and intron-level that fitted either a systematic gene-model (characterized by a similar effect on expression across all elements within a gene) or a heterogeneous gene-model (where the cis-eQTL signal is only evident in a subset of expression elements). The full results of this analysis are found in S7 Table.We found that across each RNA-Seq profiling type, the majority of causal cis-eQTLs exhib- ited heterogeneous effects on gene expression; indicative of alternative isoform usage (Fig 4A). Junction-level causal cis-eQTLs had the greatest proportion of heterogeneous associations (49 of 65 causal cis-eQTLs were heterogeneous—75%). Both systematic and heterogeneous causal cis- eQTLs were then stratified by whether or not they were also causal at gene-level. As expected, we observed that causal cis-eQTLs that were also detected at gene-level (Fig 4B) showed a greater proportion of systematic effects on gene expression than associations not detected at gene-level (Fig 4C). In both cases however, the heterogeneous model was more apposite. Interestingly, we found that the greatest frequency of systematic associations, which were not captured at gene- level, were observed at exon-level (42 of 76: 55%). This implies that exon-level analysis captures a near equal proportion of both systematic and heterogeneous effects that are not detected by gene-level analysis. We show four examples of systematic and heterogeneous causal cis-eQTLs stratified by their detection at gene-level quantification in Fig 5.A previous investigation has suggested that causal variants of gene-level and transcript-level cis-eQTLs reside in discrete functional elements of the genome [18]. We therefore investigated whether this notion held true across the five RNA-Seq quantification types tested in this study. To accomplish this, we selected the causal cis-eQTLs from the twenty autoimmune diseases interrogated, and per quantification type, tested for enrichment of these SNPs across various chromatin regulatory elements taken from the Roadmap Epigenomics Project in LCLs (using both the Roadmap chromatin state model and the positions of histone modifications). We implemented the permutation-based GoShifter algorithm to test for enrichment of causal cis-eQTLs and tightly correlated variants (r2>0.8) in genomic functional annotations in LCLs (see methods) [25]. Results of this analysis are depicted in Fig 6. We found the 28 gene-level cis-eQTLs were enriched in two chromatin marks: strong enhancers (P = 0.036) and H3K27ac occupancy sites–a marker of active enhancers (P = 0.002). Transcript-level cis-eQTLs were alsoenriched in H3K27ac occupancy sites (P = 0.039) but were not enriched in any other marks. The 72 exon-level cis-eQTLs were additionally enriched in active promoters (P = 0.017). Inter- estingly, the 54 causal cis-eQTLs detected at junction-level were found to be enriched in weak enhancers only (P = 0.002); whilst the 43 intron-level cis-eQTLs were enriched in chromatinstates predicted to be involved in transcriptional elongation (P = 0.001; 83% of intron-level cis- eQTLs). Disease relevant cis-eQTLs detected at different expression phenotypes using RNA– Seq clearly localise to largely discrete functional elements of the genome.We quantified the number of causal cis-eQTLs and tightly correlated variants (r2>0.8) per quantification type that were predicted to be alter splice site consensus sequences of the targetgenes (assessed by Sequence Ontology for the hg19 GENCODE v12 reference annotation). We found only two of the 28 (7%) gene-level cis-eQTLs disrupted consensus splice-sites for their target genes compared to the 14% and 13% detected at exon- and junction-level respectively (Fig 6C).

Our data suggest that although exon- and junction- level analysis leads to the greatest frequency of causal cis-eQTLs, the majority at this resolution cannot be explained directly by variation in annotated splice site consensus sequences (splice region/donor/acceptor/ variants).Confirmation that autoimmune causal cis-eQTLs reach genome-wide level of significanceWe extended our investigation and performed genome-wide cis-eQTL analysis for all SNPs against gene-, transcript-, exon-, junction-, and intron-level quantifications. As with ourintegrative analysis of autoimmune risk loci, we found the greatest number of genome-wide significant cis-eQTLs and target genes (at a genome-wide FDR threshold of 5%) were detected at exon-level, followed by junction- and intron-level; with gene- and transcript-level being thoroughly outperformed (S8 Table and S11 Fig). We confirmed that all of the causal cis-eQTL associations detected in our integrative analysis with autoimmune risk loci reached genome- wide significance—owing to the stringent Bonferroni multiple testing correction adopted (S9 Table).

Discussion
Elucidation of the functional consequences of non-coding genetic variation in human disease is a major objective of medical genomics [35]. Integrative studies that map disease-associated eQTLs in relevant cell types and physiological conditions are proving essential in progression towards this goal through identification of causal SNPs, candidate-genes, and illumination of molecular mechanisms [36]. In autoimmune disease, where there is considerable overlap of immunopathology, integrative eQTL investigations have been able to connect discrete aetiolo- gical pathways, cell types, and epigenetic modifications, to particular clinical manifestations [2,34,36,37]. Emerging evidence however has suggested that only a minority (~25%) of auto- immune associated SNPs share casual variants with basal-level cis-eQTLs in primary immune cell-types [9].Genetic variation can influence expression at every stage of the gene regulatory cascade— from chromatin dynamics, to RNA folding, stability, and splicing, and protein translation [21]. It is now well documented that SNPs affecting these units of expression vary strikingly in their genomic positions and localisation to specific epigenetic marks [18]. The eQTLs that affect pre-transcriptional regulation—affecting all isoforms of a gene—differ in the proximity to the target gene and effect on translated isoforms than their co-transcriptional trQTL (tran- script ratio QTL) counterparts. Where the effect size of eQTLs generally increases in relation to transcription start site proximity, trQTLs are distributed across the transcript body and generally localise to intronic binding sites of splicing factors [18,21]. In over 57% of genes with both an eQTL influencing overall gene expression and an trQTL affecting the ratio of each transcript to the gene total, the causal variants for each effect are independent and reside in distinct regulatory elements of the genome [18]. In fact, three primary molecular mechanisms are thought to link common genetic variants to complex traits.

A large proportion of trait asso- ciated SNPs act via direct effects on pre-mRNA splicing that do not change total mRNA levels [21]. Common variants also act via alteration of pre-mRNA splicing indirectly through effects on chromatin dynamics and accessibility. Such chromatin accessibility QTLs are however more likely to alter total mRNA levels than splicing ratios. Lastly, it is thought that only a minority of trait associated variants have direct effects on total gene expression that cannot be explained by changes in chromatin. As RNA-Seq becomes the convention for genome-wide transcriptomics, it is essential to maximise its ability to resolve and quantify discrete transcrip- tomic features so to expose the genetic variants that contribute to changes in expression and isoform usage. The reasoning for our investigation therefore was to delineate the limits of microarray and RNA-Seq based eQTL cohorts in the functional annotation of autoimmune disease association signals.To map autoimmune disease associated cis-eQTLs, we interrogated RNA-Seq expression data profiled at gene-, isoform, exon-, junction-, and intron-level, and tested for a shared genetic effect at each significant association. As we had densely imputed summary statistics from our SLE GWAS, we opted to use the Joint Likelihood Mapping (JLIM) framework [9] to test for a shared causal variant between the disease and cis-eQTL signals. This framework has been rigorously benchmarked against other colocalisation procedures. Summary statistics were not available for the remaining autoimmune diseases and therefore we implemented the Regulatory Trait Concordance (RTC) method for these diseases and set a stringent multiple testing threshold to define causal cis-eQTLs. We found the estimates of causal cis-eQTLs were near identical between the two methods used (Table 1 and Fig 3A). Exon- and junction-level quantification led to the greatest frequency of causal cis-eQTLs and candidate genes (exon- level: 13–18%, junction-level: 10–11%).

We conclusively found that associated variants were in fact more likely to colocalize with exon- and junction-level cis-eQTLs when applying a nomi- nal JLIM P-value threshold of <0.01 (Fig 1B and Table 2). Gene-level analysis was thoroughly outperformed in all cases (5%). Our findings that gene-level analysis explain only 5% of causal cis-eQTLs corroborate the findings from Chun et al [9] who composed and used the JLIM framework to annotate variants associated with seven autoimmune diseases (multiple sclerosis, IBD, Crohn’s disease, ulcerative colitis, T1D, rheumatoid arthritis, and celiac disease). They found that only 16 of the 272 autoimmune associated loci (6%) shared causal variants with cis- eQTLs using gene-level RNA-Seq (with the same Geuvadis European cohort in LCLs as used herein). In our investigation, we argue that it is necessary to profile expression at all possible resolutions to diminish the likelihood of overlooking potentially causal cis-eQTLs. In fact, by combining our results across all resolutions, we found that 20–24% of autoimmune loci were candidate-causal cis-eQTLs for at least one target gene. Our study therefore increases the number of autoimmune loci with shared genetic effects with cis-eQTLs in a single cell type by over four-fold. Interestingly, using microarray data from CD4+ T-cells Chun et al classified 37 of the 272 autoimmune loci (14%) as causal cis-eQTLs [9]—strengthening the hypothesis that autoimmune loci (especially those associated with inflammatory diseases of the gut) are enriched in CD4+ T-cell subsets and the cells themselves are likely to be pathogenic [25,34].Microarray data are known to underestimate the number of true causal cis-eQTLs [10]. If we assume that by leveraging RNA-Seq we can increase the number of steady-state causal cis- eQTLs four-fold, we hypothesise that as many as ~54% of autoimmune loci may share causal cis-eQTLs with gene expression at multiple resolutions in CD4+ T-cell populations. A large RNA-Seq based eQTL cohort profiled across multiple CD4+ T-cell subsets will therefore be of great use when annotating autoimmune-related traits. Immune activation conditions further increase the number of causal cis-eQTLs detected in autoimmune disease [38]. We reason that although using relevant cell types and context-specific conditions will undoubtedly increase our understanding of how associated variants alter cell physiology and ultimately contribute to disease risk; it is clearly shown herein that we are only picking the low hanging fruit in current eQTL analyses. We argue it necessary to reanalyse existing RNA-Seq based eQTL cohorts at multiple resolutions and ensure new datasets are similarly dissected. Despite the severe multi- ple testing burden, we also argue that expression profiling at multiple resolutions using RNA- Seq may be advantageous even when looking for trans-eQTL effects. As trans-eQTLs are gen- erally more cell-type specific and have a weaker effect size, we decided not to perform such analyses using the Geuvadis LCL data. Large RNA-Seq based eQTL cohorts in whole-blood will be more suitable for such analysis [19]. As well as biological reasons for using multiple expression phenotypes for integrative eQTL analysis, there are also technical factors to consider. Gene-level expression estimates can gener- ally be obtained in two ways–union-exon based approaches [14,17] and transcript-based approaches [11,12]. In the former, all overlapping exons of the same gene are merged into union exons, and intersecting exon and junction reads (including split-reads) are counted to these pseudo-gene boundaries. Using this counting-based approach, it is also possible to quan- tify meta-exons and junctions easily and with high confidence by preparing the reference annotation appropriately [13,15,39]. Introns can be quantified in a similar manner by invert- ing the reference annotation between exons and introns [18]. Of note, we found intron-level quantification generated more candidate-causal cis-eQTLs than gene-level (Fig 3A). As the library was synthesised from poly-A selection, these associations are unlikely due to differences in pre-mRNA abundance. Rather, they are likely derived from either true retained introns in the mature RNA or from coding exons that are not documented in the reference annotation used. Transcript-based approaches make use of statistical models and expectation maximiza- tion algorithms to distribute reads among gene isoforms—resulting in isoform expression esti- mates [11,12]. These estimates can then be summed to obtain the entire expression estimate of the gene. Greater biological insight is gained from isoform-level analysis; however, disambigu- ation of specific transcripts is not trivial due to substantial sequence commonality of exons and junctions. In fact, we found only 5% of autoimmune loci shared a causal variant at tran- script-level. The different approaches used to estimate expression can also lead to significant differences in the reported counts. Union-based approaches, whilst computationally less expensive, can underestimate expression levels relative to transcript-based, and this difference becomes more pronounced when the number of isoforms of a gene increases, and when expression is primar- ily derived from shorter isoforms [20]. The Geuvadis study implemented a transcript-based approach to obtain whole-gene expression estimates. Clearly therefore, a gold standard of ref- erence annotation and eQTL mapping using RNA-Seq is essential for comparative analysis across datasets. Our findings support recent evidence that suggests exon-level based strategies are more sensitive and specific than conventional gene-level approaches [22]. Subtle isoform variation and expression of less abundant isoforms are likely to be masked by gene-level analy- sis. Exon-level allows for detection of moderate but systematic changes in gene expression that are not captured at gene-level, and also, gene-level summary counts can be shifted in the direc- tion of extreme exon outliers [22]. It is therefore important to note that a positive exon-level eQTL association does not necessarily mean a differential exon-usage or splicing mechanism is involved; rather a systematic expression effect across the whole gene may exist that is only captured by the increased sensitivity. By implementing a mixed model test of heterogeneity that accounts for the dependency structure arising from within-individual and within-gene expression correlations we found that causal cis-eQTLs captured by exon-level analysis that are not detected at gene-level, are derived from both systematic and heterogeneous effects on gene expression in almost equal proportions (Fig 4). Additionally, by combining exon-level with other RNA-Seq quantification types, inferences can be made on the particular isoforms and functional domains affected by the eQTL which can later aid biological interpretation and targeted follow-up investigations [10]. We clearly show this from our analysis of SLE candidate genes IKZF2 (S5 Fig), UBE2L3 (S6 Fig), LYST (S7 Fig) and TYK2 (Fig 2). For TYK2 we reveal a novel mechanism whereby the associated variant rs2304256 [C] leads to decreased expression of a single exon and increased expression of a neighbouring intron (Fig 2). By isolating partic- ular exons, junctions, and introns, one can design more refined follow-up investigations to study the functional impact of non-coding disease associated variants. We show how our find- ings can be leveraged to comprehensively examine GWAS results of autoimmune diseases. We found nine of the 38 SLE susceptibility loci were causal cis-eQTLs (Table 3) for 12 candidate genes which we later functionally annotated in detail (S4 Table).Taken together, we have provided a deeper mechanistic understanding of the genetic regulation of gene expression in autoimmune disease by profiling the transcriptome at multiple resolutions using RNA-Seq. Similar analyses leveraging RNA-Seq in new and existing datasets using relevant cell types and context-specific conditions (such as response eQTLs as shown in [38]) will undoubtedly increase our understanding of how associated variants alter cell physi- ology and ultimately contribute to disease risk. RNA-Sequencing (RNA-Seq) expression data from 373 lymphoblastoid cell lines (LCLs) derived from four European sub-populations (Utah Residents with Northern and Western European Ancestry, British in England and Scotland, Finnish in Finland, and Toscani in Italia) of the Geuvadis project [18] were obtained from the EBI ArrayExpress website under acces- sion: E-GEUV-1. The 89 individuals of the Geuvadis project from the Yoruba in Ibadan, Nige- ria were excluded from this analysis. All individuals were included as part of the 1000Genomes Project. Expression was profiled using RNA-Seq at five quantification types: gene-, transcript-, exon-, junction-, and intron-level (the files downloaded and used in this analysis have the suf- fix: ‘QuantCount.45N.50FN.samplename.resk10.txt.gz’). Full methods of expression quantifi- cation can be found in the original publication and on the Geuvadis wiki page: http:// geuvadiswiki.crg.es/)). We have also provided a breakdown of the quantification methods in S1 Fig. Expression data downloaded represent quantifications that are corrected for sequenc- ing depth and gene/exon etc length (RPKM). Only expression elements quantified in >50% of individuals were kept and Probabilistic Estimation of Expression Residuals (PEER) had been used to remove technical variation [40]. We transformed all expression data to a standard nor- mal distribution.

In summary, transcripts, splice-junctions, and introns were quantified using Flux Capacitor against the GENCODE v12 basic reference annotation [16]. Reads belonging to single tran- scripts were predicted by deconvolution per observations of paired-reads mapping across all exonic segments of a locus. Gene-level expression was calculated as the sum of all transcripts per gene. Annotated splice junctions were quantified using split read information, counting the number of reads supporting a given junction. Intronic regions that are not retained in any mature annotated transcript, and reported mapped reads in different bins across the intron to distinguish reads stemming from retained introns from those produced by not yet annotated exons. Meta-exons were quantified by merging all overlapping exonic portions of a gene into non-redundant units and counting reads within these bins. Reads were excluded when the read pairs map to two different genes.SNPs genetically associated to systemic lupus erythematosus (SLE) were taken from the Ben- tham and Morris et al 2015 GWAS in persons of European descent [7]. The study comprised a primary GWAS, with validation through meta-analysis and replication study in an external cohort (7,219 cases, 15,991 controls in total). Independently associated susceptibility loci taken forward for this investigation were those that passed either genome-wide significance (P<5x10-08) in the primary GWAS or meta-analysis and/or those that reached significance in the replication study (q<0.01). We defined the lead SNP at each locus as either being the SNP with the lowest P-value post meta-analysis or the SNP with the greatest evidence of a missense effect as defined by a Bayes Factor (see original publication). We omitted non-autosomal associations and those within the Major Histocompatibility Complex (MHC), and SNPs with a minor allele frequency (MAF) < 0.05. In total, 38 independently associated SLE associated GWAS SNPs were taken forward for investigation (S1 Table). Each susceptibility locus had previously been imputed to the level of 1000 Genomes Phase3 using a combination of DEG-77 pre- phasing by the SHAPEIT algorithm and imputation by IMPUTE (see original publication for full details) [7].