Interaction analysis using PLINK, CASSI, ORMDR and BEAM3

Overview

Purpose

In this exercise you will be performing association analysis and testing for interaction effects using case/control data.

Methodology

The methodology used includes chi-squared tests and logistic regression in PLINK, as well as variants of these approaches implemented in PLINK and CASSI. We shall also use an R package ORMDR that implements a form of multifactor dimensionality reduction (MDR), and a Bayesian Epistasis Association Mapping method implemented in the program BEAM3.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying if you are not familiar with using statistical packages such as R. Don't worry too much if you don't understand all the details! If you decide to regularly use a statistical package such as R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Program documentation

PLINK documentation:

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://pngu.mgh.harvard.edu/~purcell/plink/

CASSI documentation:

CASSI documentation is available from:

http://www.staff.ncl.ac.uk/richard.howey/software.html

R documentation:

The R website is at http://www.r-project.org/

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

BEAM documentation:

The (rather sketchy!) documentation provided with the BEAM3 package can be found here BEAM3-README.txt



Exercise

Data overview

The data consists of simulated genotype data at 100 SNP loci, typed in 2000 cases and 2000 controls. The data has been simulated in such a way that the first five SNPs have some relationship with disease, whereas the remaining 95 SNPs have no effect on disease outcome.

The complication with these data is that SNPs 1 and 2 have been simulated in such a way that they show no marginal association with the disease: their association will only be visible when you look at both SNPs in combination. SNPs 3-5 have been simulated to only have an effect on disease when an individual is homozygous at all three of these loci. Although potentially this could lead to marginal effects at the loci, formally this corresponds to a model of pure interaction, with no main effects, at these 3 SNPs.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls).

All the programs will deal with much larger numbers of loci than the 100 SNPs considered here. PLINK and CASSI, in particular, were specifically designed for the analysis of large numbers of loci e.g. generated as part of a genome-wide association study.

Special considerations/restrictions (for the programs used here)

All the commands required to run these analyses are given below. However, packages such as R are sophisticated statistical programming packages and have much greater functionality than will be described here. If you intend to use a statistical package to analyse your data, you are strongly encouraged to learn how to use that package appropriately through attendance on a training course or careful reading of an introductory textbook.

Instructions

Data format

The data for the 100 SNPs simcasecon.ped is in standard linkage pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Note that since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown.

PLINK requires an additional map file simcasecon.map describing the markers (in order) in the pedigree file. The PLINK-format map file contains exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced) 
     rs number or snp identifier 
     Genetic distance (morgans) (not often used - so can be set to 0)
     Base-pair position (bp units) 

Take a look at the data files, and check that you understand how the data is coded. Then save (or copy) the files as .txt files to an appropriate directory (folder) on your computer.

Step-by-step instructions

1. Analysis in PLINK

Move to the directory where you have saved the data files.

To carry out a basic association analysis in PLINK, type

./plink --noweb --ped simcasecon.ped --map simcasecon.map --assoc

The --noweb command tells PLINK to run without bothering to check via the web to see whether there are updated versions of PLINK. The --ped xxxx command tells PLINK that the name of the pedigree file is xxxx and the --map yyyy command tells PLINK that the name of the map file is yyyy. The --assoc command tells PLINK to perform a basic allele-based chisquared association test.

PLINK outputs some useful messages (you should always read these to make sure they match up with what you expect!) and outputs the results to a file plink.assoc .

Take a look at the file plink.assoc (e.g. by typing more plink.assoc ). For each SNP the following columns of results are reported:

     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     CHISQ   Basic allelic test chi-square (1df)
     P       Asymptotic p-value for this test
     OR      Estimated odds ratio (for A1, i.e. A2 is reference)

Does there appear to be evidence of association at any of the five "true" loci (e.g., the first five SNPs)? What about the 95 null loci?

To test for pairwise epistasis in PLINK, the fastest option is to use the --fast-epistasis command:

./plink --noweb --ped simcasecon.ped --map simcasecon.map --fast-epistasis

Results can be found in the file plink.epi.cc.

Here are the first few lines of plink.epi.cc:

CHR1    SNP1 CHR2    SNP2         STAT            P 
   0    snp1    0    snp2        390.1    8.517e-87 
   0    snp9    0   snp28        16.66    4.481e-05 
   0    snp9    0   snp32        24.24    8.532e-07 
   0   snp15    0   snp32        16.63    4.539e-05 
   0   snp15    0   snp34         15.2     9.66e-05 
 ...

Only pairwise interaction tests with p <= 0.0001 are reported (otherwise, for genome-wide studies, there would be too many results to report, given the large number of pairwise tests performed). You should find a very significant interaction between SNPs 1 and 2, and a number of less significant pairwise iteractions. Since this is simulated data, we know that all of these less significant results are false positives.

A more powerful test for SNPs that are not in LD with one another (i.e. that are not too close) is to additionally use the --case-only option:

./plink --noweb --ped simcasecon.ped --map simcasecon.map --fast-epistasis --case-only

Results can be found in the file plink.epi.co.

Here are the first few lines of plink.epi.co:

CHR1    SNP1 CHR2    SNP2         STAT            P 
   0    snp1    0    snp2        742.7   1.749e-163 
   0    snp3    0    snp4        15.88    6.743e-05 
   0    snp4    0    snp5        17.57    2.773e-05 
   0    snp8    0   snp60        16.65    4.509e-05 
   0    snp9    0   snp32        30.29    3.738e-08 
  ...

Again only pairwise interaction tests with p <= 0.0001 are reported. You should again find a very significant interaction between SNPs 1 and 2 (even more significant than previously, owing to the increased power with a case-only test), and a number of less significant positive pairwise iteractions. Most of these are false positives, but PLINK does appear to have detected the true positive interactions between SNPs 3 and 4, and between SNPs 4 and 5.

A problem with the --fast-epistasis test is that it can be affected by LD between the SNPs (although only the case-only test is seriously affected). The most accurate test is to use the slower --epistasis command, which implements the test via logistic regression:

./plink --noweb --ped simcasecon.ped --map simcasecon.map --epistasis

Do you find that the program runs more slowly? Results can again be found in the file plink.epi.cc (which will now have been overwritten). You can see that now the interaction between SNPs 1 and 2 remains highly significant, together with one other (false positive) interaction between SNPs 15 and 77.

CHR1    SNP1 CHR2    SNP2       OR_INT         STAT            P 
   0    snp1    0    snp2        2.955        283.7    1.155e-63 
   0   snp15    0   snp77       0.8073        17.47    2.916e-05 



Since the --epistasis option is slower, but most accurate, for genome-wide studies it is recommended to first to screen for interactions using the --fast-epistasis command, but then confirm any findings using the --epistasis command on the smaller set of detected SNPs.

Before we finish with PLINK, we will use PLINK to create a differently-fomatted data file that will be required later, when we use the R package ORMDR:

./plink --noweb --ped simcasecon.ped --map simcasecon.map --recodeA --out recoded

This creates a new file recoded.raw in which the genotype data for each SNP is coded as a single count of the minor allele (0, 1, 2) rather than as two alleles (1 1, 1 2, 2 2 etc). Take a look at the file recoded.raw and check you understand this new format.



2. Analysis in CASSI

We will compare our results from PLINK with those obtained using the CASSI program. The CASSI program implements a variety of tests including logistic regression (equivalent to PLINK's --epistasis test) and the JE (Joint Effects) test described in Ueki and Cordell (2012).

First we need to convert our data to PLINK binary format:

./plink --noweb --ped simcasecon.ped --map simcasecon.map --make-bed --out cassiformat

This should create PLINK binary format files cassiformat.bed, cassiformat.bim cassiformat.fam. Then we use the CASSI program (which implements the JE test by default, though it can also implement various other tests, see CASSI website for details) with the input file cassiformat.bed to perform pairwise interaction tests at all pairs of loci. (By default, only those pairs of SNPs showing interaction with a P value < 0.0001 are output, though this can be changed if desired):

./cassi -i cassiformat.bed

Take a look at the output file cassi.out. The most important columns are the first 8 columns (listing the SNP numbers/names/positions) and the last 6 columns listing the case/control and case-only interaction test chi-squareds and p values, together with an indication of whether or not the alternative (rather than default) JE statistic was used for the test.

It can be quite hard to work out which column is which, so we suggest you start up R by typing

R

and then read in the results by typing

je <-read.table("cassi.out", header=T)

and then take a look at them by typing

je

You should see:

  SNP1 CHR1   ID1      BP1 SNP2 CHR2    ID2       BP2 JE_CASE_LOG_OR JE_CASE_SE
1    1    0  snp1  1000000    2    0   snp2   2000000       2.929320  0.0827492
2    3    0  snp3  3000000    4    0   snp4   4000000       0.350718  0.0863819
3    4    0  snp4  4000000    5    0   snp5   5000000       0.371610  0.0851736
4   15    0 snp15 15000000   77    0  snp77  77000000      -0.202479  0.0739267
5   31    0 snp31 31000000  100    0 snp100 100000000       0.284926  0.0723418
  JE_CTRL_LOG_OR JE_CTRL_SE JE_CC_CHISQ      JE_CC_P JE_CC_ALT JE_CO_CHISQ
1     0.00670512  0.0878829   586.21800 1.66520e-129         N  1253.16000
2     0.05162640  0.0916923     5.63702  1.75851e-02         N    16.48430
3    -0.07437630  0.0922595    12.61570  3.82518e-04         N    19.03550
4     0.23078500  0.0710405    17.85770  2.38057e-05         N     7.50168
5    -0.09335550  0.0714374    13.84370  1.98666e-04         N    15.51260
       JE_CO_P JE_CO_ALT
1 1.70712e-274         N
2  4.90549e-05         N
3  1.28308e-05         N
4  6.16414e-03         N
5  8.19557e-05         N


You can see that SNPs 1 and 2 show a very strong pairwise interaction (case/control p-value JE_CC_P=1.66e-129; case-only p-value JE_CO_P=1.71e-274). Interestingly we also detect, at lower significance levels, the pairwise interactions between SNPs 3 and 4 (case/control p-value=0.018; case-only p-value=0.000049) and between SNPs 4 and 5 (case/control p-value=0.00038; case-only p-value=0.000013). We also detect two false positive interactions, between SNPs 15 and 77, and between SNPs 31 and 100.



3. ORMDR analysis in R

Stay within the R package. While we are in R, we will try performing a form of MDR analysis that has been made available in an R package "ORMDR". (Note that the original MDR software, which is available as a Java program, is probably a bit more user-friendly than ORMDR). Read in the required ORMDR library:

library(ORMDR)

First we will read in the data in a form that is compatible with the ORMDR package. We first read in the data in the form given in the file recoded.raw, and take a look at the top of it (the first 6 rows) to remind ourselves what it looks like:

recoded<-read.table("recoded.raw", header=T)
head(recoded)


We then have to select just the 100 columns of SNP data together with the phenotype column, which needs to be coded as 1 and 0 (rather than 2 and 1). For convenience we will place the phenotype column at the end of the SNP data. That can be achieved as follows:

newdata<-recoded[7:106]
ormdrdata<-cbind(newdata,recoded$PHENOTYPE-1)
names(ormdrdata)[101]<-"casestatus"


Take a look at the top of the new data frame you have created, to check you understand it, by typing:

head(ormdrdata)

The data consists of 100 SNPs, with the phenotype (case/control status, coded 1/0) in column 101. To analyse this, looking at individual SNPs (i.e. groups of SNPs of size 1), and saving the results in an object "mdr1", type:

mdr1<-mdr.c(ormdrdata, colresp=101, cs=1, combi=1, cv.fold = 10)

The syntax colresp=101 tells the program which column the "class" (phenotype) variable is in, and cs=1 tells the program which code is used for cases. The syntax combi=1 tells the program to consider single-locus combinations. The syntax cv.fold = 10 tells the program to perform 10-fold cross validation. This means that the program divides the data into 10 equal parts, uses 9/10 of the data to fit and choose the best combination and then tests how well this combination performs for predicting the remaining 1/10 of the data. The whole process is repeated 10 times, for each of the different 1/10 portions of the data that could be predicted.

To see which was the best SNP (most predictive of disease status) in each of the ten cross validation replicates, type:

mdr1$min.comb

You may see something like this (the exact results will vary as random numbers are used in this algorithm):

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]   20    6   20   23    5    6    7    3    6     5

No SNP comes up consistently as best, although SNP 5 (which is a true associated SNP) may come up in some replicates. To repeat the process looking at all two-locus models (pairwise combinations of SNPs) type:

mdr2<-mdr.c(ormdrdata, colresp=101, cs=1, combi=2, cv.fold = 10)
mdr2$min.comb


Here we should see that SNPs 1 and 2 consistently come up as the best two-locus combination. To repeat the process looking at all three-locus combinations of SNPs, type:

mdr3<-mdr.c(ormdrdata, colresp=101, cs=1, combi=3, cv.fold = 10)
mdr3$min.comb


Again SNPs 1 and 2 consistently come up, together with some additional SNPs which are sometimes true positives (SNPS 3-5) and sometimes false positives.

To see how well the best SNP or SNPS in each cross validation sample actually do at predicting the relevant 1/10 of the data, take a look at the error rates:

mdr1$test.erate
mdr2$test.erate
mdr3$test.erate


The most appropriate number of loci to select for our final model is that number n which minimizes the average cross validation error. To work out the average cross validation error for each value of n, type:

mdr1mean<-mean(mdr1$test.erate)
mdr2mean<-mean(mdr2$test.erate)
mdr3mean<-mean(mdr3$test.erate)

mdr1mean
mdr2mean
mdr3mean


These results suggest that the minumum mean cross validation error occurs when n=2. To find which is the best 2-way combination, type:

mdr2$best.combi

This tells us that the best 3-way combination is SNPs 1 and 2. This is what would be expected from the cross-validation results:

mdr2$min.comb

which showed that SNPs 1 and 2 did best within each cross-validation sample. We therefore have a cross-validation consistency of 10/10. If the minimum cross-validation error had ocurred when n=3, we would need to look at:

mdr3$best.combi
mdr3$min.comb


in which case the best 3-way combination would be SNPs 1, 2 and 3, but this has a cross validation consistency of around 8/10 (since in several of the cross validation samples, SNPs 1, 2 and 3 were not the best combination).

Although ORMDR (and MDR) allow you to look at more complicated models than just pairwise combinations of SNPs, in practice these programs do not scale up very well to analysing large numbers of SNPs.


Just for fun, we will see what results we get when we analyse the true "known" interacting SNPs using standard logistic regression in R. First try analysing SNPs 1 and 2 together:

logreg12<-glm(casestatus ~ factor(snp1_2)*factor(snp2_1), family=binomial, data=ormdrdata)
summary(logreg12)
anova(logreg12)


You should find several significant terms. The formal 4 df test of interaction is given by the "deviance" of 701.68 on 4 df. To see the significance of this type:

pchisq(701.68,4,lower.tail=F)

You should find a highly significant interaction effect. The overall 8 df test of association (allowing for interaction) has a total deviance of 701.68+0.65+1.49=703.82 (or equivalently 5545.2-4841.3=703.9). This is also highly significant, as you can tell by typing:

pchisq(703.82,8,lower.tail=F)

Let us repeat this type of analysis using SNPs 3-5:

logreg345<-glm(casestatus ~ factor(snp3_2)*factor(snp4_2)*factor(snp5_2), family=binomial, data=ormdrdata)
summary(logreg345)


This model has so many terms it is quite difficult to interpret. You should see the largest and most significant effect (p=1.23e-05) corresponds to when an individual is homozygous at all three loci. This is shown in the last line marked factor(snp3_2)2:factor(snp4_2)2:factor(snp5_2)2 This result suggests that these three loci act largely via a complex 3-way interaction. To see the overall significance at these three loci, type:

anova(logreg345)

The significance of any 3-way interaction terms is given by a deviance of 45.6 on 8 df. This is quite significant (p=2.8e-07), as you can tell by typing:

pchisq(45.6,8,lower.tail=F)

To get out of R, type q() (and reply n if it asks you to save the workspace)



4. Analysis in BEAM3

Finally, we shall use the program BEAM3 to perform a Bayesian analysis to detect single-locus associations and multi-way interactions. For details of the methodology, please read the BEAM3 paper (Zhang Y (2011) Genet Epid 36: 36-47), or see the (rather sketchy!) BEAM3 documentation (link above).

BEAM3 requires a very different file format from the other progams. Essentially the role of SNPs and people is transposed, so that each line coresponds to a SNP and each column to the genotype (at that SNP) for a different person. The top line of the file gives the case/control status (coded 1/0). For more details see the BEAM3 documentation (link above).

We have prepared a file in the correct format beam3data.txt. To run BEAM with specified input file and default values for the MCMC sampling scheme, type

./BEAM3 beam3data.txt -o beam3results

This should produce an output file beam3results.prob which lists the SNPs (numbered from 0 to 99) together with their posterior probabilities of showing marginal association, interaction association, and total association (= the marginal + interaction association probabilities).

Take a look at this file. Unfortunately BEAM3 does not seem to have picked up any associations. This is because it is hard for BEAM3 to detect SNPs that don't show any marginal association. However, we can use the -T t option which runs the program at "temperature" t in the first few iterations to try and force the program to jump out of local modes. This can help in finding interactions with no marginal association, although of course it is not guaranteed.

Try using this option with the temperature specified to start at 10 and then gradually drop to 1 as the MCMC procedure proceeds:

./BEAM3 beam3data.txt -o beam3results -T 10

Take a look at the new output file beam3results.prob . You should find that the first two SNPs (numbered 0 and 1) show high posterior probabilities of being involved in interaction associations. Therefore, SNP1 and SNP2 are detected via their interaction effects. No other markers are detected, so it seems as if the 3-way interaction between SNPS 4, 5, 6 is too weak to be detected in comparison to the stronger interaction between SNPS 1 and 2.


Answers

Interpretation of output

Answers and interpretation of the output are described in the step-by-step instructions. Please ask if you need help in understanding the output for any specific test.


Comments

Advantages/disadvantages

PLINK, CASSI and BEAM are designed for genome-wide studies, allowing the inclusion of many thousands of markers.

Analysis in a standard statistical package does not generally allow so many markers, but has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmental factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, detection of interactions generally requires much larger sample sizes.

Other packages

Logistic regression analysis for detection of interactions can be performed in other statistical packages such as Stata, SAS, SPSS, GLIM.

MDR analysis can be performed in the Java MDR package. An alternative Bayesian Epistasis mapping approach is available in the BIA software from Jonathan Marchini. Several packages are available for implementing different data-mining and machine-learning approaches for detecting interactions or detecting association allowing for interaction. See Cordell (2009) (reference below) for more details.


References

Cordell HJ (2009) Genome-wide association studies: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009 May 12. [Epub ahead of print]

Y Chung and S Y Lee and R C Elston and T Park (2007) Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics 23:71-76.

L W Hahn and M D Ritchie and J H Moore (2003) Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions Bioinformatics 19:376--382.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81:559-575.

Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF and Moore JH (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69:138-147.

Ueki M, Cordell HJ (2012) Improved statistics for genome-wide interaction analysis. PLoS Genetics 8(4):e1002625.

Zhang Y, Liu JS (2007) Bayesian inference of epistatic interactions in case-control studies. Nat Genet 39:1167-1173.

Zhang Y (2011) A novel Bayesian graphical model for genome-wide multi-SNP association mapping. Genet Epidemiol 36: 36-47.


Exercises prepared by: Heather Cordell
Checked by: Daniel E. Weeks and Simon Heath
Programs used: PLINK, R, dgcGenetics, ORMDR, BEAM3, CASSI
Last updated: