Computer Practical Exercises on Case/Control Association and Gene-Gene interaction using the R package

Purpose

In this exercise you will be carrying out chi-squared tests and logistic regression analysis of case/control data, to test for genetic association and for gene-gene interaction.

Introduction

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) as described in the lectures. The tests will performed in the statistical analysis package R. As well as using standard functions implemented in R, we will also make use of special functions designed for genetic analysis that can be downloaded as add-in routines to the R package.
Some of the commands that you will type in order to do the analysis may seem a bit mystifying if you are not familiar with using statistical packages such as R. Don't worry too much if you don't understand all the details! If you decide to regularly use a statistical package such as R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Program documentation

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the dgc.genetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

Instructions

Data overview

The data consists of genotype data at 4 linked SNP loci and another unlinked locus, typed in 384 type 1 diabetes cases and 672 controls.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.

Data format

The data for the 4 linked loci casecondata.txt is in a similar format to a standard linkage pedigree file, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. The pedigree file used for the analysis in R differs from a standard pedigree file in that it has a header line describing the different columns, and it uses R's own missing value code "NA". Note that since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown.

The file newcasecondata.txt is an R format file that is identical to casecondata.txt except that it includes genotypes for an additional 5th unlinked SNP locus.

Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.

Step-by-step instructions

1. Chi-squared and logistic regression analysis

To start up the R package, click on the R icon. Then, under the File menu, change directory so that R will look in the correct directory (the folder where you saved the files) for any input files you specify.

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(dgc.genetics)

To read in your data, type

casecon <- read.table("casecondata.txt", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe 'casecon' and you can look at it simply by typing

casecon

or by typing

fix(casecon)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

casecon$pedigree

It can be inconvenient to have to type 'casecon$' before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(casecon)

Now you can just look at the column of pedigree names by typing

pedigree

However, you should beware when using the 'attach' command as if you already have a different variable in R called 'pedigree', R will use this variable rather than the one in the attached dataframe - without telling you that there may be some ambiguity. For this reason it is often convenient to clear all variables and detach all dataframes before starting your session. This can be done (BUT DONT DO IT NOW!) by use of the clear() command.

To perform the analysis we will first generate a variable called 'case' that codes affected/unaffected individuals as 1/0 rather than 2/1 (as they are coded in the 'affected' variable):

case <- affected-1

To look at the new case variable, type

case

Now we will generate genotype variables from the two alleles at each locus:

g1 <- genotype(loc1_1, loc1_2) g2 <- genotype(loc2_1, loc2_2) g3 <- genotype(loc3_1, loc3_2) g4 <- genotype(loc4_1, loc4_2)

Take a look at one of these e.g. by typing

g1

To perform chi-squared tests on the genotype variables (to see if they are associated with disease status), type:

table(g1,case) chisq.test(g1,case) table(g2,case) chisq.test(g2,case) table(g3,case) chisq.test(g3,case) table(g4,case) chisq.test(g4,case)

All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.

To perfom a chromosome-based analysis, you can use the commands:

allele.table(g1,case) chisq.test(allele.table(g1,case)) allele.table(g2,case) chisq.test(allele.table(g2,case)) allele.table(g3,case) chisq.test(allele.table(g3,case)) allele.table(g4,case) chisq.test(allele.table(g4,case))

To perform the logistic regression analysis, we need to set the variables up for either a 2df (genotype-based) analysis or a 1df (chromosome-based) analysis. For the genotype-based analysis, we use what are called 'genotype' contrasts, and for the chromosome-based analysis we use 'additive' (on the log odds scale - equivalent to multiplicative on the odds scale) contrasts. So for the 2df test at the first locus we use the following commands:

gcontrasts(g1) <- "genotype" logit (case ~ g1) anova(logit (case ~ g1))

The 'logit' command provides parameter estimates (odds ratios) and p values for each regression coefficient, while the 'anova' command gives an overall test in the form of a 'deviance' and a 'df'. In this case we find a deviance of 18.49 on 2df. To find a p value for this, we compare it to a chi-squared distribution by typing:

1-pchisq(18.49,2)

For a 1df test at the first locus, we use the following commands:

gcontrasts(g1) <- "additive" logit (case ~ g1) anova(logit (case ~ g1)) 1-pchisq(18.47,1)

You should be able to work out how to do a similar analysis on either 1df or 2df for the other loci. You should find that the pattern of significance is the same as you found when doing the chi-squared tests on the genotype or allele tables.

You can also perform a stepwise analysis, where you add or discard terms from the model. Suppose we wish to see whether locus 3 is important once we have already included locus 1, using 2df tests. Then we would use the following commands:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "genotype" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3))

The effect of adding g3 to a model that already includes g1 is seen to be 5.24 on 2df. The p value for this is found by

1-pchisq(5.24,2)

For a 1df test of g3, given 2df at g1, we would use:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "additive" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3)) 1-pchisq(5.07,1)

2. Analysis of interaction/epistasis

We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects.

First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5:

detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T) attach(newcasecon)

You can look at the data by typing

fix(newcasecon)

Next create appropriate genotype and case variables:

case <- affected-1

g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2)

The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately:

gcontrasts(g3) <- "genotype" logit (case ~ g3) anova(logit (case ~ g3))
gcontrasts(g5) <- "genotype" logit (case ~ g5) anova(logit (case ~ g5))

In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes

count3<-allele.count(g3,2) count5<-allele.count(g5,2)

Check you understand how variables count3 and count5 relate to g3 and g5 by typing

count3 count5

We then create a variable that codes for the combined effect of locus 3 and 5 as follows:

combo<-10*count3+count5

Check you understand how the 'combo' variable relates to g3 and g5 by typing

combo

Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than having numeric meaning:

fact3<-factor(count3) fact5<-factor(count5) factcombo<-factor(combo)

Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables:

anova(logit (case ~ fact3)) anova(logit (case ~ fact5))
Now test whether there is significant epistasis by typing

anova(logit(case ~ fact3 + fact5 + factcombo)) 1-pchisq(9.59,4)

This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value 0.048 i.e. there is marginal evidence of epistasis.

The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between UNLINKED LOCI ONLY is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium.

To do this, we can use a chi squared test to look for correlation (association) between the loci within the case and control groups separately.

First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands

caseg3<-g3[1:384] caseg5<-g5[1:384]

Take a look at the vectors you have created by typing

caseg3 caseg5

Now do a chi-squared test on the genotype variables to see if they are correlated with each other:

table(caseg3,caseg5) chisq.test(caseg3,caseg5)

You should find much more significant evidence of epistasis (p value 0.0018) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population:

contg3<-g3[385:1056] contg5<-g5[385:1056]

contg3 contg5

table(contg3,contg5) chisq.test(contg3,contg5)

You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci.

Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmetal factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

The analyses described did not make use of any haplotype data. Possible programs for performing haplotype analysis are described below. However, the logistic regression approach does allow analysis of multiple linked loci (provided phase information is not considered to be important).

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.

Other packages

Logistic regression analysis can be performed in other statistical packages such as Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel.

For haplotype analysis one can use several other packages such as the HAPLO.STATS add-in for R by Dan Schaid, the UNPHASED program by Frank Dudbridge, the WHAP program by Shaun Purcell, the SIMHAP program from Lyle Palmer, or the HAPIPF add-in for Stata by Adrian Mander. Alternatively, one can use the SNPHAP program by David Clayton to estimate haplotypes for several multiply-imputed datasets, and then analyse these using add-in multiple imputation functions for Stata or R.

References

Breslow N and Day N (1980) Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer.

Breslow N and Day N (1993) Statistical methods in cancer research: The design and analysis of cohort studies. International Agency for Research on Cancer.

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73: 1316-1329.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34.

Piegorsch WW, Weinberg CR, Taylor JA (1994) Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statistics in Medicine 13:153-162.

Schaid DJ (2004) Evaluating associations of haplotypes with traits. Genetic Epidemiology 27:348-64.

Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum Hered. 55:179-90.

Weinberg CR and Umbach DM (2003) Choosing a retrospective design to assess joint genetic and environmental contributions to risk. American Journal of Epidemiology 152:197-203.