Computer Practical Exercises on Multilocus Association using R

Purpose

In these exercises you will be carrying out TDT tests for family data and chi-squared tests and logistic regression analysis of case/control data, to test for genetic association.

Introduction

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) and family-based analysis (TDT) as described in the lectures. The tests will performed in the statistical analysis package R. As well as using standard functions implemented in R, we will also make use of special functions designed for genetic analysis that can be downloaded as add-in routines to the R package.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying if you are not familiar with using statistical packages such as R. Don't worry too much if you don't understand all the details! If you decide to regularly use a statistical package such as R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Program documentation

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the dgc.genetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

Exercise 1: Family Data

Data overview

We will start by re-analysing the family data that you analysed in UNPHASED i.e. family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region.

Before we start, take a look at the data file that we shall use:

fivelociR.txt

This differs from the pedigree file we used in the UNPHASED analysis by the addition of a header line describing what each of the variables are, and by the changing of any "0"s to R's own missing value code "NA". Note that the names given to the first 6 variables in the header line are not optional: they must be called

pedigree id id.father id.mother sex affected

for the add-in utilities for the R package to work correctly.

To start up the R package, click on the R icon. Then, under the File menu, change directory so that R will look in the correct directory (the folder where you saved the files) for any input files you specify.

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(dgc.genetics)

To read your data into a dataframe called "family", type

family <- read.table("fivelociR.txt", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe "family" and you can look at it simply by typing

family

or by typing

fix(family)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

family$pedigree

It can be inconvenient to have to type "family$" before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(family)

Now you can just look at the column of pedigree names by typing

pedigree

To perform association analysis, we need to convert the variables corresponding to the two alleles at each locus into a genotype variable for each locus. This can be done e.g. for locus 1 by:

g1 <- genotype(loc1_1, loc1_2)

To look at the variable you have just created, type:

g1

To perform a TDT analysis on this locus, type

tdt(g1)

As with the tdt function in Stata, this gives a global multiallelic test (chi-squared test = 278.1187 on 22 df, P-value = <2e-16) as well as individual tests for each allele versus all others.

Repeat the analysis for loci 2-5:

g2 <- genotype(loc2_1, loc2_2) tdt(g2) g3 <- genotype(loc3_1, loc3_2) tdt(g3) g4 <- genotype(loc4_1, loc4_2) tdt(g4) g5 <- genotype(loc5_1, loc5_2) tdt(g5)

Do you see the same pattern of significance as you saw in the single-locus analyses using UNPHASED?

You can use R to perform similar analyses to the analyses you did in UNPHASED, adding loci to the model in a stepwise manner. Rather than doing this for the family data, we will instead move on to analysing a case/control data set. So for now, you are ready to finish with R.

Type

q() n

to get out.

Exercise 2: Case/Control Data

Data overview

The data consists of genotype data at 4 linked SNP loci and another unlinked locus, typed in 384 type 1 diabetes cases and 672 controls.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.

Instructions

Data format

The data for the 4 linked loci casecondata.txt is in a similar format to a standard linkage pedigree file, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. The pedigree file used for the analysis in R differs from a standard pedigree file in that it has a header line describing the different columns, and it uses R's own missing value code "NA". Note that since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown.

The file newcasecondata.txt is an R format file that is identical to casecondata.txt except that it includes genotypes for an additional 5th unlinked SNP locus.

Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.

Step-by-step instructions

1. Chi-squared and logistic regression analysis

To start up the R package, click on the R icon. Then, under the File menu, change directory so that R will look in the correct directory (the folder where you saved the files) for any input files you specify.

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(dgc.genetics)

To read in your data, type

casecon <- read.table("casecondata.txt", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe 'casecon' and you can look at it simply by typing

casecon

or by typing

fix(casecon)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

casecon$pedigree

It can be inconvenient to have to type 'casecon$' before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(casecon)

Now you can just look at the column of pedigree names by typing

pedigree

However, you should beware when using the 'attach' command as if you already have a different variable in R called 'pedigree', R will use this variable rather than the one in the attached dataframe - without telling you that there may be some ambiguity. For this reason it is often convenient to clear all variables and detach all dataframes before starting your session. This can be done (although don't do it now) by use of the clear() command.

To perform the analysis we will first generate a variable called 'case' that codes affected/unaffected individuals as 1/0 rather than 2/1 (as they are coded in the 'affected' variable):

case <- affected-1

To look at the new case variable, type

case

Now we will generate genotype variables from the two alleles at each locus:

g1 <- genotype(loc1_1, loc1_2) g2 <- genotype(loc2_1, loc2_2) g3 <- genotype(loc3_1, loc3_2) g4 <- genotype(loc4_1, loc4_2)

Take a look at one of these e.g. by typing

g1

To perform chi-squared tests on the genotype variables (to see if they are associated with disease status), type:

table(g1,case) chisq.test(g1,case) table(g2,case) chisq.test(g2,case) table(g3,case) chisq.test(g3,case) table(g4,case) chisq.test(g4,case)

All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.

To perfom a chromosome-based analysis, you can use the commands:

allele.table(g1,case) chisq.test(allele.table(g1,case)) allele.table(g2,case) chisq.test(allele.table(g2,case)) allele.table(g3,case) chisq.test(allele.table(g3,case)) allele.table(g4,case) chisq.test(allele.table(g4,case))

To perform the logistic regression analysis, we need to set the variables up for either a 2df (genotype-based) analysis or a 1df (chromosome-based) analysis. For the genotype-based analysis, we use what are called 'genotype' contrasts, and for the chromosome-based analysis we use 'additive' (on the log odds scale - equivalent to multiplicative on the odds scale) contrasts. So for the 2df test at the first locus we use the following commands:

gcontrasts(g1) <- "genotype" logit (case ~ g1) anova(logit (case ~ g1))

The 'logit' command provides parameter estimates (odds ratios) and p values for each regression coefficient, while the 'anova' command gives an overall test in the form of a 'deviance' and a 'df'. In this case we find a deviance of 18.49 on 2df. To find a p value for this, we compare it to a chi-squared distribution by typing:

1-pchisq(18.49,2)

For a 1df test at the first locus, we use the following commands:

gcontrasts(g1) <- "additive" logit (case ~ g1) anova(logit (case ~ g1)) 1-pchisq(18.47,1)

You should be able to work out how to do a similar analysis on either 1df or 2df for the other loci. You should find that the pattern of significance is the same as you found when doing the chi-squared tests on the genotype or allele tables.

You can also perform a stepwise analysis, where you add or discard terms from the model. Suppose we wish to see whether locus 3 is important once we have already included locus 1, using 2df tests. Then we would use the following commands:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "genotype" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3))

The effect of adding g3 to a model that already includes g1 is seen to be 5.24 on 2df. The p value for this is found by

1-pchisq(5.24,2)

For a 1df test of g3, given 2df at g1, we would use:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "additive" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3)) 1-pchisq(5.07,1)

2. Analysis of interaction/epistasis

We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects.

First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5:

detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T) attach(newcasecon)

You can look at the data by typing

fix(newcasecon)

Next create appropriate genotype and case variables:

case <- affected-1

g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2)

The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately:

gcontrasts(g3) <- "genotype" logit (case ~ g3) anova(logit (case ~ g3))
gcontrasts(g5) <- "genotype" logit (case ~ g5) anova(logit (case ~ g5))

In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes

count3<-allele.count(g3,2) count5<-allele.count(g5,2)

Check you understand how variables count3 and count5 relate to g3 and g5 by typing

count3 count5

We then create a variable that codes for the combined effect of locus 3 and 5 as follows:

combo<-10*count3+count5

Check you understand how the 'combo' variable relates to g3 and g5 by typing

combo

Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than having numeric meaning:

fact3<-factor(count3) fact5<-factor(count5) factcombo<-factor(combo)

Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables:

anova(logit (case ~ fact3)) anova(logit (case ~ fact5))
Now test whether there is significant epistasis by typing

anova(logit(case ~ fact3 + fact5 +factcombo)) 1-pchisq(9.59,4)

This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value 0.048 i.e. there is marginal evidence of epistasis.

The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between unlinked loci only is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium.

To do this, we could use a chi squared test to look for correlation/association between the loci within the case and control groups separately. However, there is probably not time to do this today.

Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmetal factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

The analyses described did not make use of any haplotype data. Possible programs for performing haplotype analysis are described below. However, the logistic regression approach does allow analysis of multiple linked loci (provided phase information is not considered to be important).

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.

Other packages

TDT analysis can be performed in a variety of other packages. A form of TDT analysis of haplotypes is performed by the TRANSMIT program by David Clayton. The only other package that performs a form of case/pseudocontrol analysis is the UNPHASED program by Frank Dudbridge. For testing (but not estimation) of genotype or haplotype association effects in families, one can use the PDT or FBAT or PBAT programs.

Logistic regression analysis can be performed in other statistical packages such as Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel.

For haplotype analysis one can use several other packages such as the HAPLO.STATS add-in for R by Dan Schaid, the UNPHASED program by Frank Dudbridge, the WHAP program by Shaun Purcell, the SIMHAP program from Lyle Palmer, or the HAPIPF add-in for Stata by Adrian Mander. Alternatively, one can use the SNPHAP program by David Clayton to estimate haplotypes for several multiply-imputed datasets, and then analyse these using add-in multiple imputation functions for Stata or R.

References

Breslow N and Day N (1980) Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer.

Breslow N and Day N (1993) Statistical methods in cancer research: The design and analysis of cohort studies. International Agency for Research on Cancer.

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Dudbridge F (2003) Pedigree disequilibrium tests for multilocus haplotypes. Genet Epidemiol 25:115-21.

Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73: 1316-1329.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34.

Schaid DJ (2004) Evaluating associations of haplotypes with traits. Genetic Epidemiology 27:348-64.

Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet 52:455-466.

Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum Hered. 55:179-90.