Computer Practical Exercises on Multilocus Association
European School of Genetic Medicine
May 3rd-5th, 2006
Bertinoro di Romagna, Italy

Purpose

In these exercises you will be carrying out chi-squared tests, logistic regression and conditional logistic regression analysis of case/control and family data to test for genetic association.

Introduction

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) and family-based analysis (TDT and conditional logistic regression) as described in today's lecture. The tests will performed in two statistical analysis packages, Stata and R. As well as using standard functions implemented in these packages, we will also make use of special functions designed for genetic analysis that can be downloaded as add-in routines to the Stata and R packages.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying if you are not familiar with using statistical packages such as Stata or R. Don't worry too much if you don't understand all the details! If you decide to regularly use a statistical package such as Stata or R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Exercise 1: Case/Control Data

Data overview

The data consists of genotype data at 4 linked SNP loci and another unlinked locus, typed in 384 type 1 diabetes cases and 672 controls.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.

Instructions

Data format

The data for the 4 linked loci casecondata.ped is in standard linkage pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Note that `0' indicates unknown data or genotypes: since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown. Each column must be separated by a tab in order to be read correctly into Stata.

The pedigree file casecondata.Rped used for the analysis in R differs from the pedigree file used for the analysis in Stata. It has a header line describing the different columns, and it uses R's own missing value code "NA".

The file newcasecondata.ped is a Stata format file that is identical to casecondata.ped except that it includes genotypes for an additional 5th unlinked SNP locus.

Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to an appropriate directory on your computer e.g. to C:DATA

Data files

casecondata.ped
casecondata.Rped
newcasecondata.ped

These should all be saved as .txt files to C:DATA.

Program documentation

STATA documentation:

The Stata website is at http://www.stata.com/

Useful guides to Stata can be found at http://www.ats.ucla.edu/stat/stata/ and http://www.whoishostingthis.com/resources/stata/

David Clayton's add-in routines (the genassoc package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software/stata/README.txt

From within Stata, one can obtain help on any command xxxx by typing `help xxxx'

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the dgc.genetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

Step-by-step instructions

1. Chi-squared table analysis in Stata

Make sure you have a copy of the pedigree files in the appropriate directory. Open up a Stata window by clicking on the Stata icon. For those not familiar with Stata, you should find that a large window with 4 separate sub-windows appears. You type commands in the bottom right hand window, and any results will be displayed in the top right hand window. The top left window provides a review of all the commands you typed, and the bottom left window displays all the variables that you have currently loaded. A convenient way to examine the variables is to click on the data browser button at the top.

To keep a log of your session in the file 'casecon.log', you can type in the bottom right hand (white) Stata window:

log using casecon, replace text

Read the data into Stata by typing:

ginsheet using casecondata.ped, preped zmiss

The ginsheet command reads in data from a text file which in this case has been prepared for you earlier in the file 'casecondata.ped'. If Windows has renamed your file (e.g. as 'casecondata_ped.txt') you will have to use the correct new name. The preped option tells Stata that the file is in a standard pre-makeped pedigree file format and the zmiss tells Stata that missing values are coded in the file as 0 (whereas Stata will recode them as '.')

You should find that the names of 14 variables appear in the bottom left window. These correspond to the usual variables in a standard pedigree file with alleles at locus 1 denoted L1_1 and L1_2, alleles at locus 2 denoted L2_1 and L2_2 etc. etc.

To examine the variables you just read in you can click on the data browser button (3rd from right at the top). You might notice that missing genotypes have been recoded from `0' to Stata's internal missing value code '.'

Alternatively, to look at any of the variables you just read in you can use the command list e.g. you could type:

list family affected L1_1 L1_2 L2_1 L2_2

This will show you the family ID, affection status and alleles at locus 1 and 2. Use the spacebar to continue listing, or type q to stop.

To see documentation for the ginsheet command, type:

help ginsheet

In order to perform the analysis, Stata will require you to have case/control status coded as 1 or 0 rather than 2 or 1. So we generate a new variable called 'case' that has this required coding by typing:

gen case=affected-1

Use the list command or the data browser button to check that your 'case' variable is correctly coded.

At the moment, the genotype data is in the form of two alleles for each locus. E.g. 'L1_1' and 'L1_2' are the two variables coding the two alleles at locus 1. Since we will be modelling genotype as a predictor of disease status, we need to create a genotype variable for each locus from the two alleles at each locus. We can do this in several ways. The first possibility is to type:

egen A=gtype(L1_1 L1_2)

This creates a genotype variable named 'A' which contains the genotype corresponding to the alleles contained in the variables L1_1 and L1_2. This is the most general way to create a genotype variable, and the required way for multiallelic loci.

Since we have diallelic loci, another possibility is to create variables that correspond to the number of times allele 1 and 2 appear in the genotype (the variables are automatically named A1 and A2) by typing:

gtab L1_1 L1_2, gen(A)

Having created these variables, check that they are correct (and that you understand how they correspond to genotype) by typing:

list family L1_1 L1_2 A A1 A2

To carry out a chi-squared test on 2df for genotype at locus 1, using the A variable, type:

tabulate A case, chi2

Check that you get the same using the A2 variable:

tabulate A2 case, chi2

And using the A1 variable:

tabulate A1 case, chi2

Suppose we wished to perform the chi-squared test on 1df, assuming a dominant mode of action of the '2' allele. We need to generate a new variable that indicates presence of either one or two copies of the '2' allele. Do this by typing:

gen A2ind = A2>0 replace A2ind=. if A2==.

and check you understand how the new variable 'A2ind' relates to the old variables by typing

list A2ind A2 A

To carry out a chi-squared test on the A2ind variable, type

tabulate A2ind case, chi2

and check that the counts match up with what you found for the genotype table:

tabulate A case, chi2

To carry out a chi-squared test on 1df on chromosomes rather than genotypes, we need to tell Stata to count twice when genotype is 2/2, and once when genotype is 1/2. It is easiest to do this by inputting the table manually:

tabi 715 496 \ 585 272, chi2

Check that you understand how these counts correspond to the previous genotype table.

You should have found that all the tests gave evidence of a significant association at the first locus. Repeat the genotype tests for each of the other loci by generating appropriate genotype variables and analysing them:

gtab L2_1 L2_2, gen(B) gtab L3_1 L3_2, gen(C) gtab L4_1 L4_2, gen(D) tabulate B2 case, chi2 tabulate C2 case, chi2 tabulate D2 case, chi2

All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.

2. Logistic regression analysis in Stata

To carry out logistic regression, we will use the 'case' variable as a response variable i.e. a binary 0/1 indicator of disease. We will use the genotype variables A2, B2, C2 and D2 as predictor variables.

To carry out the analysis for locus A, assuming a multiplicative model for the effect of alleles (i.e. assuming that two copies of allele 2 has twice the effect of one copy, on the log odds scale), type:

logit case A2

The significance test is given at the top right: you should obtain a chi-squared on 1 df of 18.47 with a p value of 0.0000 (i.e. <0.00005). This is the significance test for comparing the model where A is in the regression equation compared to a model where it is not, i.e. comparing models

ln(p/(1-p))=beta_1 * A2+beta_0

and

ln(p/(1-p))=beta_0

Check that we get exactly the same significance test when we type either

logit case A1

or

logit case A

The values of the regression coefficients beta_1 and beta_0 given in the table (denoted A and _cons in the Stata results window) will differ as the coding schemes for genotype corresponding to variables A, A1 and A2 differ. However the significance tests should be the same.

So far we have assumed a multiplicative model with no dominance i.e. that having two copies of allele 2 corresponds to twice the effect of having one copy. We can avoid this assumption by making Stata fit a model allowing for two genotype variables per genotype, which corresponds to comparing models

ln(p/(1-p))=beta_1 * x1+beta_2 * x2+beta_0

and

ln(p/(1-p))=beta_0

where x1 and x2 are related to the genotype variable A in an appropriate way. To do this type

xi:logit case i.A xi:logit case i.A1 xi:logit case i.A2

The xi: and i. part of these commands basically simply tells Stata to consider the predictor variable as a categorical variable and to automatically create temporary appropriate non-categorical variables x1 and x2, which will appear and be called something like _IA1_1 and _IA1_2 or _IA_2 and _IA_3 in the Stata variables window.

These commands should each give a significance test of a chi-squared on 2 df of 18.49 with a p value of 0.0001. Again you should find that using variables A, A1 and A2 are basically equivalent, except that the difference in coding schemes results in different estimated regression coefficients. Since the tests are equivalent, from now on we will do everything simply with respect to A2.

To compare the dominance and no dominance models, we need to use the following sequence of commands:

xi:logit case i.A2 est store full logit case A2 lrtest full

This fits the first model with dominance included, saves the results, then fits the second model with no dominance and tests the difference between the models. The test to compare the two models that you just fitted is found at the bottom right hand of the Stata results window. You should get a chi-squared of 0.02 on 1 df with p value 0.8767 i.e. there is no significant evidence of dominance: the multiplicative model is adequate.

It is often more convenient to output the coefficients in terms of odds ratios with respect to an arbitrarily chosen reference genotype. We do this using the odds ratio or 'or' option for the logit command:

logit case A2, or xi:logit case i.A2, or

This presents the results in terms of an odds ratio with respect to the '1' allele or the '1/1' genotype. We can do the same single-locus analysis for loci 2, 3 and 4 as we have just done for locus 1. To save time, for now we will just perform the 2df tests:

xi:logit case i.B2, or xi:logit case i.C2, or xi:logit case i.D2, or

You should find that the pattern of significance is the same as you found when doing the chi-squared tests on the genotype tables.

So far we have just analysed each locus individually. Now we will perform a stepwise analysis, where we enter or discard variables in turn from the regression equation

We start with a forwards stepwise procedure. Since locus 3 was the most significant predictor of disease status, it makes sense to enter this first into the regression equation, and then see if any other loci are significant given the effect at locus 3. This can be done using nested likelihood ratio tests as we did when testing for dominance, but it turns out to be more convenient to use a slightly different approach that simply discards terms from the regression equation. So to see if locus 1 is important given the effect of locus 3, we fit a model in which both locus 1 and locus 3 are included, and then discard any terms for locus 1:

xi:logit case i.C2 i.A2, or testparm *A2*

This gives us a 2df test for locus 1 given locus 3 (chisquared = 0.55, p value = 0.76). A 1df test (assuming no dominance at locus 1) can be performed by:

xi:logit case i.C2 A2, or testparm A2

We can perform similar tests for locus 2 and locus 4:

xi:logit case i.C2 i.B2, or testparm *B2* xi:logit case i.C2 i.D2, or testparm *D2* xi:logit case i.C2 B2, or testparm B2 xi:logit case i.C2 D2, or testparm D2

You should find that no loci are significant i.e. the first model (that includes genotypes at both loci) is never significantly better than the model that just includes locus 3. So locus 3 is sufficient to account for whether or not someone is diseased. We therefore stop the stepwise analysis.

If another locus, e.g. locus 1, had turned out to be significant, we would enter it into the regression equation along with locus 3 and see if any other loci are significant. E.g. to test if locus 2 is significant once one has accounted for loci 1 and 3, on either 2df or 1df, one would type:

xi:logit case i.C2 i.A2 i.B2, or testparm *B2* xi:logit case i.C2 i.A2 B2, or testparm B2

We can also use a backwards stepwise procedure, where we start by including all loci in the model, and then delete loci that are not significant. However, in the interests of time, we will not carry out this analysis today.

3. Analysis of interaction/epistasis in Stata

We shall now use logistic regression in Stata to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects.

First get rid of the data in memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5:

clear
ginsheet using newcasecondata.ped, preped zmiss

Next create appropriate genotype and case variables:

gtab L3_1 L3_2, gen(C) gtab L5_1 L5_2, gen(E) gen case=affected-1

The individual effects at locus 3 and 5 are now coded by the variables C2 and E2. We can test for association at each locus separately:

xi: logit case i.C2 xi: logit case i.E2

We then create a variable that codes for the combined effect of locus 3 and 4 as follows:

gen combo=10*C2+E2

Check you understand how the 'combo' variable relates to C2 and E2 by typing

list combo C2 E2

Now test whether there is significant epistasis by typing

xi:logit case i.combo est store full xi: logit case i.C2 i.E2 lrtest full

This fits the first model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), saves the results, then fits the second model with no epistasis and tests the difference between the models. The test to compare the two models that you just fitted is found at the bottom right hand of the Stata results window. You should get a chi-squared of 9.59 on 4 df with p value 0.048 i.e. there is marginal evidence of epistasis.

The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between unlinked loci only is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium.

To do this, we can use a chi squared test to look for correlation/association between the loci within the case and control groups separately:

bysort case: tabulate C2 E2, chi2

You should find that there is no association between the loci in the control group, but there is significant association between the loci in the case group, indicating epistasis.

We have now finished the Stata analysis. Type

log close

to close the log file in which there will be a record of all the commands you typed and all the results.

Then type

clear exit

to get out of Stata and close the session.

4. Analysis in R

We shall now repeat some of the analyses that we performed in Stata in the statistical package R. The R package is somewhat less user-friendly than Stata (particularly with regards to the output), but has the advantage of being free and more widely available, partly because a lot of bioinformatics software utilities have been developed as add-in functions for the R package.

Before we start, take a look at the data file that we shall use:

casecondata.Rped

This differs from the pedigree file we used in the Stata analysis by the addition of a header line describing what each of the variables are, and by the changing of any '0's to R's own missing value code 'NA'.

To start up the R package, click on the R icon. Then, under the File menu, change directory (Cambia directory) so that R will look for any input files you specify in the C:DATA directory.

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(dgc.genetics)

To read in your data, type

casecon <- read.table("casecondata.Rped", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe 'casecon' and you can look at it simply by typing

casecon

or by typing

fix(casecon)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

casecon$pedigree

It can be inconvenient to have to type 'casecon$' before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(casecon)

Now you can just look at the column of pedigree names by typing

pedigree

However, you should beware when using the 'attach' command as if you already have a different variable in R called 'pedigree', R will use this variable rather than the one in the attached dataframe - without telling you that there may be some ambiguity. For this reason it is often convenient to clear all variables and detach all dataframes before starting your session. This can be done (although don't do it now) by use of the clear() command.

Just like in Stata, we will generate a variable called 'case' that codes affected/unaffected individuals as 1/0 rather than 2/1 (as they are coded in the 'affected' variable):

case <- affected-1

To look at the new case variable, type

case

Now we will generate genotype variables from the two alleles at each locus:

g1 <- genotype(loc1_1, loc1_2) g2 <- genotype(loc2_1, loc2_2) g3 <- genotype(loc3_1, loc3_2) g4 <- genotype(loc4_1, loc4_2)

Take a look at one of these e.g. by typing

g1

To perform chi-squared tests on the genotype variables (with disease status), type:

table(g1,case) chisq.test(g1,case) table(g2,case) chisq.test(g2,case) table(g3,case) chisq.test(g3,case) table(g4,case) chisq.test(g4,case)

These should give you results that correspond very closely to what you found in the Stata analysis of the same data.

To perfom a chromosome-based analysis, you can use the commands:

allele.table(g1,case) chisq.test(allele.table(g1,case)) allele.table(g2,case) chisq.test(allele.table(g2,case)) allele.table(g3,case) chisq.test(allele.table(g3,case)) allele.table(g4,case) chisq.test(allele.table(g4,case))

To perform the logistic regression analysis, we need to set the variables up for either a 2df (genotype-based) analysis or a 1df (chromosome-based) analysis. For the genotype-based analysis, we use what are called 'genotype' contrasts, and for the chromosome-based analysis we use 'additive' (on the log odds scale - equivalent to multiplicative on the odds scale) contrasts. So for the 2df test at the first locus we use the following commands:

gcontrasts(g1) <- "genotype" logit (case ~ g1) anova(logit (case ~ g1))

The 'logit' command provides parameter estimates (odds ratios) and p values for each regression coefficient, while the 'anova' command gives an overall test in the form of a 'deviance' and a 'df'. In this case we find a deviance of 18.49 on 2df. To find a p value for this, we compare it to a chi-squared distribution by typing:

1-pchisq(18.49,2)

For a 1df test at the first locus, we use the following commands:

gcontrasts(g1) <- "additive" logit (case ~ g1) anova(logit (case ~ g1)) 1-pchisq(18.47,1)

You should be able to work out how to do a similar analysis on either 1df or 2df for the other loci. There is probably not time to do this today, but if you do, you should again find you get very similar results to the logistic regression analysis in Stata.

You can also perform a stepwise analysis, where you add or discard terms from the model. Suppose we wish to see whether locus 3 is important once we have already included locus 1, using 2df tests. Then we would use the following commands:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "genotype" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3))

The effect of adding g3 to a model that already includes g1 is seen to be 5.24 on 2df. The p value for this is found by

1-pchisq(5.24,2)

For a 1df test of g3, given 2df at g1, we would use:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "additive" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3)) 1-pchisq(5.07,1)

All the other stepwise tests that we performed in Stata can also be performed in a similar manner in R. If you have time, try some of them out.

Once you are ready to finish with R, type

q() n

to get out.

Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmetal factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

The analyses described did not make use of any haplotype data. Possible programs for performing haplotype analysis are described below and one of these (the Unphased program) will be used in tomorrow's practicals. However, the logistic regression approach does allow analysis of multiple linked loci (provided phase information is not considered to be important).

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.

Other packages

Logistic regression analysis can be performed in other statistical packages such as SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel.

For haplotype analysis one can use several other packages such as the HAPLO.STATS add-in for R by Dan Schaid, the UNPHASED program by Frank Dudbridge, the WHAP program by Shaun Purcell, the SIMHAP program from Lyle Palmer, or the HAPIPF add-in for Stata by Adrian Mander. Alternatively, one can use the SNPHAP program by David Clayton to estimate haplotypes for several multiply-imputed datasets, and then analyse these using add-in multiple imputation functions for Stata or R.

References

Breslow N and Day N (1980) Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer.

Breslow N and Day N (1993) Statistical methods in cancer research: The design and analysis of cohort studies. International Agency for Research on Cancer.

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73: 1316-1329.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34.

Schaid DJ (2004) Evaluating associations of haplotypes with traits. Genetic Epidemiology 27:348-64.

Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum Hered. 55:179-90.

Computer Practical Exercises on Multilocus Association European School of Genetic Medicine May 3rd-5th, 2006 Bertinoro di Romagna, Italy