Computer Practical Exercises on Case/Control Association using Stata and the R package

Overview

Purpose

In this exercise you will be carrying out chi-squared tests and logistic regression analysis of case/control data to test for genetic association. The purpose is detect which (if any) of a set of four SNP loci are associated with disease.

Methodology

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) as described in today's lecture. The tests will performed in two statistical analysis packages, Stata and R. As well as using standard functions implemented in these packages, we will also make use of special functions designed for genetic analysis that have been downloaded as add-in routines to the Stata and R packages.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying (!) if you are not familiar with statistical packages such as Stata or R. Don't worry too much if you don't understand all the details. If you decide to use a statistical package such as Stata or R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Exercise

Data overview

The data consists of genotype data at 4 linked SNP loci typed in 384 type 1 diabetes cases and 672 controls.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.

Special considerations/restrictions (for the programs used here)

All the Stata and R commands required to run these analyses are given below. However, Stata and R are sophisticated statistical programming packages and have much greater functionality than will be described here. If you intend to use a statistical package to analyse your data, you are strongly encouraged to learn how to use that package appropriately through attendance on a training course or careful reading of an introductory textbook.

Instructions

Data format

The data is in standard pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Note that `0' indicates unknown data or genotypes: since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown. Each column must be separated by a tab in order to be read correctly into Stata.

The pedigree file used for the analysis in R differs from the pedigree file used for the analysis in Stata. It has a header line describing the different columns, and it uses R's own missing value code "NA".

Data files

casecondata.ped
casecondata.Rped

Program documentation

STATA documentation:

The Stata website is at http://www.stata.com/

Useful guides to Stata can be found at http://www.ats.ucla.edu/stat/stata/ and http://www.whoishostingthis.com/resources/stata/

David Clayton's add-in routines (the genassoc package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software/stata/README.txt

From within Stata, one can obtain help on any command xxxx by typing `help xxxx'

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the dgc.genetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

Step-by-step instructions

1. Chi-squared table analysis in Stata

Make sure you have a copy of the pedigree files in your current unix directory. Open up a Stata window from that directory by typing:

xstata

For those not familiar with Stata, you should find that a large window with 4 separate sub-windows appears. You type commands in the bottom right hand window, and any results will be displayed in the top right hand window. The top left window provides a review of all the commands you typed, and the bottom left window displays all the variables that you have currently loaded. A convenient way to examine the variables is to click on the data browser button at the top.

To keep a log of your session in the file `casecon.log', you can type in the bottom right hand (white) Stata window:

log using casecon, replace text

Read the data into Stata by typing:

ginsheet using casecondata.ped, preped zmiss

The ginsheet command reads in data from a text file which in this case has been prepared for you earlier in the file `casecondata.ped'. The preped option tells Stata that the file is in a standard pre-makeped pedigree file format and the zmiss tells Stata that missing values are coded in the file as 0 (whereas Stata will recode them as `.')

You should find that the names of 14 variables appear in the bottom left window. These correspond to the usual variables in a standard pedigree file with alleles at locus 1 denoted L1_1 and L1_2, alleles at locus 2 denoted L2_1 and L2_2 etc. etc.

To see documentation for the ginsheet command, type:

help ginsheet

To look at any of the variables you just read in you can use the command list e.g. you could type:

list family affected L1_1 L1_2 L2_1 L2_2

This will show you the family ID, affection status and alleles at locus 1 and 2. Use the spacebar to continue listing, or type q to stop. Alternatively you can examine the data by clicking on the data browser button at the top. You might notice that missing genotypes have been recoded from `0' to Stata's internal missing value code `.'

In order to perform the analysis, Stata will require you to have case/control status coded as 1/0 rather than 2/1. So generate a new variable called `case' that has this required coding by typing:

gen case=affected-1

Use the list command or the data browser button to check that your `case' variable is correctly coded.

At the moment, the genotype data is in the form of two alleles for each locus. E.g. `L1_1' and `L1_2' are the two variables coding the two alleles at locus 1. Since we will be modelling genotype as a predictor of disease status, we need to create a genotype variable for each locus from the two alleles at each locus. We can do this in several ways. The first possibility is to type:

egen A=gtype(L1_1 L1_2)

This creates a genotype variable named `A' which contains the genotype corresponding to the alleles contained in the variables L1_1 and L1_2. This is the most general way to create a genotype variable, and the required way for multiallelic loci.

Since we have diallelic loci, another possibility is to create variables that correspond to the number of times allele 1 and 2 appear in the genotype (the variables are automatically named A1 and A2) by typing:

gtab L1_1 L1_2, gen(A)

Having created these variables, check that they are correct (and that you understand how they correspond to genotype) by typing:

list family L1_1 L1_2 A A1 A2

To carry out a chi-squared test on 2df for genotype at locus 1, using the A variable, type:

tabulate A case, chi2

Check that you get the same using the A2 variable:

tabulate A2 case, chi2

Suppose we wished to perform the chi-squared test on 1df, assuming a dominant mode of action of the `2' allele. We need to generate a new variable that indicates presence of either one or two copies of the `2' allele. Do this by typing:

gen A2ind=A2>0 replace A2ind=. if A2==.

and check you understand how the new variable `A2ind' relates to the old variables by typing

list A2ind A2 A

To carry out a chi-squared test on the A2ind variable, type

tabulate A2ind case, chi2

and check that the counts mach up with what you found for the genotype table:

tabulate A case, chi2

To carry out a chi-squared test on 1df on chromosomes rather than genotypes, we need to tell Stata to count twice when genotype is 2/2, and once when genotype is 1/2. It is easiest to do this by inputting the table manually:

tabi 715 496 \ 585 272, chi2

Check that you understand how these counts correspond to the previous genotype table.

You should have found that all the tests gave evidence of a significant association at the first locus. Repeat the genotype tests for the other loci by generating appropriate genotype variables and analysing:

gtab L2_1 L2_2, gen(B) gtab L3_1 L3_2, gen(C) gtab L4_1 L4_2, gen(D) tabulate B2 case, chi2 tabulate C2 case, chi2 tabulate D2 case, chi2

All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.

2. Logistic regression analysis in Stata

To carry out logistic regression, we will use the `case' variable as a response variable i.e. a binary 0/1 indicator of disease. We will use the genotype variables A2, B2, C2 and D2 as predictor variables.

To carry out the analysis for locus A, assuming a multiplicative model for the effect of alleles (i.e. assuming that two copies of allele 2 has twice the effect of one copy, on the log odds scale), type:

logit case A2

The significance test is given at the top right: you should obtain a chi-squared on 1 df of 18.47 with a p value of 0.0000 (i.e. <0.00005). This is the significance test for comparing the model where A is in the regression equation compared to a model where it is not, i.e. comparing models

ln(p/(1-p))=beta_1*A2+beta_0

and

ln(p/(1-p))=beta_0

Check that we get exactly the same significance test when we type either

logit case A1

or

logit case A

The values of the regression coefficients beta_1 and beta_0 given in the table (denoted A and _cons in the Stata results window) will differ as the coding schemes for genotype corresponding to variables A, A1 and A2 differ. However the significance tests should be the same. So far we have assumed a multiplicative model with no dominance i.e. that having two copies of allele 2 corresponds to twice the effect of having one copy. We can avoid this assumption by making Stata fit a model allowing for two genotype variables per genotype, which corresponds to comparing models

ln(p/(1-p))=beta_1*x1+beta_2*x2+beta_0

and

ln(p/(1-p))=beta_0

where x1 and x2 are related to the genotype variable A in an appropriate way. To do this type

xi:logit case i.A xi:logit case i.A1 xi:logit case i.A2

The xi: and i. part of these commands basically simply tells Stata to consider the predictor variable as a categorical variable and to automatically create temporary appropriate non-categorical variables x1 and x2, which may called something like _IA1_1 and _IA1_2 or _IA_2 and _IA_3 in the Stata variables window.

These commands should each give a significance test of a chi-squared on 2 df of 18.49 with a p value of 0.0001. Again you should find that using variables A, A1 and A2 are basically equivalent, except that the difference in coding schemes results in different estimated regression coefficients. Since the tests are equivalent, from now on we will do everything simply with respect to A2.

To compare the dominance and no dominance models, we need to use the following sequence of commands:

xi:logit case i.A2 est store full logit case A2 lrtest full

This fits the first model with dominance included, saves the results, then fits the second model with no dominance and tests the difference between the models. The test to compare the two models that you just fitted is found at the bottom right hand of the Stata results window. You should get a chi-squared of 0.02 on 1 df with p value 0.8767 i.e. there is no significant evidence of dominance: the multiplicative model is adequate.

It is often more convenient to output the coefficients in terms of odds ratios with respect to an arbitrarily chosen reference genotype. We do this using the odds ratio or `or' option for the logit command:

logit case A2, or xi:logit case i.A2, or

This presents the results in terms of an odds ratio with respect to the `1' allele or the `1/1' genotype. Now we do the same single-locus analysis for loci 2, 3 and 4 as we have just done for locus 1. We can do this using the following sequence of commands:

logit case B2, or logit case C2, or logit case D2, or xi:logit case i.B2, or xi:logit case i.C2, or xi:logit case i.D2, or xi:logit case i.B2 est store full logit case B2 lrtest full xi:logit case i.C2 est store full logit case C2 lrtest full xi:logit case i.D2 est store full logit case D2 lrtest full

You should find that none of the loci show significant dominance, all loci are significant predictors of disease status, and the pattern of significance is the same as you found when doing the chi-squared tests on the genotype tables.

So far we have just analysed each locus individually. Now we will perform a stepwise analysis, where we enter or discard variables in turn from the regression equation

We start with a forwards stepwise procedure. Since locus 3 was the most significant predictor of disease status, it makes sense to enter this first into the regression equation, and then see if any other loci are significant given the effect at locus 3. This can be done using nested likelihood ratio tests as we did when testing for dominance, but it turns out to be more convenient to use a slightly different approach that simply discards terms from the regression equation. So to see if locus 1 is important given the effect of locus 3, we fit a model in which both locus 1 and locus 3 are included, and then discard any terms for locus 1:

xi:logit case i.C2 i.A2, or testparm *A2*

This gives us a 2df test for locus 1 given locus 3. A 1df test (assuming no dominance at locus 1) can be performed by:

xi:logit case i.C2 A2, or testparm A2

We can perform similar tests for locus 2 and locus 4:

xi:logit case i.C2 i.B2, or testparm *B2* xi:logit case i.C2 i.D2, or testparm *D2* xi:logit case i.C2 B2, or testparm B2 xi:logit case i.C2 D2, or testparm D2

You should find that no loci are significant i.e. the first model (that includes genotypes at both loci) is never significantly better than the model that just includes locus 3. So locus 3 is sufficient to account for whether or not someone is diseased. However, since in the initial single-locus tests all loci were quite significant, it is of interest to see whether in fact any of the other loci are sufficient to account for disease (i.e. their effect might not be distinguishable from that of locus 3). We first look at including locus 1 and see whether any other loci are significant once locus 1 is in the regression equation:

xi:logit case i.A2 i.B2, or testparm *B2* xi:logit case i.A2 i.C2, or testparm *C2* xi:logit case i.A2 i.D2, or testparm *D2* xi:logit case i.A2 B2, or testparm B2 xi:logit case i.A2 C2, or testparm C2 xi:logit case i.A2 D2, or testparm D2

For the 1 df tests you should find that loci 3 (p value 0.029) and 4 (p value 0.045) are borderline significant. This suggests that locus 1 may not be sufficient to explain the data. We now look at including locus 2 and see whether any other loci are significant once locus 2 is in the regression equation:

xi:logit case i.B2 i.A2, or testparm *A2* xi:logit case i.B2 i.C2, or testparm *C2* xi:logit case i.B2 i.D2, or testparm *D2* xi:logit case i.B2 A2, or testparm A2 xi:logit case i.B2 C2, or testparm C2 xi:logit case i.B2 D2, or testparm D2

In this case, locus 1, 3 and 4 are significant when added to locus 2, regardless of whether dominance is included, suggesting that locus 2 certainly does not explain the data. Finally we look at including locus 4 and see whether any other loci are significant once locus 4 is in the regression equation:

xi:logit case i.D2 i.A2, or testparm *A2* xi:logit case i.D2 i.B2, or testparm *B2* xi:logit case i.D2 i.C2, or testparm *C2* xi:logit case i.D2 A2, or testparm A2 xi:logit case i.D2 B2, or testparm B2 xi:logit case i.D2 C2, or testparm C2

In this case none of the loci are significant, regardless of whether dominance is included. Locus 4 is therefore sufficient to explain the data. Once locus 4 is included, locus 3 is not required, but once locus 3 is included locus 4 is not required. This suggests that these two loci are in too strong LD to be distinguished from each other, although the greater significance of locus 3 in the single-locus tests makes it arguably a slightly better candidate for the disease variant than locus 4.

We can also use a backwards stepwise procedure, where we start by including all loci in the model, and then delete loci that are not significant. For simplicity we will just do this using the 2df tests:

xi:logit case i.A2 i.B2 i.C2 i.D2, or testparm *A* xi:logit case i.A2 i.B2 i.C2 i.D2, or testparm *B* xi:logit case i.A2 i.B2 i.C2 i.D2, or testparm *C* xi:logit case i.A2 i.B2 i.C2 i.D2, or testparm *D*

You should find that the locus giving the least significant decrease in fit is locus 2 (variable B2, chi-squared=0.02 on 2 df). We therefore remove this locus from the model and look at the effect of additionally removing each of the other loci:

xi:logit case i.A2 i.C2 i.D2, or testparm *A* xi:logit case i.A2 i.C2 i.D2, or testparm *C* xi:logit case i.A2 i.C2 i.D2, or testparm *D*

The least significant locus to remove is locus 4 (variable D2, chisquared=0.61 on 2 df). We therefore remove this locus from the model and look at the effect of additionally removing each of the other loci:

xi:logit case i.A2 i.C2, or testparm *A* xi:logit case i.A2 i.C2, or testparm *C*

The least significant locus to remove is locus 1 (variable A2, chi-squared=0.55 on 2 df). We therefore remove this locus from the model and look at the effect of additionally removing the only remaining locus, which is equivalent to looking at its significance in a single locus test:

xi:logit case i.C2, or testparm *C*

Since this is significant (chi-squared=22.82 on 2 df from the likelihood ratio test at the top, or chi-squared=22.15 on 2 df from the `testparm' command), we do not remove this locus from the model. We are therefore left with a final model in which disease is predicted by locus 3, the same as the final model from the forward stepwise procedure.

We have now finished the Stata analysis. Type

log close

to close the log file in which there will be a record of all the commands you typed and all the results.

Then type

clear exit

to get out of Stata and close the session.

3. Analysis in R

If you have time, you may like to repeat some of the analyses that you performed in the statistical package Stata in the statistical package R. The R package is somewhat less user-friendly than Stata (!) (particularly with regards to the output), but has the advantage of being free and more widely available, partly because a lot of bioinformatics software utilities have been developed as add-in functions for the R package.

Before we start, take a look at the data file that we shall use:

casecondata.Rped

This differs from the pedigree file we used in the Stata analysis by the addition of a header line describing what each of the variables are, and by the changing of any `0's to R's own missing value code `NA'.

To start up the R package, type (from the unix prompt):

R

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(dgc.genetics)

To read in your data, type

casecon <- read.table("casecondata.Rped", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe `casecon' and you can look at it simply by typing

casecon

or by typing

fix(casecon)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

casecon$pedigree

It can be inconvenient to have to type `casecon$' before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(casecon)

Now you can just look at the column of pedigree names by typing

pedigree

However, you should beware when using the `attach' command as if you already have a different variable in R called `pedigree', R will use this variable rather than the one in the attached dataframe - without telling you that there may be some ambiguity. For this reason it is often convenient to clear all variables and detach all dataframes before starting your session. This can be done (although don't do it now) by use of the clear() command.

Just like in Stata, we will generate a variable called `case' that codes affected/unaffected individuals as 1/0 rather than 2/1 (as they are coded in the `affected' variable):

case <- affected-1

To look at the new case variable, type

case

Now we will generate genotype variables from the two alleles at each locus:

g1 <- genotype(loc1_1, loc1_2) g2 <- genotype(loc2_1, loc2_2) g3 <- genotype(loc3_1, loc3_2) g4 <- genotype(loc4_1, loc4_2)

Take a look at one of these e.g. by typing

g1

To perform chi-squared tests on the genotype variables (with disease status), type:

table(g1,case) chisq.test(g1,case) table(g2,case) chisq.test(g2,case) table(g3,case) chisq.test(g3,case) table(g4,case) chisq.test(g4,case)

These should give you results that correspond very closely to what you found in the Stata analysis of the same data.

To perfom a chromosome-based analysis, you can use the commands:

allele.table(g1,case) chisq.test(allele.table(g1,case)) allele.table(g2,case) chisq.test(allele.table(g2,case)) allele.table(g3,case) chisq.test(allele.table(g3,case)) allele.table(g4,case) chisq.test(allele.table(g4,case))

To perform the logistic regression analysis, we need to set the variables up for either a 2df (genotype-based) analysis or a 1df (chromosome-based) analysis. For the genotype-based analysis, we use what are called `genotype' contrasts, and for the chromosome-based analysis we use `additive' (on the log odds scale - equivalent to multiplicative on the odds scale) contrasts. So for the 2df test at the first locus we use the following commands:

gcontrasts(g1) <- "genotype" logit (case ~ g1) anova(logit (case ~ g1))

The `logit' command provides parameter estimates (odds ratios) and p values for each regression coefficient, while the `anova' command gives an overall test in the form of a `deviance' and a `df'. In this case we find a deviance of 18.49 on 2df. To find a p value for this, we compare it to a chi-squared distribution by typing:

1-pchisq(18.49,2)

For a 1df test at the first locus, we use the following commands:

gcontrasts(g1) <- "additive" logit (case ~ g1) anova(logit (case ~ g1)) 1-pchisq(18.47,1)

You should be able to work out how to do a similar analysis on either 1df or 2df for the other loci. If you do this, you should again find you get very similar results to the logistic regression analysis in Stata.

You can also perform a stepwise analysis, where you add or discard terms from the model. Suppose we wish to see whether locus 3 is important once we have already included locus 1, using 2df tests. Then we would use the following commands:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "genotype" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3))

The effect of adding g3 to a model that already includes g1 is seen to be 5.24 on 2df. The p value for this is found by

1-pchisq(5.24,2)

For a 1df test of g3, given 2df at g1, we would use:

gcontrasts(g1) <- "genotype" gcontrasts(g3) <- "additive" logit(case ~ g1 + g3) anova(logit(case ~ g1 + g3)) 1-pchisq(5.07,1)

All the other stepwise tests that we performed in Stata can also be performed in a similar manner in R. If you have time, try some of them out.

Once you are ready to finish with R, type

q() n

to get out.

Answers

How to interpret the output

Interpretation of the output in Stata and R is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.

Tips/Tricks

Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

The analyses described did not make use of any haplotype data. Possible programs for performing haplotype analysis are described below. However, the logistic regression approach does allow analysis of multiple linked loci (provided phase information is not considered to be important).

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.

Other packages

Logistic regression analysis can be performed in other statistical packages such as SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel.

For haplotype analysis one can use several other packages such as the HAPLO.STATS add-in for R by Dan Schaid, the UNPHASED program by Frank Dudbridge, the WHAP program by Shaun Purcell, the SIMHAP program from Lyle Palmer, or the HAPIPF add-in for Stata by Adrian Mander. Alternatively, one can use the SNPHAP program by David Clayton to estimate haplotypes for several multiply-imputed datasets, and then analyse these using add-in multiple imputation functions for Stata or R.

References

Breslow N and Day N (1980) Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer.

Breslow N and Day N (1993) Statistical methods in cancer research: The design and analysis of cohort studies. International Agency for Research on Cancer.

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73: 1316-1329.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34.

Schaid DJ (2004) Evaluating associations of haplotypes with traits. Genet Epidemiology 27:348-64.

Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum Hered. 55:179-90.

Exercises prepared by: Heather Cordell
Checked by:
Programs used: Stata, R, genassoc, dgc.genetics
Last updated: Tue, 19 Jul 2005 15:47:30 GMT