Computer Practical Exercises on Genetic Association Analysis using Stata

Introduction

In these exercises you will be carrying out chi-squared tests, and logistic regression analysis of case/control data, and TDT analysis of family data to test for genetic association. The tests will performed in the statistical analysis package Stata. As well as using standard functions implemented in these packages, we will also make use of special functions designed for genetic analysis that can be downloaded as add-in routines.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying if you are not familiar with using statistical packages such as Stata or R. Don't worry too much if you don't understand all the details! If you decide to regularly use a statistical package such as Stata or R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Exercise 1: Case/Control Data

Data overview

The data consists of genotype data at 4 linked SNP loci and another unlinked locus, typed in 384 type 1 diabetes cases and 672 controls. The data for the 4 linked loci casecondata.ped is in standard linkage pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Note that `0' indicates unknown data or genotypes: since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown. Each column must be separated by a tab in order to be read correctly into Stata.

Step-by-step instructions

1. Chi-squared table analysis in Stata

Make sure you have a copy of the pedigree files in the appropriate directory. Open up a Stata window by clicking on the Stata icon. For those not familiar with Stata, you should find that a large window with 4 separate sub-windows appears. You type commands in the bottom right hand window, and any results will be displayed in the top right hand window. The top left window provides a review of all the commands you typed, and the bottom left window displays all the variables that you have currently loaded. A convenient way to examine the variables is to click on the data browser button at the top.

Within Stata, it should be possible to set the current working directory to the directory where you have saved the data file by clicking "File" and selecting "Change working directory".

To keep a log of your session in the file 'results.log', you can type in the bottom right hand (white) Stata window:

log using results, replace text

Download the required add-in utilities by typing:

net from http://www-gene.cimr.cam.ac.uk/clayton/software/stata net install genassoc net get genassoc

If this doesn't work you will need to download the following add-in libraries manually, into the same directory as the data file:

_ggtype.ado
_ghtype.ado
ginsheet.ado
gtab.ado
tdt.ado
ginsheet.hlp
gtab.hlp
tdt.hlp

Read the data into Stata by typing:

ginsheet using casecondata.ped, preped zmiss

The ginsheet command reads in data from a text file which in this case has been prepared for you earlier in the file 'casecondata.ped'. If Windows has renamed your file (e.g. as 'casecondata_ped.txt') you will have to use the correct new name. The preped option tells Stata that the file is in a standard pre-makeped pedigree file format and the zmiss tells Stata that missing values are coded in the file as 0 (whereas Stata will recode them as '.')

You should find that the names of 14 variables appear in the bottom left window. These correspond to the usual variables in a standard pedigree file with alleles at locus 1 denoted L1_1 and L1_2, alleles at locus 2 denoted L2_1 and L2_2 etc. etc.

To examine the variables you just read in you can click on the data browser button (3rd from right at the top). You might notice that missing genotypes have been recoded from `0' to Stata's internal missing value code '.'

Alternatively, to look at any of the variables you just read in you can use the command list e.g. you could type:

list family affected L1_1 L1_2 L2_1 L2_2

This will show you the family ID, affection status and alleles at locus 1 and 2. Use the spacebar to continue listing, or type q to stop.

To see documentation for the ginsheet command, type:

help ginsheet

In order to perform the analysis, Stata will require you to have case/control status coded as 1 or 0 rather than 2 or 1. So we generate a new variable called 'case' that has this required coding by typing:

gen case=affected-1

Use the list command or the data browser button to check that your 'case' variable is correctly coded.

At the moment, the genotype data is in the form of two alleles for each locus. E.g. 'L1_1' and 'L1_2' are the two variables coding the two alleles at locus 1. Since we will be modelling genotype as a predictor of disease status, we need to create a genotype variable for each locus from the two alleles at each locus. We can do this in several ways. The first possibility is to type:

egen A=gtype(L1_1 L1_2)

This creates a genotype variable named 'A' which contains the genotype corresponding to the alleles contained in the variables L1_1 and L1_2. This is the most general way to create a genotype variable, and the required way for multiallelic loci.

Since we have diallelic loci, another possibility is to create variables that correspond to the number of times allele 1 and 2 appear in the genotype (the variables are automatically named A1 and A2) by typing:

gtab L1_1 L1_2, gen(A)

Having created these variables, check that they are correct (and that you understand how they correspond to genotype) by typing:

list family L1_1 L1_2 A A1 A2

To carry out a chi-squared test on 2df for genotype at locus 1, using the A variable, type:

tabulate A case, chi2

Check that you get the same using the A2 variable:

tabulate A2 case, chi2

And using the A1 variable:

tabulate A1 case, chi2

Suppose we wished to perform the chi-squared test on 1df, assuming a dominant mode of action of the '2' allele. We need to generate a new variable that indicates presence of either one or two copies of the '2' allele. Do this by typing:

gen A2ind = A2>0 replace A2ind=. if A2==.

and check you understand how the new variable 'A2ind' relates to the old variables by typing

list A2ind A2 A

To carry out a chi-squared test on the A2ind variable, type

tabulate A2ind case, chi2

and check that the counts match up with what you found for the genotype table:

tabulate A case, chi2

To carry out a chi-squared test on 1df on chromosomes rather than genotypes, we need to tell Stata to count twice when genotype is 2/2, and once when genotype is 1/2. It is easiest to do this by inputting the table manually:

tabi 715 496 \ 585 272, chi2

Check that you understand how these counts correspond to the previous genotype table.

You should have found that all the tests gave evidence of a significant association at the first locus. Repeat the genotype tests for each of the other loci by generating appropriate genotype variables and analysing them:

gtab L2_1 L2_2, gen(B) gtab L3_1 L3_2, gen(C) gtab L4_1 L4_2, gen(D) tabulate B2 case, chi2 tabulate C2 case, chi2 tabulate D2 case, chi2

All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.

2. Logistic regression analysis in Stata

To carry out logistic regression, we will use the 'case' variable as a response variable i.e. a binary 0/1 indicator of disease. We will use the genotype variables A2, B2, C2 and D2 as predictor variables.

To carry out the analysis for locus A, assuming a multiplicative model for the effect of alleles (i.e. assuming that two copies of allele 2 has twice the effect of one copy, on the log odds scale), type:

logit case A2

The significance test is given at the top right: you should obtain a chi-squared on 1 df of 18.47 with a p value of 0.0000 (i.e. <0.00005). This is the significance test for comparing the model where A is in the regression equation compared to a model where it is not, i.e. comparing models

ln(p/(1-p))=beta_1 * A2+beta_0

and

ln(p/(1-p))=beta_0

Check that we get exactly the same significance test when we type either

logit case A1

or

logit case A

The values of the regression coefficients beta_1 and beta_0 given in the table (denoted A and _cons in the Stata results window) will differ as the coding schemes for genotype corresponding to variables A, A1 and A2 differ. However the significance tests should be the same.

So far we have assumed a multiplicative model with no dominance i.e. that having two copies of allele 2 corresponds to twice the effect of having one copy. We can avoid this assumption by making Stata fit a model allowing for two genotype variables per genotype, which corresponds to comparing models

ln(p/(1-p))=beta_1 * x1+beta_2 * x2+beta_0

and

ln(p/(1-p))=beta_0

where x1 and x2 are related to the genotype variable A in an appropriate way. To do this type

xi:logit case i.A xi:logit case i.A1 xi:logit case i.A2

The xi: and i. part of these commands basically simply tells Stata to consider the predictor variable as a categorical variable and to automatically create temporary appropriate non-categorical variables x1 and x2, which will appear and be called something like _IA1_1 and _IA1_2 or _IA_2 and _IA_3 in the Stata variables window.

These commands should each give a significance test of a chi-squared on 2 df of 18.49 with a p value of 0.0001. Again you should find that using variables A, A1 and A2 are basically equivalent, except that the difference in coding schemes results in different estimated regression coefficients. Since the tests are equivalent, from now on we will do everything simply with respect to A2.

To compare the dominance and no dominance models, we need to use the following sequence of commands:

xi:logit case i.A2 est store full logit case A2 lrtest full

This fits the first model with dominance included, saves the results, then fits the second model with no dominance and tests the difference between the models. The test to compare the two models that you just fitted is found at the bottom right hand of the Stata results window. You should get a chi-squared of 0.02 on 1 df with p value 0.8767 i.e. there is no significant evidence of dominance: the multiplicative model is adequate.

It is often more convenient to output the coefficients in terms of odds ratios with respect to an arbitrarily chosen reference genotype. We do this using the odds ratio or 'or' option for the logit command:

logit case A2, or xi:logit case i.A2, or

This presents the results in terms of an odds ratio with respect to the '1' allele or the '1/1' genotype. We can do the same single-locus analysis for loci 2, 3 and 4 as we have just done for locus 1. To save time, for now we will just perform the 2df tests:

xi:logit case i.B2, or xi:logit case i.C2, or xi:logit case i.D2, or

You should find that the pattern of significance is the same as you found when doing the chi-squared tests on the genotype tables.

So far we have just analysed each locus individually. Now we will perform a stepwise analysis, where we enter or discard variables in turn from the regression equation

We start with a forwards stepwise procedure. Since locus 3 was the most significant predictor of disease status, it makes sense to enter this first into the regression equation, and then see if any other loci are significant given the effect at locus 3. This can be done using nested likelihood ratio tests as we did when testing for dominance, but it turns out to be more convenient to use a slightly different approach that simply discards terms from the regression equation. So to see if locus 1 is important given the effect of locus 3, we fit a model in which both locus 1 and locus 3 are included, and then discard any terms for locus 1:

xi:logit case i.C2 i.A2, or testparm *A2*

This gives us a 2df test for locus 1 given locus 3 (chisquared = 0.55, p value = 0.76). A 1df test (assuming no dominance at locus 1) can be performed by:

xi:logit case i.C2 A2, or testparm A2

We can perform similar tests for locus 2 and locus 4:

xi:logit case i.C2 i.B2, or testparm *B2* xi:logit case i.C2 i.D2, or testparm *D2* xi:logit case i.C2 B2, or testparm B2 xi:logit case i.C2 D2, or testparm D2

You should find that no loci are significant i.e. the first model (that includes genotypes at both loci) is never significantly better than the model that just includes locus 3. So locus 3 is sufficient to account for whether or not someone is diseased. We therefore stop the stepwise analysis.

If another locus, e.g. locus 1, had turned out to be significant, we would enter it into the regression equation along with locus 3 and see if any other loci are significant. E.g. to test if locus 2 is significant once one has accounted for loci 1 and 3, on either 2df or 1df, one would type:

xi:logit case i.C2 i.A2 i.B2, or testparm *B2* xi:logit case i.C2 i.A2 B2, or testparm B2

We can also use a backwards stepwise procedure, where we start by including all loci in the model, and then delete loci that are not significant. However, in the interests of time, we will not carry out this analysis today.

Exercise 2: Family Data

Data overview

In this exercise you will be carrying out family-based association (TDT) analysis of five linked loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is detect which (if any) of the loci are associated with disease. We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region. fiveloci.ped

Step-by-step instructions

Save the data file above into the appropriate directory.

Read the data into Stata (and get rid of the previous data set) by typing:

clear ginsheet using fiveloci.ped, preped zmiss

The ginsheet command reads in data from a text file which in this case has been prepared for you earlier in the file "fiveloci.ped". The preped option tells Stata that the file is in a standard pre-makeped pedigree file format and the zmiss tells Stata that missing values are coded in the file as 0 (whereas Stata will recode them as ".")

You should find that the names of 16 variables appear in the bottom left window. These correspond to the usual variables in a standard pedigree file with alleles at locus 1 denoted L1_1 and L1_2, alleles at locus 2 denoted L2_1 and L2_2 etc. etc.

Take a look at the data by clicking on the data browser (3rd icon from the right at the top). You should see that the data consists of genotypes for a series of TDT type trios i.e. father, mother and affected child. We can perform some preliminary single-locus association analysis at these loci using the following commands

tdt L1_1 L1_2 tdt L2_1 L2_2 tdt L3_1 L3_2 tdt L4_1 L4_2 tdt L5_1 L5_2

This performs TDT analysis at each locus in turn and also gives an indication of which loci are diallelic and which have more than 2 alleles. The p values for the TDT tests of each allele are given in the table, and the global (multiallelic) p value is given at the bottom right (in green). You should find highly significant associations at locus 1, 2 and 5 and moderately significant associations at locus 3 and 4 (p approx 0.01). For more information on the tdt command type

help tdt

We can also perform a stepwise analysis of family data, where we enter different loci into the model in a stepwise fashion. However, there won't be time to do this today. (If there is time, we will instead do this using a different program called UNPHASED). To get out of Stata, type

clear exit