In these exercises you will be carrying out chi-squared tests,
and logistic regression
analysis of case/control data, and TDT analysis of family data to test
for genetic association. The tests will performed in the statistical analysis package Stata. As well as using standard functions
implemented in these packages, we will
also make use of special functions designed for genetic
analysis that can be downloaded as add-in routines.
Some of the commands that you will type in order to
do the analysis may seem a bit mystifying if you are not
familiar with using statistical packages such as Stata or R.
Don't worry too much if you don't understand all the details!
If you decide to regularly use a statistical package such as Stata or R
to analyse your own data, it will be important to
learn how to use that particular package appropriately through attendance on
a training course or careful
reading of an introductory textbook.
The data consists of genotype data at 4 linked SNP loci and another unlinked locus, typed in 384 type 1 diabetes cases and 672 controls. The data for the 4 linked loci casecondata.ped is in standard linkage pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Note that `0' indicates unknown data or genotypes: since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown. Each column must be separated by a tab in order to be read correctly into Stata.
Make sure you have a copy of the pedigree files in the appropriate
directory.
Open up a Stata window by clicking on the Stata icon.
For those not familiar with Stata, you should find
that a large window with 4 separate sub-windows appears. You type commands in
the bottom right hand window, and any results will be displayed in the
top right hand window. The top left window provides a review of all
the commands you typed, and the bottom left window displays all the
variables that you have currently loaded. A convenient way to examine
the variables is to click on the data browser button at the top.
Within Stata, it should be possible to set the current working directory to the directory where you have saved the data file by clicking "File" and selecting "Change working directory".
To keep a log of your session in the file 'results.log', you can type
in the bottom
right hand (white) Stata window:
log using results, replace text
Download the required add-in utilities by typing:
net from http://www-gene.cimr.cam.ac.uk/clayton/software/stata
net install genassoc
net get genassoc
If this doesn't work you will need to download the following add-in
libraries manually, into the same directory as the data file:
_ggtype.ado
_ghtype.ado
ginsheet.ado
gtab.ado
tdt.ado
ginsheet.hlp
gtab.hlp
tdt.hlp
Read the data into Stata by typing:
ginsheet using casecondata.ped, preped zmiss
The ginsheet command reads in data from a text file
which in this case has been prepared for you earlier
in the file 'casecondata.ped'. If Windows has renamed your file (e.g. as 'casecondata_ped.txt') you will have to use the correct new name. The preped option tells Stata
that the file is in a standard pre-makeped pedigree file format
and the zmiss tells Stata that missing values are coded
in the file as 0
(whereas Stata will recode them as '.')
You should find that the names of 14 variables appear in the bottom left
window. These correspond to the usual variables in a
standard pedigree file with alleles at locus 1
denoted L1_1 and L1_2,
alleles at locus 2 denoted L2_1 and L2_2 etc. etc.
To examine the variables you just read in you can click on the
data browser button (3rd from right at the top).
You might notice that missing genotypes have been recoded from `0' to
Stata's internal missing value code '.'
Alternatively, to look at any of the variables you just read in you can use the command
list e.g. you could type:
list family affected L1_1 L1_2 L2_1 L2_2
This will show you the family ID, affection status and alleles at locus
1 and 2. Use the spacebar to continue listing, or type q to stop.
To see documentation for the ginsheet command, type:
help ginsheet
In order to perform the analysis, Stata will require you to have
case/control status coded as 1 or 0 rather than 2 or 1. So we generate a new
variable called 'case' that has this required coding by typing:
gen case=affected-1
Use the list command or the data browser button to check that your
'case' variable is correctly coded.
At the moment, the genotype data is in the form of two alleles
for each locus. E.g. 'L1_1' and 'L1_2' are the two variables coding the
two alleles at locus 1.
Since we will be modelling genotype as a predictor of disease status,
we need to create a genotype variable for each locus from the two
alleles at each locus. We can do this in several ways. The first possibility
is to type:
egen A=gtype(L1_1 L1_2)
This creates a genotype variable named 'A' which contains the
genotype corresponding to the alleles contained in the variables
L1_1 and L1_2. This is the most general way to create
a genotype variable, and the required way for multiallelic loci.
Since we have diallelic loci,
another possibility is to create variables that correspond
to the number of times allele 1 and 2 appear in the genotype
(the variables are automatically named A1 and A2) by typing:
gtab L1_1 L1_2, gen(A)
Having created these variables, check that they are correct
(and that you understand how they correspond to genotype)
by typing:
list family L1_1 L1_2 A A1 A2
To carry out a chi-squared test on 2df for
genotype at locus 1, using the A variable, type:
tabulate A case, chi2
Check that you get the same using the A2 variable:
tabulate A2 case, chi2
And using the A1 variable:
tabulate A1 case, chi2
Suppose we wished to perform the chi-squared test on 1df,
assuming a dominant mode of action of the '2' allele.
We need to generate a new variable that indicates presence
of either one or two copies of the '2' allele. Do this by typing:
gen A2ind = A2>0
replace A2ind=. if A2==.
and check you understand how the new variable 'A2ind' relates to
the old variables by typing
list A2ind A2 A
To carry out a chi-squared test on the A2ind variable, type
tabulate A2ind case, chi2
and check that the counts match up with what you found for the genotype table:
tabulate A case, chi2
To carry out a chi-squared test on 1df on chromosomes rather
than genotypes, we need to tell Stata to count twice
when genotype is 2/2, and once when genotype is 1/2.
It is easiest to do this by inputting the table manually:
tabi 715 496 \ 585 272, chi2
Check that you understand how these counts correspond to
the previous genotype table.
You should have found that all the tests gave evidence of a significant
association at the first locus. Repeat the genotype tests for each of the
other loci by generating appropriate genotype variables and analysing them:
gtab L2_1 L2_2, gen(B)
gtab L3_1 L3_2, gen(C)
gtab L4_1 L4_2, gen(D)
tabulate B2 case, chi2
tabulate C2 case, chi2
tabulate D2 case, chi2
All loci should show significant association, with C and D
(locus 3 and 4) showing the highest significance.
To carry out logistic regression, we will use the
'case' variable as a response variable i.e. a
binary 0/1 indicator of disease. We will use the
genotype variables A2, B2, C2 and D2 as predictor variables.
To carry out the analysis for locus A, assuming
a multiplicative model for the effect of alleles (i.e. assuming
that two copies of allele 2 has twice the effect of one copy,
on the log odds scale), type:
logit case A2
The significance test is given at the top right: you should
obtain a chi-squared on 1 df of 18.47 with a p value of 0.0000
(i.e. <0.00005). This is the significance test for comparing
the model where A is in the regression equation compared
to a model where it is not, i.e. comparing models
ln(p/(1-p))=beta_1 * A2+beta_0
and
ln(p/(1-p))=beta_0
Check that we get exactly the same significance test when we type either
logit case A1
or
logit case A
The values of the regression coefficients beta_1 and beta_0
given in the table (denoted A and _cons in the Stata results window)
will differ
as the coding schemes for genotype corresponding to variables
A, A1 and A2 differ. However the significance tests should
be the same.
So far we have assumed a multiplicative model with
no dominance i.e. that having two copies of
allele 2 corresponds to twice the effect of having one copy.
We can avoid this assumption by making Stata fit a model
allowing for two genotype variables per genotype,
which corresponds to comparing models
ln(p/(1-p))=beta_1 * x1+beta_2 * x2+beta_0
and
ln(p/(1-p))=beta_0
where x1 and x2 are related to the genotype
variable A in an appropriate way. To do this type
xi:logit case i.A
xi:logit case i.A1
xi:logit case i.A2
The xi: and i. part of these commands basically
simply tells Stata to consider the predictor variable as a
categorical variable and to automatically create temporary
appropriate non-categorical
variables x1 and x2, which will appear and be called something like
_IA1_1 and _IA1_2 or
_IA_2 and _IA_3 in the Stata variables window.
These commands should each give a significance test of a
chi-squared on 2 df of 18.49 with a p value of 0.0001.
Again you should find that using variables
A, A1 and A2 are basically equivalent, except
that the difference in coding schemes results in different
estimated regression coefficients. Since the tests are equivalent,
from now on we will do everything
simply with respect to A2.
To compare the dominance and no dominance models,
we need to use the following sequence of commands:
xi:logit case i.A2
est store full
logit case A2
lrtest full
This fits the first model with dominance included, saves
the results, then fits the second model with no
dominance and tests the difference between the models.
The test to compare the two models that you just fitted
is found at the bottom right hand of the Stata results window.
You should get a chi-squared of 0.02 on 1 df with p value
0.8767 i.e. there is no significant evidence of dominance: the
multiplicative model is adequate.
It is often more convenient to output the coefficients
in terms of odds ratios with respect to an arbitrarily
chosen reference genotype. We do this using the odds ratio or
'or' option
for the logit command:
logit case A2, or
xi:logit case i.A2, or
This presents the results in terms of an odds ratio
with respect to the '1' allele or the '1/1' genotype.
We can do the same single-locus analysis for loci 2, 3 and 4 as
we have just done for locus 1. To save time, for now we will
just perform the 2df tests:
xi:logit case i.B2, or
xi:logit case i.C2, or
xi:logit case i.D2, or
You should find that the pattern of significance is the same as you found when
doing the chi-squared tests on the genotype tables.
So far we have just analysed each locus individually.
Now we will perform a stepwise analysis, where we enter or
discard variables in turn from the regression equation
We start with a forwards stepwise procedure.
Since locus 3 was the most significant predictor of disease status,
it makes sense to enter this first into the regression equation,
and then see if any other loci are significant given the effect
at locus 3. This can be done using nested likelihood ratio tests
as we did when testing for dominance, but it turns out to
be more convenient to use a slightly different approach
that simply discards terms from the regression equation.
So to see if locus 1 is important given the effect of locus
3, we fit a model in which both locus 1 and locus 3 are included,
and then discard any terms for locus 1:
xi:logit case i.C2 i.A2, or
testparm *A2*
This gives us a 2df test for locus 1 given locus 3
(chisquared = 0.55, p value = 0.76). A 1df test
(assuming no dominance at locus 1) can be performed by:
xi:logit case i.C2 A2, or
testparm A2
We can perform similar tests for locus 2 and locus 4:
xi:logit case i.C2 i.B2, or
testparm *B2*
xi:logit case i.C2 i.D2, or
testparm *D2*
xi:logit case i.C2 B2, or
testparm B2
xi:logit case i.C2 D2, or
testparm D2
You should find that no loci are significant
i.e. the first model (that includes genotypes at both loci)
is never significantly better than the model
that just includes locus 3. So locus 3 is
sufficient to account for whether or not someone is diseased.
We therefore stop the stepwise analysis.
If another locus, e.g.
locus 1, had turned out to be significant,
we would enter it into the regression equation along with locus 3
and see if any other loci are significant. E.g. to test if locus 2 is significant once one has accounted for loci 1 and 3, on either 2df or 1df, one would type:
xi:logit case i.C2 i.A2 i.B2, or
testparm *B2*
xi:logit case i.C2 i.A2 B2, or
testparm B2
We can also use a backwards stepwise procedure, where we
start by including all loci in the model, and then delete loci
that are not significant. However, in the interests of time, we will
not carry out this analysis today.
In this exercise you will be carrying out family-based
association (TDT) analysis of five linked loci in the
HLA region with type 1 diabetes, using a set of case-parent trios.
The purpose is detect which (if any)
of the loci are associated with disease.
We will be using family data consisting of a number of trio
families with an affected diabetic child plus parents
(of unknown disease status)
all of whom are typed at 5 polymorphisms in the HLA region.
fiveloci.ped
Save the data file above into the appropriate directory.
Read the data into Stata (and get rid of the previous data set) by typing:
clear
ginsheet using fiveloci.ped, preped zmiss
The ginsheet command reads in data from a text file
which in this case has been prepared for you earlier
in the file "fiveloci.ped". The preped option tells Stata
that the file is in a standard pre-makeped pedigree file format
and the zmiss tells Stata that missing values are coded
in the file as 0
(whereas Stata will recode them as ".")
You should find that the names of 16 variables appear in the bottom left
window. These correspond to the usual variables in a
standard pedigree file with alleles at locus 1
denoted L1_1 and L1_2,
alleles at locus 2 denoted L2_1 and L2_2 etc. etc.
Take a look at the data by clicking on the data browser
(3rd icon from the right at the top).
You should see that the data consists of genotypes for a series
of TDT type trios i.e. father, mother and affected child.
We can perform some preliminary single-locus association
analysis at these loci using the following commands
tdt L1_1 L1_2
tdt L2_1 L2_2
tdt L3_1 L3_2
tdt L4_1 L4_2
tdt L5_1 L5_2
This performs TDT analysis at each locus in turn
and also gives an indication
of which loci are diallelic and which have more than 2 alleles.
The p values for the TDT tests of each allele are given in the table,
and the global (multiallelic) p value is given at the bottom right (in green).
You should find highly significant associations at
locus 1, 2 and 5
and moderately significant associations at
locus 3 and 4 (p approx 0.01). For more information
on the tdt command type
help tdt
We can also perform a stepwise analysis of family data, where we enter different loci into the model in a stepwise fashion. However, there won't be time to do this today. (If there is time, we will instead do this using a different program called UNPHASED).
To get out of Stata, type
clear
exit