In these exercises you will be carrying out chi-squared tests, logistic regression and conditional logistic regression analysis of case/control and family data to test for genetic association.
The methodology used is standard methodology for case/control
analysis (chi-squared tests and logistic regression) and
family-based analysis (TDT and conditional logistic regression) as
described in today's lecture.
The tests will performed in two statistical analysis packages,
Stata and R. As well as using standard functions
implemented in these packages, we will
also make use of special functions designed for genetic
analysis that can be downloaded as add-in routines to the
Stata and R packages.
Some of the commands that you will type in order to
do the analysis may seem a bit mystifying if you are not
familiar with using statistical packages such as Stata or R.
Don't worry too much if you don't understand all the details!
If you decide to regularly use a statistical package such as Stata or R
to analyse your own data, it will be important to
learn how to use that particular package appropriately through attendance on
a training course or careful
reading of an introductory textbook.
The data consists of genotype data at 4 linked SNP loci and another unlinked locus, typed in 384 type 1 diabetes cases and 672 controls.
Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.
The data for the 4 linked loci
casecondata.ped
is in standard linkage pedigree file format, with columns corresponding to
family id, subject id (within family), father's id, mother's id,
sex (1=m, 2=f), affection status (1=unaffected, 2=affected)
and one column for each allele for each locus genotype. Note that
`0' indicates unknown data or genotypes: since this is case/control
rather than family data,
there is only one individual per family and everyone's
parents are coded as unknown.
Each column must be separated by a tab in order to be read
correctly into Stata.
The pedigree file casecondata.Rped used for the analysis in R differs from the
pedigree file used for the analysis in Stata. It has a header line
describing the different columns, and it uses R's own missing value
code "NA".
The file
newcasecondata.ped
is a Stata format file that is identical to casecondata.ped except that it includes genotypes for an additional 5th unlinked SNP locus.
Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to an appropriate directory on your computer e.g. to C:DATA
casecondata.ped
casecondata.Rped
newcasecondata.ped
These should all be saved as .txt files to C:DATA.
The Stata website is at
http://www.stata.com/
Useful guides to Stata can be found at
http://www.ats.ucla.edu/stat/stata/ and
http://www.whoishostingthis.com/resources/stata/
David Clayton's add-in routines (the genassoc package) can be found at
http://www-gene.cimr.cam.ac.uk/clayton/software/stata/README.txt
From within
Stata, one can obtain help on any command xxxx by typing `help xxxx'
The R website is at
http://www.r-project.org/
David Clayton's add-in routines (the dgc.genetics package) can be found at
http://www-gene.cimr.cam.ac.uk/clayton/software
From within
R, one can obtain help on any command xxxx by typing `help(xxxx)'
Make sure you have a copy of the pedigree files in the appropriate
directory.
Open up a Stata window by clicking on the Stata icon.
For those not familiar with Stata, you should find
that a large window with 4 separate sub-windows appears. You type commands in
the bottom right hand window, and any results will be displayed in the
top right hand window. The top left window provides a review of all
the commands you typed, and the bottom left window displays all the
variables that you have currently loaded. A convenient way to examine
the variables is to click on the data browser button at the top.
To keep a log of your session in the file 'casecon.log', you can type
in the bottom
right hand (white) Stata window:
log using casecon, replace text
Read the data into Stata by typing:
ginsheet using casecondata.ped, preped zmiss
The ginsheet command reads in data from a text file
which in this case has been prepared for you earlier
in the file 'casecondata.ped'. If Windows has renamed your file (e.g. as 'casecondata_ped.txt') you will have to use the correct new name. The preped option tells Stata
that the file is in a standard pre-makeped pedigree file format
and the zmiss tells Stata that missing values are coded
in the file as 0
(whereas Stata will recode them as '.')
You should find that the names of 14 variables appear in the bottom left
window. These correspond to the usual variables in a
standard pedigree file with alleles at locus 1
denoted L1_1 and L1_2,
alleles at locus 2 denoted L2_1 and L2_2 etc. etc.
To examine the variables you just read in you can click on the
data browser button (3rd from right at the top).
You might notice that missing genotypes have been recoded from `0' to
Stata's internal missing value code '.'
Alternatively, to look at any of the variables you just read in you can use the command
list e.g. you could type:
list family affected L1_1 L1_2 L2_1 L2_2
This will show you the family ID, affection status and alleles at locus
1 and 2. Use the spacebar to continue listing, or type q to stop.
To see documentation for the ginsheet command, type:
help ginsheet
In order to perform the analysis, Stata will require you to have
case/control status coded as 1 or 0 rather than 2 or 1. So we generate a new
variable called 'case' that has this required coding by typing:
gen case=affected-1
Use the list command or the data browser button to check that your
'case' variable is correctly coded.
At the moment, the genotype data is in the form of two alleles
for each locus. E.g. 'L1_1' and 'L1_2' are the two variables coding the
two alleles at locus 1.
Since we will be modelling genotype as a predictor of disease status,
we need to create a genotype variable for each locus from the two
alleles at each locus. We can do this in several ways. The first possibility
is to type:
egen A=gtype(L1_1 L1_2)
This creates a genotype variable named 'A' which contains the
genotype corresponding to the alleles contained in the variables
L1_1 and L1_2. This is the most general way to create
a genotype variable, and the required way for multiallelic loci.
Since we have diallelic loci,
another possibility is to create variables that correspond
to the number of times allele 1 and 2 appear in the genotype
(the variables are automatically named A1 and A2) by typing:
gtab L1_1 L1_2, gen(A)
Having created these variables, check that they are correct
(and that you understand how they correspond to genotype)
by typing:
list family L1_1 L1_2 A A1 A2
To carry out a chi-squared test on 2df for
genotype at locus 1, using the A variable, type:
tabulate A case, chi2
Check that you get the same using the A2 variable:
tabulate A2 case, chi2
And using the A1 variable:
tabulate A1 case, chi2
Suppose we wished to perform the chi-squared test on 1df,
assuming a dominant mode of action of the '2' allele.
We need to generate a new variable that indicates presence
of either one or two copies of the '2' allele. Do this by typing:
gen A2ind = A2>0
replace A2ind=. if A2==.
and check you understand how the new variable 'A2ind' relates to
the old variables by typing
list A2ind A2 A
To carry out a chi-squared test on the A2ind variable, type
tabulate A2ind case, chi2
and check that the counts match up with what you found for the genotype table:
tabulate A case, chi2
To carry out a chi-squared test on 1df on chromosomes rather
than genotypes, we need to tell Stata to count twice
when genotype is 2/2, and once when genotype is 1/2.
It is easiest to do this by inputting the table manually:
tabi 715 496 \ 585 272, chi2
Check that you understand how these counts correspond to
the previous genotype table.
You should have found that all the tests gave evidence of a significant
association at the first locus. Repeat the genotype tests for each of the
other loci by generating appropriate genotype variables and analysing them:
gtab L2_1 L2_2, gen(B)
gtab L3_1 L3_2, gen(C)
gtab L4_1 L4_2, gen(D)
tabulate B2 case, chi2
tabulate C2 case, chi2
tabulate D2 case, chi2
All loci should show significant association, with C and D
(locus 3 and 4) showing the highest significance.
To carry out logistic regression, we will use the
'case' variable as a response variable i.e. a
binary 0/1 indicator of disease. We will use the
genotype variables A2, B2, C2 and D2 as predictor variables.
To carry out the analysis for locus A, assuming
a multiplicative model for the effect of alleles (i.e. assuming
that two copies of allele 2 has twice the effect of one copy,
on the log odds scale), type:
logit case A2
The significance test is given at the top right: you should
obtain a chi-squared on 1 df of 18.47 with a p value of 0.0000
(i.e. <0.00005). This is the significance test for comparing
the model where A is in the regression equation compared
to a model where it is not, i.e. comparing models
ln(p/(1-p))=beta_1 * A2+beta_0
and
ln(p/(1-p))=beta_0
Check that we get exactly the same significance test when we type either
logit case A1
or
logit case A
The values of the regression coefficients beta_1 and beta_0
given in the table (denoted A and _cons in the Stata results window)
will differ
as the coding schemes for genotype corresponding to variables
A, A1 and A2 differ. However the significance tests should
be the same.
So far we have assumed a multiplicative model with
no dominance i.e. that having two copies of
allele 2 corresponds to twice the effect of having one copy.
We can avoid this assumption by making Stata fit a model
allowing for two genotype variables per genotype,
which corresponds to comparing models
ln(p/(1-p))=beta_1 * x1+beta_2 * x2+beta_0
and
ln(p/(1-p))=beta_0
where x1 and x2 are related to the genotype
variable A in an appropriate way. To do this type
xi:logit case i.A
xi:logit case i.A1
xi:logit case i.A2
The xi: and i. part of these commands basically
simply tells Stata to consider the predictor variable as a
categorical variable and to automatically create temporary
appropriate non-categorical
variables x1 and x2, which will appear and be called something like
_IA1_1 and _IA1_2 or
_IA_2 and _IA_3 in the Stata variables window.
These commands should each give a significance test of a
chi-squared on 2 df of 18.49 with a p value of 0.0001.
Again you should find that using variables
A, A1 and A2 are basically equivalent, except
that the difference in coding schemes results in different
estimated regression coefficients. Since the tests are equivalent,
from now on we will do everything
simply with respect to A2.
To compare the dominance and no dominance models,
we need to use the following sequence of commands:
xi:logit case i.A2
est store full
logit case A2
lrtest full
This fits the first model with dominance included, saves
the results, then fits the second model with no
dominance and tests the difference between the models.
The test to compare the two models that you just fitted
is found at the bottom right hand of the Stata results window.
You should get a chi-squared of 0.02 on 1 df with p value
0.8767 i.e. there is no significant evidence of dominance: the
multiplicative model is adequate.
It is often more convenient to output the coefficients
in terms of odds ratios with respect to an arbitrarily
chosen reference genotype. We do this using the odds ratio or
'or' option
for the logit command:
logit case A2, or
xi:logit case i.A2, or
This presents the results in terms of an odds ratio
with respect to the '1' allele or the '1/1' genotype.
We can do the same single-locus analysis for loci 2, 3 and 4 as
we have just done for locus 1. To save time, for now we will
just perform the 2df tests:
xi:logit case i.B2, or
xi:logit case i.C2, or
xi:logit case i.D2, or
You should find that the pattern of significance is the same as you found when
doing the chi-squared tests on the genotype tables.
So far we have just analysed each locus individually.
Now we will perform a stepwise analysis, where we enter or
discard variables in turn from the regression equation
We start with a forwards stepwise procedure.
Since locus 3 was the most significant predictor of disease status,
it makes sense to enter this first into the regression equation,
and then see if any other loci are significant given the effect
at locus 3. This can be done using nested likelihood ratio tests
as we did when testing for dominance, but it turns out to
be more convenient to use a slightly different approach
that simply discards terms from the regression equation.
So to see if locus 1 is important given the effect of locus
3, we fit a model in which both locus 1 and locus 3 are included,
and then discard any terms for locus 1:
xi:logit case i.C2 i.A2, or
testparm *A2*
This gives us a 2df test for locus 1 given locus 3
(chisquared = 0.55, p value = 0.76). A 1df test
(assuming no dominance at locus 1) can be performed by:
xi:logit case i.C2 A2, or
testparm A2
We can perform similar tests for locus 2 and locus 4:
xi:logit case i.C2 i.B2, or
testparm *B2*
xi:logit case i.C2 i.D2, or
testparm *D2*
xi:logit case i.C2 B2, or
testparm B2
xi:logit case i.C2 D2, or
testparm D2
You should find that no loci are significant
i.e. the first model (that includes genotypes at both loci)
is never significantly better than the model
that just includes locus 3. So locus 3 is
sufficient to account for whether or not someone is diseased.
We therefore stop the stepwise analysis.
If another locus, e.g.
locus 1, had turned out to be significant,
we would enter it into the regression equation along with locus 3
and see if any other loci are significant. E.g. to test if locus 2 is significant once one has accounted for loci 1 and 3, on either 2df or 1df, one would type:
xi:logit case i.C2 i.A2 i.B2, or
testparm *B2*
xi:logit case i.C2 i.A2 B2, or
testparm B2
We can also use a backwards stepwise procedure, where we
start by including all loci in the model, and then delete loci
that are not significant. However, in the interests of time, we will
not carry out this analysis today.
We shall now use logistic regression in Stata to test for epistatic
interactions between locus 3 and another unlinked locus (locus 5).
An epistatic interaction
means that the combined effect of locus 3 and 5 is greater than the
product (on the odds scale) or the sum (on the log odds scale)
of the locus 3 and locus 5 individual effects.
First get rid of the data in memory and read in the
new data. This data is the same as the original pedfile,
but with an additional column giving genotype at (unlinked) locus 5:
clear
ginsheet using newcasecondata.ped, preped zmiss
Next create appropriate genotype and case variables:
gtab L3_1 L3_2, gen(C)
gtab L5_1 L5_2, gen(E)
gen case=affected-1
The individual effects at locus 3 and 5 are now coded by the variables C2 and E2. We can test for association at each locus separately:
xi: logit case i.C2
xi: logit case i.E2
We then create a variable that codes for the combined effect of locus 3 and 4
as follows:
gen combo=10*C2+E2
Check you understand how the 'combo' variable relates to C2 and E2
by typing
list combo C2 E2
Now test whether there is significant epistasis by typing
xi:logit case i.combo
est store full
xi: logit case i.C2 i.E2
lrtest full
This fits the first model with epistasis included
(i.e. a model with 9 estimated parameters corresponding
to the 9 genotype combinations), saves
the results, then fits the second model with no
epistasis and tests the difference between the models.
The test to compare the two models that you just fitted
is found at the bottom right hand of the Stata results window.
You should get a chi-squared of 9.59 on 4 df with p value
0.048 i.e. there is marginal evidence of epistasis.
The above test is valid for testing for epistasis between
linked or unlinked loci, although it does not allow for
haplotype (phase) effects between linked loci.
A more powerful test for epistasis between unlinked loci only
is to use 'case-only' analysis and test whether
the genotypes at one locus predict those at the other,
in the cases alone. This is only valid at unlinked loci,
because at linked loci we expect genotypes at one locus
to predict those at the other (even in controls) due to linkage disequilibrium.
To do this, we can use a chi squared test to look for correlation/association
between the loci within the case and control groups separately:
bysort case: tabulate C2 E2, chi2
You should find that there is no association between the loci
in the control group, but there is significant association between the loci in the case group, indicating epistasis.
We have now finished the Stata analysis. Type
log close
to close the log file in which there will be a record of all
the commands you typed and all the results.
Then type
clear
exit
to get out of Stata and close the session.
We shall now repeat some of the
analyses
that we performed in Stata
in the statistical package R. The R package is somewhat
less user-friendly than Stata (particularly with regards to the output),
but has the advantage of being free and more widely available,
partly because a lot of bioinformatics software utilities have
been developed as add-in functions for the R package.
Before we start, take a look at the data file that we shall use:
casecondata.Rped
This differs from the pedigree file we used in the Stata analysis
by the addition of a header line describing what each of the
variables are, and by the changing of any '0's to R's own missing value
code 'NA'.
To start up the R package, click on the R icon. Then, under the File menu,
change directory (Cambia directory) so that R will look for any input files you specify in the C:DATA directory.
You are now working within the R package. To start with, you need
to read in the necessary add-in libraries (which we have
downloaded ready for you):
library(dgc.genetics)
To read in your data, type
casecon <- read.table("casecondata.Rped", header=T)
The " <- " operator stands for "is defined as"
(or "is assigned as") and is used a lot in R to create new
variables, new dataframes or new R objects.
Here this command reads your data into what is called a dataframe, essentially
a large matrix with columns corresponding to the different variables.
You chose to name the dataframe 'casecon' and you can look
at it simply by typing
casecon
or by typing
fix(casecon)
Each variable can be accessed by using the name of the dataframe
followed by a $ sign, followed by the variable name. E.g. to look at the
column of pedigree names, you just type
casecon$pedigree
It can be inconvenient to have to type 'casecon$' before
typing every variable,
so you can tell R to automatically look at variables in the
casecon dataframe by typing:
attach(casecon)
Now you can just look at the column of pedigree names by typing
pedigree
However, you should beware when using the 'attach' command
as if you already have a different variable in R called 'pedigree',
R will use this variable rather than the one in the attached dataframe
- without telling you that there may be some ambiguity. For this reason
it is
often convenient to clear all variables and detach all dataframes
before starting your session. This can be done (although don't do it
now)
by use of the clear() command.
Just like in Stata, we will generate a variable called 'case'
that codes affected/unaffected individuals as 1/0 rather than
2/1 (as they are coded in the 'affected' variable):
case <- affected-1
To look at the new case variable, type
case
Now we will generate genotype variables from the two alleles at each locus:
g1 <- genotype(loc1_1, loc1_2)
g2 <- genotype(loc2_1, loc2_2)
g3 <- genotype(loc3_1, loc3_2)
g4 <- genotype(loc4_1, loc4_2)
Take a look at one of these e.g. by typing
g1
To perform chi-squared tests on the genotype variables
(with disease status), type:
table(g1,case)
chisq.test(g1,case)
table(g2,case)
chisq.test(g2,case)
table(g3,case)
chisq.test(g3,case)
table(g4,case)
chisq.test(g4,case)
These should give you results that correspond very closely to
what you found in the Stata analysis of the same data.
To perfom a chromosome-based analysis, you can use the commands:
allele.table(g1,case)
chisq.test(allele.table(g1,case))
allele.table(g2,case)
chisq.test(allele.table(g2,case))
allele.table(g3,case)
chisq.test(allele.table(g3,case))
allele.table(g4,case)
chisq.test(allele.table(g4,case))
To perform the logistic regression analysis, we need to set
the variables
up for either a 2df (genotype-based) analysis or a 1df (chromosome-based)
analysis. For the genotype-based analysis, we use what are called
'genotype'
contrasts, and for the chromosome-based analysis we use
'additive' (on the log odds scale - equivalent to
multiplicative on the odds scale)
contrasts. So for the 2df test
at the first locus we use the following commands:
gcontrasts(g1) <- "genotype"
logit (case ~ g1)
anova(logit (case ~ g1))
The 'logit' command provides parameter estimates (odds ratios) and
p values for each regression coefficient, while the 'anova' command
gives an overall test in the form of a 'deviance' and a 'df'.
In this case we find a deviance of 18.49 on 2df. To find a p value
for this, we compare it to a chi-squared distribution by typing:
1-pchisq(18.49,2)
For a 1df test at the first locus, we use the following commands:
gcontrasts(g1) <- "additive"
logit (case ~ g1)
anova(logit (case ~ g1))
1-pchisq(18.47,1)
You should be able to work out how to do a similar analysis
on either 1df or 2df for the other loci. There is probably not time
to do this today, but if you do,
you should again find you get very similar results to the
logistic regression analysis in Stata.
You can also perform a stepwise analysis, where you add
or discard terms from the model. Suppose we wish to see whether
locus 3 is important once we have already included locus 1,
using 2df tests.
Then we would use the following commands:
gcontrasts(g1) <- "genotype"
gcontrasts(g3) <- "genotype"
logit(case ~ g1 + g3)
anova(logit(case ~ g1 + g3))
The effect of adding g3 to a model that already includes g1 is
seen to be 5.24 on 2df. The p value for this is found by
1-pchisq(5.24,2)
For a 1df test of g3, given 2df at g1, we would use:
gcontrasts(g1) <- "genotype"
gcontrasts(g3) <- "additive"
logit(case ~ g1 + g3)
anova(logit(case ~ g1 + g3))
1-pchisq(5.07,1)
All the other stepwise tests that we performed in Stata can also be
performed in a similar manner in R. If you have time, try some of them
out.
Once you are ready to finish with R, type
q()
n
to get out.
Analysis in a standard statistical package has the advantage of
allowing a lot of extra flexibility with regards to the models
and analyses performed e.g. it easy to include additional predictor variables such as environmetal factors, gene-environment interactions etc. However, you are required to know or
learn how to use the package in order to gain that extra flexibility,
and to produce reliable results.
The analyses described did not make use of any haplotype data. Possible
programs for performing
haplotype analysis are described below and one of these (the Unphased program) will be used in tomorrow's practicals. However, the logistic regression
approach does allow analysis of multiple linked loci (provided phase
information is not considered to be important).
With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.
Logistic regression analysis can be performed in other statistical
packages such as SAS, SPSS, GLIM. Chi-squared table analysis can also
be performed in other statistical packages and in some spreadsheets
such as Excel.
For haplotype analysis one can use several other packages
such as the HAPLO.STATS add-in for R by Dan Schaid, the UNPHASED
program by Frank Dudbridge, the WHAP program by Shaun Purcell, the
SIMHAP program from Lyle Palmer, or the HAPIPF add-in for Stata by
Adrian Mander. Alternatively, one can use the SNPHAP program by David
Clayton to estimate haplotypes for several multiply-imputed datasets,
and then analyse these using add-in multiple imputation functions for
Stata or R.
Breslow N and Day N (1980) Statistical methods in cancer research: The
analysis of case-control studies. International Agency for Research on
Cancer.
Breslow N and Day N (1993) Statistical methods in cancer
research: The design and analysis of cohort studies. International
Agency for Research on Cancer.
Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of
polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American
Journal of Human Genetics 70: 124-141.
Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American
Journal of Human Genetics 73: 1316-1329.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA
(2002) Score tests for association between traits and haplotypes when
linkage phase is ambiguous. Am J Hum Genet. 70:425-34.
Schaid DJ (2004) Evaluating associations of haplotypes with traits.
Genetic Epidemiology 27:348-64.
Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn
JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling
and E-M estimation of haplotype-specific relative risks from genotype
data for a case-control study of unrelated individuals.
Hum Hered. 55:179-90.