In this exercise you will be carrying out chi-squared tests and logistic regression analysis of case/control data, to test for genetic association and for gene-gene interaction.
The methodology used is standard methodology for case/control
analysis (chi-squared tests and logistic regression) as
described in the lectures.
The tests will performed in the statistical analysis package
R. As well as using standard functions
implemented in R, we will
also make use of special functions designed for genetic
analysis that can be downloaded as add-in routines to the
R package.
Some of the commands that you will type in order to
do the analysis may seem a bit mystifying if you are not
familiar with using statistical packages such as R.
Don't worry too much if you don't understand all the details!
If you decide to regularly use a statistical package such as R
to analyse your own data, it will be important to
learn how to use that particular package appropriately through attendance on
a training course or careful
reading of an introductory textbook.
The R website is at
http://www.r-project.org/
David Clayton's add-in routines (the dgc.genetics package) can be found at
http://www-gene.cimr.cam.ac.uk/clayton/software
From within
R, one can obtain help on any command xxxx by typing `help(xxxx)'
The data consists of genotype data at 4 linked SNP loci and another unlinked locus, typed in 384 type 1 diabetes cases and 672 controls.
Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.
The data for the 4 linked loci
casecondata.txt
is in a similar format to a standard linkage pedigree file, with columns corresponding to
family id, subject id (within family), father's id, mother's id,
sex (1=m, 2=f), affection status (1=unaffected, 2=affected)
and one column for each allele for each locus genotype.
The pedigree file used for the analysis in R differs from a standard
pedigree file in that it has a header line
describing the different columns, and it uses R's own missing value
code "NA". Note that
since this is case/control
rather than family data,
there is only one individual per family and everyone's
parents are coded as unknown.
The file
newcasecondata.txt
is an R format file that is identical to casecondata.txt except that it includes genotypes for an additional 5th unlinked SNP locus.
Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.
To start up the R package, click on the R icon. Then, under the File menu,
change directory so that R will look in the correct directory (the folder where you saved the files) for any input files you specify.
You are now working within the R package. To start with, you need
to read in the necessary add-in libraries (which we have
downloaded ready for you):
library(dgc.genetics)
To read in your data, type
casecon <- read.table("casecondata.txt", header=T)
The " <- " operator stands for "is defined as"
(or "is assigned as") and is used a lot in R to create new
variables, new dataframes or new R objects.
Here this command reads your data into what is called a dataframe, essentially
a large matrix with columns corresponding to the different variables.
You chose to name the dataframe 'casecon' and you can look
at it simply by typing
casecon
or by typing
fix(casecon)
Each variable can be accessed by using the name of the dataframe
followed by a $ sign, followed by the variable name. E.g. to look at the
column of pedigree names, you just type
casecon$pedigree
It can be inconvenient to have to type 'casecon$' before
typing every variable,
so you can tell R to automatically look at variables in the
casecon dataframe by typing:
attach(casecon)
Now you can just look at the column of pedigree names by typing
pedigree
However, you should beware when using the 'attach' command
as if you already have a different variable in R called 'pedigree',
R will use this variable rather than the one in the attached dataframe
- without telling you that there may be some ambiguity. For this reason
it is
often convenient to clear all variables and detach all dataframes
before starting your session. This can be done (BUT DONT DO IT
NOW!)
by use of the clear() command.
To perform the analysis we will first generate a variable called 'case'
that codes affected/unaffected individuals as 1/0 rather than
2/1 (as they are coded in the 'affected' variable):
case <- affected-1
To look at the new case variable, type
case
Now we will generate genotype variables from the two alleles at each locus:
g1 <- genotype(loc1_1, loc1_2)
g2 <- genotype(loc2_1, loc2_2)
g3 <- genotype(loc3_1, loc3_2)
g4 <- genotype(loc4_1, loc4_2)
Take a look at one of these e.g. by typing
g1
To perform chi-squared tests on the genotype variables
(to see if they are associated with disease status), type:
table(g1,case)
chisq.test(g1,case)
table(g2,case)
chisq.test(g2,case)
table(g3,case)
chisq.test(g3,case)
table(g4,case)
chisq.test(g4,case)
All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.
To perfom a chromosome-based analysis, you can use the commands:
allele.table(g1,case)
chisq.test(allele.table(g1,case))
allele.table(g2,case)
chisq.test(allele.table(g2,case))
allele.table(g3,case)
chisq.test(allele.table(g3,case))
allele.table(g4,case)
chisq.test(allele.table(g4,case))
To perform the logistic regression analysis, we need to set
the variables
up for either a 2df (genotype-based) analysis or a 1df (chromosome-based)
analysis. For the genotype-based analysis, we use what are called
'genotype'
contrasts, and for the chromosome-based analysis we use
'additive' (on the log odds scale - equivalent to
multiplicative on the odds scale)
contrasts. So for the 2df test
at the first locus we use the following commands:
gcontrasts(g1) <- "genotype"
logit (case ~ g1)
anova(logit (case ~ g1))
The 'logit' command provides parameter estimates (odds ratios) and
p values for each regression coefficient, while the 'anova' command
gives an overall test in the form of a 'deviance' and a 'df'.
In this case we find a deviance of 18.49 on 2df. To find a p value
for this, we compare it to a chi-squared distribution by typing:
1-pchisq(18.49,2)
For a 1df test at the first locus, we use the following commands:
gcontrasts(g1) <- "additive"
logit (case ~ g1)
anova(logit (case ~ g1))
1-pchisq(18.47,1)
You should be able to work out how to do a similar analysis
on either 1df or 2df for the other loci.
You should find that the pattern of significance is the same as you found when doing the chi-squared tests on the genotype or allele tables.
You can also perform a stepwise analysis, where you add
or discard terms from the model. Suppose we wish to see whether
locus 3 is important once we have already included locus 1,
using 2df tests.
Then we would use the following commands:
gcontrasts(g1) <- "genotype"
gcontrasts(g3) <- "genotype"
logit(case ~ g1 + g3)
anova(logit(case ~ g1 + g3))
The effect of adding g3 to a model that already includes g1 is
seen to be 5.24 on 2df. The p value for this is found by
1-pchisq(5.24,2)
For a 1df test of g3, given 2df at g1, we would use:
gcontrasts(g1) <- "genotype"
gcontrasts(g3) <- "additive"
logit(case ~ g1 + g3)
anova(logit(case ~ g1 + g3))
1-pchisq(5.07,1)
We shall now use logistic regression in R to test for epistatic
interactions between locus 3 and another unlinked locus (locus 5).
An epistatic interaction
means that the combined effect of locus 3 and 5 is greater than the
product (on the odds scale) or the sum (on the log odds scale)
of the locus 3 and locus 5 individual effects.
First get rid of the data in the memory and read in the
new data. This data is the same as the original pedfile,
but with an additional column giving genotype at (unlinked) locus 5:
detach(casecon)
newcasecon <- read.table("newcasecondata.txt", header=T)
attach(newcasecon)
You can look at the data by typing
fix(newcasecon)
Next create appropriate genotype and case variables:
case <- affected-1
g3 <- genotype(loc3_1, loc3_2)
g5 <- genotype(loc5_1, loc5_2)
The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately:
gcontrasts(g3) <- "genotype"
logit (case ~ g3)
anova(logit (case ~ g3))
gcontrasts(g5) <- "genotype"
logit (case ~ g5)
anova(logit (case ~ g5))
In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes
count3<-allele.count(g3,2)
count5<-allele.count(g5,2)
Check you understand how variables count3 and count5 relate to g3 and g5 by typing
count3
count5
We then create a variable that codes for the combined effect of locus 3 and 5
as follows:
combo<-10*count3+count5
Check you understand how the 'combo' variable relates to g3 and g5
by typing
combo
Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories
rather than having numeric meaning:
fact3<-factor(count3)
fact5<-factor(count5)
factcombo<-factor(combo)
Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables:
anova(logit (case ~ fact3))
anova(logit (case ~ fact5))
Now test whether there is significant epistasis by typing
anova(logit(case ~ fact3 + fact5 + factcombo))
1-pchisq(9.59,4)
This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included
(i.e. a model with 9 estimated parameters corresponding
to the 9 genotype combinations), and tests the difference between the models.
You should get a chi-squared of 9.59 on 4 df with p value
0.048 i.e. there is marginal evidence of epistasis.
The above test is valid for testing for epistasis between
linked or unlinked loci, although it does not allow for
haplotype (phase) effects between linked loci.
A more powerful test for epistasis between UNLINKED LOCI ONLY
is to use 'case-only' analysis and test whether
the genotypes at one locus predict those at the other,
in the cases alone. This is only valid at unlinked loci,
because at linked loci we expect genotypes at one locus
to predict those at the other (even in controls) due to linkage disequilibrium.
To do this, we can use a chi squared test to look for correlation (association)
between the loci within the case and control groups separately.
First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases.
To do this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing
case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands
caseg3<-g3[1:384]
caseg5<-g5[1:384]
Take a look at the vectors you have created by typing
caseg3
caseg5
Now do a chi-squared test on the genotype variables
to see if they are correlated with each other:
table(caseg3,caseg5)
chisq.test(caseg3,caseg5)
You should find much more significant evidence of epistasis (p value 0.0018) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population:
contg3<-g3[385:1056]
contg5<-g5[385:1056]
contg3
contg5
table(contg3,contg5)
chisq.test(contg3,contg5)
You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci.
Analysis in a standard statistical package has the advantage of
allowing a lot of extra flexibility with regards to the models
and analyses performed e.g. it easy to include additional predictor variables such as environmetal factors, gene-environment interactions etc. However, you are required to know or
learn how to use the package in order to gain that extra flexibility,
and to produce reliable results.
The analyses described did not make use of any haplotype data. Possible
programs for performing
haplotype analysis are described below. However, the logistic regression
approach does allow analysis of multiple linked loci (provided phase
information is not considered to be important).
With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.
Logistic regression analysis can be performed in other statistical
packages such as Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also
be performed in other statistical packages and in some spreadsheets
such as Excel.
For haplotype analysis one can use several other packages
such as the HAPLO.STATS add-in for R by Dan Schaid, the UNPHASED
program by Frank Dudbridge, the WHAP program by Shaun Purcell, the
SIMHAP program from Lyle Palmer, or the HAPIPF add-in for Stata by
Adrian Mander. Alternatively, one can use the SNPHAP program by David
Clayton to estimate haplotypes for several multiply-imputed datasets,
and then analyse these using add-in multiple imputation functions for
Stata or R.
Breslow N and Day N (1980) Statistical methods in cancer research: The
analysis of case-control studies. International Agency for Research on
Cancer.
Breslow N and Day N (1993) Statistical methods in cancer
research: The design and analysis of cohort studies. International
Agency for Research on Cancer.
Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of
polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American
Journal of Human Genetics 70: 124-141.
Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American
Journal of Human Genetics 73: 1316-1329.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA
(2002) Score tests for association between traits and haplotypes when
linkage phase is ambiguous. Am J Hum Genet. 70:425-34.
Piegorsch WW, Weinberg CR, Taylor JA (1994) Non-hierarchical logistic models
and case-only designs for assessing susceptibility
in population-based case-control
studies. Statistics in Medicine 13:153-162.
Schaid DJ (2004) Evaluating associations of haplotypes with traits.
Genetic Epidemiology 27:348-64.
Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn
JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling
and E-M estimation of haplotype-specific relative risks from genotype
data for a case-control study of unrelated individuals.
Hum Hered. 55:179-90.
Weinberg CR and Umbach DM (2003) Choosing a retrospective design to assess
joint genetic and environmental contributions to risk. American Journal of Epidemiology 152:197-203.