In this exercise you will be carrying out chi-squared tests and likelihood-based analysis of case/control data, to test for genetic association.
The methodology used is standard methodology for case/control
analysis (chi-squared tests and logistic regression) as
described in the lectures and implemented in various computer
packages.
Some of the commands that you will type in order to
do the analysis may seem a bit mystifying if you are not
familiar with using statistical packages such as R.
Don't worry too much if you don't understand all the details!
If you decide to regularly use a statistical package such as R
to analyse your own data, it will be important to
learn how to use that particular package appropriately through attendance on
a training course or careful
reading of an introductory textbook.
PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:
http://pngu.mgh.harvard.edu/~purcell/plink/
The R website is at
http://www.r-project.org/
From within
R, one can obtain help on any command xxxx by typing `help(xxxx)'
Full documentation can be found at
http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/
The program runs either in command-line mode, or through a Java-based
graphical interface. We will be using the command-line mode, since this
is easier to explain and allows greater control over the desired options.
The data consists of genotype data at 4 linked SNP loci, typed in 384 type 1 diabetes cases and 672 controls.
Appropriate data for this exercise is genotype data for
a set of linked or unlinked loci typed in a group of unrelated
affected individuals (cases) and in a group of unaffected
or randomly chosen individuals from the same population
(controls). Ideally a larger data
set should be used than we consider here e.g. 1000 cases and 1000
controls might be considered a reasonable number.
All the programs will deal with much larger numbers of loci than the 4 SNPs considered here. PLINK in particular was specifically designed for the analysis of large numbers of loci e.g. generated as part of a genome-wide association study.
All the commands required to run these analyses are given below. However, packages such as Stata, SAS, Splus and R are sophisticated statistical programming packages and have much greater functionality than will be described here. If you intend to use a statistical package to analyse your data, you are strongly encouraged to learn how to use that package appropriately through attendance on a training course or careful reading of an introductory textbook.
The data for the 4 linked loci
caseconped.txt
is in standard linkage pedigree file format, with columns corresponding to
family id, subject id (within family), father's id, mother's id,
sex (1=m, 2=f), affection status (1=unaffected, 2=affected)
and one column for each allele for each locus genotype.
Note that
since this is case/control
rather than family data,
there is only one individual per family and everyone's
parents are coded as unknown.
PLINK requires an additional map file caseconmap.txt describing the markers (in order) in the pedigree file. The PLINK-format map file contains exactly 4 columns:
chromosome (1-22, X, Y or 0 if unplaced) rs number or snp identifier Genetic distance (morgans) Base-pair position (bp units)
Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.
Move to the directory where you have saved the data files.
To carry out a basic association analysis in PLINK, type
plink --noweb --ped caseconped.txt --map caseconmap.txt --assoc
The --noweb command tells PLINK to run without bothering to check via the web to see whether there are updated versions of PLINK. The --ped xxxx command tells PLINK that the name of the pedigree file is xxxx and the --map yyyy command tells PLINK that the name of the map file is yyyy. The --assoc command tells PLINK to perform a basic allele-based chisquared association test.
PLINK outputs some useful messages (you should always read these to make sure they match up with what you expect!) and outputs the results to a file plink.assoc .
Take a look at the file plink.assoc (e.g. by typing more plink.assoc ). For each SNP the following columns of results are reported:
CHR Chromosome SNP SNP ID BP Physical position (base-pair) A1 Minor allele name (based on whole sample) F_A Frequency of this allele in cases F_U Frequency of this allele in controls A2 Major allele name CHISQ Basic allelic test chi-square (1df) P Asymptotic p-value for this test OR Estimated odds ratio (for A1, i.e. A2 is reference)
You should find some evidence of association at all 4 loci, with the strongest result at rs003.
You can also use PLINK to calculate Fishers exact test:
plink --noweb --ped caseconped.txt --map caseconmap.txt --fisher
This produces the following columns of results in the file plink.fisher :
CHR Chromosome SNP SNP ID BP Physical position (base-pair) A1 Minor allele name (based on whole sample) F_A Frequency of this allele in cases F_U Frequency of this allele in controls A2 Major allele name P Exact p-value for this test OR Estimated odds ratio (for A1)
The most sophisticated case/control association testing can be performed in plink using the model command:
plink --noweb --ped caseconped.txt --map caseconmap.txt --model
Results can be found in the file plink.model . For each SNP, this displays results from a 2df genotype test, a Cochran-Armitage trend test, a 1df allelic test, a dominant model (for the minor allele) test and a recessive model (for the minor allele) test. The columns in the file plink.model are:
CHR Chromosome number SNP SNP identifier TEST Type of test AFF Genotype/allele counts in cases UNAFF Genotype/allele counts in controls CHISQ Chi-squared statistic DF Degrees of freedom for test P Asymptotic p-value
Stay in the directory where you have saved the data files. To start up the R package, type
R
You are now working within the R package. To start with, you need
to read in the data:
casecon <-read.table("caseconped.txt", header=F, na.strings = "0")
The " <- " operator stands for "is defined as"
(or "is assigned as") and is used a lot in R to create new
variables, new dataframes or new R objects.
Here this command reads your data into what is called a dataframe, essentially
a large matrix with columns corresponding to the different variables.
You chose to name the dataframe 'casecon' and you can look
at it simply by typing
casecon
If the dataframe is too large you will only see the bottom of it. You can look at the top 6 lines by typing
head(casecon)
or by typing
fix(casecon)
Each variable can be accessed by using the name of the dataframe
followed by a $ sign, followed by the variable name. E.g. to look at the
column of pedigree names, you just type
casecon$V1
Check that you understand how the output relates to the column of pedigree names you saw when you used fix(casecon) (note that the column of pedigrees gets wrapped round when you look at it by typing casecon$V1).
It can be inconvenient to have to type 'casecon$' before
typing every variable,
so you can tell R to automatically look at variables in the
casecon dataframe by typing:
attach(casecon)
Now you can just look at the column of pedigree names by typing
V1
However, you should beware when using the 'attach' command
as if you already have a different variable in R called 'V1',
R will use this variable rather than the one in the attached dataframe
- without telling you that there may be some ambiguity.
To perform the analysis we will first generate a variable called 'case'
that codes affected/unaffected individuals as 1/0 rather than
2/1 (as they are coded in the 'affected' variable V6):
case <- V6-1
To look at the new case variable, type
case
To compare with the old 'affected' variable, type
V6
Now we will generate genotype variables from the two alleles at each locus:
g1 <- V7+V8-2
g2 <- V9+V10-2
g3 <- V11+V12-2
g4 <- V13+V14-2
This generates 4 diferent genotype variables which code for the genotypes at the 4 loci. Can you see how this has worked?
The genotype values 0, 1, 2 correspond to the number of copies of the '2' allele that a person has.
Take a look at the first of these genotype variables by typing
g1
To perform chi-squared tests on each of the genotype variables
(to see if they are associated with disease status), type:
table(g1,case)
chisq.test(g1,case)
table(g2,case)
chisq.test(g2,case)
table(g3,case)
chisq.test(g3,case)
table(g4,case)
chisq.test(g4,case)
All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.
The results should be similar to what you found using the genotype tests in PLINK.
To perform the logistic regression, it is helpful to use a special library of R routines.
Read in the necessary add-in library (which we have downloaded ready for you):
library(DGCgenetics)
To perform allele tests at each marker using logistic regression, we use the following commands:
logistic (case ~ g1)
logistic (case ~ g2)
logistic (case ~ g3)
logistic (case ~ g4)
The 'logistic' command provides parameter estimates (odds ratios and confidence limits), z tests and
p values for each test performed. The results should be very similar to the basic
association or trend tests performed in PLINK.
For genotype analysis at each marker, use the following commands:
logistic (case ~ factor(g1))
logistic (case ~ factor(g2))
logistic (case ~ factor(g3))
logistic (case ~ factor(g4))
Again the results should be similar to the genotype tests performed in PLINK. The advantage of logistic regression is that you could put in several different genotype variables into the same model. To see, for example, whether locus 3 adds significantly to a model that already includes locus 4, type:
logistic (case ~ factor(g4)+factor(g3))
anova(logistic (case ~ factor(g4)+factor(g3)))
Here the terms get added into the model in the order you specified, so to see the significance of adding in locus 3, you need work out the significance (p value) of the reported 'deviance' which is distrobuted as a chi-squared on 2 df. To do this type:
1-pchisq(2.03,2)
To see the significance of adding locus 4 to a model that already includes locus 3, you would need to type
logistic (case ~ factor(g3)+factor(g4))
anova(logistic (case ~ factor(g3)+factor(g4)))
1-pchisq(0.66,2)
To get out of R, type q()
Finally, we shall use the program UNPHASED to repeat the single-locus analyses we did above, and to additionally perform haplotype tests.
To run an allelic test on each locus in turn in the pedigree file caseconped.txt, simply type (from the unix prompt):
unphased caseconped.txt
The output gets reported to the screen. To
re-run the analysis and save the output to a file named unphased.out,
type:
unphased caseconped.txt > unphased.out
Once the program has finished running, you can check that a new output file
unphased.out has been created by typing ls
Take a look at the output file (e.g. using the more command). First comes some program-specific information and then come the results (i.e. tests of association and
estimates
of the allelic or
haplotypic odds ratios) for each marker. The most important information is the likelihood ratio chisq test and p-value that come under the heading TEST OF OVERALL ASSOCIATION .
The tests and odds ratios (in the column marked Odds-R) should be similar to what you found in your allele-based logistic regression analysis and in the basic association or trend tests in PLINK. To instead perform a genotype-based test at each locus, type
unphased caseconped.txt -genotype > unphasedgeno.out
Now the results should be similar to what you found in your genotype-based PLINK, chi-squared table and logistic regression analyses.
To perform a haplotype analysis of all four loci, averaging over uncertain haplotypes, type
unphased caseconped.txt -window 4 > unphasedhaplo.out
You can see from the results (the estimates of haplotype effects) that there are some rather rare haplotypes. This can make the results are a bit unreliable. In order to remove these rare haplotypes from the analysis, using a cut-off threshold of 1%, for example, type:
unphased caseconped.txt -window 4 -zero 0.01 > unphasedzero.out
To output the estimates of haplotype effects all with reference to the most common haplotype, 2-1-2-1, type
unphased caseconped.txt -window 4 -zero 0.01 -reference 2 1 2 1 > unphasedcommon.out
UNPHASED will also allow you to look at haplotypes from subsets of loci, or from sliding windows. For example, to look at haplotypes constructed from alleles at locus 3 and 4 only, imposing a 1% threshold for discarding rare haplotypes, type
unphased caseconped.txt -marker 3 4 -window 2 -zero 0.01 > unphased34.out
To see the other possible analysis options in UNPHASED, please consult the UNPHASED documentation.
The results from the basic association tests are:
CHR SNP BP A1 F_A F_U A2 CHISQ P OR 1 rs001 123446 2 0.3542 0.45 1 18.27 1.916e-05 0.6703 1 rs002 123452 2 0.4245 0.3577 1 9.117 0.002532 1.324 1 rs003 123457 2 0.3659 0.4746 1 23.23 1.436e-06 0.6387 1 rs004 123458 1 0.3885 0.4946 2 21.82 3e-06 0.649
The results from Fishers exact tests are:
CHR SNP BP A1 F_A F_U A2 P OR 1 rs001 123446 2 0.3542 0.45 1 2.103e-05 0.6703 1 rs002 123452 2 0.4245 0.3577 1 0.002728 1.324 1 rs003 123457 2 0.3659 0.4746 1 1.411e-06 0.6387 1 rs004 123458 1 0.3885 0.4946 2 2.942e-06 0.649
The results from the full set of association tests are:
CHR SNP A1 A2 TEST AFF UNAFF CHISQ DF P 1 rs001 2 1 GENO 47/178/159 131/323/196 18.24 2 0.0001094 1 rs001 2 1 TREND 272/496 585/715 18.24 1 1.947e-05 1 rs001 2 1 ALLELIC 272/496 585/715 18.27 1 1.916e-05 1 rs001 2 1 DOM 225/159 454/196 13.56 1 0.0002314 1 rs001 2 1 REC 47/337 131/519 10.61 1 0.001125 1 rs002 2 1 GENO 65/196/123 78/309/263 9.439 2 0.008918 1 rs002 2 1 TREND 326/442 465/835 9.437 1 0.002126 1 rs002 2 1 ALLELIC 326/442 465/835 9.117 1 0.002532 1 rs002 2 1 DOM 261/123 387/263 7.333 1 0.00677 1 rs002 2 1 REC 65/319 78/572 4.917 1 0.02659 1 rs003 2 1 GENO 53/175/156 150/317/183 22.55 2 1.271e-05 1 rs003 2 1 TREND 281/487 617/683 22.52 1 2.079e-06 1 rs003 2 1 ALLELIC 281/487 617/683 23.23 1 1.436e-06 1 rs003 2 1 DOM 228/156 467/183 17.04 1 3.666e-05 1 rs003 2 1 REC 53/331 150/500 13.16 1 0.0002859 1 rs004 1 2 GENO 57/182/142 163/316/170 21.36 2 2.305e-05 1 rs004 1 2 TREND 296/466 642/656 21.28 1 3.969e-06 1 rs004 1 2 ALLELIC 296/466 642/656 21.82 1 3e-06 1 rs004 1 2 DOM 239/142 479/170 13.95 1 0.0001879 1 rs004 1 2 REC 57/324 163/486 14.74 1 0.0001235
Interpretation of the output is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.
Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmental factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.
With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.
Logistic regression analysis can be performed in other statistical
packages such as Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also
be performed in other statistical packages and in some spreadsheets
such as Excel.
For haplotype analysis one can use several other packages
such as PLINK, GeneBPM and the haplo.stats add-in library for R.
Breslow N and Day N (1980) Statistical methods in cancer research: The
analysis of case-control studies. International Agency for Research on
Cancer.
Breslow N and Day N (1993) Statistical methods in cancer
research: The design and analysis of cohort studies. International
Agency for Research on Cancer.
Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of
polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American
Journal of Human Genetics 70:124-141.
Dudbridge (2008) Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data.
Hum Hered 66:87-98.
Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American
Journal of Human Genetics 73:1316-1329.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR,
Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007)
PLINK: a toolset for whole-genome association and population-based
linkage analysis. American Journal of Human Genetics, 81:559-575.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34.
Schaid DJ (2004) Evaluating associations of haplotypes with traits.
Genetic Epidemiology 27:348-64.
Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn
JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling
and E-M estimation of haplotype-specific relative risks from genotype
data for a case-control study of unrelated individuals.
Hum Hered. 55:179-90.