In this exercise you will be carrying out chi-squared tests and logistic regression analysis of case/control data, to test for genetic association between a marker and a binary trait (such as disease status).
The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) as described in the lectures and implemented the computer package PLINK.
The main data for this exercise consists of genotype data at 4 linked SNP loci, typed in 384 type 1 diabetes cases and 672 controls. We will also take a brief look at a slightly larger data set (100 SNPs typed in 2000 cases and 2000 controls) in order to illustrate how the program scales up with increased numbers of SNPs and individuals.
Appropriate data for this exercise is genotype data for
a set of linked or unlinked loci typed in a group of unrelated
affected individuals (cases) and in a group of unaffected
or randomly chosen individuals from the same population
(controls).
All programs will deal with much larger numbers of loci than the 4 SNPs considered here. PLINK in particular was specifically designed for the analysis of large numbers of loci e.g. generated as part of a genome-wide association study.
First I suggest you make a new
folder on your H: drive, to keep today's programs and data files in.
Remember to use a folder name WITHOUT ANY SPACES.
If you have not already downloaded PLINK, you should save the following file to the
folder you just made:
plink.exe
[Right click, select Save Target As, and navigate through to
save it in the folder you made].
The data for the 4 linked loci
caseconped.txt
is in standard linkage pedigree file format, with columns corresponding to
family id, subject id (within family), father's id, mother's id,
sex (1=m, 2=f), affection status (1=unaffected, 2=affected)
and one column for each allele for each locus genotype.
Note that
since this is case/control
rather than family data,
there is only one individual per family and everyone's
parents are coded as unknown.
PLINK requires an additional map file caseconmap.txt describing the markers (in order) in the pedigree file. The PLINK-format map file contains exactly 4 columns:
chromosome (1-22, X, Y or 0 if unplaced) rs number or snp identifier Genetic distance (in Morgans or cM) Base-pair position (bp units)
Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.
To start with, you will need to open up an MSDOS window. To do
this, click on
Start (the round button on the bottom left), All Programs, Accessories, then click
on Command Prompt.
Once the window has opened, type dir to see all the files and directories
(folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing
cd xxxxx
(where xxxxx is replaced by the name of the appropriate folder).
Type dir again to check the required files are available in the directory.
To carry out a basic association analysis in PLINK, type
plink --noweb --ped caseconped.txt --map caseconmap.txt --assoc
The --noweb command tells PLINK to run without bothering to check via the
web to see whether there are updated versions of PLINK. The --ped xxxx command tells PLINK that the name of the pedigree file is xxxx and the --map yyyy command tells PLINK that the name of the map file is yyyy. The --assoc command tells PLINK to perform a basic allele-based chi-squared association test.
PLINK outputs some useful messages (you should always read these to make sure they match up with what you expect!) and outputs the results to a file plink.assoc .
Take a look at the file plink.assoc e.g. by typing more plink.assoc, or (better)
by opening the file in Notepad, WordPad or Excel. To open in Excel, you will need to
start up Excel, click the top left button to find the "Open" icon, and navigate
through your filesystem to find the appropriate file. You may need to choose the
"All Files" option to make the file visible, click on "Open" and chose the
"Delimited" option ticking the "Space" cell, and click on "Next" and "Finish"
where
appropriate.
For each SNP the following
columns of
results
are reported:
CHR Chromosome SNP SNP ID BP Physical position (base-pair) A1 Minor allele name (based on whole sample) F_A Frequency of this allele in cases F_U Frequency of this allele in controls A2 Major allele name CHISQ Basic allelic test chi-square (1df) P Asymptotic p-value for this test OR Estimated odds ratio (for A1, i.e. A2 is reference)
You should find some evidence of association at all 4 loci, with the strongest result at rs003.
You can also use PLINK to calculate Fisher's exact test, which is more appropriate if you have
some cells (genotype categories) with very few observations:
plink --noweb --ped caseconped.txt --map caseconmap.txt --fisher
This produces the following columns of results in the file plink.assoc.fisher :
CHR Chromosome SNP SNP ID BP Physical position (base-pair) A1 Minor allele name (based on whole sample) F_A Frequency of this allele in cases F_U Frequency of this allele in controls A2 Major allele name P Exact p-value for this test OR Estimated odds ratio (for A1)
A more sophisticated case/control association test can be performed in plink using the --model
command:
plink --noweb --ped caseconped.txt --map caseconmap.txt --model
Results can be found in the file plink.model . For each SNP, this displays results from a 2df genotype test, a Cochran-Armitage trend test, a 1df allelic test, a dominant model (for the minor allele) test and a recessive model (for the minor allele) test. The columns in the file plink.model are:
CHR Chromosome number SNP SNP identifier TEST Type of test AFF Genotype/allele counts in cases UNAFF Genotype/allele counts in controls CHISQ Chi-squared statistic DF Degrees of freedom for test P Asymptotic p-value
In genome-wide association studies (GWAS), most people focus on reporting the
TREND test, unless
there is some strong
reason apriori to consider the other tests.
A logistic regression analysis can be performed in plink using the --logistic
command (with the additional --ci option if you want to output a confidence interval for the
estimated odds
ratio). Logistic regression has the flexibility of allowing the inclusion of additional covariates
as predictors of disease status (e.g. things like age, sex, smoking) although we won't include any of these
today.
To preform this analysis, type the following:
plink --noweb --ped caseconped.txt --map caseconmap.txt --logistic --ci 0.95
Results can be found in the file plink.assoc.logistic .
For each SNP, this displays:
CHR Chromosome number SNP SNP identifier BP Physical (base pair) position A1 Tested allele (minor allele by default) TEST Code for the type of test NMISS Number of non-missing individuals included on analysis OR Estimated odds ratio SE Standard error of estimated log odds ratio L95 Lower 95% conficence limit U95 Upper 95% conficence limit STAT Test statistic P Asymptotic p-value
Just to illustrate how PLINK can scale up to analysing larger numbers of SNPs and
individuals, we will repeat the basic association analysis
with the larger data set, for which you will need to download the following
files:
simcasecon.ped
simcasecon.map
To run the analysis and save the output in a new file (named
results100SNPs.assoc) type:
plink --noweb --ped simcasecon.ped --map simcasecon.map --assoc --out results100SNPs
You can take a look at the output file by typing: more
results100SNPs.assoc, but to see it properly I suggest you open it in Excel.
The results from the basic association tests are:
CHR SNP BP A1 F_A F_U A2 CHISQ P OR 1 rs001 123446 2 0.3542 0.45 1 18.27 1.916e-05 0.6703 1 rs002 123452 2 0.4245 0.3577 1 9.117 0.002532 1.324 1 rs003 123457 2 0.3659 0.4746 1 23.23 1.436e-06 0.6387 1 rs004 123458 1 0.3885 0.4946 2 21.82 3e-06 0.649
The results from Fishers exact tests are:
CHR SNP BP A1 F_A F_U A2 P OR 1 rs001 123446 2 0.3542 0.45 1 2.103e-05 0.6703 1 rs002 123452 2 0.4245 0.3577 1 0.002728 1.324 1 rs003 123457 2 0.3659 0.4746 1 1.411e-06 0.6387 1 rs004 123458 1 0.3885 0.4946 2 2.942e-06 0.649
The results from the full set of association tests are:
CHR SNP A1 A2 TEST AFF UNAFF CHISQ DF P 1 rs001 2 1 GENO 47/178/159 131/323/196 18.24 2 0.0001094 1 rs001 2 1 TREND 272/496 585/715 18.24 1 1.947e-05 1 rs001 2 1 ALLELIC 272/496 585/715 18.27 1 1.916e-05 1 rs001 2 1 DOM 225/159 454/196 13.56 1 0.0002314 1 rs001 2 1 REC 47/337 131/519 10.61 1 0.001125 1 rs002 2 1 GENO 65/196/123 78/309/263 9.439 2 0.008918 1 rs002 2 1 TREND 326/442 465/835 9.437 1 0.002126 1 rs002 2 1 ALLELIC 326/442 465/835 9.117 1 0.002532 1 rs002 2 1 DOM 261/123 387/263 7.333 1 0.00677 1 rs002 2 1 REC 65/319 78/572 4.917 1 0.02659 1 rs003 2 1 GENO 53/175/156 150/317/183 22.55 2 1.271e-05 1 rs003 2 1 TREND 281/487 617/683 22.52 1 2.079e-06 1 rs003 2 1 ALLELIC 281/487 617/683 23.23 1 1.436e-06 1 rs003 2 1 DOM 228/156 467/183 17.04 1 3.666e-05 1 rs003 2 1 REC 53/331 150/500 13.16 1 0.0002859 1 rs004 1 2 GENO 57/182/142 163/316/170 21.36 2 2.305e-05 1 rs004 1 2 TREND 296/466 642/656 21.28 1 3.969e-06 1 rs004 1 2 ALLELIC 296/466 642/656 21.82 1 3e-06 1 rs004 1 2 DOM 239/142 479/170 13.95 1 0.0001879 1 rs004 1 2 REC 57/324 163/486 14.74 1 0.0001235
The results from the logistic regression analysis are:
CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 1 rs001 123446 2 ADD 1034 0.6683 0.09498 0.5548 0.8051 -4.243 2.205e-005 1 rs002 123452 2 ADD 1034 1.34 0.09551 1.111 1.615 3.061 0.002206 1 rs003 123457 2 ADD 1034 0.6445 0.09329 0.5368 0.7738 -4.708 2.497e-006 1 rs004 123458 1 ADD 1030 0.6531 0.09303 0.5443 0.7838 -4.579 4.677e-006
With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.
Logistic regression analysis can be performed in other statistical packages such as R, Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel. A likelihood-based genetic association analysis can be performed in the programs LAMP and UNPHASED.
Breslow N and Day N (1980) Statistical methods in cancer research: The
analysis of case-control studies. International Agency for Research on
Cancer.
Breslow N and Day N (1993) Statistical methods in cancer
research: The design and analysis of cohort studies. International
Agency for Research on Cancer.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR,
Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007)
PLINK: a toolset for whole-genome association and population-based
linkage analysis. American Journal of Human Genetics, 81:559-575.
PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:
http://zzz.bwh.harvard.edu/plink/