Case/Control Association Analysis using PLINK

Overview

Purpose

In this exercise you will be carrying out chi-squared tests and logistic regression analysis of case/control data, to test for genetic association between a marker and a binary trait (such as disease status).

Methodology

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) as described in the lectures and implemented the computer package PLINK.

Data overview

The main data for this exercise consists of genotype data at 4 linked SNP loci, typed in 384 type 1 diabetes cases and 672 controls. We will also take a brief look at a slightly larger data set (100 SNPs typed in 2000 cases and 2000 controls) in order to illustrate how the program scales up with increased numbers of SNPs and individuals.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls).

All programs will deal with much larger numbers of loci than the 4 SNPs considered here. PLINK in particular was specifically designed for the analysis of large numbers of loci e.g. generated as part of a genome-wide association study.


Instructions

Programs

First I suggest you make a new folder on your H: drive, to keep today's programs and data files in. Remember to use a folder name WITHOUT ANY SPACES.

If you have not already downloaded PLINK, you should save the following file to the folder you just made:

plink.exe

[Right click, select Save Target As, and navigate through to save it in the folder you made].

Data format

The data for the 4 linked loci caseconped.txt is in standard linkage pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Note that since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown.

PLINK requires an additional map file caseconmap.txt describing the markers (in order) in the pedigree file. The PLINK-format map file contains exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced) 
     rs number or snp identifier 
     Genetic distance (in Morgans or cM) 
     Base-pair position (bp units) 

Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.

Step-by-step instructions

Basic Association Analysis in PLINK

To start with, you will need to open up an MSDOS window. To do this, click on Start (the round button on the bottom left), All Programs, Accessories, then click on Command Prompt.

Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.


To carry out a basic association analysis in PLINK, type

plink --noweb --ped caseconped.txt --map caseconmap.txt --assoc

The --noweb command tells PLINK to run without bothering to check via the web to see whether there are updated versions of PLINK. The --ped xxxx command tells PLINK that the name of the pedigree file is xxxx and the --map yyyy command tells PLINK that the name of the map file is yyyy. The --assoc command tells PLINK to perform a basic allele-based chi-squared association test.

PLINK outputs some useful messages (you should always read these to make sure they match up with what you expect!) and outputs the results to a file plink.assoc .

Take a look at the file plink.assoc e.g. by typing more plink.assoc, or (better) by opening the file in Notepad, WordPad or Excel. To open in Excel, you will need to start up Excel, click the top left button to find the "Open" icon, and navigate through your filesystem to find the appropriate file. You may need to choose the "All Files" option to make the file visible, click on "Open" and chose the "Delimited" option ticking the "Space" cell, and click on "Next" and "Finish" where appropriate.

For each SNP the following columns of results are reported:

     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     CHISQ   Basic allelic test chi-square (1df)
     P       Asymptotic p-value for this test
     OR      Estimated odds ratio (for A1, i.e. A2 is reference)

You should find some evidence of association at all 4 loci, with the strongest result at rs003.


Fisher's exact test

You can also use PLINK to calculate Fisher's exact test, which is more appropriate if you have some cells (genotype categories) with very few observations:

plink --noweb --ped caseconped.txt --map caseconmap.txt --fisher

This produces the following columns of results in the file plink.assoc.fisher :

     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     P       Exact p-value for this test
     OR      Estimated odds ratio (for A1)

More sophisticated models

A more sophisticated case/control association test can be performed in plink using the --model command:

plink --noweb --ped caseconped.txt --map caseconmap.txt --model

Results can be found in the file plink.model . For each SNP, this displays results from a 2df genotype test, a Cochran-Armitage trend test, a 1df allelic test, a dominant model (for the minor allele) test and a recessive model (for the minor allele) test. The columns in the file plink.model are:

     CHR           Chromosome number
     SNP           SNP identifier
     TEST          Type of test
     AFF           Genotype/allele counts in cases
     UNAFF         Genotype/allele counts in controls
     CHISQ         Chi-squared statistic
     DF            Degrees of freedom for test
     P             Asymptotic p-value

In genome-wide association studies (GWAS), most people focus on reporting the TREND test, unless there is some strong reason apriori to consider the other tests.


Logistic regression

A logistic regression analysis can be performed in plink using the --logistic command (with the additional --ci option if you want to output a confidence interval for the estimated odds ratio). Logistic regression has the flexibility of allowing the inclusion of additional covariates as predictors of disease status (e.g. things like age, sex, smoking) although we won't include any of these today.

To preform this analysis, type the following:

plink --noweb --ped caseconped.txt --map caseconmap.txt --logistic --ci 0.95

Results can be found in the file plink.assoc.logistic . For each SNP, this displays:

     CHR           Chromosome number
     SNP           SNP identifier
     BP            Physical (base pair) position
     A1            Tested allele (minor allele by default)
     TEST          Code for the type of test
     NMISS         Number of non-missing individuals included on analysis
     OR            Estimated odds ratio
     SE            Standard error of estimated log odds ratio
     L95           Lower 95% conficence limit
     U95           Upper 95% conficence limit   
     STAT          Test statistic 
     P             Asymptotic p-value

Larger data sets

Just to illustrate how PLINK can scale up to analysing larger numbers of SNPs and individuals, we will repeat the basic association analysis with the larger data set, for which you will need to download the following files:

simcasecon.ped
simcasecon.map

To run the analysis and save the output in a new file (named results100SNPs.assoc) type:

plink --noweb --ped simcasecon.ped --map simcasecon.map --assoc --out results100SNPs

You can take a look at the output file by typing: more results100SNPs.assoc, but to see it properly I suggest you open it in Excel.


Answers

PLINK output

The results from the basic association tests are:

 CHR     SNP         BP   A1      F_A      F_U   A2        CHISQ            P           OR
   1   rs001     123446    2   0.3542     0.45    1        18.27    1.916e-05       0.6703
   1   rs002     123452    2   0.4245   0.3577    1        9.117     0.002532        1.324
   1   rs003     123457    2   0.3659   0.4746    1        23.23    1.436e-06       0.6387
   1   rs004     123458    1   0.3885   0.4946    2        21.82        3e-06        0.649

The results from Fishers exact tests are:

 CHR     SNP         BP   A1      F_A      F_U   A2            P           OR
   1   rs001     123446    2   0.3542     0.45    1    2.103e-05       0.6703
   1   rs002     123452    2   0.4245   0.3577    1     0.002728        1.324
   1   rs003     123457    2   0.3659   0.4746    1    1.411e-06       0.6387
   1   rs004     123458    1   0.3885   0.4946    2    2.942e-06        0.649

The results from the full set of association tests are:

 CHR     SNP   A1   A2     TEST            AFF          UNAFF        CHISQ   DF            P
   1   rs001    2    1     GENO     47/178/159    131/323/196        18.24    2    0.0001094
   1   rs001    2    1    TREND        272/496        585/715        18.24    1    1.947e-05
   1   rs001    2    1  ALLELIC        272/496        585/715        18.27    1    1.916e-05
   1   rs001    2    1      DOM        225/159        454/196        13.56    1    0.0002314
   1   rs001    2    1      REC         47/337        131/519        10.61    1     0.001125
   1   rs002    2    1     GENO     65/196/123     78/309/263        9.439    2     0.008918
   1   rs002    2    1    TREND        326/442        465/835        9.437    1     0.002126
   1   rs002    2    1  ALLELIC        326/442        465/835        9.117    1     0.002532
   1   rs002    2    1      DOM        261/123        387/263        7.333    1      0.00677
   1   rs002    2    1      REC         65/319         78/572        4.917    1      0.02659
   1   rs003    2    1     GENO     53/175/156    150/317/183        22.55    2    1.271e-05
   1   rs003    2    1    TREND        281/487        617/683        22.52    1    2.079e-06
   1   rs003    2    1  ALLELIC        281/487        617/683        23.23    1    1.436e-06
   1   rs003    2    1      DOM        228/156        467/183        17.04    1    3.666e-05
   1   rs003    2    1      REC         53/331        150/500        13.16    1    0.0002859
   1   rs004    1    2     GENO     57/182/142    163/316/170        21.36    2    2.305e-05
   1   rs004    1    2    TREND        296/466        642/656        21.28    1    3.969e-06
   1   rs004    1    2  ALLELIC        296/466        642/656        21.82    1        3e-06
   1   rs004    1    2      DOM        239/142        479/170        13.95    1    0.0001879
   1   rs004    1    2      REC         57/324        163/486        14.74    1    0.0001235

The results from the logistic regression analysis are:

 CHR     SNP         BP   A1       TEST    NMISS         OR       SE      L95      U95         STAT            
P
   1   rs001     123446    2        ADD     1034     0.6683  0.09498   0.5548   0.8051       -4.243   
2.205e-005
   1   rs002     123452    2        ADD     1034       1.34  0.09551    1.111    1.615        3.061     
0.002206
   1   rs003     123457    2        ADD     1034     0.6445  0.09329   0.5368   0.7738       -4.708   
2.497e-006
   1   rs004     123458    1        ADD     1030     0.6531  0.09303   0.5443   0.7838       -4.579   
4.677e-006

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.

Other packages

Logistic regression analysis can be performed in other statistical packages such as R, Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel. A likelihood-based genetic association analysis can be performed in the programs LAMP and UNPHASED.


References

Breslow N and Day N (1980) Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer.

Breslow N and Day N (1993) Statistical methods in cancer research: The design and analysis of cohort studies. International Agency for Research on Cancer.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81:559-575.

Program documentation

PLINK documentation:

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://zzz.bwh.harvard.edu/plink/


Exercises prepared by: Heather Cordell
Checked by:
Programs used: PLINK
Last updated: