Case/Control Association using PLINK, R and UNPHASED

Overview

Purpose

In this exercise you will be carrying out chi-squared tests and likelihood-based analysis of case/control data, to test for genetic association.

Methodology

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) as described in the lectures and implemented in various computer packages.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying if you are not familiar with using statistical packages such as R. Don't worry too much if you don't understand all the details! If you decide to regularly use a statistical package such as R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Program documentation

PLINK documentation:

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://pngu.mgh.harvard.edu/~purcell/plink/

R documentation:

The R website is at http://www.r-project.org/

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

UNPHASED documentation:

Full documentation can be found at

http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/

The program runs either in command-line mode, or through a Java-based graphical interface. We will be using the command-line mode, since this is easier to explain and allows greater control over the desired options.



Exercise

Data overview

The data consists of genotype data at 4 linked SNP loci, typed in 384 type 1 diabetes cases and 672 controls.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.

All the programs will deal with much larger numbers of loci than the 4 SNPs considered here. PLINK in particular was specifically designed for the analysis of large numbers of loci e.g. generated as part of a genome-wide association study.

Special considerations/restrictions (for the programs used here)

All the commands required to run these analyses are given below. However, packages such as Stata, SAS, Splus and R are sophisticated statistical programming packages and have much greater functionality than will be described here. If you intend to use a statistical package to analyse your data, you are strongly encouraged to learn how to use that package appropriately through attendance on a training course or careful reading of an introductory textbook.

Instructions

Data format

The data for the 4 linked loci caseconped.txt is in standard linkage pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Note that since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown.

PLINK requires an additional map file caseconmap.txt describing the markers (in order) in the pedigree file. The PLINK-format map file contains exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced) 
     rs number or snp identifier 
     Genetic distance (morgans) 
     Base-pair position (bp units) 

Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.

Step-by-step instructions

1. Analysis in PLINK

Move to the directory where you have saved the data files.

To carry out a basic association analysis in PLINK, type

plink --noweb --ped caseconped.txt --map caseconmap.txt --assoc

The --noweb command tells PLINK to run without bothering to check via the web to see whether there are updated versions of PLINK. The --ped xxxx command tells PLINK that the name of the pedigree file is xxxx and the --map yyyy command tells PLINK that the name of the map file is yyyy. The --assoc command tells PLINK to perform a basic allele-based chisquared association test.

PLINK outputs some useful messages (you should always read these to make sure they match up with what you expect!) and outputs the results to a file plink.assoc .

Take a look at the file plink.assoc (e.g. by typing more plink.assoc ). For each SNP the following columns of results are reported:

     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     CHISQ   Basic allelic test chi-square (1df)
     P       Asymptotic p-value for this test
     OR      Estimated odds ratio (for A1, i.e. A2 is reference)

You should find some evidence of association at all 4 loci, with the strongest result at rs003.

You can also use PLINK to calculate Fishers exact test:

plink --noweb --ped caseconped.txt --map caseconmap.txt --fisher

This produces the following columns of results in the file plink.fisher :

     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     P       Exact p-value for this test
     OR      Estimated odds ratio (for A1)

The most sophisticated case/control association testing can be performed in plink using the model command:

plink --noweb --ped caseconped.txt --map caseconmap.txt --model

Results can be found in the file plink.model . For each SNP, this displays results from a 2df genotype test, a Cochran-Armitage trend test, a 1df allelic test, a dominant model (for the minor allele) test and a recessive model (for the minor allele) test. The columns in the file plink.model are:

     CHR           Chromosome number
     SNP           SNP identifier
     TEST          Type of test
     AFF           Genotype/allele counts in cases
     UNAFF         Genotype/allele counts in controls
     CHISQ         Chi-squared statistic
     DF            Degrees of freedom for test
     P             Asymptotic p-value




2. Chi-squared and logistic regression analysis in R

Stay in the directory where you have saved the data files. To start up the R package, type

R

You are now working within the R package. To start with, you need to read in the data:

casecon <-read.table("caseconped.txt", header=F, na.strings = "0")

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe 'casecon' and you can look at it simply by typing

casecon

If the dataframe is too large you will only see the bottom of it. You can look at the top 6 lines by typing

head(casecon)

or by typing

fix(casecon)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

casecon$V1

Check that you understand how the output relates to the column of pedigree names you saw when you used fix(casecon) (note that the column of pedigrees gets wrapped round when you look at it by typing casecon$V1).

It can be inconvenient to have to type 'casecon$' before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(casecon)

Now you can just look at the column of pedigree names by typing

V1

However, you should beware when using the 'attach' command as if you already have a different variable in R called 'V1', R will use this variable rather than the one in the attached dataframe - without telling you that there may be some ambiguity.

To perform the analysis we will first generate a variable called 'case' that codes affected/unaffected individuals as 1/0 rather than 2/1 (as they are coded in the 'affected' variable V6):

case <- V6-1

To look at the new case variable, type

case

To compare with the old 'affected' variable, type

V6

Now we will generate genotype variables from the two alleles at each locus:

g1 <- V7+V8-2
g2 <- V9+V10-2
g3 <- V11+V12-2
g4 <- V13+V14-2


This generates 4 diferent genotype variables which code for the genotypes at the 4 loci. Can you see how this has worked? The genotype values 0, 1, 2 correspond to the number of copies of the '2' allele that a person has. Take a look at the first of these genotype variables by typing

g1

To perform chi-squared tests on each of the genotype variables (to see if they are associated with disease status), type:

table(g1,case)
chisq.test(g1,case)

table(g2,case)
chisq.test(g2,case)

table(g3,case)
chisq.test(g3,case)

table(g4,case)
chisq.test(g4,case)


All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance. The results should be similar to what you found using the genotype tests in PLINK.

To perform the logistic regression, it is helpful to use a special library of R routines. Read in the necessary add-in library (which we have downloaded ready for you):

library(DGCgenetics)

To perform allele tests at each marker using logistic regression, we use the following commands:

logistic (case ~ g1)
logistic (case ~ g2)
logistic (case ~ g3)
logistic (case ~ g4)


The 'logistic' command provides parameter estimates (odds ratios and confidence limits), z tests and p values for each test performed. The results should be very similar to the basic association or trend tests performed in PLINK. For genotype analysis at each marker, use the following commands:

logistic (case ~ factor(g1))
logistic (case ~ factor(g2))
logistic (case ~ factor(g3))
logistic (case ~ factor(g4))


Again the results should be similar to the genotype tests performed in PLINK. The advantage of logistic regression is that you could put in several different genotype variables into the same model. To see, for example, whether locus 3 adds significantly to a model that already includes locus 4, type:

logistic (case ~ factor(g4)+factor(g3))
anova(logistic (case ~ factor(g4)+factor(g3)))


Here the terms get added into the model in the order you specified, so to see the significance of adding in locus 3, you need work out the significance (p value) of the reported 'deviance' which is distrobuted as a chi-squared on 2 df. To do this type:

1-pchisq(2.03,2)

To see the significance of adding locus 4 to a model that already includes locus 3, you would need to type

logistic (case ~ factor(g3)+factor(g4))
anova(logistic (case ~ factor(g3)+factor(g4)))
1-pchisq(0.66,2)


To get out of R, type q()



3. Analysis in UNPHASED

Finally, we shall use the program UNPHASED to repeat the single-locus analyses we did above, and to additionally perform haplotype tests.

To run an allelic test on each locus in turn in the pedigree file caseconped.txt, simply type (from the unix prompt):

unphased caseconped.txt

The output gets reported to the screen. To re-run the analysis and save the output to a file named unphased.out, type:

unphased caseconped.txt > unphased.out

Once the program has finished running, you can check that a new output file unphased.out has been created by typing ls

Take a look at the output file (e.g. using the more command). First comes some program-specific information and then come the results (i.e. tests of association and estimates of the allelic or haplotypic odds ratios) for each marker. The most important information is the likelihood ratio chisq test and p-value that come under the heading TEST OF OVERALL ASSOCIATION .

The tests and odds ratios (in the column marked Odds-R) should be similar to what you found in your allele-based logistic regression analysis and in the basic association or trend tests in PLINK. To instead perform a genotype-based test at each locus, type

unphased caseconped.txt -genotype > unphasedgeno.out

Now the results should be similar to what you found in your genotype-based PLINK, chi-squared table and logistic regression analyses.

To perform a haplotype analysis of all four loci, averaging over uncertain haplotypes, type

unphased caseconped.txt -window 4 > unphasedhaplo.out

You can see from the results (the estimates of haplotype effects) that there are some rather rare haplotypes. This can make the results are a bit unreliable. In order to remove these rare haplotypes from the analysis, using a cut-off threshold of 1%, for example, type:

unphased caseconped.txt -window 4 -zero 0.01 > unphasedzero.out

To output the estimates of haplotype effects all with reference to the most common haplotype, 2-1-2-1, type

unphased caseconped.txt -window 4 -zero 0.01 -reference 2 1 2 1 > unphasedcommon.out

UNPHASED will also allow you to look at haplotypes from subsets of loci, or from sliding windows. For example, to look at haplotypes constructed from alleles at locus 3 and 4 only, imposing a 1% threshold for discarding rare haplotypes, type

unphased caseconped.txt -marker 3 4 -window 2 -zero 0.01 > unphased34.out

To see the other possible analysis options in UNPHASED, please consult the UNPHASED documentation.


Answers

PLINK output

The results from the basic association tests are:

 CHR     SNP         BP   A1      F_A      F_U   A2        CHISQ            P           OR
   1   rs001     123446    2   0.3542     0.45    1        18.27    1.916e-05       0.6703
   1   rs002     123452    2   0.4245   0.3577    1        9.117     0.002532        1.324
   1   rs003     123457    2   0.3659   0.4746    1        23.23    1.436e-06       0.6387
   1   rs004     123458    1   0.3885   0.4946    2        21.82        3e-06        0.649

The results from Fishers exact tests are:

 CHR     SNP         BP   A1      F_A      F_U   A2            P           OR
   1   rs001     123446    2   0.3542     0.45    1    2.103e-05       0.6703
   1   rs002     123452    2   0.4245   0.3577    1     0.002728        1.324
   1   rs003     123457    2   0.3659   0.4746    1    1.411e-06       0.6387
   1   rs004     123458    1   0.3885   0.4946    2    2.942e-06        0.649

The results from the full set of association tests are:

 CHR     SNP   A1   A2     TEST            AFF          UNAFF        CHISQ   DF            P
   1   rs001    2    1     GENO     47/178/159    131/323/196        18.24    2    0.0001094
   1   rs001    2    1    TREND        272/496        585/715        18.24    1    1.947e-05
   1   rs001    2    1  ALLELIC        272/496        585/715        18.27    1    1.916e-05
   1   rs001    2    1      DOM        225/159        454/196        13.56    1    0.0002314
   1   rs001    2    1      REC         47/337        131/519        10.61    1     0.001125
   1   rs002    2    1     GENO     65/196/123     78/309/263        9.439    2     0.008918
   1   rs002    2    1    TREND        326/442        465/835        9.437    1     0.002126
   1   rs002    2    1  ALLELIC        326/442        465/835        9.117    1     0.002532
   1   rs002    2    1      DOM        261/123        387/263        7.333    1      0.00677
   1   rs002    2    1      REC         65/319         78/572        4.917    1      0.02659
   1   rs003    2    1     GENO     53/175/156    150/317/183        22.55    2    1.271e-05
   1   rs003    2    1    TREND        281/487        617/683        22.52    1    2.079e-06
   1   rs003    2    1  ALLELIC        281/487        617/683        23.23    1    1.436e-06
   1   rs003    2    1      DOM        228/156        467/183        17.04    1    3.666e-05
   1   rs003    2    1      REC         53/331        150/500        13.16    1    0.0002859
   1   rs004    1    2     GENO     57/182/142    163/316/170        21.36    2    2.305e-05
   1   rs004    1    2    TREND        296/466        642/656        21.28    1    3.969e-06
   1   rs004    1    2  ALLELIC        296/466        642/656        21.82    1        3e-06
   1   rs004    1    2      DOM        239/142        479/170        13.95    1    0.0001879
   1   rs004    1    2      REC         57/324        163/486        14.74    1    0.0001235

R and UNPHASED output

Interpretation of the output is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.


Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmental factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.

Other packages

Logistic regression analysis can be performed in other statistical packages such as Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel.

For haplotype analysis one can use several other packages such as PLINK, GeneBPM and the haplo.stats add-in library for R.


References

Breslow N and Day N (1980) Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer.

Breslow N and Day N (1993) Statistical methods in cancer research: The design and analysis of cohort studies. International Agency for Research on Cancer.

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70:124-141.

Dudbridge (2008) Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data. Hum Hered 66:87-98.

Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73:1316-1329.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81:559-575.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34. Schaid DJ (2004) Evaluating associations of haplotypes with traits. Genetic Epidemiology 27:348-64.

Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum Hered. 55:179-90.


Exercises prepared by: Heather Cordell
Checked by:
Programs used: PLINK, R, DGCgenetics, UNPHASED
Last updated: