Computer Practical Exercises on Case/Control Association using the R package and UNPHASED

Overview

Purpose

In this exercise you will be carrying out chi-squared tests and likelihood-based analysis of case/control data, to test for genetic association.

Methodology

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) as described in the lectures and implemented in the statistical analysis package R. As well as using standard functions implemented in R, we will also make use of special functions designed for genetic analysis that can be downloaded as add-in routines to the R package. In addition, will also use a likelihood-based approach implemented in the program UNPHASED.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying if you are not familiar with using statistical packages such as R. Don't worry too much if you don't understand all the details! If you decide to regularly use a statistical package such as R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Program documentation

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the DGCgenetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

UNPHASED documentation:

Full documentation can be found at

http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/

The program runs either in command-line mode, or through a Java-based graphical interface. We will be using the command-line mode, since this is easier to explain and allows greater control over the desired options.



Exercise

Data overview

The data consists of genotype data at 4 linked SNP loci, typed in 384 type 1 diabetes cases and 672 controls.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.

Special considerations/restrictions (for the programs used here)

All the commands required to run these analyses are given below. However, packages such as Stata, SAS, Splus and R are sophisticated statistical programming packages and have much greater functionality than will be described here. If you intend to use a statistical package to analyse your data, you are strongly encouraged to learn how to use that package appropriately through attendance on a training course or careful reading of an introductory textbook.

Instructions

Data format

The data for the 4 linked loci casecondata.txt is in a similar format to a standard linkage pedigree file, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. The pedigree file used for the analysis in R differs from a standard pedigree file in that it has a header line describing the different columns, and it uses R's own missing value code "NA". Note that since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown.

The same data in standard pedigree file format (as is required by the UNPHASED program) is in the file caseconped.txt

Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.

Step-by-step instructions

1. Chi-squared and logistic regression analysis in R

Move to the directory where you have saved the data files. To start up the R package, type

R

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(DGCgenetics)

To read in your data, type

casecon <- read.table("casecondata.txt", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe 'casecon' and you can look at it simply by typing

casecon

or by typing

fix(casecon)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

casecon$pedigree

Check that you understand how the output relates to the column of pedigree names you saw when you used fix(casecon) (note that the column of pedigrees gets wrapped round when you look at it by typing casecon$pedigree).

It can be inconvenient to have to type 'casecon$' before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(casecon)

Now you can just look at the column of pedigree names by typing

pedigree

However, you should beware when using the 'attach' command as if you already have a different variable in R called 'pedigree', R will use this variable rather than the one in the attached dataframe - without telling you that there may be some ambiguity. For this reason it is often convenient to clear all variables and detach all dataframes before starting your session. This can be done (BUT DONT DO IT NOW!) by use of the clear() command.

To perform the analysis we will first generate a variable called 'case' that codes affected/unaffected individuals as 1/0 rather than 2/1 (as they are coded in the 'affected' variable):

case <- affected-1

To look at the new case variable, type

case

To compare with the old 'affected' variable, type

affected

Now we will generate genotype variables from the two alleles at each locus:

g1 <- genotype(loc1_1, loc1_2)
g2 <- genotype(loc2_1, loc2_2)
g3 <- genotype(loc3_1, loc3_2)
g4 <- genotype(loc4_1, loc4_2)


This generates 4 diferent genotype variables which code for the genotypes at the 4 loci. Take a look at the first of these by typing

g1

You can see that in variable g1, the genotypes have been labelled in terms of their alleles in a convenient human-readable fashion (e.g. "1/2" means a heterozygote). When used later on for logistic regression, R will convert each of these genotype labels to a particular numeric coding (which we shall specify using the gcontrasts() command).

To perform chi-squared tests on each of the genotype variables (to see if they are associated with disease status), type:

table(g1,case)
chisq.test(g1,case)

table(g2,case)
chisq.test(g2,case)

table(g3,case)
chisq.test(g3,case)

table(g4,case)
chisq.test(g4,case)


All loci should show significant association, with C and D (locus 3 and 4) showing the highest significance.

To perfom a chromosome-based analysis, you can use the commands:

allele.table(g1,case)
chisq.test(allele.table(g1,case))

allele.table(g2,case)
chisq.test(allele.table(g2,case))

allele.table(g3,case)
chisq.test(allele.table(g3,case))

allele.table(g4,case)
chisq.test(allele.table(g4,case))


To perform the logistic regression analysis, we need to set up a numeric coding of the variables up for either a 2df (genotype-based) analysis or a 1df (allele or chromosome-based) analysis. For the genotype-based analysis, we use what are called 'genotype' contrasts, and for the chromosome-based analysis we use 'additive' (on the log odds scale - equivalent to multiplicative on the odds scale) contrasts. So for a 2df test at the first locus we use the following commands:

gcontrasts(g1) <- "genotype"
logistic (case ~ g1)


The 'logistic' command provides parameter estimates (odds ratios) and p values for each regression coefficient. The syntax logistic (y ~ x) means we regress an outcome variable y (here case/control status) on a predictor variable x (here genotype variable g1). The odds ratios are given for genotypes 1/1 and 2/2 relative to heterozygous genotype 1/2. (This is seen from the slightly confusing labels "g11/1" and "g12/2" which mean "g1 1/1" and "g1 2/2"). If one wanted instead to do the analysis relative to the genotype 1/1, for example, one would have to set up the genotype contrasts differently. We can do this by typing

gcontrasts(g1,base="1/1")<-"genotype"
logistic (case ~ g1)


To asess the significance of the association, use the following command:

anova(logistic (case ~ g1))

The 'anova' command gives an overall test in the form of a 'deviance' and a 'df'. In this case we find a deviance of 18.49 on 2df. To find a p value for this, we compare it to a chi-squared distribution by typing:

1-pchisq(18.49,2)

For a 1df test at the first locus, we use the following commands:

gcontrasts(g1) <- "additive"
logistic (case ~ g1)
anova(logistic (case ~ g1))
1-pchisq(18.47,1)


You should now be able to do a genotype analysis for the other loci:

gcontrasts(g2,base="1/1")<-"genotype"
logistic (case ~ g2)
anova(logistic (case ~ g2))


gcontrasts(g3,base="1/1")<-"genotype"
logistic (case ~ g3)
anova(logistic (case ~ g3))


gcontrasts(g4,base="1/1")<-"genotype"
logistic (case ~ g4)
anova(logistic (case ~ g4))


Similarly, you could do an allele-based analysis for the other loci:

gcontrasts(g2) <- "additive"
logistic (case ~ g2)
anova(logistic (case ~ g2))


gcontrasts(g3) <- "additive"
logistic (case ~ g3)
anova(logistic (case ~ g3))


gcontrasts(g4) <- "additive"
logistic (case ~ g4)
anova(logistic (case ~ g4))


The results from the logistic regression should be very similar to what you found using the chi-squared table analyses of either genotypes or chromosomes. The advantage of logistic regression is that you could put in additional predictor variables such as environmental effects or indicators of population ancestry. Suppose for example you wanted to include sex (gender) as a predictor variable in your genotype-based analysis of locus D. You could do this by typing

gcontrasts(g4,base="1/1")<-"genotype"
logistic (case ~ g4 + sex)
anova(logistic (case ~ g4 + sex))


Does sex appear to have any significant effect? (Note - what is the sex distribution in your data set?) You could also put several different genotype variables into the same model. To see, for example, whether locus C adds significantly to a model that already includes locus D, type:

gcontrasts(g3,base="1/1")<-"genotype"
gcontrasts(g4,base="1/1")<-"genotype"
logistic (case ~ g4 + g3)
anova(logistic (case ~ g4 + g3))


Here the terms get added into the model in the order you specified, so to see the significance of adding in locus C, you need to type:

1-pchisq(2.03,2)

To see the significance of adding locus D to a model that already includes locus C, you would need to type

gcontrasts(g3,base="1/1")<-"genotype"
gcontrasts(g4,base="1/1")<-"genotype"
logistic (case ~ g3 + g4)
anova(logistic (case ~ g3 + g4))
1-pchisq(0.66,2)


To get out of R, type q()



2. Analysis in UNPHASED

We shall now use the program UNPHASED to repeat the single-locus analyses we did above, and to additionally perform haplotype tests.

To run an allelic test on each locus in turn in the pedigree file caseconped.txt, simply type (from the unix prompt):

unphased caseconped.txt

The output gets reported to the screen. To re-run the analysis and save the output to a file named unphased.out, type:

unphased caseconped.txt > unphased.out

Once the program has finished running, you can check that a new output file unphased.out has been created by typing ls

Take a look at the output file (e.g. using the more command). First comes some program-specific information and then come the results (i.e. tests of association and estimates of the allelic or haplotypic odds ratios) for each marker. The most important information is the likelihood ratio chisq test and p-value that come under the heading TEST OF OVERALL ASSOCIATION .

The tests and odds ratios (in the column marked Odds-R) should be similar to what you found in your allele-based logistic regression analysis and chi-squared table analysis. To instead perform a genotype-based test at each locus, type

unphased caseconped.txt -genotype > unphasedgeno.out

Now the results should be similar to what you found in your genotype-based logistic regression analysis and chi-squared table analysis.

To perform a haplotype analysis of all four loci, averaging over uncertain haplotypes, type

unphased caseconped.txt -window 4 > unphasedhaplo.out

You can see from the results (the estimates of haplotype effects) that there are some rather rare haplotypes. This can make the results are a bit unreliable. In order to remove these rare haplotypes from the analysis, using a cut-off threshold of 1%, for example, type:

unphased caseconped.txt -window 4 -zero 0.01 > unphasedzero.out

To output the estimates of haplotype effects all with reference to the most common haplotype, 2-1-2-1, type

unphased caseconped.txt -window 4 -zero 0.01 -reference 2 1 2 1 > unphasedcommon.out

UNPHASED will also allow you to look at haplotypes from subsets of loci, or from sliding windows. For example, to look at haplotypes constructed from alleles at locus C and D only, imposing a 1% threshold for discarding rare haplotypes, type

unphased caseconped.txt -marker 3 4 -window 2 -zero 0.01 > unphasedCD.out

To see the other possible analysis options in UNPHASED, please consult the UNPHASED documentation.


Answers

How to interpret the output

Interpretation of the output is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.


Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmental factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

Study design issues

With case/control data it is relatively easy to obtain large enough sample sizes to detect small genetic effects. However, care must be taken that the case and control samples are well matched and that, if necessary, adjustments are made to correct for possible inflation of type 1 error due to population stratification or other confounders.

Other packages

Logistic regression analysis can be performed in other statistical packages such as Stata, SAS, SPSS, GLIM. Chi-squared table analysis can also be performed in other statistical packages and in some spreadsheets such as Excel.

For haplotype analysis one can use several other packages such as the HAPLO.STATS add-in for R by Dan Schaid, the WHAP program by Shaun Purcell, the SIMHAP program from Lyle Palmer, or the HAPIPF add-in for Stata by Adrian Mander. Alternatively, one can use the SNPHAP program by David Clayton to estimate haplotypes for several multiply-imputed datasets, and then analyse these using add-in multiple imputation functions for Stata or R.


References

Breslow N and Day N (1980) Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer.

Breslow N and Day N (1993) Statistical methods in cancer research: The design and analysis of cohort studies. International Agency for Research on Cancer.

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Dudbridge (2008) Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data. Hum Hered 66:87-98.

Epstein M, Satten G (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73: 1316-1329.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 70:425-34. Schaid DJ (2004) Evaluating associations of haplotypes with traits. Genetic Epidemiology 27:348-64.

Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC (2003) Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum Hered. 55:179-90.


Exercises prepared by: Heather Cordell
Checked by:
Programs used: R, DGCgenetics, UNPHASED
Last updated: