Computer Practical Exercises on Family-Based Association in R

Overview

Purpose

In this exercise you will be carrying out family-based association analysis of five linked loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is detect which (if any) of the loci are associated with disease.

Methodology

We will use the TDT and case/pseudocontrol approaches. The tests will performed in the statistical analysis package R. As well as using standard functions implemented in R, we will also make use of special functions designed for genetic analysis that have been downloaded as add-in R libraries.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying (!) if you are not familiar with R. Don't worry too much if you don't understand all the details. If you decide to use a statistical package such as R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the DGCgenetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing "help(xxxx)"

Data overview

We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region. (This is the same data that we will use in the UNPHASED exercise, although the input file formats are a bit different).

Appropriate data

Appropriate data for this exercise is genotype data at a set of linked loci typed in a number of case-parent trios. It is also possible to use nuclear families or larger families with more affected individuals, however they will automatically be broken into trios for the analysis, and non-independence between cases from the same family would need to be accounted for e.g. by use of the +cluster(pedigree) option in R.

Instructions

Data file

fiveloci.Rped

You should save this to an appropriate directory.

The data is in standard pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=male, 2=female), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. The pedigree file used for the analysis in R differs slightly from a standard pedigree file. It has a header line describing the different columns, and it uses R's own missing value code "NA" instead of zero.

Take a look at the pedigree file (e.g using the more command, or by opening it in a text editor) and check you understand how it is coded. Note that markers 1-3 are multiallelic markers, whereas markers 4 and 5 are SNPs.

Note that the names given to the first 6 variables in the header line are not optional: they must be called

pedigree id id.father id.mother sex affected

for the add-in utilities for the R package to work correctly.

TDT analysis: step by step instructions

Move into the directory where you saved the data files and start R by typing

R

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(DGCgenetics)

To read your data into a dataframe called "family", type

family <- read.table("fiveloci.Rped", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe "family" and you can look at it simply by typing

family

or (better) by typing

fix(family)

Take a look at the dataframe and check that you understand the variables in the rows and columns, and how they are coded.

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

family$pedigree

Note that the family$pedigree variable, which is really a long vector, will get displayed as a list of values that wrap round horizontally. Check that you understand how this list of values relates to the pedigree column (vector) that you can see in the "family" dataframe when using the fix(family) command.

It can be inconvenient to have to type "family$" before typing every variable name, so you can tell R to automatically look at variables in the "family" dataframe by typing:

attach(family)

Now you can just look at the column of pedigree names by typing

pedigree

Similarly, you can look at the first allele for locus 1 by typing

loc1_1

Or the second allele for locus 1 by typing

loc1_2

Check that you understand how these values relate to the pedigree column (vector) that you can see in the "family" dataframe when using the fix(family) command.

To perform association analysis, we need to convert the variables corresponding to the two alleles at each locus into a genotype variable for each locus. This can be done e.g. for locus 1 by:

g1 <- genotype(loc1_1, loc1_2)

To look at the variable you have just created, type:

g1

To perform a TDT analysis on this locus, type

tdt(g1)

Since this is a multiallelic marker, this gives a global multiallelic test (chi-squared test = 278.1187 on 22 df, P-value = <2e-16) as well as individual TDT tests for each allele versus all others.

Repeat the analysis for loci 2-5

g2 <- genotype(loc2_1, loc2_2) g3 <- genotype(loc3_1, loc3_2) g4 <- genotype(loc4_1, loc4_2) g5 <- genotype(loc5_1, loc5_2) tdt(g2) tdt(g3) tdt(g4) tdt(g5)

As you will find later (when you re-do the analysis in UNPHASED) you should find highly significant associations for loci 1, 2 and 5, and less significant associations at loci 3 and 4.

The TDT p value displayed is actually for an exact test (rather than for the originally proposed TDT chi-squared test). To find the TDT chi-squared statistic and its asymptotic p value e.g. for locus 4, type

tdt4result<-tdt(g4)
This saves the result of the TDT test into an R object called "tdt4result". To see the TDT chi-squared statistic, type:

tdt4result$statistic
This tells us that the TDT chi-squared statistic is 6.3226. To find out the p value, type:

1-pchisq(6.3226,1)

This tells us that the TDT chisquared statistic of 6.3226 has an asymptotic p value of 0.0119, quite similar to the exact p value (p=0.0149) we found.

Case/pseudocontrol analysis: step by step instructions

To create a case/pseudocontrol set for performing analysis at locus 5, for example, type:

psccloc5 <- pseudocc(g5, data=family)

This creates a new dataframe called "psccloc5" which contains cases each with 3 matched pseudocontrols. To look at the dataframe you just created, type:

fix(psccloc5)

Note that within the "psccloc5" dataframe, the 1/0 case/control indicator variable is called cc and the genotype variable is called g5 . The set variable indicates which cases and pseudocontrols are in the same matched set. To clear the old "family" dataframe and old genotype variables from the memory, type:

detach(family) rm(g1) rm(g2) rm(g3) rm(g4) rm(g5)

and then read in the "psccloc5" dataframe as the default:

attach(psccloc5)

To analyse using conditional logistic regression, assuming either a 2df (genotype) test or a 1df (allele) test, type:

gcontrasts(g5) <- "genotype" clogit(cc ~ g5 + strata(set)) gcontrasts(g5) <- "additive" clogit(cc ~ g5 + strata(set))

This will perform the analysis on the g5 variable that is currently in the default memory, i.e. the g5 variable in the "psccloc5" dataframe, also known as psccloc5$g5. The strata(set) option indicates that the set variable labels which cases and pseudocontrols are in the same matched set.

The results should be very similar to what you found in your TDT analysis of locus 5, a highly significant likelihood ratio test.

The relative risk parameters labelled exp(coef) correspond to the risks of disease relative to the 1/2 genotype. (This is seen from the slightly confusing labels "g51/1" and "g52/2" which mean "g5 1/1" and "g5 2/2"). We find values of 0.421 and 2.176 for the 1/1 and 2/2 genotype, relative to the 1/2 genotype. So, if we wanted to calculate risks relative to the 1/1 genotype, we would have values of 1.0/0.421 = 2.38 for genotype 1/2, and 2.176/0.421 = 5.17 for genotype 2/2, respectively. We could also work out the values relative to the 1/1 genotype by specifying the genotype constrasts appropriately:

gcontrasts(g5, base="1/1") <- "genotype" clogit(cc ~ g5 + strata(set))

For analysis at more than one locus (e.g. locus 4 and 5, say) we need to clear the memory and get back to our original "family" dataframe:

detach(psccloc5) attach(family)

Now we need to create the relevant genotype variables and case/pseudocontrol datasets. We will create two different case/pseudocontrol datasets, one in which we do not keep track of phase information between the two loci, and one in which we condition on phase being known (in order to fit models where the disease risk depends on phase)

g4 <- genotype(loc4_1, loc4_2) g5 <- genotype(loc5_1, loc5_2) psccphase <- pseudocc(g4, g5, phase=TRUE, data=family) psccnophase <- pseudocc(g4, g5, phase=FALSE, data=family)

These new dataframes can be looked at by typing

fix(psccphase) fix(psccnophase)

For the "psccphase" dataframe, you should find that some sets of matched cases and pseudocontrols consist of one case and 3 pseudocontrols, other sets have only one pseudocontrol, and some families have been discarded entirely. This is because the method has discarded pseudocontrols and families for which phase is not inferrable. See Cordell and Clayton (2002) or Cordell et al. (2004) for details.

In the "psccphase" dataframe, you should also find that, as well as the genotype data at the two loci individually, a new two-locus phased genotype variable called g4.g5 has been created.

For the "psccnophase" data, when phase information is not kept, all families are used but only 1 pseudocontrol is generated per case. This is not in fact the most powerful approach, as Cordell and Clayton (2002) show that it is sometimes possible to generate 3 pseudocontrols per case in this situation. However, the R functions for case/pseudocontrol analysis are not yet fully developed, and so the more powerful creation of 3 pseudocontrols per case in this situation is not yet implemented. (It is, however, implemented in David Clayton's Stata routines - as the "pseudocc" command in the "genassoc" package for Stata - so you might like to consider using these routines instead, if you have access to the statistical package Stata).

To analyse the "psccnophase" data, first clear the old "family" dataframe and associated genotype variables from the memory and read in the new dataframe as default:

detach(family) rm(g4) rm(g5) attach(psccnophase)

To analyse each locus individually with a 2df test, type

gcontrasts(g4) <- "genotype" clogit(cc ~ g4 + strata(set)) gcontrasts(g5) <- "genotype" clogit(cc ~ g5 + strata(set))

To see whether locus 4 is significant once locus 5 is in the regression equation, use the following sequence of commands:

gcontrasts(g4) <- "genotype" gcontrasts(g5) <- "genotype" fullmodel<-clogit(cc ~ g5 + g4 + strata(set)) restrictedmodel<-clogit(cc ~ g5 + strata(set)) anova(restrictedmodel,fullmodel)

You should find a difference between the models reported as a deviance of 17.78 on 2df. We can also find the same result by using the following commands:

newfullmodel<-clogit(cc ~ strata(set) + g5 + g4) anova(newfullmodel)

To find the significance of this, use:

1-pchisq(17.78,2)

which gives you a p value of around 0.00014. This suggests that locus 5 is not sufficient to account for all the association in the region: locus 4 still adds significantly to the regression model.

To see whether locus 5 is significant once locus 4 is in the regression equation, use the following sequence of commands:

gcontrasts(g4) <- "genotype" gcontrasts(g5) <- "genotype" fullmodel<-clogit(cc ~ g5 + g4 + strata(set)) restrictedmodel<-clogit(cc ~ g4 + strata(set)) anova(restrictedmodel,fullmodel)

You should find a difference between the models reported as a deviance of 48.33 on 2df. To find the significance of this, use:

1-pchisq(48.33,2)

which gives you a p value of around 3.2e-11. This suggests that locus 4 is certainly not sufficient to account for all the association in the region: locus 5 still adds very significantly to the regression model.

Theoretically, one can compare these sorts of models for each pair of the loci in turn to see which loci might be able to account for most of the association. However, it turns out that the large number of alleles for loci 1 and 2 make this analysis a bit complicated: ideally one would want to group together alleles based on their frequency or on some biological criteria, or else drop the alleles that are very rare.

To fit a model for phase-known haplotypes at loci 4 and 5, discard the "psccnophase" dataframe from the memory, and read in the file "psccphase":

detach(psccnophase) attach(psccphase)

The "psccphase" dataframe contains a two-locus phased genotype variable called g4.g5 . To fit a multiplicative model (equivalent to an additive model on the log odds scale) for the haplotypes type:

gcontrasts(g4.g5) <- "additive" clogit(cc ~ g4.g5 + strata(set))

This gives a highly significant global test of 67.3 on 3df (p=1.63e-14) for the effects of the 3 haplotypes (relative to the 1:2 haplotype). The individual haplotype relative risks are given under in column marked "exp(coeff)". It is seen that the 1:1 haplotype has the lowest risk and the 2:2 haplotype the highest. You will find a similar result in the next exercise, when you repeat the analysis of these data using the program UNPHASED.

Once you are ready to finish with R, type

q() n

to get out.

Answers

How to interpret the output

Interpretation of the output is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.

Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed e.g. it easy to include additional predictor variables such as environmental factors, gene-environment interactions etc. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

Study design issues

Family data has the advantage of being generally more robust (than case/control data) to poulation stratification. It also allows investigation of more complex effects e.g. imprinting. But it may be harder to collect families than cases and controls.

Other packages

TDT analysis can be performed in a variety of other packages. TDTae (TDT allowing for error) is a program that performs a TDT test allowing for possible genotyping errors. A form of TDT analysis of haplotypes is performed by the TRANSMIT program by David Clayton. The only other package that performs a form of case/pseudocontrol analysis is the UNPHASED program by Frank Dudbridge. For testing (but not estimation) of genotype or haplotype association effects in families, one can use the PDT or FBAT or PBAT programs.

References

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions and parent-of-origin effects. Genetic Epidemiology 26:167-185.

Dudbridge F (2003) Pedigree disequilibrium tests for multilocus haplotypes. Genet Epidemiol 25:115-21.

Horvath S, Xu X and Laird N (2001) The family based association test method: strategies for studying general genotype-phenotype associations. Euro J Hum Gen 9: 301-306

Lake S, Blacker , and Laird N (2001) Family based tests in the presence of association. Amer J Hum Gen 67:1515-1525.

Martin ER, Monks SA, Warren LL, Kaplan NL (2000) A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am J Hum Genet 67:146-154

Schaid DJ. 1996. General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13:423-449.

Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet 52:455-466.