Computer Practical Exercises on family-based association using Stata and R

Overview

Purpose

In this exercise you will be carrying out family-based association analysis of five linked loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is detect which (if any) of the loci are associated with disease.

Methodology

We will use the TDT and case/pseudocontrol approaches. The tests will performed in two statistical analysis packages, Stata and R. As well as using standard functions implemented in these packages, we will also make use of special functions designed for genetic analysis that have been downloaded as add-in functions for the Stata and R packages.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying (!) if you are not familiar with statistical packages such as Stata or R. Don't worry too much if you don't understand all the details. If you decide to use a statistical package such as Stata or R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

Exercise

Data overview

We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region.

Appropriate data

Appropriate data for this exercise is genotype data at a set of linked loci typed in a number of case-parent trios. It is also possible to use nuclear families or larger families with more affected individuals, however they will automatically be broken into trios for the analysis, and non-independence between cases from the same family would need to be accounted for e.g. by use of the robust cluster() option in Stata or the cluster() option in R.

Special considerations/restrictions (for the programs used here)

All the Stata and R commands required to run these analyses are given below. However, Stata and R are sophisticated statistical programming packages and have much greater functionality than will be described here. If you intend to use a statistical package to analyse your data, you are strongly encouraged to learn how to use that package appropriately through attendance on a training course or careful reading of an introductory textbook.

Instructions

Data format

The data is in standard pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Each column must be separated by a tab in order to be read correctly into Stata.

The pedigree file used for the analysis in R differs from the pedigree file used for the analysis in Stata. It has a header line describing the different columns, and it uses R's own missing value code "NA".

Data files

fiveloci.ped
fiveloci.Rped

Program documentation

STATA documentation:

The Stata website is at http://www.stata.com/

Useful guides to Stata can be found at http://www.ats.ucla.edu/stat/stata/ and http://www.whoishostingthis.com/resources/stata/

David Clayton's add-in routines (the genassoc package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software/stata/README.txt

From within Stata, one can obtain help on any command xxxx by typing "help xxxx"

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the dgc.genetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing "help(xxxx)"

Step-by-step instructions

1. Analysis in Stata

Make sure you have a copy of the pedigree files in your current unix directory. Open up a Stata window from that directory by typing:

xstata

For those not familiar with Stata, you should find that a large window with 4 separate sub-windows appears. You type commands in the bottom right hand window, and any results will be displayed in the top right hand window. The top left window provides a review of all the commands you typed, and the bottom left window displays all the variables that you have currently loaded. A convenient way to examine the variables is to click on the data browser button at the top.

To keep a log of your session in the file "family.log", you can type in the bottom right hand (white) Stata window:

log using family, replace text

Read the data into Stata by typing:

ginsheet using fiveloci.ped, preped zmiss

The ginsheet command reads in data from a text file which in this case has been prepared for you earlier in the file "fiveloci.ped". The preped option tells Stata that the file is in a standard pre-makeped pedigree file format and the zmiss tells Stata that missing values are coded in the file as 0 (whereas Stata will recode them as ".")

You should find that the names of 16 variables appear in the bottom left window. These correspond to the usual variables in a standard pedigree file with alleles at locus 1 denoted L1_1 and L1_2, alleles at locus 2 denoted L2_1 and L2_2 etc. etc.

Take a look at the data by clicking on the data browser (3rd icon from the right at the top). You should see that the data consists of genotypes for a series of TDT type trios i.e. father, mother and affected child. We can perform some preliminary single-locus association analysis at these loci using the following commands

tdt L1_1 L1_2 tdt L2_1 L2_2 tdt L3_1 L3_2 tdt L4_1 L4_2 tdt L5_1 L5_2

This performs TDT analysis at each locus in turn and also gives an indication of which loci are diallelic and which have more than 2 alleles. The p values for the TDT tests of each allele are given in the table, and the global (multiallelic) p value is given at the bottom right (in green). You should find highly significant associations at locus 1, 2 and 5 and moderately significant associations at locus 3 and 4 (p approx 0.01). For more information on the tdt command type

help tdt

We can perform a conditional logistic regression analysis at each locus (i.e. a genotype relative risk analysis) using the following commands:

gtrr L1_1 L1_2 gtrr L2_1 L2_2 gtrr L3_1 L3_2 gtrr L4_1 L4_2 gtrr L5_1 L5_2

For a single locus, the gtrr command automatically generates the required case and 3 pseudo-controls and analyses the data without explicitly writing out the new case/pseudo-control data set. A similar pattern of significance (given as a P value for the LR chi-squared test) is seen at the various loci with this procedure as with the TDT. The genotype relative risks are all estimated relative to one particular genotype which can be specified if required. Note that for locus 1 and 2 there is insufficient data in some genotype classes to estimate all the genotype relative risks. We can get round this by grouping together some genotypes until there is a minumum of at least 10 transmissions of each genotype.

gtrr L1_1 L1_2, emin(10) gtrr L2_1 L2_2, emin(10)

For more information on the gtrr command type

help gtrr

In order to test the effect of a locus conditional on effects at other loci, we need to generate and explicitly write out appropriate cases and matched pseudo-controls from this family data. Since we will be interested in potentially including effects at all 5 loci, we generate cases and pseudo-controls with genotypes at all 5 loci. Note that an individual who has missing data at any one of the loci will be discarded during this process. This would not therefore be recommended for large numbers of loci or loci with large amounts of missing data.

To generate the cases and pseudo controls, without necessarily ensuring that phase is conserved, type

pseudocc L1_1 L1_2 L2_1 L2_2 L3_1 L3_2 L4_1 L4_2 L5_1 L5_2, saving(casepseudocon)

This command uses genotype data on the child and parents at the five pairs of variables listed to construct cases and pseudo-controls, without necessarily being able to infer phase. A "shortcut" for this command would be

pseudocc L1_* L2_* L3_* L4_* L5_*, saving(casepseudocon)

If you try this now, Stata will attempt to overwrite the output file "casepseudocon.dta" that you already made, but will fail. To tell Stata to overwrite it, you need to use

pseudocc L1_* L2_* L3_* L4_* L5_*, saving(casepseudocon) replace

This saves a Stata data file "casepseudocon.dta" in the current directory. We can now get rid of our pedigree data set and read in the case/pseudo-control data set that we saved by typing

clear use casepseudocon

Take a look at the new data set using the data browser. You should find that each affected child in each pedigree has now given rise to one case and between one and 3 pseudo-controls. (This information appears in the "case" variable). In addition we have a "set" variable which keeps track of which pseudo-controls are matched to which case. In this data set there is only one case per pedigree, so the information in "set" is equivalent to using the information in "pedigree", but if there were more than one case per pedigree we would want to keep track of which pseudo-controls are matched to which case within the pedigree.

We start by generating genotype variables from the alleles at the various loci:

egen g1=gtype(L1_1 L1_2)

This creates a genotype variable named g1 which contains the genotype corresponding to the alleles contained in the variables L_1 and L_2. This is the most general way to create a genotype variable, and the required way for multiallelic loci. Repeat this for the other loci:

egen g2=gtype(L2_1 L2_2) egen g3=gtype(L3_1 L3_2) egen g4=gtype(L4_1 L4_2) egen g5=gtype(L5_1 L5_2)

You can take a look at the genotype variables you have created using the data browser. We find that locus 4 and 5, which are diallelic, have 3 possible genotypes. Locus 1, 2 and 3 have a larger number of genotypes, which would lead to rather too many variables in the regresssion equation, and so we group genotype variables that appear at less than sufficient frequencies using

grprare g1, gen(newg1) min(0.025) grprare g2, gen(newg2) min(0.04) grprare g3, gen(newg3) min(0.01)

This generates new variables newg1, newg2, newg3 with less possible genotypes than the original variables g1, g2, g3 . Note that we used different frequency cut offs for the three loci (2.5%, 4% and 1% respectively). These were chosen after some trial and error to give groupings that did not lead to too many variables in the regresssion equations. In a real study we might wish to make decisions concerning the grouping of genotypes on biological grounds rather than purely statistical convenience.

Check that you understand how the new grouped genotype variables correspond to the old variables by typing:

list g1 newg1

To analyse the effect of locus 1 on 2 df (i.e. without assuming multiplicative effects of alleles) use the following commands (noting that we use the new genotype variable newg1 that we have just created):

xi:clogit case i.newg1, group(set) or

The clogit command is similar to the logit command for (unmatched) case/control data, but tells the program to perform conditional as opposed to unconditional logistic regression, with "case" status as the output and genotype variable as the predictor. For conditional logistic regression we need to tell Stata which cases are matched to which controls using the group(set) command. The or option tells Stata to output the regression coefficients in terms of odds ratios relative to a baseline genotype . The xi: and i. syntax indicates that we are considering genotype as a categorical variable. This is because the coding of genotypes as 1, 2, 3 etc. is essentially arbitrary: we do not want to fit a regression model with a single coefficient that gets multiplied by the value of the genotype variable, as this would assume that the effect of genotype 10, say, is ten times the effect of genotype 1. Instead Stata will automatically create the appropriate number of temporary dummy variables (which is equal to the number of genotypes minus 1) and carry out the regression with these as predictors of disease risk. These temporary variables may be called something like

_Inewg1_2 _Inewg1_4

etc. in the Stata variables window.

The significance of the test is found at the top right on the line "Prob > chi2" just above the table of estimated regression coefficients. You should obtain a chi-squared on 9 df of 203.06 with a p value of 0.0000 (i.e. <0.00005). This is the significance test for comparing the model where the newg1 variable is in the regression equation compared to a model where it is not, i.e. comparing a model where genotype at locus 1 is a predictor of disease to a model where it does not predict disease.

You should find that you get very similar results to the results you got from the gtrr analysis. The only differences should be because of the fact that individuals with missing data at any of the 5 loci have been removed (they were not written out to the "casepseudocon.dta" file) and also, the grouping of genotypes at locus 1 and 2 may not correspond exactly to the default grouping in the gtrr analysis.

Now do the analysis for the other loci:

xi:clogit case i.newg2, group(set) or xi:clogit case i.newg3, group(set) or xi:clogit case i.g4, group(set) or xi:clogit case i.g5, group(set) or

Locus 1, 2 and 5 are sufficiently significant that it is hard to say which is most significant. We will therefore try adding in each of these loci in turn, and seeing what is the effect of subsequently adding additional loci. To look at the effect of locus 2 given the effect at locus 1 type:

xi:clogit case i.newg1 i.newg2, group(set) or testparm *newg2*

This compares a model where disease is predicted by genotypes at locus 1 and 2 to a model where only genotypes at locus 1 are important. We obtain a chi-squared of 11.43 on 6 df, p=0.0760. To look at the effect of each other locus given the effect at locus 1 type:

xi:clogit case i.newg1 i.newg3, group(set) or testparm *newg3* xi:clogit case i.newg1 i.g4, group(set) or testparm *g4* xi:clogit case i.newg1 i.g5, group(set) or testparm *g5*

You should find that, given the effect at locus 1, the significances of effects at loci 2-5 have p values 0.08, 0.32, 0.15, 0.32.

You can probably see how you could use a forwards or backwards procedure to enter or delete loci from the regression equation. In this case, it looks as if nothing is very significant once you have accounted for effects at locus 1. To see if locus 1 is important once you have accounted for effects at other loci, use

xi:clogit case i.newg1 i.newg2, group(set) or testparm *newg1* xi:clogit case i.newg1 i.newg3, group(set) or testparm *newg1* xi:clogit case i.newg1 i.g4, group(set) or testparm *newg1* xi:clogit case i.newg1 i.g5, group(set) or testparm *newg1*

You should find that locus 1 is highly significant even once each other locus is accounted for, suggesting that it is locus 1 that is driving the association in this region.

We can also use the case/pseudocontrol approach to fit models where the disease risks depend on phased haplotypes. In order to fit models that include haplotype effects and/or estimate these effects, we must only use data in which phase can be determined from the case-parents trio and at least one of the possible matched pseudocontrols. Note that our current software only deals with observed haplotypes i.e. no EM estimation of ambiguous haplotypes or missing data is performed.

First read in the original data set again by typing

clear ginsheet using fiveloci.ped, preped zmiss

Suppose we are just interested in looking at haplotypes for loci 1, 2 and 5 (which were the most significant in the single locus analysis). To generate the appropriate cases and pseudocontrols, ensuring that phase is conserved, type:

pseudocc L1_* L2_* L5_*, saving(casepseudophased) phase

This saves a Stata data file "casepseudophased.dta" in the current directory. We can now get rid of our pedigree data set and read in the case/pseudocontrol data set that we saved by typing clear use casepseudophased

Click on the data browser button to see what the case/pseudocontrol file looks like. You should find that some sets consist of one case and 3 pseudocontrols, other sets have only one pseudocontrol, and some families have been discarded entirely. This is because we the method has discarded pseudocontrols and families for which phase is not inferrable. See Cordell and Clayton (2002) for details.

Locus 1 and 2 are multiallelic. As it happens, prior evidence has suggested that there are particular alleles at locus 1 and 2 that are associated with disease. We may therefore focus our attention on these alleles by recoding all other alleles as a single value. We can do this as follows: replace L1_1=2 if L1_1>2 replace L1_2=2 if L1_2>2 replace L2_1=2 if L2_1>2 replace L2_2=2 if L2_2>2

This allows us to look at the effects of allele 1 at versus all other alleles, at each locus.

We can generate "haplotype" variables for mother and father's transmitted haplotypes at the 3 loci by typing egen hap_1=htype(L1_1 L2_1 L5_1) egen hap_2=htype(L1_2 L2_2 L5_2), codeas(hap_1)

Take a look at the new variables hap_1 and hap_2 by clicking on the data browser (3rd icon from the right at the top). We can reduce the number of parameters to estimate by assuming that the haplotypes act multiplicatively to cause disease. First we generate new variables that code for the number of copies of the different possible haplotypes: gtab hap_1 hap_2, gen(count)

This generates 4 new variables named count1 - count8 that count the number of copies of the 8 possible haplotypes present in an individual. For example, if an individual has haplotypes 111 and 112, i.e. one copy of haplotype number 1 and one copy of haplotype number 2, you should find that count1=1 and count2=1, while all other count variables equal 0. If instead an individual was homozygous for the 222 haplotype, you should find that count8=2, while all other count variables equal 0. To look at the effect of the haplotypes in a conditional logistic regression, assuming they act multiplicatively, and all relative to haplotype 1 (=111) type: clogit case count2 count3 count4 count5 count6 count7 count8, group(set) or

In this case, it turns out there is not enough data to estimate all 8 haplotype effects (since some haplotypes are very rare). This results in some estimated odds ratios that are very large (effectively=infinity) or very small (=0) with problems estimating the confidence intervals (indicated by a ".") There are various ways to get round this. For now we will drop haplotypes 3 and 4 (121 and 122) from the analysis (which is equivalent to assuming they have the same effect as the reference haplotype): clogit case count2 count5 count6 count7 count8, group(set) or

This calculates the relative risks for the haplotypes relative to haplotype 1 (111). To calculate them relative to haplotype 6 (212) (which is the one with the lowest risk), type: clogit case count1 count2 count5 count7 count8, group(set) or

From the odds ratios, it looks like haplotype 2 (112) has the strongest effect, and the only one that is individually significantly different from haplotype 6 (z=3.18, p=0.001). A global test of the effect of all 6 haplotypes is given by the chi-squared on the top right: 143.76 on 5 df. This is highly significant as signified by "Prob > chi2 = 0.0000"

We have now finished the Stata analysis. Type

log close

to close the log file in which there will be a record of all the commands you typed and all the results.

Then type

clear exit

to get out of Stata and close the session.

3. Analysis in R

If you have time, you may like to repeat some of the analyses that you performed in the statistical package Stata in the statistical package R. As mentioned previously, the R package is somewhat less user-friendly than Stata (particularly with regards to the output), but has the advantage of being free, and in addition a lot of bioinformatics software utilities have been developed as add-in functions for the R package.

Before we start, take a look at the data file that we shall use:

fiveloci.Rped

This differs from the pedigree file we used in the Stata analysis by the addition of a header line describing what each of the variables are, and by the changing of any "0"s to R's own missing value code "NA". Note that the names given to the first 6 variables in the header line are not optional: they must be called

pedigree id id.father id.mother sex affected

for the add-in utilities for the R package to work correctly.

To start up the R package, type (from the unix prompt):

R

You are now working within the R package. To start with, you need to read in the necessary add-in libraries (which we have downloaded ready for you):

library(dgc.genetics)

To read your data into a dataframe called "family", type

family <- read.table("fiveloci.Rped", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe "family" and you can look at it simply by typing

family

or by typing

fix(family)

As with the case/control exercise, each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

family$pedigree

It can be inconvenient to have to type "family$" before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(family)

Now you can just look at the column of pedigree names by typing

pedigree

To perform association analysis, we need to convert the variables corresponding to the two alleles at each locus into a genotype variable for each locus. This can be done e.g. for locus 1 by:

g1 <- genotype(loc1_1, loc1_2)

To look at the variable you have just created, type:

g1

To perform a TDT analysis on this locus, type

tdt(g1)

As with the tdt function in Stata, this gives a global multiallelic test (chi-squared test = 278.1187 on 22 df, P-value = <2e-16) as well as individual tests for each allele versus all others.

Repeat the analysis for locus 5 (which is diallelic):

g5 <- genotype(loc5_1, loc5_2) tdt(g5)

To create a case/pseudocontrol set for analysis at this locus, type:

psccloc5 <- pseudocc(g5, data=family)

This creates a new dataframe called "psccloc5" which contains cases each with 3 matched pseudocontrols. To look at the dataframe you just created, type:

fix(psccloc5)

Note that within the "psccloc5" dataframe, the case/control 1/0 indicator variable is called "cc" and the genotype variable is called "g5" To clear the old "family" dataframe and old genotype variables from the memory, type:

detach(family) rm(g1) rm(g5)

and then read in the "psccloc5" dataframe as the default:

attach(psccloc5)

To analyse using conditional logistic regression, assuming either a 2df (genotype) test or a 1df (allele) test, type:

gcontrasts(g5) <- "genotype" clogit(cc ~ g5 + strata(set)) gcontrasts(g5) <- "additive" clogit(cc ~ g5 + strata(set))

This will perform the analysis on the g5 variable that is currently in the default memory, i.e. the g5 variable in the "psccloc5" dataframe, also known as psccloc5$g5. The strata(set) option indicates that the set variable labels matched cases and pseudocontrols.

The results should be very similar to what you found in your Stata analysis of locus 5, a highly significant likelihood ratio test. The likelihood ratio test is slightly different because in the Stata analysis we created case/pseudocontrol sets considering all five loci simultaneously whereas just now we only considered locus 5.

The relative risk parameters labelled exp(coef) correspond to those labelled Odds Ratio in the Stata analysis. In Stata we found Odds Ratios of 2.53 and 5.13 for the 1/2 and 2/2 genotype relative to the 1/1 genotype, whereas in R we find values of 0.421 and 2.176 for the 1/1 and 2/2 genotype relative to the 1/2 genotype. So relative to the 1/1 genotype we have values of 1.0/0.421 = 2.38 and 2.176/0.421 = 5.17 respectively, very similar to what we found in Stata.

For analysis at more than one locus (e.g. locus 4 and 5, say) we need to clear the memory and get back to our original "family" dataframe:

detach(psccloc5) attach(family)

Now we need to create the relevant genotype variables and case/pseudocontrol datasets. We will create two different case/pseudocontrol datasets, one in which we do not keep track of phase information, and one in which we condition on phase being known (in order to fit models where the disease risk depends on phase)

g4 <- genotype(loc4_1, loc4_2) g5 <- genotype(loc5_1, loc5_2) psccphase <- pseudocc(g4, g5, phase=TRUE, data=family) psccnophase <- pseudocc(g4, g5, phase=FALSE, data=family)

These new dataframes can be looked at by typing

fix(psccphase) fix(psccnophase)

For the "psccnophase" data, when phase information is not kept, all families are used but only 2 pseudocontrols are generated per case. This is because the R functions for case/pseudocontrol analysis are not yet as well developed as the Stata functions and so the more efficient creation of 3 pseudocontrols per case in this situation is not yet implemented.

For the "psccphase" data, some sets consist of one case and 3 pseudocontrols, other sets have only one pseudocontrol, and some families have been discarded entirely. This is because we the method has discarded pseudocontrols and families for which phase is not inferrable. See Cordell and Clayton (2002) for details.

To analyse the "psccnophase" data, first clear the old "family" dataframe and associated genotype variables from the memory and read in the new dataframe as default:

detach(family) rm(g4) rm(g5) attach(psccnophase)

To analyse each locus individually with a 2df test, type

gcontrasts(g4) <- "genotype" clogit(cc ~ g4 + strata(set)) gcontrasts(g5) <- "genotype" clogit(cc ~ g5 + strata(set))

To see whether locus 4 is significant once locus 5 is in the regression equation, use the following sequence of commands:

gcontrasts(g4) <- "genotype" gcontrasts(g5) <- "genotype" fullmodel<-clogit(cc ~ g5 + g4 + strata(set)) restrictedmodel<-clogit(cc ~ g5 + strata(set)) anova(restrictedmodel,fullmodel)

You should find a difference between the models reported as a deviance of 17.78 on 2df. To find the significance of this, use:

1-pchisq(17.78,2)

which gives you a p value of around 0.00014.

To fit a model for phase-known haplotypes at these loci, read in the file "psccphase":

detach(psccnophase) attach(psccphase)

The "psccphase" dataframe contains a two-locus phased genotype variable called "g4.g5". To fit a multiplicative model (equivalent to an additive model on the log odds scale) for the haplotypes type:

gcontrasts(g4.g5) <- "additive" clogit(cc ~ g4.g5 + strata(set))

This gives a highly significant global test of 67.3 on 3df (p=1.63e-14) for the effects of the 3 haplotypes (relative to the 1:2 haplotype). The individual haplotype odds ratios are given under in column marked "exp(coeff)". It is seen that the 1:1 haplotype in particular has a significantly lower risk than the 1:2 haplotype.

Once you are ready to finish with R, type

q() n

to get out.

Answers

How to interpret the output

Interpretation of the output in Stata and R is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.

Tips/Tricks

Comments

Advantages/disadvantages

Analysis in a standard statistical package has the advantage of allowing a lot of extra flexibility with regards to the models and analyses performed. However, you are required to know or learn how to use the package in order to gain that extra flexibility, and to produce reliable results.

The case/pseudocontrol approach has the advantage that it is not affected by bias when analysing uncertain haplotypes. It has the disadvantage that missing data is not incorporated into the analysis: trios with any missing data at the loci under consideration are simply discarded.

Study design issues

Family-based designs should be robust to population stratification.

Other packages

TDT analysis can be performed in a variety of other packages. A form of TDT analysis of haplotypes is performed by the TRANSMIT program by David Clayton. The only other package that performs a form of case/pseudocontrol analysis is the UNPHASED program by Frank Dudbridge. For testing (but not estimation) of genotype or haplotype association effects in families, one can use the PDT or FBAT or PBAT programs.

References

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions and parent-of-origin effects. Genetic Epidemiology 26:167-185.

Dudbridge F (2003) Pedigree disequilibrium tests for multilocus haplotypes. Genet Epidemiol 25:115-21.

Horvath S, Xu X and Laird N (2001) The family based association test method: strategies for studying general genotype-phenotype associations. Euro J Hum Gen 9: 301-306

Lake S, Blacker , and Laird N (2001) Family based tests in the presence of association. Amer J Hum Gen 67:1515-1525.

Martin ER, Monks SA, Warren LL, Kaplan NL (2000) A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am J Hum Genet 67:146-154

Schaid DJ. 1996. General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13:423-449.

Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet 52:455-466.

Exercises prepared by: Heather Cordell
Checked by:
Programs used: Stata, R, genassoc, dgc.genetics
Last updated: Tue, 19 Jul 2005 15:48:15 GMT