Computer Practical Exercise on Family-based Association using the UNPHASED program

Overview

Purpose

In this exercise you will be carrying out family-based association analysis of five linked loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is detect which (if any) of the loci are associated with disease, and to estimate their effects.

Methodology

We will use the UNPHASED program written by Frank Dudbridge:

http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/

This program implements a TDT or case/pseudocontrol type analysis for nuclear family data, although it can also be used to analyse case/control data, or quantitative trait data from families and/or unrelated individuals.

UNPHASED documentation

Full documentation can be found at

http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/

The program runs either in command-line mode, or through a Java-based graphical interface. We will be using the command-line mode, since this is easier to explain and allows greater control over the desired options.

Data overview

We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region.

Appropriate data

Appropriate data for this exercise is genotype data at a set of linked loci, typed in a number of nuclear families and/or unrelated individuals. It is also possible to use larger families (extended pedigrees), however they will automatically be broken into nuclear families which are then treated as independent in the analysis. The offspring and unrelated individuals should be phenotyped for either a dichotomous trait or a quantitative trait of interest.


Instructions

Data files

You should save the following file to an appropriate directory (folder) on your machine:

fiveloci.txt

Data format

The data is in standard pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Missing data is coded with a zero.

If you are unfamiliar with the standard pedigree file format (which is a commonly-used format for many linkage analysis programs) and need more explanation, please ask an instructor.

Step-by-step instructions

Move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

We will start by analysing each locus in fiveloci.txt individually. To perform this analysis type:

unphased fiveloci.txt

It may be difficult to see all the results as they flashed by on the screen. To re-run the analysis and save the output to a file indiv.txt, type:

unphased fiveloci.txt > indiv.txt

Once the program has finished running, you can check that a new output file indiv.txt has been created by typing ls

Take a look at the output file (e.g. using the more command). First comes some program-specific information and then come the results (i.e. tests of association and estimates of the allelic or haplotypic odds ratios) for each marker. The most important information is the likelihood ratio chisq test and p-value that come under the heading TEST OF OVERALL ASSOCIATION . Note that markers 1-3 are multiallelic, and so a global test of association with any/all alleles at the marker is provided by default.

Marker locus 1 shows highly significant association (a likelihood ratio test (LRS) of 333.4 on 22df, p = 1.969e-57) but from the case and control frequencies and estimated relative risks (RR), you can see that some alleles are so rare that risks cannot be estimated. The same is true for marker 2. Marker 3 has similar problems with rare alleles and also shows a less significant result (p = 0.01226). Markers 4 and 5 are diallelic so do not have problems with rare alleles. The result for marker 4 is not very significant (p = 0.01156) but the result for marker 5 is highly significant (p = 1.82e-12)

Let us repeat the analysis but dropping the rare alleles (those with frequency less than 2%):

unphased fiveloci.txt -zero 0.02 > indivdroprare.txt

Take a look at the resulting output, which is a bit more interepretable than the previous results. You should again see highly significant results at markers 1, 2 and 5, and less significant results at markers 3 and 4.

Locus 1, 2 and 5 are so strongly associated that it is hard to say which is most significant. We will therefore try adding in each locus in turn, and then look at the effect of subsequently adding additional loci. To look at the effects of alleles at each of loci 2-5, given the effect of alleles at locus 1, type:

unphased fiveloci.txt -zero 0.02 -condition 1 -model allelemain > cond1allele.txt

This compares a model where disease is predicted by alleles at locus 1 and alleles at the test locus (one of loci 2-5) to a model where only alleles at locus 1 are important. The haplotype RR estimates show estimated RRs for various combinations of alleles at the two loci (locus 1 and the current test locus).

The results for locus 1 are meaningless, since we are already conditioning on marker 1, so it does not make sense to think about adding locus 1 to the model. The results for marker 2 show significance (p = 3.259e-5), indicating that marker 2 is still important even once marker 1 is included in the model. The results for marker 3 also show some significance (p = 0.0004789), indicating that marker 3 adds some improvement to the model (additional to the association accounted for by marker 1). Marker 4 and 5 also show significance when added to the model that includes locus 1. However, the significance for locus 5 is nowhere near as strong as the original significance we found in the single-locus analyses.

To repeat the conditional analysis conditioning on each of the other loci, type:

unphased fiveloci.txt -zero 0.02 -condition 2 -model allelemain > cond2allele.txt
unphased fiveloci.txt -zero 0.02 -condition 3 -model allelemain > cond3allele.txt
unphased fiveloci.txt -zero 0.02 -condition 4 -model allelemain > cond4allele.txt
unphased fiveloci.txt -zero 0.02 -condition 5 -model allelemain > cond5allele.txt


You should find that given locus 2, locus 1 (and to a lesser extent 3 and 5) is still significant; given locus 3, loci 1, 2 and 5 are still significant; given locus 4, loci 1, 2 and 5 are still significant and given locus 5, locus 1, 2 and (to a lesser extent) 4 are still significant.

The most significant result in the original (single-locus) analysis was seen at locus 1, so perhaps it makes most sense to put this locus in the model first. Once locus 1 is in the model, the most significant result was at locus 4, so we should next add locus 4 to the model. Once we have included both locus 4 and locus 4 in the model, we can test whether any other loci are still significant using the following command:

unphased fiveloci.txt -zero 0.02 -condition 1 4 -model allelemain > cond14allele.txt

Again the results for testing at locus 1 and 4 no longer make sense, but we see that marker 5 is still significant (p=0.0007662). We could add marker 5 to the model and continue testing other loci, but the analysis starts to get quite slow so in the interests of time we will not do this today.


You can also look at the effects of alleles at a locus, given the effect of genotype (rather than just alleles) at another locus. For example, to test whether alleles at locus 4 are still significant once you have accounted for genotype at locus 5, type:

unphased fiveloci.txt -zero 0.02 -condition 5 -marker 4 -condgenotype -model allelemain > cond5geno.txt

To test whether alleles at locus 5 are still significant once you have accounted for genotype at locus 4, type:

unphased fiveloci.txt -zero 0.02 -condition 4 -marker 5 -condgenotype -model allelemain > cond4geno.txt

Theoretically you can do similar tests for the other loci, but in practice the program gets very slow owing to the large number of possible genotypes at these other loci.

UNPHASED also has the ability to automatically test haplotypes at groups of markers in a sliding window. For instance, to test 3 markers at a time, type

unphased fiveloci.txt -zero 0.02 -window 3 > window.txt

This analysis shows highly significant association for each group of markers tested. However, we found from our previous results that much of this highly significant association can probably be accounted for by marker 1 alone.

To estimate specific haplotype relative risks, e.g. for haplotypes at loci 4 and 5, type:

unphased fiveloci.txt -zero 0.02 -marker 4 5 -window 2 > window45.txt

You can see that haplotype 1-1 has the lowest risk of the four haplotypes, and haplotype 2-2 the highest. The haplotype risks are all measured relative to the (unknown) risk for haplotype 1-1. These results are similar to what you should have found when you analysed this same data in R.




Answers

How to interpret the output

Interpretation of the output is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.


Comments

Advantages/disadvantages

UNPHASED is very flexible as it can be used to analyse population-based (e.g. case/control) and/or family data, as well as performing haplotype analysis and analysis of either dichotomous or quantitative traits.

Study design issues

Family data has the advantage of being generally more robust (than case/control data) to poulation stratification. It also allows investigation of more complex effects e.g. imprinting. But it may be harder to collect families than cases and controls.

Other packages

TDT analysis can be performed in a variety of other packages including statistical packages such as R and Stata. TDTae (TDT allowing for error) is a program that performs a TDT test allowing for possible genotyping errors. A form of TDT analysis of haplotypes is performed by the TRANSMIT program by David Clayton. For testing (but not estimation) of genotype or haplotype association effects in families, one can use the PDT or FBAT or PBAT programs.


References

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions and parent-of-origin effects. Genetic Epidemiology 26:167-185.

Dudbridge F (2003) Pedigree disequilibrium tests for multilocus haplotypes. Genet Epidemiol 25:115-21.

Dudbridge F (2008) Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data. Hum Hered 66:87-98.

Schaid DJ. 1996. General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13:423-449.

Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet 52:455-466.


Exercises prepared by: Heather Cordell
Checked by:
Programs used: Unphased
Last updated: