Computer Practical Exercise on family-based association using the UNPHASED program

Purpose

In this exercise you will be carrying out family-based association analysis of five linked loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is detect which (if any) of the loci are associated with disease, and to estimate their effects.

Methodology

We will use the UNPHASED program written by Frank Dudbridge:

http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/

Full documentation can be found at

http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/

This program runs either in command-line mode, or through a Java-based graphical interface. We will be using the command-line mode, since this is easier to explain and allows greater control over the desired options.

Data overview

We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region.

Appropriate data

Appropriate data for this exercise is genotype data at a set of linked loci, typed in a number of nuclear families and/or unrelated individuals. It is also possible to use larger families (extended pedigrees), however they will automatically be broken into nuclear families which are then treated as independent in the analysis. The offspring and unrelated individuals should be phenotyped for either a dichotomous trait or a quantitative trait of interest.

Instructions

Data file

fiveloci.ped

This should be saved to the appropriate directory on your machine.

Data format

The data is in standard pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Missing data is coded with a zero.

If you are unfamiliar with the standard pedigree file format (which is used for many linkage analysis programs) and need more details, please ask an instructor.

Step-by-step instructions

We will start by analysing each locus in fiveloci.ped individually. To perform this analysis type:

unphased fiveloci.ped

It may be difficult to see all the results which should have flashed by on the screen. To rerun the analysis and save the output to a file indiv.out, type:

unphased fiveloci.ped > indiv.out

Take a look at the output file (e.g. by using the Linux command more indiv.out). Marker locus 1 shows highly significant association (a likelihood ratio test (LRS) of 333.4 on 22df, p = 1.44163e-58) but from the transmission (T) and non-transmission (NT) frequencies and estimated relative risks (RR), you can see that many alleles are so rare that risks cannot be estimated. The same is true for marker 2. Marker 3 shows a less significant result (p = 0.01226) and has problems estimating the RR for allele 4. Markers 4 and 5 are diallelic so do not have so many problems with rare alleles. The result for marker 4 is not very significant (p = 0.01156) but the result for marker 5 is highly significant (p = 1.82e-12)

Let us repeat the analysis but dropping the rare alleles (those with frequency less than 2%):

unphased fiveloci.ped -zero 0.02 > indivdroprare.out

Take a look at the resulting output, which is a bit more interepretable than the previous results. You should again see highly significant results at markers 1, 2 and 5, and less significant results at markers 3 and 4.

Locus 1, 2 and 5 are so strongly associated that it is hard to say which is most significant. We will therefore try adding in each locus in turn, and then look at the effect of subsequently adding additional loci. To look at the effects of alleles at each of loci 2-5, given the effect of alleles at locus 1, type:

unphased fiveloci.ped -zero 0.02 -condition 1 -model allelemain > cond1allele.out

This compares a model where disease is predicted by alleles at locus 1 and alleles at the test locus (one of loci 2-5) to a model where only alleles at locus 1 are important. The haplotype RR estimates show estimated RRs for various combinations of alleles at the two loci (locus 1 and the current test locus).

The results for locus 1 are meaningless, since we are already conditioning on marker 1, so it does not make sense to think about adding locus 1 to the model. The results for marker 2 show no significance (p = 1), indicating that marker 1 and 2 are in such strong LD that once marker 1 is the model, marker 2 is not needed. The results for marker 3 show slight significance (p = 0.03883), indicating that marker adds some slight improvement to the model (additional to the association accounted for by marker 1). Marker 4 shows greater significance (p = 1.468e-05) and marker 5 also shows significance (p = 0.005889) when added to the model that includes locus 1 .

To repeat the conditional analysis conditioning on each of the other loci, type:

unphased fiveloci.ped -zero 0.02 -condition 2 -model allelemain > cond2allele.out unphased fiveloci.ped -zero 0.02 -condition 3 -model allelemain > cond3allele.out unphased fiveloci.ped -zero 0.02 -condition 4 -model allelemain > cond4allele.out unphased fiveloci.ped -zero 0.02 -condition 5 -model allelemain > cond5allele.out

You should find that, given locus 2, loci 1 and 5 are still significant; given locus 3, loci 1, 2 and 5 are still significant; given locus 4, loci 1, 2 and 5 are still significant and given locus 5, locus 1, 2 and 4 are still significant.

You can also look at the effects of alleles at a locus, given the effect of genotype (rather than just alleles) at another locus. For example, to test whether alleles at locus 4 are still significant once you have accounted for genotype at locus 5, type:

unphased fiveloci.ped -zero 0.02 -condition 5 -marker 4 -condgenotype -model allelemain > cond5geno.out

To test whether alleles at locus 5 are still significant once you have accounted for genotype at locus 4, type:

unphased fiveloci.ped -zero 0.02 -condition 4 -marker 5 -condgenotype -model allelemain > cond4geno.out

Theoretically you can do similar tests for the other loci, but in practice the program gets very slow owing to the large number of possible genotypes at these other loci.

UNPHASED also has the ability to automatically test haplotypes at groups of markers in a sliding window. For instance, to test 3 markers at a time, type

unphased fiveloci.ped -zero 0.02 -window 3 > window.out

This analysis shows highly significant association for each group of markers tested. However, we found from our previous analysis that much of this association can probably be accounted for by marker 1 alone.

To estimate specific haplotype relative risks, e.g. for haplotypes at loci 4 and 5, type:

unphased fiveloci.ped -zero 0.02 -marker 4 5 -window 2 > window45.out

You can see that haplotype 1-1 has the lowest risk of the four haplotypes, and haplotype 2-2 the highest .

References

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70: 124-141.

Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions and parent-of-origin effects. Genetic Epidemiology 26:167-185.

Dudbridge F (2003) Pedigree disequilibrium tests for multilocus haplotypes. Genet Epidemiol 25:115-21.

Schaid DJ. 1996. General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13:423-449.

Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet 52:455-466.