In this exercise you will be carrying out family-based association analysis of five linked loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is detect which (if any) of the loci are associated with disease, and to estimate their effects.
We will use the UNPHASED program written by Frank Dudbridge:
http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/
Full documentation can be found at
http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/
This program runs either in command-line mode, or through a Java-based
graphical interface. We will be using the command-line mode, since this
is easier to explain and allows greater control over the desired options.
We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region.
Appropriate data for this exercise is genotype data at a set of linked loci, typed in a number of nuclear families and/or unrelated individuals. It is also possible to use larger families (extended pedigrees), however they will automatically be broken into nuclear families which are then treated as independent in the analysis. The offspring and unrelated individuals should be phenotyped for either a dichotomous trait or a quantitative trait of interest.
fiveloci.ped
This should be saved to the appropriate directory on your machine.
The data is in standard pedigree file format, with columns corresponding to
family id, subject id (within family), father's id, mother's id,
sex (1=m, 2=f), affection status (1=unaffected, 2=affected)
and one column for each allele for each locus genotype.
Missing data is coded with a zero.
If you are unfamiliar with the standard pedigree file format (which is used for many linkage analysis programs) and need more details, please ask an instructor.
We will start by analysing each locus in fiveloci.ped individually. To perform this analysis
type:
unphased fiveloci.ped
It may be difficult to see all the results which should have
flashed by on the screen. To
rerun the analysis and save the output to a file indiv.out,
type:
unphased fiveloci.ped > indiv.out
Take a look at the output file (e.g. by using the Linux command more indiv.out). Marker locus 1 shows highly significant association (a likelihood ratio test (LRS) of 333.4 on 22df, p = 1.44163e-58) but from the transmission (T) and non-transmission (NT) frequencies and estimated relative risks (RR), you can see that many alleles are so rare that risks cannot be estimated. The same is true for marker 2. Marker 3 shows a less significant result (p = 0.01226) and has problems estimating the RR for allele 4. Markers 4 and 5 are diallelic so do not have so many problems with rare alleles. The result for marker 4 is not very significant (p = 0.01156) but the result for marker 5 is highly significant (p = 1.82e-12)
Let us repeat the analysis but dropping the rare alleles (those with frequency less than 2%):
unphased fiveloci.ped -zero 0.02 > indivdroprare.out
Take a look at the resulting output, which is a bit more interepretable than the previous results. You should again see highly significant results at markers 1, 2 and 5, and less significant results at markers 3 and 4.
Locus 1, 2 and 5 are so strongly associated that it is hard to
say which is most significant. We will therefore try adding in each
locus in turn, and then look at the effect of
subsequently adding additional loci.
To look at the effects of alleles at each of loci 2-5, given the effect of alleles
at locus 1, type:
unphased fiveloci.ped -zero 0.02 -condition 1 -model allelemain > cond1allele.out
This compares a model where disease is predicted
by alleles at locus 1 and alleles
at the test locus (one of loci 2-5)
to a model where only alleles at locus 1 are important. The haplotype RR estimates show estimated RRs for various combinations of alleles at the two loci (locus 1 and the current test locus).
The results for locus 1 are meaningless, since we are already
conditioning on marker 1, so it does not make sense to think about adding locus 1 to the model. The results for marker 2
show no significance (p = 1), indicating that marker 1 and 2 are in such strong LD that once marker 1 is the model, marker 2 is not needed.
The results for marker 3
show slight significance (p = 0.03883), indicating that marker
adds some slight improvement to the model (additional
to the association accounted for by marker 1). Marker 4 shows
greater significance (p = 1.468e-05) and marker 5 also shows
significance (p = 0.005889) when added to the model that includes locus 1 .
To repeat the conditional analysis conditioning on each of the other loci, type:
unphased fiveloci.ped -zero 0.02 -condition 2 -model allelemain > cond2allele.out
unphased fiveloci.ped -zero 0.02 -condition 3 -model allelemain > cond3allele.out
unphased fiveloci.ped -zero 0.02 -condition 4 -model allelemain > cond4allele.out
unphased fiveloci.ped -zero 0.02 -condition 5 -model allelemain > cond5allele.out
You should find that, given locus 2, loci 1 and 5 are still significant; given locus 3,
loci 1, 2 and 5 are still significant; given locus 4, loci 1, 2 and 5 are still significant
and given locus 5, locus 1, 2 and 4 are still significant.
You can also look at the effects of alleles at a locus, given the effect of genotype
(rather than just alleles) at another locus. For example, to test whether alleles
at locus 4 are still significant once you have accounted for genotype at locus 5, type:
unphased fiveloci.ped -zero 0.02 -condition 5 -marker 4 -condgenotype -model allelemain > cond5geno.out
To
test whether alleles
at locus 5 are still significant once you have accounted for genotype at locus 4, type:
unphased fiveloci.ped -zero 0.02 -condition 4 -marker 5 -condgenotype -model allelemain > cond4geno.out
Theoretically you can do similar tests for the other loci, but in practice the program gets very slow owing to the large number of possible genotypes at these other loci.
UNPHASED also has the ability to automatically test haplotypes at groups of markers in a sliding window. For instance, to test 3 markers at a time, type
unphased fiveloci.ped -zero 0.02 -window 3 > window.out
This analysis shows highly significant association for each group of markers tested.
However, we found from our previous analysis that much of this association can probably be accounted for by marker 1 alone.
To estimate specific haplotype relative risks, e.g. for haplotypes at loci 4 and 5, type:
unphased fiveloci.ped -zero 0.02 -marker 4 5 -window 2 > window45.out
You can see that haplotype 1-1 has the lowest risk of the four haplotypes, and haplotype 2-2 the highest .
Cordell HJ and Clayton DG (2002) A unified
stepwise regression procedure for evaluating the relative effects of
polymorphisms within a gene using case/control or family data:
application to HLA in type 1 diabetes. American
Journal of Human Genetics 70: 124-141.
Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis
in genetic association studies: a unified framework for detection of
genotype and haplotype associations, gene-gene and gene-environment
interactions and parent-of-origin effects. Genetic Epidemiology 26:167-185.
Dudbridge F (2003) Pedigree disequilibrium tests
for multilocus haplotypes. Genet Epidemiol 25:115-21.
Schaid DJ. 1996. General score tests for associations of genetic markers
with disease using cases and their parents. Genet Epidemiol 13:423-449.
Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage
disequilibrium: the insulin gene region and insulin-dependent diabetes
mellitus (IDDM) Am J Hum Genet 52:455-466.