In this exercise you will be carrying out family-based association analysis of five linked loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is detect which (if any) of the loci are associated with disease.
We will use the TDT and case/pseudocontrol
approaches.
The tests will performed in two statistical analysis packages,
Stata and R. As well as using standard functions
implemented in these packages, we will
also make use of special functions designed for genetic
analysis that have been downloaded as add-in functions for the
Stata and R packages.
Some of the commands that you will type in order to
do the analysis may seem a bit mystifying (!) if you are not
familiar with statistical packages such as Stata or R.
Don't worry too much if you don't understand all the details.
If you decide to use a statistical package such as Stata or R
to analyse your own data, it will be important to
learn how to use that particular package appropriately through attendance on a training course or careful
reading of an introductory textbook.
We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region.
Appropriate data for this exercise is genotype data at a set of linked loci typed in a number of case-parent trios. It is also possible to use nuclear families or larger families with more affected individuals, however they will automatically be broken into trios for the analysis, and non-independence between cases from the same family would need to be accounted for e.g. by use of the robust cluster() option in Stata or the cluster() option in R.
All the Stata and R commands required to run these analyses are given below. However, Stata and R are sophisticated statistical programming packages and have much greater functionality than will be described here. If you intend to use a statistical package to analyse your data, you are strongly encouraged to learn how to use that package appropriately through attendance on a training course or careful reading of an introductory textbook.
The data is in standard pedigree file format, with columns corresponding to
family id, subject id (within family), father's id, mother's id,
sex (1=m, 2=f), affection status (1=unaffected, 2=affected)
and one column for each allele for each locus genotype.
Each column must be separated by a tab in order to be read
correctly into Stata.
The pedigree file used for the analysis in R differs from the
pedigree file used for the analysis in Stata. It has a header line
describing the different columns, and it uses R's own missing value
code "NA".
fiveloci.ped
fiveloci.Rped
These should be saved as .txt files to the C:DATA directory.
The Stata website is at
http://www.stata.com/
Useful guides to Stata can be found at
http://www.ats.ucla.edu/stat/stata/ and
http://www.whoishostingthis.com/resources/stata/
David Clayton's add-in routines (the genassoc package) can be found at
http://www-gene.cimr.cam.ac.uk/clayton/software/stata/README.txt
From within
Stata, one can obtain help on any command xxxx by typing "help xxxx"
The R website is at
http://www.r-project.org/
David Clayton's add-in routines (the dgc.genetics package) can be found at
http://www-gene.cimr.cam.ac.uk/clayton/software
From within
R, one can obtain help on any command xxxx by typing "help(xxxx)"
Start up Stata by clicking on the icon.
For those not familiar with Stata, you should find
that a large window with 4 separate sub-windows appears. You type commands in
the bottom right hand window, and any results will be displayed in the
top right hand window. The top left window provides a review of all
the commands you typed, and the bottom left window displays all the
variables that you have currently loaded. A convenient way to examine
the variables is to click on the data browser button at the top.
To keep a log of your session in the file "family.log", you can type
in the bottom
right hand (white) Stata window:
log using family, replace text
Read the data into Stata by typing:
ginsheet using fiveloci.ped, preped zmiss
The ginsheet command reads in data from a text file
which in this case has been prepared for you earlier
in the file "fiveloci.ped". The preped option tells Stata
that the file is in a standard pre-makeped pedigree file format
and the zmiss tells Stata that missing values are coded
in the file as 0
(whereas Stata will recode them as ".")
You should find that the names of 16 variables appear in the bottom left
window. These correspond to the usual variables in a
standard pedigree file with alleles at locus 1
denoted L1_1 and L1_2,
alleles at locus 2 denoted L2_1 and L2_2 etc. etc.
Take a look at the data by clicking on the data browser
(3rd icon from the right at the top).
You should see that the data consists of genotypes for a series
of TDT type trios i.e. father, mother and affected child.
We can perform some preliminary single-locus association
analysis at these loci using the following commands
tdt L1_1 L1_2
tdt L2_1 L2_2
tdt L3_1 L3_2
tdt L4_1 L4_2
tdt L5_1 L5_2
This performs TDT analysis at each locus in turn
and also gives an indication
of which loci are diallelic and which have more than 2 alleles.
The p values for the TDT tests of each allele are given in the table,
and the global (multiallelic) p value is given at the bottom right (in green).
You should find highly significant associations at
locus 1, 2 and 5
and moderately significant associations at
locus 3 and 4 (p approx 0.01). For more information
on the tdt command type
help tdt
We can perform a conditional logistic regression analysis at
each locus (i.e. a genotype relative risk
analysis) using the following
commands:
gtrr L1_1 L1_2
gtrr L2_1 L2_2
gtrr L3_1 L3_2
gtrr L4_1 L4_2
gtrr L5_1 L5_2
For a single locus, the gtrr command automatically generates the
required case and 3 pseudo-controls and analyses the data
without explicitly writing out the new case/pseudo-control
data set. A similar pattern of significance
(given as a P value for the LR chi-squared test) is seen at the various loci
with this procedure as with the TDT.
The genotype relative risks are all estimated relative to one
particular genotype which can be specified
if required. Note that for locus 1 and 2
there is insufficient data in some genotype classes to estimate
all the genotype relative risks. We can get round this by
grouping together some genotypes until there is a minumum
of at least 10 transmissions of each genotype.
gtrr L1_1 L1_2, emin(10)
gtrr L2_1 L2_2, emin(10)
For more information
on the gtrr command type
help gtrr
In order to test the effect of a locus conditional
on effects at other loci,
we need to generate and explicitly write out appropriate cases and
matched pseudo-controls from this family data.
Since we will be interested in potentially
including effects at all 5 loci, we generate
cases and pseudo-controls with genotypes at all
5 loci. Note that an individual who has
missing data at any one of the loci will
be discarded during this process. This would
not therefore be recommended for large numbers of loci
or loci with large amounts of missing data.
To generate the cases and pseudo controls,
without necessarily
ensuring that phase is conserved, type
pseudocc L1_1 L1_2 L2_1 L2_2 L3_1 L3_2 L4_1 L4_2 L5_1 L5_2, saving(casepseudocon)
This command uses genotype data on the child and parents
at the five pairs of variables listed
to construct cases and pseudo-controls, without
necessarily being able to
infer phase.
A "shortcut" for this command would be
pseudocc L1_* L2_* L3_* L4_* L5_*, saving(casepseudocon)
If you try this now, Stata will attempt to overwrite the output file
"casepseudocon.dta" that you already made, but will fail. To tell
Stata to overwrite it, you need to use
pseudocc L1_* L2_* L3_* L4_* L5_*, saving(casepseudocon) replace
This saves a Stata data file "casepseudocon.dta" in the current directory.
We can now get rid of our pedigree data set and read in the
case/pseudo-control data set that we saved by typing
clear
use casepseudocon
Take a look at the new data set using the data browser.
You should find that each affected child in each pedigree has now
given rise to one case and between one and 3 pseudo-controls.
(This information appears in the "case" variable). In addition
we have a "set" variable which keeps track of which
pseudo-controls are matched to which case. In this data set
there is only one case per pedigree, so the information in
"set" is equivalent to using the information in "pedigree",
but if there were more than one case per pedigree we
would want to keep track of which
pseudo-controls are matched to which case within the pedigree.
We start by generating genotype variables from the alleles at the
various loci:
egen g1=gtype(L1_1 L1_2)
This creates a genotype variable named g1 which contains the
genotype corresponding to the alleles contained in the variables
L_1 and L_2. This is the most general way to create
a genotype variable, and the required way for multiallelic loci.
Repeat this for the other loci:
egen g2=gtype(L2_1 L2_2)
egen g3=gtype(L3_1 L3_2)
egen g4=gtype(L4_1 L4_2)
egen g5=gtype(L5_1 L5_2)
You can take a look at the genotype variables you have created
using the data browser.
We find that locus 4 and 5, which are diallelic, have
3 possible genotypes.
Locus 1, 2 and 3 have a larger number of genotypes,
which would lead to rather too many variables
in the regresssion equation, and so we group genotype
variables that appear at less than sufficient frequencies using
grprare g1, gen(newg1) min(0.025)
grprare g2, gen(newg2) min(0.04)
grprare g3, gen(newg3) min(0.01)
This generates new variables newg1, newg2,
newg3 with less possible genotypes
than the original variables
g1, g2,
g3 .
Note that we used different frequency cut offs for the three loci
(2.5%, 4% and 1% respectively). These were chosen after some trial
and error to give groupings that did not lead to too many variables
in the regresssion equations. In a real study we might wish to make
decisions concerning the grouping of genotypes on biological grounds
rather than purely statistical convenience.
Check that you understand how the new grouped genotype variables
correspond to the old variables by typing:
list g1 newg1
To analyse the effect of locus 1 on 2 df (i.e.
without assuming multiplicative effects of alleles)
use the following commands (noting that
we use the new genotype variable newg1
that we have just created):
xi:clogit case i.newg1, group(set) or
The clogit command is similar to the logit command for
(unmatched) case/control data, but tells the program to perform
conditional as opposed to unconditional
logistic regression, with "case" status as
the output and genotype variable as the predictor.
For conditional logistic regression
we need to tell Stata which cases are matched to which controls
using the group(set) command. The or option tells
Stata to output the regression coefficients in terms of odds ratios
relative to a baseline genotype .
The xi: and i.
syntax indicates that we are considering genotype as a categorical variable.
This is because the coding of genotypes as 1, 2, 3 etc. is
essentially arbitrary: we do not want to fit a regression
model with a single coefficient that gets multiplied
by the value of the genotype variable, as this would assume
that the effect of genotype 10, say, is ten times the
effect of genotype 1. Instead Stata will automatically
create the appropriate number of temporary dummy
variables (which
is equal to the number of genotypes minus 1) and carry out the
regression with these as predictors of disease risk. These
temporary variables may be called something like
_Inewg1_2
_Inewg1_4
etc. in the Stata variables window.
The significance of the test is found at the top right on the line
"Prob > chi2" just above the table of estimated regression coefficients.
You should
obtain a chi-squared on 9 df of 203.06 with a p value of 0.0000
(i.e. <0.00005). This is the significance test for comparing
the model where the newg1
variable is in the regression equation compared
to a model where it is not, i.e. comparing a model where genotype
at locus 1 is a predictor of disease to a model where it does not
predict disease.
You should find that you get very similar results to
the results you got from the gtrr analysis. The only differences
should be because of the fact that individuals with
missing data at any of the 5 loci have been removed
(they were not written out to the "casepseudocon.dta" file)
and also, the grouping of genotypes at locus 1 and 2 may not correspond
exactly to the default grouping in the gtrr analysis.
Now do the analysis for the other loci:
xi:clogit case i.newg2, group(set) or
xi:clogit case i.newg3, group(set) or
xi:clogit case i.g4, group(set) or
xi:clogit case i.g5, group(set) or
Locus 1, 2 and 5 are sufficiently significant that it is hard to
say which is most significant. We will therefore try adding in
each of these loci in turn, and seeing what is the effect of
subsequently adding additional loci.
To look at the effect of locus 2 given the effect at locus 1
type:
xi:clogit case i.newg1 i.newg2, group(set) or
testparm *newg2*
This compares a model where disease is predicted
by genotypes at locus 1 and 2
to a model where only genotypes at locus 1 are important.
We obtain a chi-squared of 11.43 on 6 df, p=0.0760.
To look at the effect of each other locus given the effect at locus 1
type:
xi:clogit case i.newg1 i.newg3, group(set) or
testparm *newg3*
xi:clogit case i.newg1 i.g4, group(set) or
testparm *g4*
xi:clogit case i.newg1 i.g5, group(set) or
testparm *g5*
You should find that, given the effect at locus 1, the significances
of effects at loci 2-5 have p values 0.08, 0.32, 0.15, 0.32.
You can probably see how you could use a forwards or backwards
procedure to enter or delete loci from the regression equation.
In this case, it looks as if nothing is very significant
once you have accounted for effects at locus 1. To see if locus
1 is important once you have accounted for effects at other loci,
use
xi:clogit case i.newg1 i.newg2, group(set) or
testparm *newg1*
xi:clogit case i.newg1 i.newg3, group(set) or
testparm *newg1*
xi:clogit case i.newg1 i.g4, group(set) or
testparm *newg1*
xi:clogit case i.newg1 i.g5, group(set) or
testparm *newg1*
You should find that locus 1 is highly significant even once
each other locus is accounted for, suggesting that it is
locus 1 that is driving the association in this region.
We can also use the case/pseudocontrol approach to fit
models where the disease risks depend on phased haplotypes.
In order to fit models that include haplotype effects
and/or estimate these effects, we must
only use data in which phase can be determined from the
case-parents trio and at least one of the possible matched pseudocontrols.
Note that our current software
only deals with observed haplotypes i.e. no EM estimation
of ambiguous haplotypes or missing data is performed.
First read in the original data set again by typing
clear
ginsheet using fiveloci.ped, preped zmiss
Suppose we are just interested in looking at haplotypes
for loci 1, 2 and 5 (which were the most significant in the single
locus analysis).
To generate the appropriate cases and pseudocontrols,
ensuring that phase is conserved, type:
pseudocc L1_* L2_* L5_*, saving(casepseudophased) phase
This saves a Stata data file "casepseudophased.dta" in the current directory.
We can now get rid of our pedigree data set and read in the
case/pseudocontrol data set that we saved by typing
clear
use casepseudophased
Click on the data browser button to see what the
case/pseudocontrol file looks like. You should find that
some sets consist of one case and 3 pseudocontrols,
other sets have only one pseudocontrol, and some
families have been discarded entirely.
This is because we the method has discarded
pseudocontrols and families for which phase is not inferrable. See
Cordell and Clayton (2002) for details.
Locus 1 and 2 are multiallelic.
As it happens, prior evidence has suggested that
there are particular alleles at locus 1 and 2
that are associated with disease. We may therefore
focus our attention on these alleles by recoding all other
alleles as a single value. We can do this as follows:
replace L1_1=2 if L1_1>2
replace L1_2=2 if L1_2>2
replace L2_1=2 if L2_1>2
replace L2_2=2 if L2_2>2
This allows us to look at the effects of allele 1 at
versus all other alleles, at each locus.
We can generate "haplotype" variables for mother and father's
transmitted haplotypes at the 3 loci by typing
egen hap_1=htype(L1_1 L2_1 L5_1)
egen hap_2=htype(L1_2 L2_2 L5_2), codeas(hap_1)
Take a look at the new
variables hap_1 and hap_2 by clicking on the data browser
(3rd icon from the right at the top).
We can reduce the number of parameters to estimate by assuming that the haplotypes act
multiplicatively to cause disease. First we generate new variables that code for the number of copies of the different possible haplotypes:
gtab hap_1 hap_2, gen(count)
This generates 4 new variables named count1 -
count8
that count the number of copies of the 8 possible haplotypes
present in an individual. For example, if an individual has haplotypes
111 and 112, i.e. one copy of haplotype number 1 and one
copy of haplotype number 2,
you should find that count1=1 and count2=1, while all
other count variables equal 0. If instead an
individual was homozygous for the
222 haplotype, you should find that count8=2, while all
other count variables equal 0.
To look at the effect of the haplotypes
in a conditional logistic regression,
assuming they act multiplicatively,
and all relative to haplotype 1 (=111) type:
clogit case count2 count3 count4 count5 count6 count7 count8, group(set) or
In this case, it turns out there is not enough data
to estimate all 8 haplotype effects (since some haplotypes
are very rare). This results in some estimated odds ratios
that are very large (effectively=infinity) or very small
(=0) with problems estimating the confidence intervals
(indicated by a ".") There
are various ways to get round this. For now we will
drop haplotypes 3 and 4 (121 and 122) from the analysis
(which is equivalent to assuming they have the same effect as
the reference haplotype):
clogit case count2 count5 count6 count7 count8, group(set) or
This calculates the relative risks for the haplotypes
relative to haplotype 1 (111). To calculate them relative to
haplotype 6 (212) (which is the one with the lowest risk), type:
clogit case count1 count2 count5 count7 count8, group(set) or
From the odds ratios, it looks like haplotype 2 (112) has the
strongest effect, and the only one that is individually significantly
different from haplotype 6 (z=3.18, p=0.001).
A global test of the effect of all 6 haplotypes
is given by the chi-squared on the top right: 143.76
on 5 df. This is highly significant as signified
by "Prob > chi2 = 0.0000"
We have now finished the Stata analysis. Type
log close
to close the log file in which there will be a record of all
the commands you typed and all the results.
Then type
clear
exit
to get out of Stata and close the session.
If you have time, you may like to repeat some of the
analyses
that you performed in the statistical package Stata
in the statistical package R. As mentioned previously,
the R package is somewhat
less user-friendly than Stata (particularly with regards to the output),
but has the advantage of being free,
and in addition a lot of bioinformatics software utilities have
been developed as add-in functions for the R package.
Before we start, take a look at the data file that we shall use:
fiveloci.Rped
This differs from the pedigree file we used in the Stata analysis
by the addition of a header line describing what each of the
variables are, and by the changing of any "0"s to R's own missing value
code "NA". Note that the names given to the first 6 variables in the
header line are not optional: they must be called
pedigree id id.father id.mother sex affected
for the add-in utilities
for the R package to work correctly.
To start up the R package, click on the R icon. Under the File menu, change directory (Cambia directory) so that R will look for any input files you specify in the C:DATA directory.
You are now working within the R package. To start with, you need
to read in the necessary add-in libraries (which we have
downloaded ready for you):
library(dgc.genetics)
To read your data into a dataframe called "family", type
family <- read.table("fiveloci.Rped", header=T)
The " <- " operator stands for "is defined as"
(or "is assigned as") and is used a lot in R to create new
variables, new dataframes or new R objects.
Here this command reads your data into what is called a dataframe, essentially
a large matrix with columns corresponding to the different variables.
You chose to name the dataframe "family" and you can look
at it simply by typing
family
or by typing
fix(family)
As with the case/control exercise,
each variable can be accessed by using the name of the dataframe
followed by a $ sign, followed by the variable name. E.g. to look at the
column of pedigree names, you just type
family$pedigree
It can be inconvenient to have to type "family$" before
typing every variable,
so you can tell R to automatically look at variables in the
casecon dataframe by typing:
attach(family)
Now you can just look at the column of pedigree names by typing
pedigree
To perform association analysis, we need to convert the
variables corresponding to the two alleles at each locus
into a genotype variable for each locus. This can be done
e.g. for locus 1 by:
g1 <- genotype(loc1_1, loc1_2)
To look at the variable you have just created, type:
g1
To perform a TDT analysis on this locus, type
tdt(g1)
As with the tdt function in Stata, this gives a global multiallelic
test (chi-squared test = 278.1187 on 22 df, P-value = <2e-16)
as well as individual tests for each allele versus all others.
Repeat the analysis for locus 5 (which is diallelic):
g5 <- genotype(loc5_1, loc5_2)
tdt(g5)
To create a case/pseudocontrol set for analysis at this locus, type:
psccloc5 <- pseudocc(g5, data=family)
This creates a new dataframe called "psccloc5" which contains
cases each with 3 matched pseudocontrols.
To look at the dataframe you just created, type:
fix(psccloc5)
Note that within
the "psccloc5" dataframe, the case/control 1/0 indicator variable is called "cc" and the
genotype variable is called "g5"
To clear the old "family" dataframe and old genotype variables
from the memory, type:
detach(family)
rm(g1)
rm(g5)
and then read in the "psccloc5" dataframe as the default:
attach(psccloc5)
To analyse using conditional logistic regression, assuming either
a 2df (genotype) test or a 1df (allele) test, type:
gcontrasts(g5) <- "genotype"
clogit(cc ~ g5 + strata(set))
gcontrasts(g5) <- "additive"
clogit(cc ~ g5 + strata(set))
This will perform the analysis on the g5 variable that is currently in the default memory, i.e. the g5 variable in the "psccloc5" dataframe, also known as psccloc5$g5.
The strata(set) option indicates that the set variable labels matched cases and pseudocontrols.
The results should be very similar to what you found in your Stata analysis of locus 5, a highly significant likelihood ratio test. The likelihood ratio test is slightly
different because in the Stata analysis we created case/pseudocontrol sets considering
all five loci simultaneously whereas just now we only considered locus 5.
The relative risk parameters labelled exp(coef) correspond to those
labelled Odds Ratio in the Stata analysis. In Stata we found Odds Ratios of
2.53 and 5.13 for the 1/2 and 2/2 genotype relative to the 1/1 genotype,
whereas in R we find values of 0.421 and 2.176 for the 1/1 and 2/2 genotype
relative to the 1/2 genotype. So relative to the 1/1 genotype we have values
of 1.0/0.421 = 2.38 and 2.176/0.421 = 5.17 respectively, very similar to what we found in Stata.
For analysis at more than one locus (e.g. locus 4 and 5, say) we need to
clear the memory and get back to our original "family" dataframe:
detach(psccloc5)
attach(family)
Now we need to
create the relevant genotype variables and case/pseudocontrol datasets.
We will create two different case/pseudocontrol datasets, one in which
we do not keep track of phase information, and one in which we
condition on phase being known (in order to fit models where the
disease risk depends on phase)
g4 <- genotype(loc4_1, loc4_2)
g5 <- genotype(loc5_1, loc5_2)
psccphase <- pseudocc(g4, g5, phase=TRUE, data=family)
psccnophase <- pseudocc(g4, g5, phase=FALSE, data=family)
These new dataframes can be looked at by typing
fix(psccphase)
fix(psccnophase)
For the "psccnophase" data, when phase information is not kept,
all families are used but
only 2 pseudocontrols are generated per case. This is because
the R functions
for case/pseudocontrol analysis are not yet as well
developed as the Stata functions and so the more efficient
creation of 3 pseudocontrols per case in this situation
is not yet implemented.
For the "psccphase" data,
some sets consist of one case and 3 pseudocontrols,
other sets have only one pseudocontrol, and some
families have been discarded entirely.
This is because we the method has discarded
pseudocontrols and families for which phase is not inferrable. See
Cordell and Clayton (2002) for details.
To analyse the "psccnophase" data, first
clear the old "family" dataframe and associated genotype variables
from the memory and read in the new dataframe as default:
detach(family)
rm(g4)
rm(g5)
attach(psccnophase)
To analyse each locus individually with a 2df test, type
gcontrasts(g4) <- "genotype"
clogit(cc ~ g4 + strata(set))
gcontrasts(g5) <- "genotype"
clogit(cc ~ g5 + strata(set))
To see whether locus 4 is significant once locus 5 is in the regression equation, use
the following sequence of commands:
gcontrasts(g4) <- "genotype"
gcontrasts(g5) <- "genotype"
fullmodel<-clogit(cc ~ g5 + g4 + strata(set))
restrictedmodel<-clogit(cc ~ g5 + strata(set))
anova(restrictedmodel,fullmodel)
You should find a difference between the models reported
as a deviance of 17.78 on 2df. To find the significance of this, use:
1-pchisq(17.78,2)
which gives you a p value of around 0.00014.
To fit a model for phase-known haplotypes at these loci, read in the file
"psccphase":
detach(psccnophase)
attach(psccphase)
The "psccphase" dataframe contains a two-locus phased genotype variable
called "g4.g5". To fit a multiplicative model
(equivalent to an additive model on the log odds scale) for the haplotypes
type:
gcontrasts(g4.g5) <- "additive"
clogit(cc ~ g4.g5 + strata(set))
This gives a highly significant global test of
67.3 on 3df (p=1.63e-14) for the effects of the
3 haplotypes (relative to the 1:2 haplotype). The individual haplotype
odds ratios are given under in column marked "exp(coeff)". It is seen
that the 1:1 haplotype in particular has a significantly lower risk
than the 1:2 haplotype.
Once you are ready to finish with R, type
q()
n
to get out.
Interpretation of the output in Stata and R is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.
Analysis in a standard statistical package has the advantage of
allowing a lot of extra flexibility with regards to the models
and analyses performed. However, you are required to know or
learn how to use the package in order to gain that extra flexibility,
and to produce reliable results.
The case/pseudocontrol approach has the advantage that it is not
affected by bias when analysing uncertain haplotypes. It has the
disadvantage that missing data is not incorporated into the analysis:
trios with any missing data at the loci under consideration are
simply discarded.
Family-based designs should be robust to population stratification.
TDT analysis can be performed in a variety of other packages. A form of TDT analysis of haplotypes is performed by the TRANSMIT program by David Clayton. The only other package that performs a form of case/pseudocontrol analysis is the UNPHASED program by Frank Dudbridge. For testing (but not estimation) of genotype or haplotype association effects in families, one can use the PDT or FBAT or PBAT programs.
Cordell HJ and Clayton DG (2002) A unified
stepwise regression procedure for evaluating the relative effects of
polymorphisms within a gene using case/control or family data:
application to HLA in type 1 diabetes. American
Journal of Human Genetics 70: 124-141.
Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis
in genetic association studies: a unified framework for detection of
genotype and haplotype associations, gene-gene and gene-environment
interactions and parent-of-origin effects. Genetic Epidemiology 26:167-185.
Dudbridge F (2003) Pedigree disequilibrium tests
for multilocus haplotypes. Genet Epidemiol 25:115-21.
Horvath S, Xu X and Laird N (2001) The family based association test method: strategies for studying general genotype-phenotype associations. Euro J Hum Gen 9: 301-306
Lake S, Blacker , and Laird N (2001) Family based tests in the presence of association. Amer J Hum Gen 67:1515-1525.
Martin ER, Monks SA, Warren LL, Kaplan NL (2000) A test for linkage and
association in general pedigrees: the pedigree disequilibrium test.
Am J Hum Genet 67:146-154
Schaid DJ. 1996. General score tests for associations of genetic markers
with disease using cases and their parents. Genet Epidemiol 13:423-449.
Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage
disequilibrium: the insulin gene region and insulin-dependent diabetes
mellitus (IDDM) Am J Hum Genet 52:455-466.