Computer Practical Exercise on estimation of maternal, imprinting and interaction effects using the EMIM program
Overview
Purpose
In this exercise you will be carrying out an analysis of some simulated data in which there
may be maternal genotype, child genotype and imprinting effects operating.
Methodology
We will use the approach described in the manuscript:
``Ainsworth HF, Unwin J,
Jamison DL and Cordell HJ (2011) Investigation of maternal
effects, maternal-foetal interactions and parent-of-origin effects (imprinting), using mothers and their offspring''
(Genetic Epidemiology 35:19-45)
Program documentation
EMIM documentation:
Documentation for the EMIM program can be found on the EMIM website:
http://www.staff.ncl.ac.uk/richard.howey/emim/index.html
Data overview
In the first exercise, we will be using family data consisting of a number of case/mother duos
and/or case/parent trios, genotyped at three SNP loci.
In addition, we will investigate whether
our results can be improved by the
incorporation of various different kinds of control samples.
In the second exercise, we will be following the worked example on the EMIM website.
Appropriate data
Appropriate data for this exercise is SNP genotype data for
case/parent trios, case/mother duos or case/father duos.
Additional genotype data can also be incorporated into the analysis
from parents of cases,
mothers of cases or fathers of cases (e.g. if the case itself has
not been successfully genotyped) or from cases alone
(e.g. if the parents have not been genotyped).
Greater efficiency can also be achieved by the incorporation of one or more types of control sample into the analysis, provided we are not worried about population stratification. The types of control sample that can be included are
either the parents (mother and father) of controls, control/mother duos, control/father duos or just individual controls. Provided the disease is rare, these controls can either be
genuine unaffected or population-based (of unknown disease status) controls . If the disease is common, then the controls should be population-based controls.
Instructions for Exercise 1
Data files
The data is contained in the files:
casemotherduos.dat
caseparenttrios.dat
conmotherduos.dat
conparents.dat
cons.dat
emimmarkers.dat
emimparams.dat
wtccccons.dat
Hopefully these files will all already be in the EMIM-EX1 subdirectory.
Data format
The format of the data files is described on the EMIM website.
Read through the appropriate section of the website:
http://www.staff.ncl.ac.uk/richard.howey/emim/emim.html
Then take a look at the data files (e.g. using the command more *.dat ), and check that you understand
how the data are coded.
In most cases, the data files contain
500 units of the appropriate type
(e.g. 500 case/parent trios, 500 case/mother duos, 500 controls, etc. etc.)
The file wtccccons.dat contains a larger set of controls, that might represent common controls from a population-based resource, such as
the 3000 controls that were used in the Wellcome Trust Case Control Consortium (WTCCC)).
The file emimparams.dat specifies the datafiles that will be read by the EMIM program, the assumptions that will be made during the analysis, and the parameters that are to be estimated.
The initial settings that we have chosen in this file tell the program just to use the data file for case/mother duos, to assume Hardy-Weinberg Equilibrium and random mating, and to estimate a child's genotype effect only.
Take a look at emimparams.dat and check that you understand
how the lines in this file force the above settings to be implemented.
The data format required by EMIM is slightly inconvenient for those used to working with standard LINKAGE or PLINK format files. Luckily we have another program, PREMIM, that can be used to generate EMIM format input files from PLINK format data. We will see an example of this in Exercise 2.
Step-by-step instructions
To run EMIM under the initial settings as described above, from the directory where the data is kept type:
emim
The program should run briefly and produce two output files, emimsummary.out and
emimresults.out . The file emimsummary.out is harder to read (although it can provide a useful summary if you are analysing a large number of SNPs). We will look at the file emimresults.out as this gives a more detailed overview of the results.
Take a look at emimresults.out. Results are given for each of the 3 SNPs in turn.
First we see the parameter estimates under the null hypothesis that all effects are 0
(i.e. all relative risk parameters=1, or all log relative risk parameters=0). Then we have the parameter estimates under the alternative hypothesis that you specified (i.e. that child's genotype effects are non-zero). As well as parameter estimates, the program outputs:
- a 95% CI for each of the estimated parameters of interest
- the maximised log likelihoods for the alternative and null models
- twice the difference between these, This can be used to compare the two models (i.e. to test the null hypothesis) by comparing to a chi-squared on the appropriate df (in this case 2 df).
Take a look at the results. For SNPs 1 and 3 you should not find anything very significant (relative risks close to 1, with confidence intervals that include 1, and chi-squared values of 0.92 and 0.39, which are not significant). For SNP 2, the confidence intervals again include 1 but the test looks slightly more interesting although still not significant (chi-squared on 2 df of 3.73, which has a p value of 0.15).
Let us see what happens if you assume you have 500 case/parent trios rather than 500 case/mother duos. Edit the file emimparams.dat to read in data from the file
caseparenttrios.dat and NOT from casemotherduos.dat. Run the program again
by typing
emim
and take a look at the new results (which will have overwritten the previous file emimresults.out )
You should find that SNPs 1 and 3 again do not show anything very significant, but SNP 2 shows much stronger child's genotype effects (chi-squared on 2 df of 65.94, which has a p value of 4.77e-15, highly significant). This illustrates how much more power you can get when you have case/parent trios rather than case/mother duos i.e. when fathers have been genotyped.
Go back and edit the file emimparams.dat to read in data from casemotherduos.dat again, rather than from caseparenttrios.dat, but this time also add in some control data i.e. ask the program to also read in data from the file cons.dat.
Re-run the program and take a look at the results. Again you should find that SNPs 1 and 3 do not show anything very significant, but SNP 2 shows much stronger child's genotype effects (chi-squared on 2 df of 81.19, highly significant). This illustrates the fact that adding in control data is another way to considerably improve the power.
So far, our analyses have all assumed Hardy-Weinberg Equilibrium (HWE) and random mating. If we want to relax this assumption, we need to either have some data from case/parent trios, or have data from control/mother duos, control/father duos or parents of controls.
Go back and edit the file emimparams.dat to read in data from conparents.dat rather than cons.dat (while still reading in data from casemotherduos.dat). Re-run the program and take a look at the results. Again you should find that SNPs 1 and 3 do not show anything very significant, but SNP 2 shows a chi-squared on 2 df of 95.45, highly significant.
Now go back and edit the file emimparams.dat so that you do NOT assume HWE and random mating.
Re-run the program and take a look at the results. Again you should find that SNPs 1 and 3 do not show anything very significant but SNP 2 shows a chi-squared on 2 df of 89.95. This is significant, but a bit less significant than when you assumed HWE and random mating.
This is because you have more parameters to estimate (six mating type stratification parameters (mu's), as opposed to one allele
frequency (A2)) when you do not assume HWE and random mating.
In practice you would normally try to use as much data as possible (i.e. as much data as you have available). Go back and edit the file emimparams.dat so that you read in all possible input data files: casemotherduos.dat,
caseparenttrios.dat,
conmotherduos.dat,
conparents.dat,
cons.dat.
Re-run the program and take a look at the results. Again you should find that SNPs 1 and 3 do not show anything very significant, while SNP 2 shows a chi-squared on 2 df of 256.18,
highly significant. Keep a note of the maximized ln likelihood= -4530.466.
It has been suggested that this apparently significant child's genotype effect at SNP 2
may in fact be really due to a maternal genotype effect. Go back and edit the file emimparams.dat so that you try to estimate maternal genotype effects (S1 and S2) rather than child genotype effects (R1 and R2).
Re-run the program and take a look at the results. Again you should find that SNPs 1 and 3 do not show anything very significant, while SNP 2 shows highly significant
maternal genotype effects with a chi-squared on 2 df of 337.33. Keep a note of the maximized ln likelihood= -4489.889.
It looks as if
maternal genotype effects may be a better explanation for the observed association than
child genotype effects. To try and distinguish between these two possibilities, we need to
include both types of effect
in the model. Go back and edit the file emimparams.dat so that you try to estimate both maternal genotype effects (S1 and S2) and child genotype effects (R1 and R2).
Re-run the program and take a look at the results. Again you should find that SNPs 1 and 3 do not show anything very significant, while SNP 2 shows significant
child and maternal genotype effects. You can see this from the fact that the
confidence intervals do not include 1 for any of the parameters R1, R2, S1, S2.
The overall significance of the alternative versus the null models is a chi-squared of
441.36, but this is on 4 df as now 4 parameters have been estimated.
To test whether maternal genotype effects are significant once we have accounted for child genotype effects, we need to compare twice the ln likelihoods for the appropriate models.
The maximized ln likelihood for the model that includes all 4 parameters is -4437.878.
The maximized ln likelihood for the model that included R1 and R2 was -4530.466
and the maximized ln likelihood for the model that included S1 and S2 was -4489.889.
So a test of whether maternal genotype effects are significant once we have accounted for child genotype effects is a chi-squared on 2df of twice (-4437.878-(-4530.466)) = 185.176, which is highly significant. Similarly a test of whether child genotype effects are significant once we have accounted for maternal genotype effects is twice (-4437.878-(-4489.889))=104.022, also highly significant. It looks as if both child genotype effects and maternal genotype effects may be operating.
Finally, let us see what happens if we also try to include imprinting effects and maternal/child genotype interaction effects. Go back and edit the file emimparams.dat so that you try to estimate child genotype effects (R1 and R2), maternal genotype effects (S1 and S2), imprinting effects (Im and Ip) and four maternal/child genotype interactions
(gamma11, gamma12, gamma21, gamma22).
Re-run the program and take a look at the results. The program should have detected that you tried to estimate too many parameters, and so it has only estimated some of the parameters you asked for. However, even then, the parameter estimates and standard errors look a bit strange.
Go back and edit the file emimparams.dat so that you try to estimate child genotype effects (R1 and R2), maternal genotype effects (S1 and S2), one imprinting effect (Ip) and two maternal/child genotype interactions
(gamma11, gamma22). This should work better as it corresponds to the maximum set of parameters that are theoretically estimable.
Again you should find that SNPs 1 and 3 do not show anything very significant, while SNP 2 shows significant
child and maternal genotype effects.
The imprinting and interaction effects at SNP 2 do not appear to be significant
(as their confidence intervals include 1).
The overall test of all 7 parameters is highly significant (chi-squared on 7df of 460.53)
Let us see what happens if we add in a larger control sample. Save your original control data file under a different name by typing
cp cons.dat originalcons.dat
and then copy the larger WTCCC control data into the control file by typing
cp wtccccons.dat cons.dat
Re-run the program and take a look at the results. The results are quite similar to
the previous results, although the signficance at SNP 2 has improved (chi-squared on 7df of
565.06).
Since these are simulated data, we know the "true" answers. In fact, these data were
simulated from a model where there were child genotype effects (R1=1.5 and R2=2.25),
maternal genotype effects (S1=2 and S2=4) and an imprinting effect Im=1.8).
How well do the estimated relative risks reflect these true values?
Even with this large sample size, you should find that the confidence intervals for the estimated parameters are quite wide. This indicates that, even though we have detected significant maternal and child genotype effects, there is still a lot of imprecision in the estimated effect sizes (genotype relative risks).
Instructions for Exercise 2
Now go to the EMIM website and follow the tutorial:
http://www.staff.ncl.ac.uk/richard.howey/emim/example.html
You should find the data you need has already been downloaded for you, in the EMIM-EX2 subdirectory.
Comments
Advantages/disadvantages
This type of modelling is more complicated than basic association testing,
but it allows you to consider more complex models/mechanisms.
Study design issues
Family data has the advantage of being generally more robust (than case/control data) to poulation stratification. It also allows investigation of more complex effects e.g. imprinting. But it may be harder to collect families than cases and controls.
Other packages
Models similar to the ones described here can also be fit using SAS code available from Clarice Weinberg, or by using the program LEM (van Den Oord and Vermunt, 2000)
References
Ainsworth HF, Unwin J, Jamison DL and Cordell HJ (2011) Investigation of maternal
effects, maternal-foetal interactions and parent-of-origin effects (imprinting), using mothers and their offspring. Genet Epidemiol 35:19-45.
Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis
in genetic association studies: a unified framework for detection of
genotype and haplotype associations, gene-gene and gene-environment
interactions and parent-of-origin effects. Genet Epidemiol 26:167-185.
Shi M, Umbach DM, Vermeulen SH, Weinberg CR (2008)
Making the most of case-mother/control-mother studies.
Am J Epidemiol 168:541-7.
van Den Oord EJ, Vermunt JK (2000) Testing for linkage disequilibrium, maternal effects, and imprinting with (In)complete case-parent triads, by use of the computer program LEM.
Am J Hum Genet 66:335-8.
Vermeulen SH, Shi M, Weinberg CR, Umbach DM (2009)
A hybrid design: case-parent triads supplemented by control-mother dyads.
Genet Epidemiol 33:136-44.
Weinberg CR, Wilcox AJ, Lie RT (1998) A log-linear approach to case-parent-triad data: assessing effects of disease genes that act either directly or through maternal effects and that may be subject to parental imprinting. Am J Hum Genet 62:969-78.
Weinberg CR (1999) Methods for detection of parent-of-origin effects in genetic studies of case-parents triads.
Am J Hum Genet 65:229-35.
Exercises prepared by: Heather Cordell
Checked by:
Programs used: EMIM, PREMIM, R
Last updated: