Computer Practical Exercise on Family-based Association using PLINK and R

Overview

Purpose

In this exercise you will be carrying out family-based association analysis of data from a genome-wide association study. The purpose is detect which (if any) of the loci are associated with disease, and to estimate their effects.

Methodology

We will use TDT analysis as implemented in PLINK.

Program documentation

PLINK documentation:

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://zzz.bwh.harvard.edu/plink/

R documentation:

The R website is at http://www.r-project.org/

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

Data overview

We will be using family data consisting of a number of families with two or more affected children plus parents (of varying disease status, mostly unknown). All individuals have been typed at a genome-wide set of around 262,000 SNP markers. In the interests of time, we will analyse the data from just three chromosomes, but in principal one could use the same approach to analyse all 22 autosomal chromosomes as well as chromosome X.

Appropriate data

Appropriate data for this exercise is genotype data at a set of loci, typed in a number of nuclear families and/or unrelated individuals. It is also possible to use larger families (extended pedigrees), however they will automatically be broken into nuclear families which are then treated as independent in the analysis. The offspring and unrelated individuals should be phenotyped for either a dichotomous trait or a quantitative trait of interest.

Instructions

Data files and software

First I suggest you make a new folder to keep this exercise's programs and data files in. Then download the following files into this folder:

plink.exe

threechrs.bed
threechrs.bim
threechrs.fam

The data is in the form of standard PLINK binary-format .bed .bim .fam files. Take a look at the data files threechrs.fam and threechrs.bim , and check that you understand how the data is coded. (The file threechrs.bed is a binary genotype file which will not be human readable).

The file threechrs.bim is a map file. You can take a look at this (e.g. by typing more threechrs.bim, or by opening it in WordPad). The file threechrs.fam gives the pedigree structure in a format that is compatible with the binary genotype file. You can take a look at this (e.g. by typing more threechrs.fam). Note this file is the same as the first six columns of the original pedigree file, except that unknown disease status has been coded as '-9'.

For subsequent analysis of these data, the files threechrs.bed, threechrs.bim and threechrs.fam will always need to be read into PLINK together, as together they provide all the required information.

Step-by-step instructions

Open up an MSDOS window and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Your data set consists of 1336 individuals and 24766 SNPs. For genome-wide studies, it is important to do some quality control checks and filter out individuals and SNPs that are likely to be of poor-quality. By default, PLINK does not impose any filters on things like minor allele frequency or genotyping rate. If you want to impose such filters, you can either do it at the same time as doing the analysis, or you can first do some QC filtering and output a new 'filtered' version of the data set for subsequent analysis.

(For example, to exclude individuals with more than 10% missing genotypes (over all SNPs), we can use the command --mind 0.1. To then exclude SNPs with more than 10% missing genotypes (over all individuals), we can use the command --geno 0.1. To exclude SNPs with minor allele frequency (in founders) less than 0.05, we can use the command --maf 0.05. To exclude markers that fail a Hardy-Weinberg test (in founders) at significance threshold 0.000001, we can use the command --hwe 0.000001. To finally exclude individuals and/or markers on the basis of Mendelian inheritance error rates, we can use the option --me 0.05 0.1, where the first parameter determines that families with more than 5% Mendelian errors (considering all SNPs) will be discarded, and the second parameter indicates that SNPs with more than 10% Mendelian error rate (based on the number of trios) will be excluded).

We will impose these filters at the same time as performing a family-based association (TDT) analysis, by typing the following command:

plink --noweb --bfile threechrs --mind 0.1 --geno 0.1 --maf 0.05 --hwe 0.000001 --me 0.05 0.1 --tdt --ci 0.95 --out results1

A copy of the screen output is saved in the file results1.log. You can see that, after the various QC checks, 68 of the original 1336 individuals are removed for low genotyping, and 21612 SNPs of the original 24766 SNPs are retained.

The TDT association results are output to a file results1.tdt. Take a look at this file. Each line corresponds to the results for a particular SNP. Each line contains the following columns:

     CHR         Chromosome number
     SNP         SNP identifier
     A1          Minor allele code
     A2          Major allele code
     T           Transmitted minor allele count
     U           Untransmitted allele count
     OR          TDT odds ratio
     L95         Lower 95% confidence interval for TDT odds ratio
     U95         Upper 95% confidence interval for TDT odds ratio
     CHISQ       TDT chi-square statistic
     P           TDT asymptotic p-value
     A:U_PAR     Parental discordance counts
     CHISQ_PAR   Parental discordance statistic
     P_PAR       Parental discordance asymptotic p-value
     CHISQ_COM   Combined test statistic
     P_COM       Combined test asymptotic p-value

The most useful columns are CHISQ (the TDT chi-squared statistic) and its p value (P). For details about the Parental discordance test, see the PLINK documentation.

To visualise these results properly we will use the R program. Start up R, and move to the directory (folder) where you performed this analysis by clicking "File" at the top left, and selecting "Change dir". Then move through your filesystem until you find the correct folder, and click "OK".

Now (within R) read in the data by typing:

res1=read.table("results1.tdt", header=T)

This reads the results in the file "results1.tdt" into an R dataframe named "res1". The option header=T tells R that your file has a header line at the top, giving the different column names.

To see the top few lines of the dataframe "res1", type:

head(res1)

The data frame has 21612 lines, one for each SNP. It would be very laborious to go through and look at each line by eye. Instead we will plot the results for the three chromosomes, colouring each chromosome differently:

plot(res1$BP, -log10(res1$P), col=res1$CHR)

This plots the "BP" column of the "res1" dataframe on the x axis against the -log base 10 of the TDT p value (the "P" column of the "res1" dataframe) on the y axis. We plot the -log base 10 of the TDT p value, which will be large if the p value is small. The command col=res1$CHR tells R to colour the points plotted according to the value in the "CHR" column of the "res1" dataframe (i.e. the points will be coloured by chromosome). If you did not want to colour by chromosome, you could simply type:

plot(res1$BP, -log10(res1$P))

However, it is easier to see if you colour by chromsome:

plot(res1$BP, -log10(res1$P), col=res1$CHR)

It looks as if there may be some significant results (small p values) on the first chromosome, chromosome 10.

One way to assess the significance of the results, in light of the large number (21612) of tests performed, is to use a Q-Q plot. To plot a Q-Q plot for the TDT chisquared values, type:

plot(qchisq(ppoints(res1$CHISQ),1), sort(res1$CHISQ)) abline(a=0,b=1)

The first command orders (sorts) the TDT chi-squared values and plots them against their expected values, given the number of tests performed. The second command adds in to the plot a straight line with gradient 1, passing through the origin. What one would hope to see is most of the values lying along the straight line with gradient 1, indicating that most results are consistent with the null hypothesis of no association. However, one would also hope to see a few high TDT values at the top that depart from the straight line, which are hopefully true associations.

In our plot, we see that there are indeed some high TDT values at the top - however the rest of the TDT results do not lie on the straight line with gradient 1 - instead there seems to be a systematic departure away from this line. This is worrying as it suggests the overall distribution of the TDT test statistics is not what would be expected under the null hypothesis. It is unlikely that these are all true positive associations (!) - instead it suggests there may be some kind of problem with our data.

For family studies it turns out that there is a simple explanation for this kind of systematic departure for the expected distribution (described in Gordon et al. (2001) Am J Hum Genet 69:371-380). If there are genotyping errors, then this can induce an over-inflation in the TDT statistic and an increase in type 1 error rate, when trios with Mendelian inheritance errors are discarded (as they must be - since we cannot work out which alleles were or were not transmitted) while those trios without apparent Mendelian inheritance errors are retained. (This phenomenon does not occur in case/control studies, where genotyping errors lead to a reduction in power, but do not increase the type 1 error rate).

So a possible explanation for our results is that our original QC checks were not sufficiently stringent to get rid of SNPs that are unreliable i.e. that may have large numbers of genotyping errors. Let us repeat the analysis using more stringent QC cut-offs. Go back to the window where you ran PLINK and re-run it as follows:

plink --noweb --bfile threechrs --mind 0.05 --geno 0.05 --maf 0.05 --hwe 0.000005 --me 0.05 0.05 --tdt --ci 0.95 --out results2

From the screen output, or from the log file produced (results2.log), can you tell whether these QC cut-offs result in a smaller number of SNPs being retained?

To visualise the results, go back to your R window and read the results into R:

res2=read.table("results2.tdt", header=T)

This reads the new results into a dataframe named "res2". To see the top few lines of this dataframe, type:

head(res2)

Now plot the results for the three chromosomes, colouring each chromosome differently:

plot(res2$BP, -log10(res2$P), col=res2$CHR)

The p values do not look quite as significant as before, now that we have removed more SNPs (and people). However, there still may be some results of interest on chromosome 10.

Use the following commands to re-create the Q-Q plot for the old results, and compare it with the new results:

par(mfrow=c(2,2)) plot(qchisq(ppoints(res1$CHISQ),1), sort(res1$CHISQ)) abline(a=0,b=1) plot(qchisq(ppoints(res2$CHISQ),1), sort(res2$CHISQ)) abline(a=0,b=1)

The first command divides your graphics screen into 4 separate plots (2 by 2). (If you want to go back to plotting one plot at a time at any stage, you can type: par(mfrow=c(1,1)) ).

The bulk of the results seem closer to the straight line with gradient 1, suggesting that these more stringent QC cut-offs are probably appropriate. It looks like these have been successful in getting rid of SNPs with high genotype error rates.

There does not appear to be strong evidence of association, in light of the number of tests performed, but the top few TDT results might be of interest. To see which SNPs these come from, type

res2[res2$CHISQ>16,]

This command just writes out the lines in the "res2" dataframe that have a TDT chi-squared value greater than 16. We can see that the top 3 SNPs all come from the same region on chromosome 10. These results are not terribly significant by genome-wide standards (p=6.937e-06) but might be worthy of further investigation.

Another way to pick out the most significant results would be to write the lines in the "res2" dataframe that have a TDT p value less than a certain amount. E.g. to output all results with p value less than 0.00001 (so with minus log base 10 (p value) > 5), type:

res2[res2$P<0.00001,]

Can you see how often the A1 allele is transmitted and untransmitted from heterozygous parents to affected offspring? What are the resulting allelic odds ratios (ORs) conferred by the A1 allele?

For interest, let us see what happens if we were to run the analysis without using any QC filters. Go back to the window where you ran PLINK and re-run it as follows:

plink --noweb --bfile threechrs --tdt --ci 0.95 --out results3

Now go back to your R window and read the results into R:

res3=read.table("results3.tdt", header=T)

Plot the results for the three chromosomes, colouring each chromosome differently:

plot(res3$BP, -log10(res3$P), col=res3$CHR)

There seem to be quite a few significant results. Have a look at the Q-Q plot:

plot(qchisq(ppoints(res3$CHISQ),1), sort(res3$CHISQ)) abline(a=0,b=1)

Now we see a very strong departure from the expected distribution, probably due to the fact that we are retaining many unreliable SNPs with high genotyping error rates.

Saving your R plots

If you want to save a copy of your R plots, you need to write them to a file, rather than output them to the screen. E.g. to save the "good" plot from the second analysis you did in a .PNG file, type:

png("PlotFromResults2.png") plot(res2$BP, -log10(res2$P), col=res2$CHR) dev.off()

You should find that a new file "PlotFromResults2.png" has been created in the folder where you have been doing the analysis. Similarly, to save the QQ plot, you could type

png("QQFromResults2.png") plot(qchisq(ppoints(res2$CHISQ),1), sort(res2$CHISQ)) abline(a=0,b=1) dev.off()

Study design issues

Family data has the advantage of being generally more robust (than case/control data) to poulation stratification. It also allows investigation of more complex effects e.g. imprinting. But it may be harder to collect families than cases and controls.

Other packages

TDT analysis can be performed in a variety of other packages including genetics packages such as UNPHASED and statistical packages such as R and Stata. TDTae (TDT allowing for error) is a program that performs a TDT test allowing for possible genotyping errors. A form of TDT analysis of haplotypes is performed by the TRANSMIT program by David Clayton. For testing (but not estimation) of genotype or haplotype association effects in families, one can use the FBAT, PBAT, PDT or GDT programs.

Exercises prepared by: Heather Cordell
Checked by:
Programs used: PLINK, R
Last updated: