Computer Practical Exercise on Family-based Association using PLINK, R and UNPHASED

Overview

Purpose

In this exercise you will be carrying out family-based association analysis of data from a genome-wide association study. The purpose is detect which (if any) of the loci are associated with disease, and to estimate their effects.

Methodology

We will use TDT analysis as implemented in PLINK, and likelihood-based analysis as implemented in UNPHASED. UNPHASED implements a TDT or case/pseudocontrol type analysis for nuclear family data, although it can also be used to analyse case/control data, or quantitative trait data from families and/or unrelated individuals.

Program documentation

PLINK documentation:

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://pngu.mgh.harvard.edu/~purcell/plink/

R documentation:

The R website is at http://www.r-project.org/

From within R, one can obtain help on any command xxxx by typing `help(xxxx)'

UNPHASED documentation:

Full documentation can be found at

http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/

The program runs either in command-line mode, or through a Java-based graphical interface. We will be using the command-line mode, since this is easier to explain and allows greater control over the desired options.

Data overview

We will be using family data consisting of a number of families with two or more affected children plus parents (of varying disease status, mostly unknown). All individuals have been typed at a genome-wide set of around 262,000 SNP markers. In the interests of time, we will analyse the data from just three chromosomes, but in principal one could use the same approach to analyse all 22 autosomal chromosomes as well as chromosome X.

Appropriate data

Appropriate data for this exercise is genotype data at a set of linked loci, typed in a number of nuclear families and/or unrelated individuals. It is also possible to use larger families (extended pedigrees), however they will automatically be broken into nuclear families which are then treated as independent in the analysis. The offspring and unrelated individuals should be phenotyped for either a dichotomous trait or a quantitative trait of interest.


Instructions

Data files

You should have previously (yesterday) saved the family data to an appropriate directory (folder) on your machine and unzipped it. The data should be in a directory 'familydata3chrs' containing a number of files including a file named 'files2merge.txt', a file named 'threesnps.txt' and a pedigree file and a map file for each of three chromosomes (10, 13, 14). The files C**.ped refer to the pedigree files and the files C**.recode.map are the corresponding map files (where ** denotes chromosome).

Data format

The data is in standard pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. Missing data is coded with a zero.

If you are unfamiliar with the standard pedigree file format (which is a commonly-used format for many linkage analysis programs) and need more explanation, please ask an instructor.

The map files are in PLINK format. These files describe the markers (in order) in the pedigree file. The PLINK-format map file contains exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced) 
     rs number or snp identifier 
     Genetic distance (morgans) 
     Base-pair position (bp units) 

Most analyses in PLINK do not require a genetic map distance to be specified. For basic association testing, therefore, the genetic distance column can be set at 0.

Take a look at the data files, and check that you understand how the data is coded.

Step-by-step instructions

1. Analysis in PLINK

Open up an MSDOS window and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Yesterday we used PLINK to merge together the data from all 3 chromosomes. We output this merged data as a special binary format genotype file, which takes up less disc space and is quicker to read into PLINK when performing various subsequent analyses.

You should have 4 files containing the merged data in your directory: merged.bed, merged.bim, merged.fam and merged.log. The file merged.bed is the binary genotype file which will not be human readable. The file merged.bim is a map file for the merged data. You can take a look at this (e.g. by typing more merged.bim). The file merged.fam gives the pedigree structure in a format that is compatible with the binary genotype file. You can take a look at this (e.g. by typing more merged.fam). Note this file is the same as the first six columns of the original pedigree file, except that unknown disease status has been coded as '-9'.

For subsequent analysis of these data, the files merged.bed, merged.bim and merged.fam will always need to be read into PLINK together, as together they provide all the required information. This can be done using the command --bfile merged.

If you take a look at merged.log, you should see that your merged data set consists of 1336 individuals and 24766 SNPs. For genome-wide studies, it is important to do some quality control checks and filter out individuals and SNPs that are likely to be of poor-quality. By default, PLINK does not impose any filters on things like minor allele frequency or genotyping rate. If you want to impose such filters, you can either do it at the same time as doing the analysis, or you can first do some QC filtering and output a new 'filtered' version of the data set for subsequent analysis.

(For example, to exclude individuals with more than 10% missing genotypes (over all SNPs), we can use the command --mind 0.1. To then exclude SNPs with more than 10% missing genotypes (over all individuals), we can use the command --geno 0.1. To exclude SNPs with minor allele frequency (in founders) less than 0.05, we can use the command --maf 0.05. To exclude markers that fail a Hardy-Weinberg test (in founders) at significance threshold 0.000001, we can use the command --hwe 0.000001. To finally exclude individuals and/or markers on the basis of Mendelian inheritance error rates, we can use the option --me 0.05 0.1, where the first parameter determines that families with more than 5% Mendelian errors (considering all SNPs) will be discarded, and the second parameter indicates that SNPs with more than 10% Mendelian error rate (based on the number of trios) will be excluded).

We will impose these filters at the same time as performing a family-based association (TDT) analysis, by typing the following command:

plink --noweb --bfile merged --mind 0.1 --geno 0.1 --maf 0.05 --hwe 0.000001 --me 0.05 0.1 --tdt --ci 0.95 --out results1

A copy of the screen output is saved in the file results1.log. You can see that, after the various QC checks, 68 of the original 1336 individuals are removed for low genotyping, and 21612 SNPs of the original 24766 SNPs are retained.

The TDT association results are output to a file results1.tdt. Take a look at this file. Each line corresponds to the results for a particular SNP. Each line contains the following columns:

     CHR         Chromosome number
     SNP         SNP identifier
     A1          Minor allele code
     A2          Major allele code
     T           Transmitted minor allele count
     U           Untransmitted allele count
     OR          TDT odds ratio
     L95         Lower 95% confidence interval for TDT odds ratio
     U95         Upper 95% confidence interval for TDT odds ratio
     CHISQ       TDT chi-square statistic
     P           TDT asymptotic p-value
     A:U_PAR     Parental discordance counts
     CHISQ_PAR   Parental discordance statistic
     P_PAR       Parental discordance asymptotic p-value
     CHISQ_COM   Combined test statistic
     P_COM       Combined test asymptotic p-value

The most useful columns are CHISQ (the TDT chi-squared statistic) and its p value (P). For details about the Parental discordance test, see the PLINK documentation.

To visualise these results properly we will use the R program. Start up R, and move to the directory where you performed this analysis.

Now (within R) read in the data by typing:

res1<-read.table("results1.tdt", header=T)

This reads the results into a dataframe named "res1". To see the top few lines of this dataframe, type:

head(res1)

The data frame has 21612 lines, one for each SNP. It would be very laborious to go through and look at each line by eye. Instead we will plot the results for the three chromosomes, colouring each chromosome differently:

position<-res1$BP+150000000*res1$CHR
plot(position, -log10(res1$P), col=res1$CHR)


Since the base-pair position is only defined within each chromosome, this slightly clunky pair of commands allows one to plot the results in such a way that the chromosomes are separated from one another. Here we have plotted -log base 10 of the TDT p value, which will be large if the p value is small. It looks as if there may be some significant results (small p values) on the first chromosome, chromosome 10.

One way to assess the significance of the results, in light of the large number (38963) of tests performed, is to use a Q-Q plot. To plot a Q-Q plot for the TDT chisquared values, type:

plot(qchisq(ppoints(res1$CHISQ),1), sort(res1$CHISQ))
abline(a=0,b=1)


The first command orders the TDT chi-squared values and plots them against their expected values, given the number of tests performed. The second command adds in to the plot a straight line with gradient 1, passing through the origin. What one would hope to see is most of the values lying along the straight line with gradient 1, indicating that most results are consistent with the null hypothesis of no association. However, one would also hope to see a few high TDT values at the top that depart from the straight line, which are hopefully true associations.

In our plot, we see that there are indeed some high TDT values at the top - however the rest of the TDT results do not lie on the straight line with gradient 1 - instead there seems to be a systematic departure away from this line. This is worrying as it suggests the overall distribution of the TDT test statistics is not what would be expected under the null hypothesis. It is unlikely that these are all true positive associations (!) - instead it suggests there may be some kind of problem with our data.

For family studies it turns out that there is a simple explanation for this kind of systematic departure for the expected distribution (described in Gordon et al. (2001) Am J Hum Genet 69:371-380). If there are genotyping errors, then this can induce an over-inflation in the TDT statistic and an increase in type 1 error rate, when trios with Mendelian inheritance errors are discarded (as they must be - since we cannot work out which alleles were or were not transmitted) while those trios without apparent Mendelian inheritance errors are retained. (This phenomenon does not occur in case/control studies, where genotyping errors lead to a reduction in power, but do not increase the type 1 error rate).

So a possible explanation for our results is that our original QC checks were not sufficiently stringent to get rid of SNPs that are unreliable i.e. that may have large numbers of genotyping errors. Let us repeat the analysis using more stringent QC cut-offs. Go back to the window where you ran PLINK and re-run it as follows:

plink --noweb --bfile merged --mind 0.05 --geno 0.05 --maf 0.05 --hwe 0.000005 --me 0.05 0.05 --tdt --ci 0.95 --out results2

From the screen output, can you tell whether these QC cut-offs result in a smaller number of SNPs being retained?

To visualise the results, go back to your R window and read the results into R:

res2<-read.table("results2.tdt", header=T)

This reads the new results into a dataframe named "res2". To see the top few lines of this dataframe, type:

head(res2)

Now plot the results for the three chromosomes, colouring each chromosome differently:

position<-res2$BP+150000000*res2$CHR
plot(position, -log10(res2$P), col=res2$CHR)


The p values do not look quite as significant as before, now that we have removed more SNPs (and people). However, there still may be some results of interest on chromosome 10.

Use the following commands to re-create the Q-Q plot for the old results, and compare it with the new results:

par(mfrow=c(2,1))
plot(qchisq(ppoints(res1$CHISQ),1), sort(res1$CHISQ))
abline(a=0,b=1)
plot(qchisq(ppoints(res2$CHISQ),1), sort(res2$CHISQ))
abline(a=0,b=1)


The bulk of the results seem closer to the straight line with gradient 1, suggesting that these more stringent QC cut-offs are probably appropriate. It looks like these have been successful in getting rid of SNPs with high genotype error rates.

There does not appear to be strong evidence of association, in light of the number of tests performed, but the top few TDT results might be of interest. To see which SNPs these come from, type

res2[res2$CHISQ>16,]

This command just writes out the lines in the "res2" dataframe that have a TDT chi-squared value greater than 16. We can see that the top 3 SNPs all come from the same region on chromosome 10. These results are not terribly significant by genome-wide standards (p=6.937e-06) but might be worthy of further investigation.

For interest, let us see what happens if we were to run the analysis without using any QC filters. Go back to the window where you ran PLINK and re-run it as follows:

plink --noweb --bfile merged --tdt --ci 0.95 --out results3

Now go back to your R window and read the results into R:

res3<-read.table("results3.tdt", header=T)

Plot the results for the three chromosomes, colouring each chromosome differently:

position<-res3$BP+150000000*res3$CHR
plot(position, -log10(res3$P), col=res3$CHR)


There seem to be quite a few significant results. Have a look at the Q-Q plot:

plot(qchisq(ppoints(res3$CHISQ),1), sort(res3$CHISQ))
abline(a=0,b=1)


Now we see a very strong departure from the expected distribution, probably due to the fact that we are retaining many unreliable SNPs with high genotyping error rates.



2. Analysis in UNPHASED

We will finish by using the program UNPHASED to re-analyse the data for the three SNPs on chromosome 10 that showed the strongest association, at the most reliable QC threshold. Remind yourself which SNPs these were, and what their results were, by typing in your R window:

res2[res2$CHISQ>16,]

The SNPs in question are named SNP_A-2217976, SNP_A-1852546, SNP_A-1896203.

We can use PLINK to create a smaller pedigree file, just containing these 3 SNPs. Go back to the window in which you ran PLINK and take a look at the file 'threesnps.txt'. This file lists the 3 SNPs we wish to extract. We will use PLINK to create a new pedigree file 'threesnps.ped' with just these 3 SNPs. First we carry out QC to remove various individuals and SNPs:

plink --noweb --bfile merged --mind 0.05 --geno 0.05 --maf 0.05 --hwe 0.000005 --me 0.05 0.05 --make-bed --out qc

The --make-bed command makes a new set of binary files qc.bed, qc.bim, qc.fam that have had the relevant QC checks performed and in which various individuals/SNPs have been discarded.

Now extract the three SNPs of interest:

plink --noweb --bfile qc --extract threesnps.txt --recode --output-missing-phenotype 0 --out threesnps

Here the --extract threesnps.txt command tells PLINK just to extract the three SNPs listed in 'threesnps.txt', the --recode command tells PLINK to output the results to a new pedigree file and the --output-missing-phenotype 0 command tells PLINK to use 0 as the missing value code for the column signifying disease status. The --out threesnps command tells PLINK to use 'threesnps' as the stem for any output files, so the output pedigree file will be named 'threesnps.ped'. Take a look at 'threesnps.ped' and check it looks as you expect.

To use UNPHASED to perform a TDT-type likelihood-based analysis at each of the 3 loci in 'threesnps.ped' individually, type

unphased threesnps.ped

The results should ouput to the screen (you can instead redirect them to a file named "file.txt" if you prefer, using the additional command > file.txt ).

Take a look at the output. First comes some program-specific information and then come the results (i.e. tests of association and estimates of the allelic or haplotypic odds ratios) for each marker. The most important information is the likelihood ratio chi-squared test and p-value that come under the heading TEST OF OVERALL ASSOCIATION .

The p values for the three markers are 2.83104e-06, 5.13306e-07 and 1.14495e-06 respectively. These are quite similar (although not identical) to the PLINK results which were 1.158e-05, 6.937e-06 and 2.555e-05 respectively. UNPHASED uses a slightly different (likelihood-based) version of the TDT, so one would not expect the results to be identical.

It is more difficult to match up the odds ratios between PLINK and UNPHASED, as by default UNPHASED uses allele 1 as reference, whereas PLINK uses the major allele as reference. The major allele for the first locus is allele 1, but for the other two loci it is allele 2. Perform the UNPHASED analysis for each locus individually, specifying the major allele as reference, by using the following commands:

unphased threesnps.ped -marker 1 -reference 1
unphased threesnps.ped -marker 2 -reference 2
unphased threesnps.ped -marker 3 -reference 2


You should find the odds ratios (ORs) for the effect of the minor allele, relative to the major allele, are 0.4446, 0.284 and 0.3101 respectively, indicating protective effects. Again these are similar but not identical to the ORs from PLINK which were 0.5083, 0.4074, 0.4607 respectively. The counts of transmitted and untransmitted alleles from UNPHASED and PLINK will not match up as PLINK only reports transmissions from heterozygous parents, while UNPHASED reports all transmissions.

Either way, we have reasonable evidence of association at these markers. Let us see what happens if we try to perform a haplotype analysis of the three markers, using the major alleles as reference. We will only include haplotypes that have frequency greater than than 3%:

unphased threesnps.ped -marker 1 2 3 -reference 1 2 2 -window 3 -zero 0.03

We can see that there is LD between the markers: only three haplotypes have frequency greater than 3%. The 3 df haplotype association test is quite significant (p=3.54832e-08).

However, it is possible that most of the significance in the haplotype test could be accounted for by a single marker (rather than by invoking haplotype effects). To perform tests at markers 2 and 3, while accounting for (conditioning on) any effect at marker 1, type:

unphased threesnps.ped -marker 2 3 -condition 1 -zero 0.03

Marker 2 and marker 3 both seem to add significantly to marker 1 (p values 0.001 and 6.07e-5 respectively).

To perform tests at markers 1 and 3, while accounting for (conditioning on) any effect at marker 2, type:

unphased threesnps.ped -marker 1 3 -condition 2 -zero 0.03

Marker 1 adds slightly to marker 2 (p=0.056). Marker 3 does not add anything to marker 2 (p=1), possibly because these markers appear to be in complete LD (only two possible haplotypes, 1-1 and 2-2, exist).

To perform tests at markers 1 and 2, while accounting for (conditioning on) any effect at marker 3, type:

unphased threesnps.ped -marker 1 2 -condition 3 -zero 0.03

Marker 1 adds slightly to marker 3 (p=0.042). Marker 3 does not add anything to marker 2 (p=1), probably because these markers are in complete LD

Since marker 2 was most significant individually, and marker 1 adds slightly to marker 2, it might make sense to look at haplotypes of markers 1 and 2 only:

unphased threesnps.ped -marker 1 2 -reference 1 2 -window 2 -zero 0.03

The 3 df haplotype association test is still quite significant (p=5.31e-07). The odds ratios for haplotypes 1-1, 2-1, 2-2 (relative to 1-2) are shown in the table.




Answers

How to interpret the output

Interpretation of the output is described in the step-by-step instructions. In general, the output will consist of a likelihood-ratio or chi-squared test for whatever you are test you are performing, and regression coefficients or odds ratio estimates for the predictor variables in the current model. Please ask if you need help in understanding the output for any specific test.


Comments

Advantages/disadvantages

PLINK is especially useful for data management and analysis of genome-wide association data. Both PLINK and UNPHASED are very flexible as they can be used to analyse population-based (e.g. case/control) and/or family data, as well as performing haplotype analysis and analysis of either dichotomous or quantitative traits.

Study design issues

Family data has the advantage of being generally more robust (than case/control data) to poulation stratification. It also allows investigation of more complex effects e.g. imprinting. But it may be harder to collect families than cases and controls.

Other packages

TDT analysis can be performed in a variety of other packages including statistical packages such as R and Stata. TDTae (TDT allowing for error) is a program that performs a TDT test allowing for possible genotyping errors. A form of TDT analysis of haplotypes is performed by the TRANSMIT program by David Clayton. For testing (but not estimation) of genotype or haplotype association effects in families, one can use the FBAT, PBAT, PDT or GDT programs.


References

Cordell HJ and Clayton DG (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. American Journal of Human Genetics 70:124-141.

Cordell HJ, Barratt BJ and Clayton DG (2004) Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions and parent-of-origin effects. Genetic Epidemiology 26:167-185.

Dudbridge F (2008) Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data. Hum Hered 66:87-98.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81:559-575.

Schaid DJ. 1996. General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13:423-449.

Spielman RS, McGinnis RE, Ewens WJ. 1993. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet 52:455-466.


Exercises prepared by: Heather Cordell
Checked by:
Programs used: PLINK, R, Unphased
Last updated: