Computer Practical Exercise on QC and case/control association analysis in PLINK (and R)

Overview

Purpose

In this exercise you will perform quality control (including Multidimensional Scaling (MDS) for detecting population outliers and adjusting for population stratification), followed by case/control association analysis, using the program PLINK. The program R will be used for plotting and visualisation of the results.

Program documentation

PLINK documentation:

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

https://zzz.bwh.harvard.edu/plink/index.shtml

Data overview

We will analyse data derived from a set of 165 cases with a disease of interest, and 496 controls. All individuals have been typed at a genome-wide set of around 262,000 SNP markers. In the interests of time, we will analyse the data from just three chromosomes, but in principal one could use the same approach to analyse all 22 autosomal chromosomes as well as chromosome X.

Instructions

Data files

You should save the following files to an appropriate directory (folder) on your machine:

startingcasecon.bed
startingcasecon.bim
startingcasecon.fam

prunedHAPMAPdata.bed
prunedHAPMAPdata.bim
prunedHAPMAPdata.fam

people2drop.txt
allpopsKEY.out
MDSoutlier.txt

Step-by-step instructions

Open up an MSDOS window by typing "cmd" in the search box (bottom left). Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

1. Basic QC


To start with, we will use PLINK to do some quality control checks and filter out individuals and SNPs that are likely to be of poor-quality. By default, PLINK does not impose any filters on things like minor allele frequency or genotyping rate. If you want to impose such filters, you can either do it at the same time as doing the analysis, or you can first do some QC filtering and output a new 'filtered' version of the data set for subsequent analysis.

For example, to exclude individuals with more than 10% missing genotypes (over all SNPs), when we run PLINK we will use the command --mind 0.1. To then exclude SNPs with more than 10% missing genotypes (over all individuals), we will use the command --geno 0.1. To exclude SNPs with minor allele frequency less than 0.02, we will use the command --maf 0.02. To exclude markers that fail a Hardy-Weinberg test at significance threshold 0.000005, we will use the command --hwe 0.000005.

We will use PLINK to generate a new data set where some people and some SNPs have been removed according to these QC checks. We do this by typing:

plink --noweb --bfile startingcasecon --mind 0.1 --geno 0.1 --maf 0.05 --hwe 0.000005 --make-bed --out QCed

Here the --noweb command tells PLINK to run without checking the web for new versions, the --bfile startingcasecon command tells PLINK to read in the (binary) fileset contained in the files   startingcasecon.bed   startingcasecon.bim   startingcasecon.fam, and the other commands implement the various QC exclusions. The --make-bed --out commands should produce a new set of "QCed" (quality-controlled) data (21511 SNPs in 159 cases and 445 controls) contained in the 3 files   QCed.bed   QCed.bim   QCed.fam

Check the QCed.log file (or the screen output) to see how many people (cases and controls) and SNPs you started with, and how many were removed by the various QC checks.

Next we will check for people that are related to one another, or that might be duplicate samples. We do this by estimating the genome-wide identity-by-descent (IBD) sharing:

plink --noweb --bfile QCed --genome --min 0.15 --out QCedIBD

Note that we chose a threshold so only pairs of samples with high IBD sharing (greater than 15%) are output to the file QCedIBD.genome. Take a look at this file (by typing more QCedIBD.genome or double clicking in Windows Explorer). You can see that there are several pairs of people with high IBD sharing, including one person (region1a_0161 1) who appears in a lot of pairs.

Only one person from each "bad" pair needs to be removed. We have made a proposed list of people to remove in the file people2drop.txt. Take a look at this file and check whether you agree with it.

To remove these 7 people, type:

plink --noweb --bfile QCed --remove people2drop.txt --make-bed --out QCed2

This creates a new PLINK fileset contained in the 3 files   QCed2.bed   QCed2.bim   QCed2.fam. Re-run the IBD check on the new fileset:

plink --noweb --bfile QCed2 --genome --min 0.15 --out QCed2IBD

Take a look at QCed2IBD.genome. Are there any remaining "bad" pairs?


2. Analysis to detect population outliers

To perform principal components or MDS analysis, it is important that we use SNPs that are not too correlated. We can use PLINK to prune the SNPs down to get an approximately independent set of SNPs:

plink --noweb --bfile QCed2 --indep 50 5 2 --out prunedsnps

This command should produce a file prunedsnps.prune.in listing the SNPs that should be retained in the analysis. (Check the PLINK website to get more details about how this option works). To produce a new set of files for the same people, but containing only these SNPs, type:

plink --noweb --bfile QCed2 --extract prunedsnps.prune.in --make-bed --out prunedQCed2

How many SNPs are contained in the prunedQCed2 fileset?

Next we will merge in the HapMap data which has been pre-prepared for you in PLINK-format files  prunedHAPMAPdata.bed   prunedHAPMAPdata.bim   prunedHAPMAPdata.fam , using the pruned set of SNPs and the same allele-coding as used in our own data.

plink --noweb --bfile prunedHAPMAPdata --bmerge prunedQCed2.bed prunedQCed2.bim prunedQCed2.fam --make-bed --out allpops

Now we have our final set of files   allpops.bed   allpops.bim   allpops.fam   which contain data at the required set of SNPs, in all 5 populations (our own population, and the 4 HapMap populations from Europe (CEU), China (CHB), Japan (JPT) and Yoruba (YRI)).

To perform MDS analysis in PLINK, we first calculate a file allpopsIBD.genome of genome-wide estimates of IBD sharing:
(WARNING - THIS ANALYSIS CAN BE VERY SLOW!)

plink --noweb --bfile allpops --genome --out allpopsIBD

Then we use PLINK to calculate the MDS scores. To calculate the first two scores for everyone in the data set, type:

plink --noweb --bfile allpops --read-genome allpopsIBD.genome --cluster --mds-plot 2 --out allpopsmds

This produces a file allpopsmds.mds. Take a look at the file (by typing more allpopsmds.mds or double clicking in Windows Explorer).

To visualise these results properly we will use the R program. Start up R (or RStudio). This can be done in various ways, but hopefully the following will work:

Type "RStudio" into the search box on the bottom left. Alternatively you may find RStudio within the "Programming and Databases" section of the menu.

Within RStudio, move to the directory (folder) where you performed the analysis by clicking "Session" at the top, and selecting "Set Working Directory" then "Choose Directory". Then move through your filesystem until you find the correct folder on your H: drive, and click "Open".

Now (within the R Console or the left hand RStudio window) read in the data by typing:

mds <- read.table("allpopsmds.mds", header=T)

popnids <- read.table("allpopsKEY.out", header=T)


These commands read the MDS file you created into an R data frame, together with an additional file allpopsKEY.out that we have prepared for you, giving an indicator of which population each person in the data set belongs to. The option header=T tells R that your file has a header line at the top, giving the different column names.

Take a look at the top of the data frames you have created in R by typing:

head(mds)

head(popnids)


We can create a new variable in R that codes for the different populations. We will call this variable "colour" since we will use it to colour the different populations in our plot:

colour <- popnids$CEUind+2*popnids$CHBind+3*popnids$JPTind+4*popnids$YRIind+5*popnids$OURSind

To look at the variable you have just made, type

colour

Now display the MDS plot in colour:

plot(mds$C1, mds$C2, col=colour)

This plots the 1st and 2nd component for each person, with each point representing a different person. The colouring corresponds to the different populations. You should be able to see the three different HapMap populations in different positions on the plot, together with our data. The CEU samples are shown in black, the YRI in dark blue, the CHB in red and the JPT in green. Our samples (the largest data set) are shown in turquoise blue.

The HapMap CEU data points have been largely covered up by our data. You can see that our data appears to be predominently Caucasian (since it overlaps with the CEU HapMap data) but there is one outlier. You can check who this is by typing:

mds[mds$C1>0,]

This outputs a list of people who scored more than 0 on the first component. These are mostly the HapMap samples, but also include the outlier from our data, person 1 in family region1c_0021.


3. Analysis to detect population stratification

To see whether there might be more subtle population stratification in our own samples, we can repeat the MDS analysis but just using our own data (i.e. not merging in the HapMap data).

First we need to delete the population outlier from our own data. We have prepared a file MDSoutlier.txt listing the family number and ID of the person to remove. Take a look at the file and check you understand it.

To prepare new PLINK files with the outlier removed, type:

plink --noweb --bfile prunedQCed2 --remove MDSoutlier.txt --make-bed --out nooutliers

Then perform the genome-wide estimatation of IBD sharing:

plink --noweb --bfile nooutliers --genome --out nooutliersIBD

Then we use PLINK to calculate the MDS scores. To calculate the first two scores for everyone in the data set, type:

plink --noweb --bfile nooutliers --read-genome nooutliersIBD.genome --cluster --mds-plot 2 --out nooutliersmds

This produces a file nooutliersmds.mds. Take a look at the file.


To plot the scores you can read this file into R. If necessry, start up R and, within R, move to the directory where you saved the files. Then type:

mds2 <- read.table("nooutliersmds.mds", header=T)

Now display the MDS plot:

plot(mds2$C1, mds2$C2)

This plots the first two principal components from the MDS analysis, with each point representing a different person. It is unclear from the plot whether there is any underlying population structure/population stratification. One could always use the first few principal components from the MDS analysis as covariates in any case/control association analysis, in order to adjust for possible unmeasured population stratification.

In fact, our data comes from two different geographical locations. Take a look at the data by typing:

mds2

It turns out that the first 314 people come from one location (region 1), while the remainder (persons 315-596) come from another location (region 2). We can colour the MDS plot according to geographical location by using the following commands:

mds2$index=1:596
mds2$place=1+1*(mds2$index<=314)
plot(mds2$C1, mds2$C2, col=mds2$place)


You can see that the points do indeed cluster by geographical location. This population stratification could cause problems in case/control analysis if we did not adjust for it. If we did not know which geographical location each person came from, we could use the first few principal components from the MDS analysis as covariates in any case/control association analysis, in order to adjust for possible population stratification.


4. Case/control association analysis

First we need to remove the population outlier from the QCed2 fileset. To prepare new PLINK files with the outlier removed, type:

plink --noweb --bfile QCed2 --remove MDSoutlier.txt --make-bed --out QCed3

Then perform a "basic" case/control analysis using the following command:

plink --noweb --bfile QCed3 --assoc --out AssocResults

Take a look at the output file AssocResults.assoc. For each SNP the following columns of results are reported:

     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     CHISQ   Basic allelic test chi-square (1df)
     P       Asymptotic p-value for this test
     OR      Estimated odds ratio (for A1, i.e. A2 is reference)


A logistic regression analysis can be performed in plink using the --logistic command (with the additional --ci option if you want to output a confidence interval for the estimated odds ratio). Logistic regression has the flexibility of allowing the inclusion of additional covariates as predictors of disease status (e.g. things like age, sex, smoking etc.) - here we will use the MDS scores as covariates in order to correct for any population stratification.

To preform this analysis, type the following:

plink --noweb --bfile QCed3 --logistic --ci 0.95 --covar nooutliersmds.mds --covar-name C1,C2 --hide-covar --out LogRegResults

Take a look at the results which can be found in the file LogRegResults.assoc.logistic. For each SNP, this displays:

     CHR           Chromosome number
     SNP           SNP identifier
     BP            Physical (base pair) position
     A1            Tested allele (minor allele by default)
     TEST          Code for the type of test
     NMISS         Number of non-missing individuals included on analysis
     OR            Estimated odds ratio
     SE            Standard error of estimated log odds ratio
     L95           Lower 95% conficence limit
     U95           Upper 95% conficence limit   
     STAT          Test statistic 
     P             Asymptotic p-value

To visualise these results properly we will use the R program. Start up R (or RStudio) and, within R, move to the directory where you saved the files. Then read in the data by typing:

res1=read.table("AssocResults.assoc", header=T)

This reads the results in the file "AssocResults.assoc" into an R dataframe named "res1". The option header=T tells R that your file has a header line at the top, giving the different column names.

To see the top few lines of the dataframe "res1", type:

head(res1)

The data frame has 21511 lines, one for each SNP. It would be very laborious to go through and look at each line by eye. Instead we will plot the results for the three chromosomes, colouring each chromosome differently:

plot((res1$CHR*130000000)+res1$BP, -log10(res1$P), col=res1$CHR)

This plots the "BP" column of the "res1" dataframe (plus a factor of 130000000*CHR to separate the BP position by chromosome) on the x axis against the -log base 10 of the association test p value (the "P" column of the "res1" dataframe) on the y axis. We plot the -log base 10 of the p value, which will be large if the p value is small. The command col=res1$CHR tells R to colour the points plotted according to the value in the "CHR" column of the "res1" dataframe (i.e. the points will be coloured by chromosome).

It looks as if there may be some significant results (small p values) on the first chromosome, chromosome 12.

One way to assess the significance of the results, in light of the large number (21612) of tests performed, is to use a Q-Q plot. To plot a Q-Q plot for the association test chi-squared values (while adding in labels for the x and y axis), type:

plot(qchisq(ppoints(res1$CHISQ),1), sort(res1$CHISQ), ylab="Observed CHISQ", xlab="Expected CHISQ")
abline(a=0,b=1)


The first command orders (sorts) the chi-squared values and plots them against their expected values, given the number of tests performed. The second command adds in to the plot a straight line with gradient 1, passing through the origin. What one would hope to see is most of the values lying along the straight line with gradient 1, indicating that most results are consistent with the null hypothesis of no association. However, one would also hope to see a few high chi-squared values at the top that depart from the straight line, which are hopefully true associations.

To visualise the results from the logistic regression analysis, go back to your R window and read the results into R:

res2=read.table("LogRegResults.assoc.logistic", header=T)

This reads the logistic regression results into a dataframe named "res2". To see the top few lines of this dataframe, type:

head(res2)

Now plot the results for the three chromosomes, colouring each chromosome differently:

plot((res2$CHR*130000000)+res2$BP, -log10(res2$P), col=res2$CHR)

The signal identified by logistic regression seems to be very similar to that identified by the "basic" association test. Take a look at the QQ plot (note that first you need to create the chi-squared test statistic):

res2$CHISQ=res2$STAT^2

plot(qchisq(ppoints(res2$CHISQ),1), sort(res2$CHISQ), ylab="Observed CHISQ", xlab="Expected CHISQ")
abline(a=0,b=1)


To pick out the "significant" results, you can use the following commands:

res1[res1$P<0.000001,]

res2[res2$P<0.000001,]


Note that the logistic regression results are (very slightly) less significant - i.e. closer to the line of equality - than the basic association test results, perhaps because we included the MDS covariates to correct for population stratification.


References

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904-9.

Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145):661-78.


Exercises prepared by: Heather Cordell
Checked by:
Programs used: PLINK, R
Last updated: