In this exercise you will perform quality control (including Multidimensional Scaling (MDS) for detecting population outliers and adjusting for population stratification), followed by case/control association analysis, using the program PLINK. The program R will be used for plotting and visualisation of the results.
PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:
https://zzz.bwh.harvard.edu/plink/index.shtml
We will analyse data derived from a set of 165 cases with a disease of interest, and 496 controls.
All individuals have been typed at a genome-wide set of around 262,000 SNP markers.
In the interests of time, we will analyse the data from just three chromosomes,
but in principal one could use the same approach to analyse
all 22 autosomal chromosomes as well as chromosome X.
You should save the following files to an appropriate directory (folder) on your machine:
startingcasecon.bed
startingcasecon.bim
startingcasecon.fam
prunedHAPMAPdata.bed
prunedHAPMAPdata.bim
prunedHAPMAPdata.fam
people2drop.txt
allpopsKEY.out
MDSoutlier.txt
Open up an MSDOS window by typing "cmd" in the search box (bottom left).
Once the window has opened, type dir to see all the files and directories
(folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing
cd xxxxx
(where xxxxx is replaced by the name of the appropriate folder).
Type dir again to check the required files are available in the directory.
To start with, we will use PLINK to do some quality control checks and filter out individuals and SNPs that are likely to be of poor-quality. By default, PLINK does not impose any filters on things like
minor allele frequency or genotyping rate.
If you want to impose such filters, you can either do it at the same time as doing the analysis, or you can first do some QC filtering and output a new 'filtered' version of the data set for subsequent analysis.
For example, to exclude individuals with more than 10% missing genotypes (over all SNPs), when we run PLINK we will
use the command --mind 0.1.
To then exclude SNPs with more than 10% missing genotypes (over all individuals), we will use the command --geno 0.1.
To exclude SNPs with minor allele frequency less than 0.02, we will use the command --maf 0.02.
To exclude markers that fail a Hardy-Weinberg test at significance threshold 0.000005,
we will use the command --hwe 0.000005.
We will use PLINK to generate a new data set where some people and some SNPs have been removed according to these
QC checks. We do this by typing:
plink --noweb --bfile startingcasecon --mind 0.1 --geno 0.1 --maf 0.05 --hwe 0.000005 --make-bed --out QCed
Here the --noweb command tells PLINK to run without checking the web for new versions, the
--bfile startingcasecon command tells PLINK to read in the (binary) fileset
contained in the files   startingcasecon.bed   startingcasecon.bim
  startingcasecon.fam, and the other commands implement the various QC exclusions.
The --make-bed --out commands should produce a new set of "QCed" (quality-controlled) data (21511 SNPs in 159 cases and 445 controls)
contained in the 3 files   QCed.bed   QCed.bim   QCed.fam
Check the QCed.log file (or the screen output) to see how many people (cases and controls) and
SNPs you started with, and how many were removed by the various QC checks.
Next we will check for people that are related to one another, or that might be duplicate samples. We do this by estimating the
genome-wide identity-by-descent (IBD) sharing:
plink --noweb --bfile QCed --genome --min 0.15 --out QCedIBD
Note that we chose a threshold so only pairs of samples with high IBD sharing (greater than 15%) are output to the file
QCedIBD.genome. Take a look at this file (by typing
more QCedIBD.genome
or double clicking in Windows Explorer). You can see that there are several pairs of people with high IBD
sharing, including one person (region1a_0161 1) who appears in a lot of pairs.
Only one person from each "bad" pair needs to be removed. We have made a proposed list of people to remove in the file
people2drop.txt. Take a look at this file and check whether you agree with it.
To remove these 7 people, type:
plink --noweb --bfile QCed --remove people2drop.txt --make-bed --out QCed2
This creates a new PLINK fileset contained in the 3 files   QCed2.bed   QCed2.bim  
QCed2.fam. Re-run the IBD check on the new fileset:
plink --noweb --bfile QCed2 --genome --min 0.15 --out QCed2IBD
Take a look at QCed2IBD.genome. Are there any remaining "bad" pairs?
To perform principal components or MDS analysis, it is important that we use SNPs that are not too correlated. We can use
PLINK to prune the SNPs down to get an approximately independent set of SNPs:
plink --noweb --bfile QCed2 --indep 50 5 2 --out prunedsnps
This command should produce a file prunedsnps.prune.in listing the SNPs that
should be retained in the analysis. (Check the PLINK website to get more details about
how this option works). To produce a new set of files for the same people,
but containing only these SNPs, type:
plink --noweb --bfile QCed2 --extract prunedsnps.prune.in --make-bed --out prunedQCed2
How many SNPs are contained in the prunedQCed2 fileset?
Next we will merge in the HapMap data which has been pre-prepared for you in PLINK-format files  prunedHAPMAPdata.bed   prunedHAPMAPdata.bim   prunedHAPMAPdata.fam , using the pruned set of SNPs and the same allele-coding as used in our own data.
plink --noweb --bfile prunedHAPMAPdata --bmerge prunedQCed2.bed prunedQCed2.bim prunedQCed2.fam --make-bed --out allpops
Now we have our final set of files   allpops.bed   allpops.bim   allpops.fam   which contain data at the required set of SNPs, in all 5 populations (our own population, and the 4 HapMap populations from Europe (CEU), China (CHB), Japan (JPT) and Yoruba (YRI)).
To perform MDS analysis in PLINK, we first calculate a file allpopsIBD.genome of genome-wide estimates of IBD sharing:
(WARNING - THIS ANALYSIS CAN BE VERY SLOW!)
plink --noweb --bfile allpops --genome --out allpopsIBD
Then we use PLINK to calculate the MDS scores. To calculate the first two scores for everyone in the data set, type:
plink --noweb --bfile allpops --read-genome allpopsIBD.genome --cluster --mds-plot 2 --out allpopsmds
This produces a file allpopsmds.mds. Take a look at the file (by typing
more allpopsmds.mds
or double clicking in Windows Explorer).
To visualise these results properly we will use the R program. Start up R (or RStudio). This can be done in various ways, but
hopefully the following will work:
Type "RStudio" into the search box on the bottom left.
Alternatively you may find RStudio within the "Programming and Databases" section of the menu.
Within RStudio, move to the
directory (folder) where you performed the analysis by clicking "Session" at the top,
and selecting "Set Working Directory" then "Choose Directory".
Then move through your filesystem until you find the correct folder on your H: drive, and click "Open".
Now (within the R Console or the left hand RStudio window) read in the data by typing:
mds <- read.table("allpopsmds.mds", header=T)
popnids <- read.table("allpopsKEY.out", header=T)
These commands read the MDS file you created into an R data frame, together with an additional file allpopsKEY.out that we have
prepared for you, giving an indicator of which population each person in the data set belongs to. The option header=T
tells R that your file has a
header line at the top, giving the different column names.
Take a look at the top of the data frames you have created in R by typing:
head(mds)
head(popnids)
We can create a new variable in R that codes for the different populations. We will call this variable "colour" since we will use it to colour the different populations in our plot:
colour <- popnids$CEUind+2*popnids$CHBind+3*popnids$JPTind+4*popnids$YRIind+5*popnids$OURSind
To look at the variable you have just made, type
colour
Now display the MDS plot in colour:
plot(mds$C1, mds$C2, col=colour)
This plots the 1st and 2nd component for each person, with each point representing a different person. The colouring corresponds to the different populations. You should be able to see the three different HapMap populations in different positions on the plot, together with our data. The CEU samples are shown in black, the YRI in dark blue, the CHB in red and the JPT in green. Our samples (the largest data set) are shown in turquoise blue.
The HapMap CEU data points have been largely covered up by our data.
You can see that our data appears to be predominently Caucasian (since it overlaps with the CEU HapMap data) but there is one outlier. You can check who this is by typing:
mds[mds$C1>0,]
This outputs a list of people who scored more than 0 on the first component. These are mostly the HapMap samples, but also include the outlier from our data, person 1 in family region1c_0021.
To see whether there might be more subtle population stratification in our own samples, we can repeat the MDS analysis but just using our own data (i.e. not merging in the HapMap data).
First we need to delete the population outlier from our own data. We have prepared a file MDSoutlier.txt listing the family number and ID of the person to remove. Take a look at the file and check you understand it.
To prepare new PLINK files with the outlier removed, type:
plink --noweb --bfile prunedQCed2 --remove MDSoutlier.txt --make-bed --out nooutliers
Then perform the genome-wide estimatation of IBD sharing:
plink --noweb --bfile nooutliers --genome --out nooutliersIBD
Then we use PLINK to calculate the MDS scores. To calculate the first two scores for everyone in the data set, type:
plink --noweb --bfile nooutliers --read-genome nooutliersIBD.genome --cluster --mds-plot 2 --out nooutliersmds
This produces a file nooutliersmds.mds. Take a look at the file.
To plot the scores you can read this file into R.
If necessry, start up R and, within R, move to the directory where you saved the files. Then type:
mds2 <- read.table("nooutliersmds.mds", header=T)
Now display the MDS plot:
plot(mds2$C1, mds2$C2)
This plots the first two principal components from the MDS analysis, with each point representing a different person. It is unclear from the plot whether there is any underlying population structure/population stratification. One could always use the first few principal components from the MDS analysis as covariates in any case/control association analysis, in order to adjust for possible unmeasured population stratification.
In fact, our data comes from two different geographical locations. Take a look at the data by typing:
mds2
It turns out that the first 314 people come from one location (region 1), while the remainder (persons 315-596)
come from another location (region 2). We can colour the MDS plot according to geographical location by using the following commands:
mds2$index=1:596
mds2$place=1+1*(mds2$index<=314)
plot(mds2$C1, mds2$C2, col=mds2$place)
You can see that the points do indeed cluster by geographical location.
This population stratification could cause problems in case/control analysis if we did not adjust for it. If we did not know which geographical location each person came from, we could use the first few principal components from the MDS analysis as covariates in any case/control association analysis, in order to adjust for possible population stratification.
First we need to remove the population outlier from the QCed2 fileset. To prepare new PLINK files with the outlier removed, type:
plink --noweb --bfile QCed2 --remove MDSoutlier.txt --make-bed --out QCed3
Then perform a "basic" case/control analysis using the following command:
plink --noweb --bfile QCed3 --assoc --out AssocResults
Take a look at the output file AssocResults.assoc. For each SNP the following
columns of
results
are reported:
CHR Chromosome SNP SNP ID BP Physical position (base-pair) A1 Minor allele name (based on whole sample) F_A Frequency of this allele in cases F_U Frequency of this allele in controls A2 Major allele name CHISQ Basic allelic test chi-square (1df) P Asymptotic p-value for this test OR Estimated odds ratio (for A1, i.e. A2 is reference)
A logistic regression analysis can be performed in plink using the --logistic
command (with the additional --ci option if you want to output a confidence interval for the
estimated odds
ratio). Logistic regression has the flexibility of allowing the inclusion of additional covariates
as predictors of disease status (e.g. things like age, sex, smoking etc.) - here we will use the MDS scores
as covariates in order to correct for any population stratification.
To preform this analysis, type the following:
plink --noweb --bfile QCed3 --logistic --ci 0.95 --covar nooutliersmds.mds --covar-name C1,C2 --hide-covar --out LogRegResults
Take a look at the results which can be found in the file LogRegResults.assoc.logistic.
For each SNP, this displays:
CHR Chromosome number SNP SNP identifier BP Physical (base pair) position A1 Tested allele (minor allele by default) TEST Code for the type of test NMISS Number of non-missing individuals included on analysis OR Estimated odds ratio SE Standard error of estimated log odds ratio L95 Lower 95% conficence limit U95 Upper 95% conficence limit STAT Test statistic P Asymptotic p-value
To visualise these results properly we will use the R program. Start up R (or RStudio) and, within R, move to the
directory where you saved the files. Then read in the data by typing:
res1=read.table("AssocResults.assoc", header=T)
This reads the results in the file "AssocResults.assoc" into an R
dataframe named "res1". The option header=T
tells R that your file has a
header line at the top, giving the different column names.
To see the top few lines of the dataframe "res1", type:
head(res1)
The data frame has 21511 lines, one for each SNP. It would be very laborious to go
through and look at each line by eye. Instead we will plot the results
for the three chromosomes, colouring each chromosome differently:
plot((res1$CHR*130000000)+res1$BP, -log10(res1$P), col=res1$CHR)
This plots
the "BP" column of the "res1" dataframe (plus a factor of 130000000*CHR to separate the BP position by chromosome) on the x axis against the
-log base 10 of the association test p value (the "P" column of the "res1" dataframe)
on the y axis. We plot the -log base 10 of the p value,
which will be large if the p value is small.
The command col=res1$CHR tells R to colour the
points plotted according to the value in the "CHR" column of the "res1" dataframe
(i.e. the points will be coloured by chromosome).
It looks as if there may be some significant results (small p values) on the first chromosome, chromosome 12.
One way to assess the significance of the results, in light of the large number (21612) of tests performed, is to use a Q-Q plot.
To plot a Q-Q plot for the association test chi-squared values (while adding in labels for the x and y axis), type:
plot(qchisq(ppoints(res1$CHISQ),1), sort(res1$CHISQ), ylab="Observed CHISQ", xlab="Expected CHISQ")
abline(a=0,b=1)
The first command orders (sorts) the
chi-squared values and plots them against their expected values, given
the
number of tests performed. The second command adds in to the plot a straight line with gradient 1, passing through the origin. What one would hope to see
is most of the values lying along the straight line with gradient 1, indicating that most results are consistent with the null hypothesis of no association.
However, one would also hope to see a few high chi-squared values at the top that depart from the straight line, which are hopefully true associations.
To visualise the results from the logistic regression analysis, go back to your R window and read the results into R:
res2=read.table("LogRegResults.assoc.logistic", header=T)
This reads the logistic regression results into a dataframe named "res2". To see the top few lines of this dataframe, type:
head(res2)
Now plot the results for the three chromosomes, colouring each chromosome differently:
plot((res2$CHR*130000000)+res2$BP, -log10(res2$P), col=res2$CHR)
The signal identified by logistic regression seems to be very similar to that identified by the "basic" association test. Take a look at the QQ plot
(note that first you need to create the chi-squared test statistic):
res2$CHISQ=res2$STAT^2
plot(qchisq(ppoints(res2$CHISQ),1), sort(res2$CHISQ), ylab="Observed CHISQ", xlab="Expected CHISQ")
abline(a=0,b=1)
To pick out the "significant" results, you can use the following commands:
res1[res1$P<0.000001,]
res2[res2$P<0.000001,]
Note that the logistic regression results are (very slightly) less significant - i.e. closer to the line of equality - than the basic association test results,
perhaps because we included the MDS covariates to correct for population stratification.
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006)
Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904-9.
Wellcome Trust Case Control Consortium (2007)
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 447(7145):661-78.