In this exercise you will be analysing (a subset of) data generated from a dense SNP genotyping platform. Although these genotyping technologies were developed primarily for genome-wide association studies (GWAS), they can be used to perform linkage analysis, providing one selects the SNPs to be included in the analysis appropriately.
The methodology used is standard methodology for non-parametric linkage analysis, as used in the previous exercise.
First I suggest you make a new
folder to keep this exercise's programs and data files in.
Then download the following files (if you do not already have them) into this folder:
merlin.exe
pedstats.exe
chr10-merlin-full-pedfile.txt
chr10-merlin-full-datfile.txt
chr10-merlin-fullmap.txt
chr10-merlin-prunedmap.txt
chr10-merlin-thinned1map.txt
chr10-merlin-thinned2map.txt
The data consists of genotype data at 1601 SNPs from a 30 MB region on chr 10, genotyped in 320 families consisting mostly of affected sib pairs and their parents.
Take a look at the data files (e.g. using the `more' command from the MSDOS window, or by opening them in WordPad) and check you understand how they are coded. Note that the first 3 files are standard MERLIN format files. The next 3 files consist of map files that contain smaller numbers of loci compared to the full map. If you use these map files (in conjunction with the full MERLIN pedfile and datafile) then MERLIN will ignore any loci (SNPs) that are not included in the current map file.
To start with, you will need to open up an MSDOS window. [To do
this, click on Start
(the round button on the bottom left), All Programs, Accessories,
then click on Command Prompt].
Once the window has opened, type dir to see all the files and directories
(folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing
cd xxxxx
(where xxxxx is replaced by the name of the appropriate folder).
Type dir again to check the required files are available in the directory.
First use the pedstats program to check if there are any inconsistencies with your data:
pedstats -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt
You should see a lot of error messages about Mendelian inheritance errors. This is not unusual
in a data set from a dense SNP genotyping platform, when many thousands of SNPs have been genotyped
in an automated way. If you had genotyped just a small number of SNPs yourself, you might be able to go back and
check/correct these errors, but that is not possible with such a large number of SNPs.
In fact,
this data has already been
checked and the SNPs/families that gave high error rates have already been removed.
Therefore we will not worry about the (relatively small) proportion of
Mendelian inconsistencies remaining - MERLIN will ignore these `bad'
SNP/family combinations when it does its linkage
analysis.
To carry out a non-parametric linkage analysis using all the SNPs, type (all on one line)
merlin -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt -m chr10-merlin-fullmap.txt --pairs --exp --information --pdf
Take a look at the resulting PDF file. The top plot shows the `information content' - how much information for
linkage analysis is provided by these particular SNPs. The bottom plot shows the results from the multipoint Kong
and Cox Zlr (non-parametric linkage) test, performed at increments across the region.
Close the pdf file and rename it e.g. merlin-full.pdf.
In fact, it is not generally considered valid to use such a dense map of SNPs for linkage analysis, as they are
likely to be in LD with one another, which can lead to false positive results.
Try rerunning the analysis using
the map file
chr10-merlin-prunedmap.txt instead. This file has been `pruned' so SNPs with minor allele frequencies
less than 0.4 and SNPs in strong LD with one another have
been removed, resulting in only 117 SNPs remaining:
merlin -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt -m chr10-merlin-prunedmap.txt --pairs --exp --information --pdf
Do the results look very different? Have we lost much of the information content?
Close the pdf file and rename it e.g. merlin-pruned.pdf.
Try rerunning the analysis using each of the other two
map files chr10-merlin-thinned1map.txt and chr10-merlin-thinned2map.txt
instead. These have been `thinned' to only contain 1 or 2 SNPs per cM, respectively
Do the results look very different? Have we lost much of the information content? You should find that,
for linkage analysis, 1 or 2 highly informative SNPs (i.e. those with high minor allele frequencies)
per cM are sufficient to capture most of the information for linkage testing.
We made this analysis easy for you by preparing the input required files, but usually you would have to do this
yourself. Next week we will learn how to use the PLINK and MapThin programs for creating these files.
The MERLIN website is at:
http://www.sph.umich.edu/csg/abecasis/Merlin/index.html