Linkage analysis of dense SNP sets using MERLIN and PLINK

Overview

Purpose

In this exercise you will be analysing (a subset of) data generated from a dense SNP genotyping platform. Although these genotyping technologies were developed primarily for genome-wide association studies (GWAS), they can be used to perform linkage analysis, providing one selects the SNPs to be included in the analysis appropriately.

Methodology

The methodology used is standard methodology for non-parametric linkage analysis, as used in the previous exercise.

Programs and Data

First I suggest you make a new folder to keep this exercise's programs and data files in. Then download the following files into this folder:

merlin.exe
pedstats.exe
plink.exe
mapthin.exe

chr10-merlin-full-pedfile.txt
chr10-merlin-full-datfile.txt
chr10-merlin-fullmap.txt
chr10-merlin-prunedmap.txt
chr10-merlin-thinned1map.txt
chr10-merlin-thinned2map.txt
chr10-plinkmap.txt

Exercise

Data overview

The data consists of genotype data at 1601 SNPs from a 30 MB region on chr 10, genotyped in 320 families consisting mostly of affected sib pairs and their parents.

Instructions

Data format

Take a look at the data files (e.g. using the `more' command from the MSDOS window, or by opening them in WordPad) and check you understand how they are coded. Note that the first 3 files are standard MERLIN format files. The next 3 files consist of map files that contain smaller numbers of loci compared to the full map. If you use these map files (in conjunction with the full MERLIN pedfile and datafile) then MERLIN will ignore any loci (SNPs) that are not included in the current map file.

The final file is a PLINK-format map file. PLINK uses a slightly different format map file compared to MERLIN. The PLINK format consists of exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced) 
     rs number or snp identifier 
     Genetic distance (in Morgans or cM, can be set to 0 when performing association analysis) 
     Base-pair position (bp units)

Step-by-step instructions

Linkage analysis using all 1601 SNPs

To start with, you will need to open up an MSDOS window. [To do this, click on Start (the round button on the bottom left), All Programs, Accessories, then click on Command Prompt].

Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

First use the pedstats program to check if there are any inconsistencies with your data:

pedstats -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt

You should see a lot of error messages about Mendelian inheritance errors. This is not unusual in a data set from a dense SNP genotyping platform, when many thousands of SNPs have been genotyped in an automated way. If you had genotyped just a small number of SNPs yourself, you might be able to go back and check/correct these errors, but that is not possible with such a large number of SNPs.

In fact, this data has already been checked and the SNPs/families that gave high error rates have already been removed. Therefore we will not worry about the (relatively small) proportion of Mendelian inconsistencies remaining - MERLIN will ignore these `bad' SNP/family combinations when it does its linkage analysis.

To carry out a non-parametric linkage analysis using all the SNPs, type (all on one line)

merlin -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt -m chr10-merlin-fullmap.txt --pairs --exp --information --pdf

Take a look at the resulting PDF file. The top plot shows the `information content' - how much information for linkage analysis is provided by these particular SNPs. The bottom plot shows the results from the multipoint Kong and Cox Zlr (non-parametric linkage) test, performed at increments across the region.

Close the pdf file and rename it e.g. merlin-full.pdf.

Linkage analysis using 117 SNPs

In fact, it is not generally considered valid to use such a dense map of SNPs for linkage analysis, as they are likely to be in LD with one another, which can lead to false positive results. Try rerunning the analysis using the map file chr10-merlin-prunedmap.txt instead. This file has been `pruned' so SNPs with minor allele frequencies less than 0.4 and SNPs in strong LD with one another have been removed, resulting in only 117 SNPs remaining:

merlin -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt -m chr10-merlin-prunedmap.txt --pairs --exp --information --pdf

Do the results look very different? Have we lost much of the information content?

Close the pdf file and rename it e.g. merlin-pruned.pdf.

Linkage analysis using 52 and 97 SNPs

Try rerunning the analysis using each of the other two map files chr10-merlin-thinned1map.txt and chr10-merlin-thinned2map.txt instead. These have been `thinned' to only contain 1 or 2 SNPs per cM, respectively

Do the results look very different? Have we lost much of the information content? You should find that, for linkage analysis, 1 or 2 highly informative SNPs (i.e. those with high minor allele frequencies) per cM are sufficient to capture most of the information for linkage testing.

Generating pruned/thinned map files

We made the analysis above easy for you by preparing the required files, but usually you would have to do this yourself. Here is an overview of using the PLINK and MapThin programs for performing the required steps.

First you need to read your data into PLINK. PLINK takes standard MERLIN-format pedigree files, although it also takes pedigree files where the code '-9' (rather than 0) is used to signify unknown disease status. This is the code that PLINK uses when it writes out various data files.

Although PLINK can read in and write out standard pedigree files, it is usually more convenient to read in and write out files in PLINK's special binary format, which will take up less disk space and be quicker to read into PLINK when performing various subsequent analyses.

First we will read in the data to PLINK (which will require reading in a pedigree and a map file) and then we will write it out as a set of 3 binary format files, which capture the same information as is contained in the pedigree and map files:

plink --noweb --ped chr10-merlin-full-pedfile.txt --map chr10-plinkmap.txt --make-bed --out full

Here the --noweb command tells PLINK to run without bothering to check via the web to see whether there are updated versions of PLINK. The --ped xxxx command tells PLINK that the name of the first pedigree file is xxxx and the --map yyyy command tells PLINK that the name of the first map file is yyyy. The --make-bed command tells PLINK to output the resulting file in binary format and the --out full command tells PLINK to use the stem 'full' for the name of all its output files.

PLINK outputs some useful messages (you should always read these to make sure they match up with what you expect!) and also outputs a copy of these messages to the file full.log .

You should now find 4 new files in your directory: full.bed, full.bim, full.fam and full.log. The file full.bed is the binary genotype file which will not be human readable. The file full.bim is a map file with two extra columns of information giving the possible alleles at each locus. You can take a look at this (e.g. by typing more full.bim). The file full.fam gives the pedigree structure in a format that is compatible with the binary genotype file. You can take a look at this (e.g. by typing more full.fam). Note this file is the same as the first six columns of the original pedigree file, except that unknown disease status has been coded as '-9'.

For subsequent analysis of these data, the files full.bed, full.bim and full.fam will always need to be read into PLINK together, as together they provide all the required information. This can be done (when running PLINK) by using the command --bfile full.

Next we will read in the `full' files and write out a set of files that contain the same data but just for a subset of SNPs. We will ask PLINK to just include SNPs with minor allele frequency (MAF) > 0.4, SNPs that have been successfully genotyped in more than 95% of people (i.e. have less than 5% missing genotypes), and SNPs that do not fail a test of HWE at p value <0.00000001. The latter two of these are fairly standard QC checks. The minor allele frequency requirement is because we want to only include SNPs that will be highly informative for linkage analysis. (If we were performing association analysis we would probably use a MAF threshold of 0.01 or 0.05 instead of 0.4).

To do this step, we type:

plink --noweb --allow-no-sex --bfile full --maf 0.4 --geno 0.05 --hwe 0.00000001 --make-bed --out frequent

Take a look at the output files frequent.log, frequent.bim, frequent.fam. Can you understand what the information is encoded in these files, and whether/how the new files frequent.bed, frequent.bim, frequent.fam differ from the original files full.bed full.bim, full.fam?

Now we will `prune' to generate SNPs that are not in too much LD with one another. We have to do this in two steps. First we ask PLINK to generate the list of SNPs to keep:

plink --noweb --allow-no-sex --bfile frequent --indep 50 5 2 --out prunedsnplist

(The --indep 50 5 2 command actually tells PLINK how to decide which SNPs are in `too much' LD with one another. Don't worry too much about this option - if you are interested you can read more details on the PLINK web site).

Then we ask PLINK to just extract (keep) those SNPs from the SNPs that are in the `frequent' files:

plink --noweb --allow-no-sex --bfile frequent --extract prunedsnplist.prune.in --make-bed --out pruned

Take a look at the output files pruned.log, pruned.bim, pruned.fam. Can you understand what the information is encoded in these files?

In theory you could use the SNP map pruned.bim to perform the linkage analysis. However, you would first have to convert it to a MERLIN format map file. How would you do this? (E.g. using Excel?)

If we want to thin the map down even further, we can make use of the MapThin program. E.g. to thin the SNPs down to one SNP per cM, type

mapthin -t 1 pruned.bim thinned1.bim

This produces a PLINK-format `.bim' file containing approximately one SNP per cM. Again, to use it in MERLIN, you would first have to convert it to a MERLIN format map file.

If you have time, try to create MERLIN-format map files corresponding to pruned.bim and thinned1.bim, and re-run the MERLIN analysis with these files you created.

Program documentation

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://pngu.mgh.harvard.edu/~purcell/plink/

The MapThin program can be found at:

http://www.staff.ncl.ac.uk/richard.howey/mapthin/

The MERLIN website is at:

http://www.sph.umich.edu/csg/abecasis/Merlin/index.html

Exercises prepared by: Heather Cordell
Checked by:
Programs used: MERLIN, PLINK, MapThin
Last updated: