Linkage analysis of dense SNP sets using MERLIN

Overview

Purpose

In this exercise you will be analysing (a subset of) data generated from a dense SNP genotyping platform. Although these genotyping technologies were developed primarily for genome-wide association studies (GWAS), they can be used to perform linkage analysis, providing one selects the SNPs to be included in the analysis appropriately.

Methodology

The methodology used is standard methodology for non-parametric linkage analysis, as used in the previous exercise.

Programs and Data

First I suggest you make a new folder to keep this exercise's programs and data files in. Then download the following files (if you do not already have them) into this folder:

merlin.exe
pedstats.exe

chr10-merlin-full-pedfile.txt
chr10-merlin-full-datfile.txt
chr10-merlin-fullmap.txt
chr10-merlin-prunedmap.txt
chr10-merlin-thinned1map.txt
chr10-merlin-thinned2map.txt


Exercise

Data overview

The data consists of genotype data at 1601 SNPs from a 30 MB region on chr 10, genotyped in 320 families consisting mostly of affected sib pairs and their parents.


Instructions

Data format

Take a look at the data files (e.g. using the `more' command from the MSDOS window, or by opening them in WordPad) and check you understand how they are coded. Note that the first 3 files are standard MERLIN format files. The next 3 files consist of map files that contain smaller numbers of loci compared to the full map. If you use these map files (in conjunction with the full MERLIN pedfile and datafile) then MERLIN will ignore any loci (SNPs) that are not included in the current map file.

Step-by-step instructions

Linkage analysis using all 1601 SNPs

To start with, you will need to open up an MSDOS window. [To do this, click on Start (the round button on the bottom left), All Programs, Accessories, then click on Command Prompt].

Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.



First use the pedstats program to check if there are any inconsistencies with your data:

pedstats -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt

You should see a lot of error messages about Mendelian inheritance errors. This is not unusual in a data set from a dense SNP genotyping platform, when many thousands of SNPs have been genotyped in an automated way. If you had genotyped just a small number of SNPs yourself, you might be able to go back and check/correct these errors, but that is not possible with such a large number of SNPs.

In fact, this data has already been checked and the SNPs/families that gave high error rates have already been removed. Therefore we will not worry about the (relatively small) proportion of Mendelian inconsistencies remaining - MERLIN will ignore these `bad' SNP/family combinations when it does its linkage analysis.

To carry out a non-parametric linkage analysis using all the SNPs, type (all on one line)

merlin -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt -m chr10-merlin-fullmap.txt --pairs --exp --information --pdf

Take a look at the resulting PDF file. The top plot shows the `information content' - how much information for linkage analysis is provided by these particular SNPs. The bottom plot shows the results from the multipoint Kong and Cox Zlr (non-parametric linkage) test, performed at increments across the region.

Close the pdf file and rename it e.g. merlin-full.pdf.



Linkage analysis using 117 SNPs

In fact, it is not generally considered valid to use such a dense map of SNPs for linkage analysis, as they are likely to be in LD with one another, which can lead to false positive results. Try rerunning the analysis using the map file chr10-merlin-prunedmap.txt instead. This file has been `pruned' so SNPs with minor allele frequencies less than 0.4 and SNPs in strong LD with one another have been removed, resulting in only 117 SNPs remaining:

merlin -p chr10-merlin-full-pedfile.txt -d chr10-merlin-full-datfile.txt -m chr10-merlin-prunedmap.txt --pairs --exp --information --pdf

Do the results look very different? Have we lost much of the information content?

Close the pdf file and rename it e.g. merlin-pruned.pdf.



Linkage analysis using 52 and 97 SNPs

Try rerunning the analysis using each of the other two map files chr10-merlin-thinned1map.txt and chr10-merlin-thinned2map.txt instead. These have been `thinned' to only contain 1 or 2 SNPs per cM, respectively

Do the results look very different? Have we lost much of the information content? You should find that, for linkage analysis, 1 or 2 highly informative SNPs (i.e. those with high minor allele frequencies) per cM are sufficient to capture most of the information for linkage testing.



We made this analysis easy for you by preparing the input required files, but usually you would have to do this yourself. Next week we will learn how to use the PLINK and MapThin programs for creating these files.

Program documentation

The MERLIN website is at:

http://www.sph.umich.edu/csg/abecasis/Merlin/index.html


Exercises prepared by: Heather Cordell
Checked by:
Programs used: MERLIN, PEDSTATS
Last updated: