Computer Practical Exercises on Non-parametric and Parametric Linkage Analysis using the Merlin program

Introduction

In this practical we will investigate methods for positioning a disease locus on a known map of marker loci, using information from all the linked markers simultaneously. In these methods we assume that we know the genetic distances (and hence recombination fractions) between the markers. For parametric linkage analysis, we will also assume we know the underlying disease model (e.g. recessive, dominant etc). We fix a hypothetical position for the disease locus and calculate the overall likelihood for the disease and marker data, assuming the disease locus position is correct. We then repeat the analysis with the disease locus positioned at different locations in relation to the known markers. In this way we construct a multipoint LOD score curve across the region: the position where the LOD score is maximum is the best estimate of the disease locus location.

For non-parametric analysis we use a similar idea, but we calculate test statistics based on the observed (or estimated) IBD sharing by pairs or groups affected relatives at each test position.

Data overview

We will begin by analysing 4 families which are believed to be segregating for a recessive disease locus. The families are typed at 3 linked marker loci which we shall call markers 2, 3 and 4. You should already have downloaded the required data and program files onto your H: drive.

Step-by-step instructions

1. To start with, you will need to open up an MSDOS window. Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

A useful way to look at the files from within the MSDOS window (i.e. without having to open them through the Windows system) is to use the `more' command. E.g. type

more pedfile1.txt

to look at the pedigree file. (Hit the space bar to continue scrolling through a long file, or else type q to quit scrolling).

2. Use the Pedstats program to check your data by typing

pedstats -p pedfile2.txt -d datfile.txt

(This command tells the program to read in the pedigree file as specified after -p and the locus datafile as specified after -d )

3. Check the output on the screen. Do you see any error messages? The program will output lots of information regarding things like which files you read in, which analysis options you chose, how many pedigrees (families) you read in and their sizes, how many markers you read in and their heterozygosities. Check you (roughly) understand this output, and that it matches what you are expecting.

4. Use the Merlin program to perform a non-parametric linkage analysis on the pedigree file pedfile2.txt by typing

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp

(This command tells Merlin which pedigree file, locus datafile, map file and allele frequency files to use via the -p -d -m -f options. The options --pairs and --exp tell Merlin to use the S_pairs scoring function and to calculate the Kong and Cox "exponential model" Zlr statistic as well as the NPL statistic).

Take a look at the output on the screen. Ignore the first 2 lines of results marked "min" and "max". Below this, at each of the 3 marker loci you should see an NPL statistic (called Zmean) and p value, then 3 columns of results from the Kong and Cox "linear model" and 3 columns of results from the Kong and Cox "exponential model". The most important things are the last two columns from the exponential model, marked LOD and p value.

Rather than outputting the normally distributed Zlr statistic, Merlin outputs an equivalent LOD score. It also outputs (in the column before the LOD score) an estimate of the sharing parameter delta. Large positive values of delta indicate excess IBD sharing by affected individuals in the same pedigree. (Negative values of delta indicate lack of IBD sharing). If there is no excess sharing, i.e. under the null hypothesis, delta should be 0.

5. It looks as if there may be some evidence for linkage in this region. To do a multipoint analysis at increments between loci (as well as at the marker loci themselves), type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp --steps 10

You should see results at 10 increments between loci, as well as at the loci themselves. To produce a nice plot of these results, type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp --steps 10 --pdf

Take a look at the file merlin.pdf that should have been created. Do you think there is a disease locus in the region, and if so, where do you think it is located?

6. Close the pdf file. Now try analysing the data using the S_all scoring rather than the S_pairs scoring, by changing the --pairs option to --npl i.e. by typing:

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --npl --exp --steps 10 --pdf

Take a look at the file merlin.pdf that should have been created. Do you still think there is a disease locus in the region, and if so, where do you think it is located?

8. We will now use Merlin to perform a parametric analysis, assuming a recessive model. To do this, we need to download another file recessive-model.txt This file has 4 columns giving the disease name (which has to match up with the name in the locus datafile), the disease (D) allele frequency, the penetrances for genotypes dd, dD, DD and a model name. Take a look at the file and check you understand it.

9. To run the analysis using this model, type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --model recessive-model.txt --steps 10 --pdf

Do you still think there is a disease locus in the region, and if so, where do you think it is located?

10. What would happen for the parametric analysis if we assumed a dominant instead of a recessive model? Download the file dominant-model.txt Take a look at the file to check you understand how it is coded, and re-run the parametric linkage analysis using this file instead. What has happened to the evidence for linkage? You should see that assuming a wrong model can have a big impact on the power to detect linkage using parametric linkage analysis.

Merlin documentation:

Merlin documentation is available here: http://www.sph.umich.edu/csg/abecasis/Merlin/index.html