Computer Practical Exercises on Non-Parametric and Parametric Linkage Analysis using the Merlin program

Introduction

In this practical we will investigate methods for positioning a disease locus on a known map of marker loci, using information from all the linked markers simultaneously.

In these methods we assume that we know the genetic distances (and hence recombination fractions) between the markers. For parametric linkage analysis, we will also assume we know the underlying disease model (e.g. recessive, dominant etc). We fix a position for the disease locus and calculate the overall likelihood for the disease and marker data, assuming the disease locus position is correct. We then repeat the analysis with the disease locus positioned at different locations in relation to the known markers. In this way we construct a multipoint LOD score curve across the region: the position where the LOD score is maximum is the best estimate of the disease locus location.

Data overview

We will begin by analysing 4 families which are believed to be segregating for a recessive disease locus. The families are typed at 3 linked marker loci which we shall call markers 2, 3 and 4. The pedigree data is in the file pedfile2.txt. You should have previously downloaded this file onto your computer.

Take a look at the pedigree file. Each line gives the data for a single person. Data is ordered in columns corresponding to family, id (within family), id of father, id of mother, sex (male=1, female=2), affection status (1=unaffected, 2=affected), and genetic data (3 loci, each with 2 alleles). A zero indicates missing or unknown data.

To perform the analysis in Merlin, we need an additional file sometimes called the "locus datafile": datfile.txt

This file gives information about the different loci in the pedigree file (the markers (M) and the assumed disease or affection locus (A) and their order in the pedigree file. Take a look at this file and check you understand how this information is coded.

We also need a file that gives the genetic map positions (in cM) of the loci mapfile.txt

and a file giving the allele frequencies of the alleles at the different loci freqfile.txt

Take a look at these files and check you understand how this information is coded.

You should have already downloaded these files, together with the Merlin and Pedstats programs, into an appropriate folder on your computer. If you did not already do this, make a new directory (folder) and save the above files in it.

You should also have already saved a copy of the Merlin and Pedstats programs in the same directory. If you have not already done this, you can get them here:
merlin.exe
pedstats.exe

Step-by-step instructions

1. To start with, you will need to open up an MSDOS window. To do this, click on Start, Run, then type cmd. Vista users may need to click start (the round button on the bottom left), All Programs, Accessories, then click on Command Prompt.

Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

2. Use the Pedstats program to check your data by typing

pedstats -p pedfile2.txt -d datfile.txt

(This command tells the program to read in the pedigree file as specified after -p and the locus datafile as specified after -d )

3. Check the output on the screen. Do you see any error messages? The program will output lots of information regarding things like which files you read in, which analysis options you chose, how many pedigrees (families) you read in and their sizes, how many markers you read in and their heterozygosities. Check you (roughly) understand this output.

4. Use the Merlin program to perform a non-parametric linkage analysis on the pedigree file pedfile2.txt by typing

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp

(This command tells Merlin which pedigree file, locus datafile, map file and allele frequency files to use via the -p -d -m -f options. The options --pairs and --exp tell Merlin to use the S_pairs scoring function and to calculate the Kong and Cox "exponential model" Zlr statistic as well as the NPL statistic).

Take a look at the output on the screen. Ignore the first 2 lines of results marked "min" and "max". Below this, at each of the 3 marker loci you should see an NPL statistic (called Zmean) and p value, then 3 columns of results from the Kong and Cox "linear model" and 3 columns of results from the Kong and Cox "exponential model". The most important things are the last two columns from the exponential model, marked LOD and p value.

Rather than outputting the normally distributed Zlr statistic, Merlin outputs an equivalent LOD score. It also outputs (in the column before the LOD score) an estimate of the sharing parameter delta. Large positive values of delta indicate excess IBD sharing by affected individuals in the same pedigree. (Negative values of delta indicate lack of IBD sharing). If there is no excess sharing, i.e. under the null hypothesis, delta should be 0.

5. It looks as if there may be some evidence for linkage in this region. To do a multipoint analysis at increments between loci (as well as at the marker loci themselves), type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp --steps 10

You should see results at 10 increments between loci, as well as at the loci themselves. To produce a nice plot of these results, type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp --steps 10 --pdf

Take a look at the file merlin.pdf that should have been created. Do you think there is a disease locus in the region, and if so, where do you think it is located?

6. Close the pdf file. Now try analysing the data using the S_all scoring rather than the S_pairs scoring, by changing the --pairs option to --npl i.e. by typing:

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --npl --exp --steps 10 --pdf

Take a look at the file merlin.pdf that should have been created. Do you still think there is a disease locus in the region, and if so, where do you think it is located?

8. We will now use Merlin to perform a parametric analysis, assuming a recessive model. To do this, we need to download another file recessive-model.txt This file has 4 columns giving the disease name (which has to match up with the name in the locus datafile), the disease (D) allele frequency, the penetrances for genotypes dd, dD, DD and a model name (which you can choose). Take a look at the file and check you understand it.

9. To run the analysis using this model, type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --model recessive-model.txt --steps 10 --pdf

Do you still think there is a disease locus in the region, and if so, where do you think it is located?

10. What would happen for the parametric analysis if we assumed a dominant instead of a recessive model? Download the file dominant-model.txt Take a look at the file to check you understand how it is coded, and re-run the parametric linkage analysis using this file instead. What has happened to the evidence for linkage? You should see that assuming a wrong model can have a big impact on the power to detect linkage using parametric linkage analysis.

Merlin documentation:

Merlin documentation is available here: http://www.sph.umich.edu/csg/abecasis/Merlin/index.html