Computer Practical Exercises on Non-parametric and Parametric Linkage Analysis using the Merlin program

Introduction

In this practical we will investigate methods for positioning a disease locus on a known map of marker loci, using information from all the linked markers simultaneously. In these methods we assume that we know the genetic distances (and hence recombination fractions) between the markers. We also assume we know the underlying disease model (e.g. recessive, dominant etc). We fix a position for the disease locus and calculate the overall likelihood for the disease and marker data, assuming the disease locus position is correct. We then repeat the analysis with the disease locus positioned at different locations in relation to the known markers. In this way we construct a multipoint LOD score curve across the region: the position where the LOD score is maximum is the best estimate of the disease locus location.

Data overview

We will begin by analysing 4 families which are believed to be segregating for a recessive disease locus. The families are typed at 3 linked marker loci which we shall call markers 2, 3 and 4. The pedigree data is in the file pedfile1.txt. You will need to download this onto your H: drive (I suggest you first make a new folder for all of today's files, using a folder name WITHOUT ANY SPACES).

Take a look at the pedigree file. Each line gives the data for a single person. Data is ordered in columns corresponding to family, id (within family), id of father, id of mother, sex (male=1, female=2), affection status (1=unaffected, 2=affected), and genetic data (3 loci, each with 2 alleles). A zero indicates missing or unknown data.

Try to draw a pedigree diagram for the first family.

To perform the analysis in Merlin, we need an additional file sometimes called the "locus datafile": datfile.txt

This file gives information about the different loci in the pedigree file (the markers (M) and the assumed disease or affection locus (A) and their order in the pedigree file. Take a look at this file and check you understand how this information is coded.

We also need a file that gives the genetic map positions (in cM) of the loci mapfile.txt

and a file giving the allele frequencies of the alleles at the different loci freqfile.txt

Take a look at these files and check you understand how this information is coded.

If you have not already done so, make a new directory (folder) in your home space and save the above files in it. You will also need to save a copy of the Merlin and Pedstats programs in the same directory:
merlin.exe
pedstats.exe

Step-by-step instructions

1. To start with, you will need to open up an MSDOS window (Click on Start, Run, then type cmd ). Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

2. Use the Pedstats program to check your data by typing

pedstats -p pedfile1.txt -d datfile.txt

(This command tells the program to read in the pedigree file as specified after -p and the locus datafile as specified after -d )

Check the output on the screen. Do you see an error message? Check the input pedigree file and see if you can spot where the Mendelian inheritance error is.

3. A corrected version of the pedigree file is in the file pedfile2.txt. You will need to download this into the same folder on your H: drive. Re-run the Pedstats program using the new corrected pedigree file, and check you (roughly) understand the output. Has the error message disappeared?

4. Use the Merlin program to perform a non-parametric linkage analysis on the new pedigree file by typing

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp

(This command tells Merlin which pedigree file, locus datafile, map file and allele frequency files to use via the -p -d -m -f options. The options --pairs and --exp tell Merlin to use the S_pairs scoring function and the Kong and Cox "exponential model" Zlr statistic as well as the NPL statistic).

Take a look at the output on the screen. At each of the 3 marker loci you should see an NPL statistic (called Zmean) and p value, then 3 columns of results from the Kong and Cox "linear model" and 3 columns of results from the Kong and Cox "exponential model". The most important things are the last two columns from the exponential model, marked LOD and p value.

5. It looks as if there may be some evidence for linkage in this region. To do a multipoint analysis at increments between loci (as well as at the marker loci themselves), type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp --steps 10

You should see results at 10 increments between loci, as well as at the loci themselves. To produce a nice plot of these results, type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp --steps 10 --pdf

Take a look at the file merlin.pdf that should have been created. Do you think there is a disease locus in the region, and if so, where do you think it is located?

6. Close the pdf file. Now try analysing the data using the S_all scoring rather than the S_pairs scoring, by changing the --pairs option to --npl i.e. by typing:

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --npl --exp --steps 10 --pdf

Take a look at the file merlin.pdf that should have been created. Do you still think there is a disease locus in the region, and if so, where do you think it is located?

8. We will now use Merlin to perform a parametric analysis, assuming a recessive model. To do this, we need to download another file recessive-model.txt This file has 4 columns giving the disease name (which has to match up with the name in the locus datafile), the disease (D) allele frequency, the penetrances for genotypes dd, dD, DD and a model name. Take a look at the file and check you understand it.

9. To run the analysis using this model, type

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --model recessive-model.txt --steps 10 --pdf

Do you still think there is a disease locus in the region, and if so, where do you think it is located?

10. What would happen for the parametric analysis if we assumed a dominant instead of a recessive model? Download the file dominant-model.txt Take a look at the file to check you understand how it is coded, and re-run the parametric linkage analysis using this file instead. What has happened to the evidence for linkage? You should see that assuming a wrong model can have a big impact on the power to detect linkage using parametric linkage analysis.

Merlin documentation:

Merlin documentation is available here: http://www.sph.umich.edu/csg/abecasis/Merlin/index.html