Computer Practical Exercises on File Formats using the Merlin and Pedstats programs

Introduction

This practical introduces some of the file formats used in genetic analysis. In later exercises we will actually perform some analysis of genetic data. For this first exercise, we will simply become familar with the file formats and the mechanics of running the programs.

First log into your accounts and make a new folder to keep all of today's files in, on your H: drive, using a folder name WITHOUT ANY SPACES.

Preliminaries and data overview

We will begin by considering 4 families. The families are typed at 3 linked marker loci which we shall call markers 2, 3 and 4. The pedigree data is in the file pedfile1.txt. You will need to download this into the folder you made on your H: drive. [Right click, select Save Target As, and navigate through to save it in the folder you made].

Take a look at the pedigree file e.g. by clicking the link to the file above, or by opening it in WordPad. Each line gives the data for a single person. Data is ordered in columns corresponding to family, id (within family), id of father, id of mother, sex (male=1, female=2), affection status (1=unaffected, 2=affected), and genetic data (3 loci, each with 2 alleles). A zero indicates missing or unknown data.

Try to draw a pedigree diagram for the first family.

To perform the analysis in Merlin, we need an additional file sometimes called the "locus datafile": datfile.txt

This file gives information about the different loci in the pedigree file (the markers (M) and the assumed disease or affection locus (A) and their order in the pedigree file. Take a look at this file and check you understand how this information is coded.

We also need a file that gives the genetic map positions (in cM) of the loci mapfile.txt

and a file giving the allele frequencies of the alleles at the different loci freqfile.txt

Take a look at each of these files and check you understand how this information is coded.

Save the above files (datfile.txt, mapfile.txt, freqfile.txt) in the folder you made on your H: drive. You will also need to save a copy of the Merlin and Pedstats programs in the same directory:
merlin.exe
pedstats.exe

Step-by-step instructions

1. To start with, you will need to open up an MSDOS window. [Click on Start (the round button on the bottom left), All Programs, Accessories, then click on Command Prompt].

Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

2. Use the Pedstats program to check your data by typing

pedstats -p pedfile1.txt -d datfile.txt

(This command tells the program to read in the pedigree file as specified after -p and the locus datafile as specified after -d )

Check the output on the screen. The program will output lots of information regarding things like which files you read in, which analysis options you chose, how many pedigrees (families) you read in and their sizes, how many markers you read in and their heterozygosities.

Do you see an error message? Take another look at the input pedigree file and see if you can spot where the Mendelian inheritance error is that caused this error message.

3. A corrected version of the pedigree file is in the file pedfile2.txt. You will need to download this into the same folder on your H: drive. Re-run the Pedstats program using the new corrected pedigree file, and check you (roughly) understand the output. Has the error message disappeared?

4. Check that the Merlin program runs OK by typing

merlin -p pedfile2.txt -d datfile.txt -m mapfile.txt -f freqfile.txt --pairs --exp

(This command tells Merlin which pedigree file, locus datafile, map file and allele frequency files to use via the -p -d -m -f options. The options --pairs and --exp tell Merlin to use the S_pairs scoring function and the Kong and Cox "exponential model" Zlr statistic as well as the NPL statistic).

Take a look at the output on the screen. At each of the 3 marker loci you should see an NPL statistic (called Zmean) and p value, then 3 columns of results from the Kong and Cox "linear model" and 3 columns of results from the Kong and Cox "exponential model". Don't worry for now about what this output means - we will discuss it in the next exercise. For now we just want to check that the program appeared to run OK.

Pedstats and Merlin documentation:

Pedstats documentation is available here: http://www.sph.umich.edu/csg/abecasis/Pedstats/index.html

Merlin documentation is available here: http://www.sph.umich.edu/csg/abecasis/Merlin/index.html