Data management using PLINK (and MAPTHIN)

Overview

Purpose

In this exercise you will be exploring the use of PLINK (and MAPTHIN) for data management. An example of the use of this is to prepare files for subsequent analysis in MERLIN.

Program documentation

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://zzz.bwh.harvard.edu/plink/

The MapThin program can be found at:

http://www.staff.ncl.ac.uk/richard.howey/mapthin/

The MERLIN website is at:

http://www.sph.umich.edu/csg/abecasis/Merlin/index.html

Programs and Data

First I suggest you make a new folder to keep this exercise's programs and data files in. Then download the following files into this folder:

merlin.exe
pedstats.exe
plink.exe
mapthin.exe

chr10-merlin-full-pedfile.txt
chr10-merlin-full-datfile.txt
chr10-merlin-fullmap.txt
chr10-plinkmap.txt

fivesnps.txt

Exercise

Data overview

The data consists of genotype data at 1601 SNPs from a 30 MB region on chr 10, genotyped in 320 families consisting mostly of affected sib pairs and their parents.

Instructions

Data format

Take a look at the data files (e.g. using the `more' command from the MSDOS window, or by opening them in WordPad) and check you understand how they are coded. Note that the first 3 files are standard MERLIN format files. The fourth data file is a PLINK-format map file. PLINK uses a slightly different format map file compared to MERLIN. The PLINK format consists of exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced) 
     rs number or snp identifier 
     Genetic distance (in Morgans or cM, can be set to 0 when performing association analysis) 
     Base-pair position (bp units)

The final data file is just a list of the names of the first 5 SNPs in the map file. Check that these have been correctly listed.

Step-by-step instructions

To start with, you will need to open up an MSDOS window. [To do this, click on Start (the round button on the bottom left), All Programs, Accessories, then click on Command Prompt].

Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

Generating subsets of data

PLINK can be used to generate subsets of data. For example, suppose you wanted to create a smaller data set containing just the first 5 SNPs. You could do this by reading in the (PLINK-format) pedigree and map files (using the --ped and --map commands), extracting the SNPs of interest (using the --extract command), and writing out a new pedigree and map file using the --recode and --out commands. (The --out command allows you to choose the file name for the new files; without this command the new files are automatically called "plink.ped" and "plink.map").

To implement all this, type:

plink --noweb --ped chr10-merlin-full-pedfile.txt --map chr10-plinkmap.txt --extract fivesnps.txt --recode --out just5snps

It is worth reading the messages that PLINK outputs to the screen, to check what PLINK has done. Note that these output messages are also saved to a file just5snps.log

You should have created two new files: just5snps.ped and just5snps.map. Take a look at these (e.g. using the commands more just5snps.map and more just5snps.ped , hitting the space bar to scroll though) and check you understand how they are coded. Note that PLINK often recodes unknown disease status to "-9" rather than "0".

We can also generate subsets of people. Let's do this using the files just5snps.ped and just5snps.map as a starting point. Since these files both have the same stem ("just5snps") followed by the extensions ".ped" and ".map", we can read them in to PLINK together using the --file command.

To output just (unrelated) founders from the pedigrees, you can use the following commands:

plink --noweb --file just5snps --filter-founders --recode --out justfounders

Take a look at the files you have created (justfounders.ped and justfounders.map) and check you understand how they differ from just5snps.ped and just5snps.map.

To output just affected individuals (cases) from the pedigrees, you can use the following command:

plink --noweb --file just5snps --filter-cases --recode --out justcases

Take a look at the files you have created (justcases.ped and justcases.map) and check you understand how they differ from just5snps.ped and just5snps.map.

Although PLINK can read in and write out standard pedigree files, it is usually more convenient to read in and write out files in PLINK's special binary format, which will take up less disk space and be quicker to read into PLINK when performing various subsequent analyses. This can be done usinng the --make-bed command. For example, to save the "justcases" data in binary format, type:

plink --noweb --file justcases --make-bed --out binarycases

This should create 3 new files: binarycases.bed, binarycases.bim, binarycases.fam. You will not be able to read the file binarycases.bed as it is not human readable. The file binarycases.bim is a map file with two extra columns of information giving the possible alleles at each locus. You can take a look at this by typing more binarycases.bim. The file binarycases.fam gives the pedigree structure in a format that is compatible with the binary genotype file. You can take a look at this by typing more binarycases.fam. Note that this file is the same as the first six columns of the original pedigree file justcases.ped .

Generating pruned/thinned map files

Last week we did an analysis in MERLIN where we made use of thinned/pruned map files, where an original large number of SNPs was pruned and thinned to give a smaller number of SNPs, more suitable for linkage analysis. Here is an overview of using the PLINK and MapThin programs for performing the required steps to create these thinned/pruned map files, using the data in chr10-merlin-full-pedfile.txt and chr10-plinkmap.txt (which contain 1601 SNPs) as a starting point.

First you need to read your data into PLINK. PLINK takes standard MERLIN-format pedigree files, although it also takes pedigree files where the code '-9' (rather than 0) is used to signify unknown disease status. This is the code that PLINK uses when it writes out various data files.

Although PLINK can read in and write out standard pedigree files, it is usually more convenient to read in and write out files in PLINK's special binary format, which will take up less disk space and be quicker to read into PLINK when performing various subsequent analyses.

First we will read in the data to PLINK (which will require reading in a pedigree and a map file) and then we will write it out as a set of 3 binary format files, which capture the same information as is contained in the pedigree and map files:

plink --noweb --ped chr10-merlin-full-pedfile.txt --map chr10-plinkmap.txt --make-bed --out full

Here the --noweb command tells PLINK to run without bothering to check via the web to see whether there are updated versions of PLINK. The --ped xxxx command tells PLINK that the name of the first pedigree file is xxxx and the --map yyyy command tells PLINK that the name of the first map file is yyyy. The --make-bed command tells PLINK to output the resulting file in binary format and the --out full command tells PLINK to use the stem 'full' for the name of all its output files.

PLINK outputs some useful messages (you should always read these to make sure they match up with what you expect!) and also outputs a copy of these messages to the file full.log .

You should now find 4 new files in your directory: full.bed, full.bim, full.fam and full.log. The file full.bed is the binary genotype file which will not be human readable. The file full.bim is a map file with two extra columns of information giving the possible alleles at each locus. You can take a look at this (e.g. by typing more full.bim). The file full.fam gives the pedigree structure in a format that is compatible with the binary genotype file. You can take a look at this (e.g. by typing more full.fam). Note this file is the same as the first six columns of the original pedigree file, except that unknown disease status has been coded as '-9'.

For subsequent analysis of these data, the files full.bed, full.bim and full.fam will always need to be read into PLINK together, as together they provide all the required information. This can be done (when running PLINK) by using the command --bfile full.

Next we will read in the `full' files and write out a set of files that contain the same data but just for a subset of SNPs. We will ask PLINK to just include SNPs with minor allele frequency (MAF) > 0.35, SNPs that have been successfully genotyped in more than 95% of people (i.e. have less than 5% missing genotypes), and SNPs that do not fail a test of HWE at p value <0.00000001. The latter two of these are fairly standard QC checks. The minor allele frequency requirement is because we want to only include SNPs that will be highly informative for linkage analysis. (If we were performing association analysis we would probably use a MAF threshold of 0.01 or 0.05 instead of 0.35).

To do this step, we type:

plink --noweb --allow-no-sex --bfile full --maf 0.35 --geno 0.05 --hwe 0.00000001 --make-bed --out frequent

Take a look at the output files frequent.log, frequent.bim, frequent.fam. Can you understand what the information is encoded in these files, and whether/how the new files frequent.bed, frequent.bim, frequent.fam differ from the original files full.bed full.bim, full.fam?

Now we will `prune' to generate SNPs that are not in too much LD with one another. We have to do this in two steps. First we ask PLINK to generate the list of SNPs to keep:

plink --noweb --allow-no-sex --bfile frequent --indep 50 5 2 --out prunedsnplist

(The --indep 50 5 2 command actually tells PLINK how to decide which SNPs are in `too much' LD with one another. Don't worry too much about this option - if you are interested you can read more details on the PLINK web site).

Then we ask PLINK to just extract (keep) those SNPs from the SNPs that are in the `frequent' files:

plink --noweb --allow-no-sex --bfile frequent --extract prunedsnplist.prune.in --make-bed --out pruned

Take a look at the output files pruned.log, pruned.bim, pruned.fam. Can you understand what the information is encoded in these files?

In theory you could use the SNP map pruned.bim to perform the linkage analysis. However, you would first have to convert it to a MERLIN format map file. How would you do this? (E.g. using Excel?)

If we want to thin the map down even further, we can make use of the MapThin program. E.g. to thin the SNPs down to two SNPs per cM, type

mapthin -t 2 pruned.bim thinned.bim

This produces a PLINK-format `.bim' file containing approximately two SNPs per cM. Again, to use it in MERLIN, you would first have to convert it to a MERLIN format map file.

Use Excel to create MERLIN-format map files corresponding to pruned.bim and thinned.bim. (You will have to open up Excel, read in the file, delete columns that you don't need, insert a header row and give the columns the right header names for MERLIN. Then save the file with a new name of your choice, as a tab-delimited text file).

Now try to re-run the MERLIN analysis from last week with these new map files you have created.

Exercises prepared by: Heather Cordell
Checked by:
Programs used: MERLIN, PLINK, MapThin
Last updated: