Data management using PLINK

Overview

Purpose

In this exercise you will be exploring the use of PLINK for simple data management. An example of the use of this is to prepare files for subsequent association analysis.

Program documentation

PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:

http://zzz.bwh.harvard.edu/plink/

Programs and Data

First download the following files (along with the PLINK program) into an appropriate folder:

chr10-merlin-full-pedfile.txt
chr10-plinkmap.txt

fivesnps.txt


Exercise

Data overview

The data consists of genotype data at 1601 SNPs from a 30 MB region on chr 10, genotyped in 320 families consisting mostly of affected sib pairs and their parents.


Instructions

Data format

Take a look at the data files (e.g. using the `more' command from the MSDOS window, or by opening them in WordPad) and check you understand how they are coded. Note that the first file is a standard PLINK-format pedigree files. The second data file is a PLINK-format map file. The PLINK format consists of exactly 4 columns:

     chromosome (1-22, X, Y or 0 if unplaced) 
     rs number or snp identifier 
     Genetic distance (in Morgans or cM, can be set to 0 when performing association analysis) 
     Base-pair position (bp units) 
The final data file is just a list of the names of the first 5 SNPs in the map file. Check that these have been correctly listed.

Step-by-step instructions

To start with, you will need to open up an MSDOS window. [To do this, click on Start (the round button on the bottom left), All Programs, Accessories, then click on Command Prompt].

Once the window has opened, type dir to see all the files and directories (folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing

cd xxxxx

(where xxxxx is replaced by the name of the appropriate folder).

Type dir again to check the required files are available in the directory.

Generating subsets of data

PLINK can be used to generate subsets of data. For example, suppose you wanted to create a smaller data set containing just the first 4 SNPs. You could do this be reading in the (PLINK-format) pedigree and map files (using the --ped and --map commands), extracting the SNPs of interest (using the --extract command), and writing out a new pedigree and map file using the --recode and --out commands. (The --out command allows you to choose the file name for the new files; without this command the new files are automatically called "plink.ped" and "plink.map").

To implement all this, type:

plink --noweb --ped chr10-merlin-full-pedfile.txt --map chr10-plinkmap.txt --extract fivesnps.txt --recode --out just5snps

It is worth reading the messages that PLINK outputs to the screen, to check what PLINK has done. Note that these output messages are also saved to a file just5snps.log

You should have created two new files: just5snps.ped and just5snps.map. Take a look at these (e.g. using the commands more just5snps.map and more just5snps.ped , hitting the space bar to scroll though) and check you understand how they are coded. Note that PLINK often recodes unknown disease status to "-9" rather than "0".

We can also generate subsets of people. Let's do this using the files just5snps.ped and just5snps.map as a starting point. Since these files both have the same stem ("just5snps") followed by the extensions ".ped" and ".map", we can read them in to PLINK together using the --file command.

To output just (unrelated) founders from the pedigrees, you can use the following commands:

plink --noweb --file just5snps --filter-founders --recode --out justfounders

Take a look at the files you have created (justfounders.ped and justfounders.map) and check you understand how they differ from just5snps.ped and just5snps.map.

To output just affected individuals (cases) from the pedigrees, you can use the following command:

plink --noweb --file just5snps --filter-cases --recode --out justcases

Take a look at the files you have created (justcases.ped and justcases.map) and check you understand how they differ from just5snps.ped and just5snps.map.

Although PLINK can read in and write out standard pedigree files, it is usually more convenient to read in and write out files in PLINK's special binary format, which will take up less disk space and be quicker to read into PLINK when performing various subsequent analyses. This can be done usinng the --make-bed command. For example, to save the "justcases" data in binary format, type:

plink --noweb --file justcases --make-bed --out binarycases

This should create 3 new files: binarycases.bed, binarycases.bim, binarycases.fam. You will not be able to read the file binarycases.bed as it is not human readable. The file binarycases.bim is a map file with two extra columns of information giving the possible alleles at each locus. You can take a look at this by typing more binarycases.bim. The file binarycases.fam gives the pedigree structure in a format that is compatible with the binary genotype file. You can take a look at this by typing more binarycases.fam. Note that this file is the same as the first six columns of the original pedigree file justcases.ped .


Exercises prepared by: Heather Cordell
Checked by:
Programs used: MERLIN, PLINK, MapThin
Last updated: