In this exercise you will be exploring the use of PLINK (and MAPTHIN) for data management. An example of the use of this is to prepare files for subsequent analysis in MERLIN.
PLINK has an extensive set of docmentation including a pdf manual, a web-based tutorial and web-based documentation:
http://zzz.bwh.harvard.edu/plink/
The MapThin program can be found at:
http://www.staff.ncl.ac.uk/richard.howey/mapthin/
The MERLIN website is at:
http://www.sph.umich.edu/csg/abecasis/Merlin/index.html
First I suggest you make a new
folder to keep this exercise's programs and data files in.
Then download the following files into this folder:
merlin.exe
pedstats.exe
plink.exe
mapthin.exe
chr10-merlin-full-pedfile.txt
chr10-merlin-full-datfile.txt
chr10-merlin-fullmap.txt
chr10-plinkmap.txt
fivesnps.txt
The data consists of genotype data at 1601 SNPs from a 30 MB region on chr 10, genotyped in 320 families consisting mostly of affected sib pairs and their parents.
Take a look at the data files (e.g. using the `more' command from the MSDOS window, or by opening them in WordPad) and check you understand how they are coded. Note that the first 3 files are standard MERLIN format files. The fourth data file is a PLINK-format map file. PLINK uses a slightly different format map file compared to MERLIN. The PLINK format consists of exactly 4 columns:
chromosome (1-22, X, Y or 0 if unplaced) rs number or snp identifier Genetic distance (in Morgans or cM, can be set to 0 when performing association analysis) Base-pair position (bp units)
The final data file is just a list of the names of the first 5 SNPs in the map file. Check that these have been correctly listed.
To start with, you will need to open up an MSDOS window. [To do
this, click on Start
(the round button on the bottom left), All Programs, Accessories,
then click on Command Prompt].
Once the window has opened, type dir to see all the files and directories
(folders) that are in your home space, and move into the directory where you saved the data files e.g. by typing
cd xxxxx
(where xxxxx is replaced by the name of the appropriate folder).
Type dir again to check the required files are available in the directory.
PLINK can be used to generate subsets of data. For example, suppose you wanted to create a smaller data
set containing just the first 5 SNPs. You could do this by reading in the (PLINK-format) pedigree and
map files (using the --ped and --map commands), extracting the SNPs of interest
(using the --extract command), and writing out a new pedigree and map file using the
--recode and --out commands. (The --out command allows you to choose the
file name for the new files; without this command the new files are automatically called "plink.ped"
and "plink.map").
To implement all this, type:
plink --noweb --ped chr10-merlin-full-pedfile.txt --map chr10-plinkmap.txt --extract fivesnps.txt --recode --out just5snps
It is worth reading the messages that PLINK outputs to the screen, to check what PLINK has done. Note that these output
messages are also saved
to a file just5snps.log
You should have created two new files: just5snps.ped and just5snps.map. Take a look at these (e.g. using
the commands more just5snps.map and more just5snps.ped , hitting the space bar to scroll though) and
check you understand how they are coded.
Note that PLINK often recodes unknown disease status to "-9" rather than "0".
We can also generate subsets of people. Let's do this using the files just5snps.ped and just5snps.map as a
starting point. Since these files both have the same stem ("just5snps") followed by the extensions ".ped" and ".map", we can
read them in to PLINK together using the --file command.
To output just (unrelated) founders from the pedigrees, you can use the following commands:
plink --noweb --file just5snps --filter-founders --recode --out justfounders
Take a look at the files you have created (justfounders.ped and justfounders.map) and check you understand
how they differ from just5snps.ped and just5snps.map.
To output just affected individuals (cases) from the pedigrees, you can use the following command:
plink --noweb --file just5snps --filter-cases --recode --out justcases
Take a look at the files you have created (justcases.ped and justcases.map) and check you understand
how they differ from just5snps.ped and just5snps.map.
Although PLINK can read in and write out standard pedigree files, it is usually more convenient
to read in and write out files in PLINK's special binary format, which will take up less disk space
and be quicker to read into PLINK when performing various subsequent analyses. This can be done usinng the
--make-bed command. For example, to save
the "justcases" data in binary format, type:
plink --noweb --file justcases --make-bed --out binarycases
This should create 3 new files: binarycases.bed, binarycases.bim, binarycases.fam.
You will not be able to read the file binarycases.bed as it is not human readable.
The file binarycases.bim is a
map file with two extra columns of information giving the possible alleles at each
locus.
You can take a look at this by typing more binarycases.bim. The file
binarycases.fam gives the pedigree
structure in a format that is compatible with the binary genotype file.
You can take a look at this by typing more binarycases.fam. Note that this
file is the same as the first six columns
of the original pedigree file justcases.ped .
Last week we did an analysis in MERLIN where we made use of thinned/pruned map files, where an original large number of SNPs
was pruned and thinned to give a smaller number of SNPs, more suitable for linkage analysis. Here is an overview of using the
PLINK and MapThin programs for performing the required steps to create these thinned/pruned map files, using the data in
chr10-merlin-full-pedfile.txt and chr10-plinkmap.txt (which contain 1601 SNPs) as a starting point.
First you need to read your data into PLINK. PLINK takes standard MERLIN-format pedigree files,
although it also takes pedigree files where the code '-9' (rather than 0) is used
to
signify
unknown disease status. This is the code that PLINK uses when it writes out various data files.
Although PLINK can read in and write out standard pedigree files, it is usually more convenient
to read in and write out files in PLINK's special binary format, which will take up less disk space
and be quicker to read into PLINK when performing various subsequent analyses.
First we will read in the data to PLINK (which will require reading in a pedigree and a map file) and then
we will write it out as a set of 3 binary format files, which capture the same information as is contained
in the
pedigree and map files:
plink --noweb --ped chr10-merlin-full-pedfile.txt --map chr10-plinkmap.txt --make-bed --out full
Here the --noweb command tells PLINK to run without bothering to check via the web to see whether
there are updated versions of PLINK.
The --ped xxxx command tells PLINK that the name of the first
pedigree file is xxxx and the
--map yyyy command tells PLINK that the name of the first map file is yyyy.
The --make-bed command tells PLINK to output the resulting file in binary format and the
--out full command tells PLINK to use the stem 'full' for the name of all its output files.
PLINK outputs some useful messages (you should always read these to make sure they match up with what you
expect!) and also outputs a copy of these messages to the file full.log .
You should now find 4 new files in your directory: full.bed, full.bim, full.fam
and full.log. The file full.bed is the binary genotype file which will not be human readable.
The file full.bim is a
map file with two extra columns of information giving the possible alleles at each
locus.
You can take a look at this (e.g. by typing more full.bim). The file
full.fam gives the pedigree
structure in a format that is compatible with the binary genotype file.
You can take a look at this (e.g. by typing more full.fam). Note this
file is the same as the first six columns
of the original pedigree file, except that unknown disease status has been coded as '-9'.
For subsequent analysis of these data, the files full.bed, full.bim and full.fam
will always need to be read into PLINK
together, as together they provide all the required information. This can be done (when running PLINK) by using
the command --bfile full.
Next we will read in the `full' files and write out a set of files that contain the same data but just for a
subset of SNPs. We will ask PLINK to just include SNPs with minor allele frequency (MAF) > 0.35, SNPs
that have been successfully genotyped in more than 95% of people (i.e. have less than 5% missing genotypes),
and SNPs that do not fail a test of HWE at p value <0.00000001. The latter two of these are fairly standard QC
checks. The minor allele frequency requirement is because we want to only include SNPs that will be highly
informative for linkage analysis. (If we were performing association analysis we would probably use a MAF
threshold of 0.01 or 0.05 instead of 0.35).
To do this step, we type:
plink --noweb --allow-no-sex --bfile full --maf 0.35 --geno 0.05 --hwe 0.00000001
--make-bed --out frequent
Take a look at the output files frequent.log, frequent.bim, frequent.fam. Can you
understand what the information is encoded in these files, and whether/how the new files
frequent.bed, frequent.bim, frequent.fam
differ from the
original files
full.bed full.bim, full.fam?
Now we will `prune' to generate SNPs that are not in too much LD with one another. We have to do this in two
steps. First we ask PLINK to generate the list of SNPs to keep:
plink --noweb --allow-no-sex --bfile frequent --indep 50 5 2 --out prunedsnplist
(The --indep 50 5 2 command actually tells PLINK how to decide which SNPs are in `too much' LD with one another. Don't worry too much about this option - if you
are interested you can read more details on the PLINK web site).
Then we ask PLINK to just extract (keep) those SNPs from the SNPs that are in the `frequent' files:
plink --noweb --allow-no-sex --bfile frequent --extract prunedsnplist.prune.in
--make-bed --out pruned
Take a look at the output files pruned.log, pruned.bim, pruned.fam. Can you
understand what the information is encoded in these files?
In theory you could use the SNP map pruned.bim to perform the linkage analysis. However, you would
first have to convert it to a MERLIN format map file. How would you do this? (E.g. using Excel?)
If we want to thin the map down even further, we can make use of the MapThin program. E.g. to thin the SNPs down
to two SNPs per cM, type
mapthin -t 2 pruned.bim thinned.bim
This produces a PLINK-format `.bim' file containing approximately two SNPs per cM. Again, to use it in MERLIN, you
would first have to convert it to a MERLIN format map file.
Use Excel to create MERLIN-format map files corresponding to pruned.bim
and thinned.bim. (You will have to open up Excel, read in the file, delete
columns that you don't need, insert a header row and give the columns the right header names
for MERLIN. Then save the file
with a new name of your choice, as a tab-delimited text file).
Now try to re-run the MERLIN analysis from last week with these new map files you have created.