1 Introduction

Do you have the problem that you want to do a linkage analysis but your map file is massive and far too big for your needs? Do you wish to reduce false positive rates and speed up execution times? Firstly you can use PLINK to keep only the common SNPs by typing something like the following.

plink --file mydata --maf 0.4 --recode --out mydata-frequent

We could also (optionally) try and remove SNPs that are in strong linkage disequilibrium (LD) with one another by typing something like the following in PLINK.

plink --file mydata-frequent --indep 50 5 2 --out pruned-snp-list

plink --file mydata-frequent --extract pruned-snp-list.prune.in --recode --out mydata-pruned

Unfortunately, this only gets you so far - the map file still contains too many SNPs. What can one do? Now you can use MapThin to thin your map file!

./mapthin mydata-frequent.map thinned-data.map

Note that running MapThin against your map file using genetic distance (cM) (the default) should remove any pairs of SNPs that are too correlated, making the PLINK step to remove SNPs that are in strong LD unnecessary.

The program MapThin thins map files by simply taking the first SNP and then moving a genetic marker along the list of SNPs and taking the closest SNP to this marker for each step (or second closest if the SNP is already chosen). The length of this marker is determined by setting the “SNPs per cM” option. It is also possible to choose an absolute number of SNPs to keep or a percentage of SNPs to keep. These options work by searching for a suitable length for the marker step. There is no guarantee that the thinned map file will contain the exact number or percent of SNPs required, but should be very close. Extreme thinning options that are near to 0 or 100 percent may fail.

In the case where genetic distances are not available it may be useful to thin the SNPs on the basis of base pair position instead of genetic distance. Therefore, an option is included to thin the SNPs using base pair position instead of genetic distance. This option works the same as for genetic distance but uses the values in the base pair position column of the map file instead of the genetic distance column.