8 Parallel processing

EMIM performs the analysis for each SNP one by one, so if there are many SNPs then this may take a long time. It is therefore natural to want to speed up the process by performing these calculations in parallel. To facilitate this, EMIM and PREMIM have added features to make this easy by splitting the input files for EMIM into many different files, or by using multiple files (e.g. one file per chromosome) that have already been created using some other program such as PLINK.

The following the instructions explain how to do this:

  1. Create input and output sub-directories for EMIM. In the directory where your pedigree file or pedigree files are, create sub-directories for the input and output of EMIM. For example, to create directories simply called input and output in Linux/UNIX type:
    mkdir input
    mkdir output
    
  2. Create input files for EMIM. The next step is to create the input files for EMIM using PREMIM with your pedigree file(s). There are two different ways to do this:
    1. If your data is currently contained in a single file (e.g. called data.bed), you can use PREMIM with the "-s n dir" option to split the output files (caseparenttrios.dat etc.) into many files each containing n SNPs into directory dir. Since the data may be split into many files it is wise to put them in the separate input directory you made, in order to make them easier to manage. For example, to split the output into files each containing 1000 SNPs into the directory input, using the initial binary pedigree file data.bed and asking PREMIM to estimate allele frequencies, type:
      ./premim -a -s 1000 input/ data.bed 
      Note that the "/" is required after the directory name and a "\" may be required for systems other than Linux/UNIX. The files created will be named input/caseparenttrios1.dat, input/caseparenttrios2.dat... etc. Only one EMIM parameter file (emimparams.dat) is created (in the directory above the input directory).

    2. If your data is already split into many files (e.g. one per chromosome, named chr1.bed, chr2.bed, chr3.bed etc.) you need to move these files into the input directory and, within the input directory, run PREMIM on each file to create the input files for EMIM. This can be done manually, one file at a time, or else you can write a loop via a Perl script or similar. Alternatively, if you are using a High Performance Computing (HPC) cluster using the open-source Sun Grid Engine (SGE) scheduler software, then these jobs may be submitted as an array job using something similar to the following script:
      #!/bin/bash
      # execute in current working directory
      #$ -cwd
      # export local envirnoment
      #$ -V
      # the number of PREMIM tasks 
      #$ -t 1-22
      # execute PREMIM for each task
      ./premim -a -n $SGE_TASK_ID chr$SGE_TASK_ID.bed 
      
      The key thing is to make sure that you use PREMIM's -n option to output a set of input files for EMIM which have different names (e.g. one set of EMIM input files per chromosome). This relies on the fact that it is possible to create files caseparenttriosXXX.dat, caseparentsXXX.dat... etc. in PREMIM by using the "-n name" option. For example to create output files with 5 appended to the end using pedigree file data.ped, you can type "./premim -n 5 data.ped"

      If the above job script is called premimarray.sh then it is submitted to the job queue as follows:
      qsub premimarray.sh 
      This will submit a number of serial jobs (22 in the example above) to the HPC to execute in parallel. The exact details of how to perform parallel jobs may depend on your local computing services and you should consult your local computing support if in doubt.

      Once the array job, or Perl loop, or manual running of PREMIM has finished, you should have created a set of input files named input/caseparenttrios1.dat, input/caseparenttrios2.dat... etc. Only one EMIM parameter file (emimparams.dat) will have been created, corresponding to the last set of input files PREMIM created. This is not entirely satisfactory, as you may find that the last set of input files does not contain the largest number of SNPs. You therefore need to edit the number of SNPs on line 13 of emimparams.dat to the correspond to a number greater than the number of lines in the largest input file emimmarkers*.dat. E.g. type
      wc -l emimmarkers*.dat
      
      to find out which file has the greatest number of lines, and edit the number of SNPs on line 13 of emimparams.dat to correspond to this number (or greater). THEN MOVE THIS FILE emimparams.dat ONE LEVEL BACK IN THE DIRECTORY HIERARCHY (i.e. it should be placed in the directory above the input directory).
  3. Create SNP marker files. Before running EMIM it must be ensured that the necessary SNP marker files are available for each set of input files. Normally the file emimmarkers.dat is used, in the parallel version the files emimmarkers1.dat, emimmarkers2.dat... etc. are used instead and are stored in the input directory for EMIM along with the rest of the files. If you used PREMIM's -a option when creating the input files for EMIM, then you should already have a usable set of SNP marker files in the input directory. Otherwise, these files may be created using an existing SNP marker file by using PREMIM with the "-fm n markerfile.dat dir" option which splits an existing marker file into many marker files each with n SNPs into directory dir. For example, to split the file emimmarkers.dat into marker files with 1000 SNPs each into directory input type:
    ./premim -fm 1000 emimmarkers.dat input/ 
    This creates the files input/emimmarkers1.dat, input/emimmarkers2.dat... etc.
  4. Run EMIM in parallel. For each set of input files, for example, input/caseparenttrios5.dat, input/caseparents5.dat... etc., it is possible to run EMIM by typing:
    ./emim 5 input/ output/ 
    This will create result files output/emimsummary5.out and output/emimresults5.out. This must be done for every set of files numbered 1 to N for some N. The number N may correspond to 22 (e.g. if your data files were set up one per chromosome, for 22 chromosomes) or to some other number, depending on how many sets of input files were created by PREMIM when it split up your original data file. If you are using a High Performance Computing (HPC) cluster using the open-source Sun Grid Engine (SGE) scheduler software, then these jobs may be submitted as an array job using something similar to the following script:
    #!/bin/bash
    # execute in current working directory
    #$ -cwd
    # export local envirnoment
    #$ -V
    # the number of EMIM tasks 
    #$ -t 1-1435
    # execute EMIM for each task
    ./emim $SGE_TASK_ID input/ output/
    
    If the above script is saved in the directory above the input directory (i.e. the directory where your file emimparams.dat exists) and if the script is called emimarray.sh, then it is submitted to the job queue as follows:
    qsub emimarray.sh 
    This will submit a number of serial jobs (1435 in the example above) to the HPC to execute in parallel. The exact details of how to perform parallel jobs may depend on your local computing services and you should consult your local computing support if in doubt.
  5. Collate EMIM results. The results of the EMIM analysis are stored in the given output directory, for example output/emimsummaryJ.out and output/emimresultsJ.out for J from 1 to N for some N. This can be convenient if each file corresponds to a different chromosome, for example, but it is not very convenient if there are hundreds or thousands of result files. These files may therefore be collated into two result files emimsummary.out and emimresults.out by using the "-fr dir" option in PREMIM. For example, if the results are in directory output type:
    ./premim -fr output/ 
    Note that the SNP numbers in these combined files will relate to the number within each output file output/emimsummaryJ.out and output/emimresultsJ.out, which in turn corresponds to the SNP number within the input files input/caseparenttriosJ.dat, input/casemotherduosJ.dat... etc. As a result, these SNP numbers in the combined files will not be unique. If you used PREMIM to split a single file (e.g. data.bed) into many files for parallel processing, the SNP ID (as opposed to the SNP number) in the combined files should correspond to the SNP number within the original file data.bed, and so will be unique. If you used PREMIM to process a set of files (e.g. one per chromosome, named chr1.bed, chr2.bed, chr3.bed etc.) then both the SNP number and the SNP ID within the combined will not be unique, as they will refer to the number within the original file chr1.bed, chr2.bed, chr3.bed etc. We recommend you keep a separate list of the SNP identifiers from your .map or .bim files (e.g. their rs IDs) and their SNP number (order) within these files, in order to make it easier to match up the results in your combined output files with the correct SNPs. This can be done using the -rout option in PREMIM.