4 Data Processing

This section explains how PseudoCons processes the pedigree data to produce the case-control output data.

4.1 Trio Selection

By default one case/parent trio is taken from each pedigree and from this one case is taken and one pseudocontrol created. The trio chosen is simply decided by the first case in the pedigree file who also has two parents in the pedigree file.

Proband

It may be possible that there is a choice of case/parent trios from a pedigree to give the case and created pseudocontrol. For a pedigree file with many large pedigrees this could potentially alter the results of any subsequent analysis performed. For example, if pedigrees are ascertained on the basis of a particular affected child, but case/parent trios containing the parents and grandparents are chosen instead, this could then bias the analysis. With this in mind it is possible to supply an optional proband file containing a list of all the affected subjects that are of interest. The file is a list of subjects given by the pedigree name and subject name coresponding to the pedigree file given to PseudoCons. For example, a proband file may look as following:

1 4
2 5
3 2
5 12
7 3
9 3
10 2

The proband file is used in PseudoCons with the -pro option as follows:

./pseudocons -pro proband.dat -i mydata.bed -o mycasepscondata.bed

The name of the proband file should following immediately after the -pb option. The following points should be noted about proband files:

  1. If a proband file is given it is not necessary to supply a subject for every pedigree. For example, for smaller pedigrees you may be happy to use the default setting.
  2. The proband subjects do not need to appear in any particular order in the file.
  3. If the proband subject is not affected a warning message will be displayed and the pedigree processed using the default settings.
  4. If a proband subject does not exist in the pedigree file a warning message will be displayed and the pedigree file will be processed as normal.

Extra Trios

It is possible to use all possible case/parent trios from a pedigree, counting them as if they are independent, using the -xtrio option. The trios may overlap if a parent is also a case. Depending on the analysis you want to do, this assumption may be more or less valid.

4.2 One Pseudocontrol

The pseudocontrols are created using the non-transmitted alleles. For example, if the alleles of the case are A/A and the alleles of the parents are A/G and A/G, then the created pseudocontrol will have alleles G/G.

4.3 Three Pseudocontrols

The three pseudocontrols are created using any possible genotype from the parents that contains a non-transmitted alleles. For example, if the alleles of the case are A/A and the alleles of the parents are A/G and A/G, then the three created pseudocontrols will have alleles G/G, A/G and G/A.

4.4 Fifteen Pseudocontrols

Given two SNPs, the 15 pseudocontrols are created using any possible genotype pair from the parents that contains a non-transmitted allele.

4.5 CPG and CEPG

The standard procedure for PseudoCons is to assume Conditional on Parental Genotypes (CPG) rather than the Conditional on Exchangeable Parental Genotypes (CEPG) [Cordell (2004), Weinberg and Shi (2009)]. It is also possible to assume CEPG with the option -cepg which will create an additional set of pseudocontrols, resulting in 3, 7 and 31 pseudocontrols for options -pc1, -pc3 and -pc15 respectively.

Using options -pc1 -cepg will result in the following pseudocontrols: (i) the usual pseudocontrol with the non-transmitted genotype; and (ii) the parental genotypes swapped to give a pseudocontrol for the non-transmitted genotype and a pseudocontrol for the transmitted genotype.

Using options -pc3 -cepg will result in the following pseudocontrols: (i) the usual 3 pseudocontrols with a genotype containing a non-transmitted allele; and (ii) the parental genotypes swapped to give another 4 pseudocontrols given by the possible transmitted genotypes.

Using options -pc15 -cepg will result in the following pseudocontrols: (i) the usual 15 pseudocontrols given by any possible genotype pair from the parents that contains a non-transmitted allele; and (ii) the parental genotypes swapped to give another 16 pseudocontrols given by any possible genotype pair.

When the -cepg option is used the log file will include the following lines:

...
Number of pseudocontrols per trio: 1 + 2 = 3
Using Conditional on Exchangeable Parental Genotypes (CEPG)
...

showing the amended number of pseudocontrols and that CEPG has been assumed. If an infomation file is output when using the -cepg option then the extra pseudocontrols created will have the maternal and paternal genotypes exchanged. So for options -pc1, -pc3 and -pc15 the pseudocontrols 2-3, 4-7 and 16-31 will have maternal and paternal genotypes exchanged respectively.

Warning: the options -pc3 and -pc1 -cepg both result in 3 pseudocontrols but are different. Do not get the two mixed up!

4.6 Information File

It is possible to output an extra information file for follow up analysis. If the option -info info.dat is used then the text file info.dat is output with the following columns:

  1. Family ID. The same family ID as in the .bim file.
  2. Individual ID. The same individual ID as in the .bim file.
  3. Case ID. The corresponding case ID (child ID) for the pseudocontrol (or case ID repeated).
  4. Set ID. An ordinal number assigned to each set of pseudocons derived from the same case/parent trio.

If the option -info-ma info.dat is used then the text file info.dat also includes the following genotype information columns:

  1. Mother Genotypes. Each of the SNPs will have 2 columns denoting the genotype given by the two allele names.
  2. Case ID. Following the mother genotypes are the case genotypes.

If the option -info-fa info.dat is used then the text file info.dat also includes the following genotype information columns:

  1. Father Genotypes. Each of the SNPs will have 2 columns denoting the genotype given by the two allele names.
  2. Case ID. Following the father genotypes are the case genotypes.

If the option -info-fama info.dat is used then the text file info.dat also includes the following genotype information columns:

  1. Father Genotypes. Each of the SNPs will have 2 columns denoting the genotype given by the two allele names.
  2. Mother Genotypes. Following the father genotypes are the mother genotypes.
  3. Case ID. Following the mother genotypes are the case genotypes.

The resultant text file may be very large if many SNPs are included, so a maximum number of 20 SNPs is imposed. This can be changed with the -info-maxsnps option. The log file includes information on the exact numbers of the columns for the genotypes. For example, if the -info-ma info.dat option were used with 10 SNPs the log file will include the lines:

...
Info file: info.dat
    Column 1: family ID
    Column 2: individual ID
    Column 3: corresponding case ID
    Column 4: pseudocontrol set ID
    Columns 5-24: mother genotype info
    Columns 25-44: child/pseudocontrol genotype info
...

The information file will then look something like the following (using -pc1 with allele names 1 and 2):

10001 3 3 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1
10001 3-pseudo-1 3 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1
10002 3 3 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1
10002 3-pseudo-1 3 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 2 1
10003 3 3 3 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10003 3-pseudo-1 3 3 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1
...