Computer Practical Exercises on Association Analysis in R

Exercise 1: Family-based association

Purpose

In this exercise you will be carrying out family-based association analysis of several loci in the HLA region with type 1 diabetes, using a set of case-parent trios. The purpose is test for association between genotype and disease.

Methodology

We will use the transmission disequilibrium test (TDT). The test will performed in the statistical analysis package R. As well as using standard functions implemented in R, we will also make use of special functions designed for genetic analysis. These can normally be downloaded as add-in R libraries, but as we do not have user administration rights on these computers, we will have to read them in outselves manually.

Some of the commands that you will type in order to do the analysis may seem a bit mystifying (!) if you are not familiar with R. Don't worry too much if you don't understand all the details. If you decide to use a statistical package such as R to analyse your own data, it will be important to learn how to use that particular package appropriately through attendance on a training course or careful reading of an introductory textbook.

R documentation:

The R website is at http://www.r-project.org/

David Clayton's add-in routines (the DGCgenetics package) can be found at http://www-gene.cimr.cam.ac.uk/clayton/software

From within R, one can obtain help on any command xxxx by typing "help(xxxx)"

Data overview

We will be using family data consisting of a number of trio families with an affected diabetic child plus parents (of unknown disease status) all of whom are typed at 5 polymorphisms in the HLA region.

Appropriate data

Appropriate data for this exercise is genotype data at a set of linked loci typed in a number of case-parent trios. It is also possible to use nuclear families or larger families with more affected individuals, however they will automatically be broken into trios for the analysis, and non-independence between cases from the same family would need to be accounted for e.g. by use of the +cluster(pedigree) option in R.

Input files

Open up (click on) the input data file:

fivelociRped.txt

The data is in standard pedigree file format, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=male, 2=female), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. The pedigree file used for the analysis in R differs slightly from a standard pedigree file. It has a header line describing the different columns, and it uses R's own missing value code "NA" instead of zero.

Take a look at the data file, and check that you understand how the data is coded. Then save the file as a .txt file. You should save it to the folder (directory) that you created for today's practical.

Note that the names given to the first 6 variables in the header line are not optional: they must be called

pedigree id id.father id.mother sex affected

for the add-in genetics utilities for the R package to work correctly.

Now save the required library files to the same folder:

RcomsfromRlibs.txt

Step-by-step instructions

Start up the R program, and change to the directory where you saved the data files. (Ask an instructor if you need help with this).

You are now working within the R package. To start with, you need to read in the necessary add-in libraries. Normally you would do this by typing:

library(DGCgenetics)

However, since we do not have write-access on these machines, you need to instead read in the libraries manually:

source("RcomsfromRlibs.txt")

To read your data into a dataframe called "family", type

family <- read.table("fivelociRped.txt", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe "family" and you can look at it simply by typing

family

or (better) by typing

fix(family)

(close this by clicking in the top right corner)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

family$pedigree

Note that the family$pedigree variable, which is really a long vector, will get displayed as a list of values that wrap round horizontally. Check that you understand how this list of values relates to the pedigree column (vector) that you can see in the "family" dataframe when you type

fix(family)

It can be inconvenient to have to type "family$" before typing every variable, so you can tell R to automatically look at variables in the "family" dataframe by typing:

attach(family)

Now you can just look at the column of pedigree names by typing

pedigree

Similarly, you can look at the first allele for locus 1 by typing

loc1_1

Or the second allele for locus 1 by typing

loc1_2

Again, check that you understand how these values relates to the columns (vectors) that you can see in the "family" dataframe when you type

fix(family)

To perform association analysis, we need to convert the variables corresponding to the two alleles at each locus into a genotype variable for each locus. This can be done e.g. for locus 4 by:

g4 <- genotype(loc4_1, loc4_2)

To look at the variable you have just created, type:

g4

To perform a TDT analysis on this locus, type

tdt(g4)

This gives a count of the number of times allele 2 was transmitted or not transmitted, from heterozygous parents to affected offspring. The p value for the TDT test is 0.0150.

Repeat the analysis for loci 5

g5 <- genotype(loc5_1, loc5_2)

tdt(g5)


You should find a highly significant association for locus 5, contrasting with the less significant association at locus 4.

The TDT p value displayed is actually for an exact test (rather than for the originally proposed TDT chi-squared test). To find the TDT chi-squared statistic and its asymptotic p value for locus 4, type

tdt4result<-tdt(g4)

This saves the result of the TDT test into an R object called "tdt4result". To see the TDT chi-squared statistic, type:

tdt4result$statistic

This tells us that the TDT chi-squared statistic is 6.3226. To find out the p value, type:

1-pchisq(6.3226,1)

This tells us that the TDT chisquared statistic of 6.3226 has an asymptotic p value of 0.0119, quite similar to the exact p value (p=0.0149) we found.

Exercise 2: Case/control association

Purpose

In this exercise you will be carrying out chi-squared tests and logistic regression analysis of case/control data, to test for genetic association.

Methodology

The methodology used is standard methodology for case/control analysis (chi-squared tests and logistic regression) as described in the lectures. The tests will performed in the statistical analysis package R. As well as using standard functions implemented in R, we will also make use of the special functions designed for genetic analysis that can be downloaded as add-in routines to the R package.

Data overview

The data consists of genotype data at 4 linked SNP loci, typed in 384 type 1 diabetes cases and 672 controls.

Appropriate data

Appropriate data for this exercise is genotype data for a set of linked or unlinked loci typed in a group of unrelated affected individuals (cases) and in a group of unaffected or randomly chosen individuals from the same population (controls). Ideally a larger data set should be used than we consider here e.g. 1000 cases and 1000 controls might be considered a reasonable number.

Data format

The data for the 4 linked loci casecondata.txt is in a similar format to a standard linkage pedigree file, with columns corresponding to family id, subject id (within family), father's id, mother's id, sex (1=m, 2=f), affection status (1=unaffected, 2=affected) and one column for each allele for each locus genotype. The pedigree file used for the analysis in R differs from a standard pedigree file in that it has a header line describing the different columns, and it uses R's own missing value code "NA". Note that since this is case/control rather than family data, there is only one individual per family and everyone's parents are coded as unknown.

Take a look at the data files, and check that you understand how the data is coded. Then save the files as .txt files to the appropriate directory (folder) on your computer.

Step-by-step instructions

1. Chi-squared and logistic regression analysis

First clear the TDT data out of the memory in R, by typing

clear()

and reinstall the library files:

source("RcomsfromRlibs.txt")

To read in your data into R, type

casecon <- read.table("casecondata.txt", header=T)

The " <- " operator stands for "is defined as" (or "is assigned as") and is used a lot in R to create new variables, new dataframes or new R objects.

Here this command reads your data into what is called a dataframe, essentially a large matrix with columns corresponding to the different variables. You chose to name the dataframe 'casecon' and you can look at it simply by typing

fix(casecon)

Each variable can be accessed by using the name of the dataframe followed by a $ sign, followed by the variable name. E.g. to look at the column of pedigree names, you just type

casecon$pedigree

Check that you understand how the output relates to the column of pedigree names you saw when you used fix(casecon) (note that the column of pedigrees gets wrapped round when you look at it by typing casecon$pedigree ).

It can be inconvenient to have to type 'casecon$' before typing every variable, so you can tell R to automatically look at variables in the casecon dataframe by typing:

attach(casecon)

Now you can just look at the column of pedigree names by typing

pedigree

You should beware when using the 'attach' command as if you already have a different variable in R called 'pedigree', R will use this variable rather than the one in the attached dataframe - without telling you that there may be some ambiguity. Luckily we cleared all variables and data frames before starting, so this should not be an issue for us.

To perform the analysis we will first generate a variable called 'case' that codes affected/unaffected individuals as 1/0 rather than 2/1 (as they are coded in the 'affected' variable):

case <- affected-1

To look at the new case variable, type

case

To compare with the old 'affected' variable, type

affected

Now we will generate genotype variables from the two alleles at each locus:

g1 <- genotype(loc1_1, loc1_2)
g2 <- genotype(loc2_1, loc2_2)
g3 <- genotype(loc3_1, loc3_2)
g4 <- genotype(loc4_1, loc4_2)


Take a look at one of these e.g. by typing

g1

To perform chi-squared tests on the genotype variables (to see if they are associated with disease status), type:

table(g1,case)
chisq.test(g1,case)

table(g2,case)
chisq.test(g2,case)

table(g3,case)
chisq.test(g3,case)

table(g4,case)
chisq.test(g4,case)


All loci should show significant association, with locus 3 and 4 showing the highest significance.

To perfom a chromosome or allele-based analysis, you can use the commands:

allele.table(g1,case)
chisq.test(allele.table(g1,case))

allele.table(g2,case)
chisq.test(allele.table(g2,case))

allele.table(g3,case)
chisq.test(allele.table(g3,case))

allele.table(g4,case)
chisq.test(allele.table(g4,case))


The pattern of significance should be roughly the same as for the genotype-based analysis.

To perform a logistic regression analysis, we need to set the genotype variables up for either a 2df (genotype-based) analysis or a 1df (allele-based) analysis. Essentially this tells R how to code the different genotype variables (e.g. 0, 1, 2 for genotypes 1/1, 1/2, 2/2 for an allele-based analysis).

For the genotype-based analysis, we use what are called 'genotype' contrasts, and for the allele-based analysis we use 'additive' (on the log odds scale - equivalent to multiplicative on the odds scale) contrasts. So for the 2df test at the first locus we use the following commands:

gcontrasts(g1) <- "genotype"
logistic(case ~ g1)
anova(logistic(case ~ g1))

The 'logistic' command provides parameter estimates (odds ratios) and p values for each regression coefficient. The syntax logistic (y ~ x) means we regress an outcome variable y (here case/control status) on a predictor variable x (here genotype variable g1). The odds ratios are given for genotypes 1/1 and 2/2 relative to heterozygous genotype 1/2. (This is seen from the slightly confusing labels "g11/1" and "g12/2" which mean "g1 1/1" and "g1 2/2"). If one wanted instead to do the analysis relative to the genotype 1/1, for example, one would have to set up the genotype contrasts differently using the syntax gcontrasts(g1,base=1/1)<-"genotype" .

For locus 1, you should find that the odds ratio for genotype 1/1 (relative to 1/2) is 1.47, and the odds ratio for genotype 2/2 (relative to 1/2) is 0.65. This means that an individual with genotype 1/1 has 1.47 times the risk of disease compared to an individual with genotype 1/2, and an individual with genotype 2/2 has 0.65 times the risk of disease compared to an individual with genotype 1/2. So it looks as if 1/1 is the highest risk genotype and 2/2 is the lowest risk. Does this match up with what you saw in the chi-squared table analysis of variable g1?

The 'anova' command gives an overall test in the form of a 'deviance' and a 'df'. In this case we find a deviance of 18.49 on 2df. To find a p value for this, we compare it to a chi-squared distribution by typing:

1-pchisq(18.49,2)

How does the significance compare to what you saw in the chi-squared table analysis of variable g1? For a 1df allele-based test at the first locus, we use the following commands:

gcontrasts(g1) <- "additive"
logistic(case ~ g1)
anova(logistic(case ~ g1))
1-pchisq(18.47,1)


You should be able to work out how to do a similar analysis on either 1df or 2df for each of the other loci. You should find that the pattern of significance is the same as you found when doing the chi-squared tests on the genotype or allele tables.

To get out of R, type

q()

(No need to save the workspace)