Variant Annotation

Reading in vcf files in R

The vcf files that can be downloaded from the 1000 genomes using tabix can be read directly into R using the VariantAnnotation library, as an alternative to using vcftools

This post deals with some of the common operations that we might want to deal with.

Assuming that tabix is available on your computer we get a vcf file through tabix

tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 
            12:49687909-49693481 > output.vcf

And we then read this into R using VariantAnniotation

library(VariantAnnotation)
v <- readVcf(file="output.vcf", genome="hg19")
v

Which gives the following output

class: CollapsedVCF 
dim: 154 2504 
rowRanges(vcf):
 GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
 DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, M...
info(header(vcf)):
 Number Type Description 
 CIEND 2 Integer Confidence interval around END for imprecise...
 CIPOS 2 Integer Confidence interval around POS for imprecise...
 CS 1 String Source call set. 
 END 1 Integer End coordinate of this variant 
 IMPRECISE 0 Flag Imprecise structural variation 
 MC . String Merged calls. 
 MEINFO 4 String Mobile element info of the form NAME,START,E...
 MEND 1 Integer Mitochondrial end coordinate of inserted seq...
 MLEN 1 Integer Estimated length of mitochondrial insert 
 MSTART 1 Integer Mitochondrial start coordinate of inserted s...
 SVLEN . Integer SV length. It is only calculated for structu...
 SVTYPE 1 String Type of structural variant 
 TSD 1 String Precise Target Site Duplication for bases, i...
 AC A Integer Total number of alternate alleles in called ...
 AF A Float Estimated allele frequency in the range (0,1) 
 NS 1 Integer Number of samples with data 
 AN 1 Integer Total number of alleles in called genotypes 
 EAS_AF A Float Allele frequency in the EAS populations calc...
 EUR_AF A Float Allele frequency in the EUR populations calc...
 AFR_AF A Float Allele frequency in the AFR populations calc...
 AMR_AF A Float Allele frequency in the AMR populations calc...
 SAS_AF A Float Allele frequency in the SAS populations calc...
 DP 1 Integer Total read depth; only low coverage data wer...
 AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelTy...
 VT . String indicates what type of variant the line repr...
 EX_TARGET 0 Flag indicates whether a variant is within the ex...
 MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic 
geno(vcf):
 SimpleList of length 1: GT
geno(header(vcf)):
 Number Type Description
 GT 1 String Genotype

We can look at the positions within this object by using rowRanges

head(rowRanges(v), 5)

GRanges object with 5 ranges and 5 metadata columns:
            seqnames             ranges     strand | paramRangeID
              <Rle>           <IRanges>      <Rle> | <factor>
 esv3629454      12 [49675433, 49675433]         * | <NA>
 esv3629455      12 [49675487, 49675487]         * | <NA>
 rs527400822     12 [49687979, 49687979]         * | <NA>
 rs182846230     12 [49688004, 49688004]         * | <NA>
 rs570556251     12 [49688007, 49688007]         * | <NA>
                     REF            ALT       QUAL      FILTER
          <DNAStringSet> <CharacterList> <numeric> <character>
 esv3629454            G           <CN2>       100        PASS
 esv3629455            A           <CN2>       100        PASS
 rs527400822           C               T       100        PASS
 rs182846230           G               A       100        PASS
 rs570556251           G               T       100        PASS
 -------
 seqinfo: 86 sequences from hg19 genome

And we can look at the genotypes using geno. This gives a list of the genotype objects.  For my data this is the phased genotype.

geno(v)[[1]][100:106, 1:8]

Gives output

            HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105
rs73112143    "0|1"   "0|0"   "0|1"   "0|0"   "0|0"   "0|0"   "0|0"   "1|0" 
rs535948895   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0" 
rs552699967   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0" 
rs200141011   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0" 
rs544907182   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0" 
rs112146976   "0|0"   "0|0"   "0|1"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0" 
rs201966388   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"

We can further look at summaries of the different sites by using info(v).  This gives  a matrix with 27 columns (in my instance, but this is dependent on the particular vcf file).

 

1000 genomes to Haploview/ped format

The 1000 genomes contain phased data.  How do we extract this data into a usable format?  Note that only the these are not guaranteed to remove all variants that are not bi-allelic SNPs so the output may need to be run through another script.

Tabix, vcftools and my own R script

The 1000_genomes_script on my own site will do this

On the 1000 genomes web site

http://browser.1000genomes.org/Homo_sapiens/UserData/Haploview

Tabix, vcftools & plink

If we want to get phased data for the

# use tabix to extract the correct section of the file.  Here 
# I consider the chromosome.  Change the file name for other chromosomes
tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 1:1000000-2000000 > region1.vcf
# vcftools to change to plink format
vcftools --vcf region1.vcf --plink-tped --out 1000G_region1
# plink converts this file into a form for haploview
plink --noweb --tfile 1000G_region1 --recodeHV --out 1000G_region1.HV

Based on snippet here.

Genetic Correlations, Instrumental Variable Regression and Mendelian Randomisation

Placeholder for Mendelian randomisation talk and supporting material, that I presented to my group.

This was a discussion of how population stratification of various sorts could influence the results of Mendelian randomisation.

 

Links
Joe Pickrell:  What is Genetic Correlation?  Blog post

Software

LDSC

Databases

Pheno Scanner:  A database of human genotype-phenotype associations
Broad LD hub.