Reading in vcf files in R
The vcf files that can be downloaded from the 1000 genomes using tabix
can be read directly into R using the VariantAnnotation library, as an alternative to using vcftools
This post deals with some of the common operations that we might want to deal with.
Assuming that tabix is available on your computer we get a vcf file through tabix
tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 12:49687909-49693481 > output.vcf
And we then read this into R using VariantAnniotation
library(VariantAnnotation) v <- readVcf(file="output.vcf", genome="hg19") v
Which gives the following output
class: CollapsedVCF dim: 154 2504 rowRanges(vcf): GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER info(vcf): DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, M... info(header(vcf)): Number Type Description CIEND 2 Integer Confidence interval around END for imprecise... CIPOS 2 Integer Confidence interval around POS for imprecise... CS 1 String Source call set. END 1 Integer End coordinate of this variant IMPRECISE 0 Flag Imprecise structural variation MC . String Merged calls. MEINFO 4 String Mobile element info of the form NAME,START,E... MEND 1 Integer Mitochondrial end coordinate of inserted seq... MLEN 1 Integer Estimated length of mitochondrial insert MSTART 1 Integer Mitochondrial start coordinate of inserted s... SVLEN . Integer SV length. It is only calculated for structu... SVTYPE 1 String Type of structural variant TSD 1 String Precise Target Site Duplication for bases, i... AC A Integer Total number of alternate alleles in called ... AF A Float Estimated allele frequency in the range (0,1) NS 1 Integer Number of samples with data AN 1 Integer Total number of alleles in called genotypes EAS_AF A Float Allele frequency in the EAS populations calc... EUR_AF A Float Allele frequency in the EUR populations calc... AFR_AF A Float Allele frequency in the AFR populations calc... AMR_AF A Float Allele frequency in the AMR populations calc... SAS_AF A Float Allele frequency in the SAS populations calc... DP 1 Integer Total read depth; only low coverage data wer... AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelTy... VT . String indicates what type of variant the line repr... EX_TARGET 0 Flag indicates whether a variant is within the ex... MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic geno(vcf): SimpleList of length 1: GT geno(header(vcf)): Number Type Description GT 1 String Genotype
We can look at the positions within this object by using rowRanges
head(rowRanges(v), 5)
GRanges object with 5 ranges and 5 metadata columns: seqnames ranges strand | paramRangeID <Rle> <IRanges> <Rle> | <factor> esv3629454 12 [49675433, 49675433] * | <NA> esv3629455 12 [49675487, 49675487] * | <NA> rs527400822 12 [49687979, 49687979] * | <NA> rs182846230 12 [49688004, 49688004] * | <NA> rs570556251 12 [49688007, 49688007] * | <NA> REF ALT QUAL FILTER <DNAStringSet> <CharacterList> <numeric> <character> esv3629454 G <CN2> 100 PASS esv3629455 A <CN2> 100 PASS rs527400822 C T 100 PASS rs182846230 G A 100 PASS rs570556251 G T 100 PASS ------- seqinfo: 86 sequences from hg19 genome
And we can look at the genotypes using geno.
This gives a list of the genotype objects. For my data this is the phased genotype.
geno(v)[[1]][100:106, 1:8]
Gives output
HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 rs73112143 "0|1" "0|0" "0|1" "0|0" "0|0" "0|0" "0|0" "1|0" rs535948895 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" rs552699967 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" rs200141011 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" rs544907182 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" rs112146976 "0|0" "0|0" "0|1" "0|0" "0|0" "0|0" "0|0" "0|0" rs201966388 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0"
We can further look at summaries of the different sites by using info(v)
. This gives a matrix with 27 columns (in my instance, but this is dependent on the particular vcf file).