Reading in vcf files in R
The vcf files that can be downloaded from the 1000 genomes using tabix
can be read directly into R using the VariantAnnotation library, as an alternative to using vcftools
This post deals with some of the common operations that we might want to deal with.
Assuming that tabix is available on your computer we get a vcf file through tabix
tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
12:49687909-49693481 > output.vcf
And we then read this into R using VariantAnniotation
library(VariantAnnotation)
v <- readVcf(file="output.vcf", genome="hg19")
v
Which gives the following output
class: CollapsedVCF
dim: 154 2504
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, M...
info(header(vcf)):
Number Type Description
CIEND 2 Integer Confidence interval around END for imprecise...
CIPOS 2 Integer Confidence interval around POS for imprecise...
CS 1 String Source call set.
END 1 Integer End coordinate of this variant
IMPRECISE 0 Flag Imprecise structural variation
MC . String Merged calls.
MEINFO 4 String Mobile element info of the form NAME,START,E...
MEND 1 Integer Mitochondrial end coordinate of inserted seq...
MLEN 1 Integer Estimated length of mitochondrial insert
MSTART 1 Integer Mitochondrial start coordinate of inserted s...
SVLEN . Integer SV length. It is only calculated for structu...
SVTYPE 1 String Type of structural variant
TSD 1 String Precise Target Site Duplication for bases, i...
AC A Integer Total number of alternate alleles in called ...
AF A Float Estimated allele frequency in the range (0,1)
NS 1 Integer Number of samples with data
AN 1 Integer Total number of alleles in called genotypes
EAS_AF A Float Allele frequency in the EAS populations calc...
EUR_AF A Float Allele frequency in the EUR populations calc...
AFR_AF A Float Allele frequency in the AFR populations calc...
AMR_AF A Float Allele frequency in the AMR populations calc...
SAS_AF A Float Allele frequency in the SAS populations calc...
DP 1 Integer Total read depth; only low coverage data wer...
AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelTy...
VT . String indicates what type of variant the line repr...
EX_TARGET 0 Flag indicates whether a variant is within the ex...
MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
geno(vcf):
SimpleList of length 1: GT
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
We can look at the positions within this object by using rowRanges
head(rowRanges(v), 5)
GRanges object with 5 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
esv3629454 12 [49675433, 49675433] * | <NA>
esv3629455 12 [49675487, 49675487] * | <NA>
rs527400822 12 [49687979, 49687979] * | <NA>
rs182846230 12 [49688004, 49688004] * | <NA>
rs570556251 12 [49688007, 49688007] * | <NA>
REF ALT QUAL FILTER
<DNAStringSet> <CharacterList> <numeric> <character>
esv3629454 G <CN2> 100 PASS
esv3629455 A <CN2> 100 PASS
rs527400822 C T 100 PASS
rs182846230 G A 100 PASS
rs570556251 G T 100 PASS
-------
seqinfo: 86 sequences from hg19 genome
And we can look at the genotypes using geno.
This gives a list of the genotype objects. For my data this is the phased genotype.
geno(v)[[1]][100:106, 1:8]
Gives output
HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105
rs73112143 "0|1" "0|0" "0|1" "0|0" "0|0" "0|0" "0|0" "1|0"
rs535948895 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0"
rs552699967 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0"
rs200141011 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0"
rs544907182 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0"
rs112146976 "0|0" "0|0" "0|1" "0|0" "0|0" "0|0" "0|0" "0|0"
rs201966388 "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0" "0|0"
We can further look at summaries of the different sites by using info(v)
. This gives a matrix with 27 columns (in my instance, but this is dependent on the particular vcf file).