{"id":30,"date":"2017-02-15T15:37:52","date_gmt":"2017-02-15T15:37:52","guid":{"rendered":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/?p=30"},"modified":"2017-02-15T15:37:52","modified_gmt":"2017-02-15T15:37:52","slug":"variant-annotation","status":"publish","type":"post","link":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/2017\/02\/15\/variant-annotation\/","title":{"rendered":"Variant Annotation"},"content":{"rendered":"<h2>Reading in vcf files in R<\/h2>\n<p>The vcf files that can be downloaded from the 1000 genomes using <code>tabix<\/code>\u00a0can be read directly into R using the <a href=\"https:\/\/bioconductor.org\/packages\/release\/bioc\/html\/VariantAnnotation.html\">VariantAnnotation<\/a>\u00a0library, as an alternative to using <a href=\"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/2017\/01\/26\/1000-genomes-to-haploviewped-format\/\">vcftools<\/a><\/p>\n<p>This post deals with some of the common operations that we might want to deal with.<\/p>\n<p>Assuming that <a href=\"http:\/\/www.htslib.org\/doc\/tabix.html\">tabix\u00a0<\/a>is available on your computer we get a vcf file through tabix<\/p>\n<pre>tabix -fh ftp:\/\/ftp.1000genomes.ebi.ac.uk\/vol1\/ftp\/release\/20130502\/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \r\n            12:49687909-49693481\u00a0&gt; output.vcf<\/pre>\n<p>And we then read this into R using VariantAnniotation<\/p>\n<pre>library(VariantAnnotation)\r\nv &lt;- readVcf(file=\"output.vcf\", genome=\"hg19\")\r\nv<\/pre>\n<p>Which gives the following output<\/p>\n<pre>class: CollapsedVCF \r\ndim: 154 2504 \r\nrowRanges(vcf):\r\n GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER\r\ninfo(vcf):\r\n DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, M...\r\ninfo(header(vcf)):\r\n Number Type Description \r\n CIEND 2 Integer Confidence interval around END for imprecise...\r\n CIPOS 2 Integer Confidence interval around POS for imprecise...\r\n CS 1 String Source call set. \r\n END 1 Integer End coordinate of this variant \r\n IMPRECISE 0 Flag Imprecise structural variation \r\n MC . String Merged calls. \r\n MEINFO 4 String Mobile element info of the form NAME,START,E...\r\n MEND 1 Integer Mitochondrial end coordinate of inserted seq...\r\n MLEN 1 Integer Estimated length of mitochondrial insert \r\n MSTART 1 Integer Mitochondrial start coordinate of inserted s...\r\n SVLEN . Integer SV length. It is only calculated for structu...\r\n SVTYPE 1 String Type of structural variant \r\n TSD 1 String Precise Target Site Duplication for bases, i...\r\n AC A Integer Total number of alternate alleles in called ...\r\n AF A Float Estimated allele frequency in the range (0,1) \r\n NS 1 Integer Number of samples with data \r\n AN 1 Integer Total number of alleles in called genotypes \r\n EAS_AF A Float Allele frequency in the EAS populations calc...\r\n EUR_AF A Float Allele frequency in the EUR populations calc...\r\n AFR_AF A Float Allele frequency in the AFR populations calc...\r\n AMR_AF A Float Allele frequency in the AMR populations calc...\r\n SAS_AF A Float Allele frequency in the SAS populations calc...\r\n DP 1 Integer Total read depth; only low coverage data wer...\r\n AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelTy...\r\n VT . String indicates what type of variant the line repr...\r\n EX_TARGET 0 Flag indicates whether a variant is within the ex...\r\n MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic \r\ngeno(vcf):\r\n SimpleList of length 1: GT\r\ngeno(header(vcf)):\r\n Number Type Description\r\n GT 1 String Genotype<\/pre>\n<p>We can look at the positions within this object by using <code>rowRanges<\/code><\/p>\n<p>head(rowRanges(v), 5)<\/p>\n<pre>GRanges object with 5 ranges and 5 metadata columns:\r\n            seqnames             ranges     strand | paramRangeID\r\n              &lt;Rle&gt;           &lt;IRanges&gt;      &lt;Rle&gt; | &lt;factor&gt;\r\n esv3629454      12 [49675433, 49675433]         * | &lt;NA&gt;\r\n esv3629455      12 [49675487, 49675487]         * | &lt;NA&gt;\r\n rs527400822     12 [49687979, 49687979]         * | &lt;NA&gt;\r\n rs182846230     12 [49688004, 49688004]         * | &lt;NA&gt;\r\n rs570556251     12 [49688007, 49688007]         * | &lt;NA&gt;\r\n                     REF            ALT       QUAL      FILTER\r\n          &lt;DNAStringSet&gt; &lt;CharacterList&gt; &lt;numeric&gt; &lt;character&gt;\r\n esv3629454            G           &lt;CN2&gt;       100        PASS\r\n esv3629455            A           &lt;CN2&gt;       100        PASS\r\n rs527400822           C               T       100        PASS\r\n rs182846230           G               A       100        PASS\r\n rs570556251           G               T       100        PASS\r\n -------\r\n seqinfo: 86 sequences from hg19 genome\r\n\r\n<\/pre>\n<p>And we can look at the genotypes using <code>geno.<\/code> This gives a list of the genotype objects. \u00a0For my data this is the phased genotype.<\/p>\n<pre>geno(v)[[1]][100:106, 1:8]<\/pre>\n<p>Gives output<\/p>\n<pre>            HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105\r\nrs73112143    \"0|1\"   \"0|0\"   \"0|1\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"1|0\" \r\nrs535948895   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\" \r\nrs552699967   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\" \r\nrs200141011   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\" \r\nrs544907182   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\" \r\nrs112146976   \"0|0\"   \"0|0\"   \"0|1\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\" \r\nrs201966388   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"   \"0|0\"<\/pre>\n<p>We can further look at summaries of the different sites by using <code>info(v)<\/code>. \u00a0This gives \u00a0a matrix with 27 columns (in my instance, but this is dependent on the particular vcf file).<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reading in vcf files in R The vcf files that can be downloaded from the 1000 genomes using tabix\u00a0can be read directly into R using the VariantAnnotation\u00a0library, as an alternative to using vcftools This post deals with some of the &hellip; <a href=\"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/2017\/02\/15\/variant-annotation\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1219,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[6,8],"class_list":["post-30","post","type-post","status-publish","format-standard","hentry","category-howto","tag-1000-genomes","tag-bioconductor"],"_links":{"self":[{"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/posts\/30","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/users\/1219"}],"replies":[{"embeddable":true,"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/comments?post=30"}],"version-history":[{"count":1,"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/posts\/30\/revisions"}],"predecessor-version":[{"id":31,"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/posts\/30\/revisions\/31"}],"wp:attachment":[{"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/media?parent=30"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/categories?post=30"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.staff.ncl.ac.uk\/ianwilson\/wp-json\/wp\/v2\/tags?post=30"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}