<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Publicaciones on Jorge Alfredo Suazo Victoria</title><link>https://3lconejo.github.io/</link><description>Recent content in Publicaciones on Jorge Alfredo Suazo Victoria</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Mon, 10 Feb 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://3lconejo.github.io/index.xml" rel="self" type="application/rss+xml"/><item><title>Day 2: VEP and more bcftools analyses</title><link>https://3lconejo.github.io/posts/day2/</link><pubDate>Mon, 10 Feb 2025 00:00:00 +0000</pubDate><guid>https://3lconejo.github.io/posts/day2/</guid><description>&lt;h2 id="1-thinking-about-our-experiment">1. Thinking about our experiment&lt;/h2>
&lt;div class="collapsable-code">
&lt;input id="239546817" type="checkbox" />
&lt;label for="239546817">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
bcftools isec -C SRR445716.vcf.gz SRR445715.vcf.gz \&amp;gt;\
present_in_IMW004_absent_in_CEN.PK113-7D.txt
&lt;/code>&lt;/pre>
&lt;/div>
&lt;pre tabindex="0">&lt;code>chrI 244 C CT 10
chrI 675 A G 10
chrI 1152 T G 10
chrI 1397 A G 10
chrI 1428 T C 10
chrI 1757 G T 10
chrI 2002 G T 10
chrI 2029 T C 10
chrI 2406 A C 10
chrI 12227 C T 10
&lt;/code>&lt;/pre>&lt;pre tabindex="0">&lt;code>About: Create intersections, unions and complements of VCF files.
Usage: bcftools isec [options] &amp;lt;A.vcf.gz&amp;gt; &amp;lt;B.vcf.gz&amp;gt; [...]
Options:
-c, --collapse STRING Treat as identical records with &amp;lt;snps|indels|both|all|some|none&amp;gt;, see man page for details [none]
-C, --complement Output positions present only in the first file but missing in the others
-e, --exclude EXPR Exclude sites for which the expression is true
-f, --apply-filters LIST Require at least one of the listed FILTER strings (e.g. &amp;#34;PASS,.&amp;#34;)
-i, --include EXPR Include only sites for which the expression is true
--no-version Do not append version and command line to the header
-n, --nfiles [+-=~]INT Output positions present in this many (=), this many or more (+), this many or fewer (-), the exact (~) files
-o, --output FILE Write output to a file [standard output]
-O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
-p, --prefix DIR If given, subset each of the input files accordingly, see also -w
-r, --regions REGION Restrict to comma-separated list of regions
-R, --regions-file FILE Restrict to regions listed in a file
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-t, --targets REGION Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
--threads INT Use multithreading with &amp;lt;int&amp;gt; worker threads [0]
-w, --write LIST List of files to write with -p given as 1-based indexes. By default, all files are written
Examples:
# Create intersection and complements of two sets saving the output in dir/*
bcftools isec A.vcf.gz B.vcf.gz -p dir
# Filter sites in A and B (but not in C) and create intersection
bcftools isec -e&amp;#39;MAF&amp;lt;0.01&amp;#39; -i&amp;#39;dbSNP=1&amp;#39; -e - A.vcf.gz B.vcf.gz C.vcf.gz -p dir
# Extract and write records from A shared by both A and B using exact allele match
bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1
# Extract and write records from C found in A and C but not in B
bcftools isec A.vcf.gz B.vcf.gz C.vcf.gz -p dir -n~101 -w 3
# Extract records private to A or B comparing by position only
bcftools isec A.vcf.gz B.vcf.gz -p dir -n -1 -c all
&lt;/code>&lt;/pre>&lt;h2 id="question-17-can-you-think-of-a-way-to-obtain-a-list-of-candidates-that-may-underlie-the-ability-of-these-strains-to-grow-on-lactate-hint-you-can-assume-that-variants-shared-by-both-imw004-and-imw005-are-likely-to-have-arisen-before-the-start-of-the-experiment-ie-from-the-unsequenced-initial-jen1-delta-strain-and-therefore-are-not-biologically-interesting-how-many-variants-unfiltered-are-in-imw004-that-are-not-shared-by-any-other-strain">Question 17: Can you think of a way to obtain a list of candidates that may underlie the ability of these strains to grow on lactate? Hint: You can assume that variants shared by both IMW004 and IMW005 are likely to have arisen before the start of the experiment (i.e., from the unsequenced initial jen1 delta strain), and therefore are not biologically interesting. How many variants (unfiltered) are in IMW004 that are not shared by any other strain?&lt;/h2>
&lt;h3 id="my-guess">My guess&lt;/h3>
&lt;div class="collapsable-code">
&lt;input id="821396457" type="checkbox" />
&lt;label for="821396457">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
bcftools isec -C SRR445716.vcf.gz SRR445715.vcf.gz SRR445717.vcf.gz --output-type v -o IMW004_unique.vcf -w 1
bcftools isec -C SRR445717.vcf.gz SRR445715.vcf.gz SRR445716.vcf.gz --output-type v -o IMW005_unique.vcf -w 1
bgzip IMW004_unique.vcf
bgzip IMW005_unique.vcf
bcftools index IMW004_unique.vcf.gz
bcftools index IMW005_unique.vcf.gz
bcftools merge IMW004_unique.vcf.gz IMW005_unique.vcf.gz -o Lac_Uniques
&lt;/code>&lt;/pre>
&lt;/div>
&lt;h2 id="question-18-how-many-variants-remain-in-imw004-after-filtering">Question 18: How many variants remain in IMW004 after filtering?&lt;/h2>
&lt;div class="collapsable-code">
&lt;input id="769813452" type="checkbox" />
&lt;label for="769813452">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
bcftools filter -i&amp;#39;QUAL&amp;gt;=30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50 &amp;amp;&amp;amp; type=&amp;#34;snp&amp;#34;&amp;#39; IMW004_unique.vcf.gz -o IMW004.flt.vcf
bcftools view -H IMW004.flt.vcf | wc -l
&lt;/code>&lt;/pre>
&lt;/div>
&lt;p>&lt;code>25&lt;/code>&lt;/p></description><content>&lt;h2 id="1-thinking-about-our-experiment">1. Thinking about our experiment&lt;/h2>
&lt;div class="collapsable-code">
&lt;input id="239546817" type="checkbox" />
&lt;label for="239546817">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
bcftools isec -C SRR445716.vcf.gz SRR445715.vcf.gz \&amp;gt;\
present_in_IMW004_absent_in_CEN.PK113-7D.txt
&lt;/code>&lt;/pre>
&lt;/div>
&lt;pre tabindex="0">&lt;code>chrI 244 C CT 10
chrI 675 A G 10
chrI 1152 T G 10
chrI 1397 A G 10
chrI 1428 T C 10
chrI 1757 G T 10
chrI 2002 G T 10
chrI 2029 T C 10
chrI 2406 A C 10
chrI 12227 C T 10
&lt;/code>&lt;/pre>&lt;pre tabindex="0">&lt;code>About: Create intersections, unions and complements of VCF files.
Usage: bcftools isec [options] &amp;lt;A.vcf.gz&amp;gt; &amp;lt;B.vcf.gz&amp;gt; [...]
Options:
-c, --collapse STRING Treat as identical records with &amp;lt;snps|indels|both|all|some|none&amp;gt;, see man page for details [none]
-C, --complement Output positions present only in the first file but missing in the others
-e, --exclude EXPR Exclude sites for which the expression is true
-f, --apply-filters LIST Require at least one of the listed FILTER strings (e.g. &amp;#34;PASS,.&amp;#34;)
-i, --include EXPR Include only sites for which the expression is true
--no-version Do not append version and command line to the header
-n, --nfiles [+-=~]INT Output positions present in this many (=), this many or more (+), this many or fewer (-), the exact (~) files
-o, --output FILE Write output to a file [standard output]
-O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
-p, --prefix DIR If given, subset each of the input files accordingly, see also -w
-r, --regions REGION Restrict to comma-separated list of regions
-R, --regions-file FILE Restrict to regions listed in a file
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-t, --targets REGION Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
--threads INT Use multithreading with &amp;lt;int&amp;gt; worker threads [0]
-w, --write LIST List of files to write with -p given as 1-based indexes. By default, all files are written
Examples:
# Create intersection and complements of two sets saving the output in dir/*
bcftools isec A.vcf.gz B.vcf.gz -p dir
# Filter sites in A and B (but not in C) and create intersection
bcftools isec -e&amp;#39;MAF&amp;lt;0.01&amp;#39; -i&amp;#39;dbSNP=1&amp;#39; -e - A.vcf.gz B.vcf.gz C.vcf.gz -p dir
# Extract and write records from A shared by both A and B using exact allele match
bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1
# Extract and write records from C found in A and C but not in B
bcftools isec A.vcf.gz B.vcf.gz C.vcf.gz -p dir -n~101 -w 3
# Extract records private to A or B comparing by position only
bcftools isec A.vcf.gz B.vcf.gz -p dir -n -1 -c all
&lt;/code>&lt;/pre>&lt;h2 id="question-17-can-you-think-of-a-way-to-obtain-a-list-of-candidates-that-may-underlie-the-ability-of-these-strains-to-grow-on-lactate-hint-you-can-assume-that-variants-shared-by-both-imw004-and-imw005-are-likely-to-have-arisen-before-the-start-of-the-experiment-ie-from-the-unsequenced-initial-jen1-delta-strain-and-therefore-are-not-biologically-interesting-how-many-variants-unfiltered-are-in-imw004-that-are-not-shared-by-any-other-strain">Question 17: Can you think of a way to obtain a list of candidates that may underlie the ability of these strains to grow on lactate? Hint: You can assume that variants shared by both IMW004 and IMW005 are likely to have arisen before the start of the experiment (i.e., from the unsequenced initial jen1 delta strain), and therefore are not biologically interesting. How many variants (unfiltered) are in IMW004 that are not shared by any other strain?&lt;/h2>
&lt;h3 id="my-guess">My guess&lt;/h3>
&lt;div class="collapsable-code">
&lt;input id="821396457" type="checkbox" />
&lt;label for="821396457">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
bcftools isec -C SRR445716.vcf.gz SRR445715.vcf.gz SRR445717.vcf.gz --output-type v -o IMW004_unique.vcf -w 1
bcftools isec -C SRR445717.vcf.gz SRR445715.vcf.gz SRR445716.vcf.gz --output-type v -o IMW005_unique.vcf -w 1
bgzip IMW004_unique.vcf
bgzip IMW005_unique.vcf
bcftools index IMW004_unique.vcf.gz
bcftools index IMW005_unique.vcf.gz
bcftools merge IMW004_unique.vcf.gz IMW005_unique.vcf.gz -o Lac_Uniques
&lt;/code>&lt;/pre>
&lt;/div>
&lt;h2 id="question-18-how-many-variants-remain-in-imw004-after-filtering">Question 18: How many variants remain in IMW004 after filtering?&lt;/h2>
&lt;div class="collapsable-code">
&lt;input id="769813452" type="checkbox" />
&lt;label for="769813452">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
bcftools filter -i&amp;#39;QUAL&amp;gt;=30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50 &amp;amp;&amp;amp; type=&amp;#34;snp&amp;#34;&amp;#39; IMW004_unique.vcf.gz -o IMW004.flt.vcf
bcftools view -H IMW004.flt.vcf | wc -l
&lt;/code>&lt;/pre>
&lt;/div>
&lt;p>&lt;code>25&lt;/code>&lt;/p>
&lt;h2 id="question-19-how-many-variants-remain-in-imw005-after-filtering">Question 19: How many variants remain in IMW005 after filtering?&lt;/h2>
&lt;div class="collapsable-code">
&lt;input id="452316789" type="checkbox" />
&lt;label for="452316789">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
bcftools filter -i&amp;#39;QUAL&amp;gt;=30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50 &amp;amp;&amp;amp; type=&amp;#34;snp&amp;#34;&amp;#39; IMW005_unique.vcf.gz -o IMW005.flt.vcf
bcftools view -H IMW005.flt.vcf | wc -l
&lt;/code>&lt;/pre>
&lt;/div>
&lt;p>&lt;code>6&lt;/code>&lt;/p>
&lt;h2 id="question-20-what-do-all-the-options-that-we-added-to-the-command-mean-hint-look-at-the-full-options-in-httpwwwensemblorginfodocstoolsvepscriptvep_optionshtml">Question 20: What do all the options that we added to the command mean? Hint: Look at the full options in &lt;a href="http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html">http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html&lt;/a>.&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>--cache&lt;/code>&lt;/strong>:&lt;br>
This option tells VEP to use locally cached data for annotation. Caching speeds up the annotation process by avoiding repeated queries to Ensembl&amp;rsquo;s online database.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>--dir_cache /home/drobles/.vep/&lt;/code>&lt;/strong>:&lt;br>
Specifies the directory where the cached files for VEP are stored. In this case, it is pointing to &lt;code>/home/drobles/.vep/&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>-i SRR445716_unique.flt.vcf&lt;/code>&lt;/strong>:&lt;br>
Specifies the input file for VEP. Here, the input is a VCF file named &lt;code>SRR445716_unique.flt.vcf&lt;/code>, which contains the variants to be annotated.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>-o SRR445716_unique.flt.vep.vcf&lt;/code>&lt;/strong>:&lt;br>
Specifies the output file name. VEP will write the annotated variants to &lt;code>SRR445716_unique.flt.vep.vcf&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>--vcf&lt;/code>&lt;/strong>:&lt;br>
This option tells VEP to produce output in VCF format. The annotated variants will be written as an updated VCF file.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>--species &amp;quot;saccharomyces_cerevisiae&amp;quot;&lt;/code>&lt;/strong>:&lt;br>
Specifies the species to be used for annotation. In this case, the annotation will be done for &lt;em>Saccharomyces cerevisiae&lt;/em> (yeast).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="question-21-look-at-the-output-vcf-what-happened-to-the-original-vcf-did-vep-add-an-annotation-which-one">Question 21: Look at the output VCF. What happened to the original VCF? Did VEP add an annotation? Which one?&lt;/h2>
&lt;p>Yes, VEP added an annotation to the original VCF. They added the following header&lt;/p>
&lt;div class="collapsable-code">
&lt;input id="893251647" type="checkbox" />
&lt;label for="893251647">
&lt;span class="collapsable-code__language">bash&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-bash" >&lt;code>
INFO=&amp;lt;ID=CSQ,Number=.,Type=String,Description=&amp;#34;Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID&amp;#34;
&lt;/code>&lt;/pre>
&lt;/div>
&lt;h1 id="3-making-sense-of-our-results">3. Making sense of our results&lt;/h1>
&lt;p>Run VEP on both IMW004 and IMW005 filtered VCF files. Study the output very well. Now create a program in your favourite language that outputs:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Genes are mutated in any or both of the files&lt;/p>
&lt;/li>
&lt;li>
&lt;p>What mutation is present in what strain&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Did you find the original mutations found by the authors in the ADY2 gene?&lt;/p>
&lt;p>Question 22: Filter the consequences to only keep those that are either missense, stop gained, frameshift, splice acceptor or splice donor. These are typically the mutations that are predicted to directly affect protein function. How many genes are mutated with any of these consequences in both strains?&lt;/p>
&lt;div class="collapsable-code">
&lt;input id="489652371" type="checkbox" />
&lt;label for="489652371">
&lt;span class="collapsable-code__language">r&lt;/span>
&lt;span class="collapsable-code__toggle" data-label-expand="△" data-label-collapse="▽">&lt;/span>
&lt;/label>
&lt;pre class="language-r" >&lt;code>
library(vcfR)
library(tidyverse)
vcf_file &amp;lt;- &amp;#34;SRR445716_unique.flt.vep.vcf&amp;#34; # Reemplaza con la ruta de tu archivo [IMW004]
vcf &amp;lt;- read.vcfR(vcf_file)
vcf_data &amp;lt;- as.data.frame(vcf@fix) # Información básica de las variantes
info_data &amp;lt;- vcfR::extract_info_tidy(vcf)
vcf_data &amp;lt;- vcf_data[-8]
final_data &amp;lt;- vcf_data %&amp;gt;%
bind_cols(info_data)
# Divide las anotaciones de INFO, enfocándote en el campo CSQ (anotaciones de VEP)
csq16_data &amp;lt;- final_data %&amp;gt;%
separate_rows(CSQ, sep = &amp;#34;,&amp;#34;) %&amp;gt;%
separate(CSQ, into = c(&amp;#34;Allele&amp;#34;, &amp;#34;Consequence&amp;#34;, &amp;#34;Impact&amp;#34;, &amp;#34;Gene&amp;#34;, &amp;#34;Feature&amp;#34;, &amp;#34;Feature_type&amp;#34;,
&amp;#34;BIOTYPE&amp;#34;, &amp;#34;EXON&amp;#34;, &amp;#34;INTRON&amp;#34;, &amp;#34;HGVSc&amp;#34;, &amp;#34;HGVSp&amp;#34;, &amp;#34;cDNA_position&amp;#34;,
&amp;#34;CDS_position&amp;#34;, &amp;#34;Protein_position&amp;#34;, &amp;#34;Amino_acids&amp;#34;, &amp;#34;Codons&amp;#34;,
&amp;#34;Existing_variation&amp;#34;, &amp;#34;Distance&amp;#34;, &amp;#34;STRAND&amp;#34;, &amp;#34;SYMBOL&amp;#34;, &amp;#34;SYMBOL_SOURCE&amp;#34;,
&amp;#34;HGNC_ID&amp;#34;, paste0(&amp;#34;Extra_&amp;#34;, 23:50)), # Agregar más columnas para los valores extra
sep = &amp;#34;\\|&amp;#34;, fill = &amp;#34;right&amp;#34;)
vcf_file &amp;lt;- &amp;#34;SRR445717_unique.flt.vep.vcf&amp;#34; # Reemplaza con la ruta de tu archivo [IMW005]
vcf &amp;lt;- read.vcfR(vcf_file)
vcf_data &amp;lt;- as.data.frame(vcf@fix) # Información básica de las variantes
info_data &amp;lt;- vcfR::extract_info_tidy(vcf)
vcf_data &amp;lt;- vcf_data[-8]
final_data &amp;lt;- vcf_data %&amp;gt;%
bind_cols(info_data)
# Divide las anotaciones de INFO, enfocándote en el campo CSQ (anotaciones de VEP)
csq17_data &amp;lt;- final_data %&amp;gt;%
separate_rows(CSQ, sep = &amp;#34;,&amp;#34;) %&amp;gt;%
separate(CSQ, into = c(&amp;#34;Allele&amp;#34;, &amp;#34;Consequence&amp;#34;, &amp;#34;Impact&amp;#34;, &amp;#34;Gene&amp;#34;, &amp;#34;Feature&amp;#34;, &amp;#34;Feature_type&amp;#34;,
&amp;#34;BIOTYPE&amp;#34;, &amp;#34;EXON&amp;#34;, &amp;#34;INTRON&amp;#34;, &amp;#34;HGVSc&amp;#34;, &amp;#34;HGVSp&amp;#34;, &amp;#34;cDNA_position&amp;#34;,
&amp;#34;CDS_position&amp;#34;, &amp;#34;Protein_position&amp;#34;, &amp;#34;Amino_acids&amp;#34;, &amp;#34;Codons&amp;#34;,
&amp;#34;Existing_variation&amp;#34;, &amp;#34;Distance&amp;#34;, &amp;#34;STRAND&amp;#34;, &amp;#34;SYMBOL&amp;#34;, &amp;#34;SYMBOL_SOURCE&amp;#34;,
&amp;#34;HGNC_ID&amp;#34;, paste0(&amp;#34;Extra_&amp;#34;, 23:50)), # Agregar más columnas para los valores extra
sep = &amp;#34;\\|&amp;#34;, fill = &amp;#34;right&amp;#34;)
# Encuentra los genes comunes entre csq16_data y csq17_data
common_genes &amp;lt;- intersect(csq16_data$Gene, csq17_data$Gene)
# Mostrar los genes comunes
common_genes
# Filtrar las filas de csq16_data con genes comunes
csq16_common &amp;lt;- csq16_data %&amp;gt;% filter(Gene %in% common_genes)
# Filtrar las filas de csq17_data con genes comunes
csq17_common &amp;lt;- csq17_data %&amp;gt;% filter(Gene %in% common_genes)
# Ver los resultados
csq16_common
csq17_common
csq16_mutations &amp;lt;- csq16_common %&amp;gt;%
select(POS, REF, ALT, Gene, Allele, Consequence, Impact, Existing_variation, Amino_acids)
csq17_mutations &amp;lt;- csq17_common %&amp;gt;%
select(POS, REF, ALT, Gene, Allele, Consequence, Impact, Existing_variation, Amino_acids)
consequences_of_interest &amp;lt;- c(&amp;#34;missense_variant&amp;#34;, &amp;#34;stop_gained_variant&amp;#34;, &amp;#34;frameshift_variant&amp;#34;, &amp;#34;splice acceptor_variant&amp;#34;, &amp;#34;splice_donor_variant&amp;#34;)
csq16_filtered &amp;lt;- csq16_mutations %&amp;gt;%
filter(Consequence %in% consequences_of_interest)
csq17_filtered &amp;lt;- csq17_mutations %&amp;gt;%
filter(Consequence %in% consequences_of_interest)
View(csq16_filtered)
View(csq17_filtered)
&lt;/code>&lt;/pre>
&lt;/div>
&lt;h2 id="imw004-mutations">IMW004 mutations&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align: left">POS&lt;/th>
&lt;th style="text-align: left">REF&lt;/th>
&lt;th style="text-align: left">ALT&lt;/th>
&lt;th style="text-align: left">Gene&lt;/th>
&lt;th style="text-align: left">Allele&lt;/th>
&lt;th style="text-align: left">Consequence&lt;/th>
&lt;th style="text-align: left">Impact&lt;/th>
&lt;th style="text-align: left">Existing_variation&lt;/th>
&lt;th style="text-align: left">Amino_acids&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align: left">132370&lt;/td>
&lt;td style="text-align: left">G&lt;/td>
&lt;td style="text-align: left">C&lt;/td>
&lt;td style="text-align: left">ADY2&lt;/td>
&lt;td style="text-align: left">C&lt;/td>
&lt;td style="text-align: left">missense_variant&lt;/td>
&lt;td style="text-align: left">MODERATE&lt;/td>
&lt;td style="text-align: left">gCt/gGt&lt;/td>
&lt;td style="text-align: left">252&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="imw005-mutations">IMW005 mutations&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align: left">POS&lt;/th>
&lt;th style="text-align: left">REF&lt;/th>
&lt;th style="text-align: left">ALT&lt;/th>
&lt;th style="text-align: left">Gene&lt;/th>
&lt;th style="text-align: left">Allele&lt;/th>
&lt;th style="text-align: left">Consequence&lt;/th>
&lt;th style="text-align: left">Impact&lt;/th>
&lt;th style="text-align: left">Existing_variation&lt;/th>
&lt;th style="text-align: left">Amino_acids&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align: left">132470&lt;/td>
&lt;td style="text-align: left">G&lt;/td>
&lt;td style="text-align: left">C&lt;/td>
&lt;td style="text-align: left">ADY2&lt;/td>
&lt;td style="text-align: left">C&lt;/td>
&lt;td style="text-align: left">missense_variant&lt;/td>
&lt;td style="text-align: left">MODERATE&lt;/td>
&lt;td style="text-align: left">Cta/Gta&lt;/td>
&lt;td style="text-align: left">219&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align: left">540930&lt;/td>
&lt;td style="text-align: left">C&lt;/td>
&lt;td style="text-align: left">G&lt;/td>
&lt;td style="text-align: left">&lt;/td>
&lt;td style="text-align: left">G&lt;/td>
&lt;td style="text-align: left">missense_variant&lt;/td>
&lt;td style="text-align: left">MODERATE&lt;/td>
&lt;td style="text-align: left">aCg/aGg&lt;/td>
&lt;td style="text-align: left">44&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table></content></item><item><title>Jorge Alfredo Suazo-Victoria's CV</title><link>https://3lconejo.github.io/posts/cv/</link><pubDate>Tue, 04 Feb 2025 00:00:00 +0000</pubDate><guid>https://3lconejo.github.io/posts/cv/</guid><description>&lt;style type="text/css">
#GIANT-BUTTON {
background-color: unset; /* Green */
border: 2px red solid;
color: white;
padding: 15px 32px;
text-align: center;
text-decoration: none;
display: inline-block;
font-size: 16px;
margin: 4px 2px;
cursor: pointer;
}
&lt;/style>
&lt;p>&lt;a href = "/posts/cvhtml">&lt;button id="GIANT-BUTTON">STYLIZED&lt;/button>&lt;/a>&lt;/p>
&lt;style>
:root{
--decorator-outer-offset-left: -6.5px;
}
&lt;/style>
&lt;h1 id="aside">Aside&lt;/h1>
&lt;h2 id="contact">Contact&lt;/h2>
&lt;ul>
&lt;li>&lt;i class='fa fa-envelope'>&lt;/i> &lt;a href="mailto:jasvpj@gmail.com">jasvpj@gmail.com&lt;/a>&lt;/li>
&lt;/ul>
&lt;br>
&lt;br>
&lt;h2 id="skills">Programming Languages&lt;/h2>
&lt;p>&lt;b>Expertise:&lt;/b> R and Rstudio, Bash, AWK &lt;br>&lt;b>Familiarity:&lt;/b> Git/Github, python&lt;/p>
&lt;br>
&lt;h2 id="skills">Languages&lt;/h2>
&lt;p>Spanish - Native &lt;br>
English - B2 (TOEFL-IBT) &lt;br>&lt;/p>
&lt;h2 id="disclaimer">Disclaimer&lt;/h2>
&lt;p>Made with the R package &lt;a href="https://github.com/rstudio/pagedown">&lt;strong>pagedown&lt;/strong>&lt;/a>.&lt;/p>
&lt;p>Based on &lt;a href="https://github.com/EveliaCoss/CV">EveliaCoss/CV&lt;/a> and is powered by &lt;a href="https://github.com/nstrayer/cv">nstrayer/cv&lt;/a>.&lt;/p></description><content>&lt;style type="text/css">
#GIANT-BUTTON {
background-color: unset; /* Green */
border: 2px red solid;
color: white;
padding: 15px 32px;
text-align: center;
text-decoration: none;
display: inline-block;
font-size: 16px;
margin: 4px 2px;
cursor: pointer;
}
&lt;/style>
&lt;p>&lt;a href = "/posts/cvhtml">&lt;button id="GIANT-BUTTON">STYLIZED&lt;/button>&lt;/a>&lt;/p>
&lt;style>
:root{
--decorator-outer-offset-left: -6.5px;
}
&lt;/style>
&lt;h1 id="aside">Aside&lt;/h1>
&lt;h2 id="contact">Contact&lt;/h2>
&lt;ul>
&lt;li>&lt;i class='fa fa-envelope'>&lt;/i> &lt;a href="mailto:jasvpj@gmail.com">jasvpj@gmail.com&lt;/a>&lt;/li>
&lt;/ul>
&lt;br>
&lt;br>
&lt;h2 id="skills">Programming Languages&lt;/h2>
&lt;p>&lt;b>Expertise:&lt;/b> R and Rstudio, Bash, AWK &lt;br>&lt;b>Familiarity:&lt;/b> Git/Github, python&lt;/p>
&lt;br>
&lt;h2 id="skills">Languages&lt;/h2>
&lt;p>Spanish - Native &lt;br>
English - B2 (TOEFL-IBT) &lt;br>&lt;/p>
&lt;h2 id="disclaimer">Disclaimer&lt;/h2>
&lt;p>Made with the R package &lt;a href="https://github.com/rstudio/pagedown">&lt;strong>pagedown&lt;/strong>&lt;/a>.&lt;/p>
&lt;p>Based on &lt;a href="https://github.com/EveliaCoss/CV">EveliaCoss/CV&lt;/a> and is powered by &lt;a href="https://github.com/nstrayer/cv">nstrayer/cv&lt;/a>.&lt;/p>
&lt;p>Last updated on 2025-02-04.&lt;/p>
&lt;h1 id="main">Main&lt;/h1>
&lt;h2 id="title">Jorge Alfredo Suazo Victoria&lt;/h2>
&lt;p>I’m a Genomic Science student at ENES Juriquilla, currently in my fourth semester.&lt;/p>
&lt;h2 data-icon="graduation-cap" data-concise="true" id="education">Education&lt;/h2>
&lt;h3 id="genomic-sciences-student">Genomic Sciences Student&lt;/h3>
&lt;p>ENES Juriquilla&lt;/p>
&lt;p>Queretaro, Mexico&lt;/p>
&lt;p>~ - 2023&lt;/p>
&lt;ul>
&lt;li>Currently Studying Fourth Semester&lt;/li>
&lt;/ul>
&lt;!-- These breaks just force a new page so section doesnt get cut off -->
&lt;br>
&lt;br>
&lt;br>
&lt;h2 data-icon="chalkboard-teacher" id="teaching-experience">Teaching Experience&lt;/h2>
&lt;h3 id="r-data-manipulation">R Data Manipulation&lt;/h3>
&lt;p>&lt;a href="https://viernesbioinformatica.github.io/">https://viernesbioinformatica.github.io/&lt;/a>&lt;/p>
&lt;p>Virtual event&lt;/p>
&lt;p>2025 - 2025&lt;/p>
&lt;ul>
&lt;li>Instructor At VBIOINFO&lt;/li>
&lt;/ul>
&lt;h2 data-icon="graduation-cap" data-concise="true" id="courses">Courses&lt;/h2>
&lt;h3 id="viernes-de-bioinformatica-parte-1-2024">Viernes de Bioinformatica Parte 1 (2024)&lt;/h3>
&lt;p>Helper&lt;/p>
&lt;p>Virtual event&lt;/p>
&lt;p>2024 - 2024&lt;/p>
&lt;ul>
&lt;li>Dra. Evelia Lorena Coss Navarrete (&lt;a href="mailto:ecoss@liigh.unam.mx">ecoss@liigh.unam.mx&lt;/a>) - Posdoc, LIIGH - UNAM
Dr. Israel Aguilar - Jefe de Bioinformática, INMEGEN
Sofia Salazar - estudiante de Licenciatura en Ciencias Genomicas, ENES Juriquilla - UNAM
LCG. Alejandra Schafer - estudiante de doctorado, LIIGH - UNAM
Dr. Obed Ramirez - Chief Data Scientist (CDS) &amp;amp; Principal Investigator (PI), Solena Ag.
Dr. Domingo Martinez - Posdoc, LIIGH - UNAM
Dr. Shaday Guerrero - Posdoc, Centro de Ciencias Matemáticas - UNAM&lt;/li>
&lt;li>&lt;a href="https://github.com/EveliaCoss/ViernesBioinfo2024">https://github.com/EveliaCoss/ViernesBioinfo2024&lt;/a>&lt;/li>
&lt;li>Fomentar la formación y crecimiento en bioinformática a través de cursos gratuitos para toda la comunidad interesada.&lt;/li>
&lt;/ul>
&lt;h3 id="viernes-de-bioinformatica-parte-2-2024">Viernes de Bioinformatica Parte 2 (2024)&lt;/h3>
&lt;p>Helper&lt;/p>
&lt;p>Virtual event&lt;/p>
&lt;p>2024 - 2024&lt;/p>
&lt;ul>
&lt;li>Dr. Israel Aguilar - Posdoc y Coordinador del curso
Dra. Evelia Lorena Coss Navarrete - Posdoc y Coordinadora del curso
Dra. Joselyn Chávez - Rladies Cuernavaca, Posdoctoral Icahn School of Medicine en Mount Sinai
Dra. Haydeé Peruyero - Rladies Morelia, Posdoctoral Centro de Ciencias Matematicas (CCM)
Sofia Zorrilla - Rladies Morelia, estudiante de maestria PCBIOL en ENES Morelia
Maria Jose Palma Martinez - Rladies Morelia, estudiante de doctorado
Dra. Aline Pingarroni - Rladies Morelia, PI&lt;/li>
&lt;li>&lt;a href="https://github.com/EveliaCoss/ViernesBioinfo2024_parte2">https://github.com/EveliaCoss/ViernesBioinfo2024_parte2&lt;/a>&lt;/li>
&lt;li>Fomentar la formación y crecimiento en bioinformática a través de cursos gratuitos para toda la comunidad interesada.&lt;/li>
&lt;/ul></content></item><item><title>Day One in: Variant calling and Ensembl VEP exercises - LCGEJ</title><link>https://3lconejo.github.io/posts/daniela-ejercicio/</link><pubDate>Sat, 01 Feb 2025 00:00:00 +0000</pubDate><guid>https://3lconejo.github.io/posts/daniela-ejercicio/</guid><description>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd /home/suaria/Documents/variant_calling/data
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> sample in SRR445715 SRR445716 SRR445717; &lt;span style="color:#66d9ef">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> samtools stats -r /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/&lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.aligned.sorted.bam &amp;gt; &lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.stats
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> plot-bamstats -r /home/suaria/Documents/variant_calling/data/other_files/S288C_ref.fa.gc -p &lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.graphs/ &lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.stats
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">done&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="question-1-what-is-the-percentage-of-mapped-reads-in-all-three-files-check-the-insert-size-gc-content-per-base-sequence-content-and-quality-per-cycle-graphs-do-they-all-look-reasonable">Question 1: What is the percentage of mapped reads in all three files? Check the insert size, GC content, per-base sequence content and quality per cycle graphs. Do they all look reasonable?&lt;/h2>
&lt;p>The percentage of mapped reads in all three files is:&lt;/p>
&lt;h3 id="srr445717">&lt;strong>SRR445717&lt;/strong>&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Total Reads:&lt;/strong> 13,730,526&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Mapped Reads:&lt;/strong> 13,230,229 (96.4%)&lt;/p></description><content>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd /home/suaria/Documents/variant_calling/data
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> sample in SRR445715 SRR445716 SRR445717; &lt;span style="color:#66d9ef">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> samtools stats -r /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/&lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.aligned.sorted.bam &amp;gt; &lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.stats
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> plot-bamstats -r /home/suaria/Documents/variant_calling/data/other_files/S288C_ref.fa.gc -p &lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.graphs/ &lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.stats
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">done&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="question-1-what-is-the-percentage-of-mapped-reads-in-all-three-files-check-the-insert-size-gc-content-per-base-sequence-content-and-quality-per-cycle-graphs-do-they-all-look-reasonable">Question 1: What is the percentage of mapped reads in all three files? Check the insert size, GC content, per-base sequence content and quality per cycle graphs. Do they all look reasonable?&lt;/h2>
&lt;p>The percentage of mapped reads in all three files is:&lt;/p>
&lt;h3 id="srr445717">&lt;strong>SRR445717&lt;/strong>&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Total Reads:&lt;/strong> 13,730,526&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Mapped Reads:&lt;/strong> 13,230,229 (96.4%)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Mapped Bases:&lt;/strong> 660,554,158 (96.2%)&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Conclusion:&lt;/strong>&lt;/p>
&lt;p>96.4% of the reads are mapped, which is a very good percentage. This indicates that the majority of the sequences aligned correctly.&lt;/p>
&lt;p>&lt;strong>Insert Size:&lt;/strong>&lt;/p>
&lt;p>The peak is around 250 bp, which is normal for many paired-end sequencing libraries.&lt;/p>
&lt;p>&lt;strong>GC Content:&lt;/strong>&lt;/p>
&lt;p>The graph shows a normal distribution around 50%, which is typical for many species, including humans.&lt;/p>
&lt;p>&lt;strong>Per-base Sequence Content:&lt;/strong>&lt;/p>
&lt;p>There is no sign of deviation in the nucleotide composition, suggesting good sample preparation.&lt;/p>
&lt;p>&lt;strong>Quality per Cycle:&lt;/strong>&lt;/p>
&lt;p>The Phred score is high for the majority of cycles, with a slight drop at the end, which is expected in Illumina sequencing.\&lt;/p>
&lt;h3 id="srr445716">&lt;strong>SRR445716&lt;/strong>&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Total Reads:&lt;/strong> 12,870,162&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Mapped Reads:&lt;/strong> 12,528,002 (96.6%)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Mapped Bases:&lt;/strong> 620,450,305 (96.4%)&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Conclusion:&lt;/strong>&lt;/p>
&lt;p>It&amp;rsquo;s seem like a good percentage of mapped reads. This indicates that the majority of the sequences aligned correctly.&lt;/p>
&lt;p>&lt;strong>Insert Size:&lt;/strong>&lt;/p>
&lt;p>The peak is around 250 bp, which is normal for many paired-end sequencing libraries.&lt;/p>
&lt;p>&lt;strong>GC Content:&lt;/strong>&lt;/p>
&lt;p>The graph shows a normal distribution around 50%, the peak is at 38.7.&lt;/p>
&lt;p>&lt;strong>Per-base Sequence Content:&lt;/strong>&lt;/p>
&lt;p>There is no sign of deviation in the nucleotide composition, suggesting good sample preparation.&lt;/p>
&lt;p>&lt;strong>Quality per Cycle:&lt;/strong>&lt;/p>
&lt;p>The Phred score is high for the majority of cycles, with a slight drop at the end, which is expected in Illumina sequencing.&lt;/p>
&lt;h3 id="srr445715">&lt;strong>SRR445715&lt;/strong>&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Total Reads:&lt;/strong> 17,964,244&lt;/li>
&lt;/ol>
&lt;p>2.- &lt;strong>Mapped Reads:&lt;/strong> 17,503,811 (97.4%)&lt;/p>
&lt;p>3.- &lt;strong>Mapped Bases:&lt;/strong> 888,144,619 (96.9%)&lt;/p>
&lt;p>&lt;strong>Conclusion:&lt;/strong>&lt;/p>
&lt;p>97.4% of the reads are mapped, which is a very good percentage. This indicates that the majority of the sequences aligned correctly.&lt;/p>
&lt;p>&lt;strong>Insert Size:&lt;/strong>&lt;/p>
&lt;p>It&amp;rsquo;s seems like it have a slight error at the top of the curve.&lt;/p>
&lt;p>&lt;strong>GC Content:&lt;/strong>&lt;/p>
&lt;p>The graph shows a normal distribution around 50%, the peak is at 40.&lt;/p>
&lt;p>&lt;strong>Per-base Sequence Content:&lt;/strong>&lt;/p>
&lt;p>There is no sign of deviation in the nucleotide composition, suggesting good sample preparation.&lt;/p>
&lt;p>&lt;strong>Quality per Cycle:&lt;/strong>&lt;/p>
&lt;p>There&amp;rsquo;s a slight drop and then a depresion in the middle at graph, suggesting a problem in the quality of the reads.&lt;/p>
&lt;p>The error rate in theas sample is higher than the other two samples.&lt;/p>
&lt;h1 id="generating-a-pileup">Generating a pileup&lt;/h1>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>samtools mpileup -f /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/SRR445715.aligned.sorted.bam | less -S
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="question-2-what-is-the-read-depth-at-position-chri29519-what-is-the-reference-base">Question 2: What is the read depth at position chrI:29519? What is the reference base?&lt;/h2>
&lt;p>Are they any non-reference bases?&lt;/p>
&lt;p>There may be some non-reference bases based on the encoded information, but a more detailed base call breakdown is needed to confirm.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Chromosome&lt;/th>
&lt;th>Position&lt;/th>
&lt;th>Reference Base&lt;/th>
&lt;th>Depth&lt;/th>
&lt;th>Bases&lt;/th>
&lt;th>Quality&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>chrI&lt;/td>
&lt;td>29519&lt;/td>
&lt;td>A&lt;/td>
&lt;td>56&lt;/td>
&lt;td>,$,,.,.,..,,,,..,..,&amp;hellip;&amp;hellip;&amp;hellip;.,.,,.,,&amp;hellip;&amp;hellip;..,&amp;hellip;&amp;hellip;&amp;hellip;.,&lt;/td>
&lt;td>BCB=A4BB9&amp;gt;BB@?A&amp;gt;B&amp;gt;B?BBBA@A?CB7?C8AB=@BBBCB=B@@@BBCAC?B00&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="question-3-what-about-at-position-chri29522-what-is-the-reference-base-are-there-any-non-reference-bases">Question 3: What about at position chrI:29522? What is the reference base? Are there any non-reference bases?&lt;/h2>
&lt;p>There are several lowercase letters in the base call string, specifically: a,&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Chromosome&lt;/th>
&lt;th>Position&lt;/th>
&lt;th>Reference Base&lt;/th>
&lt;th>Depth&lt;/th>
&lt;th>Bases&lt;/th>
&lt;th>Quality&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>chrI&lt;/td>
&lt;td>29522&lt;/td>
&lt;td>T&lt;/td>
&lt;td>46&lt;/td>
&lt;td>aaaaAaAAaAAAAAAAAAAaAaaAaaAAAAAAAAaAAAAAAAAAAa&lt;/td>
&lt;td>8;??&amp;gt;:4BB@BABB;A=BABBCBBB?ABA=CABBBAAABC5CAB00&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h1 id="3-generating-genotype-likelihoods-and-variant-calling">3. Generating genotype likelihoods and variant calling&lt;/h1>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools mpileup -f /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/SRR445715.aligned.sorted.bam | less -S
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;blockquote>
&lt;p>This is an intermediate output that contains genotype likelihoods (if you don’t remember what this is, go back to your notes on the Bayesian exercises we did!) [&lt;code>OK I will&lt;/code>]&lt;/p>
&lt;/blockquote>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools mpileup -f /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/SRR445715.aligned.sorted.bam | bcftools call -m --ploidy &lt;span style="color:#ae81ff">1&lt;/span> | less -S
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="question-4-study-the-command-why-did-we-use-these-settings-if-you-were-performing-variant-calling-in-human-data-what-settings-would-you-use">Question 4: Study the command. Why did we use these settings? If you were performing variant calling in human data, what settings would you use?&lt;/h2>
&lt;p>THe pipe use the command call and the -m parameter is descripted as &amp;ldquo;Alternative model for multiallelic and rare-variant calling (conflicts with -c)&amp;rdquo; and the &amp;ndash;ploidy 1 means that the organism is haploid, if we want to use it in humans it would be 2.&lt;/p>
&lt;h2 id="question-5-what-option-should-we-add-to-only-print-variant-sites">Question 5: What option should we add to only print variant sites?&lt;/h2>
&lt;p>The option that we should add is -v.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> sample in SRR445715 SRR445716 SRR445717; &lt;span style="color:#66d9ef">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bcftools mpileup -a AD -f /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/&lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.aligned.sorted.bam -Ou | bcftools call -mv --ploidy &lt;span style="color:#ae81ff">1&lt;/span> -o &lt;span style="color:#e6db74">${&lt;/span>sample&lt;span style="color:#e6db74">}&lt;/span>.vcf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">done&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="question-6-what-is-the-reference-and-variant-base-at-position-chriv122724">Question 6: What is the reference and variant base at position chrIV:122724?&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Reference&lt;/th>
&lt;th>Variant&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>G&lt;/td>
&lt;td>A&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="question-7-what-is-the-total-read-depth-at-position-chriv122724">Question 7: What is the total read depth at position chrIV:122724?&lt;/h2>
&lt;p>DP=58&lt;/p>
&lt;h2 id="question-8-what-is-the-number-of-high-quality-forward-reads-supporting-the-variant-call-at-position-chriv122724-how-many-reads-support-the-reference-allele">Question 8: What is the number of high-quality forward reads supporting the variant call at position chrIV:122724? How many reads support the reference allele?&lt;/h2>
&lt;p>0&lt;/p>
&lt;h2 id="question-9-what-sort-of-event-is-happening-at-position-chri29007">Question 9: What sort of event is happening at position chrI:29007?&lt;/h2>
&lt;p>The INDEL classification indicates that the event is a structural variation that involves the insertion of the G base at this position.&lt;/p>
&lt;h2 id="4--variant-filtering">4.- Variant filtering&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools query -f&lt;span style="color:#e6db74">&amp;#39;POS = %POS\n&amp;#39;&lt;/span> SRR445715.vcf | head
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code>POS = 83
POS = 136
POS = 137
POS = 139
POS = 262
POS = 286
POS = 305
POS = 457
POS = 476
POS = 485
&lt;/code>&lt;/pre>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools query -f&lt;span style="color:#e6db74">&amp;#39;%POS %REF,%ALT\n&amp;#39;&lt;/span> SRR445715.vcf | head
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code>83 AG,A
136 G,A
137 C,CT
139 TCC,TCCCC
262 A,G
286 A,T
305 C,G
457 CAAA,CAA
476 G,T
485 T,C
&lt;/code>&lt;/pre>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bgzip SRR445715.vcf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bgzip SRR445716.vcf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bgzip SRR445717.vcf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bcftools index SRR445715.vcf.gz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bcftools index SRR445716.vcf.gz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bcftools index SRR445717.vcf.gz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bcftools merge -0 -o combined.vcf SRR445715.vcf.gz SRR445716.vcf.gz SRR445717.vcf.gz
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools query -f&lt;span style="color:#e6db74">&amp;#39;%POS %QUAL [%GT %AD ] %REF %ALT\n&amp;#39;&lt;/span> combined.vcf | head
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code>83 142.328 1 0,11 1 4,19 1 0,13 AG A
136 148.417 1 0,30 1 1,71 1 0,38 G A
137 110.074 1 10,9 1 23,24 1 6,18 C CT
139 42.4134 1 11,3 1 35,24 1 20,9 T TCC
244 48.0595 0 . 1 9,12 1 4,15 C CT
262 5.85486 1 11,13 0 . 0 . A G
286 223.417 1 0,43 1 0,88 1 0,42 A T
305 201.416 1 0,50 1 0,69 1 0,37 C G
457 35.4232 1 17,16 1 41,58 0 . CA C
476 160.421 1 0,32 1 0,44 1 0,24 G T
&lt;/code>&lt;/pre>&lt;h2 id="question-10-can-you-print-rows-with-qual-bigger-than-30-and-with-at-least-50-alternate-reads-for-this-we-will-need-to-query-the-second-value-of-the-ad-field-note-that-the-indexes-are-zero-based-the-first-ad-value-is-represented-as-ad0-therefore-the-second-value-must-be-queried-as-ad150-however-you-will-also-need-to-indicate-which-sample-to-look-at-to-look-at-any-sample-you-can-use-the-asterisk-eg-the-instruction-would-look-like-ad150-hint-if-you-get-stuck-look-at-the-examples-that-petr-danecek-pd3-explained-here-httpsgithubcomsamtoolsbcftoolsissues757">Question 10: Can you print rows with QUAL bigger than 30 and with at least 50 alternate reads? For this we will need to query the second value of the AD field. Note that the indexes are zero-based; the first AD value is represented as “AD[0]”, therefore the second value must be queried as “AD[1]&amp;gt;=50”. However, you will also need to indicate which sample to look at, to look at any sample you can use the asterisk (e.g. the instruction would look like “AD[*:1]&amp;gt;=50”) Hint: If you get stuck, look at the examples that Petr Danecek (pd3) explained here: &lt;a href="https://github.com/samtools/bcftools/issues/757">https://github.com/samtools/bcftools/issues/757&lt;/a>&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools query -f&lt;span style="color:#e6db74">&amp;#39;%POS %QUAL [%GT %AD ] %REF %ALT\n&amp;#39;&lt;/span> -i&lt;span style="color:#e6db74">&amp;#39;QUAL&amp;gt;30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50&amp;#39;&lt;/span> combined.vcf | head
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code>36 148.417 1 1,71 G A
286 223.417 1 0,88 A T
305 201.416 1 0,50 1 0,69 C G
457 35.4232 1 41,58 CA C
610 225.417 1 0,58 1 0,106 1 0,77 G A
633 225.168 1 33,102 1 20,69 T C
681 225.417 1 0,69 1 0,51 G A
686 185.809 1 21,66 1 13,51 A G
778 228.323 1 0,68 1 12,63 1 4,53 A G
1008 225.417 1 0,86 1 0,66 A G
&lt;/code>&lt;/pre>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools stats SRR445715.vcf.gz | grep TSTV | cut -f5
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code># TSTV, transitions/transversions:
[5]ts/tv
2.90
&lt;/code>&lt;/pre>&lt;h2 id="question-11-how-does-the-tstv-change-if-you-apply-the-filters-above-use-the-bcftools-stats-command-with-the--i-option-to-include-calls-with-qual-at-least-30-and-the-number-of-alternate-reads-at-least-50">Question 11: How does the ts/tv change if you apply the filters above? Use the bcftools stats command with the -i option to include calls with QUAL at least 30 and the number of alternate reads at least 50.&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools stats -i&lt;span style="color:#e6db74">&amp;#39;QUAL&amp;gt;=30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50&amp;#39;&lt;/span> SRR445715.vcf.gz | grep TSTV | cut -f5
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="tstv-transitionstransversions">TSTV, transitions/transversions:&lt;/h3>
&lt;p>[5]ts/tv
3.25&lt;/p>
&lt;h2 id="question-12-what-is-the-tstv-of-removed-sites">Question 12: What is the ts/tv of removed sites?&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools stats -e &lt;span style="color:#e6db74">&amp;#39;QUAL&amp;gt;=30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50&amp;#39;&lt;/span> SRR445715.vcf.gz | grep TSTV | cut -f5
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="tstv-transitionstransversions-1">TSTV, transitions/transversions:&lt;/h3>
&lt;p>[5]ts/tv
2.25&lt;/p>
&lt;h3 id="confusion">Confusion&lt;/h3>
&lt;p>Can you apply the above filters to produce a final callset and apply the -g and -G options to soft filter variants around indels? [Is this a real question?, I thought that the command below was the answer]&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools filter -sLowQual -m+ -i&lt;span style="color:#e6db74">&amp;#39;QUAL&amp;gt;=30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50&amp;#39;&lt;/span> -g8 -G10 combined.vcf -o combined.flt.vcf
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="question-13-can-you-see-why-position-chri1101-passed-these-filters-and-why-did-position-chri1107-fail-the-filtering-step">Question 13: Can you see why position chrI:1101 passed these filters? And why did position chrI:1107 fail the filtering step?&lt;/h2>
&lt;pre tabindex="0">&lt;code>chrI 1101 . G A 152.416 PASS VDB=0.316037;SGB=-0.69&amp;gt;
chrI 1107 . T C 86.1177 LowQual VDB=0.0031237;SGB=-0.6&amp;gt;
&lt;/code>&lt;/pre>&lt;blockquote>
&lt;p>INFO=&amp;lt;ID=VDB,Number=1,Type=Float,Description=&amp;ldquo;Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)&amp;quot;,Version=&amp;ldquo;3&amp;rdquo;&amp;gt;&lt;/p>
&lt;/blockquote>
&lt;p>I sincerely don&amp;rsquo;t understand completely what the VDB means, so i asked to my realy close friend.&lt;/p>
&lt;p>&lt;a href="#5-multi-sample-variant-calling">&lt;button>Skip This&lt;/button>&lt;/a>&lt;/p>
&lt;h2 id="chat-gpt">CHAT GPT&lt;/h2>
&lt;p>This line is from the &lt;strong>VCF (Variant Call Format)&lt;/strong> file header, and it describes the &lt;strong>INFO&lt;/strong> field called &lt;code>VDB&lt;/code>. Here&amp;rsquo;s a breakdown of what each part means:&lt;/p>
&lt;hr>
&lt;h3 id="1-field-components">&lt;strong>1. Field Components:&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>ID=VDB&lt;/code>&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This is the &lt;strong>identifier&lt;/strong> of the INFO field. In this case, &lt;code>VDB&lt;/code> stands for &lt;strong>Variant Distance Bias&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>Number=1&lt;/code>&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This indicates that the field contains &lt;strong>one value&lt;/strong> per variant.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>Type=Float&lt;/code>&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>The value for &lt;code>VDB&lt;/code> is a &lt;strong>floating-point number&lt;/strong> (i.e., a decimal number).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>Description=&amp;quot;Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)&amp;quot;&lt;/code>&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This describes the purpose of the &lt;code>VDB&lt;/code> field:
&lt;ul>
&lt;li>&lt;strong>Variant Distance Bias (VDB)&lt;/strong> measures how evenly the variant-supporting reads are distributed across the sequence.&lt;/li>
&lt;li>It’s particularly useful for &lt;strong>filtering out artefacts&lt;/strong> in &lt;strong>RNA-seq data&lt;/strong>, such as false variants near &lt;strong>splice sites&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>“Bigger is better”&lt;/strong> means that a &lt;strong>higher VDB&lt;/strong> value suggests the variant is &lt;strong>more likely to be real&lt;/strong> (less biased), while a lower VDB might indicate an artefact.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>Version=&amp;quot;3&amp;quot;&lt;/code>&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This indicates the &lt;strong>version&lt;/strong> of the software or algorithm that generated this &lt;code>VDB&lt;/code> value.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="2-what-is-variant-distance-bias-vdb">&lt;strong>2. What Is Variant Distance Bias (VDB)?&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Context:&lt;/strong>&lt;br>
In RNA-seq data, reads can align imperfectly around &lt;strong>splice sites&lt;/strong> (where exons are joined), leading to &lt;strong>artefacts&lt;/strong>—false variants caused by misalignments rather than true biological variation.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>VDB Calculation:&lt;/strong>&lt;br>
VDB measures how &lt;strong>evenly distributed&lt;/strong> the positions of variant-supporting reads are along the alignment:&lt;/p>
&lt;ul>
&lt;li>If reads supporting the variant are &lt;strong>clustered&lt;/strong> together (e.g., near a splice site), the VDB will be &lt;strong>low&lt;/strong>, suggesting a potential artefact.&lt;/li>
&lt;li>If reads are &lt;strong>spread out&lt;/strong> across the region, the VDB will be &lt;strong>high&lt;/strong>, indicating the variant is likely genuine.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Filtering Use:&lt;/strong>&lt;br>
You can use VDB to &lt;strong>filter out false positives&lt;/strong> in variant calling:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Low VDB&lt;/strong>: May indicate artefacts—consider filtering these out.&lt;/li>
&lt;li>&lt;strong>High VDB&lt;/strong>: More reliable variants.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="3-practical-example">&lt;strong>3. Practical Example:&lt;/strong>&lt;/h3>
&lt;p>Let’s say you’re filtering your VCF file and want to exclude variants with potential splice-site artefacts. You might set a &lt;strong>VDB threshold&lt;/strong> to only keep variants with higher values:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools view -i &lt;span style="color:#e6db74">&amp;#39;INFO/VDB &amp;gt; 0.05&amp;#39;&lt;/span> your_file.vcf &amp;gt; filtered_file.vcf
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Here, variants with &lt;code>VDB &amp;gt; 0.05&lt;/code> are retained, assuming values below that might be artefacts.&lt;/p>
&lt;hr>
&lt;p>Let me know if you need more details on how to use this in your filtering or interpretation!&lt;/p>
&lt;p>So the position chrI:1101 passed the filter because the VDB value was higher (0.316037), suggesting a more evenly distributed set of variant-supporting reads. In contrast, position chrI:1107 failed the filtering step due to its lower VDB value (0.0031237), indicating a potential artefact or bias in the variant-supporting reads.&lt;/p>
&lt;h1 id="5-multi-sample-variant-calling">5. Multi-sample variant calling&lt;/h1>
&lt;h2 id="question-14-there-are-three-bam-files-in-the-original-directory-mntatgc-d2bioinfoiidroblesvariant_callingdata-sure-they-are-can-you-modify-the-command-from-section-3-to-use-all-three-bam-files-and-only-write-out-variant-sites-in-chromosome-i-write-the-output-to-a-compressed-bcf-file-called-multibcf-and-index-the-file-afterwards">Question 14: There are three BAM files in the original directory /mnt/atgc-d2/bioinfoII/drobles/variant_calling/data/ [sure they are]. Can you modify the command from section 3 to use all three BAM files and only write out variant sites in chromosome I? Write the output to a compressed BCF file called multi.bcf and index the file afterwards.&lt;/h2>
&lt;p>I needed to do this before but, here&amp;rsquo;s the manual&lt;/p>
&lt;pre tabindex="0">&lt;code>Usage: bcftools mpileup [options] in1.bam [in2.bam [...]]
Input options:
-6, --illumina1.3+ Quality is in the Illumina-1.3+ encoding
-A, --count-orphans Do not discard anomalous read pairs
-b, --bam-list FILE List of input BAM filenames, one per line
-B, --no-BAQ Disable BAQ (per-Base Alignment Quality)
-C, --adjust-MQ INT Adjust mapping quality [0]
-D, --full-BAQ Apply BAQ everywhere, not just in problematic regions
-d, --max-depth INT Max raw per-file depth; avoids excessive memory usage [250]
-E, --redo-BAQ Recalculate BAQ on the fly, ignore existing BQs
-f, --fasta-ref FILE Faidx indexed reference sequence file
--no-reference Do not require fasta reference file
-G, --read-groups FILE Select or exclude read groups listed in the file
-q, --min-MQ INT Skip alignments with mapQ smaller than INT [0]
-Q, --min-BQ INT Skip bases with baseQ/BAQ smaller than INT [1]
--max-BQ INT Limit baseQ/BAQ to no more than INT [60]
--delta-BQ INT Use neighbour_qual + INT if less than qual [30]
-r, --regions REG[,...] Comma separated list of regions in which pileup is generated
-R, --regions-file FILE Restrict to regions listed in a file
--ignore-RG Ignore RG tags (one BAM = one sample)
--ls, --skip-all-set STR|INT Skip reads with all of the bits set []
--ns, --skip-any-set STR|INT Skip reads with any of the bits set [UNMAP,SECONDARY,QCFAIL,DUP]
--lu, --skip-all-unset STR|INT Skip reads with all of the bits unset []
--nu, --skip-any-unset STR|INT Skip reads with any of the bits unset []
-s, --samples LIST Comma separated list of samples to include
-S, --samples-file FILE File of samples to include
-t, --targets REG[,...] Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
-x, --ignore-overlaps Disable read-pair overlap detection
--seed INT Random number seed used for sampling deep regions [0]
Output options:
-a, --annotate LIST Optional tags to output; &amp;#39;?&amp;#39; to list available tags []
-g, --gvcf INT[,...] Group non-variant sites into gVCF blocks according
To minimum per-sample DP
--no-version Do not append version and command line to the header
-o, --output FILE Write output to FILE [standard output]
-O, --output-type TYPE &amp;#39;b&amp;#39; compressed BCF; &amp;#39;u&amp;#39; uncompressed BCF;
&amp;#39;z&amp;#39; compressed VCF; &amp;#39;v&amp;#39; uncompressed VCF; 0-9 compression level [v]
-U, --mwu-u Use older probability scale for Mann-Whitney U test
--threads INT Use multithreading with INT worker threads [0]
SNP/INDEL genotype likelihoods options:
-X, --config STR Specify platform specific profiles (see below)
-e, --ext-prob INT Phred-scaled gap extension seq error probability [20]
-F, --gap-frac FLOAT Minimum fraction of gapped reads [0.05]
-h, --tandem-qual INT Coefficient for homopolymer errors [500]
-I, --skip-indels Do not perform indel calling
-L, --max-idepth INT Maximum per-file depth for INDEL calling [250]
-m, --min-ireads INT Minimum number gapped reads for indel candidates [2]
-M, --max-read-len INT Maximum length of read to pass to BAQ algorithm [500]
-o, --open-prob INT Phred-scaled gap open seq error probability [40]
-p, --per-sample-mF Apply -m and -F per-sample for increased sensitivity
-P, --platforms STR Comma separated list of platforms for indels [all]
--ar, --ambig-reads STR What to do with ambiguous indel reads: drop,incAD,incAD0 [drop]
--indel-bias FLOAT Raise to favour recall over precision [1.00]
--indel-size INT Approximate maximum indel size considered [110]
Configuration profiles activated with -X, --config:
1.12: -Q13 -h100 -m1 -F0.002
illumina: [ default values ]
ont: -B -Q5 --max-BQ 30 -I [also try eg |bcftools call -P0.01]
pacbio-ccs: -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 --delta-BQ 10 -M99999
Notes: Assuming diploid individuals.
Example:
# See also http://samtools.github.io/bcftools/howtos/variant-calling.html
bcftools mpileup -Ou -f reference.fa alignments.bam | bcftools call -mv -Ob -o calls.bcf
&lt;/code>&lt;/pre>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools mpileup -a AD -f /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/SRR445715.aligned.sorted.bam /home/suaria/Documents/variant_calling/data/SRR445716.aligned.sorted.bam /home/suaria/Documents/variant_calling/data/SRR445717.aligned.sorted.bam -r chrI -Ou | bcftools call -mv --ploidy &lt;span style="color:#ae81ff">1&lt;/span> -o multi.bcf
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;a href="https://3lconejo.github.io/posts/untitled/">&lt;button>Download The Binary&lt;/button>&lt;/a>&lt;/p>
&lt;h2 id="question-15-can-you-apply-the-same-filters-as-before-how-many-sites-pass-the-filters-write-the-output-to-a-bcf-file-called-multifiltbcf-and-index-the-file">Question 15: Can you apply the same filters as before? How many sites pass the filters? Write the output to a BCF file called multi.filt.bcf and index the file.&lt;/h2>
&lt;p>&lt;a href="https://3lconejo.github.io/multi.filt.bcf">&lt;button>Download The Filtered Binary&lt;/button>&lt;/a>&lt;/p>
&lt;p>Manual for &lt;code>filter&lt;/code>&lt;/p>
&lt;pre tabindex="0">&lt;code>About: Apply fixed-threshold filters.
Usage: bcftools filter [options] &amp;lt;in.vcf.gz&amp;gt;
Options:
-e, --exclude EXPR Exclude sites for which the expression is true (see man page for details)
-g, --SnpGap INT[:TYPE] Filter SNPs within &amp;lt;int&amp;gt; base pairs of an indel (the default) or any combination of indel,mnp,bnd,other,overlap
-G, --IndelGap INT Filter clusters of indels separated by &amp;lt;int&amp;gt; or fewer base pairs allowing only one to pass
-i, --include EXPR Include only sites for which the expression is true (see man page for details
--mask [^]REGION Soft filter regions, &amp;#34;^&amp;#34; to negate
-M, --mask-file [^]FILE Soft filter regions listed in a file, &amp;#34;^&amp;#34; to negate
--mask-overlap 0|1|2 Mask if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-m, --mode [+x] &amp;#34;+&amp;#34;: do not replace but add to existing FILTER; &amp;#34;x&amp;#34;: reset filters at sites which pass
--no-version Do not append version and command line to the header
-o, --output FILE Write output to a file [standard output]
-O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
-r, --regions REGION Restrict to comma-separated list of regions
-R, --regions-file FILE Restrict to regions listed in a file
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-s, --soft-filter STRING Annotate FILTER column with &amp;lt;string&amp;gt; or unique filter name (&amp;#34;Filter%d&amp;#34;) made up by the program (&amp;#34;+&amp;#34;)
-S, --set-GTs .|0 Set genotypes of failed samples to missing (.) or ref (0)
-t, --targets REGION Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
--threads INT Use multithreading with &amp;lt;int&amp;gt; worker threads [0]
&lt;/code>&lt;/pre>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools filter -i&lt;span style="color:#e6db74">&amp;#39;QUAL&amp;gt;=30 &amp;amp;&amp;amp; AD[*:1]&amp;gt;=50 &amp;amp;&amp;amp; type=&amp;#34;snp&amp;#34;&amp;#39;&lt;/span> multi.bcf -o multi.filt.bcf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bcftools view -H multi.filt.bcf | wc -l
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>811&lt;/code> Tara!&lt;/p>
&lt;p>Question 16: What is the ts/tv of the raw calls and of the filtered set?&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>bcftools stats multi.filt.bcf | grep TSTV | cut -f5
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="tstv-transitionstransversions-2">TSTV, transitions/transversions:&lt;/h3>
&lt;p>[5]ts/tv
2.56&lt;/p></content></item></channel></rss>