Day 2: VEP and more bcftools analyses

2025-02-10Jorge Alfredo Suazo Victoria

1. Thinking about our experiment

bash



bcftools isec -C SRR445716.vcf.gz SRR445715.vcf.gz \>\
present_in_IMW004_absent_in_CEN.PK113-7D.txt

chrI    244 C   CT  10
chrI    675 A   G   10
chrI    1152    T   G   10
chrI    1397    A   G   10
chrI    1428    T   C   10
chrI    1757    G   T   10
chrI    2002    G   T   10
chrI    2029    T   C   10
chrI    2406    A   C   10
chrI    12227   C   T   10

About:   Create intersections, unions and complements of VCF files.
Usage:   bcftools isec [options] <A.vcf.gz> <B.vcf.gz> [...]

Options:
    -c, --collapse STRING          Treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none]
    -C, --complement               Output positions present only in the first file but missing in the others
    -e, --exclude EXPR             Exclude sites for which the expression is true
    -f, --apply-filters LIST       Require at least one of the listed FILTER strings (e.g. "PASS,.")
    -i, --include EXPR             Include only sites for which the expression is true
        --no-version               Do not append version and command line to the header
    -n, --nfiles [+-=~]INT         Output positions present in this many (=), this many or more (+), this many or fewer (-), the exact (~) files
    -o, --output FILE              Write output to a file [standard output]
    -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
    -p, --prefix DIR               If given, subset each of the input files accordingly, see also -w
    -r, --regions REGION           Restrict to comma-separated list of regions
    -R, --regions-file FILE        Restrict to regions listed in a file
        --regions-overlap 0|1|2    Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
    -t, --targets REGION           Similar to -r but streams rather than index-jumps
    -T, --targets-file FILE        Similar to -R but streams rather than index-jumps
        --targets-overlap 0|1|2    Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
        --threads INT              Use multithreading with <int> worker threads [0]
    -w, --write LIST               List of files to write with -p given as 1-based indexes. By default, all files are written

Examples:
   # Create intersection and complements of two sets saving the output in dir/*
   bcftools isec A.vcf.gz B.vcf.gz -p dir

   # Filter sites in A and B (but not in C) and create intersection
   bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e - A.vcf.gz B.vcf.gz C.vcf.gz -p dir

   # Extract and write records from A shared by both A and B using exact allele match
   bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1

   # Extract and write records from C found in A and C but not in B
   bcftools isec A.vcf.gz B.vcf.gz C.vcf.gz -p dir -n~101 -w 3

   # Extract records private to A or B comparing by position only
   bcftools isec A.vcf.gz B.vcf.gz -p dir -n -1 -c all

Question 17: Can you think of a way to obtain a list of candidates that may underlie the ability of these strains to grow on lactate? Hint: You can assume that variants shared by both IMW004 and IMW005 are likely to have arisen before the start of the experiment (i.e., from the unsequenced initial jen1 delta strain), and therefore are not biologically interesting. How many variants (unfiltered) are in IMW004 that are not shared by any other strain?

My guess

bash


bcftools isec -C SRR445716.vcf.gz SRR445715.vcf.gz SRR445717.vcf.gz --output-type v -o IMW004_unique.vcf -w 1

bcftools isec -C SRR445717.vcf.gz SRR445715.vcf.gz SRR445716.vcf.gz --output-type v -o IMW005_unique.vcf -w 1

bgzip IMW004_unique.vcf

bgzip IMW005_unique.vcf

bcftools index IMW004_unique.vcf.gz  

bcftools index IMW005_unique.vcf.gz  

bcftools merge IMW004_unique.vcf.gz IMW005_unique.vcf.gz -o Lac_Uniques

Question 18: How many variants remain in IMW004 after filtering?

bash


bcftools filter -i'QUAL>=30 && AD[*:1]>=50 && type="snp"' IMW004_unique.vcf.gz -o IMW004.flt.vcf

bcftools view -H IMW004.flt.vcf | wc -l

25

Jorge Alfredo Suazo-Victoria’s CV

2025-02-04Jorge Alfredo

Aside

Contact

jasvpj@gmail.com

Programming Languages

Expertise: R and Rstudio, Bash, AWK
Familiarity: Git/Github, python

Languages

Spanish - Native
English - B2 (TOEFL-IBT)

Disclaimer

Made with the R package pagedown.

Based on EveliaCoss/CV and is powered by nstrayer/cv.

Day One in: Variant calling and Ensembl VEP exercises - LCGEJ

2025-02-01Jorge Alfredo Suazo Victoria


cd /home/suaria/Documents/variant_calling/data

for sample in SRR445715 SRR445716 SRR445717; do
    samtools stats -r /home/suaria/Documents/variant_calling/data/S288C_ref.fa /home/suaria/Documents/variant_calling/data/${sample}.aligned.sorted.bam > ${sample}.stats
    plot-bamstats -r /home/suaria/Documents/variant_calling/data/other_files/S288C_ref.fa.gc -p ${sample}.graphs/ ${sample}.stats
done

Question 1: What is the percentage of mapped reads in all three files? Check the insert size, GC content, per-base sequence content and quality per cycle graphs. Do they all look reasonable?

The percentage of mapped reads in all three files is:

SRR445717

Total Reads: 13,730,526
Mapped Reads: 13,230,229 (96.4%)

Missions

Day 2: VEP and more bcftools analyses

1. Thinking about our experiment

My guess

Question 18: How many variants remain in IMW004 after filtering?

Jorge Alfredo Suazo-Victoria’s CV

Aside

Contact

Programming Languages

Languages

Disclaimer

Day One in: Variant calling and Ensembl VEP exercises - LCGEJ

Question 1: What is the percentage of mapped reads in all three files? Check the insert size, GC content, per-base sequence content and quality per cycle graphs. Do they all look reasonable?

SRR445717