Use this link for starting AWS instance: http://www.google.com/url?q=http%3A%2F%2Fbit.ly%2Fevomics2014&sa=D&sntz=1&usg=AFQjCNHw6fapggOZ33Pu5k5cSJeiduXRZg
Today’s lab topic: http://evomics.org/learning/quality-assessment-and-control-of-sequence-data/
Quality control is important: sequencing, library prep
Given sequence files from instruments, quality scores, Phred – Wiki is good: http://en.wikipedia.org/wiki/FASTQ_format
Quiz: What are phred scores (and probability of error) of first bases of sequence?
(Only latest Illumina data has stars at the end.)
Generate quality plots, quality vs. position in read (bp). Towards end of sequence, quality drops.
– trim to certain length
– trim bad bases
– discard bad quality reads
Stacks has program to clean data and remove low-quality reads.
Programs for data cleaning: list: FASTX-Toolkit, other programs (links on powerpoint)
Nucleotide composition, are they what you expect? Consider barcodes, restriction enzymes.
Problem Set, 3 Illumina datasets: “The goal of this exercise is to inspect the sequence data of an Illumina run.”
bartonella_illumina.fastq: a small subset of an Illumina run (v.1.5) belonging to a bovine isolate of the intracellular bacterium Bartonella
1. Launch FastQC by typing fastqc in the terminal window. (If you want to run fastqc in commandline, run:
Then, send commands to outfile to open and view results.
2. Load the bartonella_illumina.fastq file into FastQC (File->Open). You can view the results either within the FastQC application or the exported report.
3. Inspect the data contained in the sequence file, bartonella_illumina.fastq. Have a look at the numbers output on the “Basic Statistics” page. How many sequences do we have?
What is the sequence length?
And the GC content?
%GC = 37
Examine the “Per base sequence quality” and ”Per sequence quality scores” pages.
Roughly, how many incorrect base calls are expected at most positions?
Do you think this run gave good quality sequences?
Yes, towards the upper range (40)
Examine the “Per base sequence content”, “Per base GC content” and “Per sequence GC content” pages. FastQC points out a “potential problem” with an orange exclamation mark. Do you think we should worry about it in this particular case? Hint.
It depends on the dataset, if the results are expected.
Examine the “Overrepresented sequences” page. Why does FastQC give a warning message?
Possible adapter sequence, look this up to see if it makes sense to be in the dataset.
EMP_Misc_16v4EMP_NoIndex_L005_R1_001-sample.fastq.gz: a small subset of an Illumina run (v.1.9) belonging to one Earth Microbiome Project environmental sample sequenced for 16S amplicons (demultiplexed from a set of pooled samples).
1. Load the data into FASTQC (note that there is no need to unzip it first).
2. What can explain the nucleotide frequency pattern observed in the 8 first bases?
3.What can explain the nucleotide frequency pattern observed in the rest of the sequence?
Environmental sample with community of bacteria, conserved region of 16S
4. Would you be worried in this case about the sequence duplications levels, overrepresented sequences and kmer content? Why?
Frequencies of different bacteria in sample.
Remove the sequences that have more than 25% of the bases with Q < 30 using the appropiate fastx-toolkit program (once you find it, use “-h” option for usage). Note that you need to add “-Q 33″ to the command as the sequences have a different encoding (Phred + 33) than the default of fastx-toolkit (Phred + 64). How many reads are kept at the end of the process?
ubuntu@domU-12-31-39-02-88-C1:~/qc$ fastq_quality_filter -i SRR026762-sample.fastq -q 30 -Q 33 -v -p 75 -o output.fastq Quality cut-off: 30 Minimum percentage: 75 Input: 100000 reads. Output: 53832 reads. discarded 46168 (46%) low-quality reads.
Clip the adaptor sequence of the quality filtered file using the appropiate fastx-toolkit program (once you find it, use “-h” option for usage). Set the options so that you remove the sequences that are shorter than 8 nucleotides and that you see the number of kept and discarded sequences on screen. Note that you need to add “-Q 33″ to the command as the sequences have a different encoding (Phred + 33) than the default of fastx-toolkit (Phred + 64). How many reads are kept at the end of the process?
ubuntu@domU-12-31-39-02-88-C1:~/qc$ fastx_clipper -a ATCTCGTATGCCGTCTTCTGCTTG -l 8 -v -Q 33 -i output.fastq -o SRR026762-sample-qf-at.fastq Clipping Adapter: ATCTCGTATGCCGTCTTCTGCTTG Min. Length: 8 Input: 53832 reads. Output: 52701 reads. discarded 461 too-short reads. discarded 553 adapter-only reads. discarded 117 N reads.
Trimming the adapter sequences sometimes may worsen the overall quality scores. There are two types of overrepresented sequences, Library prep and others have no hit. Sequencing primers, primer dimers, are being added to library and being sequenced. This happens in microRNA sequencing often because they’re so small. Remove these by aligning to see if they all match, then trim. Use program designed to work with microRNA and take into account adapter sequences and primer dimers.