HOW TO INTERPRET ILLUMINA SEQUENCING REPORT
FILES: WHAT DO THEY TELL ABOUT THE QUALITY OD THE RUN. BY ASELA WIJERATNE
High throughput
data analysis is a multistep process. For Illumina sequencing technology, primary step is to convert the tiff images that
were taken during the sequencing into intensity files and then into
sequences (base calls). During this analysis, a summary report is
generated and most sequencing facilities would pass this information to
their users. While there certainly is a lot of excitement and much buzz
surrounding high-throughput sequencing data analyses, most of us tends
overlook this summary file. This summary report is the first point of
reference before looking for any biological significance; this summary
allows us to determine the proper performance of a sequencing run.
While at MCIC we try to carefully check these summary files to evaluate
the success of the run before sending out the data, it is always a good
idea to check these files before venturing into prolonged data analyses
endeavors.
Here, I am trying
to explain some of the summary results for Illumina paired-end results. These metric
values should be looked at as a whole and not as individual metric
values to assess successful runs.
Lane Parameter Summary:
click on image for larger view
Control: Ussually,
lane 8 is the control. We use a control library generated from the PhiX virus and this provides several benefits:
1.
Size
of the genome is small and useful for quick alignment and estimate of
error.
2.
Diverse:
it contains approximately 45% GC and 55% AT and provides good balance
of bases.
3.
PhiX genome is well
defined.
PhiX provides reference quality control and can be used to compare with
other smaples for cluster generation,
sequencing and alignment, and a calibration control for cross-talk
matrix generation, phasing and prephasing.
Lane: the physical location of the flow cell where samples were
hybridized. Each Flow Cell is
made of 8 individual lanes. If your samples are bar-coded (indexing
application), they may be sequenced on the same flow cell lane.
Sample
ID: Optional column for
tracking sample information.
Sample
Target: Reference sequence against which reads
from a lane/flowcell will be aligned. For
e.g., if your samples is from an organism that has a sequenced genome,
we can indicate that information so the reads will be aligned to the
reference during analyses.
Sample
Type: This
is the analysis mode used to align the reads from a lane to a reference
sequence using ELAND*. For
Paired-End reads ELAND_PAIR is the recommended algorithm.
*ELAND: It is a fast alignment algorithm that will give up on an
alignment if there is more than 2 bases difference to the reference in
the first 32 bases of a read. While this software can give preliminary
indication of quality of the reads, it is often not used for in-depth
analyses.
Length: Number of bases used per read to align to the reference.
Filter: To remove unreliable
data, the raw clusters are filtered to remove any clusters that have
intensities corresponding to bases other than the called base. The
signal from each cluster is examined over the first 25 cycles and the
purity of the signal (Chastity) is calculated for each cycle.
The default chastity is > 0.6
and if chastity value for a cluster is > 0.6 for all of the first 25
cycles then the cluster is kept for further analysis.
Chastity = [Highest_Intensity / (Highest_Intensity + Next_Highest_Intensity)].
The default values (>0.6 and 25)
are set to remove most of the low quality data without throwing away
too much of the good data. These values can be changed, but it is
difficult to determine the correct values without comprising the
quality and quantity of the data.
Number
of tiles: Each
lane is divided into 100 tiles (imaging areas). The tiles could be
removed from analyses, if they underperformed for every single cycle.
This number indicates the tiles used for analyses.
Tiles: A hyperlink for each lane to the
location (within Summary.htm) of the statistics for individual tiles in
that lane.
Lane Results Summary:
click on
image for larger view
Lane
Yield: Total
number of nucleotides yielded for that lane. This can be generated using this
formula: number of tiles X read length X number PF clusters.
Clusters
(raw): This is the total number of clusters
detected. The number of raw clusters is the first indication of how
many sequence tags that lane would yield. Fewer clusters mean poorly
quantified sample as both over or under clustering lead to fewer
clusters being detected. However, most cases, the cluster density is highly influenced by the nature and quality of
the library.
Clusters (PF): This is passing filter
clusters (chastity). Often over-clustering can lead to lower passing
filter clusters.
1st cycle Int (PF): This is the average intensity of the
four bases at position one (cycle 1).
%
Intensity after 20 cycles (PF): The corresponding intensity statistic
at
cycle 20 as a percentage of that at the first cycle. This indicates decay of
intensity over time and typically should be >50%.
% PF
clusters: the percentage of clusters passing
filter. Often we achieve about
80% PF clusters. Lower % PF indicates a problem of cluster formation
and detection and usually, waste of sequencing resources, but does not
lead to poor quality sequences.
% Align
(PF): This
is the percentage of filtered reads (clusters) that uniquely align to
the reference. While this is an
indication of quality of the sequences, unless the reference is
complete and very similar to sample being sequenced, this number could
be misleading. However, the control sample (PhiX)
should typically achieve about 95% using Eland algorithm.
Alignment
score (PF): This
indicates the average of all aligned scores for all sequences that have
passed the filter. Multiple
alignments and reads without hits will have score of 0.
% Error
rate (PF): This
is % of PF reads that didn t align and an indication of sequencing
error. For PhiX, the error rate should be 1
-2 %.
With ELAND_PAIR
analysis, two results summary will be available, one summary for each
read (Lane Results Summary- Read 1 and Lane Results Summary- Read 2).
Expanded Lane Summary:
click on image for larger view
Cluster
Tile Mean (raw): Total
number of clusters detected by the analysis software and represents all
clusters before filtering. Often
low cluster numbers (<1,000) will have negative effect on calculations
of phasing and matrix and high cluster number can lead ambiguous
cluster calling due to saturation of image.
%
Phasing: A
cluster contains thousands of copies of the same DNA fragment and
during sequencing these each of these fragments get sequenced. However,
some of them will not extend the same way and fall behind. This will
lead ambiguous base calling at a given position. % phasing is an indication of how many fragments in a given cluster fall behind
for a given sequencing cycle and this value should be <1%.
%
Pre-Phasing: The
same way described above some copies in a cluster may extend faster
than the rest, causing Pre-Phasing. %Pre_Phasing indicates percentage of copies jumping-ahead and this value should be
< 0.5%.
% Error
rate (raw): This
is the percentage of all called bases in aligned reads from all
detected clusters that don t match the reference.
Equiv Perfect Clusters (raw): The number of clusters in the ideal situation of read base perfectly
predicting reference base that would provide the same information
content (entropy of reference base given read base and a prior
assumption of equiprobable reference bases)
as calculated for all actual detected clusters.
%
Retained: The percentage of clusters that passed
filtering. Typically, this value should be >50%.
Cycle
2-4 Av Int (PF): The
intensity averaged over cycles 2, 3, and 4 for clusters that passed
filtering.
Cycle
2-10 Av % Loss (PF): The average percentage intensity
drop per cycle over cycles 2 10 (derived from a best
fit straight line for log intensity versus cycle number).
Cycle
10-20 Av % Loss (PF): The average percentage intensity
drop per cycle over cycles 10 20 (derived from a best
fit straight line for log intensity versus cycle number).
% Align
(PF): The percentage of filtered reads that were
uniquely aligned to the reference.
% Error
Rate (PF): The percentage of called bases in aligned
filtered reads that do not match the reference.
Equiv Perfect Clusters (PF): The number of clusters
in the ideal situation of read base perfectly predicting reference base
that would provide the same information content (entropy of reference
base given read base and a prior assumption of equiprobable reference bases) as calculated for the actual clusters that passed
filtering
IVC Plots:
These plots
provide valuable information about base composition of your reads
(eventually the sample). There will be eight IVC plots for one for the
each eight flowcell lanes. PhiX (Lane 8) control provides a
reference to compare with other samples. The IVC plots display
intensity of lane average over all tile in the
lane verse number of cycles. The plots displayed are All, Called, %Base_Calls, %All, and %Called.
All: This is the lane average of the data displayed in All.htm. It plots each
channel (A, C, G, T) separately as a different colored line. Means are calculated
over all clusters, regardless of base calling. If all clusters are T, then
channels A, C, and G will be zero. If all bases are present in the sample
at 25% of total and a well-balanced matrix is used for analysis, the
graph will display all channels with similar intensities. If
intensities are not similar, the results could indicate either poor
cross-talk correction or poor absolute intensity balance between each
channel.
click on image for larger view
Fig 1. All graph for PhiX control (A) and a sample with imbalanced bases (B)
Called: This plot is similar to All, except means are calculated for each channel
using clusters that the base caller has called in that channel. If all bases
are present in the sample at 25% with pure signal (zero intensity in the
non-called channels), the Called intensity will be four times that of All,
as the intensities will only be averaged over 25% of the clusters. For impure
clusters, the called intensity will be less than four times that of
All. The Called intensities are independent of base representation, so
a well-balanced matrix will display all channels with similar
intensities.
click on image for larger view
Fig 2. Called graph for PhiX control (A) and a sample with imbalanced bases (B)
%Base_Calls: The percentage of each base called as a function of cycle. Ideally, this should be constant for a genomic sample, reflecting the base representation of the sample. In practice, later cycles often show some bases more than others. As the signal decays, some bases may start to fall into the noise while other still rise above it. Matrix adjustments may help to optimize data.
click on image for larger view
Fig 3. %Base_Calls graph for PhiX control (A) and a sample with imbalanced bases (B)
%All
and %Called: Exactly the same as All and Called, but expressed as a percentage of the total intensities. These plots make it easier to see changes in relative intensities between channels as a function of cycle by removing any intensity decay.
click on image for larger view
Fig 4. %All for PhiX control (A) and a sample with imbalanced bases (B)
click on image for larger view
Fig 5. %All for PhiX control (A) and a sample with imbalanced bases (B)
|