MCIC February 2012 news.
ISSUE 2
Tuesday, Fbruary 14
THIS ISSUE
PreVious ISSUEs

MCIC-Galaxy: bioinformatics tools for 'next' generation sequence data analysis available at the MCIC. By Tea meulia

Friday March 9, WAMBA seminar on this topic (Selby 203, at 4:00 PM):
Introduction to the MCIC Galaxy portal
and its tools for high-throughtput sequence data analysis. Presented by Asela Wijeratne and Saranga Wijeratne.

 

The bottleneck for large sequencing dataset analysis is usually access to high-performance computers and software that is easy for biologists to use. The MCIC has been developing a centralized web-based portal, MCIC Galaxy that would host a variety of commonly used bioinformatics software and make execution of complex automated analyses pipelines, data sharing and access relatively easy and fast. The first version of MCIC-Galaxy is available now to users at: http://164.107.87.120:8080/galaxy/ (for now, via ethernet connection only).  Currently the site includes most of the pipelines and tools that are included in the original Galaxy at the Pennsylvania State University (http://main.g2.bx.psu.edu/). In addition we integrated custom-built and on-line available bioinformatics tools that allow automated workflows for RNA-Seq analysis, and for genome and transcriptome assemblies. We focused on these two areas as most of our customers are currently performing differential genes expression analysis in non model organisms for which completely annotated genomic sequences are limited or not available at all. In addition, assemblies are very memory intensive and can not be performed at other web portals or distributed environments such as the Ohio Supercomputer Center (www.osc.org), therefore the need for setting up such tools in house.

 

With our customized RNA-Seq workflow, a complete differential gene expression analysis can be performed automatically in as little as two hours, depending, of course, also on the size of the dataset and complexity of the analyzes. Automated steps include sequence reads quality and adaptor trimming, alignment to reference sequences using different algorithms, creation of digital count tables and final import into DESeq or other package for statistical analysis. Similarly we set up a complete workflow for genome assemblies using the standalone Velvet assembler (Zerbino and Birney, 2008) and we are finishing installing software for transcriptome assemblies such as the Rnnotator pipeline software (Martin et al. 2010) with the AMOS package and Velvet Oases (Zerbino and Birney, 2008), and the finally the Trinity package (Grabherr et al. 2011).

 

Users will find that in the first MCIC-Galaxy version, options for executing various algorithms are still limited. We are purposely starting with a simpler version, as we would like to make sure that the site runs efficiently for several users.  Less options also allow us to fix eventual problems can more readily. The MCIC-Galaxy will be expanded to include more options, and additional bioinformatics and statistical tools. We encourage users comments and input and we will be sending out a questionnaire with our next newsletter for this purpose.

 

From the OARDC-Wooster campus you can access the MCIC-Galaxy at http://164.107.87.120:8080/galaxy/.  If you would like to use its tools and get familiarized with the site you need to contact Saranga Wijeratne (wijeratne.3@osu.edu). He will create a user account and password. Initially, we would like to limit the number of users and projects that run at the same time, as we need to make sure that our computer can handle the analysis load. Users from the Columbus campus will need to provide and register their IP address with our IT department prior accessing the MCIC-Galaxy URL. Once everything runs smooth, users will be able to automatically register and use the site.



BIOINFORMATICS DUSCUSSION AND TRAINING SESSIONS. BY TEA MEULIA

We will be starting regular bioinformatics discussion and training sessions in March. We chose some topics to start with, however we would like to see these sessions evolve and include mainly topics and areas that students need and are interested in. In addition, we would like to encourage students and post-docs with expertise in any bioinformatics areas to participate and share their expertise.

Saranaga Wijeratne will focus mainly on the computational tools and software capabilities now available at the MCIC. Saranga's sessions will start on March 15, and will be held every Thursday afternoon. The first 4 weeks will introduce users to the Galaxy interface and customized pipelines. As these training sessions will be hands on and require computer usage we have to limit the number of participant to 4 or 5 and would like to include not more than one representative per laboratory. The sessions will be repeated, if more users would like to participate. Therefore, if you are interested, please contact Saranga (wijeratne.3@osu.edu) immediately and sign up. Depending on the demand we will be repeating these sessions.

 

      i.         First session commencing  March 15,  2:00 PM - 4:00 PM then on every Thursday for 4 weeks

a.      Introduction to Galaxy

b.     Introduction to MCIC Galaxy

c.      Introduction to Galaxy tools- general introduction to selected set of tools

d.      Introduction to MCIC workflows  

 

    ii.         Second Session

a.      RNA-Seq pipeline to find differentially expressed genes

b.     Transcriptome assembly using Trinity

c.      Transcriptome assembly using Rnnotator

d.      Demonstration to Genome assembly using Galaxy

 

Asela Wijeratne will focus on data mining tools and some additional data analyses pipelines.  The sessions will start on April 2 at 2pm . Asela is planning to meet with users once a month. However, depending on the enthusiasm, this group can meet more frequently.  

Topics scheduled are:

 

      i.         First session: Annotations using Blast2Go

    ii.         Second session: Standalone blast for large data setsThirds session: Running Standalone Blast at the Ohio Supercomputer Center

  iii.         Forth Session: Small RNA-Seq data analysis

 

Please contact Asela (wijertne.1@osu.edu) to register.



Bioinformatics research at the KH-TPS core laboratory. By Stephen Opiyo

For this quarter, e-news letter will introduce services and work done at the Kottman-Hall Translational Plant Sciences Core Facility (KH-TPS). KH-TPS was established in Columbus in July 2010, and houses analytical and molecular biology equipment. For the lists of equipment and services provided http://kh-tps.osu.edu/list.html. Dr. Stephen Opiyo, a Research Scientist, manages the KH-TPS core laboratory and conducts collaborative bioinformatics research with faculty. Support provided by KH-TPS facility covers:

• Microarray data analysis using Significance Analysis of Microarrays software and R programing
• Protein sequence analysis (Blast search, profile hidden Markov model, etc.).
• Multiple sequence alignment and phylogenetic tree reconstruction using MAFFT, RAxML, ClustalW, and FastTree software.
• Metabolomics data analysis and biomarker discovery using multivariate statistics and R software.
• Multivariate analysis tutorials using R software.
• Statistical analysis of variance (ANOVA), and non-parametric analysis using R software.
• Using Microsoft Excel spreadsheets, equations, and micros.
• Metabolomics software evaluations.
• Statistical software download.
• R statistical software tutorial.
• Blast tutorial.
• Data mining and machine learning using multivariate statistics and R software.

Protein family classification is our main area of expertise. We use multivariate statistics such as principal component analysis, partial least squares, discriminant analysis, and support vector machines with physico-chemical properties of amino acids as descriptors to identify proteins with low sequence similarity. Examples of the projects that we are working on include identification of type III effector candidates of plant pathogenic bacteria from public domain databases. KH-TPS core facility in collaboration with Dr. Graham from Department of Plant Pathology is developing a software pipeline to analyze datasets collected from Mass Spectrometer (MS). The pipeline is being developed using open source software such R statistical package, MySQL, Glassfish, Java Server Faces technology, and Bioconductor. The pipeline will allow OARDC plant researchers to generate a metabolome-wide report of metabolites under any given condition, using MS instrumentation, which in turn will allow researchers to determine which metabolites are critical for desired traits and thus help lead the way for crop improvement. The researchers will download the software and install it on their computers.

The KH-TPS core facility has been in operation for 19 months. I would like your feed back on how the facility has been operating. Your feed back will help me improve the services that are provided at the facility. Please fill in this short survey: http://www.surveymonkey.com/s/QGWLS87



RSTUDIO PROVIDES A USER FRINDLY INTERFACE FOR R STATISTICAL PACKAGES. By SARanGA WIJERATNE

R has become a very popular programming language used today by over 2 million analysts worldwide. Its functionality has expanded greatly in the past ten years. R environment is used for statistical computing and graphics and the open source R statistical packages provide a powerful tools and procedures to explore and visualize biological data as well as to test hypotheses. Several custom packages are available on the web and they usually include a variety of statistical tests, including linear and non linear modeling, classical statistical tests, time-series analysis, clustering and more, and graphical tools. Its strength is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.

 

Statistical packages based on the R language do not have the most user friendly environment, when compared to their counterparts (Matlab, SAS and Mathematica), however R language contribution to the area of Bioinformatics is invaluable and its use has been exponentially increasing in particular for the 'next' -generation sequence data analysis. For example, we use the Bioconductor package, which uses R statistical programming language and gathers tools specifically for the analysis, interpretation and visualization of high-throughput genomics sequencing data. The use of this package, similarly to other software written in R requires computer programming skills.

 

The newly developed RStudio graphical user interface (GUI) is now providing a user friendly environment for biologists that do not posses programming skills. It combines an intuitive user interface with powerful coding tools to make R implementation towards Bioinformatics possible, which will allow a more wide application of these powerful tools

 

RStudio runs on any major platform, and even can be run on a server, which provides remote access to the R Studio through a web browser.

In addition, RStudio

Integrates all the R tools into a customizable environment

Provides powerful coding tools designed to enhance your productivity

Supports Tex and Sweave documents and inter-active graphics using manipulative packages

Supports interactive graphics

 

MCIC will be making RStudio available for their customers through the MCIC-Galaxy site. More information about RStudio can be found at http://rstudio.org/. This is a software that can easily install on all environment (PC,Mac,Linux,etc). Installation does not require special configuration or expertise. This is open source and potentially be one of the best GUI ‘s for R .


HOW TO INTERPRET ILLUMINA SEQUENCING REPORT FILES: WHAT DO THEY TELL ABOUT THE QUALITY OD THE RUN. BY ASELA WIJERATNE

High throughput data analysis is a multistep process. For Illumina sequencing technology, primary step is to convert the tiff images that were taken during the sequencing into intensity files and then into sequences (base calls). During this analysis, a summary report is generated and most sequencing facilities would pass this information to their users. While there certainly is a lot of excitement and much buzz surrounding high-throughput sequencing data analyses, most of us tends overlook this summary file. This summary report is the first point of reference before looking for any biological significance; this summary allows us to determine the proper performance of a sequencing run. While at MCIC we try to carefully check these summary files to evaluate the success of the run before sending out the data, it is always a good idea to check these files before venturing into prolonged data analyses endeavors.

 

Here, I am trying to explain some of the summary results for Illumina paired-end results.  These metric values should be looked at as a whole and not as individual metric values to assess successful runs. 

 

Lane Parameter Summary:

   click on image for larger view

 

Control: Ussually, lane 8 is the control. We use a control library generated from the PhiX virus and this provides several benefits:

1.      Size of the genome is small and useful for quick alignment and estimate of error.

2.      Diverse: it contains approximately 45% GC and 55% AT and provides good balance of bases.

3.      PhiX genome is well defined.

PhiX provides reference quality control and can be used to compare with other smaples for cluster generation, sequencing and alignment, and a calibration control for cross-talk matrix generation, phasing and prephasing.

Lane: the physical location of the flow cell where samples were hybridized.  Each Flow Cell is made of 8 individual lanes. If your samples are bar-coded (indexing application), they may be sequenced on the same flow cell lane.

Sample ID:  Optional column for tracking sample information.

Sample Target: Reference sequence against which reads from a lane/flowcell will be aligned. For e.g., if your samples is from an organism that has a sequenced genome, we can indicate that information so the reads will be aligned to the reference during analyses.

Sample Type:  This is the analysis mode used to align the reads from a lane to a reference sequence using ELAND*.  For Paired-End reads ELAND_PAIR is the recommended algorithm.

*ELAND:  It is a fast alignment algorithm that will give up on an alignment if there is more than 2 bases difference to the reference in the first 32 bases of a read. While this software can give preliminary indication of quality of the reads, it is often not used for in-depth analyses.

Length: Number of bases used per read to align to the reference. 

Filter:  To remove unreliable data, the raw clusters are filtered to remove any clusters that have intensities corresponding to bases other than the called base. The signal from each cluster is examined over the first 25 cycles and the purity of the signal (Chastity) is calculated for each cycle.  

The default chastity is > 0.6 and if chastity value for a cluster is > 0.6 for all of the first 25 cycles then the cluster is kept for further analysis.

Chastity = [Highest_Intensity / (Highest_Intensity + Next_Highest_Intensity)].

The default values (>0.6 and 25) are set to remove most of the low quality data without throwing away too much of the good data. These values can be changed, but it is difficult to determine the correct values without comprising the quality and quantity of the data.  

Number of tiles:  Each lane is divided into 100 tiles (imaging areas). The tiles could be removed from analyses, if they underperformed for every single cycle. This number indicates the tiles used for analyses.

Tiles:  A hyperlink for each lane to the location (within Summary.htm) of the statistics for individual tiles in that lane.

 

Lane Results Summary:

   click on image for larger view

 

Lane Yield:  Total number of nucleotides yielded for that lane.  This can be generated using this formula: number of tiles X read length X number PF clusters.

Clusters (raw): This is the total number of clusters detected. The number of raw clusters is the first indication of how many sequence tags that lane would yield. Fewer clusters mean poorly quantified sample as both over or under clustering lead to fewer clusters being detected. However, most cases, the cluster density is highly influenced by the nature and quality of the library.

Clusters (PF):  This is passing filter clusters (chastity). Often over-clustering can lead to lower passing filter clusters.

1st cycle Int (PF):  This is the average intensity of the four bases at position one (cycle 1).  

% Intensity after 20 cycles (PF):  The corresponding intensity statistic at

cycle 20 as a percentage of that at the first cycle. This indicates decay of intensity over time and typically should be >50%.

% PF clusters: the percentage of clusters passing filter.  Often we achieve about 80% PF clusters. Lower % PF indicates a problem of cluster formation and detection and usually, waste of sequencing resources, but does not lead to poor quality sequences.

% Align (PF):  This is the percentage of filtered reads (clusters) that uniquely align to the reference.  While this is an indication of quality of the sequences, unless the reference is complete and very similar to sample being sequenced, this number could be misleading. However, the control sample (PhiX) should typically achieve about 95%  using Eland algorithm.

Alignment score (PF):  This indicates the average of all aligned scores for all sequences that have passed the filter.  Multiple alignments and reads without hits will have score of 0.

% Error rate (PF):  This is % of PF reads that didn t align and an indication of sequencing error. For PhiX, the error rate should be 1 -2 %.

With ELAND_PAIR analysis, two results summary will be available, one summary for each read (Lane Results Summary- Read 1 and Lane Results Summary- Read 2).

 

Expanded Lane Summary:

   click on image for larger view

 

  Cluster Tile Mean (raw):  Total number of clusters detected by the analysis software and represents all clusters before filtering.  Often low cluster numbers (<1,000) will have negative effect on calculations of phasing and matrix and high cluster number can lead ambiguous cluster calling due to saturation of image.

% Phasing:  A cluster contains thousands of copies of the same DNA fragment and during sequencing these each of these fragments get sequenced. However, some of them will not extend the same way and fall behind. This will lead ambiguous base calling at a given position. % phasing is an indication of how many fragments in a given cluster fall behind for a given sequencing cycle and this value should be <1%.   

% Pre-Phasing:  The same way described above some copies in a cluster may extend faster than the rest, causing Pre-Phasing. %Pre_Phasing indicates percentage of copies  jumping-ahead and this value should be < 0.5%.

% Error rate (raw):  This is the percentage of all called bases in aligned reads from all detected clusters that don t match the reference.

Equiv Perfect Clusters (raw): The number of clusters in the ideal situation of read base perfectly predicting reference base that would provide the same information content (entropy of reference base given read base and a prior assumption of equiprobable reference bases) as calculated for all actual detected clusters.

% Retained: The percentage of clusters that passed filtering. Typically, this value should be >50%. 

Cycle 2-4 Av Int (PF): The intensity averaged over cycles 2, 3, and 4 for clusters that passed filtering.

Cycle 2-10 Av % Loss (PF): The average percentage intensity drop per cycle over cycles 2 10 (derived from a best fit straight line for log intensity versus cycle number).

Cycle 10-20 Av % Loss (PF): The average percentage intensity drop per cycle over cycles 10 20 (derived from a best fit straight line for log intensity versus cycle number).

% Align (PF): The percentage of filtered reads that were uniquely aligned to the reference.

% Error Rate (PF): The percentage of called bases in aligned filtered reads that do not match the reference.

Equiv Perfect Clusters (PF): The number of clusters in the ideal situation of read base perfectly predicting reference base that would provide the same information content (entropy of reference base given read base and a prior assumption of equiprobable reference bases) as calculated for the actual clusters that passed filtering

 

  IVC Plots:

These plots provide valuable information about base composition of your reads (eventually the sample). There will be eight IVC plots for one for the each eight flowcell lanes. PhiX (Lane 8) control provides a reference to compare with other samples. The IVC plots display intensity of lane average over all tile in the lane verse number of cycles. The plots displayed are All, Called, %Base_Calls, %All, and %Called.

 

All: This is the lane average of the data displayed in All.htm. It plots each channel (A, C, G, T) separately as a different colored line. Means are calculated over all clusters, regardless of base calling. If all clusters are T, then channels A, C, and G will be zero. If all bases are present in the sample at 25% of total and a well-balanced matrix is used for analysis, the graph will display all channels with similar intensities. If intensities are not similar, the results could indicate either poor cross-talk correction or poor absolute intensity balance between each channel.

 

   click on image for larger view

Fig 1. All graph for PhiX control (A) and a sample with imbalanced bases (B)

 

Called: This plot is similar to All, except means are calculated for each channel using clusters that the base caller has called in that channel. If all bases are present in the sample at 25% with pure signal (zero intensity in the non-called channels), the Called intensity will be four times that of All, as the intensities will only be averaged over 25% of the clusters. For impure clusters, the called intensity will be less than four times that of All. The Called intensities are independent of base representation, so a well-balanced matrix will display all channels with similar intensities.

 

   click on image for larger view

Fig 2. Called graph for PhiX control (A) and a sample with imbalanced bases (B)

 

%Base_Calls: The percentage of each base called as a function of cycle. Ideally, this should be constant for a genomic sample, reflecting the base representation of the sample. In practice, later cycles often show some bases more than others. As the signal decays, some bases may start to fall into the noise while other still rise above it. Matrix adjustments may help to optimize data.

   click on image for larger view

Fig 3. %Base_Calls graph for PhiX control (A) and a sample with imbalanced bases (B)

 

%All and %Called: Exactly the same as All and Called, but expressed as a percentage of the total intensities. These plots make it easier to see changes in relative intensities between channels as a function of cycle by removing any intensity decay.

 

     click on image for larger view

  Fig 4. %All for PhiX control (A) and a sample with imbalanced bases (B)

 

click on image for larger view

  Fig 5. %All for PhiX control (A) and a sample with imbalanced bases (B)

 



FORWARD TO FRIEND
Know someone who might be interested in the email? Why not forward this email to them.

UNSUBSCRIBE
Don't want to receive these emails any more. Please unsubscribe instantly.

The Ohio State University, Ohio Agricultural Research and Development Center, 1680 Madison Avenue, Wooster Ohio 44691 . Contact Tea Meulia, Director.
Design by Saranga Wijeratne