RepMiner

Description
Features
Screenshots
Author: James C. Estill

Description

The RepMiner package takes a graph theory approach to the classification and assembly of the repetitive fraction of genomic sequence data. Sequences analyzed by RepMiner can range from full length transposable elements in well characterized genomes to short length sequence reads resulting from low coverage sample sequence data. RepMiner makes use of transposable elements identified from model species to map the location of putative transposable elements onto homology based networks derived from comparing the sequences of the query genome to itself. Individual clusters representing Pseudo Assembly Networks (PANs) may be selected and assembled using the TGICL/CAP3 program.

This package is currently under heavy development.

Features

Fully implemented features:

Automated All-by-All BLAST given a FASTA file for the query sequences of interest
Automated BLAST against a set of databases of transposable elements (TEs) from model species
Parsing of TE BLAST results into a classified set of transposable elements
Production of networks of similarity based on BLAST results that are suitable for visualization in the program Cytoscape
Automated assembly of manually selected Pseudo Assembly Networks using TGICL/CAP3
Automated comparison of transposable element assembly against a database of profile hidden Markov Models using HMMER
Parsing of all HMMER results into an easy to read tab delimited text file
Creation of an Apollo file for all assemblies resulting from TGICL/CAP3
Production of an HTML page summarizing all analyses for each of the PANS submitted to the program

Future directions:

The derivation of a cluster based classification of elements
Automated detection of putative transposable element clusters in the set of graphs derived from the All by All BLAST
Precise delineation of transposable elements within the putative transposable element assembly
Creation of a database of the results that are available for browsing with Apollo

Screenshots

Example Set of Pseudo Assembly Networks (PANs)

The 'constellations' below represents networks of homology within query genome. Each individual node in the network below represents a short sequence read. The color of each individual node represents the source BAC. The shape of the node represents any homology to known transposable elements. The color of the perimeter of the node represents any homology to known transposable element proteins. The color of the lines connecting any two nodes indicates the degree of homology between the two sequences.

HTML Output Header

Each PAN is assembled with TGICL and BLASTEd against databases of known transposable elements. The variables used for the assembly and BLAST based homology comparison is indicated in the header of the HTML output.

HTML Output Record

For each of the PANs, the assembled molecules are BLASTed against a set of known transposable elements and compared to a database of hidden Markov models profiles for MITEs. The best BLAST hit is shown for any of the TE databases that had a significant hit. The entire BLAST report is available by clicking on the database name under each contig name. The parsed hidden Markov model output can be accessed by clicking on the hmm_mite link under each contig name. The PAN shown below contains a Stowaway mite.

Author: James Estill
Last Updated: Thursday, 19 April 2007