EST sequences are downloaded from dbEST division of GenBank (release 151206).
    A dedicated software performs for each est dataset all-against-all comparisons in order to identify all the EST sequences identical or "strictly contained" in other sequences. "Strictly contained" means that the contained sequence must be a sub-string of the longest sequence (container sequence).
    ParPEST pipeline (Figure 1) was implemented to process EST sequences.
    Sequence data are pre-processed in three steps:
    • RepeatMasker analysis is performed for the detection and the masking of vector contaminations using NCBI's Vector database (update January 2006) as filtering database.
    • Then, an in house developed Perl script cleans EST sequences from vector contaminations and
    • finally a second RepeatMasker analysis is performed for the masking of low complexity sub-sequences and repeats using RepBase(update January 2006) as filtering database.
    After quality checking and vector trimming, EST shearing > of 85% identity aver a region longer than 60 nts are grouped into clusters using PaCE.
    Sequences belonging to the same cluster are than assembled into contigs using Cap3. Cap3 parameters were set to -p 85 -o 60.
    Functional annotation is performed both on EST sequences and on contigs to allow checking on annotation consistencies when ESTs sequences are assembled.
    Functional annotation is based on the detection of similarities (E-value ≤ 0.001) with both proteins and non protein coding RNAs, by BLAST searches versus the UniProt (Release 27012006) and the Rfam (version 7.0 March 2006) database, respectively.
    Protein and RNA identifiers are used to built cross-references to the corresponding external databases.
    When the UniProt identifier is recorded in myGO - a mirror of the Gene Ontology database (update January 2006) - gene ontologies are associated to the transcript, integrating the UniProt annotation with an international standard.
    If the Enzyme Commission (EC) number is present in the BLAST Hit description lines, a cross-reference to the ENZYME database is provided in order to include information such as the enzyme name and its synonyms, reaction(s), substrate(s) and product(s). Proteins that are associated to EC number(s) are also hyperlinked to myKEGG - a local satellite database built from KEGG (update January 2006) XML formatted files and the related maps in GIFF format - allowing the mapping of the expressed sequences onto known metabolic pathways. A BLASTn analysis (E-value ≤ 10-5) was carried out both to establish correspondences between the EST sequence dataset and the TOM1 cDNA microarray sequences data (kindly provided by Jim Giovannoni on March 2006) and correspondences between the EST sequence dataset and the Affymetrix Tomato Genome Array probe-sets.


    Figure 1.