EST sequences are downloaded from dbEST and from the Nucleotide/mRNA division of GenBank (release 011008).
The ParPEST pipeline was used to process EST reads.
Sequence data are pre-processed for I) the detection and the trimming of vector contaminations using NCBI's Vector database (update October 2008) as filtering database and II) for the masking of simple sequence repeats, low complexity sub-sequences and other DNA repeats using RepBase (update October 2008) as filtering database.
After quality checking and vector trimming, EST shearing > of 85% identity over a region longer than 60 nts are grouped into clusters.
Sequences belonging to the same cluster are than assembled into TCs (Tentative Consensus sequences).
Functional annotation is performed both on EST sequences and on TCs to allow checking on annotation consistencies when ESTs sequences are assembled.
Functional annotation is based on the detection of similarities (E-value ≤ 0.001) with proteins by BLAST searches versus the UniProtKB (Release 14.3) database.
Protein identifiers are used to built cross-references to the corresponding external database.
In addition, in case the UniProt identifier is recorded in myGO - a mirror of the Gene Ontology database (update October 2008) - gene ontologies are associated to the transcript.
In the same way, if the UniProt identifier is recorded in myENZYME - a local database that was built by parsing the ENZYME database data file "enzyme.dat" (release 14 October 2008) - a cross-reference to the ENZYME database is provided in order to include information such as the enzyme name and its synonyms, reaction(s), substrate(s) and product(s).
Finally, proteins that are associated to EC number(s) are also hyperlinked to myKEGG - a local satellite database built from KEGG (update October 2008) XML formatted files and the related maps in GIFF format - allowing the mapping of the expressed sequences onto known metabolic pathways.