-
EST processing and contig assembly
Crocus sativus EST sequences were analysed using the pipeline ParPEST developed by the CAB group.
Sequence base calls were performed using Phred with a quality cutoff of 0.05.
Vector contaminations identification was performed using RepeatMasker and NCBI's UniVec as filtering database.
RepeatMasker and RepBase(update January 2006) are used for filtering and masking low complexity sub-sequences and interspersed repeats.
EST clustering was made using PaCE with default parameters.
All the ESTs in a cluster are assembled into contigs using Cap3 with an overlapping window of 60 nucleotides and a minimum score of 85.
-
Functional annotation
Raw EST data and contigs are compared using BLASTX versus the UniProtKB/Swiss-Prot database.
The BLAST search is filtered setting an e-Value less equal than 0.001.
The association between the transcripts and the Gene Ontology terms occurs when the accession number of the protein subject is reported in the myGO database.
All the GO terms related to each best BLAST hit were converted to the plant GO slim terms using the map2slim.pl script, distributed as part of the go-perl package [v 0.04]. (Fig. I)
The association between the transcripts and the Enzyme Commission (EC) numbers occurs if the EC is present in the description lines of each best hit.
Transcripts, which are associated to EC numbers, are also linked to myKEGG and can be mapped onto the metabolic pathways.
-
MyGO and MyKEGG databases building
The main database is integrated with 2 local satellite databases:
myGO, a mirror of the Gene Ontology database built running the SQL script go_*-assocdb-tables.tar.gz and
myKEGG, built from KEGG XML formatted files and the related maps in GIFF format.
Figure 1.
|
|