CAP3 Sequence Assembly Program


Introduction

We have made the following improvements to the CAP sequence assembly program.

1. Use of forward-reverse constraints to correct assembly errors and link contigs.

2. Use of base quality values in alignment of sequence reads.

3. Automatic clipping of 5' and 3' poor regions of reads.

4. Generation of assembly results in ace file format for Consed.

5. CAP3 can be used in GAP4 of the Staden package.

The improved program is named CAP3.

These improvements allow CAP3 to take longer sequences of higher errors and produce more accurate consensus sequences.


Use of constraints in layout generation

A forward-reverse constraint is often produced by sequencing both ends of a subclone. A forward-reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance.

CAP3 makes use of a large number of forward-reverse constraints to locate and correct errors in layout of sequence reads. This capability allows CAP3 to address assembly errors due to repeats.

CAP3 also uses constraints to link contigs separated by a gap. This feature provides useful information to sequence finishers.

The algorithm used in CAP3 is designed to tolerate wrong constraints, which are due to errors in naming and lane tracking.

Use of quality values in alignment

CAP3 makes use of base

quality values in constructing an alignment of sequence reads and generating a consensus sequence for each contig.

This allows the program to use both base quality values and the depth of coverage at a position to improve the accuracy in generating a consensus base at the position.

The alignment method in CAP3 is very tolerable of reads of high sequencing errors.

Automatic clipping of 5' and 3' poor regions

CAP3 clips 5' and 3' poor regions of reads and uses only good regions of reads in assembly. Thus there is no need to perform clipping in advance. Note that vector sequences in reads should be masked before using CAP3.

Input to CAP3

Input sequences file format

CAP3 takes as input a file of sequence reads in FASTA format. If the names of reads contain a dot ('.'), CAP3 requres that the names of reads sequenced from the same subclone contain the same substring up to the first dot.

FASTA Format:

The first line begins with the symbol '>' followed by the name of the sequence. The sequence is on the remaining lines. The sequence must not contain blanks. The sequence could be in upper or lower case. Below is an example sequence in FASTA format:
>DNA sequence
GCCCCCGGCCCCGCCCCGGCCCCGCCCCCGGCCCCGCCCCGCAAGGGTC
ACAGGTCACGGGGCGGGGCCGAGGCGGAAGCGCCCGCAGCCCGGTACCG
CTCCTCCTGGGCTCCCTCTAGCGCCTTCCCCCCGGCCCGACTCCGCTGG
CAGCGCCAAGTGACTTACGCCCCCGACCTCTGAGCCCGGACCGCTAGGC
GGAGGATCAGATCTCGCTCGAGAATCTGAAGGTGCCCTGGTCCTGGAGG
AGTTCCGTCCCAGCCCGCGGTCTCCCGGTACTGTCGGGCCCCGGCCCTC
CAP3 takes two optional files: a file of quality values in FASTA format and a file of forward-reverse constraints.

Quality value file

The file of quality values must be named "xyz.qual", and the file of forward-reverse constraints must be named "xyz.con", where "xyz" is the name of the sequence file. CAP3 uses the same format of a quality file as Phrap. The sequence file and the corresponding quality file must be arranged in the same order in terms of reads, where for each read, the same name must be used in both files and the number of bases must be equal to the number of quality values.

Constraint file

Each line of the constraint file specifies one forward-reverse constraint of the form:
ReadA   ReadB    MinDistance    MaxDistance
where ReadA and ReadB are names of two reads, and MinDistance and MaxDistance are distances (integers) in base pairs. The constraint is satisfied if ReadA in forward orientation occurs in a contig before ReadB in reverse orientation, or ReadB in forward orientation occurs in a contig before ReadA in reverse orientation, and their distance is between MinDistance and MaxDistance. CAP3 works better if a lot more constraints are used.

Output from CAP3

Assembly results in CAP format go to the standard output and need to be directed to a file. Note that clipped 5' and 3' sequences of reads are not shown in CAP3 format output.

CAP3 also produces assembly results in ace file format (".ace"). This allows CAP3 output to be viewed in Consed. Note that clipped 5' and 3' sequences of reads are shown in ace format output.

CAP3 saves consensus sequences in file ".contigs" and their quality values in file ".contigs.qual". ".capout" file is the standard output from CAP3. The detailed alignments of assembled sequences in each contig are presented in this file. Reads that are not used in assembly are put in file ".singlets". Additional information about assembly is given in file ".info".

Parameters of CAP3

Clipping of poor regions

CAP3 computes clipping positions of each read using both base quality values and similarity information. Clipping of a poor end region of a read f is controlled by three parameters: quality value cutoff qualcut, clipping range crange, and depth of good coverage gdepth. The value for qualcut can specified with the "-c" option, the value for crange with the "-y" option, and the value for gdepth with the "-z" option.

If there are quality values, CAP3 computes two positions qualpos5 and qualpos3 of read f such that the region of read f from position qualpos5 to position qualpos3 consists mostly of quality values greater than qualcut. If there are no quality values, then qualpos5 is set to 1 and qualpos3 is set the length of read f. The range for the left clipping position of read f is from 1 to qualpos5 + crange. The range for the right clipping position of read f is from qualpos3 - crange to the end of read f. The minimum depth of good coverage at the left and right clipping positions of read f is expected to be gdepth.

Let realdepth5 be the maximum real depth of coverage for the initial region of read f ending at position qualpos5 + crange. Let depth5 be the smaller of realdepth5 and gdepth. If depth5 is 0, then left clipping position of read f is set to qualpos5 by CAP3. The given value for the parameter crange may be too small for read f. CAP3 reports at the start of a .info file that "No overlap is found in the given 5' clipping range for read f." If there are overlaps beyond the given 5' clipping range for read f, CAP3 reports a new clipping range for each overlap. One of the reported range values can be used as a new value for the parameter crange for read f.

If depth5 is greater than 0, the left clipping position of read f is the smallest position x such that x is less than qualpos5 + crange and the region of read f beginning at position x is similar to depth5 other reads. The right clipping position of read f is computed similarly by CAP3. Larger values for the parameters crange and gdepth result in more aggressive clipping of poor end regions. A larger value for crange allows CAP3 to search for the left clipping position in a larger area. A larger value for gdepth may cause CAP3 to clip more bases so that the resulting good portion of read f is similar to more reads.

The user may provide specific values for the parameters crange and gdepth for individual reads in a file. Each line in the file has the following format: file has the following format:
ReadName     crange5     gdepth5      crange3     gdepth3
where ReadName is the name of a read, crange5 & gdepth5 are values for the 5' end, and crange3 & gdepth3 are for the 3' end.

The file is given to CAP3 with the "-w" option.

Similarity score of an overlap

The third measure is based on overlap similarity score. The similarity score of an overlapping alignment is defined using base quality values. Let m be the match score factor, let n be the mismatch score factor, and let g be the gap penalty factor. Values for these parameters can be set with the "-m", "-n", and "-g" options. A match at bases of quality values q1 and q2 is given a score of m * min(q1,q2). A mismatch at bases of quality values q1 and q2 is given a score of n * min(q1,q2). A base of quality value q1 in a gap is given a score of -g * min(q1,q2), where q2 is the quality value of the base in the other sequence right before the gap. The score of a gap is the sum of scores of each base in the gap minus a gap open penalty. The similarity score of an overlapping alignment is the sum of scores of each match, each mismatch, and each gap. With m = 2, an overlap that consists of 25 matches at bases of quality value 10 has a score of 500. If the similarity score of an overlap is less than the overlap similarity score cutoff s, then the overlap is removed.

Minimum length and percent identity of an overlap

The fourth requirement for an overlap is that the length in bp of the overlap is no less than the value of the minimum overlap length cutoff parameter. The value for this parameter can be changed with the "-o" option. The fifth requirement for an overlap is that the percent identity of the overlap is no less than the minimum percent identity cutoff. The value for this parameter can be changed with the "-p" option. A value of 75 for p means 0.75 or 75%.

Availability

The CAP3 program is available upon request from Xiaoqiu Huang at xqhuang@cs.iastate.edu

Server for is available at

http://bioinformatics.iastate.edu/aat/sas.html

Acknowledgments

I thank Jun Qian for producing output in ace format and other help, Kathryn Beal for incorporating CAP3 in GAP4, Tim Hunkapiller and Granger Sutton for discussion, Bruce Roe and Granger Sutton for providing sequence data sets. This project was supported by NIH Grant R01HG01502-02 from NHGRI.