CAP3 Sequence Assembly Program
We have made the following improvements to the CAP sequence assembly
1. Use of forward-reverse constraints to correct assembly errors and link
2. Use of base quality values in alignment of sequence reads.
3. Automatic clipping of 5' and 3' poor regions of reads.
4. Generation of assembly results in ace file format for Consed.
5. CAP3 can be used in GAP4 of the Staden package.
The improved program is named CAP3.
These improvements allow CAP3 to take longer sequences
of higher errors and produce more accurate consensus sequences.
Use of constraints in layout generation
A forward-reverse constraint is often produced by
sequencing both ends of a subclone.
A forward-reverse constraint specifies that
the two reads should be on the opposite strands of the DNA molecule within
a specified range of distance.
CAP3 makes use of a large number of forward-reverse constraints
to locate and correct errors in layout of sequence reads.
This capability allows CAP3 to address assembly errors due
CAP3 also uses constraints to link contigs separated by a gap.
This feature provides useful information to sequence finishers.
The algorithm used in CAP3 is designed to tolerate wrong constraints,
which are due to errors in naming and lane tracking.
Use of quality values in alignment
CAP3 makes use of base
quality values in constructing
an alignment of sequence reads and generating a consensus sequence
for each contig.
This allows the program to use both base quality values and the depth of coverage
at a position to improve the accuracy in generating a consensus
base at the position.
The alignment method in CAP3 is very tolerable of reads
of high sequencing errors.
Automatic clipping of 5' and 3' poor regions
CAP3 clips 5' and 3' poor regions of reads and uses
only good regions of reads in assembly. Thus there is no need
to perform clipping in advance. Note that
vector sequences in reads should be masked before using CAP3.
Input to CAP3
Input sequences file format
CAP3 takes as input a file of sequence reads in FASTA format.
If the names of reads contain a dot ('.'), CAP3 requres that
the names of reads sequenced from the same subclone contain
the same substring up to the first dot.
The first line begins with the symbol '>' followed by the name of the sequence. The sequence
is on the remaining lines. The sequence must not contain blanks. The sequence could be in
upper or lower case. Below is an example sequence in FASTA format:
CAP3 takes two optional files: a file of quality values
in FASTA format and a file of forward-reverse constraints.
Quality value file
The file of quality values must be named "xyz.qual", and
the file of forward-reverse constraints must be named "xyz.con",
where "xyz" is the name of the sequence file.
CAP3 uses the same format of a quality file as Phrap.
The sequence file and the corresponding quality file must be arranged
in the same order in terms of reads, where for each read, the same name must
be used in both files and the number of bases must be equal to the number of
Each line of the constraint file specifies one forward-reverse constraint
of the form:
ReadA ReadB MinDistance MaxDistance
where ReadA and ReadB are names of two reads, and
MinDistance and MaxDistance are distances (integers) in base pairs.
The constraint is satisfied if ReadA in forward orientation occurs
in a contig before ReadB in reverse orientation, or
ReadB in forward orientation occurs in a contig before ReadA
in reverse orientation, and their distance is between MinDistance
CAP3 works better if a lot more constraints are used.
Output from CAP3
Assembly results in CAP format go to the standard output and
need to be directed to a file. Note that
clipped 5' and 3' sequences of reads are not shown in CAP3 format output.
CAP3 also produces assembly results in ace file format (".ace").
This allows CAP3 output to be viewed in Consed.
Note that clipped 5' and 3' sequences of reads are shown in
ace format output.
CAP3 saves consensus sequences in file ".contigs" and
their quality values in file ".contigs.qual".
".capout" file is the standard output from CAP3. The
detailed alignments of assembled sequences in each contig are presented
in this file.
Reads that are not used in assembly are put in file ".singlets".
Additional information about assembly is given in file ".info".
Parameters of CAP3
Clipping of poor regions
CAP3 computes clipping positions of each read using both base quality
values and similarity information. Clipping of a poor end region of
a read f is controlled by three parameters: quality value cutoff
qualcut, clipping range crange, and depth of good coverage gdepth.
The value for qualcut can specified with the "-c" option,
the value for crange with the "-y" option, and
the value for gdepth with the "-z" option.
If there are quality values, CAP3 computes two positions qualpos5 and
qualpos3 of read f such that the region of read f from position qualpos5
to position qualpos3 consists mostly of quality values greater than qualcut.
If there are no quality values, then qualpos5 is set to 1 and qualpos3 is set
the length of read f.
The range for the left clipping position of read f is from 1 to qualpos5 + crange.
The range for the right clipping position of read f is from qualpos3 - crange
to the end of read f. The minimum depth of good coverage at
the left and right clipping positions of read f is expected to be gdepth.
Let realdepth5 be the maximum real depth of coverage for the initial region of
read f ending at position qualpos5 + crange. Let depth5 be
the smaller of realdepth5 and gdepth. If depth5 is 0, then
left clipping position of read f is set to qualpos5 by CAP3.
The given value for the parameter crange may be too small for read f.
CAP3 reports at the start of a .info file that
"No overlap is found in the given 5' clipping range for read f."
If there are overlaps beyond the given 5' clipping range for read f,
CAP3 reports a new clipping range for each overlap. One of the reported
range values can be used as a new value for the parameter crange for read f.
If depth5 is greater than 0, the left clipping position of read f is the smallest position
x such that x is less than qualpos5 + crange and the region of read f beginning at position x
is similar to depth5 other reads. The right clipping position of read f
is computed similarly by CAP3. Larger values for the parameters
crange and gdepth result in more aggressive clipping of poor end regions.
A larger value for crange allows CAP3 to search for the left clipping position
in a larger area. A larger value for gdepth may cause CAP3 to clip more bases
so that the resulting good portion of read f is similar to more reads.
The user may provide specific values for the parameters crange and gdepth for individual reads in a file. Each line in the file has the following format:
file has the following format:
ReadName crange5 gdepth5 crange3 gdepth3
where ReadName is the name of a read, crange5 & gdepth5 are values for the 5' end,
and crange3 & gdepth3 are for the 3' end.
The file is given to CAP3 with the "-w" option.
Similarity score of an overlap
The third measure is based on overlap similarity score.
The similarity score of an overlapping alignment is defined using
base quality values. Let m be the match score factor, let n be
the mismatch score factor, and let g be the gap penalty factor.
Values for these parameters can be set with the "-m", "-n", and "-g" options.
A match at bases of quality values q1 and q2 is given a score of m * min(q1,q2).
A mismatch at bases of quality values q1 and q2 is given a score of n * min(q1,q2).
A base of quality value q1 in a gap is given a score of -g * min(q1,q2),
where q2 is the quality value of the base in the other sequence right before
the gap. The score of a gap is the sum of scores of each base in the gap
minus a gap open penalty. The similarity score of an overlapping alignment
is the sum of scores of each match, each mismatch, and each gap.
With m = 2, an overlap that consists of 25 matches at bases of quality value
10 has a score of 500. If the similarity score of an overlap is less
than the overlap similarity score cutoff s, then the overlap is removed.
Minimum length and percent identity of an overlap
The fourth requirement for an overlap is that the length in bp of the overlap
is no less than the value of the minimum overlap length cutoff parameter.
The value for this parameter can be changed with the "-o" option.
The fifth requirement for an overlap is that the percent identity of
the overlap is no less than the minimum percent identity cutoff.
The value for this parameter can be changed with the "-p" option.
A value of 75 for p means 0.75 or 75%.
The CAP3 program is available upon request from Xiaoqiu Huang at email@example.com
Server for is available at
I thank Jun Qian for producing output in ace format and other help,
Kathryn Beal for incorporating CAP3 in GAP4,
Tim Hunkapiller and Granger Sutton for discussion,
Bruce Roe and Granger Sutton for providing sequence data sets.
This project was supported by NIH Grant R01HG01502-02 from NHGRI.