As described in Holt et al. (2002), plasmid and BAC DNA libraries were constructed with stringently size-selected PEST strain DNA. Two BAC libraries were constructed, one (ND-TAM) using DNA from whole adult male and female mosquitoes and the other (ND-1) using DNA from ovaries of PEST females collected about 24 hours after the blood meal. Plasmid libraries containing inserts of 2.5, 10 and 50 kb were constructed with DNA derived from either 330 male or 430 female mosquitoes. For each sex, several libraries of each insert size class were made, and these were sequenced such that there was approximately equal coverage from male and female mosquitoes in the final data set. Celera Genomics, Genoscope and TIGR contributed sequence data that collectively provided 10.2-fold coverage, assuming a genome size of 278 Mb. The whole-genome data set was assembled with the Celera assembler (MOZ1 assembly), which constituted the basis of the primary genome publication (Holt et al. 2002).

The first update to this assembly (MOZ2) involved the results of a concerted effort to correct some of the ambiguities in scaffold map locations and orientations by manual analysis of the archived BAC chromosome hybridization photographs and by the hybridization of a small number of new BAC clones selected to resolve questions of scaffold orientation. The new AGP file, and early draft of which was first displayed on the An. gambiae genome poster published in the 4 October 2002 issue of science, formed the basis of a new annotation and gene build displayed on 1 October 2003 (MOZ2) (Mongin et al. 2004). This assembly is also 278 Mb.

In 2006, the major scaffolds were re-ordered into a new golden path file by use of additional physically mapped BAC clones combined with scaffold-to-scaffold sequence comparisons that identified some sequence overlaps. The AgamP3 assembly has a total of 80 scaffolds assigned to and ordered on the chromosome arms X, 2R, 2L, 3R and 3L, 28 of which are newly mapped or oriented. The most significant improvement in this new assembly is 24 scaffolds (8.64 Mbp) located to pericentromeric regions. However, this does not complete the centromeric region of any of the chromosomes. The new GenBank entries, CM000356-CM000360, reflect the revised 2L, 2R, 3L, 3R and X chromosome assemblies. This new assembly (AgamP3) of non-redundant ~264 Mb is still probably an overestimation of the true genome size (Sharakhova et al. 2007).

This assembly differs from the previous version, AgamP3, by the addition of the mitochondrial genome (L20934, 16,655 bp) which includes 13 protein-coding and 24 ncRNAs (22 tRNA and 2 rRNA genes).

Additional notes compiled on known assembly issues from the initial VectorBase project can be found here.

New in situ Scaffold Mappings

  • Using cDNA for in situ hybridization, 15 previously unmapped scaffolds with size totaling 5.34 Mbp have been mapped to the pericentromeric regions on the chromosomes.
  • 23 scaffolds, previously mapped to chromosomes but with ambiguous direction, have been oriented.
  • Further analysis of scaffolds using in situ hybridization of BAC clones allowed us to identify 1.96 Mbp (5 scaffolds) that spanned physical gaps between scaffolds on euchromatic parts of the chromosomes.
  • Analysis of BAC end sequences has found 23 BAC clones that span the synthetic inter-scaffold gaps.
  • Unmapped scaffolds have been aligned to the chromosome assemblies in silico, identifying 144 scaffolds totaling 8.18 Mbp, that are already represented in the current Golden Path.
  • A list of the newly mapped scaffolds can be found here or on the right.
  • The list of 144 overlapping previously unmapped scaffolds can be found here or on the right.

Identification of Overlapping Scaffold Ends

Using Exonerate and Dotter, adjacent scaffolds who's ends contain overlapping regions have been identified through visual inspection. For a stretch of sequence covered by two scaffolds, we have taken one the overlapping regions and selected it for use with our updated An. gambiae Golden Path. The overlapping sequence from the other scaffold will be still be associated with the same region on the chromosome, except that it will be listed as a haplotype region instead of part of the golden path.

Using these techniques, 18 major overlaps have been identified between scaffolds mapped to chromosome arms. Based on these overlaps, approximately 3.5Mb of overlapping sequence has been removed from the current Golden Path and reclassified as haplotype region.

  • The AGP file describing the new An. gambiae assembly can be found here.
  • The list of overlapping scaffolds and the haplotype regions can be found here and on the right.

Y Chromosome Scaffold Identification

The scaffolds containing Y chromosome-specific satellite DNA families were identified by in silico searches. Initially, each male-only scaffold was screened for the possible presence of satellite DNA using Tandem Repeat Finder software. Subsequently, the consensus sequence of each identified tandem repeat family was used as a query in BLASTN searches against a database made of scaffold sequences derived exclusively from male libraries and a database containing all scaffolds constituting An. gambiae genome. The satellite DNA queries, that returned the same number of hits in both databases, were regarded as potentially Y-linked and in each case their Y-linkage was experimentally confirmed using PCR and Southern blot techniques. All scaffolds harboring the Y-specific satellite sequences were treated as originating from the Y chromosome.

One scaffold (AAAB01008227), containing more complex sequences, was serendipitously discovered in a TBLASTN search of the unmapped scaffold set using as queries GenBank-derived sequences of sex-determining or male fertility-related proteins. Although sequence similarity of the scaffold to the query (GenBank accession no. B21124) was limited to low-complexity (microsatellite) region, implying lack of homology between the query and the subject, PCR experiments confirmed Y-linkage of that scaffold.

  • Download the list of Y chromosome scaffolds here or on the right.

Bacterial Scaffold Identification

678 unmapped scaffolds in the current An. gambiae assembly have been identified as bacterial contaminants. All unmapped scaffolds were used as query for a BLAST against NCBI's nr protein database. Based on results, a scaffold was classified as bacterial contaminant if it met the following criteria:

  • The scaffold had no match or overlap to any other An. gambiae scaffold.
  • Top hits against the scaffold were from bacterial proteins and had E-values of at least five orders of magnitude higher than to proteins from other organisms.
  • Verification of the classification criteria was performed by randomly selecting an amount of sequence equal to the total length of all the newly identified bacterial scaffolds from currently mapped scaffolds, dividing that length of sequence up into 678 smaller scaffolds, and then performing a BLAST against NCBI's nr database with those new chunks.
  • Hits were examined using the same criteria. 2 of the 678 new chunks met the criteria that would classify them as bacterial but upon individual inspection, this was due to low complexity region similarity.
  • A full list of scaffolds reclassified as bacterial contaminants can be found here or on the right.
Genome Size (bp): 
Scaffold N50 (bp): 
49 364 325
Scaffold count: 
Release date: 
Wednesday, April 30, 2014


The Anopheles gambiae PEST strain was chosen for genome sequencing because it had both a fixed, standard chromosomal arrangement and a sex-linked pink eye mutation that could readily be used as an indicator of cross-colony contamination. The pink eye mutation originated in a colony called A. gambiae LPE established in 1951 at the London School of Hygiene and Tropical Medicine from mosquitoes collected in Lagos, Nigeria.

Gene sets

24 Apr 2019
20 Feb 2019
22 Oct 2018
21 Feb 2018
24 Oct 2017

Assembly Specific Downloads

 Downloads for this assembly