Gap Filling for a Human MHC Haplotype Sequence
Yuanwei Zhang1, 2, Tao Zhang2, Zuhong Lu1
1School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
2BGI-Shenzhen, Shenzhen, China
To cite this article:
Yuanwei Zhang, Tao Zhang, Zuhong Lu. Gap Filling for a Human MHC Haplotype Sequence. American Journal of Life Sciences. Vol. 4, No. 6, 2016, pp. 146-151. doi: 10.11648/j.ajls.20160406.12
Received: November 15, 2016; Accepted: November 28, 2016; Published: December 1, 2016
Abstract: The major histocompatibility complex (MHC) is recognized as the most variable region in the human genome and has susceptibility to > 100 diseases. We constructed a complete MHC haplotype sequence of MCF cell line by gap filling based on whole genome sequencing (WGS) data. Gaps spanning ~ 1 Mb were filled and 31 genes were annotated in these gaps. This sequence could be used as reference to identify disease associations within this haplotype or similar haplotypes. The method for gap filling can be applied to other MHC haplotypes or other genomic region.
Keywords: Gap Filling, MHC, Haplotype, WGS
The MHC in human is a gene-dense region on Chromosome 6p21.31. The MHC region spans ~4 Mb and covers >120 expressed genes . Forty percent of the expressed loci encode proteins with functions related to immune defense . These include the highly polymorphic human leukocyte antigen (HLA) membrane glycoproteins that present peptides for recognition by T lymphocytes. Many diseases especially autoimmune diseases were found strongly associated with MHC .
The high density, polymorphism, linkage disequilibrium (LD) and frequent non-Mendelian inheritance of gene loci in MHC region have made it challenging to identify variations that cause or contribute to disease phenotypes. To partly resolve these problems, the MHC haplotype project has constructed eight common and disease associated MHC haplotype sequences in European populations . These sequences and their variants have helped to identify a second MHC susceptibility locus for multiple sclerosis . Many diseases are associated with some particular MHC haplotypes [4,6]. The complete sequences of these haplotypes could be used as references to identify variants responsible for the disease associations. However, six of the eight sequenced MHC haplotype still obtain gaps . For example, the MCF haplotype, which has been reported to be associated with rheumatoid arthritis (RA) , has 22 gaps covering about 1Mb.
There have been a few tools available for gap filling of genome assembly like GapBlaster , FGAP , G4ALL , GapFiller  and GapCloser . They use two approaches to fill gaps: paired end reads and assembled contigs from different software. GapBlaster aligns contigs obtained in the assembly of the genome to a draft of the genome, using BLAST or Mummer, and all identified alignments of contigs that extend through the gaps in the draft sequence are presented to the user for further evaluation via the GapBlaster graphical interface . FGAP also aligns multiple contigs against a draft genome assembly to find sequences that overlap gaps . G4ALL is a multiuser software that allows the visualization and curation of a group of contigs that are aligned locally to a reference genome, and can also be used to fill gaps . GapFiller method seeks to find read pairs of which one member matches within a sequence region and the second member falls (partially) within the gap, and the latter reads are then used to close the gap through sequence (k-mer) overlap, and the process is iteratively repeated until no further gaps can be closed . GapCloser also uses the information from paired end reads to extend the sequences of contigs between gaps . GapBlaster and G4ALL are fit for bacterial genomes. FGAP, GapFiller and GapCloser can be used in human genomes, but the complexity of MHC makes it challenging to enclose all gaps by applying these methods. Also, the resource of eight assembled MHC haplotypes is extremely useful for filling gaps. Thus, a particular gap-filling method fit for human MHC is needed.
Here we constructed a complete MHC haplotype sequence of MCF cell line by gap filling based on WGS data from the same cell line. Firstly, sequences from other haplotypes were filled into the gaps as patches. Then the first intermediate sequence with patches was aligned by WGS reads. Using various signals (reads depth, ratio of single end mapping to pair end mapping, frequency of soft clipping and SNP & Indel calling), errors of the sequence were identified and revised. The final complete MCF haplotype sequence was obtained by repeating the reads alignment and mistake revision until no errors were found.
2.1. Filling Gaps by Patches from Other Haplotypes
Sequences as patches were extracted from the other seven sequenced haplotypes. Positions of patches were determined by alignment of adjacent sequences of gaps. For each gap, the type of nearest HLA gene was regarded as symbol of sequence similarity, and the most similar haplotype was selected for patch extraction (Table 1). All gaps represented by N in raw MCF haplotype sequence were replaced by these patches. Blat  was used for alignment. The HLA type is from IMGT/HLA database  using cell query tool.
|Gap id||Gap position||Gap length||Selected haps||Patch position||Patch length|
2.2. Revise Errors by WGS Data
By adding patches, the first intermediate sequence (MCF1) was obtained. WGS data was mapped to hg19 whose MHC region (chr6: 28477797-33448354) was replaced by MCF1. BWA  was used for reads alignment with parameters: aln -o 1 -e 63 -i 15 -L -k 2 -l 31 -t 4 -q 10 -I. Reads depth, ratio of single end mapping reads to paired end mapping reads and frequency of soft clipping were used to identify big errors (Figure 1). Low reads depth implied insertion errors because artificial sequence could not be covered by reads, and it also implied deletion errors because reads spanning the breakpoint could hardly be mapped properly. High ratio of single end mapping reads to paired end mapping reads implied deletion errors since loss of reference sequence would make one read of a pair not aligned. High frequency of soft clipping indicated both insertion and deletion errors since the edge of errors would clip a read to a mapped part and an unmapped part. Big insertion errors were revised by simply deleting it. Big deletion errors were revised by adding patches from other haplotypes. Break points of big insertion and deletion errors were determined by the clipping positions of soft clipped reads. Complex errors which caused by small but dense differences were revised by replacement of consensus sequences. The consensus sequences were obtained by Samtools pileup . Small errors were identified by variation calling pipeline of Samtools  and revised by replacement of called variations. Repeating the reads alignment and error revision until no more error was found.
2.3. Gene Annotation
Gene information of the eight MHC haplotypes from refGene annotations (downloaded from UCSC Genome Browser Data ) were used to get sequences of transcripts from all eight MHC haplotypes. These sequences were mapped to our final complete MCF haplotype sequence (MCF6) by blast  to determine positons. For those transcripts whose sequences could not be mapped perfectly as a whole, exons were picked out for alignment to help positioning.
2.4. Whole Genome Sequencing of MCF Cell Line
High molecular weight genomic DNA (gDNA) was extracted form MCF cell line. The gDNA was sheared into ~500bp fragments. Paired-end shotgun libraries were generated followed by the manufacturer’s instruments (Illumina). DNA libraries were sequenced with Hiseq2000 (Illumina) platform to generate 2 x 91 bp paired-end reads with ~30X genome coverage on average.
3.1. Revised Errors
All changes during gap filling were listed in Figure 2. Firstly, 24 patches from other haplotypes were filled into gaps. Then one big deletion error was revised and small errors were revised to reduce noises. Then one big insertion error, two complex errors and four Indels were revised. After two run of small-error revision, final sequence was obtained. There are two big errors: one big deletion error (~2.4Kb) and one big insertion error (311bp). For the big deletion error, the reads depth was low, and there were many single end mapping reads and soft clipped reads around error position as expected (Figure 3). Alignment of this error region to other haplotypes by UCSC Blat indicated a deleted sequence with ~2.4 Kb in length when comparing to haplotype SSTO (Table 2). For the big insertion error, depth of reads covering the inserted sequence was extremely low, and there were two high-frequency soft clipping positions which marked the edge of the insertion (Figure 4).
The final complete MCF haplotype sequence did not have any significant signals of error (Figure 5). Type of HLA-DRB1 was DRB1*04:01 in the MCF haplotype (recorded in IMGT/HLA database). Region containing HLA-DRB1 in the raw MCF haplotype sequence was a gap. After gap filling, coding sequence of DRB1 from the final complete MCF sequence 100% matched that of DRB1*04:01.
3.3. Gene Annotation
A total of 220 genes and 416 transcripts were annotated. There are 31 genes and 62 transcripts in gap region while 189 genes and 354 transcripts in the original raw MCF (Figure 5). Several important immune genes are in gap region, for example, MICA and HLA-DRB1.
The eight MHC haplotypes were used for extraction of patch sequences and revision of a big deletion error. It is suitable for MCF cell line since they are all European. But for haplotypes from other ethnic groups, use of the existing eight MHC haplotypes would not be appropriate, since European haplotypes could miss structural variations. This problem could be solved by new assembly of haplotypes from different ethnic groups or using other tools like GapFiller  or GapCloser  for local assembly. In this method, short reads from Illumina was used, however, haplotypes with huge structural variations or repeat elements may not be solved by short reads. Shotgun sequencing from the third generation sequencing platform like single molecule real time (SMRT) sequencing by pacbio could be applied to resolve this limitation. For example, SMRT sequencing was successfully used for filling a gap composed of macrosatellite repeats in the facioscapulohumeral muscular dystrophy locus in human chromosome 4 . FGAP can use SMRT sequencing reads for automated gap filling . For now, our method needs some manual operations like using blat in the UCSC genome browser and identification of sequence errors from signals. Also, the whole process is divided into many small steps which need manual linking between two steps. It is necessary to develop algorithm for sequence-error classification and combine the steps as an automated or half-automated pipeline.
We have filled the gaps spanning 1Mb of a MHC haplotype named MCF which is associated with RA. This complete MCF haplotype sequence can be used as reference in disease studies for more precise reads mapping, more accurate variation detection and thus new discoveries of risk loci. The method uses WGS data and existing sequence of MHC haplotypes to fill gaps. It can be applied to any other draft MHC haplotype assemblies.