Sing-Hoi Sze's Research Projects

Research Projects

Our work focuses on the application of computational techniques to solve problems in biology. We also collaborate directly with biologists to help them solve real-life biology problems.

Multiple Sequence Alignment

The focus is on investigating alternative formulations of multiple sequence alignment and on utilizing additional sequences from biological databases to improve alignment performance. Dissatisfied with the NP-hardness of all previous formulations, we have proposed a polynomial time solvable formulation of multiple alignment that achieves comparable performance with the best heuristics. This result challenges the common conception that the multiple alignment problem is hard and provides significant theoretical advances. A second project is based on the observation that alignment of divergent sequences can be improved by searching for and incorporating information in additional sequences from database search. Instead of using hits directly from database search, we have investigated the use of intermediate sequences to remove excessive noise from the hits and achieved significant improvements in performance over current state-of-the-art approaches. The most recent project makes better use of horizontal information by taking into account alignment of neighboring residues when aligning two residues.

Papers

Lu Y. and Sze S.-H. (2009) Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues. Nucleic Acids Research, 37, 463-472.
Lu Y. and Sze S.-H. (2008) Multiple sequence alignment based on profile alignment of intermediate sequences. Journal of Computational Biology, 15, 767-777. (Also appear in Proceedings of the 11th Annual International Conference on Research in Computational Molecular Biology (RECOMB'2007), Lecture Notes in Bioinformatics, 4453, 283-295.)
Sze S.-H., Lu Y. and Yang Q. (2006) A polynomial time solvable formulation of multiple sequence alignment. Journal of Computational Biology, 13, 309-319. (Also appear in Proceedings of the 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB'2005), Lecture Notes in Bioinformatics, 3500, 204-216.)

Software

ISPAlign - Intermediate sequence profile alignment
NRAlign - Neighboring residue alignment
PSAlign - Preserving multiple sequence alignment

Motif Finding with Applications to Predicting Transcription Factor Binding Sites

The focus is on developing new models and formulations to improve motif finding performance. We have developed an algorithm that finds DNA motifs by skipping nonconserved positions, an algorithm that integrates sample-driven and pattern-driven approaches to significantly improve computation time, and an improved pattern-driven algorithm that achieves very good biological performance.

Papers

Zhao X. and Sze S.-H. (2011) Motif finding in DNA sequences based on skipping nonconserved positions in background Markov chains. Journal of Computational Biology, 18, 759-770.
Sze S.-H. and Zhao X. (2006) Improved pattern-driven algorithms for motif finding in DNA sequences. Proceedings of the 2005 Joint RECOMB Satellite Workshops on Systems Biology and Regulatory Genomics, Lecture Notes in Bioinformatics, 4023, 198-211.
Sze S.-H., Lu S. and Chen J. (2004) Integrating sample-driven and pattern-driven approaches in motif finding. Proceedings of the 4th Workshop on Algorithms in Bioinformatics (WABI'2004), Lecture Notes in Bioinformatics, 3240, 438-449.
Sze S.-H., Gelfand M.S. and Pevzner P.A. (2002) Finding weak motifs in DNA sequences. Pacific Symposium on Biocomputing (PSB'2002), 235-246.
Pevzner P.A. and Sze S.-H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB'2000), 269-278.

Software

PosMotif - Finding DNA motifs based on skipping positions in background Markov chains
MotifEnumerator - Improved pattern-driven algorithms for motif finding in DNA sequences

Biological Network Analysis

The focus is on developing efficient algorithms to extract functional substructures from biological networks and on identifying related subnetworks from a given network. We have considered the problem of identifying complexes from a protein interaction network by considering different types of complexes separately. We have also considered the problem of identifying linear biological pathways within a network and developed a new divide-and-conquer technique to obtain improved probabilistic and deterministic algorithms with a much lower time complexity than previous approaches. In another direction, we have investigated the problem of finding a path within a given graph that is most similar to another given path, with applications to identifying conserved pathways in different organisms.

Papers

Fan J.-H., Chen J. and Sze S.-H. (2012) Identifying complexes from protein interaction networks according to different types of neighborhood density. Journal of Computational Biology, 19, 1284-1294.
Chen J., Kneis J., Lu S., Mölle D., Richter S., Rossmanith P., Sze S.-H. and Zhang F. (2009) Randomized divide-and-conquer: improved path, matching, and packing algorithms. SIAM Journal on Computing, 38, 2526-2547.
Qian X., Sze S.-H. and Yoon B.-J. (2009) Querying pathways in protein interaction networks based on hidden Markov models. Journal of Computational Biology, 16, 145-157. (Also appear in Proceedings of the 4th Annual RECOMB Satellite Workshop on Systems Biology.)
Lu S., Zhang F., Chen J. and Sze S.-H. (2007) Finding pathway structures in protein interaction networks. Algorithmica, 48, 363-374.
Yang Q. and Sze S.-H. (2007) Path matching and graph matching in biological networks. Journal of Computational Biology, 14, 56-67.

Software

NDComplex - Identifying complexes from a protein interaction network by considering different types of complexes separately
PathMatch and GraphMatch - Path matching and graph matching in biological networks

Identification of Gene Clusters

The focus is on investigating improved formulations to better model gene clusters. We have developed a supervised protein family classification algorithm that overcomes the problems of existing supervised and unsupervised algorithms, an algorithm that finds conserved gene clusters on multiple genomes, an algorithm that allows large-scale analysis of gene clustering in bacteria, and an algorithm that extracts statistically significant gene clusters from a genome with the help of gene ontology annotations.

Papers

Yi G., Thon M.R. and Sze S.-H. (2012) Supervised protein family classification and new family construction. Journal of Computational Biology, 19, 957-967.
Yang Q., Yi G., Zhang F., Thon M.R. and Sze S.-H. (2010) Identifying gene clusters within localized regions in multiple genomes. Journal of Computational Biology, 17, 657-668.
Yang Q. and Sze S.-H. (2008) Large-scale analysis of gene clustering in bacteria. Genome Research, 18, 949-956.
Yi G., Sze S.-H. and Thon M.R. (2007) Identifying clusters of functionally related genes in genomes. Bioinformatics, 23, 1053-1060.

Software

SClassify - Supervised protein family classification
GCFinder - Finding conserved gene clusters on multiple genomes
GCQuery - Finding related gene clusters on a given genome from a given query cluster

Collaborations with Biologists

Papers

Sze S.-H., Dunham J.P., Carey B., Chang P.L., Li F., Edman R.M., Fjeldsted C., Scott M.J., Nuzhdin S.V. and Tarone A.M. (2012) A de novo transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms, and transcript expression estimates. Insect Molecular Biology, 21, 205-221.
Madina B.R., Kuppan G., Vashisht A.A., Liang Y.-H., Downey K.M., Wohlschlegel J.A., Ji X., Sze S.-H., Sacchettini J.C., Read L.K. and Cruz-Reyes J. (2011) Guide RNA biogenesis involves a novel RNase III family endoribonuclease in Trypanosoma brucei. RNA, 17, 1821-1830.
Zhu H., Hu F., Wang R., Zhou X., Sze S.-H., Liou L.W., Barefoot A., Dickman M. and Zhang X. (2011) Arabidopsis Argonaute10 specifically sequesters miR166/165 to regulate shoot apical meristem development. Cell, 145, 242-256.
Lee J.J., Hassan O.S.S., Gao W., Wei N.E., Kohel R.J., Chen X.-Y., Payton P., Sze S.-H., Stelly D.M. and Chen Z.J. (2006) Developmental and gene expression analyses of a cotton naked seed mutant. Planta, 223, 418-432.
Wang N., Lu S.-E., Yang Q., Sze S.-H. and Gross D.C. (2006) Identification of the syr-syp box in the promoter regions of genes dedicated to syringomycin and syringopeptin production by Pseudomonas syringae pv. syringae B301D. Journal of Bacteriology, 188, 160-168.
Yang S.S., Cheung F., Lee J.J., Ha M., Wei N.E., Sze S.-H., Stelly D.M., Thaxton P., Triplett B., Town C.D. and Chen Z.J. (2006) Accumulation of genome-specific transcripts, transcription factors and phytohormonal regulators during early stages of fiber cell development in allotetraploid cotton. The Plant Journal, 47, 761-775.
Wang J., Adelson D.L., Yilmaz, A., Sze S.-H., Jin Y. and Zhu J.J. (2005) Genomic organization, annotation, and ligand-receptor inferences of chicken chemokines and chemokine receptor genes based on comparative genomics. BMC Genomics, 6(45).
Lunyak V.V., Burgess R., Prefontaine G.G., Nelson C., Sze S.-H., Chenoweth J., Schwartz P., Pevzner P.A., Glass C., Mandel G. and Rosenfeld M.G. (2002) Corepressor-dependent silencing of chromosomal regions encoding neuronal genes. Science, 298, 1747-1752.