Date of Award
Master of Science (MS)
Dr. Hesham H. Ali
Finding homologous proteins (or cluster of homologous proteins) is a very important, since this information is required for nearly any further analysis of proteins. Two sequences are said to be homologous if they derive from a common ancestor. Generally, sequence similarity provides the base for the homology. There are several approaches to cluster proteins based on their homology. But very few methods involve graphic algorithms. In this project, five graph-based algorithms are implemented to cluster proteins. The five algorithms are finding strongly connected components (SCC) algorithm, finding partially strongly connected components (PSCC) algorithm, graph coloring algorithm, merging algorithm, random merging algorithm. Two classified protein data sets, SCOP and CATH are used to generate raw scores for the two data sets. BLAST which is heuristic and Sqealn which is optimal. The first is used to process SCOP data set and the second is used to process CATH data set. The experimental results sow SCCs and PSCCs algorithms can generate much better results than other three algorithms. PSCCs algorithm can get better results than SCCs. Especially finding weakly connect components algorithm can yield a 10% improvement over pair-wise comparisons in terms of detecting remote homologue. The experimental results also show different data sets, the algorithms which are used in generating raw scores, local alignment length have obvious effect on final results.
Liu, Zhu, "Graph-Theoretic Approaches to Cluster Protein Sequences." (2003). Student Work. 3541.
Files over 3MB may be slow to open. For best results, right-click and select "save as..."