Bioinformatics & Computational Biology Bioinformatics & Computational Biology

BCB and IGERT Graduates
Return to Home Iowa State University

Name
&
Email

Degree/Major

Major/Co-
Major
Professors

Dissertation/Thesis Title
(click to see abstract)
Position
Employer
Semester/
Year of Grad

 

Prasith Baccam

Ph.D. in

Applied Math and Immunobiology

James Cornette
Susan Carpenter

Genetic Variation and evolution of equine
infectious anemia virus rev quasispecis during long term persistent infection
Innovative Emergency Mgmt. Inc.
Lead Scientist
Bel Air, MD

Past:

Postdoctoral Research Associate
Iowa State University

http://www.t10.lanl.gov/profiles/baccam.html

http://www.t10.lanl.gov/pbaccam/

Spring, 2000

Lisa Borsuk

Lisa Borsuk

M.S. in BCB

Dr. Patrick Schnable
Dr. Hui-Hsien Chou

To be Determined

Summer, 2007

 

Kara Butterworth

M.S. in

Botany

Jonathan Wendel
Dean Adams

Initiation and early development of fibers in wild
and cultivated cotton
Middle school science teacher
Apache Junction, AZ

 

Fall, 2003

Feng Cui

Feng Cui

Ph.D. - Co-Majors in

BCB and
Physical Chemistry

Dr. Zhijun Wu
Dr. Robert Jernigan
Distance-based NMR Structure determination
and refinement

Visiting Fellow
National Cancer Institute
Center for Cancer Research Nanobiology Program (CCRNP)
Frederick, MD

http://ccr.cancer.gov/Labs/staff.asp?labid=91

 

Summer, 2005

Garrett Dancik

Garrett Dancik

Ph.D. in BCB

Dr. Karin Dorman and Dr. Doug Jones

Exploring host-pathogen relationships through computer simulations of intracellular infection

Assistant Professor
Northwestern State
Departments of Biology and Math Sciences
Louisiana

Will begin bioinformatics concentration there.

Summer, 2008

Amy Determan

Amy Determan

MGET Student -

PhD in CBE

      Fall, 2005

Lixia Diao

Lixia Diao

M.S. in

BCB

David Fernandez-Baca
Xun Gu

Consensus properties of supertree construction methods

Ph.D. Student in Statistics
Iowa State University
Ames, IA

http://perl.hs.iastate.edu/lixia.htm

Summer, 2002

Jing Ding

Jing Ding

Ph.D. Co-major in

BCB and ComE

Dan Berleant
Eve Wurtele

BOW-Based vs. Concept-Based Text Clustering for Functional Analysis of Genes Staff Specialist
Ohio State University
Columbus, OH
Spring, 2006

Pan Du

Pan Du
Ph.D. Co-major
in BCB & EE
Julie Dickerson
Eve Wurtele

Multi-scale Genetic Network Inference based on Time Series Gene Expression Profiles
 

Research Associate / Senior Bioinformatics Analyst position
Robert H. Lurie Comprehensive Cancer Center
Northwestern University
Chicago, IL

Fall, 2005

Tyra Dunn

Tyra Dunn

M.S. in
BCB

PhD in BCB

Xun Gu
Dan Voytas

Greenlee; Honavar

Genomic differences between humans and primates

Characterizing and Influencing Differentiation Of Retinal Progenitor Cells

To be determined

Fall, 2004 - MS

Summer, 2007 - PhD

Scott Emrich

Scott Emrich

Ph.D. in
BCB

Srinivas Aluru
Patrick Schnable

Assembly and Analysis of Complex Plant Genomes

Assistant Professor
University of Notre Dame
Notre Dame, IN

Summer, 2007

Jo Etzel Joset Etzel

Ph.D. in

BCB

Julie Dickerson
Ralph Adolphs

Algorithms and Procedures to Analyze Physiological Signals in Psychophysiological Research

Postdoctoral Fellow
University of Groningan
Netherlands
Spring, 2006

Fang Fang Fang Fang

Ph.D. in

BCB

Karin Dorman Drena Dobbs Virus Recombination: Modeling and Data Analysis

Postdoctoral Fellow
Dr. Arlene Auerbach
Lab of Human Genetics & Hematology
The Rockefeller University
New York City

Spring, 2006

Jianmin Feng

Jianmin Feng

M.S. in

BCB

Volker Brendel
Zhijun Wu

A new approach for discovering protein motifs

Research Scientist
Dr. Ed Yeung
Iowa State University
Ames, IA
Fall, 2002

Xiang Gao

Xiang Gao

PhD in

MCDB and BCB

Dan Voytas
Leslie Miller
Studying the replication mechanism of the yeast retrotransposon Ty5 by molecular and computational approaches Postdoctoral Fellow
With Dr. Michael Lynch
Biology Department
Indiana University
Bloomington, IN
Fall, 2001

Zhong Gao

Zhong Gao

M.S. in

BCB

Vasant Honavar and
Kai-Ming Ho
Genome wide recognition of Tumor Necrosis
Factor (TNF) related ligands in human and
Arabidopsis genomes: A structural genomics
approach

Postdoctoral Fellow
The Center for Cardiovascular Bioinformatics and Modeling
Johns Hopkins University
Baltimore MD

Summer, 2003

Aspen Garry

Aspen Garry

MS in

EEB

Dean Adams
Gavin Naylor
Geometric Morphometric analysis of shark teeth of the genus Rhizoprionodon: The modern, the ancient, and the hypothetical. Modern tooth shape analysis and test of ancestory prediction methods by comparison to fossil shapes   Fall, 2003

Jianying Gu

Jianying Gu

PhD in

BCB

Xun Gu
Dan Nettleton

Functional divergence and genome evolution of vertebrate protein kinases

Assistant Professor
City University of New York
Summer, 2003

Ericka Havecker

Ling Guo

Ph.D. in BCB

Patrick Schnable
Daniel Ashlock
Adaption of Multiclustering to the Analysis of Microarray Data

To Be Determined

Summer, 2007

Ericka Havecker

Ericka Havecker

Ph.D. in IG

(IGERT Fellow)

Dan Voytas
Mei Hong
Characterization of the Sireviruses:  A unique group of Ty1/copia LTR retrotransposons in plants

Postdoctoral Research Associate
David Baulcombe
Sainsbury Lab
Norwich, England

Spring, 2005

Julie Hoy

Julie Hoy

Ph.D. in IG

(IGERT Fellow)

Dan Voytas
Mei Hong
Structural Characterization of Ligand Binding in Hexacoordinate Hemoglobins

Postdoctoral Research Associate
Mark Hargrove Laboratory
Iowa State University

Summer, 2006

LaRon Hughes

LaRon Hughes

M.S. in BCB and a Ph.D. in

BCB

M.S.--Karin Dorman and Susan Carpenter; PhD--Jim Reecy
Vasant Honavar

M.S.- EIA V DB: A comprehensive equine infectious anemia (EIA V) virus database

Ph.D. - Hypothesis building using the Animal Trait Ontology

GenomeQuest
Field Application Scientist
Westborough, MA

Summer 2004;

Summer, 2007

Junli Ji

Junli Ji

M.S. in

Genetics and BCB

Madan Bhattacharyya
Adam Bogdanove
  Pioneer Hi-Bred
Des Moines, IA
Fall, 2004

Cizhong Jiang

Cizhong Jiang

PhD in IG with BCB minor

Tom Peterson
Xun Gu

Computational and molecular analysis of Myb gene family Postdoctoral Research Associate
VCU (Virginia Commonwealth University)
Richmond, VA

Project: SNPs in mammals

Summer, 2004

Brent Kronmiller

Brent Kronmiller

PhD in BCB

Dr. Roger Wise and Dr. Xun Gu

Assembly And Annotation Tools For Analysis Of Large Contiguous Regions Of The Maize Genome

 

Summer, 2008

Alain Laederach

Alain Laederach

PhD - Co-Major in

BCB and Chemical Engineering

Peter Reilly
Amy Andreotti

Protein-carbohydrate and protein-protein interactions: Using models to better understand and predict specific molecular recognition

Postdoctoral Fellow
Dr. Russ Altman, MD, PhD
Helix Bioinformatics Group
Department of Genetics
Stanford School of Medicine
CA

http://helix-web.stanford.edu/people/alain/

Stanford School of Medicine, Department of Genetics

Summer, 2003

Michael Lawrence

Michael Lawrence

PhD in BCB Dianne Cook
Eve Wurtele
Interactive graphics, graphical user interfaces and software interfaces for the analysis of biological experimental data and networks

Postdoctoral Fellow
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024
Seattle, WA 98109
http://www.fhcrc.org/

Fred Hutchinson Cancer Research Center is a world leader in research to understand, treat and prevent cancer, HIV/AIDS and other life-threatening diseases. Founding members of the center are credited with pioneering bone-marrow transplantation as a successful treatment for leukemia and other blood diseases.

Spring, 2008

Nicole Leahy

Nicole Leahy

PhD in

BCB

Daniel Ashlock
John Mayfield
  Postdoctoral Fellow
Genetics Department
University of GA
Athens, GA
Spring, 2004

Jae-Hyung

Jae-Hyung Lee

PhD in BCB

Drena Dobbs
Kai-Ming Ho

Analysis of protein-RNA and protein-peptide interactions in Equine Infectious Anemia Virus (EIAV) infection Postdoctoral Fellow
Drena Dobbs Lab
Iowa State University
Fall, 2007

Darrin Lemmer Darrin Lemmer

M.S. in BCB

Gloria Culver
Drena Dobbs

CAVEMol: an immersive 3D molecule viewer IBM
Rochester, MN
Spring, 2006

Yuan Lin

Yuan Lin

M.S. in

BCB

Xun Gu
Xiaoqiu Huang
The relationship of sequence similarity and expression pattern similarity between yeast genes within gene families

Staff
J. Craig Venter Institute
9704 Medical Center Drive
Rockville, MD 20850

phone: 240-268-2767
email: press@venterinstitute.org

Jobs: See Jobs Page

J. Craig Venter Institute is a not-for-profit research institute dedicated to the advancement of the science of genomics; the understanding of its implications for society; and the communication of those results to the scientific community, the public, and policymakers. http://www.venterinstitute.org/

Fall, 2001

Haining Lin

Haining Lin

M.S. in

BCB

Xiaoqiu Huang
Daniel Voytas
BACAP: An assembly program for hierarchical shotgun sequencing

The Institute for Genomic Research (TIGR)
Rockville, MD

And PhD Student in BCB
Iowa State University

Fall, 2004

Patricia Lonosky Patricia Lonosky

M.S. in

Genetics

Steve Rodermel
Vasant Honovar
Proteomics of the developing chloroplast in maize

Scientist
Nanosphere, Inc.
Northbrook, IL

Fall, 2002

Wiesia Mentzen Wiesia Mentzen

Ph.D. in

BCB

Eve Wurtele
Xun Gu

From Pathway to Regulon in Arabidopsis

Senior Scientist with
Alberto de la Fuente at
CRS4 Bioinformatica
Pula, Italy

Summer, 2006

Erin M. Myers

Erin M. Myers

PhD in EEB Major Professors: Fred Janzen & Dean Adams Post-orbital color pattern variation and the evolution of a radiation of turtles (Graptemys)    

Brooke Peterson-Burch

Brooke Peterson-Burch

PhD in

Genetics

Daniel Voytas
Vasant Honavar
Characterization of plant LTR retrotransposon diversity and host genome survival strategies Bioinformatics Scientist
Pioneer Hi-Bred
Des Moines, IA
Spring, 2003

Myron Peto

Myron Peto

Ph.D. in
BCB

Robert Jernigan
Drena Dobbs

Studies of Protein Designability using Reduced Models

Postdoctoral Associate
Crop Genome Informatics Laboratory
USDA Agricultural Research Service
On the Campus of Iowa State University

Summer, 2007

Brad Powers

Bradley Powers

M.S. in

BCB

Daniel Ashlock
Kirk Moloney

The Effect of Tags on Non-Local Adaptation

Bioinformatics Scientist
NewLink Genetics
Ames, IA
Spring, 2004

Justin Recknor

Justin Recknor

Ph.D. in BCB and Co-Major in Statistics Dan Nettleton
Jim Reecy
Identification of Differentially Expressed Functional Categories in Microarray Studies Using Nonparametric Multivariate Analyses Eli Lilly
Associate Statistician
Toxicology Department
Working with Microarray Analysis
Indianapolis, IN
Fall, 2006

Kyoungmin Roh Kyoungmin Roh

M.S. in PhD

Steve Proulx

Evolutionary variance of gene network via simulated annealing algorithm

Ph.D. Student
University of California

Summer, 2008

Ph.D. in

BCB

Volker Brendel
Randy Shoemaker
Plant genome informatics: evaluation and analysis of genomic DNA features involved in the transcriptional processing of protein coding genes Assistant Professor
Department of Computer and Information Technology
Purdue University
West Lafayette, IN
Fall, 2006

Justin Schonfeld Justin Schonfeld

Ph.D. in

BCB

Dan Ashlock
Dan Voytas

A modular data analysis pipeline for the discovery of novel RNA motifs Postdoctoral Fellow
Cognitive Information Processing group
Computer Science and Engineering Department
University of Nevada
Reno, NV
Spring, 2006

Sachet Shukla

Sachet Shukla

M.S. in

BCB

Srinivas Aluru
Charles Link

Identification of regional motifs in the 5' UTR and their implication in translational control mechanisms

Bioinformatics Scientist
NewLink Genetics
Ames, IA
Summer, 2003

Michael Sparks

Michael Sparks

Ph.D.
in BCB

MGET Fellow

Volker Brendel
Jonathan Wendel

Computational annotation of eukaryotic gene structures: algorithms
development and software systems

Postdoctoral Fellow
Volker Brendel's Lab
Iowa State University
Fall, 2007

Robert Thompson

Robert Thompson

M.S. in

Genetics

Susan Carpenter
Daniel Ashlock

Application of computational tools to analyze evolution of equine infectious anemia virus   Spring, 2001

Pete Vedell

Peter Vedell
PhD in BCB
Co-Major in Math
Zhijun Wu
Robert Jernigan
Boundary Value Approaches To Molecular Dynamics Simulation

Postdoctoral Fellow
The Jackson Laboratory
Bar Harbor, Maine

The Jackson Laboratory is designated by the National Cancer Institute as "Cancer Centers" to conduct basic cancer research. At the time of the Laboratory's initial designation in 1983, NCI noted, "The Jackson Laboratory is not only important to the national cancer effort but critical to its success."

http://www.jax.org/about/jax_facts.html

Spring, 2007

Kent Vander Velden

Kent Vander Velden

M.S. in

BCB

Gavin Naylor
Vasant Honavar
Spatial Clustering of differences in measured homoplasy with respect to protein structure

Current PhD student in BCB
Research Scientist,
Pioneer Hi-Bred
Des Moines, IA

Spring, 2002

Thomas Vigdal

Thomas Vigdal

M.S. in

BCB

Daniel Voytas
Volker Brendel
Insertion site similarities in the Tc1/mariner element family

Law Student
UC, Davis

Recently received an MS at Stanford

Summer, 2001

Jianmin Wang

Jianmin Wang

Ph.D. in BCB Xiaoqiu Huang
Xun Gu

Computational studies of ESTs: assembly, SNP detection, and applications in alternative splicing

Staff
Roswell Park Cancer Institute
Buffalo, NY

Roswell Park Cancer Institute (RPCI), is America's first cancer center founded in 1898 by Dr. Roswell Park. RPCI holds the National Cancer Center designation of "comprehensive cancer center" and serves as a member of the prestigious National Comprehensive Cancer Network.

Over its long history, Roswell Park Cancer Institute has made fundamental contributions to reducing the cancer burden and has successfully maintained an exemplary leadership role in setting the national standards for cancer care, research and education.

The campus spans 25 acres in downtown Buffalo and consists of 15 buildings with about one million square feet of space. A new hospital building, completed in 1998, houses a comprehensive diagnostic and treatment center. In addition, the Institute built a new medical research complex and renovated existing education and research space to support its future growth and expansion. http://www.roswellpark.org/

For more information about Roswell Park and cancer in general, please contact the Cancer Call Center at 1-877-ASK-RPCI (1-877-275-7724).

Summer, 2006

Xiangyun Wang

Xiangyun Wang

M.S. in

BCB

Vasant Honavar
Drena Dobbs
Data-driven discovery of rules for protein function classification based on sequence motifs Postdoctoral Research Associate
AstraZeneca Pharmaceutical
Wilmington, DE
Spring, 2002

Yingchun Wang

Yingchun Wang

PhD in

Genetics and
BCB

Parag Chitnis
Suresh Kothari

Identification and functional analysis of thylakoid membrane proteome

Research Associate
Klemke Laboratory
Scripps Research Institute
La Jolla, CA

http://www.scripps.edu/imm/klemke/barry.htm

The role of the SDF-1/CXCR-4 receptor system in breast cancer metastasis.

In May 2005, received a three year Fellowship from Susan Komen Breast Cancer Foundation to continue his research in proteomics and cancer metastasis.

Fall, 2003

Yufeng Wang

Yufeng Wang

Ph.D. in

BCB

Xun Gu
Daniel Ashlock
Functional divergence and age distribution of vertebrate gene families

Assistant Professor
Bioinformatics and Computational Biology
Department of Biology
University of Texas
San Antonio, TX
(210) 458-6492

http://www.bio.utsa.edu/faculty/wang.html

Research in my laboratory focuses on the comparative genomics, molecular evolution, and population genetics of gene families. 

Summer, 2001

Yufeng Wang

Matthew Wilkerson

PhD in BCB Volker Brendel and
Thomas Peterson
Genesis of gene structures and computational analysis of U12-type introns Matt Wilkerson
Postdoctoral Research Associate
D. Neil Hayes Laboratory
Lineberger Comphrehensive Cancer Center
The University of North Carolina at Chapel Hill
Chapel Hill, North Carolina
Fall, 2007

Di Wu

Di Wu

PhD Co-major in

BCB and Math

Zhijun Wu and
Robert Jernigan
Distance-based Protein Structure Modeling

Assistant Professor
Department of Mathematics
Western Kentucky University
Bowling Green, KY

 

Summer, 2006

Shiquan Wu

Shiquan Wu

PhD in

BCB

Xun Gu
Zhijun Wu
Comparative genomics: Multiple genome rearrangement and efficient algorithm development

Postdoctoral Research Associate
Virtual Reality Application Center
with Dr. Zhijun Wu
Iowa State University
Ames, IA

 

Fall, 2004

Wu Xu

Wu Xu

M.S. in

BCB

Parag Chitnis
Suresh Kothari
DNA sequence-specific recognition by transcriptional factors Postdoctoral Fellow
Biochemistry Department
St. Jude Hospital
Memphis, TN
Summer, 2003

Aimin Yan

Aimin Yan

Ph.D. in BCB

Dr. Robert Jernigan; Dr. Zhijun Wu

Analysis on protein structures using statistical and computational methods

Postdoctoral Associate
Dr. Jack Dekkers
Department of Animal Science
Iowa State University
Summer, 2008

Changhui Yan

Changhui Yan

Ph.D. Co-Major in BCB and Computer Science

Vasant Honavar
Drena Dobbs

Identification of interface residues involved in
protein-protein and protein-DNA interactions from sequence using machine learning approaches

Assistant Professor
Computer Science Department
Utah State University
Logan, UT
Fall, 2005

Lei Yang

Lei Yang

Ph.D. in BCB Robert Jernigan and Zhijun Wu Understanding protein motions by computational modeling and statistical approaches

 

Summer, 2008

Liang Ye

Liang Ye

Ph.D. in BCB Xiaoqiu Huang and Gavin Naylor Sequence comparison methods, statistics, and applications

Senior Scientist
Genome Sequencing Center
School of Medicine
Washington University
St. Louis, MO

Summer, 2006

Hailong Zhang

Hailong Zhang

M.S. in

BCB

Eve Wurtele
Julie Dickerson
MetNet DB: A comprehensive metabolic and regulatory network database Bioinformatics Research Scientist/PhD Student
Chemistry Department
University of New Hampshire
Durham, NH
Summer, 2002

Wuyan Zhang

Wuyan Zhang

Co-Major PhD
Stat and BCB
Alicia Carriquiry
Jack Dekkers
The design and analysis of microarray experiments using pooled samples for the study of quantitative traits Research Statistician
Abbott Laboratory
Chicago, IL
Spring, 2007

Xiaosi Zhang

Xiaosi Zhang

M.S. in

BCB

Vasant Honavar
Xun Gu

Gene expression pattern analysis

 

 

 

Xiaosi Zhang
System Engineer
Meredith Corporation
Des Moines, IA
Fall, 2002

Zhongqi Zhang

Zhongqi Zhang

PhD - Co-Majors:

Statistics and
BCB

Ken Koehler
Xun Gu
Statistical analysis of gene expression profiles

Assistant Professor
Tsinghua University
Tsinghua, PR China

Summer, 2004

Hua Zhou

Hua Zhou

M.S. in

BCB

Karin Dorman
Susan Carpenter
Branching process models for HIV-1 drug resistant mutants Ph.D. Student
Statistics department
Stanford University
CA
Fall, 2003

Huaijun Zhou

Huaijun Zhou

M.S. in

BCB

Xun Gu
Susan Lamont
Statistical Analysis of Functional Divergence in Gene Families

Assistant Professor
Department of Poultry Science
Texas A&M University
College Station, TX

Fall, 2003

Wei Zhu

Wei Zhu

PhD in

BCB

Volker Brendel
Srinivas Aluru
Spliced alignment and its application in Arabidopsis thaliana

MedImmune
Gaithersburg Headquarters
One MedImmune Way
Gaithersburg, MD 20878
(301) 398-0000

Spring, 2003


Prasith Baccam

Home Departments: Math and Immunobiology

Major Professor: Dr. Cornette
Co-Major Professor: Dr. Susan Carpenter

Title: Genetic Variation and evolution of equine infectious anemia virus rev quasispecis during long term persistent infection

Abstract: Genetic variation has been observed in many viruses. Viruses that carry their genetic information in the form of RNA exhibit high mutation rates because the viral polymerase lacks proof-reading mechanisms commonly found in DNA polymerase complexes. The combination of high mutation rates, small genome size, and high replication rates results in a population of closely related viral genotypes, which are commonly referred to as a quasispecies. A consequence of the genetic variation in viruses is possible variation in viral phenotype of the quasispecies population. Furthermore, changes in viral phenotype may be a biologically important factor in progression of disease. Here, we undertook a longitudinal study to describe the quasispecies nature and genetic variation in a lentivirus regulatory protein, Rev, during the course of disease in a pony experimentally infected with equine infections anemia virus (EIAV). This study examined rev variants that comprised the quasispecies population in sequential sera samples. Over the course of disease, there was continual appearance of novel rev variants, with some variants growing in frequency to predominate certain time points. Phylogenetic and cluster analyses suggested that the Rev quasispecies was comprised of two distinct populations that co-existed during infection. These two quasispecies populations differed in their pattern of evolution, with one population accumulating changes in a linear, time-dependent manner, while the other population evolved radially from a common variant. Changes in the population size of the two Rev quasispecies coincided with changes in the clinical stages of disease. Rev variants from each population were biologically tested, and significant differences in Rev activity were detected between the two populations. Together, these results suggested that the distinct Rev populations differed in selective advantage. A statistical correlation was found between Rev quasispecies activity differed significantly between different stages of clinical disease. This study suggests that distinct quasispecies populations, which differed in patter of evolution and niche advantage, co-existed during long term persistent infection by EIAV. A multi-population quasispecies model challenges our current thinking of viral populations and may have significant biological implications.


Kara Butterworth

Home Department: Botany

Major Professor: Dr. Jonathan Wendel
Co-Major Professor: Dr. Dean Adams

Title: Initiation and early development of fibers in wild and cultivated cotton

Abstract: Gossypium (Malvaceae) is a diverse genus best known for cultivated cotton. It includes about 50 species, 45 diploid and 5 allopolyploid, which occur in arid and semi-arid regions throughout the world (Vollesen, 1987; Fryxell, 1992). The diploids are divided into eight genome groups based on chromosome pairing and size, and fertility between species (Endrizzi, Turcotte, and Kohel, 1985). These groups comprise natural lineages within the genus and correspond to geographic locations: A, B, E, F- Africa and Arabia; C, G, K- Australia; and D- New World. Allopolyploid members are founds in the New World and contain the A and D genomes (Wendel, 1995; Wendel et al., 1998; Brubaker, Bourland, and Wendel, 1999; Percival, Wendel, and Stewart, 1999; Cronn et al., 2002). This understanding of the evolutionary history of the genus allows many aspects of evolutionary differences in development and morphology to be studied in a phylogenetic context.


Feng Cui

Home Department: Mathematics

Major Professor: Dr. Zhijun Wu
Co-Major Professor: Dr. Robert Jernigan

Title: Distance-based NMR Structure determination and refinement

Abstract: X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy are two widely used experimental techniques for protein structure determination. In protein Data Bank (PDB), about 85% of deposited protein structures are determined by X-ray crystallography. The rest of the structures are determined by NMR spectroscopy. The main difference between these two approaches lies in the state of protein samples to which they are applied: for X-ray crystallography, a protein has to be in the crystalline state while in NMR, it may be in the solution state. Both approaches have their own pros and cons. For example, X-ray crystallography is a mature technique capable of providing more objective interpretation of data. This approach has various quality indicators such as resolution and R-factor to assess the structures. It can be applied to large molecules, e.g., virus particles, and produce a single model that is easy to visualize and interpret. Raw data processing is highly automatic. In contrast, NMR is a relatively new technique and provides more subjective interpretation of the data. It lacks established quality indicators of data and models. In addition, it is limited to determination of relatively small proteins (<20kDa) and produces an ensemble of possible structures rather than one model. Data sometimes have to be manually processed. On the other hand, a protein has to form stable crystals for X-ray analysis, which could be time-consuming and often impossible. The crystalline state is not a natural and physiological environment for the protein either. In addition, X-ray crystallography is less useful for large flexible modular proteins. In contrast, the solution state of a protein is closer to biological conditions and relatively easy to prepare. NMR can provide information on dynamics and identify individual side-chain motion, often used to monitor conformational change on ligand binding. With the pros and cons, both approaches have undergone dramatic development during the past five years, especially for NMR. Advances in data collection, spectra assignment and analysis, structure calculation and computer graphics bring no barrier among NMR spectra assignment process, NMR structure assessment and visualization. Many quality indicators such as bond length, angle and NOE violations (inter-atomic distances that lie outside of NOE ranges) have been developed and used for quality assessment of NMR structures. Novel refinement schemes aimed at increasing the accuracy of the resulting structures have been proposed and tested. As a result, nowadays, proteins in size up to 30 kDa (about 260 residues) are routinely accessible by NMR spectroscopy with increased resolution, equivalent to approximately 2.5-A resolution crystal structures.


Garrett Dancik

Home Department: Statistics

Major Professor: Dr. Karin Dorman
Co-Major Professor: Dr. Doug Jones

Title: Exploring host-pathogen relationships through computer simulations of intracellular infection

Abstract: Computer simulations of infectious disease allow for the identification and estimation of important pathogen and immune parameters, the validation of theoretical biological models with experimental data, and the characterization of the host-pathogen interactions that lead to emergent and sometimes counterintuitive behavior. This dissertation describes the development, analysis, and calibration of a computer model of Leishmania major infection, the identification of correlates of escape mutant success and optimal escape strategies in a computer model of a viral infection, and statistical software to aid in computer model analysis and calibration.

In an agent-based model of L. major infection, sensitivity analysis reveals that increasing growth rates can favor, or suppress parasite load, depending on the stage of the infection and the ability of the pathogen to avoid detection. Calibration of the computer model suggests that the pathogen has a relatively slow growth rate and can grow for an extended time before damaging the host cell.

In a computer model of viral infection, we find that the relative overall importance of the cellular (or humoral) response consistently correlates with both the success of immune escape and the optimal escape strategy, and that correlation is relatively robust to the time the escape mutant arises. Mutants that simultaneously escape both responses perform substantially better than humoral or cellular escape mutants alone, highlighting the importance of both responses in controlling infection. Interestingly, loss of infectiousness of humoral escape mutants favors the virus, likely because decreasing infectivity weakens the cellular response.

Finally, Gaussian processes (GP) are commonly used as fast predictors of computer model output and are essential tools for computer model calibration and analysis. We describe the R package mlegp , which fits GPs to scalar or multivariate computer model output and performs sensitivity analysis to identify and characterize the effects of important model parameters.


Lixia Diao

Home Department: Computer Science

Major Professor: Dr. David Fernandez-Baca
Co-Major Professor: Dr. Xun Gu

Title: Consensus properties of supertree construction methods

Abstract: The combination of a set of rooted perfect phylogenetic trees on overlapping leaf sets into one supertree is important and fundamental for evolutionary biology. In this thesis, we will present three supertree techniques – MRP, MRF, MinCutSupertree – and compare the consensus properties of MRP and MRF with some consensus tree criteria.


Jing Ding

Home Department: Electrical and Computer Engineering

Major Professor: Dr. Dan Berleant
Co-Major Professor: Dr. Eve Wurtele

Title: BOW-Based vs. Concept-Based Text Clustering for Functional Analysis of Genes

Abstract: The rapid development in genomic technologies (e.g. microarray) has enabled biologists to simultaneously monitor expression of hundreds or even thousands of genes in a single experiment. To interpret the biological meaning of the expression patterns, it still largely relies on biologists domain knowledge, as well as collected information from literature and/or various public databases. Individual experts domain knowledge is insufficient for large datasets, and manually collecting and analyzing information from literature and/or public databases are tedious and time-consuming. Computer-aided functional analyzing tools are highly desirable. We developed GeneNarrator, a text-mining system for functional analysis of microarray data. Given a list of genes, GeneNarrator collects functional information (MEDLINE citations) from PubMed, and clusters the citations into functional topics. The genes are then mapped to the topics and clustered into groups based on their similarities in topic distribution.


Pan Du

Home Department: Electrical and Computer Engineering

Major Professor: Dr. Julie Dickerson
Co-Major Professor: Dr. Eve Wurtele

Title: Multi-scale Genetic Network Inference based on Time Series Gene Expression Profiles

Abstract: This work integrates multi-scale clustering and short-time correlation to estimate genetic regulatory networks with different time resolutions and detail levels. Gene expression data are noisy and large scale. Clustering is widely used to group genes with similar pattern. The cluster centers can be used to infer the genetic networks among these clusters. This work introduces the Multi-scale Fuzzy K-means clustering algorithm to uncover groups of coregulated genes and capture the networks in different levels of detail.
Time series expression profiles provide dynamic information for inferring gene regulatory relationships. Large scale network inference, identifying the transient interactions and feedback loops as well as differentiating direct and indirect interactions are among the major challenges of genetic network inference. Pairwise time correlation can detect linear interactions between genes. Estimates of the time delay and direction of causality in the inferred network can also be made. Partial correlation and d-separation theory are combined to differentiate the direct and indirect interactions and identify feedback loops. Gene expression regulation can happen in specific time periods and conditions instead of across the whole expression profile. Short-time correlation can capture transient interactions.
The network discovery algorithm was validated using yeast cell cycle data. The algorithm successfully identified the yeast cell cycle development stages, cell cycle and negative feedback loops, and indicated how the networks dynamically changes over time. The inferred network reflects most interactions previously identified by genome-wide location analysis and matches extant literature results. The inferred network provides more detailed information about genes (or clusters) and the interactions among them. Interesting genes, clusters and interactions were identified, which match the literature and the gene ontology information and provide hypotheses for further studies.


Tyra Dunn

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Xun Gu
Co-Major Professor: Dr. Daniel Voytas

Title: Genomic differences between humans and primates

Abstract: Scientists around the world have wondered for many years what distinguishes speciation. Of particular interest is the genetic basis for human/primate (chimpanzee or gorilla) separation. Humans and chimpanzees are 99% identical in their genomic DNA sequence, thus making them very closely related. Despite this high degree of sequence similarity, humans and primates have a number of striking phenotypic differences. We hypothesize that sequence changes that have occurred between humans and primates have altered developmental programs. Because transcription factors alter the expression of numerous genes, we also hypothesize that changes in the expression or activity of transcription factors are responsible for the different phenotypic traits among humans and primates.

Using human chromosome 22 as a model for comparison between human and primate DNA, a random selection of noncoding genes approximately 1-2 kilobases (kb) long upstream was sequenced. Focused on promoter regions from the sequence data, significant differences were detected when comparing humans and gorillas (p-value= < 0.01) and gorillas and chimpanzees (p-value= <0.01) suggesting that limited similarities existed between the species. When comparing humans and chimpanzees (p-value= >0.1), no significant difference was detected. Using this information, transcription factors were analyzed between the human and chimpanzee data to determine if transcription regulation was different between the species. The results indicated no significant difference between humans and chimpanzees at the single-nucleotide level even though the species differ at the genetic and phenotypic levels. The results also indicated that changes in transcription regulation have played a major role in determining speciation. This research opens new avenues in investigating how many of the differences have functional consequences and the relative contributions of these transcription factors to expression differences.


Tyra Dunn

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Heather Greenlee
Co-Major Professor: Dr. Vasant Honavar

Thesis Presentation: June 12, 2007

Title: Characterizing and Influencing Differentiation Of Retinal Progenitor Cells

Abstract: The vertebrate neural retina is a complex organ that is well suited for studying development of the central nervous system. Blinding degenerative retinal diseases including retinitis pigmentosa, macular degeneration and glaucoma are characterized by loss of retinal neurons. At this time there is no way to replace retinal cell loss due to disease or injury since differentiated retinal cells are unable to regenerate. As a potential approach for treating retinal injury, neural progenitor cells have been proposed as a unique source of transplantable cells to replace lost cells in the damaged retina.

Previous studies have transplanted a variety of neural stem cells to the eye in hopes of developing a therapy to replace retinal neurons lost to disease.  Successful integration, survival and differentiation of the cell types have been variably successful.  At the moment little is known about the fundamental biological differences between stem cell or progenitor cell types.

We have used proteomic profiling to begin to identify unique characteristics of retinal progenitor cells. Our results demonstrate that expanded retinal progenitor cells express higher levels of stress-response proteins compared to their brain-derived counterparts. Further, we have described the dynamic expression of stress-response proteins during in vivo retinal development. Finally, we have demonstrated that changing the oxidative environment by addition of the antioxidant vitamin E to retinal progenitor cells differentiated in vitro decrease expression of stress-response proteins and alter their differentiation. These studies are the first to describe the expression of stress-response proteins during in vitro and in vivo retinal cellular development. Our results demonstrate the importance of understanding the oxidative nature of a host environment and how differentiation of transplanted cells might be affected.


Scott Emrich

Home Department: Electrical and Computer Engineering

Major Professor: Dr. Srinivas Aluru
Co-Major Professor: Dr. Patrick Schnable

Title: Assembly and Analysis of Complex Plant Genomes

Presentation: June 8, 2007

Abstract: Concurrent advances in high-throughput sequencing and assembly have led to the completion of many complex genomes. Even so, these assemblies require substantial computational resources. In this dissertation, we present a massively parallel approach that scales to thousands of processors without duplicating the biological expertise present in conventional assembly software. Additional bioinformatics techniques were required to accurately assemble the maize genome including novel repeat detection, and the resulting framework has been strongly supported by maize experimental data. More recently, this framework has been generalized for fruit fly, sorghum, soybean and environmental sequence assemblies. Questions in plant genome analysis were also addressed. For example, we have discovered an estimated 350 “orphan” maize genes and have shown that approximately 1% of all maize genes were recently duplicated, many of which into at least two functional copies. LCM-454 sequencing is introduced and analyses that indicate this approach can discover rare, potentially tissue-specific transcripts and thousands of SNPs will be presented. This dissertation combines high performance computing, computational biology and high-throughput sequencing for our ongoing work on the maize genome project. We conclude by describing how these contributions can be useful for any species, including non-model organisms that are unlikely to be fully sequenced.


Joset Etzel

Home Department: Electrical and Computer Engineering

Major Professor: Dr. Julie Dickerson
Co-Major Professor: Dr. Ralph Adolphs

Title: Algorithms and Procedures to Analyze Physiological Signals in Psychophysiological Research

Abstract: This dissertation presents analytical techniques which allow more information to be derived from psychophysiological data than otherwise possible. The techniques include an implemented algorithm for chest strain-gauge respiration signal analysis and a permutation testing method for evaluating changes over time in physiological signals. These methods are applied to three data sets, each examining physiological correlates of emotional experience. In the first study physiological correlates of moods induced using music were identified, although respiration entrainment confounds the issue of whether mood or the music caused the observed patterns. The second study examined physiological responses while subjects watched an emotional movie under three conditions; changes relating both to the movie scenes and condition were identified. Finally, the third study evaluates short term changes in heart rate while viewing words in terms of the type of word viewed and later word recall.


Fang Fang

Home Department: Statistics

Major Professor: Dr. Karin Dorman
Co-Major Professor: Dr. Drena Dobbs

Title: Virus Recombination: Modeling and Data Analysis

Abstract: As a key evolutionary process, recombination shapes the genetic structure of virus populations. The dramatic increase of virus full-length sequences provides a chance to study virus recombination through molecular data. Many statistical methods have been developed, and a lot of the methods are phylogenetic-based. My research focuses on recombination modeling and data analysis. I first apply an existing phylogenetic-base method, Bayesian dual change-point model (DMCP), to investigate the role of representative data types for recombination study. We conclude that consensus data is overall the best data type to represent virus genotypes. Using consensus data we studied recombination on all full-length hepatitis B virus (HBV) sequences, and set up a system for using DMCP model for large scale sequence analysis. We discovered that HBV has extremly high recombination rate. For the first time we reported circulating recombination forms of hepatitis B virus, and identified one potential recombination hotspot. One important goal of studying recombination is to find potential recombination hotspot, and to reveal the recombination molecular mechanism. This goal requires identification of all recombinants generated by different recombination events,which is not trivial when recombination sequences have similar mosaic structures. Extending the DMCP model, I developed a metnod to identify the number of recombination event producing multiple recombinants. I apply this method to several HBV recombinants that have identical mosaic structure and find at least two recombinant events.


Jianmin Feng

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Volker Brendel
Co-Major Professor: Dr. Zhijun Wu

Title: A new approach for discovering protein motifs

Abstract: Motif recognition is a powerful homology based sequence analysis tool for clustering new protein sequences into different families based on characteristic motifs. Compared to BLAST, these approaches typically have lower false positive rates and can reveal more remotely related family members. However, the current motif databases do not cover all the sequences in protein sequence databases. One of the major reasons for the low coverage of motif databases is that there is only a small set of known member sequences available for constructing protein motifs for many gene families. I have designed a new algorithm, “mFISHER”, to detect protein motifs from only 2-5 known member sequences by artificial evolution of given sequences based on a position specific PAM evolution model. Based on my test results on 160 motif families, the overall average recall rate or sensitivity (true/(true + false negative)) and specificity (true/(true + false positive)) are 88% and 95%, respectively. Compared with MEME (Multiple EM for Motif Extraction), mFISHER is better based on the recall rate, especially when only 2 or 3 members are available. Both approaches have the similar sensitivity. MFISHER is promising for constructing protein motifs when only a few known members.


Xiang Gao

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Daniel Voytas
Co-Major Professor: Dr. Les Miller

Title: Studying the replication mechanism of the yeast retrotransposon Ty5 by molecular and computational approaches

Abstract: The yeast retrotransposon Ty5 is a Ty1/copia element. Officially, it is in the Hemivirus genus of the Pseudoviridae family. The ability to genetically manipulate retrotransposons and the yeast host cell was taken advantage of to explore replication mechanisms unique to Ty5 and common to most retrotransposons. Because of the abundance and diversity of retroelement sequences, along with the fact that many retroelement enzymes have evolved unique functional specificities, computational approaches were also developed to study functional divergence in replication. By screening a randomly mutagenized Ty5 library, two mutations (Y68C, D252N) that caused higher transposition frequencies were identified. Both mutations increased Ty5 cDNA levels, but did not have dramatic effects on the steps after cDNA synthesis (i.e. integration and recombination), or protein synthesis, processing, or solubility. The D252N mutation increased the hydrogen bonding potential of the CCHC zinc finger of nucleocapsid protein (NCp), making the Ty5 NCp zinc finger more like Ty1/copia consensus zinc fingers in terms of hydrogen bonding potential. Other mutations that increased the hydrogen bonding potential (D252R, D252K) provided the same fold increase in Ty5 reverse transcription, and natural occurring mutations in the Ty5 zinc finger repress this function. Hydrogen bonding is suggested to be a universal requirement for the function of retroviral type zinc fingers and cellular zinc fingers. A half-tRNA priming mechanism for Ty5 reverse transcription was also demonstrated. Mutations in the anticodon of tRNA (IMT) and the putative PBS of Ty5 decreased transposistion, but transposistion was restored when complementarity between the IMT and PBS was restored. A tree-based method and supplemental Split Tester software were developed to study the functional divergence of reverse transcriptase (RT) with respect to half-tRNA and full-tRNA priming mechanisms. The domains identified by this computational approach were previously experimentally demonstrated to bind with the tRNA primer/template in HIV RT. Using this software, another domain related to integrase functional specificity, namely whether or not integrase carries out 3’-end processing during integration, was also consistently identified in different integrase datasets. A model describing this functional divergence is proposed.


Zhong Gao

Home Department: Computer Science

Major Professor: Dr. Vasant Honavar
Co-Major Professor:
Dr. Kai-Ming Ho

Title: Genome wide recognition of Tumor Necrosis Factor (TNF) related ligands in human and Arabidopsis genomes: A structural genomics approach

Abstract: Tumor necrosis factors (TNFs) play a crucial role in mammalian signal transduction pathways for cell proliferation, survival, and differentiation. Human and other species (such as Arabidopsis) genome sequencing projects provide a unique opportunity for genome-wide recognition of TNF related ligand proteins and discovery of potential TNF-TNFR signal transduction mechanism in plants. Genome-wide recognition of TNF related proteins in human and Arabidopsis was carried out using secondary structure prediction and protein fold recognition. In the protein fold recognition scheme, sequence-structure models are evaluated using contact energy score based on Miyazawa-Jernigan and Li-Tan-Wingreen models. Secondary structure composition based initial screening not only reduces search space of protein fold recognition but also shifts the score distribution of the selected candidates to a higher score region. In order to investigate influence of sequence length on threading results, protein fold recognition was conducted on human and Arabidopsis genome sequences of different length. The test on known TNFs from diverse species indicates that about 83% of TNFs are able to be identified; the test on human genome sequences shows that about 80% of known TNFs can be recognized. Integration of secondary structure profiling into the scheme can improve performance by adjusting local sequence-structure relationship. However, this improvement largely depends on accuracy of secondary structure prediction. Average scoring performs better than maximal scoring in model evaluation and selection. Pattern classification algorithms such as decision tree, neural network, Naďve Bayes classifier, and support vector machine are applied to discriminate TNF related proteins from the competitive false positives which have similar secondary structure composition to known TNFs and also have high fold recognition scores. Both known TNF and false positive sequences are represented with the twenty q values corresponding to twenty amino acids in Li-Tan-Wingreen model. Cross-validation results show that Naďve Bayes classifier performs better than SVM, neural network, and decision tree, and Naďve Bayes classifier is suitable for stringent control of false positive. This genome-wide search scheme was used to search potential TNF-like signal proteins in Arabidopsis genome. Possible role of candidates in human and Arabidopsis genomes is discussed. These results demonstrate that structure based methods can facilitate functional prediction in a genome scale.


Aspen Garry

Home Department: Ecology, Evolution, & Organismal Biology

Major Professor: Dr. Dean Adams
Co-Major Professor: Dr. Gavin Naylor

Title: Geometric Morphometric analysis of shark teeth of the genus Rhizoprionodon: The modern, the ancient, and the hypothetical. Modern tooth shape analysis and test of ancestory prediction methods by comparison to fossil shapes

Abstract: Shark teeth are extremely common in the fossil record, and they can potentially provide insight into the evolutionary history of sharks. However, isolated fossil teeth are difficult to assign to the correct jaw, position, and taxon without organismal context because individual sharks exhibit a variety of tooth shapes. Tooth shape varies across jaws, positions within each jaw, and taxa.

Fortunately, tooth shape is quantifiable, and shapes can be compared using the techniques of geometric morphometrics, which measure shape and its covariation with other variables. Analysis of modern tooth shapes was performed in order to gain understanding of patterns of modern tooth shape variation. These results could then be applied to fossils to provide better identification of fossils in order to make use of sharks’ extensive fossil record.

To quantify modern patterns of tooth shape variation, teeth of five Rhizoprionodon species and representative of three closely related genera (Loxodon,Eusphyra, and Sphyrna) were quantified and analyzed using geometric morphometric methods. Ancestral tooth shapes were estimated using the modern shape data mapped onto a phylogeny created using molecular data, and a Brownian motion model of evolution. These shapes were compared to fossil teeth from Rhizoprionodon sp. and Sphyrna spp. to evaluate the accuracy of the estimated ancestral shapes.

Modern teeth at the front of the jaw displayed the most dramatic shape differences between jaws and positions. Teeth from each genus could be distinguished, but species within Rhizoprionodon could not. Fossil tooth shapes most closely resembled those of modern teeth, indicating that tooth shape did not change according to the Brownian motion model used to predict ancestral shapes.


Jianying Gu

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Xun Gu
Co-Major Professor: Dr. Dan Nettleton

Title: Functional divergence and genome evolution of vertebrate protein kinases

Abstract: The emerging complete and nearly complete genome sequences have provided a significant amount of materials for large-scale comparative genomic analysis. Novel methods have been developed to elucidate the function of gene products and functional interacting networks. Many of these post-genomic attempts have focused on unveiling the evolutionary forces that have shaped the network organization. Among various evolutionary forces, duplication of functional domain, individual gene, chromosomal segment, or entire genome has long been thought as primary resource for the function novelties in a vast number of gene families. It is therefore intriguing to quantitatively trace the changes of evolutionary constraints after a duplication event.

This study is focused on the exploitation of the functional divergence and evolutionary patterns in vertebrate kinase complements (denoted as kinomes) and kinase-regulated signaling transduction pathways, using a combinatorial statistical and evolutionary approach. The analysis of an individual kinase gene family (Jak), protein tyrosine kinase superfamily, and a kinase mediated signaling transduction pathway (TGF- b ) showed that functional divergence (altered functional constraint) after (domain or gene) duplication is a general pattern. Moreover, the age distribution of the vertebrate kinomes showed that (1) The major kinase-related animal specific signal-transduction pathways have been generated through an ancient continuous domain shuffling (or duplications) during the time period from early stage of eukaryotes to metazoan evolution; (2) Vertebrate tissue-specificity of signal-transduction is facilitated by large-scale duplication event(s) in the early stage of vertebrates; and (3) The kinase pseudogenes are generated through either segmental duplication or retrotransposition very recently.


Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Patrick Schnable
Co-Major Professor: Dr. Dan Ashlock

Title: Adaption of Multiclustering to the Analysis of Microarray Data

Presentation Date: Thursday, May 10, 2007

Abstract: Clustering has become an integral part of microarray data analysis and interpretation. It is helpful to reduce the scale of information generated by microarray experiment to the level that biologists can generate hypothesis. There is a danger that artifacts induced by clustering methods can cause misinterpretation of the data. Clustering method that can accurately capture the natural structure of the data would be a useful tool for biologists to discovery the biological meaning buried in the data. To this end, a new clustering algorithm, called K-means multiclustering, is introduced. The method can avoid the artifacts induced by distance or similarity metrics by amalgamating the results of many K-means clusterings.

Results: The multiclustering algorithm is a model-free clustering method. It is found to be reliable and consist in capturing the underlying data structure with high accuracy that is competitive with model based clustering and superior to other methods on synthetic micorarry data generated in a manner consistent with the hypothesis of model based clustering. The algorithm has a high level of immunity to artifacts introduced by the metric used to measure the distance between data points. It can successfully cluster data sets which are designed to have different shapes and variation and cannot be correctly clustered by traditional clustering method. The cut plot computed by this method is a very simple and useful summary of the data structure. A detailed view of the formation of clustering can also be generated by the method to reveal the underlying hierarchical structure of data set.


Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Daniel Voytas
Co-Major Professor: Dr. Mei Hong

Title: Characterization of the Sireviruses:  A unique group of Ty1/copia LTR retrotransposons in plants

Abstract: Plant genomes have allowed the expansion of many types of mobile genetic elements.  LTR retrotransposons are a subclass of mobile genetic elements that replicate using an RNA intermediate.  The Pseudoviridae (Ty1/copia) are a family of LTR retrotransposons, and the Sireviruses are one of three genera in the Pseudoviridae.  The Sireviruses have features that set them apart from classical retrotransposons.  Different members of the Sireviruses show great variability in their genomic structures and the translational tricks they use to express their encoded proteins.  For example, we have shown that the SIRE1 elements of soybean use stop codon suppression to express their Env-like protein.  Secondly, some monocot members of the Sireviruses may use a bypass mechanism to translate Pol.
 
Another notable feature of the Sireviruses is that most carry additional coding information in the form of an open reading frame (ORF) referred to as an env-like ORF, and all have encoded extra coding information in their gag gene.  The env-like ORF has caused speculation that these elements are plant retroviruses, although no experimental evidence has determined this to be true.  However, using a yeast two-hybrid screen, we have discovered an interaction between multiple Sirevirus Gags and a family of related host cell proteins referred to as dynein light chain LC8 and LC6.  The LC8 and LC6 proteins are highly conserved in eukaryotes and are components of the dynein and myosin-V motors.  LC8 can bind cargo (cell proteins or virus particles) to allow movement along the cytoskeleton.  Thus, one hypothesis is that the interaction of the Sirevirus Gags with LC8 or LC6 may allow for movement of the Sirevirus virus-like particles or transposition intermediates within a cell (for example, from cytoplasmic to nuclear compartments).  If true, this would not only represent the first example of a movement mechanism for any retrotransposon, but it also illustrates how plant retrotransposons and plant viruses use similar mechanisms to achieve a common goal.  In addition, an initial characterization of the expression and localization of the Arabidopsis thaliana LC8/LC6 gene family was completed.


Home Department: Biochemistry, Biophysics and Molecular Biology

Major Professor: Dr. Mark Hargrove

Title: Structural Characterization of Ligand Binding in Hexacoordinate Hemoglobins

Presentation: Thursday, August 17, 2006

Abstract: The goal of biophysics is to study the structures of the components of living organisms and to understand the mechanics of the processes of life. Hemoglobin is a well suited model for this study. As an essential component of the life blood of mammals, and easy to obtain in large quantities, hemoglobin and its monomeric partner myoglobin are two of the most well studied and characterized components of life. Yet hemoglobin studies continue to reveal new forms of hemoglobin, raising new questions, functional possibilities, and research opportunities. My research focuses on hemoglobins classified as hexacoordinate. I have focused particularly on the structural characterization of these proteins upon ligand binding. Included below for your benefit are a list of abbreviations and terms used in my talk along with their definitions.

Hbs -- hemoglobins
hxHbs -- hexacoordinate
hemoglobins trHbs -- truncated hemoglobins
nsHbs -- nonsymbiotic hemoglobins
sHbs -- symbiotic hemoglobins
SynHb -- Hb from Synechocystis
ferric -- oxidized (3+ iron)
ferrous -- reduced (2+ iron)
ligand -- small binding molecule like oxygen
k' -- rate of ligand binding
K -- equilibrium binding association constant
soret -- optical peak around 390-440nm

List of publications: Hoy, J. A., Kundu, S., Trent, J. T., 3rd, Ramaswamy, S., and Hargrove, M. S. (2004). The crystal structure of Synechocystis hemoglobin with a covalent heme linkage. J Biol Chem. 279, 16535-16542. Trent, J. T., 3rd, Kundu, S., Hoy, J. A., and Hargrove, M. S. (2004). Crystallographic analysis of synechocystis cyanoglobin reveals the structural changes accompanying ligand binding in a hexacoordinate hemoglobin. J Mol Biol. 341, 1097-1108. Smagghe, B. J., Kundu, S., Hoy, J. A., Halder, P., Weiland, T. R., Savage, A., Venugopal, A., Goodman, M., Premer, S., Hargrove, M. S. (2006). Role of Phenylalanine B10 in Plant Nonsymbiotic Hemoglobins. Biochemistry Aug 15;45(32):9735-9745. Hoy, J. A., Smagghe, B. J., Halder, P., Hargrove, M. S. (2006). Covalent heme attachement in Synechocystis hemoglobin is required to prevent ferrous heme dissociation. Manuscript in preparation. Hoy, J. A., Robinson, H., Trent, J. T., Kakar, S., Smagghe, B. J., Hargrove, M. S. (2006). Crystal structure of a nonsymbiotic plant hemoglobin; implications for the evolution of oxygen transport. Manuscript in preparation.

Bio: BA in Physics and BA in Humanities from Wartburg College, Waverly, Iowa 1996 MS in Physics from Iowa State University, 1999 Temporary Instructor of Physics, ISU, 1999 - 2000 PhD studies in Biophysics, ISU, 2000 - 2006 Postdoc in Hargrove Lab


LaRon Hughes - M.S.

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Karin Dorman
Co-Major Professor: Dr. Susan Carpenter

Title: EIAV DB:  A comprehensive Equine Infectious Anemia (EIAV) Virus database

M.S. Abstract: A major problem in biology is the storage and retrieval of biological data in a meaningful and efficient manner. With the advent of mass sequencing projects, such as the human genome project, the need to store, retrieve, and analyze sequence data is stronger than ever before. The following thesis tackles a small part of this problem by presenting techniques, models, and applications for productively storing and retrieving a set of related viral sequences in a central data bank. The thesis begins by providing an overview of the relational database and its role in storing biological data. The main chapter of the thesis is a description of a novel relational database application (EIAV DB). EIAV DB is a central repository of Equine Infectious Anemia Virus sequence and feature information. The models and application provide insight into technologies that help alleviate the storage and retrieval problem.


LaRon Hughes - PhD

Home Department: Animal Science

Major Professor: Dr. Jim Reecy
Co-Major Professor: Dr. Vasant Honavar

Title: Hypothesis building using the Animal Trait Ontology

PhD Abstract: With the advent of sequencing projects in model organisms, humans, and domesticated livestock species, the need for storage, retrieval, and analysis of genomics information for these animals has become important.  The Animal Trait Ontology (ATO) is an ontology that has been created to store the relationships between farm animal traits for several domesticated farm animals.  The Collaborative Ontology Building (COB) editor was used to create and edit the ATO.  An online ontology browser has been developed to search and browse the ontology and to view the relationships between the terms.  Some of the traits in the ontology are linked to associated quantitative trait loci (QTL) information for each species through a tool called the Comparative Animal QTL (CAQ) tool which allows users to compare QTL experiments in livestock species.  The tool allows QTL experiments to be compared based on 1) one trait given one species, and 2) two traits given one species.  The effectiveness of the tool is recorded in the form of a data and statistical analysis which demonstrates its use in examining pleiotropic effects for traits in the pig.  In addition, the Human and Animal Trait Ontology is discussed and it will form an agglomeration of several different species ontologies, including the ATO, that will form a consensus for describing phenotypes and traits across different disease models.


Cizhong Jiang

Home Department: Genetics, Development & Cell Biology

Major Professors: Dr. Thomas Peterson
Co-Major Professor:
Dr. Xun Gu

Title: Computational and molecular analysis of Myb gene family

Abstract: Myb proteins are defined by a highly conserved DNA-specific binding domain termed Myb, which is composed of approximately 50 amino acids with constantly spaced tryptophan residues. Multiple copies of Myb domains often exist as tandem repeats within a single protein. There are up to four tandem Myb repeats present in Myb proteins identified to date (termed R0R1R2R3 hereafter). In our study, we collected additional Myb genes, and performed a series of phylogenetic analyses to explore the evolutionary origin of Myb genes. The results suggest that the Myb gene family originated from an ancient one Myb-box gene. One and two intragenic duplications produced R2R3 and R1R2R3 Myb genes, respectively, which then co-existed in the primitive eukaryotes and gave rise to the currently extant Myb genes. Based on our results, we proposed that plant R1R2R3 Myb genes were derived from R2R3 Myb genes by gain of the R1 repeat through an ancient intragenic duplication; this gain model is more parsimonious than the previous proposal that plant R2R3 Myb genes were derived from R1R2R3 Myb genes by loss of the R1 repeat. The phylogenetic analysis of isolated individual Myb repeats indicates that R2 repeat has evolved more slowly than the R1 and R3 repeats. However, it is not clear which repeat is the most ancient one.

Another goal of our project is to classify and predict functions of Myb genes. We clustered the closely-related Myb genes into subgroups from Arabidopsis and rice on a basis of sequence similarity and phylogeny. The gene structure analysis revealed that both the positions and phases of introns are conserved in the same subgroup, although these differ between subgroups. Conserved motifs were detected in C-terminal coding regions within subgroups, and these motifs exist specifically in Myb genes. We also found that Myb genes with similar functions are clustered together. In contrast, no conserved regulatory elements were identified in the divergent non-coding regions. Additionally, the distribution pattern of introns in the phylogenetic tree indicates that Myb domains originally had a compact size without introns. Non-coding sequences were inserted and the splicing sites were conserved during evolution.


Brent Kronmiller

Home Department: Plant Pathology

Major Professors: Dr. Roger Wise
Co-Major Professor:
Dr. Xun Gu

Title: Assembly And Annotation Tools For Analysis Of Large Contiguous Regions Of The Maize Genome

Abstract: LTR retrotransposons make up significant portions of many of the longer grass genomes, their repeat sequences across the genome, their terminal repeats, and their nested cluster configuration make assembly of sequence clones challenging and identification of gene regions difficult.  In this thesis I provide tools necessary for both assembly and annotation of highly repetitive genomes and use these tools to construct the currently two longest maize sequence contigs.
      In the first part of the thesis I present TEnest, annotation and visualization software for transposable elements in grass genomes.  TEnest identifies all fragmented transposable elements within the input sequence and reconstructs each to the original insertion state.  This provides a chronological display of the nesting pattern of clustered transposable elements.  For LTR retrotransposons TEnest calculates an estimated age since insertion based on the divergence of its paired LTRs.  I also provide a case study of TEnest on the available maize genome sequence.  TEnest shows the distribution of transposon families, ages of insertion, and frequencies of solo LTRs.  In addition I provide a phylogenetic analysis of retrotransposon families showing the estimated ages since insertion of LTR retrotransposons cluster with their sequence identity, showing that LTR retrotransposons experience specific intervals of extreme proliferation to expand across the genome.
      In the second part of this thesis I introduce our two contiguous maize sequences, rf1-associated contigs rf1-C1 and rf1-C2 sequenced from maize B73.  These are the two longest contiguous maize sequences and provide previously unmatched sequence quality for answering many questions surrounding the makeup of the maize genome.  Here, using TEnest, we propose two maize assembly techniques for highly repetitive regions.  The use of these processes has allowed us to provide the high quality contiguous sequences of the rf1-associated region and will assist researchers with assembly of difficult sequence clones.  We show definite separation between gene and repeat regions.  The rf1-associated contigs, when compared to the rice and sorghum genomes, show conserved macro-colinearity between genes across the long sequences.  But at a closer look at individual gene islands show there is micro-non-colinearity across the analyzed grass species.
      The third section of this thesis compares the B73 rf1-associated sequence contigs with two BACs sequenced from Wf9-BG, an Rf1 containing maize line.  Here we identify four genes in an island corresponding to a similar gene island in B73, however a fifth gene is missing from Wf9-BG.  Two repeat clusters surround the gene island; one matches its counterpart in B73, the second repeat cluster does not align to B73.  Leading up to this area of recombination we observe a drastically increased frequency of polymorphisms.


Alain Laederach

Home Department: Chemical and Biological Engineering

Major Professor: Dr. Peter Reilly
Co-Major Professor: Dr. Amy Andreotti

Title: Protein-Carbohydrate and Protein-Protein interactions: Using models to better understand and predict specific molecular recognition

Abstract: Any molecular recognition event results in a change in the free energy of the system. The extent of this change is related to the association constant, such that the more negative the free energy change is, the tighter the interaction between receptor and ligand. Protein-carbohydrate interactions play a critical role in signal transduction, innate immunity and metabolism. Modeling these interactions is somewhat complicated by the inherent flexibility of carbohydrates as well as their relatively large number of functional groups. An empirical scoring function for docking carbohydrates to proteins will be presented specifically tailored to predict both the correct binding orientation and free energy of binding of the carbohydrate-ligand/protein-receptor complex. This new scoring function can predict free energies of binding to within 1.1 kcal/mol residual standard error, a definite improvement over existing scoring functions which result in standard errors well over 2 kcal/mol. Application of automated docking methodology to determine carbohydrate recognition specificity of the c-type Lectin, human Surfactant Protein D will also be presented. In the second part of the thesis, the role of p-stacking interactions (e.g. between Tyr side chains) in stabilizing protein folds will be discussed. A 17-residue peptide derived from the naturally occurring anti-microbial peptide Tachyplesin I is investigated using NMR spectroscopy. NOE cross peaks were observed confirming the existence of this interaction in solution. In the final part of the thesis, a quantitative NMR investigation into the self-association behavior of the regulatory domains of several Tec family member kinases will be presented. Of particular interest, self-association within Bruton's Tyrosine Kinase (Btk) regulatory domains occurs through the formation of an asymmetric homodimer. Together this work demonstrates the importance of rigorous biophysical characterization of bio-molecular recognition events and how interdependent computational modeling and experimentation are.


Michael Lawrence

Home Department: Statistics

Major Professor: Dr. Dianne Cook
Co-Major Professor: Dr. Eve Wurtele

Title: Interactive graphics, graphical user interfaces and software interfaces for the analysis of biological experimental data and networks

Abstract: Biologists need to analyze and comprehend increasingly large and more complex experimental data. These experimental data are multivariate, where each row corresponds to a biological entity, and each column corresponds to the level of an experimental treatment. Biological experiments often produce multiple data sets, each describing one aspect of the system, such as the transcriptome recorded by a microarray or metabolome recorded using gas chromatography mass spectrometry (GC-MS). A biochemical network model provides a conceptual system-level framework for integrating data from different sources. Effective use of graphics enhances the comprehension of data, and interactive graphics permit the analyst to actively explore data, check its integrity, satiate curiosities and reveal the unexpected. Interactive graphics have not been widely applied as a means for understanding data from biological experiments. This thesis addresses these needs by providing new methods and software that apply interactive graphics in coordination with numerical methods to the analysis of biological data, in a manner that is accessible to biologists.


Nicole Leahy

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Daniel Ashlock
Co-Major Professor: Dr. John Mayfield

Title: Pseudophyte evolutionary algorithm: A simple computational model of parapatric speciation s

Abstract: The Pseudophyte Evolutionary Algorithm (PEA) is an individual-based computer model of a population of haploid, annual plants used to examine the process of speciation in a patchy environment. The model incorporated both pre-mating and post-zygotic mechanisms for the evolution of reproductive isolation via pollen selection and offspring inviability, respectively. The PEA allows speciation as an emergent property rather than an explicit feature of the model to understand how environmental patchiness, number and arrangement of loci, and reproductive output of individuals affected the strength of isolating mechanisms as well as the rate at which these evolve. The effect of how genotypes were mapped to phenotypes was also explored to examine the sensitivity of the PEA to alternate representations.


Jae-Hyung Lee

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Drena Dobbs
Co-Major Professor: Dr. Kai-Ming Ho

Title: Analysis of protein-RNA and protein-peptide interactions in Equine Infectious Anemia Virus (EIAV) infection

Abstract: Macromolecular interactions are essential for virtually all cellular functions including signal transduction processes, metabolic processes, regulation of gene expression and immune responses. This dissertation focuses on the characterization of two important macromolecular interactions involved in the relationship between Equine Infectious Anemia Virus (EIAV) and its host cell in horse: i) the interaction between the EIAV Rev protein and its binding site, the Rev-responsive element (RRE) and ii) interactions between equine MHC class I molecules and epitope peptides derived from EIAV proteins. EIAV, one of the most divergent members of the lentivirus family, has a single-stranded RNA genome and carries several regulatory and structural proteins within its viral particle. Rev is an essential EIAV regulatory encoded protein that interacts with the viral RRE, a specific binding site in the viral mRNA. Using a combination of experimental and computational methods, the interactions between EIAV Rev and RRE were characterized in detail. EIAV Rev was shown to have a bipartite RNA binding domain containing two arginine rich motifs (ARMs). The RRE secondary structure was determined and specific structural motifs that act as cis-regulatory elements for EIAV Rev-RRE interaction were identified. Interestingly, a structural motif located in the high affinity Rev binding site is well conserved in several diverse lentiviral genomes, including HIV-1. Macromolecular interactions involved in the immune response of the horse to EIAV infection were investigated by analyzing complexes between MHC class I proteins and epitope peptides derived from EIAV Rev, Env and Gag proteins. Computational modeling results provided a mechanistic explanation for the experimental finding that a single amino acid change in the peptide binding domain of the equine MHC class I molecule differentially affects the recognition of specific epitopes by EIAV-specific CTL. Together, the findings in this dissertation provide novel insights into the strategy used by EIAV to replicate itself, and provide new details about how the host cell responds to and defends against EIAV upon the infection. Moreover, they have contributed to our understanding of the macromolecular recognition events that regulate these processes.


Darrin Lemmer

Home Department: Biochemistry, Biophysics & Molecular Biology

Major Professor: Dr. Gloria Culver
Co-Major Professor: Dr. Drena Dobbs

Title: CAVEMol: an immersive 3D molecule viewer

Abstract: As the number of solved molecular structures deposited with the Protein Data Bank (PDB) increases, so too does the desire for more advanced ways of using this data. Traditional applications for viewing and manipulating molecular structures create a computer-generated model on a standard desktop computer screen. The display may employ some method of stereography to create the illusion of depth, but generally the user just sees a flat image. The user is able to interact with the molecule by magnifying it to get closer look at a particular area of interest, or by rotating it along an arbitrary axis, thus allowing all sides of the molecule to be seen, though only one side is in view at any given time. The user may also be able to see changes in the molecule over time whereby each conformation of the molecule is a separate frame of an animation, or they may even be able to make modifications to the structure in real time. Regardless of the amount of control the user has over the molecule, however, one thing remains the same: the user experiences the molecule as though it were an object floating behind the monitor screen which they can indirectly control using a mouse or other pointing device.
An immersive environment, on the other hand, provides a new paradigm for molecular visualization, allowing the user a much more realistic interaction with the molecule. The user becomes part of the viewing experience, traversing a molecule as though walking or flying within it. The molecule can completely surround them on all sides, giving them a true sense of the size and shape of the molecule in three dimensions. The user may also interact with the object directly, moving and rotating it with their hands rather than a mouse.
This approach should prove particularly valuable for operations such as “interactive docking,” which allows a user to manipulate the interface between two molecules to identify favorable interaction sites. While this can be done to a degree in today’s desktop molecule viewers, the operation is difficult and time consuming. Because today’s viewers are limited to a flat screen display, a user can only attempt to dock two molecules in two dimensions at a time. When the structure is rotated, more often than not the third dimension is not properly aligned. Realigning the third dimension invariably breaks one or both of the first two. The result is a long and frustrating cycle of alignment rotation and realignment. By allowing direct manipulation in all three dimensions simultaneously, the immersive perspective eliminates this cycle.

This thesis presents the design and implementation of CAVEMol, a molecular visualization application for immersive environments. I will also give an overview of molecular visualization and immersive environments, and then discuss future work that can be done in this area as well as applications where molecular visualization in an immersive environment can be particularly valuable.


Haining Lin

Home Department: Computer Science

Major Professor: Dr. Xiaoqiu Huang
Co-Major Professor: Dr. Daniel Voytas

Title: BACAP: An assembly program for heirarchial shotgun sequencing

Abstract: We propose a sequence-based algorithm BACAP to assemble BAC sequences generated from hierarchical shotgun sequencing. Our approach relies on sequence similarity rather than physical mapping. It follows the “overlap-layout-consensus” framework used for shotgun sequencing data. BACAP uses heuristic methods to achieve efficiency and accuracy. It was tested on four simulated data sets of 200 BAC-size sequences each and one real data set of 228 rice BACs from TIGR. The average running time was 25 minutes on one 900 MHz IA-64 GenuineIntel Itanium machine. Our results show that BACAP can quickly and accurately accomplish some BAC assembly tasks without physical mapping information.


Yuan Lin

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Xun Gu
Co-Major Professor: Dr. Xiaoqiu Huang

Title: The Relationship of Sequence Similarity and Expression Pattern Similarity between Yeast Genes within Gene Families

Abstract: After gene duplication, the sequence and expression patterns of duplicated genes diverge. It is known that the function divergence of duplicated genes could be related to the divergence of both their coding sequence and expression profile mainly caused by the sequence change of regulatory region. But it is not known if the sequence divergence and expression pattern divergence are correlated. Former research by Andreas Wagner showed there is at most very weak correlation between them. On the contrary, our research shows there is a strong correlation between the sequence similarity and expression profile similarity if the sequences are quite conserved; the degree of coexpression of duplicated genes is consistent to their duplication order.


Patricia Lonosky

Home Department: Botany

Major Professor: Dr. Steve Rodermel
Co-Major Professor: Dr. Vasant Honavar

Title: Proteomics of the developing chloroplast in maize

Abstract: Chloroplast protein expression profiles during the light-induced biogenesis of the maize plastid were determined from 2D gel analysis. During five time points of this ‘greening’ process (0,2,4,12, and 48 hours post-illumination), maize plant tissue was collected, plastids isolated, and protein precipitated and separated in two dimensions using 2D protein gels. From these proteome maps, quantities of spots were analyzed by: Principal Components Analysis, hierarchical pairwise average linkage cluster analysis, Adaptive Resonance Theory 2 cluster analysis, and Self Organizing Map cluster analysis to determine chloroplast protein expression profiles. 54 spots representing 26 proteins were identified by MALDI-TOF mass spectrometry and used to verify the protein expression profiles. Two main conclusions were drawn from this data: 1) ART2 may be a useful clustering tool for expression data, and 2) different forms or modifications of the same protein show different expression patterns.


Wiesia Mentzen

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Eve Wurtele
Co-Major Professor: Dr. Xun Gu

Title: From Pathway to Regulon in Arabidopsis

Abstract: I apply combined bioinformatic approaches using genomic and transcriptomic data to investigate the fatty acid biosynthesis pathway, at the molecular level, and in the context of the system biology of Arabidopsis.  Fatty acids are essential components of all known bacterial and eukaryotic cells with critical role in cells as energy reserves and the metabolic precursors for biological membranes. The pathway for fatty acid synthesis seems to be conserved across all living systems. Acetyl-CoA carboxylase, a member of a superfamily of biotin-dependent enzymes, catalyzes the first committed step of the fatty acid biosynthesis pathway. Phylogenetic study exposed complex and intertwined evolutionary histories of this family, with multiple domain fusions and rearrangements. As revealed by meta-analysis of a wide array of Arabidopsis transcriptomic data, fatty acid biosynthesis is transcriptionally regulated, and this regulation not only extends across all pathway reactions, but also some substrate- and cofactor-producing reactions, thus defining a major transcriptionally co-regulated pathway. I extend the meta-analysis of the transcriptome to find groups of coexpressed genes (also called modules, or regulons) in the Arabidopsis genome. Major functionally-coherent gene groups were identified. These comprise development, information processing, defense, and metabolism, as well as tissue- and organelle- specific processes.


Erin Myers

Home Department: Ecology, Environment and Organismal Biology

Major Professor: Dr. Fred Janzen
Co-Major Professor: Dr. Dean Adams

Title: Post-orbital color pattern variation and the evolution of a radiation of turtles (Graptemys)

One of the most deeply studied areas in the field of evolutionary biology is the formation and maintenance of new species, as well as the variation in the rate and extent to which taxa radiate. A range of evolutionary processes, from ecological adaptation to sexual selection and reinforcement, can lead to the formation of new species. However, the formation of new species likely results from several isolating mechanisms acting in concert. The map turtle complex (genus: Graptemys) is an excellent model system for exploring the nature of speciation given its exceptional species richness and high levels of morphological diversity, particularly in facial coloration patterns. This research utilizes an integrative approach to establish the role of post- orbital color patterns in species diversification and maintenance. This multi- faceted approach will incorporate aspects of phylogenetics, population and quantitative genetics, morphometrics, and behavior to assess morphological evolution within species and across the genus. The phylogeny of map turtles was characterized by a hard polytomy indicating rapid speciation. Across the genus, morphological evolution occurred in a parsimonious manner. Within species, both morphology and genetics exhibited a pattern of isolation by distance. Temperature significantly influence coloration patterns and multivariate heritability was generally low. Finally, in behavior trials, neither males nor females spent significantly more time with members of their own species. In all projects, the signatures of sexual selection or reinforcement were absent or equivocal where they would be expected if they were the main forces continuing to shape interactions among map turtle species. The results of this research indicate that role of past and on-going selection on coloration pattern within the map turtle clade has been limited, indicating that post-orbital coloration was not the driving factor in the radiation of this turtle clade. Alternative explanations for map turtle species richness are explored.


Myron Peto

Home Department: Biochemistry, Biophysics and Molecular Biology

Major Professor: Dr. Robert Jernigan
Co-Major Professor: Dr. Drena Dobbs

Title: Studies of Protein Designability using Reduced Models

Presentation: July 9, 2007

Abstract: One the most important problems in computational structural biology is protein designability, that is, why protein sequences are not random strings of amino acids but instead show regular patterns that encode protein structures. Many previous studies that have attempted to solve the problem have relied upon reduced models of proteins. In particular, the 2D square and the 3D cubic lattices together with reduced amino acid alphabets have been examined extensively and have lead to interesting results that shed some light on evolutionary relationship among proteins. Here, additionally to the 2D square lattice, we study the 2D triangular and 3D face centered cubic (fcc) lattices, we perform designability studies using different shapes embedded in the 2D square lattice, and we use machine learning algorithms to classify binary sequences folding to highly- or poorly-designable conformations. In the first part of the thesis we extend the transfer matrix method to the 2D triangular lattice. The transfer matrix method is a highly efficient method of enumerating all conformations within a compact lattice area that has earlier been developed for the 2D square and 3D cubic lattices. In addition we also enumerated all compact conformations within simple geometries on the 2D triangular and 3D face centered cubic (fcc) lattices using a standard backtracking algorithm. In the second part of the thesis we described protein designability studies on various shapes in the 2D square lattice using a reduced hydrophobic-polar (HP) amino acid alphabet. We used a simple energy function that counted the number of H-H, H-P and P-P interactions within a restricted set of protein shapes that have the same number of residues and non-bonded contacts. We found a difference in the designabilities of different protein shapes. Finally, in the third part of the thesis we used standard machine learning algorithms to classify two classes of protein sequences. We first performed a designability study for two shapes, using a binary HP alphabet, on the 2D triangular lattice and separated highly- and poorly-designable conformations. Highly-designable conformations had many sequences folding to them with the lowest energy and poorly-designable conformations had few or no sequences folding to them. Sequences were classified as highly- or poorly-designable depending on whether they folded to highly- or poorly-designable structures. Using several machine learning algorithms such as Decision Tree, Naďve Bayes, and Support Vector Machine, we were able to classify highly- and poorly-designable sequences with high accuracy.


Bradley Powers

Home Department: Mathematics

Major Professor: Dr. Dan Ashlock
Co-Major Professor: Dr. Kirk Moloney

Title: The Effect of Tags on Non-Local Adaptation

Abstract: This project investigates in greater depth in phenomenon of non-local adaptation previously observed in an evolutionary model based on the game iterated Prisoner’s Dilemma. Non-local adaptation is the ability of an agent or population of agents to perform well against other agents that share no common history or ancestry with them. Populations of agents both with and without identifying tags are evolved to perform noisy iterated prisoner’s dilemma on a toroidal grid. The agents consist of a finite state machine specialized for playing iterated prisoner’s dilemma and simple tag recognition capability. The populations are allowed to evolve for 10,000 generations and the state of the world is stored every 500 generations. Populations from these samples are placed in competition with populations from generation 10,000. This procedure is repeated for varying levels of overall mutation rate, with and without tags, and varying frequencies of tag related mutations. Non-local adaptation is seen in these populations, however, tags seem to slow the acquisition of non-local adaptation. Although the concept of non-local adaptation is not a widely accepted phenomenon in biology, these results suggest that it may happen and that they effect is persistent in the face of changes in mutation rate and in the face of increased task complexity. Further analysis of the populations tend to have a predominant tag most of the time with punctuated periods of increased tag space usage that most likely correspond to invasion of the population by an opportunistic agent with a new tag identifier.


Justin Recknor

Home Department: Statistics

Major Professor: Dr. Dan Nettleton
Co-Major Professor: Dr. Jim Reecy

Title: Identification of Differentially Expressed Functional Categories in Microarray Studies Using Nonparametric Multivariate Analyses

Abstract: Tests of differential expression across groups of genes, within a functional category, are performed using a method motivated by Barry, Nobel, and Wright (2005). Rather than comparing marginal distributions on a gene-by-gene basis across treatment groups, we use a test statistic that can detect general changes in multivariate distributions across treatment groups. Resampling-based methods and multiple-testing adjustments are used to obtain simultaneous inference for multiple groups of genes. Results are visualized on a directed acyclical graph, and new methods for pinpointing genes of greatest interest are provided.


Kyoungmin Roh

Home Department: Ecology, Evolution and Organismal Biology (EEOB)

Major Professor: Dr. Steve Proulx

Title: Evolutionary variance of gene network via simulated annealing algorithm

Abstract: The traditional approach of molecular biology research was on examining and collecting data on a single gene or a single reaction. However, recently, there has been much interest on the dynamics of gene regulatory networks ( E. Klipp, et al., 2005). We applied mathematical approach for modeling of gene network. The models depict the reaction kinetics of the constituent parts and the functions are ultimately made from basic principle of simple expressions derived from Michaelis-Menten enzymatic kinetics, and the functional forms are usually chosen as Hill functions that serve as an approximation for the real molecular dynamics ( E. Klipp, et al., 2005). These dynamics depends on many parameters and the parameters strongly influence the behavior of the resulting gene network. Thus, we used simulated annealing algorithm to calculate a high fitness and optimal parameters of the gene network. The simulated annealing algorithm is suitable for calculating many degree of freedom (Jonathan Tomshine and Yiannis N. Kaznessis., 2006). We developed 3 different models that have two genes and experience two different environments, and simulated to describe the behavior of evolutionary gene networks. From simulation, we could find how genes interact each other by evolutionary times, we could obtain a high fitness of each gene network model, and we could indicate how gene network is evolved from tracks of parameters and a fitness. Also, we analyzed the relations of a high fitness and parameters. We think we can apply to design and optimize other gene network, and these findings are useful to analysis of the evolutionary gene network.


Jeffry D. Sander

Home Department: Genetics, Development and Cell Biology

Major Professor: Dr. Drena Dobbs
Co-Major Professor: Dr. Daniel Voytas

Title: Characterization and design of C2H2 zinc finger proteins as custom DNA binding domains

Abstract: As the storage medium for the source code of life, DNA is fundamentally linked to all cellular processes. Nature employs hundreds of sequence-specific DNA binding proteins as transcription factors and repressors to regulate the flow of genetic expression and replication. By adapting these DNA-binding domains to target desired genome locations, they can be harnessed to treat diseases by regulating genes and repairing diseased gene sequences. The C2H2 zinc finger motif is perhaps the most promising and versatile DNA binding framework. Each C2H2 zinc finger domain (module) is capable of recognizing approximately three adjacent nucleotide bases in standard B form DNA. Through directed mutagenesis, novel zinc finger modules (ZFMs) can be selected for most of the 64 possible DNA triplets. By assembling multiple ZFMs with the appropriate linkers, zinc finger proteins (ZFPs) can be generated to specifically bind extended sequence motifs. Several methods of varying complexity are currently available for ZFP engineering. ZFPs generated from the relatively simple modular design method often fail to function in vivo. Those generated using the most reliable module subsets, those recognizing triplets with a 5' guanine (GNN), only function an estimated 50% of the time, while modularly assembled ZFPs comprised primarily of non-GNN modules rarely function in vivo. These low success rates are extremely problematic for applications requiring multiple ZFPs targeting adjacent sequence motifs. More complex approaches provide enhanced success rates as compared to modular design, with the drawback that they are also more labor intensive and require additional biological expertise. In this work we engineered ZFPs, analyzed characteristics of functional engineered zinc finger proteins and their targets, formulated algorithms predictive of ZFP success for both modular assembly and OPEN (Oligomerized Pool Engineering) selection methods, and generated online software tools to aid others in the successful application of this technology. .


Shannon D. Schlueter

Home Department: Genetics, Development and Cell Biology

Major Professor: Dr. Volker Brendel
Co-Major Professor: Dr. Randy Shoemaker

Title: Plant genome informatics: Evaluation and Analysis of genomic DNA features involved in transcriptional processing of protein coding genes

Abstract:  As biological data collection methods have become more cost effective and less time consuming, the necessity of computational tools to store, manage, and analyze such data has led to the creation of a broad field of research. With the vast majority of effort in bioinformatics being applied to research on vertebrate species, researchers in the plant sciences have often been left with less than satisfactory tools to fill this need. In the course of this study, I have developed xGDB, an extensible infrastructure for integrating biological data resources and applying them to hypothesis driven research. Eleven plant species xGDB databases have been made publicly available at http://www.plantgdb.org. Using the infrastructure provided by xGDB, a sophisticated system was developed to investigate the reliability of protein coding gene structure annotations on a per gene basis. With this, I generated the necessary dataset to develop and test a plant specific probabilistic model of RNA polymerase II transcription start sites and promoters. Through application of this model, a look at individual plant protein coding gene promoters has shown unique structure and organization. Together, this work demonstrates the importance of integrated computational infrastructure and genomic domain knowledge.


Justin Schonfeld

Home Department: Mathematics

Major Professor: Dr. Dan Ashlock
Co-Major Professor: Dr. Dan Voytas

Title: A modular data analysis pipeline for the discovery of novel RNA motifs

Abstract: This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shorted segments of RNA primary sequence called bricks. The bricks are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns. An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called nonlinear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produced 2-Dimensional projections of the distance matrices which are examined via inspection and k-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences in which crossover points were localized.


Sachet Shukla

Home Department: Electrical and Computer Engineering

Major Professor: Dr. Srinivas Aluru
Co-Major Professor: Dr. Charles Link

Title: Region-specific implication of 5'UTR motifs in translational control mechanisms

Abstract: This study uses a novel approach based on the RESCUE technique (Relative Enhancer and Silencer Classification by Unanimous Enrichment) (Fairbrother et al, 2002) to identify region-specific motifs in the 5'UTR. A highly selective screening procedure is described and implemented, which drastically reduces the false positive rate of identified motifs by the original technique. For increased accuracy, we present the results only for species that have well-curated mRNA data as maintained in the Refseq curated database. The results of these computations suggest that there are motifs in the 5'UTR that act in conjunction with the kozak consensus sequence in the process of translation initiation. Specifically, motifs have been identified in the inter-ATG regions of 5'UTRs with multiple uATGs (upstream ATGs) that may have an effect on translation initiation. Strong and weak kozak sequences have also been associated with mutually exclusive motif sets both upstream and downstream of the true start codon. Finally a number of motifs were identified as being preferentially present in the uORF (upstream Open Reading Frame) regions, which argues against the theory that uORF sequences are random. In general, uORF regions are also found to be strongly selective against motifs associated with strong kozak sequences.

In addition to the above-stated results which are applicable across species, motif overlap analysis (ex.motifs that are associated with both strong kozak sequences and the inter-ATG region upstream of the true start codon) also suggests some species-specific translational control mechanisms. The region-specific identification of motifs itself is probably indicative of higher-order secondary and tertiary structures and interactions. The experimental validation of these results could lead to the discovery of novel primary/secondary motifs and translational contro mechanisms encoded in the 5' untranslated regions of different species.


Michael Sparks

Home Department: Genetics, Development and Cell Biology

Major Professor: Dr. Volker Brendel
Co-Major Professor: Dr. Jonathan Wendel

Title: Computational annotation of eukaryotic gene structures: algorithms development and software systems

Abstract: An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology.

The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution. Therefore, the continued development of methods not explicitly reliant on homology information—the so-called ab initio gene prediction methods—should help to more rapidly achieve a comprehensive understanding of gene content in model eukaryotes.

This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in ab initio gene prediction. The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular. Splice site prediction, coding fragment recognition, translation initiation site prediction and overall gene structure prediction will be discussed.


Robert Thompson

Home Department: Veterinary Microbiology and Preventive Medicine

Major Professor: Dr. Susan Carpenter
Co-Major Professor: Dr. Dan Ashlock

Title: Application of computational tools to analyze evolution of equine infectious anemia virus

Abstract: Evolution is the study of how variation alters the phenotype and population dynamics over time. Population genetics theories fit viral evolution well because of the properties of a viral population. Retroviruses are characterized by a high mutation and replication rate, which produces a heterogeneous mixture of viral variants commonly referred to as a quasispecies. Equine infectious anemia virus (EIAV) infection is a well-studied model for retrovirus variation and evolution (32, 33, 34). EIAV infection is characterized by a rapid, variable, dynamic disease course. Dynamic features of clinical disease as well as the ability of the horse to control the infection makes EIAV an excellent system to study evolution of viral quasispecies during progression of clinical disease. Here, we describe analyses of genetic data from longitudinal studies of genetic variation in a horse experimentally infected with equine infectious anemia virus. These studies include the genes encoding the regulatory protein Rev and the surface envelope glycoprotein, SU. Phylogenetic and cluster analyses suggested that the population of Rev variants was comprised of two distinct quasispecies that co-existed during infection, the populations shifted rapidly during febrile and afebrile periods with as little as 10 days between changes in population dominance of populations. In this study, we also examined evolution of EIAV envelope quasispecies in the chronic period evolve by random processes while quasispecies in the inapparent period evolve by a combination of Darwinian selection and random processes. These results propose that the envelope evolves by different processes during different stages of disease. Different evolutionary mechanisms during different stages of disease require unique approaches to anti-retroviral therapy during different stages of disease. Together, these results suggest there are unique host environments and viral population interactions during different stages of disease. Multiple quasispecies and varying processes of evolution during persistent retrovirus infection challenges the current thinking and has important biological implications for control of viral infections.


Peter Vedell

Home Department: Mathematics

Major Professor: Dr. Zhijun Wu
Co-Major Professor: Dr. Robert Jernigan

Title: Boundary Value Approaches To Molecular Dynamics Simulation

Abstract: Conformational transitions of biomolecules like proteins play an important role in many cellular processes, most often in a positive way, but sometimes in a detrimental way, perhaps causing diseases. Knowledge about conformational transitions of proteins and other biomolecules has the potential to be important in many areas of biological research. Simulation is an important means of studying these transitions. When a molecule has more than one known stable conformation, one can consider study of conformational transitions by a boundary value approach to molecular dynamics simulation. Application of multiple-shooting methods – an iterative numerical method for solving boundary value problems for ordinary differential equations – is proposed to find Newtonian molecular dynamics trajectories for a system subject to an all-atom molecular mechanics force field. These trajectories correspond to conformational transitions of proteins. Important aspects of this work include assessment of potential biological significance and computational challenges. The many computational issues include feasibility of the approach for larger systems, convergence properties, global optimization algorithms, efficient methods for finding initial trajectories, choice of boundary conditions, methods for parameter reduction, algorithms for handling the initial-value sub-problems, for computing Jacobian matrices, and for solving resulting nonlinear systems of equations. Distance matrix interpolation methods, which are particularly useful for constructing approximate trajectories for application in situations where all-atom Newtonian trajectories are not feasible, have previously been described ([Kim2002]). We introduce different distance matrix interpolation approaches that hold some promise for useful application for the purpose of efficiently constructing initial trajectories as well as for possible progress in construction of approximate trajectories. The results from simulating conformational transitions of alanine dipeptide are presented.

REFERENCES

[Kim2002] Kim M, Jernigan R, Chirikjian G. Efficient generation of feasible pathways for protein conformational transitions. Biophysical Journal, 83: 1620 (2002).
[Elb1999] Elber R, Meller J, Olender R. Stochastic path approach to compute atomically detailed trajectories: application to the folding of C peptide. Journal of Physical Chemistry B. 103: 6, (1999). [Sch1997] Schlick T, Barth E, Mandziuk M. Biomolecular dynamics at long time steps: Bridging the timescale gap between simulation and experimentation. Annual Review of Biophysics and Biomolecular Structure. 26: 181 (1997).
[Ved2006a] Vedell P, Wu Z. Multiple Shooting Methods for Boundary Value Approaches to Biomolecular Dynamics Simulation. (submitted, 2006).
[Ved2006b] Vedell P, Wu Z. Shooting methods with inexact boundary conditions and parameter reduction for protein dynamics simulation (in preparation).
[Ved2006c] Vedell P, Jernigan R, Wu Z. Distance matrix interpolation methods for boundary value approaches to biomolecular dynamics simulation (in preparation).


Kent Vander Velden

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Gavin Naylor
Co-Major Professor: Dr. Vasant Honavar

Title: Spatial Clustering of Differences in Measured Homoplasy with Respect to Protein Structure

Abstract: The identification of residues that hold misleading phylogenetic signals and those that are of functional significance are intertwined. Advances in the one area can support the other mainly because misleading phylogenetic signals come from residues that are not evolving as a random process. This paper is a case study of the comparison of a widely accepted phylogenetic tree to trees that have been inferred from sequence data of five proteins. A new metric, RI Difference and based on Retention Index, is suggested measuring the relative support that individual sites provide for two trees. Through the identification of sites harboring disproportionally large misleading phylogenetic signal, we attempt to identify residues that are cooperating to define the function of the protein. This information is presented in the presence of the structure of the protein where clustering patterns (or lack of) are observed in the implicated residues. A new bioinformatic software tool, RI Compare, is presented implementing the metric and blending heterogeneous information from alignments, phylogenetic trees, and structure promoting this research. The results are offered followed by some speculation to what might be causing erroneous trees to be inferred. The relationship of the implicated residues to those of known importance is also discussed. While, regrettably, the results of this paper do not seem to suggest that the RI Difference measure is a general measure for the identification of functional important residues in all proteins, there is evidence to suggest it may be applicable to the large transmembrane class of proteins. Unfortunately, no experimental tests of the implicated residues have been performed at this time and judgment of the correctness of the results has been based solely on the proximity of the implicated residues to ligands, other chains, and residues of known importance. However, even if the RI Difference measure is identifying residues other than the functional significant ones, the fact that the cluster patterns are unlikely to occur at random is intriguing.


Thomas Vigdal

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Daniel Voytas
Co-Major Professor: Dr. Volker Brendel

Title: Insertion site similarities in the Tc1/mariner element family

Abstract: In this study we report the first insertion site profile for the Tc1-like transposon Sleeping Beauty. We found that Sleeping Beauty prefers a consensus sequence of ATA TATAT, where the underlined TA represents the canonical target site. We also performed computational structural prediction analyses on Sleeping Beauty insertion sites and found that they were significantly different from random DNA. We then compared Sleeping Beauty’s insertion site structural profile with the insertion site profiles generated from three other studies of Tc1/mariner elements: Tc1 (genomic copies and experimentally induced insertions (van Luenen and Plasterk, 1994), Tc3 (van Luenen and Plasterk, 1994) and Himar1 (Lampe et al., 1998). Through this comparison we found that bendability, protein-induced deformability and A-philicity are the most significant for determining insertion site preference. We furher examined Tn5 and Ty1 insertion sites and found that Tn5 shares similarities with the Tc1/mariner elements whereas Ty1 insertions do not. Therefore, we predict that a large amount of the Tc1/mariner elements’, and possibly other DNA transposons’, target site specificity relies on unusual DNA structure in the area of insertion.


Jianmin Wang

Home Department: Computer Science

Major Professor: Dr. Xiaoqiu Huang
Co-Major Professor: Dr. Xun Gu

Title: COMPUTATIONAL STUDIES OF ESTS: ASSEMBLY, SNP DETECTION, AND APPLICATIONS IN ALTERNATIVE SPLICING

Abstract: EST sequences are important in functional genomics studies. To better use available EST resources, clustering and assembling are crucial techniques. For EST sequences with deep coverage, no current assembly program can handle them well. We describe a deep assembly program named DA. The program keeps the number of differences in each contig alignment under control by making corrections to differences that are likely due to sequencing errors. Experimental results on the 115 clusters from the UniGene database show that DA can handle data sets of deep coverage efficiently. A comparison of the DA consensus sequences with the finished human and mouse genomes indicates that the consensus sequences are of acceptable quality. EST sequences can be used in SNP discovery. We describe a computational method for finding common SNPs with allele frequencies in single-pass sequences of deep coverage. The method enhances a widely used program named PolyBayes in several aspects. We present results from our method and PolyBayes on eighteen data sets of human expressed sequence tags (ESTs) with deep coverage. The results indicate that our method used almost all single-pass sequences in computation of the allele frequencies of SNPs. EST sequences can also be used to study alternative splicing (AS), which is the most common post transcription event in metazoans. We first developed a pipeline to identify AS forms by comparing alignments between expressed sequences and genomic sequences. Then we studied the relationship between AS and gene duplication. We observed that duplicate genes have fewer AS forms than single-copy genes; we also found that the loss of alternative splicing in duplicate genes may occur shortly after the gene duplication. Further analysis of the alternative splicing distribution in human duplicate pairs showed the asymmetric evolution of alternative splicing after gene duplications. We also compared AS among six species. We found significant differences on both AS rates and splice forms per gene among the studied species by detailed and categorized studies. The difference in AS rate between rice and Arabidopsis is significant enough to lead to a difference in protein diversity between those two species.

References:

Jianmin Wang, Xiaoqiu Huang. A method for finding single-nucleotide polymorphisms with allele frequency in sequences of deep coverage. BMC Bioinformatics. 2005 6:220
Zhixi Su, Jianmin Wang (co-authors), Jun Yu, Xiaoqiu Huang, and Xun Gu. Evolution of alternative splicing after gene duplication. Genome Res. 2006


Xiangyun Wang

Home Department: Computer Science

Major Professor: Dr. Vasant Honavar
Co-Major Professor: Dr. Drena Dobbs

Title: Protein Function Classification: A Data-Driven Approach

Abstract: Machine learning offers one of the most effective and practical approaches to data-driven knowledge acquisition. Decision tree learning algorithm represents one of the simplest and most commonly used machine-learning algorithms for data-driven induction of classifiers. My work describes an approach to data-driven discovery of sequence motif-based models in the form of decision trees for assigning protein sequences to functional families. Unlike approaches that try to classify protein sequences based on presence of a single motif, this method is able to capture regularities that can be described in terms of presence or absence of combinations of motifs. A training set of peptidase sequences with known functions is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families.


Yingchun Wang

Home Department: Biochemistry, Biophysics & Molecular Biology

Major Professor: Dr. Parag Chitnis
Co-Major Professor: Dr. Suresh Kothari

Title: Identification and functional analysis of thylakoid membrane proteome

Abstract: Membrane proteins play crucial roles in many metabolic pathways. Functions of most membrane proteins remain to be revealed because of their insolubility. New technological breakthroughs in proteomics together with more available genomic sequence information make it possible to study functions of membrane proteins on a genome-wide scale. We are trying to use methods in biochemistry, genetics, proteomics and bioinformatics to study the functions of the thylakoid proteome of Synechocystis sp. PCC6803. The thylakoid membrane proteins were separated into peripheral and integral fractions and resolved into 2-D gels with different pH range. The protein spots in the 2-D gels were subjected to peptide mass fingerprinting analysis, and totally 390 out of 558 analyzed spots were identified as protein products of 128 individual genes, of which 38 gene encode hypothetical proteins with unknown function. To study the function of the hypothetical proteins, we knocked out the DNA sequence of the corresponding ORF, and 10 knockout mutants were obtained. The growth analysis for the mutant cells revealed that only one mutant (H1) which has a deletion in the ORF slr0110, showed conditional growth phenotype. Detailed analysis indicated that the H1 mutant is sensitive to both glucose and light, which is caused by the over-reduction of the PQ pool in the thylakoid membrane. The ID and the structural and functional information of the identified proteins as well as the 2-D reference maps were included in a web-based relational database for thylakoid membrane proteins. The database was constructed with MySQL, and the application programs were developed with SQL, PERL, JAVASCRIPT and HTML. Users can search the information of identified proteins and compare their own identified proteins with the identified proteins in the database. A manager interface is also provided for the routine maintenance of the database.


Yufeng Wang

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Zhijun Wu
Co-Major Professor: Dr. Dan Ashlock

Title: Functional divergence and age distribution of vertebrate gene families

Abstract: Biology is undergoing a revolution based on the accelerating determination of DNA sequences, including the complete genomes of a growing number of organisms (Adams et al 2000; International Human Genomes Sequencing Consortium, 2001; Venter et al., 2001). During this post-genomic era, functional genomics seek to devise and apply technologies that take advantage of the flooding sequence information to analyze and predict in vivo functions of proteins (Doolittle 1996; McKusick 1997; Durbin et al. 1998). One of the missions of protein genomics is to make direct predictions on function(s) prior to biological experimentation.

The objective of this study is to develop and apply statistical methods to predict functional content from primary sequence and to explore the pattern of gene family evolution. This dissertation is composed of a general introduction, four chapters, each of which is in the journal of manuscript format, and a general conclusions section. The four chapters that detail the core of the research work are outlined below.

Chapter 1 introduces a new statistical model for testing functional divergence and predicting critical residues (Gu 1999) by a case study in caspase gene family. By taking advantage of substantial experimental data of caspases, the functional/structural basis of our predictions are extensively studied. The objective of this study is to show the potential of combining new methodology with classical phylogenetic approach in functional genomics.

Chapter 2 extends the study to a comprehensive survey in functional divergence among a large number of gene families by using Gu (1999) method (PHYBA, phylogeny-based-analysis). The technical issues, biological implications and potential applications are detailed addressed in this chapter.

Chapter 3 investigates the evolutionary patterns of 49 gene families that are generated in the early stage of vertebrates. The times of gene duplications are estimated to test the hypothesis of two-rounds (2R) of genome duplication. Complicated evolutionary patterns (2R/3R) are surveyed.

Chapter 4 examines the impacts of gene duplications on the functional divergence in vertebrate gene families. Two patterns of functional divergence after gene duplication(s) are illustrated.


Matthew Wilkerson

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Volker Brendel
Co-Major Professor: Dr. Thomas Peterson

Title: Genesis of gene structures and computational analysis of U12-type introns

Abstract: Major Professor: Volker Brendel Co-Major Professor: Thomas Peterson Tuesday, November 20, 11:00 AM 1102 Molecular Biology Building Completely sequenced genomes provide a wealth of information that has allowed the exploration of large scale biological questions and continues to provide a critical resource for the advancement of biological research. Previously, the number of completely sequenced genomes was small and was generally limited to the model organisms. Currently, the number of genomes completely or partially sequenced is rapidly increasing, with 338 different eukaryotic genomes available as of October 2007. With a genome sequence in hand, the typical first step is gene structure annotation, or identifying the location and structural features of genes within the genome sequence, after which functional descriptions of the genes, relationships to homologous genes can be made, and other higher level research questions can be investigated. Annotation, then, imparts the biology of the organism onto the genome sequence. The goal of this thesis is to provide useful computational tools for gene annotation in emerging and mature genomes, and to analyze a particularly difficult-to-annotate gene feature. The process of gene structure annotation requires a genomic sequence of sufficient size so that it can contain a full gene, which in eukaryotes can be thousands of nucleotides. The popular method of whole genome shotgun sequencing to furnish genomic sequences, produces small sequence fragments of hundreds of nucleotides, which are eventually assembled into chromosome sequences, and can take several years from start to finish. In the interim, these small sequence fragments are deposited into repositories for historical reference and dissemination purposes, but since they are too small to contain a gene, these fragments are not particularly useful for gene structure annotation purposes. I have developed a web-based tool, Tracembler, which facilitates dynamic gene annotation of these fragments through on the fly sequence similarity searching and assembly. Hence, Tracembler allows biologists and interested scientists to immediately create gene annotations upon the latest sequences from emerging genomes without having to wait for the completion of the genome sequencing project. On the other end of the genome maturity spectrum, accurate gene structure annotation, which includes the biologically-correct specification of exons, introns, untranslated regions, protein coding regions, and alternatively spliced variants of a gene, remains a challenge for completely sequenced genomes. Pure computational approaches are excellent for providing an approximate initial summary of an organism’s gene space, but they are not completely accurate or comprehensive. Manual annotation by a human curator, who inspects and reviews the available evidence to make decisions in constructing a gene structure annotation, is considered the highest quality method. Hindrances to manual annotation are that it is time consuming, has restricted participation, and is not easy to conduct. Removing these limitations of manual annotation, I have developed the yrGATE (“your Gene structure Annotation Tool for Eukaryotes”) software, which enables individuals to create gene structure annotations using high quality evidence through an easy-to-use dynamic web browser interface and submit their annotations to a community database. A particular category of often mis-annotated genes is those containing U12-type introns. U12-type introns are a class of introns that have highly conserved sequence features, have a specific spliceosome that processes their removal from pre-mRNA transcripts, and comprise less than 1% of the introns in any studied eukaryotic organism. One reason for their mis-annotation is most gene prediction programs are not designed to specifically recognize them, which is likely caused by U12-type introns’ unique sequence features and rare occurrence. Apart from their mis-annotation, U12-type introns are intriguing due to their unique proposed evolutionary history and due to their maintenance in organisms at very low frequencies in a seemingly functional redundancy with the major splicing system. In order to further the understanding of this intriguing gene feature, a large-scale annotation and computational investigation of U12-type introns in the context of their host genes and evolution was completed, which yielded several new discoveries.


Di Wu

Home Department: Mathematics

Major Professor: Dr. Zhijun Wu
Co-Major Professor: Dr. Robert Jernigan

Title: Distance-based Protein Structure Modeling

Abstract: Abstract: Protein structure modeling could be studied based on the knowledge of interactions or distances between pairs of atoms, which is so-called distance-based protein structure modeling and such field includes problems of structure determination and refinement as well as analysis of protein dynamics. The distances for certain pairs of atoms in a protein can often be obtained based on our knowledge on various types of bond-lengths and bond-angles or from physical experiments such as nuclear magnetic resonance (NMR). The coordinates of the atoms and hence the protein structure can then be determined by using the known distances. However, it requires the solution of a mathematical problem called the distance geometry problem, which is proved to be computationally intractable in general. On the other hand, due to insufficient distance data such as nuclear overhauser effect (NOE) data in NMR, the protein structures determined by conventional techniques usually are not as accurate as desired. Therefore, the uses of such protein structures in important applications including homology modeling and rational drug design have been severely limited. In this work, we have developed several efficient algorithms including theories for the solution of the distance geometry problem using a geometric build-up algorithm. We also introduced a knowledge-based method for protein structure refinement, in which we constructed a dedicated structural database for protein inter-atomic distance distributions and derived so-called mean force potentials to refine NMR-determined protein structures. We have participated in CASPR competition regarding comparative models and reported some substantial improvement using mean force potentials. In the last, an efficient and simple method called Local-DME calculations has been developed to study protein dynamics of NMR ensembles specifically.

References:

Wu, D., and Wu, Z. An Updated Geometric Build-Up Algorithm for Solving the Molecular Distance Geometry Problem with Sparse Distance Data. Journal of Global Optimization, 2006 (accepted).

Wu, D., Cui, F., Jernigan, R., and Wu, Z., PIDD: Database for Protein Inter-atomic Distance Distributions, submitted to NAR, 2006.

Wu, D., Jernigan, R., and Wu, Z., Refinement of NMR-Determined Protein Structures with Database Derived Mean Force Potentials, to be submitted, 2006.


Shiquan Wu

Home Department: Mathematics

Major Professor: Dr. Xun Gu
Co-Major Professor: Dr. Zhijun Wu

Title: Comparative genomics: Multiple genome rearrangement and efficient algorithm development

Abstract: Multiple genome rearrangement by signed reversal is discussed: For a collection of genomes represented by signed permutations, reconstruct their evolutionary history by using signed reversals, i.e., find a tree where the given genomes are assigned to leaf nodes and ancestral genomes (i.e. signed permutations) are hypothesized at internal nodes such that the total reversal distance summed over all edges of the tree is minimized. It is equivalent to finding an optimal Steiner tree that connects the given genomes by signed reversal paths. The key for the problem is to reconstruct all optimal ancestral genomes or Steiner nodes.

The probelm is NP-hard and can only be solved by efficient approximation algorithms. Various algorithms and programs have been designed to solve the problem, such as BPAnalysis, GRAPPA, grid search algorithm, MGR greedy split algorithm (chapter 1). However, they may have expensive computational costs or low inference accuracy. In this thesis, several new algorithms are developed, including nearest path search algorithm (chapter 2), neighbor-perturbing algorithm (chapter 3), branch and bound algorithm (chapter 3), perturbing-improving algorithm (Chapter 4), and partitioning algorithm (Chapter 5). With theoretical proofs, computer simulations, and biological applications, these algorithms are shown to be 2-approximation algorithms and more efficient than the existing algorithms.


Wu Xu

Home Department: Biochemistry, Biophysics & Molecular Biology

Major Professor: Dr. Parag Chitnis
Co-Major Professor: Dr. Suresh Kothari

Title: Is there a code for transcription factor-DNA recognition?

Abstract: The whole genome sequences from a wide variety of species including 599 viruses, and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant are available. 2000-3000 transcription factors out of approximate 30,000-40,000 genes in human genome, which play the central role in controlling cell development, cell growth and differentiation. Abnormal activity of transcriptional factors often leads to diseases. E lucidating the transcriptional regulatory network will be the next challenge of the post-genomic era. Gene regulation initiates from the selective binding of transcription factors to a particular DNA site out of a vast number of potential sites in the genome. It is unclear how transcription factors could specifically recognize the correct sites out of hundreds or thousands potential sites in the genome. We investigated the DNA recognition sites functionally mapped by biochemical and biophysically approaches and also transcription factor-DNA complexes solved by X-ray or NMR from Protein Data Bank. The purpose of this study is to find whether there is a simple code for transcription factor-DNA recognition. Our analyses show that (i) the length for DNA recognition sequences is typically from 4-10 bases; (ii) there is no GC or AT preference for our studied sequences; (iii) positively charged amino acids-Arg and Lys are found to be the majority of contacts with base and phosphate; (iv) some favored interaction pairs, Arg-G, Lys-G and Glu-C, are observed from our studies. However, no simple code for transcription factor-DNA recognition is obtained from our study. A relational database for storing and retrieving collected data is generated as an example to demonstrate the importance of database in computational biology.


Aimin Yan

Home Department: Biochemistry, Biophysics and Molecular Biology

Major Professor: Dr. Robert Jernigan
Co-Major Professor: Dr. Zhijun Wu

Title: Analysis on protein structures using statistical and computational methods

Abstract: The orientation of side chains relative to the radial vector from the center of the protein to an amino acid is studied. We find that the average angles for different residue types are highly correlated with their hydrophobicities, and the average side chain orientations in different parts of structures exhibit characteristically different features. The application of our findings on side chain orientation to protein tertiary structure prediction has also been considered. Several statistical machine learning methods are used to check the predictability of side chain orientation.

One method to validate the computed motions generated from the elastic network models (ENM) is to compare them with the principal components (PCs) of multiple structures. The multiple structures of the same protein are superimposed first, and the correspondence between the experimental conformational changes represented by PCs and the normal modes from ENM are calculated. Here we use two superposition methods (least-squares fitting and maximum likelihood), and find that the extent of the correspondence between two conformational spaces depends on the superposition method.

The effects on motion of removing some protein subunits of partial 30S ribosome structures are studied. Our results show that some larger changes from removing single protein subunit can be restored by the removal of another subunit, which indicates their interdependencies. We further find that the subunits showing some interdependencies have strong positive motion correlation and interact together, which are consistent with the previous computational studies and experimental results from other people.


Lei Yang

Home Department: Biochemistry, Biophysics, and Molecular Biology

Major Professor: Dr. Robert Jernigan
Co-Major Professor: Dr. Zhijun Wu

Title: Understanding protein motions by computational modeling and statistical approaches

Abstract: Because of its appealing simplicity, the elastic network model (ENM) has been widely accepted and applied to study many molecular motion problems, such as the molecular mechanisms of chaperonin GroEL-GroES function, allosteric changes in hemoglobin, ribosome motions, motor-protein motions, and conformational changes in general. In this dissertation, the ENM is employed to study various protein dynamics problems, and its validity is also examined by comparing with experimental data.

First, we apply principal component analysis (PCA) to identify the essential protein motions from multiple structures (X-ray, NMR and MD) of the HIV-1 protease. We find significant similarities between the first few of these key motions and the first few low-frequency normal modes from the ENM, suggesting that the ENM provides a coarse-grained and structurally-based explanation for the experimentally observed conformational changes.

Second, we extend these approaches from a single protein (HIV-1 protease) to thousands of proteins whose multiple NMR structures are available. We also find close correspondence between the experimentally observed dynamics and the ENM predicted ones, indicating the validity of using the ENM to computationally predict protein dynamics.

Third, we develop a regression model for the isotropic B-factor predictions by combining the protein rigid body motions with the ENM. The new model shows significant improvements in B-factor predictions. Fourth, we further examine the validity of using the ENM to study protein motions. We use the anisotropic form of ENM to predict the anisotropic temperature factors of proteins. It presents a timely and important evaluation of the model, shows the extent of its accuracy in reproducing experimental anisotropic temperature factors, and suggests ways to improve the model.

Finally, we apply the ENM to study a dataset of 170 protein pairs having "open" and "closed" structures, and try to address how well a conformational change can be predicted by the ENM and how to improve the model. The results indicate that the applicability of ENM for explaining conformational changes is not limited by either the size of the studied protein or even the scale of the conformational change. Instead, it depends strongly on how collective the transition is.


Changhui Yan

Home Department: Computer Science

Major Professor: Dr. Vasant Honavar
Co-Major Professor: Dr. Drena Dobbs

Title: Identification of interface residues involved in protein-protein and protein-DNA interactions from sequence using machine learning approaches

Abstract: In this study, we develop machine-learning methods to identify amino acid residues involved in protein-protein interactions and protein-DNA interactions. We focus on the methods using sequence information alone and build classifiers that can classify residues into interface and non-interface residues based on local sequence information. To facilitate the study of developing machine-learning algorithms to identify interface residues and the study of searching for characteristics that can distinguish the interfaces from the rest of the proteins, we also develop a database of protein-protein interfaces and systematically analyze the characteristics of the interfaces.


Liang Ye

Home Department: Computer Science

Major Professor: Dr. Xiaoqiu Huang
Co-Major Professor: Dr. Dan Voytas

Title: Sequence comparison methods, statistics, and applications

Abstract: With more genomes being sequenced, understanding biological signals encoded in a genome has become a key challenge in modern biology. Cross-species comparison is a powerful approach in revealing those functional elements. In this thesis we first address some basic issues in sequence comparison, including optimization of sequence alignment parameters and statistical significance assessment of similarity scores. We present a method for assessing the effects of parameters on the sensitivity and specificity of an alignment algorithm on real coding DNA sequences. We then describe a computational and statistical method for assessing the statistical significance of the best alignment between two protein sequences. Multiple alignment of genomic sequences is a powerful approach for genome data analysis and annotation. We develop a sensitive multiple alignment program named MAP2 based on the generalized pairwise global alignment algorithm evaluated and tested above for handling long, different intergenic and intragenic regions in genomic sequences. We propose two similarity measures for evaluation of the performance of MAP2 and existing multiple alignment programs. We also present experimental results by MAP2 on six simulated data sets to show its strength in detecting the boundaries between similar and different regions. Finally, we apply different alignment algorithms to various sequence data, including genomic sequences, EST sequences, and cDNA sequences in the grass family, to explore gene conservation among the grass family and examine the usage of the rice genome as a reference to study other grass genomes.

References:

Ye L., Wang J., Huang X. Selection of effective parameter values for alignment of DNA sequences. (Submitted to BMC Bioinformatics)
Ye L., Huang X. (2005) MAP2: multiple alignment of syntenic genomic sequences. Nucleic Acids Research, 33(1):162-170


Hailong Zhang

Home Department: Botany

Major Professor: Dr. Eve Wurtele
Co-Major Professor: Dr. Julie Dickerson

Title: MetNet DB: A Comprehensive Metabolic and Regulatory Network Database

Abstract: One of the major challenges in the post-genome era is to determine the cellular functions of genes and their products, to understand how the interactions among the entities in cellular contexts could yield a living cell. To attack this problem, Gene Expression Tool kit (GET) project was launched in Iowa State University. This thesis will describe a general data model for representing metabolic and regulatory biological networks. The model is implemented in a relational database: MetNet DB. MetNet DB serves as an information hub in GET software package. The thesis will also present one of MetNet DB practical applications: Probe database, which is based on the information derived from MetNet DB. Probe database provides the integrated functional annotations for Arabidopsis microarray probes. Currently Affymetrix Arabidopsis GeneChip and AFGC EST microarray, two large datasets are supported. Probe database could be seamlessly integrated to other microarray data analysis tools, such as GeneSpring. This provides an efficient annotation for mining Arabidopsis RNA profiling data.


Wuyan Zhang

Home Department: Statistics

Major Professor: Dr. Alicia Carriquiry
Co-Major Professor: Dr. Jack Dekkers

Title: The design and analysis of microarray experiments using pooled samples for the study of quantitative traits

Abstract: Microarrays can simultaneously measure the mRNA expression levels of thousands of genes. In such experiments, mRNA samples are sometimes pooled across individuals to reduce cost or to increase mRNA volume in the sample.  Our main objective is to investigate the effect of pooling mRNA on different types of inferences drawn from three important types of genomic experimentation.  First, we investigate the effect of pooling mRNA on the power with which we can identify differentially expressed genes. We propose a statistical model for gene expression in a pool that mimics the process of mRNA pooling and develop the appropriate F statistics to test for differentially expressed genes. We show our power estimation is more conservative and less biased. Second, we investigate the effect of two different mRNA pooling strategies on the estimate of the correlation between phenotype and gene expression. We propose a maximum likelihood method to estimate the correlation between phenotype and expression. The MLE outperforms the standard Pearson correlation estimate in terms of bias and precision when individuals are stratified by phenotype prior to pooling. Finally, we evaluate the efficiency of a recently proposed QTL mapping approach which combines the idea of mRNA pooling with expression QTL transcriptome mapping.  We argue that by pooling mRNA we can reduce the number of microarrays required by 2-fold or more and directly target the generation of expression data that is relevant to the phenotypic traits of interest.  The reduction in cost can be achieved with negligible loss in power when QTL mapping is done via the standard regression approach. However, when mapping is carried out via composite interval mapping which takes into account linkage disequilibrium effects, the loss in power can be significant.


Xiaosi Zhang

Home Department: Computer Science, Artificial Intelligence Research Laboratory

Major Professor: Dr. Vasant Honavar
Co-Major Professor: Dr. Xun Gu

Title: Gene Expression Analysis

Abstract: Microarray technology provides an approach to measure the expression levels of a large number of genes simultaneously and an insight into the transcriptional state of the cell. It can be used for searching for co-expressed genes under certain conditions. As such, it has become a powerful tool in genetic network research and functional genomics. Meanwhile, the technology produces large amounts of data and the data interpretation becomes a major bottleneck.

In this study, public yeast gene expression data is analyzed by Principal Components Analysis (PCA), Hierarchical Clustering, Self Organizing Mapping (SOM) and Adaptive Resonance Theory 2 (ART-2). The four statistical methods are also applied to maize chloroplast protein expression data in greening process. PCA can reduce the dimensionality of the data set. The first few components contain the most variance in the data and represent meaningful expression patterns. ART-2 is a neural network method, which is applied to gene expression analysis for the first time in our study. It provides very good clustering quality. Compared with Hierarchical Clustering and SOM, ART-2 is not limited by the rigid structure of Hierarchical Clustering and is not required to determine the clustering number in the beginning such as SOM. ART-2 has the ability to deal with noise in the data and is easy to implement and interpret the result. The algorithm is also fast and scalable.


Zhongqi Zhang

Home Department: Statistics

Major Professor: Dr. Kenneth Koehler
Co-Major Professor: Dr. Xun Gu

Title: Application of computational tools to analyze evolution of equine infectious anemia virus

Abstract: My Ph.D. is mainly about applying statistical methods to the analyses of gene expression data, i.e. microarray data, putting the gene expression process into an evolution framework, and characterizing the expression evolution procedure. Such expression divergence analysis can deepen our understanding of the phenotypic evolution at the transcriptional level.

Molecular phylogeny currently plays a major role in analyzing genomic data, trying to understand the relationship between genes, chromosomes and species. However, for another major source of genomic information, large-scale gene expression analysis, little research has been done from an evolutionary point of view. In chapter 1, we reviewed a preliminary phylogenetic expression analysis developed by Gu (2000) that used a Brownian motion process to represent expression variation among duplicate genes in a gene family. The general Brownian-based model can be transformed and restricted to obtain several derived models or sub-models. Each sub-model can be applied to deal with specific biological questions, depending on the imposed restrictions.

Basked on the E 0 model described by Gu (2004), we develop a fast algorithm to predict expression profiles at the ancestral nodes (genes). By comparing ancestral expression profiles with progeny expression profiles, so called expression divergence, expression profile changes along the duplication lineage, can be revealed and quantified. Such expression divergence can be used as an indicator of function divergence, showing if the gene activity is under selection pressure along that specific lineage and inferring the potential function difference between progeny genes. Details about ancestral expression inference can be found in chapter II.

The phylogenetic expression analysis proposed by (Gu 2004) is rather complicated, especially as it requires the use of the maximum likelihood estimation which is sensitive to model assumptions. In chapter III, we transformed the key idea of Brownian-based E 0 model into the form of an expression distance structure, and used the modified molecular phylogenetic approach to reconstruct an expression tree. Such expression phylogeny has the same convenience and flexibility as the molecular phylogeny in molecular evolutionary study. However, we did notice that there are some differences between expression phylogeny and molecular phylogeny, and such differences reveal the decoupling between expression profile evolution and sequence evolution.

In chapter IV, we used yeast expression data and motif data to study the relationship between expression divergence and motif divergence. Although it has long been believed that motif structure is the key factor in shaping the expression profiles, our analysis only reveals a weak coupling relationship between the two profiles. Many studies have shown that transcription regulation is a very complicated involving dynamic process. It involves interactions between DNA and proteins, such as motifs and transcription factors; the interaction between proteins, such as the transcription factors and their cofactors; the modification and degradation of the proteins; the structure of the DNA sequence, such the condensation of local chromatin; etc. Our results simply indicate that motif structure is only part of the story and people should be very cautious when making assumptions about the relationship between motif structures and expression profiles.

The research summarized in this dissertation is still in its theoretical stage. My next and immediate task is to apply those ideas to the analysis of real data. In order to accomplish this goal, statistical modeling and analysis in chapter II and III will first be incorporated into a program package which allows convenient and fast analysis, especially when dealing with massive datasets at the genomic level. In this research report, we showed some examples. In the future, we will extend these analyses to the whole genome of some organism, in particular, all the yeast gene families, and conduct the expression divergence analysis at the genome level.

During the study of motif and expression relationships, we realized the importance of gene networks in all aspects of the organism activities. In my next research project, I plan to combine the information of al the available components of gene networks, such as gene expression, gene duplication, metabolic pathway, motif structure, null mutation mutants, etc., and see if I can identify some relationships among those components.

My long-term research goal is to integrate genomic data resources with evolutionary concepts and further investigate relationships among sequence divergence, expression divergence and function divergence. A key interest is to understand how those divergence processes can be related to or shaped by the structure and development of the gene networks.


Hua Zhou

Home Department: Statistics

Major Professor: Dr. Karin Dorman
Co-Major Professor: Dr. Susan Carpenter

Title: Branching process models for HIV-1 drug resistant mutants

Abstract: HIV drug therapy often fails because of the appearance of resistant viral mutants. Thus knowledge in the abundance of resistant mutants prior to treatment is essential for optimizing drug therapy to avoid resurgence of resistant mutants. A simple multitype continuous-time branching process model is developed and investigated for the generation of resistant viral mutants during HIV-1 infection. The growth of mutant populations are characterized by their means, variances and distributions from start of acute infection to the equilibrium state in chronic stage. The expressions for the equilibrium frequencies of mutants are derived and their dependence on mutation rates and mutant fitness explored. The model suggests that mutants with three or more point mutations are unlikely to occur prior to treatment. A similar branching process model is also used to compute the number of resistant mutants that are generated {\it de novo} during treatment. Then the two possible causes of resistance-related treatment failure are discriminated by characterizing the ratio of the amount of resistant mutants produced {\it de novo} to the number of preexisting resistant mutants.


Huaijun Zhou

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Xun Gu
Co-Major Professor: Dr. Susan Lamont

Title: Statistical analysis of functional divergence in gene families

Abstract: Duplication of genes has been thought as a major resource for the function innovation in a large number of gene families. The prediction of critical residues for functional divergence between homologous genes is important for functional genomics. The Toll-like receptor (TLR) gene family plays an important role in innate immunity and adaptive immune response. All TLR protein sequences from vertebrate animals were collected to investigate functional divergence and evolutionary pattern between TLR gene clusters. Four independent clusters were identified. Functional divergence of the domains in TLR family was characterized by a site-specific posterior profile analysis, and critical residues for altered selective constraints of amino acid sites after gene duplication were predicted. The extracellular domain of TLR gene showed higher functional divergence than the cytoplasmic domain. Further analysis indicated that the region between Leucine-rich repeats (LRR) 10 and 14 of TLR4 was a potential target for functional genomics study in the future. For a large set of gene families study, we collected the cDNA sequences of all orthologous genes of human, mouse or rat of two-gene cluster, three-gene cluster, and four-gene cluster from all available gene families in the database. The nonsynonymous and synonymous substitutions rate for all orthologs between human and mouse or rat were estimated. The ratio of nonsynonymous to synonymous substitutions rate were calculated. The nonsynonymous substitutions rate was positively correlated with synonymous substitutions rate, and the ratio of nonsynonymous to synonymous substitutions rate, which suggested that nonsynonymous substitutions rate is a major resource of the ratio of nonsynonymous to synonymous substitutions rate. The significant differences of nonsynonymous substitutions rate in most of paralogous genes suggested that nonsynonymous substitution plays an important role in creating novel function following gene duplication.


Wei Zhu

Home Department: Genetics, Development & Cell Biology

Major Professor: Dr. Volker Brendel
Co-Major Professor: Dr. Srinivas Aluru

Title: Spliced alignment and its application in Arabidopsis thaliana

Abstract: The goal of my project has been to develop and apply methods for gene identification in genome sequences according to expressed sequence tags (ESTs) or homologous protein sequences evidence. For this purpose, we developed an efficient spliced alignment program, GeneSeqer (available at http://bioinformatics.iastate.edu/cgi-bin/gs.cgi), which is capable of aligning ESTs with a large genomic sequence. Another program MyGV (available at http://bioinformatics.iastate.edu/bioinformatics2go/MyGV/) written in JAVA as a browser to visualize the output of GeneSeqer had also been distributed recently. As a practical test and demonstration, GeneSeqer was applied to map 174,628 Arabidopsis EST sequences on the whole genome of Arabidopsis thaliana (5 chromosomes, about 117M bp in total), and all results were parsed and imported into a MySQL database. Much useful information was inferred from the Arabidopsis spliced alignments results, that could serve as valuable resource for a number of projects of special scientific interest, such as alternative splicing, non-canonical splice sites, mini-exons, etc. We developed an elaborate web interface to allow visually and interactively querying and browsing EST spliced alignments and GenBank annotation, accessible at http://zmdb.iastate.edu/PlantGDB/AtGDB.html.


URL:
Copyright © 2000-2008, Iowa State University, all rights reserved.
Last Modified:
Please direct corrections, suggestions, and comments to bcb@iastate.edu.