Understanding the interactions of genes plays a vital role in the analysis of complex biological systems. The gene regulatory networks (GRNs) are representations of gene-gene regulatory interactions in a genome and display relationships between various gene activities. GRN modeling and inference is carried out mainly with the help of gene expression microarray data. The microarray data is characterized as massive, heterogeneous and high-dimensional in nature. In a typical dataset, the number of samples n (with an order of tens) is substantially smaller than the number of genes p (with an order of hundreds or even thousands) which makes it very difficult to reconstruct a GRN from this data. The aim of the thesis is develop a novel causal model for GRN inference which exploits the naturally existing causal gene interactions (i.e. expression of gene Y is caused by interaction with another gene X) thereby resulting in higher accuracies in reconstruction. The method is based on the decomposition of the entire GRN into sub-networks which are basically the Markov Blankets (MB) of each gene. The causal GRN model is accomplished by applying a minimal set of constraints which reduces the extremely large search space to a smaller set of possible models. These reconstructed networks are pruned further to eliminate false positives resulting in minimal connectivity and best fit GRN for the data. Synthetic datasets allow validating new techniques and approaches since the underlying mechanisms of the GRNs, generated from these datasets, are completely known. The realistic synthetic datasets validate the robustness of the method by varying topology, sample size, time-delay, noise, vertex in-degree and presence of hidden nodes. We present a novel approach for synthetically generating gene networks using causal relationships. To accurately and efficiently reverse engineer the gene network from time-course expression data, a Guided Genetic Algorithm (GGA) is developed to carry a heuristic search through the space of qualitative causal networks incorporating the causal relationships between genes. The GGA exploits characteristics of diversity and high level heuristics to generate fit networks quickly (less iterations) and is shown to have a superior performance compared to simple GA (SGA) that is currently applied by researchers. Building upon GGA, we further improve search process by another new technique, which we refer as FOMBGA (Frequently Occurring Markov Blanket Genetic Algorithm). The FOMBGA replaces crossover and mutation operators with a probabilistic model on frequency of occurrence of fit Markov blankets (MBs). Estimation of GRN parameters is basically the estimation of conditional probability distributions (CPD) of the given GRN. Due to high dimensional data, exact computation of the CPDs is infeasible and computationally expensive. In the thesis work presented, given the network structures, we deduce a unique minimal I-map of the GRN by estimating the conditional probability distribution of each variable (gene) from the data set. This is achieved by using a novel variant of the Markov Chain Monte Carlo (MCMC) method whereby the search space is gradually reduced resulting in the convergence to occur quickly and in a reasonable computation time. The performance of the parameter estimation technique is further improved by integrating regulatory sequence motif data and GO annotations. Investigations are carried out using both the synthetic dataset as well as yeast cell-cycle gene expression datasets. Experiments carried out show that the proposed modeling approach has excellent inferential capabilities and high accuracy even in the presence of noise. The gene network inferred from yeast cell-cycle data is investigated for its biological relevance using well-known interactions, sequence analysis, motif patterns and GO data available in literature. The studies resulted in discovering the known interactions and predicting novel interactions.
This thesis is protected by copyright. Copyright in the thesis remains with the author. The Monash University ARROW Repository has a non-exclusive licence to publish and communicate this thesis online.