Simon Fraser University -

Essential Site Maintenance: Authorea-powered sites will be updated circa 15:00-17:00 Eastern on Tuesday 5 November.
There should be no interruption to normal services, but please contact us at help@authorea.com in case you face any issues.

http://www.sfu.ca/

by author

by title

by keyword

Charith Bhagya Karunarathna

July 31, 2017

A document by Charith Bhagya Karunarathna. Click on the document to view its contents.

Charith Bhagya Karunarathna

July 29, 2017

A document by Charith Bhagya Karunarathna. Click on the document to view its contents.

Using gene genealogies to localize rare variants associated with complex traits in di...

Charith Bhagya Karunarathna

July 24, 2017

Many methods have been proposed to detect disease association with sequence variants in candidate genomic regions. However, the literature lacks a comparison of these methods in terms of their ability to localize or fine-map the causal risk variants lying within the candidate region. We extend a previous comparison of the detection abilities of these methods to a comparison of their localization abilities. In contrast to previous work, cases and controls are sampled from a diploid (i.e., two-parent) rather than a haploid (one-parent) population. We simulated 200 sequencing datasets of a 2-million base-pair candidate genomic region for 50 cases and 50 controls. Risk variants were in a middle subregion. We present a case study of one simulated dataset to illustrate the methods and describe simulation results to score which method best localizes the risk subregion. Our results lend support to the potential of genealogy-based methods for genetic fine-mapping of disease.

Multivariate association between single-nucleotide polymorphisms in Alzgene linkage...

Elena Szefer

and 1 more

March 19, 2017

INTRODUCTION Alzheimer’s disease (AD) is a neurodegenerative disorder causing cognitive impairment and memory loss. The estimated heritability of late-onset AD is 60%-80% , and the largest susceptibility allele is the ε4 allele of _APOE_ , which may play a role in 20% to 25% of AD cases. Numerous studies have identified susceptibility genes which account for some of the missing heritability of AD, with many associated variants having been identified through genome-wide association studies (GWAS) \citep[e.g.][]{Beecham_2009, Kamboh_2012, Bertram_2008}. Apart from _APOE_, the associated variants have mostly had moderate or small effect sizes, suggesting that the remaining heritability of AD may be explained by many additional genetic variants of small effect. Identifying susceptibility variants with small effect sizes in GWAS is challenging since strict multiple testing corrections are required to maintain a reasonable family-wise error rate. This analysis focuses on leveraging information from prior family of studies of AD , by looking for association in previously identified linkage regions reported on the Alzgene website . Linkage regions for AD are genomic regions that tend to be co-inherited with AD in families. By definition, linkage regions include susceptibility genes that are co-transmitted with the disease. The regions currently identified from family studies of AD are large, however, since families contain relatively few transmissions. Further transmissions over multiple generations would provide more fine-grain information about the location of susceptibility genes. Previous studies have fine-mapped a single linkage region through association of AD with genetic variants in densely genotyped or sequenced regions , or have confirmed linkage to AD in genomic regions identified from GWAS . In this report, we aim to fine-map multiple linkage regions for AD through multivariate association of their SNPs to the rates of atrophy in brain regions affected by AD. We analyze data from two phases of the Alzheimer’s Disease Neuroimaging Initiative which are case-control studies of AD and mild-cognitive impairment; ADNI-1 and ADNI-2. The rates of atrophy in brain regions affected by AD are so-called endophenotypes: observable traits that reflect disease progression. By investigating the joint association between the genomic variants and the neuroimaging endophenotypes, we use the information about disease progression to supervise the selection of single-nucleotide polymorphisms (SNPs). This multivariate approach to analysis stands in contrast to the commonly-used mass-univariate approach in which separate regressions are fit for each SNP, and the disease outcome is predicted by the minor allele counts. Simultaneous analysis of association is preferred because the reduced residual variation leads to (i) a clearer assessment of the signal from each SNP, (ii) increased power to detect signal, and (iii) a decreased false-positive rate . We also employ inverse probability weighting to account for the biased sampling design of the ADNI-1 and ADNI-2 studies, an aspect of analysis that has not been accounted for in many previous imaging genetics studies . Methods that explicitly account for gene structure have been proposed for analyzing the association between multiple imaging phenotypes and SNPs in candidate genes \citep[e.g.][]{Wang_2011, 1605.02234}. However, these methods become computationally intractable when analyzing data with tens of thousands of genotyped variants. To select SNPs associated with disease progression, we instead use sparse canonical correlation analysis (SCCA) to find a sparse linear combinations of SNPs having maximal correlation with the imaging endophenotypes. Multiple penalty schemes have been proposed to implement the sparse estimation in SCCA . We employ an SCCA implementation that estimates the sparse linear combinations by computing sparse approximations to the left singular vectors of the cross-correlation matrix of the SNP data and the neuroimaging endophenotype data . Sparsity is introduced through soft-thresholding of the coefficient estimates , which has been noted to be similar in implementation to a limiting form of the elastic-net . A drawback of ℓ₁-type penalties is that not all SNPs from an LD block of highly-correlated SNPs that are associated with the outcome will be selected into the model . We prefer an elastic-net-like penalty over alternative implementations with ℓ₁ penalties because it allows selection of all potentially associated SNPs regardless of the linkage-disequilibrium (LD) structure in the data. We may think of SNP genotypes as a matrix X and imaging phenotypes as a matrix Y measured on the same n subjects. \citet*{Robert_1976} showed that estimating the maximum correlation between linear combinations of X and Y in canonical correlation analysis is equivalent to estimating the linear combinations having the maximum RV coefficient, a measure of linear association between the multivariate datasets . As the squared correlation coefficient between the first canonical variates, the RV coefficient is well-suited for testing linear association in our context. We use a permutation test based on the RV coefficient to assess the association between the initial list of SNPs in X and the phenotypes in Y. Although the RV coefficient may overestimate association when n ≪ p , a permutation test with the RV coefficient is preferred over a parametric hypothesis test since the permutation null distribution is computed under the same conditions as the observed RV coefficient, resulting in a valid hypothesis test. The outcome of this test is used to determine whether or not to proceed with a second refinement stage that reduces the number of SNPs by applying SCCA. Selection of the soft-thresholding parameter in SCCA is challenging in our context. Since the number of SNPs exceeds the sample size and many of the SNPs are expected to be unassociated with the phenotypes, large sample correlations can arise by chance . Indeed, the prescribed procedure of selecting the penalty parameter with highest predicted correlation across cross-validation test sets results in more than 98% of the SNPs remaining in the model. A prediction criterion for choosing the penalty term may contribute to the lack of variable selection, allowing redundant variables into the model . When the same tuning parameter is used for variable selection and shrinkage, redundant variables tend to be selected to compensate for overshrinkage of coefficient estimates and losses in predictive ability . In our case, there is effectively no variable selection and little insight is gained by allowing for sparsity in the solution. To circumvent the lack of variable selection from SCCA, we fix the tuning parameter to select about 10% of the SNPs and then use resampling to determine the relative importance of each SNP to the association with neuroimaging endophenotypes. Instead of using the prediction-optimal penalty term, we fixed the soft-thresholding parameter for the SNPs to achieve variable selection based on the rationale that no more than about 7,500 SNPs, or approximately 10%, are expected to be associated with the phenotypes. This choice is guided by prior experience in genetic association studies, where the majority of genetic variants have no effect on the phenotypes, or an effect that is indistinguishable from zero . The organization of the manuscript is as follows. The Materials and Methods section describes the ADNI data, the data processing procedures, and the methods applied for discovery, refinement, and validation. The Results section presents the results of the analyses. The Discussion section notes challenges and successes of the analysis, including considerations for modelling continuous phenotype data under a case-control sampling design, and provides interpretation of the results.

Charith Bhagya Karunarathna

and 1 more

July 19, 2016

INTRODUCTION Most genetic association studies focus on common variants, but rare genetic variants can play major roles in influencing complex traits.. The rare susceptibility variants identified through sequencing have potential to explain some of the ’missing heritability’ of complex traits . However, standard methods to test for association with single genetic variants are underpowered for rare variants unless sample sizes are very large . The lack of power of single-variant approaches holds in fine-mapping as well as genome-wide association studies. # In this report, we are concerned with fine-mapping a genomic region that has been sequenced in cases and controls to identify disease-risk loci. A number of methods have been developed to evaluate the disease association for both single-variant and multiple-variants in a genomic region. Besides single-variant methods, we consider three broad classes of methods for analysing sequence data: pooled-variant, joint-modelling and tree-based methods. Pooled-variant methods evaluate the cumulative effects of multiple genetic variants in a genomic region. The score statistics from marginal models of the trait association with individual variants are collapsed into a single test statistic, either by combining the information for multiple variants into a single genetic score or by evaluating the distribution of the pooled score statistics of individual variants . Joint-modeling methods identify the joint effect of multiple genetic variants simultaneously. These methods can assess whether a variant carries any further information about the trait beyond what is explained by the other variants. When trait-influencing variants are in low linkage disequilibrium, this approach may be more powerful than pooling test statistics for marginal associations across variants . A local genealogical tree represents the ancestry of the sample of haplotypes at each locus in the genomic region being fine-mapped. Haplotypes carrying the same disease risk alleles are expected to be related and cluster on the genealogical tree at a disease risk locus. Tree-based methods assess whether trait values co-cluster with the ancestral tree for the haplotypes (e.g., ). has developed a method to reconstruct and score genealogies according to the case-control clusters. In practice true trees are unknown. However, cluster statistics based on true trees represent a best case for detecting association as tree uncertainty is eliminated. Burkett et al. use known trees to assess the effectiveness of such a tree-based approach for detection of rare, disease-risk variants in a candidate genomic region under various models of disease risk in a haploid population. They found that Mantel statistics computed on the known trees outperform popular methods for detecting rare variants associated with disease. Following Burkett et al., we use clustering tests based on true trees as benchmarks against which to compare the popular association methods. However, unlike Burkett et al., who focus on detection of disease risk variants, we here focus on localization of association signal in the candidate genomic region. Moreover, we use a diploid disease model instead of a haploid disease model. In this article, we compare the performance of selected rare-variant association methods for fine-mapping a disease locus. Our investigation focus on the localization of association signal to between 950kbp − 1050kbp within a 2Mb candidate genomic region. To motivate our study, we use variant data simulated from coalescent trees. Our work on localization of association signal extends that of Burkett et al., which investigated the ability to detect association signal in the candidate region, without regard to localization. To illustrate ideas, we start by working through a particular example dataset as a case study for insight into selected association methods. we next perform a simulation study involving 200 sequencing datasets and score which association method localizes best, overall. Our results indicate that the potential of ancestral tree-based approach for localizing the association signal.

Bacon

and 2 more

March 17, 2016

Even in following good coding practices, arbitrary code execution bugs can still exist. By leveraging pledge(2) system calls and a static analysis framework, we attempt to mitigate these bugs by automatically inserting pledge statements. Although an algorithm was devised to do this, time limitations prevented its full implementation.

Charith Bhagya Karunarathna

and 1 more

October 09, 2015

INRODUCTION Brief literature review - Most genetic association studies focus on common variants. - But, rare genetic variants can play major roles in influencing complex traits. - The rare susceptibility variants identified through sequencing have potential to explain some of the ’missing heritability’ of complex traits. . - However, standard methods to test for association with single genetic variants are underpowered for rare variants unless sample sizes are very large. - The lack of power of single-variant approaches holds in fine-mapping as well as genome-wide association studies. - In this report, we are concerned with fine-mapping a genomic region that has been sequenced in cases and controls to identify disease-risk loci. - A number of methods have been developed to evaluate the disease association for both single-variant and multiple-variants in a genomic region. - Besides single-variant methods, we consider three broad classes of methods for analysing sequence data: pooled-variant, joint-modelling and tree-based methods. - Overview of 3 types of analysis methods (Besides single-variant method) - Pooled-variant methods evaluate the cumulative effects of multiple genetic variants in a genomic region. The score statistics from marginal models of the trait association with individual variants are collapsed into a single test statistic, either by combining the information for multiple variants into a single genetic score or by evaluating the distribution of the pooled score statistics of individual variants. - Joint-modeling methods identify the joint effect of multiple genetic variants simultaneously. These methods can assess whether a variant carries any further information about the trait beyond what is explained by the other variants. When trait-influencing variants are in low linkage disequilibrium, this approach may be more powerful than pooling test statistics for marginal associations across variants . - Tree-based methods. - A local genealogical tree represents the ancestry of the sample of haplotypes at each locus in the genomic region being fine-mapped. - Haplotypes carrying the same disease risk alleles are expected to be related and cluster on the genealogical tree at a disease risk locus. - Tree-based methods assess whether trait values co-cluster with the ancestral tree for the haplotypes (e.g., ). - has developed a method to reconstruct and score genealogies according to the case-control clusters. - Review Burkett et al. study briefly(!), what it found. - In practice true trees are unknown. However, cluster statistics based on true trees represent a best case for detecting association as tree uncertainty is eliminated. - Burkett et al. use known trees to assess the effectiveness of such a tree-based approach for detection of rare, disease-risk variants in a candidate genomic region under various models of disease risk in a haploid population. - They found that Mantel statistics computed on the known trees outperform popular methods for detecting rare variants associated with disease. - Following Burkett et al., we use clustering tests based on true trees as benchmarks against which to compare the popular association methods. - However, unlike Burkett et al., who focus on _detection_ of disease risk variants, we here focus on _localization_ of association signal in the candidate genomic region. Moreover, we use a diploid disease model instead of a haploid disease model.

Using gene genealogies to localize rare variants associated with complex traits in di...

Charith Bhagya Karunarathna

and 1 more

October 08, 2015

INTRODUCTION Most genetic association studies focus on common variants, but rare genetic variants can play major roles in influencing complex traits.. The rare susceptibility variants identified through sequencing have potential to explain some of the ’missing heritability’ of complex traits . However, for rare variants, standard methods to test for association with single genetic variants are underpowered unless sample sizes are very large . The lack of power of single-variant approaches holds in fine-mapping as well as genome-wide association studies. In this report, we are concerned with fine-mapping a genomic region that has been sequenced in cases and controls to identify disease-risk loci. Our work extends an earlier comparison of methods for _detecting_ disease association in cases and controls to a comparison of methods for _localizing_ the association signal. In the previous investigation, cases and controls were sampled from a haploid or one-parent population. However, in the current investigation, cases and controls are sampled from a diploid or two-parent population to mimic studies in human populations. A number of methods have been developed to evaluate the disease association for both a single variant and multiple variants in a genomic region. Besides single-variant methods, we consider three broad classes of methods for analysing sequence data: pooled-variant, joint-modelling and tree-based methods. Pooled-variant methods evaluate the cumulative effects of multiple genetic variants in a genomic region. The score statistics from marginal models of the trait association with individual variants are collapsed into a single test statistic by combining the information for multiple variants into a single genetic score . Joint-modeling methods model the joint effect of multiple genetic variants on the trait simultaneously. These methods can assess whether a variant carries any further information about the trait beyond what is explained by the other variants. When trait-influencing variants are in low linkage disequilibrium, this approach may be more powerful than pooling test statistics for marginal associations across variants . Tree-based methods assess whether trait values co-cluster with the local genealogical tree for the haplotypes (e.g., ). A local genealogical tree represents the ancestry of the sample of haplotypes at each locus. Haplotypes carrying the same disease risk alleles are expected to be related and cluster on the genealogical tree at a disease-risk locus. has developed a method to reconstruct and score local genealogies according to the case-control clusters. In practice true trees are unknown. However, clustering statistics based on true trees represent a best case for detecting association as tree uncertainty is eliminated. used known trees to assess the effectiveness of such a tree-based approach for detection of disease-risk variants in a haploid population. They found that clustering statistics computed on the known trees outperform popular methods for detecting causal rare variants in a candidate genomic region. Following Burkett et al., we use Mantel tests as the clustering statistics based on true trees. These tree-based statistics, which rely on known trees, serve as benchmarks against which to compare the popular association methods. However, unlike Burkett et al., who focus on detection of disease-risk variants, we here focus on localization of association signal in the candidate genomic region. Moreover, we use a diploid disease model instead of a haploid disease model. In this report, we compare the performance of selected association methods for fine-mapping a disease locus in the middle of a larger, candidate, genomic region. In our simulation study, we use variant data generated under the coalescent model. To illustrate ideas, we start by working through a particular example dataset as a case study for insight into the association methods. We next perform a simulation study involving 200 sequencing datasets and score which association method localizes best, overall. Our results indicate the potential of ancestral tree-based approaches for localizing the association signal.

Strong Lens Time Delay Challenge: I. Experimental Design

Greg Dobler

and 7 more

May 28, 2013

ABSTRACT: The time delays between point-like images in gravitational lens systems can be used to measure cosmological parameters as well as probe the dark matter (sub-)structure within the lens galaxy. The number of lenses with measuring time delays is growing rapidly due to dedicated efforts. In the near future, the upcoming _Large Synoptic Survey Telescope_ (LSST), will monitor ∼10³ lens systems consisting of a foreground elliptical galaxy producing multiple images of a background quasar. In an effort to assess the present capabilities of the community to accurately measure the time delays in strong gravitational lens systems, and to provide input to dedicated monitoring campaigns and future LSST cosmology feasibility studies, we pose a “Time Delay Challenge” (TDC). The challenge is organized as a set of “ladders,” each containing a group of simulated datasets to be analyzed blindly by participating independent analysis teams. Each rung on a ladder consists of a set of realistic mock observed lensed quasar light curves, with the rungs’ datasets increasing in complexity and realism to incorporate a variety of anticipated physical and experimental effects. The initial challenge described here has two ladders, TDC0 and TDC1. TDC0 has a small number of datasets, and is designed to be used as a practice set by the participating teams as they set up their analysis pipelines. The non mondatory deadline for completion of TDC0 will be December 1 2013. The teams that perform sufficiently well on TDC0 will then be able to participate in the much more demanding TDC1. TDC1 will consists of 10³ lightcurves, a sample designed to provide the statistical power to make meaningful statements about the sub-percent accuracy that will be required to provide competitive Dark Energy constraints in the LSST era. In this paper we describe the simulated datasets in general terms, lay out the structure of the challenge and define a minimal set of metrics that will be used to quantify the goodness-of-fit, efficiency, precision, and accuracy of the algorithms. The results for TDC1 from the participating teams will be presented in a companion paper to be submitted after the closing of TDC1, with all TDC1 participants as co-authors.