Biological Big Data Analytics: Challenges and Algorithms
We live in an era of big data. Voluminous datasets are generated and have to be processed in every area of science and engineering. This is especially true in biology. Efficient techniques are needed to process these data. In particular, we need tools to extract useful information from massive data sets. Society at large can benefit immensely from advances in this arena. For example, information extracted from biological data can result in gene identification, diagnosis for diseases, drug design, etc. Market-data information can be used for custom-designed catalogues for customers, supermarket shelving, and so on. Weather prediction and protecting the environment from pollution are possible with the analysis of atmospheric data.
In this talk we present some challenges existing in processing biological big data. We also provide an overview of some basic techniques. In particular, we will summarize various data processing and reduction techniques.
Sanguthevar Rajasekaran received his M.E. degree in Automation from the Indian Institute of Science (Bangalore) in 1983, and his Ph.D. degree in Computer Science from Harvard University in 1988. Currently he is the Board of Trustees Distinguished Professor, UTC Chair Professor of Computer Science and Engineering, and the Director of Booth Engineering Center for Advanced Technologies (BECAT) at the University of Connecticut. Before joining UConn, he has served as a faculty member in the CISE Department of the University of Florida and in the CIS Department of the University of Pennsylvania. During 2000-2002 he was the Chief Scientist for Arcot Systems. His research interests include Big Data, Bioinformatics, Algorithms, Data Mining, Randomized Computing, and HPC. He has published over 350 research articles in journals and conferences. He has co-authored two texts on algorithms and co-edited six books on algorithms and related topics. His research works have been supported by grants from such agencies as NSF, NIH, DARPA, and DHS (totaling $9M as the PI and an additional $9M as a co-PI). He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and the American Association for the Advancement of Science (AAAS). He is also an elected member of the Connecticut Academy of Science and Engineering.
Adventures with large biomedical datasets: diseases, medical records, environment and genetics
I will attempt to cover several interrelated analysis topics, spending more time on parts that resonate with the audience.
First, I will introduce our recent study analyzing phenotypic data harvested from over 150 million unique patients. Curiously, these non-genetic large-scale data can be used for genetic inferences. We discovered that complex diseases are associated with unique sets of rare Mendelian variants, referred to as the “Mendelian code.” We found that the genetic loci indicated by this code were enriched for common risk alleles. Moreover, we used probabilistic modeling to demonstrate for the first time that deleterious Mendelian variants likely contribute to complex disease risk in a non-additive fashion.
The second topic that I hope to cover is analysis of apparent clusters of neurodevelopmental disorders. Disease clusters are defined as geographically compact areas where a particular disease, such as a cancer, shows a significantly increased rate. It is presently unclear how common are such clusters for neurodevelopmental maladies, such as autism spectrum disorders (ASD) and intellectual disability (ID). As in the first story, examining data for one third of the whole US population, we demonstrated that (1) ASD and ID are manifesting strong clustering across US counties; (2) counties with high ASD rates also appear to have high ID rates, and (3) the spatial variation of both phenotypes appears to be driven by environment, and, by a lesser extent, by economic incentives at the state level.
The third topic is about using electronic medical record data to 1) estimate the heritability and familial environmental patterns of diseases, and 2) infer the genetic and environmental correlations between disease pairs from a set of complex diseases. I am particularly interested in inferring objective classifications// of diseases (based on a formal optimization criterion), separately from environmental and genetic factors.
Andrey Rzhetsky is an Edna K. Papazian Professor of Medicine and Human Genetics, at the University of Chicago. He is also a Pritzker Scholar, and a Senior Fellow of both the Computation Institute, and the Institute for Genomics and Systems Biology at the University of Chicago.
His research is focused on computational analysis of complex human phenotypes in context of changes and perturbations of underlying molecular networks. The input data for these studies is supplied by large-scale mining of free text, computation over clinical records, and high-throughput systems biology experiments.
Going beyond Patterns: Deep Understanding of Biology with Machine Learning
A major goal in computational biology is the development algorithms, analysis techniques, and tools towards deep mechanistic understanding of life at a molecular level. In the process, computational biology must take advantage of the new developments in artificial intelligence and machine learning, and then move beyond pattern analysis to provide testable hypotheses for experimental scientists. This talk will focus on our contributions to this process and the relevant related work. We will first discuss the development of machine learning techniques for partially observable domains such as molecular biology; in particular, methods for accurate estimation of frequency of occurrence of hard-to-measure and rare events. We will then show how these methods play key roles in inferring protein function and the phenotypic effect of coding sequence variants, with an emphasis on understanding the molecular mechanisms of human genetic disease. We will assess the value of these methods in a wet lab where we tested the molecular mechanisms behind selected de novo mutations in a cohort of individuals with neurodevelopmental disorders. We finally discuss implications for genome interpretation.
Predrag Radivojac is a Professor of Computer Science at Indiana University Bloomington. Prof. Radivojac received his Bachelor's and Master's degrees in Electrical Engineering from the University of Novi Sad and University of Belgrade, Serbia. His Ph.D. degree is in Computer Science from Temple University (2003) under the direction of Prof. Zoran Obradovic and co-direction of Prof. Keith Dunker. In 2004 he held a post-doctoral position in Keith Dunker's lab at Indiana University School of Medicine, after which he joined Indiana University Bloomington. Prof. Radivojac's research is in the areas of computational biology and machine learning with specific interests in protein function, MS/MS proteomics, genome interpretation, and precision health. He received a National Science Foundation (NSF) CAREER Award in 2007 and is an honorary member of the Institute for Advanced Study at Technical University of Munich. Prof. Radivojac's projects have been supported by NSF and National Institutes of Health (NIH). He is currently an Editorial Board member for the journal Bioinformatics, Associate Editor for PLoS Computational Biology, and serves on the Board of Directors of the International Society for Computational Biology (ISCB).
Towards Automated Deep Learning Model Construction and Its Applications in Computational Chemical Biology
In recent years, research in Artificial Neural Networks (ANNs) has resurged, now under the Deep-Learning umbrella, and grown extremely popular due to major breakthroughs in methodological and computing capabilities. Deep-Learning methods are part of representation-learning algorithms that attempt to extract and organize discriminative information from the data. Recently reported success of DL techniques in crowd-sourced chemical biology data analysis and predictive toxicology competitions has showcased these methods as powerful tools for drug-discovery and toxicology research. Nevertheless, reported applications of Deep Learning techniques for modeling complex bioactivity data for small molecules remain still limited.
In this talk I will present our recent work on optimizing feed-forward Deep Neural Nets (DNNs) hyper-parameters and performance evaluation of these methods as compared to shallow methods. In our study 48 DNNs, 24 Random Forest, 20 SVM and 6 Naïve Bayes arbitrary but reasonably selected configurations were compared employing 7 diverse bioactivity datasets assembled from ChEMBL repository combined with circular fingerprints as molecular descriptors. Our results demonstrate that DNNs are powerful modeling techniques for modeling complex bioactivity data. I will then talk about a project towards a collaborative environment where we support the automated construction, optimization, profiling, sharing, running, and reusing deep (and shallow) machine learning models.
Dr. Jun (Luke) Huan is the Charles E. & Mary Jane Spahr Professor in the Department of Electrical Engineering and Computer Science at the University of Kansas. He directs the Data Science and Computational Life Sciences Laboratory at KU Information and Telecommunication Technology Center (ITTC).
Dr. Huan works on data science, machine learning, data mining, big data, and interdisciplinary topics including bioinformatics and health informatics. He has published more than 120 peer-reviewed papers in leading conferences and journals and has graduated more than ten graduate students including seven PhDs. Dr. Huan serves the editorial board of several international journals including the Springer Journal of Big Data, Elsevier Journal of Big Data Research, and the International Journal of Data Mining and Bioinformatics. He regularly serves the program committee of top-tier international conferences on machine learning, data mining, big data, and bioinformatics.
Dr. Huan's research is recognized internationally. He was a recipient of the National Science Foundation Faculty Early Career Development Award in 2009. His group won the Best Student Paper Award at the IEEE International Conference on Data Mining in 2011 and the Best Paper Award (runner-up) at the ACM International Conference on Information and Knowledge Management in 2009. His work appeared at mass media including Science Daily, R&D magazine, and EurekAlert (sponsored by AAAS). Dr. Huan's research was supported by NSF, NIH, DoD, and the University of Kansas.
Starting January 2016, Dr. Huan serves as a Program Director in NSF at its Intelligent and Information Division in the Computer and Information Science and Engineering Directorate.
An Energy Landscape View of Protein Structure, Dynamics, and (Dys)Function
The energy landscape underscores the inherent nature of proteins as dynamic systems interconverting between structures with varying energies. Recently, our laboratory has developed a computational framework that feasibly reconstructs energy landscapes of any forms of a protein of interest, thus allowing investigating in silico the impact of pathogenic mutations on equilibrium structure and dynamics. The framework operates under the umbrella of stochastic optimization and leverages experimentally-known, stable and semi-stable structural states of a protein’s variants to reconstruct the energy landscape of any variant of interest. The availability of landscapes of wildtype and diseased variants of a protein opens the way for data mining techniques to harness quantitative information embedded in landscapes. We share findings from a recent line of research in our laboratory that automatically extracts the hierarchical organization and structure of a molecular energy landscape and summarizes a landscape with geometric attributes. As we demonstrate on an enzyme central to human biology and health, mining landscapes allows categorizing variants and summarizing mechanisms via which mutations alter dynamics and function. We share results on oncogenic and syndrome-causing variants of the human Ras enzyme. These results signal an exciting stage where machines can compute and mine landscapes to autonomously learn how mutations impact function and even elucidate the role of specific structural states and transitions of a protein variant in biological activities in the cell.
Dr. Amarda Shehu is an Associate Professor in the Department of Computer Science at George Mason University and is also affiliated with the School of Systems Biology and the Department of Bioengineering. Shehu received her B.S. in Computer Science and Mathematics from Clarkson University in Potsdam, NY in 2002 and her Ph.D. in Computer Science from Rice University in Houston, TX in 2008, where she was an NIH fellow of the Nanobiology Training Program of the Gulf Coast Consortia. Shehu’s research contributions are in computational structural biology, biophysics, and bioinformatics with a focus on issues concerning the relationship between sequence, structure, dynamics, and function in biological molecules. Her research is supported by various NSF programs, including Intelligent Information Systems, Computing Core Foundations, and Software Infrastructure. Shehu is also the recipient of an NSF CAREER award and two Jeffress Memorial Trust Awards. Shehu is an associate editor of IEEE Transactions in Computational Biology and Bioinformatics. She has served as program committee chair and general chair of the IEEE BIBM and ACM BCB conferences and is routinely a guest editor of special collections and issues in journals, such as PLoS Computational Biology, IEEE Transactions in Computational Biology and Bioinformatics, BMC Structural Biology, and J Computational Biology.
Radiomics – Beyond Imaging for Personalized and Precision Medicine
Radiomics refers to the computation, analysis and selection of advanced quantitative imaging features with high throughput from standard-of-care medical images acquired using, for instance, CT, PET or MRI. Indeed, the increasing adoption of electronic patient records as well as the diffused use of PACS have made available heterogeneous patient data, spanning different spatial and temporal scales, modalities, and functionalities. Radiomics is also evolving into radiogenomics that looks for correlation between cancer imaging features and gene expression. On the basis of such image features, medical and biological data, radiomics and radiogenomics are currently directed towards the development of personalized and precision medicine models that aim to provide valuable diagnostic, prognostic or predictive information.
Prof. Paolo Soda, PhD, is an Associate Professor in Computer Science at the Department of Engineering, University Campus Bio-Medico di Roma (UCBM), Italy. His research interests include pattern recognition, machine learning, big data analytics, and data mining applied to data, signal, 2D and 3D image and video processing and analysis. Practical applications of the research activities have impacted on the biomedical applications, with reference to computer-aided diagnosis and decision support systems. Prof. Paolo Soda has received six external grants from both government funding agencies and industry, totalizing over 500 thousand euros in external funding. He has published over 80 refereed papers in international journals and conference proceedings, being also co-author of two international patents. Since June 2017 Paolo serves as chair of the IEEE Technical Committee on Computational Life Sciences (http://tccls.computer.org/). Since 2012, he has also served as associate editor of the proceedings of the annual international conference of the IEEE Engineering in Medicine & Biology Society, and since the same year he has been a member of the Steering Committee of the International Symposium on Computer-Based Medical Systems (CBMS). He was general co-chair of the 25th and 29th CBMS editions in 2012 and 2016, respectively. In the last few years, Paolo Soda has also served as program and special tracks co-chair. From 2009 to 2012 he co-organized at CBMS special tracks on knowledge discovery and decision systems in biomedicine, and in 2012 he co-organized a contest on bioimage classification at the 21st International Conference on Pattern Recognition. He also currently serves as member of the program committee in several conferences. He was guest editor of Pattern Recognition (vol. 47(7), 2014) and Artificial Intelligence in Medicine (vol. 50(1), 2010).
Prof. Paolo Soda received his Master’s diploma and PhD in biomedical engineering from UCBM in 2004 and 2008, respectively, co-founding with his supervisor, Prof. Giulio Iannello, the Unit of Computer Systems and Bioinformatics. He continued as a postdoctoral researcher in 2009 at the Department of Engineering, UCBM, and as an assistant professor from 2010 to 2014 at the Department of Medicine, UCBM. In 2013 and 2015 he held a digital imaging class at the Technical Medical Superior School of Locarno, Switzerland; in 2014 he held a machine learning class at the faculty of Computer Science, Henan University, China, and in 2009 and 2012 he got European training grants to carry out scientific and teaching activities on machine learning and computer vision at the Polytech'Nice, Université de Nice-Sophia Antipolis, France, and at the Eindhoven University of Technology, The Netherlands.
ChIP-Seq Data Completion and Transcription Factors Binding Analyses
Transcription factors (TFs), as the key regulatory elements of gene transcription, can activate or suppress the transcription by binding to specific sets of DNA sequences. In the past, the introduction of ChIP-seq sequencing technologies provides immense opportunities for precise categorization of TF binding sites. In this talk, we will introduce several novel computational models for integrative analysis of the accumulated ChIP-seq data. Firstly, due to cost, time or sample material availability, it is not always possible for researchers to obtain ChIP-seq data for every TF in every sample of interest, which considerably limits the power of integrative studies. To tackle this problem, we propose Local Sensitive Unified Embedding (LSUE) for imputing new ChIP-seq datasets. Secondly, we construct gene regulatory networks in 13 human tissues by integrating large-scale transcription factor (TF)-gene regulations with gene and protein expression data. By comparing these regulatory networks, it was found that many tissue-specific regulations that are important for tissue identity. In particular, the tissue-specific TFs are found to regulate more genes than those expressed in multiple tissues, and the processes regulated by these tissue-specific TFs are closely related to tissue functions. Therefore, recognizing tissue specific regulatory networks can help better understand the molecular mechanisms underlying diseases and identify new disease genes.
De-Shuang Huang is Chaired Professor in Department of Computer Science and Director of Institute of Machine Learning and Systems Biology at Tongji University, China. He received his M.S. and Ph.D. in electronic engineering from National Defense University of Science and Technology and Xidian University, China, in 1989 and 1993, respectively. He was the Recipient of “Hundred Talents Program of Chinese Academy of Sciences” (2000). He was also visiting professors at the George Washington University, Washington DC, USA (2003), Queen’s University of Belfast, UK (2006) and Inha University, Korea (2007, 2008 & 2009). Currently, he is the visiting professor of the Liverpool John-Moore University, UK. His main research interest includes neural networks, pattern recognition and bioinformatics.
De-Shuang Huang is currently the Fellow of the International Association of Pattern Recognition (IAPR Fellow), the Board Member of the International Neural Network Society (INNS) Governors, a Senior Member of the IEEE and the Senior Member of INNS, Bioinformatics and Bioengineering Technical Committee Member of IEEE CIS, Neural Networks Technical Committee Member of IEEE CIS, the member of the INNS, Co-Chair of the Big Data Analytics section within INNS, and associated editors of several main-stream international journals such as Neural Networks, etc. He founded the International Conference on Intelligent Computing (ICIC) in 2005. ICIC has since been successfully held annually with him serving as General or Steering Committee Chair. He also served as the 2015 International Joint Conference on Neural Networks (IJCNN 2015) General Chair, July 12-17, 2015, Killarney, Ireland, the 2014 11th IEEE Computational Intelligence in Bioinformatics and Computational Biology Conference (IEEE-CIBCBC) Program Committee Chair, May 21-24, 2014, Honolulu, USA, and the 2014 IEEE World Congress on Computational Intelligence-International Joint Conference on Neural Networks, Technical Committee Co-Chair, July 6-11, 2014, Beijing, China as well as The 2013 International Joint Conference on Neural Networks, Asia Liaison, August 4-9, 2013, Dallas, TX, USA.
He has published over 360 papers in international journals, international conferences proceedings, and book chapters. Particularly, he has published over 160 SCI indexed papers. Also, he published three monographs (in Chinese), one of which, entitled with “Systematic Theory of Neural Networks for Pattern Recognition”, won the Second-Class Prize of the 8th Excellent High Technology Books of China in 1997.
Differential Privacy Preserving Deep Learning in Healthcare
The remarkable development of deep learning in healthcare domain presents obvious privacy issues, when deep neural networks are built on users’ personal and highly sensitive data, e.g., clinical records, user profiles, and biomedical images. In this talk, we concentrate on recent research on differential privacy preserving deep learning. Differential privacy ensures that the adversary cannot infer any information about any particular record with high confidence (controlled by a privacy budget) from the released learning models. In the first part of this talk, we introduce the concept of differential privacy and present several mechanisms, including Laplace mechanism, exponential mechanism, input perturbation, and functional perturbation, that have been developed to enforce differential privacy in data mining and machine learning models. In the second part of this talk, we discuss how to apply and adapt those mechanism to preserve differential privacy in deep learning models. In particular, we discuss how to achieve differential privacy by injecting noise into input data, gradient descents of parameters, or loss functions of deep learning models. Finally we present challenges and findings when applying differential privacy preserving deep learning models for human behavior prediction and classification tasks in a health social network.
Dr. Xintao Wu is the professor and the Charles D. Morgan/Acxiom Endowed Graduate Research Chair in Database and leads Social Awareness and Intelligent Learning (SAIL) Lab in Computer Science and Computer Engineering Department at the University of Arkansas. He was a faculty member in College of Computing and Informatics at the University of North Carolina at Charlotte from 2001 to 2014. Dr. Wu's major research interests include data mining, privacy and security, fairness aware learning, and big data analysis. His recent research work has been to develop 1) privacy preserving techniques for mining tabular data, social network data, healthcare data, and GWAS data; 2) spectral analysis based fraud detection techniques in social networks; and 3) causal network based discrimination detection and prevention in training data and prediction models. Dr. Wu has published over 100 scholarly papers. He and his students received several awards including a PAKDD'09 Best Student Paper Runner-up Award, WISE'12 Challenge Runner-up Award, PAKDD'13 Best Application Paper Award, and BIBM'13 Best Paper Award. Dr. Wu has served on editorial boards of several international journals and frequently served on program committees of top international conferences, including ACM KDD, CIKM, IEEE ICDM, BIBM, SIAM SDM, PKDD, and PAKDD. Dr. Wu is a recipient of NSF CAREER Award (2006), Excellence in Undergraduate Teaching Award (2005), and Outstanding Faculty Research Award (2009) from College of Computing and Informatics at UNC Charlotte, and Outstanding Researcher Award from Computer Science and Computer Engineering Department at University of Arkansas.