IMG_3715

COMPUTATIONAL AND DATA SCIENCES

By enabling virtual experiments and computer simulations, scientific computing has become the third pillar of scientific investigation and is central to innovation in most domains of our lives. It underpins the majority of today’s technological, economic and societal feats. We have entered an era in which soaring amounts of data offer enormous opportunities, but only to those who are able to harness them. We are standing at a turning point where the economic success of a nation is determined by its ability to exploit the vast amounts of information that are generated daily. The upcoming challenge is to harvest the fruits borne by computational sciences in research fields which have not yet benefited from its full potential, e.g. biology, health and the social and behavioural sciences. Our strategy to achieve this is to leverage the mathematical, procedural and algorithmic commonality between apparently disparate research fields.

DATA-DRIVEN DISCOVERY

Understanding the world, which grows more complex and generates data at an increasing rate, relies on the ability to construct robust and reliable models. Integration of hypothesis-driven and data-driven science concepts operating on a high technological level is considered to be the future of scientific discovery and knowledge generation, by enabling automatic process prediction and optimisation, and improving decision-making at all levels. Many expect data-driven discovery to become the fourth pillar of research, as the main driver to overcome limitations of traditional modelling approaches. The data-driven modelling approach is closely related to the concept of Big Data, which is represented by a recent transition of paradigms: from model calibration to model identification, from low data complexity to increased data volume, data velocity, data variety, data veracity and ultimately data value. We observe a conceptual shift from ‘model by data abstraction’ to ‘data is the model’ together with an increasing size of the parameter spaces involved.

DATA CHALLENGES

In addition to the complexity brought forth by the size of the parametric space, uncertainty, volatility and ambiguity are factors that can also be addressed using data-driven science. Discovery through data requires integrated data mining, data exploration (interrogation and association), predictive modelling, sensitivity quantification and incorporation of feedback available by ‘new and/or better’ data operating on a closed chain of descriptive, diagnostic, predictive and even prescriptive analytics. These innovative research perspectives are challenged by a number of aspects: (1) the presence of ‘Big Data’, not necessarily characterised by size metrics like high data volume, large data rate, data variety and dimensionality, but also by the inability of existing computational approaches to process data due to its properties; (2) the science of data, which refers to the (subject-specific) evaluation of data quality (imperfect, incomplete, low-density), the structure of data (combination and types of data), content interpretation and characterisation through models and theory; (3) the data science dimension, which deals with the requirements on computing infrastructure and, even more importantly, the scalability and modularity of computational algorithms for the continuous mining, assimilation and visualisation of heterogeneous and conceptually drifting data involving questions on data privacy, security and hierarchy.

METHODS

To address these issues, Machine Learning and Statistical Analysis are fundamental. Depending on the availability of a system’s outputs, Supervised Learning (mapping input to output) and Unsupervised Learning (discovering the structure of input data) are two ways forward. Classification is used to automatically identify to which category an object belongs. Concepts like Support Vector Machines and Random Forests can be used for image and pattern recognition or spam detection. Similar objects are automatically grouped into sets with Clustering using k-Means, Mean-Shift or Spectral Clustering and thereby help with patient cohort segmentation or grouping experimental results. Continuous-valued attributes associated with an object or person can be predicted with Regression algorithms like Lasso, Ridge Regression or Gaussian Processes. Model Selection Methods improve predictive models by enabling the comparison, validation and selection of models and parameters using Grid Search or Cross Validation. Data Pre-Processing techniques transform input data or extract features for use in specific Machine Learning algorithms. Methods of Dimensionality Reduction help to unveil characteristic information hidden in large and/or high-dimensional data sets via Principal Component Analysis, Manifold Learning or Feature Selection. Deep learning describes hierarchical learning from data representations by capturing various abstraction levels.