# Statistical Learning Workshop 2020

Machine learning can be loosely defined as the set of computational methods that use experience to improve performance or make accurate predictions. With the revival of neural networks and the advent of deep learning, machine learning research now focuses on the development of even more complex models,  requiring very large training sets and huge amounts of computational power. Understanding the properties of these models and analyzing them is, at least for the moment, receiving less attention in machine learning research. To fill this gap, one possibility is to rely on one of the pillars of machine learning: statistics.

Many machine learning algorithms rely on statistical models, such as ridge and lasso regression. The statistics community has derived precise theoretical results of these models, including their asymptotic properties or the construction of confidence intervals for parameters. They enable model interpretation, robust inference and causal statements. Such results are mostly still missing in machine learning world, because the models that are used there are inherently complex.

In this workshop we bring together the research communities of statistics and machine learning to foster a discussion between the two fields and develop research synergies. The workshop takes place on-line the whole day of 18/September.  The workshop will feature talks by:

In addition, a number of PhD students will present their work in talks. The detailed program can be found below. You may register for the workshop by sending an e-mail with subject 'Registration' at sl.ws.geneva@gmail.com (please check you spam folder in case you haven't received a reply).  Please note that the participation is free of charge, but the registration is mandatory.

The organizers,

Prof. Sebastian Engelke, UNIGE, GSEM

Prof. Alexandros Kalousis, HES-SO

Prof. Davide La Vecchia, UNIGE, GSEM

The Workshop's program (video link for each presenter's talk)

---------------------

8:40-8:45 Organizers

8:45-8:55 Prof. Antoine Geissbuhler, Vice-rector of the University of Geneva

--------------------

9:00-10:40

• Prof. Olivier Scaillet, UNIGE, GSEM , Statistical analysis of network data: learning from small samples (video)
• Prof Francois Fleuret, UNIGE, EPFL, Fast attention models  (video)
• Mr. Alban Moor, UNIGE, GSEM, A Higher-Order Correct Fast Moving-Average Bootstrap for Dependent Data (video)
• Mrs. Frantzeska Lavda,  PhD student UNIGE, HESGE, Data-dependent conditional priors for unsupervised learning of multimodal data (video)

--------------------

10:40-11:15 Break

--------------------

11:15-12:35

• Prof. Sylvain Sardy,  UNIGE, What needles do sparse neural networks find in nonlinear haystacks (video)
• Prof. Simon Scheidegger , UNIL, HEC, Gaussian Process methods for Asset Pricing and Dynamic Portfolio Choice (video)
• Mr. Lionel Blonde,  PhD student UNIGE, HESGE, Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning (video)

--------------------

12:35-14:15 Lunch break

--------------------

14:15-16:00

• Prof. Giacomo De Giorgi, UNIGE, GSEM, Predict Mortality from Credit Reports
• Mr.  Edoardo Vignotto, PhD student UNIGE, GSEM, Towards dynamical adjustment of the full temperature distribution (video)
• Prof. Dimitri van De Ville, UNIGE, EPFL, Community-Aware Graph Signal Processing (video 1, video 2)
• Mr. Cesare Miglioli, PhD student UNIGE, GSEM, SWAG: A Wrapper Method for Sparse Learning (video)

--------------------

16:00-16:30 Break

--------------------

16:30-17:45

--------------------

• Mrs. Amina Mollaysa, UNIGE, HESGE, Goal-directed Generation of Discrete Structures with Conditional Generative Models (video)
• Mr. Nicola Gnecco , UNIGE, GSEM, Causal discovery in heavy-tailed models (video)
• Dr Kostantinos Sechidis, Novartis, From feature selection to predictive biomarker discovery

17:45-17:50 Concluding remarks

Abstracts

Prof. Olivier Scaillet, Statistical analysis of network data: learning from small samples

For small samples network (spatial panels) data, we illustrate that first-order asymptotic theory suffers from finite sample distortions. We develop saddlepoint techniques which perform well in small samples and feature higher-order accuracy. We deploy the new higher-order asymptotic techniques for the Gaussian maximum likelihood estimator in a spatial panel data model, with fixed effects, time-varying covariates, and spatially correlated errors. Our saddlepoint density and tail area approximation feature relative error of order O(m^(-1)) for m = n(T-1) with n being the cross-sectional dimension and T the time-series dimension. The main theoretical tool is the tilted-Edgeworth technique in a non-identically distributed setting. The density approximation is always non-negative, does not need resampling, and is accurate in the tails. We compare the saddlepoint density approximation to Edgeworth expansion and to the asymptotic approximation. An empirical application to the investment-saving relationship in OECD (Organisation for Economic Co-operation and Development) countries shows disagreement between testing results based on first-order asymptotics and saddlepoint techniques.

This talk is based on the working paper "Saddlepoint approximations for spatial panel data models" by Chaonan Jiang, Davide La Vecchia, Elvezio Ronchetti, and Olivier Scaillet

Prof. Francois Fleuret, Fast attention models

Deep neural networks based on attention mechanisms have become the standard models for natural language processing, and they also demonstrate very promising results in computer vision and other application domains. Their impressive performance requires however extremely high computational cost, due to the very structure of the attention layers.

In this talk I will give a rapid introduction to attention mechanisms and transformer architectures, and present our recent contributions that aim at reducing their computational cost.

Mr. Alban Moor, A Higher-Order Correct Fast Moving-Average Bootstrap for Dependent Data.

We develop the theory of a novel fast bootstrap for dependent data. Our scheme is based on the i.i.d. resampling of the smoothed moment indicators. We characterize the class of parametric and semiparametric estimation problems for which the method is valid. We show the asymptotic refinements of the proposed procedure, proving that it is higher-order correct under mild assumptions on the time series, the estimating functions, and the smoothing kernel. We illustrate the applicability and the advantages of our procedure for M-estimators, generalized method of moments and generalized empirical likelihood estimation method. In a Monte Carlo study we consider an autoregressive conditional duration model and we compare our method with other existant, routinely-applied first and higher-order correct methods. The results provide numerical evidence that the novel bootstrap yields higher-order accurate confidence intervals, while remaining computationally lighter than its higher-order competitors. A real-data example on dynamics of trading volume of US stocks illustrates the applicability of our method.

This is joint work with D. La Vecchia and O. Scaillet

Mrs. Frantzeska Lavda, Data-dependent conditional priors for unsupervised learning of multimodal data

One of the major shortcomings of variational autoencoders is the inability to produce generations from the individual modalities of data originating from mixture distributions. This is primarily due to the use of a simple isotropic Gaussian as the prior for the latent code in the ancestral sampling procedure for data generations. In this paper, we propose a novel formulation of variational autoencoders, conditional prior VAE (CP-VAE), with a two-level generative process for the observed data where continuous and discrete variables are introduced in addition to the observed variables. By learning data-dependent conditional priors, the new variational objective naturally encourages a better match between the posterior and prior conditionals, and the learning of the latent categories encoding the major source of variation of the original data in an unsupervised manner. Through sampling continuous latent code from the data-dependent conditional priors, we are able to generate new samples from the individual mixture components corresponding, to the multimodal structure over the original data. Moreover, we unify and analyse our objective under different independence assumptions for the joint distribution of the continuous and discrete latent variables. We provide an empirical evaluation on one synthetic dataset and three image datasets, FashionMNIST, MNIST, and Omniglot, illustrating the generative performance of our new model comparing to multiple baselines.

This is joint work with Magda Gregorova and Alexandros Kalousis

Prof. Sylvain Sardy, What needles do sparse neural networks find in nonlinear haystacks

Inspired by LASSO, we regularize with an L1-penalty the estimation of an artificial neural network (ANN). This has the advantage of performing variable selection (e.g., gene selection or feature selection in an image) and avoiding over-fitting. The selection of the regularization parameter is done with the Quantile Universal Threshold method. This method requires no estimation of the noise variance, no expensive cross-validation and can retrieve the needles and the sparsity structure of the ANN in certain regimes.

Prof. Simon Scheidegger, Gaussian Process methods for Asset Pricing and Dynamic Portfolio Choice

In this paper, we consider the portfolio optimization problem for a multiperiod investor who seeks to maximize her utility facing multiple risky assets and proportional transaction costs in the presence of return predictability. Due to the curse of dimensionality, this problem is challenging to solve, even numerically. To this end, we  propose to embed Gaussian Process regression in combination the active subspace method and Bayesian active learning  inside a parallelized dynamic programming algorithm. Preliminary results will show that with this generic setup, we push the boundary of the current state of the art in the literature along several dimensions. The said combination of tools allows us to study important open problems in this literature, including (i) the characterization of no-trade regions (potentially volume and welfare implications in economies with several assets), and (ii) the optimal portfolio behavior in economies with a stochastic opportunity set or stochastic frictions.

Joint work with Fabio Trojani.

Mr. Lionel Blonde, Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning

Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyper-parameters and are often riddled with essential engineering feats allowing their success. We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. Crucially, we show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the state-value function. Finally, we propose a novel reward-modulation technique inspired from a new interpretation of gradient-penalty regularization in reinforcement learning. Besides being extremely easy to implement and bringing little to no overhead, we show that our method provides improvements in several continuous control environments of the MuJoCo suite.

Joint work with Pablo Strasser, Alexandros Kalousis

Prof. Giacomo De Giorgi, Predict Mortality from Credit Reports

Data on hundreds of variables related to individual consumer finance behavior (such as credit card and loan activity) is routinely collected in many countries and plays an important role in lending decisions. We postulate that the detailed nature of this data may be used to predict outcomes in unrelated domains such as individual health. We build a series of machine learning models to demonstrate that credit report data can be used to predict individual mortality. Variable groups related to credit cards and various loans, mostly unsecured loans, are shown to carry significant predictive power. Lags of these variables are also significant thus indicating that dynamics also matters. Improved mortality predictions based on consumer finance data can have important economic implications in insurance markets but may also raise privacy concerns.

Mr. Edoardo Vignotto, Towards dynamical adjustment of the full temperature distribution

Internal variability due to atmospheric circulation can dominate the thermodynamical signal present in the climate system for small spatial or short temporal scales, thus fundamentally limiting the detectability of forced climate signals. Dynamical adjustment techniques aim to enhance the signal-to-noise ratio of trends in climate variables such as temperature by removing the influence of atmospheric circulation variability. Forced thermodynamical signals unrelated to circulation variability are then thought to remain in the residuals, allowing a more accurate quantification of changes even at the regional or decadal scale. The majority of these methods focus on climate variable’s averages, thus discounting important distributional features. Here we propose a machine learning dynamical adjustment method for the full temperature distribution that recognizes the stochastic nature of the relationship between the dynamical and thermodynamical components. Furthermore, we illustrate how this method enables evaluating how specific events would have unfolded in a different, counterfactual climate from a few decades ago, thereby characterizing the emergent effect of climatic changes over decadal time scales. We apply our method to observational data over Europe and over the last 70 years.

This is joint work with S. Sippel, F. Lehner, and E. M. Fischer.

Dr Kostantinos Sechidis, From feature selection to predictive biomarker discovery

One of the key challenges of personalised medicine is to identify which patients will respond positively to a given treatment. The area of subgroup identification focuses on this challenge, i.e. identifying groups of patients that experienced enhanced treatment effect even if the study failed to show an effect in the overall population. A crucial first step towards the subgroup identification is to identify the variables (e.g. biomarkers) that modify the treatment effect, known as predictive biomarkers. In this talk we will connect the problem of predictive biomarker discovery with the problem of supervised feature selection, which occurs when we observe a response variable together with a large number of features, and we would like to know which variables are truly associated with the response. Furthermore, we will review a recent method for performing feature selection while controlling the false discovery rate (FDR) - the expected fraction of variables falsely selected among all discoveries. Finally, we will provide some insights on how to perform controlled predictive biomarker discovery.

Mr. Cesare Miglioli, SWAG: A Wrapper Method for Sparse Learning

Predictive power has always been the main research focus of learning algorithms. While the general approach for these algorithms is to consider all possible attributes in a dataset to best predict the response of interest, an important branch of research is focused on sparse learning. Indeed, in many practical settings we believe that only an extremely small combination of different attributes affect the response. However even sparse-learning methods can still preserve a high number of attributes in high-dimensional settings and possibly deliver inconsistent prediction performance. The latter methods can also be hard to interpret for researchers and practitioners, a problem which is even more relevant for the black-box''-type mechanisms of many learning approaches. Finally, there is often a problem of replicability since not all data-collection procedures measure (or observe) the same attributes and therefore cannot make use of proposed learners for testing purposes. To address all the previous issues, we propose to study a procedure that combines screening and wrapper methods and aims to find a library of extremely low-dimensional attribute combinations (with consequent low data collection and storage costs) in order to (i) match or improve the predictive performance of any particular learning method which uses all attributes as an input (including sparse learners); (ii) provide a low-dimensional network of attributes easily interpretable by researchers and practitioners; and (iii) increase the potential replicability of results due to a diversity of attribute combinations defining strong learners with equivalent predictive power. We call this algorithm Sparse Wrapper AlGorithm'' (SWAG).

Joint work with R. Molinari, G. Bakalli, S. Guerrier, S. Orso and O. Scaillet

Mrs. Amina Maolaisha, Goal-directed Generation of Discrete Structures with Conditional Generative Models

Despite recent advances, goal-directed generation of structured discrete data remains challenging. For problems such as program synthesis (generating source code) and materials design (generating molecules), finding examples which satisfy desired constraints or exhibit desired properties is difficult. In practice, expensive heuristic search or reinforcement learning algorithms are often employed.

We investigate the use of conditional generative models which directly attack this inverse problem, by modeling the distribution of discrete structures given properties of interest. Unfortunately, maximum likelihood training of such models often fails with the samples from the generative model inadequately respecting the input properties. To address this, we introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We avoid high-variance score-function estimators that would otherwise be required by sampling from an approximation to the normalized rewards, allowing simple Monte Carlo estimation of model gradients. We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value. In both cases, we find improvements over maximum likelihood estimation and other baselines.

Joint work with Brooks Paige and Alexandros Kalousis

Mr. Nicola Gnecco, Causal discovery in heavy-tailed models

Causal questions are omnipresent in many scientific problems. While much progress has been made in the analysis of causal relationships between random variables, these methods are not well suited if the causal mechanisms manifest themselves only in extremes. This work aims to connect the two fields of causal inference and extreme value theory. We define the causal tail coefficient that captures asymmetries in the extremal dependence of two random variables. In the population case, the causal tail coefficient is shown to reveal the causal structure if the distribution follows a linear structural causal model. This holds even in the presence of latent common causes that have the same tail index as the observed variables. Based on a consistent estimator of the causal tail coefficient, we propose a computationally highly efficient algorithm that infers causal structure from finitely many data. We prove that our method consistently recovers the causal order and compare it to other well-established and non-extremal approaches in causal discovery on synthetic and real data. The code is available as an open-access R package on Github.

This is joint work with N. Meinshausen, J.Peters, and S. Engelke.

Prof. Dimitri van De Ville, Community-Aware Graph Signal Processing

The emerging field of graph signal processing (GSP) allows to transpose classical signal processing operations (e.g., filtering) to signals on graphs. The GSP framework is generally built upon the graph Laplacian, which plays a crucial role to study graph properties and measure graph signal smoothness. Here instead, we propose the graph modularity matrix as the centerpiece of GSP, in order to incorporate knowledge about graph community structure when processing signals on the graph, but without the need for community detection. We study this approach in several generic settings such as filtering, optimal sampling and reconstruction, surrogate data generation, and denoising. Feasibility is illustrated by a small-scale example and a transportation network dataset, as well as one application in human neuroimaging where community-aware GSP reveals relationships between behavior and brain features that are not shown by Laplacian-based GSP. This work demonstrates how concepts from network science can lead to new meaningful operations on graph signals.

Joint work with Miljan Petrovic, Raphaël Liégeois, Thomas Bolto